mareko (Marek Olšák)
User

Projects

User does not belong to any projects.

User Details

User Since
Aug 17 2015, 4:04 AM (118 w, 2 h)

Recent Activity

Thu, Nov 16

mareko added a comment to D40047: AMDGPU/GCN: Remove xnack from 801 and 810.

Did notice that for Mesa3D XNACK is being forcibly disabled for all targets <gfx9 and forcibly enabled for all targets >=gfx9. Is that the best choice? It seems it will likely not match how the driver is configuring the hardware. For example, how does the driver configure the APUs? Is XNACK always being enabled for gfx9 (it is not for compute)?

Mesa can't modify SH_MEM_CONFIG to enable/disable XNACK. What is hardcoded in the kernel is what we get. XNACK is only enabled on compute rings on gfx8 APUs and on all rings on gfx9. In practice, Mesa should never access an unmapped page. I don't know if setting -xnack on all chips is a good idea in that case. We might also have suboptimal performance on gfx9 due to XNACK being always enabled by the KMD.

If Mesa always guarantees that the shaders will never access non-resident memory, and so will never have an XNACK, then it can always generate shaders that have XNACK disabled regardless of whether the kernel enables XNACK replay. In other words, enabling XNACK replay does not affect performance unless the shader chooses to generate XNACK compatible code. And the shader does not need to generate XNACK compatible code if it will never access a non-resident page.

For example, OpenCL 1.2 runtime always ensures all buffers are resident, and so compilers all shaders with XNACK disabled, regardless of whther the kernel has enabled XNACK replay.

So, does Mesa runtime always ensure all data accessed will be resident? Even for APUs? If so it would likely be a performance gain to always request no-XNACK unless page migration may also be active on the data accessed.

Thu, Nov 16, 8:16 AM

Wed, Nov 15

mareko added a comment to D40047: AMDGPU/GCN: Remove xnack from 801 and 810.

Did notice that for Mesa3D XNACK is being forcibly disabled for all targets <gfx9 and forcibly enabled for all targets >=gfx9. Is that the best choice? It seems it will likely not match how the driver is configuring the hardware. For example, how does the driver configure the APUs? Is XNACK always being enabled for gfx9 (it is not for compute)?

Wed, Nov 15, 6:42 PM

Tue, Nov 14

mareko added a comment to D40047: AMDGPU/GCN: Remove xnack from 801 and 810.

Does this change break backwards compatibility with the gfx801 target? If so, which ROCm version will I need to use with these changes?

Tue, Nov 14, 6:19 PM
mareko added a comment to D40047: AMDGPU/GCN: Remove xnack from 801 and 810.

Looks good to me.

Tue, Nov 14, 5:51 PM

Wed, Nov 8

mareko committed rL317754: AMDGPU: Lower buffer store and atomic intrinsics manually.
AMDGPU: Lower buffer store and atomic intrinsics manually
Wed, Nov 8, 5:53 PM
mareko committed rL317755: AMDGPU: Merge BUFFER_STORE_DWORD_OFFEN/OFFSET into x2, x4.
AMDGPU: Merge BUFFER_STORE_DWORD_OFFEN/OFFSET into x2, x4
Wed, Nov 8, 5:53 PM
mareko closed D39060: AMDGPU: Lower buffer store and atomic intrinsics manually by committing rL317754: AMDGPU: Lower buffer store and atomic intrinsics manually.
Wed, Nov 8, 5:53 PM
mareko closed D39012: AMDGPU: Merge BUFFER_STORE_DWORD_OFFEN/OFFSET into x2, x4 by committing rL317755: AMDGPU: Merge BUFFER_STORE_DWORD_OFFEN/OFFSET into x2, x4.
Wed, Nov 8, 5:53 PM
mareko committed rL317753: AMDGPU: Merge BUFFER_LOAD_DWORD_OFFSET into x2, x4.
AMDGPU: Merge BUFFER_LOAD_DWORD_OFFSET into x2, x4
Wed, Nov 8, 5:52 PM
mareko committed rL317752: AMDGPU: Merge BUFFER_LOAD_DWORD_OFFEN into x2, x4.
AMDGPU: Merge BUFFER_LOAD_DWORD_OFFEN into x2, x4
Wed, Nov 8, 5:52 PM
mareko closed D38950: AMDGPU: Merge BUFFER_LOAD_DWORD_OFFEN into x2, x4 by committing rL317752: AMDGPU: Merge BUFFER_LOAD_DWORD_OFFEN into x2, x4.
Wed, Nov 8, 5:52 PM
mareko closed D38951: AMDGPU: Merge BUFFER_LOAD_DWORD_OFFSET into x2, x4 by committing rL317753: AMDGPU: Merge BUFFER_LOAD_DWORD_OFFSET into x2, x4.
Wed, Nov 8, 5:52 PM
mareko committed rL317751: AMDGPU: Merge S_BUFFER_LOAD_DWORD_IMM into x2, x4.
AMDGPU: Merge S_BUFFER_LOAD_DWORD_IMM into x2, x4
Wed, Nov 8, 5:52 PM
mareko committed rL317750: AMDGPU: Fold immediate offset into BUFFER_LOAD_DWORD lowered from SMEM.
AMDGPU: Fold immediate offset into BUFFER_LOAD_DWORD lowered from SMEM
Wed, Nov 8, 5:52 PM
mareko closed D38949: AMDGPU: Merge S_BUFFER_LOAD_DWORD_IMM into x2, x4 by committing rL317751: AMDGPU: Merge S_BUFFER_LOAD_DWORD_IMM into x2, x4.
Wed, Nov 8, 5:52 PM
mareko closed D38915: AMDGPU: Fold immediate offset into BUFFER_LOAD_DWORD lowered from SMEM by committing rL317750: AMDGPU: Fold immediate offset into BUFFER_LOAD_DWORD lowered from SMEM.
Wed, Nov 8, 5:52 PM

Tue, Oct 31

mareko added a comment to D38951: AMDGPU: Merge BUFFER_LOAD_DWORD_OFFSET into x2, x4.

Ping

Tue, Oct 31, 2:10 PM
mareko committed rL317038: AMDGPU: Select s_buffer_load_dword with a non-constant SGPR offset.
AMDGPU: Select s_buffer_load_dword with a non-constant SGPR offset
Tue, Oct 31, 2:07 PM
mareko closed D38914: AMDGPU: Select s_buffer_load_dword with a non-constant SGPR offset by committing rL317038: AMDGPU: Select s_buffer_load_dword with a non-constant SGPR offset.
Tue, Oct 31, 2:07 PM
mareko updated the diff for D39012: AMDGPU: Merge BUFFER_STORE_DWORD_OFFEN/OFFSET into x2, x4.
  • don't ignore MOVolatile
Tue, Oct 31, 2:04 PM
mareko updated the diff for D39060: AMDGPU: Lower buffer store and atomic intrinsics manually.
  • don't set MOVolatile
  • this precedes the buffer store merging patch
Tue, Oct 31, 2:04 PM
mareko added inline comments to D39060: AMDGPU: Lower buffer store and atomic intrinsics manually.
Tue, Oct 31, 1:49 PM
mareko added a comment to D39012: AMDGPU: Merge BUFFER_STORE_DWORD_OFFEN/OFFSET into x2, x4.

One minor comment inline.

The volatile / hasOrderedMemoryRef issue is worrying, and I don't think the change should go in as-is. Besides, it's probably not as big a win as the other patches, especially due to the added VGPR spills. Do you have all these patches in a branch somewhere?

Tue, Oct 31, 12:43 PM
mareko added inline comments to D38949: AMDGPU: Merge S_BUFFER_LOAD_DWORD_IMM into x2, x4.
Tue, Oct 31, 12:05 PM
mareko updated the diff for D38915: AMDGPU: Fold immediate offset into BUFFER_LOAD_DWORD lowered from SMEM.
  • address feedback
  • Offset == 0 also means that the operand is not an immediate.
Tue, Oct 31, 11:53 AM
mareko added a comment to D38915: AMDGPU: Fold immediate offset into BUFFER_LOAD_DWORD lowered from SMEM.

Another improvement that can be done here is that LLVM sometimes generates OR instead of ADD, and I don't know how to tell when OR means ADD.

Tue, Oct 31, 11:46 AM

Thu, Oct 26

mareko committed rL316666: AMDGPU: Handle s_buffer_load_dword hazard on SI.
AMDGPU: Handle s_buffer_load_dword hazard on SI
Thu, Oct 26, 7:43 AM
mareko closed D39171: AMDGPU: Handle s_buffer_load_dword hazard on SI by committing rL316666: AMDGPU: Handle s_buffer_load_dword hazard on SI.
Thu, Oct 26, 7:43 AM

Wed, Oct 25

mareko added a comment to D39040: AMDGPU: Fix creating invalid copy when adjusting dmask.

In case it's still confusing: the number of components returned by image opcodes is popcount(dmask). The code could just do popcount(dmask) instead of computing BitsSet.

Wed, Oct 25, 7:33 AM
mareko added a comment to D39040: AMDGPU: Fix creating invalid copy when adjusting dmask.

With this, when dmask = 0x2, we get "image_get_lod v[0:1], ...". Based on what Marek said, wouldn't it only be returning a single value to v0? What would v1 get set to?

In comparison, for image_sample with dmask = 0x2 I see only a single destination register specified. That's also what I see on the proprietary driver for image_get_lod.

Wed, Oct 25, 7:30 AM

Tue, Oct 24

mareko committed rL316427: AMDGPU: Add new intrinsic llvm.amdgcn.kill(i1).
AMDGPU: Add new intrinsic llvm.amdgcn.kill(i1)
Tue, Oct 24, 3:27 AM
mareko closed D38544: AMDGPU: Add new intrinsic llvm.amdgcn.kill(i1) by committing rL316427: AMDGPU: Add new intrinsic llvm.amdgcn.kill(i1).
Tue, Oct 24, 3:27 AM
mareko committed rL316426: AMDGPU: Add llvm.amdgcn.wqm.vote intrinsic.
AMDGPU: Add llvm.amdgcn.wqm.vote intrinsic
Tue, Oct 24, 3:27 AM
mareko closed D38543: AMDGPU: Add llvm.amdgcn.wqm.vote intrinsic by committing rL316426: AMDGPU: Add llvm.amdgcn.wqm.vote intrinsic.
Tue, Oct 24, 3:27 AM
mareko updated the diff for D39171: AMDGPU: Handle s_buffer_load_dword hazard on SI.

Cleanups.

Tue, Oct 24, 2:49 AM
mareko added inline comments to D39171: AMDGPU: Handle s_buffer_load_dword hazard on SI.
Tue, Oct 24, 2:47 AM
mareko accepted D39040: AMDGPU: Fix creating invalid copy when adjusting dmask.

LGTM.

Tue, Oct 24, 2:16 AM

Sun, Oct 22

mareko created D39171: AMDGPU: Handle s_buffer_load_dword hazard on SI.
Sun, Oct 22, 2:13 PM

Oct 18 2017

mareko added a comment to D39040: AMDGPU: Fix creating invalid copy when adjusting dmask.

Each bit of dmask determines whether that component is enabled. Image opcodes return 4 components if dmask == 0xf. If dmask == 0x2, image opcodes only return the 2nd component in <1 x float>. If dmask = 0x5, image opcodes return the 1st and 3rd component in <2 x float>. If dmask = 0xa, image opcodes return the 2nd and 4th component in <2 x float>.
Gather4 opcodes are an exception and always return 4 components.

Oct 18 2017, 9:57 AM
mareko created D39060: AMDGPU: Lower buffer store and atomic intrinsics manually.
Oct 18 2017, 9:49 AM

Oct 17 2017

mareko created D39012: AMDGPU: Merge BUFFER_STORE_DWORD_OFFEN/OFFSET into x2, x4.
Oct 17 2017, 10:29 AM
mareko updated the diff for D38951: AMDGPU: Merge BUFFER_LOAD_DWORD_OFFSET into x2, x4.

fix up merge-stores.ll

Oct 17 2017, 9:16 AM
mareko updated the diff for D38915: AMDGPU: Fold immediate offset into BUFFER_LOAD_DWORD lowered from SMEM.

Rebase.

Oct 17 2017, 8:46 AM
mareko updated the diff for D38914: AMDGPU: Select s_buffer_load_dword with a non-constant SGPR offset.

Preserve GLC.

Oct 17 2017, 8:45 AM
mareko added inline comments to D38914: AMDGPU: Select s_buffer_load_dword with a non-constant SGPR offset.
Oct 17 2017, 8:30 AM

Oct 16 2017

mareko added a comment to D38543: AMDGPU: Add llvm.amdgcn.wqm.vote intrinsic.

My understanding is that the existing llvm.amdgcn.wqm() intrinsic annotates "expression trees" for the WQM pass and might not insert any instructions at the call site. This new wqm.vote intrinsic does translate to S_WQM_B{wavesize} directly.

Oct 16 2017, 3:01 PM
mareko created D38951: AMDGPU: Merge BUFFER_LOAD_DWORD_OFFSET into x2, x4.
Oct 16 2017, 6:05 AM
mareko created D38950: AMDGPU: Merge BUFFER_LOAD_DWORD_OFFEN into x2, x4.
Oct 16 2017, 6:05 AM
mareko created D38949: AMDGPU: Merge S_BUFFER_LOAD_DWORD_IMM into x2, x4.
Oct 16 2017, 6:04 AM

Oct 15 2017

mareko updated the diff for D38543: AMDGPU: Add llvm.amdgcn.wqm.vote intrinsic.

fix the undef test.

Oct 15 2017, 8:09 AM
mareko updated the diff for D38543: AMDGPU: Add llvm.amdgcn.wqm.vote intrinsic.

also test undef.

Oct 15 2017, 7:47 AM

Oct 13 2017

mareko created D38915: AMDGPU: Fold immediate offset into BUFFER_LOAD_DWORD lowered from SMEM.
Oct 13 2017, 5:50 PM
mareko created D38914: AMDGPU: Select s_buffer_load_dword with a non-constant SGPR offset.
Oct 13 2017, 5:50 PM
mareko added a comment to D38544: AMDGPU: Add new intrinsic llvm.amdgcn.kill(i1).

Nicolai, that's a good point, though let's just merge this intrinsic replacement for now.

Oct 13 2017, 12:26 PM
mareko updated the diff for D38544: AMDGPU: Add new intrinsic llvm.amdgcn.kill(i1).

Address feedback.

Oct 13 2017, 12:22 PM
mareko updated the diff for D38543: AMDGPU: Add llvm.amdgcn.wqm.vote intrinsic.

Address feedback.

Oct 13 2017, 10:01 AM

Oct 5 2017

mareko added inline comments to D38543: AMDGPU: Add llvm.amdgcn.wqm.vote intrinsic.
Oct 5 2017, 10:47 AM

Oct 4 2017

mareko added inline comments to D38543: AMDGPU: Add llvm.amdgcn.wqm.vote intrinsic.
Oct 4 2017, 3:42 PM
mareko added inline comments to D38544: AMDGPU: Add new intrinsic llvm.amdgcn.kill(i1).
Oct 4 2017, 3:15 PM
mareko added inline comments to D38543: AMDGPU: Add llvm.amdgcn.wqm.vote intrinsic.
Oct 4 2017, 11:46 AM
mareko created D38544: AMDGPU: Add new intrinsic llvm.amdgcn.kill(i1).
Oct 4 2017, 8:05 AM
mareko created D38543: AMDGPU: Add llvm.amdgcn.wqm.vote intrinsic.
Oct 4 2017, 8:04 AM

Aug 30 2017

mareko added a comment to D22898: AMDGPU: Fix ffloor for SI.

Is the MIN needed for correctness at all? Looking at the workaround docs, I see the explanation that "[FRACT] is outputting 1.0 for very small negative inputs). Sounds to me like v_fract is correctly in the range [0, 1.0), except for those very small negative inputs, where it returns 1.0 (which happens to be correct for the ffloor lowering).

I guess so? I don't know the details of the bug but this passes conformance now

Thinking about it more this makes sense. 1.0 will skip the fract at exactly 1.0. up to 0.99... fract is used

How does it make any sense? fract should return values in [0, 1). SI has a bug that it returns 1 incorrectly in one case. Doing min(x, 1) will have no effect on the result of buggy fract. That min() is a no-op operation.

It's not actually a no-op, it's a canonicalize.

What does that mean?

IEEE canonicalize. http://www.llvm.org/docs/LangRef.html#llvm-canonicalize-intrinsic

NaNs are quieted, denormals may be flushed

Aug 30 2017, 2:53 AM

Aug 29 2017

mareko added a comment to D22898: AMDGPU: Fix ffloor for SI.

Is the MIN needed for correctness at all? Looking at the workaround docs, I see the explanation that "[FRACT] is outputting 1.0 for very small negative inputs). Sounds to me like v_fract is correctly in the range [0, 1.0), except for those very small negative inputs, where it returns 1.0 (which happens to be correct for the ffloor lowering).

I guess so? I don't know the details of the bug but this passes conformance now

Thinking about it more this makes sense. 1.0 will skip the fract at exactly 1.0. up to 0.99... fract is used

How does it make any sense? fract should return values in [0, 1). SI has a bug that it returns 1 incorrectly in one case. Doing min(x, 1) will have no effect on the result of buggy fract. That min() is a no-op operation.

It's not actually a no-op, it's a canonicalize.

Aug 29 2017, 1:18 PM

Jul 25 2017

mareko committed rL309028: AMDGPU/SI: Fix Depth and Height computation for SI scheduler.
AMDGPU/SI: Fix Depth and Height computation for SI scheduler
Jul 25 2017, 1:37 PM
mareko closed D34967: AMDGPU/SI: Fix Depth and Height computation for SI scheduler by committing rL309028: AMDGPU/SI: Fix Depth and Height computation for SI scheduler.
Jul 25 2017, 1:37 PM · Restricted Project
mareko committed rL309027: AMDGPU/SI: Force exports at the end for SI scheduler.
AMDGPU/SI: Force exports at the end for SI scheduler
Jul 25 2017, 1:37 PM
mareko closed D34965: AMDGPU/SI: Force exports at the end for SI scheduler by committing rL309027: AMDGPU/SI: Force exports at the end for SI scheduler.
Jul 25 2017, 1:37 PM · Restricted Project

Jul 4 2017

mareko closed D34190: [AMDGPU] Fix latency of MIMG instructions.

Committed manually. Closing.

Jul 4 2017, 7:44 AM
mareko committed rL307081: [AMDGPU] Fix latency of MIMG instructions.
[AMDGPU] Fix latency of MIMG instructions
Jul 4 2017, 7:44 AM

Jun 28 2017

mareko added a comment to D34767: AMDGPU: Remove SITypeRewriter.

I approve this. I don't know if Dave does.

Jun 28 2017, 1:59 PM

Jun 22 2017

mareko added a comment to D27586: AMDGPU/SI: Add llvm.amdgcn.s.buffer.load intrinsic.

I've got a scenario that will benefit from this change. It seems to me that this might be a more workable solution than D28993 (although a more generic approach like that is attractive). This change doesn't necessarily preclude something like D28993 at a later stage does it?

Is it possible to make the non-const offset change that mareko talks about as well?

Jun 22 2017, 11:02 AM

Jun 20 2017

mareko added a comment to D33865: Mark llvm.*annotation intrinsics as NoMem and Speculatable.

How do you use these integer annotations? If you're using them solely as optimization information (i.e. dropping them would be correct, but perhaps undesirable), it may be better to use metadata here.

Jun 20 2017, 3:42 AM

Jun 19 2017

mareko added a comment to D33865: Mark llvm.*annotation intrinsics as NoMem and Speculatable.

Do we need to add more reviewers?

JFYI: so far you haven't added any reviewers. :)

However, if you want to make this change, at the very least you also need to edit the language reference entry for llvm.var.annotation to state that it isn't allowed to express control flow dependent properties. Can you also please give some examples of how you use llvm.var.annotate (for discussion)?

Jun 19 2017, 9:05 AM

Jun 16 2017

mareko added a comment to D33865: Mark llvm.*annotation intrinsics as NoMem and Speculatable.

Do we need to add more reviewers?

Jun 16 2017, 5:18 AM
mareko added a comment to D33865: Mark llvm.*annotation intrinsics as NoMem and Speculatable.

Ping

Jun 16 2017, 5:17 AM

Jun 3 2017

mareko created D33865: Mark llvm.*annotation intrinsics as NoMem and Speculatable.
Jun 3 2017, 8:51 AM

May 24 2017

mareko committed rL303754: Revert "AMDGPU: Fold CI-specific complex SMRD patterns into existing complex….
Revert "AMDGPU: Fold CI-specific complex SMRD patterns into existing complex…
May 24 2017, 7:54 AM

May 23 2017

mareko committed rL303658: AMDGPU: Fold CI-specific complex SMRD patterns into existing complex patterns.
AMDGPU: Fold CI-specific complex SMRD patterns into existing complex patterns
May 23 2017, 10:14 AM
mareko closed D28994: AMDGPU: Fold CI-specific complex SMRD patterns into existing complex patterns by committing rL303658: AMDGPU: Fold CI-specific complex SMRD patterns into existing complex patterns.
May 23 2017, 10:14 AM

May 19 2017

mareko updated the diff for D28994: AMDGPU: Fold CI-specific complex SMRD patterns into existing complex patterns.

add one more use of has32BitLiteralSMRDOffset.

May 19 2017, 11:13 AM
mareko updated the diff for D28994: AMDGPU: Fold CI-specific complex SMRD patterns into existing complex patterns.

This is just a cleanup. Also, it adds checking that ByteCount is aligned to 4.

May 19 2017, 11:05 AM

May 4 2017

mareko committed rL302200: AMDGPU: GFX9 GS and HS shaders always have the scratch wave offset in SGPR5.
AMDGPU: GFX9 GS and HS shaders always have the scratch wave offset in SGPR5
May 4 2017, 3:38 PM
mareko closed D32645: AMDGPU: GFX9 GS and HS shaders always have the scratch wave offset in SGPR5 by committing rL302200: AMDGPU: GFX9 GS and HS shaders always have the scratch wave offset in SGPR5.
May 4 2017, 3:38 PM
mareko added a comment to D32645: AMDGPU: GFX9 GS and HS shaders always have the scratch wave offset in SGPR5.

Yeah, we could have a simpler test, but my approach is usually to copy an existing test and modify it, so it would help if we had such a test that screams "hey I'm a simple scratch buffer test!"

May 4 2017, 12:21 PM

May 3 2017

mareko abandoned D28993: AMDGPU: Try to select SMEM opcodes for llvm.amdgcn.buffer.load.

Another possible issue is that SMEM instructions ignore bits of the resource descriptor. So you would need some way to tell the compiler that it wouldn't be ignoring some relevant resource bits by selecting to SMEM.

Doesn't this make this change unworkable? Presumably the front-end would need to annotate in some way to indicate that this is a legitimate transformation, in which case you might as well use a different intrinsic anyway. Are there any circumstances where you can determine if this is definitely the case?

I've got a situation that benefits from this change, but equally could use the solution in D27586. Perhaps that change could be enhanced with the non-const offset change in this review?

May 3 2017, 9:30 AM

May 2 2017

mareko updated the diff for D32645: AMDGPU: GFX9 GS and HS shaders always have the scratch wave offset in SGPR5.

Directly set SGPR5 in the SIMachineFunctionInfo constructor.

May 2 2017, 12:13 PM
mareko abandoned D23020: [ValueTracking] bitreverse, sin, cos are safe to speculatively execute.
May 2 2017, 11:18 AM
mareko committed rL301930: AMDGPU: Add AMDGPU_HS calling convention.
AMDGPU: Add AMDGPU_HS calling convention
May 2 2017, 8:56 AM
mareko closed D32644: AMDGPU: Add AMDGPU_HS calling convention by committing rL301930: AMDGPU: Add AMDGPU_HS calling convention.
May 2 2017, 8:56 AM
mareko added a comment to D32645: AMDGPU: GFX9 GS and HS shaders always have the scratch wave offset in SGPR5.

Ping

May 2 2017, 7:44 AM

Apr 28 2017

mareko committed rL301677: AMDGPU: Add new amdgcn.init.exec intrinsics.
AMDGPU: Add new amdgcn.init.exec intrinsics
Apr 28 2017, 1:35 PM
mareko closed D31762: AMDGPU: Add new amdgcn.init.exec intrinsics by committing rL301677: AMDGPU: Add new amdgcn.init.exec intrinsics.
Apr 28 2017, 1:35 PM
mareko updated the diff for D32645: AMDGPU: GFX9 GS and HS shaders always have the scratch wave offset in SGPR5.

Use isPhysRegUsed.

Apr 28 2017, 12:28 PM
mareko added inline comments to D32645: AMDGPU: GFX9 GS and HS shaders always have the scratch wave offset in SGPR5.
Apr 28 2017, 11:34 AM
mareko created D32645: AMDGPU: GFX9 GS and HS shaders always have the scratch wave offset in SGPR5.
Apr 28 2017, 7:24 AM
mareko created D32644: AMDGPU: Add AMDGPU_HS calling convention.
Apr 28 2017, 7:23 AM

Apr 24 2017

mareko added a comment to D31762: AMDGPU: Add new amdgcn.init.exec intrinsics.

Is it at all possible to get merged shaders where either part has thread_count == 0? We might want a way to annotate branches so that the skip-jump for EXEC=0 is not introduced.

Yes, thread_count == 0 is possible, and it's explained here: https://patchwork.freedesktop.org/patch/152356/

Ah, so that's why the barrier instruction is needed between shader parts? Interesting.

Apr 24 2017, 10:35 AM
mareko added a comment to D31762: AMDGPU: Add new amdgcn.init.exec intrinsics.

I don't see anything wrong with the code.

I agree that the design is a bit iffy. It's almost like these intrinsics are something that is part of the calling convention. But even these intrinsics cannot quite lead to optimal code for merged monolithic shaders, because there's an unnecessary initialization of EXEC in the first part of the shader.

Since what we need to do here in general really doesn't fit well into LLVM IR semantics, I suspect that no matter what we come up with, it's bound to be ugly. So we might as well go with this particular solution here.

Merged monolithic shaders set exec = -1 and then they use "if (tid < thread_count) ...". I think that's the only way to jump over the conditional code right now if the wave has thread_count == 0. If we don't want v_mbcnt+v_cmp, we could do something like "if (amdgcn.set.thread_count(n)) ..." that sets EXEC regardless of current EXEC and skips the conditional for thread_count == 0. The performance of that solution is unlikely to justify the implementation effort.

Is it at all possible to get merged shaders where either part has thread_count == 0? We might want a way to annotate branches so that the skip-jump for EXEC=0 is not introduced.

Apr 24 2017, 10:16 AM
mareko added a comment to D31762: AMDGPU: Add new amdgcn.init.exec intrinsics.

I don't see anything wrong with the code.

I agree that the design is a bit iffy. It's almost like these intrinsics are something that is part of the calling convention. But even these intrinsics cannot quite lead to optimal code for merged monolithic shaders, because there's an unnecessary initialization of EXEC in the first part of the shader.

Since what we need to do here in general really doesn't fit well into LLVM IR semantics, I suspect that no matter what we come up with, it's bound to be ugly. So we might as well go with this particular solution here.

Apr 24 2017, 9:42 AM