nhaehnle (Nicolai Hähnle)
User

Projects

User does not belong to any projects.

User Details

User Since
Oct 9 2015, 4:06 AM (110 w, 2 d)

Recent Activity

Thu, Nov 16

nhaehnle added a comment to D40047: AMDGPU/GCN: Remove xnack from 801 and 810.

Did notice that for Mesa3D XNACK is being forcibly disabled for all targets <gfx9 and forcibly enabled for all targets >=gfx9. Is that the best choice? It seems it will likely not match how the driver is configuring the hardware. For example, how does the driver configure the APUs? Is XNACK always being enabled for gfx9 (it is not for compute)?

Mesa can't modify SH_MEM_CONFIG to enable/disable XNACK. What is hardcoded in the kernel is what we get. XNACK is only enabled on compute rings on gfx8 APUs and on all rings on gfx9. In practice, Mesa should never access an unmapped page. I don't know if setting -xnack on all chips is a good idea in that case. We might also have suboptimal performance on gfx9 due to XNACK being always enabled by the KMD.

If Mesa always guarantees that the shaders will never access non-resident memory, and so will never have an XNACK, then it can always generate shaders that have XNACK disabled regardless of whether the kernel enables XNACK replay. In other words, enabling XNACK replay does not affect performance unless the shader chooses to generate XNACK compatible code. And the shader does not need to generate XNACK compatible code if it will never access a non-resident page.

For example, OpenCL 1.2 runtime always ensures all buffers are resident, and so compilers all shaders with XNACK disabled, regardless of whther the kernel has enabled XNACK replay.

So, does Mesa runtime always ensure all data accessed will be resident? Even for APUs? If so it would likely be a performance gain to always request no-XNACK unless page migration may also be active on the data accessed.

Actually not always. We support partially-resident buffers. If those cause an infinite replay loop, I think we have to disable replay in the KMD. I don't know if no-XNACK is OK for partially-resident buffers.

Thu, Nov 16, 10:11 AM
nhaehnle added a comment to D40051: AMDGPU: Rename Bonaire target to be gfx704; update target feature handling.

Fine with me.

Thu, Nov 16, 9:57 AM

Tue, Oct 24

nhaehnle added inline comments to D39060: AMDGPU: Lower buffer store and atomic intrinsics manually.
Tue, Oct 24, 5:40 PM
nhaehnle added inline comments to D39060: AMDGPU: Lower buffer store and atomic intrinsics manually.
Tue, Oct 24, 5:37 PM

Mon, Oct 23

nhaehnle added inline comments to D39060: AMDGPU: Lower buffer store and atomic intrinsics manually.
Mon, Oct 23, 10:18 AM
nhaehnle added a comment to D39060: AMDGPU: Lower buffer store and atomic intrinsics manually.

Can the same be achieved by implementing getTgtMemIntrinsic?

Mon, Oct 23, 10:15 AM
nhaehnle added inline comments to D39060: AMDGPU: Lower buffer store and atomic intrinsics manually.
Mon, Oct 23, 8:58 AM
nhaehnle added a comment to D39012: AMDGPU: Merge BUFFER_STORE_DWORD_OFFEN/OFFSET into x2, x4.

One minor comment inline.

Mon, Oct 23, 8:52 AM
nhaehnle accepted D38950: AMDGPU: Merge BUFFER_LOAD_DWORD_OFFEN into x2, x4.

LGTM

Mon, Oct 23, 8:45 AM
nhaehnle accepted D38949: AMDGPU: Merge S_BUFFER_LOAD_DWORD_IMM into x2, x4.

One comment inline, apart from that LGTM. I imagine some of those additional spills could be improved with better register allocation, and that's a bullet someone has to bite eventually, but not now.

Mon, Oct 23, 8:31 AM
nhaehnle added a comment to D38915: AMDGPU: Fold immediate offset into BUFFER_LOAD_DWORD lowered from SMEM.

P.S.: It would help to use ReviewBoard's feature of defining dependent (parent/child) revisions, like I've just done now. It sucks that this is not done automatically, unfortunately I haven't found a good way to do proper reviews of series with it yet. (This is my mine gripe with all those shiny tools written by web hipsters...)

Mon, Oct 23, 8:16 AM
nhaehnle added a comment to D38915: AMDGPU: Fold immediate offset into BUFFER_LOAD_DWORD lowered from SMEM.

Some small comments inline. More generally, I think stuff like this belongs in a MIR-level InstCombine pass. We just don't have the infrastructure for that yet, and this patch is a nice improvement in the meantime.

Mon, Oct 23, 8:15 AM
nhaehnle added a dependency for D38915: AMDGPU: Fold immediate offset into BUFFER_LOAD_DWORD lowered from SMEM: D38914: AMDGPU: Select s_buffer_load_dword with a non-constant SGPR offset.
Mon, Oct 23, 8:08 AM
nhaehnle added a dependent revision for D38914: AMDGPU: Select s_buffer_load_dword with a non-constant SGPR offset: D38915: AMDGPU: Fold immediate offset into BUFFER_LOAD_DWORD lowered from SMEM.
Mon, Oct 23, 8:08 AM
nhaehnle accepted D38914: AMDGPU: Select s_buffer_load_dword with a non-constant SGPR offset.

That should indeed help a lot :)
LGTM.

Mon, Oct 23, 8:05 AM
nhaehnle accepted D38543: AMDGPU: Add llvm.amdgcn.wqm.vote intrinsic.

I agree the naming of the intrinsics is a bit unfortunate, but it is what it is, and it's not the end of the world. Please remove speculatable. Other than that, LGTM.

Mon, Oct 23, 7:02 AM
nhaehnle closed D37847: AMDGPU: VALU carry-in and v_cndmask condition cannot be EXEC.

r314522

Mon, Oct 23, 6:51 AM
nhaehnle added inline comments to D39171: AMDGPU: Handle s_buffer_load_dword hazard on SI.
Mon, Oct 23, 6:51 AM
nhaehnle accepted D38544: AMDGPU: Add new intrinsic llvm.amdgcn.kill(i1).

You're right, on second thought the existing intrinsic has the same problem. So this is still a strict improvement.

Mon, Oct 23, 2:59 AM

Oct 11 2017

nhaehnle accepted D38770: AMDGPU: Use stricter bounds for workitem builtins.

LGTM

Oct 11 2017, 11:31 PM

Oct 10 2017

nhaehnle added a comment to D38544: AMDGPU: Add new intrinsic llvm.amdgcn.kill(i1).

A bunch of inline comments, but also some higher level things that don't really fit anywhere. This is a useful feature, but I don't think we've ever gotten the design of kill just right, because kill is really an implicit control flow intrinsic.

Oct 10 2017, 3:17 AM
nhaehnle added a comment to D38543: AMDGPU: Add llvm.amdgcn.wqm.vote intrinsic.

Apart from Matt's comments, this looks good to me.

Oct 10 2017, 2:29 AM
nhaehnle accepted D37758: [AMDGPU] For amdpal, widen interpolation mode workaround.

LGTM

Oct 10 2017, 2:08 AM

Oct 2 2017

nhaehnle added a comment to D38173: AMDGPU: do not fold clamp instructions when sources are different.

Mostly it took me a while to grok why the function was looking at MAX instructions. As far as I understand now, it's because the code that deduces the clamp output modifier happens to always produce it on a MAX(x,x) instruction, and not a MIN.

Oct 2 2017, 9:44 AM

Sep 29 2017

nhaehnle updated the diff for D37850: AMDGPU: Split MUBUF offset into aligned components.

Use alignDown instead of magic 4092.

Sep 29 2017, 9:17 AM
nhaehnle added a comment to D37753: [AMDGPU] implemented pal metadata.

I'm not too familiar with the asm printer stuff, but this all looks reasonable to me.

Sep 29 2017, 8:38 AM
nhaehnle added inline comments to D37758: [AMDGPU] For amdpal, widen interpolation mode workaround.
Sep 29 2017, 8:28 AM
nhaehnle accepted D38173: AMDGPU: do not fold clamp instructions when sources are different.

Maybe you could add a helpful comment on isClamp? Either way, LGTM.

Sep 29 2017, 8:14 AM
nhaehnle added inline comments to D9375: An llvm.noalias intrinsic.
Sep 29 2017, 8:01 AM

Sep 14 2017

nhaehnle updated the diff for D37850: AMDGPU: Split MUBUF offset into aligned components.

Updated commit message: the highest value that can be represented with
immediate + inline is 4156 = 4092 + 64.

Sep 14 2017, 6:20 AM
nhaehnle created D37850: AMDGPU: Split MUBUF offset into aligned components.
Sep 14 2017, 6:19 AM
nhaehnle created D37847: AMDGPU: VALU carry-in and v_cndmask condition cannot be EXEC.
Sep 14 2017, 4:11 AM

Sep 6 2017

nhaehnle added a comment to D37205: AMDGPU: Make worst-case assumption about the wait states in inline assembly.

LGTM.

Sep 6 2017, 6:46 AM
nhaehnle added a comment to D37380: Triple: Add AMDPAL operating system type.

How about in unittests/ADT/TripleTest.cpp?

Sep 6 2017, 6:39 AM

Sep 4 2017

nhaehnle added a comment to D37380: Triple: Add AMDPAL operating system type.

TripleTest.cpp has some minimally meaningful tests though. So a test of amdgcn-amd-amdpal should be added there. Apart from that the patch looks good.

Sep 4 2017, 12:45 PM

Aug 28 2017

nhaehnle created D37205: AMDGPU: Make worst-case assumption about the wait states in inline assembly.
Aug 28 2017, 2:56 AM

Aug 8 2017

nhaehnle accepted D34718: [AMDGPU] Add llvm.amdgpu.update.dpp intrinsic.

LGTM

Aug 8 2017, 12:08 AM

Aug 4 2017

nhaehnle added a comment to D34716: [AMDGPU] Add pseudo "old" and "wqm_mode" source to all DPP instructions.

Some comments on what I noticed. Probably best if somebody else has a look as well.

Aug 4 2017, 1:38 AM

Aug 3 2017

nhaehnle added inline comments to D34719: [AMDGPU] Implement llvm.amdgcn.set.inactive intrinsic.
Aug 3 2017, 5:13 AM

Aug 2 2017

nhaehnle added inline comments to D35967: [AMDGPU] Collapse adjacent SI_END_CF.
Aug 2 2017, 8:38 AM
nhaehnle accepted D34719: [AMDGPU] Implement llvm.amdgcn.set.inactive intrinsic.

One comment, apart from that LGTM.

Aug 2 2017, 2:32 AM
nhaehnle accepted D34716: [AMDGPU] Add pseudo "old" and "wqm_mode" source to all DPP instructions.

LGTM

Aug 2 2017, 2:26 AM
nhaehnle accepted D35524: [AMDGPU] Add support for Whole Wavefront Mode.

One remaining style nitpick. LGTM with that fixed.

Aug 2 2017, 2:23 AM
nhaehnle accepted D35167: [AMDGPU] Add an llvm.amdgcn.wqm intrinsic for WQM.

Thanks. Assuming we agree on the derivative calculations logic I added in a comment, this LGTM.

Aug 2 2017, 2:15 AM

Aug 1 2017

nhaehnle created D36193: AMDGPU: IMPLICIT_DEFs do not contribute to wait states.
Aug 1 2017, 3:04 PM

Jul 21 2017

nhaehnle added a comment to D35524: [AMDGPU] Add support for Whole Wavefront Mode.

Looks mostly good, but I do have a couple of comments.

Jul 21 2017, 7:28 AM

Jul 20 2017

nhaehnle added a comment to D34716: [AMDGPU] Add pseudo "old" and "wqm_mode" source to all DPP instructions.

One question, apart from that looks good.

Jul 20 2017, 2:58 PM
nhaehnle accepted D35523: [AMDGPU] refactor WQM pass in preparation for WWM (NFCI).

One minor comment. Either way, LGTM.

Jul 20 2017, 2:39 PM
nhaehnle requested changes to D35167: [AMDGPU] Add an llvm.amdgcn.wqm intrinsic for WQM.

Some minor comments. In addition, I think it can be simplified, and we probably want the intrinsic to be convergent, because sinking WQM computations into a non-uniform branch could mean that the computation becomes non-WQM for practical purposes.

Jul 20 2017, 8:45 AM

Jul 14 2017

nhaehnle created D35416: AMDGPU: Fix crash when folding immediates into multiple uses.
Jul 14 2017, 6:57 AM
nhaehnle closed D23131: AMDGPU: Fix an interaction between WQM and polygon stippling.

This was submitted long time ago.

Jul 14 2017, 6:55 AM

Jun 28 2017

nhaehnle added a comment to D34716: [AMDGPU] Add pseudo "old" and "wqm_mode" source to all DPP instructions.

So, one thing that's not clear to me with is the semantics of how the update.dpp intrinsic is supposed to enable WQM or WWM. In your sequence of instructions, if you just put a WQM/WWM flag on the update.dpp intrinsic, how does LLVM know whether the regular ALU intrinsics in between should run in WQM/WWM or not?

Jun 28 2017, 3:19 PM

May 18 2017

nhaehnle created D33319: AMDGPU: M0 operands to spill/restore opcodes are dead.
May 18 2017, 8:10 AM

May 8 2017

nhaehnle added a comment to D32862: [AMDGPU] add intrinsic for s_getpc.

Low & high bits and use cases aside, isn't this something that could be covered by adding support for PC in the read_register intrinsic?

May 8 2017, 3:18 AM

Apr 24 2017

nhaehnle added a comment to D31762: AMDGPU: Add new amdgcn.init.exec intrinsics.

Is it at all possible to get merged shaders where either part has thread_count == 0? We might want a way to annotate branches so that the skip-jump for EXEC=0 is not introduced.

Yes, thread_count == 0 is possible, and it's explained here: https://patchwork.freedesktop.org/patch/152356/

Apr 24 2017, 10:24 AM
nhaehnle added inline comments to D32344: InstCombine/AMDGPU: Fix constant folding of llvm.amdgcn.{icmp,fcmp}.
Apr 24 2017, 10:20 AM
nhaehnle added a comment to D31762: AMDGPU: Add new amdgcn.init.exec intrinsics.

I don't see anything wrong with the code.

I agree that the design is a bit iffy. It's almost like these intrinsics are something that is part of the calling convention. But even these intrinsics cannot quite lead to optimal code for merged monolithic shaders, because there's an unnecessary initialization of EXEC in the first part of the shader.

Since what we need to do here in general really doesn't fit well into LLVM IR semantics, I suspect that no matter what we come up with, it's bound to be ugly. So we might as well go with this particular solution here.

Merged monolithic shaders set exec = -1 and then they use "if (tid < thread_count) ...". I think that's the only way to jump over the conditional code right now if the wave has thread_count == 0. If we don't want v_mbcnt+v_cmp, we could do something like "if (amdgcn.set.thread_count(n)) ..." that sets EXEC regardless of current EXEC and skips the conditional for thread_count == 0. The performance of that solution is unlikely to justify the implementation effort.

Apr 24 2017, 10:13 AM
nhaehnle updated the diff for D32344: InstCombine/AMDGPU: Fix constant folding of llvm.amdgcn.{icmp,fcmp}.

Address review comments and apply some more formatting fixes.

Apr 24 2017, 10:05 AM
nhaehnle updated the diff for D32343: AMDGPU: Move v_readlane lane select from VGPR to SGPR.

Address review comments.

Apr 24 2017, 10:04 AM
nhaehnle accepted D31762: AMDGPU: Add new amdgcn.init.exec intrinsics.

I don't see anything wrong with the code.

Apr 24 2017, 9:19 AM

Apr 21 2017

nhaehnle created D32345: AMDGPU: Fix crash when scheduling non-memory SMRD instructions.
Apr 21 2017, 3:44 AM
nhaehnle created D32344: InstCombine/AMDGPU: Fix constant folding of llvm.amdgcn.{icmp,fcmp}.
Apr 21 2017, 3:09 AM
nhaehnle created D32343: AMDGPU: Move v_readlane lane select from VGPR to SGPR.
Apr 21 2017, 2:21 AM

Apr 19 2017

nhaehnle accepted D31222: AMDPU: Change DivergenceAnalysis for function arguments.

LGTM

Apr 19 2017, 9:14 AM
nhaehnle accepted D31476: AMDGPU: Don't align callable functions to 256.

There might be a benefit to choosing a larger alignment, like 64 bytes, due to cache line alignment.

Apr 19 2017, 9:10 AM
nhaehnle accepted D30852: StructurizeCFG: Directly invert cmp instructions.

LGTM

Apr 19 2017, 9:07 AM

Apr 8 2017

nhaehnle added a comment to D31161: [AMDGPU] New Waitcnt Insertion Pass.

When applied to current trunk, this code can get into an infinite loop. Please see the sample LLVM IR at https://paste.debian.net/926533/

Apr 8 2017, 2:04 AM

Feb 15 2017

nhaehnle added inline comments to D26005: AMDGPU: Don't use stack space for SGPR->VGPR spills.
Feb 15 2017, 5:14 AM

Feb 10 2017

nhaehnle added a reviewer for D26005: AMDGPU: Don't use stack space for SGPR->VGPR spills: nhaehnle.
Feb 10 2017, 9:26 AM
nhaehnle added inline comments to D26005: AMDGPU: Don't use stack space for SGPR->VGPR spills.
Feb 10 2017, 9:25 AM

Feb 1 2017

nhaehnle added a comment to D28993: AMDGPU: Try to select SMEM opcodes for llvm.amdgcn.buffer.load.

I see the same problem as Tom here. Do those shaders use read-only SSBOs? If so, this could perhaps be done at the Mesa level. But even then, there'd be a problem if the same memory is bound to two different SSBOs, and one of them is written to, unless the SSBO is marked 'restrict'.

Can you be more specific about why it's incorrect? I only see an issue with L1 coherency (GLC=0).

Feb 1 2017, 4:04 AM
nhaehnle added a comment to D27586: AMDGPU/SI: Add llvm.amdgcn.s.buffer.load intrinsic.

I haven't looked in too much detail yet. I assume getelementptr doesn't work with these pointers, so it would be good to have a negative test which ensures that GEP use fails.

Feb 1 2017, 3:51 AM
nhaehnle added a comment to D28993: AMDGPU: Try to select SMEM opcodes for llvm.amdgcn.buffer.load.

I see the same problem as Tom here. Do those shaders use read-only SSBOs? If so, this could perhaps be done at the Mesa level. But even then, there'd be a problem if the same memory is bound to two different SSBOs, and one of them is written to, unless the SSBO is marked 'restrict'.

Feb 1 2017, 3:46 AM

Jan 30 2017

nhaehnle updated the diff for D28675: [DAGCombine] require UnsafeFPMath for re-association of addition.

Backing out the effectively dead FMF code, but leave the corresponding
FIXMEs in place.

Jan 30 2017, 1:27 AM

Jan 27 2017

nhaehnle added a comment to D28675: [DAGCombine] require UnsafeFPMath for re-association of addition.

Yeah, I personally feel more comfortable having the exact same change on the branch.

I meant that having getFlags() always return something dereferenceable seems safer.

Jan 27 2017, 12:59 PM
nhaehnle added a comment to D28675: [DAGCombine] require UnsafeFPMath for re-association of addition.

Yeah, I personally feel more comfortable having the exact same change on the branch.

Jan 27 2017, 12:45 PM
nhaehnle updated the diff for D28675: [DAGCombine] require UnsafeFPMath for re-association of addition.

Drop unnecessary getNode()s.

Jan 27 2017, 12:16 PM
nhaehnle added a comment to D28675: [DAGCombine] require UnsafeFPMath for re-association of addition.

But you can't because they'll crash [...]

Jan 27 2017, 12:13 PM
nhaehnle added a comment to D28675: [DAGCombine] require UnsafeFPMath for re-association of addition.

@arsenm, unfortunately I can't add flag-based tests, since as @hfinkel points out, you can't actually get the flag on an FMA or FMAD in the DAG. @hfinkel, the idea was to put the check in already anyway, since by the discussion with @spatel, the plan is to add FMF for all nodes eventually. That seems reasonable to me.

Jan 27 2017, 11:24 AM
nhaehnle updated the diff for D28675: [DAGCombine] require UnsafeFPMath for re-association of addition.

Formatting of FIXMEs

Jan 27 2017, 11:24 AM

Jan 23 2017

nhaehnle accepted D28351: AMDGPU: Remove spurious out branches after a kill.

LGTM.

Jan 23 2017, 5:30 AM

Jan 20 2017

nhaehnle updated the diff for D28675: [DAGCombine] require UnsafeFPMath for re-association of addition.

Added the DefaultFlags. It does fix @spatel's example for me. Anything else?

Jan 20 2017, 6:56 AM

Jan 19 2017

nhaehnle added a comment to D28675: [DAGCombine] require UnsafeFPMath for re-association of addition.

Obviously it'd be even easier to skip the getFlags() thing entirely now, since it can never evaluate to true. But the right way forward is to extend support for FMF, so...

Jan 19 2017, 1:32 PM
nhaehnle added a comment to D28675: [DAGCombine] require UnsafeFPMath for re-association of addition.

The patch will cause a crash [...]

Jan 19 2017, 1:30 PM
nhaehnle updated the diff for D28675: [DAGCombine] require UnsafeFPMath for re-association of addition.

Add FIXMEs. @hfinkel, do I understand your comment correctly that this patch is
good to go?

Jan 19 2017, 3:41 AM

Jan 17 2017

nhaehnle updated the diff for D28675: [DAGCombine] require UnsafeFPMath for re-association of addition.
  1. Take individual UnsafeAlgebra flags into account.
Jan 17 2017, 3:04 AM

Jan 13 2017

nhaehnle added a comment to D28351: AMDGPU: Remove spurious out branches after a kill.

Thanks! We generally use CHECK-LABEL only for the start of a function, not for labels inside a function. The idea is that the label helps FileCheck get less confused when there are many sub-tests in a single test file.

Jan 13 2017, 1:20 PM
nhaehnle added a comment to D28351: AMDGPU: Remove spurious out branches after a kill.

The approach is unfortunately not correct in general as far as I can tell. Say you have

Jan 13 2017, 6:15 AM
nhaehnle retitled D28675: [DAGCombine] require UnsafeFPMath for re-association of addition from to [DAGCombine] require UnsafeFPMath for re-association of addition.
Jan 13 2017, 5:38 AM

Jan 12 2017

nhaehnle accepted D28102: AMDGPU: Fold fneg into fmul_legacy.

LGTM

Jan 12 2017, 10:15 AM
nhaehnle accepted D28615: AMDGPU: Fold free fneg into sin.

LGTM

Jan 12 2017, 10:15 AM
nhaehnle accepted D28101: AMDGPU: Fold fneg into rcp.

I don't see the change that merges the FP_EXTEND / FP_ROUND cases. But the end result LGTM.

Jan 12 2017, 1:17 AM
nhaehnle accepted D28100: AMDGPU: Fold fneg into fp_round.

LGTM

Jan 12 2017, 1:15 AM
nhaehnle accepted D28099: AMDGPU: Fold fneg into fp_extend .

LGTM

Jan 12 2017, 1:14 AM

Jan 11 2017

nhaehnle added a comment to D28101: AMDGPU: Fold fneg into rcp.

This looks like it's missing a rebase onto the rest of the series.

Jan 11 2017, 2:49 PM
nhaehnle accepted D28098: AMDGPU: Fold fneg into fma or fmad .

LGTM

Jan 11 2017, 2:43 PM
nhaehnle accepted D28097: AMDGPU: Fold fneg into fmul if free.

LGTM

Jan 11 2017, 2:43 PM
nhaehnle accepted D28096: AMDGPU: Fold fneg into fadd if free.

LGTM

Jan 11 2017, 2:42 PM
nhaehnle accepted D28240: AMDGPU: Don't produce v_mov_b64_pseudo when folding operands.

A test case would be good. Apart from that, LGTM.

Jan 11 2017, 2:20 PM
nhaehnle accepted D28256: AMDGPU: Skip fneg/select combine if it can fold into other.

One minor comment, LGTM otherwise.

Jan 11 2017, 2:15 PM