This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU][RFC] New llvm.amdgcn.ballot intrinsic
ClosedPublic

Authored by Flakebi on Jul 22 2019, 6:34 AM.

Details

Summary

Add a new llvm.amdgcn.ballot intrinsic modeled on the ballot function
in GLSL and other shader languages. It returns a bitfield containing the
result of its boolean argument in all active lanes, and zero in all
inactive lanes.

This is intended to replace the existing llvm.amdgcn.icmp and
llvm.amdgcn.fcmp intrinsics after a suitable transition period.

Use the new intrinsic in the atomic optimizer pass.

I'm not going to commit this as-is because tests are failing due to
poor code generation, e.g. test2 in ballot.ll generates:

v_cmp_eq_u32_e32 vcc, v0, v1
v_cndmask_b32_e64 v0, 0, 1, vcc
v_cmp_ne_u32_e64 s[4:5], 0, v0
v_mov_b32_e32 v0, s4
v_mov_b32_e32 v1, s5

instead of:

v_cmp_eq_u32_e32 s[4:5], v0, v1
v_mov_b32_e32 v0, s4
v_mov_b32_e32 v1, s5

I'd appreciate feedback on (a) the idea, (b) the implementation and
(c) how best to improve the code generation.

Diff Detail

Event Timeline

foad created this revision.Jul 22 2019, 6:34 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 22 2019, 6:34 AM
arsenm added inline comments.Jul 22 2019, 6:44 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4235–4236

Should also do this in instcombine, as icmp/fcmp already do

4250

This is already the right type, so this should be unnecessary

llvm/test/CodeGen/AMDGPU/ballot.ll
4 ↗(On Diff #211078)

This needs the .i64 for the mangling, I'm surprised this works

49 ↗(On Diff #211078)

Can you add some different imp and fcmp sources?

Also need a separate wave32 test

foad updated this revision to Diff 211100.Jul 22 2019, 7:42 AM
foad marked 2 inline comments as done.

Address review comments.

foad marked 2 inline comments as done.Jul 22 2019, 7:46 AM
foad added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4235–4236

I've had a go, but is it really necessary to implement the same optimization cases in two places (especially the one that just replaces a ballot intrinsic with a read_register intrinsic), and how do I test them independently?

4250

Does anything guarantee that it's the right type? Is it OK for the compiler to crash if someone writes a .ll file that uses it with the wrong type?

Thanks for doing this. For the codegen quality question, I wonder if something like the following could be done:

  1. Remove what's currently in SITargetLowering.
  2. Add a custom DAG combine which combines (ballot (ISD::SETCC ...)) to (AMDGPUISD::SETCC ...)
  3. Do code generation for the remaining ballot cases either in AMDGPUISelDAGToDAG or via a TableGen pattern (the most generic case would have to map to some combination of COPY_TO_REGCLASS and S_AND_B32/B64).
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4235–4236

Independent tests as opt tests should go into test/Transforms/InstCombine. You can find existing tests for the icmp/fcmp intrinsics there.

Nitpick: please add braces for the if-body (multiple lines guarded by if).

arsenm added inline comments.Jul 24 2019, 5:54 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4235–4236

Yes. If I were to pick one, it would probably be instcombine. I think it’s unlikely this would come up in a context only visible after lowering. Reducing intrinsics is useful when reducing testcases, and any code in the DAG is just going to need to be redone in globalisel eventually

4250

Ideally we would verify wave32 IR somewhere and error, but I don’t think that exists today

Any update on this?

foad added a comment.Jan 13 2020, 11:42 AM

Any update on this?

No, I haven't done any more work on it. I guess I could pick it up again some time soon, unless anyone else wants to?

Any update on this?

No, I haven't done any more work on it. I guess I could pick it up again some time soon, unless anyone else wants to?

So far I've put off handling amdgcn.icmp/fcmp in globalisel, hoping this would supersede that so I'd like to see this move along

Agreed, it would be good to see progress on this.

Flakebi commandeered this revision.Mar 11 2020, 12:55 AM
Flakebi added a reviewer: foad.
Flakebi updated this revision to Diff 249563.Mar 11 2020, 1:14 AM

Address Nicolai’s comments and implement this as DAG combines and TableGen patterns.

The code generation for test2 is currently not optimal:

%trunc = trunc i32 %x to i1
%ballot = call i64 @llvm.amdgcn.ballot.i64(i1 %trunc)

generates

v_and_b32_e32 v0, 1, v0
v_cmp_eq_u32_e32 vcc, 1, v0
s_and_b64 s[4:5], vcc, exec

where the first compare stems from the truncate.
We could handle the case of (ballot (truncate x)) in the combining. Any opinions on this?

foad added a comment.Mar 11 2020, 2:18 AM

The code generation for test2 is currently not optimal:

%trunc = trunc i32 %x to i1
%ballot = call i64 @llvm.amdgcn.ballot.i64(i1 %trunc)

generates

v_and_b32_e32 v0, 1, v0
v_cmp_eq_u32_e32 vcc, 1, v0
s_and_b64 s[4:5], vcc, exec

where the first compare stems from the truncate.

I'm confused by this. What is the optimal code generation?

foad added inline comments.Mar 11 2020, 2:30 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
428 ↗(On Diff #249563)

If $src is the result of a v_cmp instruction then this s_and will be redundant, won't it? Because v_cmp is defined to return a zero in vcc for all inactive lanes.

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
3968

Are you sure it's safe to use read_register(exec) for this? I'm not an expert but I have been told it's not safe, because the compiler doesn't understand that it can't reorder the read_register call past other instructions that modify exec.

Flakebi marked an inline comment as done.Mar 11 2020, 3:09 AM
Flakebi added inline comments.
llvm/lib/Target/AMDGPU/SIInstructions.td
428 ↗(On Diff #249563)

Yes, I agree that this is redundant.
As this happens in the lowering phase, we can only match for (i64 (int_amdgcn_ballot (i1 (trunc $src)))) in this case. I think we would need to combine machine instructions to get the v_cmp.

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
3968

Is there a safe way to get the exec register? I’m fine with removing this optimization here and doing it only in the SDag.

The code generation for test2 is currently not optimal:

%trunc = trunc i32 %x to i1
%ballot = call i64 @llvm.amdgcn.ballot.i64(i1 %trunc)

generates

v_and_b32_e32 v0, 1, v0
v_cmp_eq_u32_e32 vcc, 1, v0
s_and_b64 s[4:5], vcc, exec

where the first compare stems from the truncate.

I'm confused by this. What is the optimal code generation?

The more optimal version would be to merge the compare and s_and exec like you said in your comment?

v_and_b32_e32 v0, 1, v0
v_cmp_eq_u32_e32 s[4:5], 1, v0
Flakebi marked an inline comment as done.Mar 12 2020, 2:11 AM
Flakebi added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
3968

Actually, icmp and fcmp already create read_register(exec) in InstCombine. So, having this combination and replacing icmp with ballot won’t change behavior.

Flakebi updated this revision to Diff 251027.Mar 18 2020, 3:04 AM

Use getCopyFromReg(exec) and rebase on fixed whole-wave-mode.

The remaining exec copy can get fixed in multiple ways:

  • Prevent sinking the ctpop into the next basic block. Unfortunately marking the instruction as convergent does not work because the Machine Sinking pass moves it instead.
  • Reuse the copy of exec that gets created by s_and_saveexec. This might work with GlobalISel, so it’s probably worth to wait until this gets used.

Use getCopyFromReg(exec) and rebase on fixed whole-wave-mode.

The remaining exec copy can get fixed in multiple ways:

  • Prevent sinking the ctpop into the next basic block. Unfortunately marking the instruction as convergent does not work because the Machine Sinking pass moves it instead.
  • Reuse the copy of exec that gets created by s_and_saveexec. This might work with GlobalISel, so it’s probably worth to wait until this gets used.

I was afraid this would be a problem. We probably need a version of COPY with convergent set. If MachineSinking is moving a convergent instruction, that's also a bug

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
8705

I didn't think of it before, but reading exec this way could potentially be dangerous due to the fact that we have the terrible operations that modify exec in the middle of IR blocks, and we split them later. We might have to do this fold later

llvm/lib/Target/AMDGPU/SIInstructions.td
428 ↗(On Diff #251027)

Is the COPY_TO_REGCLASS just a tablegen workaround?

433 ↗(On Diff #251027)

Probably need an explicit Src_b32:$src to make this work with GlobalISel

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
3969

What happens for wave32 here?

llvm/test/Transforms/InstCombine/AMDGPU/amdgcn-intrinsics.ll
2412

Wave32 tests also

Flakebi marked an inline comment as done.Mar 18 2020, 9:41 AM
Flakebi added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
8705

What are these operations and how can I fix them?
And shouldn’t this work as the instruction reads exec and thus should not be touched?

llvm/lib/Target/AMDGPU/SIInstructions.td
428 ↗(On Diff #251027)

Yes, we get an i1 as input though we know that it is stored an sgpr (pair), so we "cast" it into one. The same happens in line 803 to optimize icmp:

def : Pat <
  (i64 (int_amdgcn_icmp i1:$src, (i1 0), (i32 33))),
  (COPY $src) // Return the SGPRs representing i1 src
>;
Flakebi updated this revision to Diff 251121.Mar 18 2020, 9:58 AM

Add missing wave32 instcombining

Also should add GlobalISel tests

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
8705

You can't really fix them in SelectionDAG. Things like the wwm/wqm intrinsics, and kills can exist in the middle of blocks (plus inline asm)

Flakebi updated this revision to Diff 251586.Mar 20 2020, 3:03 AM
This comment was removed by Flakebi.
Flakebi updated this revision to Diff 251587.Mar 20 2020, 3:04 AM

I removed the COPY_TO_REGCLASS, it looked flaky and does not work with GlobalISel.
Instead, the ballot intrinsic is morphed into an AMDGPUISD::SETCC. Compares are the only reasonable way to get a boolean value into the wavefront form as an i32/i64 and use it in LLVM.

Matt: How do I add these combines for GlobalISel?
Should they go into AMDGPUCombine.td? And do I have to write functions for this or does it work purely with pattern matching?

Harbormaster completed remote builds in B49856: Diff 251586.

I removed the COPY_TO_REGCLASS, it looked flaky and does not work with GlobalISel.
Instead, the ballot intrinsic is morphed into an AMDGPUISD::SETCC. Compares are the only reasonable way to get a boolean value into the wavefront form as an i32/i64 and use it in LLVM.

Matt: How do I add these combines for GlobalISel?
Should they go into AMDGPUCombine.td? And do I have to write functions for this or does it work purely with pattern matching?

If you're not going to have a fallback pattern in tablegen, then the GlobalISel support is probably a separate patch. This would not go in AMDGPUCombines.td as this is not an optimization combine. By having everything go through AMDGPUISD::SETCC, this develops the same problem the existing icmp/fcmp intrinsics have for GlobalISel support. As a separate patch, we need to replace how AMDGPUISD::SETCC works to avoid relying on the CondCode fiel

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
8714–8716

Since you're now treating this as the full lowering path, and not an optimization for the special case with a compare, this should go in LowerINTRINSIC_WO_CHAIN

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
3967–3968

Should move the getDeclaration below to where you actually use it. If you happened to hit the other size break case, you would have introduced a dead declaration

Flakebi updated this revision to Diff 252064.Mar 23 2020, 8:48 AM

Move the code to lowering again, I’m back were Jay started.
Report a fatal error if the size is neither i32 nor i64.

Flakebi updated this revision to Diff 253569.Mar 30 2020, 5:45 AM

return instead of report_fatal_error

arsenm added inline comments.Mar 30 2020, 7:54 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
737

Can you add a test for this in test/Analysis/DivergenceAnalysis/AMDGPU

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4258–4260

DAG.getSetCC?

Flakebi updated this revision to Diff 253629.Mar 30 2020, 10:13 AM
Flakebi marked 2 inline comments as done.

Add uniformity test

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4258–4260

This creates AMDGPUISD::SETCC not setcc :)

arsenm accepted this revision.Mar 30 2020, 11:40 AM
This revision is now accepted and ready to land.Mar 30 2020, 11:40 AM
This revision was automatically updated to reflect the committed changes.