This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Move kill lowering to WQM pass and add live mask tracking
ClosedPublic

Authored by critson on Jan 14 2021, 10:04 PM.

Details

Summary

Move implementation of kill intrinsics to WQM pass. Add live lane
tracking by updating a stored exec mask when lanes are killed.
Use live lane tracking to enable early termination of shader
at any point in control flow.

Diff Detail

Event Timeline

critson created this revision.Jan 14 2021, 10:04 PM
critson requested review of this revision.Jan 14 2021, 10:04 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 14 2021, 10:04 PM
piotr added a subscriber: piotr.Jan 15 2021, 2:12 AM
piotr added inline comments.
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
618–626

Unused?

636

A missing word after "in"?

866

WQMMaskMI is always nullptr here.

1048

It looks the new param NonWWMState is unused?

piotr added inline comments.Jan 15 2021, 2:21 AM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
1048

I can see you are going to use it in D94747, but as it stands now it will cause a build warning.

piotr added inline comments.Jan 15 2021, 3:28 AM
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.kill.ll
87

The generated code for this test (and a few others) is slightly unexpected (all three patches combined):

Before:

v_cmpx_lt_f32_e32 vcc, 0, v0

After:

v_cmp_gt_f32_e32 vcc, 0, v0
s_andn2_b64 exec, exec, vcc
s_andn2_b64 exec, exec, vcc
foad added inline comments.Jan 15 2021, 7:24 AM
llvm/lib/Target/AMDGPU/SIInsertSkips.cpp
213–214

Seems like dead code since OldSuccessor is never set to anything useful.

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
15

Are you changing whether or not this pass can assume SSA form?

150

I think this was changed to a MapVector to give a stable iteration order, so changing it back to DenseMap seems dangerous.

critson marked 8 inline comments as done.Jan 15 2021, 5:36 PM
critson added inline comments.
llvm/lib/Target/AMDGPU/SIInsertSkips.cpp
213–214

This belongs in D94748.

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
15

Yes, since the pass is now run after scheduler drop support for SSA form.
Supporting live mask tracking is much simpler non-SSA (no need to add virtual registers and PHIs tracking through every program block).
While I could support both, the code in WQM pass would be very large for a behaviour we are not using.

150

It seems that change occurred during the development of this patch. I missed that, so failed to incorporate it.

618–626

This is legacy of an older version that should have been deleted.

866

Yep, this should be in D94747.

1048

Move to D94746.

critson updated this revision to Diff 317126.Jan 15 2021, 5:37 PM
critson marked 6 inline comments as done.
  • Address reviewer comments.
  • Move misplaced code to other reviews.
critson added inline comments.Jan 15 2021, 6:22 PM
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.kill.ll
87

So what is happening is the mask update and the exec update use the same register, and shader is marked GS.

Post WQM:

// live mask generated:
%3:sreg_64 = COPY $exec

// kill: 
%0:vgpr_32 = COPY $vgpr0
V_CMP_GT_F32_e32 0, %0:vgpr_32, implicit-def $vcc, implicit $mode, implicit $exec

// live mask update:
dead %3:sreg_64 = S_ANDN2_B64 %3:sreg_64, $vcc, implicit-def $scc
SI_EARLY_TERMINATE_SCC0 implicit $exec, implicit $scc

// kill implemented:
$exec = S_ANDN2_B64 $exec, $vcc, implicit-def $scc

Here the SI_EARLY_TERMINATE_SCC0 generates nothing because the test shader is marked amdgpu_gs. I think the test shaders are too trivial to be representative of real code generation. I added the sendmsg otherwise these shaders optimise away to nothing. It still could be reasonable to add a peephole to clean these up.

critson updated this revision to Diff 317205.Jan 16 2021, 7:11 PM

Remove VCC def flags from SI_KILL_I1 and add test.
This bug existed prior to this patch, but was not causing any issues as control flow lowering does not track liveness.
However it matters for WQM lowering of kills as it can lead to stray definitions of VCC.

piotr added a comment.EditedJan 18 2021, 5:58 AM

LGTM with a few more nits (inline).

Being able to query live lanes at any point in the shader makes sense to me and I really like the removal of SI_KILL_CLEANUP.

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
863

LiveMaskWQM unused.

1057

Nit: I tend to agree with the clang-tidy warning. 's/isEntry/IsEntry/' for consistency?

1341

Is the comment up-to-date? Did you mean: "does not need WQM nor WWM"?

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.kill.ll
87

Fair enough. Thanks for checking.

foad added a reviewer: piotr.Jan 18 2021, 6:45 AM
foad added inline comments.Jan 18 2021, 7:30 AM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
209

I don't think you need to "require" this since you don't explicitly use it for anything.

639

Maybe just choose the new opcode inside the switch, and pull the get/setDesc calls out?

686

I'm a bit sceptical of some of the opcodes in this table.

E.g. SETOGT --(swap operands)--> SETOLT --(invert)-->SETUGE which is V_CMP_NLT, but the table here says V_CMP_GT.

It might be more obvious what's going on if you call getSetCCSwappedOperands and getSetCCInverse first, and then have a simple lookup from SETxx to opcode. Or you could avoid the "invert" by using AND instead of ANDN2?

701

It's a bit more conventional to lump SETNE together with SETUNE (unlike all the other comparisons which go with the O form), though of course it doesn't really matter.

1066

No need to add braces.

critson marked 9 inline comments as done.Jan 19 2021, 4:56 PM
critson added inline comments.
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
686

I agree these are wrong, but the inversion is required because this will only generate comparison results for active lanes -- so in non-uniform control flow this will not generate a complete mask to use AND.

701

The branches of the switch are lifted from the old kill lowering code.
So I do not plan to touch it for now.

critson updated this revision to Diff 317727.Jan 19 2021, 4:56 PM
critson marked 2 inline comments as done.
  • Address review comments.
  • Fix conversion table for comparisons in F32 kills.
  • Fix a bug where conditional kills could terminate WQM.
piotr added inline comments.Jan 20 2021, 2:18 AM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
723–727

This looks suspicious: both SETUGT and SETUGE map to V_CMP_NGE.

critson updated this revision to Diff 318471.Jan 22 2021, 2:22 AM
  • Rework F32 kill condition code translation as it was still wrong.
  • There is still an outstanding bug in WWM somewhere.
foad added inline comments.Jan 22 2021, 6:34 AM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
689

I still don't trust this table! I think the "O" predicates should map to "N" opcodes e.g. SETOLT -> V_CMP_NLT as explained above.

critson added inline comments.Jan 27 2021, 3:37 AM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
689

Are you expecting the table to look like this?

case ISD::SETEQ:
  Opcode = AMDGPU::V_CMP_LG_F32_e64;
  break;
case ISD::SETGT:
  Opcode = AMDGPU::V_CMP_GE_F32_e64;
  break;
case ISD::SETGE:
  Opcode = AMDGPU::V_CMP_GT_F32_e64;
  break;
case ISD::SETLT:
  Opcode = AMDGPU::V_CMP_LE_F32_e64;
  break;
case ISD::SETLE:
  Opcode = AMDGPU::V_CMP_LT_F32_e64;
  break;
case ISD::SETNE:
  Opcode = AMDGPU::V_CMP_EQ_F32_e64;
  break;
case ISD::SETO:
  Opcode = AMDGPU::V_CMP_O_F32_e64;
  break;
case ISD::SETUO:
  Opcode = AMDGPU::V_CMP_U_F32_e64;
  break;
case ISD::SETOEQ:
case ISD::SETUEQ:
  Opcode = AMDGPU::V_CMP_NEQ_F32_e64;
  break;
case ISD::SETOGT:
case ISD::SETUGT:
  Opcode = AMDGPU::V_CMP_NLT_F32_e64;
  break;
case ISD::SETOGE:
case ISD::SETUGE:
  Opcode = AMDGPU::V_CMP_NLE_F32_e64;
  break;
case ISD::SETOLT:
case ISD::SETULT:
  Opcode = AMDGPU::V_CMP_NGT_F32_e64;
  break;
case ISD::SETOLE:
case ISD::SETULE:
  Opcode = AMDGPU::V_CMP_NGE_F32_e64;
  break;
case ISD::SETONE:
case ISD::SETUNE:
  Opcode = AMDGPU::V_CMP_NLG_F32_e64;
  break;

I am still testing, but I am unsure if VulkanCTS or game shaders can tell the differences.

foad added inline comments.Jan 27 2021, 5:36 AM
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
689

No I'm expecting it to look like this:

case ISD::SETOEQ:
case ISD::SETEQ:
  Opcode = AMDGPU::V_CMP_NEQ_F32_e64;
  break;
case ISD::SETOGT:
case ISD::SETGT:
  Opcode = AMDGPU::V_CMP_NLT_F32_e64;
  break;
case ISD::SETOGE:
case ISD::SETGE:
  Opcode = AMDGPU::V_CMP_NLE_F32_e64;
  break;
case ISD::SETOLT:
case ISD::SETLT:
  Opcode = AMDGPU::V_CMP_NGT_F32_e64;
  break;
case ISD::SETOLE:
case ISD::SETLE:
  Opcode = AMDGPU::V_CMP_NGE_F32_e64;
  break;
case ISD::SETONE:
  Opcode = AMDGPU::V_CMP_NLG_F32_e64;
  break;
case ISD::SETO:
  Opcode = AMDGPU::V_CMP_U_F32_e64;
  break;
case ISD::SETUO:
  Opcode = AMDGPU::V_CMP_O_F32_e64;
  break;
case ISD::SETUEQ:
  Opcode = AMDGPU::V_CMP_LG_F32_e64;
  break;
case ISD::SETUGT:
  Opcode = AMDGPU::V_CMP_GE_F32_e64;
  break;
case ISD::SETUGE:
  Opcode = AMDGPU::V_CMP_GT_F32_e64;
  break;
case ISD::SETULT:
  Opcode = AMDGPU::V_CMP_LE_F32_e64;
  break;
case ISD::SETULE:
  Opcode = AMDGPU::V_CMP_LT_F32_e64;
  break;
case ISD::SETUNE:
case ISD::SETNE:
  Opcode = AMDGPU::V_CMP_EQ_F32_e64;
  break;

E.g. SETOGT --(swap operands)--> SETOLT --(invert)-->SETUGE which is V_CMP_NLT.

(Though as mentioned previously, SETEQ is undefined on NaNs, so it doesn't matter whether it behaves the same as SETOEQ or the same as SETUEQ. And likewise for all the others that don't have an explicit O or U.)

critson marked 4 inline comments as done.Jan 27 2021, 8:49 PM
critson added inline comments.
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
689

Thanks -- I have now validated this version as well.
So I suspect in practice we are not generating anything that hits the edge cases.

critson updated this revision to Diff 319750.Jan 27 2021, 8:49 PM
critson marked an inline comment as done.
  • Update F32 kill lowering table.
critson updated this revision to Diff 320012.Jan 28 2021, 5:28 PM
  • Add getClearedProperties to pass to ensure running !IsSSA
piotr accepted this revision.Feb 8 2021, 5:44 AM
This revision is now accepted and ready to land.Feb 8 2021, 5:44 AM
This revision was landed with ongoing or failed builds.Feb 11 2021, 3:32 AM
This revision was automatically updated to reflect the committed changes.
llvm/test/CodeGen/AMDGPU/early-term.mir