Move implementation of kill intrinsics to WQM pass. Add live lane
tracking by updating a stored exec mask when lanes are killed.
Use live lane tracking to enable early termination of shader
at any point in control flow.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.kill.ll | ||
---|---|---|
87 | The generated code for this test (and a few others) is slightly unexpected (all three patches combined): Before: v_cmpx_lt_f32_e32 vcc, 0, v0 After: v_cmp_gt_f32_e32 vcc, 0, v0 s_andn2_b64 exec, exec, vcc s_andn2_b64 exec, exec, vcc |
llvm/lib/Target/AMDGPU/SIInsertSkips.cpp | ||
---|---|---|
210–211 | Seems like dead code since OldSuccessor is never set to anything useful. | |
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
15 | Are you changing whether or not this pass can assume SSA form? | |
153–157 | I think this was changed to a MapVector to give a stable iteration order, so changing it back to DenseMap seems dangerous. |
llvm/lib/Target/AMDGPU/SIInsertSkips.cpp | ||
---|---|---|
210–211 | This belongs in D94748. | |
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
15 | Yes, since the pass is now run after scheduler drop support for SSA form. | |
153–157 | It seems that change occurred during the development of this patch. I missed that, so failed to incorporate it. | |
627–635 | This is legacy of an older version that should have been deleted. | |
875 | Yep, this should be in D94747. | |
1059 | Move to D94746. |
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.kill.ll | ||
---|---|---|
87 | So what is happening is the mask update and the exec update use the same register, and shader is marked GS. Post WQM: // live mask generated: %3:sreg_64 = COPY $exec // kill: %0:vgpr_32 = COPY $vgpr0 V_CMP_GT_F32_e32 0, %0:vgpr_32, implicit-def $vcc, implicit $mode, implicit $exec // live mask update: dead %3:sreg_64 = S_ANDN2_B64 %3:sreg_64, $vcc, implicit-def $scc SI_EARLY_TERMINATE_SCC0 implicit $exec, implicit $scc // kill implemented: $exec = S_ANDN2_B64 $exec, $vcc, implicit-def $scc Here the SI_EARLY_TERMINATE_SCC0 generates nothing because the test shader is marked amdgpu_gs. I think the test shaders are too trivial to be representative of real code generation. I added the sendmsg otherwise these shaders optimise away to nothing. It still could be reasonable to add a peephole to clean these up. |
Remove VCC def flags from SI_KILL_I1 and add test.
This bug existed prior to this patch, but was not causing any issues as control flow lowering does not track liveness.
However it matters for WQM lowering of kills as it can lead to stray definitions of VCC.
LGTM with a few more nits (inline).
Being able to query live lanes at any point in the shader makes sense to me and I really like the removal of SI_KILL_CLEANUP.
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
872 | LiveMaskWQM unused. | |
1069 | Nit: I tend to agree with the clang-tidy warning. 's/isEntry/IsEntry/' for consistency? | |
1375 | Is the comment up-to-date? Did you mean: "does not need WQM nor WWM"? | |
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.kill.ll | ||
87 | Fair enough. Thanks for checking. |
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
214 | I don't think you need to "require" this since you don't explicitly use it for anything. | |
648 | Maybe just choose the new opcode inside the switch, and pull the get/setDesc calls out? | |
695 | I'm a bit sceptical of some of the opcodes in this table. E.g. SETOGT --(swap operands)--> SETOLT --(invert)-->SETUGE which is V_CMP_NLT, but the table here says V_CMP_GT. It might be more obvious what's going on if you call getSetCCSwappedOperands and getSetCCInverse first, and then have a simple lookup from SETxx to opcode. Or you could avoid the "invert" by using AND instead of ANDN2? | |
710 | It's a bit more conventional to lump SETNE together with SETUNE (unlike all the other comparisons which go with the O form), though of course it doesn't really matter. | |
1078–1079 | No need to add braces. |
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
695 | I agree these are wrong, but the inversion is required because this will only generate comparison results for active lanes -- so in non-uniform control flow this will not generate a complete mask to use AND. | |
710 | The branches of the switch are lifted from the old kill lowering code. |
- Address review comments.
- Fix conversion table for comparisons in F32 kills.
- Fix a bug where conditional kills could terminate WQM.
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
732–736 | This looks suspicious: both SETUGT and SETUGE map to V_CMP_NGE. |
- Rework F32 kill condition code translation as it was still wrong.
- There is still an outstanding bug in WWM somewhere.
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
698 | I still don't trust this table! I think the "O" predicates should map to "N" opcodes e.g. SETOLT -> V_CMP_NLT as explained above. |
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
698 | Are you expecting the table to look like this? case ISD::SETEQ: Opcode = AMDGPU::V_CMP_LG_F32_e64; break; case ISD::SETGT: Opcode = AMDGPU::V_CMP_GE_F32_e64; break; case ISD::SETGE: Opcode = AMDGPU::V_CMP_GT_F32_e64; break; case ISD::SETLT: Opcode = AMDGPU::V_CMP_LE_F32_e64; break; case ISD::SETLE: Opcode = AMDGPU::V_CMP_LT_F32_e64; break; case ISD::SETNE: Opcode = AMDGPU::V_CMP_EQ_F32_e64; break; case ISD::SETO: Opcode = AMDGPU::V_CMP_O_F32_e64; break; case ISD::SETUO: Opcode = AMDGPU::V_CMP_U_F32_e64; break; case ISD::SETOEQ: case ISD::SETUEQ: Opcode = AMDGPU::V_CMP_NEQ_F32_e64; break; case ISD::SETOGT: case ISD::SETUGT: Opcode = AMDGPU::V_CMP_NLT_F32_e64; break; case ISD::SETOGE: case ISD::SETUGE: Opcode = AMDGPU::V_CMP_NLE_F32_e64; break; case ISD::SETOLT: case ISD::SETULT: Opcode = AMDGPU::V_CMP_NGT_F32_e64; break; case ISD::SETOLE: case ISD::SETULE: Opcode = AMDGPU::V_CMP_NGE_F32_e64; break; case ISD::SETONE: case ISD::SETUNE: Opcode = AMDGPU::V_CMP_NLG_F32_e64; break; I am still testing, but I am unsure if VulkanCTS or game shaders can tell the differences. |
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
698 | No I'm expecting it to look like this: case ISD::SETOEQ: case ISD::SETEQ: Opcode = AMDGPU::V_CMP_NEQ_F32_e64; break; case ISD::SETOGT: case ISD::SETGT: Opcode = AMDGPU::V_CMP_NLT_F32_e64; break; case ISD::SETOGE: case ISD::SETGE: Opcode = AMDGPU::V_CMP_NLE_F32_e64; break; case ISD::SETOLT: case ISD::SETLT: Opcode = AMDGPU::V_CMP_NGT_F32_e64; break; case ISD::SETOLE: case ISD::SETLE: Opcode = AMDGPU::V_CMP_NGE_F32_e64; break; case ISD::SETONE: Opcode = AMDGPU::V_CMP_NLG_F32_e64; break; case ISD::SETO: Opcode = AMDGPU::V_CMP_U_F32_e64; break; case ISD::SETUO: Opcode = AMDGPU::V_CMP_O_F32_e64; break; case ISD::SETUEQ: Opcode = AMDGPU::V_CMP_LG_F32_e64; break; case ISD::SETUGT: Opcode = AMDGPU::V_CMP_GE_F32_e64; break; case ISD::SETUGE: Opcode = AMDGPU::V_CMP_GT_F32_e64; break; case ISD::SETULT: Opcode = AMDGPU::V_CMP_LE_F32_e64; break; case ISD::SETULE: Opcode = AMDGPU::V_CMP_LT_F32_e64; break; case ISD::SETUNE: case ISD::SETNE: Opcode = AMDGPU::V_CMP_EQ_F32_e64; break; E.g. SETOGT --(swap operands)--> SETOLT --(invert)-->SETUGE which is V_CMP_NLT. (Though as mentioned previously, SETEQ is undefined on NaNs, so it doesn't matter whether it behaves the same as SETOEQ or the same as SETUEQ. And likewise for all the others that don't have an explicit O or U.) |
llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp | ||
---|---|---|
698 | Thanks -- I have now validated this version as well. |
Seems like dead code since OldSuccessor is never set to anything useful.