Since the divergence-driven instruction selection has been enabled for AMDGPU,
all the uniform instructions are expected to be selected to SALU form, except those not having one.
VGPR to SGPR copies appear in MIR to connect values producers and consumers. This change implements an algorithm
that evolves a reasonable tradeoff between the profit achieved from keeping the uniform instructions in SALU form
and overhead introduced by the data transfer between the VGPRs and SGPRs.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp | ||
---|---|---|
943 | a bit misleading name, maybe "ShouldGoToVALU"? | |
972 | ditto | |
984 | Copies.count(J) and Copies[J] lookups J twice | |
1023 | if (LowerSpecialCase ,,, ) continue to reduce nesting | |
1064 | replace Inst->getParent()->end() with E | |
1072 | why isSGPRReg is in the loop? It isn't being changed in the loop | |
1097 | Copies[CurID] and Copies.count(CurID) lookups for CurID twice | |
1101 | ditto for S |
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp | ||
---|---|---|
581 | What's the advantage of splitting this into a separate loop? |
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp | ||
---|---|---|
581 | We, basically, have 3 phases - analysis that collect information related to the particular copy, a decision that is made based on the information collected, and lowering that lowers the copy according to the decision made. |
Fixed bug caused 5 vulkan CTS tests to fail. Also, fixed "sibling" copies counting code.
It now consider subregs of the same register as different sources.
[AMDGPU] VGPR to SGPR copies lowering
Needs a better subject line. Previously moving all VGPR uses from SALU to VALU was required for correctness. Now it is not, so this pass is a heuristic that decides whether to move SALU to VALU or to insert a readfirstlane instruction.
In future I would like VGPR-to-SGPR copies to be legal, and always implemented as a readfirstlane, so that this whole pass could be truly optional and we could skip it at -O0.
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp | ||
---|---|---|
98–105 | & should be after space, not before. Doesn't clang-format fix this? Please run git clang-format @^ on the patch anyway. | |
100 | Likewise. | |
580 | I don't understand this comment. | |
1126 | Typo readfirstlane | |
1130 | Should be const. But why is 4 enough? Isn't there some way you can get this programmatically from SIRegisterInfo? | |
1137 | Don't need the "if", you can call getSubRegClass unconditionally. | |
1142 | Don't need the "if", you can use the subreg version in all cases. |
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp | ||
---|---|---|
100 | camelLowerCase the name. Please also add a comment what special cases does it handle. | |
935 | Second to ask for a clang-format run. | |
946 | Double space after "to". | |
956 | #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) around dump(). | |
1040 | Capitalize 'Worklist'. | |
1052 | Typo "comtribute". | |
1070 | Why ++I is in parenthesis? | |
1145 | What happens to 16 bit subregs? |
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp | ||
---|---|---|
1145 | VGPR to SGPR copies are inserted by InstrEmitter to adjust the VALU result to the SALU consumer. |
In future I would like VGPR-to-SGPR copies to be legal, and always implemented as a readfirstlane, so that this whole pass could be truly optional and we could skip it at -O0.
They are already legal. As soon as I integrate part2 that takes care of the REG_SEQUENCE and PHIs we can lower all to v_readfirstlane_b32 at -O0.
Now we cannot because the REG_SEQUENCE and PHIs having VGPR input are converted to VALU unconditionally and we are going to have bugs similar to that just fixed in VK-CTS.
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp | ||
---|---|---|
1130 | For now just SIRegisterInfo::getSubRegFromChannel is used. Later I would like to change the SIInstrInfo::readlaneVGPRToSGPR to serve all cases. |
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp | ||
---|---|---|
948 | You probably need to reset NextID to zero with each run of the pass. Better though make it a normal member of the Pass class itself. | |
1145 | Assume the input like: %0:SGPR_LO16 = COPY %1.lo16:VGPR_32 If I read it right it will produce V_READFIRSTLANE_B32 with a 16 bit destination and source, which does not work. Assume that selection managed to produce such input, which path will it take here? | |
1150 | Run clang-format again? |
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp | ||
---|---|---|
1145 | JBTW, right now it seems to go via moveToVALU: # RUN: llc -march=amdgcn -mcpu=gfx1100 -run-pass=si-fix-sgpr-copies -verify-machineinstrs -o - %s --- name: v16_to_s16 body: | bb.0: %0:vgpr_32 = IMPLICIT_DEF %1:sgpr_lo16 = COPY %0.lo16:vgpr_32 %2:vgpr_lo16 = COPY %1 S_ENDPGM 0, implicit %1, implicit %2 ... Results in: %0:vgpr_32 = IMPLICIT_DEF %3:vgpr_lo16 = COPY %0.lo16 %2:vgpr_lo16 = COPY %3 S_ENDPGM 0, implicit %3, implicit %2 |
Perhaps I'm blind, but I don't see where the heuristic takes into account the number of ALU instructions that would be moved from SALU to VALU.
I am not sure that I understand the question.
Currently, we only have VGPR to SGPR copies for the cases where uniform instruction produces VGPR. It's uniform users that are SALU need SGPR.
For each copy we have a choice - convert all its users to VALU or replace the copy with v_readfirstlane_b32. The algorithm computes the tradeoff.
Inserting v_readfirstlane_b32 we add yet one more VALU instruction. If the SALU chain spawned by the copy is 2 instructions long, inserting v_readfirstlane_b32 is an overkill.
We agreed on the heuristic: 1 v_readfirstlane_b32 for at least 3 SALU.
The score is the length of SALU chain minus the number of v_readfirstlane_b32 to insert and minus the number of SGPR to VGPR copies that need to get the result back to VALU.
So, if we decide to move the whole chain to VALU it already means that keeping it SALU (by replacing the copy with v_readfirstlane_b32) is not profitable.
The number of the SALU instructions that are converting VALU, in this case, is the SChain.size()
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp | ||
---|---|---|
1145 | That is what I tried to explain. To reach the place we're talking about the copy should spawn SALU chain long enough to be selected for v_readfirstlane_b32. We have no SALU instructions that accept 16bit operand explicitly. All 16bit SALU really take 32bit SGPR but only use HI/LO half of it. So, we cannot create the MIR for the case you are concerned with. |
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp | ||
---|---|---|
87 | It is uninitialized. |
Thanks for the explanation, it all makes sense now. I misread what SChain does, sorry about that.
looks like this breaks tests: http://45.33.8.238/linux/81242/step_12.txt
Please take a look and revert for now if it takes a while to fix.
Currently SIInstrInfo::copyPhysReg does this:
if (RI.isSGPRClass(RC)) { if (!RI.isSGPRClass(SrcRC)) { reportIllegalCopy(this, MBB, MI, DL, DestReg, SrcReg, KillSrc); return; }
That is why I say they are illegal. I would like to change this, so that copyPhysReg will allow them and implement them by emitting readfirstlane.
OK. You already can do this now. I have pointed you to the experimental branch on my github that propagates the DA information to MIR.
In this branch there is an assert if VGPR to SGPR copy defining instruction is uniform. I had to support a list of the exceptions for those instructions that a divergent but require SGPR operand. Not really a long list of them.
I passed through all the AMDGPU LIT tests w/o asserts.
For more information see https://github.com/alex-t/llvm-project/tree/dd_isel_exp: 850646bb02c204da602c7f4e654c7b65e59b6912
The branch is a pure draft for internal use. You may be interested in the DA information propagated to MIR and the exception instruction list.
As for the -O0, the pass may be excluded as soon as I integrate part2 of this change that covers REG_SEQUENCE and PHIs
It is uninitialized.