Greedy register allocator prefers to move a constrained
live range into a larger allocatable class over spilling
them. This patch defines the necessary superclasses for
vector registers. For subtargets that support copy between
VGPRs and AGPRs, the vector register spills during regalloc
now become just copies.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
411–412 | I think with the intent of this function, you don't need to return aligned classes for aligned classes, the unaligned versions are fine. I guess this gets more complicated in the case where we're using scratch instructions that do require alignment for multi-dword spilling |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
411–412 | I will see the impact of scratch instructions separately. |
There is an elephant in the room: it kills partial spill of wide tuples. I do not think we can afford it without addressing this problem first.
Note that to spill an AGPR to memory on gfx908 one needs an intermediate VGPR and extra VALU (32 extra VALU in a worst case + nops). In the updated tests we can clearly see the code bloat and memory traffic bloat because of the missing partial spill.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
411–412 | The copy of a 64-bit VGPR will use V_PK_MOV_B32 which uses 64 bit operands, which then shall be aligned. | |
llvm/test/CodeGen/AMDGPU/pei-build-spill-partial-agpr.mir | ||
63 ↗ | (On Diff #372126) | So this is a clear and predictable regression. Partial spill is killed by this patch. |
llvm/test/CodeGen/AMDGPU/spill-agpr.ll | ||
10 | Looks like the test does not test AGPR spills anymore, while we need to test it at least for gfx908. | |
llvm/test/CodeGen/AMDGPU/spill-agpr.mir | ||
34 ↗ | (On Diff #372126) | Another obvious regression. |
llvm/test/CodeGen/AMDGPU/spill-agpr.mir | ||
---|---|---|
34 ↗ | (On Diff #372126) | This is a regression, but I think it's an acceptable one. This is using regalloc fast, so you get no optimizations. You're only losing the optimization to copy between AGPR/VGPR at -O0. The intent of the test is to stress the low level spill handling, which this still accomplishes |
llvm/test/CodeGen/AMDGPU/pei-build-spill-partial-agpr.mir | ||
---|---|---|
5 ↗ | (On Diff #372126) | If we want to have CSR AGPRs, they should probably be spilled using the SplitCSR mechanism instead |
llvm/test/CodeGen/AMDGPU/pei-build-spill-partial-agpr.mir | ||
---|---|---|
63 ↗ | (On Diff #372126) | I don't think this is a situation that should happen in the first place. We should be able to split register tuples into different ranges for different subregisters. Overall the allocator needs to be smarter about knowing when only certain subregisters need spilling (which I'm hoping to look at once I come back from vacation) |
llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir | ||
---|---|---|
22 | This isn't allocated at all, this is running just PEI. For CSRs I think we should either not have CSR AGPRs, or use splitCSR |
llvm/test/CodeGen/AMDGPU/pei-build-spill-partial-agpr.mir | ||
---|---|---|
63 ↗ | (On Diff #372126) | Yes. I believe it shall be addressed first though. |
llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir | ||
---|---|---|
22 | OK, can we see partial cross class copy anywhere? |
llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir | ||
---|---|---|
22 | Not really, but this is just a general problem with the allocator which I hope to look into soon. It doesn't know how to introduce new subranges to avoid conflicts or to spill the minimum set of required lanes |
llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir | ||
---|---|---|
22 | Look, this is tolerable unless these are AGPRs with their 32 register tuples. |
llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir | ||
---|---|---|
22 | It is true, the Spiller during regalloc doesn't support partial/sub-range tuple spills. max_256_vgprs_spill_9x32_2bb in spill-vgpr-to-agpr.ll, for instance, used to get entire 1024 tuple spill stores and restores. I see similar copies inserted during greedy allocator. We currently don't have a test that captures it. I can include one. |
llvm/test/CodeGen/AMDGPU/agpr-csr.ll | ||
---|---|---|
41 ↗ | (On Diff #372126) | I think Matt it right, the easiest thing is to have no AGPR CSRs. We certainly do not want these scratch spills. |
llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir | ||
22 | This test would be super helpful. We need to make sure we do not copy a whole tuple. Moreover, we need it all ways: v to a, a to v, and all tuple sizes. |
llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir | ||
---|---|---|
22 |
I take my word back when I said regallc inserts sub-range tuple copies during tryInstructionSplit. It isn't entirely true. |
Rebase + converted AV spills into VGPR spills by introducing appropriate copies in between.
Added a test case for AV spills.
I still do not think we can do it without solving partial spill/copy issue. At least not for wide tuples.
Even today we don't generate partial spills.
I am not sure how this patch is going to add any incremental impact on existing partial spill behavior.
What do you mean? The code you are removing does it and spill-to-agpr-partial.mir in particular tests it.
I see your point. The code I removed from SIFrameLowering is supposed to insert a partial tuple copy to agprs that won't happen now.
Is it ok if we retain this code until the sub-range tuple copy/spill issue is fixed?
In that way, we can effectively convert vector spills into a copy-to-superclass during regalloc in all possible cases. The remaining spill pseudos of tuples will get another chance for a partial copy at frame lowering.
Can you experiment with this? I am not sure it will really work and will not get first expensive wide copy and then spilling of that copied value. This at least needs a test to demonstrate how it will work.
Recently added lit test llvm/test/CodeGen/AMDGPU/schedule-xdl-resource.ll has extreme pressure situations and the regalloc ends up inserting copies between virtual registers of identical regclasses.
It’s due to the allocator’s choice to spill the AGPRs and later restore them into its superclass.
After regalloc:
%563:areg_1024 = V_MFMA_F32_32X32X4F16_e64 %136, %136, %568, 1, 1, 1, implicit $mode, implicit $exec
SI_SPILL_A1024_SAVE %563, %stack.2, $sgpr32, 0, agpr spill
...
%555:av_1024 = SI_SPILL_V1024_RESTORE %stack.2, $sgpr32, 0 restores to AV class.
%461.sub3:vreg_128 = COPY %555.sub31
%461.sub2:vreg_128 = COPY %555.sub30
%461.sub1:vreg_128 = COPY %555.sub29
%461.sub0:vreg_128 = COPY %555.sub28
GLOBAL_STORE_DWORDX4_SADDR %151, %461, renamable $sgpr6_sgpr7, 112, 0
The superclass eventually gets VGPRs.
After virtual reg-rewriter:
$agpr32_agpr33_..._agpr62_agpr63 = V_MFMA_F32_32X32X4F16_e64 $vgpr9_vgpr10, $vgpr9_vgpr10, killed $agpr32_agpr33_..._agpr62_agpr63, 1, 1, 1
SI_SPILL_A1024_SAVE killed $agpr32_agpr33_..._agpr62_agpr63, %stack.2, $sgpr32, 0 AGPR spill
...
$vgpr5_vgpr6_..._vgpr35_vgpr36 = SI_SPILL_V1024_RESTORE %stack.2, $sgpr32, 0 restores to VGPRs.
renamable $vgpr4 = COPY renamable $vgpr36
renamable $vgpr3 = COPY renamable $vgpr35
renamable $vgpr2 = COPY renamable $vgpr34
renamable $vgpr1 = COPY renamable $vgpr33
GLOBAL_STORE_DWORDX4_SADDR renamable $vgpr0, killed renamable $vgpr1_vgpr2_vgpr3_vgpr4, renamable $sgpr6_sgpr7, 112, 0,
These VGPR copies are redundant here and should have optimized away. But they exist in the ISA.
It is possible that identical regclass copies are needed for alignment constraints in gfx90a and above.
Not sure how do we deal with it. Should we optimize them late in the pre-emit-peephole?
You cannot optimize it in pre-emit peephole as it will create new hazards which will not be handled.
That is also not what we would want on gfx90a: '$vgpr5_vgpr6_.. ='. I am not sure if spilling code would handle it correctly but this is a misaligned tuple.
You cannot optimize it in pre-emit peephole as it will create new hazards which will not be handled.
I missed your comment earlier, sorry.
Yes, trying to optimize them at late phases would be risky. It should be done no later than Post-RA scheduler.
But I am not sure we can correctly optimize the subreg tuple copies when strict alignment constraints exist.
I guess, after virtregrewriter the sub-registers are no longer tied together. Correct me if I'm wrong.
That is also not what we would want on gfx90a: '$vgpr5_vgpr6_.. ='. I am not sure if spilling code would handle it correctly but this is a misaligned tuple.
This lit test is compiled only for gfx908 and that would be the reason we see the misaligned tuple.
In fact restoring into av superclass also seems problematic. I believe we have agreed all the code here only work correctly if we have no actual av registers past selection.
That is also not what we would want on gfx90a: '$vgpr5_vgpr6_.. ='. I am not sure if spilling code would handle it correctly but this is a misaligned tuple.
This lit test is compiled only for gfx908 and that would be the reason we see the misaligned tuple.
OK, on gfx908 this is legal. Then something like this shall never happen on gfx90a. I hope it does not.
In fact restoring into av superclass also seems problematic. I believe we have agreed all the code here only work correctly if we have no actual av registers past selection.
The function getLargestLegalSuperClass is used mainly during Greedy, coalescer, and spiller.
Restore to AV class is part of the Spiller during regalloc. In this patch, I made a fix in storeRegToStackSlot to convert superclass spills into VGPR spills by introducing a copy.
AV classes appear only in COPY instances and they all become either VGPRs or AGPRs after virtregrewriter.
Retained SpillVGPRToAGPR implementation at frame lowering to handle the partial tuple spills to registers that are missed during regalloc.
Fixed a broken case in function spillVGPRtoAGPR.
Added more tests.
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
1469 | When does this happen? On gfx90a there is no need to copy AGPR to VGPR, it can be stored (and loaded) directly. | |
1617 | Same here. | |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
1054 | We are already spilling. I.e. we ran out of registers of ValueReg's class and we know it. Why cannot we just copy into a different class instead of introducing an ambiguous AV class? | |
llvm/test/CodeGen/AMDGPU/partial-regcopy-and-spill-missed-at-regalloc.ll | ||
13 | Can you add gfx90a run line and/or test? This copy should not happen on gfx90a, it can store AGPRs directly. | |
llvm/test/CodeGen/AMDGPU/spill-agpr.ll | ||
10 |
There shall be a test somewhere to do real spilling into memory on both gfx908 and gfx90a. | |
llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir | ||
234 | This should not happen. This is gfx90a and agpr has to be stored directly. This should happen on gfx908 though. |
The combination of Regcoalescer, superclass copy (tryInstructionSplit), and Spiller during regalloc introduce a superclass spill at extreme pressure situations.
Lit test llvm/test/CodeGen/AMDGPU/schedule-xdl-resource.ll demonstrates that.
This mostly happens for gfx908 where we need a copy for AGPR to VGPR. I didn’t see this happening for gfx90a.
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
1469 | Yes, this is mostly needed for gfx908. | |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
1054 | I didn't follow the question. Are you, in the first place, doubting the need for AV class as a superclass? | |
llvm/test/CodeGen/AMDGPU/partial-regcopy-and-spill-missed-at-regalloc.ll | ||
13 | Will add a RUN line for gfx90a. | |
llvm/test/CodeGen/AMDGPU/spill-agpr.ll | ||
10 | Will do. | |
llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir | ||
234 | You are right. The original test was for gfx90a. |
This might have caused all OpenMP offload tests (with math) to fail for gfx908/90a. @arsenm has a reproducer.
When does this happen? On gfx90a there is no need to copy AGPR to VGPR, it can be stored (and loaded) directly.