This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Enable copy between VGPR and AGPR classes during regalloc
ClosedPublic

Authored by cdevadas on Sep 5 2021, 10:19 PM.

Details

Summary

Greedy register allocator prefers to move a constrained
live range into a larger allocatable class over spilling
them. This patch defines the necessary superclasses for
vector registers. For subtargets that support copy between
VGPRs and AGPRs, the vector register spills during regalloc
now become just copies.

Diff Detail

Event Timeline

cdevadas created this revision.Sep 5 2021, 10:19 PM
cdevadas requested review of this revision.Sep 5 2021, 10:19 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 5 2021, 10:19 PM
arsenm added inline comments.Sep 7 2021, 4:59 PM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
417–418

I think with the intent of this function, you don't need to return aligned classes for aligned classes, the unaligned versions are fine. I guess this gets more complicated in the case where we're using scratch instructions that do require alignment for multi-dword spilling

cdevadas added inline comments.Sep 9 2021, 8:06 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
417–418

I will see the impact of scratch instructions separately.

cdevadas updated this revision to Diff 372126.Sep 12 2021, 10:03 AM
cdevadas retitled this revision from [AMDGPU] Enable VGPR to AGPR copy during regalloc to [AMDGPU] Enable copy between VGPR and AGPR classes during regalloc.
cdevadas edited the summary of this revision. (Show Details)

Defined superclasses for AGPRs to enable copy into a superclass instead of spills.

rampitec requested changes to this revision.Sep 20 2021, 3:15 PM

There is an elephant in the room: it kills partial spill of wide tuples. I do not think we can afford it without addressing this problem first.

Note that to spill an AGPR to memory on gfx908 one needs an intermediate VGPR and extra VALU (32 extra VALU in a worst case + nops). In the updated tests we can clearly see the code bloat and memory traffic bloat because of the missing partial spill.

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
417–418

The copy of a 64-bit VGPR will use V_PK_MOV_B32 which uses 64 bit operands, which then shall be aligned.

llvm/test/CodeGen/AMDGPU/pei-build-spill-partial-agpr.mir
63 ↗(On Diff #372126)

So this is a clear and predictable regression. Partial spill is killed by this patch.

llvm/test/CodeGen/AMDGPU/spill-agpr.ll
10

Looks like the test does not test AGPR spills anymore, while we need to test it at least for gfx908.

llvm/test/CodeGen/AMDGPU/spill-agpr.mir
34 ↗(On Diff #372126)

Another obvious regression.

This revision now requires changes to proceed.Sep 20 2021, 3:15 PM
arsenm added inline comments.Sep 20 2021, 4:30 PM
llvm/test/CodeGen/AMDGPU/spill-agpr.mir
34 ↗(On Diff #372126)

This is a regression, but I think it's an acceptable one. This is using regalloc fast, so you get no optimizations. You're only losing the optimization to copy between AGPR/VGPR at -O0. The intent of the test is to stress the low level spill handling, which this still accomplishes

arsenm added inline comments.Sep 20 2021, 4:37 PM
llvm/test/CodeGen/AMDGPU/pei-build-spill-partial-agpr.mir
5 ↗(On Diff #372126)

If we want to have CSR AGPRs, they should probably be spilled using the SplitCSR mechanism instead

arsenm added inline comments.Sep 20 2021, 4:40 PM
llvm/test/CodeGen/AMDGPU/pei-build-spill-partial-agpr.mir
63 ↗(On Diff #372126)

I don't think this is a situation that should happen in the first place. We should be able to split register tuples into different ranges for different subregisters. Overall the allocator needs to be smarter about knowing when only certain subregisters need spilling (which I'm hoping to look at once I come back from vacation)

rampitec added inline comments.Sep 20 2021, 4:41 PM
llvm/test/CodeGen/AMDGPU/spill-agpr.mir
34 ↗(On Diff #372126)

For fast RA it is acceptable. What about greedy? Can we do partial spill?

llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir
19 ↗(On Diff #372126)

This is greedy, not fastra, the same regression.

arsenm added inline comments.Sep 20 2021, 4:43 PM
llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir
19 ↗(On Diff #372126)

This isn't allocated at all, this is running just PEI. For CSRs I think we should either not have CSR AGPRs, or use splitCSR

rampitec added inline comments.Sep 20 2021, 4:44 PM
llvm/test/CodeGen/AMDGPU/pei-build-spill-partial-agpr.mir
63 ↗(On Diff #372126)

Yes. I believe it shall be addressed first though.

rampitec added inline comments.Sep 20 2021, 4:45 PM
llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir
19 ↗(On Diff #372126)

OK, can we see partial cross class copy anywhere?

arsenm added inline comments.Sep 20 2021, 4:53 PM
llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir
19 ↗(On Diff #372126)

Not really, but this is just a general problem with the allocator which I hope to look into soon. It doesn't know how to introduce new subranges to avoid conflicts or to spill the minimum set of required lanes

rampitec added inline comments.Sep 20 2021, 5:00 PM
llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir
19 ↗(On Diff #372126)

Look, this is tolerable unless these are AGPRs with their 32 register tuples.

cdevadas added inline comments.Sep 20 2021, 6:54 PM
llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir
19 ↗(On Diff #372126)

It is true, the Spiller during regalloc doesn't support partial/sub-range tuple spills.
But it's not the case with copy. Greedy did handle partial tuple copy into a superclass.

max_256_vgprs_spill_9x32_2bb in spill-vgpr-to-agpr.ll, for instance, used to get entire 1024 tuple spill stores and restores.
Now that has become minimal copies.

I see similar copies inserted during greedy allocator.
%1200.sub16_sub17_sub18_sub19:av_1024 = COPY %1201.sub16_sub17_sub18_sub19:vreg_1024 // copy to the super class, AV.
%1202.sub16_sub17_sub18_sub19:vreg_1024 = COPY %1200.sub16_sub17_sub18_sub19:av_1024
// later moving it back to V.

We currently don't have a test that captures it. I can include one.

rampitec added inline comments.Sep 21 2021, 11:25 AM
llvm/test/CodeGen/AMDGPU/agpr-csr.ll
41 ↗(On Diff #372126)

I think Matt it right, the easiest thing is to have no AGPR CSRs. We certainly do not want these scratch spills.

llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir
19 ↗(On Diff #372126)

This test would be super helpful. We need to make sure we do not copy a whole tuple. Moreover, we need it all ways: v to a, a to v, and all tuple sizes.

cdevadas added inline comments.Nov 4 2021, 5:37 AM
llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir
19 ↗(On Diff #372126)

This test would be super helpful. We need to make sure we do not copy a whole tuple. Moreover, we need it all ways: v to a, a to v, and all tuple sizes.

I take my word back when I said regallc inserts sub-range tuple copies during tryInstructionSplit. It isn't entirely true.
Register coalescer inserts copies that now become copies to equivalent superclasses as we defined getLargestLegalSuperClass for the vector classes. The trySplit function doesn't do anything new to introduce subrange copies.
I failed to come up with a test case that introduce tuple subrange copies for all supported AMDGPU tuple sizes.
Like Matt said, regalloc needs a fix to better handle the subranges.

cdevadas updated this revision to Diff 384840.Nov 4 2021, 12:17 PM

Rebase + converted AV spills into VGPR spills by introducing appropriate copies in between.
Added a test case for AV spills.

I still do not think we can do it without solving partial spill/copy issue. At least not for wide tuples.

I still do not think we can do it without solving partial spill/copy issue. At least not for wide tuples.

Even today we don't generate partial spills.
I am not sure how this patch is going to add any incremental impact on existing partial spill behavior.

I still do not think we can do it without solving partial spill/copy issue. At least not for wide tuples.

Even today we don't generate partial spills.
I am not sure how this patch is going to add any incremental impact on existing partial spill behavior.

What do you mean? The code you are removing does it and spill-to-agpr-partial.mir in particular tests it.

I still do not think we can do it without solving partial spill/copy issue. At least not for wide tuples.

Even today we don't generate partial spills.
I am not sure how this patch is going to add any incremental impact on existing partial spill behavior.

What do you mean? The code you are removing does it and spill-to-agpr-partial.mir in particular tests it.

I see your point. The code I removed from SIFrameLowering is supposed to insert a partial tuple copy to agprs that won't happen now.
Is it ok if we retain this code until the sub-range tuple copy/spill issue is fixed?
In that way, we can effectively convert vector spills into a copy-to-superclass during regalloc in all possible cases. The remaining spill pseudos of tuples will get another chance for a partial copy at frame lowering.

I still do not think we can do it without solving partial spill/copy issue. At least not for wide tuples.

Even today we don't generate partial spills.
I am not sure how this patch is going to add any incremental impact on existing partial spill behavior.

What do you mean? The code you are removing does it and spill-to-agpr-partial.mir in particular tests it.

I see your point. The code I removed from SIFrameLowering is supposed to insert a partial tuple copy to agprs that won't happen now.
Is it ok if we retain this code until the sub-range tuple copy/spill issue is fixed?
In that way, we can effectively convert vector spills into a copy-to-superclass during regalloc in all possible cases. The remaining spill pseudos of tuples will get another chance for a partial copy at frame lowering.

Can you experiment with this? I am not sure it will really work and will not get first expensive wide copy and then spilling of that copied value. This at least needs a test to demonstrate how it will work.

Recently added lit test llvm/test/CodeGen/AMDGPU/schedule-xdl-resource.ll has extreme pressure situations and the regalloc ends up inserting copies between virtual registers of identical regclasses.
It’s due to the allocator’s choice to spill the AGPRs and later restore them into its superclass.

After regalloc:
%563:areg_1024 = V_MFMA_F32_32X32X4F16_e64 %136, %136, %568, 1, 1, 1, implicit $mode, implicit $exec
SI_SPILL_A1024_SAVE %563, %stack.2, $sgpr32, 0, agpr spill
...
%555:av_1024 = SI_SPILL_V1024_RESTORE %stack.2, $sgpr32, 0
restores to AV class.
%461.sub3:vreg_128 = COPY %555.sub31
%461.sub2:vreg_128 = COPY %555.sub30
%461.sub1:vreg_128 = COPY %555.sub29
%461.sub0:vreg_128 = COPY %555.sub28
GLOBAL_STORE_DWORDX4_SADDR %151, %461, renamable $sgpr6_sgpr7, 112, 0

The superclass eventually gets VGPRs.
After virtual reg-rewriter:
$agpr32_agpr33_..._agpr62_agpr63 = V_MFMA_F32_32X32X4F16_e64 $vgpr9_vgpr10, $vgpr9_vgpr10, killed $agpr32_agpr33_..._agpr62_agpr63, 1, 1, 1
SI_SPILL_A1024_SAVE killed $agpr32_agpr33_..._agpr62_agpr63, %stack.2, $sgpr32, 0 AGPR spill
...
$vgpr5_vgpr6_..._vgpr35_vgpr36 = SI_SPILL_V1024_RESTORE %stack.2, $sgpr32, 0
restores to VGPRs.
renamable $vgpr4 = COPY renamable $vgpr36
renamable $vgpr3 = COPY renamable $vgpr35
renamable $vgpr2 = COPY renamable $vgpr34
renamable $vgpr1 = COPY renamable $vgpr33
GLOBAL_STORE_DWORDX4_SADDR renamable $vgpr0, killed renamable $vgpr1_vgpr2_vgpr3_vgpr4, renamable $sgpr6_sgpr7, 112, 0,

These VGPR copies are redundant here and should have optimized away. But they exist in the ISA.
It is possible that identical regclass copies are needed for alignment constraints in gfx90a and above.
Not sure how do we deal with it. Should we optimize them late in the pre-emit-peephole?

Recently added lit test llvm/test/CodeGen/AMDGPU/schedule-xdl-resource.ll has extreme pressure situations and the regalloc ends up inserting copies between virtual registers of identical regclasses.
It’s due to the allocator’s choice to spill the AGPRs and later restore them into its superclass.

After regalloc:
%563:areg_1024 = V_MFMA_F32_32X32X4F16_e64 %136, %136, %568, 1, 1, 1, implicit $mode, implicit $exec
SI_SPILL_A1024_SAVE %563, %stack.2, $sgpr32, 0, agpr spill
...
%555:av_1024 = SI_SPILL_V1024_RESTORE %stack.2, $sgpr32, 0
restores to AV class.
%461.sub3:vreg_128 = COPY %555.sub31
%461.sub2:vreg_128 = COPY %555.sub30
%461.sub1:vreg_128 = COPY %555.sub29
%461.sub0:vreg_128 = COPY %555.sub28
GLOBAL_STORE_DWORDX4_SADDR %151, %461, renamable $sgpr6_sgpr7, 112, 0

The superclass eventually gets VGPRs.
After virtual reg-rewriter:
$agpr32_agpr33_..._agpr62_agpr63 = V_MFMA_F32_32X32X4F16_e64 $vgpr9_vgpr10, $vgpr9_vgpr10, killed $agpr32_agpr33_..._agpr62_agpr63, 1, 1, 1
SI_SPILL_A1024_SAVE killed $agpr32_agpr33_..._agpr62_agpr63, %stack.2, $sgpr32, 0 AGPR spill
...
$vgpr5_vgpr6_..._vgpr35_vgpr36 = SI_SPILL_V1024_RESTORE %stack.2, $sgpr32, 0
restores to VGPRs.
renamable $vgpr4 = COPY renamable $vgpr36
renamable $vgpr3 = COPY renamable $vgpr35
renamable $vgpr2 = COPY renamable $vgpr34
renamable $vgpr1 = COPY renamable $vgpr33
GLOBAL_STORE_DWORDX4_SADDR renamable $vgpr0, killed renamable $vgpr1_vgpr2_vgpr3_vgpr4, renamable $sgpr6_sgpr7, 112, 0,

These VGPR copies are redundant here and should have optimized away. But they exist in the ISA.
It is possible that identical regclass copies are needed for alignment constraints in gfx90a and above.
Not sure how do we deal with it. Should we optimize them late in the pre-emit-peephole?

You cannot optimize it in pre-emit peephole as it will create new hazards which will not be handled.
That is also not what we would want on gfx90a: '$vgpr5_vgpr6_.. ='. I am not sure if spilling code would handle it correctly but this is a misaligned tuple.

You cannot optimize it in pre-emit peephole as it will create new hazards which will not be handled.

I missed your comment earlier, sorry.
Yes, trying to optimize them at late phases would be risky. It should be done no later than Post-RA scheduler.
But I am not sure we can correctly optimize the subreg tuple copies when strict alignment constraints exist.
I guess, after virtregrewriter the sub-registers are no longer tied together. Correct me if I'm wrong.

That is also not what we would want on gfx90a: '$vgpr5_vgpr6_.. ='. I am not sure if spilling code would handle it correctly but this is a misaligned tuple.

This lit test is compiled only for gfx908 and that would be the reason we see the misaligned tuple.

You cannot optimize it in pre-emit peephole as it will create new hazards which will not be handled.

I missed your comment earlier, sorry.
Yes, trying to optimize them at late phases would be risky. It should be done no later than Post-RA scheduler.
But I am not sure we can correctly optimize the subreg tuple copies when strict alignment constraints exist.
I guess, after virtregrewriter the sub-registers are no longer tied together. Correct me if I'm wrong.

In fact restoring into av superclass also seems problematic. I believe we have agreed all the code here only work correctly if we have no actual av registers past selection.

That is also not what we would want on gfx90a: '$vgpr5_vgpr6_.. ='. I am not sure if spilling code would handle it correctly but this is a misaligned tuple.

This lit test is compiled only for gfx908 and that would be the reason we see the misaligned tuple.

OK, on gfx908 this is legal. Then something like this shall never happen on gfx90a. I hope it does not.

In fact restoring into av superclass also seems problematic. I believe we have agreed all the code here only work correctly if we have no actual av registers past selection.

The function getLargestLegalSuperClass is used mainly during Greedy, coalescer, and spiller.
Restore to AV class is part of the Spiller during regalloc. In this patch, I made a fix in storeRegToStackSlot to convert superclass spills into VGPR spills by introducing a copy.
AV classes appear only in COPY instances and they all become either VGPRs or AGPRs after virtregrewriter.

cdevadas updated this revision to Diff 388004.Nov 17 2021, 11:42 AM
cdevadas edited the summary of this revision. (Show Details)

Retained SpillVGPRToAGPR implementation at frame lowering to handle the partial tuple spills to registers that are missed during regalloc.
Fixed a broken case in function spillVGPRtoAGPR.
Added more tests.

rampitec added inline comments.Nov 17 2021, 1:34 PM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
1470

When does this happen? On gfx90a there is no need to copy AGPR to VGPR, it can be stored (and loaded) directly.

1618

Same here.

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1060

We are already spilling. I.e. we ran out of registers of ValueReg's class and we know it. Why cannot we just copy into a different class instead of introducing an ambiguous AV class?

llvm/test/CodeGen/AMDGPU/partial-regcopy-and-spill-missed-at-regalloc.ll
14

Can you add gfx90a run line and/or test? This copy should not happen on gfx90a, it can store AGPRs directly.

llvm/test/CodeGen/AMDGPU/spill-agpr.ll
10

Looks like the test does not test AGPR spills anymore, while we need to test it at least for gfx908.

There shall be a test somewhere to do real spilling into memory on both gfx908 and gfx90a.

llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir
234 ↗(On Diff #388004)

This should not happen. This is gfx90a and agpr has to be stored directly. This should happen on gfx908 though.

The combination of Regcoalescer, superclass copy (tryInstructionSplit), and Spiller during regalloc introduce a superclass spill at extreme pressure situations.
Lit test llvm/test/CodeGen/AMDGPU/schedule-xdl-resource.ll demonstrates that.
This mostly happens for gfx908 where we need a copy for AGPR to VGPR. I didn’t see this happening for gfx90a.

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
1470

Yes, this is mostly needed for gfx908.

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1060

I didn't follow the question. Are you, in the first place, doubting the need for AV class as a superclass?

llvm/test/CodeGen/AMDGPU/partial-regcopy-and-spill-missed-at-regalloc.ll
14

Will add a RUN line for gfx90a.

llvm/test/CodeGen/AMDGPU/spill-agpr.ll
10

Will do.

llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir
234 ↗(On Diff #388004)

You are right. The original test was for gfx90a.
The new patterns should be in a separate test for gfx908. I will create one.

rampitec added inline comments.Nov 22 2021, 11:42 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1060

No, I just mean it is not needed right here. In particular that is the reason for regression in the spill-to-agpr-partial.mir.

llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir
234 ↗(On Diff #388004)

But this copy still should not happen on gfx90a as it does.

cdevadas updated this revision to Diff 389517.Nov 24 2021, 9:11 AM

Addressed the review comments.

rampitec added inline comments.Nov 24 2021, 1:33 PM
llvm/test/CodeGen/AMDGPU/spill-agpr.ll
107

It is not clear what's stored and what's loaded in this test and what are register classes here. I think this is important for this patch.

llvm/test/CodeGen/AMDGPU/spill-vgpr-to-agpr.ll
232–233

This comment is probably not relevant anymore?

cdevadas updated this revision to Diff 389662.Nov 24 2021, 10:51 PM

Suggestions addressed.

rampitec accepted this revision.Nov 29 2021, 11:19 AM

LGTM. Thanks!

This revision is now accepted and ready to land.Nov 29 2021, 11:19 AM
This revision was landed with ongoing or failed builds.Nov 29 2021, 7:22 PM
This revision was automatically updated to reflect the committed changes.
jdoerfert reopened this revision.Dec 8 2021, 8:42 AM
jdoerfert added a subscriber: jdoerfert.

This might have caused all OpenMP offload tests (with math) to fail for gfx908/90a. @arsenm has a reproducer.

This revision is now accepted and ready to land.Dec 8 2021, 8:42 AM

This might have caused all OpenMP offload tests (with math) to fail for gfx908/90a. @arsenm has a reproducer.

Posted D115439 to fix this issue.