This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
1/3
SIInstrInfo.cpp
-
SIRegisterInfo.h
2/6
SIRegisterInfo.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
extend-phi-subrange-not-in-parent.mir
1/2
partial-regcopy-and-spill-missed-at-regalloc.ll
1/4
spill-agpr.ll
3/11
spill-to-agpr-partial.mir
-
spill-vector-superclass.ll
1
spill-vgpr-to-agpr.ll

Differential D109301

[AMDGPU] Enable copy between VGPR and AGPR classes during regalloc
ClosedPublic

Authored by cdevadas on Sep 5 2021, 10:19 PM.

Download Raw Diff

Details

Reviewers

arsenm
rampitec

Commits

rG5297cbf04532: [AMDGPU] Enable copy between VGPR and AGPR classes during regalloc

Summary

Greedy register allocator prefers to move a constrained
live range into a larger allocatable class over spilling
them. This patch defines the necessary superclasses for
vector registers. For subtargets that support copy between
VGPRs and AGPRs, the vector register spills during regalloc
now become just copies.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

cdevadas created this revision.Sep 5 2021, 10:19 PM

Herald added subscribers: foad, kerbowa, hiraditya and 8 others. · View Herald TranscriptSep 5 2021, 10:19 PM

cdevadas requested review of this revision.Sep 5 2021, 10:19 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 5 2021, 10:19 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

cdevadas added a parent revision: D109300: [AMDGPU] Make vector superclasses allocatable.Sep 5 2021, 10:19 PM

Harbormaster completed remote builds in B122717: Diff 370843.Sep 5 2021, 10:27 PM

• hafixo added a commit: rCRT373035: hwasan: Compatibility fixes for short granules..Sep 6 2021, 12:44 AM

• hafixo added a commit: rGc336557f0238: hwasan: Compatibility fixes for short granules..Sep 6 2021, 12:47 AM

thopre removed a commit: rGc336557f0238: hwasan: Compatibility fixes for short granules..Sep 7 2021, 2:47 AM

thopre removed a commit: rCRT373035: hwasan: Compatibility fixes for short granules..Sep 7 2021, 2:51 AM

arsenm added inline comments.Sep 7 2021, 4:59 PM

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
411–412	I think with the intent of this function, you don't need to return aligned classes for aligned classes, the unaligned versions are fine. I guess this gets more complicated in the case where we're using scratch instructions that do require alignment for multi-dword spilling

cdevadas added inline comments.Sep 9 2021, 8:06 AM

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
411–412	I will see the impact of scratch instructions separately.

Defined superclasses for AGPRs to enable copy into a superclass instead of spills.

Harbormaster completed remote builds in B123595: Diff 372126.Sep 12 2021, 10:43 AM

Ping

cdevadas added a child revision: D110053: [AMDGPU] Add a regclass flag for scalar registers.Sep 20 2021, 1:27 AM

There is an elephant in the room: it kills partial spill of wide tuples. I do not think we can afford it without addressing this problem first.

Note that to spill an AGPR to memory on gfx908 one needs an intermediate VGPR and extra VALU (32 extra VALU in a worst case + nops). In the updated tests we can clearly see the code bloat and memory traffic bloat because of the missing partial spill.

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
411–412	The copy of a 64-bit VGPR will use V_PK_MOV_B32 which uses 64 bit operands, which then shall be aligned.
llvm/test/CodeGen/AMDGPU/pei-build-spill-partial-agpr.mir
63 ↗	(On Diff #372126)	So this is a clear and predictable regression. Partial spill is killed by this patch.
llvm/test/CodeGen/AMDGPU/spill-agpr.ll
10	Looks like the test does not test AGPR spills anymore, while we need to test it at least for gfx908.
llvm/test/CodeGen/AMDGPU/spill-agpr.mir
34 ↗	(On Diff #372126)	Another obvious regression.

This revision now requires changes to proceed.Sep 20 2021, 3:15 PM

arsenm added inline comments.Sep 20 2021, 4:30 PM

llvm/test/CodeGen/AMDGPU/spill-agpr.mir
34 ↗	(On Diff #372126)	This is a regression, but I think it's an acceptable one. This is using regalloc fast, so you get no optimizations. You're only losing the optimization to copy between AGPR/VGPR at -O0. The intent of the test is to stress the low level spill handling, which this still accomplishes

arsenm added inline comments.Sep 20 2021, 4:37 PM

llvm/test/CodeGen/AMDGPU/pei-build-spill-partial-agpr.mir
5 ↗	(On Diff #372126)	If we want to have CSR AGPRs, they should probably be spilled using the SplitCSR mechanism instead

arsenm added inline comments.Sep 20 2021, 4:40 PM

llvm/test/CodeGen/AMDGPU/pei-build-spill-partial-agpr.mir
63 ↗	(On Diff #372126)	I don't think this is a situation that should happen in the first place. We should be able to split register tuples into different ranges for different subregisters. Overall the allocator needs to be smarter about knowing when only certain subregisters need spilling (which I'm hoping to look at once I come back from vacation)

rampitec added inline comments.Sep 20 2021, 4:41 PM

llvm/test/CodeGen/AMDGPU/spill-agpr.mir
34 ↗	(On Diff #372126)	For fast RA it is acceptable. What about greedy? Can we do partial spill?
llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir
22	This is greedy, not fastra, the same regression.

arsenm added inline comments.Sep 20 2021, 4:43 PM

llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir
22	This isn't allocated at all, this is running just PEI. For CSRs I think we should either not have CSR AGPRs, or use splitCSR

rampitec added inline comments.Sep 20 2021, 4:44 PM

llvm/test/CodeGen/AMDGPU/pei-build-spill-partial-agpr.mir
63 ↗	(On Diff #372126)	Yes. I believe it shall be addressed first though.

rampitec added inline comments.Sep 20 2021, 4:45 PM

llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir
22	OK, can we see partial cross class copy anywhere?

arsenm added inline comments.Sep 20 2021, 4:53 PM

llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir
22	Not really, but this is just a general problem with the allocator which I hope to look into soon. It doesn't know how to introduce new subranges to avoid conflicts or to spill the minimum set of required lanes

rampitec added inline comments.Sep 20 2021, 5:00 PM

llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir
22	Look, this is tolerable unless these are AGPRs with their 32 register tuples.

cdevadas added inline comments.Sep 20 2021, 6:54 PM

llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir
22	It is true, the Spiller during regalloc doesn't support partial/sub-range tuple spills. But it's not the case with copy. Greedy did handle partial tuple copy into a superclass. `max_256_vgprs_spill_9x32_2bb` in spill-vgpr-to-agpr.ll, for instance, used to get entire 1024 tuple spill stores and restores. Now that has become minimal copies. I see similar copies inserted during greedy allocator. %1200.sub16_sub17_sub18_sub19:av_1024 = COPY %1201.sub16_sub17_sub18_sub19:vreg_1024 // copy to the super class, AV. %1202.sub16_sub17_sub18_sub19:vreg_1024 = COPY %1200.sub16_sub17_sub18_sub19:av_1024 // later moving it back to V. We currently don't have a test that captures it. I can include one.

rampitec added inline comments.Sep 21 2021, 11:25 AM

llvm/test/CodeGen/AMDGPU/agpr-csr.ll
41 ↗	(On Diff #372126)	I think Matt it right, the easiest thing is to have no AGPR CSRs. We certainly do not want these scratch spills.
llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir
22	This test would be super helpful. We need to make sure we do not copy a whole tuple. Moreover, we need it all ways: v to a, a to v, and all tuple sizes.

cdevadas added inline comments.Nov 4 2021, 5:37 AM

llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir
22	This test would be super helpful. We need to make sure we do not copy a whole tuple. Moreover, we need it all ways: v to a, a to v, and all tuple sizes. I take my word back when I said regallc inserts sub-range tuple copies during `tryInstructionSplit`. It isn't entirely true. Register coalescer inserts copies that now become copies to equivalent superclasses as we defined `getLargestLegalSuperClass` for the vector classes. The `trySplit` function doesn't do anything new to introduce subrange copies. I failed to come up with a test case that introduce tuple subrange copies for all supported AMDGPU tuple sizes. Like Matt said, regalloc needs a fix to better handle the subranges.

Rebase + converted AV spills into VGPR spills by introducing appropriate copies in between.
Added a test case for AV spills.

Harbormaster completed remote builds in B132522: Diff 384840.Nov 4 2021, 12:18 PM

I still do not think we can do it without solving partial spill/copy issue. At least not for wide tuples.

In D109301#3109750, @rampitec wrote:

I still do not think we can do it without solving partial spill/copy issue. At least not for wide tuples.

Even today we don't generate partial spills.
I am not sure how this patch is going to add any incremental impact on existing partial spill behavior.

In D109301#3109806, @cdevadas wrote:

In D109301#3109750, @rampitec wrote:

I still do not think we can do it without solving partial spill/copy issue. At least not for wide tuples.

Even today we don't generate partial spills.
I am not sure how this patch is going to add any incremental impact on existing partial spill behavior.

What do you mean? The code you are removing does it and spill-to-agpr-partial.mir in particular tests it.

In D109301#3109814, @rampitec wrote:

In D109301#3109806, @cdevadas wrote:

In D109301#3109750, @rampitec wrote:

I still do not think we can do it without solving partial spill/copy issue. At least not for wide tuples.

Even today we don't generate partial spills.
I am not sure how this patch is going to add any incremental impact on existing partial spill behavior.

What do you mean? The code you are removing does it and spill-to-agpr-partial.mir in particular tests it.

I see your point. The code I removed from SIFrameLowering is supposed to insert a partial tuple copy to agprs that won't happen now.
Is it ok if we retain this code until the sub-range tuple copy/spill issue is fixed?
In that way, we can effectively convert vector spills into a copy-to-superclass during regalloc in all possible cases. The remaining spill pseudos of tuples will get another chance for a partial copy at frame lowering.

In D109301#3110900, @cdevadas wrote:

In D109301#3109814, @rampitec wrote:

In D109301#3109806, @cdevadas wrote:

In D109301#3109750, @rampitec wrote:

I still do not think we can do it without solving partial spill/copy issue. At least not for wide tuples.

Even today we don't generate partial spills.
I am not sure how this patch is going to add any incremental impact on existing partial spill behavior.

What do you mean? The code you are removing does it and spill-to-agpr-partial.mir in particular tests it.

I see your point. The code I removed from SIFrameLowering is supposed to insert a partial tuple copy to agprs that won't happen now.
Is it ok if we retain this code until the sub-range tuple copy/spill issue is fixed?
In that way, we can effectively convert vector spills into a copy-to-superclass during regalloc in all possible cases. The remaining spill pseudos of tuples will get another chance for a partial copy at frame lowering.

Can you experiment with this? I am not sure it will really work and will not get first expensive wide copy and then spilling of that copied value. This at least needs a test to demonstrate how it will work.

Recently added lit test llvm/test/CodeGen/AMDGPU/schedule-xdl-resource.ll has extreme pressure situations and the regalloc ends up inserting copies between virtual registers of identical regclasses.
It’s due to the allocator’s choice to spill the AGPRs and later restore them into its superclass.

After regalloc:
%563:areg_1024 = V_MFMA_F32_32X32X4F16_e64 %136, %136, %568, 1, 1, 1, implicit $mode, implicit $exec
SI_SPILL_A1024_SAVE %563, %stack.2, $sgpr32, 0, agpr spill
...
%555:av_1024 = SI_SPILL_V1024_RESTORE %stack.2, $sgpr32, 0 restores to AV class.
%461.sub3:vreg_128 = COPY %555.sub31
%461.sub2:vreg_128 = COPY %555.sub30
%461.sub1:vreg_128 = COPY %555.sub29
%461.sub0:vreg_128 = COPY %555.sub28
GLOBAL_STORE_DWORDX4_SADDR %151, %461, renamable $sgpr6_sgpr7, 112, 0

The superclass eventually gets VGPRs.
After virtual reg-rewriter:
$agpr32_agpr33_..._agpr62_agpr63 = V_MFMA_F32_32X32X4F16_e64 $vgpr9_vgpr10, $vgpr9_vgpr10, killed $agpr32_agpr33_..._agpr62_agpr63, 1, 1, 1
SI_SPILL_A1024_SAVE killed $agpr32_agpr33_..._agpr62_agpr63, %stack.2, $sgpr32, 0 AGPR spill
...
$vgpr5_vgpr6_..._vgpr35_vgpr36 = SI_SPILL_V1024_RESTORE %stack.2, $sgpr32, 0 restores to VGPRs.
renamable $vgpr4 = COPY renamable $vgpr36
renamable $vgpr3 = COPY renamable $vgpr35
renamable $vgpr2 = COPY renamable $vgpr34
renamable $vgpr1 = COPY renamable $vgpr33
GLOBAL_STORE_DWORDX4_SADDR renamable $vgpr0, killed renamable $vgpr1_vgpr2_vgpr3_vgpr4, renamable $sgpr6_sgpr7, 112, 0,

These VGPR copies are redundant here and should have optimized away. But they exist in the ISA.
It is possible that identical regclass copies are needed for alignment constraints in gfx90a and above.
Not sure how do we deal with it. Should we optimize them late in the pre-emit-peephole?

In D109301#3121154, @cdevadas wrote:

Recently added lit test llvm/test/CodeGen/AMDGPU/schedule-xdl-resource.ll has extreme pressure situations and the regalloc ends up inserting copies between virtual registers of identical regclasses.
It’s due to the allocator’s choice to spill the AGPRs and later restore them into its superclass.

After regalloc:
%563:areg_1024 = V_MFMA_F32_32X32X4F16_e64 %136, %136, %568, 1, 1, 1, implicit $mode, implicit $exec
SI_SPILL_A1024_SAVE %563, %stack.2, $sgpr32, 0, agpr spill
...
%555:av_1024 = SI_SPILL_V1024_RESTORE %stack.2, $sgpr32, 0 restores to AV class.
%461.sub3:vreg_128 = COPY %555.sub31
%461.sub2:vreg_128 = COPY %555.sub30
%461.sub1:vreg_128 = COPY %555.sub29
%461.sub0:vreg_128 = COPY %555.sub28
GLOBAL_STORE_DWORDX4_SADDR %151, %461, renamable $sgpr6_sgpr7, 112, 0

The superclass eventually gets VGPRs.
After virtual reg-rewriter:
$agpr32_agpr33_..._agpr62_agpr63 = V_MFMA_F32_32X32X4F16_e64 $vgpr9_vgpr10, $vgpr9_vgpr10, killed $agpr32_agpr33_..._agpr62_agpr63, 1, 1, 1
SI_SPILL_A1024_SAVE killed $agpr32_agpr33_..._agpr62_agpr63, %stack.2, $sgpr32, 0 AGPR spill
...
$vgpr5_vgpr6_..._vgpr35_vgpr36 = SI_SPILL_V1024_RESTORE %stack.2, $sgpr32, 0 restores to VGPRs.
renamable $vgpr4 = COPY renamable $vgpr36
renamable $vgpr3 = COPY renamable $vgpr35
renamable $vgpr2 = COPY renamable $vgpr34
renamable $vgpr1 = COPY renamable $vgpr33
GLOBAL_STORE_DWORDX4_SADDR renamable $vgpr0, killed renamable $vgpr1_vgpr2_vgpr3_vgpr4, renamable $sgpr6_sgpr7, 112, 0,

These VGPR copies are redundant here and should have optimized away. But they exist in the ISA.
It is possible that identical regclass copies are needed for alignment constraints in gfx90a and above.
Not sure how do we deal with it. Should we optimize them late in the pre-emit-peephole?

You cannot optimize it in pre-emit peephole as it will create new hazards which will not be handled.
That is also not what we would want on gfx90a: '$vgpr5_vgpr6_.. ='. I am not sure if spilling code would handle it correctly but this is a misaligned tuple.

You cannot optimize it in pre-emit peephole as it will create new hazards which will not be handled.

I missed your comment earlier, sorry.
Yes, trying to optimize them at late phases would be risky. It should be done no later than Post-RA scheduler.
But I am not sure we can correctly optimize the subreg tuple copies when strict alignment constraints exist.
I guess, after virtregrewriter the sub-registers are no longer tied together. Correct me if I'm wrong.

That is also not what we would want on gfx90a: '$vgpr5_vgpr6_.. ='. I am not sure if spilling code would handle it correctly but this is a misaligned tuple.

This lit test is compiled only for gfx908 and that would be the reason we see the misaligned tuple.

In D109301#3130948, @cdevadas wrote:

You cannot optimize it in pre-emit peephole as it will create new hazards which will not be handled.

I missed your comment earlier, sorry.
Yes, trying to optimize them at late phases would be risky. It should be done no later than Post-RA scheduler.
But I am not sure we can correctly optimize the subreg tuple copies when strict alignment constraints exist.
I guess, after virtregrewriter the sub-registers are no longer tied together. Correct me if I'm wrong.

In fact restoring into av superclass also seems problematic. I believe we have agreed all the code here only work correctly if we have no actual av registers past selection.

That is also not what we would want on gfx90a: '$vgpr5_vgpr6_.. ='. I am not sure if spilling code would handle it correctly but this is a misaligned tuple.

This lit test is compiled only for gfx908 and that would be the reason we see the misaligned tuple.

OK, on gfx908 this is legal. Then something like this shall never happen on gfx90a. I hope it does not.

In fact restoring into av superclass also seems problematic. I believe we have agreed all the code here only work correctly if we have no actual av registers past selection.

The function getLargestLegalSuperClass is used mainly during Greedy, coalescer, and spiller.
Restore to AV class is part of the Spiller during regalloc. In this patch, I made a fix in storeRegToStackSlot to convert superclass spills into VGPR spills by introducing a copy.
AV classes appear only in COPY instances and they all become either VGPRs or AGPRs after virtregrewriter.

Retained SpillVGPRToAGPR implementation at frame lowering to handle the partial tuple spills to registers that are missed during regalloc.
Fixed a broken case in function spillVGPRtoAGPR.
Added more tests.

Herald added a subscriber: MatzeB. · View Herald TranscriptNov 17 2021, 11:42 AM

Harbormaster completed remote builds in B134782: Diff 388004.Nov 17 2021, 11:42 AM

rampitec added inline comments.Nov 17 2021, 1:34 PM

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
1469	When does this happen? On gfx90a there is no need to copy AGPR to VGPR, it can be stored (and loaded) directly.
1617	Same here.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1054	We are already spilling. I.e. we ran out of registers of ValueReg's class and we know it. Why cannot we just copy into a different class instead of introducing an ambiguous AV class?
llvm/test/CodeGen/AMDGPU/partial-regcopy-and-spill-missed-at-regalloc.ll
13	Can you add gfx90a run line and/or test? This copy should not happen on gfx90a, it can store AGPRs directly.
llvm/test/CodeGen/AMDGPU/spill-agpr.ll
10	Looks like the test does not test AGPR spills anymore, while we need to test it at least for gfx908. There shall be a test somewhere to do real spilling into memory on both gfx908 and gfx90a.
llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir
234	This should not happen. This is gfx90a and agpr has to be stored directly. This should happen on gfx908 though.

The combination of Regcoalescer, superclass copy (tryInstructionSplit), and Spiller during regalloc introduce a superclass spill at extreme pressure situations.
Lit test llvm/test/CodeGen/AMDGPU/schedule-xdl-resource.ll demonstrates that.
This mostly happens for gfx908 where we need a copy for AGPR to VGPR. I didn’t see this happening for gfx90a.

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
1469	Yes, this is mostly needed for gfx908.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1054	I didn't follow the question. Are you, in the first place, doubting the need for AV class as a superclass?
llvm/test/CodeGen/AMDGPU/partial-regcopy-and-spill-missed-at-regalloc.ll
13	Will add a RUN line for gfx90a.
llvm/test/CodeGen/AMDGPU/spill-agpr.ll
10	Will do.
llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir
234	You are right. The original test was for gfx90a. The new patterns should be in a separate test for gfx908. I will create one.

rampitec added inline comments.Nov 22 2021, 11:42 AM

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1054	No, I just mean it is not needed right here. In particular that is the reason for regression in the spill-to-agpr-partial.mir.
llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir
234	But this copy still should not happen on gfx90a as it does.

Addressed the review comments.

Harbormaster completed remote builds in B135858: Diff 389517.Nov 24 2021, 9:12 AM

rampitec added inline comments.Nov 24 2021, 1:33 PM

llvm/test/CodeGen/AMDGPU/spill-agpr.ll
107	It is not clear what's stored and what's loaded in this test and what are register classes here. I think this is important for this patch.
llvm/test/CodeGen/AMDGPU/spill-vgpr-to-agpr.ll
233	This comment is probably not relevant anymore?

Suggestions addressed.

Harbormaster completed remote builds in B135977: Diff 389662.Nov 24 2021, 10:51 PM

cdevadas removed a parent revision: D109300: [AMDGPU] Make vector superclasses allocatable.Nov 29 2021, 2:57 AM

LGTM. Thanks!

This revision is now accepted and ready to land.Nov 29 2021, 11:19 AM

This revision was landed with ongoing or failed builds.Nov 29 2021, 7:22 PM

Closed by commit rG5297cbf04532: [AMDGPU] Enable copy between VGPR and AGPR classes during regalloc (authored by cdevadas). · Explain Why

This revision was automatically updated to reflect the committed changes.

cdevadas added a commit: rG5297cbf04532: [AMDGPU] Enable copy between VGPR and AGPR classes during regalloc.

cdevadas removed a child revision: D110053: [AMDGPU] Add a regclass flag for scalar registers.Dec 1 2021, 2:32 AM

This might have caused all OpenMP offload tests (with math) to fail for gfx908/90a. @arsenm has a reproducer.

This revision is now accepted and ready to land.Dec 8 2021, 8:42 AM

In D109301#3179777, @jdoerfert wrote:

This might have caused all OpenMP offload tests (with math) to fail for gfx908/90a. @arsenm has a reproducer.

Posted D115439 to fix this issue.

cdevadas mentioned this in rGcf58b9ce9804: [AMDGPU] Add AV class spill pseudo instructions.Dec 10 2021, 12:13 AM

Merged D115439 with cf58b9ce98043d4c9af5ffb5b47a18009b145b5b

arsenm mentioned this in D115996: [AMDGPU] Don't remove VGPR to AGPR dead spills from frame info.Dec 20 2021, 8:09 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIInstrInfo.cpp

26 lines

SIRegisterInfo.h

4 lines

SIRegisterInfo.cpp

74 lines

test/

CodeGen/

AMDGPU/

extend-phi-subrange-not-in-parent.mir

11 lines

partial-regcopy-and-spill-missed-at-regalloc.ll

38 lines

spill-agpr.ll

84 lines

spill-to-agpr-partial.mir

223 lines

spill-vector-superclass.ll

23 lines

spill-vgpr-to-agpr.ll

76 lines

Diff 388004

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,428 Lines • ▼ Show 20 Lines	void SIInstrInfo::storeRegToStackSlot(MachineBasicBlock &MBB,

MachinePointerInfo PtrInfo		MachinePointerInfo PtrInfo
= MachinePointerInfo::getFixedStack(*MF, FrameIndex);		= MachinePointerInfo::getFixedStack(*MF, FrameIndex);
MachineMemOperand *MMO = MF->getMachineMemOperand(		MachineMemOperand *MMO = MF->getMachineMemOperand(
PtrInfo, MachineMemOperand::MOStore, FrameInfo.getObjectSize(FrameIndex),		PtrInfo, MachineMemOperand::MOStore, FrameInfo.getObjectSize(FrameIndex),
FrameInfo.getObjectAlign(FrameIndex));		FrameInfo.getObjectAlign(FrameIndex));
unsigned SpillSize = TRI->getSpillSize(*RC);		unsigned SpillSize = TRI->getSpillSize(*RC);

		MachineRegisterInfo &MRI = MF->getRegInfo();
if (RI.isSGPRClass(RC)) {		if (RI.isSGPRClass(RC)) {
MFI->setHasSpilledSGPRs();		MFI->setHasSpilledSGPRs();
assert(SrcReg != AMDGPU::M0 && "m0 should not be spilled");		assert(SrcReg != AMDGPU::M0 && "m0 should not be spilled");
assert(SrcReg != AMDGPU::EXEC_LO && SrcReg != AMDGPU::EXEC_HI &&		assert(SrcReg != AMDGPU::EXEC_LO && SrcReg != AMDGPU::EXEC_HI &&
SrcReg != AMDGPU::EXEC && "exec should not be spilled");		SrcReg != AMDGPU::EXEC && "exec should not be spilled");

// We are only allowed to create one new instruction when spilling		// We are only allowed to create one new instruction when spilling
// registers, so we need to use pseudo instruction for spilling SGPRs.		// registers, so we need to use pseudo instruction for spilling SGPRs.
const MCInstrDesc &OpDesc = get(getSGPRSpillSaveOpcode(SpillSize));		const MCInstrDesc &OpDesc = get(getSGPRSpillSaveOpcode(SpillSize));

// The SGPR spill/restore instructions only work on number sgprs, so we need		// The SGPR spill/restore instructions only work on number sgprs, so we need
// to make sure we are using the correct register class.		// to make sure we are using the correct register class.
if (SrcReg.isVirtual() && SpillSize == 4) {		if (SrcReg.isVirtual() && SpillSize == 4) {
MachineRegisterInfo &MRI = MF->getRegInfo();
MRI.constrainRegClass(SrcReg, &AMDGPU::SReg_32_XM0_XEXECRegClass);		MRI.constrainRegClass(SrcReg, &AMDGPU::SReg_32_XM0_XEXECRegClass);
}		}

BuildMI(MBB, MI, DL, OpDesc)		BuildMI(MBB, MI, DL, OpDesc)
.addReg(SrcReg, getKillRegState(isKill)) // data		.addReg(SrcReg, getKillRegState(isKill)) // data
.addFrameIndex(FrameIndex) // addr		.addFrameIndex(FrameIndex) // addr
.addMemOperand(MMO)		.addMemOperand(MMO)
.addReg(MFI->getStackPtrOffsetReg(), RegState::Implicit);		.addReg(MFI->getStackPtrOffsetReg(), RegState::Implicit);

if (RI.spillSGPRToVGPR())		if (RI.spillSGPRToVGPR())
FrameInfo.setStackID(FrameIndex, TargetStackID::SGPRSpill);		FrameInfo.setStackID(FrameIndex, TargetStackID::SGPRSpill);
return;		return;
}		}

unsigned Opcode = RI.isAGPRClass(RC) ? getAGPRSpillSaveOpcode(SpillSize)		unsigned Opcode = RI.isAGPRClass(RC) ? getAGPRSpillSaveOpcode(SpillSize)
: getVGPRSpillSaveOpcode(SpillSize);		: getVGPRSpillSaveOpcode(SpillSize);
MFI->setHasSpilledVGPRs();		MFI->setHasSpilledVGPRs();

		if (RI.isVectorSuperClass(RC)) {
		rampitecUnsubmitted Not Done Reply Inline Actions When does this happen? On gfx90a there is no need to copy AGPR to VGPR, it can be stored (and loaded) directly. rampitec: When does this happen? On gfx90a there is no need to copy AGPR to VGPR, it can be stored (and…
		cdevadasAuthorUnsubmitted Done Reply Inline Actions Yes, this is mostly needed for gfx908. cdevadas: Yes, this is mostly needed for gfx908.
		// Convert an AV spill into a VGPR spill. Introduce a copy from AV to an
		// equivalent VGPR register beforehand. Regalloc might want to introduce
		// AV spills only to be relevant until rewriter at which they become
		// either spills of VGPRs or AGPRs.
		Register TmpVReg = MRI.createVirtualRegister(RI.getEquivalentVGPRClass(RC));
		BuildMI(MBB, MI, DL, get(TargetOpcode::COPY), TmpVReg)
		.addReg(SrcReg, RegState::Kill);
		SrcReg = TmpVReg;
		}

BuildMI(MBB, MI, DL, get(Opcode))		BuildMI(MBB, MI, DL, get(Opcode))
.addReg(SrcReg, getKillRegState(isKill)) // data		.addReg(SrcReg, getKillRegState(isKill)) // data
.addFrameIndex(FrameIndex) // addr		.addFrameIndex(FrameIndex) // addr
.addReg(MFI->getStackPtrOffsetReg()) // scratch_offset		.addReg(MFI->getStackPtrOffsetReg()) // scratch_offset
.addImm(0) // offset		.addImm(0) // offset
.addMemOperand(MMO);		.addMemOperand(MMO);
}		}

▲ Show 20 Lines • Show All 117 Lines • ▼ Show 20 Lines	BuildMI(MBB, MI, DL, OpDesc, DestReg)
.addMemOperand(MMO)		.addMemOperand(MMO)
.addReg(MFI->getStackPtrOffsetReg(), RegState::Implicit);		.addReg(MFI->getStackPtrOffsetReg(), RegState::Implicit);

return;		return;
}		}

unsigned Opcode = RI.isAGPRClass(RC) ? getAGPRSpillRestoreOpcode(SpillSize)		unsigned Opcode = RI.isAGPRClass(RC) ? getAGPRSpillRestoreOpcode(SpillSize)
: getVGPRSpillRestoreOpcode(SpillSize);		: getVGPRSpillRestoreOpcode(SpillSize);

		bool IsVectorSuperClass = RI.isVectorSuperClass(RC);
		Register TmpReg = DestReg;
		if (IsVectorSuperClass) {
		// For AV classes, insert the spill restore to a VGPR followed by a copy
		rampitecUnsubmitted Not Done Reply Inline Actions Same here. rampitec: Same here.
		// into an equivalent AV register.
		MachineRegisterInfo &MRI = MF->getRegInfo();
		DestReg = MRI.createVirtualRegister(RI.getEquivalentVGPRClass(RC));
		}
BuildMI(MBB, MI, DL, get(Opcode), DestReg)		BuildMI(MBB, MI, DL, get(Opcode), DestReg)
.addFrameIndex(FrameIndex) // vaddr		.addFrameIndex(FrameIndex) // vaddr
.addReg(MFI->getStackPtrOffsetReg()) // scratch_offset		.addReg(MFI->getStackPtrOffsetReg()) // scratch_offset
.addImm(0) // offset		.addImm(0) // offset
.addMemOperand(MMO);		.addMemOperand(MMO);

		if (IsVectorSuperClass)
		BuildMI(MBB, MI, DL, get(TargetOpcode::COPY), TmpReg)
		.addReg(DestReg, RegState::Kill);
}		}

void SIInstrInfo::insertNoop(MachineBasicBlock &MBB,		void SIInstrInfo::insertNoop(MachineBasicBlock &MBB,
MachineBasicBlock::iterator MI) const {		MachineBasicBlock::iterator MI) const {
insertNoops(MBB, MI, 1);		insertNoops(MBB, MI, 1);
}		}

void SIInstrInfo::insertNoops(MachineBasicBlock &MBB,		void SIInstrInfo::insertNoops(MachineBasicBlock &MBB,
▲ Show 20 Lines • Show All 6,663 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIRegisterInfo.h

Show First 20 Lines • Show All 71 Lines • ▼ Show 20 Lines	public:
const uint32_t *getNoPreservedMask() const override;		const uint32_t *getNoPreservedMask() const override;

// Stack access is very expensive. CSRs are also the high registers, and we		// Stack access is very expensive. CSRs are also the high registers, and we
// want to minimize the number of used registers.		// want to minimize the number of used registers.
unsigned getCSRFirstUseCost() const override {		unsigned getCSRFirstUseCost() const override {
return 100;		return 100;
}		}

		const TargetRegisterClass *
		getLargestLegalSuperClass(const TargetRegisterClass *RC,
		const MachineFunction &MF) const override;

Register getFrameRegister(const MachineFunction &MF) const override;		Register getFrameRegister(const MachineFunction &MF) const override;

bool hasBasePointer(const MachineFunction &MF) const;		bool hasBasePointer(const MachineFunction &MF) const;
Register getBaseRegister() const;		Register getBaseRegister() const;

bool shouldRealignStack(const MachineFunction &MF) const override;		bool shouldRealignStack(const MachineFunction &MF) const override;
bool requiresRegisterScavenging(const MachineFunction &Fn) const override;		bool requiresRegisterScavenging(const MachineFunction &Fn) const override;

▲ Show 20 Lines • Show All 305 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

Show First 20 Lines • Show All 390 Lines • ▼ Show 20 Lines	default:
return nullptr;		return nullptr;
}		}
}		}

const uint32_t *SIRegisterInfo::getNoPreservedMask() const {		const uint32_t *SIRegisterInfo::getNoPreservedMask() const {
return CSR_AMDGPU_NoRegs_RegMask;		return CSR_AMDGPU_NoRegs_RegMask;
}		}

		const TargetRegisterClass *
		SIRegisterInfo::getLargestLegalSuperClass(const TargetRegisterClass *RC,
		const MachineFunction &MF) const {
		// FIXME: Should have a helper function like getEquivalentVGPRClass to get the
		// equivalent AV class. If used one, the verifier will crash after
		// RegBankSelect in the GISel flow. The aligned regclasses are not fully given
		// until Instruction selection.
		if (MF.getSubtarget<GCNSubtarget>().hasMAIInsts() &&
		(isVGPRClass(RC) \|\| isAGPRClass(RC))) {
		if (RC == &AMDGPU::VGPR_32RegClass \|\| RC == &AMDGPU::AGPR_32RegClass)
		return &AMDGPU::AV_32RegClass;
		if (RC == &AMDGPU::VReg_64RegClass \|\| RC == &AMDGPU::AReg_64RegClass)
		return &AMDGPU::AV_64RegClass;
		if (RC == &AMDGPU::VReg_64_Align2RegClass \|\|
		arsenmUnsubmitted Not Done Reply Inline Actions I think with the intent of this function, you don't need to return aligned classes for aligned classes, the unaligned versions are fine. I guess this gets more complicated in the case where we're using scratch instructions that do require alignment for multi-dword spilling arsenm: I think with the intent of this function, you don't need to return aligned classes for aligned…
		cdevadasAuthorUnsubmitted Done Reply Inline Actions I will see the impact of scratch instructions separately. cdevadas: I will see the impact of scratch instructions separately.
		rampitecUnsubmitted Not Done Reply Inline Actions The copy of a 64-bit VGPR will use V_PK_MOV_B32 which uses 64 bit operands, which then shall be aligned. rampitec: The copy of a 64-bit VGPR will use V_PK_MOV_B32 which uses 64 bit operands, which then shall be…
		RC == &AMDGPU::AReg_64_Align2RegClass)
		return &AMDGPU::AV_64_Align2RegClass;
		if (RC == &AMDGPU::VReg_96RegClass \|\| RC == &AMDGPU::AReg_96RegClass)
		return &AMDGPU::AV_96RegClass;
		if (RC == &AMDGPU::VReg_96_Align2RegClass \|\|
		RC == &AMDGPU::AReg_96_Align2RegClass)
		return &AMDGPU::AV_96_Align2RegClass;
		if (RC == &AMDGPU::VReg_128RegClass \|\| RC == &AMDGPU::AReg_128RegClass)
		return &AMDGPU::AV_128RegClass;
		if (RC == &AMDGPU::VReg_128_Align2RegClass \|\|
		RC == &AMDGPU::AReg_128_Align2RegClass)
		return &AMDGPU::AV_128_Align2RegClass;
		if (RC == &AMDGPU::VReg_160RegClass \|\| RC == &AMDGPU::AReg_160RegClass)
		return &AMDGPU::AV_160RegClass;
		if (RC == &AMDGPU::VReg_160_Align2RegClass \|\|
		RC == &AMDGPU::AReg_160_Align2RegClass)
		return &AMDGPU::AV_160_Align2RegClass;
		if (RC == &AMDGPU::VReg_192RegClass \|\| RC == &AMDGPU::AReg_192RegClass)
		return &AMDGPU::AV_192RegClass;
		if (RC == &AMDGPU::VReg_192_Align2RegClass \|\|
		RC == &AMDGPU::AReg_192_Align2RegClass)
		return &AMDGPU::AV_192_Align2RegClass;
		if (RC == &AMDGPU::VReg_256RegClass \|\| RC == &AMDGPU::AReg_256RegClass)
		return &AMDGPU::AV_256RegClass;
		if (RC == &AMDGPU::VReg_256_Align2RegClass \|\|
		RC == &AMDGPU::AReg_256_Align2RegClass)
		return &AMDGPU::AV_256_Align2RegClass;
		if (RC == &AMDGPU::VReg_512RegClass \|\| RC == &AMDGPU::AReg_512RegClass)
		return &AMDGPU::AV_512RegClass;
		if (RC == &AMDGPU::VReg_512_Align2RegClass \|\|
		RC == &AMDGPU::AReg_512_Align2RegClass)
		return &AMDGPU::AV_512_Align2RegClass;
		if (RC == &AMDGPU::VReg_1024RegClass \|\| RC == &AMDGPU::AReg_1024RegClass)
		return &AMDGPU::AV_1024RegClass;
		if (RC == &AMDGPU::VReg_1024_Align2RegClass \|\|
		RC == &AMDGPU::AReg_1024_Align2RegClass)
		return &AMDGPU::AV_1024_Align2RegClass;
		}

		return TargetRegisterInfo::getLargestLegalSuperClass(RC, MF);
		}

Register SIRegisterInfo::getFrameRegister(const MachineFunction &MF) const {		Register SIRegisterInfo::getFrameRegister(const MachineFunction &MF) const {
const SIFrameLowering *TFI =		const SIFrameLowering *TFI =
MF.getSubtarget<GCNSubtarget>().getFrameLowering();		MF.getSubtarget<GCNSubtarget>().getFrameLowering();
const SIMachineFunctionInfo *FuncInfo = MF.getInfo<SIMachineFunctionInfo>();		const SIMachineFunctionInfo *FuncInfo = MF.getInfo<SIMachineFunctionInfo>();
// During ISel lowering we always reserve the stack pointer in entry		// During ISel lowering we always reserve the stack pointer in entry
// functions, but never actually want to reference it when accessing our own		// functions, but never actually want to reference it when accessing our own
// frame. If we need a frame pointer we use it, but otherwise we can just use		// frame. If we need a frame pointer we use it, but otherwise we can just use
// an immediate "0" which we represent by returning NoRegister.		// an immediate "0" which we represent by returning NoRegister.
▲ Show 20 Lines • Show All 576 Lines • ▼ Show 20 Lines	if (Reg == AMDGPU::NoRegister)
return MachineInstrBuilder();		return MachineInstrBuilder();

bool IsStore = MI->mayStore();		bool IsStore = MI->mayStore();
MachineRegisterInfo &MRI = MF->getRegInfo();		MachineRegisterInfo &MRI = MF->getRegInfo();
auto TRI = static_cast<const SIRegisterInfo>(MRI.getTargetRegisterInfo());		auto TRI = static_cast<const SIRegisterInfo>(MRI.getTargetRegisterInfo());

unsigned Dst = IsStore ? Reg : ValueReg;		unsigned Dst = IsStore ? Reg : ValueReg;
unsigned Src = IsStore ? ValueReg : Reg;		unsigned Src = IsStore ? ValueReg : Reg;
unsigned Opc = (IsStore ^ TRI->isVGPR(MRI, Reg)) ? AMDGPU::V_ACCVGPR_WRITE_B32_e64		bool IsVGPR = TRI->isVGPR(MRI, Reg);
		DebugLoc DL = MI->getDebugLoc();
		if (IsVGPR == TRI->isVGPR(MRI, ValueReg)) {
		// Spiller during regalloc may restore a spilled register to its superclass.
		// It could result in AGPR spills restored to VGPRs or the other way around,
		// making the src and dst with identical regclasses at this point. It just
		// needs a copy in such cases.
		auto CopyMIB = BuildMI(MBB, MI, DL, TII->get(AMDGPU::COPY), Dst)
		rampitecUnsubmitted Not Done Reply Inline Actions We are already spilling. I.e. we ran out of registers of ValueReg's class and we know it. Why cannot we just copy into a different class instead of introducing an ambiguous AV class? rampitec: We are already spilling. I.e. we ran out of registers of ValueReg's class and we know it. Why…
		cdevadasAuthorUnsubmitted Done Reply Inline Actions I didn't follow the question. Are you, in the first place, doubting the need for AV class as a superclass? cdevadas: I didn't follow the question. Are you, in the first place, doubting the need for AV class as a…
		rampitecUnsubmitted Not Done Reply Inline Actions No, I just mean it is not needed right here. In particular that is the reason for regression in the spill-to-agpr-partial.mir. rampitec: No, I just mean it is not needed right here. In particular that is the reason for regression in…
		.addReg(Src, getKillRegState(IsKill));
		CopyMIB->setAsmPrinterFlag(MachineInstr::ReloadReuse);
		return CopyMIB;
		}
		unsigned Opc = (IsStore ^ IsVGPR) ? AMDGPU::V_ACCVGPR_WRITE_B32_e64
: AMDGPU::V_ACCVGPR_READ_B32_e64;		: AMDGPU::V_ACCVGPR_READ_B32_e64;

auto MIB = BuildMI(MBB, MI, MI->getDebugLoc(), TII->get(Opc), Dst)		auto MIB = BuildMI(MBB, MI, DL, TII->get(Opc), Dst)
.addReg(Src, getKillRegState(IsKill));		.addReg(Src, getKillRegState(IsKill));
MIB->setAsmPrinterFlag(MachineInstr::ReloadReuse);		MIB->setAsmPrinterFlag(MachineInstr::ReloadReuse);
return MIB;		return MIB;
}		}

// This differs from buildSpillLoadStore by only scavenging a VGPR. It does not		// This differs from buildSpillLoadStore by only scavenging a VGPR. It does not
// need to handle the case where an SGPR may need to be spilled while spilling.		// need to handle the case where an SGPR may need to be spilled while spilling.
static bool buildMUBUFOffsetLoadStore(const GCNSubtarget &ST,		static bool buildMUBUFOffsetLoadStore(const GCNSubtarget &ST,
▲ Show 20 Lines • Show All 1,719 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/extend-phi-subrange-not-in-parent.mir

Show All 14 Lines	machineFunctionInfo:
stackPtrOffsetReg: '$sgpr32'		stackPtrOffsetReg: '$sgpr32'
occupancy: 7		occupancy: 7
body: \|		body: \|
; CHECK-LABEL: name: subrange_for_this_mask_not_found		; CHECK-LABEL: name: subrange_for_this_mask_not_found
; CHECK: bb.0:		; CHECK: bb.0:
; CHECK: successors: %bb.1(0x80000000)		; CHECK: successors: %bb.1(0x80000000)
; CHECK: [[DEF:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF		; CHECK: [[DEF:%[0-9]+]]:vgpr_32 = IMPLICIT_DEF
; CHECK: [[DEF1:%[0-9]+]]:vreg_1024_align2 = IMPLICIT_DEF		; CHECK: [[DEF1:%[0-9]+]]:vreg_1024_align2 = IMPLICIT_DEF
; CHECK: SI_SPILL_V1024_SAVE [[DEF1]], %stack.0, $sgpr32, 0, implicit $exec :: (store (s1024) into %stack.0, align 4, addrspace 5)		; CHECK: [[COPY:%[0-9]+]]:av_1024_align2 = COPY [[DEF1]]
; CHECK: bb.1:		; CHECK: bb.1:
; CHECK: successors: %bb.1(0x40000000), %bb.2(0x40000000)		; CHECK: successors: %bb.1(0x40000000), %bb.2(0x40000000)
; CHECK: S_NOP 0, implicit [[DEF1]]		; CHECK: S_NOP 0, implicit [[DEF1]]
; CHECK: S_NOP 0, implicit [[DEF1]]		; CHECK: S_NOP 0, implicit [[DEF1]]
; CHECK: [[DEF2:%[0-9]+]]:vreg_1024_align2 = IMPLICIT_DEF		; CHECK: [[DEF2:%[0-9]+]]:vreg_1024_align2 = IMPLICIT_DEF
; CHECK: S_CBRANCH_VCCNZ %bb.1, implicit undef $vcc		; CHECK: S_CBRANCH_VCCNZ %bb.1, implicit undef $vcc
; CHECK: bb.2:		; CHECK: bb.2:
; CHECK: successors: %bb.3(0x80000000)		; CHECK: successors: %bb.3(0x80000000)
; CHECK: [[SI_SPILL_V1024_RESTORE:%[0-9]+]]:vreg_1024_align2 = SI_SPILL_V1024_RESTORE %stack.0, $sgpr32, 0, implicit $exec :: (load (s1024) from %stack.0, align 4, addrspace 5)		; CHECK: undef %5.sub1_sub2_sub3_sub4_sub5_sub6_sub7_sub8_sub9_sub10_sub11_sub12_sub13_sub14_sub15_sub16:av_1024_align2 = COPY [[COPY]].sub1_sub2_sub3_sub4_sub5_sub6_sub7_sub8_sub9_sub10_sub11_sub12_sub13_sub14_sub15_sub16 {
; CHECK: undef %5.sub1_sub2_sub3_sub4_sub5_sub6_sub7_sub8_sub9_sub10_sub11_sub12_sub13_sub14_sub15_sub16:vreg_1024_align2 = COPY [[SI_SPILL_V1024_RESTORE]].sub1_sub2_sub3_sub4_sub5_sub6_sub7_sub8_sub9_sub10_sub11_sub12_sub13_sub14_sub15_sub16 {		; CHECK: internal %5.sub16_sub17_sub18_sub19_sub20_sub21_sub22_sub23_sub24_sub25_sub26_sub27_sub28_sub29_sub30_sub31:av_1024_align2 = COPY [[COPY]].sub16_sub17_sub18_sub19_sub20_sub21_sub22_sub23_sub24_sub25_sub26_sub27_sub28_sub29_sub30_sub31
; CHECK: internal %5.sub16_sub17_sub18_sub19_sub20_sub21_sub22_sub23_sub24_sub25_sub26_sub27_sub28_sub29_sub30_sub31:vreg_1024_align2 = COPY [[SI_SPILL_V1024_RESTORE]].sub16_sub17_sub18_sub19_sub20_sub21_sub22_sub23_sub24_sub25_sub26_sub27_sub28_sub29_sub30_sub31
; CHECK: }		; CHECK: }
; CHECK: %5.sub0:vreg_1024_align2 = IMPLICIT_DEF		; CHECK: %5.sub0:av_1024_align2 = IMPLICIT_DEF
; CHECK: S_NOP 0, implicit %5.sub0		; CHECK: S_NOP 0, implicit %5.sub0
; CHECK: bb.3:		; CHECK: bb.3:
; CHECK: successors: %bb.4(0x80000000)		; CHECK: successors: %bb.4(0x80000000)
; CHECK: S_NOP 0, implicit %5		; CHECK: S_NOP 0, implicit %5
; CHECK: bb.4:		; CHECK: bb.4:
; CHECK: successors: %bb.3(0x40000000), %bb.5(0x40000000)		; CHECK: successors: %bb.3(0x40000000), %bb.5(0x40000000)
; CHECK: [[DEF2:%[0-9]+]]:vreg_1024_align2 = IMPLICIT_DEF		; CHECK: [[DEF2:%[0-9]+]]:av_1024_align2 = IMPLICIT_DEF
; CHECK: S_CBRANCH_VCCNZ %bb.3, implicit undef $vcc		; CHECK: S_CBRANCH_VCCNZ %bb.3, implicit undef $vcc
; CHECK: bb.5:		; CHECK: bb.5:
; CHECK: undef %3.sub0:vreg_1024_align2 = COPY [[DEF]]		; CHECK: undef %3.sub0:vreg_1024_align2 = COPY [[DEF]]
; CHECK: S_NOP 0, implicit %3		; CHECK: S_NOP 0, implicit %3
bb.0:		bb.0:
%0:vgpr_32 = IMPLICIT_DEF		%0:vgpr_32 = IMPLICIT_DEF
%1:vreg_1024_align2 = IMPLICIT_DEF		%1:vreg_1024_align2 = IMPLICIT_DEF
%2:vreg_1024_align2 = COPY %1		%2:vreg_1024_align2 = COPY %1
Show All 22 Lines

llvm/test/CodeGen/AMDGPU/partial-regcopy-and-spill-missed-at-regalloc.ll

This file was added.

				;RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 --stop-after=greedy,1 -verify-machineinstrs < %s \| FileCheck -check-prefix=REGALLOC %s
				;RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 --stop-after=prologepilog -verify-machineinstrs < %s \| FileCheck -check-prefix=PEI %s

				; Partial reg copy and spill missed during regalloc handled later at frame lowering.
				define amdgpu_kernel void @partial_copy(<4 x i32> %arg) #0 {
				; REGALLOC-LABEL: name: partial_copy
				; REGALLOC: bb.0 (%ir-block.0):
				; REGALLOC: INLINEASM &"; def $0", 1 /* sideeffect attdialect /, 2949130 / regdef:VReg_64 */, def [[VREG_64:%[0-9]+]]
				; REGALLOC: SI_SPILL_V64_SAVE [[VREG_64]], %stack.0
				; REGALLOC: [[V_MFMA_I32_4X4X4I8:%[0-9]+]]:areg_128 = V_MFMA_I32_4X4X4I8_e64
				; REGALLOC: [[SI_SPILL_V64_RESTORE:%[0-9]+]]:vreg_64 = SI_SPILL_V64_RESTORE %stack.0
				; REGALLOC: GLOBAL_STORE_DWORDX2 undef %{{[0-9]+}}:vreg_64, [[SI_SPILL_V64_RESTORE]]
				; REGALLOC: [[COPY3:%[0-9]+]]:vreg_128 = COPY [[V_MFMA_I32_4X4X4I8]]
				rampitecUnsubmitted Not Done Reply Inline Actions Can you add gfx90a run line and/or test? This copy should not happen on gfx90a, it can store AGPRs directly. rampitec: Can you add gfx90a run line and/or test? This copy should not happen on gfx90a, it can store…
				cdevadasAuthorUnsubmitted Done Reply Inline Actions Will add a RUN line for gfx90a. cdevadas: Will add a RUN line for gfx90a.
				; REGALLOC: GLOBAL_STORE_DWORDX4 undef %{{[0-9]+}}:vreg_64, [[COPY3]]

				; PEI-LABEL: name: partial_copy
				; PEI: bb.0 (%ir-block.0):
				; PEI: INLINEASM &"; def $0", 1 /* sideeffect attdialect /, 2949130 / regdef:VReg_64 */, def renamable $vgpr0_vgpr1
				; PEI: BUFFER_STORE_DWORD_OFFSET killed $vgpr0
				; PEI: $agpr4 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr1
				; PEI: renamable $agpr0_agpr1_agpr2_agpr3 = V_MFMA_I32_4X4X4I8_e64
				; PEI: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET
				; PEI: $vgpr1 = V_ACCVGPR_READ_B32_e64 $agpr4
				; PEI: GLOBAL_STORE_DWORDX2 undef renamable ${{.*}}, killed renamable $vgpr0_vgpr1
				; PEI: renamable $vgpr0_vgpr1_vgpr2_vgpr3 = COPY killed renamable $agpr0_agpr1_agpr2_agpr3, implicit $exec
				; PEI: GLOBAL_STORE_DWORDX4 undef renamable ${{.*}}, killed renamable $vgpr0_vgpr1_vgpr2_vgpr3
				%v0 = call <4 x i32> asm sideeffect "; def $0", "=v" ()
				%v1 = call <2 x i32> asm sideeffect "; def $0", "=v" ()
				%mai = tail call <4 x i32> @llvm.amdgcn.mfma.i32.4x4x4i8(i32 1, i32 2, <4 x i32> %arg, i32 0, i32 0, i32 0)
				store volatile <4 x i32> %v0, <4 x i32> addrspace(1)* undef
				store volatile <2 x i32> %v1, <2 x i32> addrspace(1)* undef
				store volatile <4 x i32> %mai, <4 x i32> addrspace(1)* undef
				ret void
				}

				declare <4 x i32> @llvm.amdgcn.mfma.i32.4x4x4i8(i32, i32, <4 x i32>, i32, i32, i32)

				attributes #0 = { nounwind "amdgpu-num-vgpr"="5" }

llvm/test/CodeGen/AMDGPU/spill-agpr.ll

	; RUN: llc -march=amdgcn -mcpu=gfx908 -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX908,A2V %s			; RUN: llc -march=amdgcn -mcpu=gfx908 -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX908 %s
	; RUN: llc -march=amdgcn -mcpu=gfx908 -amdgpu-spill-vgpr-to-agpr=0 -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX908,GFX908-A2M,A2M %s			; RUN: llc -march=amdgcn -mcpu=gfx90a -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX90A %s
	; RUN: llc -march=amdgcn -mcpu=gfx90a -amdgpu-spill-vgpr-to-agpr=0 -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX90A,GFX90A-A2M,A2M %s

	; GCN-LABEL: {{^}}max_24regs_32a_used:			; GCN-LABEL: {{^}}max_24regs_32a_used:
	; A2M-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD0			; GCN-NOT: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD0
	; A2M-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD1			; GCN-NOT: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD1
	; GCN-DAG: v_mfma_f32_16x16x1f32			; GCN-DAG: v_mfma_f32_16x16x1f32
	; GCN-DAG: v_mfma_f32_16x16x1f32			; GCN-DAG: v_mfma_f32_16x16x1f32
	; A2V-NOT: SCRATCH_RSRC			; GCN-DAG: v_accvgpr_read_b32
	; GFX908-A2M-DAG: v_accvgpr_read_b32 v[[VSPILL:[0-9]+]], a0 ; Reload Reuse			; GCN-NOT: buffer_store_dword
				rampitecUnsubmitted Not Done Reply Inline Actions Looks like the test does not test AGPR spills anymore, while we need to test it at least for gfx908. rampitec: Looks like the test does not test AGPR spills anymore, while we need to test it at least for…
				rampitecUnsubmitted Not Done Reply Inline Actions Looks like the test does not test AGPR spills anymore, while we need to test it at least for gfx908. There shall be a test somewhere to do real spilling into memory on both gfx908 and gfx90a. rampitec: > Looks like the test does not test AGPR spills anymore, while we need to test it at least for…
				cdevadasAuthorUnsubmitted Done Reply Inline Actions Will do. cdevadas: Will do.
	; A2V: v_accvgpr_read_b32 v[[VSPILL:[0-9]+]], a0 ; Reload Reuse			; GCN-NOT: buffer_load_dword
	; GFX908-A2M: buffer_store_dword v[[VSPILL]], off, s[{{[0-9:]+}}], 0 offset:[[FI:[0-9]+]] ; 4-byte Folded Spill			; GFX908-NOT: v_accvgpr_write_b32
	; GFX908-A2M: buffer_load_dword v[[VSPILL:[0-9]+]], off, s[{{[0-9:]+}}], 0 offset:[[FI]] ; 4-byte Folded Reload			; GFX90A: v_accvgpr_write_b32
	; GFX90A-NOT: v_accvgpr_read_b32			; GCN: ScratchSize: 0
	; GFX90A-A2M: buffer_store_dword a{{[0-9]+}}, off, s[{{[0-9:]+}}], 0 offset:[[FI:[0-9]+]] ; 4-byte Folded Spill
	; GFX90A-A2M: buffer_load_dword a{{[0-9]+}}, off, s[{{[0-9:]+}}], 0 offset:[[FI]] ; 4-byte Folded Reload
	; GFX908: v_accvgpr_write_b32 a{{[0-9]+}}, v[[VSPILL]] ; Reload Reuse
	; GFX90A-NOT: v_accvgpr_write_b32
	; A2V: ScratchSize: 0
	define amdgpu_kernel void @max_24regs_32a_used(<16 x float> addrspace(1)* %arg, float addrspace(1)* %out) #0 {			define amdgpu_kernel void @max_24regs_32a_used(<16 x float> addrspace(1)* %arg, float addrspace(1)* %out) #0 {
	bb:			bb:
	%in.1 = load <16 x float>, <16 x float> addrspace(1)* %arg			%in.1 = load <16 x float>, <16 x float> addrspace(1)* %arg
	%mai.1 = tail call <16 x float> @llvm.amdgcn.mfma.f32.16x16x1f32(float 1.0, float 1.0, <16 x float> %in.1, i32 0, i32 0, i32 0)			%mai.1 = tail call <16 x float> @llvm.amdgcn.mfma.f32.16x16x1f32(float 1.0, float 1.0, <16 x float> %in.1, i32 0, i32 0, i32 0)
	%mai.2 = tail call <16 x float> @llvm.amdgcn.mfma.f32.16x16x1f32(float 1.0, float 1.0, <16 x float> %mai.1, i32 0, i32 0, i32 0)			%mai.2 = tail call <16 x float> @llvm.amdgcn.mfma.f32.16x16x1f32(float 1.0, float 1.0, <16 x float> %mai.1, i32 0, i32 0, i32 0)
	%elt1 = extractelement <16 x float> %mai.2, i32 0			%elt1 = extractelement <16 x float> %mai.2, i32 0
	%elt2 = extractelement <16 x float> %mai.1, i32 15			%elt2 = extractelement <16 x float> %mai.1, i32 15
	%elt3 = extractelement <16 x float> %mai.1, i32 14			%elt3 = extractelement <16 x float> %mai.1, i32 14
	%elt4 = extractelement <16 x float> %mai.2, i32 1			%elt4 = extractelement <16 x float> %mai.2, i32 1
	store float %elt1, float addrspace(1)* %out			store float %elt1, float addrspace(1)* %out
	%gep1 = getelementptr float, float addrspace(1)* %out, i64 1			%gep1 = getelementptr float, float addrspace(1)* %out, i64 1
	store float %elt2, float addrspace(1)* %gep1			store float %elt2, float addrspace(1)* %gep1
	%gep2 = getelementptr float, float addrspace(1)* %out, i64 2			%gep2 = getelementptr float, float addrspace(1)* %out, i64 2
	store float %elt3, float addrspace(1)* %gep2			store float %elt3, float addrspace(1)* %gep2
	%gep3 = getelementptr float, float addrspace(1)* %out, i64 3			%gep3 = getelementptr float, float addrspace(1)* %out, i64 3
	store float %elt4, float addrspace(1)* %gep3			store float %elt4, float addrspace(1)* %gep3

	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}max_12regs_13a_used:			; GCN-LABEL: {{^}}max_12regs_13a_used:
	; A2M-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD0			; GCN-NOT: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD0
	; A2M-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD1			; GCN-NOT: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD1
	; A2V-NOT: SCRATCH_RSRC			; GCN: v_accvgpr_read_b32 v[[VSPILL:[0-9]+]], a{{[0-9]+}}
	; GFX908-DAG: v_accvgpr_read_b32 v[[VSPILL:[0-9]+]], a{{[0-9]+}} ; Reload Reuse			; GCN-NOT: buffer_store_dword
	; GFX908-A2M: buffer_store_dword v[[VSPILL]], off, s[{{[0-9:]+}}], 0 offset:[[FI:[0-9]+]] ; 4-byte Folded Spill			; GCN-NOT: buffer_load_dword
	; GFX908-A2M: buffer_load_dword v[[VSPILL:[0-9]+]], off, s[{{[0-9:]+}}], 0 offset:[[FI]] ; 4-byte Folded Reload			; GCN: v_accvgpr_write_b32 a{{[0-9]+}}, v[[VSPILL]]
	; GFX90A-A2M: buffer_store_dword a{{[0-9]+}}, off, s[{{[0-9:]+}}], 0 offset:[[FI:[0-9]+]] ; 4-byte Folded Spill			; GCN: ScratchSize: 0
	; GFX90A-A2M: buffer_load_dword a{{[0-9]+}}, off, s[{{[0-9:]+}}], 0 offset:[[FI]] ; 4-byte Folded Reload
	; A2V: v_accvgpr_write_b32 a{{[0-9]+}}, v[[VSPILL]] ; Reload Reuse
	; A2V: ScratchSize: 0
	define amdgpu_kernel void @max_12regs_13a_used(i32 %cond, <4 x float> addrspace(1)* %arg, <4 x float> addrspace(1)* %out) #2 {			define amdgpu_kernel void @max_12regs_13a_used(i32 %cond, <4 x float> addrspace(1)* %arg, <4 x float> addrspace(1)* %out) #2 {
	bb:			bb:
	%in.1 = load <4 x float>, <4 x float> addrspace(1)* %arg			%in.1 = load <4 x float>, <4 x float> addrspace(1)* %arg
	%mai.1 = tail call <4 x float> @llvm.amdgcn.mfma.f32.4x4x1f32(float 1.0, float 1.0, <4 x float> %in.1, i32 0, i32 0, i32 0)			%mai.1 = tail call <4 x float> @llvm.amdgcn.mfma.f32.4x4x1f32(float 1.0, float 1.0, <4 x float> %in.1, i32 0, i32 0, i32 0)
	%mai.2 = tail call <4 x float> @llvm.amdgcn.mfma.f32.4x4x1f32(float 1.0, float 1.0, <4 x float> %mai.1, i32 0, i32 0, i32 0)			%mai.2 = tail call <4 x float> @llvm.amdgcn.mfma.f32.4x4x1f32(float 1.0, float 1.0, <4 x float> %mai.1, i32 0, i32 0, i32 0)
	%cmp = icmp eq i32 %cond, 0			%cmp = icmp eq i32 %cond, 0
	br i1 %cmp, label %use, label %st			br i1 %cmp, label %use, label %st

	use:			use:
	call void asm sideeffect "", "a,a,a,a,a"(i32 1, i32 2, i32 3, i32 4, i32 5)			call void asm sideeffect "", "a,a,a,a,a"(i32 1, i32 2, i32 3, i32 4, i32 5)
	store volatile <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>, <4 x float> addrspace(1)* %out			store volatile <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>, <4 x float> addrspace(1)* %out
	br label %st			br label %st

	st:			st:
	%gep1 = getelementptr <4 x float>, <4 x float> addrspace(1)* %out, i64 16			%gep1 = getelementptr <4 x float>, <4 x float> addrspace(1)* %out, i64 16
	%gep2 = getelementptr <4 x float>, <4 x float> addrspace(1)* %out, i64 32			%gep2 = getelementptr <4 x float>, <4 x float> addrspace(1)* %out, i64 32
	call void asm sideeffect "", "a,a"(<4 x float> %mai.1, <4 x float> %mai.2)			call void asm sideeffect "", "a,a"(<4 x float> %mai.1, <4 x float> %mai.2)
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}max_10_vgprs_used_9a:			; GCN-LABEL: {{^}}max_10_vgprs_used_9a:
	; GFX908-A2M-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD0			; GCN-NOT: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD0
	; GFX908-A2M-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD1			; GCN-NOT: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD1
	; A2V-NOT: SCRATCH_RSRC			; GCN: v_accvgpr_read_b32 v[[VSPILL:[0-9]+]], a{{[0-9]+}}
				; GCN-NOT: buffer_store_dword
	; A2V: v_accvgpr_read_b32 v[[VSPILL:[0-9]+]], a{{[0-9]+}} ; Reload Reuse			; GCN-NOT: buffer_load_dword
	; A2V: v_accvgpr_write_b32 a{{[0-9]+}}, v[[VSPILL]] ; Reload Reuse			; GCN: v_accvgpr_write_b32 a{{[0-9]+}}, v[[VSPILL]]
	; A2V: ScratchSize: 0			; GCN: ScratchSize: 0

	; GFX908-A2M: buffer_store_dword v[[VSPILLSTORE:[0-9]+]], off, s[{{[0-9:]+}}], 0 offset:[[FI:[0-9]+]] ; 4-byte Folded Spill
	; GFX908-A2M: buffer_load_dword v[[VSPILL_RELOAD:[0-9]+]], off, s[{{[0-9:]+}}], 0 offset:[[FI]] ; 4-byte Folded Reload
	; GFX908-A2M: v_accvgpr_write_b32 a{{[0-9]+}}, v[[VSPILL_RELOAD]] ; Reload Reuse
	define amdgpu_kernel void @max_10_vgprs_used_9a() #1 {			define amdgpu_kernel void @max_10_vgprs_used_9a() #1 {
	%a1 = call <4 x i32> asm sideeffect "", "=a"()			%a1 = call <4 x i32> asm sideeffect "", "=a"()
	%a2 = call <4 x i32> asm sideeffect "", "=a"()			%a2 = call <4 x i32> asm sideeffect "", "=a"()
	%a3 = call i32 asm sideeffect "", "=a"()			%a3 = call i32 asm sideeffect "", "=a"()
	%a4 = call <2 x i32> asm sideeffect "", "=a"()			%a4 = call <2 x i32> asm sideeffect "", "=a"()
	call void asm sideeffect "", "a,a,a"(<4 x i32> %a1, <4 x i32> %a2, i32 %a3)			call void asm sideeffect "", "a,a,a"(<4 x i32> %a1, <4 x i32> %a2, i32 %a3)
	call void asm sideeffect "", "a"(<2 x i32> %a4)			call void asm sideeffect "", "a"(<2 x i32> %a4)
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}max_32regs_mfma32:			; GCN-LABEL: {{^}}max_32regs_mfma32:
	; A2M-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD0			; GCN-NOT: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD0
	; A2M-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD1			; GCN-NOT: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD1
	; A2V-NOT: SCRATCH_RSRC			; GCN-NOT: buffer_store_dword
	; GFX908-DAG: v_accvgpr_read_b32 v[[VSPILL:[0-9]+]], a0 ; Reload Reuse			; GCN: v_accvgpr_read_b32
	; GFX908-A2M: buffer_store_dword v[[VSPILL]], off, s[{{[0-9:]+}}], 0 offset:[[FI:[0-9]+]] ; 4-byte Folded Spill			; GCN: v_mfma_f32_32x32x1f32
	; GFX90A-NOT: v_accvgpr_read_b32			; GCN-NOT: buffer_load_dword
	; GFX90A: v_mfma_f32_32x32x1f32			; GCN: v_accvgpr_write_b32
	; GFX908-A2M: buffer_load_dword v[[VSPILL:[0-9]+]], off, s[{{[0-9:]+}}], 0 offset:[[FI]] ; 4-byte Folded Reload			; GCN: ScratchSize: 0
	; GFX908: v_accvgpr_write_b32 a{{[0-9]+}}, v[[VSPILL]] ; Reload Reuse
	; GFX90A-NOT: v_accvgpr_write_b32
	; A2V: ScratchSize: 0
	define amdgpu_kernel void @max_32regs_mfma32(float addrspace(1)* %arg) #3 {			define amdgpu_kernel void @max_32regs_mfma32(float addrspace(1)* %arg) #3 {
	bb:			bb:
	%v = call i32 asm sideeffect "", "=a"()			%v = call i32 asm sideeffect "", "=a"()
	br label %use			br label %use

	use:			use:
	%mai.1 = tail call <32 x float> @llvm.amdgcn.mfma.f32.32x32x1f32(float 1.0, float 1.0, <32 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0, float 9.0, float 10.0, float 11.0, float 12.0, float 13.0, float 14.0, float 15.0, float 16.0, float 17.0, float 18.0, float 19.0, float 20.0, float 21.0, float 22.0, float 23.0, float 24.0, float 25.0, float 26.0, float 27.0, float 28.0, float 29.0, float 30.0, float 31.0, float 2.0>, i32 0, i32 0, i32 0)			%mai.1 = tail call <32 x float> @llvm.amdgcn.mfma.f32.32x32x1f32(float 1.0, float 1.0, <32 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0, float 9.0, float 10.0, float 11.0, float 12.0, float 13.0, float 14.0, float 15.0, float 16.0, float 17.0, float 18.0, float 19.0, float 20.0, float 21.0, float 22.0, float 23.0, float 24.0, float 25.0, float 26.0, float 27.0, float 28.0, float 29.0, float 30.0, float 31.0, float 2.0>, i32 0, i32 0, i32 0)
	call void asm sideeffect "", "a"(i32 %v)			call void asm sideeffect "", "a"(i32 %v)
	%elt1 = extractelement <32 x float> %mai.1, i32 0			%elt1 = extractelement <32 x float> %mai.1, i32 0
	store float %elt1, float addrspace(1)* %arg			store float %elt1, float addrspace(1)* %arg
	ret void			ret void
	}			}

	declare i32 @llvm.amdgcn.workitem.id.x()			declare i32 @llvm.amdgcn.workitem.id.x()
	declare <16 x float> @llvm.amdgcn.mfma.f32.16x16x1f32(float, float, <16 x float>, i32, i32, i32)			declare <16 x float> @llvm.amdgcn.mfma.f32.16x16x1f32(float, float, <16 x float>, i32, i32, i32)
	declare <4 x float> @llvm.amdgcn.mfma.f32.4x4x1f32(float, float, <4 x float>, i32, i32, i32)			declare <4 x float> @llvm.amdgcn.mfma.f32.4x4x1f32(float, float, <4 x float>, i32, i32, i32)
	declare <32 x float> @llvm.amdgcn.mfma.f32.32x32x1f32(float, float, <32 x float>, i32, i32, i32)			declare <32 x float> @llvm.amdgcn.mfma.f32.32x32x1f32(float, float, <32 x float>, i32, i32, i32)

				rampitecUnsubmitted Not Done Reply Inline Actions It is not clear what's stored and what's loaded in this test and what are register classes here. I think this is important for this patch. rampitec: It is not clear what's stored and what's loaded in this test and what are register classes here.
	attributes #0 = { nounwind "amdgpu-num-vgpr"="24" }			attributes #0 = { nounwind "amdgpu-num-vgpr"="24" }
	attributes #1 = { nounwind "amdgpu-num-vgpr"="10" }			attributes #1 = { nounwind "amdgpu-num-vgpr"="10" }
	attributes #2 = { nounwind "amdgpu-num-vgpr"="12" }			attributes #2 = { nounwind "amdgpu-num-vgpr"="12" }
	attributes #3 = { nounwind "amdgpu-num-vgpr"="32" }			attributes #3 = { nounwind "amdgpu-num-vgpr"="32" }

llvm/test/CodeGen/AMDGPU/spill-to-agpr-partial.mir

Show All 13 Lines	bb.0:
liveins: $vgpr0_vgpr1_vgpr2_vgpr3, $agpr0_agpr1_agpr2_agpr3_agpr4_agpr5_agpr6_agpr7_agpr8_agpr9_agpr10_agpr11_agpr12_agpr13_agpr14_agpr15, $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23, $agpr24_agpr25_agpr26_agpr27, $agpr28_agpr29, $agpr30		liveins: $vgpr0_vgpr1_vgpr2_vgpr3, $agpr0_agpr1_agpr2_agpr3_agpr4_agpr5_agpr6_agpr7_agpr8_agpr9_agpr10_agpr11_agpr12_agpr13_agpr14_agpr15, $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23, $agpr24_agpr25_agpr26_agpr27, $agpr28_agpr29, $agpr30

; GCN-LABEL: name: partial_spill_v128_1_of_4		; GCN-LABEL: name: partial_spill_v128_1_of_4
; GCN: liveins: $agpr30, $agpr31, $agpr24_agpr25_agpr26_agpr27, $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23, $agpr0_agpr1_agpr2_agpr3_agpr4_agpr5_agpr6_agpr7_agpr8_agpr9_agpr10_agpr11_agpr12_agpr13_agpr14_agpr15, $agpr28_agpr29, $vgpr0_vgpr1_vgpr2_vgpr3		; GCN: liveins: $agpr30, $agpr31, $agpr24_agpr25_agpr26_agpr27, $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23, $agpr0_agpr1_agpr2_agpr3_agpr4_agpr5_agpr6_agpr7_agpr8_agpr9_agpr10_agpr11_agpr12_agpr13_agpr14_agpr15, $agpr28_agpr29, $vgpr0_vgpr1_vgpr2_vgpr3
; GCN: $agpr31 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr3, implicit $exec, implicit-def $vgpr0_vgpr1_vgpr2_vgpr3, implicit $vgpr0_vgpr1_vgpr2_vgpr3		; GCN: $agpr31 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr3, implicit $exec, implicit-def $vgpr0_vgpr1_vgpr2_vgpr3, implicit $vgpr0_vgpr1_vgpr2_vgpr3
; GCN: SCRATCH_STORE_DWORDX3_SADDR killed $vgpr0_vgpr1_vgpr2, $sgpr32, 0, 0, implicit $exec, implicit $flat_scr, implicit killed $vgpr0_vgpr1_vgpr2_vgpr3 :: (store (s96) into %stack.0, align 4, addrspace 5)		; GCN: SCRATCH_STORE_DWORDX3_SADDR killed $vgpr0_vgpr1_vgpr2, $sgpr32, 0, 0, implicit $exec, implicit $flat_scr, implicit killed $vgpr0_vgpr1_vgpr2_vgpr3 :: (store (s96) into %stack.0, align 4, addrspace 5)
; GCN: $vgpr3 = V_ACCVGPR_READ_B32_e64 $agpr31, implicit $exec, implicit-def $vgpr0_vgpr1_vgpr2_vgpr3		; GCN: $vgpr3 = V_ACCVGPR_READ_B32_e64 $agpr31, implicit $exec, implicit-def $vgpr0_vgpr1_vgpr2_vgpr3
; GCN: $vgpr0_vgpr1_vgpr2 = SCRATCH_LOAD_DWORDX3_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr, implicit-def $vgpr0_vgpr1_vgpr2_vgpr3 :: (load (s96) from %stack.0, align 4, addrspace 5)		; GCN: $vgpr0_vgpr1_vgpr2 = SCRATCH_LOAD_DWORDX3_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr, implicit-def $vgpr0_vgpr1_vgpr2_vgpr3 :: (load (s96) from %stack.0, align 4, addrspace 5)
; GCN: S_ENDPGM 0, implicit $agpr0_agpr1_agpr2_agpr3_agpr4_agpr5_agpr6_agpr7_agpr8_agpr9_agpr10_agpr11_agpr12_agpr13_agpr14_agpr15, implicit $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23, implicit $agpr24_agpr25_agpr26_agpr27, implicit $agpr28_agpr29, implicit $agpr30		; GCN: S_ENDPGM 0, implicit $agpr0_agpr1_agpr2_agpr3_agpr4_agpr5_agpr6_agpr7_agpr8_agpr9_agpr10_agpr11_agpr12_agpr13_agpr14_agpr15, implicit $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23, implicit $agpr24_agpr25_agpr26_agpr27, implicit $agpr28_agpr29, implicit $agpr30
		rampitecUnsubmitted Not Done Reply Inline Actions This is greedy, not fastra, the same regression. rampitec: This is greedy, not fastra, the same regression.
		arsenmUnsubmitted Not Done Reply Inline Actions This isn't allocated at all, this is running just PEI. For CSRs I think we should either not have CSR AGPRs, or use splitCSR arsenm: This isn't allocated at all, this is running just PEI. For CSRs I think we should either not…
		rampitecUnsubmitted Not Done Reply Inline Actions OK, can we see partial cross class copy anywhere? rampitec: OK, can we see partial cross class copy anywhere?
		arsenmUnsubmitted Not Done Reply Inline Actions Not really, but this is just a general problem with the allocator which I hope to look into soon. It doesn't know how to introduce new subranges to avoid conflicts or to spill the minimum set of required lanes arsenm: Not really, but this is just a general problem with the allocator which I hope to look into…
		rampitecUnsubmitted Not Done Reply Inline Actions Look, this is tolerable unless these are AGPRs with their 32 register tuples. rampitec: Look, this is tolerable unless these are AGPRs with their 32 register tuples.
		cdevadasAuthorUnsubmitted Done Reply Inline Actions It is true, the Spiller during regalloc doesn't support partial/sub-range tuple spills. But it's not the case with copy. Greedy did handle partial tuple copy into a superclass. `max_256_vgprs_spill_9x32_2bb` in spill-vgpr-to-agpr.ll, for instance, used to get entire 1024 tuple spill stores and restores. Now that has become minimal copies. I see similar copies inserted during greedy allocator. %1200.sub16_sub17_sub18_sub19:av_1024 = COPY %1201.sub16_sub17_sub18_sub19:vreg_1024 // copy to the super class, AV. %1202.sub16_sub17_sub18_sub19:vreg_1024 = COPY %1200.sub16_sub17_sub18_sub19:av_1024 // later moving it back to V. We currently don't have a test that captures it. I can include one. cdevadas: It is true, the Spiller during regalloc doesn't support partial/sub-range tuple spills. But…
		rampitecUnsubmitted Not Done Reply Inline Actions This test would be super helpful. We need to make sure we do not copy a whole tuple. Moreover, we need it all ways: v to a, a to v, and all tuple sizes. rampitec: This test would be super helpful. We need to make sure we do not copy a whole tuple. Moreover…
		cdevadasAuthorUnsubmitted Done Reply Inline Actions This test would be super helpful. We need to make sure we do not copy a whole tuple. Moreover, we need it all ways: v to a, a to v, and all tuple sizes. I take my word back when I said regallc inserts sub-range tuple copies during `tryInstructionSplit`. It isn't entirely true. Register coalescer inserts copies that now become copies to equivalent superclasses as we defined `getLargestLegalSuperClass` for the vector classes. The `trySplit` function doesn't do anything new to introduce subrange copies. I failed to come up with a test case that introduce tuple subrange copies for all supported AMDGPU tuple sizes. Like Matt said, regalloc needs a fix to better handle the subranges. cdevadas: > This test would be super helpful. We need to make sure we do not copy a whole tuple. Moreover…
SI_SPILL_V128_SAVE killed $vgpr0_vgpr1_vgpr2_vgpr3, %stack.0, $sgpr32, 0, implicit $exec :: (store (s64) into %stack.0, addrspace 5)		SI_SPILL_V128_SAVE killed $vgpr0_vgpr1_vgpr2_vgpr3, %stack.0, $sgpr32, 0, implicit $exec :: (store (s64) into %stack.0, addrspace 5)
$vgpr0_vgpr1_vgpr2_vgpr3 = SI_SPILL_V128_RESTORE %stack.0, $sgpr32, 0, implicit $exec :: (load (s128) from %stack.0, align 4, addrspace 5)		$vgpr0_vgpr1_vgpr2_vgpr3 = SI_SPILL_V128_RESTORE %stack.0, $sgpr32, 0, implicit $exec :: (load (s128) from %stack.0, align 4, addrspace 5)
S_ENDPGM 0, implicit $agpr0_agpr1_agpr2_agpr3_agpr4_agpr5_agpr6_agpr7_agpr8_agpr9_agpr10_agpr11_agpr12_agpr13_agpr14_agpr15, implicit $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23, implicit $agpr24_agpr25_agpr26_agpr27, implicit $agpr28_agpr29, implicit $agpr30		S_ENDPGM 0, implicit $agpr0_agpr1_agpr2_agpr3_agpr4_agpr5_agpr6_agpr7_agpr8_agpr9_agpr10_agpr11_agpr12_agpr13_agpr14_agpr15, implicit $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23, implicit $agpr24_agpr25_agpr26_agpr27, implicit $agpr28_agpr29, implicit $agpr30
...		...

---		---
name: partial_spill_v128_2_of_4		name: partial_spill_v128_2_of_4
tracksRegLiveness: true		tracksRegLiveness: true
▲ Show 20 Lines • Show All 176 Lines • ▼ Show 20 Lines	bb.0:
; GCN: $agpr2 = V_ACCVGPR_WRITE_B32_e64 $vgpr1, implicit $exec, implicit-def $agpr0_agpr1_agpr2_agpr3		; GCN: $agpr2 = V_ACCVGPR_WRITE_B32_e64 $vgpr1, implicit $exec, implicit-def $agpr0_agpr1_agpr2_agpr3
; GCN: $agpr1 = V_ACCVGPR_WRITE_B32_e64 $vgpr2, implicit $exec, implicit-def $agpr0_agpr1_agpr2_agpr3		; GCN: $agpr1 = V_ACCVGPR_WRITE_B32_e64 $vgpr2, implicit $exec, implicit-def $agpr0_agpr1_agpr2_agpr3
; GCN: $agpr0 = V_ACCVGPR_WRITE_B32_e64 $vgpr3, implicit $exec, implicit-def $agpr0_agpr1_agpr2_agpr3		; GCN: $agpr0 = V_ACCVGPR_WRITE_B32_e64 $vgpr3, implicit $exec, implicit-def $agpr0_agpr1_agpr2_agpr3
; GCN: S_ENDPGM 0		; GCN: S_ENDPGM 0
SI_SPILL_A128_SAVE killed $agpr0_agpr1_agpr2_agpr3, %stack.0, $sgpr32, 0, implicit $exec :: (store (s64) into %stack.0, addrspace 5)		SI_SPILL_A128_SAVE killed $agpr0_agpr1_agpr2_agpr3, %stack.0, $sgpr32, 0, implicit $exec :: (store (s64) into %stack.0, addrspace 5)
$agpr0_agpr1_agpr2_agpr3 = SI_SPILL_A128_RESTORE %stack.0, $sgpr32, 0, implicit $exec :: (load (s128) from %stack.0, align 4, addrspace 5)		$agpr0_agpr1_agpr2_agpr3 = SI_SPILL_A128_RESTORE %stack.0, $sgpr32, 0, implicit $exec :: (load (s128) from %stack.0, align 4, addrspace 5)
S_ENDPGM 0		S_ENDPGM 0
...		...

		# A spilled register can be restored to its superclass during regalloc.
		# As a result, we might see AGPR spills restored to VGPRs or the other way around.

		---
		name: partial_spill_a128_restore_to_v128_1_of_4
		tracksRegLiveness: true
		stack:
		- { id: 0, type: spill-slot, size: 16, alignment: 4 }
		machineFunctionInfo:
		hasSpilledVGPRs: true
		stackPtrOffsetReg: '$sgpr32'
		body: \|
		bb.0:
		liveins: $vgpr52, $vgpr53, $vgpr54, $agpr0_agpr1_agpr2_agpr3, $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15, $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31, $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39_vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47

		; GCN-LABEL: name: partial_spill_a128_restore_to_v128_1_of_4
		; GCN: liveins: $vgpr52, $vgpr53, $vgpr54, $vgpr55, $agpr0_agpr1_agpr2_agpr3, $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15, $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31, $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39_vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47
		; GCN: {{ $}}
		; GCN: $vgpr55 = V_ACCVGPR_READ_B32_e64 killed $agpr3, implicit $exec, implicit-def $agpr0_agpr1_agpr2_agpr3, implicit $agpr0_agpr1_agpr2_agpr3
		rampitecUnsubmitted Not Done Reply Inline Actions This should not happen. This is gfx90a and agpr has to be stored directly. This should happen on gfx908 though. rampitec: This should not happen. This is gfx90a and agpr has to be stored directly. This should happen…
		cdevadasAuthorUnsubmitted Done Reply Inline Actions You are right. The original test was for gfx90a. The new patterns should be in a separate test for gfx908. I will create one. cdevadas: You are right. The original test was for gfx90a. The new patterns should be in a separate test…
		rampitecUnsubmitted Not Done Reply Inline Actions But this copy still should not happen on gfx90a as it does. rampitec: But this copy still should not happen on gfx90a as it does.
		; GCN: SCRATCH_STORE_DWORDX3_SADDR killed $agpr0_agpr1_agpr2, $sgpr32, 0, 0, implicit $exec, implicit $flat_scr, implicit killed $agpr0_agpr1_agpr2_agpr3 :: (store (s96) into %stack.0, align 4, addrspace 5)
		; GCN: $vgpr51 = COPY $vgpr55, implicit-def $vgpr48_vgpr49_vgpr50_vgpr51
		; GCN: $vgpr48_vgpr49_vgpr50 = SCRATCH_LOAD_DWORDX3_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr, implicit-def $vgpr48_vgpr49_vgpr50_vgpr51 :: (load (s96) from %stack.0, align 4, addrspace 5)
		; GCN: S_ENDPGM 0, implicit $vgpr52, implicit $vgpr53, implicit $vgpr54, implicit $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15, implicit $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31, implicit $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39_vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47
		SI_SPILL_A128_SAVE killed $agpr0_agpr1_agpr2_agpr3, %stack.0, $sgpr32, 0, implicit $exec :: (store (s64) into %stack.0, addrspace 5)
		$vgpr48_vgpr49_vgpr50_vgpr51 = SI_SPILL_V128_RESTORE %stack.0, $sgpr32, 0, implicit $exec :: (load (s128) from %stack.0, align 4, addrspace 5)
		S_ENDPGM 0, implicit $vgpr52, implicit $vgpr53, implicit $vgpr54, implicit $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15, implicit $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31, implicit $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39_vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47
		...

		---
		name: partial_spill_a128_restore_to_v128_2_of_4
		tracksRegLiveness: true
		stack:
		- { id: 0, type: spill-slot, size: 16, alignment: 4 }
		machineFunctionInfo:
		hasSpilledVGPRs: true
		stackPtrOffsetReg: '$sgpr32'
		body: \|
		bb.0:
		liveins: $vgpr52, $vgpr53, $agpr0_agpr1_agpr2_agpr3, $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15, $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31, $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39_vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47

		; GCN-LABEL: name: partial_spill_a128_restore_to_v128_2_of_4
		; GCN: liveins: $vgpr52, $vgpr53, $vgpr54, $vgpr55, $agpr0_agpr1_agpr2_agpr3, $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15, $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31, $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39_vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47
		; GCN: {{ $}}
		; GCN: $vgpr54 = V_ACCVGPR_READ_B32_e64 killed $agpr3, implicit $exec, implicit-def $agpr0_agpr1_agpr2_agpr3, implicit $agpr0_agpr1_agpr2_agpr3
		; GCN: $vgpr55 = V_ACCVGPR_READ_B32_e64 killed $agpr2, implicit $exec, implicit $agpr0_agpr1_agpr2_agpr3
		; GCN: SCRATCH_STORE_DWORDX2_SADDR killed $agpr0_agpr1, $sgpr32, 0, 0, implicit $exec, implicit $flat_scr, implicit killed $agpr0_agpr1_agpr2_agpr3 :: (store (s64) into %stack.0, align 4, addrspace 5)
		; GCN: $vgpr51 = COPY $vgpr54, implicit-def $vgpr48_vgpr49_vgpr50_vgpr51
		; GCN: $vgpr50 = COPY $vgpr55, implicit-def $vgpr48_vgpr49_vgpr50_vgpr51
		; GCN: $vgpr48_vgpr49 = SCRATCH_LOAD_DWORDX2_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr, implicit-def $vgpr48_vgpr49_vgpr50_vgpr51 :: (load (s64) from %stack.0, align 4, addrspace 5)
		; GCN: S_ENDPGM 0, implicit $vgpr52, implicit $vgpr53, implicit $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15, implicit $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31, implicit $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39_vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47
		SI_SPILL_A128_SAVE killed $agpr0_agpr1_agpr2_agpr3, %stack.0, $sgpr32, 0, implicit $exec :: (store (s64) into %stack.0, addrspace 5)
		$vgpr48_vgpr49_vgpr50_vgpr51 = SI_SPILL_V128_RESTORE %stack.0, $sgpr32, 0, implicit $exec :: (load (s128) from %stack.0, align 4, addrspace 5)
		S_ENDPGM 0, implicit $vgpr52, implicit $vgpr53, implicit $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15, implicit $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31, implicit $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39_vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47
		...

		---
		name: partial_spill_a128_restore_to_v128_3_of_4
		tracksRegLiveness: true
		stack:
		- { id: 0, type: spill-slot, size: 16, alignment: 4 }
		machineFunctionInfo:
		hasSpilledVGPRs: true
		stackPtrOffsetReg: '$sgpr32'
		body: \|
		bb.0:
		liveins: $vgpr52, $agpr0_agpr1_agpr2_agpr3, $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15, $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31, $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39_vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47

		; GCN-LABEL: name: partial_spill_a128_restore_to_v128_3_of_4
		; GCN: liveins: $vgpr52, $vgpr53, $vgpr54, $vgpr55, $agpr0_agpr1_agpr2_agpr3, $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15, $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31, $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39_vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47
		; GCN: {{ $}}
		; GCN: $vgpr53 = V_ACCVGPR_READ_B32_e64 killed $agpr3, implicit $exec, implicit-def $agpr0_agpr1_agpr2_agpr3, implicit $agpr0_agpr1_agpr2_agpr3
		; GCN: $vgpr54 = V_ACCVGPR_READ_B32_e64 killed $agpr2, implicit $exec, implicit $agpr0_agpr1_agpr2_agpr3
		; GCN: $vgpr55 = V_ACCVGPR_READ_B32_e64 killed $agpr1, implicit $exec, implicit $agpr0_agpr1_agpr2_agpr3
		; GCN: SCRATCH_STORE_DWORD_SADDR killed $agpr0, $sgpr32, 0, 0, implicit $exec, implicit $flat_scr, implicit killed $agpr0_agpr1_agpr2_agpr3 :: (store (s32) into %stack.0, addrspace 5)
		; GCN: $vgpr51 = COPY $vgpr53, implicit-def $vgpr48_vgpr49_vgpr50_vgpr51
		; GCN: $vgpr50 = COPY $vgpr54, implicit-def $vgpr48_vgpr49_vgpr50_vgpr51
		; GCN: $vgpr49 = COPY $vgpr55, implicit-def $vgpr48_vgpr49_vgpr50_vgpr51
		; GCN: $vgpr48 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr, implicit-def $vgpr48_vgpr49_vgpr50_vgpr51 :: (load (s32) from %stack.0, addrspace 5)
		; GCN: S_ENDPGM 0, implicit $vgpr52, implicit $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15, implicit $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31, implicit $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39_vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47
		SI_SPILL_A128_SAVE killed $agpr0_agpr1_agpr2_agpr3, %stack.0, $sgpr32, 0, implicit $exec :: (store (s64) into %stack.0, addrspace 5)
		$vgpr48_vgpr49_vgpr50_vgpr51 = SI_SPILL_V128_RESTORE %stack.0, $sgpr32, 0, implicit $exec :: (load (s128) from %stack.0, align 4, addrspace 5)
		S_ENDPGM 0, implicit $vgpr52, implicit $vgpr0_vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15, implicit $vgpr16_vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31, implicit $vgpr32_vgpr33_vgpr34_vgpr35_vgpr36_vgpr37_vgpr38_vgpr39_vgpr40_vgpr41_vgpr42_vgpr43_vgpr44_vgpr45_vgpr46_vgpr47
		...

		---
		name: full_spill_a128_restore_to_v128
		tracksRegLiveness: true
		stack:
		- { id: 0, type: spill-slot, size: 16, alignment: 4 }
		machineFunctionInfo:
		hasSpilledVGPRs: true
		stackPtrOffsetReg: '$sgpr32'
		body: \|
		bb.0:
		liveins: $agpr0_agpr1_agpr2_agpr3

		; GCN-LABEL: name: full_spill_a128_restore_to_v128
		; GCN: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $agpr0_agpr1_agpr2_agpr3
		; GCN: {{ $}}
		; GCN: $vgpr0 = V_ACCVGPR_READ_B32_e64 killed $agpr3, implicit $exec, implicit-def $agpr0_agpr1_agpr2_agpr3, implicit $agpr0_agpr1_agpr2_agpr3
		; GCN: $vgpr1 = V_ACCVGPR_READ_B32_e64 killed $agpr2, implicit $exec, implicit $agpr0_agpr1_agpr2_agpr3
		; GCN: $vgpr2 = V_ACCVGPR_READ_B32_e64 killed $agpr1, implicit $exec, implicit $agpr0_agpr1_agpr2_agpr3
		; GCN: $vgpr3 = V_ACCVGPR_READ_B32_e64 killed $agpr0, implicit $exec, implicit killed $agpr0_agpr1_agpr2_agpr3
		; GCN: $vgpr55 = COPY $vgpr0, implicit-def $vgpr52_vgpr53_vgpr54_vgpr55
		; GCN: $vgpr54 = COPY $vgpr1, implicit-def $vgpr52_vgpr53_vgpr54_vgpr55
		; GCN: $vgpr53 = COPY $vgpr2, implicit-def $vgpr52_vgpr53_vgpr54_vgpr55
		; GCN: $vgpr52 = COPY $vgpr3, implicit-def $vgpr52_vgpr53_vgpr54_vgpr55
		; GCN: S_ENDPGM 0
		SI_SPILL_A128_SAVE killed $agpr0_agpr1_agpr2_agpr3, %stack.0, $sgpr32, 0, implicit $exec :: (store (s64) into %stack.0, addrspace 5)
		$vgpr52_vgpr53_vgpr54_vgpr55 = SI_SPILL_V128_RESTORE %stack.0, $sgpr32, 0, implicit $exec :: (load (s128) from %stack.0, align 4, addrspace 5)
		S_ENDPGM 0
		...

		---
		name: partial_spill_v128_restore_to_a128_1_of_4
		tracksRegLiveness: true
		stack:
		- { id: 0, type: spill-slot, size: 16, alignment: 4 }
		machineFunctionInfo:
		hasSpilledVGPRs: true
		stackPtrOffsetReg: '$sgpr32'
		body: \|
		bb.0:
		liveins: $agpr31, $vgpr0_vgpr1_vgpr2_vgpr3, $agpr0_agpr1_agpr2_agpr3_agpr4_agpr5_agpr6_agpr7_agpr8_agpr9_agpr10_agpr11_agpr12_agpr13_agpr14_agpr15, $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23, $agpr24_agpr25

		; GCN-LABEL: name: partial_spill_v128_restore_to_a128_1_of_4
		; GCN: liveins: $agpr30, $agpr31, $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23, $agpr0_agpr1_agpr2_agpr3_agpr4_agpr5_agpr6_agpr7_agpr8_agpr9_agpr10_agpr11_agpr12_agpr13_agpr14_agpr15, $agpr24_agpr25, $vgpr0_vgpr1_vgpr2_vgpr3
		; GCN: {{ $}}
		; GCN: $agpr30 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr3, implicit $exec, implicit-def $vgpr0_vgpr1_vgpr2_vgpr3, implicit $vgpr0_vgpr1_vgpr2_vgpr3
		; GCN: SCRATCH_STORE_DWORDX3_SADDR killed $vgpr0_vgpr1_vgpr2, $sgpr32, 0, 0, implicit $exec, implicit $flat_scr, implicit killed $vgpr0_vgpr1_vgpr2_vgpr3 :: (store (s96) into %stack.0, align 4, addrspace 5)
		; GCN: $agpr29 = COPY $agpr30, implicit-def $agpr26_agpr27_agpr28_agpr29
		; GCN: $agpr26_agpr27_agpr28 = SCRATCH_LOAD_DWORDX3_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr, implicit-def $agpr26_agpr27_agpr28_agpr29 :: (load (s96) from %stack.0, align 4, addrspace 5)
		; GCN: S_ENDPGM 0, implicit $agpr31, implicit $agpr0_agpr1_agpr2_agpr3_agpr4_agpr5_agpr6_agpr7_agpr8_agpr9_agpr10_agpr11_agpr12_agpr13_agpr14_agpr15, implicit $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23, implicit $agpr24_agpr25
		SI_SPILL_V128_SAVE killed $vgpr0_vgpr1_vgpr2_vgpr3, %stack.0, $sgpr32, 0, implicit $exec :: (store (s64) into %stack.0, addrspace 5)
		$agpr26_agpr27_agpr28_agpr29 = SI_SPILL_A128_RESTORE %stack.0, $sgpr32, 0, implicit $exec :: (load (s128) from %stack.0, align 4, addrspace 5)
		S_ENDPGM 0, implicit $agpr31, implicit $agpr0_agpr1_agpr2_agpr3_agpr4_agpr5_agpr6_agpr7_agpr8_agpr9_agpr10_agpr11_agpr12_agpr13_agpr14_agpr15, implicit $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23, implicit $agpr24_agpr25
		...

		---
		name: partial_spill_v128_restore_to_a128_2_of_4
		tracksRegLiveness: true
		stack:
		- { id: 0, type: spill-slot, size: 16, alignment: 4 }
		machineFunctionInfo:
		hasSpilledVGPRs: true
		stackPtrOffsetReg: '$sgpr32'
		body: \|
		bb.0:
		liveins: $vgpr0_vgpr1_vgpr2_vgpr3, $agpr0_agpr1_agpr2_agpr3_agpr4_agpr5_agpr6_agpr7_agpr8_agpr9_agpr10_agpr11_agpr12_agpr13_agpr14_agpr15, $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23, $agpr24_agpr25

		; GCN-LABEL: name: partial_spill_v128_restore_to_a128_2_of_4
		; GCN: liveins: $agpr30, $agpr31, $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23, $agpr0_agpr1_agpr2_agpr3_agpr4_agpr5_agpr6_agpr7_agpr8_agpr9_agpr10_agpr11_agpr12_agpr13_agpr14_agpr15, $agpr24_agpr25, $vgpr0_vgpr1_vgpr2_vgpr3
		; GCN: {{ $}}
		; GCN: $agpr30 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr3, implicit $exec, implicit-def $vgpr0_vgpr1_vgpr2_vgpr3, implicit $vgpr0_vgpr1_vgpr2_vgpr3
		; GCN: $agpr31 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr2, implicit $exec, implicit $vgpr0_vgpr1_vgpr2_vgpr3
		; GCN: SCRATCH_STORE_DWORDX2_SADDR killed $vgpr0_vgpr1, $sgpr32, 0, 0, implicit $exec, implicit $flat_scr, implicit killed $vgpr0_vgpr1_vgpr2_vgpr3 :: (store (s64) into %stack.0, align 4, addrspace 5)
		; GCN: $agpr29 = COPY $agpr30, implicit-def $agpr26_agpr27_agpr28_agpr29
		; GCN: $agpr28 = COPY $agpr31, implicit-def $agpr26_agpr27_agpr28_agpr29
		; GCN: $agpr26_agpr27 = SCRATCH_LOAD_DWORDX2_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr, implicit-def $agpr26_agpr27_agpr28_agpr29 :: (load (s64) from %stack.0, align 4, addrspace 5)
		; GCN: S_ENDPGM 0, implicit $agpr0_agpr1_agpr2_agpr3_agpr4_agpr5_agpr6_agpr7_agpr8_agpr9_agpr10_agpr11_agpr12_agpr13_agpr14_agpr15, implicit $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23, implicit $agpr24_agpr25
		SI_SPILL_V128_SAVE killed $vgpr0_vgpr1_vgpr2_vgpr3, %stack.0, $sgpr32, 0, implicit $exec :: (store (s64) into %stack.0, addrspace 5)
		$agpr26_agpr27_agpr28_agpr29 = SI_SPILL_A128_RESTORE %stack.0, $sgpr32, 0, implicit $exec :: (load (s128) from %stack.0, align 4, addrspace 5)
		S_ENDPGM 0, implicit $agpr0_agpr1_agpr2_agpr3_agpr4_agpr5_agpr6_agpr7_agpr8_agpr9_agpr10_agpr11_agpr12_agpr13_agpr14_agpr15, implicit $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23, implicit $agpr24_agpr25
		...

		---
		name: partial_spill_v128_restore_to_a128_3_of_4
		tracksRegLiveness: true
		stack:
		- { id: 0, type: spill-slot, size: 16, alignment: 4 }
		machineFunctionInfo:
		hasSpilledVGPRs: true
		stackPtrOffsetReg: '$sgpr32'
		body: \|
		bb.0:
		liveins: $vgpr0_vgpr1_vgpr2_vgpr3, $agpr0_agpr1_agpr2_agpr3_agpr4_agpr5_agpr6_agpr7_agpr8_agpr9_agpr10_agpr11_agpr12_agpr13_agpr14_agpr15, $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23, $agpr24

		; GCN-LABEL: name: partial_spill_v128_restore_to_a128_3_of_4
		; GCN: liveins: $agpr24, $agpr25, $agpr30, $agpr31, $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23, $agpr0_agpr1_agpr2_agpr3_agpr4_agpr5_agpr6_agpr7_agpr8_agpr9_agpr10_agpr11_agpr12_agpr13_agpr14_agpr15, $vgpr0_vgpr1_vgpr2_vgpr3
		; GCN: {{ $}}
		; GCN: $agpr25 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr3, implicit $exec, implicit-def $vgpr0_vgpr1_vgpr2_vgpr3, implicit $vgpr0_vgpr1_vgpr2_vgpr3
		; GCN: $agpr30 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr2, implicit $exec, implicit $vgpr0_vgpr1_vgpr2_vgpr3
		; GCN: $agpr31 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr1, implicit $exec, implicit $vgpr0_vgpr1_vgpr2_vgpr3
		; GCN: SCRATCH_STORE_DWORD_SADDR killed $vgpr0, $sgpr32, 0, 0, implicit $exec, implicit $flat_scr, implicit killed $vgpr0_vgpr1_vgpr2_vgpr3 :: (store (s32) into %stack.0, addrspace 5)
		; GCN: $agpr29 = COPY $agpr25, implicit-def $agpr26_agpr27_agpr28_agpr29
		; GCN: $agpr28 = COPY $agpr30, implicit-def $agpr26_agpr27_agpr28_agpr29
		; GCN: $agpr27 = COPY $agpr31, implicit-def $agpr26_agpr27_agpr28_agpr29
		; GCN: $agpr26 = SCRATCH_LOAD_DWORD_SADDR $sgpr32, 0, 0, implicit $exec, implicit $flat_scr, implicit-def $agpr26_agpr27_agpr28_agpr29 :: (load (s32) from %stack.0, addrspace 5)
		; GCN: S_ENDPGM 0, implicit $agpr0_agpr1_agpr2_agpr3_agpr4_agpr5_agpr6_agpr7_agpr8_agpr9_agpr10_agpr11_agpr12_agpr13_agpr14_agpr15, implicit $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23, implicit $agpr24
		SI_SPILL_V128_SAVE killed $vgpr0_vgpr1_vgpr2_vgpr3, %stack.0, $sgpr32, 0, implicit $exec :: (store (s64) into %stack.0, addrspace 5)
		$agpr26_agpr27_agpr28_agpr29 = SI_SPILL_A128_RESTORE %stack.0, $sgpr32, 0, implicit $exec :: (load (s128) from %stack.0, align 4, addrspace 5)
		S_ENDPGM 0, implicit $agpr0_agpr1_agpr2_agpr3_agpr4_agpr5_agpr6_agpr7_agpr8_agpr9_agpr10_agpr11_agpr12_agpr13_agpr14_agpr15, implicit $agpr16_agpr17_agpr18_agpr19_agpr20_agpr21_agpr22_agpr23, implicit $agpr24
		...

		---
		name: full_spill_v128_restore_to_a128
		tracksRegLiveness: true
		stack:
		- { id: 0, type: spill-slot, size: 16, alignment: 4 }
		machineFunctionInfo:
		hasSpilledVGPRs: true
		stackPtrOffsetReg: '$sgpr32'
		body: \|
		bb.0:
		liveins: $vgpr0_vgpr1_vgpr2_vgpr3

		; GCN-LABEL: name: full_spill_v128_restore_to_a128
		; GCN: liveins: $agpr4, $agpr5, $agpr6, $agpr7, $vgpr0_vgpr1_vgpr2_vgpr3
		; GCN: {{ $}}
		; GCN: $agpr4 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr3, implicit $exec, implicit-def $vgpr0_vgpr1_vgpr2_vgpr3, implicit $vgpr0_vgpr1_vgpr2_vgpr3
		; GCN: $agpr5 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr2, implicit $exec, implicit $vgpr0_vgpr1_vgpr2_vgpr3
		; GCN: $agpr6 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr1, implicit $exec, implicit $vgpr0_vgpr1_vgpr2_vgpr3
		; GCN: $agpr7 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr0, implicit $exec, implicit killed $vgpr0_vgpr1_vgpr2_vgpr3
		; GCN: $agpr3 = COPY $agpr4, implicit-def $agpr0_agpr1_agpr2_agpr3
		; GCN: $agpr2 = COPY $agpr5, implicit-def $agpr0_agpr1_agpr2_agpr3
		; GCN: $agpr1 = COPY $agpr6, implicit-def $agpr0_agpr1_agpr2_agpr3
		; GCN: $agpr0 = COPY $agpr7, implicit-def $agpr0_agpr1_agpr2_agpr3
		; GCN: S_ENDPGM 0
		SI_SPILL_V128_SAVE killed $vgpr0_vgpr1_vgpr2_vgpr3, %stack.0, $sgpr32, 0, implicit $exec :: (store (s64) into %stack.0, addrspace 5)
		$agpr0_agpr1_agpr2_agpr3 = SI_SPILL_A128_RESTORE %stack.0, $sgpr32, 0, implicit $exec :: (load (s128) from %stack.0, align 4, addrspace 5)
		S_ENDPGM 0
		...

llvm/test/CodeGen/AMDGPU/spill-vector-superclass.ll

This file was added.

				; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx908 -stop-after=greedy,1 -verify-machineinstrs -o - %s \| FileCheck -check-prefix=GCN %s
				; Convert AV spills into VGPR spills by introducing appropriate copies in between.

				define amdgpu_kernel void @test_spill_av_class(<4 x i32> %arg) #0 {
				; GCN-LABEL: name: test_spill_av_class
				; GCN: INLINEASM &"; def $0", 1 /* sideeffect attdialect /, 1835018 / regdef:VGPR_32 */, def undef %21.sub0
				; GCN-NEXT: undef %23.sub0:av_64 = COPY %21.sub0
				; GCN-NEXT: [[COPY1:%[0-9]+]]:vreg_64 = COPY %23
				; GCN-NEXT: SI_SPILL_V64_SAVE [[COPY1]], %stack.0, $sgpr32, 0, implicit $exec
				; GCN: [[SI_SPILL_V64_RESTORE:%[0-9]+]]:vreg_64 = SI_SPILL_V64_RESTORE %stack.0, $sgpr32, 0, implicit $exec
				; GCN-NEXT: [[COPY3:%[0-9]+]]:av_64 = COPY [[SI_SPILL_V64_RESTORE]]
				; GCN-NEXT: undef %22.sub0:vreg_64 = COPY [[COPY3]].sub0
				%v0 = call i32 asm sideeffect "; def $0", "=v"()
				%tmp = insertelement <2 x i32> undef, i32 %v0, i32 0
				%mai = tail call <4 x i32> @llvm.amdgcn.mfma.i32.4x4x4i8(i32 1, i32 2, <4 x i32> %arg, i32 0, i32 0, i32 0)
				store volatile <4 x i32> %mai, <4 x i32> addrspace(1)* undef
				call void asm sideeffect "; use $0", "v"(<2 x i32> %tmp);
				ret void
				}

				declare <4 x i32> @llvm.amdgcn.mfma.i32.4x4x4i8(i32, i32, <4 x i32>, i32, i32, i32)

				attributes #0 = { nounwind "amdgpu-num-vgpr"="5" }

llvm/test/CodeGen/AMDGPU/spill-vgpr-to-agpr.ll

; RUN: llc -march=amdgcn -mcpu=gfx908 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX908 %s		; RUN: llc -march=amdgcn -mcpu=gfx908 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX908 %s
; RUN: not llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s 2>&1 \| FileCheck -check-prefixes=GCN,GFX900 %s		; RUN: not llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s 2>&1 \| FileCheck -check-prefixes=GCN,GFX900 %s

; GCN-LABEL: {{^}}max_10_vgprs:		; GCN-LABEL: {{^}}max_10_vgprs:
; GFX900-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD0		; GFX900-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD0
; GFX900-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD1		; GFX900-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD1
; GFX908-NOT: SCRATCH_RSRC		; GFX908-NOT: SCRATCH_RSRC
; GFX908-DAG: v_accvgpr_write_b32 a0, v{{[0-9]}} ; Reload Reuse		; GFX908-DAG: v_accvgpr_write_b32 [[A_REG:a[0-9]+]], v{{[0-9]}}
; GFX908-DAG: v_accvgpr_write_b32 a1, v{{[0-9]}} ; Reload Reuse
; GFX900: buffer_store_dword v{{[0-9]}},		; GFX900: buffer_store_dword v{{[0-9]}},
; GFX900: buffer_store_dword v{{[0-9]}},		; GFX900: buffer_store_dword v{{[0-9]}},
; GFX900: buffer_load_dword v{{[0-9]}},		; GFX900: buffer_load_dword v{{[0-9]}},
; GFX900: buffer_load_dword v{{[0-9]}},		; GFX900: buffer_load_dword v{{[0-9]}},
; GFX908-NOT: buffer_		; GFX908-NOT: buffer_
; GFX908-DAG: v_accvgpr_read_b32 v{{[0-9]}}, a0 ; Reload Reuse		; GFX908-DAG: v_mov_b32_e32 v{{[0-9]}}, [[V_REG:v[0-9]+]]
; GFX908-DAG: v_accvgpr_read_b32 v{{[0-9]}}, a1 ; Reload Reuse		; GFX908-DAG: v_accvgpr_read_b32 [[V_REG]], [[A_REG]]

; GCN: NumVgprs: 10		; GCN: NumVgprs: 10
; GFX900: ScratchSize: 12		; GFX900: ScratchSize: 12
; GFX908: ScratchSize: 0		; GFX908: ScratchSize: 0
; GCN: VGPRBlocks: 2		; GCN: VGPRBlocks: 2
; GCN: NumVGPRsForWavesPerEU: 10		; GCN: NumVGPRsForWavesPerEU: 10
define amdgpu_kernel void @max_10_vgprs(i32 addrspace(1)* %p) #0 {		define amdgpu_kernel void @max_10_vgprs(i32 addrspace(1)* %p) #0 {
%tid = load volatile i32, i32 addrspace(1)* undef		%tid = load volatile i32, i32 addrspace(1)* undef
Show All 27 Lines	define amdgpu_kernel void @max_10_vgprs(i32 addrspace(1)* %p) #0 {
store volatile i32 %v7, i32 addrspace(1)* undef		store volatile i32 %v7, i32 addrspace(1)* undef
store volatile i32 %v8, i32 addrspace(1)* undef		store volatile i32 %v8, i32 addrspace(1)* undef
store volatile i32 %v9, i32 addrspace(1)* undef		store volatile i32 %v9, i32 addrspace(1)* undef
store volatile i32 %v10, i32 addrspace(1)* undef		store volatile i32 %v10, i32 addrspace(1)* undef
ret void		ret void
}		}

; GCN-LABEL: {{^}}max_10_vgprs_used_9a:		; GCN-LABEL: {{^}}max_10_vgprs_used_9a:
; GFX908-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD0		; GFX908-NOT: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD0
; GFX908-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD1		; GFX908-NOT: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD1
; GFX908-DAG: v_accvgpr_write_b32 a9, v{{[0-9]}} ; Reload Reuse		; GFX908-DAG: v_accvgpr_write_b32 [[A_REG:a[0-9]+]], v{{[0-9]}}
; GFX908: buffer_store_dword v{{[0-9]}},		; GFX908-NOT: buffer_store_dword v{{[0-9]}},
; GFX908-NOT: buffer_		; GFX908-NOT: buffer_
; GFX908: v_accvgpr_read_b32 v{{[0-9]}}, a9 ; Reload Reuse		; GFX908: v_mov_b32_e32 v{{[0-9]}}, [[V_REG:v[0-9]+]]
; GFX908: buffer_load_dword v{{[0-9]}},		; GFX908: v_accvgpr_read_b32 [[V_REG]], [[A_REG]]
; GFX908-NOT: buffer_		; GFX908-NOT: buffer_

; GFX900: couldn't allocate input reg for constraint 'a'		; GFX900: couldn't allocate input reg for constraint 'a'

; GFX908: NumVgprs: 10		; GFX908: NumVgprs: 10
; GFX908: ScratchSize: 8		; GFX908: ScratchSize: 0
; GFX908: VGPRBlocks: 2		; GFX908: VGPRBlocks: 2
; GFX908: NumVGPRsForWavesPerEU: 10		; GFX908: NumVGPRsForWavesPerEU: 10
define amdgpu_kernel void @max_10_vgprs_used_9a(i32 addrspace(1)* %p) #0 {		define amdgpu_kernel void @max_10_vgprs_used_9a(i32 addrspace(1)* %p) #0 {
%tid = load volatile i32, i32 addrspace(1)* undef		%tid = load volatile i32, i32 addrspace(1)* undef
call void asm sideeffect "", "a,a,a,a,a,a,a,a,a"(i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9)		call void asm sideeffect "", "a,a,a,a,a,a,a,a,a"(i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9)
%p1 = getelementptr inbounds i32, i32 addrspace(1)* %p, i32 %tid		%p1 = getelementptr inbounds i32, i32 addrspace(1)* %p, i32 %tid
%p2 = getelementptr inbounds i32, i32 addrspace(1)* %p1, i32 4		%p2 = getelementptr inbounds i32, i32 addrspace(1)* %p1, i32 4
%p3 = getelementptr inbounds i32, i32 addrspace(1)* %p2, i32 8		%p3 = getelementptr inbounds i32, i32 addrspace(1)* %p2, i32 8
Show All 27 Lines	define amdgpu_kernel void @max_10_vgprs_used_9a(i32 addrspace(1)* %p) #0 {
store volatile i32 %v10, i32 addrspace(1)* undef		store volatile i32 %v10, i32 addrspace(1)* undef
ret void		ret void
}		}

; GCN-LABEL: {{^}}max_10_vgprs_used_1a_partial_spill:		; GCN-LABEL: {{^}}max_10_vgprs_used_1a_partial_spill:
; GCN-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD0		; GCN-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD0
; GCN-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD1		; GCN-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD1
; GFX908-DAG: v_accvgpr_write_b32 a0, 1		; GFX908-DAG: v_accvgpr_write_b32 a0, 1
; GFX908-DAG: v_accvgpr_write_b32 a1, v{{[0-9]}} ; Reload Reuse
; GFX908-DAG: v_accvgpr_write_b32 a2, v{{[0-9]}} ; Reload Reuse
; GFX908-DAG: v_accvgpr_write_b32 a3, v{{[0-9]}} ; Reload Reuse
; GFX908-DAG: v_accvgpr_write_b32 a4, v{{[0-9]}} ; Reload Reuse
; GFX908-DAG: v_accvgpr_write_b32 a5, v{{[0-9]}} ; Reload Reuse
; GFX908-DAG: v_accvgpr_write_b32 a6, v{{[0-9]}} ; Reload Reuse
; GFX908-DAG: v_accvgpr_write_b32 a7, v{{[0-9]}} ; Reload Reuse
; GFX908-DAG: v_accvgpr_write_b32 a8, v{{[0-9]}} ; Reload Reuse
; GFX908-DAG: v_accvgpr_write_b32 a9, v{{[0-9]}} ; Reload Reuse
; GFX900: buffer_store_dword v{{[0-9]}},
; GCN-DAG: buffer_store_dword v{{[0-9]}},		; GCN-DAG: buffer_store_dword v{{[0-9]}},
; GFX900: buffer_load_dword v{{[0-9]}},		; GCN-DAG: buffer_store_dword v{{[0-9]}},
		; GFX908-DAG: v_accvgpr_write_b32 a1, v{{[0-9]}}
		; GFX908-DAG: v_accvgpr_write_b32 a2, v{{[0-9]}}
		; GFX908-DAG: v_accvgpr_write_b32 a3, v{{[0-9]}}
		; GFX908-DAG: v_accvgpr_write_b32 a4, v{{[0-9]}}
		; GFX908-DAG: v_accvgpr_write_b32 a5, v{{[0-9]}}
		; GFX908-DAG: v_accvgpr_write_b32 a6, v{{[0-9]}}
		; GFX908-DAG: v_accvgpr_write_b32 a7, v{{[0-9]}}
		; GFX908-DAG: v_accvgpr_write_b32 a8, v{{[0-9]}}
		; GFX908-DAG: v_accvgpr_write_b32 a9, v{{[0-9]}}
		; GCN-DAG: buffer_load_dword v{{[0-9]}},
; GCN-DAG: buffer_load_dword v{{[0-9]}},		; GCN-DAG: buffer_load_dword v{{[0-9]}},
; GFX908-DAG: v_accvgpr_read_b32 v{{[0-9]}}, a1 ; Reload Reuse		; GFX908-DAG: v_accvgpr_read_b32 v{{[0-9]}}, a1
; GFX908-DAG: v_accvgpr_read_b32 v{{[0-9]}}, a2 ; Reload Reuse		; GFX908-DAG: v_accvgpr_read_b32 v{{[0-9]}}, a2
; GFX908-DAG: v_accvgpr_read_b32 v{{[0-9]}}, a3 ; Reload Reuse		; GFX908-DAG: v_accvgpr_read_b32 v{{[0-9]}}, a3
; GFX908-DAG: v_accvgpr_read_b32 v{{[0-9]}}, a4 ; Reload Reuse		; GFX908-DAG: v_accvgpr_read_b32 v{{[0-9]}}, a4
; GFX908-DAG: v_accvgpr_read_b32 v{{[0-9]}}, a5 ; Reload Reuse		; GFX908-DAG: v_accvgpr_read_b32 v{{[0-9]}}, a5
; GFX908-DAG: v_accvgpr_read_b32 v{{[0-9]}}, a6 ; Reload Reuse		; GFX908-DAG: v_accvgpr_read_b32 v{{[0-9]}}, a6
; GFX908-DAG: v_accvgpr_read_b32 v{{[0-9]}}, a7 ; Reload Reuse		; GFX908-DAG: v_accvgpr_read_b32 v{{[0-9]}}, a7
; GFX908-DAG: v_accvgpr_read_b32 v{{[0-9]}}, a8 ; Reload Reuse		; GFX908-DAG: v_accvgpr_read_b32 v{{[0-9]}}, a8
; GFX908-DAG: v_accvgpr_read_b32 v{{[0-9]}}, a9 ; Reload Reuse		; GFX908-DAG: v_accvgpr_read_b32 v{{[0-9]}}, a9

; GCN: NumVgprs: 10		; GCN: NumVgprs: 10
; GFX900: ScratchSize: 44		; GFX900: ScratchSize: 44
; GFX908: ScratchSize: 12		; GFX908: ScratchSize: 12
; GCN: VGPRBlocks: 2		; GCN: VGPRBlocks: 2
; GCN: NumVGPRsForWavesPerEU: 10		; GCN: NumVGPRsForWavesPerEU: 10
define amdgpu_kernel void @max_10_vgprs_used_1a_partial_spill(i64 addrspace(1)* %p) #0 {		define amdgpu_kernel void @max_10_vgprs_used_1a_partial_spill(i64 addrspace(1)* %p) #0 {
%tid = load volatile i32, i32 addrspace(1)* undef		%tid = load volatile i32, i32 addrspace(1)* undef
Show All 15 Lines	define amdgpu_kernel void @max_10_vgprs_used_1a_partial_spill(i64 addrspace(1)* %p) #0 {
store volatile i64 %v4, i64 addrspace(1)* %p5		store volatile i64 %v4, i64 addrspace(1)* %p5
store volatile i64 %v5, i64 addrspace(1)* %p1		store volatile i64 %v5, i64 addrspace(1)* %p1
ret void		ret void
}		}

; GCN-LABEL: {{^}}max_10_vgprs_spill_v32:		; GCN-LABEL: {{^}}max_10_vgprs_spill_v32:
; GCN-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD0		; GCN-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD0
; GCN-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD1		; GCN-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD1
; GFX908-DAG: v_accvgpr_write_b32 a0, v{{[0-9]}} ; Reload Reuse
; GFX908-DAG: v_accvgpr_write_b32 a9, v{{[0-9]}} ; Reload Reuse
; GCN-NOT: a10
; GCN: buffer_store_dword v{{[0-9]}},		; GCN: buffer_store_dword v{{[0-9]}},
		; GFX908-DAG: v_accvgpr_write_b32 a0, v{{[0-9]}}
		; GFX908-DAG: v_accvgpr_write_b32 a9, v{{[0-9]}}
		; GCN-NOT: a10

; GFX908: NumVgprs: 10		; GFX908: NumVgprs: 10
; GFX900: ScratchSize: 100		; GFX900: ScratchSize: 100
; GFX908: ScratchSize: 68		; GFX908: ScratchSize: 68
; GFX908: VGPRBlocks: 2		; GFX908: VGPRBlocks: 2
; GFX908: NumVGPRsForWavesPerEU: 10		; GFX908: NumVGPRsForWavesPerEU: 10
define amdgpu_kernel void @max_10_vgprs_spill_v32(<32 x float> addrspace(1)* %p) #0 {		define amdgpu_kernel void @max_10_vgprs_spill_v32(<32 x float> addrspace(1)* %p) #0 {
%tid = call i32 @llvm.amdgcn.workitem.id.x()		%tid = call i32 @llvm.amdgcn.workitem.id.x()
▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @max_256_vgprs_spill_9x32(<32 x float> addrspace(1)* %p) #1 {
store volatile <32 x float> %v5, <32 x float> addrspace(1)* undef		store volatile <32 x float> %v5, <32 x float> addrspace(1)* undef
store volatile <32 x float> %v6, <32 x float> addrspace(1)* undef		store volatile <32 x float> %v6, <32 x float> addrspace(1)* undef
store volatile <32 x float> %v7, <32 x float> addrspace(1)* undef		store volatile <32 x float> %v7, <32 x float> addrspace(1)* undef
store volatile <32 x float> %v8, <32 x float> addrspace(1)* undef		store volatile <32 x float> %v8, <32 x float> addrspace(1)* undef
store volatile <32 x float> %v9, <32 x float> addrspace(1)* undef		store volatile <32 x float> %v9, <32 x float> addrspace(1)* undef
ret void		ret void
}		}

; FIXME: adding an AReg_1024 register class for v32f32 and v32i32		; FIXME: adding an AReg_1024 register class for v32f32 and v32i32
		rampitecUnsubmitted Not Done Reply Inline Actions This comment is probably not relevant anymore? rampitec: This comment is probably not relevant anymore?
; produces unnecessary copies and we still have some amount		; produces unnecessary copies and we still have some amount
; of conventional spilling.		; of conventional spilling.

; GCN-LABEL: {{^}}max_256_vgprs_spill_9x32_2bb:		; GCN-LABEL: {{^}}max_256_vgprs_spill_9x32_2bb:
; GFX900-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD0		; GFX900-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD0
; GFX900-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD1		; GFX900-DAG: s_mov_b32 s{{[0-9]+}}, SCRATCH_RSRC_DWORD1
; GFX908-FIXME-NOT: SCRATCH_RSRC		; GFX908-NOT: SCRATCH_RSRC
; GFX908-DAG: v_accvgpr_write_b32 a0, v		; GFX908: v_accvgpr_write_b32
		; GFX908: global_load_
; GFX900: buffer_store_dword v		; GFX900: buffer_store_dword v
; GFX900: buffer_load_dword v		; GFX900: buffer_load_dword v
; GFX908-FIXME-NOT: buffer_		; GFX908-NOT: buffer_
; GFX908-DAG: v_accvgpr_read_b32		; GFX908-DAG: v_accvgpr_read_b32

; GCN: NumVgprs: 256		; GCN: NumVgprs: 256
; GFX900: ScratchSize: 2052		; GFX900: ScratchSize: 2052
; GFX908-FIXME: ScratchSize: 0		; GFX908: ScratchSize: 0
; GCN: VGPRBlocks: 63		; GCN: VGPRBlocks: 63
; GCN: NumVGPRsForWavesPerEU: 256		; GCN: NumVGPRsForWavesPerEU: 256
define amdgpu_kernel void @max_256_vgprs_spill_9x32_2bb(<32 x float> addrspace(1)* %p) #1 {		define amdgpu_kernel void @max_256_vgprs_spill_9x32_2bb(<32 x float> addrspace(1)* %p) #1 {
%tid = call i32 @llvm.amdgcn.workitem.id.x()		%tid = call i32 @llvm.amdgcn.workitem.id.x()
%p1 = getelementptr inbounds <32 x float>, <32 x float> addrspace(1)* %p, i32 %tid		%p1 = getelementptr inbounds <32 x float>, <32 x float> addrspace(1)* %p, i32 %tid
%p2 = getelementptr inbounds <32 x float>, <32 x float> addrspace(1)* %p1, i32 %tid		%p2 = getelementptr inbounds <32 x float>, <32 x float> addrspace(1)* %p1, i32 %tid
%p3 = getelementptr inbounds <32 x float>, <32 x float> addrspace(1)* %p2, i32 %tid		%p3 = getelementptr inbounds <32 x float>, <32 x float> addrspace(1)* %p2, i32 %tid
%p4 = getelementptr inbounds <32 x float>, <32 x float> addrspace(1)* %p3, i32 %tid		%p4 = getelementptr inbounds <32 x float>, <32 x float> addrspace(1)* %p3, i32 %tid
Show All 33 Lines