This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] On gfx908, reserve VGPR for AGPR copy based on register budget.
ClosedPublic

Authored by hsmhsm on Apr 11 2022, 10:43 AM.

Details

Summary

Based on available register budget, reserve highest available VGPR for
AGPR copy before RA. After RA, shift it to lowest unused VGPR if the one
exist.

Fixes SWDEV-330006.

Diff Detail

Event Timeline

hsmhsm created this revision.Apr 11 2022, 10:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2022, 10:43 AM
hsmhsm requested review of this revision.Apr 11 2022, 10:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2022, 10:43 AM
hsmhsm edited the summary of this revision. (Show Details)Apr 11 2022, 11:03 AM
rampitec requested changes to this revision.Apr 11 2022, 11:22 AM
rampitec added inline comments.
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
628 ↗(On Diff #421964)

It will never get replaced if RegNo == 0.

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
520

I do not know real world situation with such low budget.

521

And then 32 is over the budget. This is a bug.

This revision now requires changes to proceed.Apr 11 2022, 11:22 AM
hsmhsm added inline comments.Apr 11 2022, 7:09 PM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
628 ↗(On Diff #421964)

Yes, I noticed it. I think, we need to take a relook at it.

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
520

Even, I had my own doubt about such a low budget before submitting the patch, But, I thought - let me post a patch, discus it open, and revise it based on the outcome of the discussion.

521

Really, I have no idea what to do in case of constrained register budget - Stealing one more register from already constrained set would make RegAlloc to fail because of "no register".

However, I think, we need to discuss it, and refine it based on common consensus.

hsmhsm added inline comments.Apr 11 2022, 7:49 PM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
521

Also note that the bug that you noticed is not a new one from this patch, it is already there in the code since https://reviews.llvm.org/D118415 landed. Indeed, we should fix this bug now.

hsmhsm updated this revision to Diff 422130.Apr 11 2022, 11:58 PM

Refine the logic within indirectCopyToAGPR() while chosing lowest available
VGPR for AGPR copy. Basically, find out the total avaialble unused free
VGPRs (after RegAlloc), and call scavenger (only) that many times to find
the lowest available unused free VGPR.

hsmhsm updated this revision to Diff 422144.Apr 12 2022, 2:05 AM

Here is the update where we always reserve highest avaialble VGPR irrespective of the
register constraint.

With this update, below two lit tests fail to compile because RegAlloc fails.

spill-agpr.ll
-------------

define amdgpu_kernel void @max_5regs_used_8a(<4 x float> addrspace(1)* %arg) #4 {
  %tid = call i32 @llvm.amdgcn.workitem.id.x()
  %v0 = call float asm sideeffect "; def $0", "=v"()
  %a4 = call <4 x float> asm sideeffect "; def $0", "=a"()
  %gep = getelementptr inbounds <4 x float>, <4 x float> addrspace(1)* %arg, i32 %tid
  %mai.in = load <4 x float>, <4 x float> addrspace(1)* %gep
  %mai.out = tail call <4 x float> @llvm.amdgcn.mfma.f32.4x4x1f32(float 1.0, float 1.0, <4 x float> %mai.in, i32 0, i32 0, i32 0)
  store <4 x float> %mai.out, <4 x float> addrspace(1)* %gep
  store volatile <4 x float> %a4, <4 x float> addrspace(1)* undef
  call void asm sideeffect "; use $0", "v"(float %v0);
  ret void
}

declare i32 @llvm.amdgcn.workitem.id.x()
declare <4 x float> @llvm.amdgcn.mfma.f32.4x4x1f32(float, float, <4 x float>, i32, i32, i32)

attributes #4 = { nounwind "amdgpu-num-vgpr"="5" }
spill-vgpr-to-agpr.ll
---------------------

define amdgpu_kernel void @max_10_vgprs_used_1a_partial_spill(i64 addrspace(1)* %p) #0 {
  %tid = load volatile i32, i32 addrspace(1)* undef
  call void asm sideeffect "", "a"(i32 1)
  %p1 = getelementptr inbounds i64, i64 addrspace(1)* %p, i32 %tid
  %p2 = getelementptr inbounds i64, i64 addrspace(1)* %p1, i32 8
  %p3 = getelementptr inbounds i64, i64 addrspace(1)* %p2, i32 16
  %p4 = getelementptr inbounds i64, i64 addrspace(1)* %p3, i32 24
  %p5 = getelementptr inbounds i64, i64 addrspace(1)* %p4, i32 32
  %v1 = load volatile i64, i64 addrspace(1)* %p1
  %v2 = load volatile i64, i64 addrspace(1)* %p2
  %v3 = load volatile i64, i64 addrspace(1)* %p3
  %v4 = load volatile i64, i64 addrspace(1)* %p4
  %v5 = load volatile i64, i64 addrspace(1)* %p5
  call void asm sideeffect "", "v,v,v,v,v"(i64 %v1, i64 %v2, i64 %v3, i64 %v4, i64 %v5)
  store volatile i64 %v1, i64 addrspace(1)* %p2
  store volatile i64 %v2, i64 addrspace(1)* %p3
  store volatile i64 %v3, i64 addrspace(1)* %p4
  store volatile i64 %v4, i64 addrspace(1)* %p5
  store volatile i64 %v5, i64 addrspace(1)* %p1
  ret void
}

declare i32 @llvm.amdgcn.workitem.id.x()

attributes #0 = { nounwind "amdgpu-num-vgpr"="10" }

Let's begin with the question: what problem are you trying to solve?

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
628 ↗(On Diff #421964)

This completely breaks the original logic and idea. This loop exists because there are 2 waitstates needed to reuse a VGPR used in the v_accvgpr_write on gfx908. When you are copying a wide register, say a 16 dword tuple, we would have to insert 30 nops is we are using a single register. Therefore this code attempts to use another register each time until we have copied 3 dwords. Hence the 'DestReg % 3'. This code just does not make sense anymore.

Let's begin with the question: what problem are you trying to solve?

The problem, I am basically trying to solve is a case, where we have following situation - there is a vector ALU operation which requires 1024 wide VGPR, and the register budget in this case is 64. Since v32 is reserved, all the 1024 bit vectors starting from v1...v32 until v32...v63 are NOT usable. Only possible to use 1024 bit register is v0...v31. But, unfortunately, there is SGPR spills to v0 happening, and hence, v0...v31 is also NOT usable. So, RA fails.

Hence, we cannot (always) choose v32, probably the better way of handling it is:

(1) While reserving the registers before RA, reserve highest available VGPR based on register budget, and then later
(2) Shift it to lowest available VGPR after RA

Now, we have of couple of questions to answer

Q1. Is it safe to always reserve highest available VGPR irrespective of constrained register budget?
Q2. What is the best place to handle (2) ?
Q3. Is SIInstrInfo::indirectCopyToAGPR() always doing right thing as expected?

Answer to Q1 is - we have no other choice at the moment.
Answer to Q2 is - I thought SIInstrInfo::indirectCopyToAGPR() is a right place to handle it. But, later based on offline discussion with Matt, looks like framelowering is probably the right place.
Answer to Q3 is - Based on my own debugging of SIInstrInfo::indirectCopyToAGPR() and discussing my findings with Matt, this function is actually broken. I better not touch it at the moment, at least for this patch.

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
628 ↗(On Diff #421964)

As I mentioned in my other reply (to your comment), this function is actually broken. I better not touch it at the moment, at least for this patch. And, I am going to handle the required change within framelowering.

hsmhsm updated this revision to Diff 422794.Apr 14 2022, 3:01 AM

Move code related to shifting of reserved VGPR to a lower range within FrameLowering.

hsmhsm edited the summary of this revision. (Show Details)Apr 14 2022, 3:07 AM
rampitec added inline comments.Apr 14 2022, 9:48 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1242

Do we have to freeze it here?

llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll
1093–1097

Could you please pre-commit changes like that?

hsmhsm added inline comments.Apr 14 2022, 9:55 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1242

Actually, here we need to update the set of reserved registers so that it will unreserve the earlier reserved register, and reserve new one. The only way to achieve it is by calling freezeReservedRegs(). We already do it in SILowerSGPRSpills.

hsmhsm updated this revision to Diff 423060.Apr 15 2022, 2:51 AM

Rebase to latest trunk and to D123809.

Since the lit test for the function @max_5regs_used_8a() within spill-agpr.ll asserts for
gfx908 when used only 5 vgprs, I could not find any other way, but had to move this function
from spill-agpr.ll to spill-agpr-to-memory-gfx90a.ll and spill-agpr-to-memory-gfx908.ll.

rampitec added inline comments.Apr 15 2022, 11:11 AM
llvm/test/CodeGen/AMDGPU/spill-agpr.ll
72

No reason to split this file. You can just add new function here.

llvm/test/CodeGen/AMDGPU/spill-vgpr-to-agpr.ll
3

What exactly do we test here now? Plus this is an obvious regression.

hsmhsm added inline comments.Apr 16 2022, 11:02 PM
llvm/test/CodeGen/AMDGPU/spill-agpr.ll
72

Actually, I am bit confused here - I am having difficulty in understanding about fixing it.

Basically, with change in this patch (for gfx908), @max_5regs_used_8a() crashes (asserts) while compiling for gfx908, and hence no output will be generated for the file while compiling for gfx908.

To fix this crash, we need to increase num vgprs to 6 (from 5) only for gfx908, but, not for gfx90a, which means, we need to keep two versions of the same function - one for gfx90a (existing one - @max_5regs_used_8a), and another for gfx908 (new one - @max_6regs_used_8a).

But, the problem is - since the function @max_5regs_used_8a still exist in the file, the compilation for gfx908 still crashes, and hence lit still fails for gfx908.

How to fix this without spilting the file?

llvm/test/CodeGen/AMDGPU/spill-vgpr-to-agpr.ll
3

I am confused here, not getting, what exactly is happening with this lit test.

First, forget this change, consider the original file, and focus on the compilation for gfx900. - the expectation here is that the functions @max_10_vgprs_used_9a and @max_10_vgprs_used_1a_partial_spill should fail to compile for gfx900 since there is agpr usage here, which we test in the below check line.

; GFX900:     couldn't allocate input reg for constraint 'a'

But, the quirk here is, - we also test for expected assembly output for other functions (say, max_10_vgprs) while compiling for gfx900.

This is a mystery to me, because, I expected no assembly output when the compiler throws an error. However, llc has a quirk where on error, it will print the asm output if it's defaulting to stdout output from stdin, but won't if you specify -o (courtesy Matt), and we are taking advantage of this quirk nature of llc while testing this file for gfx900.

Whether above testing by taking advantage of the quirk nature of llc is right thing to do? Whether this quirk llc behavior is consistent with all types of compiler errors? which I am not sure, for example, with asserts, looks like llc will not exbibit this quirk behavior, meaning stops compilation when asserts.

Now, coming to new behavior of testing this file for gfx908 with this patch - basically llc throws below error since 10 vgprs is not enough.

register allocation failed: maximum depth for recoloring reached

To fix it, we need to increase the num of vgprs from 10 to 11 only for gfx908, but we need to keep 10 while compiling for gfx900. So, we need to keep two versions of same function one for gfx900, and another for gfx908. But, if we keep both the versions in the same file, then compilation for gfx908 still fails because vpgr 10 version is still exist in the file.

How to fix it by maintaining everything in a single file without spliting the file? For now, I took advantage of the quirk nature llc as explained above.

rampitec added inline comments.Apr 18 2022, 11:17 AM
llvm/test/CodeGen/AMDGPU/spill-agpr.ll
72

Isn't it regression and bug by itself in the first place?

llvm/test/CodeGen/AMDGPU/spill-vgpr-to-agpr.ll
3

Run line for gfx900 can be really just removed and a much simpler and smaller test written to check we are rejecting 'a' constraint before gfx908. It does not require a full copy of the test though.

The second is the real problem, because it can be compiled today w/o your patch.

hsmhsm added inline comments.Apr 18 2022, 11:42 AM
llvm/test/CodeGen/AMDGPU/spill-agpr.ll
72

I agree that the current update (from this patch) to this file has to be properly corrected.

But, what I do not understand is - are you saying this patch introducing new bug, and is a regression? If so how?

Without this patch - what actually happening is - we use 5 vgprs from v0 to v4 for vector operations which is the minimum required budget, and v32 for agpr copy. So, we are still using 6 vgprs here, but v32 is not getting accounted here.

With this patch - we reserve v4 (highest available one) for agpr copy, hence we only left with four vgprs from v0 to v3, and hence RegAlloc issue. It requires to officially mention num required vgprs to 6 in that case v5 is used for agpr copy.

llvm/test/CodeGen/AMDGPU/spill-vgpr-to-agpr.ll
3

for gfx900, I had previously attempted to move the test of rejecting 'a' constraint to a separate file, but then again had to restore it due to review comments. So, I am going to do the same now again, of course by including only required function, not a full copy of the file.

The second thing is making my job difficult - without splitting the file, it seems not possible to keep both gfx908 and gfx90a test within the same file.

rampitec added inline comments.Apr 18 2022, 11:51 AM
llvm/test/CodeGen/AMDGPU/spill-agpr.ll
72

OK, it was using v32 violating the budget requested. Now that makes sense to me. Then you probably should just increase the budget to 6.

llvm/test/CodeGen/AMDGPU/spill-vgpr-to-agpr.ll
3

Yes, just a really small separate test for the constraint refusal.

hsmhsm added inline comments.Apr 18 2022, 11:56 AM
llvm/test/CodeGen/AMDGPU/spill-agpr.ll
72

But, for gfx90a, still 5 vgprs are minimum requirement. Is it OK, if I make it 6 common for both gfx908 and gfx90a, in that case there is NO need to split the file?

rampitec added inline comments.Apr 18 2022, 12:05 PM
llvm/test/CodeGen/AMDGPU/spill-agpr.ll
72

Yes, as long as it still spills to memory (and it should if I understand correctly).

hsmhsm updated this revision to Diff 423545.Apr 19 2022, 12:20 AM

Rebase to D123973 and update lit tests accordingly.

rampitec accepted this revision.Apr 20 2022, 1:47 PM
This revision is now accepted and ready to land.Apr 20 2022, 1:47 PM
arsenm added inline comments.Apr 20 2022, 1:50 PM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1233

Just use Register instead of auto for these

cdevadas added inline comments.Apr 20 2022, 6:45 PM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
506

Keep the return type just void.

hsmhsm updated this revision to Diff 424080.Apr 20 2022, 7:26 PM

Fix minor comments by Matt and CD.

This revision was landed with ongoing or failed builds.Apr 20 2022, 7:28 PM
This revision was automatically updated to reflect the committed changes.