Page MenuHomePhabricator

[AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs
Needs ReviewPublic

Authored by cdevadas on Apr 21 2022, 12:18 PM.

Details

Summary

Currently, the custom SGPR spill lowering pass would
spill SGPRs into physical VGPR lanes and the remaining
VGPRs are used by regalloc for vector regclass allocation.
This imposes many restrictions that we ended up with
unsuccessful SGPR spilling when there won't be enough
VGPRs and we are forced to spill the leftover into
memory during PEI. The custom spill handling during PEI
has many edge cases and often breaks the compiler time
to time.

This patch implements spilling SGPRs into virtual VGPR
lanes. Since we now split the register allocation for
SGPRs and VGPRs, the virtual registers introduced for
the spill lanes would get allocated automatically in
the subsequent regalloc invocation for VGPRs.

Spill to virtual registers will always be successful,
even in the high-pressure situations, and hence it avoids
most of the edge cases during PEI. We are now left with
only the custom SGPR spills during PEI for special registers
like the frame pointer which is an unproblematic case.

This patch also implements the whole wave spills which
might occur if RA spills any live range of virtual registers
involved in the whole wave operations. Earlier, we had
been hand-picking registers for such machine operands.
But now with SGPR spills into virtual VGPR lanes, we are
exposing them to the allocator.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
arsenm accepted this revision.Jun 27 2022, 5:29 PM

LGTM. Might want to introduce an asm printer flag on the implicit_def to mark it's for SGPR spills in the comment

llvm/lib/Target/AMDGPU/SILowerSGPRSpills.cpp
258

Remove "Is there a better way to handle it?"

298

Extra ()s

This revision is now accepted and ready to land.Jun 27 2022, 5:29 PM

Should also remove the SpillSGPRToVGPR option and handling

Should also remove the SpillSGPRToVGPR option and handling

I don't want to do it separately. A follow-up patch?

Should also remove the SpillSGPRToVGPR option and handling

I don't want to do it separately. A follow-up patch?

Typo in my earlier comment. I want to do that as a separate patch.
I've identified a few more clean up that can be done while removing SpillSGPRToVGPR option.

Should also remove the SpillSGPRToVGPR option and handling

I don't want to do it separately. A follow-up patch?

Yes, that's fine

cdevadas updated this revision to Diff 440735.Jun 28 2022, 12:58 PM
cdevadas edited the summary of this revision. (Show Details)

Code rebase.

arsenm accepted this revision.Jun 28 2022, 3:39 PM

What happens when the register allocator decides to split a live range of virtual registers here, i.e. if it introduces a COPY?

arsenm requested changes to this revision.Jun 29 2022, 1:28 PM

What happens when the register allocator decides to split a live range of virtual registers here, i.e. if it introduces a COPY?

This is totally broken as soon as any of these spill. We need WWM spills if they do. We should boost their priority and they need a guaranteed register to save and restore exec. I’m not sure the best way to go about this

This revision now requires changes to proceed.Jun 29 2022, 1:28 PM
cdevadas updated this revision to Diff 464220.Sep 30 2022, 5:15 AM
cdevadas edited the summary of this revision. (Show Details)

Implemented WWM register spill. Reserved SGPR(s) needed for saving EXEC while manipulating the WWM spills. Included the reserved SGPRs serialization.
I couldn't reproduce the WWM COPY situation yet even after running the internal PSDB tests and hoping this patch is good to go.
Working on a follow-up patch to implement WWM Copy.

ruiling added a subscriber: ruiling.EditedOct 1 2022, 8:14 AM

AFAIK, the WWM register has some unmodeled liveness behavior, which makes it impossible to allocate wwm register together with normal vector register in one pass now.
For example(a typical if-then):

bb0:
  %0 = ...
  s_cbranch_execz %bb2

bb1:
  %1 = wwm_operation
  ... = %1
  %0 = ...

bb2:
  ... = %0

VGPR %0 was dead in bb1 and WWM-VGPR %1 was defined and used in bb1. As there is no live-range conflict between them, they have a chance to get assigned the same physical register. If this happens, certain lane of %0 might be overwritten when writing to %1. I am not sure if moving the SIPreAllocateWWMRegs between the sgpr allocation and the vgpr allocation might help your case? The key point is to request the SIPreAllocateWWMRegs allocate the wwm register usage introduced in SILowerSGPRSpills.

AFAIK, the WWM register has some unmodeled liveness behavior, which makes it impossible to allocate wwm register together with normal vector register in one pass now.
For example(a typical if-then):

bb0:
  %0 = ...
  s_cbranch_execz %bb2

bb1:
  %1 = wwm_operation
  ... = %1
  %0 = ...

bb2:
  ... = %0

VGPR %0 was dead in bb1 and WWM-VGPR %1 was defined and used in bb1. As there is no live-range conflict between them, they have a chance to get assigned the same physical register. If this happens, certain lane of %0 might be overwritten when writing to %1. I am not sure if moving the SIPreAllocateWWMRegs between the sgpr allocation and the vgpr allocation might help your case? The key point is to request the SIPreAllocateWWMRegs allocate the wwm register usage introduced in SILowerSGPRSpills.

Agree, allowing wwm-register allocation together with the regular vector registers would be error prone with miscomputed liveness data.
But I guess, they are edge cases . Unlike the other WWM operations, the writelane/readlane for SGPR spill stores/restores, in most cases, would span across different blocks and such a liveness miscomputation would be a rare combination.

IIRC, SIPreAllocateWWMRegs can help allocate only when we have enough free VGPRs. There is no live-range spill/split incorporated in this custom pass. It won’t help in the case of large functions with more SGPR spills.
The best approach would be to introduce another regalloc pipeline between the existing SGPR and VGPR allocations. The new pipeline should allocate only the WWM-registers.
It would, however, increase the compile time complexity further. But I’m not sure we have a better choice.

Agree, allowing wwm-register allocation together with the regular vector registers would be error prone with miscomputed liveness data.
But I guess, they are edge cases . Unlike the other WWM operations, the writelane/readlane for SGPR spill stores/restores, in most cases, would span across different blocks and such a liveness miscomputation would be a rare combination.

I think we need to make sure the idea is correct in all possible cases we can think of. The writelane/readlane shares the same behavior with WWM operation regarding to the issue here. That is: they may write to a VGPR lane that the corresponding thread is inactive. "spanning across different blocks" won't help on the problem. Even the writelane/readlane operations span across more than one thousand blocks, it can still be nested in an outer if-then structure.

Agree, allowing wwm-register allocation together with the regular vector registers would be error prone with miscomputed liveness data.
But I guess, they are edge cases . Unlike the other WWM operations, the writelane/readlane for SGPR spill stores/restores, in most cases, would span across different blocks and such a liveness miscomputation would be a rare combination.

I think we need to make sure the idea is correct in all possible cases we can think of. The writelane/readlane shares the same behavior with WWM operation regarding to the issue here. That is: they may write to a VGPR lane that the corresponding thread is inactive. "spanning across different blocks" won't help on the problem. Even the writelane/readlane operations span across more than one thousand blocks, it can still be nested in an outer if-then structure.

Yes, we should fix this case. And we don't see a better way other than introducing a new regalloc pipeline for wwm registers alone. The effort for that is yet to be accounted and planning a follow-up patch to split the vgpr allocation.

cdevadas updated this revision to Diff 470483.Oct 25 2022, 7:29 AM

Moved VRegFlags into AMDGPU files. Introduced the MRI delegate callbacks and used the delegate method to propagate the virtual register flags.

cdevadas updated this revision to Diff 470714.Oct 25 2022, 10:51 PM

Simplified addDelegate function to reflect the recent changes made in D134950.

Pierre-vh added inline comments.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1992

Why does SCC need to be dead? What happens if another instruction right after uses it?

cdevadas added inline comments.Oct 26 2022, 1:43 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1992

The code here is only to manipulate exec mask and no other instruction depends on the SCC that it produces, and we should mark it dead to avoid unwanted side effects. We don't have an alternate instruction that doesn't clobber SCC.

Pierre-vh added inline comments.Oct 26 2022, 1:50 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1992

Ah that makes sense, but shouldn't this check that it's not inserting in a place where SCC is alive?
I was trying out this patch and I have a case where it's causing issues:

S_CMP_EQ_U32 killed renamable $sgpr6, killed renamable $sgpr7, implicit-def $scc
renamable $vgpr0 = V_WRITELANE_B32 killed $sgpr4, 4, $vgpr0(tied-def 0), implicit-def $sgpr4_sgpr5, implicit $sgpr4_sgpr5
renamable $vgpr0 = V_WRITELANE_B32 killed $sgpr5, 5, $vgpr0(tied-def 0), implicit killed $sgpr4_sgpr5
$sgpr10_sgpr11 = S_OR_SAVEEXEC_B64 -1, implicit-def $exec, implicit-def dead $scc, implicit $exec
$agpr4 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr0, implicit $exec
$exec = S_MOV_B64 killed $sgpr10_sgpr11
S_CBRANCH_SCC1 %bb.5, implicit killed $scc

Insertion is between the S_CMP and the S_CBRANCH.

cdevadas added inline comments.Oct 26 2022, 1:59 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1992

Yes, the check is already in place. See the code above, the if condition, that inserts two separate move instructions when SCC is live and the else part uses SCC when it is free.
Not sure why RegScavenger returned false. It should have returned SCC as clobbered.

cdevadas added inline comments.Oct 26 2022, 2:10 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1992

See test llvm/test/CodeGen/AMDGPU/cf-loop-on-constant.ll
A similar situation is handled. RS returned the correct liveness info for SCC.

cdevadas updated this revision to Diff 470903.Oct 26 2022, 12:42 PM
cdevadas edited the summary of this revision. (Show Details)

Rebase after recent changes in D134950.

cdevadas updated this revision to Diff 472291.Nov 1 2022, 7:09 AM

Code rebase.

Pierre-vh added inline comments.Nov 2 2022, 3:37 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1351 ↗(On Diff #472479)

I think this is missing and it's what's causing verification errors with "Using an undefined physical register" that I was talking about.
The current code just tells the scavenger to enter that block but it doesn't update it to the right instruction, so eliminateFrameIndex is working with information from the start of the BB, not from the MI it's dealing with

cdevadas added inline comments.Nov 2 2022, 4:05 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1351 ↗(On Diff #472479)

An entirely different problem and needs to be implemented separately. The code that handles the register liveness update is implemented in PEI::replaceFrameIndices and it tracks the loops and invokes RS->forward() appropriately to update the liveness info. I guess we should bring this code into VGPR to AGPR spill path.

cdevadas updated this revision to Diff 472666.Nov 2 2022, 10:18 AM

Included the patch provided by @Pierre-vh to correctly update the register liveness in the RegisterScavenger during VGPR -> AGPR spilling.
This patch avoids a crash that occurred when enabled SGPR spill to virtual VGPR lanes.

diff --git a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp

  • a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp

+++ b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
@@ -1511,52 +1511,52 @@ void SIFrameLowering::processFunctionBeforeFrameFinalized(

                     && EnableSpillVGPRToAGPR;
                     
if (FuncInfo->allocateVGPRSpillToAGPR(MF, FI,
                                      TRI->isAGPR(MRI, VReg))) {
  • // FIXME: change to enterBasicBlockEnd()
  • RS->enterBasicBlock(MBB);

+ RS->enterBasicBlockEnd(MBB);
+ RS->backward(MI);

TRI->eliminateFrameIndex(MI, 0, FIOp, RS);
SpillFIs.set(FI);
continue;

Included the new test llvm/test/CodeGen/AMDGPU/spill-vgpr-to-agpr-update-regscavenger.ll.

arsenm added inline comments.Mon, Nov 14, 1:32 PM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1349–1350 ↗(On Diff #472666)

D137574 is in flight to invert the direction, should we land that first / separately?

llvm/lib/Target/AMDGPU/SIInstrInfo.h
628 ↗(On Diff #472666)

static?

llvm/lib/Target/AMDGPU/SILowerSGPRSpills.cpp
59

Is this introducing a new computation in the pass pipeline (I assume not since I don't see a pass pipeline test update)

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
490–495 ↗(On Diff #472666)

I don't like having state here for a single operation that's happening in one pass and isn't valid for multiple uses. I don't really understand how this is being set and passed around

688 ↗(On Diff #472666)

Reg.isVirtual()

694 ↗(On Diff #472666)

Reg.isVirtual()

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
651

Isn't this always required?

cdevadas added inline comments.Tue, Nov 15, 10:30 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1349–1350 ↗(On Diff #472666)

Alex's patch has landed. But this code is still needed to update the liveness for each instruction as eliminateFrameIndex is called here.

llvm/lib/Target/AMDGPU/SIInstrInfo.h
628 ↗(On Diff #472666)

Will change.

llvm/lib/Target/AMDGPU/SILowerSGPRSpills.cpp
59

It isn't.

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
490–495 ↗(On Diff #472666)

CurrentVRegSpilled is needed to track the virtual register (Liverange) for which the physical register was assigned. And it is needed only for fast regalloc . We need this mapping to correctly track the WWM spills as RegAllocFast spills/restore the physical registers directly as there is no VRM. This will be appropriately set with the delegate MRI_NoteVirtualRegisterSpill which is inserted in the RegAllocFast spill/reload functions.

SIMachineFunctionInfo is where the delegates are currently handled and I don't have a better place to move it.

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
651

No. They are reserved only if RA inserts any whole wave spill.

cdevadas updated this revision to Diff 475518.Tue, Nov 15, 10:44 AM

Rebase + Suggestions incorporated.

cdevadas updated this revision to Diff 477532.Wed, Nov 23, 9:25 AM

Rebase + Incorporated changes after D138515 to move the handling of physReg to current VirtReg mapping entirely into the generic design.

cdevadas updated this revision to Diff 477752.Thu, Nov 24, 5:02 AM

Implemented the WWM spill during RegAllocFast using the additional argument to the spiller interface introduced with patch D138656.