This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs
AcceptedPublic

Authored by cdevadas on Apr 21 2022, 12:18 PM.

Details

Summary

Currently, the custom SGPR spill lowering pass spills
SGPRs into physical VGPR lanes and the remaining VGPRs
are used by regalloc for vector regclass allocation.
This imposes many restrictions that we ended up with
unsuccessful SGPR spilling when there won't be enough
VGPRs and we are forced to spill the leftover into
memory during PEI. The custom spill handling during PEI
has many edge cases and often breaks the compiler time
to time.

This patch implements spilling SGPRs into virtual VGPR
lanes. Since we now split the register allocation for
SGPRs and VGPRs, the virtual registers introduced for
the spill lanes would get allocated automatically in
the subsequent regalloc invocation for VGPRs.

Spill to virtual registers will always be successful,
even in the high-pressure situations, and hence it avoids
most of the edge cases during PEI. We are now left with
only the custom SGPR spills during PEI for special registers
like the frame pointer which is an unproblematic case.

By spilling CSRs into virtual VGPR lanes, we might end up
with broken CFIs that can potentially corrupt the frame
unwinding in the debugger causing either a crash or a
terrible debugging experience. This occurs when regalloc
tries to spill or split the liverange of these virtual VGPRs.
The CFIs should also be inserted at these intermediate
points to correctly propagate the CFI entries. It is not
currently implemented in the compiler. As a short-term fix,
we continue to spill CSR SGPRs into physical VGPR lanes for
the debugger to correctly compute the unwind information.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
cdevadas added inline comments.Apr 26 2022, 9:22 AM
llvm/lib/Target/AMDGPU/SILowerSGPRSpills.cpp
416

I don't think we can do any special handling for them at assembly emission.
IMPLICIT_DEF is handled/printed by the generic part of AsmPrinter and it won't reach the target-specific emitInstruction at all.

446–447

Will do.

llvm/test/CodeGen/AMDGPU/csr-sgpr-spill-live-ins.mir
17–19 ↗(On Diff #424263)

Will do.

llvm/test/CodeGen/AMDGPU/spill-sgpr-to-virtual-vgpr.mir
27

The auto-generator didn't generate the tied-operands correctly. I can hand-modify this test to show the tied operand. It's the simplest case.

59

This test is already hand-modified to check the tied operands.

195

I couldn't write one successfully. Will try some unstructured flow to force one.

cdevadas added inline comments.Apr 27 2022, 3:53 AM
llvm/test/CodeGen/AMDGPU/spill-sgpr-to-virtual-vgpr.mir
195

I don't think such a case exists. A fall-through block will have only one successor and that becomes the nearest dominator for its children.
It would be true even for any unstructured flow.

cdevadas updated this revision to Diff 425478.Apr 27 2022, 4:32 AM

Fixed the review comments.
Moved UpdateLaneVGPRDomInstr lambda into a separate function.
Implemented getClearedProperties to clear certain MF properties.
Tes pre-commit + rebase.
Fixed the tied operand cases in certain tests.

As a follow up I think we need to address the loss of being able to share VGPR lanes for unrelated spills

llvm/lib/Target/AMDGPU/SILowerSGPRSpills.cpp
272–273

Typo "the the". It's also not necessarily unstructured

301

IsDominatesChecked is confusingly named and expresses the code not the intent. SeenSpillInBlock?

llvm/test/CodeGen/AMDGPU/need-fp-from-vgpr-spills.ll
197–199

This is an unfortunate regression but what I expected

arsenm added inline comments.Apr 27 2022, 2:07 PM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
319

As part of the follow up to allow spill slot sharing, I think we can move all of this allocation stuff out of SIMachineFunctionInfo and into SILowerSGPRSpills

cdevadas added inline comments.Apr 27 2022, 8:24 PM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
319

Ya, will try to move it entirely out of SIMachineFunctionInfo.

cdevadas updated this revision to Diff 425689.Apr 27 2022, 8:30 PM

Addressed the review comments.

cdevadas updated this revision to Diff 440294.Jun 27 2022, 10:13 AM
cdevadas edited the summary of this revision. (Show Details)

Code rebase.

arsenm accepted this revision.Jun 27 2022, 5:29 PM

LGTM. Might want to introduce an asm printer flag on the implicit_def to mark it's for SGPR spills in the comment

llvm/lib/Target/AMDGPU/SILowerSGPRSpills.cpp
275

Remove "Is there a better way to handle it?"

315

Extra ()s

This revision is now accepted and ready to land.Jun 27 2022, 5:29 PM

Should also remove the SpillSGPRToVGPR option and handling

Should also remove the SpillSGPRToVGPR option and handling

I don't want to do it separately. A follow-up patch?

Should also remove the SpillSGPRToVGPR option and handling

I don't want to do it separately. A follow-up patch?

Typo in my earlier comment. I want to do that as a separate patch.
I've identified a few more clean up that can be done while removing SpillSGPRToVGPR option.

Should also remove the SpillSGPRToVGPR option and handling

I don't want to do it separately. A follow-up patch?

Yes, that's fine

cdevadas updated this revision to Diff 440735.Jun 28 2022, 12:58 PM
cdevadas edited the summary of this revision. (Show Details)

Code rebase.

arsenm accepted this revision.Jun 28 2022, 3:39 PM

What happens when the register allocator decides to split a live range of virtual registers here, i.e. if it introduces a COPY?

arsenm requested changes to this revision.Jun 29 2022, 1:28 PM

What happens when the register allocator decides to split a live range of virtual registers here, i.e. if it introduces a COPY?

This is totally broken as soon as any of these spill. We need WWM spills if they do. We should boost their priority and they need a guaranteed register to save and restore exec. I’m not sure the best way to go about this

This revision now requires changes to proceed.Jun 29 2022, 1:28 PM
cdevadas updated this revision to Diff 464220.Sep 30 2022, 5:15 AM
cdevadas edited the summary of this revision. (Show Details)

Implemented WWM register spill. Reserved SGPR(s) needed for saving EXEC while manipulating the WWM spills. Included the reserved SGPRs serialization.
I couldn't reproduce the WWM COPY situation yet even after running the internal PSDB tests and hoping this patch is good to go.
Working on a follow-up patch to implement WWM Copy.

ruiling added a subscriber: ruiling.EditedOct 1 2022, 8:14 AM

AFAIK, the WWM register has some unmodeled liveness behavior, which makes it impossible to allocate wwm register together with normal vector register in one pass now.
For example(a typical if-then):

bb0:
  %0 = ...
  s_cbranch_execz %bb2

bb1:
  %1 = wwm_operation
  ... = %1
  %0 = ...

bb2:
  ... = %0

VGPR %0 was dead in bb1 and WWM-VGPR %1 was defined and used in bb1. As there is no live-range conflict between them, they have a chance to get assigned the same physical register. If this happens, certain lane of %0 might be overwritten when writing to %1. I am not sure if moving the SIPreAllocateWWMRegs between the sgpr allocation and the vgpr allocation might help your case? The key point is to request the SIPreAllocateWWMRegs allocate the wwm register usage introduced in SILowerSGPRSpills.

AFAIK, the WWM register has some unmodeled liveness behavior, which makes it impossible to allocate wwm register together with normal vector register in one pass now.
For example(a typical if-then):

bb0:
  %0 = ...
  s_cbranch_execz %bb2

bb1:
  %1 = wwm_operation
  ... = %1
  %0 = ...

bb2:
  ... = %0

VGPR %0 was dead in bb1 and WWM-VGPR %1 was defined and used in bb1. As there is no live-range conflict between them, they have a chance to get assigned the same physical register. If this happens, certain lane of %0 might be overwritten when writing to %1. I am not sure if moving the SIPreAllocateWWMRegs between the sgpr allocation and the vgpr allocation might help your case? The key point is to request the SIPreAllocateWWMRegs allocate the wwm register usage introduced in SILowerSGPRSpills.

Agree, allowing wwm-register allocation together with the regular vector registers would be error prone with miscomputed liveness data.
But I guess, they are edge cases . Unlike the other WWM operations, the writelane/readlane for SGPR spill stores/restores, in most cases, would span across different blocks and such a liveness miscomputation would be a rare combination.

IIRC, SIPreAllocateWWMRegs can help allocate only when we have enough free VGPRs. There is no live-range spill/split incorporated in this custom pass. It won’t help in the case of large functions with more SGPR spills.
The best approach would be to introduce another regalloc pipeline between the existing SGPR and VGPR allocations. The new pipeline should allocate only the WWM-registers.
It would, however, increase the compile time complexity further. But I’m not sure we have a better choice.

Agree, allowing wwm-register allocation together with the regular vector registers would be error prone with miscomputed liveness data.
But I guess, they are edge cases . Unlike the other WWM operations, the writelane/readlane for SGPR spill stores/restores, in most cases, would span across different blocks and such a liveness miscomputation would be a rare combination.

I think we need to make sure the idea is correct in all possible cases we can think of. The writelane/readlane shares the same behavior with WWM operation regarding to the issue here. That is: they may write to a VGPR lane that the corresponding thread is inactive. "spanning across different blocks" won't help on the problem. Even the writelane/readlane operations span across more than one thousand blocks, it can still be nested in an outer if-then structure.

Agree, allowing wwm-register allocation together with the regular vector registers would be error prone with miscomputed liveness data.
But I guess, they are edge cases . Unlike the other WWM operations, the writelane/readlane for SGPR spill stores/restores, in most cases, would span across different blocks and such a liveness miscomputation would be a rare combination.

I think we need to make sure the idea is correct in all possible cases we can think of. The writelane/readlane shares the same behavior with WWM operation regarding to the issue here. That is: they may write to a VGPR lane that the corresponding thread is inactive. "spanning across different blocks" won't help on the problem. Even the writelane/readlane operations span across more than one thousand blocks, it can still be nested in an outer if-then structure.

Yes, we should fix this case. And we don't see a better way other than introducing a new regalloc pipeline for wwm registers alone. The effort for that is yet to be accounted and planning a follow-up patch to split the vgpr allocation.

cdevadas updated this revision to Diff 470483.Oct 25 2022, 7:29 AM

Moved VRegFlags into AMDGPU files. Introduced the MRI delegate callbacks and used the delegate method to propagate the virtual register flags.

cdevadas updated this revision to Diff 470714.Oct 25 2022, 10:51 PM

Simplified addDelegate function to reflect the recent changes made in D134950.

Pierre-vh added inline comments.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2052

Why does SCC need to be dead? What happens if another instruction right after uses it?

cdevadas added inline comments.Oct 26 2022, 1:43 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2052

The code here is only to manipulate exec mask and no other instruction depends on the SCC that it produces, and we should mark it dead to avoid unwanted side effects. We don't have an alternate instruction that doesn't clobber SCC.

Pierre-vh added inline comments.Oct 26 2022, 1:50 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2052

Ah that makes sense, but shouldn't this check that it's not inserting in a place where SCC is alive?
I was trying out this patch and I have a case where it's causing issues:

S_CMP_EQ_U32 killed renamable $sgpr6, killed renamable $sgpr7, implicit-def $scc
renamable $vgpr0 = V_WRITELANE_B32 killed $sgpr4, 4, $vgpr0(tied-def 0), implicit-def $sgpr4_sgpr5, implicit $sgpr4_sgpr5
renamable $vgpr0 = V_WRITELANE_B32 killed $sgpr5, 5, $vgpr0(tied-def 0), implicit killed $sgpr4_sgpr5
$sgpr10_sgpr11 = S_OR_SAVEEXEC_B64 -1, implicit-def $exec, implicit-def dead $scc, implicit $exec
$agpr4 = V_ACCVGPR_WRITE_B32_e64 killed $vgpr0, implicit $exec
$exec = S_MOV_B64 killed $sgpr10_sgpr11
S_CBRANCH_SCC1 %bb.5, implicit killed $scc

Insertion is between the S_CMP and the S_CBRANCH.

cdevadas added inline comments.Oct 26 2022, 1:59 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2052

Yes, the check is already in place. See the code above, the if condition, that inserts two separate move instructions when SCC is live and the else part uses SCC when it is free.
Not sure why RegScavenger returned false. It should have returned SCC as clobbered.

cdevadas added inline comments.Oct 26 2022, 2:10 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2052

See test llvm/test/CodeGen/AMDGPU/cf-loop-on-constant.ll
A similar situation is handled. RS returned the correct liveness info for SCC.

cdevadas updated this revision to Diff 470903.Oct 26 2022, 12:42 PM
cdevadas edited the summary of this revision. (Show Details)

Rebase after recent changes in D134950.

cdevadas updated this revision to Diff 472291.Nov 1 2022, 7:09 AM

Code rebase.

Pierre-vh added inline comments.Nov 2 2022, 3:37 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1359

I think this is missing and it's what's causing verification errors with "Using an undefined physical register" that I was talking about.
The current code just tells the scavenger to enter that block but it doesn't update it to the right instruction, so eliminateFrameIndex is working with information from the start of the BB, not from the MI it's dealing with

cdevadas added inline comments.Nov 2 2022, 4:05 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1359

An entirely different problem and needs to be implemented separately. The code that handles the register liveness update is implemented in PEI::replaceFrameIndices and it tracks the loops and invokes RS->forward() appropriately to update the liveness info. I guess we should bring this code into VGPR to AGPR spill path.

cdevadas updated this revision to Diff 472666.Nov 2 2022, 10:18 AM

Included the patch provided by @Pierre-vh to correctly update the register liveness in the RegisterScavenger during VGPR -> AGPR spilling.
This patch avoids a crash that occurred when enabled SGPR spill to virtual VGPR lanes.

diff --git a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp

  • a/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp

+++ b/llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
@@ -1511,52 +1511,52 @@ void SIFrameLowering::processFunctionBeforeFrameFinalized(

                     && EnableSpillVGPRToAGPR;
                     
if (FuncInfo->allocateVGPRSpillToAGPR(MF, FI,
                                      TRI->isAGPR(MRI, VReg))) {
  • // FIXME: change to enterBasicBlockEnd()
  • RS->enterBasicBlock(MBB);

+ RS->enterBasicBlockEnd(MBB);
+ RS->backward(MI);

TRI->eliminateFrameIndex(MI, 0, FIOp, RS);
SpillFIs.set(FI);
continue;

Included the new test llvm/test/CodeGen/AMDGPU/spill-vgpr-to-agpr-update-regscavenger.ll.

arsenm added inline comments.Nov 14 2022, 1:32 PM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1357–1359

D137574 is in flight to invert the direction, should we land that first / separately?

llvm/lib/Target/AMDGPU/SIInstrInfo.h
628 ↗(On Diff #472666)

static?

llvm/lib/Target/AMDGPU/SILowerSGPRSpills.cpp
64

Is this introducing a new computation in the pass pipeline (I assume not since I don't see a pass pipeline test update)

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
659

Reg.isVirtual()

665

Reg.isVirtual()

arsenm added inline comments.Nov 14 2022, 1:32 PM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
463–468

I don't like having state here for a single operation that's happening in one pass and isn't valid for multiple uses. I don't really understand how this is being set and passed around

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
658

Isn't this always required?

cdevadas added inline comments.Nov 15 2022, 10:30 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1357–1359

Alex's patch has landed. But this code is still needed to update the liveness for each instruction as eliminateFrameIndex is called here.

llvm/lib/Target/AMDGPU/SIInstrInfo.h
628 ↗(On Diff #472666)

Will change.

llvm/lib/Target/AMDGPU/SILowerSGPRSpills.cpp
64

It isn't.

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
463–468

CurrentVRegSpilled is needed to track the virtual register (Liverange) for which the physical register was assigned. And it is needed only for fast regalloc . We need this mapping to correctly track the WWM spills as RegAllocFast spills/restore the physical registers directly as there is no VRM. This will be appropriately set with the delegate MRI_NoteVirtualRegisterSpill which is inserted in the RegAllocFast spill/reload functions.

SIMachineFunctionInfo is where the delegates are currently handled and I don't have a better place to move it.

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
658

No. They are reserved only if RA inserts any whole wave spill.

cdevadas updated this revision to Diff 475518.Nov 15 2022, 10:44 AM

Rebase + Suggestions incorporated.

cdevadas updated this revision to Diff 477532.Nov 23 2022, 9:25 AM

Rebase + Incorporated changes after D138515 to move the handling of physReg to current VirtReg mapping entirely into the generic design.

cdevadas updated this revision to Diff 477752.Nov 24 2022, 5:02 AM

Implemented the WWM spill during RegAllocFast using the additional argument to the spiller interface introduced with patch D138656.

arsenm accepted this revision.Dec 15 2022, 10:45 AM
This revision is now accepted and ready to land.Dec 15 2022, 10:45 AM
This revision was landed with ongoing or failed builds.Dec 16 2022, 10:27 PM
This revision was automatically updated to reflect the committed changes.
jdoerfert reopened this revision.Dec 19 2022, 11:14 PM
jdoerfert added subscribers: jtramm, jhuber6, ronlieb, jdoerfert.

This patch causes OpenMC offloaded via OpenMP on AMDGPUs to crash at runtime. It looks like some corruption in the memory address.
You can find build instructions here: https://github.com/jtramm/openmc_offloading_builder

The commit before this one works fine though, assuming you cherry picked https://reviews.llvm.org/rGee1d000d43321590771a2f047c8c55d07d09ad28 first as it landed after.
I assume other codes will be impacted too.

@jtramm @ronlieb @jhuber6 FYI

This revision is now accepted and ready to land.Dec 19 2022, 11:14 PM

This patch causes OpenMC offloaded via OpenMP on AMDGPUs to crash at runtime. It looks like some corruption in the memory address.
You can find build instructions here: https://github.com/jtramm/openmc_offloading_builder

The commit before this one works fine though, assuming you cherry picked https://reviews.llvm.org/rGee1d000d43321590771a2f047c8c55d07d09ad28 first as it landed after.
I assume other codes will be impacted too.

@jtramm @ronlieb @jhuber6 FYI

Thanks. Going to take a look.

cdevadas updated this revision to Diff 496532.Feb 10 2023, 9:56 AM

Rebased after whole-wave copy implementation.

yassingh updated this revision to Diff 523333.May 18 2023, 4:09 AM
yassingh added a subscriber: yassingh.
  • Rebased
  • Incorporated the downstream code
cdevadas edited the summary of this revision. (Show Details)May 18 2023, 5:06 AM
arsenm added inline comments.Jun 21 2023, 5:27 PM
llvm/lib/Target/AMDGPU/SILowerSGPRSpills.cpp
71

It shouldn't have been SSA to begin with ad this doesn't de-SSA

72

Add a comment explaining the new vregs?

416

You don't need to specially handle the instruction, see AsmPrinterFlags

arsenm requested changes to this revision.Jun 22 2023, 10:55 AM

Just a few more nits

This revision now requires changes to proceed.Jun 22 2023, 10:55 AM
yassingh added inline comments.Jun 26 2023, 4:53 AM
llvm/lib/Target/AMDGPU/SILowerSGPRSpills.cpp
71

Removing this line causes machine-verifier to crash in few tests. Any hints @cdevadas ?

416

Tried adding a new flag here D153754

yassingh updated this revision to Diff 534590.Jun 26 2023, 8:57 AM

Review comments

yassingh added inline comments.Jun 26 2023, 9:07 AM
llvm/lib/Target/AMDGPU/SILowerSGPRSpills.cpp
71

Removing this line works fine when running the whole pipeline as the compiler knows the code here is not in SSA form. However, when SILowerSGPRSpills and related passes are run in isolation the verifier assumes the code to be in SSA form(possibly a bug there, also we are introducing virtual vgprs maybe that's the reason).

I can leave the line as it is or is there some way to update the test files to let the compiler know the input isn't SSA? I tried "isSSA: false", didn't work.

cdevadas added inline comments.Jun 26 2023, 9:41 AM
llvm/lib/Target/AMDGPU/SILowerSGPRSpills.cpp
71

Seems reasonable to retain this line for now. The compiler might not be able to decide that this pass is run post phi-elimination and assume SSA form by default. There must be a serialized option to control it for MIR tests.

yassingh added inline comments.Jun 26 2023, 9:23 PM
llvm/lib/Target/AMDGPU/SILowerSGPRSpills.cpp
71

Yeah, MIRParser::isSSA recomputes the SSA information and sets it to true, also it doesn't expose a way to override it.

yassingh updated this revision to Diff 535257.Jun 28 2023, 12:08 AM

Rebase over ancestor patch changes.

arsenm accepted this revision.Jun 28 2023, 9:25 AM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1357–1358

This is a pre-existing issue that should be fixed, but we should not be scanning the entire block from the end on every spill. The block iteration should be reversed and we should lazily call enterBasicBlockEnd on the first seen spill

llvm/lib/Target/AMDGPU/SILowerSGPRSpills.cpp
71

this is kind of a mir parser bug

This revision is now accepted and ready to land.Jun 28 2023, 9:25 AM
yassingh updated this revision to Diff 537989.Jul 6 2023, 10:31 PM

fix comment

yassingh updated this revision to Diff 538200.Jul 7 2023, 10:40 AM

Rebase before merge

This revision was landed with ongoing or failed builds.Jul 7 2023, 10:46 AM
This revision was automatically updated to reflect the committed changes.
This revision is now accepted and ready to land.Jul 20 2023, 9:29 AM
llvm/test/CodeGen/AMDGPU/gfx-call-non-gfx-func.ll