Page MenuHomePhabricator

[AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs
Needs ReviewPublic

Authored by cdevadas on Apr 21 2022, 12:18 PM.

Details

Summary

Currently, the custom SGPR spill lowering pass would
spill SGPRs into physical VGPR lanes and the remaining
VGPRs are used by regalloc for vector regclass allocation.
This imposes many restrictions that we ended up with
unsuccessful SGPR spilling when there won't be enough
VGPRs and we are forced to spill the left over into
memory during PEI. The custom spill handling during PEI
has many edge cases and often breaks the compiler time
to time.

This patch implements spilling SGPRs into virtual VGPR
lanes. Since we now split the register allocation for
SGPRs and VGPRs, the virtual registers introduced for
the spill lanes would get allocated automatically in
the subsequent regalloc invocation for VGPRs.

Spill to virtual registers will always be successful,
even at high-pressure situations, and hence it avoids
most of the edge cases during PEI. We are now left only
with the custom SGPR spills during PEI for special registers
like the frame pointer which is an unproblematic case.

This patch also implements the whole wave spills which
might occur if RA spills any live range of virtual registers
involved in the whole wave operations. Earlier, we had
been hand picking registers for such machine operands.
But now with SGPR spills into virtual VGPR lanes, we are
exposing them to the allocator.

Diff Detail

Event Timeline

cdevadas created this revision.Apr 21 2022, 12:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2022, 12:18 PM
cdevadas requested review of this revision.Apr 21 2022, 12:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2022, 12:18 PM
arsenm added inline comments.Apr 21 2022, 1:58 PM
llvm/lib/Target/AMDGPU/SILowerSGPRSpills.cpp
352–353

Seems worthwhile for this to be its own real function instead of a lambda

376

This could be the end iterator

379

It might be worth adding a target comment flag for this implicit def to comment it's for SGPR spilling

382

Typo " in case if multiple spills"

407–408

Should implement MachineFunctionPass::getClearedProperties instead of clearing these here

llvm/test/CodeGen/AMDGPU/csr-sgpr-spill-live-ins.mir
17–19

Can you precommit a change to add the -NEXTs here

llvm/test/CodeGen/AMDGPU/spill-sgpr-to-virtual-vgpr.mir
27

The test checks seem to not capture that these operands are tied

195

Needs a case where the insert block has no terminators

cdevadas added inline comments.Apr 26 2022, 9:22 AM
llvm/lib/Target/AMDGPU/SILowerSGPRSpills.cpp
352–353

Yes indeed. Will do.

379

I don't think we can do any special handling for them at assembly emission.
IMPLICIT_DEF is handled/printed by the generic part of AsmPrinter and it won't reach the target-specific emitInstruction at all.

407–408

Will do.

llvm/test/CodeGen/AMDGPU/csr-sgpr-spill-live-ins.mir
17–19

Will do.

llvm/test/CodeGen/AMDGPU/spill-sgpr-to-virtual-vgpr.mir
27

The auto-generator didn't generate the tied-operands correctly. I can hand-modify this test to show the tied operand. It's the simplest case.

59

This test is already hand-modified to check the tied operands.

195

I couldn't write one successfully. Will try some unstructured flow to force one.

cdevadas added inline comments.Apr 27 2022, 3:53 AM
llvm/test/CodeGen/AMDGPU/spill-sgpr-to-virtual-vgpr.mir
195

I don't think such a case exists. A fall-through block will have only one successor and that becomes the nearest dominator for its children.
It would be true even for any unstructured flow.

cdevadas updated this revision to Diff 425478.Apr 27 2022, 4:32 AM

Fixed the review comments.
Moved UpdateLaneVGPRDomInstr lambda into a separate function.
Implemented getClearedProperties to clear certain MF properties.
Tes pre-commit + rebase.
Fixed the tied operand cases in certain tests.

As a follow up I think we need to address the loss of being able to share VGPR lanes for unrelated spills

llvm/lib/Target/AMDGPU/SILowerSGPRSpills.cpp
264–265

Typo "the the". It's also not necessarily unstructured

293

IsDominatesChecked is confusingly named and expresses the code not the intent. SeenSpillInBlock?

llvm/test/CodeGen/AMDGPU/need-fp-from-vgpr-spills.ll
202–204

This is an unfortunate regression but what I expected

arsenm added inline comments.Apr 27 2022, 2:07 PM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
283

As part of the follow up to allow spill slot sharing, I think we can move all of this allocation stuff out of SIMachineFunctionInfo and into SILowerSGPRSpills

cdevadas added inline comments.Apr 27 2022, 8:24 PM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
283

Ya, will try to move it entirely out of SIMachineFunctionInfo.

cdevadas updated this revision to Diff 425689.Apr 27 2022, 8:30 PM

Addressed the review comments.

cdevadas updated this revision to Diff 440294.Jun 27 2022, 10:13 AM
cdevadas edited the summary of this revision. (Show Details)

Code rebase.

arsenm accepted this revision.Jun 27 2022, 5:29 PM

LGTM. Might want to introduce an asm printer flag on the implicit_def to mark it's for SGPR spills in the comment

llvm/lib/Target/AMDGPU/SILowerSGPRSpills.cpp
267

Remove "Is there a better way to handle it?"

307

Extra ()s

This revision is now accepted and ready to land.Jun 27 2022, 5:29 PM

Should also remove the SpillSGPRToVGPR option and handling

Should also remove the SpillSGPRToVGPR option and handling

I don't want to do it separately. A follow-up patch?

Should also remove the SpillSGPRToVGPR option and handling

I don't want to do it separately. A follow-up patch?

Typo in my earlier comment. I want to do that as a separate patch.
I've identified a few more clean up that can be done while removing SpillSGPRToVGPR option.

Should also remove the SpillSGPRToVGPR option and handling

I don't want to do it separately. A follow-up patch?

Yes, that's fine

cdevadas updated this revision to Diff 440735.Jun 28 2022, 12:58 PM
cdevadas edited the summary of this revision. (Show Details)

Code rebase.

arsenm accepted this revision.Jun 28 2022, 3:39 PM

What happens when the register allocator decides to split a live range of virtual registers here, i.e. if it introduces a COPY?

arsenm requested changes to this revision.Jun 29 2022, 1:28 PM

What happens when the register allocator decides to split a live range of virtual registers here, i.e. if it introduces a COPY?

This is totally broken as soon as any of these spill. We need WWM spills if they do. We should boost their priority and they need a guaranteed register to save and restore exec. I’m not sure the best way to go about this

This revision now requires changes to proceed.Jun 29 2022, 1:28 PM
cdevadas updated this revision to Diff 464220.Fri, Sep 30, 5:15 AM
cdevadas edited the summary of this revision. (Show Details)

Implemented WWM register spill. Reserved SGPR(s) needed for saving EXEC while manipulating the WWM spills. Included the reserved SGPRs serialization.
I couldn't reproduce the WWM COPY situation yet even after running the internal PSDB tests and hoping this patch is good to go.
Working on a follow-up patch to implement WWM Copy.

ruiling added a subscriber: ruiling.EditedSat, Oct 1, 8:14 AM

AFAIK, the WWM register has some unmodeled liveness behavior, which makes it impossible to allocate wwm register together with normal vector register in one pass now.
For example(a typical if-then):

bb0:
  %0 = ...
  s_cbranch_execz %bb2

bb1:
  %1 = wwm_operation
  ... = %1
  %0 = ...

bb2:
  ... = %0

VGPR %0 was dead in bb1 and WWM-VGPR %1 was defined and used in bb1. As there is no live-range conflict between them, they have a chance to get assigned the same physical register. If this happens, certain lane of %0 might be overwritten when writing to %1. I am not sure if moving the SIPreAllocateWWMRegs between the sgpr allocation and the vgpr allocation might help your case? The key point is to request the SIPreAllocateWWMRegs allocate the wwm register usage introduced in SILowerSGPRSpills.

AFAIK, the WWM register has some unmodeled liveness behavior, which makes it impossible to allocate wwm register together with normal vector register in one pass now.
For example(a typical if-then):

bb0:
  %0 = ...
  s_cbranch_execz %bb2

bb1:
  %1 = wwm_operation
  ... = %1
  %0 = ...

bb2:
  ... = %0

VGPR %0 was dead in bb1 and WWM-VGPR %1 was defined and used in bb1. As there is no live-range conflict between them, they have a chance to get assigned the same physical register. If this happens, certain lane of %0 might be overwritten when writing to %1. I am not sure if moving the SIPreAllocateWWMRegs between the sgpr allocation and the vgpr allocation might help your case? The key point is to request the SIPreAllocateWWMRegs allocate the wwm register usage introduced in SILowerSGPRSpills.

Agree, allowing wwm-register allocation together with the regular vector registers would be error prone with miscomputed liveness data.
But I guess, they are edge cases . Unlike the other WWM operations, the writelane/readlane for SGPR spill stores/restores, in most cases, would span across different blocks and such a liveness miscomputation would be a rare combination.

IIRC, SIPreAllocateWWMRegs can help allocate only when we have enough free VGPRs. There is no live-range spill/split incorporated in this custom pass. It won’t help in the case of large functions with more SGPR spills.
The best approach would be to introduce another regalloc pipeline between the existing SGPR and VGPR allocations. The new pipeline should allocate only the WWM-registers.
It would, however, increase the compile time complexity further. But I’m not sure we have a better choice.

Agree, allowing wwm-register allocation together with the regular vector registers would be error prone with miscomputed liveness data.
But I guess, they are edge cases . Unlike the other WWM operations, the writelane/readlane for SGPR spill stores/restores, in most cases, would span across different blocks and such a liveness miscomputation would be a rare combination.

I think we need to make sure the idea is correct in all possible cases we can think of. The writelane/readlane shares the same behavior with WWM operation regarding to the issue here. That is: they may write to a VGPR lane that the corresponding thread is inactive. "spanning across different blocks" won't help on the problem. Even the writelane/readlane operations span across more than one thousand blocks, it can still be nested in an outer if-then structure.

Agree, allowing wwm-register allocation together with the regular vector registers would be error prone with miscomputed liveness data.
But I guess, they are edge cases . Unlike the other WWM operations, the writelane/readlane for SGPR spill stores/restores, in most cases, would span across different blocks and such a liveness miscomputation would be a rare combination.

I think we need to make sure the idea is correct in all possible cases we can think of. The writelane/readlane shares the same behavior with WWM operation regarding to the issue here. That is: they may write to a VGPR lane that the corresponding thread is inactive. "spanning across different blocks" won't help on the problem. Even the writelane/readlane operations span across more than one thousand blocks, it can still be nested in an outer if-then structure.

Yes, we should fix this case. And we don't see a better way other than introducing a new regalloc pipeline for wwm registers alone. The effort for that is yet to be accounted and planning a follow-up patch to split the vgpr allocation.