Page MenuHomePhabricator

[AMDGPU] Support LDS spilling
Needs ReviewPublic

Authored by piotr on Jul 29 2022, 9:00 AM.

Details

Reviewers
arsenm
rampitec
foad
dstuttard
sebastian-ne
RamNalamothu
Group Reviewers
Restricted Project
Summary

Add experimental support for LDS spilling on targets >= gfx9.

The amount of LDS is controlled by the attribute amdgpu-lds-spill-limit-dwords.
The default value of 0 means that LDS spilling is disabled.

The implementation utilizes DS_READ_ADDTID/DS_WRITE_ADDTID instructions.

For cases where workgroup size is larger than wave size, MultiDispatchInfo
(user sgpr in PAL front-end) is used to offset the address accordingly.
With some extra work, compute could use WorkGroupInfo to drive the spill
in the backend. Sadly, the way the values are preloaded is different between
graphics and compute (user sgpr versus system sgpr).

Tested on real-world graphics content (compute and pixel shaders).

Diff Detail

Event Timeline

piotr created this revision.Jul 29 2022, 9:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 29 2022, 9:00 AM
piotr requested review of this revision.Jul 29 2022, 9:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 29 2022, 9:00 AM
piotr added reviewers: arsenm, rampitec, foad, dstuttard, sebastian-ne, Restricted Project.Jul 29 2022, 9:03 AM

I thought we spilled registers to the stack and promoted some stack memory to LDS. Is there any interaction with promoteAlloca? In particular that pass tries to use LDS up to some occupancy boundary, after which allocating more here should take us over that boundary

I thought we spilled registers to the stack and promoted some stack memory to LDS. Is there any interaction with promoteAlloca? In particular that pass tries to use LDS up to some occupancy boundary, after which allocating more here should take us over that boundary

These are orthogonal things. The only relation would be promote alloca reduces the available LDS budget available for later spilling

rampitec added inline comments.Jul 29 2022, 2:07 PM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
2560

Why CS only?

2565

Matching argument by name is bad, especially if such name can be used by an user.

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
741

This is the most expensive check, it should go last.

piotr added inline comments.Aug 1 2022, 6:35 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
2560

This is a left-over from an older version of the patch, will remove.

2565

This is controlled by the front-end - graphics compute shaders do not have inputs that could be specified by a shader writer and end up landing here.

Having said that, I am not 100% happy with the way this is handled here either. I guess a better idea could be to rely on a function attribute set by the front-end that would say - "the user sgpr you want to use for multi dispatch info is at nth location".

Just to note, input sgprs are handled differently between graphics and compute. I need to "just" preload WorkGroupInfo, so that I could later use that sgpr in the spill code. In kernels, it is treated as a system sgpr (allocateSystemSGPRs), so preloading it should not be a problem. However, in graphics we treat it as a user sgpr and pass in the list of arguments (with the name "MultiDispatchInfo").

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
741

Ok, will re-order. (I placed this condition at the beginning, because it would be true (== return early) more often the the other checks).

piotr added a comment.Aug 1 2022, 6:39 AM

I thought we spilled registers to the stack and promoted some stack memory to LDS. Is there any interaction with promoteAlloca? In particular that pass tries to use LDS up to some occupancy boundary, after which allocating more here should take us over that boundary

These are orthogonal things. The only relation would be promote alloca reduces the available LDS budget available for later spilling

Exactly, if the amount of LDS available at the point of frame lowering is zero, no spill slots will use LDS.

We will need to generate CFI describing these spills downstream; @RamNalamothu do you see anything that would be an issue, or should it be pretty straightforward?

arsenm added inline comments.Aug 4 2022, 7:16 PM
llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
890–894

Rename to MaxWorkGroupSize

893

Does this need alignment padding up to 4?

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
799–800

I don't see why we would need a new limit for this and just rely on the remaining LDS in the occupancy budget

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
406

Could we do this in a post-RA pass before LiveIntervals is discarded? I was thinking we should copy what SC does and reserve more registers, and try to reallocate them in such a pass. The same place could have smarter management of m0

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
2565

Argument names are totally meaningless. opt -strip will now break this, so you need something that doesn't rely on the name

@scott.linder

I don't think there are any issues.

Looks like it's possible to calculate LDS offset of the spill and buildCFIForVGPRToVMEMSpill() needs to be specialized to include DW_OP_LLVM_form_aspace_address for LDS.

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1290

s/the are no/there are no

piotr updated this revision to Diff 450305.Aug 5 2022, 8:47 AM

Addressed review comments.

scchan added a subscriber: scchan.Aug 5 2022, 9:03 AM

Just briefly glanced through the code and it seems that it doesn't look at whether a kernel has dynamic LDS or not because LDS spilling has to be disabled in that case, can you confirm?

piotr marked 6 inline comments as done.Aug 5 2022, 9:03 AM
piotr added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
893

Not sure (and the total size is a multiply of 4 by construction).

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
799–800

For pixel shaders we would also need to make room for pixel parameters as they also reside in the same CU.

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
406

Doing it in a separate pass may work, but will need to explore it - need to check if I would have access to everything I need here.

Good point about extensibility - I did not intend to do the smart m0 thing in the first implementation (left a FIXME), but it is true that to do that properly we would need kind of a data flow analysis so a separate pass would make sense in the long run.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
2565

Added an extra attr instead of matching by name.

sebastian-ne added inline comments.Aug 22 2022, 4:57 AM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
824–830

Maybe this should be below getStackPtrOffsetReg? Currently, it divides setStackPtrOffsetReg and getStackPtrOffsetReg.

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1295

I guess M0 could have a kill flag here?

2195

The non-lds restore code does not call addToSpilledVGPRs. Is this intentional here?

piotr marked an inline comment as done.Aug 22 2022, 5:57 AM
piotr added inline comments.
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
824–830

Maybe this should be below getStackPtrOffsetReg? Currently, it divides setStackPtrOffsetReg and getStackPtrOffsetReg.

Yes, good idea.

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1295

I guess M0 could have a kill flag here?

Yes, thanks.

2195

The non-lds restore code does not call addToSpilledVGPRs. Is this intentional here?

Good catch - copy & pasta error on my part.