This is an archive of the discontinued LLVM Phabricator instance.

[WIP][AMDGPU] Add Scratch Wave Offset to Scratch Buffer Descriptor in entry functions
ClosedPublic

Authored by scott.linder on Feb 25 2020, 1:34 PM.

Details

Summary
[AMDGPU] Add Scratch Wave Offset to Scratch Buffer Descriptor in entry functions

Add the scratch wave offset to the scratch buffer descriptor (SRSrc) in
the entry function prologue. This frees up the preloaded scratch wave
offset register after the entry function prologue and removes the
scratch wave offset register from the calling convention ABI.

As part of this change, allow the use of an inline constant zero for the
SOffset of MUBUF instructions accessing the stack in entry functions
when a frame pointer is not requested/required. Entry functions with
calls still need to set up the calling convention ABI stack pointer
register, and reference it in order to address arguments of called
functions. The ABI stack pointer register remains unswizzled, but is now
wave-relative instead of queue-relative.

Non-entry functions also use an inline constant zero SOffset for
wave-relative scratch access, but continue to use the stack and frame
pointers as before. When the stack or frame pointer is converted to a
swizzled offset it is now scaled directly, as the scratch wave offset no
longer needs to be subtracted first.

Update llvm/docs/AMDGPUUsage.rst to reflect these changes to the calling
convention.

Diff Detail

Event Timeline

scott.linder created this revision.Feb 25 2020, 1:34 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 25 2020, 1:34 PM

I'm having trouble working out the best way forward on this patch, with the core issue relating to the fact that we no longer need anything equivalent to a frame pointer in the entry function when there is no stack usage. This is complicated by the fact that hasFP is broken in some of the places it is called, including reservePrivateMemoryRegs. I'm not sure I completely understand where the best place to handle this is, but without addressing it I can't avoid gratuitously initializing the SP and/or FP in many cases, including a trivial kernel with no body.

I'm also not sure if my ISA for initializing the SRSRC is optimal and wanted to get feedback. I do think that in at least some cases we will need to save a DWORD out of the SRSRC while updating it, and in those cases I'm not certain scavenging one is infallible (see the cc-update-scavenge-fail.ll test case). Is there a better approach here?

I assume this is missing a lot of test updates?

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
554–555

Do we actually need these bits? I'm fairly confident these are always 0 in the HSA resource descriptor (or at least are a known constant we can just reproduce later)

558

I think just 0xffff0000 would be clearer here

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
340

These should be switched to Register at some point

llvm/test/CodeGen/AMDGPU/cc-update-scavenge-fail.ll
5 ↗(On Diff #246557)

I would move this to the first line, and check the error message to make sure it fails for the right reason

I'm having trouble working out the best way forward on this patch, with the core issue relating to the fact that we no longer need anything equivalent to a frame pointer in the entry function when there is no stack usage. This is complicated by the fact that hasFP is broken in some of the places it is called, including reservePrivateMemoryRegs. I'm not sure I completely understand where the best place to handle this is, but without addressing it I can't avoid gratuitously initializing the SP and/or FP in many cases, including a trivial kernel with no body.

Why is this a problem exactly? I only vaguely remember what kind of problems this would cause. hasFP has always been broken depending on what time it's called, so in some places we do have to guess if it's needed

arsenm added inline comments.Feb 25 2020, 2:13 PM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
554–555

According to this it's hardcoded: https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/master/src/core/runtime/amd_aql_queue.cpp#L1015

We just need to worry about SWIZZLE_ENABLE being set to 1. This is the high bit, so all it can do is trigger a carry on the second add. So I think that means you can get away with just doing the add, and then using s_bitset1_b32 to ensure it wasn't carried away

arsenm added inline comments.Feb 25 2020, 2:17 PM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
554–555

Actually, I don't think any add that fits in the 48-bit address space should ever touch the high bits (although I usually manage to be wrong about known bits optimizations with adds)

arsenm added inline comments.Feb 25 2020, 3:12 PM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
554–555

I think this means it's OK to just not worry about the high bits: https://rise4fun.com/Alive/i24

arsenm added inline comments.Feb 25 2020, 3:19 PM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
554–555

As long as we know bit 48 is 0, this seems fine. As this is hardcoded in the driver, this is probably OK https://rise4fun.com/Alive/KmH

scott.linder marked an inline comment as done.Feb 25 2020, 7:12 PM

Yes, there are a lot of test updates and likely more new tests needed, but I just posted some tests that exercise the bits I'm currently stuck on for now.

I will try to articulate the issue with hasFP better tomorrow morning, but currently we are making the decision about whether to have a distinct FP (i.e. S34) before we actually know if we use the stack. If we have a call, but no stack use early, and then later we need to reference the stack we end up in a situation where at PEI time we are updating the same register both for the ABI SP and for the entry function FP, which obviously isn't right.

The right thing seems to be to not have any stack or frame pointer at all, but I am not sure how to implement that and wanted to ask for some help estimating how reasonable that would be.

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
554–555

That make sense to me, and this would simplify things a lot. I don't quite understand if we need to ensure [48:62] are 0, though? If the addc carries into bit 48 is that an issue? I.e. https://rise4fun.com/Alive/qsv

At the very least, it seems like we can avoid the need to save anything and just mask in a constant, but if it is possible to avoid that too it removes a couple additional instructions from nearly every kernel prologue.

scott.linder edited reviewers, added: RamNalamothu; removed: ramana-nvr.Feb 25 2020, 7:25 PM

I'm going to go ahead with trying to eliminate the need for an FP completely in entry functions and then update the review with a more complete set of test updates. I'm sure the issue I was having with defining and using hasFP consistently between ISel and PEI could be worked around, but putting that effort into eliminating the FP entirely in entry functions seems more productive.

Update/add tests and eliminate use of FP in entry functions

scott.linder retitled this revision from [WIP][AMDGPU] Eliminate the ScratchWaveOffset register from the calling convention to [WIP][AMDGPU] Add Scratch Wave Offset to Scratch Buffer Descriptor in entry functions.Mar 4 2020, 4:23 PM
scott.linder edited the summary of this revision. (Show Details)
arsenm added inline comments.Mar 4 2020, 4:30 PM
llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
631–632

Should demorgan this

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
290–292

This should not need to inspect the original IR. Why can't this just read it directly from MFI? They should be accounted there already?

293

This will be inaccurate for any struct type, this should have been computed during lowering that knows the type split

scott.linder marked 7 inline comments as done.Mar 4 2020, 4:46 PM

There are still a reasonable amount of FIXME/TODO in this patch, and I left some additional comments on each to highlight them and ask for feedback on them. I am not entirely comfortable with the way I went about implementing the special-casing for having no FP in the entry function. I would prefer not having all of the isEntryFunction checks everywhere, but I'm not sure how else to represent it?

I also would rather break this patch up more, but I don't think doing so will make it easier to understand or reduce the size of the test diffs. The only pieces I could break off naturally were some NFC changes in https://reviews.llvm.org/D75092 and switching to s33 for FP in https://reviews.llvm.org/D75657

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
290

@arsenm @nhaehnle I don't think I understand how inreg currently works relative to "preloaded" SGPRs; is/should inreg be recorded somewhere in the machine function info so this isn't necessary?

309

Similar question here, should there be a change in SITargetLowering so the preloaded count is correct?

554–555

I went the route of just always doing the 64-bit add of the scratch wave offset into the SRsrc rather than saving anything or using known constants for some of the bits. From some other discussion this should always be correct.

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
340

I haven't gotten around to this yet, but I'll do this in another NFC patch.

llvm/test/CodeGen/AMDGPU/attr-amdgpu-num-sgpr.ll
8

Can anyone help me understand what we are trying to test here? It seems likely the amount of live SGPRs and the amount of available SGPRs needs to be adjusted to have this test continue to be meaningful, but in trying to correct it I realized I wasn't sure what it was testing in the first place.

llvm/test/CodeGen/AMDGPU/scratch-simple.ll
100

@arsenm @nhaehnle Similar question as above wrt. how inreg should work. Is the %swo argument in these expected to actually be allowed to coincide with the scratch wave offset?

llvm/test/CodeGen/AMDGPU/spill-offset-calculation.ll
52

Is it OK for us to fail here? This is a consequence of not having a frame pointer in entry functions and not being able to e.g. restart RA after we realize we really need it in this case.

arsenm added inline comments.Mar 9 2020, 1:09 PM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
290

Not directly. There shouldn't be any repeating of the calling convention logic here. Either the number of SGPR arguments should be recorded, or it should be inferred from the machine code.

It might be correct to just count the number of SGPR in the function live-in list. I think live in registers can be deleted from the list if they are proven to be unused, so this might be fragile. Finding the highest live in SGPR number may also work.

scott.linder marked 2 inline comments as done.

Support FP in entry functions by reverting most of the changes needed
before PEI in the previous patches. Now the entry function always
allocates S32 for the SP, and optionally allocates S34 as the FP.

There are still a couple tests to be updated, but they are just due
to RA noise.

scott.linder edited the summary of this revision. (Show Details)Mar 10 2020, 4:53 PM
arsenm accepted this revision.Mar 10 2020, 6:46 PM

LGTM with nits

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
362–365

Braces

388

s/unsigned/Register

395

Ditto

541

Braces

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
99

s/NoRegister/Register()

This revision is now accepted and ready to land.Mar 10 2020, 6:46 PM
t-tye added a comment.Mar 10 2020, 7:27 PM

I think commit comment "The ABI stack pointer register remains unswizzled, but is now wave-relative instead of dispatch-relative." shuld chage to "The ABI stack pointer register remains unswizzled, but is now wave-relative instead of queue-relative." since for the HSAABI the scratch base is the queue base and not per dispatch. The PALABI may use per dispatch scratch allocation.

t-tye added inline comments.Mar 10 2020, 7:59 PM
llvm/docs/AMDGPUUsage.rst
8557–8559 ↗(On Diff #249521)

"This can be done without having to perform register allocation again, which is necessary as register allocation may introduce spills." Suggest moving this to a separate bullet and reword to make clear why this approach is done:

"- Note: this approach of using a tentative scratch SRD and shifting the register numbers if used, avoids having to perform register allocation a second time if the tentative SRD is eliminated. This is more efficient and avoids the problem that the second register allocation may perform spilling which will fail as here is no longer a scratch SRD."

For consistency, should SRD be changed to V# to match the usage in the next section?

8572 ↗(On Diff #249521)

Should the manner that the kernel prolog sets the scratch V# be specified? The compiler requests that the scratch V# and wave scratch offset be passed in using the kernel descriptor (reference the section), The wave scratch offset is added to the queue base address in the scratch V#and moved to SGPR0-3.

Also specify how the kernel must set the FLAT_SCRATCH. The compiler requests that the flat scratch and wave scratch offset be passed in using the kernel descriptor (reference the section), The wave scratch offset is added to the flat scratch base and moved to FLAT_SCRATCH.

Should setup up of M0 also be defined here. For GFX6-??? it is set to the LDS size, otherwise it is set to ???.

Any other setup that hs to be done in the kernel prolog?

8614 ↗(On Diff #249521)

"private address" -> "private address space address"

8653 ↗(On Diff #249521)

Is this necessary to say since the following bullet states all SGPS except 4-31 which means SGPR0-3 aare preserved?

scott.linder edited the summary of this revision. (Show Details)Mar 11 2020, 11:04 AM

I think commit comment "The ABI stack pointer register remains unswizzled, but is now wave-relative instead of dispatch-relative." shuld chage to "The ABI stack pointer register remains unswizzled, but is now wave-relative instead of queue-relative." since for the HSAABI the scratch base is the queue base and not per dispatch. The PALABI may use per dispatch scratch allocation.

I updated the commit message, but I didn't include mention of the possibility of the PALABI differing here. Is there a more generic way to describe the old behavior for every ABI the compiler supports? As far as the compiler is concerned it is only important that the SRSRC base + the scratch wave offset gets it to the base for the scratch allocation for the wave.

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
290

In switching back to supporting an FP I no longer see the need for this manifest, but there may still be a need to update this in the future. I don't think my change is making this any more fragile so I'm leaving it as it was.

scott.linder marked 9 inline comments as done.

Address feedback

scott.linder added inline comments.Mar 11 2020, 4:45 PM
llvm/docs/AMDGPUUsage.rst
8572 ↗(On Diff #249521)

I didn't notice originally that we have a section "Code Conventions > AMDHSA > Kernel Prolog" which already describes some of this. It seemed odd to put some of that here and some of that there, so I ended up trying to just move all the relevant bits to the Kernel Prolog section and reference it here. It ends up being a bit circular in that the Kernel Prolog section defers to the Calling Convention section for the definition of the ABI stack pointer, and the Calling Convention section defers to the Kernel Prolog section for the description of the properties of M0/FlatScratch/V# and how they are initialized. I think it is OK, but maybe you have some suggestions?

Finish updating remaining tests. Remove Kill from last use of scratch wave
offset in prologue, as it is used in at least some Mesa shaders.

arsenm accepted this revision.Mar 17 2020, 3:27 PM
arsenm added inline comments.
llvm/test/CodeGen/AMDGPU/scratch-simple.ll
95–96

Can you add a comment elaborating on what this tests

scott.linder marked an inline comment as done.Mar 17 2020, 3:58 PM
scott.linder added inline comments.
llvm/test/CodeGen/AMDGPU/scratch-simple.ll
95–96

From discussion with @mareko my understanding is that Mesa GS and HS shaders have the preloaded scratch wave offset SGPR fixed at SGPR5, and the inreg implementation is used to reference it in the IR. So here, the shader snippet inserted after the SI_RETURN_TO_EPILOG wants to use the scratch wave offset, and the IR passes it along by padding out the inreg arguments until it gets to where the scratch wave offset is, and then using it in the return value. I'll add something to that effect in the test.

This revision was automatically updated to reflect the committed changes.
llvm/test/CodeGen/MIR/AMDGPU/machine-function-info.ll