[AMDGPU] Add Scratch Wave Offset to Scratch Buffer Descriptor in entry functions Add the scratch wave offset to the scratch buffer descriptor (SRSrc) in the entry function prologue. This frees up the preloaded scratch wave offset register after the entry function prologue and removes the scratch wave offset register from the calling convention ABI. As part of this change, allow the use of an inline constant zero for the SOffset of MUBUF instructions accessing the stack in entry functions when a frame pointer is not requested/required. Entry functions with calls still need to set up the calling convention ABI stack pointer register, and reference it in order to address arguments of called functions. The ABI stack pointer register remains unswizzled, but is now wave-relative instead of queue-relative. Non-entry functions also use an inline constant zero SOffset for wave-relative scratch access, but continue to use the stack and frame pointers as before. When the stack or frame pointer is converted to a swizzled offset it is now scaled directly, as the scratch wave offset no longer needs to be subtracted first. Update llvm/docs/AMDGPUUsage.rst to reflect these changes to the calling convention.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I'm having trouble working out the best way forward on this patch, with the core issue relating to the fact that we no longer need anything equivalent to a frame pointer in the entry function when there is no stack usage. This is complicated by the fact that hasFP is broken in some of the places it is called, including reservePrivateMemoryRegs. I'm not sure I completely understand where the best place to handle this is, but without addressing it I can't avoid gratuitously initializing the SP and/or FP in many cases, including a trivial kernel with no body.
I'm also not sure if my ISA for initializing the SRSRC is optimal and wanted to get feedback. I do think that in at least some cases we will need to save a DWORD out of the SRSRC while updating it, and in those cases I'm not certain scavenging one is infallible (see the cc-update-scavenge-fail.ll test case). Is there a better approach here?
I assume this is missing a lot of test updates?
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
555–556 | Do we actually need these bits? I'm fairly confident these are always 0 in the HSA resource descriptor (or at least are a known constant we can just reproduce later) | |
559 | I think just 0xffff0000 would be clearer here | |
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h | ||
339 | These should be switched to Register at some point | |
llvm/test/CodeGen/AMDGPU/cc-update-scavenge-fail.ll | ||
5 ↗ | (On Diff #246557) | I would move this to the first line, and check the error message to make sure it fails for the right reason |
Why is this a problem exactly? I only vaguely remember what kind of problems this would cause. hasFP has always been broken depending on what time it's called, so in some places we do have to guess if it's needed
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
555–556 | According to this it's hardcoded: https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/master/src/core/runtime/amd_aql_queue.cpp#L1015 We just need to worry about SWIZZLE_ENABLE being set to 1. This is the high bit, so all it can do is trigger a carry on the second add. So I think that means you can get away with just doing the add, and then using s_bitset1_b32 to ensure it wasn't carried away |
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
555–556 | Actually, I don't think any add that fits in the 48-bit address space should ever touch the high bits (although I usually manage to be wrong about known bits optimizations with adds) |
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
555–556 | I think this means it's OK to just not worry about the high bits: https://rise4fun.com/Alive/i24 |
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
555–556 | As long as we know bit 48 is 0, this seems fine. As this is hardcoded in the driver, this is probably OK https://rise4fun.com/Alive/KmH |
Yes, there are a lot of test updates and likely more new tests needed, but I just posted some tests that exercise the bits I'm currently stuck on for now.
I will try to articulate the issue with hasFP better tomorrow morning, but currently we are making the decision about whether to have a distinct FP (i.e. S34) before we actually know if we use the stack. If we have a call, but no stack use early, and then later we need to reference the stack we end up in a situation where at PEI time we are updating the same register both for the ABI SP and for the entry function FP, which obviously isn't right.
The right thing seems to be to not have any stack or frame pointer at all, but I am not sure how to implement that and wanted to ask for some help estimating how reasonable that would be.
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
555–556 | That make sense to me, and this would simplify things a lot. I don't quite understand if we need to ensure [48:62] are 0, though? If the addc carries into bit 48 is that an issue? I.e. https://rise4fun.com/Alive/qsv At the very least, it seems like we can avoid the need to save anything and just mask in a constant, but if it is possible to avoid that too it removes a couple additional instructions from nearly every kernel prologue. |
I'm going to go ahead with trying to eliminate the need for an FP completely in entry functions and then update the review with a more complete set of test updates. I'm sure the issue I was having with defining and using hasFP consistently between ISel and PEI could be worked around, but putting that effort into eliminating the FP entirely in entry functions seems more productive.
llvm/lib/Target/AMDGPU/SIFoldOperands.cpp | ||
---|---|---|
626–627 | Should demorgan this | |
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
290–292 | This should not need to inspect the original IR. Why can't this just read it directly from MFI? They should be accounted there already? | |
293 | This will be inaccurate for any struct type, this should have been computed during lowering that knows the type split |
There are still a reasonable amount of FIXME/TODO in this patch, and I left some additional comments on each to highlight them and ask for feedback on them. I am not entirely comfortable with the way I went about implementing the special-casing for having no FP in the entry function. I would prefer not having all of the isEntryFunction checks everywhere, but I'm not sure how else to represent it?
I also would rather break this patch up more, but I don't think doing so will make it easier to understand or reduce the size of the test diffs. The only pieces I could break off naturally were some NFC changes in https://reviews.llvm.org/D75092 and switching to s33 for FP in https://reviews.llvm.org/D75657
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
290 | ||
300 | Similar question here, should there be a change in SITargetLowering so the preloaded count is correct? | |
555–556 | I went the route of just always doing the 64-bit add of the scratch wave offset into the SRsrc rather than saving anything or using known constants for some of the bits. From some other discussion this should always be correct. | |
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h | ||
339 | I haven't gotten around to this yet, but I'll do this in another NFC patch. | |
llvm/test/CodeGen/AMDGPU/attr-amdgpu-num-sgpr.ll | ||
8 | Can anyone help me understand what we are trying to test here? It seems likely the amount of live SGPRs and the amount of available SGPRs needs to be adjusted to have this test continue to be meaningful, but in trying to correct it I realized I wasn't sure what it was testing in the first place. | |
llvm/test/CodeGen/AMDGPU/scratch-simple.ll | ||
143 | ||
llvm/test/CodeGen/AMDGPU/spill-offset-calculation.ll | ||
52 | Is it OK for us to fail here? This is a consequence of not having a frame pointer in entry functions and not being able to e.g. restart RA after we realize we really need it in this case. |
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
290 | Not directly. There shouldn't be any repeating of the calling convention logic here. Either the number of SGPR arguments should be recorded, or it should be inferred from the machine code. It might be correct to just count the number of SGPR in the function live-in list. I think live in registers can be deleted from the list if they are proven to be unused, so this might be fragile. Finding the highest live in SGPR number may also work. |
Support FP in entry functions by reverting most of the changes needed
before PEI in the previous patches. Now the entry function always
allocates S32 for the SP, and optionally allocates S34 as the FP.
There are still a couple tests to be updated, but they are just due
to RA noise.
I think commit comment "The ABI stack pointer register remains unswizzled, but is now wave-relative instead of dispatch-relative." shuld chage to "The ABI stack pointer register remains unswizzled, but is now wave-relative instead of queue-relative." since for the HSAABI the scratch base is the queue base and not per dispatch. The PALABI may use per dispatch scratch allocation.
llvm/docs/AMDGPUUsage.rst | ||
---|---|---|
8626–8627 | Should the manner that the kernel prolog sets the scratch V# be specified? The compiler requests that the scratch V# and wave scratch offset be passed in using the kernel descriptor (reference the section), The wave scratch offset is added to the queue base address in the scratch V#and moved to SGPR0-3. Also specify how the kernel must set the FLAT_SCRATCH. The compiler requests that the flat scratch and wave scratch offset be passed in using the kernel descriptor (reference the section), The wave scratch offset is added to the flat scratch base and moved to FLAT_SCRATCH. Should setup up of M0 also be defined here. For GFX6-??? it is set to the LDS size, otherwise it is set to ???. Any other setup that hs to be done in the kernel prolog? | |
8637–8639 | "This can be done without having to perform register allocation again, which is necessary as register allocation may introduce spills." Suggest moving this to a separate bullet and reword to make clear why this approach is done: "- Note: this approach of using a tentative scratch SRD and shifting the register numbers if used, avoids having to perform register allocation a second time if the tentative SRD is eliminated. This is more efficient and avoids the problem that the second register allocation may perform spilling which will fail as here is no longer a scratch SRD." For consistency, should SRD be changed to V# to match the usage in the next section? | |
8670 | "private address" -> "private address space address" | |
8709 | Is this necessary to say since the following bullet states all SGPS except 4-31 which means SGPR0-3 aare preserved? |
I updated the commit message, but I didn't include mention of the possibility of the PALABI differing here. Is there a more generic way to describe the old behavior for every ABI the compiler supports? As far as the compiler is concerned it is only important that the SRSRC base + the scratch wave offset gets it to the base for the scratch allocation for the wave.
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp | ||
---|---|---|
290 | In switching back to supporting an FP I no longer see the need for this manifest, but there may still be a need to update this in the future. I don't think my change is making this any more fragile so I'm leaving it as it was. |
llvm/docs/AMDGPUUsage.rst | ||
---|---|---|
8626–8627 | I didn't notice originally that we have a section "Code Conventions > AMDHSA > Kernel Prolog" which already describes some of this. It seemed odd to put some of that here and some of that there, so I ended up trying to just move all the relevant bits to the Kernel Prolog section and reference it here. It ends up being a bit circular in that the Kernel Prolog section defers to the Calling Convention section for the definition of the ABI stack pointer, and the Calling Convention section defers to the Kernel Prolog section for the description of the properties of M0/FlatScratch/V# and how they are initialized. I think it is OK, but maybe you have some suggestions? |
Finish updating remaining tests. Remove Kill from last use of scratch wave
offset in prologue, as it is used in at least some Mesa shaders.
llvm/test/CodeGen/AMDGPU/scratch-simple.ll | ||
---|---|---|
101 | Can you add a comment elaborating on what this tests |
llvm/test/CodeGen/AMDGPU/scratch-simple.ll | ||
---|---|---|
101 | From discussion with @mareko my understanding is that Mesa GS and HS shaders have the preloaded scratch wave offset SGPR fixed at SGPR5, and the inreg implementation is used to reference it in the IR. So here, the shader snippet inserted after the SI_RETURN_TO_EPILOG wants to use the scratch wave offset, and the IR passes it along by padding out the inreg arguments until it gets to where the scratch wave offset is, and then using it in the return value. I'll add something to that effect in the test. |
Should the manner that the kernel prolog sets the scratch V# be specified? The compiler requests that the scratch V# and wave scratch offset be passed in using the kernel descriptor (reference the section), The wave scratch offset is added to the queue base address in the scratch V#and moved to SGPR0-3.
Also specify how the kernel must set the FLAT_SCRATCH. The compiler requests that the flat scratch and wave scratch offset be passed in using the kernel descriptor (reference the section), The wave scratch offset is added to the flat scratch base and moved to FLAT_SCRATCH.
Should setup up of M0 also be defined here. For GFX6-??? it is set to the LDS size, otherwise it is set to ???.
Any other setup that hs to be done in the kernel prolog?