This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix the inconsistency in soffset for MUBUF stack accesses.
ClosedPublic

Authored by cdevadas on Jan 20 2021, 10:33 AM.

Details

Summary

During instruction selection, there is an inconsistency in choosing
the initial soffset value. With certain early passes, this value is
getting modified and that brought additional fixup during
eliminateFrameIndex to work for all cases. This whole transformation
looks trivial and can be handled better.

This patch clearly defines the initial value for soffset and keeps it
unchanged before eliminateFrameIndex. The initial value must be zero
for MUBUF with a frame index. The non-frame index MUBUF forms that
use a raw offset from SP will have the stack register for soffset.
During frame elimination, the soffset remains zero for entry functions
with zero dynamic allocas and no callsites, or else is updated to the
appropriate frame/stack register.

Also, did some code clean up and made all asserts around soffset
stricter to match.

Diff Detail

Event Timeline

cdevadas created this revision.Jan 20 2021, 10:33 AM
cdevadas requested review of this revision.Jan 20 2021, 10:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 20 2021, 10:33 AM

Thank you for the patch, I definitely think this is a step in the right direction! However, I think we are still broken in common cases, the first of which can be seen by simply removing amdgpu_kernel from llvm/test/CodeGen/AMDGPU/stack-pointer-offset-relative-frameindex.ll and noticing that the soffset of the buffer_load_dwords in %bb.1 do not correctly refer to the frame pointer, they are constant 0 just like in the kernel case:

$ sed 's/amdgpu_kernel//' llvm/test/CodeGen/AMDGPU/stack-pointer-offset-relative-frameindex.ll | release/bin/llc -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs | grep '%bb' -A3
; %bb.0:                                ; %entry
        s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
        s_waitcnt_vscnt null, 0x0
        s_or_saveexec_b32 s4, -1
--
; %bb.1:                                ; %if.then4.i
        s_clause 0x1
        buffer_load_dword v0, v40, s[0:3], 0 offen
        buffer_load_dword v1, v40, s[0:3], 0 offen offset:4

This was true before this patch, and also true with a slightly different result back when the FIXME was first added:

$ git switch --detach 60b1967c3933c && ninja -C release/ llc && sed 's/amdgpu_kernel//' llvm/test/CodeGen/AMDGPU/stack-pointer-offset-relative-frameindex.ll | release/bin/llc -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs | grep '%bb' -A3
HEAD is now at 60b1967c3933 [AMDGPU] Add Scratch Wave Offset to Scratch Buffer Descriptor in entry functions
ninja: Entering directory `release/'
ninja: no work to do.
; %bb.0:                                ; %entry
        s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
        s_waitcnt_vscnt null, 0x0
        s_or_saveexec_b32 s4, -1
--
; %bb.1:                                ; %if.then4.i
        buffer_load_dword v0, v32, s[0:3], s32 offen
        buffer_load_dword v1, v32, s[0:3], s32 offen offset:4
        s_waitcnt vmcnt(0)

The fundamental issue is in all of these cases is that we still are relying on eliminateFrameIndex to correct the soffset of an MUBUF instruction, when it is not guaranteed that the frame index will survive to this point in it's original place (i.e. as an operand of the MUBUF we care about updating). From the comment, it seems like LocalStackSlotAllocation is one such place that can move the frame index. One mitigating factor is that we also try to fold the frame index back into the MUBUF when possible in another pass, but as we can't do this in all cases, we can't rely on it for generating correct code.

I think using a pseudo for these MUBUF cases early, and lowering them around the same time as eliminateFrameIndex (i.e. some time after we know how to populate the soffset field of the corresponding "real" MUBUF instruction) should work in all cases. IIRC this was the approach Matt liked back when the FIXME was added.

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
497–504

Not exactly related to this patch, but I feel like I must be reading this wrong. Why is there a different assert and an early return for DEBUG builds? If isLegalFLATOffset implies isLegalMUBUFImmOffset then this is just an additional, stricter check to what is below, but it still seems odd to have a return here then. Why not just fall through in the DEBUG case instead of copy-pasting the code?

llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll
19 ↗(On Diff #317931)

I don't see the connection to the patch at hand, is this just unrelated cleanup or does this patch affect the dead code here?

Thank you for the patch, I definitely think this is a step in the right direction! However, I think we are still broken in common cases, the first of which can be seen by simply removing amdgpu_kernel from llvm/test/CodeGen/AMDGPU/stack-pointer-offset-relative-frameindex.ll and noticing that the soffset of the buffer_load_dwords in %bb.1 do not correctly refer to the frame pointer, they are constant 0 just like in the kernel case:

The frame index being separated from the memory operation is fine. 0 is always an acceptable soffset, kernel or not. The vaddr will be interpreted as the absolute address. If the frame index is materialized by a move, it will be expanded such that the address placed in the mubuf vaddr should work

Thank you for the patch, I definitely think this is a step in the right direction! However, I think we are still broken in common cases, the first of which can be seen by simply removing amdgpu_kernel from llvm/test/CodeGen/AMDGPU/stack-pointer-offset-relative-frameindex.ll and noticing that the soffset of the buffer_load_dwords in %bb.1 do not correctly refer to the frame pointer, they are constant 0 just like in the kernel case:

The frame index being separated from the memory operation is fine. 0 is always an acceptable soffset, kernel or not. The vaddr will be interpreted as the absolute address. If the frame index is materialized by a move, it will be expanded such that the address placed in the mubuf vaddr should work

OK, that makes sense, sorry for the noise!

I'm not sure why I didn't consider the code right above the MUBUF handling in eliminateFrameIndex before assuming this wasn't correct. I'm also not sure why I didn't just read the ISA for the function case more carefully.

I'm satisfied that starting with 0 for the frame-index case and requiring that be the true in eliminateFrameIndex is a sound approach, then.

cdevadas added inline comments.Jan 21 2021, 3:02 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
497–504

The flat scratch code was an add-on to the existing code that earlier handled MUBUF instructions alone. Looks like the early return was meant to skip the soffset validation.

llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll
19 ↗(On Diff #317931)

This is completely an unrelated cleanup. This test was added as part of D87472. I noticed the dead argument and thought I can remove it.
Let me know if this should go as a separate NFC commit.

scott.linder accepted this revision.Jan 21 2021, 9:01 AM

I'd recommend committing the NFC changes before this, no need for a separate review.

Otherwise this LGTM, thank you!

This revision is now accepted and ready to land.Jan 21 2021, 9:01 AM