This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add legalization case for PTR_ADD on buffer pointers
AbandonedPublic

Authored by krzysz00 on Feb 13 2023, 1:29 PM.

Details

Reviewers
arsenm
Group Reviewers
Restricted Project
Summary

On buffer pointers (address space 7), we want PTR_ADD to take 32-bit
offsets, and so, unlike all other address spaces, we do not want to
enforce the condition that those G_PTR_ADD instructions have an offset
of the same size as their pointer operand.

(Also, fix some comments and missing pattern fragment definitions
while we're here.)

Depends on D143526

Diff Detail

Event Timeline

krzysz00 created this revision.Feb 13 2023, 1:29 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 13 2023, 1:29 PM
krzysz00 requested review of this revision.Feb 13 2023, 1:29 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 13 2023, 1:29 PM
krzysz00 added reviewers: arsenm, Restricted Project.Feb 13 2023, 1:30 PM

I think we might want to consider whether we should be using g_ptr_add at all, or if a different instruction would be more appropriate for a fat pointer

@arsenm Yeah, I think there's room for a broader design conversation here. My theory was that we could pattern-match PTR_ADD and G_{LOAD,STORE,...} into the relevant buffer instructions at some point in the lowering process, but I didn't have a good sense of where - I figured it'd be after legalization pass, though.

My sense of the goal is to try and keep the fat pointers as p7 values as long as we can manage it, although that does mean we'd need to do things like the voffset/imm split somewhere further down the codegen pipeline than I think they currently are.

The other option would be to lower these G_LOAD operations into the relevant intrinsic instructions pretty early on and then to pattern-match all the address calculation onto them, which might have the downside that it loses alias information?

(I'm also suspecting that the definitions of a lot of the buffer ops might need to be loosened from s4i32/v4i32, to SReg_128/VReg_128, since that's the actual constraint, but that's not this patch)

arsenm accepted this revision.Feb 13 2023, 4:52 PM

I guess. we can go with this for now. I think this doesn't really work the same, since e.g. what happens on overflow of the 48-bit pointer part?

We do need to be able to handle these standalone, independent of a buffer instruction which may require emulation of any bounds checking it performs?

This revision is now accepted and ready to land.Feb 13 2023, 4:52 PM
piotr added inline comments.
llvm/lib/Target/AMDGPU/AMDGPU.h
382

This is now in conflict with the description in AMDGPUUsage.rst, which says 160-bit.

In fact, making it 160-bit (including offset) was a deliberate choice, see discussion in https://reviews.llvm.org/D58957?id=189289#inline-522534. But perhaps now it's good time to revisit the discussion, fyi @nhaehnle.

@piotr @nhaehnle @sheredom My proposal here - and in the patch stack in general, is that we do revisit that 160-bit idea.

How I imagine this working is that, if you GEP a buffer descriptor, the operation that that lowers to - here G_PTR_ADD - does not modify the input buffer descriptor value at all. Instead, all these offsets we add are accumulated into the voffset and imm fields of the relevant MUBUF instruction.

That means that

%y = gep i32, ptr addrspace(7) %x, i32 1
%z = load i32, ptr addrspace(7) %y

lowers to

[%z] = BUFFER_LOAD_DWORD [%x], offset:4

and any pattern of code that can't be translated into offsets on a load/store/atomic is unsupported.

Now, this does mean that, for example

%y = gep i32 ptr addrspace(7) %x, i32 1
%z = ptrtoint i128 %y

produces ... who the heck knows what ... in %z, but buffer descriptors are a very weird kind of pointer and anyone emitting such code either knows exactly what they're doing or is Wrong. (My gut tells me that, if I _had_ to pick a behavior, ptrtoint %y == ptrtoint %x)

Now, in the case that the buffer descriptor itself in non-uniform, I'm willing to initially declare that case a "instruction selection failed" case and perhaps add emulation in the future.

@piotr @nhaehnle @sheredom My proposal here - and in the patch stack in general, is that we do revisit that 160-bit idea.

How I imagine this working is that, if you GEP a buffer descriptor, the operation that that lowers to - here G_PTR_ADD - does not modify the input buffer descriptor value at all. Instead, all these offsets we add are accumulated into the voffset and imm fields of the relevant MUBUF instruction.

Yes. In other words, the GEP doesn't modify the 128 bits of buffer descriptor that are in the pointer, but it does modify the 32 bits of offset that are in the pointer.

That means that

%y = gep i32, ptr addrspace(7) %x, i32 1
%z = load i32, ptr addrspace(7) %y

lowers to

[%z] = BUFFER_LOAD_DWORD [%x], offset:4

That doesn't quite work, though, because %x itself might already have an offset. It really needs to translate to something like:

%y_offset = G_ADD %x_offset, 4
%z = G_AMDGPU_BUFFER_LOAD_DWORD %y_offset, %x_buffer

which can then later split %y_offset into a voffset and an inst_offset part during instruction selection.

Now, this does mean that, for example

%y = gep i32 ptr addrspace(7) %x, i32 1
%z = ptrtoint i128 %y

produces ... who the heck knows what ... in %z, but buffer descriptors are a very weird kind of pointer and anyone emitting such code either knows exactly what they're doing or is Wrong. (My gut tells me that, if I _had_ to pick a behavior, ptrtoint %y == ptrtoint %x)

This problem is very easily solved by making the pointers into 160-bit quantities as initially intended.

Note that @lgc.late.launder.fat.pointer is not a full inttoptr for fat pointers. It so happens that in LLPC we never need an initial offset different from 0, and so that's why this lgc "intrinsic" doesn't have the offset part as input even though it sort of plays the role of inttoptr. But if you look at how the operation is lowered in our PatchBufferOp pass, you'll see clearly that it produces a 160-bit quantity: 128 bits of buffer descriptor + 32 bits of offset (which just happen to always be 0).

Now, in the case that the buffer descriptor itself in non-uniform, I'm willing to initially declare that case a "instruction selection failed" case and perhaps add emulation in the future.

That's not acceptable for the graphics use cases.

(quick notes from phone)

I'm not convinced you need to store the offset in the actual value of the pointer itself. That is, because addres space 7 is non-integral, the only operations that can modify the address a buffer descriptor points to (uhlees the user really wants to do weird things) are GEPs.

This means that, to get the offset argument to the load operation, all we have to do is trace through the G_PTR_ADD operations and add those together until we reach a G_PTR_ADD input that didn't come from a GEP (ok, there is the complexity that we'd need to replicate comdjtiinaks if we really wanted to do this trace right), take the sum of their offset values that offset value, use it as the offset to the load/store

What worries me about the i160 approach is that it'll mess up things like allocating function arguments: if I want to pass a buffer descriptor to a function, that takes four registers and not five.

foad added a comment.Feb 17 2023, 1:44 AM

I'm not convinced you need to store the offset in the actual value of the pointer itself. That is, because addres space 7 is non-integral, the only operations that can modify the address a buffer descriptor points to (uhlees the user really wants to do weird things) are GEPs.

This means that, to get the offset argument to the load operation, all we have to do is trace through the G_PTR_ADD operations and add those together until we reach a G_PTR_ADD input that didn't come from a GEP

That doesn't work for a couple of reasons. First the G_PTR_ADDs could be in a loop or other complex control flow, so you can't just statically trace them back through the program. Second there are other things you can do with arbitrary pointers in IR that the backend needs to be able to implement: you can store them to memory and load them later, you can pass them into and out of function calls, you can use them in phi and select instructions, and so on.

(ok, there is the complexity that we'd need to replicate comdjtiinaks if we really wanted to do this trace right)

Good point, I'd forgotten about the comdjtiinaks :)

Ok, to elaborate the proposal - which is wonky enough that going i160 might be the right call here, even though we don't have a s_load_dwordx5 and so we'd need something like that buffer pointer launder intrinsic - is that (given that we replace all G_LOAD with a G_OFFSET_LOAD) we replace

%y = G_PTR_ADD p7 %x, s32 %a
%v = G_OFFSET_LOAD p7 %y, s32 %b

with

%newOff = G_ADD s32 %a, s32 %b
%v = G_OFFSET_LOAD p7 %x, s32 %newOff

. If we get to select/phi, we duplicate those into select/phi on the offset and select/phi on the base pointer, and if we start storing things, we materialize the offset by incrementing the base address in the descriptor and then decrementing the buffer extent by the same amount.

But, having written that out, i160 and having a <4 x i32>/i128/... to address space 7 intrinsic - or even not needing one, since inttoptr exists - may well be the better approach here.

The catch, I think, will be *when* we split away the offset part and the pointer part during ISel - and whether then it'll be too late to run uniformity analysis on the pointer part.

... on the third hand, the "increment address, decrement extent" fallback for GEPs that don't merge into a load or store isn't _that_ absurd

krzysz00 abandoned this revision.Apr 5 2023, 6:17 PM

This isn't a path we're going down anymore, closing