Page MenuHomePhabricator

[AMDGPU] Add intrinsics llvm.amdgcn.{raw|struct}.buffer.load.lds
ClosedPublic

Authored by rampitec on May 3 2022, 3:01 PM.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2022, 3:01 PM
rampitec requested review of this revision.May 3 2022, 3:01 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2022, 3:01 PM
Herald added a subscriber: wdng. · View Herald Transcript

It uses a single mem operand with both load and store in addrspace(4). The addrspace(4) is common for all buffer intrinsics memops. In fact neither MemSDNode nor MemIntrinsicSDNode can have 2 mem ops. Only MachineSDNode and final MI can. I certainly do now want to create a MachineSDNode here and duplicate a lot of buffer operations logic. If we believe we really want 2 mem ops these can be split in the FinalizeLowering.

It will also need to be rebased on top of D124550 to handle hazard between M0 initialization and LDS DMA.

arsenm added a comment.May 3 2022, 3:12 PM

Why does this need an intrinsic? I thought the whole point of the LDS DMA thing was an optimization the backend would perform and doesn't need to be exposed directly

Why does this need an intrinsic? I thought the whole point of the LDS DMA thing was an optimization the backend would perform and doesn't need to be exposed directly

We cannot match this pattern. If you look at the addressing mode this is byzantine. Yet another addtid instruction on steroids.

arsenm added inline comments.May 3 2022, 3:44 PM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
4289 ↗(On Diff #426850)

Why is this needed if it's in the MMO?

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1199–1200

Should use the return / data type?

rampitec added inline comments.May 3 2022, 4:11 PM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
4289 ↗(On Diff #426850)

I believe to select based on the MMO I would need to write a complex pattern.

rampitec updated this revision to Diff 426876.May 3 2022, 4:15 PM
rampitec marked an inline comment as done.
rampitec added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1199–1200

Changed to return. What do you mean by 'use data type'?

arsenm added inline comments.May 3 2022, 4:19 PM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1199–1200

You're looking at the pointer element type instead of the return / data type. i.e. we would have i8/i16/i32 return types and you don't need to look at the pointer

1199–1200

I just noticed there is no return type so this is just introducing a dependency on typed pointers which is a no-go.

I don't actually see why can't we match these from the buffer intrinsic plus LDS access?

rampitec added inline comments.May 3 2022, 4:20 PM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1199–1200

It does not return anything. This instruction does not have vdata. The only way to know the size is by looking at the overloaded LDS base pointer pointee.

rampitec added inline comments.May 3 2022, 4:24 PM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1199–1200

LDS address = LDS_base + LDS_offset + inst_offset + (TIDinWave * 4)

We do not have TIDinWave, certainly not after selection. Even before selection it is extremely problematic.

Why typed pointer is a no-go if that works?

rampitec added inline comments.May 3 2022, 4:33 PM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1199–1200

On top of that MEM_ADDR also depends on the TID. It not the same address as a normal buffer_load would use with the same operands.

arsenm added inline comments.May 3 2022, 4:34 PM
llvm/include/llvm/IR/IntrinsicsAMDGPU.td
1277

This should be addrspace 3 pointers only also

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1199–1200

Pointee types have been removed from the IR. If this really needs the type it would need to use an attribute on the parameter to carry it which may be new territory

rampitec added inline comments.May 3 2022, 4:44 PM
llvm/include/llvm/IR/IntrinsicsAMDGPU.td
1277

It cannot be overloaded on the pointee type, infrastructure limitation. We are using llvm_anyptr_ty everywhere in such context.

If in turn we cannot use pointee types at all then this could be non overloaded pointer to void in addrspace 3.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
1199–1200

It does not really need type but it needs size. I can add immediate to the intrinsic and switch to void* for LDS base.

To confirm: is that OK to add yet another imm to the end of operands of the intrinsic to select a byte size? And then remove the overload. If yes I will do it tomorrow.

To confirm: is that OK to add yet another imm to the end of operands of the intrinsic to select a byte size? And then remove the overload. If yes I will do it tomorrow.

Or maybe after the pointer, although it is less convenient for lowering.

arsenm added a comment.May 4 2022, 2:50 AM

To confirm: is that OK to add yet another imm to the end of operands of the intrinsic to select a byte size? And then remove the overload. If yes I will do it tomorrow.

There's the elementtype attribute for this case which some arm intrinsics seem to be using. Not sure how you're supposed to define an intrinsic to use it though

arsenm added a comment.May 4 2022, 2:57 AM

To confirm: is that OK to add yet another imm to the end of operands of the intrinsic to select a byte size? And then remove the overload. If yes I will do it tomorrow.

There's the elementtype attribute for this case which some arm intrinsics seem to be using. Not sure how you're supposed to define an intrinsic to use it though

Apparently this isn't well developed but works. The verifier is hardcoding these intrinsics (it's also looking at the call site instead of the intrinsic declaration attributes)

To confirm: is that OK to add yet another imm to the end of operands of the intrinsic to select a byte size? And then remove the overload. If yes I will do it tomorrow.

There's the elementtype attribute for this case which some arm intrinsics seem to be using. Not sure how you're supposed to define an intrinsic to use it though

Apparently this isn't well developed but works. The verifier is hardcoding these intrinsics (it's also looking at the call site instead of the intrinsic declaration attributes)

What's wrong with the idea of an i32 imm %size argument? That seems to me more in line with the philosophy of caring less about types.

arsenm added a comment.May 4 2022, 4:26 AM

What's wrong with the idea of an i32 imm %size argument? That seems to me more in line with the philosophy of caring less about types.

I'd prefer to keep any intrinsics that look like a load or store to look more like the regular load or store instructions. All of these arbitrary immediate parameters are uglier (e.g. the memory ordering arguments that don't actually work on some of the atomics)

To confirm: is that OK to add yet another imm to the end of operands of the intrinsic to select a byte size? And then remove the overload. If yes I will do it tomorrow.

There's the elementtype attribute for this case which some arm intrinsics seem to be using. Not sure how you're supposed to define an intrinsic to use it though

This will need a clang builtin to produce the attribute. I am not sure we really want to expose it as a clang builtin.

What's wrong with the idea of an i32 imm %size argument? That seems to me more in line with the philosophy of caring less about types.

I'd prefer to keep any intrinsics that look like a load or store to look more like the regular load or store instructions. All of these arbitrary immediate parameters are uglier (e.g. the memory ordering arguments that don't actually work on some of the atomics)

It is more or less similar to memcpy, and memcpy uses size argument.

asroy added a subscriber: asroy.May 4 2022, 2:30 PM
asroy added inline comments.
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.lds.ll
17

m0 holds the size of LDS, should we save the value of m0 before overwriting it, and write the value back before issuing ds_read?

arsenm added inline comments.May 4 2022, 2:33 PM
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.lds.ll
17

Every user of m0 is supposed to set it itself, and we hopefully clean up the redundant rewrites. It's not something that's generally saved and restored per operation

rampitec added inline comments.May 4 2022, 2:34 PM
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.lds.ll
17

DS_* do not read M0 since gfx9. These intrinsics are only available since gfx9.
Moreover, on gfx8 and earlier selection of DS opcodes takes care about M0 initialization right before the opcode.

rampitec updated this revision to Diff 427150.May 4 2022, 3:03 PM
rampitec marked 6 inline comments as done.
  • Removed the overload and added i32 %size operand instead.
  • LDS pointer is i8 addrspace(3) now qualified with the address space.
  • Rebased on the change to handle hazards between m0 initialization and these operations.
ramjana added a subscriber: ramjana.May 6 2022, 8:03 AM

Why does this need an intrinsic? I thought the whole point of the LDS DMA thing was an optimization the backend would perform and doesn't need to be exposed directly

  • Removed the overload and added i32 %size operand instead.
  • LDS pointer is i8 addrspace(3) now qualified with the address space.
  • Rebased on the change to handle hazards between m0 initialization and these operations.
  • Removed the overload and added i32 %size operand instead.
  • LDS pointer is i8 addrspace(3) now qualified with the address space.
  • Rebased on the change to handle hazards between m0 initialization and these operations.

Just to be clear , Is your expectation that intrinsic user to save and restore m0 before calling buffer_load lds intrinsic?

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.lds.ll
17

Just to be clear , Is your expectation that intrinsic user to save and restore m0 before calling buffer_load lds intrinsic?

Just to be clear , Is your expectation that intrinsic user to save and restore m0 before calling buffer_load lds intrinsic?

No, you do not have to.

rampitec updated this revision to Diff 428454.May 10 2022, 11:46 AM

Do not split voffset because inst_offset is applied to both VMEM and LDS address and voffset is not. Add a separate operand instead.

rampitec updated this revision to Diff 428759.May 11 2022, 1:39 PM

Removed support for wide than DWORD ops. See D125409.

piotr added a subscriber: piotr.May 13 2022, 2:31 AM
piotr added inline comments.
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
4277–4279 ↗(On Diff #428759)

What is preventing this from clobbering M0? There are intrinsics like int_amdgcn_interp* that have a dependency on M0. Shouldn't the code save the existing M0 and restore it after the load?

arsenm added inline comments.May 13 2022, 2:55 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
4277–4279 ↗(On Diff #428759)

m0 isn't treated as a preserved value. Each user is supposed to initialize m0 itself

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
8261

I don't see how / where this preserves the LDS bit

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
389

If you're going to rely on the memory operand, the verifier needs to start enforcing these have one memory operand (well, 2 actually with the same sizes)

rampitec updated this revision to Diff 429278.May 13 2022, 10:00 AM

Return false from getMemOperandsWithOffsetWidth() instead of checking mem operand.

rampitec updated this revision to Diff 429347.May 13 2022, 1:56 PM
rampitec retitled this revision from [AMDGPU] Add intrinsic llvm.amdgcn.raw.buffer.load.lds to [AMDGPU] Add intrinsics llvm.amdgcn.{raw|struct}.buffer.load.lds.

Switched to direct select which allows to use 2 separate memory operands.
The patch now handles both raw and struct intrinsics.

rampitec marked 2 inline comments as done.May 13 2022, 1:58 PM
rampitec added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
8261

It has different number of operands comparing to the SIbuffer_load, so selects into _LDS versions of opcodes.

In fact after I have removed offset split because we cannot do it on one pointer only, and dropped multi-dword support I start thinking it might be better to drop SIbuffer_load_lds, patterns, and produce MachineSDNode right here (like in the D125279 for global load), it will not be so much code anymore and I will be able to produce 2 separate memory operands.

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
389

On a second thought it is better to just return false here. We cannot have a reasonable pointer here on either side anyway, and in fact even 2 memory operands which it should ideally have should be of a different size for a sub-dword operations. A load can be sub-dword, but the store is always extended to a dword.

arsenm added inline comments.May 16 2022, 2:10 PM
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
3080

Should return false, the verifier isn't enforcing this

3130

The verifier should probably be enforcing MMO ordering if you're going to rely on that

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
8254

Ditto, verifier isn't enforcing this so shouldn't assert

rampitec marked 2 inline comments as done.May 16 2022, 2:13 PM
rampitec added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
3130

I do not rely on their order. Not anymore.

rampitec updated this revision to Diff 429847.May 16 2022, 2:17 PM
rampitec marked 2 inline comments as done.

Changed asserts to cannot select.

arsenm accepted this revision.May 17 2022, 9:03 AM
This revision is now accepted and ready to land.May 17 2022, 9:03 AM
This revision was landed with ongoing or failed builds.May 17 2022, 10:32 AM
This revision was automatically updated to reflect the committed changes.