Details
Diff Detail
Event Timeline
It uses a single mem operand with both load and store in addrspace(4). The addrspace(4) is common for all buffer intrinsics memops. In fact neither MemSDNode nor MemIntrinsicSDNode can have 2 mem ops. Only MachineSDNode and final MI can. I certainly do now want to create a MachineSDNode here and duplicate a lot of buffer operations logic. If we believe we really want 2 mem ops these can be split in the FinalizeLowering.
It will also need to be rebased on top of D124550 to handle hazard between M0 initialization and LDS DMA.
Why does this need an intrinsic? I thought the whole point of the LDS DMA thing was an optimization the backend would perform and doesn't need to be exposed directly
We cannot match this pattern. If you look at the addressing mode this is byzantine. Yet another addtid instruction on steroids.
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | ||
---|---|---|
4289 | I believe to select based on the MMO I would need to write a complex pattern. |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
1199–1200 | Changed to return. What do you mean by 'use data type'? |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
1199–1200 | You're looking at the pointer element type instead of the return / data type. i.e. we would have i8/i16/i32 return types and you don't need to look at the pointer | |
1199–1200 | I just noticed there is no return type so this is just introducing a dependency on typed pointers which is a no-go. I don't actually see why can't we match these from the buffer intrinsic plus LDS access? |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
1199–1200 | It does not return anything. This instruction does not have vdata. The only way to know the size is by looking at the overloaded LDS base pointer pointee. |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
1199–1200 | LDS address = LDS_base + LDS_offset + inst_offset + (TIDinWave * 4) We do not have TIDinWave, certainly not after selection. Even before selection it is extremely problematic. Why typed pointer is a no-go if that works? |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
1199–1200 | On top of that MEM_ADDR also depends on the TID. It not the same address as a normal buffer_load would use with the same operands. |
llvm/include/llvm/IR/IntrinsicsAMDGPU.td | ||
---|---|---|
1268 | This should be addrspace 3 pointers only also | |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
1199–1200 | Pointee types have been removed from the IR. If this really needs the type it would need to use an attribute on the parameter to carry it which may be new territory |
llvm/include/llvm/IR/IntrinsicsAMDGPU.td | ||
---|---|---|
1268 | It cannot be overloaded on the pointee type, infrastructure limitation. We are using llvm_anyptr_ty everywhere in such context. If in turn we cannot use pointee types at all then this could be non overloaded pointer to void in addrspace 3. | |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
1199–1200 | It does not really need type but it needs size. I can add immediate to the intrinsic and switch to void* for LDS base. |
To confirm: is that OK to add yet another imm to the end of operands of the intrinsic to select a byte size? And then remove the overload. If yes I will do it tomorrow.
There's the elementtype attribute for this case which some arm intrinsics seem to be using. Not sure how you're supposed to define an intrinsic to use it though
Apparently this isn't well developed but works. The verifier is hardcoding these intrinsics (it's also looking at the call site instead of the intrinsic declaration attributes)
What's wrong with the idea of an i32 imm %size argument? That seems to me more in line with the philosophy of caring less about types.
I'd prefer to keep any intrinsics that look like a load or store to look more like the regular load or store instructions. All of these arbitrary immediate parameters are uglier (e.g. the memory ordering arguments that don't actually work on some of the atomics)
This will need a clang builtin to produce the attribute. I am not sure we really want to expose it as a clang builtin.
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.lds.ll | ||
---|---|---|
16 | m0 holds the size of LDS, should we save the value of m0 before overwriting it, and write the value back before issuing ds_read? |
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.lds.ll | ||
---|---|---|
16 | Every user of m0 is supposed to set it itself, and we hopefully clean up the redundant rewrites. It's not something that's generally saved and restored per operation |
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.lds.ll | ||
---|---|---|
16 | DS_* do not read M0 since gfx9. These intrinsics are only available since gfx9. |
- Removed the overload and added i32 %size operand instead.
- LDS pointer is i8 addrspace(3) now qualified with the address space.
- Rebased on the change to handle hazards between m0 initialization and these operations.
Just to be clear , Is your expectation that intrinsic user to save and restore m0 before calling buffer_load lds intrinsic?
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.lds.ll | ||
---|---|---|
16 | Just to be clear , Is your expectation that intrinsic user to save and restore m0 before calling buffer_load lds intrinsic? |
Do not split voffset because inst_offset is applied to both VMEM and LDS address and voffset is not. Add a separate operand instead.
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | ||
---|---|---|
4277–4279 | What is preventing this from clobbering M0? There are intrinsics like int_amdgcn_interp* that have a dependency on M0. Shouldn't the code save the existing M0 and restore it after the load? |
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | ||
---|---|---|
4277–4279 | m0 isn't treated as a preserved value. Each user is supposed to initialize m0 itself | |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
8250 | I don't see how / where this preserves the LDS bit | |
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
389 | If you're going to rely on the memory operand, the verifier needs to start enforcing these have one memory operand (well, 2 actually with the same sizes) |
Switched to direct select which allows to use 2 separate memory operands.
The patch now handles both raw and struct intrinsics.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
8250 | It has different number of operands comparing to the SIbuffer_load, so selects into _LDS versions of opcodes. In fact after I have removed offset split because we cannot do it on one pointer only, and dropped multi-dword support I start thinking it might be better to drop SIbuffer_load_lds, patterns, and produce MachineSDNode right here (like in the D125279 for global load), it will not be so much code anymore and I will be able to produce 2 separate memory operands. | |
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
389 | On a second thought it is better to just return false here. We cannot have a reasonable pointer here on either side anyway, and in fact even 2 memory operands which it should ideally have should be of a different size for a sub-dword operations. A load can be sub-dword, but the store is always extended to a dword. |
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp | ||
---|---|---|
3080 ↗ | (On Diff #429347) | Should return false, the verifier isn't enforcing this |
3130 ↗ | (On Diff #429347) | The verifier should probably be enforcing MMO ordering if you're going to rely on that |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
8243 | Ditto, verifier isn't enforcing this so shouldn't assert |
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp | ||
---|---|---|
3130 ↗ | (On Diff #429347) | I do not rely on their order. Not anymore. |
This should be addrspace 3 pointers only also