Added intrinsics for the instructions:
- buffer_load_ubyte
- buffer_load_ushort
- buffer_store_byte
- buffer_store_short
Added test cases to existing buffer load/store tests.
Paths
| Differential D42885
[AMDGPU] intrintrics for byte/short load/store ClosedPublic Authored by rtaylor on Feb 3 2018, 10:35 AM.
Details
Summary Added intrinsics for the instructions:
Added test cases to existing buffer load/store tests.
Diff Detail
Event TimelineHerald added subscribers: llvm-commits, t-tye, tpr and 6 others. · View Herald TranscriptFeb 3 2018, 10:35 AM Comment Actions I don't think we need intrinsics for these. At most we should add a mangled type to the existing buffer intrinsics.
This revision now requires changes to proceed.Feb 3 2018, 11:01 AM Comment Actions Matt, the instructions zero extend the data to i32, so the return type of the int, ushort and ubyte variants are the same, and overloading would not work. Comment Actions Matt, we do actually need these intrinsics as we have an urgent requirement for them Open Vulkan (which is of course my motivation for implementing them). As Tim commented, the load ubyte and load short instructions extend to 32 bits. While float is a little odd, it does match the behavior of the other buffer_load instructions. Also I think changing it would require a disproportionate amount of effort. Comment Actions Couldn't we also optimize the loads at least based on used bits like a normal load?
Yes, they do but the intrinsic doesn't need to care. Comment Actions If we define an overloaded intrinsic with a return type of i8, and the IR using it wants the value zero extended to i32, the frontend would then have to emit a separate zext. I guess we could optimize that to the zero-extending instruction in instruction selection, but wouldn't it be better to have the intrinsic match what the ISA instruction does by returning the zero extended i32? Comment Actions Maybe a dumb question - but why can't we just use the tbuffer load/store instead of these? It already upcasts for you (the zext/sext is built in depending on the nfmt I believe). Comment Actions
Well we can, but using the loads without conversion can be faster? See https://gpuopen.com/gcn-memory-coalescing/ :
Comment Actions Mostly LGTM except some nits. The major one is avoiding the repeated lowering code for each of these cases
This revision is now accepted and ready to land.Mar 18 2019, 7:37 AM
Revision Contents
Diff 187824 include/llvm/IR/IntrinsicsAMDGPU.td
lib/Target/AMDGPU/AMDGPUISelLowering.h
lib/Target/AMDGPU/AMDGPUISelLowering.cpp
lib/Target/AMDGPU/BUFInstructions.td
lib/Target/AMDGPU/SIISelLowering.cpp
lib/Target/AMDGPU/SIInstrInfo.td
test/CodeGen/AMDGPU/llvm.amdgcn.buffer.load.ll
test/CodeGen/AMDGPU/llvm.amdgcn.buffer.store.ll
test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.ll
test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.store.ll
test/CodeGen/AMDGPU/llvm.amdgcn.struct.buffer.load.ll
test/CodeGen/AMDGPU/llvm.amdgcn.struct.buffer.store.ll
|
float return type doesn't make sense