This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] intrintrics for byte/short load/store
ClosedPublic

Authored by rtaylor on Feb 3 2018, 10:35 AM.

Download Raw Diff

Details

Reviewers

mareko
arsenm
nhaehnle
timcorringham

Group Reviewers

Restricted Project

Summary

Added intrinsics for the instructions:

buffer_load_ubyte
buffer_load_ushort
buffer_store_byte
buffer_store_short

Added test cases to existing buffer load/store tests.

Diff Detail

Repository

rL LLVM

Build Status

Buildable 14595
Build 14595: arc lint + arc unit

Event Timeline

timcorringham created this revision.Feb 3 2018, 10:35 AM

Herald added subscribers: llvm-commits, t-tye, tpr and 6 others. · View Herald TranscriptFeb 3 2018, 10:35 AM

Harbormaster completed remote builds in B14595: Diff 132746.Feb 3 2018, 10:37 AM

timcorringham added a reviewer: Restricted Project.Feb 3 2018, 10:44 AM

timcorringham added a reviewer: mareko.Feb 3 2018, 10:47 AM

I don't think we need intrinsics for these. At most we should add a mangled type to the existing buffer intrinsics.

include/llvm/IR/IntrinsicsAMDGPU.td
494	float return type doesn't make sense

This revision now requires changes to proceed.Feb 3 2018, 11:01 AM

Matt, the instructions zero extend the data to i32, so the return type of the int, ushort and ubyte variants are the same, and overloading would not work.

Matt, we do actually need these intrinsics as we have an urgent requirement for them Open Vulkan (which is of course my motivation for implementing them).

As Tim commented, the load ubyte and load short instructions extend to 32 bits. While float is a little odd, it does match the behavior of the other buffer_load instructions. Also I think changing it would require a disproportionate amount of effort.

It should be easy to change the return type to i32.

and, of course, the vdata type.

Couldn't we also optimize the loads at least based on used bits like a normal load?

In D42885#997648, @timcorringham wrote:

Matt, we do actually need these intrinsics as we have an urgent requirement for them Open Vulkan (which is of course my motivation for implementing them).

As Tim commented, the load ubyte and load short instructions extend to 32 bits. While float is a little odd, it does match the behavior of the other buffer_load instructions. Also I think changing it would require a disproportionate amount of effort.

Yes, they do but the intrinsic doesn't need to care.

If we define an overloaded intrinsic with a return type of i8, and the IR using it wants the value zero extended to i32, the frontend would then have to emit a separate zext. I guess we could optimize that to the zero-extending instruction in instruction selection, but wouldn't it be better to have the intrinsic match what the ISA instruction does by returning the zero extended i32?

Maybe a dumb question - but why can't we just use the tbuffer load/store instead of these? It already upcasts for you (the zext/sext is built in depending on the nfmt I believe).

Herald added a subscriber: jvesely. · View Herald TranscriptOct 19 2018, 1:55 AM

In D42885#1268934, @sheredom wrote:

Maybe a dumb question - but why can't we just use the tbuffer load/store instead of these? It already upcasts for you (the zext/sext is built in depending on the nfmt I believe).

Well we can, but using the loads without conversion can be faster? See https://gpuopen.com/gcn-memory-coalescing/ :

32-bit (or smaller) single-channel buffer loads / wave = 4 clocks (under specific cases)
32-bit (or smaller) filtered texels / wave = 16 clocks

Updating to include requested changes

Herald added a project: Restricted Project. · View Herald TranscriptFeb 21 2019, 10:53 AM

Harbormaster completed remote builds in B28389: Diff 187824.Feb 21 2019, 10:54 AM

rtaylor added a reviewer: nhaehnle.Feb 21 2019, 10:58 AM

arsenm added inline comments.Feb 21 2019, 5:15 PM

lib/Target/AMDGPU/SIISelLowering.cpp
4944–4952	You shouldn't be inspecting the users. You can just unconditionally use one or the other. You're going to have to insert a truncate back to the original type at the end anyway. You can then add a separate optimization to fold in the sext_inreg or mask into the buffer like is done for loads

rtaylor added inline comments.Feb 22 2019, 7:30 AM

lib/Target/AMDGPU/SIISelLowering.cpp
4944–4952	There are four potential options so what do you mean by one or the other? There is BUFFER_LOAD_ubyte/ushort/short/byte for the Opc.

arsenm added inline comments.Feb 22 2019, 7:49 AM

lib/Target/AMDGPU/SIISelLowering.cpp
4944–4952	You can just unconditionally use load_ubyte/load_ushort. Folding the sign extend in is then a separate optimization on a sext (or more likely a sext_inreg)

rtaylor added inline comments.Feb 22 2019, 8:02 AM

lib/Target/AMDGPU/SIISelLowering.cpp
4944–4952	So outputting byte/short based on sign_extend in the tablgen pattern? This won't allow re-use of the existing multiclass without changes. I think I remember there being a reason Nicolai and I decided not to do this.

arsenm added inline comments.Feb 26 2019, 7:56 AM

lib/Target/AMDGPU/SIISelLowering.cpp
4944–4952	This doesn't change the selection. This is an optimization done in the DAGCombiner

Request changes

Harbormaster completed remote builds in B28601: Diff 188597.Feb 27 2019, 12:09 PM

Rename function to better reflect what it does

Harbormaster completed remote builds in B28602: Diff 188598.Feb 27 2019, 12:17 PM

Ping.

arsenm added inline comments.Mar 5 2019, 8:08 AM

lib/Target/AMDGPU/SIISelLowering.cpp
4935	Capitalize
4936–4937	This looks identical to the other part, which is kind of surprising to me but this should be factored into something common
4937	Repeated again
4938–4942	Ternary operator
4939	This comment can be removed
6349	Formatting
6352	Should have a hasOneUse check
6353	Leftover debugging
6354	This is missing a check on the source type. If you want to be fancier, you can split out the remainder bits into a new sign extend but there probably isn't much reason to
6359–6361	"will be set by" part doesn't make sense here
test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.ll
275 ↗	(On Diff #188598)	The base test case shouldn't have a extend of the use and directly use the value. You should also have one with an explicit zext
307 ↗	(On Diff #188598)	Could use a testcase with a second non-extended use

rtaylor added inline comments.Mar 5 2019, 8:44 AM

lib/Target/AMDGPU/SIISelLowering.cpp
4936–4937	Do you mean that there is simliar/same code in the different cases of buffer_load? (ie struct, raw, normal) And that these should be factored into something common (ie function call)?
6352	You mean there should be a hasOneUse check on the SIGN_EXTEND_INREG right?
6354	Src is the BUFFER_LOAD_XXX. The only way this code is executed is if the Src is a BUFFER_LOAD_XXX. I'm not sure we need a redundant check here do we?

arsenm added inline comments.Mar 5 2019, 9:02 AM

lib/Target/AMDGPU/SIISelLowering.cpp
6352	No, the buffer operation. If there are multiple uses you will end up creating multiple loads
6354	The number of bits in the sext_inreg may not match the load's from-8/16 bit source. You can test this with something like %load = call llvm.amdgcn.buffer.load.i8() %ext = zext i8 %load to i32 %shl = shl i32 %ext, 27 %shr = ashr i32 %shl, 27 There will need more shifts to clear the extra bits in the loaded value

rtaylor added inline comments.Mar 11 2019, 10:05 AM

lib/Target/AMDGPU/SIISelLowering.cpp
6354	This should produce a buffer_load_sbyte right? That is what it does currently.

arsenm added inline comments.Mar 11 2019, 1:50 PM

lib/Target/AMDGPU/SIISelLowering.cpp
6354	But it needs additional shifts even after. Right now you'll not be clearing the extra bits in the low 8 that need to be

arsenm added inline comments.Mar 11 2019, 1:53 PM

lib/Target/AMDGPU/SIISelLowering.cpp
6354	You can either leave it as x = buffer_load_ubyte sext_inreg x, i27 or x = buffer_load_sbyte sext_inreg x, i5 I'm not sure there's much practical difference between them

rtaylor added inline comments.Mar 11 2019, 1:57 PM

lib/Target/AMDGPU/SIISelLowering.cpp
6354	Right, I don't think there is, I'm working on doing the former. Thanks.

arsenm added inline comments.Mar 11 2019, 2:04 PM

lib/Target/AMDGPU/SIISelLowering.cpp
6354	There might be a small advantage by canonicalizing the sext_inreg sizes. The constant masks that will be emitted for BFI might be more likely to be reusable

Changing ownership

Add requested changes

Harbormaster completed remote builds in B29044: Diff 190286.Mar 12 2019, 9:50 AM

Mostly LGTM except some nits. The major one is avoiding the repeated lowering code for each of these cases

lib/Target/AMDGPU/SIISelLowering.cpp
4934	Capitalize
4936	Capitalize
4936–4937	Yes
4941	You can just hardcoded this to MVT::Other
4941–4944	Ternary operator
5327–5330	Ternary operator
5341	Capitalize
5347–5350	Ternary operator
6356	Extra space before ==
6358	Extra space before ==

Requested Changes

Harbormaster completed remote builds in B29056: Diff 190349.Mar 12 2019, 3:29 PM

LGTM except formatting

lib/Target/AMDGPU/SIISelLowering.cpp
5349	Brace placement
5368	Brace placement

This revision is now accepted and ready to land.Mar 18 2019, 7:37 AM

rtaylor closed this revision.Mar 20 2019, 7:11 AM

Revision Contents

Path

Size

include/

llvm/

IR/

IntrinsicsAMDGPU.td

38 lines

lib/

Target/

AMDGPU/

AMDGPUISelLowering.h

4 lines

AMDGPUISelLowering.cpp

4 lines

BUFInstructions.td

5 lines

SIISelLowering.cpp

29 lines

SIInstrInfo.td

8 lines

test/

CodeGen/

AMDGPU/

llvm.amdgcn.buffer.load.ll

22 lines

llvm.amdgcn.buffer.store.ll

22 lines

Diff 132746

include/llvm/IR/IntrinsicsAMDGPU.td

Show First 20 Lines • Show All 484 Lines • ▼ Show 20 Lines	def int_amdgcn_image_atomic_cmpswap : Intrinsic <
llvm_i32_ty, // cmp(VGPR)		llvm_i32_ty, // cmp(VGPR)
llvm_anyint_ty, // vaddr(VGPR)		llvm_anyint_ty, // vaddr(VGPR)
llvm_v8i32_ty, // rsrc(SGPR)		llvm_v8i32_ty, // rsrc(SGPR)
llvm_i1_ty, // r128(imm)		llvm_i1_ty, // r128(imm)
llvm_i1_ty, // da(imm)		llvm_i1_ty, // da(imm)
llvm_i1_ty], // slc(imm)		llvm_i1_ty], // slc(imm)
[], "", [SDNPMemOperand]>;		[], "", [SDNPMemOperand]>;

		def int_amdgcn_buffer_load_ubyte : Intrinsic <
		[llvm_float_ty],
		arsenmUnsubmitted Not Done Reply Inline Actions float return type doesn't make sense arsenm: float return type doesn't make sense
		[llvm_v4i32_ty, // rsrc(SGPR)
		llvm_i32_ty, // vindex(VGPR)
		llvm_i32_ty, // offset(SGPR/VGPR/imm)
		llvm_i1_ty, // glc(imm)
		llvm_i1_ty], // slc(imm)
		[], "", [SDNPMemOperand]>;

		def int_amdgcn_buffer_load_ushort : Intrinsic <
		[llvm_float_ty],
		[llvm_v4i32_ty, // rsrc(SGPR)
		llvm_i32_ty, // vindex(VGPR)
		llvm_i32_ty, // offset(SGPR/VGPR/imm)
		llvm_i1_ty, // glc(imm)
		llvm_i1_ty], // slc(imm)
		[], "", [SDNPMemOperand]>;

class AMDGPUBufferLoad : Intrinsic <		class AMDGPUBufferLoad : Intrinsic <
[llvm_anyfloat_ty],		[llvm_anyfloat_ty],
[llvm_v4i32_ty, // rsrc(SGPR)		[llvm_v4i32_ty, // rsrc(SGPR)
llvm_i32_ty, // vindex(VGPR)		llvm_i32_ty, // vindex(VGPR)
llvm_i32_ty, // offset(SGPR/VGPR/imm)		llvm_i32_ty, // offset(SGPR/VGPR/imm)
llvm_i1_ty, // glc(imm)		llvm_i1_ty, // glc(imm)
llvm_i1_ty], // slc(imm)		llvm_i1_ty], // slc(imm)
[IntrReadMem], "", [SDNPMemOperand]>;		[IntrReadMem], "", [SDNPMemOperand]>;
def int_amdgcn_buffer_load_format : AMDGPUBufferLoad;		def int_amdgcn_buffer_load_format : AMDGPUBufferLoad;
def int_amdgcn_buffer_load : AMDGPUBufferLoad;		def int_amdgcn_buffer_load : AMDGPUBufferLoad;

		def int_amdgcn_buffer_store_byte : Intrinsic <
		[],
		[llvm_float_ty, // vdata(VGPR) -- this variant writes low 8 bits
		llvm_v4i32_ty, // rsrc(SGPR)
		llvm_i32_ty, // vindex(VGPR)
		llvm_i32_ty, // offset(SGPR/VGPR/imm)
		llvm_i1_ty, // glc(imm)
		llvm_i1_ty], // slc(imm)
		[IntrWriteMem], "", [SDNPMemOperand]>;

		def int_amdgcn_buffer_store_short : Intrinsic <
		[],
		[llvm_float_ty, // vdata(VGPR) -- this variant writes low 16 bits
		llvm_v4i32_ty, // rsrc(SGPR)
		llvm_i32_ty, // vindex(VGPR)
		llvm_i32_ty, // offset(SGPR/VGPR/imm)
		llvm_i1_ty, // glc(imm)
		llvm_i1_ty], // slc(imm)
		[IntrWriteMem], "", [SDNPMemOperand]>;

class AMDGPUBufferStore : Intrinsic <		class AMDGPUBufferStore : Intrinsic <
[],		[],
[llvm_anyfloat_ty, // vdata(VGPR) -- can currently only select f32, v2f32, v4f32		[llvm_anyfloat_ty, // vdata(VGPR) -- can currently only select f32, v2f32, v4f32
llvm_v4i32_ty, // rsrc(SGPR)		llvm_v4i32_ty, // rsrc(SGPR)
llvm_i32_ty, // vindex(VGPR)		llvm_i32_ty, // vindex(VGPR)
llvm_i32_ty, // offset(SGPR/VGPR/imm)		llvm_i32_ty, // offset(SGPR/VGPR/imm)
llvm_i1_ty, // glc(imm)		llvm_i1_ty, // glc(imm)
llvm_i1_ty], // slc(imm)		llvm_i1_ty], // slc(imm)
▲ Show 20 Lines • Show All 398 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUISelLowering.h

Show First 20 Lines • Show All 459 Lines • ▼ Show 20 Lines	enum NodeType : unsigned {
TBUFFER_LOAD_FORMAT_D16,		TBUFFER_LOAD_FORMAT_D16,
ATOMIC_CMP_SWAP,		ATOMIC_CMP_SWAP,
ATOMIC_INC,		ATOMIC_INC,
ATOMIC_DEC,		ATOMIC_DEC,
ATOMIC_LOAD_FADD,		ATOMIC_LOAD_FADD,
ATOMIC_LOAD_FMIN,		ATOMIC_LOAD_FMIN,
ATOMIC_LOAD_FMAX,		ATOMIC_LOAD_FMAX,
BUFFER_LOAD,		BUFFER_LOAD,
		BUFFER_LOAD_UBYTE,
		BUFFER_LOAD_USHORT,
BUFFER_LOAD_FORMAT,		BUFFER_LOAD_FORMAT,
BUFFER_LOAD_FORMAT_D16,		BUFFER_LOAD_FORMAT_D16,
BUFFER_STORE,		BUFFER_STORE,
		BUFFER_STORE_BYTE,
		BUFFER_STORE_SHORT,
BUFFER_STORE_FORMAT,		BUFFER_STORE_FORMAT,
BUFFER_STORE_FORMAT_D16,		BUFFER_STORE_FORMAT_D16,
BUFFER_ATOMIC_SWAP,		BUFFER_ATOMIC_SWAP,
BUFFER_ATOMIC_ADD,		BUFFER_ATOMIC_ADD,
BUFFER_ATOMIC_SUB,		BUFFER_ATOMIC_SUB,
BUFFER_ATOMIC_SMIN,		BUFFER_ATOMIC_SMIN,
BUFFER_ATOMIC_UMIN,		BUFFER_ATOMIC_UMIN,
BUFFER_ATOMIC_SMAX,		BUFFER_ATOMIC_SMAX,
▲ Show 20 Lines • Show All 99 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUISelLowering.cpp

Show First 20 Lines • Show All 3,984 Lines • ▼ Show 20 Lines	const char* AMDGPUTargetLowering::getTargetNodeName(unsigned Opcode) const {
NODE_NAME_CASE(TBUFFER_LOAD_FORMAT_D16)		NODE_NAME_CASE(TBUFFER_LOAD_FORMAT_D16)
NODE_NAME_CASE(ATOMIC_CMP_SWAP)		NODE_NAME_CASE(ATOMIC_CMP_SWAP)
NODE_NAME_CASE(ATOMIC_INC)		NODE_NAME_CASE(ATOMIC_INC)
NODE_NAME_CASE(ATOMIC_DEC)		NODE_NAME_CASE(ATOMIC_DEC)
NODE_NAME_CASE(ATOMIC_LOAD_FADD)		NODE_NAME_CASE(ATOMIC_LOAD_FADD)
NODE_NAME_CASE(ATOMIC_LOAD_FMIN)		NODE_NAME_CASE(ATOMIC_LOAD_FMIN)
NODE_NAME_CASE(ATOMIC_LOAD_FMAX)		NODE_NAME_CASE(ATOMIC_LOAD_FMAX)
NODE_NAME_CASE(BUFFER_LOAD)		NODE_NAME_CASE(BUFFER_LOAD)
		NODE_NAME_CASE(BUFFER_LOAD_UBYTE)
		NODE_NAME_CASE(BUFFER_LOAD_USHORT)
NODE_NAME_CASE(BUFFER_LOAD_FORMAT)		NODE_NAME_CASE(BUFFER_LOAD_FORMAT)
NODE_NAME_CASE(BUFFER_LOAD_FORMAT_D16)		NODE_NAME_CASE(BUFFER_LOAD_FORMAT_D16)
NODE_NAME_CASE(BUFFER_STORE)		NODE_NAME_CASE(BUFFER_STORE)
		NODE_NAME_CASE(BUFFER_STORE_BYTE)
		NODE_NAME_CASE(BUFFER_STORE_SHORT)
NODE_NAME_CASE(BUFFER_STORE_FORMAT)		NODE_NAME_CASE(BUFFER_STORE_FORMAT)
NODE_NAME_CASE(BUFFER_STORE_FORMAT_D16)		NODE_NAME_CASE(BUFFER_STORE_FORMAT_D16)
NODE_NAME_CASE(BUFFER_ATOMIC_SWAP)		NODE_NAME_CASE(BUFFER_ATOMIC_SWAP)
NODE_NAME_CASE(BUFFER_ATOMIC_ADD)		NODE_NAME_CASE(BUFFER_ATOMIC_ADD)
NODE_NAME_CASE(BUFFER_ATOMIC_SUB)		NODE_NAME_CASE(BUFFER_ATOMIC_SUB)
NODE_NAME_CASE(BUFFER_ATOMIC_SMIN)		NODE_NAME_CASE(BUFFER_ATOMIC_SMIN)
NODE_NAME_CASE(BUFFER_ATOMIC_UMIN)		NODE_NAME_CASE(BUFFER_ATOMIC_UMIN)
NODE_NAME_CASE(BUFFER_ATOMIC_SMAX)		NODE_NAME_CASE(BUFFER_ATOMIC_SMAX)
▲ Show 20 Lines • Show All 245 Lines • Show Last 20 Lines

lib/Target/AMDGPU/BUFInstructions.td

Show First 20 Lines • Show All 1,010 Lines • ▼ Show 20 Lines	let SubtargetPredicate = HasPackedD16VMem in {
defm : MUBUF_LoadIntrinsicPat<SIbuffer_load_format, v2f16, "BUFFER_LOAD_FORMAT_D16_XY">;		defm : MUBUF_LoadIntrinsicPat<SIbuffer_load_format, v2f16, "BUFFER_LOAD_FORMAT_D16_XY">;
defm : MUBUF_LoadIntrinsicPat<SIbuffer_load_format_d16, i32, "BUFFER_LOAD_FORMAT_D16_XY">;		defm : MUBUF_LoadIntrinsicPat<SIbuffer_load_format_d16, i32, "BUFFER_LOAD_FORMAT_D16_XY">;
defm : MUBUF_LoadIntrinsicPat<SIbuffer_load_format_d16, v2i32, "BUFFER_LOAD_FORMAT_D16_XYZW">;		defm : MUBUF_LoadIntrinsicPat<SIbuffer_load_format_d16, v2i32, "BUFFER_LOAD_FORMAT_D16_XYZW">;
} // End HasPackedD16VMem.		} // End HasPackedD16VMem.

defm : MUBUF_LoadIntrinsicPat<SIbuffer_load, f32, "BUFFER_LOAD_DWORD">;		defm : MUBUF_LoadIntrinsicPat<SIbuffer_load, f32, "BUFFER_LOAD_DWORD">;
defm : MUBUF_LoadIntrinsicPat<SIbuffer_load, v2f32, "BUFFER_LOAD_DWORDX2">;		defm : MUBUF_LoadIntrinsicPat<SIbuffer_load, v2f32, "BUFFER_LOAD_DWORDX2">;
defm : MUBUF_LoadIntrinsicPat<SIbuffer_load, v4f32, "BUFFER_LOAD_DWORDX4">;		defm : MUBUF_LoadIntrinsicPat<SIbuffer_load, v4f32, "BUFFER_LOAD_DWORDX4">;
		defm : MUBUF_LoadIntrinsicPat<SIbuffer_load_ubyte, f32, "BUFFER_LOAD_UBYTE">;
		defm : MUBUF_LoadIntrinsicPat<SIbuffer_load_ushort, f32, "BUFFER_LOAD_USHORT">;

multiclass MUBUF_StoreIntrinsicPat<SDPatternOperator name, ValueType vt,		multiclass MUBUF_StoreIntrinsicPat<SDPatternOperator name, ValueType vt,
string opcode> {		string opcode> {
def : GCNPat<		def : GCNPat<
(name vt:$vdata, v4i32:$rsrc, 0,		(name vt:$vdata, v4i32:$rsrc, 0,
(MUBUFIntrinsicOffset i32:$soffset, i16:$offset),		(MUBUFIntrinsicOffset i32:$soffset, i16:$offset),
imm:$glc, imm:$slc),		imm:$glc, imm:$slc),
(!cast<MUBUF_Pseudo>(opcode # _OFFSET_exact) $vdata, $rsrc, $soffset, (as_i16imm $offset),		(!cast<MUBUF_Pseudo>(opcode # _OFFSET_exact) $vdata, $rsrc, $soffset, (as_i16imm $offset),
Show All 28 Lines	(!cast<MUBUF_Pseudo>(opcode # _BOTHEN_exact)
$rsrc, $soffset, (as_i16imm $offset),		$rsrc, $soffset, (as_i16imm $offset),
(as_i1imm $glc), (as_i1imm $slc), 0)		(as_i1imm $glc), (as_i1imm $slc), 0)
>;		>;
}		}

defm : MUBUF_StoreIntrinsicPat<SIbuffer_store_format, f32, "BUFFER_STORE_FORMAT_X">;		defm : MUBUF_StoreIntrinsicPat<SIbuffer_store_format, f32, "BUFFER_STORE_FORMAT_X">;
defm : MUBUF_StoreIntrinsicPat<SIbuffer_store_format, v2f32, "BUFFER_STORE_FORMAT_XY">;		defm : MUBUF_StoreIntrinsicPat<SIbuffer_store_format, v2f32, "BUFFER_STORE_FORMAT_XY">;
defm : MUBUF_StoreIntrinsicPat<SIbuffer_store_format, v4f32, "BUFFER_STORE_FORMAT_XYZW">;		defm : MUBUF_StoreIntrinsicPat<SIbuffer_store_format, v4f32, "BUFFER_STORE_FORMAT_XYZW">;
		defm : MUBUF_StoreIntrinsicPat<SIbuffer_store_byte, f32, "BUFFER_STORE_BYTE">;
		defm : MUBUF_StoreIntrinsicPat<SIbuffer_store_short, f32, "BUFFER_STORE_SHORT">;


let SubtargetPredicate = HasUnpackedD16VMem in {		let SubtargetPredicate = HasUnpackedD16VMem in {
defm : MUBUF_StoreIntrinsicPat<SIbuffer_store_format_d16, f16, "BUFFER_STORE_FORMAT_D16_X_gfx80">;		defm : MUBUF_StoreIntrinsicPat<SIbuffer_store_format_d16, f16, "BUFFER_STORE_FORMAT_D16_X_gfx80">;
defm : MUBUF_StoreIntrinsicPat<SIbuffer_store_format_d16, v2i32, "BUFFER_STORE_FORMAT_D16_XY_gfx80">;		defm : MUBUF_StoreIntrinsicPat<SIbuffer_store_format_d16, v2i32, "BUFFER_STORE_FORMAT_D16_XY_gfx80">;
defm : MUBUF_StoreIntrinsicPat<SIbuffer_store_format_d16, v4i32, "BUFFER_STORE_FORMAT_D16_XYZW_gfx80">;		defm : MUBUF_StoreIntrinsicPat<SIbuffer_store_format_d16, v4i32, "BUFFER_STORE_FORMAT_D16_XYZW_gfx80">;
} // End HasUnpackedD16VMem.		} // End HasUnpackedD16VMem.

let SubtargetPredicate = HasPackedD16VMem in {		let SubtargetPredicate = HasPackedD16VMem in {
▲ Show 20 Lines • Show All 899 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 729 Lines • ▼ Show 20 Lines	Info.flags = MachineMemOperand::MOLoad \|
MachineMemOperand::MODereferenceable;		MachineMemOperand::MODereferenceable;

// XXX - Should this be volatile without known ordering?		// XXX - Should this be volatile without known ordering?
Info.flags \|= MachineMemOperand::MOVolatile;		Info.flags \|= MachineMemOperand::MOVolatile;
return true;		return true;
}		}
case Intrinsic::amdgcn_tbuffer_load:		case Intrinsic::amdgcn_tbuffer_load:
case Intrinsic::amdgcn_buffer_load:		case Intrinsic::amdgcn_buffer_load:
		case Intrinsic::amdgcn_buffer_load_ubyte:
		case Intrinsic::amdgcn_buffer_load_ushort:
case Intrinsic::amdgcn_buffer_load_format: {		case Intrinsic::amdgcn_buffer_load_format: {
SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();		SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();
Info.opc = ISD::INTRINSIC_W_CHAIN;		Info.opc = ISD::INTRINSIC_W_CHAIN;
Info.ptrVal = MFI->getBufferPSV(		Info.ptrVal = MFI->getBufferPSV(
*MF.getSubtarget<SISubtarget>().getInstrInfo(),		*MF.getSubtarget<SISubtarget>().getInstrInfo(),
CI.getArgOperand(0));		CI.getArgOperand(0));
Info.memVT = MVT::getVT(CI.getType());		Info.memVT = MVT::getVT(CI.getType());
Info.flags = MachineMemOperand::MOLoad \|		Info.flags = MachineMemOperand::MOLoad \|
MachineMemOperand::MODereferenceable;		MachineMemOperand::MODereferenceable;

// There is a constant offset component, but there are additional register		// There is a constant offset component, but there are additional register
// offsets which could break AA if we set the offset to anything non-0.		// offsets which could break AA if we set the offset to anything non-0.
return true;		return true;
}		}
case Intrinsic::amdgcn_tbuffer_store:		case Intrinsic::amdgcn_tbuffer_store:
case Intrinsic::amdgcn_buffer_store:		case Intrinsic::amdgcn_buffer_store:
		case Intrinsic::amdgcn_buffer_store_byte:
		case Intrinsic::amdgcn_buffer_store_short:
case Intrinsic::amdgcn_buffer_store_format: {		case Intrinsic::amdgcn_buffer_store_format: {
SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();		SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();
Info.opc = ISD::INTRINSIC_VOID;		Info.opc = ISD::INTRINSIC_VOID;
Info.ptrVal = MFI->getBufferPSV(		Info.ptrVal = MFI->getBufferPSV(
*MF.getSubtarget<SISubtarget>().getInstrInfo(),		*MF.getSubtarget<SISubtarget>().getInstrInfo(),
CI.getArgOperand(1));		CI.getArgOperand(1));
Info.memVT = MVT::getVT(CI.getArgOperand(0)->getType());		Info.memVT = MVT::getVT(CI.getArgOperand(0)->getType());
Info.flags = MachineMemOperand::MOStore \|		Info.flags = MachineMemOperand::MOStore \|
▲ Show 20 Lines • Show All 4,136 Lines • ▼ Show 20 Lines	SDValue Ops[] = {
M->getOperand(2), // Ptr		M->getOperand(2), // Ptr
M->getOperand(3) // Value		M->getOperand(3) // Value
};		};

return DAG.getMemIntrinsicNode(Opc, SDLoc(Op), M->getVTList(), Ops,		return DAG.getMemIntrinsicNode(Opc, SDLoc(Op), M->getVTList(), Ops,
M->getMemoryVT(), M->getMemOperand());		M->getMemoryVT(), M->getMemOperand());
}		}
case Intrinsic::amdgcn_buffer_load:		case Intrinsic::amdgcn_buffer_load:
		case Intrinsic::amdgcn_buffer_load_ubyte:
		case Intrinsic::amdgcn_buffer_load_ushort:
case Intrinsic::amdgcn_buffer_load_format: {		case Intrinsic::amdgcn_buffer_load_format: {
SDValue Ops[] = {		SDValue Ops[] = {
Op.getOperand(0), // Chain		Op.getOperand(0), // Chain
Op.getOperand(2), // rsrc		Op.getOperand(2), // rsrc
Op.getOperand(3), // vindex		Op.getOperand(3), // vindex
Op.getOperand(4), // offset		Op.getOperand(4), // offset
Op.getOperand(5), // glc		Op.getOperand(5), // glc
Op.getOperand(6) // slc		Op.getOperand(6) // slc
};		};

unsigned Opc = (IntrID == Intrinsic::amdgcn_buffer_load) ?		unsigned Opc = 0;
AMDGPUISD::BUFFER_LOAD : AMDGPUISD::BUFFER_LOAD_FORMAT;		switch (IntrID) {
		case Intrinsic::amdgcn_buffer_load: Opc = AMDGPUISD::BUFFER_LOAD; break;
		case Intrinsic::amdgcn_buffer_load_ubyte: Opc = AMDGPUISD::BUFFER_LOAD_UBYTE; break;
		case Intrinsic::amdgcn_buffer_load_ushort: Opc = AMDGPUISD::BUFFER_LOAD_USHORT; break;
		case Intrinsic::amdgcn_buffer_load_format: Opc = AMDGPUISD::BUFFER_LOAD_FORMAT; break;
		default: llvm_unreachable("Unexpected IntrinsicID");
		}

EVT VT = Op.getValueType();		EVT VT = Op.getValueType();
EVT IntVT = VT.changeTypeToInteger();		EVT IntVT = VT.changeTypeToInteger();

auto *M = cast<MemSDNode>(Op);		auto *M = cast<MemSDNode>(Op);
		arsenmUnsubmitted Not Done Reply Inline Actions Capitalize arsenm: Capitalize
return DAG.getMemIntrinsicNode(Opc, DL, Op->getVTList(), Ops, IntVT,		return DAG.getMemIntrinsicNode(Opc, DL, Op->getVTList(), Ops, IntVT,
		arsenmUnsubmitted Not Done Reply Inline Actions Capitalize arsenm: Capitalize
M->getMemOperand());		M->getMemOperand());
		arsenmUnsubmitted Not Done Reply Inline Actions Capitalize arsenm: Capitalize
}		}
		arsenmUnsubmitted Not Done Reply Inline Actions This looks identical to the other part, which is kind of surprising to me but this should be factored into something common arsenm: This looks identical to the other part, which is kind of surprising to me but this should be…
		rtaylorAuthorUnsubmitted Not Done Reply Inline Actions Do you mean that there is simliar/same code in the different cases of buffer_load? (ie struct, raw, normal) And that these should be factored into something common (ie function call)? rtaylor: Do you mean that there is simliar/same code in the different cases of buffer_load? (ie struct…
		arsenmUnsubmitted Not Done Reply Inline Actions Yes arsenm: Yes
		arsenmUnsubmitted Not Done Reply Inline Actions Repeated again arsenm: Repeated again
case Intrinsic::amdgcn_tbuffer_load: {		case Intrinsic::amdgcn_tbuffer_load: {
MemSDNode *M = cast<MemSDNode>(Op);		MemSDNode *M = cast<MemSDNode>(Op);
		arsenmUnsubmitted Not Done Reply Inline Actions This comment can be removed arsenm: This comment can be removed
SDValue Ops[] = {		SDValue Ops[] = {
Op.getOperand(0), // Chain		Op.getOperand(0), // Chain
		arsenmUnsubmitted Not Done Reply Inline Actions You can just hardcoded this to MVT::Other arsenm: You can just hardcoded this to MVT::Other
Op.getOperand(2), // rsrc		Op.getOperand(2), // rsrc
		arsenmUnsubmitted Not Done Reply Inline Actions Ternary operator arsenm: Ternary operator
Op.getOperand(3), // vindex		Op.getOperand(3), // vindex
Op.getOperand(4), // voffset		Op.getOperand(4), // voffset
		arsenmUnsubmitted Not Done Reply Inline Actions Ternary operator arsenm: Ternary operator
Op.getOperand(5), // soffset		Op.getOperand(5), // soffset
Op.getOperand(6), // offset		Op.getOperand(6), // offset
Op.getOperand(7), // dfmt		Op.getOperand(7), // dfmt
Op.getOperand(8), // nfmt		Op.getOperand(8), // nfmt
Op.getOperand(9), // glc		Op.getOperand(9), // glc
Op.getOperand(10) // slc		Op.getOperand(10) // slc
};		};

		arsenmUnsubmitted Not Done Reply Inline Actions You shouldn't be inspecting the users. You can just unconditionally use one or the other. You're going to have to insert a truncate back to the original type at the end anyway. You can then add a separate optimization to fold in the sext_inreg or mask into the buffer like is done for loads arsenm: You shouldn't be inspecting the users. You can just unconditionally use one or the other.
		rtaylorAuthorUnsubmitted Not Done Reply Inline Actions There are four potential options so what do you mean by one or the other? There is BUFFER_LOAD_ubyte/ushort/short/byte for the Opc. rtaylor: There are four potential options so what do you mean by one or the other? There is…
		arsenmUnsubmitted Not Done Reply Inline Actions You can just unconditionally use load_ubyte/load_ushort. Folding the sign extend in is then a separate optimization on a sext (or more likely a sext_inreg) arsenm: You can just unconditionally use load_ubyte/load_ushort. Folding the sign extend in is then a…
		rtaylorAuthorUnsubmitted Not Done Reply Inline Actions So outputting byte/short based on sign_extend in the tablgen pattern? This won't allow re-use of the existing multiclass without changes. I think I remember there being a reason Nicolai and I decided not to do this. rtaylor: So outputting byte/short based on sign_extend in the tablgen pattern? This won't allow re-use…
		arsenmUnsubmitted Not Done Reply Inline Actions This doesn't change the selection. This is an optimization done in the DAGCombiner arsenm: This doesn't change the selection. This is an optimization done in the DAGCombiner
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();

return DAG.getMemIntrinsicNode(AMDGPUISD::TBUFFER_LOAD_FORMAT, DL,		return DAG.getMemIntrinsicNode(AMDGPUISD::TBUFFER_LOAD_FORMAT, DL,
Op->getVTList(), Ops, VT, M->getMemOperand());		Op->getVTList(), Ops, VT, M->getMemOperand());
}		}
case Intrinsic::amdgcn_buffer_atomic_swap:		case Intrinsic::amdgcn_buffer_atomic_swap:
case Intrinsic::amdgcn_buffer_atomic_add:		case Intrinsic::amdgcn_buffer_atomic_add:
case Intrinsic::amdgcn_buffer_atomic_sub:		case Intrinsic::amdgcn_buffer_atomic_sub:
▲ Show 20 Lines • Show All 321 Lines • ▼ Show 20 Lines	case Intrinsic::amdgcn_tbuffer_store: {
unsigned Opc = IsD16 ? AMDGPUISD::TBUFFER_STORE_FORMAT_D16 :		unsigned Opc = IsD16 ? AMDGPUISD::TBUFFER_STORE_FORMAT_D16 :
AMDGPUISD::TBUFFER_STORE_FORMAT;		AMDGPUISD::TBUFFER_STORE_FORMAT;
MemSDNode *M = cast<MemSDNode>(Op);		MemSDNode *M = cast<MemSDNode>(Op);
return DAG.getMemIntrinsicNode(Opc, DL, Op->getVTList(), Ops,		return DAG.getMemIntrinsicNode(Opc, DL, Op->getVTList(), Ops,
M->getMemoryVT(), M->getMemOperand());		M->getMemoryVT(), M->getMemOperand());
}		}

case Intrinsic::amdgcn_buffer_store:		case Intrinsic::amdgcn_buffer_store:
		case Intrinsic::amdgcn_buffer_store_byte:
		case Intrinsic::amdgcn_buffer_store_short:
case Intrinsic::amdgcn_buffer_store_format: {		case Intrinsic::amdgcn_buffer_store_format: {
SDValue VData = Op.getOperand(2);		SDValue VData = Op.getOperand(2);
bool IsD16 = (VData.getValueType().getScalarType() == MVT::f16);		bool IsD16 = (VData.getValueType().getScalarType() == MVT::f16);
if (IsD16)		if (IsD16)
VData = handleD16VData(VData, DAG);		VData = handleD16VData(VData, DAG);
SDValue Ops[] = {		SDValue Ops[] = {
Chain,		Chain,
VData, // vdata		VData, // vdata
Op.getOperand(3), // rsrc		Op.getOperand(3), // rsrc
Op.getOperand(4), // vindex		Op.getOperand(4), // vindex
Op.getOperand(5), // offset		Op.getOperand(5), // offset
Op.getOperand(6), // glc		Op.getOperand(6), // glc
Op.getOperand(7) // slc		Op.getOperand(7) // slc
};		};
unsigned Opc = IntrinsicID == Intrinsic::amdgcn_buffer_store ?		unsigned Opc = 0;
AMDGPUISD::BUFFER_STORE : AMDGPUISD::BUFFER_STORE_FORMAT;		switch (IntrinsicID) {
		case Intrinsic::amdgcn_buffer_store: Opc = AMDGPUISD::BUFFER_STORE; break;
		case Intrinsic::amdgcn_buffer_store_byte: Opc = AMDGPUISD::BUFFER_STORE_BYTE; break;
		case Intrinsic::amdgcn_buffer_store_short: Opc = AMDGPUISD::BUFFER_STORE_SHORT; break;
		case Intrinsic::amdgcn_buffer_store_format: Opc = AMDGPUISD::BUFFER_STORE_FORMAT; break;
		default: llvm_unreachable("Unexpected IntrinsicID");
		}
Opc = IsD16 ? AMDGPUISD::BUFFER_STORE_FORMAT_D16 : Opc;		Opc = IsD16 ? AMDGPUISD::BUFFER_STORE_FORMAT_D16 : Opc;
MemSDNode *M = cast<MemSDNode>(Op);		MemSDNode *M = cast<MemSDNode>(Op);
return DAG.getMemIntrinsicNode(Opc, DL, Op->getVTList(), Ops,		return DAG.getMemIntrinsicNode(Opc, DL, Op->getVTList(), Ops,
M->getMemoryVT(), M->getMemOperand());		M->getMemoryVT(), M->getMemOperand());
}		}

case Intrinsic::amdgcn_image_store:		case Intrinsic::amdgcn_image_store:
case Intrinsic::amdgcn_image_store_mip: {		case Intrinsic::amdgcn_image_store_mip: {
SDValue VData = Op.getOperand(2);		SDValue VData = Op.getOperand(2);
bool IsD16 = (VData.getValueType().getScalarType() == MVT::f16);		bool IsD16 = (VData.getValueType().getScalarType() == MVT::f16);
if (IsD16)		if (IsD16)
VData = handleD16VData(VData, DAG);		VData = handleD16VData(VData, DAG);
SDValue Ops[] = {		SDValue Ops[] = {
Chain, // Chain		Chain, // Chain
VData, // vdata		VData, // vdata
Op.getOperand(3), // vaddr		Op.getOperand(3), // vaddr
Op.getOperand(4), // rsrc		Op.getOperand(4), // rsrc
		arsenmUnsubmitted Not Done Reply Inline Actions Ternary operator arsenm: Ternary operator
Op.getOperand(5), // dmask		Op.getOperand(5), // dmask
Op.getOperand(6), // glc		Op.getOperand(6), // glc
Op.getOperand(7), // slc		Op.getOperand(7), // slc
Op.getOperand(8), // lwe		Op.getOperand(8), // lwe
Op.getOperand(9) // da		Op.getOperand(9) // da
};		};
unsigned Opc = (IntrinsicID==Intrinsic::amdgcn_image_store) ?		unsigned Opc = (IntrinsicID==Intrinsic::amdgcn_image_store) ?
AMDGPUISD::IMAGE_STORE : AMDGPUISD::IMAGE_STORE_MIP;		AMDGPUISD::IMAGE_STORE : AMDGPUISD::IMAGE_STORE_MIP;
MemSDNode *M = cast<MemSDNode>(Op);		MemSDNode *M = cast<MemSDNode>(Op);
return DAG.getMemIntrinsicNode(Opc, DL, Op->getVTList(), Ops,		return DAG.getMemIntrinsicNode(Opc, DL, Op->getVTList(), Ops,
M->getMemoryVT(), M->getMemOperand());		M->getMemoryVT(), M->getMemOperand());
		arsenmUnsubmitted Not Done Reply Inline Actions Capitalize arsenm: Capitalize
}		}

default:		default:
return Op;		return Op;
}		}
}		}

SDValue SITargetLowering::LowerLOAD(SDValue Op, SelectionDAG &DAG) const {		SDValue SITargetLowering::LowerLOAD(SDValue Op, SelectionDAG &DAG) const {
		arsenmUnsubmitted Not Done Reply Inline Actions Brace placement arsenm: Brace placement
SDLoc DL(Op);		SDLoc DL(Op);
		arsenmUnsubmitted Not Done Reply Inline Actions Ternary operator arsenm: Ternary operator
LoadSDNode *Load = cast<LoadSDNode>(Op);		LoadSDNode *Load = cast<LoadSDNode>(Op);
ISD::LoadExtType ExtType = Load->getExtensionType();		ISD::LoadExtType ExtType = Load->getExtensionType();
EVT MemVT = Load->getMemoryVT();		EVT MemVT = Load->getMemoryVT();

if (ExtType == ISD::NON_EXTLOAD && MemVT.getSizeInBits() < 32) {		if (ExtType == ISD::NON_EXTLOAD && MemVT.getSizeInBits() < 32) {
if (MemVT == MVT::i16 && isTypeLegal(MVT::i16))		if (MemVT == MVT::i16 && isTypeLegal(MVT::i16))
return SDValue();		return SDValue();

// FIXME: Copied from PPC		// FIXME: Copied from PPC
// First, load into 32 bits, then truncate to 1 bit.		// First, load into 32 bits, then truncate to 1 bit.

SDValue Chain = Load->getChain();		SDValue Chain = Load->getChain();
SDValue BasePtr = Load->getBasePtr();		SDValue BasePtr = Load->getBasePtr();
MachineMemOperand *MMO = Load->getMemOperand();		MachineMemOperand *MMO = Load->getMemOperand();

EVT RealMemVT = (MemVT == MVT::i1) ? MVT::i8 : MVT::i16;		EVT RealMemVT = (MemVT == MVT::i1) ? MVT::i8 : MVT::i16;

SDValue NewLD = DAG.getExtLoad(ISD::EXTLOAD, DL, MVT::i32, Chain,		SDValue NewLD = DAG.getExtLoad(ISD::EXTLOAD, DL, MVT::i32, Chain,
		arsenmUnsubmitted Not Done Reply Inline Actions Brace placement arsenm: Brace placement
BasePtr, RealMemVT, MMO);		BasePtr, RealMemVT, MMO);

SDValue Ops[] = {		SDValue Ops[] = {
DAG.getNode(ISD::TRUNCATE, DL, MemVT, NewLD),		DAG.getNode(ISD::TRUNCATE, DL, MemVT, NewLD),
NewLD.getValue(1)		NewLD.getValue(1)
};		};

return DAG.getMergeValues(Ops, DL);		return DAG.getMergeValues(Ops, DL);
▲ Show 20 Lines • Show All 964 Lines • ▼ Show 20 Lines	if (BCSrc.getValueType() == MVT::f16 &&
fp16SrcZerosHighBits(BCSrc.getOpcode()))		fp16SrcZerosHighBits(BCSrc.getOpcode()))
return DCI.DAG.getNode(AMDGPUISD::FP16_ZEXT, SDLoc(N), VT, BCSrc);		return DCI.DAG.getNode(AMDGPUISD::FP16_ZEXT, SDLoc(N), VT, BCSrc);
}		}

return SDValue();		return SDValue();
}		}

SDValue SITargetLowering::performClassCombine(SDNode *N,		SDValue SITargetLowering::performClassCombine(SDNode *N,
DAGCombinerInfo &DCI) const {		DAGCombinerInfo &DCI) const {
		arsenmUnsubmitted Not Done Reply Inline Actions Formatting arsenm: Formatting
SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
SDValue Mask = N->getOperand(1);		SDValue Mask = N->getOperand(1);

		arsenmUnsubmitted Not Done Reply Inline Actions Should have a hasOneUse check arsenm: Should have a hasOneUse check
		rtaylorAuthorUnsubmitted Not Done Reply Inline Actions You mean there should be a hasOneUse check on the SIGN_EXTEND_INREG right? rtaylor: You mean there should be a hasOneUse check on the SIGN_EXTEND_INREG right?
		arsenmUnsubmitted Not Done Reply Inline Actions No, the buffer operation. If there are multiple uses you will end up creating multiple loads arsenm: No, the buffer operation. If there are multiple uses you will end up creating multiple loads
// fp_class x, 0 -> false		// fp_class x, 0 -> false
		arsenmUnsubmitted Not Done Reply Inline Actions Leftover debugging arsenm: Leftover debugging
if (const ConstantSDNode *CMask = dyn_cast<ConstantSDNode>(Mask)) {		if (const ConstantSDNode *CMask = dyn_cast<ConstantSDNode>(Mask)) {
		arsenmUnsubmitted Not Done Reply Inline Actions This is missing a check on the source type. If you want to be fancier, you can split out the remainder bits into a new sign extend but there probably isn't much reason to arsenm: This is missing a check on the source type. If you want to be fancier, you can split out the…
		rtaylorAuthorUnsubmitted Not Done Reply Inline Actions Src is the BUFFER_LOAD_XXX. The only way this code is executed is if the Src is a BUFFER_LOAD_XXX. I'm not sure we need a redundant check here do we? rtaylor: Src is the BUFFER_LOAD_XXX. The only way this code is executed is if the Src is a…
		arsenmUnsubmitted Not Done Reply Inline Actions The number of bits in the sext_inreg may not match the load's from-8/16 bit source. You can test this with something like %load = call llvm.amdgcn.buffer.load.i8() %ext = zext i8 %load to i32 %shl = shl i32 %ext, 27 %shr = ashr i32 %shl, 27 There will need more shifts to clear the extra bits in the loaded value arsenm: The number of bits in the sext_inreg may not match the load's from-8/16 bit source. You can…
		rtaylorAuthorUnsubmitted Not Done Reply Inline Actions This should produce a buffer_load_sbyte right? That is what it does currently. rtaylor: This should produce a buffer_load_sbyte right? That is what it does currently.
		arsenmUnsubmitted Not Done Reply Inline Actions But it needs additional shifts even after. Right now you'll not be clearing the extra bits in the low 8 that need to be arsenm: But it needs additional shifts even after. Right now you'll not be clearing the extra bits in…
		arsenmUnsubmitted Not Done Reply Inline Actions You can either leave it as x = buffer_load_ubyte sext_inreg x, i27 or x = buffer_load_sbyte sext_inreg x, i5 I'm not sure there's much practical difference between them arsenm: You can either leave it as x = buffer_load_ubyte sext_inreg x, i27 or x = buffer_load_sbyte…
		rtaylorAuthorUnsubmitted Not Done Reply Inline Actions Right, I don't think there is, I'm working on doing the former. Thanks. rtaylor: Right, I don't think there is, I'm working on doing the former. Thanks.
		arsenmUnsubmitted Not Done Reply Inline Actions There might be a small advantage by canonicalizing the sext_inreg sizes. The constant masks that will be emitted for BFI might be more likely to be reusable arsenm: There might be a small advantage by canonicalizing the sext_inreg sizes. The constant masks…
if (CMask->isNullValue())		if (CMask->isNullValue())
return DAG.getConstant(0, SDLoc(N), MVT::i1);		return DAG.getConstant(0, SDLoc(N), MVT::i1);
		arsenmUnsubmitted Not Done Reply Inline Actions Extra space before == arsenm: Extra space before ==
}		}

		arsenmUnsubmitted Not Done Reply Inline Actions Extra space before == arsenm: Extra space before ==
if (N->getOperand(0).isUndef())		if (N->getOperand(0).isUndef())
return DAG.getUNDEF(MVT::i1);		return DAG.getUNDEF(MVT::i1);

		arsenmUnsubmitted Not Done Reply Inline Actions "will be set by" part doesn't make sense here arsenm: "will be set by" part doesn't make sense here
return SDValue();		return SDValue();
}		}

static bool isKnownNeverSNan(SelectionDAG &DAG, SDValue Op) {		static bool isKnownNeverSNan(SelectionDAG &DAG, SDValue Op) {
if (!DAG.getTargetLoweringInfo().hasFloatingPointExceptions())		if (!DAG.getTargetLoweringInfo().hasFloatingPointExceptions())
return true;		return true;

return DAG.isKnownNeverNaN(Op);		return DAG.isKnownNeverNaN(Op);
▲ Show 20 Lines • Show All 1,421 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInstrInfo.td

Show First 20 Lines • Show All 107 Lines • ▼ Show 20 Lines	def SDTBufferLoad : SDTypeProfile<1, 5,
SDTCisVT<1, v4i32>, // rsrc		SDTCisVT<1, v4i32>, // rsrc
SDTCisVT<2, i32>, // vindex		SDTCisVT<2, i32>, // vindex
SDTCisVT<3, i32>, // offset		SDTCisVT<3, i32>, // offset
SDTCisVT<4, i1>, // glc		SDTCisVT<4, i1>, // glc
SDTCisVT<5, i1>]>; // slc		SDTCisVT<5, i1>]>; // slc

def SIbuffer_load : SDNode <"AMDGPUISD::BUFFER_LOAD", SDTBufferLoad,		def SIbuffer_load : SDNode <"AMDGPUISD::BUFFER_LOAD", SDTBufferLoad,
[SDNPMemOperand, SDNPHasChain, SDNPMayLoad]>;		[SDNPMemOperand, SDNPHasChain, SDNPMayLoad]>;
		def SIbuffer_load_ubyte : SDNode <"AMDGPUISD::BUFFER_LOAD_UBYTE", SDTBufferLoad,
		[SDNPMemOperand, SDNPHasChain, SDNPMayLoad]>;
		def SIbuffer_load_ushort : SDNode <"AMDGPUISD::BUFFER_LOAD_USHORT", SDTBufferLoad,
		[SDNPMemOperand, SDNPHasChain, SDNPMayLoad]>;
def SIbuffer_load_format : SDNode <"AMDGPUISD::BUFFER_LOAD_FORMAT", SDTBufferLoad,		def SIbuffer_load_format : SDNode <"AMDGPUISD::BUFFER_LOAD_FORMAT", SDTBufferLoad,
[SDNPMemOperand, SDNPHasChain, SDNPMayLoad]>;		[SDNPMemOperand, SDNPHasChain, SDNPMayLoad]>;
def SIbuffer_load_format_d16 : SDNode <"AMDGPUISD::BUFFER_LOAD_FORMAT_D16",		def SIbuffer_load_format_d16 : SDNode <"AMDGPUISD::BUFFER_LOAD_FORMAT_D16",
SDTBufferLoad,		SDTBufferLoad,
[SDNPMemOperand, SDNPHasChain, SDNPMayLoad]>;		[SDNPMemOperand, SDNPHasChain, SDNPMayLoad]>;

def SDTBufferStore : SDTypeProfile<0, 6,		def SDTBufferStore : SDTypeProfile<0, 6,
[ // vdata		[ // vdata
SDTCisVT<1, v4i32>, // rsrc		SDTCisVT<1, v4i32>, // rsrc
SDTCisVT<2, i32>, // vindex		SDTCisVT<2, i32>, // vindex
SDTCisVT<3, i32>, // offset		SDTCisVT<3, i32>, // offset
SDTCisVT<4, i1>, // glc		SDTCisVT<4, i1>, // glc
SDTCisVT<5, i1>]>; // slc		SDTCisVT<5, i1>]>; // slc

def SIbuffer_store : SDNode <"AMDGPUISD::BUFFER_STORE", SDTBufferStore,		def SIbuffer_store : SDNode <"AMDGPUISD::BUFFER_STORE", SDTBufferStore,
[SDNPMayStore, SDNPMemOperand, SDNPHasChain]>;		[SDNPMayStore, SDNPMemOperand, SDNPHasChain]>;
		def SIbuffer_store_byte : SDNode <"AMDGPUISD::BUFFER_STORE_BYTE", SDTBufferStore,
		[SDNPMemOperand, SDNPHasChain, SDNPMayStore]>;
		def SIbuffer_store_short : SDNode <"AMDGPUISD::BUFFER_STORE_SHORT", SDTBufferStore,
		[SDNPMemOperand, SDNPHasChain, SDNPMayStore]>;
def SIbuffer_store_format : SDNode <"AMDGPUISD::BUFFER_STORE_FORMAT",		def SIbuffer_store_format : SDNode <"AMDGPUISD::BUFFER_STORE_FORMAT",
SDTBufferStore,		SDTBufferStore,
[SDNPMayStore, SDNPMemOperand, SDNPHasChain]>;		[SDNPMayStore, SDNPMemOperand, SDNPHasChain]>;
def SIbuffer_store_format_d16 : SDNode <"AMDGPUISD::BUFFER_STORE_FORMAT_D16",		def SIbuffer_store_format_d16 : SDNode <"AMDGPUISD::BUFFER_STORE_FORMAT_D16",
SDTBufferStore,		SDTBufferStore,
[SDNPMayStore, SDNPMemOperand, SDNPHasChain]>;		[SDNPMayStore, SDNPMemOperand, SDNPHasChain]>;

class SDBufferAtomic<string opcode> : SDNode <opcode,		class SDBufferAtomic<string opcode> : SDNode <opcode,
▲ Show 20 Lines • Show All 1,994 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/llvm.amdgcn.buffer.load.ll

Show First 20 Lines • Show All 221 Lines • ▼ Show 20 Lines	main_body:
%r1 = extractelement <2 x float> %vr1, i32 0		%r1 = extractelement <2 x float> %vr1, i32 0
%r2 = extractelement <2 x float> %vr1, i32 1		%r2 = extractelement <2 x float> %vr1, i32 1
%r3 = extractelement <2 x float> %vr2, i32 0		%r3 = extractelement <2 x float> %vr2, i32 0
%r4 = extractelement <2 x float> %vr2, i32 1		%r4 = extractelement <2 x float> %vr2, i32 1
call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r1, float %r2, float %r3, float %r4, i1 true, i1 true)		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r1, float %r2, float %r3, float %r4, i1 true, i1 true)
ret void		ret void
}		}

		;CHECK-LABEL: {{^}}buffer_load_ubyte:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_ubyte v{{[0-9]}}, off, s[0:3], 0 offset:8
		;CHECK: s_waitcnt
		define amdgpu_ps float @buffer_load_ubyte(<4 x i32> inreg %rsrc) {
		main_body:
		%val = call float @llvm.amdgcn.buffer.load.ubyte(<4 x i32> %rsrc, i32 0, i32 8, i1 0, i1 0)
		ret float %val
		}

		;CHECK-LABEL: {{^}}buffer_load_ushort:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_ushort v{{[0-9]}}, off, s[0:3], 0 offset:16
		;CHECK: s_waitcnt
		define amdgpu_ps float @buffer_load_ushort(<4 x i32> inreg %rsrc) {
		main_body:
		%val = call float @llvm.amdgcn.buffer.load.ushort(<4 x i32> %rsrc, i32 0, i32 16, i1 0, i1 0)
		ret float %val
		}

declare float @llvm.amdgcn.buffer.load.f32(<4 x i32>, i32, i32, i1, i1) #0		declare float @llvm.amdgcn.buffer.load.f32(<4 x i32>, i32, i32, i1, i1) #0
declare <2 x float> @llvm.amdgcn.buffer.load.v2f32(<4 x i32>, i32, i32, i1, i1) #0		declare <2 x float> @llvm.amdgcn.buffer.load.v2f32(<4 x i32>, i32, i32, i1, i1) #0
declare <4 x float> @llvm.amdgcn.buffer.load.v4f32(<4 x i32>, i32, i32, i1, i1) #0		declare <4 x float> @llvm.amdgcn.buffer.load.v4f32(<4 x i32>, i32, i32, i1, i1) #0
		declare float @llvm.amdgcn.buffer.load.ubyte(<4 x i32>, i32, i32, i1, i1) #0
		declare float @llvm.amdgcn.buffer.load.ushort(<4 x i32>, i32, i32, i1, i1) #0
declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #0		declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #0

attributes #0 = { nounwind readonly }		attributes #0 = { nounwind readonly }

test/CodeGen/AMDGPU/llvm.amdgcn.buffer.store.ll

	Show First 20 Lines • Show All 164 Lines • ▼ Show 20 Lines
	;CHECK-NOT: s_waitcnt			;CHECK-NOT: s_waitcnt
	;CHECK: buffer_store_dwordx4 v[{{[0-9]}}:{{[0-9]}}], off, s[0:3], 0 offset:4			;CHECK: buffer_store_dwordx4 v[{{[0-9]}}:{{[0-9]}}], off, s[0:3], 0 offset:4
	define amdgpu_ps void @buffer_store_x2_offset_merged(<4 x i32> inreg %rsrc, <2 x float> %v1,<2 x float> %v2) {			define amdgpu_ps void @buffer_store_x2_offset_merged(<4 x i32> inreg %rsrc, <2 x float> %v1,<2 x float> %v2) {
	call void @llvm.amdgcn.buffer.store.v2f32(<2 x float> %v1, <4 x i32> %rsrc, i32 0, i32 4, i1 0, i1 0)			call void @llvm.amdgcn.buffer.store.v2f32(<2 x float> %v1, <4 x i32> %rsrc, i32 0, i32 4, i1 0, i1 0)
	call void @llvm.amdgcn.buffer.store.v2f32(<2 x float> %v2, <4 x i32> %rsrc, i32 0, i32 12, i1 0, i1 0)			call void @llvm.amdgcn.buffer.store.v2f32(<2 x float> %v2, <4 x i32> %rsrc, i32 0, i32 12, i1 0, i1 0)
	ret void			ret void
	}			}

				;CHECK-LABEL: {{^}}buffer_store_byte:
				;CHECK-NOT: s_waitcnt
				;CHECK-NEXT: %bb.
				;CHECK-NEXT: buffer_store_byte v{{[0-9]}}, off, s[0:3], 0 offset:8
				define amdgpu_ps void @buffer_store_byte(<4 x i32> inreg %rsrc, float %v1) {
				main_body:
				call void @llvm.amdgcn.buffer.store.byte(float %v1, <4 x i32> %rsrc, i32 0, i32 8, i1 0, i1 0)
				ret void
				}

				;CHECK-LABEL: {{^}}buffer_store_short:
				;CHECK-NOT: s_waitcnt
				;CHECK-NEXT: %bb.
				;CHECK-NEXT: buffer_store_short v{{[0-9]}}, off, s[0:3], 0 offset:16
				define amdgpu_ps void @buffer_store_short(<4 x i32> inreg %rsrc, float %v1) {
				main_body:
				call void @llvm.amdgcn.buffer.store.short(float %v1, <4 x i32> %rsrc, i32 0, i32 16, i1 0, i1 0)
				ret void
				}

	declare void @llvm.amdgcn.buffer.store.f32(float, <4 x i32>, i32, i32, i1, i1) #0			declare void @llvm.amdgcn.buffer.store.f32(float, <4 x i32>, i32, i32, i1, i1) #0
	declare void @llvm.amdgcn.buffer.store.v2f32(<2 x float>, <4 x i32>, i32, i32, i1, i1) #0			declare void @llvm.amdgcn.buffer.store.v2f32(<2 x float>, <4 x i32>, i32, i32, i1, i1) #0
	declare void @llvm.amdgcn.buffer.store.v4f32(<4 x float>, <4 x i32>, i32, i32, i1, i1) #0			declare void @llvm.amdgcn.buffer.store.v4f32(<4 x float>, <4 x i32>, i32, i32, i1, i1) #0
				declare void @llvm.amdgcn.buffer.store.byte(float, <4 x i32>, i32, i32, i1, i1) #0
				declare void @llvm.amdgcn.buffer.store.short(float, <4 x i32>, i32, i32, i1, i1) #0
	declare <4 x float> @llvm.amdgcn.buffer.load.v4f32(<4 x i32>, i32, i32, i1, i1) #1			declare <4 x float> @llvm.amdgcn.buffer.load.v4f32(<4 x i32>, i32, i32, i1, i1) #1

	attributes #0 = { nounwind }			attributes #0 = { nounwind }
	attributes #1 = { nounwind readonly }			attributes #1 = { nounwind readonly }