This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] intrintrics for byte/short load/store
ClosedPublic

Authored by rtaylor on Feb 3 2018, 10:35 AM.

Download Raw Diff

Details

Reviewers

mareko
arsenm
nhaehnle
timcorringham

Group Reviewers

Restricted Project

Summary

Added intrinsics for the instructions:

buffer_load_ubyte
buffer_load_ushort
buffer_store_byte
buffer_store_short

Added test cases to existing buffer load/store tests.

Diff Detail

Repository

rL LLVM

Build Status

Buildable 28389
Build 28388: arc lint + arc unit

Event Timeline

timcorringham created this revision.Feb 3 2018, 10:35 AM

Herald added subscribers: llvm-commits, t-tye, tpr and 6 others. · View Herald TranscriptFeb 3 2018, 10:35 AM

Harbormaster completed remote builds in B14595: Diff 132746.Feb 3 2018, 10:37 AM

timcorringham added a reviewer: Restricted Project.Feb 3 2018, 10:44 AM

timcorringham added a reviewer: mareko.Feb 3 2018, 10:47 AM

I don't think we need intrinsics for these. At most we should add a mangled type to the existing buffer intrinsics.

include/llvm/IR/IntrinsicsAMDGPU.td
822	float return type doesn't make sense

This revision now requires changes to proceed.Feb 3 2018, 11:01 AM

Matt, the instructions zero extend the data to i32, so the return type of the int, ushort and ubyte variants are the same, and overloading would not work.

Matt, we do actually need these intrinsics as we have an urgent requirement for them Open Vulkan (which is of course my motivation for implementing them).

As Tim commented, the load ubyte and load short instructions extend to 32 bits. While float is a little odd, it does match the behavior of the other buffer_load instructions. Also I think changing it would require a disproportionate amount of effort.

It should be easy to change the return type to i32.

and, of course, the vdata type.

Couldn't we also optimize the loads at least based on used bits like a normal load?

In D42885#997648, @timcorringham wrote:

Matt, we do actually need these intrinsics as we have an urgent requirement for them Open Vulkan (which is of course my motivation for implementing them).

As Tim commented, the load ubyte and load short instructions extend to 32 bits. While float is a little odd, it does match the behavior of the other buffer_load instructions. Also I think changing it would require a disproportionate amount of effort.

Yes, they do but the intrinsic doesn't need to care.

If we define an overloaded intrinsic with a return type of i8, and the IR using it wants the value zero extended to i32, the frontend would then have to emit a separate zext. I guess we could optimize that to the zero-extending instruction in instruction selection, but wouldn't it be better to have the intrinsic match what the ISA instruction does by returning the zero extended i32?

Maybe a dumb question - but why can't we just use the tbuffer load/store instead of these? It already upcasts for you (the zext/sext is built in depending on the nfmt I believe).

Herald added a subscriber: jvesely. · View Herald TranscriptOct 19 2018, 1:55 AM

In D42885#1268934, @sheredom wrote:

Maybe a dumb question - but why can't we just use the tbuffer load/store instead of these? It already upcasts for you (the zext/sext is built in depending on the nfmt I believe).

Well we can, but using the loads without conversion can be faster? See https://gpuopen.com/gcn-memory-coalescing/ :

32-bit (or smaller) single-channel buffer loads / wave = 4 clocks (under specific cases)
32-bit (or smaller) filtered texels / wave = 16 clocks

Updating to include requested changes

Herald added a project: Restricted Project. · View Herald TranscriptFeb 21 2019, 10:53 AM

Harbormaster completed remote builds in B28389: Diff 187824.Feb 21 2019, 10:54 AM

rtaylor added a reviewer: nhaehnle.Feb 21 2019, 10:58 AM

arsenm added inline comments.Feb 21 2019, 5:15 PM

lib/Target/AMDGPU/SIISelLowering.cpp
5642–5650	You shouldn't be inspecting the users. You can just unconditionally use one or the other. You're going to have to insert a truncate back to the original type at the end anyway. You can then add a separate optimization to fold in the sext_inreg or mask into the buffer like is done for loads

rtaylor added inline comments.Feb 22 2019, 7:30 AM

lib/Target/AMDGPU/SIISelLowering.cpp
5642–5650	There are four potential options so what do you mean by one or the other? There is BUFFER_LOAD_ubyte/ushort/short/byte for the Opc.

arsenm added inline comments.Feb 22 2019, 7:49 AM

lib/Target/AMDGPU/SIISelLowering.cpp
5642–5650	You can just unconditionally use load_ubyte/load_ushort. Folding the sign extend in is then a separate optimization on a sext (or more likely a sext_inreg)

rtaylor added inline comments.Feb 22 2019, 8:02 AM

lib/Target/AMDGPU/SIISelLowering.cpp
5642–5650	So outputting byte/short based on sign_extend in the tablgen pattern? This won't allow re-use of the existing multiclass without changes. I think I remember there being a reason Nicolai and I decided not to do this.

arsenm added inline comments.Feb 26 2019, 7:56 AM

lib/Target/AMDGPU/SIISelLowering.cpp
5642–5650	This doesn't change the selection. This is an optimization done in the DAGCombiner

Request changes

Harbormaster completed remote builds in B28601: Diff 188597.Feb 27 2019, 12:09 PM

Rename function to better reflect what it does

Harbormaster completed remote builds in B28602: Diff 188598.Feb 27 2019, 12:17 PM

Ping.

arsenm added inline comments.Mar 5 2019, 8:08 AM

lib/Target/AMDGPU/SIISelLowering.cpp
5636–5640	Ternary operator
5691	Capitalize
5692–5693	This looks identical to the other part, which is kind of surprising to me but this should be factored into something common
5695	This comment can be removed
5751	Repeated again
7887	Formatting
7890	Should have a hasOneUse check
7891	Leftover debugging
7892	This is missing a check on the source type. If you want to be fancier, you can split out the remainder bits into a new sign extend but there probably isn't much reason to
7897–7899	"will be set by" part doesn't make sense here
test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.ll
275	The base test case shouldn't have a extend of the use and directly use the value. You should also have one with an explicit zext
307	Could use a testcase with a second non-extended use

rtaylor added inline comments.Mar 5 2019, 8:44 AM

lib/Target/AMDGPU/SIISelLowering.cpp
5692–5693	Do you mean that there is simliar/same code in the different cases of buffer_load? (ie struct, raw, normal) And that these should be factored into something common (ie function call)?
7890	You mean there should be a hasOneUse check on the SIGN_EXTEND_INREG right?
7892	Src is the BUFFER_LOAD_XXX. The only way this code is executed is if the Src is a BUFFER_LOAD_XXX. I'm not sure we need a redundant check here do we?

arsenm added inline comments.Mar 5 2019, 9:02 AM

lib/Target/AMDGPU/SIISelLowering.cpp
7890	No, the buffer operation. If there are multiple uses you will end up creating multiple loads
7892	The number of bits in the sext_inreg may not match the load's from-8/16 bit source. You can test this with something like %load = call llvm.amdgcn.buffer.load.i8() %ext = zext i8 %load to i32 %shl = shl i32 %ext, 27 %shr = ashr i32 %shl, 27 There will need more shifts to clear the extra bits in the loaded value

rtaylor added inline comments.Mar 11 2019, 10:05 AM

lib/Target/AMDGPU/SIISelLowering.cpp
7892	This should produce a buffer_load_sbyte right? That is what it does currently.

arsenm added inline comments.Mar 11 2019, 1:50 PM

lib/Target/AMDGPU/SIISelLowering.cpp
7892	But it needs additional shifts even after. Right now you'll not be clearing the extra bits in the low 8 that need to be

arsenm added inline comments.Mar 11 2019, 1:53 PM

lib/Target/AMDGPU/SIISelLowering.cpp
7892	You can either leave it as x = buffer_load_ubyte sext_inreg x, i27 or x = buffer_load_sbyte sext_inreg x, i5 I'm not sure there's much practical difference between them

rtaylor added inline comments.Mar 11 2019, 1:57 PM

lib/Target/AMDGPU/SIISelLowering.cpp
7892	Right, I don't think there is, I'm working on doing the former. Thanks.

arsenm added inline comments.Mar 11 2019, 2:04 PM

lib/Target/AMDGPU/SIISelLowering.cpp
7892	There might be a small advantage by canonicalizing the sext_inreg sizes. The constant masks that will be emitted for BFI might be more likely to be reusable

Changing ownership

Add requested changes

Harbormaster completed remote builds in B29044: Diff 190286.Mar 12 2019, 9:50 AM

Mostly LGTM except some nits. The major one is avoiding the repeated lowering code for each of these cases

lib/Target/AMDGPU/SIISelLowering.cpp
5632	Capitalize
5639	You can just hardcoded this to MVT::Other
5691	Capitalize
5692–5693	Yes
5696–5699	Ternary operator
6400–6403	Ternary operator
6437	Capitalize
6443–6446	Ternary operator
7894	Extra space before ==
7896	Extra space before ==

Requested Changes

Harbormaster completed remote builds in B29056: Diff 190349.Mar 12 2019, 3:29 PM

LGTM except formatting

lib/Target/AMDGPU/SIISelLowering.cpp
6560	Brace placement
6579	Brace placement

This revision is now accepted and ready to land.Mar 18 2019, 7:37 AM

rtaylor closed this revision.Mar 20 2019, 7:11 AM

Revision Contents

Path

Size

include/

llvm/

IR/

IntrinsicsAMDGPU.td

4 lines

lib/

Target/

AMDGPU/

AMDGPUISelLowering.h

6 lines

AMDGPUISelLowering.cpp

22 lines

BUFInstructions.td

6 lines

SIISelLowering.cpp

148 lines

SIInstrInfo.td

14 lines

test/

CodeGen/

AMDGPU/

llvm.amdgcn.buffer.load.ll

169 lines

llvm.amdgcn.buffer.store.ll

26 lines

llvm.amdgcn.raw.buffer.load.ll

58 lines

llvm.amdgcn.raw.buffer.store.ll

28 lines

llvm.amdgcn.struct.buffer.load.ll

58 lines

llvm.amdgcn.struct.buffer.store.ll

28 lines

Diff 187824

include/llvm/IR/IntrinsicsAMDGPU.td

	Show First 20 Lines • Show All 813 Lines • ▼ Show 20 Lines
	// Buffer intrinsics			// Buffer intrinsics
	//////////////////////////////////////////////////////////////////////////			//////////////////////////////////////////////////////////////////////////

	let TargetPrefix = "amdgcn" in {			let TargetPrefix = "amdgcn" in {

	defset list<AMDGPURsrcIntrinsic> AMDGPUBufferIntrinsics = {			defset list<AMDGPURsrcIntrinsic> AMDGPUBufferIntrinsics = {

	class AMDGPUBufferLoad : Intrinsic <			class AMDGPUBufferLoad : Intrinsic <
	[llvm_anyfloat_ty],			[llvm_any_ty],
				arsenmUnsubmitted Not Done Reply Inline Actions float return type doesn't make sense arsenm: float return type doesn't make sense
	[llvm_v4i32_ty, // rsrc(SGPR)			[llvm_v4i32_ty, // rsrc(SGPR)
	llvm_i32_ty, // vindex(VGPR)			llvm_i32_ty, // vindex(VGPR)
	llvm_i32_ty, // offset(SGPR/VGPR/imm)			llvm_i32_ty, // offset(SGPR/VGPR/imm)
	llvm_i1_ty, // glc(imm)			llvm_i1_ty, // glc(imm)
	llvm_i1_ty], // slc(imm)			llvm_i1_ty], // slc(imm)
	[IntrReadMem], "", [SDNPMemOperand]>,			[IntrReadMem], "", [SDNPMemOperand]>,
	AMDGPURsrcIntrinsic<0>;			AMDGPURsrcIntrinsic<0>;
	def int_amdgcn_buffer_load_format : AMDGPUBufferLoad;			def int_amdgcn_buffer_load_format : AMDGPUBufferLoad;
	def int_amdgcn_buffer_load : AMDGPUBufferLoad;			def int_amdgcn_buffer_load : AMDGPUBufferLoad;

	def int_amdgcn_s_buffer_load : Intrinsic <			def int_amdgcn_s_buffer_load : Intrinsic <
	[llvm_any_ty],			[llvm_any_ty],
	[llvm_v4i32_ty, // rsrc(SGPR)			[llvm_v4i32_ty, // rsrc(SGPR)
	llvm_i32_ty, // byte offset(SGPR/VGPR/imm)			llvm_i32_ty, // byte offset(SGPR/VGPR/imm)
	llvm_i32_ty], // cachepolicy(imm; bit 0 = glc)			llvm_i32_ty], // cachepolicy(imm; bit 0 = glc)
	[IntrNoMem]>,			[IntrNoMem]>,
	AMDGPURsrcIntrinsic<0>;			AMDGPURsrcIntrinsic<0>;

	class AMDGPUBufferStore : Intrinsic <			class AMDGPUBufferStore : Intrinsic <
	[],			[],
	[llvm_anyfloat_ty, // vdata(VGPR) -- can currently only select f32, v2f32, v4f32			[llvm_any_ty, // vdata(VGPR)
	llvm_v4i32_ty, // rsrc(SGPR)			llvm_v4i32_ty, // rsrc(SGPR)
	llvm_i32_ty, // vindex(VGPR)			llvm_i32_ty, // vindex(VGPR)
	llvm_i32_ty, // offset(SGPR/VGPR/imm)			llvm_i32_ty, // offset(SGPR/VGPR/imm)
	llvm_i1_ty, // glc(imm)			llvm_i1_ty, // glc(imm)
	llvm_i1_ty], // slc(imm)			llvm_i1_ty], // slc(imm)
	[IntrWriteMem], "", [SDNPMemOperand]>,			[IntrWriteMem], "", [SDNPMemOperand]>,
	AMDGPURsrcIntrinsic<1>;			AMDGPURsrcIntrinsic<1>;
	def int_amdgcn_buffer_store_format : AMDGPUBufferStore;			def int_amdgcn_buffer_store_format : AMDGPUBufferStore;
	▲ Show 20 Lines • Show All 697 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUISelLowering.h

Show First 20 Lines • Show All 477 Lines • ▼ Show 20 Lines	enum NodeType : unsigned {
TBUFFER_LOAD_FORMAT_D16,		TBUFFER_LOAD_FORMAT_D16,
DS_ORDERED_COUNT,		DS_ORDERED_COUNT,
ATOMIC_CMP_SWAP,		ATOMIC_CMP_SWAP,
ATOMIC_INC,		ATOMIC_INC,
ATOMIC_DEC,		ATOMIC_DEC,
ATOMIC_LOAD_FMIN,		ATOMIC_LOAD_FMIN,
ATOMIC_LOAD_FMAX,		ATOMIC_LOAD_FMAX,
BUFFER_LOAD,		BUFFER_LOAD,
		BUFFER_LOAD_UBYTE,
		BUFFER_LOAD_USHORT,
		BUFFER_LOAD_BYTE,
		BUFFER_LOAD_SHORT,
BUFFER_LOAD_FORMAT,		BUFFER_LOAD_FORMAT,
BUFFER_LOAD_FORMAT_D16,		BUFFER_LOAD_FORMAT_D16,
SBUFFER_LOAD,		SBUFFER_LOAD,
BUFFER_STORE,		BUFFER_STORE,
		BUFFER_STORE_BYTE,
		BUFFER_STORE_SHORT,
BUFFER_STORE_FORMAT,		BUFFER_STORE_FORMAT,
BUFFER_STORE_FORMAT_D16,		BUFFER_STORE_FORMAT_D16,
BUFFER_ATOMIC_SWAP,		BUFFER_ATOMIC_SWAP,
BUFFER_ATOMIC_ADD,		BUFFER_ATOMIC_ADD,
BUFFER_ATOMIC_SUB,		BUFFER_ATOMIC_SUB,
BUFFER_ATOMIC_SMIN,		BUFFER_ATOMIC_SMIN,
BUFFER_ATOMIC_UMIN,		BUFFER_ATOMIC_UMIN,
BUFFER_ATOMIC_SMAX,		BUFFER_ATOMIC_SMAX,
Show All 15 Lines

lib/Target/AMDGPU/AMDGPUISelLowering.cpp

Show First 20 Lines • Show All 4,195 Lines • ▼ Show 20 Lines	const char* AMDGPUTargetLowering::getTargetNodeName(unsigned Opcode) const {
NODE_NAME_CASE(TBUFFER_LOAD_FORMAT_D16)		NODE_NAME_CASE(TBUFFER_LOAD_FORMAT_D16)
NODE_NAME_CASE(DS_ORDERED_COUNT)		NODE_NAME_CASE(DS_ORDERED_COUNT)
NODE_NAME_CASE(ATOMIC_CMP_SWAP)		NODE_NAME_CASE(ATOMIC_CMP_SWAP)
NODE_NAME_CASE(ATOMIC_INC)		NODE_NAME_CASE(ATOMIC_INC)
NODE_NAME_CASE(ATOMIC_DEC)		NODE_NAME_CASE(ATOMIC_DEC)
NODE_NAME_CASE(ATOMIC_LOAD_FMIN)		NODE_NAME_CASE(ATOMIC_LOAD_FMIN)
NODE_NAME_CASE(ATOMIC_LOAD_FMAX)		NODE_NAME_CASE(ATOMIC_LOAD_FMAX)
NODE_NAME_CASE(BUFFER_LOAD)		NODE_NAME_CASE(BUFFER_LOAD)
		NODE_NAME_CASE(BUFFER_LOAD_UBYTE)
		NODE_NAME_CASE(BUFFER_LOAD_USHORT)
		NODE_NAME_CASE(BUFFER_LOAD_BYTE)
		NODE_NAME_CASE(BUFFER_LOAD_SHORT)
NODE_NAME_CASE(BUFFER_LOAD_FORMAT)		NODE_NAME_CASE(BUFFER_LOAD_FORMAT)
NODE_NAME_CASE(BUFFER_LOAD_FORMAT_D16)		NODE_NAME_CASE(BUFFER_LOAD_FORMAT_D16)
NODE_NAME_CASE(SBUFFER_LOAD)		NODE_NAME_CASE(SBUFFER_LOAD)
NODE_NAME_CASE(BUFFER_STORE)		NODE_NAME_CASE(BUFFER_STORE)
		NODE_NAME_CASE(BUFFER_STORE_BYTE)
		NODE_NAME_CASE(BUFFER_STORE_SHORT)
NODE_NAME_CASE(BUFFER_STORE_FORMAT)		NODE_NAME_CASE(BUFFER_STORE_FORMAT)
NODE_NAME_CASE(BUFFER_STORE_FORMAT_D16)		NODE_NAME_CASE(BUFFER_STORE_FORMAT_D16)
NODE_NAME_CASE(BUFFER_ATOMIC_SWAP)		NODE_NAME_CASE(BUFFER_ATOMIC_SWAP)
NODE_NAME_CASE(BUFFER_ATOMIC_ADD)		NODE_NAME_CASE(BUFFER_ATOMIC_ADD)
NODE_NAME_CASE(BUFFER_ATOMIC_SUB)		NODE_NAME_CASE(BUFFER_ATOMIC_SUB)
NODE_NAME_CASE(BUFFER_ATOMIC_SMIN)		NODE_NAME_CASE(BUFFER_ATOMIC_SMIN)
NODE_NAME_CASE(BUFFER_ATOMIC_UMIN)		NODE_NAME_CASE(BUFFER_ATOMIC_UMIN)
NODE_NAME_CASE(BUFFER_ATOMIC_SMAX)		NODE_NAME_CASE(BUFFER_ATOMIC_SMAX)
▲ Show 20 Lines • Show All 148 Lines • ▼ Show 20 Lines	for (unsigned I = 0; I < 32; I += 8) {
Known.Zero \|= 0xff << I;		Known.Zero \|= 0xff << I;
} else if (SelBits > 0x0c) {		} else if (SelBits > 0x0c) {
Known.One \|= 0xff << I;		Known.One \|= 0xff << I;
}		}
Sel >>= 8;		Sel >>= 8;
}		}
break;		break;
}		}
		case AMDGPUISD::BUFFER_LOAD_UBYTE: {
		Known.Zero.setHighBits(24);
		break;
		}
		case AMDGPUISD::BUFFER_LOAD_USHORT: {
		Known.Zero.setHighBits(16);
		break;
		}
case ISD::INTRINSIC_WO_CHAIN: {		case ISD::INTRINSIC_WO_CHAIN: {
unsigned IID = cast<ConstantSDNode>(Op.getOperand(0))->getZExtValue();		unsigned IID = cast<ConstantSDNode>(Op.getOperand(0))->getZExtValue();
switch (IID) {		switch (IID) {
case Intrinsic::amdgcn_mbcnt_lo:		case Intrinsic::amdgcn_mbcnt_lo:
case Intrinsic::amdgcn_mbcnt_hi: {		case Intrinsic::amdgcn_mbcnt_hi: {
const GCNSubtarget &ST =		const GCNSubtarget &ST =
DAG.getMachineFunction().getSubtarget<GCNSubtarget>();		DAG.getMachineFunction().getSubtarget<GCNSubtarget>();
// These return at most the wavefront size - 1.		// These return at most the wavefront size - 1.
Show All 29 Lines	unsigned AMDGPUTargetLowering::ComputeNumSignBitsForTargetNode(
case AMDGPUISD::BFE_U32: {		case AMDGPUISD::BFE_U32: {
ConstantSDNode *Width = dyn_cast<ConstantSDNode>(Op.getOperand(2));		ConstantSDNode *Width = dyn_cast<ConstantSDNode>(Op.getOperand(2));
return Width ? 32 - (Width->getZExtValue() & 0x1f) : 1;		return Width ? 32 - (Width->getZExtValue() & 0x1f) : 1;
}		}

case AMDGPUISD::CARRY:		case AMDGPUISD::CARRY:
case AMDGPUISD::BORROW:		case AMDGPUISD::BORROW:
return 31;		return 31;
		case AMDGPUISD::BUFFER_LOAD_BYTE:
		return 25;
		case AMDGPUISD::BUFFER_LOAD_SHORT:
		return 17;
		case AMDGPUISD::BUFFER_LOAD_UBYTE:
		return 24;
		case AMDGPUISD::BUFFER_LOAD_USHORT:
		return 16;
case AMDGPUISD::FP_TO_FP16:		case AMDGPUISD::FP_TO_FP16:
case AMDGPUISD::FP16_ZEXT:		case AMDGPUISD::FP16_ZEXT:
return 16;		return 16;
default:		default:
return 1;		return 1;
}		}
}		}

▲ Show 20 Lines • Show All 108 Lines • Show Last 20 Lines

lib/Target/AMDGPU/BUFInstructions.td

	Show First 20 Lines • Show All 1,126 Lines • ▼ Show 20 Lines
	} // End HasPackedD16VMem.			} // End HasPackedD16VMem.

	defm : MUBUF_LoadIntrinsicPat<SIbuffer_load, f32, "BUFFER_LOAD_DWORD">;			defm : MUBUF_LoadIntrinsicPat<SIbuffer_load, f32, "BUFFER_LOAD_DWORD">;
	defm : MUBUF_LoadIntrinsicPat<SIbuffer_load, i32, "BUFFER_LOAD_DWORD">;			defm : MUBUF_LoadIntrinsicPat<SIbuffer_load, i32, "BUFFER_LOAD_DWORD">;
	defm : MUBUF_LoadIntrinsicPat<SIbuffer_load, v2f32, "BUFFER_LOAD_DWORDX2">;			defm : MUBUF_LoadIntrinsicPat<SIbuffer_load, v2f32, "BUFFER_LOAD_DWORDX2">;
	defm : MUBUF_LoadIntrinsicPat<SIbuffer_load, v2i32, "BUFFER_LOAD_DWORDX2">;			defm : MUBUF_LoadIntrinsicPat<SIbuffer_load, v2i32, "BUFFER_LOAD_DWORDX2">;
	defm : MUBUF_LoadIntrinsicPat<SIbuffer_load, v4f32, "BUFFER_LOAD_DWORDX4">;			defm : MUBUF_LoadIntrinsicPat<SIbuffer_load, v4f32, "BUFFER_LOAD_DWORDX4">;
	defm : MUBUF_LoadIntrinsicPat<SIbuffer_load, v4i32, "BUFFER_LOAD_DWORDX4">;			defm : MUBUF_LoadIntrinsicPat<SIbuffer_load, v4i32, "BUFFER_LOAD_DWORDX4">;
				defm : MUBUF_LoadIntrinsicPat<SIbuffer_load_byte, i32, "BUFFER_LOAD_SBYTE">;
				defm : MUBUF_LoadIntrinsicPat<SIbuffer_load_short, i32, "BUFFER_LOAD_SSHORT">;
				defm : MUBUF_LoadIntrinsicPat<SIbuffer_load_ubyte, i32, "BUFFER_LOAD_UBYTE">;
				defm : MUBUF_LoadIntrinsicPat<SIbuffer_load_ushort, i32, "BUFFER_LOAD_USHORT">;

	multiclass MUBUF_StoreIntrinsicPat<SDPatternOperator name, ValueType vt,			multiclass MUBUF_StoreIntrinsicPat<SDPatternOperator name, ValueType vt,
	string opcode> {			string opcode> {
	def : GCNPat<			def : GCNPat<
	(name vt:$vdata, v4i32:$rsrc, 0, 0, i32:$soffset, imm:$offset,			(name vt:$vdata, v4i32:$rsrc, 0, 0, i32:$soffset, imm:$offset,
	imm:$cachepolicy, 0),			imm:$cachepolicy, 0),
	(!cast<MUBUF_Pseudo>(opcode # _OFFSET_exact) $vdata, $rsrc, $soffset, (as_i16imm $offset),			(!cast<MUBUF_Pseudo>(opcode # _OFFSET_exact) $vdata, $rsrc, $soffset, (as_i16imm $offset),
	(extract_glc $cachepolicy), (extract_slc $cachepolicy), 0)			(extract_glc $cachepolicy), (extract_slc $cachepolicy), 0)
	▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
	} // End HasPackedD16VMem.			} // End HasPackedD16VMem.

	defm : MUBUF_StoreIntrinsicPat<SIbuffer_store, f32, "BUFFER_STORE_DWORD">;			defm : MUBUF_StoreIntrinsicPat<SIbuffer_store, f32, "BUFFER_STORE_DWORD">;
	defm : MUBUF_StoreIntrinsicPat<SIbuffer_store, i32, "BUFFER_STORE_DWORD">;			defm : MUBUF_StoreIntrinsicPat<SIbuffer_store, i32, "BUFFER_STORE_DWORD">;
	defm : MUBUF_StoreIntrinsicPat<SIbuffer_store, v2f32, "BUFFER_STORE_DWORDX2">;			defm : MUBUF_StoreIntrinsicPat<SIbuffer_store, v2f32, "BUFFER_STORE_DWORDX2">;
	defm : MUBUF_StoreIntrinsicPat<SIbuffer_store, v2i32, "BUFFER_STORE_DWORDX2">;			defm : MUBUF_StoreIntrinsicPat<SIbuffer_store, v2i32, "BUFFER_STORE_DWORDX2">;
	defm : MUBUF_StoreIntrinsicPat<SIbuffer_store, v4f32, "BUFFER_STORE_DWORDX4">;			defm : MUBUF_StoreIntrinsicPat<SIbuffer_store, v4f32, "BUFFER_STORE_DWORDX4">;
	defm : MUBUF_StoreIntrinsicPat<SIbuffer_store, v4i32, "BUFFER_STORE_DWORDX4">;			defm : MUBUF_StoreIntrinsicPat<SIbuffer_store, v4i32, "BUFFER_STORE_DWORDX4">;
				defm : MUBUF_StoreIntrinsicPat<SIbuffer_store_byte, i32, "BUFFER_STORE_BYTE">;
				defm : MUBUF_StoreIntrinsicPat<SIbuffer_store_short, i32, "BUFFER_STORE_SHORT">;

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// buffer_atomic patterns			// buffer_atomic patterns
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	multiclass BufferAtomicPatterns<SDPatternOperator name, string opcode> {			multiclass BufferAtomicPatterns<SDPatternOperator name, string opcode> {
	def : GCNPat<			def : GCNPat<
	(name i32:$vdata_in, v4i32:$rsrc, 0,			(name i32:$vdata_in, v4i32:$rsrc, 0,
	▲ Show 20 Lines • Show All 904 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 210 Lines • ▼ Show 20 Lines	SITargetLowering::SITargetLowering(const TargetMachine &TM,
setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::f16, Custom);		setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::f16, Custom);
setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::v2i16, Custom);		setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::v2i16, Custom);
setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::v2f16, Custom);		setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::v2f16, Custom);

setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::v2f16, Custom);		setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::v2f16, Custom);
setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::v4f16, Custom);		setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::v4f16, Custom);
setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::v8f16, Custom);		setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::v8f16, Custom);
setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::Other, Custom);		setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::Other, Custom);
		setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::i16, Custom);
		setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::i8, Custom);

setOperationAction(ISD::INTRINSIC_VOID, MVT::Other, Custom);		setOperationAction(ISD::INTRINSIC_VOID, MVT::Other, Custom);
setOperationAction(ISD::INTRINSIC_VOID, MVT::v2i16, Custom);		setOperationAction(ISD::INTRINSIC_VOID, MVT::v2i16, Custom);
setOperationAction(ISD::INTRINSIC_VOID, MVT::v2f16, Custom);		setOperationAction(ISD::INTRINSIC_VOID, MVT::v2f16, Custom);
setOperationAction(ISD::INTRINSIC_VOID, MVT::v4f16, Custom);		setOperationAction(ISD::INTRINSIC_VOID, MVT::v4f16, Custom);
		setOperationAction(ISD::INTRINSIC_VOID, MVT::i16, Custom);
		setOperationAction(ISD::INTRINSIC_VOID, MVT::i8, Custom);

setOperationAction(ISD::BRCOND, MVT::Other, Custom);		setOperationAction(ISD::BRCOND, MVT::Other, Custom);
setOperationAction(ISD::BR_CC, MVT::i1, Expand);		setOperationAction(ISD::BR_CC, MVT::i1, Expand);
setOperationAction(ISD::BR_CC, MVT::i32, Expand);		setOperationAction(ISD::BR_CC, MVT::i32, Expand);
setOperationAction(ISD::BR_CC, MVT::i64, Expand);		setOperationAction(ISD::BR_CC, MVT::i64, Expand);
setOperationAction(ISD::BR_CC, MVT::f32, Expand);		setOperationAction(ISD::BR_CC, MVT::f32, Expand);
setOperationAction(ISD::BR_CC, MVT::f64, Expand);		setOperationAction(ISD::BR_CC, MVT::f64, Expand);

▲ Show 20 Lines • Show All 5,387 Lines • ▼ Show 20 Lines	case Intrinsic::amdgcn_buffer_load_format: {
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();
EVT IntVT = VT.changeTypeToInteger();		EVT IntVT = VT.changeTypeToInteger();
auto *M = cast<MemSDNode>(Op);		auto *M = cast<MemSDNode>(Op);
EVT LoadVT = Op.getValueType();		EVT LoadVT = Op.getValueType();

if (LoadVT.getScalarType() == MVT::f16)		if (LoadVT.getScalarType() == MVT::f16)
return adjustLoadValueType(AMDGPUISD::BUFFER_LOAD_FORMAT_D16,		return adjustLoadValueType(AMDGPUISD::BUFFER_LOAD_FORMAT_D16,
M, DAG, Ops);		M, DAG, Ops);

		// handle buffer_load_ubyte/byte/ushort/short overloaded intrinsics
		arsenmUnsubmitted Not Done Reply Inline Actions Capitalize arsenm: Capitalize
		if (LoadVT.getScalarType() == MVT::i8 \|\|
		LoadVT.getScalarType() == MVT::i16) {

		// set Opc based on data type
		if (LoadVT.getScalarType() == MVT::i8)
		Opc = AMDGPUISD::BUFFER_LOAD_UBYTE;
		else
		arsenmUnsubmitted Not Done Reply Inline Actions You can just hardcoded this to MVT::Other arsenm: You can just hardcoded this to MVT::Other
		Opc = AMDGPUISD::BUFFER_LOAD_USHORT;
		arsenmUnsubmitted Not Done Reply Inline Actions Ternary operator arsenm: Ternary operator

		// set node opcode if buffer_load_byte/short
		if (Op.hasOneUse()) {
		if (M->use_begin()->getOpcode() == ISD::SIGN_EXTEND) {
		if (LoadVT.getScalarType() == MVT::i8)
		Opc = AMDGPUISD::BUFFER_LOAD_BYTE;
		else
		Opc = AMDGPUISD::BUFFER_LOAD_SHORT;
		}
		}
		arsenmUnsubmitted Not Done Reply Inline Actions You shouldn't be inspecting the users. You can just unconditionally use one or the other. You're going to have to insert a truncate back to the original type at the end anyway. You can then add a separate optimization to fold in the sext_inreg or mask into the buffer like is done for loads arsenm: You shouldn't be inspecting the users. You can just unconditionally use one or the other.
		rtaylorAuthorUnsubmitted Not Done Reply Inline Actions There are four potential options so what do you mean by one or the other? There is BUFFER_LOAD_ubyte/ushort/short/byte for the Opc. rtaylor: There are four potential options so what do you mean by one or the other? There is…
		arsenmUnsubmitted Not Done Reply Inline Actions You can just unconditionally use load_ubyte/load_ushort. Folding the sign extend in is then a separate optimization on a sext (or more likely a sext_inreg) arsenm: You can just unconditionally use load_ubyte/load_ushort. Folding the sign extend in is then a…
		rtaylorAuthorUnsubmitted Not Done Reply Inline Actions So outputting byte/short based on sign_extend in the tablgen pattern? This won't allow re-use of the existing multiclass without changes. I think I remember there being a reason Nicolai and I decided not to do this. rtaylor: So outputting byte/short based on sign_extend in the tablgen pattern? This won't allow re-use…
		arsenmUnsubmitted Not Done Reply Inline Actions This doesn't change the selection. This is an optimization done in the DAGCombiner arsenm: This doesn't change the selection. This is an optimization done in the DAGCombiner
		SDVTList ResList = DAG.getVTList(MVT::i32, Ops[0].getValueType());
		SDValue BufferLoad = DAG.getMemIntrinsicNode(Opc, DL, ResList,
		Ops, IntVT,
		M->getMemOperand());
		SDValue BufferLoadTrunc = DAG.getNode(ISD::TRUNCATE, DL,
		LoadVT.getScalarType(), BufferLoad);
		SDValue BufferLoadMerge = DAG.getMergeValues({BufferLoadTrunc,
		BufferLoad.getValue(1)}, DL);
		return BufferLoadMerge;
		}

return DAG.getMemIntrinsicNode(Opc, DL, Op->getVTList(), Ops, IntVT,		return DAG.getMemIntrinsicNode(Opc, DL, Op->getVTList(), Ops, IntVT,
M->getMemOperand());		M->getMemOperand());
}		}
case Intrinsic::amdgcn_raw_buffer_load:		case Intrinsic::amdgcn_raw_buffer_load:
case Intrinsic::amdgcn_raw_buffer_load_format: {		case Intrinsic::amdgcn_raw_buffer_load_format: {
auto Offsets = splitBufferOffsets(Op.getOperand(3), DAG);		auto Offsets = splitBufferOffsets(Op.getOperand(3), DAG);
SDValue Ops[] = {		SDValue Ops[] = {
Op.getOperand(0), // Chain		Op.getOperand(0), // Chain
Show All 12 Lines	case Intrinsic::amdgcn_raw_buffer_load_format: {
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();
EVT IntVT = VT.changeTypeToInteger();		EVT IntVT = VT.changeTypeToInteger();
auto *M = cast<MemSDNode>(Op);		auto *M = cast<MemSDNode>(Op);
EVT LoadVT = Op.getValueType();		EVT LoadVT = Op.getValueType();

if (LoadVT.getScalarType() == MVT::f16)		if (LoadVT.getScalarType() == MVT::f16)
return adjustLoadValueType(AMDGPUISD::BUFFER_LOAD_FORMAT_D16,		return adjustLoadValueType(AMDGPUISD::BUFFER_LOAD_FORMAT_D16,
M, DAG, Ops);		M, DAG, Ops);

		// handle buffer_load_ubyte/byte/ushort/short overloaded intrinsics
		arsenmUnsubmitted Not Done Reply Inline Actions Capitalize arsenm: Capitalize
		arsenmUnsubmitted Not Done Reply Inline Actions Capitalize arsenm: Capitalize
		if (LoadVT.getScalarType() == MVT::i8 \|\|
		LoadVT.getScalarType() == MVT::i16) {
		arsenmUnsubmitted Not Done Reply Inline Actions This looks identical to the other part, which is kind of surprising to me but this should be factored into something common arsenm: This looks identical to the other part, which is kind of surprising to me but this should be…
		rtaylorAuthorUnsubmitted Not Done Reply Inline Actions Do you mean that there is simliar/same code in the different cases of buffer_load? (ie struct, raw, normal) And that these should be factored into something common (ie function call)? rtaylor: Do you mean that there is simliar/same code in the different cases of buffer_load? (ie struct…
		arsenmUnsubmitted Not Done Reply Inline Actions Yes arsenm: Yes

		// set Opc based on data type
		arsenmUnsubmitted Not Done Reply Inline Actions This comment can be removed arsenm: This comment can be removed
		if (LoadVT.getScalarType() == MVT::i8)
		Opc = AMDGPUISD::BUFFER_LOAD_UBYTE;
		else
		Opc = AMDGPUISD::BUFFER_LOAD_USHORT;
		arsenmUnsubmitted Not Done Reply Inline Actions Ternary operator arsenm: Ternary operator

		// set node opcode if buffer_load_byte/short
		if (Op.hasOneUse()) {
		if (M->use_begin()->getOpcode() == ISD::SIGN_EXTEND) {
		if (LoadVT.getScalarType() == MVT::i8)
		Opc = AMDGPUISD::BUFFER_LOAD_BYTE;
		else
		Opc = AMDGPUISD::BUFFER_LOAD_SHORT;
		}
		}
		SDVTList ResList = DAG.getVTList(MVT::i32, Ops[0].getValueType());
		SDValue BufferLoad = DAG.getMemIntrinsicNode(Opc, DL, ResList,
		Ops, IntVT,
		M->getMemOperand());
		SDValue BufferLoadTrunc = DAG.getNode(ISD::TRUNCATE, DL,
		LoadVT.getScalarType(), BufferLoad);
		SDValue BufferLoadMerge = DAG.getMergeValues({BufferLoadTrunc,
		BufferLoad.getValue(1)}, DL);
		return BufferLoadMerge;
		}

return DAG.getMemIntrinsicNode(Opc, DL, Op->getVTList(), Ops, IntVT,		return DAG.getMemIntrinsicNode(Opc, DL, Op->getVTList(), Ops, IntVT,
M->getMemOperand());		M->getMemOperand());
}		}
case Intrinsic::amdgcn_struct_buffer_load:		case Intrinsic::amdgcn_struct_buffer_load:
case Intrinsic::amdgcn_struct_buffer_load_format: {		case Intrinsic::amdgcn_struct_buffer_load_format: {
auto Offsets = splitBufferOffsets(Op.getOperand(4), DAG);		auto Offsets = splitBufferOffsets(Op.getOperand(4), DAG);
SDValue Ops[] = {		SDValue Ops[] = {
Op.getOperand(0), // Chain		Op.getOperand(0), // Chain
Show All 12 Lines	case Intrinsic::amdgcn_struct_buffer_load_format: {
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();
EVT IntVT = VT.changeTypeToInteger();		EVT IntVT = VT.changeTypeToInteger();
auto *M = cast<MemSDNode>(Op);		auto *M = cast<MemSDNode>(Op);
EVT LoadVT = Op.getValueType();		EVT LoadVT = Op.getValueType();

if (LoadVT.getScalarType() == MVT::f16)		if (LoadVT.getScalarType() == MVT::f16)
return adjustLoadValueType(AMDGPUISD::BUFFER_LOAD_FORMAT_D16,		return adjustLoadValueType(AMDGPUISD::BUFFER_LOAD_FORMAT_D16,
M, DAG, Ops);		M, DAG, Ops);

		// handle buffer_load_ubyte/byte/ushort/short overloaded intrinsics
		if (LoadVT.getScalarType() == MVT::i8 \|\|
		arsenmUnsubmitted Not Done Reply Inline Actions Repeated again arsenm: Repeated again
		LoadVT.getScalarType() == MVT::i16) {

		// set Opc based on data type
		if (LoadVT.getScalarType() == MVT::i8)
		Opc = AMDGPUISD::BUFFER_LOAD_UBYTE;
		else
		Opc = AMDGPUISD::BUFFER_LOAD_USHORT;

		// set node opcode if buffer_load_byte/short
		if (Op.hasOneUse()) {
		if (M->use_begin()->getOpcode() == ISD::SIGN_EXTEND) {
		if (LoadVT.getScalarType() == MVT::i8)
		Opc = AMDGPUISD::BUFFER_LOAD_BYTE;
		else
		Opc = AMDGPUISD::BUFFER_LOAD_SHORT;
		}
		}
		SDVTList ResList = DAG.getVTList(MVT::i32, Ops[0].getValueType());
		SDValue BufferLoad = DAG.getMemIntrinsicNode(Opc, DL, ResList,
		Ops, IntVT,
		M->getMemOperand());
		SDValue BufferLoadTrunc = DAG.getNode(ISD::TRUNCATE, DL,
		LoadVT.getScalarType(), BufferLoad);
		SDValue BufferLoadMerge = DAG.getMergeValues({BufferLoadTrunc,
		BufferLoad.getValue(1)}, DL);
		return BufferLoadMerge;
		}

return DAG.getMemIntrinsicNode(Opc, DL, Op->getVTList(), Ops, IntVT,		return DAG.getMemIntrinsicNode(Opc, DL, Op->getVTList(), Ops, IntVT,
M->getMemOperand());		M->getMemOperand());
}		}
case Intrinsic::amdgcn_tbuffer_load: {		case Intrinsic::amdgcn_tbuffer_load: {
MemSDNode *M = cast<MemSDNode>(Op);		MemSDNode *M = cast<MemSDNode>(Op);
EVT LoadVT = Op.getValueType();		EVT LoadVT = Op.getValueType();

unsigned Dfmt = cast<ConstantSDNode>(Op.getOperand(7))->getZExtValue();		unsigned Dfmt = cast<ConstantSDNode>(Op.getOperand(7))->getZExtValue();
▲ Show 20 Lines • Show All 554 Lines • ▼ Show 20 Lines	SDValue Ops[] = {
DAG.getConstant(Glc \| (Slc << 1), DL, MVT::i32), // cachepolicy		DAG.getConstant(Glc \| (Slc << 1), DL, MVT::i32), // cachepolicy
DAG.getConstant(IdxEn, DL, MVT::i1), // idxen		DAG.getConstant(IdxEn, DL, MVT::i1), // idxen
};		};
setBufferOffsets(Op.getOperand(5), DAG, &Ops[4]);		setBufferOffsets(Op.getOperand(5), DAG, &Ops[4]);
unsigned Opc = IntrinsicID == Intrinsic::amdgcn_buffer_store ?		unsigned Opc = IntrinsicID == Intrinsic::amdgcn_buffer_store ?
AMDGPUISD::BUFFER_STORE : AMDGPUISD::BUFFER_STORE_FORMAT;		AMDGPUISD::BUFFER_STORE : AMDGPUISD::BUFFER_STORE_FORMAT;
Opc = IsD16 ? AMDGPUISD::BUFFER_STORE_FORMAT_D16 : Opc;		Opc = IsD16 ? AMDGPUISD::BUFFER_STORE_FORMAT_D16 : Opc;
MemSDNode *M = cast<MemSDNode>(Op);		MemSDNode *M = cast<MemSDNode>(Op);

		// handle buffer_store_byte/short overloaded intrinsics
		EVT VDataType = VData.getValueType().getScalarType();
		if (VDataType == MVT::i8 \|\| VDataType == MVT::i16) {
		SDValue BufferStoreExt = DAG.getNode(ISD::ANY_EXTEND, DL,
		MVT::i32, Ops[1]);
		Ops[1] = BufferStoreExt;
		if (VDataType == MVT::i8)
		Opc = AMDGPUISD::BUFFER_STORE_BYTE;
		else
		Opc = AMDGPUISD::BUFFER_STORE_SHORT;
		SDValue BufferStore = DAG.getMemIntrinsicNode(Opc, DL, M->getVTList(),
		Ops, VDataType,
		M->getMemOperand());
		return BufferStore;
		}

return DAG.getMemIntrinsicNode(Opc, DL, Op->getVTList(), Ops,		return DAG.getMemIntrinsicNode(Opc, DL, Op->getVTList(), Ops,
M->getMemoryVT(), M->getMemOperand());		M->getMemoryVT(), M->getMemOperand());
}		}

case Intrinsic::amdgcn_raw_buffer_store:		case Intrinsic::amdgcn_raw_buffer_store:
case Intrinsic::amdgcn_raw_buffer_store_format: {		case Intrinsic::amdgcn_raw_buffer_store_format: {
SDValue VData = Op.getOperand(2);		SDValue VData = Op.getOperand(2);
bool IsD16 = (VData.getValueType().getScalarType() == MVT::f16);		bool IsD16 = (VData.getValueType().getScalarType() == MVT::f16);
Show All 10 Lines	SDValue Ops[] = {
Offsets.second, // offset		Offsets.second, // offset
Op.getOperand(6), // cachepolicy		Op.getOperand(6), // cachepolicy
DAG.getConstant(0, DL, MVT::i1), // idxen		DAG.getConstant(0, DL, MVT::i1), // idxen
};		};
unsigned Opc = IntrinsicID == Intrinsic::amdgcn_raw_buffer_store ?		unsigned Opc = IntrinsicID == Intrinsic::amdgcn_raw_buffer_store ?
AMDGPUISD::BUFFER_STORE : AMDGPUISD::BUFFER_STORE_FORMAT;		AMDGPUISD::BUFFER_STORE : AMDGPUISD::BUFFER_STORE_FORMAT;
Opc = IsD16 ? AMDGPUISD::BUFFER_STORE_FORMAT_D16 : Opc;		Opc = IsD16 ? AMDGPUISD::BUFFER_STORE_FORMAT_D16 : Opc;
MemSDNode *M = cast<MemSDNode>(Op);		MemSDNode *M = cast<MemSDNode>(Op);

		// handle buffer_store_byte/short overloaded intrinsics
		EVT VDataType = VData.getValueType().getScalarType();
		if (VDataType == MVT::i8 \|\| VDataType == MVT::i16) {
		SDValue BufferStoreExt = DAG.getNode(ISD::ANY_EXTEND, DL,
		MVT::i32, Ops[1]);
		Ops[1] = BufferStoreExt;
		if (VDataType == MVT::i8)
		Opc = AMDGPUISD::BUFFER_STORE_BYTE;
		else
		Opc = AMDGPUISD::BUFFER_STORE_SHORT;
		arsenmUnsubmitted Not Done Reply Inline Actions Ternary operator arsenm: Ternary operator
		SDValue BufferStore = DAG.getMemIntrinsicNode(Opc, DL, M->getVTList(),
		Ops, VDataType,
		M->getMemOperand());
		return BufferStore;
		}

return DAG.getMemIntrinsicNode(Opc, DL, Op->getVTList(), Ops,		return DAG.getMemIntrinsicNode(Opc, DL, Op->getVTList(), Ops,
M->getMemoryVT(), M->getMemOperand());		M->getMemoryVT(), M->getMemOperand());
}		}

case Intrinsic::amdgcn_struct_buffer_store:		case Intrinsic::amdgcn_struct_buffer_store:
case Intrinsic::amdgcn_struct_buffer_store_format: {		case Intrinsic::amdgcn_struct_buffer_store_format: {
SDValue VData = Op.getOperand(2);		SDValue VData = Op.getOperand(2);
bool IsD16 = (VData.getValueType().getScalarType() == MVT::f16);		bool IsD16 = (VData.getValueType().getScalarType() == MVT::f16);
Show All 10 Lines	SDValue Ops[] = {
Offsets.second, // offset		Offsets.second, // offset
Op.getOperand(7), // cachepolicy		Op.getOperand(7), // cachepolicy
DAG.getConstant(1, DL, MVT::i1), // idxen		DAG.getConstant(1, DL, MVT::i1), // idxen
};		};
unsigned Opc = IntrinsicID == Intrinsic::amdgcn_struct_buffer_store ?		unsigned Opc = IntrinsicID == Intrinsic::amdgcn_struct_buffer_store ?
AMDGPUISD::BUFFER_STORE : AMDGPUISD::BUFFER_STORE_FORMAT;		AMDGPUISD::BUFFER_STORE : AMDGPUISD::BUFFER_STORE_FORMAT;
Opc = IsD16 ? AMDGPUISD::BUFFER_STORE_FORMAT_D16 : Opc;		Opc = IsD16 ? AMDGPUISD::BUFFER_STORE_FORMAT_D16 : Opc;
MemSDNode *M = cast<MemSDNode>(Op);		MemSDNode *M = cast<MemSDNode>(Op);

		// handle buffer_store_byte/short overloaded intrinsics
		arsenmUnsubmitted Not Done Reply Inline Actions Capitalize arsenm: Capitalize
		EVT VDataType = VData.getValueType().getScalarType();
		if (VDataType == MVT::i8 \|\| VDataType == MVT::i16) {
		SDValue BufferStoreExt = DAG.getNode(ISD::ANY_EXTEND, DL,
		MVT::i32, Ops[1]);
		Ops[1] = BufferStoreExt;
		if (VDataType == MVT::i8)
		Opc = AMDGPUISD::BUFFER_STORE_BYTE;
		else
		Opc = AMDGPUISD::BUFFER_STORE_SHORT;
		arsenmUnsubmitted Not Done Reply Inline Actions Ternary operator arsenm: Ternary operator
		SDValue BufferStore = DAG.getMemIntrinsicNode(Opc, DL, M->getVTList(),
		Ops, VDataType,
		M->getMemOperand());
		return BufferStore;
		}

return DAG.getMemIntrinsicNode(Opc, DL, Op->getVTList(), Ops,		return DAG.getMemIntrinsicNode(Opc, DL, Op->getVTList(), Ops,
M->getMemoryVT(), M->getMemOperand());		M->getMemoryVT(), M->getMemOperand());
}		}

default: {		default: {
if (const AMDGPU::ImageDimIntrinsicInfo *ImageDimIntr =		if (const AMDGPU::ImageDimIntrinsicInfo *ImageDimIntr =
AMDGPU::getImageDimIntrinsicInfo(IntrinsicID))		AMDGPU::getImageDimIntrinsicInfo(IntrinsicID))
return lowerImage(Op, ImageDimIntr, DAG);		return lowerImage(Op, ImageDimIntr, DAG);
▲ Show 20 Lines • Show All 91 Lines • ▼ Show 20 Lines	void SITargetLowering::setBufferOffsets(SDValue CombinedOffset,
Offsets[2] = DAG.getConstant(0, DL, MVT::i32);		Offsets[2] = DAG.getConstant(0, DL, MVT::i32);
}		}

static SDValue getLoadExtOrTrunc(SelectionDAG &DAG,		static SDValue getLoadExtOrTrunc(SelectionDAG &DAG,
ISD::LoadExtType ExtType, SDValue Op,		ISD::LoadExtType ExtType, SDValue Op,
const SDLoc &SL, EVT VT) {		const SDLoc &SL, EVT VT) {
if (VT.bitsLT(Op.getValueType()))		if (VT.bitsLT(Op.getValueType()))
return DAG.getNode(ISD::TRUNCATE, SL, VT, Op);		return DAG.getNode(ISD::TRUNCATE, SL, VT, Op);

		arsenmUnsubmitted Not Done Reply Inline Actions Brace placement arsenm: Brace placement
switch (ExtType) {		switch (ExtType) {
case ISD::SEXTLOAD:		case ISD::SEXTLOAD:
return DAG.getNode(ISD::SIGN_EXTEND, SL, VT, Op);		return DAG.getNode(ISD::SIGN_EXTEND, SL, VT, Op);
case ISD::ZEXTLOAD:		case ISD::ZEXTLOAD:
return DAG.getNode(ISD::ZERO_EXTEND, SL, VT, Op);		return DAG.getNode(ISD::ZERO_EXTEND, SL, VT, Op);
case ISD::EXTLOAD:		case ISD::EXTLOAD:
return DAG.getNode(ISD::ANY_EXTEND, SL, VT, Op);		return DAG.getNode(ISD::ANY_EXTEND, SL, VT, Op);
case ISD::NON_EXTLOAD:		case ISD::NON_EXTLOAD:
return Op;		return Op;
}		}

llvm_unreachable("invalid ext type");		llvm_unreachable("invalid ext type");
}		}

SDValue SITargetLowering::widenLoad(LoadSDNode *Ld, DAGCombinerInfo &DCI) const {		SDValue SITargetLowering::widenLoad(LoadSDNode *Ld, DAGCombinerInfo &DCI) const {
SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
if (Ld->getAlignment() < 4 \|\| Ld->isDivergent())		if (Ld->getAlignment() < 4 \|\| Ld->isDivergent())
return SDValue();		return SDValue();

		arsenmUnsubmitted Not Done Reply Inline Actions Brace placement arsenm: Brace placement
// FIXME: Constant loads should all be marked invariant.		// FIXME: Constant loads should all be marked invariant.
unsigned AS = Ld->getAddressSpace();		unsigned AS = Ld->getAddressSpace();
if (AS != AMDGPUAS::CONSTANT_ADDRESS &&		if (AS != AMDGPUAS::CONSTANT_ADDRESS &&
AS != AMDGPUAS::CONSTANT_ADDRESS_32BIT &&		AS != AMDGPUAS::CONSTANT_ADDRESS_32BIT &&
(AS != AMDGPUAS::GLOBAL_ADDRESS \|\| !Ld->isInvariant()))		(AS != AMDGPUAS::GLOBAL_ADDRESS \|\| !Ld->isInvariant()))
return SDValue();		return SDValue();

// Don't do this early, since it may interfere with adjacent load merging for		// Don't do this early, since it may interfere with adjacent load merging for
▲ Show 20 Lines • Show All 1,291 Lines • ▼ Show 20 Lines	if (BCSrc.getValueType() == MVT::f16 &&
fp16SrcZerosHighBits(BCSrc.getOpcode()))		fp16SrcZerosHighBits(BCSrc.getOpcode()))
return DCI.DAG.getNode(AMDGPUISD::FP16_ZEXT, SDLoc(N), VT, BCSrc);		return DCI.DAG.getNode(AMDGPUISD::FP16_ZEXT, SDLoc(N), VT, BCSrc);
}		}

return SDValue();		return SDValue();
}		}

SDValue SITargetLowering::performClassCombine(SDNode *N,		SDValue SITargetLowering::performClassCombine(SDNode *N,
DAGCombinerInfo &DCI) const {		DAGCombinerInfo &DCI) const {
		arsenmUnsubmitted Not Done Reply Inline Actions Formatting arsenm: Formatting
SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
SDValue Mask = N->getOperand(1);		SDValue Mask = N->getOperand(1);

		arsenmUnsubmitted Not Done Reply Inline Actions Should have a hasOneUse check arsenm: Should have a hasOneUse check
		rtaylorAuthorUnsubmitted Not Done Reply Inline Actions You mean there should be a hasOneUse check on the SIGN_EXTEND_INREG right? rtaylor: You mean there should be a hasOneUse check on the SIGN_EXTEND_INREG right?
		arsenmUnsubmitted Not Done Reply Inline Actions No, the buffer operation. If there are multiple uses you will end up creating multiple loads arsenm: No, the buffer operation. If there are multiple uses you will end up creating multiple loads
// fp_class x, 0 -> false		// fp_class x, 0 -> false
		arsenmUnsubmitted Not Done Reply Inline Actions Leftover debugging arsenm: Leftover debugging
if (const ConstantSDNode *CMask = dyn_cast<ConstantSDNode>(Mask)) {		if (const ConstantSDNode *CMask = dyn_cast<ConstantSDNode>(Mask)) {
		arsenmUnsubmitted Not Done Reply Inline Actions This is missing a check on the source type. If you want to be fancier, you can split out the remainder bits into a new sign extend but there probably isn't much reason to arsenm: This is missing a check on the source type. If you want to be fancier, you can split out the…
		rtaylorAuthorUnsubmitted Not Done Reply Inline Actions Src is the BUFFER_LOAD_XXX. The only way this code is executed is if the Src is a BUFFER_LOAD_XXX. I'm not sure we need a redundant check here do we? rtaylor: Src is the BUFFER_LOAD_XXX. The only way this code is executed is if the Src is a…
		arsenmUnsubmitted Not Done Reply Inline Actions The number of bits in the sext_inreg may not match the load's from-8/16 bit source. You can test this with something like %load = call llvm.amdgcn.buffer.load.i8() %ext = zext i8 %load to i32 %shl = shl i32 %ext, 27 %shr = ashr i32 %shl, 27 There will need more shifts to clear the extra bits in the loaded value arsenm: The number of bits in the sext_inreg may not match the load's from-8/16 bit source. You can…
		rtaylorAuthorUnsubmitted Not Done Reply Inline Actions This should produce a buffer_load_sbyte right? That is what it does currently. rtaylor: This should produce a buffer_load_sbyte right? That is what it does currently.
		arsenmUnsubmitted Not Done Reply Inline Actions But it needs additional shifts even after. Right now you'll not be clearing the extra bits in the low 8 that need to be arsenm: But it needs additional shifts even after. Right now you'll not be clearing the extra bits in…
		arsenmUnsubmitted Not Done Reply Inline Actions You can either leave it as x = buffer_load_ubyte sext_inreg x, i27 or x = buffer_load_sbyte sext_inreg x, i5 I'm not sure there's much practical difference between them arsenm: You can either leave it as x = buffer_load_ubyte sext_inreg x, i27 or x = buffer_load_sbyte…
		rtaylorAuthorUnsubmitted Not Done Reply Inline Actions Right, I don't think there is, I'm working on doing the former. Thanks. rtaylor: Right, I don't think there is, I'm working on doing the former. Thanks.
		arsenmUnsubmitted Not Done Reply Inline Actions There might be a small advantage by canonicalizing the sext_inreg sizes. The constant masks that will be emitted for BFI might be more likely to be reusable arsenm: There might be a small advantage by canonicalizing the sext_inreg sizes. The constant masks…
if (CMask->isNullValue())		if (CMask->isNullValue())
return DAG.getConstant(0, SDLoc(N), MVT::i1);		return DAG.getConstant(0, SDLoc(N), MVT::i1);
		arsenmUnsubmitted Not Done Reply Inline Actions Extra space before == arsenm: Extra space before ==
}		}

		arsenmUnsubmitted Not Done Reply Inline Actions Extra space before == arsenm: Extra space before ==
if (N->getOperand(0).isUndef())		if (N->getOperand(0).isUndef())
return DAG.getUNDEF(MVT::i1);		return DAG.getUNDEF(MVT::i1);

		arsenmUnsubmitted Not Done Reply Inline Actions "will be set by" part doesn't make sense here arsenm: "will be set by" part doesn't make sense here
return SDValue();		return SDValue();
}		}

SDValue SITargetLowering::performRcpCombine(SDNode *N,		SDValue SITargetLowering::performRcpCombine(SDNode *N,
DAGCombinerInfo &DCI) const {		DAGCombinerInfo &DCI) const {
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);

▲ Show 20 Lines • Show All 2,099 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInstrInfo.td

Show First 20 Lines • Show All 114 Lines • ▼ Show 20 Lines	def SDTBufferLoad : SDTypeProfile<1, 7,
SDTCisVT<3, i32>, // voffset(VGPR)		SDTCisVT<3, i32>, // voffset(VGPR)
SDTCisVT<4, i32>, // soffset(SGPR)		SDTCisVT<4, i32>, // soffset(SGPR)
SDTCisVT<5, i32>, // offset(imm)		SDTCisVT<5, i32>, // offset(imm)
SDTCisVT<6, i32>, // cachepolicy(imm)		SDTCisVT<6, i32>, // cachepolicy(imm)
SDTCisVT<7, i1>]>; // idxen(imm)		SDTCisVT<7, i1>]>; // idxen(imm)

def SIbuffer_load : SDNode <"AMDGPUISD::BUFFER_LOAD", SDTBufferLoad,		def SIbuffer_load : SDNode <"AMDGPUISD::BUFFER_LOAD", SDTBufferLoad,
[SDNPMemOperand, SDNPHasChain, SDNPMayLoad]>;		[SDNPMemOperand, SDNPHasChain, SDNPMayLoad]>;
		def SIbuffer_load_ubyte : SDNode <"AMDGPUISD::BUFFER_LOAD_UBYTE", SDTBufferLoad,
		[SDNPMemOperand, SDNPHasChain, SDNPMayLoad]>;
		def SIbuffer_load_ushort : SDNode <"AMDGPUISD::BUFFER_LOAD_USHORT", SDTBufferLoad,
		[SDNPMemOperand, SDNPHasChain, SDNPMayLoad]>;
		def SIbuffer_load_byte : SDNode <"AMDGPUISD::BUFFER_LOAD_BYTE", SDTBufferLoad,
		[SDNPMemOperand, SDNPHasChain, SDNPMayLoad]>;
		def SIbuffer_load_short: SDNode <"AMDGPUISD::BUFFER_LOAD_SHORT", SDTBufferLoad,
		[SDNPMemOperand, SDNPHasChain, SDNPMayLoad]>;
def SIbuffer_load_format : SDNode <"AMDGPUISD::BUFFER_LOAD_FORMAT", SDTBufferLoad,		def SIbuffer_load_format : SDNode <"AMDGPUISD::BUFFER_LOAD_FORMAT", SDTBufferLoad,
[SDNPMemOperand, SDNPHasChain, SDNPMayLoad]>;		[SDNPMemOperand, SDNPHasChain, SDNPMayLoad]>;
def SIbuffer_load_format_d16 : SDNode <"AMDGPUISD::BUFFER_LOAD_FORMAT_D16",		def SIbuffer_load_format_d16 : SDNode <"AMDGPUISD::BUFFER_LOAD_FORMAT_D16",
SDTBufferLoad,		SDTBufferLoad,
[SDNPMemOperand, SDNPHasChain, SDNPMayLoad]>;		[SDNPMemOperand, SDNPHasChain, SDNPMayLoad]>;

def SDTBufferStore : SDTypeProfile<0, 8,		def SDTBufferStore : SDTypeProfile<0, 8,
[ // vdata		[ // vdata
SDTCisVT<1, v4i32>, // rsrc		SDTCisVT<1, v4i32>, // rsrc
SDTCisVT<2, i32>, // vindex(VGPR)		SDTCisVT<2, i32>, // vindex(VGPR)
SDTCisVT<3, i32>, // voffset(VGPR)		SDTCisVT<3, i32>, // voffset(VGPR)
SDTCisVT<4, i32>, // soffset(SGPR)		SDTCisVT<4, i32>, // soffset(SGPR)
SDTCisVT<5, i32>, // offset(imm)		SDTCisVT<5, i32>, // offset(imm)
SDTCisVT<6, i32>, // cachepolicy(imm)		SDTCisVT<6, i32>, // cachepolicy(imm)
SDTCisVT<7, i1>]>; // idxen(imm)		SDTCisVT<7, i1>]>; // idxen(imm)

def SIbuffer_store : SDNode <"AMDGPUISD::BUFFER_STORE", SDTBufferStore,		def SIbuffer_store : SDNode <"AMDGPUISD::BUFFER_STORE", SDTBufferStore,
[SDNPMayStore, SDNPMemOperand, SDNPHasChain]>;		[SDNPMayStore, SDNPMemOperand, SDNPHasChain]>;
		def SIbuffer_store_byte: SDNode <"AMDGPUISD::BUFFER_STORE_BYTE",
		SDTBufferStore,
		[SDNPMayStore, SDNPMemOperand, SDNPHasChain]>;
		def SIbuffer_store_short : SDNode <"AMDGPUISD::BUFFER_STORE_SHORT",
		SDTBufferStore,
		[SDNPMayStore, SDNPMemOperand, SDNPHasChain]>;
def SIbuffer_store_format : SDNode <"AMDGPUISD::BUFFER_STORE_FORMAT",		def SIbuffer_store_format : SDNode <"AMDGPUISD::BUFFER_STORE_FORMAT",
SDTBufferStore,		SDTBufferStore,
[SDNPMayStore, SDNPMemOperand, SDNPHasChain]>;		[SDNPMayStore, SDNPMemOperand, SDNPHasChain]>;
def SIbuffer_store_format_d16 : SDNode <"AMDGPUISD::BUFFER_STORE_FORMAT_D16",		def SIbuffer_store_format_d16 : SDNode <"AMDGPUISD::BUFFER_STORE_FORMAT_D16",
SDTBufferStore,		SDTBufferStore,
[SDNPMayStore, SDNPMemOperand, SDNPHasChain]>;		[SDNPMayStore, SDNPMemOperand, SDNPHasChain]>;

class SDBufferAtomic<string opcode> : SDNode <opcode,		class SDBufferAtomic<string opcode> : SDNode <opcode,
▲ Show 20 Lines • Show All 1,902 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/llvm.amdgcn.buffer.load.ll

Show First 20 Lines • Show All 251 Lines • ▼ Show 20 Lines	main_body:
%vr1 = call <2 x float> @llvm.amdgcn.buffer.load.v2f32(<4 x i32> %rsrc, i32 0, i32 4, i1 0, i1 0)		%vr1 = call <2 x float> @llvm.amdgcn.buffer.load.v2f32(<4 x i32> %rsrc, i32 0, i32 4, i1 0, i1 0)
%r3 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> %rsrc, i32 0, i32 12, i1 0, i1 0)		%r3 = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> %rsrc, i32 0, i32 12, i1 0, i1 0)
%r1 = extractelement <2 x float> %vr1, i32 0		%r1 = extractelement <2 x float> %vr1, i32 0
%r2 = extractelement <2 x float> %vr1, i32 1		%r2 = extractelement <2 x float> %vr1, i32 1
call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r1, float %r2, float %r3, float undef, i1 true, i1 true)		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r1, float %r2, float %r3, float undef, i1 true, i1 true)
ret void		ret void
}		}

		;CHECK-LABEL: {{^}}buffer_load_ubyte:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_ubyte v{{[0-9]}}, off, s[0:3], 0 offset:8
		;CHECK-NEXT: s_waitcnt vmcnt(0)
		;CHECK-NEXT: v_cvt_f32_ubyte0_e32 v0, v0
		;CHECK-NEXT: ; return to shader part epilog
		define amdgpu_ps float @buffer_load_ubyte(<4 x i32> inreg %rsrc) {
		main_body:
		%tmp = call i8 @llvm.amdgcn.buffer.load.i8(<4 x i32> %rsrc, i32 0, i32 8, i1 0, i1 0)
		%val = uitofp i8 %tmp to float
		ret float %val
		}

		;CHECK-LABEL: {{^}}buffer_load_ushort:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_ushort v{{[0-9]}}, off, s[0:3], 0 offset:16
		;CHECK-NEXT: s_waitcnt vmcnt(0)
		;CHECK-NEXT: v_cvt_f32_u32_e32 v0, v0
		;CHECK-NEXT: ; return to shader part epilog
		define amdgpu_ps float @buffer_load_ushort(<4 x i32> inreg %rsrc) {
		main_body:
		%tmp = call i16 @llvm.amdgcn.buffer.load.i16(<4 x i32> %rsrc, i32 0, i32 16, i1 0, i1 0)
		%tmp2 = zext i16 %tmp to i32
		%val = uitofp i32 %tmp2 to float
		ret float %val
		}

		;CHECK-LABEL: {{^}}buffer_load_sbyte:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_sbyte v{{[0-9]}}, off, s[0:3], 0 offset:8
		;CHECK-NEXT: s_waitcnt vmcnt(0)
		;CHECK-NEXT: v_cvt_f32_i32_e32 v0, v0
		;CHECK-NEXT: ; return to shader part epilog
		define amdgpu_ps float @buffer_load_sbyte(<4 x i32> inreg %rsrc) {
		main_body:
		%tmp = call i8 @llvm.amdgcn.buffer.load.i8(<4 x i32> %rsrc, i32 0, i32 8, i1 0, i1 0)
		%tmp2 = sext i8 %tmp to i32
		%val = sitofp i32 %tmp2 to float
		ret float %val
		}

		;CHECK-LABEL: {{^}}buffer_load_sshort:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_sshort v{{[0-9]}}, off, s[0:3], 0 offset:16
		;CHECK-NEXT: s_waitcnt vmcnt(0)
		;CHECK-NEXT: v_cvt_f32_i32_e32 v0, v0
		;CHECK-NEXT: ; return to shader part epilog
		define amdgpu_ps float @buffer_load_sshort(<4 x i32> inreg %rsrc) {
		main_body:
		%tmp = call i16 @llvm.amdgcn.buffer.load.i16(<4 x i32> %rsrc, i32 0, i32 16, i1 0, i1 0)
		%tmp2 = sext i16 %tmp to i32
		%val = sitofp i32 %tmp2 to float
		ret float %val
		}

		;CHECK-LABEL: {{^}}buffer_load_ubyte_bitcast:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_ubyte v{{[0-9]}}, off, s[0:3], 0 offset:8
		;CHECK-NEXT: s_waitcnt vmcnt(0)
		;CHECK-NEXT: ; return to shader part epilog
		define amdgpu_ps float @buffer_load_ubyte_bitcast(<4 x i32> inreg %rsrc) {
		main_body:
		%tmp = call i8 @llvm.amdgcn.buffer.load.i8(<4 x i32> %rsrc, i32 0, i32 8, i1 0, i1 0)
		%tmp2 = zext i8 %tmp to i32
		%val = bitcast i32 %tmp2 to float
		ret float %val
		}

		;CHECK-LABEL: {{^}}buffer_load_ushort_bitcast:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_ushort v{{[0-9]}}, off, s[0:3], 0 offset:8
		;CHECK-NEXT: s_waitcnt vmcnt(0)
		;CHECK-NEXT: ; return to shader part epilog
		define amdgpu_ps float @buffer_load_ushort_bitcast(<4 x i32> inreg %rsrc) {
		main_body:
		%tmp = call i16 @llvm.amdgcn.buffer.load.i16(<4 x i32> %rsrc, i32 0, i32 8, i1 0, i1 0)
		%tmp2 = zext i16 %tmp to i32
		%val = bitcast i32 %tmp2 to float
		ret float %val
		}

		;CHECK-LABEL: {{^}}buffer_load_sbyte_bitcast:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_sbyte v{{[0-9]}}, off, s[0:3], 0 offset:8
		;CHECK-NEXT: s_waitcnt vmcnt(0)
		;CHECK-NEXT: ; return to shader part epilog
		define amdgpu_ps float @buffer_load_sbyte_bitcast(<4 x i32> inreg %rsrc) {
		main_body:
		%tmp = call i8 @llvm.amdgcn.buffer.load.i8(<4 x i32> %rsrc, i32 0, i32 8, i1 0, i1 0)
		%tmp2 = sext i8 %tmp to i32
		%val = bitcast i32 %tmp2 to float
		ret float %val
		}

		;CHECK-LABEL: {{^}}buffer_load_sshort_bitcast:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_sshort v{{[0-9]}}, off, s[0:3], 0 offset:8
		;CHECK-NEXT: s_waitcnt vmcnt(0)
		;CHECK-NEXT: ; return to shader part epilog
		define amdgpu_ps float @buffer_load_sshort_bitcast(<4 x i32> inreg %rsrc) {
		main_body:
		%tmp = call i16 @llvm.amdgcn.buffer.load.i16(<4 x i32> %rsrc, i32 0, i32 8, i1 0, i1 0)
		%tmp2 = sext i16 %tmp to i32
		%val = bitcast i32 %tmp2 to float
		ret float %val
		}

		;CHECK-LABEL: {{^}}buffer_load_ubyte_mul_bitcast:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_ubyte v{{[0-9]}}, off, s[0:3], 0 offset:8
		;CHECK-NEXT: s_waitcnt vmcnt(0)
		;CHECK-NEXT: v_mul_u32_u24_e32 v{{[0-9]}}, 0xff, v{{[0-9]}}
		;CHECK-NEXT: ; return to shader part epilog
		define amdgpu_ps float @buffer_load_ubyte_mul_bitcast(<4 x i32> inreg %rsrc) {
		main_body:
		%tmp = call i8 @llvm.amdgcn.buffer.load.i8(<4 x i32> %rsrc, i32 0, i32 8, i1 0, i1 0)
		%tmp2 = zext i8 %tmp to i32
		%tmp3 = mul i32 %tmp2, 255
		%val = bitcast i32 %tmp3 to float
		ret float %val
		}

		;CHECK-LABEL: {{^}}buffer_load_ushort_mul_bitcast:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_ushort v{{[0-9]}}, off, s[0:3], 0 offset:8
		;CHECK-NEXT: s_waitcnt vmcnt(0)
		;CHECK-NEXT: v_mul_u32_u24_e32 v{{[0-9]}}, 0xff, v{{[0-9]}}
		;CHECK-NEXT: ; return to shader part epilog
		define amdgpu_ps float @buffer_load_ushort_mul_bitcast(<4 x i32> inreg %rsrc) {
		main_body:
		%tmp = call i16 @llvm.amdgcn.buffer.load.i16(<4 x i32> %rsrc, i32 0, i32 8, i1 0, i1 0)
		%tmp2 = zext i16 %tmp to i32
		%tmp3 = mul i32 %tmp2, 255
		%val = bitcast i32 %tmp3 to float
		ret float %val
		}

		;CHECK-LABEL: {{^}}buffer_load_sbyte_mul_bitcast:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_sbyte v{{[0-9]}}, off, s[0:3], 0 offset:8
		;CHECK-NEXT: s_waitcnt vmcnt(0)
		;CHECK-NEXT: v_mul_i32_i24_e32 v{{[0-9]}}, 0xff, v{{[0-9]}}
		;CHECK-NEXT: ; return to shader part epilog
		define amdgpu_ps float @buffer_load_sbyte_mul_bitcast(<4 x i32> inreg %rsrc) {
		main_body:
		%tmp = call i8 @llvm.amdgcn.buffer.load.i8(<4 x i32> %rsrc, i32 0, i32 8, i1 0, i1 0)
		%tmp2 = sext i8 %tmp to i32
		%tmp3 = mul i32 %tmp2, 255
		%val = bitcast i32 %tmp3 to float
		ret float %val
		}

		;CHECK-LABEL: {{^}}buffer_load_sshort_mul_bitcast:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_sshort v{{[0-9]}}, off, s[0:3], 0 offset:8
		;CHECK-NEXT: s_waitcnt vmcnt(0)
		;CHECK-NEXT: v_mul_i32_i24_e32 v{{[0-9]}}, 0xff, v{{[0-9]}}
		;CHECK-NEXT: ; return to shader part epilog
		define amdgpu_ps float @buffer_load_sshort_mul_bitcast(<4 x i32> inreg %rsrc) {
		main_body:
		%tmp = call i16 @llvm.amdgcn.buffer.load.i16(<4 x i32> %rsrc, i32 0, i32 8, i1 0, i1 0)
		%tmp2 = sext i16 %tmp to i32
		%tmp3 = mul i32 %tmp2, 255
		%val = bitcast i32 %tmp3 to float
		ret float %val
		}

declare float @llvm.amdgcn.buffer.load.f32(<4 x i32>, i32, i32, i1, i1) #0		declare float @llvm.amdgcn.buffer.load.f32(<4 x i32>, i32, i32, i1, i1) #0
declare <2 x float> @llvm.amdgcn.buffer.load.v2f32(<4 x i32>, i32, i32, i1, i1) #0		declare <2 x float> @llvm.amdgcn.buffer.load.v2f32(<4 x i32>, i32, i32, i1, i1) #0
declare <4 x float> @llvm.amdgcn.buffer.load.v4f32(<4 x i32>, i32, i32, i1, i1) #0		declare <4 x float> @llvm.amdgcn.buffer.load.v4f32(<4 x i32>, i32, i32, i1, i1) #0
		declare i8 @llvm.amdgcn.buffer.load.i8(<4 x i32>, i32, i32, i1, i1) #0
		declare i16 @llvm.amdgcn.buffer.load.i16(<4 x i32>, i32, i32, i1, i1) #0
declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #0		declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #0

attributes #0 = { nounwind readonly }		attributes #0 = { nounwind readonly }

test/CodeGen/AMDGPU/llvm.amdgcn.buffer.store.ll

	Show First 20 Lines • Show All 227 Lines • ▼ Show 20 Lines
	;CHECK-NOT: s_waitcnt			;CHECK-NOT: s_waitcnt
	;CHECK-DAG: buffer_store_dwordx3 v[{{[0-9]}}:{{[0-9]}}], off, s[0:3], 0 offset:8			;CHECK-DAG: buffer_store_dwordx3 v[{{[0-9]}}:{{[0-9]}}], off, s[0:3], 0 offset:8
	define amdgpu_ps void @buffer_store_x3_offset_merged3(<4 x i32> inreg %rsrc, <2 x float> %v1, float %v2) {			define amdgpu_ps void @buffer_store_x3_offset_merged3(<4 x i32> inreg %rsrc, <2 x float> %v1, float %v2) {
	call void @llvm.amdgcn.buffer.store.v2f32(<2 x float> %v1, <4 x i32> %rsrc, i32 0, i32 8, i1 0, i1 0)			call void @llvm.amdgcn.buffer.store.v2f32(<2 x float> %v1, <4 x i32> %rsrc, i32 0, i32 8, i1 0, i1 0)
	call void @llvm.amdgcn.buffer.store.f32(float %v2, <4 x i32> %rsrc, i32 0, i32 16, i1 0, i1 0)			call void @llvm.amdgcn.buffer.store.f32(float %v2, <4 x i32> %rsrc, i32 0, i32 16, i1 0, i1 0)
	ret void			ret void
	}			}

				;CHECK-LABEL: {{^}}buffer_store_byte:
				;CHECK-NOT: s_waitcnt
				;CHECK-NEXT: %bb.
				;CHECK: buffer_store_byte v{{[0-9]}}, off, s[0:3], 0 offset:8
				define amdgpu_ps void @buffer_store_byte(<4 x i32> inreg %rsrc, float %v1) {
				main_body:
				%v2 = fptoui float %v1 to i32
				%v3 = trunc i32 %v2 to i8
				call void @llvm.amdgcn.buffer.store.i8(i8 %v3, <4 x i32> %rsrc, i32 0, i32 8, i1 0, i1 0)
				ret void
				}

				;CHECK-LABEL: {{^}}buffer_store_short:
				;CHECK-NOT: s_waitcnt
				;CHECK-NEXT: %bb.
				;CHECK: buffer_store_short v{{[0-9]}}, off, s[0:3], 0 offset:16
				define amdgpu_ps void @buffer_store_short(<4 x i32> inreg %rsrc, float %v1) {
				main_body:
				%v2 = fptoui float %v1 to i32
				%v3 = trunc i32 %v2 to i16
				call void @llvm.amdgcn.buffer.store.i16(i16 %v3, <4 x i32> %rsrc, i32 0, i32 16, i1 0, i1 0)
				ret void
				}

	declare void @llvm.amdgcn.buffer.store.f32(float, <4 x i32>, i32, i32, i1, i1) #0			declare void @llvm.amdgcn.buffer.store.f32(float, <4 x i32>, i32, i32, i1, i1) #0
	declare void @llvm.amdgcn.buffer.store.v2f32(<2 x float>, <4 x i32>, i32, i32, i1, i1) #0			declare void @llvm.amdgcn.buffer.store.v2f32(<2 x float>, <4 x i32>, i32, i32, i1, i1) #0
	declare void @llvm.amdgcn.buffer.store.v4f32(<4 x float>, <4 x i32>, i32, i32, i1, i1) #0			declare void @llvm.amdgcn.buffer.store.v4f32(<4 x float>, <4 x i32>, i32, i32, i1, i1) #0
				declare void @llvm.amdgcn.buffer.store.i8(i8, <4 x i32>, i32, i32, i1, i1) #0
				declare void @llvm.amdgcn.buffer.store.i16(i16, <4 x i32>, i32, i32, i1, i1) #0
	declare <4 x float> @llvm.amdgcn.buffer.load.v4f32(<4 x i32>, i32, i32, i1, i1) #1			declare <4 x float> @llvm.amdgcn.buffer.load.v4f32(<4 x i32>, i32, i32, i1, i1) #1

	attributes #0 = { nounwind }			attributes #0 = { nounwind }
	attributes #1 = { nounwind readonly }			attributes #1 = { nounwind readonly }

test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.load.ll

Show First 20 Lines • Show All 257 Lines • ▼ Show 20 Lines	main_body:
%fdata_glc = bitcast <2 x i32> %data_glc to <2 x float>		%fdata_glc = bitcast <2 x i32> %data_glc to <2 x float>
%fdata_slc = bitcast i32 %data_slc to float		%fdata_slc = bitcast i32 %data_slc to float
%r0 = insertvalue {<4 x float>, <2 x float>, float} undef, <4 x float> %fdata, 0		%r0 = insertvalue {<4 x float>, <2 x float>, float} undef, <4 x float> %fdata, 0
%r1 = insertvalue {<4 x float>, <2 x float>, float} %r0, <2 x float> %fdata_glc, 1		%r1 = insertvalue {<4 x float>, <2 x float>, float} %r0, <2 x float> %fdata_glc, 1
%r2 = insertvalue {<4 x float>, <2 x float>, float} %r1, float %fdata_slc, 2		%r2 = insertvalue {<4 x float>, <2 x float>, float} %r1, float %fdata_slc, 2
ret {<4 x float>, <2 x float>, float} %r2		ret {<4 x float>, <2 x float>, float} %r2
}		}

		;CHECK-LABEL: {{^}}raw_buffer_load_ubyte:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_ubyte v{{[0-9]}}, off, s[0:3], 0
		;CHECK: s_waitcnt vmcnt(0)
		;CHECK-NEXT: v_cvt_f32_ubyte0_e32 v0, v0
		;CHECK-NEXT: ; return to shader part epilog
		define amdgpu_ps float @raw_buffer_load_ubyte(<4 x i32> inreg %rsrc) {
		main_body:
		%tmp = call i8 @llvm.amdgcn.raw.buffer.load.i8(<4 x i32> %rsrc, i32 0, i32 0, i32 0)
		%tmp2 = zext i8 %tmp to i32
		arsenmUnsubmitted Not Done Reply Inline Actions The base test case shouldn't have a extend of the use and directly use the value. You should also have one with an explicit zext arsenm: The base test case shouldn't have a extend of the use and directly use the value. You should…
		%val = uitofp i32 %tmp2 to float
		ret float %val
		}

		;CHECK-LABEL: {{^}}raw_buffer_load_ushort:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_ushort v{{[0-9]}}, off, s[0:3], 0
		;CHECK-NEXT: s_waitcnt vmcnt(0)
		;CHECK-NEXT: v_cvt_f32_u32_e32 v0, v0
		;CHECK-NEXT: ; return to shader part epilog
		define amdgpu_ps float @raw_buffer_load_ushort(<4 x i32> inreg %rsrc) {
		main_body:
		%tmp = call i16 @llvm.amdgcn.raw.buffer.load.i16(<4 x i32> %rsrc, i32 0, i32 0, i32 0)
		%tmp2 = zext i16 %tmp to i32
		%val = uitofp i32 %tmp2 to float
		ret float %val
		}

		;CHECK-LABEL: {{^}}raw_buffer_load_sbyte:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_sbyte v{{[0-9]}}, off, s[0:3], 0
		;CHECK-NEXT: s_waitcnt vmcnt(0)
		;CHECK-NEXT: v_cvt_f32_i32_e32 v0, v0
		;CHECK-NEXT: ; return to shader part epilog
		define amdgpu_ps float @raw_buffer_load_sbyte(<4 x i32> inreg %rsrc) {
		main_body:
		%tmp = call i8 @llvm.amdgcn.raw.buffer.load.i8(<4 x i32> %rsrc, i32 0, i32 0, i32 0)
		%tmp2 = sext i8 %tmp to i32
		%val = sitofp i32 %tmp2 to float
		ret float %val
		}

		arsenmUnsubmitted Not Done Reply Inline Actions Could use a testcase with a second non-extended use arsenm: Could use a testcase with a second non-extended use
		;CHECK-LABEL: {{^}}raw_buffer_load_sshort:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_sshort v{{[0-9]}}, off, s[0:3], 0
		;CHECK-NEXT: s_waitcnt vmcnt(0)
		;CHECK-NEXT: v_cvt_f32_i32_e32 v0, v0
		;CHECK-NEXT: ; return to shader part epilog
		define amdgpu_ps float @raw_buffer_load_sshort(<4 x i32> inreg %rsrc) {
		main_body:
		%tmp = call i16 @llvm.amdgcn.raw.buffer.load.i16(<4 x i32> %rsrc, i32 0, i32 0, i32 0)
		%tmp2 = sext i16 %tmp to i32
		%val = sitofp i32 %tmp2 to float
		ret float %val
		}

declare float @llvm.amdgcn.raw.buffer.load.f32(<4 x i32>, i32, i32, i32) #0		declare float @llvm.amdgcn.raw.buffer.load.f32(<4 x i32>, i32, i32, i32) #0
declare <2 x float> @llvm.amdgcn.raw.buffer.load.v2f32(<4 x i32>, i32, i32, i32) #0		declare <2 x float> @llvm.amdgcn.raw.buffer.load.v2f32(<4 x i32>, i32, i32, i32) #0
declare <4 x float> @llvm.amdgcn.raw.buffer.load.v4f32(<4 x i32>, i32, i32, i32) #0		declare <4 x float> @llvm.amdgcn.raw.buffer.load.v4f32(<4 x i32>, i32, i32, i32) #0
declare i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32>, i32, i32, i32) #0		declare i32 @llvm.amdgcn.raw.buffer.load.i32(<4 x i32>, i32, i32, i32) #0
declare <2 x i32> @llvm.amdgcn.raw.buffer.load.v2i32(<4 x i32>, i32, i32, i32) #0		declare <2 x i32> @llvm.amdgcn.raw.buffer.load.v2i32(<4 x i32>, i32, i32, i32) #0
declare <4 x i32> @llvm.amdgcn.raw.buffer.load.v4i32(<4 x i32>, i32, i32, i32) #0		declare <4 x i32> @llvm.amdgcn.raw.buffer.load.v4i32(<4 x i32>, i32, i32, i32) #0
declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #0		declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #0
		declare i8 @llvm.amdgcn.raw.buffer.load.i8(<4 x i32>, i32, i32, i32) #0
		declare i16 @llvm.amdgcn.raw.buffer.load.i16(<4 x i32>, i32, i32, i32) #0

attributes #0 = { nounwind readonly }		attributes #0 = { nounwind readonly }

test/CodeGen/AMDGPU/llvm.amdgcn.raw.buffer.store.ll

	Show First 20 Lines • Show All 183 Lines • ▼ Show 20 Lines
	define amdgpu_ps void @buffer_store_int(<4 x i32> inreg, <4 x i32>, <2 x i32>, i32) {			define amdgpu_ps void @buffer_store_int(<4 x i32> inreg, <4 x i32>, <2 x i32>, i32) {
	main_body:			main_body:
	call void @llvm.amdgcn.raw.buffer.store.v4i32(<4 x i32> %1, <4 x i32> %0, i32 0, i32 0, i32 0)			call void @llvm.amdgcn.raw.buffer.store.v4i32(<4 x i32> %1, <4 x i32> %0, i32 0, i32 0, i32 0)
	call void @llvm.amdgcn.raw.buffer.store.v2i32(<2 x i32> %2, <4 x i32> %0, i32 0, i32 0, i32 1)			call void @llvm.amdgcn.raw.buffer.store.v2i32(<2 x i32> %2, <4 x i32> %0, i32 0, i32 0, i32 1)
	call void @llvm.amdgcn.raw.buffer.store.i32(i32 %3, <4 x i32> %0, i32 0, i32 0, i32 2)			call void @llvm.amdgcn.raw.buffer.store.i32(i32 %3, <4 x i32> %0, i32 0, i32 0, i32 2)
	ret void			ret void
	}			}

				;CHECK-LABEL: {{^}}raw_buffer_store_byte:
				;CHECK-NEXT: %bb.
				;CHECK-NEXT: v_cvt_u32_f32_e32 v{{[0-9]}}, v{{[0-9]}}
				;CHECK-NEXT: buffer_store_byte v{{[0-9]}}, off, s[0:3], 0
				;CHECK-NEXT: s_endpgm
				define amdgpu_ps void @raw_buffer_store_byte(<4 x i32> inreg %rsrc, float %v1) {
				main_body:
				%v2 = fptoui float %v1 to i32
				%v3 = trunc i32 %v2 to i8
				call void @llvm.amdgcn.raw.buffer.store.i8(i8 %v3, <4 x i32> %rsrc, i32 0, i32 0, i32 0)
				ret void
				}

				;CHECK-LABEL: {{^}}raw_buffer_store_short:
				;CHECK-NEXT: %bb.
				;CHECK-NEXT: v_cvt_u32_f32_e32 v{{[0-9]}}, v{{[0-9]}}
				;CHECK-NEXT: buffer_store_short v{{[0-9]}}, off, s[0:3], 0
				;CHECK-NEXT: s_endpgm
				define amdgpu_ps void @raw_buffer_store_short(<4 x i32> inreg %rsrc, float %v1) {
				main_body:
				%v2 = fptoui float %v1 to i32
				%v3 = trunc i32 %v2 to i16
				call void @llvm.amdgcn.raw.buffer.store.i16(i16 %v3, <4 x i32> %rsrc, i32 0, i32 0, i32 0)
				ret void
				}

	declare void @llvm.amdgcn.raw.buffer.store.f32(float, <4 x i32>, i32, i32, i32) #0			declare void @llvm.amdgcn.raw.buffer.store.f32(float, <4 x i32>, i32, i32, i32) #0
	declare void @llvm.amdgcn.raw.buffer.store.v2f32(<2 x float>, <4 x i32>, i32, i32, i32) #0			declare void @llvm.amdgcn.raw.buffer.store.v2f32(<2 x float>, <4 x i32>, i32, i32, i32) #0
	declare void @llvm.amdgcn.raw.buffer.store.v4f32(<4 x float>, <4 x i32>, i32, i32, i32) #0			declare void @llvm.amdgcn.raw.buffer.store.v4f32(<4 x float>, <4 x i32>, i32, i32, i32) #0
	declare void @llvm.amdgcn.raw.buffer.store.i32(i32, <4 x i32>, i32, i32, i32) #0			declare void @llvm.amdgcn.raw.buffer.store.i32(i32, <4 x i32>, i32, i32, i32) #0
	declare void @llvm.amdgcn.raw.buffer.store.v2i32(<2 x i32>, <4 x i32>, i32, i32, i32) #0			declare void @llvm.amdgcn.raw.buffer.store.v2i32(<2 x i32>, <4 x i32>, i32, i32, i32) #0
	declare void @llvm.amdgcn.raw.buffer.store.v4i32(<4 x i32>, <4 x i32>, i32, i32, i32) #0			declare void @llvm.amdgcn.raw.buffer.store.v4i32(<4 x i32>, <4 x i32>, i32, i32, i32) #0
	declare <4 x float> @llvm.amdgcn.raw.buffer.load.v4f32(<4 x i32>, i32, i32, i32) #1			declare <4 x float> @llvm.amdgcn.raw.buffer.load.v4f32(<4 x i32>, i32, i32, i32) #1
				declare void @llvm.amdgcn.raw.buffer.store.i8(i8, <4 x i32>, i32, i32, i32) #0
				declare void @llvm.amdgcn.raw.buffer.store.i16(i16, <4 x i32>, i32, i32, i32) #0

	attributes #0 = { nounwind }			attributes #0 = { nounwind }
	attributes #1 = { nounwind readonly }			attributes #1 = { nounwind readonly }

test/CodeGen/AMDGPU/llvm.amdgcn.struct.buffer.load.ll

Show First 20 Lines • Show All 138 Lines • ▼ Show 20 Lines	main_body:
%fdata_glc = bitcast <2 x i32> %data_glc to <2 x float>		%fdata_glc = bitcast <2 x i32> %data_glc to <2 x float>
%fdata_slc = bitcast i32 %data_slc to float		%fdata_slc = bitcast i32 %data_slc to float
%r0 = insertvalue {<4 x float>, <2 x float>, float} undef, <4 x float> %fdata, 0		%r0 = insertvalue {<4 x float>, <2 x float>, float} undef, <4 x float> %fdata, 0
%r1 = insertvalue {<4 x float>, <2 x float>, float} %r0, <2 x float> %fdata_glc, 1		%r1 = insertvalue {<4 x float>, <2 x float>, float} %r0, <2 x float> %fdata_glc, 1
%r2 = insertvalue {<4 x float>, <2 x float>, float} %r1, float %fdata_slc, 2		%r2 = insertvalue {<4 x float>, <2 x float>, float} %r1, float %fdata_slc, 2
ret {<4 x float>, <2 x float>, float} %r2		ret {<4 x float>, <2 x float>, float} %r2
}		}

		;CHECK-LABEL: {{^}}struct_buffer_load_ubyte:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_ubyte v{{[0-9]}}, v[0:1], s[0:3], 0 idxen offen
		;CHECK: s_waitcnt vmcnt(0)
		;CHECK-NEXT: v_cvt_f32_ubyte0_e32 v0, v0
		;CHECK-NEXT: ; return to shader part epilog
		define amdgpu_ps float @struct_buffer_load_ubyte(<4 x i32> inreg %rsrc, i32 %idx, i32 %ofs) {
		main_body:
		%tmp = call i8 @llvm.amdgcn.struct.buffer.load.i8(<4 x i32> %rsrc, i32 %idx, i32 %ofs, i32 0, i32 0)
		%tmp2 = zext i8 %tmp to i32
		%val = uitofp i32 %tmp2 to float
		ret float %val
		}

		;CHECK-LABEL: {{^}}struct_buffer_load_ushort:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_ushort v{{[0-9]}}, v[0:1], s[0:3], 0 idxen offen
		;CHECK-NEXT: s_waitcnt vmcnt(0)
		;CHECK-NEXT: v_cvt_f32_u32_e32 v0, v0
		;CHECK-NEXT: ; return to shader part epilog
		define amdgpu_ps float @struct_buffer_load_ushort(<4 x i32> inreg %rsrc, i32 %idx, i32 %ofs) {
		main_body:
		%tmp = call i16 @llvm.amdgcn.struct.buffer.load.i16(<4 x i32> %rsrc, i32 %idx, i32 %ofs, i32 0, i32 0)
		%tmp2 = zext i16 %tmp to i32
		%val = uitofp i32 %tmp2 to float
		ret float %val
		}

		;CHECK-LABEL: {{^}}struct_buffer_load_sbyte:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_sbyte v{{[0-9]}}, v[0:1], s[0:3], 0 idxen offen
		;CHECK-NEXT: s_waitcnt vmcnt(0)
		;CHECK-NEXT: v_cvt_f32_i32_e32 v0, v0
		;CHECK-NEXT: ; return to shader part epilog
		define amdgpu_ps float @struct_buffer_load_sbyte(<4 x i32> inreg %rsrc, i32 %idx, i32 %ofs) {
		main_body:
		%tmp = call i8 @llvm.amdgcn.struct.buffer.load.i8(<4 x i32> %rsrc, i32 %idx, i32 %ofs, i32 0, i32 0)
		%tmp2 = sext i8 %tmp to i32
		%val = sitofp i32 %tmp2 to float
		ret float %val
		}

		;CHECK-LABEL: {{^}}struct_buffer_load_sshort:
		;CHECK-NEXT: %bb.
		;CHECK-NEXT: buffer_load_sshort v{{[0-9]}}, v[0:1], s[0:3], 0 idxen offen
		;CHECK-NEXT: s_waitcnt vmcnt(0)
		;CHECK-NEXT: v_cvt_f32_i32_e32 v0, v0
		;CHECK-NEXT: ; return to shader part epilog
		define amdgpu_ps float @struct_buffer_load_sshort(<4 x i32> inreg %rsrc, i32 %idx, i32 %ofs) {
		main_body:
		%tmp = call i16 @llvm.amdgcn.struct.buffer.load.i16(<4 x i32> %rsrc, i32 %idx, i32 %ofs, i32 0, i32 0)
		%tmp2 = sext i16 %tmp to i32
		%val = sitofp i32 %tmp2 to float
		ret float %val
		}

declare float @llvm.amdgcn.struct.buffer.load.f32(<4 x i32>, i32, i32, i32, i32) #0		declare float @llvm.amdgcn.struct.buffer.load.f32(<4 x i32>, i32, i32, i32, i32) #0
declare <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32>, i32, i32, i32, i32) #0		declare <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32>, i32, i32, i32, i32) #0
declare <4 x float> @llvm.amdgcn.struct.buffer.load.v4f32(<4 x i32>, i32, i32, i32, i32) #0		declare <4 x float> @llvm.amdgcn.struct.buffer.load.v4f32(<4 x i32>, i32, i32, i32, i32) #0
declare i32 @llvm.amdgcn.struct.buffer.load.i32(<4 x i32>, i32, i32, i32, i32) #0		declare i32 @llvm.amdgcn.struct.buffer.load.i32(<4 x i32>, i32, i32, i32, i32) #0
declare <2 x i32> @llvm.amdgcn.struct.buffer.load.v2i32(<4 x i32>, i32, i32, i32, i32) #0		declare <2 x i32> @llvm.amdgcn.struct.buffer.load.v2i32(<4 x i32>, i32, i32, i32, i32) #0
declare <4 x i32> @llvm.amdgcn.struct.buffer.load.v4i32(<4 x i32>, i32, i32, i32, i32) #0		declare <4 x i32> @llvm.amdgcn.struct.buffer.load.v4i32(<4 x i32>, i32, i32, i32, i32) #0
declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #0		declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #0
		declare i8 @llvm.amdgcn.struct.buffer.load.i8(<4 x i32>, i32, i32, i32, i32) #0
		declare i16 @llvm.amdgcn.struct.buffer.load.i16(<4 x i32>, i32, i32, i32, i32) #0

attributes #0 = { nounwind readonly }		attributes #0 = { nounwind readonly }

test/CodeGen/AMDGPU/llvm.amdgcn.struct.buffer.store.ll

	Show First 20 Lines • Show All 102 Lines • ▼ Show 20 Lines
	define amdgpu_ps void @buffer_store_int(<4 x i32> inreg, <4 x i32>, <2 x i32>, i32) {			define amdgpu_ps void @buffer_store_int(<4 x i32> inreg, <4 x i32>, <2 x i32>, i32) {
	main_body:			main_body:
	call void @llvm.amdgcn.struct.buffer.store.v4i32(<4 x i32> %1, <4 x i32> %0, i32 0, i32 0, i32 0, i32 0)			call void @llvm.amdgcn.struct.buffer.store.v4i32(<4 x i32> %1, <4 x i32> %0, i32 0, i32 0, i32 0, i32 0)
	call void @llvm.amdgcn.struct.buffer.store.v2i32(<2 x i32> %2, <4 x i32> %0, i32 0, i32 0, i32 0, i32 1)			call void @llvm.amdgcn.struct.buffer.store.v2i32(<2 x i32> %2, <4 x i32> %0, i32 0, i32 0, i32 0, i32 1)
	call void @llvm.amdgcn.struct.buffer.store.i32(i32 %3, <4 x i32> %0, i32 0, i32 0, i32 0, i32 2)			call void @llvm.amdgcn.struct.buffer.store.i32(i32 %3, <4 x i32> %0, i32 0, i32 0, i32 0, i32 2)
	ret void			ret void
	}			}

				;CHECK-LABEL: {{^}}struct_buffer_store_byte:
				;CHECK-NEXT: %bb.
				;CHECK-NEXT: v_cvt_u32_f32_e32 v{{[0-9]}}, v{{[0-9]}}
				;CHECK-NEXT: buffer_store_byte v{{[0-9]}}, v{{[0-9]}}, s[0:3], 0 idxen
				;CHECK-NEXT: s_endpgm
				define amdgpu_ps void @struct_buffer_store_byte(<4 x i32> inreg %rsrc, float %v1, i32 %index) {
				main_body:
				%v2 = fptoui float %v1 to i32
				%v3 = trunc i32 %v2 to i8
				call void @llvm.amdgcn.struct.buffer.store.i8(i8 %v3, <4 x i32> %rsrc, i32 %index, i32 0, i32 0, i32 0)
				ret void
				}

				;CHECK-LABEL: {{^}}struct_buffer_store_short:
				;CHECK-NEXT: %bb.
				;CHECK-NEXT: v_cvt_u32_f32_e32 v{{[0-9]}}, v{{[0-9]}}
				;CHECK-NEXT: buffer_store_short v{{[0-9]}}, v{{[0-9]}}, s[0:3], 0 idxen
				;CHECK-NEXT: s_endpgm
				define amdgpu_ps void @struct_buffer_store_short(<4 x i32> inreg %rsrc, float %v1, i32 %index) {
				main_body:
				%v2 = fptoui float %v1 to i32
				%v3 = trunc i32 %v2 to i16
				call void @llvm.amdgcn.struct.buffer.store.i16(i16 %v3, <4 x i32> %rsrc, i32 %index, i32 0, i32 0, i32 0)
				ret void
				}

	declare void @llvm.amdgcn.struct.buffer.store.f32(float, <4 x i32>, i32, i32, i32, i32) #0			declare void @llvm.amdgcn.struct.buffer.store.f32(float, <4 x i32>, i32, i32, i32, i32) #0
	declare void @llvm.amdgcn.struct.buffer.store.v2f32(<2 x float>, <4 x i32>, i32, i32, i32, i32) #0			declare void @llvm.amdgcn.struct.buffer.store.v2f32(<2 x float>, <4 x i32>, i32, i32, i32, i32) #0
	declare void @llvm.amdgcn.struct.buffer.store.v4f32(<4 x float>, <4 x i32>, i32, i32, i32, i32) #0			declare void @llvm.amdgcn.struct.buffer.store.v4f32(<4 x float>, <4 x i32>, i32, i32, i32, i32) #0
	declare void @llvm.amdgcn.struct.buffer.store.i32(i32, <4 x i32>, i32, i32, i32, i32) #0			declare void @llvm.amdgcn.struct.buffer.store.i32(i32, <4 x i32>, i32, i32, i32, i32) #0
	declare void @llvm.amdgcn.struct.buffer.store.v2i32(<2 x i32>, <4 x i32>, i32, i32, i32, i32) #0			declare void @llvm.amdgcn.struct.buffer.store.v2i32(<2 x i32>, <4 x i32>, i32, i32, i32, i32) #0
	declare void @llvm.amdgcn.struct.buffer.store.v4i32(<4 x i32>, <4 x i32>, i32, i32, i32, i32) #0			declare void @llvm.amdgcn.struct.buffer.store.v4i32(<4 x i32>, <4 x i32>, i32, i32, i32, i32) #0
	declare <4 x float> @llvm.amdgcn.struct.buffer.load.v4f32(<4 x i32>, i32, i32, i32, i32) #1			declare <4 x float> @llvm.amdgcn.struct.buffer.load.v4f32(<4 x i32>, i32, i32, i32, i32) #1
				declare void @llvm.amdgcn.struct.buffer.store.i8(i8, <4 x i32>, i32, i32, i32, i32) #0
				declare void @llvm.amdgcn.struct.buffer.store.i16(i16, <4 x i32>, i32, i32, i32, i32) #0

	attributes #0 = { nounwind }			attributes #0 = { nounwind }
	attributes #1 = { nounwind readonly }			attributes #1 = { nounwind readonly }