This is an archive of the discontinued LLVM Phabricator instance.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5660	Should get the alignment from the ABI type alignment? On second thought though, the MMO should already exist at this point so I’m not sure why this is reconstructing one

Using the alignment from the ABI type alignment.

The s_buffer_load intrinsic is not marked with SDNPMemOperand, so I think
that is why we need to create MMO here.

Harbormaster completed remote builds in B40872: Diff 229028.Nov 13 2019, 1:46 AM

In D70118#1743561, @piotr wrote:

Using the alignment from the ABI type alignment.

The s_buffer_load intrinsic is not marked with SDNPMemOperand, so I think
that is why we need to create MMO here.

It probably should be marked with SDNPMemOperand, and the fact that it's IntrNoMem is another problem that should eventually be solved

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.buffer.load.ll
27–42	Most of these test changes look unrelated?
97–98	There is no load dwordx3, so I'm slightly confused about why you need this, but I would expect this ot widen to 4x loads?

piotr marked 2 inline comments as done.Nov 13 2019, 3:27 AM

piotr added inline comments.

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.buffer.load.ll
27–42	I added the v3 test which exercises the code I am modifying (divergent index): s_buffer_loadx3_index_divergent. Also added analogous s_buffer_load_index_divergent and s_buffer_loadx2_index_divergent for consistency.
97–98	The big picture is that I am working on cutting down the number of loaded components with various buffer loads. I have another change in instcombine (soon to be uploaded for review) that trims loads based on the components used. With that patch vec3 s_buffer_load crashes in the lowering so I am adding support for that. It is useful to have s_buffer_load.v3 for the case with divergent index, where s_buffer_load cannot be used and buffer_load_dword is generated instead. On newer GPU (VI and later) buffer_load_dwordx3 is present, only on SI we generate buffer_load_dwordx4 for that (see s_buffer_loadx3_index_divergent test). As for whether it is better to split or widen the s_buffer_load (non-divergent index), the advantage of splitting is that the split loads can be merged with an adjacent load more easily. But I do not have a strong opinion on that.

piotr marked an inline comment as done.Nov 15 2019, 12:42 AM

piotr added inline comments.

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.buffer.load.ll
97–98	On second thought widening seems to make more sense, will update the patch.

Widening instead of splitting.

Harbormaster completed remote builds in B41016: Diff 229506.Nov 15 2019, 4:30 AM

LGMT

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5679	Hardcoding the index type to i32 is fine in target specific code

This revision is now accepted and ready to land.Nov 15 2019, 4:47 AM

Closed by commit rG02419ab5c739: [AMDGPU] Lower llvm.amdgcn.s.buffer.load.v3[i|f]32 (authored by piotr). · Explain WhyNov 15 2019, 6:04 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIISelLowering.cpp

49 lines

test/

CodeGen/

AMDGPU/

llvm.amdgcn.s.buffer.load.ll

163 lines

Diff 228854

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,646 Lines • ▼ Show 20 Lines	SDValue SITargetLowering::lowerImage(SDValue Op,

return SDValue(NewNode, 0);		return SDValue(NewNode, 0);
}		}

SDValue SITargetLowering::lowerSBuffer(EVT VT, SDLoc DL, SDValue Rsrc,		SDValue SITargetLowering::lowerSBuffer(EVT VT, SDLoc DL, SDValue Rsrc,
SDValue Offset, SDValue GLC, SDValue DLC,		SDValue Offset, SDValue GLC, SDValue DLC,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
MachineFunction &MF = DAG.getMachineFunction();		MachineFunction &MF = DAG.getMachineFunction();

MachineMemOperand *MMO = MF.getMachineMemOperand(		MachineMemOperand *MMO = MF.getMachineMemOperand(
MachinePointerInfo(),		MachinePointerInfo(),
MachineMemOperand::MOLoad \| MachineMemOperand::MODereferenceable \|		MachineMemOperand::MOLoad \| MachineMemOperand::MODereferenceable \|
MachineMemOperand::MOInvariant,		MachineMemOperand::MOInvariant,
VT.getStoreSize(), VT.getStoreSize());		VT.getStoreSize(), PowerOf2Ceil(VT.getStoreSize()));
		arsenmUnsubmitted Not Done Reply Inline Actions Should get the alignment from the ABI type alignment? On second thought though, the MMO should already exist at this point so I’m not sure why this is reconstructing one arsenm: Should get the alignment from the ABI type alignment? On second thought though, the MMO should…

if (!Offset->isDivergent()) {		if (!Offset->isDivergent()) {
SDValue Ops[] = {		SDValue Ops[] = {
Rsrc,		Rsrc,
Offset, // Offset		Offset, // Offset
GLC,		GLC,
DLC,		DLC,
};		};

		// Split vec3 load into vec2 and single component loads.
		if (VT.isVector() && VT.getVectorNumElements() == 3) {

		EVT LoVT, HiVT;
		std::tie(LoVT, HiVT) = getSplitDestVTs(VT, DAG);

		auto LoLoad = DAG.getMemIntrinsicNode(
		AMDGPUISD::SBUFFER_LOAD, DL, DAG.getVTList(LoVT), Ops, LoVT, MMO);

		unsigned HiLoadOffset = LoVT.getStoreSize();
		arsenmUnsubmitted Not Done Reply Inline Actions Hardcoding the index type to i32 is fine in target specific code arsenm: Hardcoding the index type to i32 is fine in target specific code
		if (auto ConstOffset = dyn_cast<ConstantSDNode>(Offset)) {
		Ops[1] =
		DAG.getTargetConstant(ConstOffset->getSExtValue() + HiLoadOffset,
		DL, Ops[1].getValueType());
		} else {
		auto HiLoadOffsetNode =
		DAG.getConstant(HiLoadOffset, DL, Offset.getValueType());
		Ops[1] = DAG.getNode(ISD::ADD, DL, Offset.getValueType(), Offset,
		HiLoadOffsetNode);
		}

		auto HiLoad = DAG.getMemIntrinsicNode(
		AMDGPUISD::SBUFFER_LOAD, DL, DAG.getVTList(HiVT), Ops, HiVT,
		MF.getMachineMemOperand(MMO, HiLoadOffset, HiVT.getStoreSize()));

		auto IdxTy = getVectorIdxTy(DAG.getDataLayout());
		SDValue Join;
		Join = DAG.getNode(ISD::INSERT_SUBVECTOR, DL, VT, DAG.getUNDEF(VT),
		LoLoad, DAG.getConstant(0, DL, IdxTy));
		Join =
		DAG.getNode(ISD::INSERT_VECTOR_ELT, DL, VT, Join, HiLoad,
		DAG.getConstant(LoVT.getVectorNumElements(), DL, IdxTy));

		return Join;
		}

return DAG.getMemIntrinsicNode(AMDGPUISD::SBUFFER_LOAD, DL,		return DAG.getMemIntrinsicNode(AMDGPUISD::SBUFFER_LOAD, DL,
DAG.getVTList(VT), Ops, VT, MMO);		DAG.getVTList(VT), Ops, VT, MMO);
}		}

// We have a divergent offset. Emit a MUBUF buffer load instead. We can		// We have a divergent offset. Emit a MUBUF buffer load instead. We can
// assume that the buffer is unswizzled.		// assume that the buffer is unswizzled.
SmallVector<SDValue, 4> Loads;		SmallVector<SDValue, 4> Loads;
unsigned NumLoads = 1;		unsigned NumLoads = 1;
MVT LoadVT = VT.getSimpleVT();		MVT LoadVT = VT.getSimpleVT();
unsigned NumElts = LoadVT.isVector() ? LoadVT.getVectorNumElements() : 1;		unsigned NumElts = LoadVT.isVector() ? LoadVT.getVectorNumElements() : 1;
assert((LoadVT.getScalarType() == MVT::i32 \|\|		assert((LoadVT.getScalarType() == MVT::i32 \|\|
LoadVT.getScalarType() == MVT::f32) &&		LoadVT.getScalarType() == MVT::f32));
isPowerOf2_32(NumElts));

if (NumElts == 8 \|\| NumElts == 16) {		if (NumElts == 8 \|\| NumElts == 16) {
NumLoads = NumElts == 16 ? 4 : 2;		NumLoads = NumElts / 4;
LoadVT = MVT::v4i32;		LoadVT = MVT::v4i32;
}		}

SDVTList VTList = DAG.getVTList({LoadVT, MVT::Glue});		SDVTList VTList = DAG.getVTList({LoadVT, MVT::Glue});
unsigned CachePolicy = cast<ConstantSDNode>(GLC)->getZExtValue();		unsigned CachePolicy = cast<ConstantSDNode>(GLC)->getZExtValue();
SDValue Ops[] = {		SDValue Ops[] = {
DAG.getEntryNode(), // Chain		DAG.getEntryNode(), // Chain
Rsrc, // rsrc		Rsrc, // rsrc
DAG.getConstant(0, DL, MVT::i32), // vindex		DAG.getConstant(0, DL, MVT::i32), // vindex
{}, // voffset		{}, // voffset
{}, // soffset		{}, // soffset
{}, // offset		{}, // offset
DAG.getTargetConstant(CachePolicy, DL, MVT::i32), // cachepolicy		DAG.getTargetConstant(CachePolicy, DL, MVT::i32), // cachepolicy
DAG.getTargetConstant(0, DL, MVT::i1), // idxen		DAG.getTargetConstant(0, DL, MVT::i1), // idxen
};		};

// Use the alignment to ensure that the required offsets will fit into the		// Use the alignment to ensure that the required offsets will fit into the
// immediate offsets.		// immediate offsets.
setBufferOffsets(Offset, DAG, &Ops[3], NumLoads > 1 ? 16 * NumLoads : 4);		setBufferOffsets(Offset, DAG, &Ops[3], NumLoads > 1 ? 16 * NumLoads : 4);

uint64_t InstOffset = cast<ConstantSDNode>(Ops[5])->getZExtValue();		uint64_t InstOffset = cast<ConstantSDNode>(Ops[5])->getZExtValue();
for (unsigned i = 0; i < NumLoads; ++i) {		for (unsigned i = 0; i < NumLoads; ++i) {
Ops[5] = DAG.getTargetConstant(InstOffset + 16 * i, DL, MVT::i32);		Ops[5] = DAG.getTargetConstant(InstOffset + 16 * i, DL, MVT::i32);
Loads.push_back(DAG.getMemIntrinsicNode(AMDGPUISD::BUFFER_LOAD, DL, VTList,		Loads.push_back(getMemIntrinsicNode(AMDGPUISD::BUFFER_LOAD, DL, VTList, Ops,
Ops, LoadVT, MMO));		LoadVT, MMO, DAG));
}		}

if (VT == MVT::v8i32 \|\| VT == MVT::v16i32)		if (VT == MVT::v8i32 \|\| VT == MVT::v16i32)
return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, Loads);		return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, Loads);

return Loads[0];		return Loads[0];
}		}

▲ Show 20 Lines • Show All 5,347 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.buffer.load.ll

	;RUN: llc < %s -march=amdgcn -mcpu=tonga -verify-machineinstrs \| FileCheck %s			;RUN: llc < %s -march=amdgcn -mcpu=tahiti -verify-machineinstrs \| FileCheck %s -check-prefixes=GCN,SI
				;RUN: llc < %s -march=amdgcn -mcpu=tonga -verify-machineinstrs \| FileCheck %s -check-prefixes=GCN,VI

	;CHECK-LABEL: {{^}}s_buffer_load_imm:			;GCN-LABEL: {{^}}s_buffer_load_imm:
	;CHECK-NOT: s_waitcnt;			;GCN-NOT: s_waitcnt;
	;CHECK: s_buffer_load_dword s{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0x4			;SI: s_buffer_load_dword s{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0x1
				;VI: s_buffer_load_dword s{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0x4
	define amdgpu_ps void @s_buffer_load_imm(<4 x i32> inreg %desc) {			define amdgpu_ps void @s_buffer_load_imm(<4 x i32> inreg %desc) {
	main_body:			main_body:
	%load = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 4, i32 0)			%load = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 4, i32 0)
	%bitcast = bitcast i32 %load to float			%bitcast = bitcast i32 %load to float
	call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %bitcast, float undef, float undef, float undef, i1 true, i1 true)			call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %bitcast, float undef, float undef, float undef, i1 true, i1 true)
	ret void			ret void
	}			}

	;CHECK-LABEL: {{^}}s_buffer_load_index:			;GCN-LABEL: {{^}}s_buffer_load_index:
	;CHECK-NOT: s_waitcnt;			;GCN-NOT: s_waitcnt;
	;CHECK: s_buffer_load_dword s{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}			;GCN: s_buffer_load_dword s{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}
	define amdgpu_ps void @s_buffer_load_index(<4 x i32> inreg %desc, i32 inreg %index) {			define amdgpu_ps void @s_buffer_load_index(<4 x i32> inreg %desc, i32 inreg %index) {
	main_body:			main_body:
	%load = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 %index, i32 0)			%load = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 %index, i32 0)
	%bitcast = bitcast i32 %load to float			%bitcast = bitcast i32 %load to float
	call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %bitcast, float undef, float undef, float undef, i1 true, i1 true)			call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %bitcast, float undef, float undef, float undef, i1 true, i1 true)
	ret void			ret void
	}			}

	;CHECK-LABEL: {{^}}s_buffer_loadx2_imm:			;GCN-LABEL: {{^}}s_buffer_load_index_divergent:
	;CHECK-NOT: s_waitcnt;			;GCN-NOT: s_waitcnt;
	;CHECK: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], 0x40			;GCN: buffer_load_dword v{{[0-9]+}}, v{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0 offen
				define amdgpu_ps void @s_buffer_load_index_divergent(<4 x i32> inreg %desc, i32 %index) {
				main_body:
				%load = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 %index, i32 0)
				%bitcast = bitcast i32 %load to float
				call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %bitcast, float undef, float undef, float undef, i1 true, i1 true)
				ret void
				}

				;GCN-LABEL: {{^}}s_buffer_loadx2_imm:
				;GCN-NOT: s_waitcnt;
				;SI: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], 0x10
				;VI: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], 0x40
	define amdgpu_ps void @s_buffer_loadx2_imm(<4 x i32> inreg %desc) {			define amdgpu_ps void @s_buffer_loadx2_imm(<4 x i32> inreg %desc) {
				arsenmUnsubmitted Not Done Reply Inline Actions Most of these test changes look unrelated? arsenm: Most of these test changes look unrelated?
				piotrAuthorUnsubmitted Done Reply Inline Actions I added the v3 test which exercises the code I am modifying (divergent index): s_buffer_loadx3_index_divergent. Also added analogous s_buffer_load_index_divergent and s_buffer_loadx2_index_divergent for consistency. piotr: I added the v3 test which exercises the code I am modifying (divergent index)…
	main_body:			main_body:
	%load = call <2 x i32> @llvm.amdgcn.s.buffer.load.v2i32(<4 x i32> %desc, i32 64, i32 0)			%load = call <2 x i32> @llvm.amdgcn.s.buffer.load.v2i32(<4 x i32> %desc, i32 64, i32 0)
	%bitcast = bitcast <2 x i32> %load to <2 x float>			%bitcast = bitcast <2 x i32> %load to <2 x float>
	%x = extractelement <2 x float> %bitcast, i32 0			%x = extractelement <2 x float> %bitcast, i32 0
	%y = extractelement <2 x float> %bitcast, i32 1			%y = extractelement <2 x float> %bitcast, i32 1
	call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float undef, float undef, i1 true, i1 true)			call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float undef, float undef, i1 true, i1 true)
	ret void			ret void
	}			}

	;CHECK-LABEL: {{^}}s_buffer_loadx2_index:			;GCN-LABEL: {{^}}s_buffer_loadx2_index:
	;CHECK-NOT: s_waitcnt;			;GCN-NOT: s_waitcnt;
	;CHECK: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}			;GCN: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}
	define amdgpu_ps void @s_buffer_loadx2_index(<4 x i32> inreg %desc, i32 inreg %index) {			define amdgpu_ps void @s_buffer_loadx2_index(<4 x i32> inreg %desc, i32 inreg %index) {
	main_body:			main_body:
	%load = call <2 x i32> @llvm.amdgcn.s.buffer.load.v2i32(<4 x i32> %desc, i32 %index, i32 0)			%load = call <2 x i32> @llvm.amdgcn.s.buffer.load.v2i32(<4 x i32> %desc, i32 %index, i32 0)
	%bitcast = bitcast <2 x i32> %load to <2 x float>			%bitcast = bitcast <2 x i32> %load to <2 x float>
	%x = extractelement <2 x float> %bitcast, i32 0			%x = extractelement <2 x float> %bitcast, i32 0
	%y = extractelement <2 x float> %bitcast, i32 1			%y = extractelement <2 x float> %bitcast, i32 1
	call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float undef, float undef, i1 true, i1 true)			call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float undef, float undef, i1 true, i1 true)
	ret void			ret void
	}			}

	;CHECK-LABEL: {{^}}s_buffer_loadx4_imm:			;GCN-LABEL: {{^}}s_buffer_loadx2_index_divergent:
	;CHECK-NOT: s_waitcnt;			;GCN-NOT: s_waitcnt;
	;CHECK: s_buffer_load_dwordx4 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], 0xc8			;GCN: buffer_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0 offen
				define amdgpu_ps void @s_buffer_loadx2_index_divergent(<4 x i32> inreg %desc, i32 %index) {
				main_body:
				%load = call <2 x i32> @llvm.amdgcn.s.buffer.load.v2i32(<4 x i32> %desc, i32 %index, i32 0)
				%bitcast = bitcast <2 x i32> %load to <2 x float>
				%x = extractelement <2 x float> %bitcast, i32 0
				%y = extractelement <2 x float> %bitcast, i32 1
				call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float undef, float undef, i1 true, i1 true)
				ret void
				}

				;GCN-LABEL: {{^}}s_buffer_loadx3_imm:
				;GCN-NOT: s_waitcnt;
				;SI: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], 0x10
				;SI: s_buffer_load_dword s{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0x12
				;VI: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], 0x40
				;VI: s_buffer_load_dword s{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0x48
				define amdgpu_ps void @s_buffer_loadx3_imm(<4 x i32> inreg %desc) {
				main_body:
				%load = call <3 x i32> @llvm.amdgcn.s.buffer.load.v3i32(<4 x i32> %desc, i32 64, i32 0)
				%bitcast = bitcast <3 x i32> %load to <3 x float>
				%x = extractelement <3 x float> %bitcast, i32 0
				%y = extractelement <3 x float> %bitcast, i32 1
				%z = extractelement <3 x float> %bitcast, i32 2
				call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float %z, float undef, i1 true, i1 true)
				ret void
				}

				;GCN-LABEL: {{^}}s_buffer_loadx3_index:
				;GCN-NOT: s_waitcnt;
				;GCN: s_buffer_load_dword s{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}
				;GCN: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}
				arsenmUnsubmitted Not Done Reply Inline Actions There is no load dwordx3, so I'm slightly confused about why you need this, but I would expect this ot widen to 4x loads? arsenm: There is no load dwordx3, so I'm slightly confused about why you need this, but I would expect…
				piotrAuthorUnsubmitted Done Reply Inline Actions The big picture is that I am working on cutting down the number of loaded components with various buffer loads. I have another change in instcombine (soon to be uploaded for review) that trims loads based on the components used. With that patch vec3 s_buffer_load crashes in the lowering so I am adding support for that. It is useful to have s_buffer_load.v3 for the case with divergent index, where s_buffer_load cannot be used and buffer_load_dword is generated instead. On newer GPU (VI and later) buffer_load_dwordx3 is present, only on SI we generate buffer_load_dwordx4 for that (see s_buffer_loadx3_index_divergent test). As for whether it is better to split or widen the s_buffer_load (non-divergent index), the advantage of splitting is that the split loads can be merged with an adjacent load more easily. But I do not have a strong opinion on that. piotr: The big picture is that I am working on cutting down the number of loaded components with…
				piotrAuthorUnsubmitted Done Reply Inline Actions On second thought widening seems to make more sense, will update the patch. piotr: On second thought widening seems to make more sense, will update the patch.
				define amdgpu_ps void @s_buffer_loadx3_index(<4 x i32> inreg %desc, i32 inreg %index) {
				main_body:
				%load = call <3 x i32> @llvm.amdgcn.s.buffer.load.v3i32(<4 x i32> %desc, i32 %index, i32 0)
				%bitcast = bitcast <3 x i32> %load to <3 x float>
				%x = extractelement <3 x float> %bitcast, i32 0
				%y = extractelement <3 x float> %bitcast, i32 1
				%z = extractelement <3 x float> %bitcast, i32 2
				call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float %z, float undef, i1 true, i1 true)
				ret void
				}

				;GCN-LABEL: {{^}}s_buffer_loadx3_index_divergent:
				;GCN-NOT: s_waitcnt;
				;SI: buffer_load_dwordx4 v[{{[0-9]+:[0-9]+}}], v{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0 offen
				;VI: buffer_load_dwordx3 v[{{[0-9]+:[0-9]+}}], v{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0 offen
				define amdgpu_ps void @s_buffer_loadx3_index_divergent(<4 x i32> inreg %desc, i32 %index) {
				main_body:
				%load = call <3 x i32> @llvm.amdgcn.s.buffer.load.v3i32(<4 x i32> %desc, i32 %index, i32 0)
				%bitcast = bitcast <3 x i32> %load to <3 x float>
				%x = extractelement <3 x float> %bitcast, i32 0
				%y = extractelement <3 x float> %bitcast, i32 1
				%z = extractelement <3 x float> %bitcast, i32 2
				call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float %z, float undef, i1 true, i1 true)
				ret void
				}

				;GCN-LABEL: {{^}}s_buffer_loadx4_imm:
				;GCN-NOT: s_waitcnt;
				;SI: s_buffer_load_dwordx4 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], 0x32
				;VI: s_buffer_load_dwordx4 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], 0xc8
	define amdgpu_ps void @s_buffer_loadx4_imm(<4 x i32> inreg %desc) {			define amdgpu_ps void @s_buffer_loadx4_imm(<4 x i32> inreg %desc) {
	main_body:			main_body:
	%load = call <4 x i32> @llvm.amdgcn.s.buffer.load.v4i32(<4 x i32> %desc, i32 200, i32 0)			%load = call <4 x i32> @llvm.amdgcn.s.buffer.load.v4i32(<4 x i32> %desc, i32 200, i32 0)
	%bitcast = bitcast <4 x i32> %load to <4 x float>			%bitcast = bitcast <4 x i32> %load to <4 x float>
	%x = extractelement <4 x float> %bitcast, i32 0			%x = extractelement <4 x float> %bitcast, i32 0
	%y = extractelement <4 x float> %bitcast, i32 1			%y = extractelement <4 x float> %bitcast, i32 1
	%z = extractelement <4 x float> %bitcast, i32 2			%z = extractelement <4 x float> %bitcast, i32 2
	%w = extractelement <4 x float> %bitcast, i32 3			%w = extractelement <4 x float> %bitcast, i32 3
	call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float %z, float %w, i1 true, i1 true)			call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float %z, float %w, i1 true, i1 true)
	ret void			ret void
	}			}

	;CHECK-LABEL: {{^}}s_buffer_loadx4_index:			;GCN-LABEL: {{^}}s_buffer_loadx4_index:
	;CHECK-NOT: s_waitcnt;			;GCN-NOT: s_waitcnt;
	;CHECK: s_buffer_load_dwordx4 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}			;GCN: buffer_load_dwordx4 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}
	define amdgpu_ps void @s_buffer_loadx4_index(<4 x i32> inreg %desc, i32 inreg %index) {			define amdgpu_ps void @s_buffer_loadx4_index(<4 x i32> inreg %desc, i32 inreg %index) {
	main_body:			main_body:
	%load = call <4 x i32> @llvm.amdgcn.s.buffer.load.v4i32(<4 x i32> %desc, i32 %index, i32 0)			%load = call <4 x i32> @llvm.amdgcn.s.buffer.load.v4i32(<4 x i32> %desc, i32 %index, i32 0)
	%bitcast = bitcast <4 x i32> %load to <4 x float>			%bitcast = bitcast <4 x i32> %load to <4 x float>
	%x = extractelement <4 x float> %bitcast, i32 0			%x = extractelement <4 x float> %bitcast, i32 0
	%y = extractelement <4 x float> %bitcast, i32 1			%y = extractelement <4 x float> %bitcast, i32 1
	%z = extractelement <4 x float> %bitcast, i32 2			%z = extractelement <4 x float> %bitcast, i32 2
	%w = extractelement <4 x float> %bitcast, i32 3			%w = extractelement <4 x float> %bitcast, i32 3
	call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float %z, float %w, i1 true, i1 true)			call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float %z, float %w, i1 true, i1 true)
	ret void			ret void
	}			}

	;CHECK-LABEL: {{^}}s_buffer_load_imm_mergex2:			;GCN-LABEL: {{^}}s_buffer_loadx4_index_divergent:
	;CHECK-NOT: s_waitcnt;			;GCN-NOT: s_waitcnt;
	;CHECK: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], 0x4			;GCN: buffer_load_dwordx4 v[{{[0-9]+:[0-9]+}}], v{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0 offen
				define amdgpu_ps void @s_buffer_loadx4_index_divergent(<4 x i32> inreg %desc, i32 %index) {
				main_body:
				%load = call <4 x i32> @llvm.amdgcn.s.buffer.load.v4i32(<4 x i32> %desc, i32 %index, i32 0)
				%bitcast = bitcast <4 x i32> %load to <4 x float>
				%x = extractelement <4 x float> %bitcast, i32 0
				%y = extractelement <4 x float> %bitcast, i32 1
				%z = extractelement <4 x float> %bitcast, i32 2
				%w = extractelement <4 x float> %bitcast, i32 3
				call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float %z, float %w, i1 true, i1 true)
				ret void
				}

				;GCN-LABEL: {{^}}s_buffer_load_imm_mergex2:
				;GCN-NOT: s_waitcnt;
				;SI: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], 0x1
				;VI: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], 0x4
	define amdgpu_ps void @s_buffer_load_imm_mergex2(<4 x i32> inreg %desc) {			define amdgpu_ps void @s_buffer_load_imm_mergex2(<4 x i32> inreg %desc) {
	main_body:			main_body:
	%load0 = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 4, i32 0)			%load0 = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 4, i32 0)
	%load1 = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 8, i32 0)			%load1 = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 8, i32 0)
	%x = bitcast i32 %load0 to float			%x = bitcast i32 %load0 to float
	%y = bitcast i32 %load1 to float			%y = bitcast i32 %load1 to float
	call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float undef, float undef, i1 true, i1 true)			call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float undef, float undef, i1 true, i1 true)
	ret void			ret void
	}			}

	;CHECK-LABEL: {{^}}s_buffer_load_imm_mergex4:			;GCN-LABEL: {{^}}s_buffer_load_imm_mergex4:
	;CHECK-NOT: s_waitcnt;			;GCN-NOT: s_waitcnt;
	;CHECK: s_buffer_load_dwordx4 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], 0x8			;SI: s_buffer_load_dwordx4 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], 0x2
				;VI: s_buffer_load_dwordx4 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], 0x8
	define amdgpu_ps void @s_buffer_load_imm_mergex4(<4 x i32> inreg %desc) {			define amdgpu_ps void @s_buffer_load_imm_mergex4(<4 x i32> inreg %desc) {
	main_body:			main_body:
	%load0 = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 8, i32 0)			%load0 = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 8, i32 0)
	%load1 = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 12, i32 0)			%load1 = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 12, i32 0)
	%load2 = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 16, i32 0)			%load2 = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 16, i32 0)
	%load3 = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 20, i32 0)			%load3 = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 20, i32 0)
	%x = bitcast i32 %load0 to float			%x = bitcast i32 %load0 to float
	%y = bitcast i32 %load1 to float			%y = bitcast i32 %load1 to float
	%z = bitcast i32 %load2 to float			%z = bitcast i32 %load2 to float
	%w = bitcast i32 %load3 to float			%w = bitcast i32 %load3 to float
	call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float %z, float %w, i1 true, i1 true)			call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %x, float %y, float %z, float %w, i1 true, i1 true)
	ret void			ret void
	}			}

	;CHECK-LABEL: {{^}}s_buffer_load_index_across_bb:			;GCN-LABEL: {{^}}s_buffer_load_index_across_bb:
	;CHECK-NOT: s_waitcnt;			;GCN-NOT: s_waitcnt;
	;CHECK: v_or_b32			;GCN: v_or_b32
	;CHECK: buffer_load_dword v{{[0-9]+}}, v{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0 offen			;GCN: buffer_load_dword v{{[0-9]+}}, v{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0 offen
	define amdgpu_ps void @s_buffer_load_index_across_bb(<4 x i32> inreg %desc, i32 %index) {			define amdgpu_ps void @s_buffer_load_index_across_bb(<4 x i32> inreg %desc, i32 %index) {
	main_body:			main_body:
	%tmp = shl i32 %index, 4			%tmp = shl i32 %index, 4
	br label %bb1			br label %bb1

	bb1: ; preds = %main_body			bb1: ; preds = %main_body
	%tmp1 = or i32 %tmp, 8			%tmp1 = or i32 %tmp, 8
	%load = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 %tmp1, i32 0)			%load = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 %tmp1, i32 0)
	%bitcast = bitcast i32 %load to float			%bitcast = bitcast i32 %load to float
	call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %bitcast, float undef, float undef, float undef, i1 true, i1 true)			call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %bitcast, float undef, float undef, float undef, i1 true, i1 true)
	ret void			ret void
	}			}

	;CHECK-LABEL: {{^}}s_buffer_load_index_across_bb_merged:			;GCN-LABEL: {{^}}s_buffer_load_index_across_bb_merged:
	;CHECK-NOT: s_waitcnt;			;GCN-NOT: s_waitcnt;
	;CHECK: v_or_b32			;GCN: v_or_b32
	;CHECK: v_or_b32			;GCN: v_or_b32
	;CHECK: buffer_load_dword v{{[0-9]+}}, v{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0 offen			;GCN: buffer_load_dword v{{[0-9]+}}, v{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0 offen
	;CHECK: buffer_load_dword v{{[0-9]+}}, v{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0 offen			;GCN: buffer_load_dword v{{[0-9]+}}, v{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0 offen
	define amdgpu_ps void @s_buffer_load_index_across_bb_merged(<4 x i32> inreg %desc, i32 %index) {			define amdgpu_ps void @s_buffer_load_index_across_bb_merged(<4 x i32> inreg %desc, i32 %index) {
	main_body:			main_body:
	%tmp = shl i32 %index, 4			%tmp = shl i32 %index, 4
	br label %bb1			br label %bb1

	bb1: ; preds = %main_body			bb1: ; preds = %main_body
	%tmp1 = or i32 %tmp, 8			%tmp1 = or i32 %tmp, 8
	%load = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 %tmp1, i32 0)			%load = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 %tmp1, i32 0)
	%tmp2 = or i32 %tmp1, 4			%tmp2 = or i32 %tmp1, 4
	%load2 = tail call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 %tmp2, i32 0)			%load2 = tail call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %desc, i32 %tmp2, i32 0)
	%bitcast = bitcast i32 %load to float			%bitcast = bitcast i32 %load to float
	%bitcast2 = bitcast i32 %load2 to float			%bitcast2 = bitcast i32 %load2 to float
	call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %bitcast, float %bitcast2, float undef, float undef, i1 true, i1 true)			call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %bitcast, float %bitcast2, float undef, float undef, i1 true, i1 true)
	ret void			ret void
	}			}

	declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1)			declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1)
	declare i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32>, i32, i32)			declare i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32>, i32, i32)
	declare <2 x i32> @llvm.amdgcn.s.buffer.load.v2i32(<4 x i32>, i32, i32)			declare <2 x i32> @llvm.amdgcn.s.buffer.load.v2i32(<4 x i32>, i32, i32)
				declare <3 x i32> @llvm.amdgcn.s.buffer.load.v3i32(<4 x i32>, i32, i32)
	declare <4 x i32> @llvm.amdgcn.s.buffer.load.v4i32(<4 x i32>, i32, i32)			declare <4 x i32> @llvm.amdgcn.s.buffer.load.v4i32(<4 x i32>, i32, i32)

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Lower llvm.amdgcn.s.buffer.load.v3[i|f]32ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 228854

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.buffer.load.ll

[AMDGPU] Lower llvm.amdgcn.s.buffer.load.v3[i|f]32
ClosedPublic