This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add support for multi-dword s.buffer.load intrinsic
ClosedPublic

Authored by tpr on Aug 22 2018, 5:37 AM.

Download Raw Diff

Details

Reviewers

mareko
nhaehnle
arsenm

Commits

rG904343f879b3: [AMDGPU] Add support for multi-dword s.buffer.load intrinsic
rL340684: [AMDGPU] Add support for multi-dword s.buffer.load intrinsic

Summary

Patch by Marek Olsen and David Stuttard, both of AMD.

This adds a new amdgcn intrinsic supporting s.buffer.load, in particular
multiple dword variants. These are convenient to use from some front-end
implementations.

Also modified the existing llvm.SI.load.const intrinsic to common up the
underlying implementation.

This modification also requires that we can lower to non-uniform loads correctly
by splitting larger dword variants into sizes supported by the non-uniform
versions of the load.

Change-Id: I83a6e00681158bb243591a94a51c7baa445f169b

Diff Detail

Repository: rL LLVM

Event Timeline

tpr created this revision.Aug 22 2018, 5:37 AM

Herald added subscribers: llvm-commits, t-tye, dstuttard and 6 others. · View Herald TranscriptAug 22 2018, 5:37 AM

Harbormaster completed remote builds in B21776: Diff 161929.Aug 22 2018, 5:37 AM

tpr added reviewers: mareko, nhaehnle, arsenm.Aug 22 2018, 5:40 AM

arsenm added inline comments.Aug 22 2018, 10:29 AM

lib/Target/AMDGPU/SIInstrInfo.cpp
4492 ↗	(On Diff #161929)	const reference
4533 ↗	(On Diff #161929)	I think value copies of MachineOperand don't behave like you would expect, so I usually avoid them

V2: Addressed minor review comments.

Harbormaster completed remote builds in B21802: Diff 162015.Aug 22 2018, 11:54 AM

Marek, I will correct the spelling of your name in the commit message when I land this. :-)

In D51098#1209699, @tpr wrote:

Marek, I will correct the spelling of your name in the commit message when I land this. :-)

Thanks. I was wondering who Marek Olsen was. I'd never heard of that guy. :)

nhaehnle added inline comments.Aug 23 2018, 1:16 AM

include/llvm/IR/IntrinsicsAMDGPU.td
809 ↗	(On Diff #162015)	Can we make this consistent with the new (vector) buffer intrinsics?
lib/Target/AMDGPU/SIISelLowering.cpp
4947–4951 ↗	(On Diff #162015)	clang-format?

V3: i1 glc is now i32 cachepolicy for consistency with buffer and
tbuffer intrinsics, plus fixed formatting issue.

Harbormaster completed remote builds in B21844: Diff 162239.Aug 23 2018, 11:34 AM

Thanks. I don't see a test that actually sets glc, please add one before committing.

Apart from that, LGTM.

This revision is now accepted and ready to land.Aug 24 2018, 4:08 AM

Thanks, will do.

Closed by commit rL340684: [AMDGPU] Add support for multi-dword s.buffer.load intrinsic (authored by tpr). · Explain WhyAug 25 2018, 7:54 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

IR/

IntrinsicsAMDGPU.td

8 lines

lib/

Target/

AMDGPU/

AMDGPUISelLowering.h

1 line

AMDGPUISelLowering.cpp

1 line

26 lines

2 lines

99 lines

6 lines

48 lines

test/

CodeGen/

AMDGPU/

smrd.ll

196 lines

Transforms/

EarlyCSE/

intrinsics.ll

36 lines

Diff 162548

llvm/trunk/include/llvm/IR/IntrinsicsAMDGPU.td

Show First 20 Lines • Show All 796 Lines • ▼ Show 20 Lines	class AMDGPUBufferLoad : Intrinsic <
llvm_i32_ty, // offset(SGPR/VGPR/imm)		llvm_i32_ty, // offset(SGPR/VGPR/imm)
llvm_i1_ty, // glc(imm)		llvm_i1_ty, // glc(imm)
llvm_i1_ty], // slc(imm)		llvm_i1_ty], // slc(imm)
[IntrReadMem], "", [SDNPMemOperand]>,		[IntrReadMem], "", [SDNPMemOperand]>,
AMDGPURsrcIntrinsic<0>;		AMDGPURsrcIntrinsic<0>;
def int_amdgcn_buffer_load_format : AMDGPUBufferLoad;		def int_amdgcn_buffer_load_format : AMDGPUBufferLoad;
def int_amdgcn_buffer_load : AMDGPUBufferLoad;		def int_amdgcn_buffer_load : AMDGPUBufferLoad;

		def int_amdgcn_s_buffer_load : Intrinsic <
		[llvm_anyint_ty],
		[llvm_v4i32_ty, // rsrc(SGPR)
		llvm_i32_ty, // byte offset(SGPR/VGPR/imm)
		llvm_i32_ty], // cachepolicy(imm; bit 0 = glc)
		[IntrNoMem]>,
		AMDGPURsrcIntrinsic<0>;

class AMDGPUBufferStore : Intrinsic <		class AMDGPUBufferStore : Intrinsic <
[],		[],
[llvm_anyfloat_ty, // vdata(VGPR) -- can currently only select f32, v2f32, v4f32		[llvm_anyfloat_ty, // vdata(VGPR) -- can currently only select f32, v2f32, v4f32
llvm_v4i32_ty, // rsrc(SGPR)		llvm_v4i32_ty, // rsrc(SGPR)
llvm_i32_ty, // vindex(VGPR)		llvm_i32_ty, // vindex(VGPR)
llvm_i32_ty, // offset(SGPR/VGPR/imm)		llvm_i32_ty, // offset(SGPR/VGPR/imm)
llvm_i1_ty, // glc(imm)		llvm_i1_ty, // glc(imm)
llvm_i1_ty], // slc(imm)		llvm_i1_ty], // slc(imm)
▲ Show 20 Lines • Show All 694 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPUISelLowering.h

Show First 20 Lines • Show All 480 Lines • ▼ Show 20 Lines	enum NodeType : unsigned {
ATOMIC_INC,		ATOMIC_INC,
ATOMIC_DEC,		ATOMIC_DEC,
ATOMIC_LOAD_FADD,		ATOMIC_LOAD_FADD,
ATOMIC_LOAD_FMIN,		ATOMIC_LOAD_FMIN,
ATOMIC_LOAD_FMAX,		ATOMIC_LOAD_FMAX,
BUFFER_LOAD,		BUFFER_LOAD,
BUFFER_LOAD_FORMAT,		BUFFER_LOAD_FORMAT,
BUFFER_LOAD_FORMAT_D16,		BUFFER_LOAD_FORMAT_D16,
		SBUFFER_LOAD,
BUFFER_STORE,		BUFFER_STORE,
BUFFER_STORE_FORMAT,		BUFFER_STORE_FORMAT,
BUFFER_STORE_FORMAT_D16,		BUFFER_STORE_FORMAT_D16,
BUFFER_ATOMIC_SWAP,		BUFFER_ATOMIC_SWAP,
BUFFER_ATOMIC_ADD,		BUFFER_ATOMIC_ADD,
BUFFER_ATOMIC_SUB,		BUFFER_ATOMIC_SUB,
BUFFER_ATOMIC_SMIN,		BUFFER_ATOMIC_SMIN,
BUFFER_ATOMIC_UMIN,		BUFFER_ATOMIC_UMIN,
Show All 16 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPUISelLowering.cpp

Show First 20 Lines • Show All 4,164 Lines • ▼ Show 20 Lines	const char* AMDGPUTargetLowering::getTargetNodeName(unsigned Opcode) const {
NODE_NAME_CASE(ATOMIC_INC)		NODE_NAME_CASE(ATOMIC_INC)
NODE_NAME_CASE(ATOMIC_DEC)		NODE_NAME_CASE(ATOMIC_DEC)
NODE_NAME_CASE(ATOMIC_LOAD_FADD)		NODE_NAME_CASE(ATOMIC_LOAD_FADD)
NODE_NAME_CASE(ATOMIC_LOAD_FMIN)		NODE_NAME_CASE(ATOMIC_LOAD_FMIN)
NODE_NAME_CASE(ATOMIC_LOAD_FMAX)		NODE_NAME_CASE(ATOMIC_LOAD_FMAX)
NODE_NAME_CASE(BUFFER_LOAD)		NODE_NAME_CASE(BUFFER_LOAD)
NODE_NAME_CASE(BUFFER_LOAD_FORMAT)		NODE_NAME_CASE(BUFFER_LOAD_FORMAT)
NODE_NAME_CASE(BUFFER_LOAD_FORMAT_D16)		NODE_NAME_CASE(BUFFER_LOAD_FORMAT_D16)
		NODE_NAME_CASE(SBUFFER_LOAD)
NODE_NAME_CASE(BUFFER_STORE)		NODE_NAME_CASE(BUFFER_STORE)
NODE_NAME_CASE(BUFFER_STORE_FORMAT)		NODE_NAME_CASE(BUFFER_STORE_FORMAT)
NODE_NAME_CASE(BUFFER_STORE_FORMAT_D16)		NODE_NAME_CASE(BUFFER_STORE_FORMAT_D16)
NODE_NAME_CASE(BUFFER_ATOMIC_SWAP)		NODE_NAME_CASE(BUFFER_ATOMIC_SWAP)
NODE_NAME_CASE(BUFFER_ATOMIC_ADD)		NODE_NAME_CASE(BUFFER_ATOMIC_ADD)
NODE_NAME_CASE(BUFFER_ATOMIC_SUB)		NODE_NAME_CASE(BUFFER_ATOMIC_SUB)
NODE_NAME_CASE(BUFFER_ATOMIC_SMIN)		NODE_NAME_CASE(BUFFER_ATOMIC_SMIN)
NODE_NAME_CASE(BUFFER_ATOMIC_UMIN)		NODE_NAME_CASE(BUFFER_ATOMIC_UMIN)
▲ Show 20 Lines • Show All 298 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,915 Lines • ▼ Show 20 Lines	return loadInputValue(DAG, &AMDGPU::VGPR_32RegClass, MVT::i32,
MFI->getArgInfo().WorkItemIDY);		MFI->getArgInfo().WorkItemIDY);
case Intrinsic::amdgcn_workitem_id_z:		case Intrinsic::amdgcn_workitem_id_z:
case Intrinsic::r600_read_tidig_z:		case Intrinsic::r600_read_tidig_z:
return loadInputValue(DAG, &AMDGPU::VGPR_32RegClass, MVT::i32,		return loadInputValue(DAG, &AMDGPU::VGPR_32RegClass, MVT::i32,
SDLoc(DAG.getEntryNode()),		SDLoc(DAG.getEntryNode()),
MFI->getArgInfo().WorkItemIDZ);		MFI->getArgInfo().WorkItemIDZ);
case AMDGPUIntrinsic::SI_load_const: {		case AMDGPUIntrinsic::SI_load_const: {
SDValue Ops[] = {		SDValue Ops[] = {
Op.getOperand(1),		Op.getOperand(1), // Ptr
Op.getOperand(2)		Op.getOperand(2), // Offset
		DAG.getTargetConstant(0, DL, MVT::i1) // glc
};		};

MachineMemOperand *MMO = MF.getMachineMemOperand(		MachineMemOperand *MMO = MF.getMachineMemOperand(
MachinePointerInfo(),		MachinePointerInfo(),
MachineMemOperand::MOLoad \| MachineMemOperand::MODereferenceable \|		MachineMemOperand::MOLoad \| MachineMemOperand::MODereferenceable \|
MachineMemOperand::MOInvariant,		MachineMemOperand::MOInvariant,
VT.getStoreSize(), 4);		VT.getStoreSize(), 4);
return DAG.getMemIntrinsicNode(AMDGPUISD::LOAD_CONSTANT, DL,		SDVTList VTList = DAG.getVTList(MVT::i32);
		SDValue Load = DAG.getMemIntrinsicNode(AMDGPUISD::SBUFFER_LOAD, DL,
		VTList, Ops, MVT::i32, MMO);

		return DAG.getNode(ISD::BITCAST, DL, MVT::f32, Load);
		}
		case Intrinsic::amdgcn_s_buffer_load: {
		unsigned Cache = cast<ConstantSDNode>(Op.getOperand(3))->getZExtValue();
		SDValue Ops[] = {
		Op.getOperand(1), // Ptr
		Op.getOperand(2), // Offset
		DAG.getTargetConstant(Cache & 1, DL, MVT::i1) // glc
		};

		MachineMemOperand *MMO = MF.getMachineMemOperand(
		MachinePointerInfo(),
		MachineMemOperand::MOLoad \| MachineMemOperand::MODereferenceable \|
		MachineMemOperand::MOInvariant,
		VT.getStoreSize(), VT.getStoreSize());
		return DAG.getMemIntrinsicNode(AMDGPUISD::SBUFFER_LOAD, DL,
Op->getVTList(), Ops, VT, MMO);		Op->getVTList(), Ops, VT, MMO);
}		}
case Intrinsic::amdgcn_fdiv_fast:		case Intrinsic::amdgcn_fdiv_fast:
return lowerFDIV_FAST(Op, DAG);		return lowerFDIV_FAST(Op, DAG);
case Intrinsic::amdgcn_interp_mov: {		case Intrinsic::amdgcn_interp_mov: {
SDValue M0 = copyToM0(DAG, DAG.getEntryNode(), DL, Op.getOperand(4));		SDValue M0 = copyToM0(DAG, DAG.getEntryNode(), DL, Op.getOperand(4));
SDValue Glue = M0.getValue(1);		SDValue Glue = M0.getValue(1);
return DAG.getNode(AMDGPUISD::INTERP_MOV, DL, MVT::f32, Op.getOperand(1),		return DAG.getNode(AMDGPUISD::INTERP_MOV, DL, MVT::f32, Op.getOperand(1),
▲ Show 20 Lines • Show All 4,244 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.h

Show First 20 Lines • Show All 95 Lines • ▼ Show 20 Lines	private:

void splitScalar64BitBinaryOp(SetVectorType &Worklist,		void splitScalar64BitBinaryOp(SetVectorType &Worklist,
MachineInstr &Inst, unsigned Opcode) const;		MachineInstr &Inst, unsigned Opcode) const;

void splitScalar64BitBCNT(SetVectorType &Worklist,		void splitScalar64BitBCNT(SetVectorType &Worklist,
MachineInstr &Inst) const;		MachineInstr &Inst) const;
void splitScalar64BitBFE(SetVectorType &Worklist,		void splitScalar64BitBFE(SetVectorType &Worklist,
MachineInstr &Inst) const;		MachineInstr &Inst) const;
		void splitScalarBuffer(SetVectorType &Worklist,
		MachineInstr &Inst) const;
void movePackToVALU(SetVectorType &Worklist,		void movePackToVALU(SetVectorType &Worklist,
MachineRegisterInfo &MRI,		MachineRegisterInfo &MRI,
MachineInstr &Inst) const;		MachineInstr &Inst) const;

void addUsersToMoveToVALUWorklist(unsigned Reg, MachineRegisterInfo &MRI,		void addUsersToMoveToVALUWorklist(unsigned Reg, MachineRegisterInfo &MRI,
SetVectorType &Worklist) const;		SetVectorType &Worklist) const;

void		void
▲ Show 20 Lines • Show All 849 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.cpp

Show First 20 Lines • Show All 3,898 Lines • ▼ Show 20 Lines	case AMDGPU::S_XNOR_B32:
Inst.eraseFromParent();		Inst.eraseFromParent();
continue;		continue;

case AMDGPU::S_XNOR_B64:		case AMDGPU::S_XNOR_B64:
splitScalar64BitBinaryOp(Worklist, Inst, AMDGPU::S_XNOR_B32);		splitScalar64BitBinaryOp(Worklist, Inst, AMDGPU::S_XNOR_B32);
Inst.eraseFromParent();		Inst.eraseFromParent();
continue;		continue;

case AMDGPU::S_BUFFER_LOAD_DWORD_SGPR: {		case AMDGPU::S_BUFFER_LOAD_DWORD_SGPR:
unsigned VDst = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);		case AMDGPU::S_BUFFER_LOAD_DWORDX2_SGPR:
		case AMDGPU::S_BUFFER_LOAD_DWORDX4_SGPR:
		case AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR:
		case AMDGPU::S_BUFFER_LOAD_DWORDX16_SGPR: {
		unsigned VDst;
		unsigned NewOpcode;

		switch(Opcode) {
		case AMDGPU::S_BUFFER_LOAD_DWORD_SGPR:
		NewOpcode = AMDGPU::BUFFER_LOAD_DWORD_OFFEN;
		VDst = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);
		break;
		case AMDGPU::S_BUFFER_LOAD_DWORDX2_SGPR:
		NewOpcode = AMDGPU::BUFFER_LOAD_DWORDX2_OFFEN;
		VDst = MRI.createVirtualRegister(&AMDGPU::VReg_64RegClass);
		break;
		case AMDGPU::S_BUFFER_LOAD_DWORDX4_SGPR:
		NewOpcode = AMDGPU::BUFFER_LOAD_DWORDX4_OFFEN;
		VDst = MRI.createVirtualRegister(&AMDGPU::VReg_128RegClass);
		break;
		case AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR:
		case AMDGPU::S_BUFFER_LOAD_DWORDX16_SGPR:
		splitScalarBuffer(Worklist, Inst);
		Inst.eraseFromParent();
		continue;
		}

const MachineOperand *VAddr = getNamedOperand(Inst, AMDGPU::OpName::soff);		const MachineOperand *VAddr = getNamedOperand(Inst, AMDGPU::OpName::soff);
auto Add = MRI.getUniqueVRegDef(VAddr->getReg());		auto Add = MRI.getUniqueVRegDef(VAddr->getReg());
unsigned Offset = 0;		unsigned Offset = 0;

// FIXME: This isn't safe because the addressing mode doesn't work		// FIXME: This isn't safe because the addressing mode doesn't work
// correctly if vaddr is negative.		// correctly if vaddr is negative.
//		//
// FIXME: Should probably be done somewhere else, maybe SIFoldOperands.		// FIXME: Should probably be done somewhere else, maybe SIFoldOperands.
Show All 34 Lines	case AMDGPU::S_BUFFER_LOAD_DWORDX16_SGPR: {
}		}

Offset = 0;		Offset = 0;
}		}
}		}

MachineInstr *NewInstr =		MachineInstr *NewInstr =
BuildMI(*MBB, Inst, Inst.getDebugLoc(),		BuildMI(*MBB, Inst, Inst.getDebugLoc(),
get(AMDGPU::BUFFER_LOAD_DWORD_OFFEN), VDst)		get(NewOpcode), VDst)
.add(*VAddr) // vaddr		.add(*VAddr) // vaddr
.add(*getNamedOperand(Inst, AMDGPU::OpName::sbase)) // srsrc		.add(*getNamedOperand(Inst, AMDGPU::OpName::sbase)) // srsrc
.addImm(0) // soffset		.addImm(0) // soffset
.addImm(Offset) // offset		.addImm(Offset) // offset
.addImm(getNamedOperand(Inst, AMDGPU::OpName::glc)->getImm())		.addImm(getNamedOperand(Inst, AMDGPU::OpName::glc)->getImm())
.addImm(0) // slc		.addImm(0) // slc
.addImm(0) // tfe		.addImm(0) // tfe
.cloneMemRefs(Inst)		.cloneMemRefs(Inst)
▲ Show 20 Lines • Show All 484 Lines • ▼ Show 20 Lines	BuildMI(MBB, MII, DL, get(TargetOpcode::REG_SEQUENCE), ResultReg)
.addImm(AMDGPU::sub0)		.addImm(AMDGPU::sub0)
.addReg(TmpReg)		.addReg(TmpReg)
.addImm(AMDGPU::sub1);		.addImm(AMDGPU::sub1);

MRI.replaceRegWith(Dest.getReg(), ResultReg);		MRI.replaceRegWith(Dest.getReg(), ResultReg);
addUsersToMoveToVALUWorklist(ResultReg, MRI, Worklist);		addUsersToMoveToVALUWorklist(ResultReg, MRI, Worklist);
}		}

		void SIInstrInfo::splitScalarBuffer(SetVectorType &Worklist,
		MachineInstr &Inst) const {
		MachineBasicBlock &MBB = *Inst.getParent();
		MachineRegisterInfo &MRI = MBB.getParent()->getRegInfo();

		MachineBasicBlock::iterator MII = Inst;
		auto &DL = Inst.getDebugLoc();

		MachineOperand &Dest = *getNamedOperand(Inst, AMDGPU::OpName::sdst);;
		MachineOperand &Rsrc = *getNamedOperand(Inst, AMDGPU::OpName::sbase);
		MachineOperand &Offset = *getNamedOperand(Inst, AMDGPU::OpName::soff);
		MachineOperand &Glc = *getNamedOperand(Inst, AMDGPU::OpName::glc);

		unsigned Opcode = Inst.getOpcode();
		unsigned NewOpcode = AMDGPU::BUFFER_LOAD_DWORDX4_OFFEN;
		unsigned Count = 0;
		const TargetRegisterClass *DestRC = MRI.getRegClass(Dest.getReg());
		const TargetRegisterClass *NewDestRC = RI.getEquivalentVGPRClass(DestRC);

		switch(Opcode) {
		default:
		return;
		case AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR:
		Count = 2;
		break;
		case AMDGPU::S_BUFFER_LOAD_DWORDX16_SGPR:
		Count = 4;
		break;
		}

		// FIXME: Should also attempt to build VAddr and Offset like the non-split
		// case (see call site for this function)

		// Create a vector of result registers
		SmallVector<unsigned, 8> ResultRegs;
		for (unsigned i = 0; i < Count ; ++i) {
		unsigned ResultReg = MRI.createVirtualRegister(&AMDGPU::VReg_128RegClass);
		MachineInstr &NewMI = *BuildMI(MBB, MII, DL, get(NewOpcode), ResultReg)
		.addReg(Offset.getReg()) // offset
		.addReg(Rsrc.getReg()) // rsrc
		.addImm(0) // soffset
		.addImm(i << 4) // inst_offset
		.addImm(Glc.getImm()) // glc
		.addImm(0) // slc
		.addImm(0) // tfe
		.addMemOperand(*Inst.memoperands_begin());
		// Extract the 4 32 bit sub-registers from the result to add into the final REG_SEQUENCE
		auto &NewDestOp = NewMI.getOperand(0);
		for (unsigned i = 0 ; i < 4 ; i++)
		ResultRegs.push_back(buildExtractSubReg(MII, MRI, NewDestOp, &AMDGPU::VReg_128RegClass,
		RI.getSubRegFromChannel(i), &AMDGPU::VGPR_32RegClass));
		}
		// Create a new combined result to replace original with
		unsigned FullDestReg = MRI.createVirtualRegister(NewDestRC);
		MachineInstrBuilder CombinedResBuilder = BuildMI(MBB, MII, DL,
		get(TargetOpcode::REG_SEQUENCE), FullDestReg);

		for (unsigned i = 0 ; i < Count * 4 ; ++i) {
		CombinedResBuilder
		.addReg(ResultRegs[i])
		.addImm(RI.getSubRegFromChannel(i));
		}

		MRI.replaceRegWith(Dest.getReg(), FullDestReg);
		addUsersToMoveToVALUWorklist(FullDestReg, MRI, Worklist);
		}

void SIInstrInfo::addUsersToMoveToVALUWorklist(		void SIInstrInfo::addUsersToMoveToVALUWorklist(
unsigned DstReg,		unsigned DstReg,
MachineRegisterInfo &MRI,		MachineRegisterInfo &MRI,
SetVectorType &Worklist) const {		SetVectorType &Worklist) const {
for (MachineRegisterInfo::use_iterator I = MRI.use_begin(DstReg),		for (MachineRegisterInfo::use_iterator I = MRI.use_begin(DstReg),
E = MRI.use_end(); I != E;) {		E = MRI.use_end(); I != E;) {
MachineInstr &UseMI = *I->getParent();		MachineInstr &UseMI = *I->getParent();
if (!canReadVGPR(UseMI, I.getOperandNo())) {		if (!canReadVGPR(UseMI, I.getOperandNo())) {
▲ Show 20 Lines • Show All 604 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.td

	Show All 34 Lines
	}			}

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// SI DAG Nodes			// SI DAG Nodes
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	def AMDGPUclamp : SDNode<"AMDGPUISD::CLAMP", SDTFPUnaryOp>;			def AMDGPUclamp : SDNode<"AMDGPUISD::CLAMP", SDTFPUnaryOp>;

	def SIload_constant : SDNode<"AMDGPUISD::LOAD_CONSTANT",			def SIsbuffer_load : SDNode<"AMDGPUISD::SBUFFER_LOAD",
	SDTypeProfile<1, 2, [SDTCisVT<0, f32>, SDTCisVT<1, v4i32>, SDTCisVT<2, i32>]>,			SDTypeProfile<1, 3, [SDTCisVT<1, v4i32>, SDTCisVT<2, i32>, SDTCisVT<3, i1>]>,
	[SDNPMayLoad, SDNPMemOperand]			[SDNPMayLoad, SDNPMemOperand]
	>;			>;

	def SIatomic_inc : SDNode<"AMDGPUISD::ATOMIC_INC", SDTAtomic2,			def SIatomic_inc : SDNode<"AMDGPUISD::ATOMIC_INC", SDTAtomic2,
	[SDNPMayLoad, SDNPMayStore, SDNPMemOperand, SDNPHasChain]			[SDNPMayLoad, SDNPMayStore, SDNPMemOperand, SDNPHasChain]
	>;			>;

	def SIatomic_dec : SDNode<"AMDGPUISD::ATOMIC_DEC", SDTAtomic2,			def SIatomic_dec : SDNode<"AMDGPUISD::ATOMIC_DEC", SDTAtomic2,
	[SDNPMayLoad, SDNPMayStore, SDNPMemOperand, SDNPHasChain]			[SDNPMayLoad, SDNPMayStore, SDNPMemOperand, SDNPHasChain]
	▲ Show 20 Lines • Show All 1,965 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SMInstructions.td

Show First 20 Lines • Show All 403 Lines • ▼ Show 20 Lines	multiclass SMRD_Pattern <string Instr, ValueType vt> {

// 2. SGPR offset		// 2. SGPR offset
def : GCNPat <		def : GCNPat <
(smrd_load (SMRDSgpr i64:$sbase, i32:$offset)),		(smrd_load (SMRDSgpr i64:$sbase, i32:$offset)),
(vt (!cast<SM_Pseudo>(Instr#"_SGPR") $sbase, $offset, 0))		(vt (!cast<SM_Pseudo>(Instr#"_SGPR") $sbase, $offset, 0))
>;		>;
}		}

		multiclass SMLoad_Pattern <string Instr, ValueType vt> {
		// 1. Offset as an immediate
		// name this pattern to reuse AddedComplexity on CI
		def _IMM : GCNPat <
		(SIsbuffer_load v4i32:$sbase, (SMRDBufferImm i32:$offset), i1:$glc),
		(vt (!cast<SM_Pseudo>(Instr#"_IMM") $sbase, $offset, (as_i1imm $glc)))
		>;

		// 2. Offset loaded in an 32bit SGPR
		def : GCNPat <
		(SIsbuffer_load v4i32:$sbase, i32:$offset, i1:$glc),
		(vt (!cast<SM_Pseudo>(Instr#"_SGPR") $sbase, $offset, (as_i1imm $glc)))
		>;
		}


let OtherPredicates = [isSICI] in {		let OtherPredicates = [isSICI] in {
def : GCNPat <		def : GCNPat <
(i64 (readcyclecounter)),		(i64 (readcyclecounter)),
(S_MEMTIME)		(S_MEMTIME)
>;		>;
}		}

// Global and constant loads can be selected to either MUBUF or SMRD		// Global and constant loads can be selected to either MUBUF or SMRD
// instructions, but SMRD instructions are faster so we want the instruction		// instructions, but SMRD instructions are faster so we want the instruction
// selector to prefer those.		// selector to prefer those.
let AddedComplexity = 100 in {		let AddedComplexity = 100 in {

defm : SMRD_Pattern <"S_LOAD_DWORD", i32>;		defm : SMRD_Pattern <"S_LOAD_DWORD", i32>;
defm : SMRD_Pattern <"S_LOAD_DWORDX2", v2i32>;		defm : SMRD_Pattern <"S_LOAD_DWORDX2", v2i32>;
defm : SMRD_Pattern <"S_LOAD_DWORDX4", v4i32>;		defm : SMRD_Pattern <"S_LOAD_DWORDX4", v4i32>;
defm : SMRD_Pattern <"S_LOAD_DWORDX8", v8i32>;		defm : SMRD_Pattern <"S_LOAD_DWORDX8", v8i32>;
defm : SMRD_Pattern <"S_LOAD_DWORDX16", v16i32>;		defm : SMRD_Pattern <"S_LOAD_DWORDX16", v16i32>;

// 1. Offset as an immediate		// Name the pattern to reuse AddedComplexity on CI
def SM_LOAD_PATTERN : GCNPat < // name this pattern to reuse AddedComplexity on CI		defm SM_LOAD_PATTERN : SMLoad_Pattern <"S_BUFFER_LOAD_DWORD", i32>;
(SIload_constant v4i32:$sbase, (SMRDBufferImm i32:$offset)),		defm : SMLoad_Pattern <"S_BUFFER_LOAD_DWORDX2", v2i32>;
(S_BUFFER_LOAD_DWORD_IMM $sbase, $offset, 0)		defm : SMLoad_Pattern <"S_BUFFER_LOAD_DWORDX4", v4i32>;
>;		defm : SMLoad_Pattern <"S_BUFFER_LOAD_DWORDX8", v8i32>;
		defm : SMLoad_Pattern <"S_BUFFER_LOAD_DWORDX16", v16i32>;
// 2. Offset loaded in an 32bit SGPR
def : GCNPat <
(SIload_constant v4i32:$sbase, i32:$offset),
(S_BUFFER_LOAD_DWORD_SGPR $sbase, $offset, 0)
>;

} // End let AddedComplexity = 100		} // End let AddedComplexity = 100

let OtherPredicates = [isVI] in {		let OtherPredicates = [isVI] in {

def : GCNPat <		def : GCNPat <
(i64 (readcyclecounter)),		(i64 (readcyclecounter)),
(S_MEMREALTIME)		(S_MEMREALTIME)
>;		>;
▲ Show 20 Lines • Show All 302 Lines • ▼ Show 20 Lines	class SMRD_Real_ci <bits<5> op, SM_Pseudo ps>
let Inst{14-9} = !if(ps.has_sbase, sbase{6-1}, ?);		let Inst{14-9} = !if(ps.has_sbase, sbase{6-1}, ?);
let Inst{21-15} = !if(ps.has_sdst, sdst{6-0}, ?);		let Inst{21-15} = !if(ps.has_sdst, sdst{6-0}, ?);
let Inst{26-22} = op;		let Inst{26-22} = op;
let Inst{31-27} = 0x18; //encoding		let Inst{31-27} = 0x18; //encoding
}		}

def S_DCACHE_INV_VOL_ci : SMRD_Real_ci <0x1d, S_DCACHE_INV_VOL>;		def S_DCACHE_INV_VOL_ci : SMRD_Real_ci <0x1d, S_DCACHE_INV_VOL>;

let AddedComplexity = SM_LOAD_PATTERN.AddedComplexity in {		let AddedComplexity = SM_LOAD_PATTERN_IMM.AddedComplexity in {

class SMRD_Pattern_ci <string Instr, ValueType vt> : GCNPat <		class SMRD_Pattern_ci <string Instr, ValueType vt> : GCNPat <
(smrd_load (SMRDImm32 i64:$sbase, i32:$offset)),		(smrd_load (SMRDImm32 i64:$sbase, i32:$offset)),
(vt (!cast<InstSI>(Instr#"_IMM_ci") $sbase, $offset, 0))> {		(vt (!cast<InstSI>(Instr#"_IMM_ci") $sbase, $offset, 0))> {
let OtherPredicates = [isCIOnly];		let OtherPredicates = [isCIOnly];
}		}

def : SMRD_Pattern_ci <"S_LOAD_DWORD", i32>;		def : SMRD_Pattern_ci <"S_LOAD_DWORD", i32>;
def : SMRD_Pattern_ci <"S_LOAD_DWORDX2", v2i32>;		def : SMRD_Pattern_ci <"S_LOAD_DWORDX2", v2i32>;
def : SMRD_Pattern_ci <"S_LOAD_DWORDX4", v4i32>;		def : SMRD_Pattern_ci <"S_LOAD_DWORDX4", v4i32>;
def : SMRD_Pattern_ci <"S_LOAD_DWORDX8", v8i32>;		def : SMRD_Pattern_ci <"S_LOAD_DWORDX8", v8i32>;
def : SMRD_Pattern_ci <"S_LOAD_DWORDX16", v16i32>;		def : SMRD_Pattern_ci <"S_LOAD_DWORDX16", v16i32>;

def : GCNPat <		class SMLoad_Pattern_ci <string Instr, ValueType vt> : GCNPat <
(SIload_constant v4i32:$sbase, (SMRDBufferImm32 i32:$offset)),		(vt (SIsbuffer_load v4i32:$sbase, (SMRDBufferImm32 i32:$offset), i1:$glc)),
(S_BUFFER_LOAD_DWORD_IMM_ci $sbase, $offset, 0)> {		(!cast<InstSI>(Instr) $sbase, $offset, (as_i1imm $glc))> {
let OtherPredicates = [isCI]; // should this be isCIOnly?		let OtherPredicates = [isCI]; // should this be isCIOnly?
}		}

		def : SMLoad_Pattern_ci <"S_BUFFER_LOAD_DWORD_IMM_ci", i32>;
		def : SMLoad_Pattern_ci <"S_BUFFER_LOAD_DWORDX2_IMM_ci", v2i32>;
		def : SMLoad_Pattern_ci <"S_BUFFER_LOAD_DWORDX4_IMM_ci", v4i32>;
		def : SMLoad_Pattern_ci <"S_BUFFER_LOAD_DWORDX8_IMM_ci", v8i32>;
		def : SMLoad_Pattern_ci <"S_BUFFER_LOAD_DWORDX16_IMM_ci", v16i32>;

} // End let AddedComplexity = SM_LOAD_PATTERN.AddedComplexity		} // End let AddedComplexity = SM_LOAD_PATTERN.AddedComplexity

llvm/trunk/test/CodeGen/AMDGPU/smrd.ll

Show First 20 Lines • Show All 100 Lines • ▼ Show 20 Lines	main_body:
%r = call float @llvm.SI.load.const.v4i32(<4 x i32> %d3, i32 0)		%r = call float @llvm.SI.load.const.v4i32(<4 x i32> %d3, i32 0)
ret float %r		ret float %r
}		}

; SMRD load using the load.const.v4i32 intrinsic with an immediate offset		; SMRD load using the load.const.v4i32 intrinsic with an immediate offset
; GCN-LABEL: {{^}}smrd_load_const0:		; GCN-LABEL: {{^}}smrd_load_const0:
; SICI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x4 ; encoding: [0x04		; SICI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x4 ; encoding: [0x04
; VIGFX9: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x10		; VIGFX9: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x10
define amdgpu_ps void @smrd_load_const0(<4 x i32> addrspace(4)* inreg %arg, <4 x i32> addrspace(4)* inreg %arg1, <32 x i8> addrspace(4)* inreg %arg2, i32 inreg %arg3, <2 x i32> %arg4, <2 x i32> %arg5, <2 x i32> %arg6, <3 x i32> %arg7, <2 x i32> %arg8, <2 x i32> %arg9, <2 x i32> %arg10, float %arg11, float %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, float %arg18, float %arg19) #0 {		define amdgpu_ps void @smrd_load_const0(<4 x i32> addrspace(4)* inreg %arg, <4 x i32> addrspace(4)* inreg %arg1, <32 x i8> addrspace(4)* inreg %arg2, i32 inreg %arg3, <2 x i32> %arg4, <2 x i32> %arg5, <2 x i32> %arg6, <3 x i32> %arg7, <2 x i32> %arg8, <2 x i32> %arg9, <2 x i32> %arg10, float %arg11, float %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, float %arg18, float %arg19, <4 x i32> addrspace(4)* inreg %in) #0 {
main_body:		main_body:
%tmp = getelementptr <4 x i32>, <4 x i32> addrspace(4)* %arg, i32 0		%tmp = getelementptr <4 x i32>, <4 x i32> addrspace(4)* %arg, i32 0
%tmp20 = load <4 x i32>, <4 x i32> addrspace(4)* %tmp		%tmp20 = load <4 x i32>, <4 x i32> addrspace(4)* %tmp
%tmp21 = call float @llvm.SI.load.const.v4i32(<4 x i32> %tmp20, i32 16)		%tmp21 = call float @llvm.SI.load.const.v4i32(<4 x i32> %tmp20, i32 16)
call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %tmp21, float %tmp21, float %tmp21, float %tmp21, i1 true, i1 true) #0		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %tmp21, float %tmp21, float %tmp21, float %tmp21, i1 true, i1 true) #0
ret void		ret void
}		}

; SMRD load using the load.const.v4i32 intrinsic with the largest possible immediate		; SMRD load using the load.const.v4i32 intrinsic with the largest possible immediate
; offset.		; offset.
; GCN-LABEL: {{^}}smrd_load_const1:		; GCN-LABEL: {{^}}smrd_load_const1:
; SICI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0xff ; encoding: [0xff		; SICI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0xff ; encoding: [0xff
		; SICI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0xff glc ; encoding: [0xff
; VIGFX9: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x3fc		; VIGFX9: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x3fc
define amdgpu_ps void @smrd_load_const1(<4 x i32> addrspace(4)* inreg %arg, <4 x i32> addrspace(4)* inreg %arg1, <32 x i8> addrspace(4)* inreg %arg2, i32 inreg %arg3, <2 x i32> %arg4, <2 x i32> %arg5, <2 x i32> %arg6, <3 x i32> %arg7, <2 x i32> %arg8, <2 x i32> %arg9, <2 x i32> %arg10, float %arg11, float %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, float %arg18, float %arg19) #0 {		; VIGFX9: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x3fc glc
		define amdgpu_ps void @smrd_load_const1(<4 x i32> addrspace(4)* inreg %arg, <4 x i32> addrspace(4)* inreg %arg1, <32 x i8> addrspace(4)* inreg %arg2, i32 inreg %arg3, <2 x i32> %arg4, <2 x i32> %arg5, <2 x i32> %arg6, <3 x i32> %arg7, <2 x i32> %arg8, <2 x i32> %arg9, <2 x i32> %arg10, float %arg11, float %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, float %arg18, float %arg19, <4 x i32> addrspace(4)* inreg %in) #0 {
main_body:		main_body:
%tmp = getelementptr <4 x i32>, <4 x i32> addrspace(4)* %arg, i32 0		%tmp = getelementptr <4 x i32>, <4 x i32> addrspace(4)* %arg, i32 0
%tmp20 = load <4 x i32>, <4 x i32> addrspace(4)* %tmp		%tmp20 = load <4 x i32>, <4 x i32> addrspace(4)* %tmp
%tmp21 = call float @llvm.SI.load.const.v4i32(<4 x i32> %tmp20, i32 1020)		%tmp21 = call float @llvm.SI.load.const.v4i32(<4 x i32> %tmp20, i32 1020)
call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %tmp21, float %tmp21, float %tmp21, float %tmp21, i1 true, i1 true) #0		%tmp22 = load <4 x i32>, <4 x i32> addrspace(4)* %in
		%s.buffer = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %tmp22, i32 1020, i32 1)
		%s.buffer.float = bitcast i32 %s.buffer to float
		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %tmp21, float %tmp21, float %tmp21, float %s.buffer.float, i1 true, i1 true) #0
ret void		ret void
}		}

; SMRD load using the load.const.v4i32 intrinsic with an offset greater than the		; SMRD load using the load.const.v4i32 intrinsic with an offset greater than the
; largets possible immediate.		; largets possible immediate.
; immediate offset.		; immediate offset.
; GCN-LABEL: {{^}}smrd_load_const2:		; GCN-LABEL: {{^}}smrd_load_const2:
; SI: s_movk_i32 s[[OFFSET:[0-9]]], 0x400		; SI: s_movk_i32 s[[OFFSET:[0-9]]], 0x400
; SI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], s[[OFFSET]] ; encoding: [0x0[[OFFSET]]		; SI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], s[[OFFSET]] ; encoding: [0x0[[OFFSET]]
		; SI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], s[[OFFSET]] ; encoding: [0x0[[OFFSET]]
		; CI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x100
; CI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x100		; CI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x100
; VIGFX9: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x400		; VIGFX9: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x400
define amdgpu_ps void @smrd_load_const2(<4 x i32> addrspace(4)* inreg %arg, <4 x i32> addrspace(4)* inreg %arg1, <32 x i8> addrspace(4)* inreg %arg2, i32 inreg %arg3, <2 x i32> %arg4, <2 x i32> %arg5, <2 x i32> %arg6, <3 x i32> %arg7, <2 x i32> %arg8, <2 x i32> %arg9, <2 x i32> %arg10, float %arg11, float %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, float %arg18, float %arg19) #0 {		; VIGFX9: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x400
		define amdgpu_ps void @smrd_load_const2(<4 x i32> addrspace(4)* inreg %arg, <4 x i32> addrspace(4)* inreg %arg1, <32 x i8> addrspace(4)* inreg %arg2, i32 inreg %arg3, <2 x i32> %arg4, <2 x i32> %arg5, <2 x i32> %arg6, <3 x i32> %arg7, <2 x i32> %arg8, <2 x i32> %arg9, <2 x i32> %arg10, float %arg11, float %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, float %arg18, float %arg19, <4 x i32> addrspace(4)* inreg %in) #0 {
main_body:		main_body:
%tmp = getelementptr <4 x i32>, <4 x i32> addrspace(4)* %arg, i32 0		%tmp = getelementptr <4 x i32>, <4 x i32> addrspace(4)* %arg, i32 0
%tmp20 = load <4 x i32>, <4 x i32> addrspace(4)* %tmp		%tmp20 = load <4 x i32>, <4 x i32> addrspace(4)* %tmp
%tmp21 = call float @llvm.SI.load.const.v4i32(<4 x i32> %tmp20, i32 1024)		%tmp21 = call float @llvm.SI.load.const.v4i32(<4 x i32> %tmp20, i32 1024)
call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %tmp21, float %tmp21, float %tmp21, float %tmp21, i1 true, i1 true) #0		%tmp22 = load <4 x i32>, <4 x i32> addrspace(4)* %in
		%s.buffer = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %tmp22, i32 1024, i32 0)
		%s.buffer.float = bitcast i32 %s.buffer to float
		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %tmp21, float %tmp21, float %tmp21, float %s.buffer.float, i1 true, i1 true) #0
ret void		ret void
}		}

; SMRD load with the largest possible immediate offset on VI		; SMRD load with the largest possible immediate offset on VI
; GCN-LABEL: {{^}}smrd_load_const3:		; GCN-LABEL: {{^}}smrd_load_const3:
; SI: s_mov_b32 [[OFFSET:s[0-9]+]], 0xffffc		; SI: s_mov_b32 [[OFFSET:s[0-9]+]], 0xffffc
; SI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], [[OFFSET]]		; SI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], [[OFFSET]]
		; SI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], [[OFFSET]]
		; CI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x3ffff
; CI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x3ffff		; CI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x3ffff
; VIGFX9: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0xffffc		; VIGFX9: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0xffffc
define amdgpu_ps void @smrd_load_const3(<4 x i32> addrspace(4)* inreg %arg, <4 x i32> addrspace(4)* inreg %arg1, <32 x i8> addrspace(4)* inreg %arg2, i32 inreg %arg3, <2 x i32> %arg4, <2 x i32> %arg5, <2 x i32> %arg6, <3 x i32> %arg7, <2 x i32> %arg8, <2 x i32> %arg9, <2 x i32> %arg10, float %arg11, float %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, float %arg18, float %arg19) #0 {		; VIGFX9: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0xffffc
		define amdgpu_ps void @smrd_load_const3(<4 x i32> addrspace(4)* inreg %arg, <4 x i32> addrspace(4)* inreg %arg1, <32 x i8> addrspace(4)* inreg %arg2, i32 inreg %arg3, <2 x i32> %arg4, <2 x i32> %arg5, <2 x i32> %arg6, <3 x i32> %arg7, <2 x i32> %arg8, <2 x i32> %arg9, <2 x i32> %arg10, float %arg11, float %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, float %arg18, float %arg19, <4 x i32> addrspace(4)* inreg %in) #0 {
main_body:		main_body:
%tmp = getelementptr <4 x i32>, <4 x i32> addrspace(4)* %arg, i32 0		%tmp = getelementptr <4 x i32>, <4 x i32> addrspace(4)* %arg, i32 0
%tmp20 = load <4 x i32>, <4 x i32> addrspace(4)* %tmp		%tmp20 = load <4 x i32>, <4 x i32> addrspace(4)* %tmp
%tmp21 = call float @llvm.SI.load.const.v4i32(<4 x i32> %tmp20, i32 1048572)		%tmp21 = call float @llvm.SI.load.const.v4i32(<4 x i32> %tmp20, i32 1048572)
call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %tmp21, float %tmp21, float %tmp21, float %tmp21, i1 true, i1 true) #0		%tmp22 = load <4 x i32>, <4 x i32> addrspace(4)* %in
		%s.buffer = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %tmp22, i32 1048572, i32 0)
		%s.buffer.float = bitcast i32 %s.buffer to float
		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %tmp21, float %tmp21, float %tmp21, float %s.buffer.float, i1 true, i1 true) #0
ret void		ret void
}		}

; SMRD load with an offset greater than the largest possible immediate on VI		; SMRD load with an offset greater than the largest possible immediate on VI
; GCN-LABEL: {{^}}smrd_load_const4:		; GCN-LABEL: {{^}}smrd_load_const4:
; SIVIGFX9: s_mov_b32 [[OFFSET:s[0-9]+]], 0x100000		; SIVIGFX9: s_mov_b32 [[OFFSET:s[0-9]+]], 0x100000
; SIVIGFX9: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], [[OFFSET]]		; SIVIGFX9: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], [[OFFSET]]
		; SIVIGFX9: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], [[OFFSET]]
		; CI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x40000
; CI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x40000		; CI: s_buffer_load_dword s{{[0-9]}}, s[{{[0-9]:[0-9]}}], 0x40000
; GCN: s_endpgm		; GCN: s_endpgm
define amdgpu_ps void @smrd_load_const4(<4 x i32> addrspace(4)* inreg %arg, <4 x i32> addrspace(4)* inreg %arg1, <32 x i8> addrspace(4)* inreg %arg2, i32 inreg %arg3, <2 x i32> %arg4, <2 x i32> %arg5, <2 x i32> %arg6, <3 x i32> %arg7, <2 x i32> %arg8, <2 x i32> %arg9, <2 x i32> %arg10, float %arg11, float %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, float %arg18, float %arg19) #0 {		define amdgpu_ps void @smrd_load_const4(<4 x i32> addrspace(4)* inreg %arg, <4 x i32> addrspace(4)* inreg %arg1, <32 x i8> addrspace(4)* inreg %arg2, i32 inreg %arg3, <2 x i32> %arg4, <2 x i32> %arg5, <2 x i32> %arg6, <3 x i32> %arg7, <2 x i32> %arg8, <2 x i32> %arg9, <2 x i32> %arg10, float %arg11, float %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, float %arg18, float %arg19, <4 x i32> addrspace(4)* inreg %in) #0 {
main_body:		main_body:
%tmp = getelementptr <4 x i32>, <4 x i32> addrspace(4)* %arg, i32 0		%tmp = getelementptr <4 x i32>, <4 x i32> addrspace(4)* %arg, i32 0
%tmp20 = load <4 x i32>, <4 x i32> addrspace(4)* %tmp		%tmp20 = load <4 x i32>, <4 x i32> addrspace(4)* %tmp
%tmp21 = call float @llvm.SI.load.const.v4i32(<4 x i32> %tmp20, i32 1048576)		%tmp21 = call float @llvm.SI.load.const.v4i32(<4 x i32> %tmp20, i32 1048576)
call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %tmp21, float %tmp21, float %tmp21, float %tmp21, i1 true, i1 true) #0		%tmp22 = load <4 x i32>, <4 x i32> addrspace(4)* %in
		%s.buffer = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %tmp22, i32 1048576, i32 0)
		%s.buffer.float = bitcast i32 %s.buffer to float
		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %tmp21, float %tmp21, float %tmp21, float %s.buffer.float, i1 true, i1 true) #0
		ret void
		}

		; dwordx2 s.buffer.load
		; GCN-LABEL: {{^}}s_buffer_load_dwordx2:
		; VIGFX9: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]:[0-9]}}], 0x80
		; SICI: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]:[0-9]}}], 0x20
		define amdgpu_ps void @s_buffer_load_dwordx2(<4 x i32> addrspace(4)* inreg %arg, <4 x i32> addrspace(4)* inreg %arg1, <32 x i8> addrspace(4)* inreg %arg2, i32 inreg %arg3, <2 x i32> %arg4, <2 x i32> %arg5, <2 x i32> %arg6, <3 x i32> %arg7, <2 x i32> %arg8, <2 x i32> %arg9, <2 x i32> %arg10, float %arg11, float %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, float %arg18, float %arg19, <4 x i32> addrspace(4)* inreg %in) #0 {
		main_body:
		%tmp22 = load <4 x i32>, <4 x i32> addrspace(4)* %in
		%s.buffer = call <2 x i32> @llvm.amdgcn.s.buffer.load.v2i32(<4 x i32> %tmp22, i32 128, i32 0)
		%s.buffer.0 = extractelement <2 x i32> %s.buffer, i32 0
		%s.buffer.0.float = bitcast i32 %s.buffer.0 to float
		%s.buffer.1 = extractelement <2 x i32> %s.buffer, i32 1
		%s.buffer.1.float = bitcast i32 %s.buffer.1 to float
		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %s.buffer.0.float, float %s.buffer.1.float, float %s.buffer.0.float, float %s.buffer.1.float, i1 true, i1 true) #0
		ret void
		}

		; dwordx4 s.buffer.load
		; GCN-LABEL: {{^}}s_buffer_load_dwordx4:
		; VIGFX9: s_buffer_load_dwordx4 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]:[0-9]}}], 0x80
		; SICI: s_buffer_load_dwordx4 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]:[0-9]}}], 0x20
		define amdgpu_ps void @s_buffer_load_dwordx4(<4 x i32> addrspace(4)* inreg %arg, <4 x i32> addrspace(4)* inreg %arg1, <32 x i8> addrspace(4)* inreg %arg2, i32 inreg %arg3, <2 x i32> %arg4, <2 x i32> %arg5, <2 x i32> %arg6, <3 x i32> %arg7, <2 x i32> %arg8, <2 x i32> %arg9, <2 x i32> %arg10, float %arg11, float %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, float %arg18, float %arg19, <4 x i32> addrspace(4)* inreg %in) #0 {
		main_body:
		%tmp22 = load <4 x i32>, <4 x i32> addrspace(4)* %in
		%s.buffer = call <4 x i32> @llvm.amdgcn.s.buffer.load.v4i32(<4 x i32> %tmp22, i32 128, i32 0)
		%s.buffer.0 = extractelement <4 x i32> %s.buffer, i32 0
		%s.buffer.0.float = bitcast i32 %s.buffer.0 to float
		%s.buffer.1 = extractelement <4 x i32> %s.buffer, i32 1
		%s.buffer.1.float = bitcast i32 %s.buffer.1 to float
		%s.buffer.2 = extractelement <4 x i32> %s.buffer, i32 2
		%s.buffer.2.float = bitcast i32 %s.buffer.2 to float
		%s.buffer.3 = extractelement <4 x i32> %s.buffer, i32 3
		%s.buffer.3.float = bitcast i32 %s.buffer.3 to float
		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %s.buffer.0.float, float %s.buffer.1.float, float %s.buffer.2.float, float %s.buffer.3.float, i1 true, i1 true) #0
		ret void
		}

		; dwordx8 s.buffer.load
		; GCN-LABEL: {{^}}s_buffer_load_dwordx8:
		; VIGFX9: s_buffer_load_dwordx8 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]:[0-9]}}], 0x80
		; SICI: s_buffer_load_dwordx8 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]:[0-9]}}], 0x20
		define amdgpu_ps void @s_buffer_load_dwordx8(<4 x i32> addrspace(4)* inreg %arg, <4 x i32> addrspace(4)* inreg %arg1, <32 x i8> addrspace(4)* inreg %arg2, i32 inreg %arg3, <2 x i32> %arg4, <2 x i32> %arg5, <2 x i32> %arg6, <3 x i32> %arg7, <2 x i32> %arg8, <2 x i32> %arg9, <2 x i32> %arg10, float %arg11, float %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, float %arg18, float %arg19, <4 x i32> addrspace(4)* inreg %in) #0 {
		main_body:
		%tmp22 = load <4 x i32>, <4 x i32> addrspace(4)* %in
		%s.buffer = call <8 x i32> @llvm.amdgcn.s.buffer.load.v8i32(<4 x i32> %tmp22, i32 128, i32 0)
		%s.buffer.0 = extractelement <8 x i32> %s.buffer, i32 0
		%s.buffer.0.float = bitcast i32 %s.buffer.0 to float
		%s.buffer.1 = extractelement <8 x i32> %s.buffer, i32 2
		%s.buffer.1.float = bitcast i32 %s.buffer.1 to float
		%s.buffer.2 = extractelement <8 x i32> %s.buffer, i32 5
		%s.buffer.2.float = bitcast i32 %s.buffer.2 to float
		%s.buffer.3 = extractelement <8 x i32> %s.buffer, i32 7
		%s.buffer.3.float = bitcast i32 %s.buffer.3 to float
		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %s.buffer.0.float, float %s.buffer.1.float, float %s.buffer.2.float, float %s.buffer.3.float, i1 true, i1 true) #0
		ret void
		}

		; dwordx16 s.buffer.load
		; GCN-LABEL: {{^}}s_buffer_load_dwordx16:
		; VIGFX9: s_buffer_load_dwordx16 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]:[0-9]}}], 0x80
		; SICI: s_buffer_load_dwordx16 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]:[0-9]}}], 0x20
		define amdgpu_ps void @s_buffer_load_dwordx16(<4 x i32> addrspace(4)* inreg %arg, <4 x i32> addrspace(4)* inreg %arg1, <32 x i8> addrspace(4)* inreg %arg2, i32 inreg %arg3, <2 x i32> %arg4, <2 x i32> %arg5, <2 x i32> %arg6, <3 x i32> %arg7, <2 x i32> %arg8, <2 x i32> %arg9, <2 x i32> %arg10, float %arg11, float %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, float %arg18, float %arg19, <4 x i32> addrspace(4)* inreg %in) #0 {
		main_body:
		%tmp22 = load <4 x i32>, <4 x i32> addrspace(4)* %in
		%s.buffer = call <16 x i32> @llvm.amdgcn.s.buffer.load.v16i32(<4 x i32> %tmp22, i32 128, i32 0)
		%s.buffer.0 = extractelement <16 x i32> %s.buffer, i32 0
		%s.buffer.0.float = bitcast i32 %s.buffer.0 to float
		%s.buffer.1 = extractelement <16 x i32> %s.buffer, i32 3
		%s.buffer.1.float = bitcast i32 %s.buffer.1 to float
		%s.buffer.2 = extractelement <16 x i32> %s.buffer, i32 12
		%s.buffer.2.float = bitcast i32 %s.buffer.2 to float
		%s.buffer.3 = extractelement <16 x i32> %s.buffer, i32 15
		%s.buffer.3.float = bitcast i32 %s.buffer.3 to float
		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %s.buffer.0.float, float %s.buffer.1.float, float %s.buffer.2.float, float %s.buffer.3.float, i1 true, i1 true) #0
ret void		ret void
}		}

; GCN-LABEL: {{^}}smrd_sgpr_offset:		; GCN-LABEL: {{^}}smrd_sgpr_offset:
; GCN: s_buffer_load_dword s{{[0-9]}}, s[0:3], s4		; GCN: s_buffer_load_dword s{{[0-9]}}, s[0:3], s4
define amdgpu_ps float @smrd_sgpr_offset(<4 x i32> inreg %desc, i32 inreg %offset) #0 {		define amdgpu_ps float @smrd_sgpr_offset(<4 x i32> inreg %desc, i32 inreg %offset) #0 {
main_body:		main_body:
%r = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 %offset)		%r = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 %offset)
▲ Show 20 Lines • Show All 150 Lines • ▼ Show 20 Lines

.outer_loop_body:		.outer_loop_body:
%offset = shl i32 %loopctr.2, 6		%offset = shl i32 %loopctr.2, 6
%load2result = call float @llvm.SI.load.const.v4i32(<4 x i32> %descriptor, i32 %offset)		%load2result = call float @llvm.SI.load.const.v4i32(<4 x i32> %descriptor, i32 %offset)
%outer_br = fcmp ueq float %load2result, 0x0		%outer_br = fcmp ueq float %load2result, 0x0
br i1 %outer_br, label %.outer_loop_header, label %ret_block		br i1 %outer_br, label %.outer_loop_header, label %ret_block
}		}

		; SMRD load with a non-const offset
		; GCN-LABEL: {{^}}smrd_load_nonconst0:
		; SIVIGFX9: s_buffer_load_dword s{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}
		; SIVIGFX9: s_buffer_load_dword s{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}
		; CI: s_buffer_load_dword s{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}
		; CI: s_buffer_load_dword s{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}
		; GCN: s_endpgm
		define amdgpu_ps void @smrd_load_nonconst0(<4 x i32> addrspace(4)* inreg %arg, <4 x i32> addrspace(4)* inreg %arg1, <32 x i8> addrspace(4)* inreg %arg2, i32 inreg %arg3, <2 x i32> %arg4, <2 x i32> %arg5, <2 x i32> %arg6, <3 x i32> %arg7, <2 x i32> %arg8, <2 x i32> %arg9, <2 x i32> %arg10, float %arg11, float %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, float %arg18, float %arg19, <4 x i32> addrspace(4)* inreg %in, i32 inreg %ncoff) #0 {
		main_body:
		%tmp = getelementptr <4 x i32>, <4 x i32> addrspace(4)* %arg, i32 0
		%tmp20 = load <4 x i32>, <4 x i32> addrspace(4)* %tmp
		%tmp21 = call float @llvm.SI.load.const.v4i32(<4 x i32> %tmp20, i32 %ncoff)
		%tmp22 = load <4 x i32>, <4 x i32> addrspace(4)* %in
		%s.buffer = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %tmp22, i32 %ncoff, i32 0)
		%s.buffer.float = bitcast i32 %s.buffer to float
		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %tmp21, float %tmp21, float %tmp21, float %s.buffer.float, i1 true, i1 true) #0
		ret void
		}

		; SMRD load with a non-const non-uniform offset
		; GCN-LABEL: {{^}}smrd_load_nonconst1:
		; SIVIGFX9: buffer_load_dword v{{[0-9]+}}, v{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0 offen
		; SIVIGFX9: buffer_load_dword v{{[0-9]+}}, v{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0 offen
		; CI: buffer_load_dword v{{[0-9]+}}, v{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0 offen
		; CI: buffer_load_dword v{{[0-9]+}}, v{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0 offen
		; GCN: s_endpgm
		define amdgpu_ps void @smrd_load_nonconst1(<4 x i32> addrspace(4)* inreg %arg, <4 x i32> addrspace(4)* inreg %arg1, <32 x i8> addrspace(4)* inreg %arg2, i32 inreg %arg3, <2 x i32> %arg4, <2 x i32> %arg5, <2 x i32> %arg6, <3 x i32> %arg7, <2 x i32> %arg8, <2 x i32> %arg9, <2 x i32> %arg10, float %arg11, float %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, float %arg18, float %arg19, <4 x i32> addrspace(4)* inreg %in, i32 %ncoff) #0 {
		main_body:
		%tmp = getelementptr <4 x i32>, <4 x i32> addrspace(4)* %arg, i32 0
		%tmp20 = load <4 x i32>, <4 x i32> addrspace(4)* %tmp
		%tmp21 = call float @llvm.SI.load.const.v4i32(<4 x i32> %tmp20, i32 %ncoff)
		%tmp22 = load <4 x i32>, <4 x i32> addrspace(4)* %in
		%s.buffer = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %tmp22, i32 %ncoff, i32 0)
		%s.buffer.float = bitcast i32 %s.buffer to float
		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %tmp21, float %tmp21, float %tmp21, float %s.buffer.float, i1 true, i1 true) #0
		ret void
		}

		; SMRD load with a non-const non-uniform offset of > 4 dwords (requires splitting)
		; GCN-LABEL: {{^}}smrd_load_nonconst2:
		; SIVIGFX9: buffer_load_dword v{{[0-9]+}}, v{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0 offen
		; SIVIGFX9: buffer_load_dwordx4 v[{{[0-9]+:[0-9]+}}], v{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0 offen
		; CI: buffer_load_dword v{{[0-9]+}}, v{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0 offen
		; CI: buffer_load_dwordx4 v[{{[0-9]+:[0-9]+}}], v{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], 0 offen
		; GCN: s_endpgm
		define amdgpu_ps void @smrd_load_nonconst2(<4 x i32> addrspace(4)* inreg %arg, <4 x i32> addrspace(4)* inreg %arg1, <32 x i8> addrspace(4)* inreg %arg2, i32 inreg %arg3, <2 x i32> %arg4, <2 x i32> %arg5, <2 x i32> %arg6, <3 x i32> %arg7, <2 x i32> %arg8, <2 x i32> %arg9, <2 x i32> %arg10, float %arg11, float %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, float %arg18, float %arg19, <4 x i32> addrspace(4)* inreg %in, i32 %ncoff) #0 {
		main_body:
		%tmp = getelementptr <4 x i32>, <4 x i32> addrspace(4)* %arg, i32 0
		%tmp20 = load <4 x i32>, <4 x i32> addrspace(4)* %tmp
		%tmp21 = call float @llvm.SI.load.const.v4i32(<4 x i32> %tmp20, i32 %ncoff)
		%tmp22 = load <4 x i32>, <4 x i32> addrspace(4)* %in
		%s.buffer = call <8 x i32> @llvm.amdgcn.s.buffer.load.v8i32(<4 x i32> %tmp22, i32 %ncoff, i32 0)
		%s.buffer.elt = extractelement <8 x i32> %s.buffer, i32 1
		%s.buffer.float = bitcast i32 %s.buffer.elt to float
		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %tmp21, float %tmp21, float %tmp21, float %s.buffer.float, i1 true, i1 true) #0
		ret void
		}

		; SMRD load dwordx2
		; GCN-LABEL: {{^}}smrd_load_dwordx2:
		; SIVIGFX9: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}
		; CI: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}
		; GCN: s_endpgm
		define amdgpu_ps void @smrd_load_dwordx2(<4 x i32> addrspace(4)* inreg %arg, <4 x i32> addrspace(4)* inreg %arg1, <32 x i8> addrspace(4)* inreg %arg2, i32 inreg %arg3, <2 x i32> %arg4, <2 x i32> %arg5, <2 x i32> %arg6, <3 x i32> %arg7, <2 x i32> %arg8, <2 x i32> %arg9, <2 x i32> %arg10, float %arg11, float %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, float %arg18, float %arg19, <4 x i32> addrspace(4)* inreg %in, i32 inreg %ncoff) #0 {
		main_body:
		%tmp22 = load <4 x i32>, <4 x i32> addrspace(4)* %in
		%s.buffer = call <2 x i32> @llvm.amdgcn.s.buffer.load.v2i32(<4 x i32> %tmp22, i32 %ncoff, i32 0)
		%s.buffer.float = bitcast <2 x i32> %s.buffer to <2 x float>
		%r.1 = extractelement <2 x float> %s.buffer.float, i32 0
		%r.2 = extractelement <2 x float> %s.buffer.float, i32 1
		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r.1, float %r.1, float %r.1, float %r.2, i1 true, i1 true) #0
		ret void
		}


declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #0		declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #0
declare float @llvm.SI.load.const.v4i32(<4 x i32>, i32) #1		declare float @llvm.SI.load.const.v4i32(<4 x i32>, i32) #1
declare float @llvm.amdgcn.interp.p1(float, i32, i32, i32) #2		declare float @llvm.amdgcn.interp.p1(float, i32, i32, i32) #2
declare float @llvm.amdgcn.interp.p2(float, float, i32, i32, i32) #2		declare float @llvm.amdgcn.interp.p2(float, float, i32, i32, i32) #2
		declare i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32>, i32, i32)
		declare <2 x i32> @llvm.amdgcn.s.buffer.load.v2i32(<4 x i32>, i32, i32)
		declare <4 x i32> @llvm.amdgcn.s.buffer.load.v4i32(<4 x i32>, i32, i32)
		declare <8 x i32> @llvm.amdgcn.s.buffer.load.v8i32(<4 x i32>, i32, i32)
		declare <16 x i32> @llvm.amdgcn.s.buffer.load.v16i32(<4 x i32>, i32, i32)

attributes #0 = { nounwind }		attributes #0 = { nounwind }
attributes #1 = { nounwind readnone }		attributes #1 = { nounwind readnone }
attributes #2 = { nounwind readnone speculatable }		attributes #2 = { nounwind readnone speculatable }

!0 = !{}		!0 = !{}

llvm/trunk/test/Transforms/EarlyCSE/intrinsics.ll

				; RUN: opt < %s -S -mtriple=amdgcn-- -early-cse \| FileCheck %s

				; CHECK-LABEL: @no_cse
				; CHECK: call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %in, i32 0, i32 0)
				; CHECK: call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %in, i32 4, i32 0)
				define void @no_cse(i32 addrspace(1)* %out, <4 x i32> %in) {
				%a = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %in, i32 0, i32 0)
				%b = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %in, i32 4, i32 0)
				%c = add i32 %a, %b
				store i32 %c, i32 addrspace(1)* %out
				ret void
				}

				; CHECK-LABEL: @cse_zero_offset
				; CHECK: [[CSE:%[a-z0-9A-Z]+]] = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %in, i32 0, i32 0)
				; CHECK: add i32 [[CSE]], [[CSE]]
				define void @cse_zero_offset(i32 addrspace(1)* %out, <4 x i32> %in) {
				%a = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %in, i32 0, i32 0)
				%b = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %in, i32 0, i32 0)
				%c = add i32 %a, %b
				store i32 %c, i32 addrspace(1)* %out
				ret void
				}

				; CHECK-LABEL: @cse_nonzero_offset
				; CHECK: [[CSE:%[a-z0-9A-Z]+]] = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %in, i32 4, i32 0)
				; CHECK: add i32 [[CSE]], [[CSE]]
				define void @cse_nonzero_offset(i32 addrspace(1)* %out, <4 x i32> %in) {
				%a = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %in, i32 4, i32 0)
				%b = call i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> %in, i32 4, i32 0)
				%c = add i32 %a, %b
				store i32 %c, i32 addrspace(1)* %out
				ret void
				}

				declare i32 @llvm.amdgcn.s.buffer.load.i32(<4 x i32> nocapture, i32, i32)