This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Divergence-driven selection of scalar buffer load intrinsics
ClosedPublic

Authored by nhaehnle on Oct 15 2018, 4:40 AM.

Download Raw Diff

Details

Reviewers

arsenm
alex-t
rampitec
tpr

Commits

rGa7b00058e05f: AMDGPU: Divergence-driven selection of scalar buffer load intrinsics
rGc4a2ff095078: AMDGPU: Divergence-driven selection of scalar buffer load intrinsics
rL348050: AMDGPU: Divergence-driven selection of scalar buffer load intrinsics
rL344696: AMDGPU: Divergence-driven selection of scalar buffer load intrinsics

Summary

Moving SMRD to VMEM in SIFixSGPRCopies is rather bad for performance if
the load is really uniform. So select the scalar load intrinsics directly
to either VMEM or SMRD buffer loads based on divergence analysis.

If an offset happens to end up in a VGPR -- either because a floating
point calculation was involved, or due to other remaining deficiencies
in SIFixSGPRCopies -- we use v_readfirstlane.

There is some unrelated churn in tests since we now select MUBUF offsets
in a unified way with non-scalar buffer loads.

Change-Id: I170e6816323beb1348677b358c9d380865cd1a19

Diff Detail

Repository: rL LLVM

Event Timeline

nhaehnle created this revision.Oct 15 2018, 4:40 AM

Herald added subscribers: t-tye, dstuttard, yaxunl and 3 others. · View Herald TranscriptOct 15 2018, 4:40 AM

LGTM

This revision is now accepted and ready to land.Oct 16 2018, 12:26 AM

nhaehnle added a child revision: D53316: StructurizeCFG: Simplify inserted PHI nodes.Oct 16 2018, 2:04 AM

arsenm added inline comments.Oct 16 2018, 10:40 AM

lib/Target/AMDGPU/SIISelLowering.cpp
4824–4827 ↗	(On Diff #169679)	I don't love the hardcoded types here. Can you assert on the sizes, in case we fix casting all mem operations to int?

add an assertion for the VT

Harbormaster completed remote builds in B23843: Diff 169868.Oct 16 2018, 11:28 AM

arsenm accepted this revision.Oct 16 2018, 11:34 AM

arsenm added inline comments.

lib/Target/AMDGPU/SIInstrInfo.cpp
3582 ↗	(On Diff #169868)	This should probably be called soffset? I guess that's a preexisting condition

nhaehnle added inline comments.Oct 16 2018, 11:39 AM

lib/Target/AMDGPU/SIInstrInfo.cpp
3582 ↗	(On Diff #169868)	Yeah, changing the OpName seems like a separate thing :)

Closed by commit rL344696: AMDGPU: Divergence-driven selection of scalar buffer load intrinsics (authored by nha). · Explain WhyOct 17 2018, 8:39 AM

This revision was automatically updated to reflect the committed changes.

Hi Nicolai,

Fyi, This introduced a regression with Mass Effect Andromeda with DXVK and RADV on Polaris10. See https://bugs.freedesktop.org/show_bug.cgi?id=108611

Thanks for the heads up. I'll take a look.

In D53283#1282151, @hakzsam wrote:

Hi Nicolai,

Fyi, This introduced a regression with Mass Effect Andromeda with DXVK and RADV on Polaris10. See https://bugs.freedesktop.org/show_bug.cgi?id=108611

I have root-caused the problem, which is that divergence info isn't passed correctly through the SelectionDAG in all cases. I'm going to look into a fix, but I'll likely have to touch common code to do so and have reverted this commit for now.

nhaehnle mentioned this in D54340: AMDGPU: Fix various issues around the VirtReg2Value mapping.Nov 9 2018, 11:51 AM

Diffusion mentioned this in rL348049: AMDGPU: Fix various issues around the VirtReg2Value mapping.Nov 30 2018, 2:58 PM

Hi Nicolai,

I'm sorry but this change (actually r348050) also breaks World Of Tanks, here's the apitrace https://mega.nz/#!MOg2mSrD!aJHdSrimJBrVzv6c0ParmHDKIsduMq55CKPJjk0OgRI and the DXVK issue is here https://github.com/doitsujin/dxvk/issues/884.
Note that it's not a renderdoc capture because we failed to record one with that game, it worked with apitrace though. The caterpillar issue can be reproduced with both WineD3D/RadeonSI and DXVK/RADV.

The first bad commit is cc436fd26637b0629b95fd8e60fde61cec4b421f, then it works with 69f971eb1814487fc23ee092a69532a8d152c80d (because you reverted the original change) and e3924b1c15606bb5bf98392e0c20e731b4965311 breaks it again.

Can you look into this?
Thanks,
Samuel.

Herald added a project: Restricted Project. · View Herald TranscriptJun 14 2019, 5:23 AM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

4 lines

107 lines

2 lines

185 lines

Utils/

AMDGPUBaseInfo.h

5 lines

AMDGPUBaseInfo.cpp

7 lines

test/

CodeGen/

AMDGPU/

smrd-fold-offset.mir

8 lines

smrd.ll

67 lines

Diff 170013

llvm/trunk/lib/Target/AMDGPU/SIISelLowering.h

Show First 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	SDValue getPreloadedValue(SelectionDAG &DAG,
AMDGPUFunctionArgInfo::PreloadedValue) const;		AMDGPUFunctionArgInfo::PreloadedValue) const;

SDValue LowerGlobalAddress(AMDGPUMachineFunction *MFI, SDValue Op,		SDValue LowerGlobalAddress(AMDGPUMachineFunction *MFI, SDValue Op,
SelectionDAG &DAG) const override;		SelectionDAG &DAG) const override;
SDValue lowerImplicitZextParam(SelectionDAG &DAG, SDValue Op,		SDValue lowerImplicitZextParam(SelectionDAG &DAG, SDValue Op,
MVT VT, unsigned Offset) const;		MVT VT, unsigned Offset) const;
SDValue lowerImage(SDValue Op, const AMDGPU::ImageDimIntrinsicInfo *Intr,		SDValue lowerImage(SDValue Op, const AMDGPU::ImageDimIntrinsicInfo *Intr,
SelectionDAG &DAG) const;		SelectionDAG &DAG) const;
		SDValue lowerSBuffer(EVT VT, SDLoc DL, SDValue Rsrc, SDValue Offset,
		SDValue GLC, SelectionDAG &DAG) const;

SDValue LowerINTRINSIC_WO_CHAIN(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerINTRINSIC_WO_CHAIN(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerINTRINSIC_W_CHAIN(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerINTRINSIC_W_CHAIN(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerINTRINSIC_VOID(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerINTRINSIC_VOID(SDValue Op, SelectionDAG &DAG) const;

// The raw.tbuffer and struct.tbuffer intrinsics have two offset args: offset		// The raw.tbuffer and struct.tbuffer intrinsics have two offset args: offset
// (the offset that is included in bounds checking and swizzling, to be split		// (the offset that is included in bounds checking and swizzling, to be split
// between the instruction's voffset and immoffset fields) and soffset (the		// between the instruction's voffset and immoffset fields) and soffset (the
▲ Show 20 Lines • Show All 114 Lines • ▼ Show 20 Lines	private:
/// \returns True if PC-relative relocation needs to be emitted for given		/// \returns True if PC-relative relocation needs to be emitted for given
/// global value \p GV, false otherwise.		/// global value \p GV, false otherwise.
bool shouldEmitPCReloc(const GlobalValue *GV) const;		bool shouldEmitPCReloc(const GlobalValue *GV) const;

// Analyze a combined offset from an amdgcn_buffer_ intrinsic and store the		// Analyze a combined offset from an amdgcn_buffer_ intrinsic and store the
// three offsets (voffset, soffset and instoffset) into the SDValue[3] array		// three offsets (voffset, soffset and instoffset) into the SDValue[3] array
// pointed to by Offsets.		// pointed to by Offsets.
void setBufferOffsets(SDValue CombinedOffset, SelectionDAG &DAG,		void setBufferOffsets(SDValue CombinedOffset, SelectionDAG &DAG,
SDValue *Offsets) const;		SDValue *Offsets, unsigned Align = 4) const;

public:		public:
SITargetLowering(const TargetMachine &tm, const GCNSubtarget &STI);		SITargetLowering(const TargetMachine &tm, const GCNSubtarget &STI);

const GCNSubtarget *getSubtarget() const;		const GCNSubtarget *getSubtarget() const;

bool isFPExtFoldable(unsigned Opcode, EVT DestVT, EVT SrcVT) const override;		bool isFPExtFoldable(unsigned Opcode, EVT DestVT, EVT SrcVT) const override;

▲ Show 20 Lines • Show All 150 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,790 Lines • ▼ Show 20 Lines	if (BaseOpcode->AtomicX2) {
SDValue Adjusted = adjustLoadValueTypeImpl(		SDValue Adjusted = adjustLoadValueTypeImpl(
SDValue(NewNode, 0), LoadVT, DL, DAG, Subtarget->hasUnpackedD16VMem());		SDValue(NewNode, 0), LoadVT, DL, DAG, Subtarget->hasUnpackedD16VMem());
return DAG.getMergeValues({Adjusted, SDValue(NewNode, 1)}, DL);		return DAG.getMergeValues({Adjusted, SDValue(NewNode, 1)}, DL);
}		}

return SDValue(NewNode, 0);		return SDValue(NewNode, 0);
}		}

		SDValue SITargetLowering::lowerSBuffer(EVT VT, SDLoc DL, SDValue Rsrc,
		SDValue Offset, SDValue GLC,
		SelectionDAG &DAG) const {
		MachineFunction &MF = DAG.getMachineFunction();
		MachineMemOperand *MMO = MF.getMachineMemOperand(
		MachinePointerInfo(),
		MachineMemOperand::MOLoad \| MachineMemOperand::MODereferenceable \|
		MachineMemOperand::MOInvariant,
		VT.getStoreSize(), VT.getStoreSize());

		if (!Offset->isDivergent()) {
		SDValue Ops[] = {
		Rsrc,
		Offset, // Offset
		GLC // glc
		};
		return DAG.getMemIntrinsicNode(AMDGPUISD::SBUFFER_LOAD, DL,
		DAG.getVTList(VT), Ops, VT, MMO);
		}

		// We have a divergent offset. Emit a MUBUF buffer load instead. We can
		// assume that the buffer is unswizzled.
		SmallVector<SDValue, 4> Loads;
		unsigned NumLoads = 1;
		MVT LoadVT = VT.getSimpleVT();

		assert(LoadVT == MVT::i32 \|\| LoadVT == MVT::v2i32 \|\| LoadVT == MVT::v4i32 \|\|
		LoadVT == MVT::v8i32 \|\| LoadVT == MVT::v16i32);

		if (VT == MVT::v8i32 \|\| VT == MVT::v16i32) {
		NumLoads = VT == MVT::v16i32 ? 4 : 2;
		LoadVT = MVT::v4i32;
		}

		SDVTList VTList = DAG.getVTList({LoadVT, MVT::Glue});
		unsigned CachePolicy = cast<ConstantSDNode>(GLC)->getZExtValue();
		SDValue Ops[] = {
		DAG.getEntryNode(), // Chain
		Rsrc, // rsrc
		DAG.getConstant(0, DL, MVT::i32), // vindex
		{}, // voffset
		{}, // soffset
		{}, // offset
		DAG.getConstant(CachePolicy, DL, MVT::i32), // cachepolicy
		DAG.getConstant(0, DL, MVT::i1), // idxen
		};

		// Use the alignment to ensure that the required offsets will fit into the
		// immediate offsets.
		setBufferOffsets(Offset, DAG, &Ops[3], NumLoads > 1 ? 16 * NumLoads : 4);

		uint64_t InstOffset = cast<ConstantSDNode>(Ops[5])->getZExtValue();
		for (unsigned i = 0; i < NumLoads; ++i) {
		Ops[5] = DAG.getConstant(InstOffset + 16 * i, DL, MVT::i32);
		Loads.push_back(DAG.getMemIntrinsicNode(AMDGPUISD::BUFFER_LOAD, DL, VTList,
		Ops, LoadVT, MMO));
		}

		if (VT == MVT::v8i32 \|\| VT == MVT::v16i32)
		return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, Loads);

		return Loads[0];
		}

SDValue SITargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,		SDValue SITargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
MachineFunction &MF = DAG.getMachineFunction();		MachineFunction &MF = DAG.getMachineFunction();
auto MFI = MF.getInfo<SIMachineFunctionInfo>();		auto MFI = MF.getInfo<SIMachineFunctionInfo>();

EVT VT = Op.getValueType();		EVT VT = Op.getValueType();
SDLoc DL(Op);		SDLoc DL(Op);
unsigned IntrinsicID = cast<ConstantSDNode>(Op.getOperand(0))->getZExtValue();		unsigned IntrinsicID = cast<ConstantSDNode>(Op.getOperand(0))->getZExtValue();
▲ Show 20 Lines • Show All 139 Lines • ▼ Show 20 Lines	return loadInputValue(DAG, &AMDGPU::VGPR_32RegClass, MVT::i32,
SDLoc(DAG.getEntryNode()),		SDLoc(DAG.getEntryNode()),
MFI->getArgInfo().WorkItemIDY);		MFI->getArgInfo().WorkItemIDY);
case Intrinsic::amdgcn_workitem_id_z:		case Intrinsic::amdgcn_workitem_id_z:
case Intrinsic::r600_read_tidig_z:		case Intrinsic::r600_read_tidig_z:
return loadInputValue(DAG, &AMDGPU::VGPR_32RegClass, MVT::i32,		return loadInputValue(DAG, &AMDGPU::VGPR_32RegClass, MVT::i32,
SDLoc(DAG.getEntryNode()),		SDLoc(DAG.getEntryNode()),
MFI->getArgInfo().WorkItemIDZ);		MFI->getArgInfo().WorkItemIDZ);
case AMDGPUIntrinsic::SI_load_const: {		case AMDGPUIntrinsic::SI_load_const: {
SDValue Ops[] = {		SDValue Load =
Op.getOperand(1), // Ptr		lowerSBuffer(MVT::i32, DL, Op.getOperand(1), Op.getOperand(2),
Op.getOperand(2), // Offset		DAG.getTargetConstant(0, DL, MVT::i1), DAG);
DAG.getTargetConstant(0, DL, MVT::i1) // glc
};

MachineMemOperand *MMO = MF.getMachineMemOperand(
MachinePointerInfo(),
MachineMemOperand::MOLoad \| MachineMemOperand::MODereferenceable \|
MachineMemOperand::MOInvariant,
VT.getStoreSize(), 4);
SDVTList VTList = DAG.getVTList(MVT::i32);
SDValue Load = DAG.getMemIntrinsicNode(AMDGPUISD::SBUFFER_LOAD, DL,
VTList, Ops, MVT::i32, MMO);

return DAG.getNode(ISD::BITCAST, DL, MVT::f32, Load);		return DAG.getNode(ISD::BITCAST, DL, MVT::f32, Load);
}		}
case Intrinsic::amdgcn_s_buffer_load: {		case Intrinsic::amdgcn_s_buffer_load: {
unsigned Cache = cast<ConstantSDNode>(Op.getOperand(3))->getZExtValue();		unsigned Cache = cast<ConstantSDNode>(Op.getOperand(3))->getZExtValue();
SDValue Ops[] = {		return lowerSBuffer(VT, DL, Op.getOperand(1), Op.getOperand(2),
Op.getOperand(1), // Ptr		DAG.getTargetConstant(Cache & 1, DL, MVT::i1), DAG);
Op.getOperand(2), // Offset
DAG.getTargetConstant(Cache & 1, DL, MVT::i1) // glc
};

MachineMemOperand *MMO = MF.getMachineMemOperand(
MachinePointerInfo(),
MachineMemOperand::MOLoad \| MachineMemOperand::MODereferenceable \|
MachineMemOperand::MOInvariant,
VT.getStoreSize(), VT.getStoreSize());
return DAG.getMemIntrinsicNode(AMDGPUISD::SBUFFER_LOAD, DL,
Op->getVTList(), Ops, VT, MMO);
}		}
case Intrinsic::amdgcn_fdiv_fast:		case Intrinsic::amdgcn_fdiv_fast:
return lowerFDIV_FAST(Op, DAG);		return lowerFDIV_FAST(Op, DAG);
case Intrinsic::amdgcn_interp_mov: {		case Intrinsic::amdgcn_interp_mov: {
SDValue M0 = copyToM0(DAG, DAG.getEntryNode(), DL, Op.getOperand(4));		SDValue M0 = copyToM0(DAG, DAG.getEntryNode(), DL, Op.getOperand(4));
SDValue Glue = M0.getValue(1);		SDValue Glue = M0.getValue(1);
return DAG.getNode(AMDGPUISD::INTERP_MOV, DL, MVT::f32, Op.getOperand(1),		return DAG.getNode(AMDGPUISD::INTERP_MOV, DL, MVT::f32, Op.getOperand(1),
Op.getOperand(2), Op.getOperand(3), Glue);		Op.getOperand(2), Op.getOperand(3), Glue);
▲ Show 20 Lines • Show All 1,018 Lines • ▼ Show 20 Lines	if (!C1)
C1 = cast<ConstantSDNode>(DAG.getConstant(0, DL, MVT::i32));		C1 = cast<ConstantSDNode>(DAG.getConstant(0, DL, MVT::i32));
return {N0, SDValue(C1, 0)};		return {N0, SDValue(C1, 0)};
}		}

// Analyze a combined offset from an amdgcn_buffer_ intrinsic and store the		// Analyze a combined offset from an amdgcn_buffer_ intrinsic and store the
// three offsets (voffset, soffset and instoffset) into the SDValue[3] array		// three offsets (voffset, soffset and instoffset) into the SDValue[3] array
// pointed to by Offsets.		// pointed to by Offsets.
void SITargetLowering::setBufferOffsets(SDValue CombinedOffset,		void SITargetLowering::setBufferOffsets(SDValue CombinedOffset,
SelectionDAG &DAG,		SelectionDAG &DAG, SDValue *Offsets,
SDValue *Offsets) const {		unsigned Align) const {
SDLoc DL(CombinedOffset);		SDLoc DL(CombinedOffset);
if (auto C = dyn_cast<ConstantSDNode>(CombinedOffset)) {		if (auto C = dyn_cast<ConstantSDNode>(CombinedOffset)) {
uint32_t Imm = C->getZExtValue();		uint32_t Imm = C->getZExtValue();
uint32_t SOffset, ImmOffset;		uint32_t SOffset, ImmOffset;
if (AMDGPU::splitMUBUFOffset(Imm, SOffset, ImmOffset, Subtarget)) {		if (AMDGPU::splitMUBUFOffset(Imm, SOffset, ImmOffset, Subtarget, Align)) {
Offsets[0] = DAG.getConstant(0, DL, MVT::i32);		Offsets[0] = DAG.getConstant(0, DL, MVT::i32);
Offsets[1] = DAG.getConstant(SOffset, DL, MVT::i32);		Offsets[1] = DAG.getConstant(SOffset, DL, MVT::i32);
Offsets[2] = DAG.getConstant(ImmOffset, DL, MVT::i32);		Offsets[2] = DAG.getConstant(ImmOffset, DL, MVT::i32);
return;		return;
}		}
}		}
if (DAG.isBaseWithConstantOffset(CombinedOffset)) {		if (DAG.isBaseWithConstantOffset(CombinedOffset)) {
SDValue N0 = CombinedOffset.getOperand(0);		SDValue N0 = CombinedOffset.getOperand(0);
SDValue N1 = CombinedOffset.getOperand(1);		SDValue N1 = CombinedOffset.getOperand(1);
uint32_t SOffset, ImmOffset;		uint32_t SOffset, ImmOffset;
int Offset = cast<ConstantSDNode>(N1)->getSExtValue();		int Offset = cast<ConstantSDNode>(N1)->getSExtValue();
if (Offset >= 0		if (Offset >= 0 && AMDGPU::splitMUBUFOffset(Offset, SOffset, ImmOffset,
&& AMDGPU::splitMUBUFOffset(Offset, SOffset, ImmOffset, Subtarget)) {		Subtarget, Align)) {
Offsets[0] = N0;		Offsets[0] = N0;
Offsets[1] = DAG.getConstant(SOffset, DL, MVT::i32);		Offsets[1] = DAG.getConstant(SOffset, DL, MVT::i32);
Offsets[2] = DAG.getConstant(ImmOffset, DL, MVT::i32);		Offsets[2] = DAG.getConstant(ImmOffset, DL, MVT::i32);
return;		return;
}		}
}		}
Offsets[0] = CombinedOffset;		Offsets[0] = CombinedOffset;
Offsets[1] = DAG.getConstant(0, DL, MVT::i32);		Offsets[1] = DAG.getConstant(0, DL, MVT::i32);
▲ Show 20 Lines • Show All 3,211 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.h

Show First 20 Lines • Show All 97 Lines • ▼ Show 20 Lines	private:
void splitScalar64BitBinaryOp(SetVectorType &Worklist, MachineInstr &Inst,		void splitScalar64BitBinaryOp(SetVectorType &Worklist, MachineInstr &Inst,
unsigned Opcode,		unsigned Opcode,
MachineDominatorTree *MDT = nullptr) const;		MachineDominatorTree *MDT = nullptr) const;

void splitScalar64BitBCNT(SetVectorType &Worklist,		void splitScalar64BitBCNT(SetVectorType &Worklist,
MachineInstr &Inst) const;		MachineInstr &Inst) const;
void splitScalar64BitBFE(SetVectorType &Worklist,		void splitScalar64BitBFE(SetVectorType &Worklist,
MachineInstr &Inst) const;		MachineInstr &Inst) const;
void splitScalarBuffer(SetVectorType &Worklist,
MachineInstr &Inst) const;
void movePackToVALU(SetVectorType &Worklist,		void movePackToVALU(SetVectorType &Worklist,
MachineRegisterInfo &MRI,		MachineRegisterInfo &MRI,
MachineInstr &Inst) const;		MachineInstr &Inst) const;

void addUsersToMoveToVALUWorklist(unsigned Reg, MachineRegisterInfo &MRI,		void addUsersToMoveToVALUWorklist(unsigned Reg, MachineRegisterInfo &MRI,
SetVectorType &Worklist) const;		SetVectorType &Worklist) const;

void		void
▲ Show 20 Lines • Show All 876 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIInstrInfo.cpp

Show First 20 Lines • Show All 3,570 Lines • ▼ Show 20 Lines	void SIInstrInfo::legalizeOperandsSMRD(MachineRegisterInfo &MRI,
MachineInstr &MI) const {		MachineInstr &MI) const {

// If the pointer is store in VGPRs, then we need to move them to		// If the pointer is store in VGPRs, then we need to move them to
// SGPRs using v_readfirstlane. This is safe because we only select		// SGPRs using v_readfirstlane. This is safe because we only select
// loads with uniform pointers to SMRD instruction so we know the		// loads with uniform pointers to SMRD instruction so we know the
// pointer value is uniform.		// pointer value is uniform.
MachineOperand *SBase = getNamedOperand(MI, AMDGPU::OpName::sbase);		MachineOperand *SBase = getNamedOperand(MI, AMDGPU::OpName::sbase);
if (SBase && !RI.isSGPRClass(MRI.getRegClass(SBase->getReg()))) {		if (SBase && !RI.isSGPRClass(MRI.getRegClass(SBase->getReg()))) {
unsigned SGPR = readlaneVGPRToSGPR(SBase->getReg(), MI, MRI);		unsigned SGPR = readlaneVGPRToSGPR(SBase->getReg(), MI, MRI);
SBase->setReg(SGPR);		SBase->setReg(SGPR);
}		}
		MachineOperand *SOff = getNamedOperand(MI, AMDGPU::OpName::soff);
		if (SOff && !RI.isSGPRClass(MRI.getRegClass(SOff->getReg()))) {
		unsigned SGPR = readlaneVGPRToSGPR(SOff->getReg(), MI, MRI);
		SOff->setReg(SGPR);
		}
}		}

void SIInstrInfo::legalizeGenericOperand(MachineBasicBlock &InsertMBB,		void SIInstrInfo::legalizeGenericOperand(MachineBasicBlock &InsertMBB,
MachineBasicBlock::iterator I,		MachineBasicBlock::iterator I,
const TargetRegisterClass *DstRC,		const TargetRegisterClass *DstRC,
MachineOperand &Op,		MachineOperand &Op,
MachineRegisterInfo &MRI,		MachineRegisterInfo &MRI,
const DebugLoc &DL) const {		const DebugLoc &DL) const {
▲ Show 20 Lines • Show All 611 Lines • ▼ Show 20 Lines	case AMDGPU::S_XNOR_B32:
lowerScalarXnor(Worklist, Inst);		lowerScalarXnor(Worklist, Inst);
Inst.eraseFromParent();		Inst.eraseFromParent();
continue;		continue;

case AMDGPU::S_XNOR_B64:		case AMDGPU::S_XNOR_B64:
splitScalar64BitBinaryOp(Worklist, Inst, AMDGPU::S_XNOR_B32, MDT);		splitScalar64BitBinaryOp(Worklist, Inst, AMDGPU::S_XNOR_B32, MDT);
Inst.eraseFromParent();		Inst.eraseFromParent();
continue;		continue;

case AMDGPU::S_BUFFER_LOAD_DWORD_SGPR:
case AMDGPU::S_BUFFER_LOAD_DWORDX2_SGPR:
case AMDGPU::S_BUFFER_LOAD_DWORDX4_SGPR:
case AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR:
case AMDGPU::S_BUFFER_LOAD_DWORDX16_SGPR: {
unsigned VDst;
unsigned NewOpcode;

switch(Opcode) {
case AMDGPU::S_BUFFER_LOAD_DWORD_SGPR:
NewOpcode = AMDGPU::BUFFER_LOAD_DWORD_OFFEN;
VDst = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);
break;
case AMDGPU::S_BUFFER_LOAD_DWORDX2_SGPR:
NewOpcode = AMDGPU::BUFFER_LOAD_DWORDX2_OFFEN;
VDst = MRI.createVirtualRegister(&AMDGPU::VReg_64RegClass);
break;
case AMDGPU::S_BUFFER_LOAD_DWORDX4_SGPR:
NewOpcode = AMDGPU::BUFFER_LOAD_DWORDX4_OFFEN;
VDst = MRI.createVirtualRegister(&AMDGPU::VReg_128RegClass);
break;
case AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR:
case AMDGPU::S_BUFFER_LOAD_DWORDX16_SGPR:
splitScalarBuffer(Worklist, Inst);
Inst.eraseFromParent();
continue;
}

const MachineOperand *VAddr = getNamedOperand(Inst, AMDGPU::OpName::soff);
auto Add = MRI.getUniqueVRegDef(VAddr->getReg());
unsigned Offset = 0;

// FIXME: This isn't safe because the addressing mode doesn't work
// correctly if vaddr is negative.
//
// FIXME: Should probably be done somewhere else, maybe SIFoldOperands.
//
// See if we can extract an immediate offset by recognizing one of these:
// V_ADD_I32_e32 dst, imm, src1
// V_ADD_I32_e32 dst, (S_MOV_B32 imm), src1
// V_ADD will be removed by "Remove dead machine instructions".
if (Add &&
(Add->getOpcode() == AMDGPU::V_ADD_I32_e32 \|\|
Add->getOpcode() == AMDGPU::V_ADD_U32_e32 \|\|
Add->getOpcode() == AMDGPU::V_ADD_U32_e64)) {
static const unsigned SrcNames[2] = {
AMDGPU::OpName::src0,
AMDGPU::OpName::src1,
};

// Find a literal offset in one of source operands.
for (int i = 0; i < 2; i++) {
const MachineOperand *Src =
getNamedOperand(*Add, SrcNames[i]);

if (Src->isReg()) {
MachineInstr *Def = MRI.getUniqueVRegDef(Src->getReg());
if (Def) {
if (Def->isMoveImmediate())
Src = &Def->getOperand(1);
else if (Def->isCopy()) {
auto Mov = MRI.getUniqueVRegDef(Def->getOperand(1).getReg());
if (Mov && Mov->isMoveImmediate()) {
Src = &Mov->getOperand(1);
}
}
}
}

if (Src) {
if (Src->isImm())
Offset = Src->getImm();
else if (Src->isCImm())
Offset = Src->getCImm()->getZExtValue();
}

if (Offset && isLegalMUBUFImmOffset(Offset)) {
VAddr = getNamedOperand(*Add, SrcNames[!i]);
break;
}

Offset = 0;
}
}

MachineInstr *NewInstr =
BuildMI(*MBB, Inst, Inst.getDebugLoc(),
get(NewOpcode), VDst)
.add(*VAddr) // vaddr
.add(*getNamedOperand(Inst, AMDGPU::OpName::sbase)) // srsrc
.addImm(0) // soffset
.addImm(Offset) // offset
.addImm(getNamedOperand(Inst, AMDGPU::OpName::glc)->getImm())
.addImm(0) // slc
.addImm(0) // tfe
.cloneMemRefs(Inst)
.getInstr();

MRI.replaceRegWith(getNamedOperand(Inst, AMDGPU::OpName::sdst)->getReg(),
VDst);
addUsersToMoveToVALUWorklist(VDst, MRI, Worklist);
Inst.eraseFromParent();

// Legalize all operands other than the offset. Notably, convert the srsrc
// into SGPRs using v_readfirstlane if needed.
legalizeOperands(*NewInstr, MDT);
continue;
}
}		}

if (NewOpcode == AMDGPU::INSTRUCTION_LIST_END) {		if (NewOpcode == AMDGPU::INSTRUCTION_LIST_END) {
// We cannot move this instruction to the VALU, so we should try to		// We cannot move this instruction to the VALU, so we should try to
// legalize its operands instead.		// legalize its operands instead.
legalizeOperands(Inst, MDT);		legalizeOperands(Inst, MDT);
continue;		continue;
}		}
▲ Show 20 Lines • Show All 465 Lines • ▼ Show 20 Lines	BuildMI(MBB, MII, DL, get(TargetOpcode::REG_SEQUENCE), ResultReg)
.addImm(AMDGPU::sub0)		.addImm(AMDGPU::sub0)
.addReg(TmpReg)		.addReg(TmpReg)
.addImm(AMDGPU::sub1);		.addImm(AMDGPU::sub1);

MRI.replaceRegWith(Dest.getReg(), ResultReg);		MRI.replaceRegWith(Dest.getReg(), ResultReg);
addUsersToMoveToVALUWorklist(ResultReg, MRI, Worklist);		addUsersToMoveToVALUWorklist(ResultReg, MRI, Worklist);
}		}

void SIInstrInfo::splitScalarBuffer(SetVectorType &Worklist,
MachineInstr &Inst) const {
MachineBasicBlock &MBB = *Inst.getParent();
MachineRegisterInfo &MRI = MBB.getParent()->getRegInfo();

MachineBasicBlock::iterator MII = Inst;
auto &DL = Inst.getDebugLoc();

MachineOperand &Dest = *getNamedOperand(Inst, AMDGPU::OpName::sdst);;
MachineOperand &Rsrc = *getNamedOperand(Inst, AMDGPU::OpName::sbase);
MachineOperand &Offset = *getNamedOperand(Inst, AMDGPU::OpName::soff);
MachineOperand &Glc = *getNamedOperand(Inst, AMDGPU::OpName::glc);

unsigned Opcode = Inst.getOpcode();
unsigned NewOpcode = AMDGPU::BUFFER_LOAD_DWORDX4_OFFEN;
unsigned Count = 0;
const TargetRegisterClass *DestRC = MRI.getRegClass(Dest.getReg());
const TargetRegisterClass *NewDestRC = RI.getEquivalentVGPRClass(DestRC);

switch(Opcode) {
default:
return;
case AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR:
Count = 2;
break;
case AMDGPU::S_BUFFER_LOAD_DWORDX16_SGPR:
Count = 4;
break;
}

// FIXME: Should also attempt to build VAddr and Offset like the non-split
// case (see call site for this function)

// Create a vector of result registers
SmallVector<unsigned, 8> ResultRegs;
for (unsigned i = 0; i < Count ; ++i) {
unsigned ResultReg = MRI.createVirtualRegister(&AMDGPU::VReg_128RegClass);
MachineInstr &NewMI = *BuildMI(MBB, MII, DL, get(NewOpcode), ResultReg)
.addReg(Offset.getReg()) // offset
.addReg(Rsrc.getReg()) // rsrc
.addImm(0) // soffset
.addImm(i << 4) // inst_offset
.addImm(Glc.getImm()) // glc
.addImm(0) // slc
.addImm(0) // tfe
.addMemOperand(*Inst.memoperands_begin());
// Extract the 4 32 bit sub-registers from the result to add into the final REG_SEQUENCE
auto &NewDestOp = NewMI.getOperand(0);
for (unsigned i = 0 ; i < 4 ; i++)
ResultRegs.push_back(buildExtractSubReg(MII, MRI, NewDestOp, &AMDGPU::VReg_128RegClass,
RI.getSubRegFromChannel(i), &AMDGPU::VGPR_32RegClass));
}
// Create a new combined result to replace original with
unsigned FullDestReg = MRI.createVirtualRegister(NewDestRC);
MachineInstrBuilder CombinedResBuilder = BuildMI(MBB, MII, DL,
get(TargetOpcode::REG_SEQUENCE), FullDestReg);

for (unsigned i = 0 ; i < Count * 4 ; ++i) {
CombinedResBuilder
.addReg(ResultRegs[i])
.addImm(RI.getSubRegFromChannel(i));
}

MRI.replaceRegWith(Dest.getReg(), FullDestReg);
addUsersToMoveToVALUWorklist(FullDestReg, MRI, Worklist);
}

void SIInstrInfo::addUsersToMoveToVALUWorklist(		void SIInstrInfo::addUsersToMoveToVALUWorklist(
unsigned DstReg,		unsigned DstReg,
MachineRegisterInfo &MRI,		MachineRegisterInfo &MRI,
SetVectorType &Worklist) const {		SetVectorType &Worklist) const {
for (MachineRegisterInfo::use_iterator I = MRI.use_begin(DstReg),		for (MachineRegisterInfo::use_iterator I = MRI.use_begin(DstReg),
E = MRI.use_end(); I != E;) {		E = MRI.use_end(); I != E;) {
MachineInstr &UseMI = *I->getParent();		MachineInstr &UseMI = *I->getParent();
if (!canReadVGPR(UseMI, I.getOperandNo())) {		if (!canReadVGPR(UseMI, I.getOperandNo())) {
▲ Show 20 Lines • Show All 602 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.h

	Show First 20 Lines • Show All 434 Lines • ▼ Show 20 Lines
	/// offset field.			/// offset field.
	int64_t getSMRDEncodedOffset(const MCSubtargetInfo &ST, int64_t ByteOffset);			int64_t getSMRDEncodedOffset(const MCSubtargetInfo &ST, int64_t ByteOffset);

	/// \returns true if this offset is small enough to fit in the SMRD			/// \returns true if this offset is small enough to fit in the SMRD
	/// offset field. \p ByteOffset should be the offset in bytes and			/// offset field. \p ByteOffset should be the offset in bytes and
	/// not the encoded offset.			/// not the encoded offset.
	bool isLegalSMRDImmOffset(const MCSubtargetInfo &ST, int64_t ByteOffset);			bool isLegalSMRDImmOffset(const MCSubtargetInfo &ST, int64_t ByteOffset);

	// Given Imm, split it into the values to put into the SOffset and ImmOffset
	// fields in an MUBUF instruction. Return false if it is not possible (due to a
	// hardware bug needing a workaround).
	bool splitMUBUFOffset(uint32_t Imm, uint32_t &SOffset, uint32_t &ImmOffset,			bool splitMUBUFOffset(uint32_t Imm, uint32_t &SOffset, uint32_t &ImmOffset,
	const GCNSubtarget *Subtarget);			const GCNSubtarget *Subtarget, uint32_t Align = 4);

	/// \returns true if the intrinsic is divergent			/// \returns true if the intrinsic is divergent
	bool isIntrinsicSourceOfDivergence(unsigned IntrID);			bool isIntrinsicSourceOfDivergence(unsigned IntrID);

	} // end namespace AMDGPU			} // end namespace AMDGPU
	} // end namespace llvm			} // end namespace llvm

	#endif // LLVM_LIB_TARGET_AMDGPU_UTILS_AMDGPUBASEINFO_H			#endif // LLVM_LIB_TARGET_AMDGPU_UTILS_AMDGPUBASEINFO_H

llvm/trunk/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp

Show First 20 Lines • Show All 882 Lines • ▼ Show 20 Lines	bool isLegalSMRDImmOffset(const MCSubtargetInfo &ST, int64_t ByteOffset) {
int64_t EncodedOffset = getSMRDEncodedOffset(ST, ByteOffset);		int64_t EncodedOffset = getSMRDEncodedOffset(ST, ByteOffset);
return isGCN3Encoding(ST) ?		return isGCN3Encoding(ST) ?
isUInt<20>(EncodedOffset) : isUInt<8>(EncodedOffset);		isUInt<20>(EncodedOffset) : isUInt<8>(EncodedOffset);
}		}

// Given Imm, split it into the values to put into the SOffset and ImmOffset		// Given Imm, split it into the values to put into the SOffset and ImmOffset
// fields in an MUBUF instruction. Return false if it is not possible (due to a		// fields in an MUBUF instruction. Return false if it is not possible (due to a
// hardware bug needing a workaround).		// hardware bug needing a workaround).
		//
		// The required alignment ensures that individual address components remain
		// aligned if they are aligned to begin with. It also ensures that additional
		// offsets within the given alignment can be added to the resulting ImmOffset.
bool splitMUBUFOffset(uint32_t Imm, uint32_t &SOffset, uint32_t &ImmOffset,		bool splitMUBUFOffset(uint32_t Imm, uint32_t &SOffset, uint32_t &ImmOffset,
const GCNSubtarget *Subtarget) {		const GCNSubtarget *Subtarget, uint32_t Align) {
const uint32_t Align = 4;
const uint32_t MaxImm = alignDown(4095, Align);		const uint32_t MaxImm = alignDown(4095, Align);
uint32_t Overflow = 0;		uint32_t Overflow = 0;

if (Imm > MaxImm) {		if (Imm > MaxImm) {
if (Imm <= MaxImm + 64) {		if (Imm <= MaxImm + 64) {
// Use an SOffset inline constant for 4..64		// Use an SOffset inline constant for 4..64
Overflow = Imm - MaxImm;		Overflow = Imm - MaxImm;
Imm = MaxImm;		Imm = MaxImm;
▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/smrd-fold-offset.mir

	# RUN: llc -march=amdgcn -run-pass si-fix-sgpr-copies -o - %s \| FileCheck -check-prefix=GCN %s			# RUN: llc -march=amdgcn -run-pass si-fix-sgpr-copies -o - %s \| FileCheck -check-prefix=GCN %s

	# GCN: BUFFER_LOAD_DWORD_OFFEN %{{[0-9]+}}, killed %{{[0-9]+}}, 0, 4095			# GCN-LABEL: name: smrd_vgpr_offset_imm
				# GCN: V_READFIRSTLANE_B32
				# GCN: S_BUFFER_LOAD_DWORD_SGPR
	---			---
	name: smrd_vgpr_offset_imm			name: smrd_vgpr_offset_imm
	body: \|			body: \|
	bb.0:			bb.0:
	liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0			liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0

	%4:vgpr_32 = COPY $vgpr0			%4:vgpr_32 = COPY $vgpr0
	%3:sgpr_32 = COPY $sgpr3			%3:sgpr_32 = COPY $sgpr3
	%2:sgpr_32 = COPY $sgpr2			%2:sgpr_32 = COPY $sgpr2
	%1:sgpr_32 = COPY $sgpr1			%1:sgpr_32 = COPY $sgpr1
	%0:sgpr_32 = COPY $sgpr0			%0:sgpr_32 = COPY $sgpr0
	%5:sgpr_128 = REG_SEQUENCE %0, %subreg.sub0, %1, %subreg.sub1, %2, %subreg.sub2, %3, %subreg.sub3			%5:sgpr_128 = REG_SEQUENCE %0, %subreg.sub0, %1, %subreg.sub1, %2, %subreg.sub2, %3, %subreg.sub3
	%6:sreg_32_xm0 = S_MOV_B32 4095			%6:sreg_32_xm0 = S_MOV_B32 4095
	%8:vgpr_32 = COPY %6			%8:vgpr_32 = COPY %6
	%7:vgpr_32 = V_ADD_I32_e32 %4, killed %8, implicit-def dead $vcc, implicit $exec			%7:vgpr_32 = V_ADD_I32_e32 %4, killed %8, implicit-def dead $vcc, implicit $exec
	%10:sreg_32 = COPY %7			%10:sreg_32 = COPY %7
	%9:sreg_32_xm0_xexec = S_BUFFER_LOAD_DWORD_SGPR killed %5, killed %10, 0			%9:sreg_32_xm0_xexec = S_BUFFER_LOAD_DWORD_SGPR killed %5, killed %10, 0
	$vgpr0 = COPY %9			$vgpr0 = COPY %9
	SI_RETURN_TO_EPILOG $vgpr0			SI_RETURN_TO_EPILOG $vgpr0
	...			...

	# GCN: BUFFER_LOAD_DWORD_OFFEN %{{[0-9]+}}, killed %{{[0-9]+}}, 0, 4095			# GCN-LABEL: name: smrd_vgpr_offset_imm_add_u32
				# GCN: V_READFIRSTLANE_B32
				# GCN: S_BUFFER_LOAD_DWORD_SGPR
	---			---
	name: smrd_vgpr_offset_imm_add_u32			name: smrd_vgpr_offset_imm_add_u32
	body: \|			body: \|
	bb.0:			bb.0:
	liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0			liveins: $sgpr0, $sgpr1, $sgpr2, $sgpr3, $vgpr0

	%4:vgpr_32 = COPY $vgpr0			%4:vgpr_32 = COPY $vgpr0
	%3:sgpr_32 = COPY $sgpr3			%3:sgpr_32 = COPY $sgpr3
	Show All 13 Lines

llvm/trunk/test/CodeGen/AMDGPU/smrd.ll

Show First 20 Lines • Show All 286 Lines • ▼ Show 20 Lines
define amdgpu_ps float @smrd_vgpr_offset(<4 x i32> inreg %desc, i32 %offset) #0 {		define amdgpu_ps float @smrd_vgpr_offset(<4 x i32> inreg %desc, i32 %offset) #0 {
main_body:		main_body:
%r = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 %offset)		%r = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 %offset)
ret float %r		ret float %r
}		}

; GCN-LABEL: {{^}}smrd_vgpr_offset_imm:		; GCN-LABEL: {{^}}smrd_vgpr_offset_imm:
; GCN-NEXT: %bb.		; GCN-NEXT: %bb.
; GCN-NEXT: buffer_load_dword v{{[0-9]}}, v0, s[0:3], 0 offen offset:4095 ;		; GCN-NEXT: buffer_load_dword v{{[0-9]}}, v0, s[0:3], 0 offen offset:4092 ;
define amdgpu_ps float @smrd_vgpr_offset_imm(<4 x i32> inreg %desc, i32 %offset) #0 {		define amdgpu_ps float @smrd_vgpr_offset_imm(<4 x i32> inreg %desc, i32 %offset) #0 {
main_body:		main_body:
%off = add i32 %offset, 4095		%off = add i32 %offset, 4092
%r = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 %off)		%r = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 %off)
ret float %r		ret float %r
}		}

; GCN-LABEL: {{^}}smrd_vgpr_offset_imm_too_large:		; GCN-LABEL: {{^}}smrd_vgpr_offset_imm_too_large:
; GCN-NEXT: %bb.		; GCN-NEXT: %bb.
; GCN-NEXT: v_add_{{i\|u}}32_e32 v0, {{(vcc, )?}}0x1000, v0		; SICI-NEXT: v_add_{{i\|u}}32_e32 v0, {{(vcc, )?}}0x1000, v0
; GCN-NEXT: buffer_load_dword v{{[0-9]}}, v0, s[0:3], 0 offen ;		; SICI-NEXT: buffer_load_dword v{{[0-9]}}, v0, s[0:3], 0 offen ;
		; VIGFX9-NEXT: buffer_load_dword v{{[0-9]}}, v0, s[0:3], 4 offen offset:4092 ;
define amdgpu_ps float @smrd_vgpr_offset_imm_too_large(<4 x i32> inreg %desc, i32 %offset) #0 {		define amdgpu_ps float @smrd_vgpr_offset_imm_too_large(<4 x i32> inreg %desc, i32 %offset) #0 {
main_body:		main_body:
%off = add i32 %offset, 4096		%off = add i32 %offset, 4096
%r = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 %off)		%r = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 %off)
ret float %r		ret float %r
}		}

; GCN-LABEL: {{^}}smrd_imm_merged:		; GCN-LABEL: {{^}}smrd_imm_merged:
▲ Show 20 Lines • Show All 175 Lines • ▼ Show 20 Lines	main_body:
%tmp22 = load <4 x i32>, <4 x i32> addrspace(4)* %in		%tmp22 = load <4 x i32>, <4 x i32> addrspace(4)* %in
%s.buffer = call <8 x i32> @llvm.amdgcn.s.buffer.load.v8i32(<4 x i32> %tmp22, i32 %ncoff, i32 0)		%s.buffer = call <8 x i32> @llvm.amdgcn.s.buffer.load.v8i32(<4 x i32> %tmp22, i32 %ncoff, i32 0)
%s.buffer.elt = extractelement <8 x i32> %s.buffer, i32 1		%s.buffer.elt = extractelement <8 x i32> %s.buffer, i32 1
%s.buffer.float = bitcast i32 %s.buffer.elt to float		%s.buffer.float = bitcast i32 %s.buffer.elt to float
call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %tmp21, float %tmp21, float %tmp21, float %s.buffer.float, i1 true, i1 true) #0		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %tmp21, float %tmp21, float %tmp21, float %s.buffer.float, i1 true, i1 true) #0
ret void		ret void
}		}

		; SMRD load with a non-const non-uniform offset of > 4 dwords (requires splitting)
		; GCN-LABEL: {{^}}smrd_load_nonconst3:
		; GCN-DAG: buffer_load_dwordx4 v[0:3], v{{[0-9]+}}, s[0:3], 0 offen ;
		; GCN-DAG: buffer_load_dwordx4 v[4:7], v{{[0-9]+}}, s[0:3], 0 offen offset:16 ;
		; GCN-DAG: buffer_load_dwordx4 v[8:11], v{{[0-9]+}}, s[0:3], 0 offen offset:32 ;
		; GCN-DAG: buffer_load_dwordx4 v[12:15], v{{[0-9]+}}, s[0:3], 0 offen offset:48 ;
		; GCN: ; return to shader part epilog
		define amdgpu_ps <16 x float> @smrd_load_nonconst3(<4 x i32> inreg %rsrc, i32 %off) #0 {
		main_body:
		%ld = call <16 x i32> @llvm.amdgcn.s.buffer.load.v16i32(<4 x i32> %rsrc, i32 %off, i32 0)
		%bc = bitcast <16 x i32> %ld to <16 x float>
		ret <16 x float> %bc
		}

		; GCN-LABEL: {{^}}smrd_load_nonconst4:
		; SICI: v_add_i32_e32 v{{[0-9]+}}, vcc, 0xff8, v0 ;
		; SICI-DAG: buffer_load_dwordx4 v[0:3], v{{[0-9]+}}, s[0:3], 0 offen ;
		; SICI-DAG: buffer_load_dwordx4 v[4:7], v{{[0-9]+}}, s[0:3], 0 offen offset:16 ;
		; SICI-DAG: buffer_load_dwordx4 v[8:11], v{{[0-9]+}}, s[0:3], 0 offen offset:32 ;
		; SICI-DAG: buffer_load_dwordx4 v[12:15], v{{[0-9]+}}, s[0:3], 0 offen offset:48 ;
		; VIGFX9-DAG: buffer_load_dwordx4 v[0:3], v{{[0-9]+}}, s[0:3], 56 offen offset:4032 ;
		; VIGFX9-DAG: buffer_load_dwordx4 v[4:7], v{{[0-9]+}}, s[0:3], 56 offen offset:4048 ;
		; VIGFX9-DAG: buffer_load_dwordx4 v[8:11], v{{[0-9]+}}, s[0:3], 56 offen offset:4064 ;
		; VIGFX9-DAG: buffer_load_dwordx4 v[12:15], v{{[0-9]+}}, s[0:3], 56 offen offset:4080 ;
		; GCN: ; return to shader part epilog
		define amdgpu_ps <16 x float> @smrd_load_nonconst4(<4 x i32> inreg %rsrc, i32 %off) #0 {
		main_body:
		%off.2 = add i32 %off, 4088
		%ld = call <16 x i32> @llvm.amdgcn.s.buffer.load.v16i32(<4 x i32> %rsrc, i32 %off.2, i32 0)
		%bc = bitcast <16 x i32> %ld to <16 x float>
		ret <16 x float> %bc
		}

		; GCN-LABEL: {{^}}smrd_load_nonconst5:
		; SICI: v_add_i32_e32 v{{[0-9]+}}, vcc, 0x1004, v0
		; SICI-DAG: buffer_load_dwordx4 v[0:3], v{{[0-9]+}}, s[0:3], 0 offen ;
		; SICI-DAG: buffer_load_dwordx4 v[4:7], v{{[0-9]+}}, s[0:3], 0 offen offset:16 ;
		; SICI-DAG: buffer_load_dwordx4 v[8:11], v{{[0-9]+}}, s[0:3], 0 offen offset:32 ;
		; SICI-DAG: buffer_load_dwordx4 v[12:15], v{{[0-9]+}}, s[0:3], 0 offen offset:48 ;
		; VIGFX9: s_movk_i32 s4, 0xfc0
		; VIGFX9-DAG: buffer_load_dwordx4 v[0:3], v{{[0-9]+}}, s[0:3], s4 offen offset:68 ;
		; VIGFX9-DAG: buffer_load_dwordx4 v[4:7], v{{[0-9]+}}, s[0:3], s4 offen offset:84 ;
		; VIGFX9-DAG: buffer_load_dwordx4 v[8:11], v{{[0-9]+}}, s[0:3], s4 offen offset:100 ;
		; VIGFX9-DAG: buffer_load_dwordx4 v[12:15], v{{[0-9]+}}, s[0:3], s4 offen offset:116 ;
		; GCN: ; return to shader part epilog
		define amdgpu_ps <16 x float> @smrd_load_nonconst5(<4 x i32> inreg %rsrc, i32 %off) #0 {
		main_body:
		%off.2 = add i32 %off, 4100
		%ld = call <16 x i32> @llvm.amdgcn.s.buffer.load.v16i32(<4 x i32> %rsrc, i32 %off.2, i32 0)
		%bc = bitcast <16 x i32> %ld to <16 x float>
		ret <16 x float> %bc
		}

; SMRD load dwordx2		; SMRD load dwordx2
; GCN-LABEL: {{^}}smrd_load_dwordx2:		; GCN-LABEL: {{^}}smrd_load_dwordx2:
; SIVIGFX9: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}		; SIVIGFX9: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}
; CI: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}		; CI: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}
; GCN: s_endpgm		; GCN: s_endpgm
define amdgpu_ps void @smrd_load_dwordx2(<4 x i32> addrspace(4)* inreg %arg, <4 x i32> addrspace(4)* inreg %arg1, <32 x i8> addrspace(4)* inreg %arg2, i32 inreg %arg3, <2 x i32> %arg4, <2 x i32> %arg5, <2 x i32> %arg6, <3 x i32> %arg7, <2 x i32> %arg8, <2 x i32> %arg9, <2 x i32> %arg10, float %arg11, float %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, float %arg18, float %arg19, <4 x i32> addrspace(4)* inreg %in, i32 inreg %ncoff) #0 {		define amdgpu_ps void @smrd_load_dwordx2(<4 x i32> addrspace(4)* inreg %arg, <4 x i32> addrspace(4)* inreg %arg1, <32 x i8> addrspace(4)* inreg %arg2, i32 inreg %arg3, <2 x i32> %arg4, <2 x i32> %arg5, <2 x i32> %arg6, <3 x i32> %arg7, <2 x i32> %arg8, <2 x i32> %arg9, <2 x i32> %arg10, float %arg11, float %arg12, float %arg13, float %arg14, float %arg15, float %arg16, float %arg17, float %arg18, float %arg19, <4 x i32> addrspace(4)* inreg %in, i32 inreg %ncoff) #0 {
main_body:		main_body:
%tmp22 = load <4 x i32>, <4 x i32> addrspace(4)* %in		%tmp22 = load <4 x i32>, <4 x i32> addrspace(4)* %in
%s.buffer = call <2 x i32> @llvm.amdgcn.s.buffer.load.v2i32(<4 x i32> %tmp22, i32 %ncoff, i32 0)		%s.buffer = call <2 x i32> @llvm.amdgcn.s.buffer.load.v2i32(<4 x i32> %tmp22, i32 %ncoff, i32 0)
%s.buffer.float = bitcast <2 x i32> %s.buffer to <2 x float>		%s.buffer.float = bitcast <2 x i32> %s.buffer to <2 x float>
%r.1 = extractelement <2 x float> %s.buffer.float, i32 0		%r.1 = extractelement <2 x float> %s.buffer.float, i32 0
%r.2 = extractelement <2 x float> %s.buffer.float, i32 1		%r.2 = extractelement <2 x float> %s.buffer.float, i32 1
call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r.1, float %r.1, float %r.1, float %r.2, i1 true, i1 true) #0		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r.1, float %r.1, float %r.1, float %r.2, i1 true, i1 true) #0
ret void		ret void
}		}

; GCN-LABEL: {{^}}smrd_uniform_loop:		; GCN-LABEL: {{^}}smrd_uniform_loop:
;		;
; TODO: this should use an s_buffer_load		; TODO: we should keep the loop counter in an SGPR
;		;
; GCN: buffer_load_dword		; GCN: v_readfirstlane_b32
		; GCN: s_buffer_load_dword
define amdgpu_ps float @smrd_uniform_loop(<4 x i32> inreg %desc, i32 %bound) #0 {		define amdgpu_ps float @smrd_uniform_loop(<4 x i32> inreg %desc, i32 %bound) #0 {
main_body:		main_body:
br label %loop		br label %loop

loop:		loop:
%counter = phi i32 [ 0, %main_body ], [ %counter.next, %loop ]		%counter = phi i32 [ 0, %main_body ], [ %counter.next, %loop ]
%sum = phi float [ 0.0, %main_body ], [ %sum.next, %loop ]		%sum = phi float [ 0.0, %main_body ], [ %sum.next, %loop ]
%offset = shl i32 %counter, 2		%offset = shl i32 %counter, 2
▲ Show 20 Lines • Show All 60 Lines • Show Last 20 Lines