This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Produce flat|global_dwordx3 instructions
AbandonedPublic

Authored by rampitec on Jul 14 2017, 1:07 PM.

Download Raw Diff

Details

Reviewers

vpykhtin
alex-t

Summary

The patch allows to produce dwordx3 loads out of v3i32/v3f32 loads.
There is still a future work to allow vectorizer to create that
vec3 loads.

Diff Detail

Event Timeline

rampitec created this revision.Jul 14 2017, 1:07 PM

Herald added subscribers: t-tye, tpr, dstuttard and 5 others. · View Herald TranscriptJul 14 2017, 1:07 PM

This is the wrong way to handle this. I did most of the work to avoid having to select the machine nodes so early a long time ago. I have the patches to add v3* to MVT. Short of that a new LOAD_V3 node would be better than going direct to the instruction here

lib/Target/AMDGPU/AMDGPUISelLowering.cpp
2554	There's a global offset subtarget feature. I also have the patch to start selecting global, but I haven't committed it yet

In D35435#812242, @arsenm wrote:

This is the wrong way to handle this. I did most of the work to avoid having to select the machine nodes so early a long time ago. I have the patches to add v3* to MVT. Short of that a new LOAD_V3 node would be better than going direct to the instruction here

V3 is an alien to LLVM, so I had to do it this way. It works.
What about patches for MVT to support V3? Are they anywhere ready to be submitted?

lib/Target/AMDGPU/AMDGPUISelLowering.cpp
2554	This is not about offsets, this is about support of global_load instructions, which is started from GFX9. Then flat_load's also have offsets there and that is checked above as Subtarget->hasFlatInstOffsets(). A global offset is actually a feature to support offsetting workitem ids, so unrelated.

The implementation of this approach looks good to me. The only question is which way to go to implement v3 vector.

In D35435#817495, @vpykhtin wrote:

The implementation of this approach looks good to me. The only question is which way to go to implement v3 vector.

Probably making v3 generally legal and simple type is a right thing to do. This will solve not only problem with loads, but compute on the resulting vector as well. Currently such compute is done on a 4 component vector created by the legalization with the promote of v3 to v4.

It however seems to be long way because a lot of places just designed to work with a power of 2 vectors, halfing and doubling them. I would do this style vec3 load in the short term and target legal v3 in a long term.

I could probably get the v3 patch in. IIRC I had all tests passing with a hack to keep the legalization unchanged and then got stuck fixing all cases with proper legalization

In D35435#817506, @arsenm wrote:

I could probably get the v3 patch in. IIRC I had all tests passing with a hack to keep the legalization unchanged and then got stuck fixing all cases with proper legalization

If legal v3 is around that is certainly preferable.

In D35435#817511, @rampitec wrote:

In D35435#817506, @arsenm wrote:

I could probably get the v3 patch in. IIRC I had all tests passing with a hack to keep the legalization unchanged and then got stuck fixing all cases with proper legalization

If legal v3 is around that is certainly preferable.

https://github.com/arsenm/llvm/tree/legal-vector3-v2

The first 2 commits here seem to work without test failures (but a few cost model regressions). They succeed in adding the basic types, the few after that need some more work

In D35435#817806, @arsenm wrote:

In D35435#817511, @rampitec wrote:

In D35435#817506, @arsenm wrote:

I could probably get the v3 patch in. IIRC I had all tests passing with a hack to keep the legalization unchanged and then got stuck fixing all cases with proper legalization

If legal v3 is around that is certainly preferable.

https://github.com/arsenm/llvm/tree/legal-vector3-v2

The first 2 commits here seem to work without test failures (but a few cost model regressions). They succeed in adding the basic types, the few after that need some more work

I wander what happens to passes which like to split a vector by half? I also can see that vectorizer (load/store and SLP) for instance is written in a way that does not support non power of 2 vectors. I hope some passes will just silently bail instead of silently fail at least when v3* will be reported as legal.

Anyway, if you are planning to fix these patches and merge I will hold current review. It also does not solve v3 operations problem other than load and potentially store, because v3 will be promoted on any arithmetic, so it is way not perfect.

rampitec abandoned this revision.Apr 15 2021, 2:10 PM

Herald added subscribers: kerbowa, jvesely. · View Herald TranscriptApr 15 2021, 2:10 PM

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

AMDGPUISelLowering.h

3 lines

AMDGPUISelLowering.cpp

87 lines

SIISelLowering.h

2 lines

SIISelLowering.cpp

13 lines

test/

CodeGen/

AMDGPU/

load-global-f32.ll

2 lines

load-global-i32.ll

2 lines

load-vec3.ll

110 lines

Diff 106695

lib/Target/AMDGPU/AMDGPUISelLowering.h

Show First 20 Lines • Show All 248 Lines • ▼ Show 20 Lines	public:

AMDGPUAS getAMDGPUAS() const {		AMDGPUAS getAMDGPUAS() const {
return AMDGPUASI;		return AMDGPUASI;
}		}

MVT getFenceOperandTy(const DataLayout &DL) const override {		MVT getFenceOperandTy(const DataLayout &DL) const override {
return MVT::i32;		return MVT::i32;
}		}

		bool isMemOpUniform(const SDNode *N) const;
		bool isMemOpHasNoClobberedMemOperand(const SDNode *N) const;
};		};

namespace AMDGPUISD {		namespace AMDGPUISD {

enum NodeType : unsigned {		enum NodeType : unsigned {
// AMDIL ISD Opcodes		// AMDIL ISD Opcodes
FIRST_NUMBER = ISD::BUILTIN_OP_END,		FIRST_NUMBER = ISD::BUILTIN_OP_END,
UMUL, // 32bit unsigned multiplication		UMUL, // 32bit unsigned multiplication
▲ Show 20 Lines • Show All 157 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUISelLowering.cpp

Show First 20 Lines • Show All 2,474 Lines • ▼ Show 20 Lines	if (LN->isVolatile() \|\| !ISD::isNormalLoad(LN) \|\| hasVolatileUser(LN))
return SDValue();		return SDValue();

SDLoc SL(N);		SDLoc SL(N);
SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
EVT VT = LN->getMemoryVT();		EVT VT = LN->getMemoryVT();

unsigned Size = VT.getStoreSize();		unsigned Size = VT.getStoreSize();
unsigned Align = LN->getAlignment();		unsigned Align = LN->getAlignment();
		unsigned AS = LN->getAddressSpace();
if (Align < Size && isTypeLegal(VT)) {		if (Align < Size && isTypeLegal(VT)) {
bool IsFast;		bool IsFast;
unsigned AS = LN->getAddressSpace();

// Expand unaligned loads earlier than legalization. Due to visitation order		// Expand unaligned loads earlier than legalization. Due to visitation order
// problems during legalization, the emitted instructions to pack and unpack		// problems during legalization, the emitted instructions to pack and unpack
// the bytes again are not eliminated in the case of an unaligned copy.		// the bytes again are not eliminated in the case of an unaligned copy.
if (!allowsMisalignedMemoryAccesses(VT, AS, Align, &IsFast)) {		if (!allowsMisalignedMemoryAccesses(VT, AS, Align, &IsFast)) {
if (VT.isVector())		if (VT.isVector())
return scalarizeVectorLoad(LN, DAG);		return scalarizeVectorLoad(LN, DAG);

SDValue Ops[2];		SDValue Ops[2];
std::tie(Ops[0], Ops[1]) = expandUnalignedLoad(LN, DAG);		std::tie(Ops[0], Ops[1]) = expandUnalignedLoad(LN, DAG);
return DAG.getMergeValues(Ops, SDLoc(N));		return DAG.getMergeValues(Ops, SDLoc(N));
}		}

if (!IsFast)		if (!IsFast)
return SDValue();		return SDValue();
}		}

		// Create DWORDX3 loads. We cannot create it later because legalizer will
		// split it and there is no way to specify custom lowering.
		bool IsGlobal = (AS == AMDGPUASI.GLOBAL_ADDRESS);
		bool IsConstant = (AS == AMDGPUASI.CONSTANT_ADDRESS);
		bool IsGlobalOrConstant = IsGlobal \|\| IsConstant;
		// TODO: support vec3 stores and move the logig of this condition into
		// shouldCombineMemoryType().
		if (Subtarget->getGeneration() >= AMDGPUSubtarget::SEA_ISLANDS &&
		VT.isExtended() && VT.isVector() && VT.getVectorNumElements() == 3 &&
		// There are no sub-dword vector loads.
		VT.getVectorElementType().getStoreSize() == 4 &&
		// There are no vector extloads.
		LN->getExtensionType() == ISD::LoadExtType::NON_EXTLOAD &&
		((Subtarget->useFlatForGlobal() && IsGlobalOrConstant) \|\|
		AS == AMDGPUASI.FLAT_ADDRESS) &&
		// Uniform const loads will be selected to scalar loads, which do not have
		// DWORDX3 form.
		!((IsConstant \|\| (IsGlobal && Subtarget->getScalarizeGlobalBehavior() &&
		isMemOpHasNoClobberedMemOperand(LN))) &&
		isMemOpUniform(LN))) {
		SDValue ZeroFlag = DAG.getTargetConstant(0, SL, MVT::i1); // GLC/SLC
		SDValue Ptr = LN->getBasePtr();
		SDValue Offset = LN->getOffset();

		int64_t OffVal = 0;
		if (auto OffC = dyn_cast<ConstantSDNode>(Offset))
		OffVal = OffC->getSExtValue();
		// GFX9: Imm offset: Scratch, Global: 13-bit signed byte offset
		// FLAT: 12-bit unsigned offset (MSB is ignored)
		// TODO: It does not seem to be possible to get any offset after
		// SelectionDAGBuilder.
		if ((OffVal && (!Subtarget->hasFlatInstOffsets() \|\|
		(IsGlobalOrConstant && !isInt<13>(OffVal)) \|\|
		!isUInt<12>(OffVal))) \|\|
		// Is that possible to get non-constant offset recorded in LoadSDNode?
		(!OffVal && !Offset.isUndef())) {
		Ptr = DAG.getNode(ISD::ADD, SL, Ptr.getValueType(), Ptr, Offset);
		OffVal = 0;
		}
		Offset = DAG.getTargetConstant(OffVal, SL, MVT::i16);

		// TODO: introduce AMDGPUISD::LOAD3 returning v4i32 and select it later
		// to allow proper non-constant offset folding with GFX9 flat/global
		// instructions and with buffer_load_dwordx3.
		// That is in case if we are interested in supporting MUBUF or
		// VGPR offsets with SGPR base on GFX9. Both are unclear.
		// However, SelectionDAGBuilder does not really record an offset
		// even if constant, so we still want to get that constant offset
		// and we do not want to replicate SelectADDR/MUBUFOffset code here.
		unsigned Opc = AMDGPU::FLAT_LOAD_DWORDX3;

		if (Subtarget->getGeneration() >= AMDGPUSubtarget::GFX9 &&
		arsenmUnsubmitted Not Done Reply Inline Actions There's a global offset subtarget feature. I also have the patch to start selecting global, but I haven't committed it yet arsenm: There's a global offset subtarget feature. I also have the patch to start selecting global, but…
		rampitecAuthorUnsubmitted Not Done Reply Inline Actions This is not about offsets, this is about support of global_load instructions, which is started from GFX9. Then flat_load's also have offsets there and that is checked above as Subtarget->hasFlatInstOffsets(). A global offset is actually a feature to support offsetting workitem ids, so unrelated. rampitec: This is not about offsets, this is about support of global_load instructions, which is started…
		IsGlobalOrConstant)
		Opc = AMDGPU::GLOBAL_LOAD_DWORDX3;

		// We must return a legal v4 type because DAG legalizer cannot widen machine
		// nodes results, but knowns how to widen BUILD_VECTOR.
		EVT V4VT = EVT::getVectorVT(*DAG.getContext(), VT.getVectorElementType(), 4);
		auto NewLoad = DAG.getMachineNode(Opc, SL, V4VT, N->getValueType(1),
		{ Ptr, Offset, ZeroFlag, ZeroFlag });

		auto MMOs = DAG.getMachineFunction().allocateMemRefsArray(1);
		*MMOs = LN->getMemOperand();
		NewLoad->setMemRefs(MMOs, MMOs + 1);

		SmallVector<SDValue, 3> Elts;
		DAG.ExtractVectorElements(SDValue(NewLoad, 0), Elts, 0, 3);
		SDValue V3 = DAG.getBuildVector(VT, SL, { Elts[0], Elts[1], Elts[2] });
		return DAG.getMergeValues({ V3, SDValue(NewLoad, 1) }, SL);
		}

if (!shouldCombineMemoryType(VT))		if (!shouldCombineMemoryType(VT))
return SDValue();		return SDValue();

EVT NewVT = getEquivalentMemType(*DAG.getContext(), VT);		EVT NewVT = getEquivalentMemType(*DAG.getContext(), VT);

SDValue NewLoad		SDValue NewLoad
= DAG.getLoad(NewVT, SL, LN->getChain(),		= DAG.getLoad(NewVT, SL, LN->getChain(),
LN->getBasePtr(), LN->getMemOperand());		LN->getBasePtr(), LN->getMemOperand());
▲ Show 20 Lines • Show All 1,277 Lines • ▼ Show 20 Lines	case AMDGPUISD::BORROW:
return 31;		return 31;
case AMDGPUISD::FP_TO_FP16:		case AMDGPUISD::FP_TO_FP16:
case AMDGPUISD::FP16_ZEXT:		case AMDGPUISD::FP16_ZEXT:
return 16;		return 16;
default:		default:
return 1;		return 1;
}		}
}		}

		bool AMDGPUTargetLowering::isMemOpUniform(const SDNode *N) const {
		const MemSDNode *MemNode = cast<MemSDNode>(N);

		return AMDGPU::isUniformMMO(MemNode->getMemOperand());
		}

		bool AMDGPUTargetLowering::isMemOpHasNoClobberedMemOperand(const SDNode *N)
		const {
		const MemSDNode *MemNode = cast<MemSDNode>(N);
		const Value *Ptr = MemNode->getMemOperand()->getValue();
		const Instruction *I = dyn_cast<Instruction>(Ptr);
		return I && I->getMetadata("amdgpu.noclobber");
		}

lib/Target/AMDGPU/SIISelLowering.h

Show First 20 Lines • Show All 160 Lines • ▼ Show 20 Lines	bool allowsMisalignedMemoryAccesses(EVT VT, unsigned AS,
bool *IsFast) const override;		bool *IsFast) const override;

EVT getOptimalMemOpType(uint64_t Size, unsigned DstAlign,		EVT getOptimalMemOpType(uint64_t Size, unsigned DstAlign,
unsigned SrcAlign, bool IsMemset,		unsigned SrcAlign, bool IsMemset,
bool ZeroMemset,		bool ZeroMemset,
bool MemcpyStrSrc,		bool MemcpyStrSrc,
MachineFunction &MF) const override;		MachineFunction &MF) const override;

bool isMemOpUniform(const SDNode *N) const;
bool isMemOpHasNoClobberedMemOperand(const SDNode *N) const;
bool isNoopAddrSpaceCast(unsigned SrcAS, unsigned DestAS) const override;		bool isNoopAddrSpaceCast(unsigned SrcAS, unsigned DestAS) const override;
bool isCheapAddrSpaceCast(unsigned SrcAS, unsigned DestAS) const override;		bool isCheapAddrSpaceCast(unsigned SrcAS, unsigned DestAS) const override;

TargetLoweringBase::LegalizeTypeAction		TargetLoweringBase::LegalizeTypeAction
getPreferredVectorAction(EVT VT) const override;		getPreferredVectorAction(EVT VT) const override;

bool shouldConvertConstantLoadToIntImm(const APInt &Imm,		bool shouldConvertConstantLoadToIntImm(const APInt &Imm,
Type *Ty) const override;		Type *Ty) const override;
▲ Show 20 Lines • Show All 61 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIISelLowering.cpp

	Show First 20 Lines • Show All 813 Lines • ▼ Show 20 Lines
	}			}

	bool SITargetLowering::isNoopAddrSpaceCast(unsigned SrcAS,			bool SITargetLowering::isNoopAddrSpaceCast(unsigned SrcAS,
	unsigned DestAS) const {			unsigned DestAS) const {
	return isFlatGlobalAddrSpace(SrcAS, AMDGPUASI) &&			return isFlatGlobalAddrSpace(SrcAS, AMDGPUASI) &&
	isFlatGlobalAddrSpace(DestAS, AMDGPUASI);			isFlatGlobalAddrSpace(DestAS, AMDGPUASI);
	}			}

	bool SITargetLowering::isMemOpHasNoClobberedMemOperand(const SDNode *N) const {
	const MemSDNode *MemNode = cast<MemSDNode>(N);
	const Value *Ptr = MemNode->getMemOperand()->getValue();
	const Instruction *I = dyn_cast<Instruction>(Ptr);
	return I && I->getMetadata("amdgpu.noclobber");
	}

	bool SITargetLowering::isCheapAddrSpaceCast(unsigned SrcAS,			bool SITargetLowering::isCheapAddrSpaceCast(unsigned SrcAS,
	unsigned DestAS) const {			unsigned DestAS) const {
	// Flat -> private/local is a simple truncate.			// Flat -> private/local is a simple truncate.
	// Flat -> global is no-op			// Flat -> global is no-op
	if (SrcAS == AMDGPUASI.FLAT_ADDRESS)			if (SrcAS == AMDGPUASI.FLAT_ADDRESS)
	return true;			return true;

	return isNoopAddrSpaceCast(SrcAS, DestAS);			return isNoopAddrSpaceCast(SrcAS, DestAS);
	}			}

	bool SITargetLowering::isMemOpUniform(const SDNode *N) const {
	const MemSDNode *MemNode = cast<MemSDNode>(N);

	return AMDGPU::isUniformMMO(MemNode->getMemOperand());
	}

	TargetLoweringBase::LegalizeTypeAction			TargetLoweringBase::LegalizeTypeAction
	SITargetLowering::getPreferredVectorAction(EVT VT) const {			SITargetLowering::getPreferredVectorAction(EVT VT) const {
	if (VT.getVectorNumElements() != 1 && VT.getScalarType().bitsLE(MVT::i16))			if (VT.getVectorNumElements() != 1 && VT.getScalarType().bitsLE(MVT::i16))
	return TypeSplitVector;			return TypeSplitVector;

	return TargetLoweringBase::getPreferredVectorAction(VT);			return TargetLoweringBase::getPreferredVectorAction(VT);
	}			}

	▲ Show 20 Lines • Show All 4,974 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/load-global-f32.ll

	Show All 25 Lines
	entry:			entry:
	%tmp0 = load <2 x float>, <2 x float> addrspace(1)* %in			%tmp0 = load <2 x float>, <2 x float> addrspace(1)* %in
	store <2 x float> %tmp0, <2 x float> addrspace(1)* %out			store <2 x float> %tmp0, <2 x float> addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}global_load_v3f32:			; FUNC-LABEL: {{^}}global_load_v3f32:
	; GCN-NOHSA: buffer_load_dwordx4			; GCN-NOHSA: buffer_load_dwordx4
	; GCN-HSA: flat_load_dwordx4			; GCN-HSA: flat_load_dwordx3

	; R600: VTX_READ_128			; R600: VTX_READ_128
	define amdgpu_kernel void @global_load_v3f32(<3 x float> addrspace(1)* %out, <3 x float> addrspace(1)* %in) #0 {			define amdgpu_kernel void @global_load_v3f32(<3 x float> addrspace(1)* %out, <3 x float> addrspace(1)* %in) #0 {
	entry:			entry:
	%tmp0 = load <3 x float>, <3 x float> addrspace(1)* %in			%tmp0 = load <3 x float>, <3 x float> addrspace(1)* %in
	store <3 x float> %tmp0, <3 x float> addrspace(1)* %out			store <3 x float> %tmp0, <3 x float> addrspace(1)* %out
	ret void			ret void
	}			}
	▲ Show 20 Lines • Show All 51 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/load-global-i32.ll

	Show All 24 Lines
	entry:			entry:
	%ld = load <2 x i32>, <2 x i32> addrspace(1)* %in			%ld = load <2 x i32>, <2 x i32> addrspace(1)* %in
	store <2 x i32> %ld, <2 x i32> addrspace(1)* %out			store <2 x i32> %ld, <2 x i32> addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}global_load_v3i32:			; FUNC-LABEL: {{^}}global_load_v3i32:
	; GCN-NOHSA: buffer_load_dwordx4			; GCN-NOHSA: buffer_load_dwordx4
	; GCN-HSA: flat_load_dwordx4			; GCN-HSA: flat_load_dwordx3

	; EG: VTX_READ_128			; EG: VTX_READ_128
	define amdgpu_kernel void @global_load_v3i32(<3 x i32> addrspace(1)* %out, <3 x i32> addrspace(1)* %in) #0 {			define amdgpu_kernel void @global_load_v3i32(<3 x i32> addrspace(1)* %out, <3 x i32> addrspace(1)* %in) #0 {
	entry:			entry:
	%ld = load <3 x i32>, <3 x i32> addrspace(1)* %in			%ld = load <3 x i32>, <3 x i32> addrspace(1)* %in
	store <3 x i32> %ld, <3 x i32> addrspace(1)* %out			store <3 x i32> %ld, <3 x i32> addrspace(1)* %out
	ret void			ret void
	}			}
	▲ Show 20 Lines • Show All 480 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/load-vec3.ll

This file was added.

				; RUN: llc -march=amdgcn -mcpu=tonga -mattr=+flat-for-global < %s \| FileCheck -check-prefix=GCN -check-prefix=VI %s
				; RUN: llc -march=amdgcn -mcpu=bonaire -mattr=-flat-for-global < %s \| FileCheck -check-prefix=GCN -check-prefix=GCN-MUBUF %s
				; RUN: llc -march=amdgcn -mcpu=gfx901 < %s \| FileCheck -check-prefix=GCN -check-prefix=GFX9 %s

				; GCN-LABEL: {{^}}load_global_v3i32:
				; VI: flat_load_dwordx3
				; GFX9: global_load_dwordx3
				; GCN-MUBUF-DAG: buffer_load_dwordx2 v
				; GCN-MUBUF-DAG: buffer_load_dword v
				define amdgpu_kernel void @load_global_v3i32(float addrspace(1)* nocapture readonly %in, <3 x float> addrspace(1)* nocapture %out) {
				%id = tail call i32 @llvm.amdgcn.workitem.id.x()
				%gep_in = getelementptr inbounds float, float addrspace(1)* %in, i32 %id
				%gep_in_v3 = bitcast float addrspace(1)* %gep_in to <3 x i32> addrspace(1)*
				%load = load <3 x i32>, <3 x i32> addrspace(1)* %gep_in_v3, align 4
				%gep_out = getelementptr inbounds <3 x float>, <3 x float> addrspace(1)* %out, i32 %id
				%vec3i = bitcast <3 x i32> %load to <3 x float>
				store <3 x float> %vec3i, <3 x float> addrspace(1)* %gep_out, align 16
				ret void
				}

				; GCN-LABEL: {{^}}load_global_v3f32:
				; VI: flat_load_dwordx3
				; GFX9: global_load_dwordx3
				; GCN-MUBUF-DAG: buffer_load_dwordx2 v
				; GCN-MUBUF-DAG: buffer_load_dword v
				define amdgpu_kernel void @load_global_v3f32(float addrspace(1)* nocapture readonly %in, <3 x float> addrspace(1)* nocapture %out) {
				%id = tail call i32 @llvm.amdgcn.workitem.id.x()
				%gep_in = getelementptr inbounds float, float addrspace(1)* %in, i32 %id
				%gep_in_v3 = bitcast float addrspace(1)* %gep_in to <3 x float> addrspace(1)*
				%load = load <3 x float>, <3 x float> addrspace(1)* %gep_in_v3, align 4
				%val = fadd <3 x float> %load, %load
				%gep_out = getelementptr inbounds <3 x float>, <3 x float> addrspace(1)* %out, i32 %id
				store <3 x float> %val, <3 x float> addrspace(1)* %gep_out, align 16
				ret void
				}

				; GCN-LABEL: {{^}}load_constant_v3i32:
				; VI: flat_load_dwordx3
				; GFX9: global_load_dwordx3
				; GCN-MUBUF-DAG: buffer_load_dwordx2 v
				; GCN-MUBUF-DAG: buffer_load_dword v
				define amdgpu_kernel void @load_constant_v3i32(i32 addrspace(2)* nocapture readonly %in, <3 x i32> addrspace(1)* nocapture %out) {
				%id = tail call i32 @llvm.amdgcn.workitem.id.x()
				%gep_in = getelementptr inbounds i32, i32 addrspace(2)* %in, i32 %id
				%gep_in_v3 = bitcast i32 addrspace(2)* %gep_in to <3 x i32> addrspace(2)*
				%load = load <3 x i32>, <3 x i32> addrspace(2)* %gep_in_v3, align 4
				%gep_out = getelementptr inbounds <3 x i32>, <3 x i32> addrspace(1)* %out, i32 %id
				store <3 x i32> %load, <3 x i32> addrspace(1)* %gep_out, align 16
				ret void
				}

				; GCN-LABEL: {{^}}load_flat_v3i32:
				; GCN: flat_load_dwordx3
				define amdgpu_kernel void @load_flat_v3i32(i32 addrspace(4)* nocapture readonly %in, <3 x i32> addrspace(1)* nocapture %out) {
				%id = tail call i32 @llvm.amdgcn.workitem.id.x()
				%gep_in = getelementptr inbounds i32, i32 addrspace(4)* %in, i32 %id
				%gep_in_v3 = bitcast i32 addrspace(4)* %gep_in to <3 x i32> addrspace(4)*
				%load = load <3 x i32>, <3 x i32> addrspace(4)* %gep_in_v3, align 4
				%gep_out = getelementptr inbounds <3 x i32>, <3 x i32> addrspace(1)* %out, i32 %id
				store <3 x i32> %load, <3 x i32> addrspace(1)* %gep_out, align 16
				ret void
				}

				; GCN-LABEL: {{^}}load_global_v3f16:
				; GCN: {{buffer\|flat\|global}}_load_ushort v
				; GCN: {{buffer\|flat\|global}}_load_ushort v
				; GCN: {{buffer\|flat\|global}}_load_ushort v
				; GCN-NOT: load_dwordx3
				define amdgpu_kernel void @load_global_v3f16(half addrspace(1)* nocapture readonly %in, <3 x half> addrspace(1)* nocapture %out) {
				%id = tail call i32 @llvm.amdgcn.workitem.id.x()
				%gep_in = getelementptr inbounds half, half addrspace(1)* %in, i32 %id
				%gep_in_v3 = bitcast half addrspace(1)* %gep_in to <3 x half> addrspace(1)*
				%load = load <3 x half>, <3 x half> addrspace(1)* %gep_in_v3, align 2
				%val = fadd <3 x half> %load, %load
				%gep_out = getelementptr inbounds <3 x half>, <3 x half> addrspace(1)* %out, i32 %id
				store <3 x half> %val, <3 x half> addrspace(1)* %gep_out, align 8
				ret void
				}

				; GCN-LABEL: {{^}}load_global_v3i16_to_v3i32:
				; GCN: {{buffer\|flat\|global}}_load_ushort v
				; GCN: {{buffer\|flat\|global}}_load_ushort v
				; GCN: {{buffer\|flat\|global}}_load_ushort v
				; GCN-NOT: load_dwordx3
				define amdgpu_kernel void @load_global_v3i16_to_v3i32(i16 addrspace(1)* nocapture readonly %in, <3 x i32> addrspace(1)* nocapture %out) {
				%id = tail call i32 @llvm.amdgcn.workitem.id.x()
				%gep_in = getelementptr inbounds i16, i16 addrspace(1)* %in, i32 %id
				%gep_in_v3 = bitcast i16 addrspace(1)* %gep_in to <3 x i16> addrspace(1)*
				%load = load <3 x i16>, <3 x i16> addrspace(1)* %gep_in_v3, align 2
				%val = zext <3 x i16> %load to <3 x i32>
				%gep_out = getelementptr inbounds <3 x i32>, <3 x i32> addrspace(1)* %out, i32 %id
				store <3 x i32> %val, <3 x i32> addrspace(1)* %gep_out, align 8
				ret void
				}

				; GCN-LABEL: {{^}}load_global_v3i32_scalar:
				; GCN-DAG: s_load_dwordx2 s[{{[0-9:]+}}], s[{{[0-9:]+}}], 0x0
				; GCN-DAG: s_load_dword s{{[0-9]+}}, s[{{[0-9:]+}}], 0x{{2\|8}}
				; GCN-NOT: load_dwordx3
				define amdgpu_kernel void @load_global_v3i32_scalar(float addrspace(1)* nocapture readonly %in, <3 x i32> addrspace(1)* nocapture %out) {
				%gep_in = getelementptr inbounds float, float addrspace(1)* %in, i32 0
				%gep_in_v3 = bitcast float addrspace(1)* %gep_in to <3 x i32> addrspace(1)*
				%load = load <3 x i32>, <3 x i32> addrspace(1)* %gep_in_v3, align 4
				store <3 x i32> %load, <3 x i32> addrspace(1)* %out, align 16
				ret void
				}

				declare i32 @llvm.amdgcn.workitem.id.x() #1

				attributes #1 = { nounwind readnone speculatable }

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Produce flat|global_dwordx3 instructionsAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 106695

lib/Target/AMDGPU/AMDGPUISelLowering.h

lib/Target/AMDGPU/AMDGPUISelLowering.cpp

lib/Target/AMDGPU/SIISelLowering.h

lib/Target/AMDGPU/SIISelLowering.cpp

test/CodeGen/AMDGPU/load-global-f32.ll

test/CodeGen/AMDGPU/load-global-i32.ll

test/CodeGen/AMDGPU/load-vec3.ll

[AMDGPU] Produce flat|global_dwordx3 instructions
AbandonedPublic