This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: buffer.load.format intrinsic and si-load-shrink pass
AbandonedPublic

Authored by nhaehnle on Oct 20 2015, 6:32 AM.

Download Raw Diff

Details

Reviewers

• tstellarAMD
mareko

Summary

This is work in progress, please provide feedback. It supersedes http://reviews.llvm.org/D13586 based on comments there.

At a high-level, the motivation of these changes is:

Add llvm.amdgcn.buffer.load.format intrinsic to expose (almost) the full range of what the hardware can do (minus addr64 mode and D16 variants, both of which should arguably get their own intrinsics).

For both image loads/samples and buffer load, split the (simple) optimization of determining the appropriate size/dmask of the load into an IR-level CodeGenPrepare pass, while the selection of the appropriate machine instruction stays in SIISelLowering/patterns in TableGen files, respectively.

Known issues / questions:

Some regressions in image/sample-related tests. Also, buffer.load.format tests obviously need to be expanded as noted.

It is annoying to match the MIMG-related intrinsics by name at the IR level, but this seems to be necessary as long as those intrinsics are defined in the target .td files, and so no IntrinsicID is assigned.

The big one: the v3f32 variant (BUFFER_LOAD_FORMAT_XYZ) is currently not supported because v3f32 is not a MachineValueType, and the type legalization step of codegen bails out. IMO, the clean solution would be to argue that since amgcn is a real existing target that genuinely has 3-element vector instructions for non-crazy reasons, v3f32 (and v3i32) should be added to MachineValueType. But of course, this is a pretty core change and the image path in SIISelLowering did not go that route and hacked around it instead.

I have not actually looked into an alternative design without modifying MVT. Since at the IR-level the desired size must be represented by the return type of the intrinsic, any hack-around would somehow involved convincing the type legalization step to accept v3f32 (unlike in the image case, where the IR-level always uses v4f32 and implicitly stores the size in the dmask).

Perhaps there is a way of telling the target-independent codegen to accept v3f32s without actually adding it to MVT?

So... going the route of adding v3f32 and v3i32 to MVT is the best path that I can see, but before I do that, I want to get feedback on the plan.

Eventually, Mesa and other clients are intended to emit the buffer.load.format intrinsic directly. In the meantime, SI.load.input should be transformed early on. Where is the best place to do that from a design POV? The SILoadShrink pass is an obvious candidate, but it clashes with the name given to that pass.

Diff Detail

Event Timeline

nhaehnle updated this revision to Diff 37871.Oct 20 2015, 6:32 AM

nhaehnle retitled this revision from to AMDGPU: buffer.load.format intrinsic and si-load-shrink pass.

nhaehnle updated this object.

nhaehnle added reviewers: • tstellarAMD, mareko.

Herald added a subscriber: arsenm. · View Herald TranscriptOct 20 2015, 6:32 AM

It can be difficult to select the non-vec4 loads in TGSI. There is not enough readily-available information for that there.

ADDR64 loads shouldn't get their own intrinsics, because:

They are already matched by the generic loads
VI and later don't support them (FLAT opcodes should be used instead, but I think those need more work outside of LLVM)

Also, Mesa doesn't use MUBUF ADDR64.

I wouldn't worry about D16. We don't need it.

include/llvm/IR/IntrinsicsAMDGPU.td
114	A small nit: We define such intrinsics in SIIntrinsics.td

• tstellarAMD added inline comments.Oct 20 2015, 8:22 AM

include/llvm/IR/IntrinsicsAMDGPU.td
114	IntrinsicsAMDGPU.td is the correct place for new intrinsics. At some point we should move the all the intrinsics there.

In D13894#271140, @mareko wrote:

It can be difficult to select the non-vec4 loads in TGSI. There is not enough readily-available information for that there.

True.

The underlying reason for supporting the full range of intrinsics is to be able to implement this optimization as an IR pass as Matt suggested. The fact that it will then theoretically be possible to choose non-vec4 loads in Mesa directly is only a nice side-effect.

ADDR64 loads shouldn't get their own intrinsics, because:

They are already matched by the generic loads

VI and later don't support them (FLAT opcodes should be used instead, but I think those need more work outside of LLVM)

Also, Mesa doesn't use MUBUF ADDR64.

Ok.

I wouldn't worry about D16. We don't need it.

Ok.

Any thoughts about the v3f32 issue?

arsenm added inline comments.Oct 22 2015, 11:10 AM

include/llvm/IR/IntrinsicsAMDGPU.td
114	Only the supported ones. Backend only ones like the control flow intrinsics should stay private in the backend

In D13894#271176, @nhaehnle wrote:

Any thoughts about the v3f32 issue?

This is a pretty big problem everywhere. You can use a 3 vector intrinsic type, but it will require some special handling to lower the intrinsic with the artificially illegal type.

In D13894#279284, @arsenm wrote:

In D13894#271176, @nhaehnle wrote:

Any thoughts about the v3f32 issue?

This is a pretty big problem everywhere. You can use a 3 vector intrinsic type, but it will require some special handling to lower the intrinsic with the artificially illegal type.

By special handling I mean you probably need to directly select the instruction during lowering before the type legalizer tries to do anything with this

nhaehnle abandoned this revision.Feb 21 2018, 6:55 AM

Herald added subscribers: t-tye, tpr, dstuttard and 4 others. · View Herald TranscriptFeb 21 2018, 6:55 AM

Revision Contents

Path

Size

include/

llvm/

IR/

IntrinsicsAMDGPU.td

13 lines

lib/

Target/

AMDGPU/

AMDGPU.h

4 lines

AMDGPUTargetMachine.cpp

7 lines

1 line

2 lines

112 lines

50 lines

246 lines

test/

CodeGen/

AMDGPU/

llvm.amdgcn.buffer.load.format.ll

41 lines

Diff 37871

include/llvm/IR/IntrinsicsAMDGPU.td

	Show First 20 Lines • Show All 93 Lines • ▼ Show 20 Lines

	let TargetPrefix = "amdgcn" in {			let TargetPrefix = "amdgcn" in {

	// SI only			// SI only
	def int_amdgcn_buffer_wbinvl1_sc :			def int_amdgcn_buffer_wbinvl1_sc :
	GCCBuiltin<"__builtin_amdgcn_buffer_wbinvl1_sc">,			GCCBuiltin<"__builtin_amdgcn_buffer_wbinvl1_sc">,
	Intrinsic<[], [], []>;			Intrinsic<[], [], []>;

				// SI+
				def int_amdgcn_buffer_load_format :
				Intrinsic<[llvm_anyfloat_ty],
				[llvm_v4i32_ty, // rsrc(SGPR)
				llvm_i32_ty, // sgpr_offset(SGPR or 0)
				llvm_i32_ty, // inst_offset(imm)
				llvm_i32_ty, // vgpr_offset(VGPR or 0)
				llvm_i32_ty, // vgpr_index(VGPR or 0)
				llvm_i1_ty, // glc
				llvm_i1_ty, // slc
				llvm_i1_ty], // tfe
				[IntrNoMem]>;

				marekoUnsubmitted Not Done Reply Inline Actions A small nit: We define such intrinsics in SIIntrinsics.td mareko: A small nit: We define such intrinsics in SIIntrinsics.td
				tstellarAMDUnsubmitted Not Done Reply Inline Actions IntrinsicsAMDGPU.td is the correct place for new intrinsics. At some point we should move the all the intrinsics there. tstellarAMD: IntrinsicsAMDGPU.td is the correct place for new intrinsics. At some point we should move the…
				arsenmUnsubmitted Not Done Reply Inline Actions Only the supported ones. Backend only ones like the control flow intrinsics should stay private in the backend arsenm: Only the supported ones. Backend only ones like the control flow intrinsics should stay private…
	// On CI+			// On CI+
	def int_amdgcn_buffer_wbinvl1_vol :			def int_amdgcn_buffer_wbinvl1_vol :
	GCCBuiltin<"__builtin_amdgcn_buffer_wbinvl1_vol">,			GCCBuiltin<"__builtin_amdgcn_buffer_wbinvl1_vol">,
	Intrinsic<[], [], []>;			Intrinsic<[], [], []>;

	def int_amdgcn_buffer_wbinvl1 :			def int_amdgcn_buffer_wbinvl1 :
	GCCBuiltin<"__builtin_amdgcn_buffer_wbinvl1">,			GCCBuiltin<"__builtin_amdgcn_buffer_wbinvl1">,
	Intrinsic<[], [], []>;			Intrinsic<[], [], []>;
	Show All 21 Lines

lib/Target/AMDGPU/AMDGPU.h

	Show All 35 Lines
	FunctionPass *createAMDGPUCFGStructurizerPass();			FunctionPass *createAMDGPUCFGStructurizerPass();

	// SI Passes			// SI Passes
	FunctionPass *createSITypeRewriter();			FunctionPass *createSITypeRewriter();
	FunctionPass *createSIAnnotateControlFlowPass();			FunctionPass *createSIAnnotateControlFlowPass();
	FunctionPass *createSIFoldOperandsPass();			FunctionPass *createSIFoldOperandsPass();
	FunctionPass *createSILowerI1CopiesPass();			FunctionPass *createSILowerI1CopiesPass();
	FunctionPass *createSIShrinkInstructionsPass();			FunctionPass *createSIShrinkInstructionsPass();
				FunctionPass *createSILoadShrinkPass();
	FunctionPass *createSILoadStoreOptimizerPass(TargetMachine &tm);			FunctionPass *createSILoadStoreOptimizerPass(TargetMachine &tm);
	FunctionPass *createSILowerControlFlowPass(TargetMachine &tm);			FunctionPass *createSILowerControlFlowPass(TargetMachine &tm);
	FunctionPass *createSIFixControlFlowLiveIntervalsPass();			FunctionPass *createSIFixControlFlowLiveIntervalsPass();
	FunctionPass *createSIFixSGPRCopiesPass(TargetMachine &tm);			FunctionPass *createSIFixSGPRCopiesPass(TargetMachine &tm);
	FunctionPass *createSIFixSGPRLiveRangesPass();			FunctionPass *createSIFixSGPRLiveRangesPass();
	FunctionPass *createSICodeEmitterPass(formatted_raw_ostream &OS);			FunctionPass *createSICodeEmitterPass(formatted_raw_ostream &OS);
	FunctionPass *createSIInsertWaits(TargetMachine &tm);			FunctionPass *createSIInsertWaits(TargetMachine &tm);
	FunctionPass *createSIPrepareScratchRegs();			FunctionPass *createSIPrepareScratchRegs();

	void initializeSIFoldOperandsPass(PassRegistry &);			void initializeSIFoldOperandsPass(PassRegistry &);
	extern char &SIFoldOperandsID;			extern char &SIFoldOperandsID;

	void initializeSILowerI1CopiesPass(PassRegistry &);			void initializeSILowerI1CopiesPass(PassRegistry &);
	extern char &SILowerI1CopiesID;			extern char &SILowerI1CopiesID;

				void initializeSILoadShrinkPass(PassRegistry &);
				extern char &SILoadShrinkPassID;

	void initializeSILoadStoreOptimizerPass(PassRegistry &);			void initializeSILoadStoreOptimizerPass(PassRegistry &);
	extern char &SILoadStoreOptimizerID;			extern char &SILoadStoreOptimizerID;

	// Passes common to R600 and SI			// Passes common to R600 and SI
	FunctionPass *createAMDGPUPromoteAlloca(const AMDGPUSubtarget &ST);			FunctionPass *createAMDGPUPromoteAlloca(const AMDGPUSubtarget &ST);
	Pass *createAMDGPUStructurizeCFGPass();			Pass *createAMDGPUStructurizeCFGPass();
	FunctionPass *createAMDGPUISelDag(TargetMachine &tm);			FunctionPass *createAMDGPUISelDag(TargetMachine &tm);
	ModulePass *createAMDGPUAlwaysInlinePass();			ModulePass *createAMDGPUAlwaysInlinePass();
	▲ Show 20 Lines • Show All 83 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	extern "C" void LLVMInitializeAMDGPUTarget() {
RegisterTargetMachine<R600TargetMachine> X(TheAMDGPUTarget);		RegisterTargetMachine<R600TargetMachine> X(TheAMDGPUTarget);
RegisterTargetMachine<GCNTargetMachine> Y(TheGCNTarget);		RegisterTargetMachine<GCNTargetMachine> Y(TheGCNTarget);

PassRegistry *PR = PassRegistry::getPassRegistry();		PassRegistry *PR = PassRegistry::getPassRegistry();
initializeSILowerI1CopiesPass(*PR);		initializeSILowerI1CopiesPass(*PR);
initializeSIFoldOperandsPass(*PR);		initializeSIFoldOperandsPass(*PR);
initializeSIFixSGPRLiveRangesPass(*PR);		initializeSIFixSGPRLiveRangesPass(*PR);
initializeSIFixControlFlowLiveIntervalsPass(*PR);		initializeSIFixControlFlowLiveIntervalsPass(*PR);
		initializeSILoadShrinkPass(*PR);
initializeSILoadStoreOptimizerPass(*PR);		initializeSILoadStoreOptimizerPass(*PR);
}		}

static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {		static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {
if (TT.getOS() == Triple::AMDHSA)		if (TT.getOS() == Triple::AMDHSA)
return make_unique<AMDGPUHSATargetObjectFile>();		return make_unique<AMDGPUHSATargetObjectFile>();

return make_unique<TargetLoweringObjectFileELF>();		return make_unique<TargetLoweringObjectFileELF>();
▲ Show 20 Lines • Show All 101 Lines • ▼ Show 20 Lines	public:
void addPreSched2() override;		void addPreSched2() override;
void addPreEmitPass() override;		void addPreEmitPass() override;
};		};

class GCNPassConfig : public AMDGPUPassConfig {		class GCNPassConfig : public AMDGPUPassConfig {
public:		public:
GCNPassConfig(TargetMachine *TM, PassManagerBase &PM)		GCNPassConfig(TargetMachine *TM, PassManagerBase &PM)
: AMDGPUPassConfig(TM, PM) { }		: AMDGPUPassConfig(TM, PM) { }
		void addCodeGenPrepare() override;
bool addPreISel() override;		bool addPreISel() override;
bool addInstSelector() override;		bool addInstSelector() override;
void addFastRegAlloc(FunctionPass *RegAllocPass) override;		void addFastRegAlloc(FunctionPass *RegAllocPass) override;
void addOptimizedRegAlloc(FunctionPass *RegAllocPass) override;		void addOptimizedRegAlloc(FunctionPass *RegAllocPass) override;
void addPreRegAlloc() override;		void addPreRegAlloc() override;
void addPostRegAlloc() override;		void addPostRegAlloc() override;
void addPreSched2() override;		void addPreSched2() override;
void addPreEmitPass() override;		void addPreEmitPass() override;
▲ Show 20 Lines • Show All 84 Lines • ▼ Show 20 Lines
TargetPassConfig *R600TargetMachine::createPassConfig(PassManagerBase &PM) {		TargetPassConfig *R600TargetMachine::createPassConfig(PassManagerBase &PM) {
return new R600PassConfig(this, PM);		return new R600PassConfig(this, PM);
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// GCN Pass Setup		// GCN Pass Setup
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

		void GCNPassConfig::addCodeGenPrepare() {
		addPass(createSILoadShrinkPass());
		AMDGPUPassConfig::addCodeGenPrepare();
		}

bool GCNPassConfig::addPreISel() {		bool GCNPassConfig::addPreISel() {
AMDGPUPassConfig::addPreISel();		AMDGPUPassConfig::addPreISel();
addPass(createSinkingPass());		addPass(createSinkingPass());
addPass(createSITypeRewriter());		addPass(createSITypeRewriter());
addPass(createSIAnnotateControlFlowPass());		addPass(createSIAnnotateControlFlowPass());
return false;		return false;
}		}

▲ Show 20 Lines • Show All 60 Lines • Show Last 20 Lines

lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
SIAnnotateControlFlow.cpp		SIAnnotateControlFlow.cpp
SIFixControlFlowLiveIntervals.cpp		SIFixControlFlowLiveIntervals.cpp
SIFixSGPRCopies.cpp		SIFixSGPRCopies.cpp
SIFixSGPRLiveRanges.cpp		SIFixSGPRLiveRanges.cpp
SIFoldOperands.cpp		SIFoldOperands.cpp
SIInsertWaits.cpp		SIInsertWaits.cpp
SIInstrInfo.cpp		SIInstrInfo.cpp
SIISelLowering.cpp		SIISelLowering.cpp
		SILoadShrink.cpp
SILoadStoreOptimizer.cpp		SILoadStoreOptimizer.cpp
SILowerControlFlow.cpp		SILowerControlFlow.cpp
SILowerI1Copies.cpp		SILowerI1Copies.cpp
SIMachineFunctionInfo.cpp		SIMachineFunctionInfo.cpp
SIPrepareScratchRegs.cpp		SIPrepareScratchRegs.cpp
SIRegisterInfo.cpp		SIRegisterInfo.cpp
SIShrinkInstructions.cpp		SIShrinkInstructions.cpp
SITypeRewriter.cpp		SITypeRewriter.cpp
)		)

add_subdirectory(AsmParser)		add_subdirectory(AsmParser)
add_subdirectory(InstPrinter)		add_subdirectory(InstPrinter)
add_subdirectory(TargetInfo)		add_subdirectory(TargetInfo)
add_subdirectory(MCTargetDesc)		add_subdirectory(MCTargetDesc)
add_subdirectory(Utils)		add_subdirectory(Utils)

lib/Target/AMDGPU/SIISelLowering.h

Show All 36 Lines	class SITargetLowering : public AMDGPUTargetLowering {
SDValue LowerFDIV32(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFDIV32(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFDIV64(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFDIV64(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFDIV(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFDIV(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerINT_TO_FP(SDValue Op, SelectionDAG &DAG, bool Signed) const;		SDValue LowerINT_TO_FP(SDValue Op, SelectionDAG &DAG, bool Signed) const;
SDValue LowerSTORE(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerSTORE(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerTrig(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerTrig(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerBRCOND(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerBRCOND(SDValue Op, SelectionDAG &DAG) const;

void adjustWritemask(MachineSDNode *&N, SelectionDAG &DAG) const;

SDValue performUCharToFloatCombine(SDNode *N,		SDValue performUCharToFloatCombine(SDNode *N,
DAGCombinerInfo &DCI) const;		DAGCombinerInfo &DCI) const;
SDValue performSHLPtrCombine(SDNode *N,		SDValue performSHLPtrCombine(SDNode *N,
unsigned AS,		unsigned AS,
DAGCombinerInfo &DCI) const;		DAGCombinerInfo &DCI) const;
SDValue performAndCombine(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue performAndCombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performOrCombine(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue performOrCombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performClassCombine(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue performClassCombine(SDNode *N, DAGCombinerInfo &DCI) const;
▲ Show 20 Lines • Show All 74 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIISelLowering.cpp

Show First 20 Lines • Show All 1,993 Lines • ▼ Show 20 Lines	if (Node->getValueType(0) == MVT::f32)
return FloatToBits(Node->getValueAPF().convertToFloat());		return FloatToBits(Node->getValueAPF().convertToFloat());

return -1;		return -1;
}		}

return -1;		return -1;
}		}

/// \brief Helper function for adjustWritemask
static unsigned SubIdx2Lane(unsigned Idx) {
switch (Idx) {
default: return 0;
case AMDGPU::sub0: return 0;
case AMDGPU::sub1: return 1;
case AMDGPU::sub2: return 2;
case AMDGPU::sub3: return 3;
}
}

/// \brief Adjust the writemask of MIMG instructions
void SITargetLowering::adjustWritemask(MachineSDNode *&Node,
SelectionDAG &DAG) const {
SDNode *Users[4] = { };
unsigned Lane = 0;
unsigned OldDmask = Node->getConstantOperandVal(0);
unsigned NewDmask = 0;

// Try to figure out the used register components
for (SDNode::use_iterator I = Node->use_begin(), E = Node->use_end();
I != E; ++I) {

// Abort if we can't understand the usage
if (!I->isMachineOpcode() \|\|
I->getMachineOpcode() != TargetOpcode::EXTRACT_SUBREG)
return;

// Lane means which subreg of %VGPRa_VGPRb_VGPRc_VGPRd is used.
// Note that subregs are packed, i.e. Lane==0 is the first bit set
// in OldDmask, so it can be any of X,Y,Z,W; Lane==1 is the second bit
// set, etc.
Lane = SubIdx2Lane(I->getConstantOperandVal(1));

// Set which texture component corresponds to the lane.
unsigned Comp;
for (unsigned i = 0, Dmask = OldDmask; i <= Lane; i++) {
assert(Dmask);
Comp = countTrailingZeros(Dmask);
Dmask &= ~(1 << Comp);
}

// Abort if we have more than one user per component
if (Users[Lane])
return;

Users[Lane] = *I;
NewDmask \|= 1 << Comp;
}

// Abort if there's no change
if (NewDmask == OldDmask)
return;

// Adjust the writemask in the node
std::vector<SDValue> Ops;
Ops.push_back(DAG.getTargetConstant(NewDmask, SDLoc(Node), MVT::i32));
Ops.insert(Ops.end(), Node->op_begin() + 1, Node->op_end());
Node = (MachineSDNode*)DAG.UpdateNodeOperands(Node, Ops);

// If we only got one lane, replace it with a copy
// (if NewDmask has only one bit set...)
if (NewDmask && (NewDmask & (NewDmask-1)) == 0) {
SDValue RC = DAG.getTargetConstant(AMDGPU::VGPR_32RegClassID, SDLoc(),
MVT::i32);
SDNode *Copy = DAG.getMachineNode(TargetOpcode::COPY_TO_REGCLASS,
SDLoc(), Users[Lane]->getValueType(0),
SDValue(Node, 0), RC);
DAG.ReplaceAllUsesWith(Users[Lane], Copy);
return;
}

// Update the users of the node with the new indices
for (unsigned i = 0, Idx = AMDGPU::sub0; i < 4; ++i) {

SDNode *User = Users[i];
if (!User)
continue;

SDValue Op = DAG.getTargetConstant(Idx, SDLoc(User), MVT::i32);
DAG.UpdateNodeOperands(User, User->getOperand(0), Op);

switch (Idx) {
default: break;
case AMDGPU::sub0: Idx = AMDGPU::sub1; break;
case AMDGPU::sub1: Idx = AMDGPU::sub2; break;
case AMDGPU::sub2: Idx = AMDGPU::sub3; break;
}
}
}

static bool isFrameIndexOp(SDValue Op) {		static bool isFrameIndexOp(SDValue Op) {
if (Op.getOpcode() == ISD::AssertZext)		if (Op.getOpcode() == ISD::AssertZext)
Op = Op.getOperand(0);		Op = Op.getOperand(0);

return isa<FrameIndexSDNode>(Op);		return isa<FrameIndexSDNode>(Op);
}		}

/// \brief Legalize target independent instructions (e.g. INSERT_SUBREG)		/// \brief Legalize target independent instructions (e.g. INSERT_SUBREG)
Show All 19 Lines
}		}

/// \brief Fold the instructions after selecting them.		/// \brief Fold the instructions after selecting them.
SDNode SITargetLowering::PostISelFolding(MachineSDNode Node,		SDNode SITargetLowering::PostISelFolding(MachineSDNode Node,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
const SIInstrInfo *TII =		const SIInstrInfo *TII =
static_cast<const SIInstrInfo *>(Subtarget->getInstrInfo());		static_cast<const SIInstrInfo *>(Subtarget->getInstrInfo());

if (TII->isMIMG(Node->getMachineOpcode()))		if (TII->isMIMG(Node->getMachineOpcode())) {
adjustWritemask(Node, DAG);		unsigned NumWritten = countPopulation(Node->getConstantOperandVal(0));
		if (NumWritten == 1) {
		SDValue RC =
		DAG.getTargetConstant(AMDGPU::VGPR_32RegClassID, SDLoc(), MVT::i32);
		SDNode *Copy = DAG.getMachineNode(
		TargetOpcode::COPY_TO_REGCLASS, SDLoc(),
		Node->getValueType(0).getVectorElementType(), SDValue(Node, 0), RC);

		for (SDNode *Use : Node->uses()) {
		if (Use != Copy) {
		assert(Use->isMachineOpcode() &&
		Use->getMachineOpcode() == TargetOpcode::EXTRACT_SUBREG);
		DAG.ReplaceAllUsesWith(Use, Copy);
		}
		}
		}
		return Node;
		}

if (Node->getMachineOpcode() == AMDGPU::INSERT_SUBREG \|\|		if (Node->getMachineOpcode() == AMDGPU::INSERT_SUBREG \|\|
Node->getMachineOpcode() == AMDGPU::REG_SEQUENCE) {		Node->getMachineOpcode() == AMDGPU::REG_SEQUENCE) {
legalizeTargetIndependentNode(Node, DAG);		legalizeTargetIndependentNode(Node, DAG);
return Node;		return Node;
}		}
return Node;		return Node;
}		}
▲ Show 20 Lines • Show All 170 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInstructions.td

	Show First 20 Lines • Show All 2,976 Lines • ▼ Show 20 Lines

	defm : MUBUF_Load_Dword <i32, BUFFER_LOAD_DWORD_OFFSET, BUFFER_LOAD_DWORD_OFFEN,			defm : MUBUF_Load_Dword <i32, BUFFER_LOAD_DWORD_OFFSET, BUFFER_LOAD_DWORD_OFFEN,
	BUFFER_LOAD_DWORD_IDXEN, BUFFER_LOAD_DWORD_BOTHEN>;			BUFFER_LOAD_DWORD_IDXEN, BUFFER_LOAD_DWORD_BOTHEN>;
	defm : MUBUF_Load_Dword <v2i32, BUFFER_LOAD_DWORDX2_OFFSET, BUFFER_LOAD_DWORDX2_OFFEN,			defm : MUBUF_Load_Dword <v2i32, BUFFER_LOAD_DWORDX2_OFFSET, BUFFER_LOAD_DWORDX2_OFFEN,
	BUFFER_LOAD_DWORDX2_IDXEN, BUFFER_LOAD_DWORDX2_BOTHEN>;			BUFFER_LOAD_DWORDX2_IDXEN, BUFFER_LOAD_DWORDX2_BOTHEN>;
	defm : MUBUF_Load_Dword <v4i32, BUFFER_LOAD_DWORDX4_OFFSET, BUFFER_LOAD_DWORDX4_OFFEN,			defm : MUBUF_Load_Dword <v4i32, BUFFER_LOAD_DWORDX4_OFFSET, BUFFER_LOAD_DWORDX4_OFFEN,
	BUFFER_LOAD_DWORDX4_IDXEN, BUFFER_LOAD_DWORDX4_BOTHEN>;			BUFFER_LOAD_DWORDX4_IDXEN, BUFFER_LOAD_DWORDX4_BOTHEN>;

				multiclass MUBUF_Load_Format <ValueType vt, MUBUF offset, MUBUF offen,
				MUBUF idxen, MUBUF bothen> {
				def : Pat <
				(vt (int_amdgcn_buffer_load_format v4i32:$rsrc, i32:$soffset, imm:$offset,
				0, 0, imm:$glc, imm:$slc, imm:$tfe)),
				(offset $rsrc, $soffset, (as_i16imm $offset), (as_i1imm $glc),
				(as_i1imm $slc), (as_i1imm $tfe))
				>;

				def : Pat <
				(vt (int_amdgcn_buffer_load_format v4i32:$rsrc, i32:$soffset, imm:$offset,
				i32:$voffset, 0, imm:$glc, imm:$slc,
				imm:$tfe)),
				(offen $voffset, $rsrc, $soffset, (as_i16imm $offset), (as_i1imm $glc),
				(as_i1imm $slc), (as_i1imm $tfe))
				>;

				def : Pat <
				(vt (int_amdgcn_buffer_load_format v4i32:$rsrc, i32:$soffset, imm:$offset,
				0, i32:$vindex, imm:$glc, imm:$slc,
				imm:$tfe)),
				(idxen $vindex, $rsrc, $soffset, (as_i16imm $offset), (as_i1imm $glc),
				(as_i1imm $slc), (as_i1imm $tfe))
				>;

				def : Pat <
				(vt (int_amdgcn_buffer_load_format v4i32:$rsrc, i32:$soffset, imm:$offset,
				i32:$voffset, i32:$vindex, imm:$glc,
				imm:$slc, imm:$tfe)),
				(bothen (REG_SEQUENCE VReg_64, $vindex, sub0, $voffset, sub1), $rsrc,
				$soffset, (as_i16imm $offset), (as_i1imm $glc), (as_i1imm $slc),
				(as_i1imm $tfe))
				>;
				}

				defm : MUBUF_Load_Format <f32, BUFFER_LOAD_FORMAT_X_OFFSET,
				BUFFER_LOAD_FORMAT_X_OFFEN,
				BUFFER_LOAD_FORMAT_X_IDXEN,
				BUFFER_LOAD_FORMAT_X_BOTHEN>;

				defm : MUBUF_Load_Format <v2f32, BUFFER_LOAD_FORMAT_XY_OFFSET,
				BUFFER_LOAD_FORMAT_XY_OFFEN,
				BUFFER_LOAD_FORMAT_XY_IDXEN,
				BUFFER_LOAD_FORMAT_XY_BOTHEN>;

				defm : MUBUF_Load_Format <v4f32, BUFFER_LOAD_FORMAT_XYZW_OFFSET,
				BUFFER_LOAD_FORMAT_XYZW_OFFEN,
				BUFFER_LOAD_FORMAT_XYZW_IDXEN,
				BUFFER_LOAD_FORMAT_XYZW_BOTHEN>;

	class MUBUFScratchStorePat <MUBUF Instr, ValueType vt, PatFrag st> : Pat <			class MUBUFScratchStorePat <MUBUF Instr, ValueType vt, PatFrag st> : Pat <
	(st vt:$value, (MUBUFScratch v4i32:$srsrc, i32:$vaddr, i32:$soffset,			(st vt:$value, (MUBUFScratch v4i32:$srsrc, i32:$vaddr, i32:$soffset,
	u16imm:$offset)),			u16imm:$offset)),
	(Instr $value, $vaddr, $srsrc, $soffset, $offset, 0, 0, 0)			(Instr $value, $vaddr, $srsrc, $soffset, $offset, 0, 0, 0)
	>;			>;

	def : MUBUFScratchStorePat <BUFFER_STORE_BYTE_OFFEN, i32, truncstorei8_private>;			def : MUBUFScratchStorePat <BUFFER_STORE_BYTE_OFFEN, i32, truncstorei8_private>;
	def : MUBUFScratchStorePat <BUFFER_STORE_SHORT_OFFEN, i32, truncstorei16_private>;			def : MUBUFScratchStorePat <BUFFER_STORE_SHORT_OFFEN, i32, truncstorei16_private>;
	▲ Show 20 Lines • Show All 294 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SILoadShrink.cpp

This file was added.

				//===-- SILoadShrink.cpp - Shrink load intrinsics -------------------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// Target-specific intrinsics that perform loads, such as buffer.load.format.*
				// and TODO (tex)
				//
				//===----------------------------------------------------------------------===//

				#include "AMDGPU.h"
				#include "llvm/IR/Function.h"
				#include "llvm/IR/InstVisitor.h"
				#include "llvm/IR/Intrinsics.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/Support/Debug.h"

				#define DEBUG_TYPE "si-load-shrink"

				using namespace llvm;

				namespace {

				class SILoadShrink : public FunctionPass, public InstVisitor<SILoadShrink> {
				public:
				static char ID;

				SILoadShrink() : FunctionPass(ID) {}

				bool runOnFunction(Function &F) override;

				const char *getPassName() const override { return "SI Load Shrink"; }

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.setPreservesCFG();
				FunctionPass::getAnalysisUsage(AU);
				}

				void visitCallInst(CallInst &C);
				void visitIntrinsicInst(IntrinsicInst &I);

				private:
				void adjustReturnType(IntrinsicInst &I);
				void adjustWritemask(CallInst &Call, int DMaskIndex);

				bool AnyChanges;
				std::vector<Instruction *> Replaced;
				};

				} // End anonymous namespace.

				INITIALIZE_PASS_BEGIN(SILoadShrink, DEBUG_TYPE, "SI Load Shrink", false, false)
				INITIALIZE_PASS_END(SILoadShrink, DEBUG_TYPE, "SI Load Shrink", false, false)

				char SILoadShrink::ID = 0;

				FunctionPass *llvm::createSILoadShrinkPass() { return new SILoadShrink; }

				bool SILoadShrink::runOnFunction(Function &F) {
				AnyChanges = false;

				visit(F);

				for (Instruction *I : Replaced) {
				I->eraseFromParent();
				}
				Replaced.clear();

				return AnyChanges;
				}

				void SILoadShrink::visitCallInst(CallInst &I) {
				Function *Callee = I.getCalledFunction();
				if (!Callee)
				return;

				// TODO Move SI intrinsics to global list so IntrinsicID works?
				static const char SI_image_sample[] = "llvm.SI.image.sample.";
				static const char SI_gather4[] = "llvm.SI.gather4.";
				if (Callee->getName().startswith(SI_image_sample) \|\|
				Callee->getName().startswith(SI_gather4)) {
				adjustWritemask(I, 3);
				}
				}

				void SILoadShrink::visitIntrinsicInst(IntrinsicInst &I) {
				Function *Callee = I.getCalledFunction();

				if (Callee->getIntrinsicID() == Intrinsic::amdgcn_buffer_load_format) {
				adjustReturnType(I);
				}
				}

				void SILoadShrink::adjustReturnType(IntrinsicInst &I) {
				Type *OrigType = I.getType();
				if (!OrigType->isVectorTy())
				return;

				const unsigned OrigNumWritten = OrigType->getVectorNumElements();
				uint64_t HighestIndex = 0;

				SmallVector<ExtractElementInst *, 4> Uses;
				for (const auto &use : I.uses()) {
				ExtractElementInst *EE = dyn_cast<ExtractElementInst>(use.getUser());
				if (!EE) {
				return;
				}

				ConstantInt *IndexConstant = dyn_cast<ConstantInt>(EE->getIndexOperand());
				if (!IndexConstant) {
				return;
				}

				Uses.push_back(EE);

				if (EE->use_empty())
				continue;

				HighestIndex = std::max(HighestIndex, IndexConstant->getZExtValue());
				if (HighestIndex + 1 >= OrigNumWritten)
				return;
				}

				unsigned NewNumWritten = HighestIndex + 1;
				if (NewNumWritten == 3) {
				NewNumWritten = 4; // TODO XYZ codegen
				if (NewNumWritten >= OrigNumWritten) {
				return;
				}
				}

				DEBUG(dbgs() << "SILoadShrink: from " << OrigNumWritten << " to "
				<< NewNumWritten << "\n");

				IRBuilder<> Builder(&I);

				Type *NewTypes[1];
				if (NewNumWritten == 1) {
				NewTypes[0] = Type::getFloatTy(I.getContext());
				} else {
				NewTypes[0] =
				VectorType::get(OrigType->getVectorElementType(), NewNumWritten);
				}

				Function *NewCallee =
				Intrinsic::getDeclaration(I.getModule(), I.getIntrinsicID(), NewTypes);
				SmallVector<Value *, 8> Args;
				for (const Use &Arg : I.arg_operands())
				Args.push_back(Arg.get());
				CallInst *NewCall = Builder.CreateCall(NewCallee, Args);

				for (ExtractElementInst *EE : Uses) {
				if (!EE->use_empty()) {
				if (NewNumWritten == 1) {
				EE->replaceAllUsesWith(NewCall);
				} else {
				EE->replaceAllUsesWith(
				Builder.CreateExtractElement(NewCall, EE->getIndexOperand()));
				}
				}
				Replaced.push_back(EE);
				}

				Replaced.push_back(&I);

				AnyChanges = true;
				}

				void SILoadShrink::adjustWritemask(CallInst &Call, int DMaskIndex) {
				ConstantInt *DMaskConstant =
				dyn_cast<ConstantInt>(Call.getArgOperand(DMaskIndex));
				if (!DMaskConstant)
				return;

				const unsigned OrigDMask = DMaskConstant->getZExtValue();
				const unsigned OrigNumWritten = countPopulation(OrigDMask);

				// Collect all uses, bailing out early when everything is used
				const unsigned UseAllMask = (1 << OrigNumWritten) - 1;
				unsigned UseMask = 0;
				SmallVector<std::pair<unsigned, Value *>, 4> Uses;

				for (const auto &use : Call.uses()) {
				ExtractElementInst *EE = dyn_cast<ExtractElementInst>(use.getUser());
				if (!EE) {
				return;
				}

				ConstantInt *IndexConstant = dyn_cast<ConstantInt>(EE->getIndexOperand());
				if (!IndexConstant) {
				return;
				}

				if (EE->use_empty())
				continue;

				const unsigned Index = IndexConstant->getZExtValue();
				UseMask \|= 1 << Index;
				if (UseMask == UseAllMask)
				return;

				Uses.emplace_back(Index, EE);
				}

				// If not all written channels are used, compute index remapping
				Type *Int32Ty = Type::getInt32Ty(Call.getContext());
				SmallVector<Value *, 4> Remapped;
				unsigned NewDMask = 0;
				unsigned NewIndex = 0;
				unsigned Component = 0;
				for (unsigned OrigIndex = 0; OrigIndex < OrigNumWritten;
				++OrigIndex, ++Component) {
				while ((OrigDMask & (1 << Component)) == 0)
				++Component;

				ExtractElementInst *NewEE = nullptr;
				if (UseMask & (1 << OrigIndex)) {
				if (OrigIndex != NewIndex) {
				errs() << " Remap " << OrigIndex << " to " << NewIndex << "\n";
				NewEE = ExtractElementInst::Create(&Call,
				ConstantInt::get(Int32Ty, NewIndex));
				NewEE->insertAfter(&Call);
				}

				NewDMask \|= 1 << Component;
				++NewIndex;
				}

				Remapped.push_back(NewEE);
				}

				// Commit remapping
				Call.setArgOperand(DMaskIndex, ConstantInt::get(Int32Ty, NewDMask));

				for (const auto &OldIndexUse : Uses) {
				if (Value *RemappedValue = Remapped[OldIndexUse.first]) {
				OldIndexUse.second->replaceAllUsesWith(RemappedValue);
				}
				}

				AnyChanges = true;
				}

test/CodeGen/AMDGPU/llvm.amdgcn.buffer.load.format.ll

This file was added.

				; RUN: llc -march=amdgcn -mcpu=verde -show-mc-encoding -verify-machineinstrs < %s \| FileCheck %s
				; RUN: llc -march=amdgcn -mcpu=tonga -show-mc-encoding -verify-machineinstrs < %s \| FileCheck %s

				; TODO:
				; - check soffset & (immediate) offset
				; - check glc/slc
				; - check vector versions
				; - v3f32 version: how?

				; CHECK-LABEL: {{^}}main:
				; CHECK: buffer_load_format_x {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 ; encoding
				; CHECK: buffer_load_format_x {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 idxen ; encoding
				; CHECK: buffer_load_format_x {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen ; encoding
				; CHECK: buffer_load_format_x {{v[0-9]+}}, {{v\[[0-9]+:[0-9]+\]}}, {{s\[[0-9]+:[0-9]+\]}}, 0 idxen offen ; encoding
				; CHECK: s_endpgm

				define void @main([9 x <16 x i8>] addrspace(2)* byval, [17 x <16 x i8>] addrspace(2)* byval, [17 x <4 x i32>] addrspace(2)* byval, [34 x <8 x i32>] addrspace(2)* byval, [16 x <16 x i8>] addrspace(2)* byval, i32 inreg, i32 inreg, i32, i32, i32, i32) #0 {
				main_body:
				%11 = getelementptr [16 x <16 x i8>], [16 x <16 x i8>] addrspace(2)* %4, i64 0, i64 0
				%12 = load <16 x i8>, <16 x i8> addrspace(2)* %11, align 16, !tbaa !0
				%rsrc = bitcast <16 x i8> %12 to <4 x i32>
				%vgpr0 = add i32 %5, %7
				%r0 = call float @llvm.amdgcn.buffer.load.format.f32(<4 x i32> %rsrc, i32 0, i32 0, i32 0, i32 0, i1 0, i1 0, i1 0)
				%r1 = call float @llvm.amdgcn.buffer.load.format.f32(<4 x i32> %rsrc, i32 0, i32 0, i32 0, i32 %5, i1 0, i1 0, i1 0)
				%r2 = call float @llvm.amdgcn.buffer.load.format.f32(<4 x i32> %rsrc, i32 0, i32 0, i32 %7, i32 0, i1 0, i1 0, i1 0)
				%r3 = call float @llvm.amdgcn.buffer.load.format.f32(<4 x i32> %rsrc, i32 0, i32 0, i32 %7, i32 %5, i1 0, i1 0, i1 0)
				call void @llvm.SI.export(i32 15, i32 0, i32 1, i32 12, i32 0, float %r0, float %r1, float %r2, float %r3)
				ret void
				}

				; Function Attrs: nounwind readnone
				declare float @llvm.amdgcn.buffer.load.format.f32(<4 x i32>, i32, i32, i32, i32, i1, i1, i1) #1
				declare <2 x float> @llvm.amdgcn.buffer.load.format.v2f32(<4 x i32>, i32, i32, i32, i32, i1, i1, i1) #1
				declare <4 x float> @llvm.amdgcn.buffer.load.format.v4f32(<4 x i32>, i32, i32, i32, i32, i1, i1, i1) #1

				declare void @llvm.SI.export(i32, i32, i32, i32, i32, float, float, float, float)

				attributes #0 = { "ShaderType"="1" "enable-no-nans-fp-math"="true" }
				attributes #1 = { nounwind readnone }

				!0 = !{!"const", null, i32 1}