This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Tune inlining parameters for AMDGPU target
ClosedPublic

Authored by dfukalov on Jul 12 2019, 9:03 AM.

Download Raw Diff

Details

Reviewers

arsenm
rampitec

Commits

rGd912a9ba9b16: [AMDGPU] Tune inlining parameters for AMDGPU target
rL366348: [AMDGPU] Tune inlining parameters for AMDGPU target

Summary

Since the target has no significant advantage of vectorization,
vector instructions bous threshold bonus should be optional.

amdgpu-inline-arg-alloca-cost parameter default value and the target
InliningThresholdMultiplier value tuned then respectively.

Diff Detail

Repository: rL LLVM

Event Timeline

dfukalov created this revision.Jul 12 2019, 9:03 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 12 2019, 9:03 AM

Herald added subscribers: haicheng, hiraditya, eraman and 8 others. · View Herald Transcript

dfukalov added a project: Restricted Project.Jul 12 2019, 9:04 AM

Harbormaster completed remote builds in B34874: Diff 209505.Jul 12 2019, 9:04 AM

arsenm added inline comments.Jul 12 2019, 9:10 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
276 ↗	(On Diff #209505)	I think this need a name indicating it's an inliner control. getInlinerVectorBonusPercent?
llvm/lib/Analysis/InlineCost.cpp
883 ↗	(On Diff #209505)	How does it decide what "vector dense" means? We already report costs that approximately say scalarize everything, and scalarization is free
llvm/test/CodeGen/AMDGPU/amdgpu-inline.ll
28–41 ↗	(On Diff #209505)	Why this test change? I would expect a separate version without the control flow?

Agree with Matt on the callback name change. Otherwise LGTM.

dfukalov marked 2 inline comments as done.Jul 15 2019, 8:58 AM

dfukalov added inline comments.

llvm/lib/Analysis/InlineCost.cpp
883 ↗	(On Diff #209505)	They estimate this "dense" by a percent of LLVM IR instructions with vector arguments. So if a function contains more than 50% of vector instructions this bonus added to threshold. For 10%-50% vector instructions cases they add half of the bonus. I guess this logic of bonuses is based on x86 extensions like MMX and others.
llvm/test/CodeGen/AMDGPU/amdgpu-inline.ll
28–41 ↗	(On Diff #209505)	Without the modification test @test_inliner_multi_pvt_ptr_cutoff starts to fail since I decreased the threshold multiplier and cost of the function started to be slightly higher. The test is not about cotrol flow, we should check amdgpu-inline-arg-alloca-cutoff value: foo_private_ptr2 should be inlined in test_inliner_multi_pvt_ptr and shouldn't be inlined in test_inliner_multi_pvt_ptr_cutoff

Diff updated as requested

Harbormaster completed remote builds in B35009: Diff 209878.Jul 15 2019, 8:59 AM

dfukalov marked an inline comment as done.Jul 15 2019, 8:59 AM

LGTM

This revision is now accepted and ready to land.Jul 15 2019, 9:16 AM

eraman added inline comments.Jul 16 2019, 9:26 PM

llvm/lib/Analysis/InlineCost.cpp
883 ↗	(On Diff #209878)	The comment block explaining vector bonuses is still relevant after this change. Instead of removing it, you should modify it to say the bonus percentage is target dependent.

dfukalov marked 2 inline comments as done.Jul 17 2019, 7:20 AM

dfukalov added inline comments.

llvm/lib/Analysis/InlineCost.cpp
883 ↗	(On Diff #209878)	the comment was not removed but moved to TargetTransformInfo.h where to new function. And note about target was added also.

Closed by commit rL366348: [AMDGPU] Tune inlining parameters for AMDGPU target (authored by dfukalov). · Explain WhyJul 17 2019, 9:55 AM

This revision was automatically updated to reflect the committed changes.

dfukalov marked an inline comment as done.

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

Analysis/

TargetTransformInfo.h

16 lines

TargetTransformInfoImpl.h

2 lines

CodeGen/

BasicTTIImpl.h

2 lines

lib/

Analysis/

InlineCost.cpp

11 lines

TargetTransformInfo.cpp

4 lines

Target/

AMDGPU/

AMDGPUInline.cpp

2 lines

AMDGPUTargetTransformInfo.h

4 lines

test/

CodeGen/

AMDGPU/

amdgpu-inline.ll

7 lines

Transforms/

Inline/

AMDGPU/

inline-amdgpu-vecbonus.ll

31 lines

Diff 210356

llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 257 Lines • ▼ Show 20 Lines	public:
/// \returns A value by which our inlining threshold should be multiplied.		/// \returns A value by which our inlining threshold should be multiplied.
/// This is primarily used to bump up the inlining threshold wholesale on		/// This is primarily used to bump up the inlining threshold wholesale on
/// targets where calls are unusually expensive.		/// targets where calls are unusually expensive.
///		///
/// TODO: This is a rather blunt instrument. Perhaps altering the costs of		/// TODO: This is a rather blunt instrument. Perhaps altering the costs of
/// individual classes of instructions would be better.		/// individual classes of instructions would be better.
unsigned getInliningThresholdMultiplier() const;		unsigned getInliningThresholdMultiplier() const;

		/// \returns Vector bonus in percent.
		///
		/// Vector bonuses: We want to more aggressively inline vector-dense kernels
		/// and apply this bonus based on the percentage of vector instructions. A
		/// bonus is applied if the vector instructions exceed 50% and half that amount
		/// is applied if it exceeds 10%. Note that these bonuses are some what
		/// arbitrary and evolved over time by accident as much as because they are
		/// principled bonuses.
		/// FIXME: It would be nice to base the bonus values on something more
		/// scientific. A target may has no bonus on vector instructions.
		int getInlinerVectorBonusPercent() const;

/// Estimate the cost of an intrinsic when lowered.		/// Estimate the cost of an intrinsic when lowered.
///		///
/// Mirrors the \c getCallCost method but uses an intrinsic identifier.		/// Mirrors the \c getCallCost method but uses an intrinsic identifier.
int getIntrinsicCost(Intrinsic::ID IID, Type *RetTy,		int getIntrinsicCost(Intrinsic::ID IID, Type *RetTy,
ArrayRef<Type *> ParamTys,		ArrayRef<Type *> ParamTys,
const User *U = nullptr) const;		const User *U = nullptr) const;

/// Estimate the cost of an intrinsic when lowered.		/// Estimate the cost of an intrinsic when lowered.
▲ Show 20 Lines • Show All 849 Lines • ▼ Show 20 Lines	public:
virtual int getGEPCost(Type PointeeType, const Value Ptr,		virtual int getGEPCost(Type PointeeType, const Value Ptr,
ArrayRef<const Value *> Operands) = 0;		ArrayRef<const Value *> Operands) = 0;
virtual int getExtCost(const Instruction I, const Value Src) = 0;		virtual int getExtCost(const Instruction I, const Value Src) = 0;
virtual int getCallCost(FunctionType FTy, int NumArgs, const User U) = 0;		virtual int getCallCost(FunctionType FTy, int NumArgs, const User U) = 0;
virtual int getCallCost(const Function F, int NumArgs, const User U) = 0;		virtual int getCallCost(const Function F, int NumArgs, const User U) = 0;
virtual int getCallCost(const Function *F,		virtual int getCallCost(const Function *F,
ArrayRef<const Value > Arguments, const User U) = 0;		ArrayRef<const Value > Arguments, const User U) = 0;
virtual unsigned getInliningThresholdMultiplier() = 0;		virtual unsigned getInliningThresholdMultiplier() = 0;
		virtual int getInlinerVectorBonusPercent() = 0;
virtual int getIntrinsicCost(Intrinsic::ID IID, Type *RetTy,		virtual int getIntrinsicCost(Intrinsic::ID IID, Type *RetTy,
ArrayRef<Type > ParamTys, const User U) = 0;		ArrayRef<Type > ParamTys, const User U) = 0;
virtual int getIntrinsicCost(Intrinsic::ID IID, Type *RetTy,		virtual int getIntrinsicCost(Intrinsic::ID IID, Type *RetTy,
ArrayRef<const Value *> Arguments,		ArrayRef<const Value *> Arguments,
const User *U) = 0;		const User *U) = 0;
virtual int getMemcpyCost(const Instruction *I) = 0;		virtual int getMemcpyCost(const Instruction *I) = 0;
virtual unsigned getEstimatedNumberOfCaseClusters(const SwitchInst &SI,		virtual unsigned getEstimatedNumberOfCaseClusters(const SwitchInst &SI,
unsigned &JTSize) = 0;		unsigned &JTSize) = 0;
▲ Show 20 Lines • Show All 207 Lines • ▼ Show 20 Lines	public:
}		}
int getCallCost(const Function *F,		int getCallCost(const Function *F,
ArrayRef<const Value > Arguments, const User U) override {		ArrayRef<const Value > Arguments, const User U) override {
return Impl.getCallCost(F, Arguments, U);		return Impl.getCallCost(F, Arguments, U);
}		}
unsigned getInliningThresholdMultiplier() override {		unsigned getInliningThresholdMultiplier() override {
return Impl.getInliningThresholdMultiplier();		return Impl.getInliningThresholdMultiplier();
}		}
		int getInlinerVectorBonusPercent() override {
		return Impl.getInlinerVectorBonusPercent();
		}
int getIntrinsicCost(Intrinsic::ID IID, Type *RetTy,		int getIntrinsicCost(Intrinsic::ID IID, Type *RetTy,
ArrayRef<Type > ParamTys, const User U = nullptr) override {		ArrayRef<Type > ParamTys, const User U = nullptr) override {
return Impl.getIntrinsicCost(IID, RetTy, ParamTys, U);		return Impl.getIntrinsicCost(IID, RetTy, ParamTys, U);
}		}
int getIntrinsicCost(Intrinsic::ID IID, Type *RetTy,		int getIntrinsicCost(Intrinsic::ID IID, Type *RetTy,
ArrayRef<const Value *> Arguments,		ArrayRef<const Value *> Arguments,
const User *U = nullptr) override {		const User *U = nullptr) override {
return Impl.getIntrinsicCost(IID, RetTy, Arguments, U);		return Impl.getIntrinsicCost(IID, RetTy, Arguments, U);
▲ Show 20 Lines • Show All 502 Lines • Show Last 20 Lines

llvm/trunk/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 134 Lines • ▼ Show 20 Lines	if (NumArgs < 0)
// function.		// function.
NumArgs = FTy->getNumParams();		NumArgs = FTy->getNumParams();

return TTI::TCC_Basic * (NumArgs + 1);		return TTI::TCC_Basic * (NumArgs + 1);
}		}

unsigned getInliningThresholdMultiplier() { return 1; }		unsigned getInliningThresholdMultiplier() { return 1; }

		int getInlinerVectorBonusPercent() { return 150; }

unsigned getMemcpyCost(const Instruction *I) {		unsigned getMemcpyCost(const Instruction *I) {
return TTI::TCC_Expensive;		return TTI::TCC_Expensive;
}		}

bool hasBranchDivergence() { return false; }		bool hasBranchDivergence() { return false; }

bool isSourceOfDivergence(const Value *V) { return false; }		bool isSourceOfDivergence(const Value *V) { return false; }

▲ Show 20 Lines • Show All 750 Lines • Show Last 20 Lines

llvm/trunk/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 421 Lines • ▼ Show 20 Lines	case Instruction::AddrSpaceCast:
return TargetTransformInfo::TCC_Basic;		return TargetTransformInfo::TCC_Basic;
}		}

return BaseT::getOperationCost(Opcode, Ty, OpTy);		return BaseT::getOperationCost(Opcode, Ty, OpTy);
}		}

unsigned getInliningThresholdMultiplier() { return 1; }		unsigned getInliningThresholdMultiplier() { return 1; }

		int getInlinerVectorBonusPercent() { return 150; }

void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP) {		TTI::UnrollingPreferences &UP) {
// This unrolling functionality is target independent, but to provide some		// This unrolling functionality is target independent, but to provide some
// motivation for its intended use, for x86:		// motivation for its intended use, for x86:

// According to the Intel 64 and IA-32 Architectures Optimization Reference		// According to the Intel 64 and IA-32 Architectures Optimization Reference
// Manual, Intel Core models and later have a loop stream detector (and		// Manual, Intel Core models and later have a loop stream detector (and
// associated uop queue) that can benefit from partial unrolling.		// associated uop queue) that can benefit from partial unrolling.
▲ Show 20 Lines • Show All 1,269 Lines • Show Last 20 Lines

llvm/trunk/lib/Analysis/InlineCost.cpp

Show First 20 Lines • Show All 874 Lines • ▼ Show 20 Lines	void CallAnalyzer::updateThreshold(CallBase &Call, Function &Callee) {
};		};

// Various bonus percentages. These are multiplied by Threshold to get the		// Various bonus percentages. These are multiplied by Threshold to get the
// bonus values.		// bonus values.
// SingleBBBonus: This bonus is applied if the callee has a single reachable		// SingleBBBonus: This bonus is applied if the callee has a single reachable
// basic block at the given callsite context. This is speculatively applied		// basic block at the given callsite context. This is speculatively applied
// and withdrawn if more than one basic block is seen.		// and withdrawn if more than one basic block is seen.
//		//
// Vector bonuses: We want to more aggressively inline vector-dense kernels
// and apply this bonus based on the percentage of vector instructions. A
// bonus is applied if the vector instructions exceed 50% and half that amount
// is applied if it exceeds 10%. Note that these bonuses are some what
// arbitrary and evolved over time by accident as much as because they are
// principled bonuses.
// FIXME: It would be nice to base the bonus values on something more
// scientific.
//
// LstCallToStaticBonus: This large bonus is applied to ensure the inlining		// LstCallToStaticBonus: This large bonus is applied to ensure the inlining
// of the last call to a static function as inlining such functions is		// of the last call to a static function as inlining such functions is
// guaranteed to reduce code size.		// guaranteed to reduce code size.
//		//
// These bonus percentages may be set to 0 based on properties of the caller		// These bonus percentages may be set to 0 based on properties of the caller
// and the callsite.		// and the callsite.
int SingleBBBonusPercent = 50;		int SingleBBBonusPercent = 50;
int VectorBonusPercent = 150;		int VectorBonusPercent = TTI.getInlinerVectorBonusPercent();
int LastCallToStaticBonus = InlineConstants::LastCallToStaticBonus;		int LastCallToStaticBonus = InlineConstants::LastCallToStaticBonus;

// Lambda to set all the above bonus and bonus percentages to 0.		// Lambda to set all the above bonus and bonus percentages to 0.
auto DisallowAllBonuses = [&]() {		auto DisallowAllBonuses = [&]() {
SingleBBBonusPercent = 0;		SingleBBBonusPercent = 0;
VectorBonusPercent = 0;		VectorBonusPercent = 0;
LastCallToStaticBonus = 0;		LastCallToStaticBonus = 0;
};		};
▲ Show 20 Lines • Show All 1,336 Lines • Show Last 20 Lines

llvm/trunk/lib/Analysis/TargetTransformInfo.cpp

Show First 20 Lines • Show All 170 Lines • ▼ Show 20 Lines	int TargetTransformInfo::getCallCost(const Function *F,
assert(Cost >= 0 && "TTI should not produce negative costs!");		assert(Cost >= 0 && "TTI should not produce negative costs!");
return Cost;		return Cost;
}		}

unsigned TargetTransformInfo::getInliningThresholdMultiplier() const {		unsigned TargetTransformInfo::getInliningThresholdMultiplier() const {
return TTIImpl->getInliningThresholdMultiplier();		return TTIImpl->getInliningThresholdMultiplier();
}		}

		int TargetTransformInfo::getInlinerVectorBonusPercent() const {
		return TTIImpl->getInlinerVectorBonusPercent();
		}

int TargetTransformInfo::getGEPCost(Type PointeeType, const Value Ptr,		int TargetTransformInfo::getGEPCost(Type PointeeType, const Value Ptr,
ArrayRef<const Value *> Operands) const {		ArrayRef<const Value *> Operands) const {
return TTIImpl->getGEPCost(PointeeType, Ptr, Operands);		return TTIImpl->getGEPCost(PointeeType, Ptr, Operands);
}		}

int TargetTransformInfo::getExtCost(const Instruction *I,		int TargetTransformInfo::getExtCost(const Instruction *I,
const Value *Src) const {		const Value *Src) const {
return TTIImpl->getExtCost(I, Src);		return TTIImpl->getExtCost(I, Src);
▲ Show 20 Lines • Show All 1,188 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPUInline.cpp

	Show All 33 Lines
	#include "llvm/Support/Debug.h"			#include "llvm/Support/Debug.h"
	#include "llvm/Transforms/IPO/Inliner.h"			#include "llvm/Transforms/IPO/Inliner.h"

	using namespace llvm;			using namespace llvm;

	#define DEBUG_TYPE "inline"			#define DEBUG_TYPE "inline"

	static cl::opt<int>			static cl::opt<int>
	ArgAllocaCost("amdgpu-inline-arg-alloca-cost", cl::Hidden, cl::init(2200),			ArgAllocaCost("amdgpu-inline-arg-alloca-cost", cl::Hidden, cl::init(1500),
	cl::desc("Cost of alloca argument"));			cl::desc("Cost of alloca argument"));

	// If the amount of scratch memory to eliminate exceeds our ability to allocate			// If the amount of scratch memory to eliminate exceeds our ability to allocate
	// it into registers we gain nothing by aggressively inlining functions for that			// it into registers we gain nothing by aggressively inlining functions for that
	// heuristic.			// heuristic.
	static cl::opt<unsigned>			static cl::opt<unsigned>
	ArgAllocaCutoff("amdgpu-inline-arg-alloca-cutoff", cl::Hidden, cl::init(256),			ArgAllocaCutoff("amdgpu-inline-arg-alloca-cutoff", cl::Hidden, cl::init(256),
	cl::desc("Maximum alloca size to use for inline cost"));			cl::desc("Maximum alloca size to use for inline cost"));
	▲ Show 20 Lines • Show All 178 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h

Show First 20 Lines • Show All 185 Lines • ▼ Show 20 Lines	public:
unsigned getVectorSplitCost() { return 0; }		unsigned getVectorSplitCost() { return 0; }

unsigned getShuffleCost(TTI::ShuffleKind Kind, Type *Tp, int Index,		unsigned getShuffleCost(TTI::ShuffleKind Kind, Type *Tp, int Index,
Type *SubTp);		Type *SubTp);

bool areInlineCompatible(const Function *Caller,		bool areInlineCompatible(const Function *Caller,
const Function *Callee) const;		const Function *Callee) const;

unsigned getInliningThresholdMultiplier() { return 9; }		unsigned getInliningThresholdMultiplier() { return 7; }

		int getInlinerVectorBonusPercent() { return 0; }

int getArithmeticReductionCost(unsigned Opcode,		int getArithmeticReductionCost(unsigned Opcode,
Type *Ty,		Type *Ty,
bool IsPairwise);		bool IsPairwise);
int getMinMaxReductionCost(Type Ty, Type CondTy,		int getMinMaxReductionCost(Type Ty, Type CondTy,
bool IsPairwiseForm,		bool IsPairwiseForm,
bool IsUnsigned);		bool IsUnsigned);
};		};
▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/amdgpu-inline.ll

	Show All 22 Lines

	if.end: ; preds = %if.then, %entry			if.end: ; preds = %if.then, %entry
	ret void			ret void
	}			}

	define coldcc void @foo_private_ptr2(float addrspace(5)* nocapture %p1, float addrspace(5)* nocapture %p2) {			define coldcc void @foo_private_ptr2(float addrspace(5)* nocapture %p1, float addrspace(5)* nocapture %p2) {
	entry:			entry:
	%tmp1 = load float, float addrspace(5)* %p1, align 4			%tmp1 = load float, float addrspace(5)* %p1, align 4
	%cmp = fcmp ogt float %tmp1, 1.000000e+00
	br i1 %cmp, label %if.then, label %if.end

	if.then: ; preds = %entry
	%div = fdiv float 2.000000e+00, %tmp1			%div = fdiv float 2.000000e+00, %tmp1
	store float %div, float addrspace(5)* %p2, align 4			store float %div, float addrspace(5)* %p2, align 4
	br label %if.end

	if.end: ; preds = %if.then, %entry
	ret void			ret void
	}			}

	define coldcc float @sin_wrapper(float %x) {			define coldcc float @sin_wrapper(float %x) {
	bb:			bb:
	%call = tail call float @_Z3sinf(float %x)			%call = tail call float @_Z3sinf(float %x)
	ret float %call			ret float %call
	}			}
	▲ Show 20 Lines • Show All 105 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/Inline/AMDGPU/inline-amdgpu-vecbonus.ll

				; RUN: opt -S -mtriple=amdgcn-unknown-amdhsa -amdgpu-inline --inline-threshold=1 < %s \| FileCheck %s

				define hidden <16 x i32> @div_vecbonus(<16 x i32> %x, <16 x i32> %y) {
				entry:
				%div.1 = udiv <16 x i32> %x, %y
				%div.2 = udiv <16 x i32> %div.1, %y
				%div.3 = udiv <16 x i32> %div.2, %y
				%div.4 = udiv <16 x i32> %div.3, %y
				%div.5 = udiv <16 x i32> %div.4, %y
				%div.6 = udiv <16 x i32> %div.5, %y
				%div.7 = udiv <16 x i32> %div.6, %y
				%div.8 = udiv <16 x i32> %div.7, %y
				%div.9 = udiv <16 x i32> %div.8, %y
				%div.10 = udiv <16 x i32> %div.9, %y
				%div.11 = udiv <16 x i32> %div.10, %y
				%div.12 = udiv <16 x i32> %div.11, %y
				ret <16 x i32> %div.12
				}

				; CHECK-LABEL: define amdgpu_kernel void @caller_vecbonus
				; CHECK-NOT: udiv
				; CHECK: tail call <16 x i32> @div_vecbonus
				; CHECK: ret void
				define amdgpu_kernel void @caller_vecbonus(<16 x i32> addrspace(1)* nocapture %x, <16 x i32> addrspace(1)* nocapture readonly %y) {
				entry:
				%tmp = load <16 x i32>, <16 x i32> addrspace(1)* %x
				%tmp1 = load <16 x i32>, <16 x i32> addrspace(1)* %y
				%div.i = tail call <16 x i32> @div_vecbonus(<16 x i32> %tmp, <16 x i32> %tmp1)
				store <16 x i32> %div.i, <16 x i32> addrspace(1)* %x
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Tune inlining parameters for AMDGPU targetClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 210356

llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h

llvm/trunk/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/trunk/include/llvm/CodeGen/BasicTTIImpl.h

llvm/trunk/lib/Analysis/InlineCost.cpp

llvm/trunk/lib/Analysis/TargetTransformInfo.cpp

llvm/trunk/lib/Target/AMDGPU/AMDGPUInline.cpp

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h

llvm/trunk/test/CodeGen/AMDGPU/amdgpu-inline.ll

llvm/trunk/test/Transforms/Inline/AMDGPU/inline-amdgpu-vecbonus.ll

[AMDGPU] Tune inlining parameters for AMDGPU target
ClosedPublic