This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU][Inliner] Remove amdgpu-inline and add a new TTI inline hook
ClosedPublic

Authored by aeubanks on Jan 5 2021, 9:25 PM.

Download Raw Diff

Details

Reviewers

arsenm
rnk
mtrofin
rampitec

Commits

rGa11bf9a7fbd3: [AMDGPU][Inliner] Remove amdgpu-inline and add a new TTI inline hook

Summary

Having a custom inliner doesn't really fit in with the new PM's
pipeline. It's also extra technical debt.

amdgpu-inline only does a couple of custom things compared to the normal
inliner:

It disables inlining if the number of BBs in a function would exceed some limit
It increases the threshold if there are pointers to private arrays(?)

These can all be handled as TTI inliner hooks.
There already exists a hook for backends to multiply the inlining
threshold.

This way we can remove the custom amdgpu-inline pass.

This caused inline-hint.ll to fail, and after some investigation, it
looks like getInliningThresholdMultiplier() was previously getting
applied twice in amdgpu-inline (https://reviews.llvm.org/D62707 fixed it
not applying at all, so some later inliner change must have fixed
something), so I had to change the threshold in the test.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

aeubanks created this revision.Jan 5 2021, 9:25 PM

Herald added subscribers: nikic, kerbowa, haicheng and 11 others. · View Herald TranscriptJan 5 2021, 9:25 PM

aeubanks requested review of this revision.Jan 5 2021, 9:25 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 5 2021, 9:25 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

aeubanks added reviewers: arsenm, rnk, mtrofin, rampitec.Jan 5 2021, 9:28 PM

Harbormaster completed remote builds in B84158: Diff 314794.Jan 5 2021, 10:00 PM

arsenm added inline comments.Jan 6 2021, 8:43 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1195–1199	I'm not convinced we ever really needed this. I believe the standard inline heuristic will always do this anyway
1204	I'm not sure I like having a target hook for a hack like this

aeubanks added inline comments.Jan 6 2021, 9:03 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1195–1199	There was one case where this forced a coldcc call to be inlined where normally it wouldn't be. Of course, why put coldcc on some random function? Will remove.
1204	Would you rather just remove this altogether?

arsenm added subscribers: vpykhtin, dfukalov.Jan 6 2021, 9:31 AM

arsenm added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1204	I don't remember the story here. @rampitec @vpykhtin @dfukalov ?

aeubanks added inline comments.Jan 6 2021, 9:54 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1204	It was added in https://reviews.llvm.org/D62917. Apparently it's a hack to help with compile times. That's a bit surprising, is this an AMDGPU specific issue?

rampitec added inline comments.Jan 6 2021, 10:24 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1195–1199	This was added to unwrap several layers of wrappers we have in device lib. Without this heuristic it was not always properly handled by the standard inliner.
1204	It is AMDGPU specific only in a sense. We tend to inline a lot, much more than other targets. Therefor we can have drastic compile time issues. In particular there are several pretty big codes which compile hours instead of one or two minutes without this (suboptimal) cutoff.

arsenm added inline comments.Jan 6 2021, 10:28 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1204	Is it possible this was only due to a bug in the subtarget feature compatibility handling? I believe the standard inline analysis specifically looks for wrappers as well

rampitec added inline comments.Jan 6 2021, 10:33 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1204	It did not work this way at least back when it was added. I think we should not change the logic with this patch. This patch is infrastructural, if we want to tune heuristics that would be a separate work.

aeubanks added inline comments.Jan 6 2021, 10:41 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1204	That's fair. I'll keep everything for now. @arsenm Any thoughts on alternatives to target hooks?

arsenm added inline comments.Jan 6 2021, 10:43 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1204	I think we should first try to remove this part and see how it goes

rnk added inline comments.Jan 6 2021, 10:59 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1204	I think it makes sense to avoid building infrastructure (new TTI hooks) if it turns out that it is never needed. It's much harder to remove hooks than it is to add them. However, we want to make sure that Arthur can make progress flipping the default pass manager without doing any AMDGPU tuning work. That's clearly out of scope for him. We could proceed with a minimal set of inliner hooks, flip the default pass manager, and allow downstream AMD folks to sort out any performance problems later. AMD or any other vendor can configure CMake to use the old pass manager if they aren't ready to triage the performance impacts of the NPM. Would that be OK?

Removing the wrapper check seems to work fine: D94187

rebase past https://reviews.llvm.org/D94187

aeubanks edited the summary of this revision. (Show Details)Jan 12 2021, 9:08 PM

Harbormaster completed remote builds in B84970: Diff 316316.Jan 12 2021, 9:34 PM

arsenm added inline comments.Jan 13 2021, 6:47 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1184–1194	I still don't really want to expand this compile time hack to a target hook

foad added a subscriber: foad.Jan 13 2021, 7:05 AM

foad added inline comments.

llvm/include/llvm/Analysis/TargetTransformInfo.h
292	"A value to be added to the inlining threshold"? You can't add something by something.

foad added inline comments.Jan 13 2021, 7:11 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
74	"Maximum number of BBs allowed ..."

comment fixups
move max BB check into areInlineCompatible

aeubanks marked 2 inline comments as done.Jan 14 2021, 11:40 AM

aeubanks added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
1184–1194	is stuffing this into areInlineCompatible() better?

Harbormaster completed remote builds in B85204: Diff 316719.Jan 14 2021, 12:19 PM

ping

LGTM

This revision is now accepted and ready to land.Jan 21 2021, 11:47 AM

Closed by commit rGa11bf9a7fbd3: [AMDGPU][Inliner] Remove amdgpu-inline and add a new TTI inline hook (authored by aeubanks). · Explain WhyJan 21 2021, 8:29 PM

This revision was automatically updated to reflect the committed changes.

aeubanks added a commit: rGa11bf9a7fbd3: [AMDGPU][Inliner] Remove amdgpu-inline and add a new TTI inline hook.

rampitec mentioned this in D98362: [AMDGPU] Fix -amdgpu-inline-arg-alloca-cost.Mar 10 2021, 10:21 AM

rampitec mentioned this in rGb7b99b0799fa: [AMDGPU] Fix -amdgpu-inline-arg-alloca-cost.Mar 12 2021, 10:20 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

TargetTransformInfo.h

7 lines

TargetTransformInfoImpl.h

1 line

CodeGen/

BasicTTIImpl.h

1 line

lib/

Analysis/

InlineCost.cpp

1 line

TargetTransformInfo.cpp

5 lines

Target/

AMDGPU/

AMDGPU.h

3 lines

AMDGPUInline.cpp

AMDGPUTargetMachine.cpp

3 lines

AMDGPUTargetTransformInfo.h

1 line

AMDGPUTargetTransformInfo.cpp

60 lines

CMakeLists.txt

1 line

test/

CodeGen/

AMDGPU/

amdgpu-inline.ll

2 lines

inline-maxbb.ll

6 lines

opt-pipeline.ll

8 lines

Transforms/

Inline/

AMDGPU/

amdgpu-inline-alloca-argument.ll

3 lines

inline-amdgpu-vecbonus.ll

3 lines

inline-hint.ll

3 lines

utils/

gn/

secondary/

llvm/

lib/

Target/

AMDGPU/

BUILD.gn

1 line

Diff 318408

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 283 Lines • ▼ Show 20 Lines	public:
/// \returns A value by which our inlining threshold should be multiplied.		/// \returns A value by which our inlining threshold should be multiplied.
/// This is primarily used to bump up the inlining threshold wholesale on		/// This is primarily used to bump up the inlining threshold wholesale on
/// targets where calls are unusually expensive.		/// targets where calls are unusually expensive.
///		///
/// TODO: This is a rather blunt instrument. Perhaps altering the costs of		/// TODO: This is a rather blunt instrument. Perhaps altering the costs of
/// individual classes of instructions would be better.		/// individual classes of instructions would be better.
unsigned getInliningThresholdMultiplier() const;		unsigned getInliningThresholdMultiplier() const;

		/// \returns A value to be added to the inlining threshold.
		foadUnsubmitted Done Reply Inline Actions "A value to be added to the inlining threshold"? You can't add something by something. foad: "A value to be added to the inlining threshold"? You can't add something //by// something.
		unsigned adjustInliningThreshold(const CallBase *CB) const;

/// \returns Vector bonus in percent.		/// \returns Vector bonus in percent.
///		///
/// Vector bonuses: We want to more aggressively inline vector-dense kernels		/// Vector bonuses: We want to more aggressively inline vector-dense kernels
/// and apply this bonus based on the percentage of vector instructions. A		/// and apply this bonus based on the percentage of vector instructions. A
/// bonus is applied if the vector instructions exceed 50% and half that		/// bonus is applied if the vector instructions exceed 50% and half that
/// amount is applied if it exceeds 10%. Note that these bonuses are some what		/// amount is applied if it exceeds 10%. Note that these bonuses are some what
/// arbitrary and evolved over time by accident as much as because they are		/// arbitrary and evolved over time by accident as much as because they are
/// principled bonuses.		/// principled bonuses.
▲ Show 20 Lines • Show All 1,090 Lines • ▼ Show 20 Lines
class TargetTransformInfo::Concept {		class TargetTransformInfo::Concept {
public:		public:
virtual ~Concept() = 0;		virtual ~Concept() = 0;
virtual const DataLayout &getDataLayout() const = 0;		virtual const DataLayout &getDataLayout() const = 0;
virtual int getGEPCost(Type PointeeType, const Value Ptr,		virtual int getGEPCost(Type PointeeType, const Value Ptr,
ArrayRef<const Value *> Operands,		ArrayRef<const Value *> Operands,
TTI::TargetCostKind CostKind) = 0;		TTI::TargetCostKind CostKind) = 0;
virtual unsigned getInliningThresholdMultiplier() = 0;		virtual unsigned getInliningThresholdMultiplier() = 0;
		virtual unsigned adjustInliningThreshold(const CallBase *CB) = 0;
virtual int getInlinerVectorBonusPercent() = 0;		virtual int getInlinerVectorBonusPercent() = 0;
virtual int getMemcpyCost(const Instruction *I) = 0;		virtual int getMemcpyCost(const Instruction *I) = 0;
virtual unsigned		virtual unsigned
getEstimatedNumberOfCaseClusters(const SwitchInst &SI, unsigned &JTSize,		getEstimatedNumberOfCaseClusters(const SwitchInst &SI, unsigned &JTSize,
ProfileSummaryInfo *PSI,		ProfileSummaryInfo *PSI,
BlockFrequencyInfo *BFI) = 0;		BlockFrequencyInfo *BFI) = 0;
virtual int getUserCost(const User U, ArrayRef<const Value > Operands,		virtual int getUserCost(const User U, ArrayRef<const Value > Operands,
TargetCostKind CostKind) = 0;		TargetCostKind CostKind) = 0;
▲ Show 20 Lines • Show All 268 Lines • ▼ Show 20 Lines	public:
int getGEPCost(Type PointeeType, const Value Ptr,		int getGEPCost(Type PointeeType, const Value Ptr,
ArrayRef<const Value *> Operands,		ArrayRef<const Value *> Operands,
enum TargetTransformInfo::TargetCostKind CostKind) override {		enum TargetTransformInfo::TargetCostKind CostKind) override {
return Impl.getGEPCost(PointeeType, Ptr, Operands);		return Impl.getGEPCost(PointeeType, Ptr, Operands);
}		}
unsigned getInliningThresholdMultiplier() override {		unsigned getInliningThresholdMultiplier() override {
return Impl.getInliningThresholdMultiplier();		return Impl.getInliningThresholdMultiplier();
}		}
		unsigned adjustInliningThreshold(const CallBase *CB) override {
		return Impl.adjustInliningThreshold(CB);
		}
int getInlinerVectorBonusPercent() override {		int getInlinerVectorBonusPercent() override {
return Impl.getInlinerVectorBonusPercent();		return Impl.getInlinerVectorBonusPercent();
}		}
int getMemcpyCost(const Instruction *I) override {		int getMemcpyCost(const Instruction *I) override {
return Impl.getMemcpyCost(I);		return Impl.getMemcpyCost(I);
}		}
int getUserCost(const User U, ArrayRef<const Value > Operands,		int getUserCost(const User U, ArrayRef<const Value > Operands,
TargetCostKind CostKind) override {		TargetCostKind CostKind) override {
▲ Show 20 Lines • Show All 619 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	unsigned getEstimatedNumberOfCaseClusters(const SwitchInst &SI,
BlockFrequencyInfo *BFI) const {		BlockFrequencyInfo *BFI) const {
(void)PSI;		(void)PSI;
(void)BFI;		(void)BFI;
JTSize = 0;		JTSize = 0;
return SI.getNumCases();		return SI.getNumCases();
}		}

unsigned getInliningThresholdMultiplier() const { return 1; }		unsigned getInliningThresholdMultiplier() const { return 1; }
		unsigned adjustInliningThreshold(const CallBase *CB) const { return 0; }

int getInlinerVectorBonusPercent() const { return 150; }		int getInlinerVectorBonusPercent() const { return 150; }

unsigned getMemcpyCost(const Instruction *I) const {		unsigned getMemcpyCost(const Instruction *I) const {
return TTI::TCC_Expensive;		return TTI::TCC_Expensive;
}		}

bool hasBranchDivergence() const { return false; }		bool hasBranchDivergence() const { return false; }
▲ Show 20 Lines • Show All 1,036 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 395 Lines • ▼ Show 20 Lines	unsigned getFPOpCost(Type *Ty) {
const TargetLoweringBase *TLI = getTLI();		const TargetLoweringBase *TLI = getTLI();
EVT VT = TLI->getValueType(DL, Ty);		EVT VT = TLI->getValueType(DL, Ty);
if (TLI->isOperationLegalOrCustomOrPromote(ISD::FADD, VT))		if (TLI->isOperationLegalOrCustomOrPromote(ISD::FADD, VT))
return TargetTransformInfo::TCC_Basic;		return TargetTransformInfo::TCC_Basic;
return TargetTransformInfo::TCC_Expensive;		return TargetTransformInfo::TCC_Expensive;
}		}

unsigned getInliningThresholdMultiplier() { return 1; }		unsigned getInliningThresholdMultiplier() { return 1; }
		unsigned adjustInliningThreshold(const CallBase *CB) { return 0; }

int getInlinerVectorBonusPercent() { return 150; }		int getInlinerVectorBonusPercent() { return 150; }

void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP) {		TTI::UnrollingPreferences &UP) {
// This unrolling functionality is target independent, but to provide some		// This unrolling functionality is target independent, but to provide some
// motivation for its intended use, for x86:		// motivation for its intended use, for x86:

▲ Show 20 Lines • Show All 1,651 Lines • Show Last 20 Lines

llvm/lib/Analysis/InlineCost.cpp

Show First 20 Lines • Show All 1,574 Lines • ▼ Show 20 Lines	if (!Caller->hasOptSize() && HotCallSiteThreshold) {
Threshold = MinIfValid(Threshold, Params.ColdThreshold);		Threshold = MinIfValid(Threshold, Params.ColdThreshold);
}		}
}		}
}		}

// Finally, take the target-specific inlining threshold multiplier into		// Finally, take the target-specific inlining threshold multiplier into
// account.		// account.
Threshold *= TTI.getInliningThresholdMultiplier();		Threshold *= TTI.getInliningThresholdMultiplier();
		Threshold += TTI.adjustInliningThreshold(&Call);

SingleBBBonus = Threshold * SingleBBBonusPercent / 100;		SingleBBBonus = Threshold * SingleBBBonusPercent / 100;
VectorBonus = Threshold * VectorBonusPercent / 100;		VectorBonus = Threshold * VectorBonusPercent / 100;

bool OnlyOneCallAndLocalLinkage =		bool OnlyOneCallAndLocalLinkage =
F.hasLocalLinkage() && F.hasOneUse() && &F == Call.getCalledFunction();		F.hasLocalLinkage() && F.hasOneUse() && &F == Call.getCalledFunction();
// If there is only one call of the function, and it has internal linkage,		// If there is only one call of the function, and it has internal linkage,
// the cost of inlining it drops dramatically. It may seem odd to update		// the cost of inlining it drops dramatically. It may seem odd to update
▲ Show 20 Lines • Show All 1,204 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

Show First 20 Lines • Show All 241 Lines • ▼ Show 20 Lines	TargetTransformInfo &TargetTransformInfo::operator=(TargetTransformInfo &&RHS) {
TTIImpl = std::move(RHS.TTIImpl);		TTIImpl = std::move(RHS.TTIImpl);
return *this;		return *this;
}		}

unsigned TargetTransformInfo::getInliningThresholdMultiplier() const {		unsigned TargetTransformInfo::getInliningThresholdMultiplier() const {
return TTIImpl->getInliningThresholdMultiplier();		return TTIImpl->getInliningThresholdMultiplier();
}		}

		unsigned
		TargetTransformInfo::adjustInliningThreshold(const CallBase *CB) const {
		return TTIImpl->adjustInliningThreshold(CB);
		}

int TargetTransformInfo::getInlinerVectorBonusPercent() const {		int TargetTransformInfo::getInlinerVectorBonusPercent() const {
return TTIImpl->getInlinerVectorBonusPercent();		return TTIImpl->getInlinerVectorBonusPercent();
}		}

int TargetTransformInfo::getGEPCost(Type PointeeType, const Value Ptr,		int TargetTransformInfo::getGEPCost(Type PointeeType, const Value Ptr,
ArrayRef<const Value *> Operands,		ArrayRef<const Value *> Operands,
TTI::TargetCostKind CostKind) const {		TTI::TargetCostKind CostKind) const {
return TTIImpl->getGEPCost(PointeeType, Ptr, Operands, CostKind);		return TTIImpl->getGEPCost(PointeeType, Ptr, Operands, CostKind);
▲ Show 20 Lines • Show All 1,211 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPU.h

	Show First 20 Lines • Show All 321 Lines • ▼ Show 20 Lines

	ImmutablePass *createAMDGPUAAWrapperPass();			ImmutablePass *createAMDGPUAAWrapperPass();
	void initializeAMDGPUAAWrapperPassPass(PassRegistry&);			void initializeAMDGPUAAWrapperPassPass(PassRegistry&);
	ImmutablePass *createAMDGPUExternalAAWrapperPass();			ImmutablePass *createAMDGPUExternalAAWrapperPass();
	void initializeAMDGPUExternalAAWrapperPass(PassRegistry&);			void initializeAMDGPUExternalAAWrapperPass(PassRegistry&);

	void initializeAMDGPUArgumentUsageInfoPass(PassRegistry &);			void initializeAMDGPUArgumentUsageInfoPass(PassRegistry &);

	Pass *createAMDGPUFunctionInliningPass();
	void initializeAMDGPUInlinerPass(PassRegistry&);

	ModulePass *createAMDGPUOpenCLEnqueuedBlockLoweringPass();			ModulePass *createAMDGPUOpenCLEnqueuedBlockLoweringPass();
	void initializeAMDGPUOpenCLEnqueuedBlockLoweringPass(PassRegistry &);			void initializeAMDGPUOpenCLEnqueuedBlockLoweringPass(PassRegistry &);
	extern char &AMDGPUOpenCLEnqueuedBlockLoweringID;			extern char &AMDGPUOpenCLEnqueuedBlockLoweringID;

	void initializeGCNRegBankReassignPass(PassRegistry &);			void initializeGCNRegBankReassignPass(PassRegistry &);
	extern char &GCNRegBankReassignID;			extern char &GCNRegBankReassignID;

	void initializeGCNNSAReassignPass(PassRegistry &);			void initializeGCNNSAReassignPass(PassRegistry &);
	▲ Show 20 Lines • Show All 82 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUInline.cpp

This file was deleted.

	//===- AMDGPUInline.cpp - Code to perform simple function inlining --------===//
	//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//
	//===----------------------------------------------------------------------===//
	//
	/// \file
	/// This is AMDGPU specific replacement of the standard inliner.
	/// The main purpose is to account for the fact that calls not only expensive
	/// on the AMDGPU, but much more expensive if a private memory pointer is
	/// passed to a function as an argument. In this situation, we are unable to
	/// eliminate private memory in the caller unless inlined and end up with slow
	/// and expensive scratch access. Thus, we boost the inline threshold for such
	/// functions here.
	///
	//===----------------------------------------------------------------------===//

	#include "AMDGPU.h"
	#include "llvm/Analysis/TargetTransformInfo.h"
	#include "llvm/Analysis/ValueTracking.h"
	#include "llvm/IR/Instructions.h"
	#include "llvm/InitializePasses.h"
	#include "llvm/Support/CommandLine.h"
	#include "llvm/Transforms/IPO/Inliner.h"

	using namespace llvm;

	#define DEBUG_TYPE "inline"

	static cl::opt<int>
	ArgAllocaCost("amdgpu-inline-arg-alloca-cost", cl::Hidden, cl::init(4000),
	cl::desc("Cost of alloca argument"));

	// If the amount of scratch memory to eliminate exceeds our ability to allocate
	// it into registers we gain nothing by aggressively inlining functions for that
	// heuristic.
	static cl::opt<unsigned>
	ArgAllocaCutoff("amdgpu-inline-arg-alloca-cutoff", cl::Hidden, cl::init(256),
	cl::desc("Maximum alloca size to use for inline cost"));

	// Inliner constraint to achieve reasonable compilation time
	static cl::opt<size_t>
	MaxBB("amdgpu-inline-max-bb", cl::Hidden, cl::init(1100),
	cl::desc("Maximum BB number allowed in a function after inlining"
	" (compile time constraint)"));

	namespace {

	class AMDGPUInliner : public LegacyInlinerBase {

	public:
	AMDGPUInliner() : LegacyInlinerBase(ID) {
	initializeAMDGPUInlinerPass(*PassRegistry::getPassRegistry());
	Params = getInlineParams();
	}

	static char ID; // Pass identification, replacement for typeid

	unsigned getInlineThreshold(CallBase &CB) const;

	InlineCost getInlineCost(CallBase &CB) override;

	bool runOnSCC(CallGraphSCC &SCC) override;

	void getAnalysisUsage(AnalysisUsage &AU) const override;

	private:
	TargetTransformInfoWrapperPass *TTIWP;

	InlineParams Params;
	};

	} // end anonymous namespace

	char AMDGPUInliner::ID = 0;
	INITIALIZE_PASS_BEGIN(AMDGPUInliner, "amdgpu-inline",
	"AMDGPU Function Integration/Inlining", false, false)
	INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)
	INITIALIZE_PASS_DEPENDENCY(CallGraphWrapperPass)
	INITIALIZE_PASS_DEPENDENCY(ProfileSummaryInfoWrapperPass)
	INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
	INITIALIZE_PASS_DEPENDENCY(TargetLibraryInfoWrapperPass)
	INITIALIZE_PASS_END(AMDGPUInliner, "amdgpu-inline",
	"AMDGPU Function Integration/Inlining", false, false)

	Pass *llvm::createAMDGPUFunctionInliningPass() { return new AMDGPUInliner(); }

	bool AMDGPUInliner::runOnSCC(CallGraphSCC &SCC) {
	TTIWP = &getAnalysis<TargetTransformInfoWrapperPass>();
	return LegacyInlinerBase::runOnSCC(SCC);
	}

	void AMDGPUInliner::getAnalysisUsage(AnalysisUsage &AU) const {
	AU.addRequired<TargetTransformInfoWrapperPass>();
	LegacyInlinerBase::getAnalysisUsage(AU);
	}

	unsigned AMDGPUInliner::getInlineThreshold(CallBase &CB) const {
	int Thres = Params.DefaultThreshold;

	Function *Caller = CB.getCaller();
	// Listen to the inlinehint attribute when it would increase the threshold
	// and the caller does not need to minimize its size.
	Function *Callee = CB.getCalledFunction();
	bool InlineHint = Callee && !Callee->isDeclaration() &&
	Callee->hasFnAttribute(Attribute::InlineHint);
	if (InlineHint && Params.HintThreshold && Params.HintThreshold > Thres
	&& !Caller->hasFnAttribute(Attribute::MinSize))
	Thres = Params.HintThreshold.getValue() *
	TTIWP->getTTI(*Callee).getInliningThresholdMultiplier();

	const DataLayout &DL = Caller->getParent()->getDataLayout();
	if (!Callee)
	return (unsigned)Thres;

	// If we have a pointer to private array passed into a function
	// it will not be optimized out, leaving scratch usage.
	// Increase the inline threshold to allow inliniting in this case.
	uint64_t AllocaSize = 0;
	SmallPtrSet<const AllocaInst *, 8> AIVisited;
	for (Value *PtrArg : CB.args()) {
	PointerType *Ty = dyn_cast<PointerType>(PtrArg->getType());
	if (!Ty \|\| (Ty->getAddressSpace() != AMDGPUAS::PRIVATE_ADDRESS &&
	Ty->getAddressSpace() != AMDGPUAS::FLAT_ADDRESS))
	continue;

	PtrArg = getUnderlyingObject(PtrArg);
	if (const AllocaInst *AI = dyn_cast<AllocaInst>(PtrArg)) {
	if (!AI->isStaticAlloca() \|\| !AIVisited.insert(AI).second)
	continue;
	AllocaSize += DL.getTypeAllocSize(AI->getAllocatedType());
	// If the amount of stack memory is excessive we will not be able
	// to get rid of the scratch anyway, bail out.
	if (AllocaSize > ArgAllocaCutoff) {
	AllocaSize = 0;
	break;
	}
	}
	}
	if (AllocaSize)
	Thres += ArgAllocaCost;

	return (unsigned)Thres;
	}

	InlineCost AMDGPUInliner::getInlineCost(CallBase &CB) {
	Function *Callee = CB.getCalledFunction();
	Function *Caller = CB.getCaller();

	if (!Callee \|\| Callee->isDeclaration())
	return llvm::InlineCost::getNever("undefined callee");

	if (CB.isNoInline())
	return llvm::InlineCost::getNever("noinline");

	TargetTransformInfo &TTI = TTIWP->getTTI(*Callee);
	if (!TTI.areInlineCompatible(Caller, Callee))
	return llvm::InlineCost::getNever("incompatible");

	if (CB.hasFnAttr(Attribute::AlwaysInline)) {
	auto IsViable = isInlineViable(*Callee);
	if (IsViable.isSuccess())
	return llvm::InlineCost::getAlways("alwaysinline viable");
	return llvm::InlineCost::getNever(IsViable.getFailureReason());
	}

	InlineParams LocalParams = Params;
	LocalParams.DefaultThreshold = (int)getInlineThreshold(CB);
	bool RemarksEnabled = false;
	const auto &BBs = Caller->getBasicBlockList();
	if (!BBs.empty()) {
	auto DI = OptimizationRemark(DEBUG_TYPE, "", DebugLoc(), &BBs.front());
	if (DI.isEnabled())
	RemarksEnabled = true;
	}

	OptimizationRemarkEmitter ORE(Caller);
	auto GetAssumptionCache = [this](Function &F) -> AssumptionCache & {
	return ACT->getAssumptionCache(F);
	};

	auto IC = llvm::getInlineCost(CB, Callee, LocalParams, TTI,
	GetAssumptionCache, GetTLI, nullptr, PSI,
	RemarksEnabled ? &ORE : nullptr);

	if (IC && !IC.isAlways() && !Callee->hasFnAttribute(Attribute::InlineHint)) {
	// Single BB does not increase total BB amount, thus subtract 1
	size_t Size = Caller->size() + Callee->size() - 1;
	if (MaxBB && Size > MaxBB)
	return llvm::InlineCost::getNever("max number of bb exceeded");
	}
	return IC;
	}

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 249 Lines • ▼ Show 20 Lines	extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
initializeSIPreAllocateWWMRegsPass(*PR);		initializeSIPreAllocateWWMRegsPass(*PR);
initializeSIFormMemoryClausesPass(*PR);		initializeSIFormMemoryClausesPass(*PR);
initializeSIPostRABundlerPass(*PR);		initializeSIPostRABundlerPass(*PR);
initializeAMDGPUUnifyDivergentExitNodesPass(*PR);		initializeAMDGPUUnifyDivergentExitNodesPass(*PR);
initializeAMDGPUAAWrapperPassPass(*PR);		initializeAMDGPUAAWrapperPassPass(*PR);
initializeAMDGPUExternalAAWrapperPass(*PR);		initializeAMDGPUExternalAAWrapperPass(*PR);
initializeAMDGPUUseNativeCallsPass(*PR);		initializeAMDGPUUseNativeCallsPass(*PR);
initializeAMDGPUSimplifyLibCallsPass(*PR);		initializeAMDGPUSimplifyLibCallsPass(*PR);
initializeAMDGPUInlinerPass(*PR);
initializeAMDGPUPrintfRuntimeBindingPass(*PR);		initializeAMDGPUPrintfRuntimeBindingPass(*PR);
initializeGCNRegBankReassignPass(*PR);		initializeGCNRegBankReassignPass(*PR);
initializeGCNNSAReassignPass(*PR);		initializeGCNNSAReassignPass(*PR);
initializeSIAddIMGInitPass(*PR);		initializeSIAddIMGInitPass(*PR);
}		}

static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {		static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {
return std::make_unique<AMDGPUTargetObjectFile>();		return std::make_unique<AMDGPUTargetObjectFile>();
▲ Show 20 Lines • Show All 151 Lines • ▼ Show 20 Lines	void AMDGPUTargetMachine::adjustPassManager(PassManagerBuilder &Builder) {
bool EnableOpt = getOptLevel() > CodeGenOpt::None;		bool EnableOpt = getOptLevel() > CodeGenOpt::None;
bool Internalize = InternalizeSymbols;		bool Internalize = InternalizeSymbols;
bool EarlyInline = EarlyInlineAll && EnableOpt && !EnableFunctionCalls;		bool EarlyInline = EarlyInlineAll && EnableOpt && !EnableFunctionCalls;
bool AMDGPUAA = EnableAMDGPUAliasAnalysis && EnableOpt;		bool AMDGPUAA = EnableAMDGPUAliasAnalysis && EnableOpt;
bool LibCallSimplify = EnableLibCallSimplify && EnableOpt;		bool LibCallSimplify = EnableLibCallSimplify && EnableOpt;

if (EnableFunctionCalls) {		if (EnableFunctionCalls) {
delete Builder.Inliner;		delete Builder.Inliner;
Builder.Inliner = createAMDGPUFunctionInliningPass();		Builder.Inliner = createFunctionInliningPass();
}		}

Builder.addExtension(		Builder.addExtension(
PassManagerBuilder::EP_ModuleOptimizerEarly,		PassManagerBuilder::EP_ModuleOptimizerEarly,
[Internalize, EarlyInline, AMDGPUAA, this](const PassManagerBuilder &,		[Internalize, EarlyInline, AMDGPUAA, this](const PassManagerBuilder &,
legacy::PassManagerBase &PM) {		legacy::PassManagerBase &PM) {
if (AMDGPUAA) {		if (AMDGPUAA) {
PM.add(createAMDGPUAAWrapperPass());		PM.add(createAMDGPUAAWrapperPass());
▲ Show 20 Lines • Show All 936 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h

Show First 20 Lines • Show All 197 Lines • ▼ Show 20 Lines	public:

unsigned getShuffleCost(TTI::ShuffleKind Kind, VectorType *Tp, int Index,		unsigned getShuffleCost(TTI::ShuffleKind Kind, VectorType *Tp, int Index,
VectorType *SubTp);		VectorType *SubTp);

bool areInlineCompatible(const Function *Caller,		bool areInlineCompatible(const Function *Caller,
const Function *Callee) const;		const Function *Callee) const;

unsigned getInliningThresholdMultiplier() { return 11; }		unsigned getInliningThresholdMultiplier() { return 11; }
		unsigned adjustInliningThreshold(const CallBase *CB) const;

int getInlinerVectorBonusPercent() { return 0; }		int getInlinerVectorBonusPercent() { return 0; }

int getArithmeticReductionCost(		int getArithmeticReductionCost(
unsigned Opcode,		unsigned Opcode,
VectorType *Ty,		VectorType *Ty,
bool IsPairwise,		bool IsPairwise,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput);		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput);
▲ Show 20 Lines • Show All 47 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

Show First 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	static cl::opt<bool> UseLegacyDA(
cl::desc("Enable legacy divergence analysis for AMDGPU"),		cl::desc("Enable legacy divergence analysis for AMDGPU"),
cl::init(false), cl::Hidden);		cl::init(false), cl::Hidden);

static cl::opt<unsigned> UnrollMaxBlockToAnalyze(		static cl::opt<unsigned> UnrollMaxBlockToAnalyze(
"amdgpu-unroll-max-block-to-analyze",		"amdgpu-unroll-max-block-to-analyze",
cl::desc("Inner loop block size threshold to analyze in unroll for AMDGPU"),		cl::desc("Inner loop block size threshold to analyze in unroll for AMDGPU"),
cl::init(32), cl::Hidden);		cl::init(32), cl::Hidden);

		static cl::opt<unsigned> ArgAllocaCost("amdgpu-inline-arg-alloca-cost",
		cl::Hidden, cl::init(4000),
		cl::desc("Cost of alloca argument"));

		// If the amount of scratch memory to eliminate exceeds our ability to allocate
		// it into registers we gain nothing by aggressively inlining functions for that
		// heuristic.
		static cl::opt<unsigned>
		ArgAllocaCutoff("amdgpu-inline-arg-alloca-cutoff", cl::Hidden,
		cl::init(256),
		cl::desc("Maximum alloca size to use for inline cost"));

		// Inliner constraint to achieve reasonable compilation time.
		static cl::opt<size_t> InlineMaxBB(
		"amdgpu-inline-max-bb", cl::Hidden, cl::init(1100),
		cl::desc("Maximum number of BBs allowed in a function after inlining"
		foadUnsubmitted Done Reply Inline Actions "Maximum number of BBs allowed ..." foad: "Maximum number of BBs allowed ..."
		" (compile time constraint)"));

static bool dependsOnLocalPhi(const Loop L, const Value Cond,		static bool dependsOnLocalPhi(const Loop L, const Value Cond,
unsigned Depth = 0) {		unsigned Depth = 0) {
const Instruction *I = dyn_cast<Instruction>(Cond);		const Instruction *I = dyn_cast<Instruction>(Cond);
if (!I)		if (!I)
return false;		return false;

for (const Value *V : I->operand_values()) {		for (const Value *V : I->operand_values()) {
if (!L->contains(I))		if (!L->contains(I))
▲ Show 20 Lines • Show All 1,048 Lines • ▼ Show 20 Lines	bool GCNTTIImpl::areInlineCompatible(const Function *Caller,
FeatureBitset RealCalleeBits = CalleeBits & ~InlineFeatureIgnoreList;		FeatureBitset RealCalleeBits = CalleeBits & ~InlineFeatureIgnoreList;
if ((RealCallerBits & RealCalleeBits) != RealCalleeBits)		if ((RealCallerBits & RealCalleeBits) != RealCalleeBits)
return false;		return false;

// FIXME: dx10_clamp can just take the caller setting, but there seems to be		// FIXME: dx10_clamp can just take the caller setting, but there seems to be
// no way to support merge for backend defined attributes.		// no way to support merge for backend defined attributes.
AMDGPU::SIModeRegisterDefaults CallerMode(*Caller);		AMDGPU::SIModeRegisterDefaults CallerMode(*Caller);
AMDGPU::SIModeRegisterDefaults CalleeMode(*Callee);		AMDGPU::SIModeRegisterDefaults CalleeMode(*Callee);
return CallerMode.isInlineCompatible(CalleeMode);		if (!CallerMode.isInlineCompatible(CalleeMode))
		return false;

		// Hack to make compile times reasonable.
		if (InlineMaxBB && !Callee->hasFnAttribute(Attribute::InlineHint)) {
		// Single BB does not increase total BB amount, thus subtract 1.
		size_t BBSize = Caller->size() + Callee->size() - 1;
		return BBSize <= InlineMaxBB;
		}

		return true;
		}

		unsigned GCNTTIImpl::adjustInliningThreshold(const CallBase *CB) const {
		// If we have a pointer to private array passed into a function
		// it will not be optimized out, leaving scratch usage.
		// Increase the inline threshold to allow inlining in this case.
		uint64_t AllocaSize = 0;
		SmallPtrSet<const AllocaInst *, 8> AIVisited;
		for (Value *PtrArg : CB->args()) {
		PointerType *Ty = dyn_cast<PointerType>(PtrArg->getType());
		if (!Ty \|\| (Ty->getAddressSpace() != AMDGPUAS::PRIVATE_ADDRESS &&
		Ty->getAddressSpace() != AMDGPUAS::FLAT_ADDRESS))
		continue;

		PtrArg = getUnderlyingObject(PtrArg);
		if (const AllocaInst *AI = dyn_cast<AllocaInst>(PtrArg)) {
		if (!AI->isStaticAlloca() \|\| !AIVisited.insert(AI).second)
		continue;
		AllocaSize += DL.getTypeAllocSize(AI->getAllocatedType());
		// If the amount of stack memory is excessive we will not be able
		// to get rid of the scratch anyway, bail out.
		if (AllocaSize > ArgAllocaCutoff) {
		AllocaSize = 0;
		break;
		}
		}
		}
		if (AllocaSize)
		return ArgAllocaCost;
		return 0;
}		}

void GCNTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void GCNTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP) {		TTI::UnrollingPreferences &UP) {
CommonTTI.getUnrollingPreferences(L, SE, UP);		CommonTTI.getUnrollingPreferences(L, SE, UP);
}		}

void GCNTTIImpl::getPeelingPreferences(Loop *L, ScalarEvolution &SE,		void GCNTTIImpl::getPeelingPreferences(Loop *L, ScalarEvolution &SE,
TTI::PeelingPreferences &PP) {		TTI::PeelingPreferences &PP) {
CommonTTI.getPeelingPreferences(L, SE, PP);		CommonTTI.getPeelingPreferences(L, SE, PP);
}		}

int GCNTTIImpl::get64BitInstrCost(TTI::TargetCostKind CostKind) const {		int GCNTTIImpl::get64BitInstrCost(TTI::TargetCostKind CostKind) const {
		arsenmUnsubmitted Not Done Reply Inline Actions I still don't really want to expand this compile time hack to a target hook arsenm: I still don't really want to expand this compile time hack to a target hook
		aeubanksAuthorUnsubmitted Done Reply Inline Actions is stuffing this into areInlineCompatible() better? aeubanks: is stuffing this into areInlineCompatible() better?
return ST->hasHalfRate64Ops() ? getHalfRateInstrCost(CostKind)		return ST->hasHalfRate64Ops() ? getHalfRateInstrCost(CostKind)
: getQuarterRateInstrCost(CostKind);		: getQuarterRateInstrCost(CostKind);
}		}

R600TTIImpl::R600TTIImpl(const AMDGPUTargetMachine *TM, const Function &F)		R600TTIImpl::R600TTIImpl(const AMDGPUTargetMachine *TM, const Function &F)
		arsenmUnsubmitted Not Done Reply Inline Actions I'm not convinced we ever really needed this. I believe the standard inline heuristic will always do this anyway arsenm: I'm not convinced we ever really needed this. I believe the standard inline heuristic will…
		aeubanksAuthorUnsubmitted Done Reply Inline Actions There was one case where this forced a coldcc call to be inlined where normally it wouldn't be. Of course, why put coldcc on some random function? Will remove. aeubanks: There was one case where this forced a coldcc call to be inlined where normally it wouldn't be.
		rampitecUnsubmitted Not Done Reply Inline Actions This was added to unwrap several layers of wrappers we have in device lib. Without this heuristic it was not always properly handled by the standard inliner. rampitec: This was added to unwrap several layers of wrappers we have in device lib. Without this…
: BaseT(TM, F.getParent()->getDataLayout()),		: BaseT(TM, F.getParent()->getDataLayout()),
ST(static_cast<const R600Subtarget *>(TM->getSubtargetImpl(F))),		ST(static_cast<const R600Subtarget *>(TM->getSubtargetImpl(F))),
TLI(ST->getTargetLowering()), CommonTTI(TM, F) {}		TLI(ST->getTargetLowering()), CommonTTI(TM, F) {}

unsigned R600TTIImpl::getHardwareNumberOfRegisters(bool Vec) const {		unsigned R600TTIImpl::getHardwareNumberOfRegisters(bool Vec) const {
		arsenmUnsubmitted Not Done Reply Inline Actions I'm not sure I like having a target hook for a hack like this arsenm: I'm not sure I like having a target hook for a hack like this
		aeubanksAuthorUnsubmitted Done Reply Inline Actions Would you rather just remove this altogether? aeubanks: Would you rather just remove this altogether?
		arsenmUnsubmitted Not Done Reply Inline Actions I don't remember the story here. @rampitec @vpykhtin @dfukalov ? arsenm: I don't remember the story here. @rampitec @vpykhtin @dfukalov ?
		aeubanksAuthorUnsubmitted Done Reply Inline Actions It was added in https://reviews.llvm.org/D62917. Apparently it's a hack to help with compile times. That's a bit surprising, is this an AMDGPU specific issue? aeubanks: It was added in https://reviews.llvm.org/D62917. Apparently it's a hack to help with compile…
		rampitecUnsubmitted Not Done Reply Inline Actions It is AMDGPU specific only in a sense. We tend to inline a lot, much more than other targets. Therefor we can have drastic compile time issues. In particular there are several pretty big codes which compile hours instead of one or two minutes without this (suboptimal) cutoff. rampitec: It is AMDGPU specific only in a sense. We tend to inline a lot, much more than other targets.
		arsenmUnsubmitted Not Done Reply Inline Actions Is it possible this was only due to a bug in the subtarget feature compatibility handling? I believe the standard inline analysis specifically looks for wrappers as well arsenm: Is it possible this was only due to a bug in the subtarget feature compatibility handling? I…
		rampitecUnsubmitted Not Done Reply Inline Actions It did not work this way at least back when it was added. I think we should not change the logic with this patch. This patch is infrastructural, if we want to tune heuristics that would be a separate work. rampitec: It did not work this way at least back when it was added. I think we should not change the…
		aeubanksAuthorUnsubmitted Done Reply Inline Actions That's fair. I'll keep everything for now. @arsenm Any thoughts on alternatives to target hooks? aeubanks: That's fair. I'll keep everything for now. @arsenm Any thoughts on alternatives to target…
		arsenmUnsubmitted Not Done Reply Inline Actions I think we should first try to remove this part and see how it goes arsenm: I think we should first try to remove this part and see how it goes
		rnkUnsubmitted Not Done Reply Inline Actions I think it makes sense to avoid building infrastructure (new TTI hooks) if it turns out that it is never needed. It's much harder to remove hooks than it is to add them. However, we want to make sure that Arthur can make progress flipping the default pass manager without doing any AMDGPU tuning work. That's clearly out of scope for him. We could proceed with a minimal set of inliner hooks, flip the default pass manager, and allow downstream AMD folks to sort out any performance problems later. AMD or any other vendor can configure CMake to use the old pass manager if they aren't ready to triage the performance impacts of the NPM. Would that be OK? rnk: I think it makes sense to avoid building infrastructure (new TTI hooks) if it turns out that it…
return 4 * 128; // XXX - 4 channels. Should these count as vector instead?		return 4 * 128; // XXX - 4 channels. Should these count as vector instead?
}		}

unsigned R600TTIImpl::getNumberOfRegisters(bool Vec) const {		unsigned R600TTIImpl::getNumberOfRegisters(bool Vec) const {
return getHardwareNumberOfRegisters(Vec);		return getHardwareNumberOfRegisters(Vec);
}		}

unsigned R600TTIImpl::getRegisterBitWidth(bool Vector) const {		unsigned R600TTIImpl::getRegisterBitWidth(bool Vector) const {
▲ Show 20 Lines • Show All 102 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
AMDGPURegisterBankInfo.cpp		AMDGPURegisterBankInfo.cpp
AMDGPURewriteOutArguments.cpp		AMDGPURewriteOutArguments.cpp
AMDGPUSubtarget.cpp		AMDGPUSubtarget.cpp
AMDGPUTargetMachine.cpp		AMDGPUTargetMachine.cpp
AMDGPUTargetObjectFile.cpp		AMDGPUTargetObjectFile.cpp
AMDGPUTargetTransformInfo.cpp		AMDGPUTargetTransformInfo.cpp
AMDGPUUnifyDivergentExitNodes.cpp		AMDGPUUnifyDivergentExitNodes.cpp
AMDGPUUnifyMetadata.cpp		AMDGPUUnifyMetadata.cpp
AMDGPUInline.cpp
AMDGPUPerfHintAnalysis.cpp		AMDGPUPerfHintAnalysis.cpp
AMDILCFGStructurizer.cpp		AMDILCFGStructurizer.cpp
AMDGPUPrintfRuntimeBinding.cpp		AMDGPUPrintfRuntimeBinding.cpp
GCNHazardRecognizer.cpp		GCNHazardRecognizer.cpp
GCNIterativeScheduler.cpp		GCNIterativeScheduler.cpp
GCNMinRegStrategy.cpp		GCNMinRegStrategy.cpp
GCNRegPressure.cpp		GCNRegPressure.cpp
GCNSchedStrategy.cpp		GCNSchedStrategy.cpp
▲ Show 20 Lines • Show All 80 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/amdgpu-inline.ll

	; RUN: opt -mtriple=amdgcn--amdhsa -data-layout=A5 -O3 -S -inline-threshold=1 < %s \| FileCheck -check-prefix=GCN -check-prefix=GCN-INL1 %s			; RUN: opt -mtriple=amdgcn--amdhsa -data-layout=A5 -O3 -S -inline-threshold=1 < %s \| FileCheck -check-prefix=GCN -check-prefix=GCN-INL1 %s
	; RUN: opt -mtriple=amdgcn--amdhsa -data-layout=A5 -O3 -S < %s \| FileCheck -check-prefix=GCN -check-prefix=GCN-INLDEF %s			; RUN: opt -mtriple=amdgcn--amdhsa -data-layout=A5 -O3 -S < %s \| FileCheck -check-prefix=GCN -check-prefix=GCN-INLDEF %s
				; RUN: opt -mtriple=amdgcn--amdhsa -data-layout=A5 -passes='default<O3>' -S -inline-threshold=1 < %s \| FileCheck -check-prefix=GCN -check-prefix=GCN-INL1 %s
				; RUN: opt -mtriple=amdgcn--amdhsa -data-layout=A5 -passes='default<O3>' -S < %s \| FileCheck -check-prefix=GCN -check-prefix=GCN-INLDEF %s

	define coldcc float @foo(float %x, float %y) {			define coldcc float @foo(float %x, float %y) {
	entry:			entry:
	%cmp = fcmp ogt float %x, 0.000000e+00			%cmp = fcmp ogt float %x, 0.000000e+00
	%div = fdiv float %y, %x			%div = fdiv float %y, %x
	%mul = fmul float %x, %y			%mul = fmul float %x, %y
	%cond = select i1 %cmp, float %div, float %mul			%cond = select i1 %cmp, float %div, float %mul
	ret float %cond			ret float %cond
	▲ Show 20 Lines • Show All 142 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/inline-maxbb.ll

	; RUN: opt -mtriple=amdgcn-- --amdgpu-inline -S -amdgpu-inline-max-bb=2 %s \| FileCheck %s --check-prefix=NOINL			; RUN: opt -mtriple=amdgcn-- -inline -S -amdgpu-inline-max-bb=2 %s \| FileCheck %s --check-prefix=NOINL
	; RUN: opt -mtriple=amdgcn-- --amdgpu-inline -S -amdgpu-inline-max-bb=3 %s \| FileCheck %s --check-prefix=INL			; RUN: opt -mtriple=amdgcn-- -inline -S -amdgpu-inline-max-bb=3 %s \| FileCheck %s --check-prefix=INL
				; RUN: opt -mtriple=amdgcn-- -passes=inline -S -amdgpu-inline-max-bb=2 %s \| FileCheck %s --check-prefix=NOINL
				; RUN: opt -mtriple=amdgcn-- -passes=inline -S -amdgpu-inline-max-bb=3 %s \| FileCheck %s --check-prefix=INL

	define i32 @callee(i32 %x) {			define i32 @callee(i32 %x) {
	entry:			entry:
	%cc = icmp eq i32 %x, 1			%cc = icmp eq i32 %x, 1
	br i1 %cc, label %ret_res, label %mulx			br i1 %cc, label %ret_res, label %mulx

	mulx:			mulx:
	%mul1 = mul i32 %x, %x			%mul1 = mul i32 %x, %x
	▲ Show 20 Lines • Show All 58 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/opt-pipeline.ll

	Show All 17 Lines
	; GCN-O0-NEXT: Target Pass Configuration			; GCN-O0-NEXT: Target Pass Configuration
	; GCN-O0-NEXT: Assumption Cache Tracker			; GCN-O0-NEXT: Assumption Cache Tracker
	; GCN-O0-NEXT: Profile summary info			; GCN-O0-NEXT: Profile summary info
	; GCN-O0-NEXT: ModulePass Manager			; GCN-O0-NEXT: ModulePass Manager
	; GCN-O0-NEXT: Annotation2Metadata			; GCN-O0-NEXT: Annotation2Metadata
	; GCN-O0-NEXT: Force set function attributes			; GCN-O0-NEXT: Force set function attributes
	; GCN-O0-NEXT: CallGraph Construction			; GCN-O0-NEXT: CallGraph Construction
	; GCN-O0-NEXT: Call Graph SCC Pass Manager			; GCN-O0-NEXT: Call Graph SCC Pass Manager
	; GCN-O0-NEXT: AMDGPU Function Integration/Inlining			; GCN-O0-NEXT: Function Integration/Inlining
	; GCN-O0-NEXT: A No-Op Barrier Pass			; GCN-O0-NEXT: A No-Op Barrier Pass


	; GCN-O1: Pass Arguments:			; GCN-O1: Pass Arguments:
	; GCN-O1-NEXT: Target Transform Information			; GCN-O1-NEXT: Target Transform Information
	; GCN-O1-NEXT: AMDGPU Address space based Alias Analysis			; GCN-O1-NEXT: AMDGPU Address space based Alias Analysis
	; GCN-O1-NEXT: External Alias Analysis			; GCN-O1-NEXT: External Alias Analysis
	; GCN-O1-NEXT: Assumption Cache Tracker			; GCN-O1-NEXT: Assumption Cache Tracker
	▲ Show 20 Lines • Show All 57 Lines • ▼ Show 20 Lines
	; GCN-O1-NEXT: Lazy Block Frequency Analysis			; GCN-O1-NEXT: Lazy Block Frequency Analysis
	; GCN-O1-NEXT: Optimization Remark Emitter			; GCN-O1-NEXT: Optimization Remark Emitter
	; GCN-O1-NEXT: Combine redundant instructions			; GCN-O1-NEXT: Combine redundant instructions
	; GCN-O1-NEXT: Simplify the CFG			; GCN-O1-NEXT: Simplify the CFG
	; GCN-O1-NEXT: CallGraph Construction			; GCN-O1-NEXT: CallGraph Construction
	; GCN-O1-NEXT: Globals Alias Analysis			; GCN-O1-NEXT: Globals Alias Analysis
	; GCN-O1-NEXT: Call Graph SCC Pass Manager			; GCN-O1-NEXT: Call Graph SCC Pass Manager
	; GCN-O1-NEXT: Remove unused exception handling info			; GCN-O1-NEXT: Remove unused exception handling info
	; GCN-O1-NEXT: AMDGPU Function Integration/Inlining			; GCN-O1-NEXT: Function Integration/Inlining
	; GCN-O1-NEXT: Deduce function attributes			; GCN-O1-NEXT: Deduce function attributes
	; GCN-O1-NEXT: FunctionPass Manager			; GCN-O1-NEXT: FunctionPass Manager
	; GCN-O1-NEXT: Infer address spaces			; GCN-O1-NEXT: Infer address spaces
	; GCN-O1-NEXT: AMDGPU Kernel Attributes			; GCN-O1-NEXT: AMDGPU Kernel Attributes
	; GCN-O1-NEXT: FunctionPass Manager			; GCN-O1-NEXT: FunctionPass Manager
	; GCN-O1-NEXT: AMDGPU Promote Alloca to vector			; GCN-O1-NEXT: AMDGPU Promote Alloca to vector
	; GCN-O1-NEXT: Dominator Tree Construction			; GCN-O1-NEXT: Dominator Tree Construction
	; GCN-O1-NEXT: SROA			; GCN-O1-NEXT: SROA
	▲ Show 20 Lines • Show All 294 Lines • ▼ Show 20 Lines
	; GCN-O2-NEXT: Lazy Block Frequency Analysis			; GCN-O2-NEXT: Lazy Block Frequency Analysis
	; GCN-O2-NEXT: Optimization Remark Emitter			; GCN-O2-NEXT: Optimization Remark Emitter
	; GCN-O2-NEXT: Combine redundant instructions			; GCN-O2-NEXT: Combine redundant instructions
	; GCN-O2-NEXT: Simplify the CFG			; GCN-O2-NEXT: Simplify the CFG
	; GCN-O2-NEXT: CallGraph Construction			; GCN-O2-NEXT: CallGraph Construction
	; GCN-O2-NEXT: Globals Alias Analysis			; GCN-O2-NEXT: Globals Alias Analysis
	; GCN-O2-NEXT: Call Graph SCC Pass Manager			; GCN-O2-NEXT: Call Graph SCC Pass Manager
	; GCN-O2-NEXT: Remove unused exception handling info			; GCN-O2-NEXT: Remove unused exception handling info
	; GCN-O2-NEXT: AMDGPU Function Integration/Inlining			; GCN-O2-NEXT: Function Integration/Inlining
	; GCN-O2-NEXT: OpenMP specific optimizations			; GCN-O2-NEXT: OpenMP specific optimizations
	; GCN-O2-NEXT: Deduce function attributes			; GCN-O2-NEXT: Deduce function attributes
	; GCN-O2-NEXT: FunctionPass Manager			; GCN-O2-NEXT: FunctionPass Manager
	; GCN-O2-NEXT: Infer address spaces			; GCN-O2-NEXT: Infer address spaces
	; GCN-O2-NEXT: AMDGPU Kernel Attributes			; GCN-O2-NEXT: AMDGPU Kernel Attributes
	; GCN-O2-NEXT: FunctionPass Manager			; GCN-O2-NEXT: FunctionPass Manager
	; GCN-O2-NEXT: AMDGPU Promote Alloca to vector			; GCN-O2-NEXT: AMDGPU Promote Alloca to vector
	; GCN-O2-NEXT: Dominator Tree Construction			; GCN-O2-NEXT: Dominator Tree Construction
	▲ Show 20 Lines • Show All 345 Lines • ▼ Show 20 Lines
	; GCN-O3-NEXT: Lazy Block Frequency Analysis			; GCN-O3-NEXT: Lazy Block Frequency Analysis
	; GCN-O3-NEXT: Optimization Remark Emitter			; GCN-O3-NEXT: Optimization Remark Emitter
	; GCN-O3-NEXT: Combine redundant instructions			; GCN-O3-NEXT: Combine redundant instructions
	; GCN-O3-NEXT: Simplify the CFG			; GCN-O3-NEXT: Simplify the CFG
	; GCN-O3-NEXT: CallGraph Construction			; GCN-O3-NEXT: CallGraph Construction
	; GCN-O3-NEXT: Globals Alias Analysis			; GCN-O3-NEXT: Globals Alias Analysis
	; GCN-O3-NEXT: Call Graph SCC Pass Manager			; GCN-O3-NEXT: Call Graph SCC Pass Manager
	; GCN-O3-NEXT: Remove unused exception handling info			; GCN-O3-NEXT: Remove unused exception handling info
	; GCN-O3-NEXT: AMDGPU Function Integration/Inlining			; GCN-O3-NEXT: Function Integration/Inlining
	; GCN-O3-NEXT: OpenMP specific optimizations			; GCN-O3-NEXT: OpenMP specific optimizations
	; GCN-O3-NEXT: Deduce function attributes			; GCN-O3-NEXT: Deduce function attributes
	; GCN-O3-NEXT: Promote 'by reference' arguments to scalars			; GCN-O3-NEXT: Promote 'by reference' arguments to scalars
	; GCN-O3-NEXT: FunctionPass Manager			; GCN-O3-NEXT: FunctionPass Manager
	; GCN-O3-NEXT: Infer address spaces			; GCN-O3-NEXT: Infer address spaces
	; GCN-O3-NEXT: AMDGPU Kernel Attributes			; GCN-O3-NEXT: AMDGPU Kernel Attributes
	; GCN-O3-NEXT: FunctionPass Manager			; GCN-O3-NEXT: FunctionPass Manager
	; GCN-O3-NEXT: AMDGPU Promote Alloca to vector			; GCN-O3-NEXT: AMDGPU Promote Alloca to vector
	▲ Show 20 Lines • Show All 284 Lines • Show Last 20 Lines

llvm/test/Transforms/Inline/AMDGPU/amdgpu-inline-alloca-argument.ll

	; RUN: opt -mtriple=amdgcn--amdhsa -S -amdgpu-inline -inline-threshold=0 < %s \| FileCheck %s			; RUN: opt -mtriple=amdgcn--amdhsa -S -inline -inline-threshold=0 < %s \| FileCheck %s
				; RUN: opt -mtriple=amdgcn--amdhsa -S -passes=inline -inline-threshold=0 < %s \| FileCheck %s

	target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5"			target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5"

	define void @use_flat_ptr_arg(float* nocapture %p) {			define void @use_flat_ptr_arg(float* nocapture %p) {
	entry:			entry:
	%tmp1 = load float, float* %p, align 4			%tmp1 = load float, float* %p, align 4
	%div = fdiv float 1.000000e+00, %tmp1			%div = fdiv float 1.000000e+00, %tmp1
	%add0 = fadd float %div, 1.0			%add0 = fadd float %div, 1.0
	▲ Show 20 Lines • Show All 61 Lines • Show Last 20 Lines

llvm/test/Transforms/Inline/AMDGPU/inline-amdgpu-vecbonus.ll

	; RUN: opt -S -mtriple=amdgcn-unknown-amdhsa -amdgpu-inline --inline-threshold=1 < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-unknown-amdhsa -inline --inline-threshold=1 < %s \| FileCheck %s
				; RUN: opt -S -mtriple=amdgcn-unknown-amdhsa -passes=inline --inline-threshold=1 < %s \| FileCheck %s

	define hidden <16 x i32> @div_vecbonus(<16 x i32> %x, <16 x i32> %y) {			define hidden <16 x i32> @div_vecbonus(<16 x i32> %x, <16 x i32> %y) {
	entry:			entry:
	%div.1 = udiv <16 x i32> %x, %y			%div.1 = udiv <16 x i32> %x, %y
	%div.2 = udiv <16 x i32> %div.1, %y			%div.2 = udiv <16 x i32> %div.1, %y
	%div.3 = udiv <16 x i32> %div.2, %y			%div.3 = udiv <16 x i32> %div.2, %y
	%div.4 = udiv <16 x i32> %div.3, %y			%div.4 = udiv <16 x i32> %div.3, %y
	%div.5 = udiv <16 x i32> %div.4, %y			%div.5 = udiv <16 x i32> %div.4, %y
	Show All 22 Lines

llvm/test/Transforms/Inline/AMDGPU/inline-hint.ll

	; RUN: opt -S -mtriple=amdgcn-unknown-amdhsa -amdgpu-inline --inline-threshold=1 --inlinehint-threshold=2 < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-unknown-amdhsa -inline --inline-threshold=1 --inlinehint-threshold=4 < %s \| FileCheck %s
				; RUN: opt -S -mtriple=amdgcn-unknown-amdhsa -passes=inline --inline-threshold=1 --inlinehint-threshold=4 < %s \| FileCheck %s

	define hidden <16 x i32> @div_hint(<16 x i32> %x, <16 x i32> %y) #0 {			define hidden <16 x i32> @div_hint(<16 x i32> %x, <16 x i32> %y) #0 {
	entry:			entry:
	%div.1 = udiv <16 x i32> %x, %y			%div.1 = udiv <16 x i32> %x, %y
	%div.2 = udiv <16 x i32> %div.1, %y			%div.2 = udiv <16 x i32> %div.1, %y
	%div.3 = udiv <16 x i32> %div.2, %y			%div.3 = udiv <16 x i32> %div.2, %y
	%div.4 = udiv <16 x i32> %div.3, %y			%div.4 = udiv <16 x i32> %div.3, %y
	%div.5 = udiv <16 x i32> %div.4, %y			%div.5 = udiv <16 x i32> %div.4, %y
	▲ Show 20 Lines • Show All 68 Lines • Show Last 20 Lines

llvm/utils/gn/secondary/llvm/lib/Target/AMDGPU/BUILD.gn

Show First 20 Lines • Show All 131 Lines • ▼ Show 20 Lines	sources = [
"AMDGPUCodeGenPrepare.cpp",		"AMDGPUCodeGenPrepare.cpp",
"AMDGPUExportClustering.cpp",		"AMDGPUExportClustering.cpp",
"AMDGPUFixFunctionBitcasts.cpp",		"AMDGPUFixFunctionBitcasts.cpp",
"AMDGPUFrameLowering.cpp",		"AMDGPUFrameLowering.cpp",
"AMDGPUGlobalISelUtils.cpp",		"AMDGPUGlobalISelUtils.cpp",
"AMDGPUHSAMetadataStreamer.cpp",		"AMDGPUHSAMetadataStreamer.cpp",
"AMDGPUISelDAGToDAG.cpp",		"AMDGPUISelDAGToDAG.cpp",
"AMDGPUISelLowering.cpp",		"AMDGPUISelLowering.cpp",
"AMDGPUInline.cpp",
"AMDGPUInstCombineIntrinsic.cpp",		"AMDGPUInstCombineIntrinsic.cpp",
"AMDGPUInstrInfo.cpp",		"AMDGPUInstrInfo.cpp",
"AMDGPUInstructionSelector.cpp",		"AMDGPUInstructionSelector.cpp",
"AMDGPULateCodeGenPrepare.cpp",		"AMDGPULateCodeGenPrepare.cpp",
"AMDGPULegalizerInfo.cpp",		"AMDGPULegalizerInfo.cpp",
"AMDGPULibCalls.cpp",		"AMDGPULibCalls.cpp",
"AMDGPULibFunc.cpp",		"AMDGPULibFunc.cpp",
"AMDGPULowerIntrinsics.cpp",		"AMDGPULowerIntrinsics.cpp",
▲ Show 20 Lines • Show All 97 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU][Inliner] Remove amdgpu-inline and add a new TTI inline hookClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 318408

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/include/llvm/CodeGen/BasicTTIImpl.h

llvm/lib/Analysis/InlineCost.cpp

llvm/lib/Analysis/TargetTransformInfo.cpp

llvm/lib/Target/AMDGPU/AMDGPU.h

llvm/lib/Target/AMDGPU/AMDGPUInline.cpp

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

llvm/lib/Target/AMDGPU/CMakeLists.txt

llvm/test/CodeGen/AMDGPU/amdgpu-inline.ll

llvm/test/CodeGen/AMDGPU/inline-maxbb.ll

llvm/test/CodeGen/AMDGPU/opt-pipeline.ll

llvm/test/Transforms/Inline/AMDGPU/amdgpu-inline-alloca-argument.ll

llvm/test/Transforms/Inline/AMDGPU/inline-amdgpu-vecbonus.ll

llvm/test/Transforms/Inline/AMDGPU/inline-hint.ll

llvm/utils/gn/secondary/llvm/lib/Target/AMDGPU/BUILD.gn

[AMDGPU][Inliner] Remove amdgpu-inline and add a new TTI inline hook
ClosedPublic