This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Port of HSAIL inliner
ClosedPublic

Authored by rampitec on Aug 17 2017, 2:01 PM.

Download Raw Diff

Details

Reviewers

arsenm
dfukalov
alfred.j.huang

Commits

rG5670e6d482b4: [AMDGPU] Port of HSAIL inliner
rL313714: [AMDGPU] Port of HSAIL inliner

Diff Detail

Repository: rL LLVM

Event Timeline

rampitec created this revision.Aug 17 2017, 2:01 PM

Herald added subscribers: eraman, t-tye, tpr and 6 others. · View Herald TranscriptAug 17 2017, 2:01 PM

Added OptimizationRemarkEmitter as now in master.

Ping

rampitec added a reviewer: alfred.j.huang.Aug 23 2017, 1:27 PM

Look fine. Cosmetic questions from me. Thanks!

lib/Target/AMDGPU/AMDGPUInline.cpp
117 ↗	(On Diff #112191)	Accumulation of all pointer args to private array, if total size exceeds ArgAllocaCutoff, then bail out. Look fine. I was just confused initially with the comment.
186 ↗	(On Diff #112191)	CS.getCaller() doesn't need to be called again? Caller set previously.
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
329 ↗	(On Diff #112191)	This mean if EnableAMDGPUFunctionCalls is set, we will not have early inline anymore?
test/CodeGen/AMDGPU/amdgpu-inline.ll
43 ↗	(On Diff #112191)	Is existence of "tail call @foo" considered @foo inlined? @foo should be inlined, right?

Removed extra CS.getCaller() call.

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
329 ↗	(On Diff #112191)	We will not have early inline all. We always have selective early inline because we are running opt with the FE before linking, but w/o this new inlining pass it was a pristine llvm simple inliner which did not inline a lot of functions where we would like to see them inlined. Now the default inliner is replaced with our more aggressive version. If still runs twice with both invocations of opt, but it does not need a special pass to mark every function always_inline (which is in a nutshell what -amdgpu-early-inline-all does). Moreover, w/o this change even if we enable calls we will get no calls, as all of the functions will be marked always_inline and then inlined. Then eventually the whole AMDGPUAlwaysInlinePass pass shall be removed when we enable calls.
test/CodeGen/AMDGPU/amdgpu-inline.ll
43 ↗	(On Diff #112191)	It is GCN-INL1 check, which constitutes to the call with -inline-threshold=1. I.e. for the test purposes I'm practically asking not to inline conventional functions, unless they have a really good reason to be inlined (like use of scratch). The check GCN-INLDEF below represents default threshold under which foo supposed to be inlined.

alfred.j.huang accepted this revision.Aug 23 2017, 9:55 PM

This revision is now accepted and ready to land.Aug 23 2017, 9:55 PM

I don't like the idea of having a custom inlined, and thinking about this is very premature. We haven't attempted any benchmarking of calls or looked at current inlining behavior or if it could be improved. I would rather not commit this until it is clear it is absolutely necessary.

This revision now requires changes to proceed.Aug 23 2017, 11:02 PM

In D36849#851079, @arsenm wrote:

I don't like the idea of having a custom inlined, and thinking about this is very premature. We haven't attempted any benchmarking of calls or looked at current inlining behavior or if it could be improved. I would rather not commit this until it is clear it is absolutely necessary.

We did a lot of benchmarking with HSAIL. LLVM general inliner did not change much since then and does not factor stuff important for us.

In D36849#851091, @rampitec wrote:

In D36849#851079, @arsenm wrote:

I don't like the idea of having a custom inlined, and thinking about this is very premature. We haven't attempted any benchmarking of calls or looked at current inlining behavior or if it could be improved. I would rather not commit this until it is clear it is absolutely necessary.

We did a lot of benchmarking with HSAIL. LLVM general inliner did not change much since then and does not factor stuff important for us.

It's not really meaningful to compare inlining of HSAIL to AMDGPU. HSAIL never implemented any of the TTI cost model (and actually really completely broke it) plus other ABI differences. HSAIL was always using the default ABI using sret / byval among other differences.

In D36849#853144, @arsenm wrote:

In D36849#851091, @rampitec wrote:

In D36849#851079, @arsenm wrote:

I don't like the idea of having a custom inlined, and thinking about this is very premature. We haven't attempted any benchmarking of calls or looked at current inlining behavior or if it could be improved. I would rather not commit this until it is clear it is absolutely necessary.

We did a lot of benchmarking with HSAIL. LLVM general inliner did not change much since then and does not factor stuff important for us.

It's not really meaningful to compare inlining of HSAIL to AMDGPU. HSAIL never implemented any of the TTI cost model (and actually really completely broke it) plus other ABI differences. HSAIL was always using the default ABI using sret / byval among other differences.

What in the default inliner will bump threshold if private pointer is passed?

In D36849#853144, @arsenm wrote:

HSAIL never implemented any of the TTI cost model

And this is also completely false statement.

arsenm added inline comments.Aug 28 2017, 10:24 AM

lib/Target/AMDGPU/AMDGPUInline.cpp
30–32 ↗	(On Diff #112498)	We should avoid having a custom version of the exact same flag that the regular inliner uses. This would result in some surprises.
53 ↗	(On Diff #112498)	Hardcoded threshold. We should rely on the default opt-level based thresholds and override getInliningThresholdMultiplier to get the larger target default. That way the standard flags will work as expected.
112 ↗	(On Diff #112498)	Early return if no callee rather than wrapping the rest of the function in if
120 ↗	(On Diff #112498)	Skip potentially costly GetUnderlyingObject call if the address space isn't private
123–124 ↗	(On Diff #112498)	I'm pretty sure you aren't allowed to alloca an unsized type, so this check isn't needed. However you do need to check / skip dynamically sized allocas.
166–174 ↗	(On Diff #112498)	Call the base implementation and see if it says always or never first rather than reproducing the logic?
176–177 ↗	(On Diff #112498)	Wrapper calls will (ignoring no inline) always be inlined by the stock inline heuristics, so †his shouldn't be necessary
180–182 ↗	(On Diff #112498)	This heuristic doesn't make any sense to me. Why does the block count matter? Just use the default cost.
test/CodeGen/AMDGPU/amdgpu-inline.ll
75 ↗	(On Diff #112498)	Needs tests with multiple private pointers and with the alloca size threshold exceeded

Addressed review comments.
Rebased to master.
Added detection of the case when a subscript of the same alloca passed multiple times - count it as just a single alloca (found while forging new test).

lib/Target/AMDGPU/AMDGPUInline.cpp
30–32 ↗	(On Diff #112498)	Stock inline hint threshold is only 44% higher than regular threshold. Here I have it 6 times higher. The exact number may and will change, but when I return 9 from getInliningThresholdMultiplier() and multiply hint by the very same number it still will be a very low number, far less than we would really like to have.
123–124 ↗	(On Diff #112498)	That was for opaque types which we do not use now, and we had allocas on them.
166–174 ↗	(On Diff #112498)	Base implementation is pure virtual. I could call llvm::getInlineCost() instead directly (it is called from SimpleInliner::getInlineCost() anyway) two times, but I want to avoid expensive CallAnalyzer.
176–177 ↗	(On Diff #112498)	That is unless we hit MaxBB, in which case we still want to inline it.
180–182 ↗	(On Diff #112498)	That is to prevent huge compilation time of some programs. Not an ideal heuristic, but better than nothing.

arsenm added inline comments.Sep 1 2017, 10:31 AM

lib/Target/AMDGPU/AMDGPUInline.cpp
30–32 ↗	(On Diff #112498)	If that is really a problem we should find a different solution. I really don't want to have the set of flags for debugging this changing.

rampitec added inline comments.Sep 1 2017, 10:59 AM

lib/Target/AMDGPU/AMDGPUInline.cpp
30–32 ↗	(On Diff #112498)	OK, when we hit it next time I guess we could expose something from TTI.

Removed custom inline hint threshold.

arsenm added inline comments.Sep 11 2017, 10:33 AM

lib/Target/AMDGPU/AMDGPUInline.cpp
180–182 ↗	(On Diff #112498)	Compilation time from what? That it requires this custom wrapper function checking sounds like additional motivation to drop it.
176–177 ↗	(On Diff #113563)	By doing this you could possibly be allowing inlining with incompatible function attributes. llvm:::getInlineCost has the additional functionsHaveCompatibleAttributes check, which is more than the above areInlineCompatible
lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
165–166 ↗	(On Diff #113563)	How did you decide 9 for this?

rampitec added inline comments.Sep 11 2017, 11:14 AM

lib/Target/AMDGPU/AMDGPUInline.cpp
180–182 ↗	(On Diff #112498)	The actual testcase which required that was VRay AFAIR. When we increase inline threshold (or inline everything like now) we a vulnerable to extremely high compilation times for huge codes. In case of VRay it was hours, and this has decreased it to minutes.
lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
165–166 ↗	(On Diff #113563)	This is the rounded ratio of standard inline threshold and that tuned for HSAIL. Later we will need to retune.

rampitec added inline comments.Sep 11 2017, 11:32 AM

lib/Target/AMDGPU/AMDGPUInline.cpp
176–177 ↗	(On Diff #113563)	areInlineCompatible() is called earlier at line 167. If they do not we will not get to this point.

There should be an explanation of what this pass does and why it is better than LLVM's default inliner and some benchmark data showing which applications / games this helps.

Added file brief and more comments for thresholds.

In D36849#866812, @tstellar wrote:

There should be an explanation of what this pass does and why it is better than LLVM's default inliner and some benchmark data showing which applications / games this helps.

Explanation added. For the benchmarks, the one I used was Luxmark, where it gave ~12% better result comparing to standard inliner. For the others it does not look like it makes a lot of sense to try to recollect old HSAIL numbers.
This pass is ported because we know in advance there is an issue like this. As said before cost model is different and some tuning will be also required later when we turn call support on.

arsenm added inline comments.Sep 12 2017, 8:15 PM

lib/Target/AMDGPU/AMDGPUInline.cpp
180–182 ↗	(On Diff #112498)	I meant where was the compile time spent? I doubt it was the inliner itself

rampitec added inline comments.Sep 12 2017, 9:40 PM

lib/Target/AMDGPU/AMDGPUInline.cpp
180–182 ↗	(On Diff #112498)	As usual, optimizing inflated code, scheduling and RA.

arsenm added inline comments.Sep 15 2017, 10:34 AM

lib/Target/AMDGPU/AMDGPUInline.cpp
180–182 ↗	(On Diff #112498)	I'd rather drop this workaround for now at least.

Removed MaxBB limit.

lib/Target/AMDGPU/AMDGPUInline.cpp
176–177 ↗	(On Diff #112498)	AFAIR I've seen cases when our library call wrappers were not inlined even with relatively small programs.
180–182 ↗	(On Diff #112498)	OK

LGTM. It does look like the default inline heuristic does consider the SROAability of the passed arguments to some degree

This revision is now accepted and ready to land.Sep 19 2017, 6:21 PM

Rebase to master.

Closed by commit rL313714: [AMDGPU] Port of HSAIL inliner (authored by rampitec). · Explain WhySep 19 2017, 9:27 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

AMDGPU.h

3 lines

AMDGPUInline.cpp

208 lines

AMDGPUTargetMachine.cpp

5 lines

AMDGPUTargetTransformInfo.h

2 lines

CMakeLists.txt

1 line

test/

CodeGen/

AMDGPU/

amdgpu-inline.ll

152 lines

internalize.ll

20 lines

Diff 115960

llvm/trunk/lib/Target/AMDGPU/AMDGPU.h

	Show First 20 Lines • Show All 176 Lines • ▼ Show 20 Lines
	void initializeAMDGPUUnifyDivergentExitNodesPass(PassRegistry&);			void initializeAMDGPUUnifyDivergentExitNodesPass(PassRegistry&);
	extern char &AMDGPUUnifyDivergentExitNodesID;			extern char &AMDGPUUnifyDivergentExitNodesID;

	ImmutablePass *createAMDGPUAAWrapperPass();			ImmutablePass *createAMDGPUAAWrapperPass();
	void initializeAMDGPUAAWrapperPassPass(PassRegistry&);			void initializeAMDGPUAAWrapperPassPass(PassRegistry&);

	void initializeAMDGPUArgumentUsageInfoPass(PassRegistry &);			void initializeAMDGPUArgumentUsageInfoPass(PassRegistry &);

				Pass *createAMDGPUFunctionInliningPass();
				void initializeAMDGPUInlinerPass(PassRegistry&);

	Target &getTheAMDGPUTarget();			Target &getTheAMDGPUTarget();
	Target &getTheGCNTarget();			Target &getTheGCNTarget();

	namespace AMDGPU {			namespace AMDGPU {
	enum TargetIndex {			enum TargetIndex {
	TI_CONSTDATA_START,			TI_CONSTDATA_START,
	TI_SCRATCH_RSRC_DWORD0,			TI_SCRATCH_RSRC_DWORD0,
	TI_SCRATCH_RSRC_DWORD1,			TI_SCRATCH_RSRC_DWORD1,
	▲ Show 20 Lines • Show All 63 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPUInline.cpp

				//===- AMDGPUInline.cpp - Code to perform simple function inlining --------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file
				/// \brief This is AMDGPU specific replacement of the standard inliner.
				/// The main purpose is to account for the fact that calls not only expensive
				/// on the AMDGPU, but much more expensive if a private memory pointer is
				/// passed to a function as an argument. In this situation, we are unable to
				/// eliminate private memory in the caller unless inlined and end up with slow
				/// and expensive scratch access. Thus, we boost the inline threshold for such
				/// functions here.
				///
				//===----------------------------------------------------------------------===//


				#include "AMDGPU.h"
				#include "llvm/Transforms/IPO.h"
				#include "llvm/Analysis/AssumptionCache.h"
				#include "llvm/Analysis/CallGraph.h"
				#include "llvm/Analysis/InlineCost.h"
				#include "llvm/Analysis/ValueTracking.h"
				#include "llvm/Analysis/TargetTransformInfo.h"
				#include "llvm/IR/CallSite.h"
				#include "llvm/IR/DataLayout.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/IR/Module.h"
				#include "llvm/IR/Type.h"
				#include "llvm/Support/CommandLine.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Transforms/IPO/Inliner.h"

				using namespace llvm;

				#define DEBUG_TYPE "inline"

				static cl::opt<int>
				ArgAllocaCost("amdgpu-inline-arg-alloca-cost", cl::Hidden, cl::init(2200),
				cl::desc("Cost of alloca argument"));

				// If the amount of scratch memory to eliminate exceeds our ability to allocate
				// it into registers we gain nothing by agressively inlining functions for that
				// heuristic.
				static cl::opt<unsigned>
				ArgAllocaCutoff("amdgpu-inline-arg-alloca-cutoff", cl::Hidden, cl::init(256),
				cl::desc("Maximum alloca size to use for inline cost"));

				namespace {

				class AMDGPUInliner : public LegacyInlinerBase {

				public:
				AMDGPUInliner() : LegacyInlinerBase(ID) {
				initializeAMDGPUInlinerPass(*PassRegistry::getPassRegistry());
				Params = getInlineParams();
				}

				static char ID; // Pass identification, replacement for typeid

				unsigned getInlineThreshold(CallSite CS) const;

				InlineCost getInlineCost(CallSite CS) override;

				bool runOnSCC(CallGraphSCC &SCC) override;

				void getAnalysisUsage(AnalysisUsage &AU) const override;

				private:
				TargetTransformInfoWrapperPass *TTIWP;

				InlineParams Params;
				};

				} // end anonymous namespace

				char AMDGPUInliner::ID = 0;
				INITIALIZE_PASS_BEGIN(AMDGPUInliner, "amdgpu-inline",
				"AMDGPU Function Integration/Inlining", false, false)
				INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)
				INITIALIZE_PASS_DEPENDENCY(CallGraphWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(ProfileSummaryInfoWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(TargetLibraryInfoWrapperPass)
				INITIALIZE_PASS_END(AMDGPUInliner, "amdgpu-inline",
				"AMDGPU Function Integration/Inlining", false, false)

				Pass *llvm::createAMDGPUFunctionInliningPass() { return new AMDGPUInliner(); }

				bool AMDGPUInliner::runOnSCC(CallGraphSCC &SCC) {
				TTIWP = &getAnalysis<TargetTransformInfoWrapperPass>();
				return LegacyInlinerBase::runOnSCC(SCC);
				}

				void AMDGPUInliner::getAnalysisUsage(AnalysisUsage &AU) const {
				AU.addRequired<TargetTransformInfoWrapperPass>();
				LegacyInlinerBase::getAnalysisUsage(AU);
				}

				unsigned AMDGPUInliner::getInlineThreshold(CallSite CS) const {
				int Thres = Params.DefaultThreshold;

				Function *Caller = CS.getCaller();
				// Listen to the inlinehint attribute when it would increase the threshold
				// and the caller does not need to minimize its size.
				Function *Callee = CS.getCalledFunction();
				bool InlineHint = Callee && !Callee->isDeclaration() &&
				Callee->hasFnAttribute(Attribute::InlineHint);
				if (InlineHint && Params.HintThreshold && Params.HintThreshold > Thres
				&& !Caller->hasFnAttribute(Attribute::MinSize))
				Thres = Params.HintThreshold.getValue();

				const DataLayout &DL = Caller->getParent()->getDataLayout();
				if (!Callee)
				return (unsigned)Thres;

				const AMDGPUAS AS = AMDGPU::getAMDGPUAS(*Caller->getParent());

				// If we have a pointer to private array passed into a function
				// it will not be optimized out, leaving scratch usage.
				// Increase the inline threshold to allow inliniting in this case.
				uint64_t AllocaSize = 0;
				SmallPtrSet<const AllocaInst *, 8> AIVisited;
				for (Value *PtrArg : CS.args()) {
				Type *Ty = PtrArg->getType();
				if (!Ty->isPointerTy() \|\|
				Ty->getPointerAddressSpace() != AS.PRIVATE_ADDRESS)
				continue;
				PtrArg = GetUnderlyingObject(PtrArg, DL);
				if (const AllocaInst *AI = dyn_cast<AllocaInst>(PtrArg)) {
				if (!AI->isStaticAlloca() \|\| !AIVisited.insert(AI).second)
				continue;
				AllocaSize += DL.getTypeAllocSize(AI->getAllocatedType());
				// If the amount of stack memory is excessive we will not be able
				// to get rid of the scratch anyway, bail out.
				if (AllocaSize > ArgAllocaCutoff) {
				AllocaSize = 0;
				break;
				}
				}
				}
				if (AllocaSize)
				Thres += ArgAllocaCost;

				return (unsigned)Thres;
				}

				// Check if call is just a wrapper around another call.
				// In this case we only have call and ret instructions.
				static bool isWrapperOnlyCall(CallSite CS) {
				Function *Callee = CS.getCalledFunction();
				if (!Callee \|\| Callee->size() != 1)
				return false;
				const BasicBlock &BB = Callee->getEntryBlock();
				if (const Instruction *I = BB.getFirstNonPHI()) {
				if (!isa<CallInst>(I)) {
				return false;
				}
				if (isa<ReturnInst>(*std::next(I->getIterator()))) {
				DEBUG(dbgs() << " Wrapper only call detected: "
				<< Callee->getName() << '\n');
				return true;
				}
				}
				return false;
				}

				InlineCost AMDGPUInliner::getInlineCost(CallSite CS) {
				Function *Callee = CS.getCalledFunction();
				Function *Caller = CS.getCaller();
				TargetTransformInfo &TTI = TTIWP->getTTI(*Callee);

				if (!Callee \|\| Callee->isDeclaration() \|\| CS.isNoInline() \|\|
				!TTI.areInlineCompatible(Caller, Callee))
				return llvm::InlineCost::getNever();

				if (CS.hasFnAttr(Attribute::AlwaysInline)) {
				if (isInlineViable(*Callee))
				return llvm::InlineCost::getAlways();
				return llvm::InlineCost::getNever();
				}

				if (isWrapperOnlyCall(CS))
				return llvm::InlineCost::getAlways();

				InlineParams LocalParams = Params;
				LocalParams.DefaultThreshold = (int)getInlineThreshold(CS);
				bool RemarksEnabled = false;
				const auto &BBs = Caller->getBasicBlockList();
				if (!BBs.empty()) {
				auto DI = OptimizationRemark(DEBUG_TYPE, "", DebugLoc(), &BBs.front());
				if (DI.isEnabled())
				RemarksEnabled = true;
				}

				OptimizationRemarkEmitter ORE(Caller);
				std::function<AssumptionCache &(Function &)> GetAssumptionCache =
				[this](Function &F) -> AssumptionCache & {
				return ACT->getAssumptionCache(F);
				};

				return llvm::getInlineCost(CS, Callee, LocalParams, TTI, GetAssumptionCache,
				None, PSI, RemarksEnabled ? &ORE : nullptr);
				}

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 173 Lines • ▼ Show 20 Lines	extern "C" void LLVMInitializeAMDGPUTarget() {
initializeSIMemoryLegalizerPass(*PR);		initializeSIMemoryLegalizerPass(*PR);
initializeSIDebuggerInsertNopsPass(*PR);		initializeSIDebuggerInsertNopsPass(*PR);
initializeSIOptimizeExecMaskingPass(*PR);		initializeSIOptimizeExecMaskingPass(*PR);
initializeSIFixWWMLivenessPass(*PR);		initializeSIFixWWMLivenessPass(*PR);
initializeAMDGPUUnifyDivergentExitNodesPass(*PR);		initializeAMDGPUUnifyDivergentExitNodesPass(*PR);
initializeAMDGPUAAWrapperPassPass(*PR);		initializeAMDGPUAAWrapperPassPass(*PR);
initializeAMDGPUUseNativeCallsPass(*PR);		initializeAMDGPUUseNativeCallsPass(*PR);
initializeAMDGPUSimplifyLibCallsPass(*PR);		initializeAMDGPUSimplifyLibCallsPass(*PR);
		initializeAMDGPUInlinerPass(*PR);
}		}

static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {		static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {
return llvm::make_unique<AMDGPUTargetObjectFile>();		return llvm::make_unique<AMDGPUTargetObjectFile>();
}		}

static ScheduleDAGInstrs createR600MachineScheduler(MachineSchedContext C) {		static ScheduleDAGInstrs createR600MachineScheduler(MachineSchedContext C) {
return new ScheduleDAGMILive(C, llvm::make_unique<R600SchedStrategy>());		return new ScheduleDAGMILive(C, llvm::make_unique<R600SchedStrategy>());
▲ Show 20 Lines • Show All 137 Lines • ▼ Show 20 Lines	bool mustPreserveGV(const GlobalValue &GV) {
return !GV.use_empty();		return !GV.use_empty();
}		}

void AMDGPUTargetMachine::adjustPassManager(PassManagerBuilder &Builder) {		void AMDGPUTargetMachine::adjustPassManager(PassManagerBuilder &Builder) {
Builder.DivergentTarget = true;		Builder.DivergentTarget = true;

bool EnableOpt = getOptLevel() > CodeGenOpt::None;		bool EnableOpt = getOptLevel() > CodeGenOpt::None;
bool Internalize = InternalizeSymbols;		bool Internalize = InternalizeSymbols;
bool EarlyInline = EarlyInlineAll && EnableOpt;		bool EarlyInline = EarlyInlineAll && EnableOpt && !EnableAMDGPUFunctionCalls;
bool AMDGPUAA = EnableAMDGPUAliasAnalysis && EnableOpt;		bool AMDGPUAA = EnableAMDGPUAliasAnalysis && EnableOpt;
bool LibCallSimplify = EnableLibCallSimplify && EnableOpt;		bool LibCallSimplify = EnableLibCallSimplify && EnableOpt;

		Builder.Inliner = createAMDGPUFunctionInliningPass();

if (Internalize) {		if (Internalize) {
// If we're generating code, we always have the whole program available. The		// If we're generating code, we always have the whole program available. The
// relocations expected for externally visible functions aren't supported,		// relocations expected for externally visible functions aren't supported,
// so make sure every non-entry function is hidden.		// so make sure every non-entry function is hidden.
Builder.addExtension(		Builder.addExtension(
PassManagerBuilder::EP_EnabledOnOptLevel0,		PassManagerBuilder::EP_EnabledOnOptLevel0,
[](const PassManagerBuilder &, legacy::PassManagerBase &PM) {		[](const PassManagerBuilder &, legacy::PassManagerBase &PM) {
PM.add(createInternalizePass(mustPreserveGV));		PM.add(createInternalizePass(mustPreserveGV));
▲ Show 20 Lines • Show All 523 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h

Show First 20 Lines • Show All 156 Lines • ▼ Show 20 Lines	public:

unsigned getVectorSplitCost() { return 0; }		unsigned getVectorSplitCost() { return 0; }

unsigned getShuffleCost(TTI::ShuffleKind Kind, Type *Tp, int Index,		unsigned getShuffleCost(TTI::ShuffleKind Kind, Type *Tp, int Index,
Type *SubTp);		Type *SubTp);

bool areInlineCompatible(const Function *Caller,		bool areInlineCompatible(const Function *Caller,
const Function *Callee) const;		const Function *Callee) const;

		unsigned getInliningThresholdMultiplier() { return 9; }
};		};

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_LIB_TARGET_AMDGPU_AMDGPUTARGETTRANSFORMINFO_H		#endif // LLVM_LIB_TARGET_AMDGPU_AMDGPUTARGETTRANSFORMINFO_H

llvm/trunk/lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
AMDGPURegisterInfo.cpp		AMDGPURegisterInfo.cpp
AMDGPURewriteOutArguments.cpp		AMDGPURewriteOutArguments.cpp
AMDGPUSubtarget.cpp		AMDGPUSubtarget.cpp
AMDGPUTargetMachine.cpp		AMDGPUTargetMachine.cpp
AMDGPUTargetObjectFile.cpp		AMDGPUTargetObjectFile.cpp
AMDGPUTargetTransformInfo.cpp		AMDGPUTargetTransformInfo.cpp
AMDGPUUnifyDivergentExitNodes.cpp		AMDGPUUnifyDivergentExitNodes.cpp
AMDGPUUnifyMetadata.cpp		AMDGPUUnifyMetadata.cpp
		AMDGPUInline.cpp
AMDILCFGStructurizer.cpp		AMDILCFGStructurizer.cpp
GCNHazardRecognizer.cpp		GCNHazardRecognizer.cpp
GCNIterativeScheduler.cpp		GCNIterativeScheduler.cpp
GCNMinRegStrategy.cpp		GCNMinRegStrategy.cpp
GCNRegPressure.cpp		GCNRegPressure.cpp
GCNSchedStrategy.cpp		GCNSchedStrategy.cpp
R600ClauseMergePass.cpp		R600ClauseMergePass.cpp
R600ControlFlowFinalizer.cpp		R600ControlFlowFinalizer.cpp
▲ Show 20 Lines • Show All 42 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/amdgpu-inline.ll

				; RUN: opt -mtriple=amdgcn--amdhsa -O3 -S -amdgpu-function-calls -inline-threshold=1 < %s \| FileCheck -check-prefix=GCN -check-prefix=GCN-INL1 %s
				; RUN: opt -mtriple=amdgcn--amdhsa -O3 -S -amdgpu-function-calls < %s \| FileCheck -check-prefix=GCN -check-prefix=GCN-INLDEF %s

				define coldcc float @foo(float %x, float %y) {
				entry:
				%cmp = fcmp ogt float %x, 0.000000e+00
				%div = fdiv float %y, %x
				%mul = fmul float %x, %y
				%cond = select i1 %cmp, float %div, float %mul
				ret float %cond
				}

				define coldcc void @foo_private_ptr(float* nocapture %p) {
				entry:
				%tmp1 = load float, float* %p, align 4
				%cmp = fcmp ogt float %tmp1, 1.000000e+00
				br i1 %cmp, label %if.then, label %if.end

				if.then: ; preds = %entry
				%div = fdiv float 1.000000e+00, %tmp1
				store float %div, float* %p, align 4
				br label %if.end

				if.end: ; preds = %if.then, %entry
				ret void
				}

				define coldcc void @foo_private_ptr2(float* nocapture %p1, float* nocapture %p2) {
				entry:
				%tmp1 = load float, float* %p1, align 4
				%cmp = fcmp ogt float %tmp1, 1.000000e+00
				br i1 %cmp, label %if.then, label %if.end

				if.then: ; preds = %entry
				%div = fdiv float 2.000000e+00, %tmp1
				store float %div, float* %p2, align 4
				br label %if.end

				if.end: ; preds = %if.then, %entry
				ret void
				}

				define coldcc float @sin_wrapper(float %x) {
				bb:
				%call = tail call float @_Z3sinf(float %x)
				ret float %call
				}

				define void @foo_noinline(float* nocapture %p) #0 {
				entry:
				%tmp1 = load float, float* %p, align 4
				%mul = fmul float %tmp1, 2.000000e+00
				store float %mul, float* %p, align 4
				ret void
				}

				; GCN: define amdgpu_kernel void @test_inliner(
				; GCN-INL1: %c1 = tail call coldcc float @foo(
				; GCN-INLDEF: %cmp.i = fcmp ogt float %tmp2, 0.000000e+00
				; GCN: %div.i{{[0-9]*}} = fdiv float 1.000000e+00, %c
				; GCN: %div.i{{[0-9]*}} = fdiv float 2.000000e+00, %tmp1.i
				; GCN: call void @foo_noinline(
				; GCN: tail call float @_Z3sinf(
				define amdgpu_kernel void @test_inliner(float addrspace(1)* nocapture %a, i32 %n) {
				entry:
				%pvt_arr = alloca [64 x float], align 4
				%tid = tail call i32 @llvm.amdgcn.workitem.id.x()
				%arrayidx = getelementptr inbounds float, float addrspace(1)* %a, i32 %tid
				%tmp2 = load float, float addrspace(1)* %arrayidx, align 4
				%add = add i32 %tid, 1
				%arrayidx2 = getelementptr inbounds float, float addrspace(1)* %a, i32 %add
				%tmp5 = load float, float addrspace(1)* %arrayidx2, align 4
				%c1 = tail call coldcc float @foo(float %tmp2, float %tmp5)
				%or = or i32 %tid, %n
				%arrayidx5 = getelementptr inbounds [64 x float], [64 x float]* %pvt_arr, i32 0, i32 %or
				store float %c1, float* %arrayidx5, align 4
				%arrayidx7 = getelementptr inbounds [64 x float], [64 x float]* %pvt_arr, i32 0, i32 %or
				call coldcc void @foo_private_ptr(float* %arrayidx7)
				%arrayidx8 = getelementptr inbounds [64 x float], [64 x float]* %pvt_arr, i32 0, i32 1
				%arrayidx9 = getelementptr inbounds [64 x float], [64 x float]* %pvt_arr, i32 0, i32 2
				call coldcc void @foo_private_ptr2(float* %arrayidx8, float* %arrayidx9)
				call void @foo_noinline(float* %arrayidx7)
				%and = and i32 %tid, %n
				%arrayidx11 = getelementptr inbounds [64 x float], [64 x float]* %pvt_arr, i32 0, i32 %and
				%tmp12 = load float, float* %arrayidx11, align 4
				%c2 = call coldcc float @sin_wrapper(float %tmp12)
				store float %c2, float* %arrayidx7, align 4
				%xor = xor i32 %tid, %n
				%arrayidx16 = getelementptr inbounds [64 x float], [64 x float]* %pvt_arr, i32 0, i32 %xor
				%tmp16 = load float, float* %arrayidx16, align 4
				store float %tmp16, float addrspace(1)* %arrayidx, align 4
				ret void
				}

				; GCN: define amdgpu_kernel void @test_inliner_multi_pvt_ptr(
				; GCN: %div.i{{[0-9]*}} = fdiv float 2.000000e+00, %tmp1.i
				define amdgpu_kernel void @test_inliner_multi_pvt_ptr(float addrspace(1)* nocapture %a, i32 %n, float %v) {
				entry:
				%pvt_arr1 = alloca [32 x float], align 4
				%pvt_arr2 = alloca [32 x float], align 4
				%tid = tail call i32 @llvm.amdgcn.workitem.id.x()
				%arrayidx = getelementptr inbounds float, float addrspace(1)* %a, i32 %tid
				%or = or i32 %tid, %n
				%arrayidx4 = getelementptr inbounds [32 x float], [32 x float]* %pvt_arr1, i32 0, i32 %or
				%arrayidx5 = getelementptr inbounds [32 x float], [32 x float]* %pvt_arr2, i32 0, i32 %or
				store float %v, float* %arrayidx4, align 4
				store float %v, float* %arrayidx5, align 4
				%arrayidx8 = getelementptr inbounds [32 x float], [32 x float]* %pvt_arr1, i32 0, i32 1
				%arrayidx9 = getelementptr inbounds [32 x float], [32 x float]* %pvt_arr2, i32 0, i32 2
				call coldcc void @foo_private_ptr2(float* %arrayidx8, float* %arrayidx9)
				%xor = xor i32 %tid, %n
				%arrayidx15 = getelementptr inbounds [32 x float], [32 x float]* %pvt_arr1, i32 0, i32 %xor
				%arrayidx16 = getelementptr inbounds [32 x float], [32 x float]* %pvt_arr2, i32 0, i32 %xor
				%tmp15 = load float, float* %arrayidx15, align 4
				%tmp16 = load float, float* %arrayidx16, align 4
				%tmp17 = fadd float %tmp15, %tmp16
				store float %tmp17, float addrspace(1)* %arrayidx, align 4
				ret void
				}

				; GCN: define amdgpu_kernel void @test_inliner_multi_pvt_ptr_cutoff(
				; GCN-INL1: call coldcc void @foo_private_ptr2
				; GCN-INLDEF: %div.i{{[0-9]*}} = fdiv float 2.000000e+00, %tmp1.i
				define amdgpu_kernel void @test_inliner_multi_pvt_ptr_cutoff(float addrspace(1)* nocapture %a, i32 %n, float %v) {
				entry:
				%pvt_arr1 = alloca [32 x float], align 4
				%pvt_arr2 = alloca [33 x float], align 4
				%tid = tail call i32 @llvm.amdgcn.workitem.id.x()
				%arrayidx = getelementptr inbounds float, float addrspace(1)* %a, i32 %tid
				%or = or i32 %tid, %n
				%arrayidx4 = getelementptr inbounds [32 x float], [32 x float]* %pvt_arr1, i32 0, i32 %or
				%arrayidx5 = getelementptr inbounds [33 x float], [33 x float]* %pvt_arr2, i32 0, i32 %or
				store float %v, float* %arrayidx4, align 4
				store float %v, float* %arrayidx5, align 4
				%arrayidx8 = getelementptr inbounds [32 x float], [32 x float]* %pvt_arr1, i32 0, i32 1
				%arrayidx9 = getelementptr inbounds [33 x float], [33 x float]* %pvt_arr2, i32 0, i32 2
				call coldcc void @foo_private_ptr2(float* %arrayidx8, float* %arrayidx9)
				%xor = xor i32 %tid, %n
				%arrayidx15 = getelementptr inbounds [32 x float], [32 x float]* %pvt_arr1, i32 0, i32 %xor
				%arrayidx16 = getelementptr inbounds [33 x float], [33 x float]* %pvt_arr2, i32 0, i32 %xor
				%tmp15 = load float, float* %arrayidx15, align 4
				%tmp16 = load float, float* %arrayidx16, align 4
				%tmp17 = fadd float %tmp15, %tmp16
				store float %tmp17, float addrspace(1)* %arrayidx, align 4
				ret void
				}

				declare i32 @llvm.amdgcn.workitem.id.x() #1
				declare float @_Z3sinf(float) #1

				attributes #0 = { noinline }
				attributes #1 = { nounwind readnone }

llvm/trunk/test/CodeGen/AMDGPU/internalize.ll

	; RUN: opt -O1 -S -mtriple=amdgcn-unknown-amdhsa -amdgpu-internalize-symbols < %s \| FileCheck -check-prefix=ALL -check-prefix=OPT %s			; RUN: opt -O1 -S -mtriple=amdgcn-unknown-amdhsa -amdgpu-internalize-symbols < %s \| FileCheck -check-prefix=ALL -check-prefix=OPT %s
	; RUN: opt -O0 -S -mtriple=amdgcn-unknown-amdhsa -amdgpu-internalize-symbols < %s \| FileCheck -check-prefix=ALL -check-prefix=OPTNONE %s			; RUN: opt -O0 -S -mtriple=amdgcn-unknown-amdhsa -amdgpu-internalize-symbols < %s \| FileCheck -check-prefix=ALL -check-prefix=OPTNONE %s

	; OPT-NOT: gvar_unused			; OPT-NOT: gvar_unused
	; OPTNONE: gvar_unused			; OPTNONE: gvar_unused
	@gvar_unused = addrspace(1) global i32 undef, align 4			@gvar_unused = addrspace(1) global i32 undef, align 4

	; ALL: gvar_used			; ALL: gvar_used
	@gvar_used = addrspace(1) global i32 undef, align 4			@gvar_used = addrspace(1) global i32 undef, align 4

	; ALL: define internal fastcc void @func_used(
	define fastcc void @func_used(i32 addrspace(1)* %out, i32 %tid) #1 {
	entry:
	store volatile i32 %tid, i32 addrspace(1)* %out
	ret void
	}

	; ALL: define internal fastcc void @func_used_noinline(			; ALL: define internal fastcc void @func_used_noinline(
	define fastcc void @func_used_noinline(i32 addrspace(1)* %out, i32 %tid) #2 {			define fastcc void @func_used_noinline(i32 addrspace(1)* %out, i32 %tid) #1 {
	entry:			entry:
	store volatile i32 %tid, i32 addrspace(1)* %out			store volatile i32 %tid, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; OPTNONE: define internal fastcc void @func_used_alwaysinline(			; OPTNONE: define internal fastcc void @func_used_alwaysinline(
	; OPT-NOT: @func_used_alwaysinline			; OPT-NOT: @func_used_alwaysinline
	define fastcc void @func_used_alwaysinline(i32 addrspace(1)* %out, i32 %tid) #3 {			define fastcc void @func_used_alwaysinline(i32 addrspace(1)* %out, i32 %tid) #2 {
	entry:			entry:
	store volatile i32 %tid, i32 addrspace(1)* %out			store volatile i32 %tid, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; OPTNONE: define internal void @func_unused(			; OPTNONE: define internal void @func_unused(
	; OPT-NOT: @func_unused			; OPT-NOT: @func_unused
	define void @func_unused(i32 addrspace(1)* %out, i32 %tid) #2 {			define void @func_unused(i32 addrspace(1)* %out, i32 %tid) #1 {
	entry:			entry:
	store volatile i32 %tid, i32 addrspace(1)* %out			store volatile i32 %tid, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; ALL: define amdgpu_kernel void @kernel_unused(			; ALL: define amdgpu_kernel void @kernel_unused(
	define amdgpu_kernel void @kernel_unused(i32 addrspace(1)* %out) #1 {			define amdgpu_kernel void @kernel_unused(i32 addrspace(1)* %out) #1 {
	entry:			entry:
	store volatile i32 1, i32 addrspace(1)* %out			store volatile i32 1, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; ALL: define amdgpu_kernel void @main_kernel()			; ALL: define amdgpu_kernel void @main_kernel()
	; ALL: tail call i32 @llvm.amdgcn.workitem.id.x			; ALL: tail call i32 @llvm.amdgcn.workitem.id.x
	; ALL: tail call fastcc void @func_used
	; ALL: tail call fastcc void @func_used_noinline			; ALL: tail call fastcc void @func_used_noinline
	; ALL: store volatile			; ALL: store volatile
	; ALL: ret void			; ALL: ret void
	define amdgpu_kernel void @main_kernel() {			define amdgpu_kernel void @main_kernel() {
	entry:			entry:
	%tid = tail call i32 @llvm.amdgcn.workitem.id.x()			%tid = tail call i32 @llvm.amdgcn.workitem.id.x()
	tail call fastcc void @func_used(i32 addrspace(1)* @gvar_used, i32 %tid)
	tail call fastcc void @func_used_noinline(i32 addrspace(1)* @gvar_used, i32 %tid)			tail call fastcc void @func_used_noinline(i32 addrspace(1)* @gvar_used, i32 %tid)
	tail call fastcc void @func_used_alwaysinline(i32 addrspace(1)* @gvar_used, i32 %tid)			tail call fastcc void @func_used_alwaysinline(i32 addrspace(1)* @gvar_used, i32 %tid)
	ret void			ret void
	}			}

	declare i32 @llvm.amdgcn.workitem.id.x() #0			declare i32 @llvm.amdgcn.workitem.id.x() #0

	attributes #0 = { nounwind readnone }			attributes #0 = { nounwind readnone }
	attributes #1 = { nounwind }			attributes #1 = { noinline nounwind }
	attributes #2 = { noinline nounwind }			attributes #2 = { alwaysinline nounwind }
	attributes #3 = { alwaysinline nounwind }