This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Port of HSAIL inliner
ClosedPublic

Authored by rampitec on Aug 17 2017, 2:01 PM.

Download Raw Diff

Details

Reviewers

arsenm
dfukalov
alfred.j.huang

Commits

rG5670e6d482b4: [AMDGPU] Port of HSAIL inliner
rL313714: [AMDGPU] Port of HSAIL inliner

Diff Detail

Event Timeline

rampitec created this revision.Aug 17 2017, 2:01 PM

Herald added subscribers: eraman, t-tye, tpr and 6 others. · View Herald TranscriptAug 17 2017, 2:01 PM

Added OptimizationRemarkEmitter as now in master.

Ping

rampitec added a reviewer: alfred.j.huang.Aug 23 2017, 1:27 PM

Look fine. Cosmetic questions from me. Thanks!

lib/Target/AMDGPU/AMDGPUInline.cpp
118	Accumulation of all pointer args to private array, if total size exceeds ArgAllocaCutoff, then bail out. Look fine. I was just confused initially with the comment.
187	CS.getCaller() doesn't need to be called again? Caller set previously.
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
336	This mean if EnableAMDGPUFunctionCalls is set, we will not have early inline anymore?
test/CodeGen/AMDGPU/amdgpu-inline.ll
44	Is existence of "tail call @foo" considered @foo inlined? @foo should be inlined, right?

Removed extra CS.getCaller() call.

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
336	We will not have early inline all. We always have selective early inline because we are running opt with the FE before linking, but w/o this new inlining pass it was a pristine llvm simple inliner which did not inline a lot of functions where we would like to see them inlined. Now the default inliner is replaced with our more aggressive version. If still runs twice with both invocations of opt, but it does not need a special pass to mark every function always_inline (which is in a nutshell what -amdgpu-early-inline-all does). Moreover, w/o this change even if we enable calls we will get no calls, as all of the functions will be marked always_inline and then inlined. Then eventually the whole AMDGPUAlwaysInlinePass pass shall be removed when we enable calls.
test/CodeGen/AMDGPU/amdgpu-inline.ll
44	It is GCN-INL1 check, which constitutes to the call with -inline-threshold=1. I.e. for the test purposes I'm practically asking not to inline conventional functions, unless they have a really good reason to be inlined (like use of scratch). The check GCN-INLDEF below represents default threshold under which foo supposed to be inlined.

alfred.j.huang accepted this revision.Aug 23 2017, 9:55 PM

This revision is now accepted and ready to land.Aug 23 2017, 9:55 PM

I don't like the idea of having a custom inlined, and thinking about this is very premature. We haven't attempted any benchmarking of calls or looked at current inlining behavior or if it could be improved. I would rather not commit this until it is clear it is absolutely necessary.

This revision now requires changes to proceed.Aug 23 2017, 11:02 PM

In D36849#851079, @arsenm wrote:

I don't like the idea of having a custom inlined, and thinking about this is very premature. We haven't attempted any benchmarking of calls or looked at current inlining behavior or if it could be improved. I would rather not commit this until it is clear it is absolutely necessary.

We did a lot of benchmarking with HSAIL. LLVM general inliner did not change much since then and does not factor stuff important for us.

In D36849#851091, @rampitec wrote:

In D36849#851079, @arsenm wrote:

I don't like the idea of having a custom inlined, and thinking about this is very premature. We haven't attempted any benchmarking of calls or looked at current inlining behavior or if it could be improved. I would rather not commit this until it is clear it is absolutely necessary.

We did a lot of benchmarking with HSAIL. LLVM general inliner did not change much since then and does not factor stuff important for us.

It's not really meaningful to compare inlining of HSAIL to AMDGPU. HSAIL never implemented any of the TTI cost model (and actually really completely broke it) plus other ABI differences. HSAIL was always using the default ABI using sret / byval among other differences.

In D36849#853144, @arsenm wrote:

In D36849#851091, @rampitec wrote:

In D36849#851079, @arsenm wrote:

I don't like the idea of having a custom inlined, and thinking about this is very premature. We haven't attempted any benchmarking of calls or looked at current inlining behavior or if it could be improved. I would rather not commit this until it is clear it is absolutely necessary.

We did a lot of benchmarking with HSAIL. LLVM general inliner did not change much since then and does not factor stuff important for us.

It's not really meaningful to compare inlining of HSAIL to AMDGPU. HSAIL never implemented any of the TTI cost model (and actually really completely broke it) plus other ABI differences. HSAIL was always using the default ABI using sret / byval among other differences.

What in the default inliner will bump threshold if private pointer is passed?

In D36849#853144, @arsenm wrote:

HSAIL never implemented any of the TTI cost model

And this is also completely false statement.

arsenm added inline comments.Aug 28 2017, 10:24 AM

lib/Target/AMDGPU/AMDGPUInline.cpp
31–33	We should avoid having a custom version of the exact same flag that the regular inliner uses. This would result in some surprises.
54	Hardcoded threshold. We should rely on the default opt-level based thresholds and override getInliningThresholdMultiplier to get the larger target default. That way the standard flags will work as expected.
113	Early return if no callee rather than wrapping the rest of the function in if
121	Skip potentially costly GetUnderlyingObject call if the address space isn't private
124–125	I'm pretty sure you aren't allowed to alloca an unsized type, so this check isn't needed. However you do need to check / skip dynamically sized allocas.
167–175	Call the base implementation and see if it says always or never first rather than reproducing the logic?
177–178	Wrapper calls will (ignoring no inline) always be inlined by the stock inline heuristics, so †his shouldn't be necessary
181–183	This heuristic doesn't make any sense to me. Why does the block count matter? Just use the default cost.
test/CodeGen/AMDGPU/amdgpu-inline.ll
76	Needs tests with multiple private pointers and with the alloca size threshold exceeded

Addressed review comments.
Rebased to master.
Added detection of the case when a subscript of the same alloca passed multiple times - count it as just a single alloca (found while forging new test).

lib/Target/AMDGPU/AMDGPUInline.cpp
31–33	Stock inline hint threshold is only 44% higher than regular threshold. Here I have it 6 times higher. The exact number may and will change, but when I return 9 from getInliningThresholdMultiplier() and multiply hint by the very same number it still will be a very low number, far less than we would really like to have.
124–125	That was for opaque types which we do not use now, and we had allocas on them.
167–175	Base implementation is pure virtual. I could call llvm::getInlineCost() instead directly (it is called from SimpleInliner::getInlineCost() anyway) two times, but I want to avoid expensive CallAnalyzer.
177–178	That is unless we hit MaxBB, in which case we still want to inline it.
181–183	That is to prevent huge compilation time of some programs. Not an ideal heuristic, but better than nothing.

arsenm added inline comments.Sep 1 2017, 10:31 AM

lib/Target/AMDGPU/AMDGPUInline.cpp
31–33	If that is really a problem we should find a different solution. I really don't want to have the set of flags for debugging this changing.

rampitec added inline comments.Sep 1 2017, 10:59 AM

lib/Target/AMDGPU/AMDGPUInline.cpp
31–33	OK, when we hit it next time I guess we could expose something from TTI.

Removed custom inline hint threshold.

arsenm added inline comments.Sep 11 2017, 10:33 AM

lib/Target/AMDGPU/AMDGPUInline.cpp
177–178	By doing this you could possibly be allowing inlining with incompatible function attributes. llvm:::getInlineCost has the additional functionsHaveCompatibleAttributes check, which is more than the above areInlineCompatible
181–183	Compilation time from what? That it requires this custom wrapper function checking sounds like additional motivation to drop it.
lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
165–166	How did you decide 9 for this?

rampitec added inline comments.Sep 11 2017, 11:14 AM

lib/Target/AMDGPU/AMDGPUInline.cpp
181–183	The actual testcase which required that was VRay AFAIR. When we increase inline threshold (or inline everything like now) we a vulnerable to extremely high compilation times for huge codes. In case of VRay it was hours, and this has decreased it to minutes.
lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
165–166	This is the rounded ratio of standard inline threshold and that tuned for HSAIL. Later we will need to retune.

rampitec added inline comments.Sep 11 2017, 11:32 AM

lib/Target/AMDGPU/AMDGPUInline.cpp
177–178	areInlineCompatible() is called earlier at line 167. If they do not we will not get to this point.

There should be an explanation of what this pass does and why it is better than LLVM's default inliner and some benchmark data showing which applications / games this helps.

Added file brief and more comments for thresholds.

In D36849#866812, @tstellar wrote:

There should be an explanation of what this pass does and why it is better than LLVM's default inliner and some benchmark data showing which applications / games this helps.

Explanation added. For the benchmarks, the one I used was Luxmark, where it gave ~12% better result comparing to standard inliner. For the others it does not look like it makes a lot of sense to try to recollect old HSAIL numbers.
This pass is ported because we know in advance there is an issue like this. As said before cost model is different and some tuning will be also required later when we turn call support on.

arsenm added inline comments.Sep 12 2017, 8:15 PM

lib/Target/AMDGPU/AMDGPUInline.cpp
181–183	I meant where was the compile time spent? I doubt it was the inliner itself

rampitec added inline comments.Sep 12 2017, 9:40 PM

lib/Target/AMDGPU/AMDGPUInline.cpp
181–183	As usual, optimizing inflated code, scheduling and RA.

arsenm added inline comments.Sep 15 2017, 10:34 AM

lib/Target/AMDGPU/AMDGPUInline.cpp
181–183	I'd rather drop this workaround for now at least.

Removed MaxBB limit.

lib/Target/AMDGPU/AMDGPUInline.cpp
177–178	AFAIR I've seen cases when our library call wrappers were not inlined even with relatively small programs.
181–183	OK

LGTM. It does look like the default inline heuristic does consider the SROAability of the passed arguments to some degree

This revision is now accepted and ready to land.Sep 19 2017, 6:21 PM

Rebase to master.

Closed by commit rL313714: [AMDGPU] Port of HSAIL inliner (authored by rampitec). · Explain WhySep 19 2017, 9:27 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

AMDGPU.h

3 lines

AMDGPUInline.cpp

208 lines

AMDGPUTargetMachine.cpp

5 lines

AMDGPUTargetTransformInfo.h

2 lines

CMakeLists.txt

1 line

test/

CodeGen/

AMDGPU/

amdgpu-inline.ll

152 lines

internalize.ll

20 lines

Diff 115959

lib/Target/AMDGPU/AMDGPU.h

	Show First 20 Lines • Show All 176 Lines • ▼ Show 20 Lines
	void initializeAMDGPUUnifyDivergentExitNodesPass(PassRegistry&);			void initializeAMDGPUUnifyDivergentExitNodesPass(PassRegistry&);
	extern char &AMDGPUUnifyDivergentExitNodesID;			extern char &AMDGPUUnifyDivergentExitNodesID;

	ImmutablePass *createAMDGPUAAWrapperPass();			ImmutablePass *createAMDGPUAAWrapperPass();
	void initializeAMDGPUAAWrapperPassPass(PassRegistry&);			void initializeAMDGPUAAWrapperPassPass(PassRegistry&);

	void initializeAMDGPUArgumentUsageInfoPass(PassRegistry &);			void initializeAMDGPUArgumentUsageInfoPass(PassRegistry &);

				Pass *createAMDGPUFunctionInliningPass();
				void initializeAMDGPUInlinerPass(PassRegistry&);

	Target &getTheAMDGPUTarget();			Target &getTheAMDGPUTarget();
	Target &getTheGCNTarget();			Target &getTheGCNTarget();

	namespace AMDGPU {			namespace AMDGPU {
	enum TargetIndex {			enum TargetIndex {
	TI_CONSTDATA_START,			TI_CONSTDATA_START,
	TI_SCRATCH_RSRC_DWORD0,			TI_SCRATCH_RSRC_DWORD0,
	TI_SCRATCH_RSRC_DWORD1,			TI_SCRATCH_RSRC_DWORD1,
	▲ Show 20 Lines • Show All 63 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUInline.cpp

This file was added.

				//===- AMDGPUInline.cpp - Code to perform simple function inlining --------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file
				/// \brief This is AMDGPU specific replacement of the standard inliner.
				/// The main purpose is to account for the fact that calls not only expensive
				/// on the AMDGPU, but much more expensive if a private memory pointer is
				/// passed to a function as an argument. In this situation, we are unable to
				/// eliminate private memory in the caller unless inlined and end up with slow
				/// and expensive scratch access. Thus, we boost the inline threshold for such
				/// functions here.
				///
				//===----------------------------------------------------------------------===//


				#include "AMDGPU.h"
				#include "llvm/Transforms/IPO.h"
				#include "llvm/Analysis/AssumptionCache.h"
				#include "llvm/Analysis/CallGraph.h"
				#include "llvm/Analysis/InlineCost.h"
				#include "llvm/Analysis/ValueTracking.h"
				#include "llvm/Analysis/TargetTransformInfo.h"
				#include "llvm/IR/CallSite.h"
				#include "llvm/IR/DataLayout.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/IR/Module.h"
				#include "llvm/IR/Type.h"
				arsenmUnsubmitted Done Reply Inline Actions We should avoid having a custom version of the exact same flag that the regular inliner uses. This would result in some surprises. arsenm: We should avoid having a custom version of the exact same flag that the regular inliner uses.
				rampitecAuthorUnsubmitted Done Reply Inline Actions Stock inline hint threshold is only 44% higher than regular threshold. Here I have it 6 times higher. The exact number may and will change, but when I return 9 from getInliningThresholdMultiplier() and multiply hint by the very same number it still will be a very low number, far less than we would really like to have. rampitec: Stock inline hint threshold is only 44% higher than regular threshold. Here I have it 6 times…
				arsenmUnsubmitted Done Reply Inline Actions If that is really a problem we should find a different solution. I really don't want to have the set of flags for debugging this changing. arsenm: If that is really a problem we should find a different solution. I really don't want to have…
				rampitecAuthorUnsubmitted Done Reply Inline Actions OK, when we hit it next time I guess we could expose something from TTI. rampitec: OK, when we hit it next time I guess we could expose something from TTI.
				#include "llvm/Support/CommandLine.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Transforms/IPO/Inliner.h"

				using namespace llvm;

				#define DEBUG_TYPE "inline"

				static cl::opt<int>
				ArgAllocaCost("amdgpu-inline-arg-alloca-cost", cl::Hidden, cl::init(2200),
				cl::desc("Cost of alloca argument"));

				// If the amount of scratch memory to eliminate exceeds our ability to allocate
				// it into registers we gain nothing by agressively inlining functions for that
				// heuristic.
				static cl::opt<unsigned>
				ArgAllocaCutoff("amdgpu-inline-arg-alloca-cutoff", cl::Hidden, cl::init(256),
				cl::desc("Maximum alloca size to use for inline cost"));

				namespace {

				arsenmUnsubmitted Done Reply Inline Actions Hardcoded threshold. We should rely on the default opt-level based thresholds and override getInliningThresholdMultiplier to get the larger target default. That way the standard flags will work as expected. arsenm: Hardcoded threshold. We should rely on the default opt-level based thresholds and override…
				class AMDGPUInliner : public LegacyInlinerBase {

				public:
				AMDGPUInliner() : LegacyInlinerBase(ID) {
				initializeAMDGPUInlinerPass(*PassRegistry::getPassRegistry());
				Params = getInlineParams();
				}

				static char ID; // Pass identification, replacement for typeid

				unsigned getInlineThreshold(CallSite CS) const;

				InlineCost getInlineCost(CallSite CS) override;

				bool runOnSCC(CallGraphSCC &SCC) override;

				void getAnalysisUsage(AnalysisUsage &AU) const override;

				private:
				TargetTransformInfoWrapperPass *TTIWP;

				InlineParams Params;
				};

				} // end anonymous namespace

				char AMDGPUInliner::ID = 0;
				INITIALIZE_PASS_BEGIN(AMDGPUInliner, "amdgpu-inline",
				"AMDGPU Function Integration/Inlining", false, false)
				INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)
				INITIALIZE_PASS_DEPENDENCY(CallGraphWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(ProfileSummaryInfoWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(TargetLibraryInfoWrapperPass)
				INITIALIZE_PASS_END(AMDGPUInliner, "amdgpu-inline",
				"AMDGPU Function Integration/Inlining", false, false)

				Pass *llvm::createAMDGPUFunctionInliningPass() { return new AMDGPUInliner(); }

				bool AMDGPUInliner::runOnSCC(CallGraphSCC &SCC) {
				TTIWP = &getAnalysis<TargetTransformInfoWrapperPass>();
				return LegacyInlinerBase::runOnSCC(SCC);
				}

				void AMDGPUInliner::getAnalysisUsage(AnalysisUsage &AU) const {
				AU.addRequired<TargetTransformInfoWrapperPass>();
				LegacyInlinerBase::getAnalysisUsage(AU);
				}

				unsigned AMDGPUInliner::getInlineThreshold(CallSite CS) const {
				int Thres = Params.DefaultThreshold;

				Function *Caller = CS.getCaller();
				// Listen to the inlinehint attribute when it would increase the threshold
				// and the caller does not need to minimize its size.
				Function *Callee = CS.getCalledFunction();
				bool InlineHint = Callee && !Callee->isDeclaration() &&
				Callee->hasFnAttribute(Attribute::InlineHint);
				if (InlineHint && Params.HintThreshold && Params.HintThreshold > Thres
				arsenmUnsubmitted Done Reply Inline Actions Early return if no callee rather than wrapping the rest of the function in if arsenm: Early return if no callee rather than wrapping the rest of the function in if
				&& !Caller->hasFnAttribute(Attribute::MinSize))
				Thres = Params.HintThreshold.getValue();

				const DataLayout &DL = Caller->getParent()->getDataLayout();
				if (!Callee)
				alfred.j.huangUnsubmitted Done Reply Inline Actions Accumulation of all pointer args to private array, if total size exceeds ArgAllocaCutoff, then bail out. Look fine. I was just confused initially with the comment. alfred.j.huang: Accumulation of all pointer args to private array, if total size exceeds ArgAllocaCutoff, then…
				return (unsigned)Thres;

				const AMDGPUAS AS = AMDGPU::getAMDGPUAS(*Caller->getParent());
				arsenmUnsubmitted Done Reply Inline Actions Skip potentially costly GetUnderlyingObject call if the address space isn't private arsenm: Skip potentially costly GetUnderlyingObject call if the address space isn't private

				// If we have a pointer to private array passed into a function
				// it will not be optimized out, leaving scratch usage.
				// Increase the inline threshold to allow inliniting in this case.
				arsenmUnsubmitted Done Reply Inline Actions I'm pretty sure you aren't allowed to alloca an unsized type, so this check isn't needed. However you do need to check / skip dynamically sized allocas. arsenm: I'm pretty sure you aren't allowed to alloca an unsized type, so this check isn't needed.
				rampitecAuthorUnsubmitted Not Done Reply Inline Actions That was for opaque types which we do not use now, and we had allocas on them. rampitec: That was for opaque types which we do not use now, and we had allocas on them.
				uint64_t AllocaSize = 0;
				SmallPtrSet<const AllocaInst *, 8> AIVisited;
				for (Value *PtrArg : CS.args()) {
				Type *Ty = PtrArg->getType();
				if (!Ty->isPointerTy() \|\|
				Ty->getPointerAddressSpace() != AS.PRIVATE_ADDRESS)
				continue;
				PtrArg = GetUnderlyingObject(PtrArg, DL);
				if (const AllocaInst *AI = dyn_cast<AllocaInst>(PtrArg)) {
				if (!AI->isStaticAlloca() \|\| !AIVisited.insert(AI).second)
				continue;
				AllocaSize += DL.getTypeAllocSize(AI->getAllocatedType());
				// If the amount of stack memory is excessive we will not be able
				// to get rid of the scratch anyway, bail out.
				if (AllocaSize > ArgAllocaCutoff) {
				AllocaSize = 0;
				break;
				}
				}
				}
				if (AllocaSize)
				Thres += ArgAllocaCost;

				return (unsigned)Thres;
				}

				// Check if call is just a wrapper around another call.
				// In this case we only have call and ret instructions.
				static bool isWrapperOnlyCall(CallSite CS) {
				Function *Callee = CS.getCalledFunction();
				if (!Callee \|\| Callee->size() != 1)
				return false;
				const BasicBlock &BB = Callee->getEntryBlock();
				if (const Instruction *I = BB.getFirstNonPHI()) {
				if (!isa<CallInst>(I)) {
				return false;
				}
				if (isa<ReturnInst>(*std::next(I->getIterator()))) {
				DEBUG(dbgs() << " Wrapper only call detected: "
				<< Callee->getName() << '\n');
				return true;
				}
				}
				return false;
				}

				InlineCost AMDGPUInliner::getInlineCost(CallSite CS) {
				Function *Callee = CS.getCalledFunction();
				Function *Caller = CS.getCaller();
				TargetTransformInfo &TTI = TTIWP->getTTI(*Callee);
				arsenmUnsubmitted Not Done Reply Inline Actions Call the base implementation and see if it says always or never first rather than reproducing the logic? arsenm: Call the base implementation and see if it says always or never first rather than reproducing…
				rampitecAuthorUnsubmitted Not Done Reply Inline Actions Base implementation is pure virtual. I could call llvm::getInlineCost() instead directly (it is called from SimpleInliner::getInlineCost() anyway) two times, but I want to avoid expensive CallAnalyzer. rampitec: Base implementation is pure virtual. I could call llvm::getInlineCost() instead directly (it is…

				if (!Callee \|\| Callee->isDeclaration() \|\| CS.isNoInline() \|\|
				!TTI.areInlineCompatible(Caller, Callee))
				arsenmUnsubmitted Not Done Reply Inline Actions Wrapper calls will (ignoring no inline) always be inlined by the stock inline heuristics, so †his shouldn't be necessary arsenm: Wrapper calls will (ignoring no inline) always be inlined by the stock inline heuristics, so…
				rampitecAuthorUnsubmitted Not Done Reply Inline Actions That is unless we hit MaxBB, in which case we still want to inline it. rampitec: That is unless we hit MaxBB, in which case we still want to inline it.
				rampitecAuthorUnsubmitted Not Done Reply Inline Actions AFAIR I've seen cases when our library call wrappers were not inlined even with relatively small programs. rampitec: AFAIR I've seen cases when our library call wrappers were not inlined even with relatively…
				arsenmUnsubmitted Not Done Reply Inline Actions By doing this you could possibly be allowing inlining with incompatible function attributes. llvm:::getInlineCost has the additional functionsHaveCompatibleAttributes check, which is more than the above areInlineCompatible arsenm: By doing this you could possibly be allowing inlining with incompatible function attributes.
				rampitecAuthorUnsubmitted Not Done Reply Inline Actions areInlineCompatible() is called earlier at line 167. If they do not we will not get to this point. rampitec: areInlineCompatible() is called earlier at line 167. If they do not we will not get to this…
				return llvm::InlineCost::getNever();

				if (CS.hasFnAttr(Attribute::AlwaysInline)) {
				if (isInlineViable(*Callee))
				return llvm::InlineCost::getAlways();
				arsenmUnsubmitted Done Reply Inline Actions This heuristic doesn't make any sense to me. Why does the block count matter? Just use the default cost. arsenm: This heuristic doesn't make any sense to me. Why does the block count matter? Just use the…
				rampitecAuthorUnsubmitted Done Reply Inline Actions That is to prevent huge compilation time of some programs. Not an ideal heuristic, but better than nothing. rampitec: That is to prevent huge compilation time of some programs. Not an ideal heuristic, but better…
				arsenmUnsubmitted Done Reply Inline Actions Compilation time from what? That it requires this custom wrapper function checking sounds like additional motivation to drop it. arsenm: Compilation time from what? That it requires this custom wrapper function checking sounds like…
				rampitecAuthorUnsubmitted Done Reply Inline Actions The actual testcase which required that was VRay AFAIR. When we increase inline threshold (or inline everything like now) we a vulnerable to extremely high compilation times for huge codes. In case of VRay it was hours, and this has decreased it to minutes. rampitec: The actual testcase which required that was VRay AFAIR. When we increase inline threshold (or…
				arsenmUnsubmitted Done Reply Inline Actions I meant where was the compile time spent? I doubt it was the inliner itself arsenm: I meant where was the compile time spent? I doubt it was the inliner itself
				rampitecAuthorUnsubmitted Done Reply Inline Actions As usual, optimizing inflated code, scheduling and RA. rampitec: As usual, optimizing inflated code, scheduling and RA.
				arsenmUnsubmitted Done Reply Inline Actions I'd rather drop this workaround for now at least. arsenm: I'd rather drop this workaround for now at least.
				rampitecAuthorUnsubmitted Not Done Reply Inline Actions OK rampitec: OK
				return llvm::InlineCost::getNever();
				}

				if (isWrapperOnlyCall(CS))
				alfred.j.huangUnsubmitted Done Reply Inline Actions CS.getCaller() doesn't need to be called again? Caller set previously. alfred.j.huang: CS.getCaller() doesn't need to be called again? Caller set previously.
				return llvm::InlineCost::getAlways();

				InlineParams LocalParams = Params;
				LocalParams.DefaultThreshold = (int)getInlineThreshold(CS);
				bool RemarksEnabled = false;
				const auto &BBs = Caller->getBasicBlockList();
				if (!BBs.empty()) {
				auto DI = OptimizationRemark(DEBUG_TYPE, "", DebugLoc(), &BBs.front());
				if (DI.isEnabled())
				RemarksEnabled = true;
				}

				OptimizationRemarkEmitter ORE(Caller);
				std::function<AssumptionCache &(Function &)> GetAssumptionCache =
				[this](Function &F) -> AssumptionCache & {
				return ACT->getAssumptionCache(F);
				};

				return llvm::getInlineCost(CS, Callee, LocalParams, TTI, GetAssumptionCache,
				None, PSI, RemarksEnabled ? &ORE : nullptr);
				}

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 173 Lines • ▼ Show 20 Lines	extern "C" void LLVMInitializeAMDGPUTarget() {
initializeSIMemoryLegalizerPass(*PR);		initializeSIMemoryLegalizerPass(*PR);
initializeSIDebuggerInsertNopsPass(*PR);		initializeSIDebuggerInsertNopsPass(*PR);
initializeSIOptimizeExecMaskingPass(*PR);		initializeSIOptimizeExecMaskingPass(*PR);
initializeSIFixWWMLivenessPass(*PR);		initializeSIFixWWMLivenessPass(*PR);
initializeAMDGPUUnifyDivergentExitNodesPass(*PR);		initializeAMDGPUUnifyDivergentExitNodesPass(*PR);
initializeAMDGPUAAWrapperPassPass(*PR);		initializeAMDGPUAAWrapperPassPass(*PR);
initializeAMDGPUUseNativeCallsPass(*PR);		initializeAMDGPUUseNativeCallsPass(*PR);
initializeAMDGPUSimplifyLibCallsPass(*PR);		initializeAMDGPUSimplifyLibCallsPass(*PR);
		initializeAMDGPUInlinerPass(*PR);
}		}

static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {		static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {
return llvm::make_unique<AMDGPUTargetObjectFile>();		return llvm::make_unique<AMDGPUTargetObjectFile>();
}		}

static ScheduleDAGInstrs createR600MachineScheduler(MachineSchedContext C) {		static ScheduleDAGInstrs createR600MachineScheduler(MachineSchedContext C) {
return new ScheduleDAGMILive(C, llvm::make_unique<R600SchedStrategy>());		return new ScheduleDAGMILive(C, llvm::make_unique<R600SchedStrategy>());
▲ Show 20 Lines • Show All 137 Lines • ▼ Show 20 Lines	bool mustPreserveGV(const GlobalValue &GV) {
return !GV.use_empty();		return !GV.use_empty();
}		}

void AMDGPUTargetMachine::adjustPassManager(PassManagerBuilder &Builder) {		void AMDGPUTargetMachine::adjustPassManager(PassManagerBuilder &Builder) {
Builder.DivergentTarget = true;		Builder.DivergentTarget = true;

bool EnableOpt = getOptLevel() > CodeGenOpt::None;		bool EnableOpt = getOptLevel() > CodeGenOpt::None;
bool Internalize = InternalizeSymbols;		bool Internalize = InternalizeSymbols;
bool EarlyInline = EarlyInlineAll && EnableOpt;		bool EarlyInline = EarlyInlineAll && EnableOpt && !EnableAMDGPUFunctionCalls;
		alfred.j.huangUnsubmitted Not Done Reply Inline Actions This mean if EnableAMDGPUFunctionCalls is set, we will not have early inline anymore? alfred.j.huang: This mean if EnableAMDGPUFunctionCalls is set, we will not have early inline anymore?
		rampitecAuthorUnsubmitted Not Done Reply Inline Actions We will not have early inline all. We always have selective early inline because we are running opt with the FE before linking, but w/o this new inlining pass it was a pristine llvm simple inliner which did not inline a lot of functions where we would like to see them inlined. Now the default inliner is replaced with our more aggressive version. If still runs twice with both invocations of opt, but it does not need a special pass to mark every function always_inline (which is in a nutshell what -amdgpu-early-inline-all does). Moreover, w/o this change even if we enable calls we will get no calls, as all of the functions will be marked always_inline and then inlined. Then eventually the whole AMDGPUAlwaysInlinePass pass shall be removed when we enable calls. rampitec: We will not have early inline all. We always have selective early inline because we are…
bool AMDGPUAA = EnableAMDGPUAliasAnalysis && EnableOpt;		bool AMDGPUAA = EnableAMDGPUAliasAnalysis && EnableOpt;
bool LibCallSimplify = EnableLibCallSimplify && EnableOpt;		bool LibCallSimplify = EnableLibCallSimplify && EnableOpt;

		Builder.Inliner = createAMDGPUFunctionInliningPass();

if (Internalize) {		if (Internalize) {
// If we're generating code, we always have the whole program available. The		// If we're generating code, we always have the whole program available. The
// relocations expected for externally visible functions aren't supported,		// relocations expected for externally visible functions aren't supported,
// so make sure every non-entry function is hidden.		// so make sure every non-entry function is hidden.
Builder.addExtension(		Builder.addExtension(
PassManagerBuilder::EP_EnabledOnOptLevel0,		PassManagerBuilder::EP_EnabledOnOptLevel0,
[](const PassManagerBuilder &, legacy::PassManagerBase &PM) {		[](const PassManagerBuilder &, legacy::PassManagerBase &PM) {
PM.add(createInternalizePass(mustPreserveGV));		PM.add(createInternalizePass(mustPreserveGV));
▲ Show 20 Lines • Show All 523 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h

Show First 20 Lines • Show All 156 Lines • ▼ Show 20 Lines	public:

unsigned getVectorSplitCost() { return 0; }		unsigned getVectorSplitCost() { return 0; }

unsigned getShuffleCost(TTI::ShuffleKind Kind, Type *Tp, int Index,		unsigned getShuffleCost(TTI::ShuffleKind Kind, Type *Tp, int Index,
Type *SubTp);		Type *SubTp);

bool areInlineCompatible(const Function *Caller,		bool areInlineCompatible(const Function *Caller,
const Function *Callee) const;		const Function *Callee) const;

		unsigned getInliningThresholdMultiplier() { return 9; }
		arsenmUnsubmitted Not Done Reply Inline Actions How did you decide 9 for this? arsenm: How did you decide 9 for this?
		rampitecAuthorUnsubmitted Not Done Reply Inline Actions This is the rounded ratio of standard inline threshold and that tuned for HSAIL. Later we will need to retune. rampitec: This is the rounded ratio of standard inline threshold and that tuned for HSAIL. Later we will…
};		};

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_LIB_TARGET_AMDGPU_AMDGPUTARGETTRANSFORMINFO_H		#endif // LLVM_LIB_TARGET_AMDGPU_AMDGPUTARGETTRANSFORMINFO_H

lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
AMDGPURegisterInfo.cpp		AMDGPURegisterInfo.cpp
AMDGPURewriteOutArguments.cpp		AMDGPURewriteOutArguments.cpp
AMDGPUSubtarget.cpp		AMDGPUSubtarget.cpp
AMDGPUTargetMachine.cpp		AMDGPUTargetMachine.cpp
AMDGPUTargetObjectFile.cpp		AMDGPUTargetObjectFile.cpp
AMDGPUTargetTransformInfo.cpp		AMDGPUTargetTransformInfo.cpp
AMDGPUUnifyDivergentExitNodes.cpp		AMDGPUUnifyDivergentExitNodes.cpp
AMDGPUUnifyMetadata.cpp		AMDGPUUnifyMetadata.cpp
		AMDGPUInline.cpp
AMDILCFGStructurizer.cpp		AMDILCFGStructurizer.cpp
GCNHazardRecognizer.cpp		GCNHazardRecognizer.cpp
GCNIterativeScheduler.cpp		GCNIterativeScheduler.cpp
GCNMinRegStrategy.cpp		GCNMinRegStrategy.cpp
GCNRegPressure.cpp		GCNRegPressure.cpp
GCNSchedStrategy.cpp		GCNSchedStrategy.cpp
R600ClauseMergePass.cpp		R600ClauseMergePass.cpp
R600ControlFlowFinalizer.cpp		R600ControlFlowFinalizer.cpp
▲ Show 20 Lines • Show All 42 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/amdgpu-inline.ll

This file was added.

				; RUN: opt -mtriple=amdgcn--amdhsa -O3 -S -amdgpu-function-calls -inline-threshold=1 < %s \| FileCheck -check-prefix=GCN -check-prefix=GCN-INL1 %s
				; RUN: opt -mtriple=amdgcn--amdhsa -O3 -S -amdgpu-function-calls < %s \| FileCheck -check-prefix=GCN -check-prefix=GCN-INLDEF %s

				define coldcc float @foo(float %x, float %y) {
				entry:
				%cmp = fcmp ogt float %x, 0.000000e+00
				%div = fdiv float %y, %x
				%mul = fmul float %x, %y
				%cond = select i1 %cmp, float %div, float %mul
				ret float %cond
				}

				define coldcc void @foo_private_ptr(float* nocapture %p) {
				entry:
				%tmp1 = load float, float* %p, align 4
				%cmp = fcmp ogt float %tmp1, 1.000000e+00
				br i1 %cmp, label %if.then, label %if.end

				if.then: ; preds = %entry
				%div = fdiv float 1.000000e+00, %tmp1
				store float %div, float* %p, align 4
				br label %if.end

				if.end: ; preds = %if.then, %entry
				ret void
				}

				define coldcc void @foo_private_ptr2(float* nocapture %p1, float* nocapture %p2) {
				entry:
				%tmp1 = load float, float* %p1, align 4
				%cmp = fcmp ogt float %tmp1, 1.000000e+00
				br i1 %cmp, label %if.then, label %if.end

				if.then: ; preds = %entry
				%div = fdiv float 2.000000e+00, %tmp1
				store float %div, float* %p2, align 4
				br label %if.end

				if.end: ; preds = %if.then, %entry
				ret void
				}

				define coldcc float @sin_wrapper(float %x) {
				bb:
				alfred.j.huangUnsubmitted Done Reply Inline Actions Is existence of "tail call @foo" considered @foo inlined? @foo should be inlined, right? alfred.j.huang: Is existence of "tail call @foo" considered @foo inlined? @foo should be inlined, right?
				rampitecAuthorUnsubmitted Not Done Reply Inline Actions It is GCN-INL1 check, which constitutes to the call with -inline-threshold=1. I.e. for the test purposes I'm practically asking not to inline conventional functions, unless they have a really good reason to be inlined (like use of scratch). The check GCN-INLDEF below represents default threshold under which foo supposed to be inlined. rampitec: It is GCN-INL1 check, which constitutes to the call with -inline-threshold=1. I.e. for the test…
				%call = tail call float @_Z3sinf(float %x)
				ret float %call
				}

				define void @foo_noinline(float* nocapture %p) #0 {
				entry:
				%tmp1 = load float, float* %p, align 4
				%mul = fmul float %tmp1, 2.000000e+00
				store float %mul, float* %p, align 4
				ret void
				}

				; GCN: define amdgpu_kernel void @test_inliner(
				; GCN-INL1: %c1 = tail call coldcc float @foo(
				; GCN-INLDEF: %cmp.i = fcmp ogt float %tmp2, 0.000000e+00
				; GCN: %div.i{{[0-9]*}} = fdiv float 1.000000e+00, %c
				; GCN: %div.i{{[0-9]*}} = fdiv float 2.000000e+00, %tmp1.i
				; GCN: call void @foo_noinline(
				; GCN: tail call float @_Z3sinf(
				define amdgpu_kernel void @test_inliner(float addrspace(1)* nocapture %a, i32 %n) {
				entry:
				%pvt_arr = alloca [64 x float], align 4
				%tid = tail call i32 @llvm.amdgcn.workitem.id.x()
				%arrayidx = getelementptr inbounds float, float addrspace(1)* %a, i32 %tid
				%tmp2 = load float, float addrspace(1)* %arrayidx, align 4
				%add = add i32 %tid, 1
				%arrayidx2 = getelementptr inbounds float, float addrspace(1)* %a, i32 %add
				%tmp5 = load float, float addrspace(1)* %arrayidx2, align 4
				%c1 = tail call coldcc float @foo(float %tmp2, float %tmp5)
				%or = or i32 %tid, %n
				%arrayidx5 = getelementptr inbounds [64 x float], [64 x float]* %pvt_arr, i32 0, i32 %or
				store float %c1, float* %arrayidx5, align 4
				arsenmUnsubmitted Done Reply Inline Actions Needs tests with multiple private pointers and with the alloca size threshold exceeded arsenm: Needs tests with multiple private pointers and with the alloca size threshold exceeded
				%arrayidx7 = getelementptr inbounds [64 x float], [64 x float]* %pvt_arr, i32 0, i32 %or
				call coldcc void @foo_private_ptr(float* %arrayidx7)
				%arrayidx8 = getelementptr inbounds [64 x float], [64 x float]* %pvt_arr, i32 0, i32 1
				%arrayidx9 = getelementptr inbounds [64 x float], [64 x float]* %pvt_arr, i32 0, i32 2
				call coldcc void @foo_private_ptr2(float* %arrayidx8, float* %arrayidx9)
				call void @foo_noinline(float* %arrayidx7)
				%and = and i32 %tid, %n
				%arrayidx11 = getelementptr inbounds [64 x float], [64 x float]* %pvt_arr, i32 0, i32 %and
				%tmp12 = load float, float* %arrayidx11, align 4
				%c2 = call coldcc float @sin_wrapper(float %tmp12)
				store float %c2, float* %arrayidx7, align 4
				%xor = xor i32 %tid, %n
				%arrayidx16 = getelementptr inbounds [64 x float], [64 x float]* %pvt_arr, i32 0, i32 %xor
				%tmp16 = load float, float* %arrayidx16, align 4
				store float %tmp16, float addrspace(1)* %arrayidx, align 4
				ret void
				}

				; GCN: define amdgpu_kernel void @test_inliner_multi_pvt_ptr(
				; GCN: %div.i{{[0-9]*}} = fdiv float 2.000000e+00, %tmp1.i
				define amdgpu_kernel void @test_inliner_multi_pvt_ptr(float addrspace(1)* nocapture %a, i32 %n, float %v) {
				entry:
				%pvt_arr1 = alloca [32 x float], align 4
				%pvt_arr2 = alloca [32 x float], align 4
				%tid = tail call i32 @llvm.amdgcn.workitem.id.x()
				%arrayidx = getelementptr inbounds float, float addrspace(1)* %a, i32 %tid
				%or = or i32 %tid, %n
				%arrayidx4 = getelementptr inbounds [32 x float], [32 x float]* %pvt_arr1, i32 0, i32 %or
				%arrayidx5 = getelementptr inbounds [32 x float], [32 x float]* %pvt_arr2, i32 0, i32 %or
				store float %v, float* %arrayidx4, align 4
				store float %v, float* %arrayidx5, align 4
				%arrayidx8 = getelementptr inbounds [32 x float], [32 x float]* %pvt_arr1, i32 0, i32 1
				%arrayidx9 = getelementptr inbounds [32 x float], [32 x float]* %pvt_arr2, i32 0, i32 2
				call coldcc void @foo_private_ptr2(float* %arrayidx8, float* %arrayidx9)
				%xor = xor i32 %tid, %n
				%arrayidx15 = getelementptr inbounds [32 x float], [32 x float]* %pvt_arr1, i32 0, i32 %xor
				%arrayidx16 = getelementptr inbounds [32 x float], [32 x float]* %pvt_arr2, i32 0, i32 %xor
				%tmp15 = load float, float* %arrayidx15, align 4
				%tmp16 = load float, float* %arrayidx16, align 4
				%tmp17 = fadd float %tmp15, %tmp16
				store float %tmp17, float addrspace(1)* %arrayidx, align 4
				ret void
				}

				; GCN: define amdgpu_kernel void @test_inliner_multi_pvt_ptr_cutoff(
				; GCN-INL1: call coldcc void @foo_private_ptr2
				; GCN-INLDEF: %div.i{{[0-9]*}} = fdiv float 2.000000e+00, %tmp1.i
				define amdgpu_kernel void @test_inliner_multi_pvt_ptr_cutoff(float addrspace(1)* nocapture %a, i32 %n, float %v) {
				entry:
				%pvt_arr1 = alloca [32 x float], align 4
				%pvt_arr2 = alloca [33 x float], align 4
				%tid = tail call i32 @llvm.amdgcn.workitem.id.x()
				%arrayidx = getelementptr inbounds float, float addrspace(1)* %a, i32 %tid
				%or = or i32 %tid, %n
				%arrayidx4 = getelementptr inbounds [32 x float], [32 x float]* %pvt_arr1, i32 0, i32 %or
				%arrayidx5 = getelementptr inbounds [33 x float], [33 x float]* %pvt_arr2, i32 0, i32 %or
				store float %v, float* %arrayidx4, align 4
				store float %v, float* %arrayidx5, align 4
				%arrayidx8 = getelementptr inbounds [32 x float], [32 x float]* %pvt_arr1, i32 0, i32 1
				%arrayidx9 = getelementptr inbounds [33 x float], [33 x float]* %pvt_arr2, i32 0, i32 2
				call coldcc void @foo_private_ptr2(float* %arrayidx8, float* %arrayidx9)
				%xor = xor i32 %tid, %n
				%arrayidx15 = getelementptr inbounds [32 x float], [32 x float]* %pvt_arr1, i32 0, i32 %xor
				%arrayidx16 = getelementptr inbounds [33 x float], [33 x float]* %pvt_arr2, i32 0, i32 %xor
				%tmp15 = load float, float* %arrayidx15, align 4
				%tmp16 = load float, float* %arrayidx16, align 4
				%tmp17 = fadd float %tmp15, %tmp16
				store float %tmp17, float addrspace(1)* %arrayidx, align 4
				ret void
				}

				declare i32 @llvm.amdgcn.workitem.id.x() #1
				declare float @_Z3sinf(float) #1

				attributes #0 = { noinline }
				attributes #1 = { nounwind readnone }

test/CodeGen/AMDGPU/internalize.ll

	; RUN: opt -O1 -S -mtriple=amdgcn-unknown-amdhsa -amdgpu-internalize-symbols < %s \| FileCheck -check-prefix=ALL -check-prefix=OPT %s			; RUN: opt -O1 -S -mtriple=amdgcn-unknown-amdhsa -amdgpu-internalize-symbols < %s \| FileCheck -check-prefix=ALL -check-prefix=OPT %s
	; RUN: opt -O0 -S -mtriple=amdgcn-unknown-amdhsa -amdgpu-internalize-symbols < %s \| FileCheck -check-prefix=ALL -check-prefix=OPTNONE %s			; RUN: opt -O0 -S -mtriple=amdgcn-unknown-amdhsa -amdgpu-internalize-symbols < %s \| FileCheck -check-prefix=ALL -check-prefix=OPTNONE %s

	; OPT-NOT: gvar_unused			; OPT-NOT: gvar_unused
	; OPTNONE: gvar_unused			; OPTNONE: gvar_unused
	@gvar_unused = addrspace(1) global i32 undef, align 4			@gvar_unused = addrspace(1) global i32 undef, align 4

	; ALL: gvar_used			; ALL: gvar_used
	@gvar_used = addrspace(1) global i32 undef, align 4			@gvar_used = addrspace(1) global i32 undef, align 4

	; ALL: define internal fastcc void @func_used(
	define fastcc void @func_used(i32 addrspace(1)* %out, i32 %tid) #1 {
	entry:
	store volatile i32 %tid, i32 addrspace(1)* %out
	ret void
	}

	; ALL: define internal fastcc void @func_used_noinline(			; ALL: define internal fastcc void @func_used_noinline(
	define fastcc void @func_used_noinline(i32 addrspace(1)* %out, i32 %tid) #2 {			define fastcc void @func_used_noinline(i32 addrspace(1)* %out, i32 %tid) #1 {
	entry:			entry:
	store volatile i32 %tid, i32 addrspace(1)* %out			store volatile i32 %tid, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; OPTNONE: define internal fastcc void @func_used_alwaysinline(			; OPTNONE: define internal fastcc void @func_used_alwaysinline(
	; OPT-NOT: @func_used_alwaysinline			; OPT-NOT: @func_used_alwaysinline
	define fastcc void @func_used_alwaysinline(i32 addrspace(1)* %out, i32 %tid) #3 {			define fastcc void @func_used_alwaysinline(i32 addrspace(1)* %out, i32 %tid) #2 {
	entry:			entry:
	store volatile i32 %tid, i32 addrspace(1)* %out			store volatile i32 %tid, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; OPTNONE: define internal void @func_unused(			; OPTNONE: define internal void @func_unused(
	; OPT-NOT: @func_unused			; OPT-NOT: @func_unused
	define void @func_unused(i32 addrspace(1)* %out, i32 %tid) #2 {			define void @func_unused(i32 addrspace(1)* %out, i32 %tid) #1 {
	entry:			entry:
	store volatile i32 %tid, i32 addrspace(1)* %out			store volatile i32 %tid, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; ALL: define amdgpu_kernel void @kernel_unused(			; ALL: define amdgpu_kernel void @kernel_unused(
	define amdgpu_kernel void @kernel_unused(i32 addrspace(1)* %out) #1 {			define amdgpu_kernel void @kernel_unused(i32 addrspace(1)* %out) #1 {
	entry:			entry:
	store volatile i32 1, i32 addrspace(1)* %out			store volatile i32 1, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; ALL: define amdgpu_kernel void @main_kernel()			; ALL: define amdgpu_kernel void @main_kernel()
	; ALL: tail call i32 @llvm.amdgcn.workitem.id.x			; ALL: tail call i32 @llvm.amdgcn.workitem.id.x
	; ALL: tail call fastcc void @func_used
	; ALL: tail call fastcc void @func_used_noinline			; ALL: tail call fastcc void @func_used_noinline
	; ALL: store volatile			; ALL: store volatile
	; ALL: ret void			; ALL: ret void
	define amdgpu_kernel void @main_kernel() {			define amdgpu_kernel void @main_kernel() {
	entry:			entry:
	%tid = tail call i32 @llvm.amdgcn.workitem.id.x()			%tid = tail call i32 @llvm.amdgcn.workitem.id.x()
	tail call fastcc void @func_used(i32 addrspace(1)* @gvar_used, i32 %tid)
	tail call fastcc void @func_used_noinline(i32 addrspace(1)* @gvar_used, i32 %tid)			tail call fastcc void @func_used_noinline(i32 addrspace(1)* @gvar_used, i32 %tid)
	tail call fastcc void @func_used_alwaysinline(i32 addrspace(1)* @gvar_used, i32 %tid)			tail call fastcc void @func_used_alwaysinline(i32 addrspace(1)* @gvar_used, i32 %tid)
	ret void			ret void
	}			}

	declare i32 @llvm.amdgcn.workitem.id.x() #0			declare i32 @llvm.amdgcn.workitem.id.x() #0

	attributes #0 = { nounwind readnone }			attributes #0 = { nounwind readnone }
	attributes #1 = { nounwind }			attributes #1 = { noinline nounwind }
	attributes #2 = { noinline nounwind }			attributes #2 = { alwaysinline nounwind }
	attributes #3 = { alwaysinline nounwind }