This is an archive of the discontinued LLVM Phabricator instance.

Estimate speedup due to inlining and use that to adjust threshold.
AbandonedPublic

Authored by eraman on Feb 16 2017, 5:31 PM.

Download Raw Diff

Details

Reviewers

chandlerc
davidxl

Summary

This adds a heuristic that estimates the speedup due to inlining a callsite and increases thresholds for callsites that have an estimated speedup above a configurable minimum. The speedup is define as WeightedSavings / (WeightedCost + WeightedSavings). WeightedCost above is obtained by weighting the cost (already computed by the analysis) per basic block with the block's frequency. In order to compute the WeightedSavings, we keep track of savings due to inlining. There are 3 components to the Savings: a. The cost savings due to elimination of call overhead and argument setup overhead, b. The cost savings due to SROA, and c. The cost savings due to elimination of instructions after inlining. Note that the savings does not include the cost of blocks that are unreachable in the call context since, irresepective of inlining, those blocks are not reachable in a dynamic sense. Similar to WeightedCost, WeightedSavings is calculated by weighting the savings per block with its frequency.

This analysis is triggered only when the callsite's frequency relative to caller's entry exceeds a configurable parameter. In my experiments, I noticed that not having this filter results in increased code size in the cold regions.

I have done some parameter tuning using a hacked up version of this code (to work on the old PM) on a set of internal benchmarks. I do expect more tuning to be done on a broader set of benchmarks. Since this is active only with the new PM based pipeline, I haven't added a flag to disable this but I'll add them if you prefer so.

Diff Detail

Event Timeline

eraman created this revision.Feb 16 2017, 5:31 PM

eraman added a parent revision: D30059: Refactor InlinCost.cpp in preparation for estimated speedup heuristic..

efriedma added a subscriber: efriedma.Feb 16 2017, 5:44 PM

efriedma added inline comments.

lib/Analysis/InlineCost.cpp
1429	CalleeBFI is unused?

Thanks for the comment!

lib/Analysis/InlineCost.cpp
1429	Ouch. This is badly broken. In two places where I should have used CalleeBFI, I've used CallerBFI. getBlockFreq returns 0 if the BB is not in the function whose BFI is used. The tests still pass because I've used getEntryFreq of Caller (instead of Callee) which still returns a valid value and the weighted savings associated with the entry block (elimination of the call and arg setup) is enough to result in the speedup. I'll send a revised patch tomorrow.

Use CalleeBFI to get the current block's frequency.
Move accumulateCost into this patch instead of it being part of the dependent patch.

eraman edited parent revisions, added: D30104: Refactor instruction simplification code in visitors. NFC., D30108: Refactor code computing switch instruction cost. NFC., D30112: [InlineCost] Move the code in isGEPOffsetConstant to a lambda. NFC.; removed: D30059: Refactor InlinCost.cpp in preparation for estimated speedup heuristic..Feb 17 2017, 1:43 PM

davide added a subscriber: davide.Feb 17 2017, 6:57 PM

davidxl added inline comments.Feb 21 2017, 12:02 PM

lib/Analysis/InlineCost.cpp
89	Does this mean globally hot callsites may not get the speed up bonus if their relative frequency to caller entry is smaller than 8? while a less hot callsite may get both hot callsite bump and big speedup bump?
254	Document the method.
255	hasLargeSpeedup ?
561	should this be called inside the lambda?
592	why does this need to be guarded here?
1053	Why is the guard needed here?
1160	Perhaps at least add parameter passing to the savings?

Thanks for the comments.

lib/Analysis/InlineCost.cpp
89	That is possible. The easy fix is to check if the callsite is hot or if it's relative frequency exceeds this threshold. I have made this change.
592	We want to give a bonus if the speedup only happens due to inlining. In most cases, if Evaluate returns a constant, then we would have looked up the SimplifiedValues. But it is also possible that the IR actually has an instruction with two constant operands which were not cleaned by earlier simplification pass and we don't want to consider it an inlining benefit (since a later simplification pass will simplify it). I don't think it is very likely in a reasonable pass pipeline, but explicitly having this guard makes the intention cleaner.
1053	Same explanation as above.

Rebase and address David's comments.

davidxl added inline comments.Feb 24 2017, 11:52 AM

lib/Analysis/InlineCost.cpp
592	Can you add a brief comment about this?
615	Better filter cold callsite out too.
1053	A brief comment.
test/Transforms/Inline/speedup-analysis.ll
4	what do the savings come from in this case? just the call overhead?

Some initial (minor) comments. It'd also help I think to rebase this as some of the refactorings land.

lib/Analysis/InlineCost.cpp
80	nit: Please use vertical whitespace before the comment for the second flag. More substantive comment, what unit is this? The "desc" string only says "speedup", but it isn't clear if you mean "10% faster after inlining"? If so, that seems a surprisingly low threshold...
90–91	The flag seems to indicate this is the value of the frequency itself... the comment seems to indicate it is relative... Makes it hard for me to understand the value used...
186	I think it'd be better to have the argument always passed and match the accumulateCost API?

Prazek added a subscriber: Prazek.Feb 26 2017, 1:29 AM

Prazek added inline comments.

lib/Analysis/InlineCost.cpp
543	range based for loop? There should be something like .operands()

chandlerc added inline comments.Feb 26 2017, 3:02 PM

lib/Analysis/InlineCost.cpp
543	I looked and there isn't really. I can add one though.

Prazek added inline comments.Feb 26 2017, 4:13 PM

lib/Analysis/InlineCost.cpp
543	That would be great, thanks

chandlerc added inline comments.Feb 26 2017, 5:10 PM

lib/Analysis/InlineCost.cpp
543	Huh, I must have missed it. Danny added an 'indices' method back in 2015. =D

eraman marked 5 inline comments as done.Feb 27 2017, 3:16 PM

eraman added inline comments.

lib/Analysis/InlineCost.cpp
80	I've tried to expand this a bit. PTAL. Regarding the threshold being low: When I tuned it originally, it was before r286814 which added -30 to the cost for every inlining. At least to account that this needs to be revised up. In any case, I'll collect more numbers on the size/performance tradeoff and follow up.
90–91	It is relative. I've updated the variable name, flag name and the description to make this clear.
543	I see indices() in some other classes but not in GetElementPtrInst
test/Transforms/Inline/speedup-analysis.ll
4	Yes, the savings come from eliminating the call.

Address review comments.

Sorry for the delay in collecting performance numbers. I now have some data to share. First, some details on the methodology. I used ~400 microbenchmarks used internally at Google. I built them with the following percentage values of min-speedup-for-bonus: 0%, 5%, 10%, and 15%. I ran each benchmark 10 times in each configuration. Speedup/slowdown for a benchmark is calculated only when the p-value <=0.05 (and thus the results might include different subset of benchmarks for different configs). The numbers presented below are the geomean across all benchmarks.

Config | #Benchmarks | Geomean | #Slowdowns | #Speedups | Size increase percentage
0% | 134 | 2.92% | 51 | 83 | 2.45%
5% | 121 | 1.05% | 41 | 80 | 1.58%
10% | 115 | 0.8% | 51 | 64 | 1.32%
15% | 160 | 1.03% | 44 | 116| 1.02%

Some observations:

The best geomean performance comes when the min-speedup-for-bonus is set to 0%. I interpret this to mean that it is generally a performance win to increase the threshold for hot callsites, and the speedup estimation is a way to control the size growth.
The performance when the min-speedup-for-bonus is set to 10% sits in-between that of 5% and 15%. As I mentioned above, these are not apples-to-apples comparisons beacause we compute geomean on a different set of benchmarks. Even for the same benchmark, it is possible (and it does happen) that the performance numbers are not monotonically decreasing as the min-speedup-for-bonus is increased.
For comparison, I calculated the size growth if we simply apply a 3X multiplier to the threshold irrespective of the callsite frequency. The size increase is 9.7%.

I'm collecting SPEC numbers now. I've also fixed a bug in the code and will update the patch shortly.

Fix a bug and rebase.

eraman removed a parent revision: D30108: Refactor code computing switch instruction cost. NFC..Apr 17 2017, 2:31 PM

Given the performance data, I suggest drop the min speedup requirement to 0 at O3.

In D30062#728588, @eraman wrote:

Sorry for the delay in collecting performance numbers. I now have some data to share. First, some details on the methodology. I used ~400 microbenchmarks used internally at Google. I built them with the following percentage values of min-speedup-for-bonus: 0%, 5%, 10%, and 15%. I ran each benchmark 10 times in each configuration. Speedup/slowdown for a benchmark is calculated only when the p-value <=0.05 (and thus the results might include different subset of benchmarks for different configs). The numbers presented below are the geomean across all benchmarks.

Config | #Benchmarks | Geomean | #Slowdowns | #Speedups | Size increase percentage
0% | 134 | 2.92% | 51 | 83 | 2.45%
5% | 121 | 1.05% | 41 | 80 | 1.58%
10% | 115 | 0.8% | 51 | 64 | 1.32%
15% | 160 | 1.03% | 44 | 116| 1.02%

Some observations:

The best geomean performance comes when the min-speedup-for-bonus is set to 0%. I interpret this to mean that it is generally a performance win to increase the threshold for hot callsites, and the speedup estimation is a way to control the size growth.

The performance when the min-speedup-for-bonus is set to 10% sits in-between that of 5% and 15%. As I mentioned above, these are not apples-to-apples comparisons beacause we compute geomean on a different set of benchmarks. Even for the same benchmark, it is possible (and it does happen) that the performance numbers are not monotonically decreasing as the min-speedup-for-bonus is increased.

For comparison, I calculated the size growth if we simply apply a 3X multiplier to the threshold irrespective of the callsite frequency. The size increase is 9.7%.

The data here is really interesting, but I'm not sure about using the 0% threshold...

What I mean by that, is that if we use a 0% min-speedup-for-bonus, then we essentially aren't using the speedup computation at all are we? It seems like this would be roughly the same as just applying a similar multiplier to the threshold based on call site hotness. Maybe I just don't understand what the result of this is (sorry if I'm just failing to page back in all of the details)? If my understanding is correct though, then I would focus on that first and get it in, and then return to the speedup heuristic to see if there are wins to be found by doing a speedup analysis to bonus less hot call sites, or doing it to give an *even higher* threshold when a call site is both hot *and* gives a speedup on inlining.

I'm collecting SPEC numbers now. I've also fixed a bug in the code and will update the patch shortly.

Please also collect LLVM test suite numbers with the SPEC numbers.

One thing that would be particularly important though is to collect larger application *size* numbers. I don't think the size growth numbers from microbenchmarks are really going to tell us what we need to know to make good threshold decisions where size is a factor (especially O2 vs. O3).

eraman abandoned this revision.Apr 21 2017, 5:36 PM

Revision Contents

Path

Size

lib/

Analysis/

InlineCost.cpp

377 lines

test/

Transforms/

Inline/

speedup-analysis.ll

35 lines

speedup-analysis2.ll

69 lines

Diff 95490

lib/Analysis/InlineCost.cpp

Show First 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	static cl::opt<int> ColdThreshold(
"inlinecold-threshold", cl::Hidden, cl::init(225),		"inlinecold-threshold", cl::Hidden, cl::init(225),
cl::desc("Threshold for inlining functions with cold attribute"));		cl::desc("Threshold for inlining functions with cold attribute"));

static cl::opt<int>		static cl::opt<int>
HotCallSiteThreshold("hot-callsite-threshold", cl::Hidden, cl::init(3000),		HotCallSiteThreshold("hot-callsite-threshold", cl::Hidden, cl::init(3000),
cl::ZeroOrMore,		cl::ZeroOrMore,
cl::desc("Threshold for hot callsites "));		cl::desc("Threshold for hot callsites "));

		// The following options control the heuristic that estimates speedup due to
		// inlining and adds a threshold bonus when the speedup is above a certain
		// limit. Speedup here refers to savings in weighted cost due to inlining
		// relative to the weighted cost of the uninlined callee.
		//
		/// Threshold bonus to apply when inlining is expected to result in a speedup.
		/// The bonus is expressed as a percent value. The unbonused threshold is
		/// multiplied by this threshold to arrive at the threshold increase.
		static cl::opt<int>
		SpeedupBonusPercent("speedup-bonus-percent", cl::Hidden, cl::init(200),
		cl::ZeroOrMore,
		cl::desc("Bonus for callees showing speedup"));
		chandlercUnsubmitted Not Done Reply Inline Actions nit: Please use vertical whitespace before the comment for the second flag. More substantive comment, what unit is this? The "desc" string only says "speedup", but it isn't clear if you mean "10% faster after inlining"? If so, that seems a surprisingly low threshold... chandlerc: nit: Please use vertical whitespace before the comment for the second flag. More substantive…
		eramanAuthorUnsubmitted Not Done Reply Inline Actions I've tried to expand this a bit. PTAL. Regarding the threshold being low: When I tuned it originally, it was before r286814 which added -30 to the cost for every inlining. At least to account that this needs to be revised up. In any case, I'll collect more numbers on the size/performance tradeoff and follow up. eraman: I've tried to expand this a bit. PTAL. Regarding the threshold being low: When I tuned it…

		/// Minimum speedup percentage required to apply a bonus to the threshold.
		///
		/// Speedup is computed as the percentage of weighted cost savings if the callee
		/// is inlined.
		static cl::opt<int>
		MinSpeedupForBonus("min-speedup-for-bonus", cl::Hidden, cl::init(10),
		cl::ZeroOrMore,
		cl::desc("Minimum weighted cost savings (in percentage) "
		davidxlUnsubmitted Done Reply Inline Actions Does this mean globally hot callsites may not get the speed up bonus if their relative frequency to caller entry is smaller than 8? while a less hot callsite may get both hot callsite bump and big speedup bump? davidxl: Does this mean globally hot callsites may not get the speed up bonus if their relative…
		eramanAuthorUnsubmitted Not Done Reply Inline Actions That is possible. The easy fix is to check if the callsite is hot or if it's relative frequency exceeds this threshold. I have made this change. eraman: That is possible. The easy fix is to check if the callsite is hot or if it's relative…
		"needed to apply the speedup bonus"));

		chandlercUnsubmitted Done Reply Inline Actions The flag seems to indicate this is the value of the frequency itself... the comment seems to indicate it is relative... Makes it hard for me to understand the value used... chandlerc: The flag seems to indicate this is the value of the frequency itself... the comment seems to…
		eramanAuthorUnsubmitted Not Done Reply Inline Actions It is relative. I've updated the variable name, flag name and the description to make this clear. eraman: It is relative. I've updated the variable name, flag name and the description to make this…
		/// Minimum relative frequency of callsite to apply speedup-based bonus.
		///
		/// Speedup bonus is applied if the callsite is hot and not applied if it is
		/// cold as determined by profile summary. If the callsite is not known to be
		/// hot or cold, the bonus is applied only if the block frequency of the
		/// callsite relative to the caller's entry meets this ratio to limit size
		/// increase due to speedup bonus by applying the heuristic.
		static cl::opt<unsigned> MinBFRatioForSpeedupBonus(
		"min-bf-ratio-for-speedup-bonus", cl::Hidden, cl::init(8), cl::ZeroOrMore,
		cl::desc("Minimum ratio of callsite's block frequency to caller's entry "
		"block frequency to apply the speedup bonus"));

namespace {		namespace {

class CallAnalyzer : public InstVisitor<CallAnalyzer, bool> {		class CallAnalyzer : public InstVisitor<CallAnalyzer, bool> {
typedef InstVisitor<CallAnalyzer, bool> Base;		typedef InstVisitor<CallAnalyzer, bool> Base;
friend class InstVisitor<CallAnalyzer, bool>;		friend class InstVisitor<CallAnalyzer, bool>;

/// The TargetTransformInfo available for this compilation.		/// The TargetTransformInfo available for this compilation.
const TargetTransformInfo &TTI;		const TargetTransformInfo &TTI;
Show All 13 Lines	class CallAnalyzer : public InstVisitor<CallAnalyzer, bool> {
/// The candidate callsite being analyzed. Please do not use this to do		/// The candidate callsite being analyzed. Please do not use this to do
/// analysis in the caller function; we want the inline cost query to be		/// analysis in the caller function; we want the inline cost query to be
/// easily cacheable. Instead, use the cover function paramHasAttr.		/// easily cacheable. Instead, use the cover function paramHasAttr.
CallSite CandidateCS;		CallSite CandidateCS;

/// Tunable parameters that control the analysis.		/// Tunable parameters that control the analysis.
const InlineParams &Params;		const InlineParams &Params;

		BlockFrequencyInfo CallerBFI, CalleeBFI;
int Threshold;		int Threshold;
int Cost;		int Cost;

bool IsCallerRecursive;		bool IsCallerRecursive;
bool IsRecursiveCall;		bool IsRecursiveCall;
bool ExposesReturnsTwice;		bool ExposesReturnsTwice;
bool HasDynamicAlloca;		bool HasDynamicAlloca;
bool ContainsNoDuplicateCall;		bool ContainsNoDuplicateCall;
Show All 15 Lines	class CallAnalyzer : public InstVisitor<CallAnalyzer, bool> {
/// is CFG altering simplifications -- when we prove a basic block dead, that		/// is CFG altering simplifications -- when we prove a basic block dead, that
/// can cause dramatic shifts in the cost of inlining a function.		/// can cause dramatic shifts in the cost of inlining a function.
DenseMap<Value , Constant > SimplifiedValues;		DenseMap<Value , Constant > SimplifiedValues;

/// Keep track of the values which map back (through function arguments) to		/// Keep track of the values which map back (through function arguments) to
/// allocas on the caller stack which could be simplified through SROA.		/// allocas on the caller stack which could be simplified through SROA.
DenseMap<Value , Value > SROAArgValues;		DenseMap<Value , Value > SROAArgValues;

		// We track unweighted and weighted(by block frequency) SROA cost savings.
		using SROACostTy = std::pair<int, uint64_t>;

/// The mapping of caller Alloca values to their accumulated cost savings. If		/// The mapping of caller Alloca values to their accumulated cost savings. If
/// we have to disable SROA for one of the allocas, this tells us how much		/// we have to disable SROA for one of the allocas, this tells us how much
/// cost must be added.		/// cost must be added.
DenseMap<Value *, int> SROAArgCosts;		DenseMap<Value *, SROACostTy> SROAArgCosts;

/// Keep track of values which map to a pointer base and constant offset.		/// Keep track of values which map to a pointer base and constant offset.
DenseMap<Value , std::pair<Value , APInt>> ConstantOffsetPtrs;		DenseMap<Value , std::pair<Value , APInt>> ConstantOffsetPtrs;

// Custom simplification helper routines.		// Custom simplification helper routines.
bool isAllocaDerivedArg(Value *V);		bool isAllocaDerivedArg(Value *V);
		int getArgPassingCost(CallSite CS, Function *Callee);
bool lookupSROAArgAndCost(Value V, Value &Arg,		bool lookupSROAArgAndCost(Value V, Value &Arg,
DenseMap<Value *, int>::iterator &CostIt);		DenseMap<Value *, SROACostTy>::iterator &CostIt);
void disableSROA(DenseMap<Value *, int>::iterator CostIt);		void disableSROA(DenseMap<Value *, SROACostTy>::iterator CostIt);
void disableSROA(Value *V);		void disableSROA(Value *V);
void accumulateSROACost(DenseMap<Value *, int>::iterator CostIt,		void accumulateSROACost(DenseMap<Value *, SROACostTy>::iterator CostIt,
int InstructionCost);		int InstructionCost);
		void accumulateCost(int InstructionCost);
		void accumulateSavings(int Savings);
		chandlercUnsubmitted Done Reply Inline Actions I think it'd be better to have the argument always passed and match the accumulateCost API? chandlerc: I think it'd be better to have the argument always passed and match the accumulateCost API?
bool isGEPFree(GetElementPtrInst &GEP);		bool isGEPFree(GetElementPtrInst &GEP);
bool accumulateGEPOffset(GEPOperator &GEP, APInt &Offset);		bool accumulateGEPOffset(GEPOperator &GEP, APInt &Offset);
bool simplifyCallSite(Function *F, CallSite CS);		bool simplifyCallSite(Function *F, CallSite CS);
template <typename Callable>		template <typename Callable>
bool simplifyInstruction(Instruction &I, Callable Evaluate);		bool simplifyInstruction(Instruction &I, Callable Evaluate);
ConstantInt stripAndComputeInBoundsConstantOffsets(Value &V);		ConstantInt stripAndComputeInBoundsConstantOffsets(Value &V);

/// Return true if the given argument to the function being considered for		/// Return true if the given argument to the function being considered for
▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	class CallAnalyzer : public InstVisitor<CallAnalyzer, bool> {
bool visitReturnInst(ReturnInst &RI);		bool visitReturnInst(ReturnInst &RI);
bool visitBranchInst(BranchInst &BI);		bool visitBranchInst(BranchInst &BI);
bool visitSwitchInst(SwitchInst &SI);		bool visitSwitchInst(SwitchInst &SI);
bool visitIndirectBrInst(IndirectBrInst &IBI);		bool visitIndirectBrInst(IndirectBrInst &IBI);
bool visitResumeInst(ResumeInst &RI);		bool visitResumeInst(ResumeInst &RI);
bool visitCleanupReturnInst(CleanupReturnInst &RI);		bool visitCleanupReturnInst(CleanupReturnInst &RI);
bool visitCatchReturnInst(CatchReturnInst &RI);		bool visitCatchReturnInst(CatchReturnInst &RI);
bool visitUnreachableInst(UnreachableInst &I);		bool visitUnreachableInst(UnreachableInst &I);
		/// Simplify \p I if its operands are constants and update SimplifiedValues.
		davidxlUnsubmitted Done Reply Inline Actions Document the method. davidxl: Document the method.
		/// Evaluate is a lambda specific to instruction type that evaluates the
		davidxlUnsubmitted Done Reply Inline Actions hasLargeSpeedup ? davidxl: hasLargeSpeedup ?
		/// instruction when all the operands are constants.
		bool
		simplifyInstruction(Instruction &I,
		function_ref<Constant (ArrayRef<Constant >)> Evaluate);
		int getSpeedupBonus(CallSite &CS, int Threshold);
		bool hasLargeSpeedup();

public:		public:
CallAnalyzer(const TargetTransformInfo &TTI,		CallAnalyzer(const TargetTransformInfo &TTI,
std::function<AssumptionCache &(Function &)> &GetAssumptionCache,		std::function<AssumptionCache &(Function &)> &GetAssumptionCache,
Optional<function_ref<BlockFrequencyInfo &(Function &)>> &GetBFI,		Optional<function_ref<BlockFrequencyInfo &(Function &)>> &GetBFI,
ProfileSummaryInfo *PSI, Function &Callee, CallSite CSArg,		ProfileSummaryInfo *PSI, Function &Callee, CallSite CSArg,
const InlineParams &Params)		const InlineParams &Params)
: TTI(TTI), GetAssumptionCache(GetAssumptionCache), GetBFI(GetBFI),		: TTI(TTI), GetAssumptionCache(GetAssumptionCache), GetBFI(GetBFI),
PSI(PSI), F(Callee), CandidateCS(CSArg), Params(Params),		PSI(PSI), F(Callee), CandidateCS(CSArg), Params(Params),
Threshold(Params.DefaultThreshold), Cost(0), IsCallerRecursive(false),		Threshold(Params.DefaultThreshold), Cost(0), IsCallerRecursive(false),
IsRecursiveCall(false), ExposesReturnsTwice(false),		IsRecursiveCall(false), ExposesReturnsTwice(false),
HasDynamicAlloca(false), ContainsNoDuplicateCall(false),		HasDynamicAlloca(false), ContainsNoDuplicateCall(false),
HasReturn(false), HasIndirectBr(false), HasFrameEscape(false),		HasReturn(false), HasIndirectBr(false), HasFrameEscape(false),
AllocatedSize(0), NumInstructions(0), NumVectorInstructions(0),		AllocatedSize(0), NumInstructions(0), NumVectorInstructions(0),
FiftyPercentVectorBonus(0), TenPercentVectorBonus(0), VectorBonus(0),		FiftyPercentVectorBonus(0), TenPercentVectorBonus(0), VectorBonus(0),
NumConstantArgs(0), NumConstantOffsetPtrArgs(0), NumAllocaArgs(0),		NumConstantArgs(0), NumConstantOffsetPtrArgs(0), NumAllocaArgs(0),
NumConstantPtrCmps(0), NumConstantPtrDiffs(0),		NumConstantPtrCmps(0), NumConstantPtrDiffs(0),
NumInstructionsSimplified(0), SROACostSavings(0),		NumInstructionsSimplified(0), SROACostSavings(0), WeightedSavings(0),
SROACostSavingsLost(0) {}		WeightedCost(0), SROACostSavingsLost(0), CurrBBFreq(1) {}

bool analyzeCall(CallSite CS);		bool analyzeCall(CallSite CS);

int getThreshold() { return Threshold; }		int getThreshold() { return Threshold; }
int getCost() { return Cost; }		int getCost() { return Cost; }

// Keep a bunch of stats about the cost savings found so we can print them		// Keep a bunch of stats about the cost savings found so we can print them
// out when debugging.		// out when debugging.
unsigned NumConstantArgs;		unsigned NumConstantArgs;
unsigned NumConstantOffsetPtrArgs;		unsigned NumConstantOffsetPtrArgs;
unsigned NumAllocaArgs;		unsigned NumAllocaArgs;
unsigned NumConstantPtrCmps;		unsigned NumConstantPtrCmps;
unsigned NumConstantPtrDiffs;		unsigned NumConstantPtrDiffs;
unsigned NumInstructionsSimplified;		unsigned NumInstructionsSimplified;
unsigned SROACostSavings;		unsigned SROACostSavings;
		int WeightedSavings, WeightedCost;
unsigned SROACostSavingsLost;		unsigned SROACostSavingsLost;
		/// The block frequency of the current block being analyzed.
		uint64_t CurrBBFreq;

void dump();		void dump();
};		};

} // namespace		} // namespace

/// \brief Test whether the given value is an Alloca-derived function argument.		/// \brief Test whether the given value is an Alloca-derived function argument.
bool CallAnalyzer::isAllocaDerivedArg(Value *V) {		bool CallAnalyzer::isAllocaDerivedArg(Value *V) {
return SROAArgValues.count(V);		return SROAArgValues.count(V);
}		}

		/// Return the cost of passing arguments to the callee at callsite \param CS.
		///
		/// This cost is negative since inlining would eliminate the instructions needed
		/// for setting up the arguments
		int CallAnalyzer::getArgPassingCost(CallSite CS, Function *Callee) {
		// Give out bonuses per argument, as the instructions setting them up will
		// be gone after inlining.
		int Cost = 0;
		const DataLayout &DL = Callee->getParent()->getDataLayout();
		for (unsigned I = 0, E = CS.arg_size(); I != E; ++I) {
		if (CS.isByValArgument(I)) {
		// We approximate the number of loads and stores needed by dividing the
		// size of the byval type by the target's pointer size.
		PointerType *PTy = cast<PointerType>(CS.getArgument(I)->getType());
		unsigned TypeSize = DL.getTypeSizeInBits(PTy->getElementType());
		unsigned PointerSize = DL.getPointerSizeInBits();
		// Ceiling division.
		unsigned NumStores = (TypeSize + PointerSize - 1) / PointerSize;

		// If it generates more than 8 stores it is likely to be expanded as an
		// inline memcpy so we take that as an upper bound. Otherwise we assume
		// one load and one store per word copied.
		// FIXME: The maxStoresPerMemcpy setting from the target should be used
		// here instead of a magic number of 8, but it's not available via
		// DataLayout.
		NumStores = std::min(NumStores, 8U);

		Cost -= 2 * NumStores * InlineConstants::InstrCost;
		} else {
		// For non-byval arguments subtract off one instruction per call
		// argument.
		Cost -= InlineConstants::InstrCost;
		}
		}
		return Cost;
		}

/// \brief Lookup the SROA-candidate argument and cost iterator which V maps to.		/// \brief Lookup the SROA-candidate argument and cost iterator which V maps to.
/// Returns false if V does not map to a SROA-candidate.		/// Returns false if V does not map to a SROA-candidate.
bool CallAnalyzer::lookupSROAArgAndCost(		bool CallAnalyzer::lookupSROAArgAndCost(
Value V, Value &Arg, DenseMap<Value *, int>::iterator &CostIt) {		Value V, Value &Arg, DenseMap<Value *, SROACostTy>::iterator &CostIt) {
if (SROAArgValues.empty() \|\| SROAArgCosts.empty())		if (SROAArgValues.empty() \|\| SROAArgCosts.empty())
return false;		return false;

DenseMap<Value , Value >::iterator ArgIt = SROAArgValues.find(V);		DenseMap<Value , Value >::iterator ArgIt = SROAArgValues.find(V);
if (ArgIt == SROAArgValues.end())		if (ArgIt == SROAArgValues.end())
return false;		return false;

Arg = ArgIt->second;		Arg = ArgIt->second;
CostIt = SROAArgCosts.find(Arg);		CostIt = SROAArgCosts.find(Arg);
return CostIt != SROAArgCosts.end();		return CostIt != SROAArgCosts.end();
}		}

/// \brief Disable SROA for the candidate marked by this cost iterator.		/// \brief Disable SROA for the candidate marked by this cost iterator.
///		///
/// This marks the candidate as no longer viable for SROA, and adds the cost		/// This marks the candidate as no longer viable for SROA, and adds the cost
/// savings associated with it back into the inline cost measurement.		/// savings associated with it back into the inline cost measurement.
void CallAnalyzer::disableSROA(DenseMap<Value *, int>::iterator CostIt) {		void CallAnalyzer::disableSROA(DenseMap<Value *, SROACostTy>::iterator CostIt) {
// If we're no longer able to perform SROA we need to undo its cost savings		// If we're no longer able to perform SROA we need to undo its cost savings
// and prevent subsequent analysis.		// and prevent subsequent analysis.
Cost += CostIt->second;		Cost += CostIt->second.first;
SROACostSavings -= CostIt->second;		WeightedCost += CostIt->second.second;
SROACostSavingsLost += CostIt->second;		SROACostSavings -= CostIt->second.first;
		WeightedSavings -= CostIt->second.second;
		SROACostSavingsLost += CostIt->second.first;
SROAArgCosts.erase(CostIt);		SROAArgCosts.erase(CostIt);
}		}

/// \brief If 'V' maps to a SROA candidate, disable SROA for it.		/// \brief If 'V' maps to a SROA candidate, disable SROA for it.
void CallAnalyzer::disableSROA(Value *V) {		void CallAnalyzer::disableSROA(Value *V) {
Value *SROAArg;		Value *SROAArg;
DenseMap<Value *, int>::iterator CostIt;		DenseMap<Value *, SROACostTy>::iterator CostIt;
if (lookupSROAArgAndCost(V, SROAArg, CostIt))		if (lookupSROAArgAndCost(V, SROAArg, CostIt))
disableSROA(CostIt);		disableSROA(CostIt);
}		}

/// \brief Accumulate the given cost for a particular SROA candidate.		/// \brief Accumulate the given cost for a particular SROA candidate.
void CallAnalyzer::accumulateSROACost(DenseMap<Value *, int>::iterator CostIt,		void CallAnalyzer::accumulateSROACost(
int InstructionCost) {		DenseMap<Value *, SROACostTy>::iterator CostIt, int InstructionCost) {
CostIt->second += InstructionCost;		CostIt->second.first += InstructionCost;
SROACostSavings += InstructionCost;		SROACostSavings += InstructionCost;
		// FIXME: We use saturating multiply on uint64_t here and below where we
		// compute weighted cost/savings. If this proves to be less precise, consider
		// using 128 bit APInt and also use relative block frequency (scale block
		// frequencies relative to entry block).
		auto WeightedCost = SaturatingMultiply(CurrBBFreq, (uint64_t)InstructionCost);
		CostIt->second.second += WeightedCost;
		WeightedSavings += WeightedCost;
}		}

/// \brief Accumulate a constant GEP offset into an APInt if possible.		/// \brief Accumulate a constant GEP offset into an APInt if possible.
///		///
/// Returns false if unable to compute the offset for any reason. Respects any		/// Returns false if unable to compute the offset for any reason. Respects any
/// simplified values known during the analysis of this callsite.		/// simplified values known during the analysis of this callsite.
bool CallAnalyzer::accumulateGEPOffset(GEPOperator &GEP, APInt &Offset) {		bool CallAnalyzer::accumulateGEPOffset(GEPOperator &GEP, APInt &Offset) {
const DataLayout &DL = F.getParent()->getDataLayout();		const DataLayout &DL = F.getParent()->getDataLayout();
▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines	bool CallAnalyzer::visitPHI(PHINode &I) {
// FIXME: We need to propagate SROA disabling through phi nodes, even		// FIXME: We need to propagate SROA disabling through phi nodes, even
// though we don't want to propagate it's bonuses. The idea is to disable		// though we don't want to propagate it's bonuses. The idea is to disable
// SROA if it might be used in an inappropriate manner.		// SROA if it might be used in an inappropriate manner.

// Phi nodes are always zero-cost.		// Phi nodes are always zero-cost.
return true;		return true;
}		}

		void CallAnalyzer::accumulateCost(int InstructionCost) {
		Cost += InstructionCost;
		WeightedCost += SaturatingMultiply(CurrBBFreq, (uint64_t)InstructionCost);
		}

bool CallAnalyzer::visitGetElementPtr(GetElementPtrInst &I) {		bool CallAnalyzer::visitGetElementPtr(GetElementPtrInst &I) {
Value *SROAArg;		Value *SROAArg;
DenseMap<Value *, int>::iterator CostIt;		DenseMap<Value *, SROACostTy>::iterator CostIt;
bool SROACandidate =		bool SROACandidate =
lookupSROAArgAndCost(I.getPointerOperand(), SROAArg, CostIt);		lookupSROAArgAndCost(I.getPointerOperand(), SROAArg, CostIt);

// Try to fold GEPs of constant-offset call site argument pointers. This		// Try to fold GEPs of constant-offset call site argument pointers. This
// requires target data and inbounds GEPs.		// requires target data and inbounds GEPs.
if (I.isInBounds()) {		if (I.isInBounds()) {
// Check if we have a base + offset for the pointer.		// Check if we have a base + offset for the pointer.
Value *Ptr = I.getPointerOperand();		Value *Ptr = I.getPointerOperand();
std::pair<Value *, APInt> BaseAndOffset = ConstantOffsetPtrs.lookup(Ptr);		std::pair<Value *, APInt> BaseAndOffset = ConstantOffsetPtrs.lookup(Ptr);
if (BaseAndOffset.first) {		if (BaseAndOffset.first) {
// Check if the offset of this GEP is constant, and if so accumulate it		// Check if the offset of this GEP is constant, and if so accumulate it
// into Offset.		// into Offset.
if (!accumulateGEPOffset(cast<GEPOperator>(I), BaseAndOffset.second)) {		if (!accumulateGEPOffset(cast<GEPOperator>(I), BaseAndOffset.second)) {
// Non-constant GEPs aren't folded, and disable SROA.		// Non-constant GEPs aren't folded, and disable SROA.
if (SROACandidate)		if (SROACandidate)
disableSROA(CostIt);		disableSROA(CostIt);
		// isGEPFree looks up SimplifiedValues and so we should ideally be
		// tracking savings here, but that would require calling the TTI hook
		// with and without simplified values. So we take the conservative route
		// and assume no savings due to the lookup.
return isGEPFree(I);		return isGEPFree(I);
}		}

// Add the result as a new mapping to Base + Offset.		// Add the result as a new mapping to Base + Offset.
ConstantOffsetPtrs[&I] = BaseAndOffset;		ConstantOffsetPtrs[&I] = BaseAndOffset;

// Also handle SROA candidates here, we already know that the GEP is		// Also handle SROA candidates here, we already know that the GEP is
// all-constant indexed.		// all-constant indexed.
if (SROACandidate)		if (SROACandidate)
SROAArgValues[&I] = SROAArg;		SROAArgValues[&I] = SROAArg;

return true;		return true;
}		}
}		}


// Lambda to check whether a GEP's indices are all constant.		// Lambda to check whether a GEP's indices are all constant.
auto IsGEPOffsetConstant = [&](GetElementPtrInst &GEP) {		auto IsGEPOffsetConstant = [&](GetElementPtrInst &GEP) {
for (User::op_iterator I = GEP.idx_begin(), E = GEP.idx_end(); I != E; ++I)		bool SVLookup = false;
if (!isa<Constant>(I) && !SimplifiedValues.lookup(I))		for (User::op_iterator I = GEP.idx_begin(), E = GEP.idx_end(); I != E;
		PrazekUnsubmitted Not Done Reply Inline Actions range based for loop? There should be something like .operands() Prazek: range based for loop? There should be something like .operands()
		chandlercUnsubmitted Not Done Reply Inline Actions I looked and there isn't really. I can add one though. chandlerc: I looked and there isn't really. I can add one though.
		PrazekUnsubmitted Not Done Reply Inline Actions That would be great, thanks Prazek: That would be great, thanks
		chandlercUnsubmitted Not Done Reply Inline Actions Huh, I must have missed it. Danny added an 'indices' method back in 2015. =D chandlerc: Huh, I must have missed it. Danny added an 'indices' method back in 2015. =D
		eramanAuthorUnsubmitted Not Done Reply Inline Actions I see indices() in some other classes but not in GetElementPtrInst eraman: I see indices() in some other classes but not in GetElementPtrInst
		++I) {
		if (!isa<Constant>(*I)) {
		if (SimplifiedValues.lookup(*I))
		SVLookup = true;
		else
return false;		return false;
		}
		}
		if (SVLookup)
		accumulateSavings(InlineConstants::InstrCost);
return true;		return true;
};		};

if (IsGEPOffsetConstant(I)) {		if (IsGEPOffsetConstant(I)) {
if (SROACandidate)		if (SROACandidate)
SROAArgValues[&I] = SROAArg;		SROAArgValues[&I] = SROAArg;

// Constant GEPs are modeled as free.		// Constant GEPs are modeled as free.
		davidxlUnsubmitted Done Reply Inline Actions should this be called inside the lambda? davidxl: should this be called inside the lambda?
return true;		return true;
}		}

// Variable GEPs will require math and will disable SROA.		// Variable GEPs will require math and will disable SROA.
if (SROACandidate)		if (SROACandidate)
disableSROA(CostIt);		disableSROA(CostIt);
return isGEPFree(I);		return isGEPFree(I);
}		}

		void CallAnalyzer::accumulateSavings(int Savings) {
		WeightedSavings += SaturatingMultiply(CurrBBFreq, (uint64_t)Savings);
		}

/// Simplify \p I if its operands are constants and update SimplifiedValues.		/// Simplify \p I if its operands are constants and update SimplifiedValues.
/// \p Evaluate is a callable specific to instruction type that evaluates the		/// \p Evaluate is a callable specific to instruction type that evaluates the
/// instruction when all the operands are constants.		/// instruction when all the operands are constants.
template <typename Callable>		template <typename Callable>
bool CallAnalyzer::simplifyInstruction(Instruction &I, Callable Evaluate) {		bool CallAnalyzer::simplifyInstruction(Instruction &I, Callable Evaluate) {
SmallVector<Constant *, 2> COps;		SmallVector<Constant *, 2> COps;
		bool SVLookup = false;
for (Value *Op : I.operands()) {		for (Value *Op : I.operands()) {
Constant *COp = dyn_cast<Constant>(Op);		Constant *COp = dyn_cast<Constant>(Op);
if (!COp)		if (!COp) {
COp = SimplifiedValues.lookup(Op);		COp = SimplifiedValues.lookup(Op);
		SVLookup = true;
		}
if (!COp)		if (!COp)
return false;		return false;
COps.push_back(COp);		COps.push_back(COp);
}		}
auto *C = Evaluate(COps);		auto *C = Evaluate(COps);
		davidxlUnsubmitted Not Done Reply Inline Actions why does this need to be guarded here? davidxl: why does this need to be guarded here?
		eramanAuthorUnsubmitted Not Done Reply Inline Actions We want to give a bonus if the speedup only happens due to inlining. In most cases, if Evaluate returns a constant, then we would have looked up the SimplifiedValues. But it is also possible that the IR actually has an instruction with two constant operands which were not cleaned by earlier simplification pass and we don't want to consider it an inlining benefit (since a later simplification pass will simplify it). I don't think it is very likely in a reasonable pass pipeline, but explicitly having this guard makes the intention cleaner. eraman: We want to give a bonus if the speedup only happens due to inlining. In most cases, if Evaluate…
		davidxlUnsubmitted Done Reply Inline Actions Can you add a brief comment about this? davidxl: Can you add a brief comment about this?
if (!C)		if (!C)
return false;		return false;
SimplifiedValues[&I] = C;		SimplifiedValues[&I] = C;
		// If SVLookup is true, then the constness of at least one of the operands is
		// inferred only after looking up the SimplifiedValues map. In other words,
		// the constness of the operands (and consequently the result) cannot be
		// inferred without inlining. Thus we account the savings associated with the
		// elimination of this instruction to inlining
		if (SVLookup)
		accumulateSavings(InlineConstants::InstrCost);
return true;		return true;
}		}

		/// Get the bonus to be applied for callsites with significant speedup.
		int CallAnalyzer::getSpeedupBonus(CallSite &CS, int Threshold) {
		if (!CallerBFI \|\| !CalleeBFI)
		return 0;
		auto EntryFreq = CallerBFI->getEntryFreq();
		auto *BB = CS.getInstruction()->getParent();
		auto CallSiteFreq = CallerBFI->getBlockFreq(BB).getFrequency();
		int Bonus = SpeedupBonusPercent * Threshold / 100;
		// Always apply the bonus for hot callsites and never apply this for cold
		// callsites.
		davidxlUnsubmitted Done Reply Inline Actions Better filter cold callsite out too. davidxl: Better filter cold callsite out too.
		if (PSI) {
		if (PSI->isHotCallSite(CS, CallerBFI))
		return Bonus;
		if (PSI->isColdCallSite(CS, CallerBFI))
		return 0;
		}
		// In the absence of profile summary or when the callsite is neither hot nor
		// cold, apply the bonus if callsite's frequency exceeds
		// MinBFRatioForSpeedupBonus.
		if (CallSiteFreq / EntryFreq >= MinBFRatioForSpeedupBonus)
		return Bonus;
		return 0;
		}

		/// Return true if the estimated speedup is large enough to apply bonus.
		bool CallAnalyzer::hasLargeSpeedup() {
		int D = std::max(1, WeightedCost + WeightedSavings);
		int Speedup = WeightedSavings * 100 / D;
		return Speedup > MinSpeedupForBonus;
		}

bool CallAnalyzer::visitBitCast(BitCastInst &I) {		bool CallAnalyzer::visitBitCast(BitCastInst &I) {
// Propagate constants through bitcasts.		// Propagate constants through bitcasts.
if (simplifyInstruction(I, [&](SmallVectorImpl<Constant *> &COps) {		if (simplifyInstruction(I, [&](SmallVectorImpl<Constant *> &COps) {
return ConstantExpr::getBitCast(COps[0], I.getType());		return ConstantExpr::getBitCast(COps[0], I.getType());
}))		}))
return true;		return true;

// Track base/offsets through casts		// Track base/offsets through casts
std::pair<Value *, APInt> BaseAndOffset =		std::pair<Value *, APInt> BaseAndOffset =
ConstantOffsetPtrs.lookup(I.getOperand(0));		ConstantOffsetPtrs.lookup(I.getOperand(0));
// Casts don't change the offset, just wrap it up.		// Casts don't change the offset, just wrap it up.
if (BaseAndOffset.first)		if (BaseAndOffset.first)
ConstantOffsetPtrs[&I] = BaseAndOffset;		ConstantOffsetPtrs[&I] = BaseAndOffset;

// Also look for SROA candidates here.		// Also look for SROA candidates here.
Value *SROAArg;		Value *SROAArg;
DenseMap<Value *, int>::iterator CostIt;		DenseMap<Value *, SROACostTy>::iterator CostIt;
if (lookupSROAArgAndCost(I.getOperand(0), SROAArg, CostIt))		if (lookupSROAArgAndCost(I.getOperand(0), SROAArg, CostIt))
SROAArgValues[&I] = SROAArg;		SROAArgValues[&I] = SROAArg;

// Bitcasts are always zero cost.		// Bitcasts are always zero cost.
return true;		return true;
}		}

bool CallAnalyzer::visitPtrToInt(PtrToIntInst &I) {		bool CallAnalyzer::visitPtrToInt(PtrToIntInst &I) {
Show All 17 Lines	bool CallAnalyzer::visitPtrToInt(PtrToIntInst &I) {
// This is really weird. Technically, ptrtoint will disable SROA. However,		// This is really weird. Technically, ptrtoint will disable SROA. However,
// unless that ptrtoint is used somewhere in the live basic blocks after		// unless that ptrtoint is used somewhere in the live basic blocks after
// inlining, it will be nuked, and SROA should proceed. All of the uses which		// inlining, it will be nuked, and SROA should proceed. All of the uses which
// would block SROA would also block SROA if applied directly to a pointer,		// would block SROA would also block SROA if applied directly to a pointer,
// and so we can just add the integer in here. The only places where SROA is		// and so we can just add the integer in here. The only places where SROA is
// preserved either cannot fire on an integer, or won't in-and-of themselves		// preserved either cannot fire on an integer, or won't in-and-of themselves
// disable SROA (ext) w/o some later use that we would see and disable.		// disable SROA (ext) w/o some later use that we would see and disable.
Value *SROAArg;		Value *SROAArg;
DenseMap<Value *, int>::iterator CostIt;		DenseMap<Value *, SROACostTy>::iterator CostIt;
if (lookupSROAArgAndCost(I.getOperand(0), SROAArg, CostIt))		if (lookupSROAArgAndCost(I.getOperand(0), SROAArg, CostIt))
SROAArgValues[&I] = SROAArg;		SROAArgValues[&I] = SROAArg;

return TargetTransformInfo::TCC_Free == TTI.getUserCost(&I);		return TargetTransformInfo::TCC_Free == TTI.getUserCost(&I);
}		}

bool CallAnalyzer::visitIntToPtr(IntToPtrInst &I) {		bool CallAnalyzer::visitIntToPtr(IntToPtrInst &I) {
// Propagate constants through ptrtoint.		// Propagate constants through ptrtoint.
Show All 10 Lines	bool CallAnalyzer::visitIntToPtr(IntToPtrInst &I) {
if (IntegerSize <= DL.getPointerSizeInBits()) {		if (IntegerSize <= DL.getPointerSizeInBits()) {
std::pair<Value *, APInt> BaseAndOffset = ConstantOffsetPtrs.lookup(Op);		std::pair<Value *, APInt> BaseAndOffset = ConstantOffsetPtrs.lookup(Op);
if (BaseAndOffset.first)		if (BaseAndOffset.first)
ConstantOffsetPtrs[&I] = BaseAndOffset;		ConstantOffsetPtrs[&I] = BaseAndOffset;
}		}

// "Propagate" SROA here in the same manner as we do for ptrtoint above.		// "Propagate" SROA here in the same manner as we do for ptrtoint above.
Value *SROAArg;		Value *SROAArg;
DenseMap<Value *, int>::iterator CostIt;		DenseMap<Value *, SROACostTy>::iterator CostIt;
if (lookupSROAArgAndCost(Op, SROAArg, CostIt))		if (lookupSROAArgAndCost(Op, SROAArg, CostIt))
SROAArgValues[&I] = SROAArg;		SROAArgValues[&I] = SROAArg;

return TargetTransformInfo::TCC_Free == TTI.getUserCost(&I);		return TargetTransformInfo::TCC_Free == TTI.getUserCost(&I);
}		}

bool CallAnalyzer::visitCastInst(CastInst &I) {		bool CallAnalyzer::visitCastInst(CastInst &I) {
// Propagate constants through ptrtoint.		// Propagate constants through ptrtoint.
▲ Show 20 Lines • Show All 102 Lines • ▼ Show 20 Lines	else if (Caller->optForSize())
Threshold = MinIfValid(Threshold, Params.OptSizeThreshold);		Threshold = MinIfValid(Threshold, Params.OptSizeThreshold);

// Adjust the threshold based on inlinehint attribute and profile based		// Adjust the threshold based on inlinehint attribute and profile based
// hotness information if the caller does not have MinSize attribute.		// hotness information if the caller does not have MinSize attribute.
if (!Caller->optForMinSize()) {		if (!Caller->optForMinSize()) {
if (Callee.hasFnAttribute(Attribute::InlineHint))		if (Callee.hasFnAttribute(Attribute::InlineHint))
Threshold = MaxIfValid(Threshold, Params.HintThreshold);		Threshold = MaxIfValid(Threshold, Params.HintThreshold);
if (PSI) {		if (PSI) {
BlockFrequencyInfo CallerBFI = GetBFI ? &((GetBFI)(*Caller)) : nullptr;
if (PSI->isHotCallSite(CS, CallerBFI)) {		if (PSI->isHotCallSite(CS, CallerBFI)) {
DEBUG(dbgs() << "Hot callsite.\n");		DEBUG(dbgs() << "Hot callsite.\n");
Threshold = Params.HotCallSiteThreshold.getValue();		Threshold = Params.HotCallSiteThreshold.getValue();
} else if (PSI->isFunctionEntryHot(&Callee)) {		} else if (PSI->isFunctionEntryHot(&Callee)) {
DEBUG(dbgs() << "Hot callee.\n");		DEBUG(dbgs() << "Hot callee.\n");
// If callsite hotness can not be determined, we may still know		// If callsite hotness can not be determined, we may still know
// that the callee is hot and treat it as a weaker hint for threshold		// that the callee is hot and treat it as a weaker hint for threshold
// increase.		// increase.
Show All 32 Lines	bool CallAnalyzer::visitCmpInst(CmpInst &I) {
if (LHSBase) {		if (LHSBase) {
std::tie(RHSBase, RHSOffset) = ConstantOffsetPtrs.lookup(RHS);		std::tie(RHSBase, RHSOffset) = ConstantOffsetPtrs.lookup(RHS);
if (RHSBase && LHSBase == RHSBase) {		if (RHSBase && LHSBase == RHSBase) {
// We have common bases, fold the icmp to a constant based on the		// We have common bases, fold the icmp to a constant based on the
// offsets.		// offsets.
Constant *CLHS = ConstantInt::get(LHS->getContext(), LHSOffset);		Constant *CLHS = ConstantInt::get(LHS->getContext(), LHSOffset);
Constant *CRHS = ConstantInt::get(RHS->getContext(), RHSOffset);		Constant *CRHS = ConstantInt::get(RHS->getContext(), RHSOffset);
if (Constant *C = ConstantExpr::getICmp(I.getPredicate(), CLHS, CRHS)) {		if (Constant *C = ConstantExpr::getICmp(I.getPredicate(), CLHS, CRHS)) {
		accumulateSavings(InlineConstants::InstrCost);
SimplifiedValues[&I] = C;		SimplifiedValues[&I] = C;
++NumConstantPtrCmps;		++NumConstantPtrCmps;
return true;		return true;
}		}
}		}
}		}

// If the comparison is an equality comparison with null, we can simplify it		// If the comparison is an equality comparison with null, we can simplify it
// if we know the value (argument) can't be null		// if we know the value (argument) can't be null
if (I.isEquality() && isa<ConstantPointerNull>(I.getOperand(1)) &&		if (I.isEquality() && isa<ConstantPointerNull>(I.getOperand(1)) &&
isKnownNonNullInCallee(I.getOperand(0))) {		isKnownNonNullInCallee(I.getOperand(0))) {
		if (isAllocaDerivedArg(I.getOperand(0)))
		accumulateSavings(InlineConstants::InstrCost);
bool IsNotEqual = I.getPredicate() == CmpInst::ICMP_NE;		bool IsNotEqual = I.getPredicate() == CmpInst::ICMP_NE;
SimplifiedValues[&I] = IsNotEqual ? ConstantInt::getTrue(I.getType())		SimplifiedValues[&I] = IsNotEqual ? ConstantInt::getTrue(I.getType())
: ConstantInt::getFalse(I.getType());		: ConstantInt::getFalse(I.getType());
return true;		return true;
}		}
// Finally check for SROA candidates in comparisons.		// Finally check for SROA candidates in comparisons.
Value *SROAArg;		Value *SROAArg;
DenseMap<Value *, int>::iterator CostIt;		DenseMap<Value *, SROACostTy>::iterator CostIt;
if (lookupSROAArgAndCost(I.getOperand(0), SROAArg, CostIt)) {		if (lookupSROAArgAndCost(I.getOperand(0), SROAArg, CostIt)) {
if (isa<ConstantPointerNull>(I.getOperand(1))) {		if (isa<ConstantPointerNull>(I.getOperand(1))) {
accumulateSROACost(CostIt, InlineConstants::InstrCost);		accumulateSROACost(CostIt, InlineConstants::InstrCost);
return true;		return true;
}		}

disableSROA(CostIt);		disableSROA(CostIt);
}		}
Show All 11 Lines	bool CallAnalyzer::visitSub(BinaryOperator &I) {
if (LHSBase) {		if (LHSBase) {
std::tie(RHSBase, RHSOffset) = ConstantOffsetPtrs.lookup(RHS);		std::tie(RHSBase, RHSOffset) = ConstantOffsetPtrs.lookup(RHS);
if (RHSBase && LHSBase == RHSBase) {		if (RHSBase && LHSBase == RHSBase) {
// We have common bases, fold the subtract to a constant based on the		// We have common bases, fold the subtract to a constant based on the
// offsets.		// offsets.
Constant *CLHS = ConstantInt::get(LHS->getContext(), LHSOffset);		Constant *CLHS = ConstantInt::get(LHS->getContext(), LHSOffset);
Constant *CRHS = ConstantInt::get(RHS->getContext(), RHSOffset);		Constant *CRHS = ConstantInt::get(RHS->getContext(), RHSOffset);
if (Constant *C = ConstantExpr::getSub(CLHS, CRHS)) {		if (Constant *C = ConstantExpr::getSub(CLHS, CRHS)) {
		accumulateSavings(InlineConstants::InstrCost);
SimplifiedValues[&I] = C;		SimplifiedValues[&I] = C;
++NumConstantPtrDiffs;		++NumConstantPtrDiffs;
return true;		return true;
}		}
}		}
}		}

// Otherwise, fall back to the generic logic for simplifying and handling		// Otherwise, fall back to the generic logic for simplifying and handling
Show All 21 Lines	bool CallAnalyzer::visitBinaryOperator(BinaryOperator &I) {
disableSROA(LHS);		disableSROA(LHS);
disableSROA(RHS);		disableSROA(RHS);

return false;		return false;
}		}

bool CallAnalyzer::visitLoad(LoadInst &I) {		bool CallAnalyzer::visitLoad(LoadInst &I) {
Value *SROAArg;		Value *SROAArg;
DenseMap<Value *, int>::iterator CostIt;		DenseMap<Value *, SROACostTy>::iterator CostIt;
if (lookupSROAArgAndCost(I.getPointerOperand(), SROAArg, CostIt)) {		if (lookupSROAArgAndCost(I.getPointerOperand(), SROAArg, CostIt)) {
if (I.isSimple()) {		if (I.isSimple()) {
accumulateSROACost(CostIt, InlineConstants::InstrCost);		accumulateSROACost(CostIt, InlineConstants::InstrCost);
return true;		return true;
}		}

disableSROA(CostIt);		disableSROA(CostIt);
}		}

return false;		return false;
}		}

bool CallAnalyzer::visitStore(StoreInst &I) {		bool CallAnalyzer::visitStore(StoreInst &I) {
Value *SROAArg;		Value *SROAArg;
DenseMap<Value *, int>::iterator CostIt;		DenseMap<Value *, SROACostTy>::iterator CostIt;
if (lookupSROAArgAndCost(I.getPointerOperand(), SROAArg, CostIt)) {		if (lookupSROAArgAndCost(I.getPointerOperand(), SROAArg, CostIt)) {
if (I.isSimple()) {		if (I.isSimple()) {
accumulateSROACost(CostIt, InlineConstants::InstrCost);		accumulateSROACost(CostIt, InlineConstants::InstrCost);
return true;		return true;
}		}

disableSROA(CostIt);		disableSROA(CostIt);
}		}
Show All 33 Lines
/// free.		/// free.
bool CallAnalyzer::simplifyCallSite(Function *F, CallSite CS) {		bool CallAnalyzer::simplifyCallSite(Function *F, CallSite CS) {
// FIXME: Using the instsimplify logic directly for this is inefficient		// FIXME: Using the instsimplify logic directly for this is inefficient
// because we have to continually rebuild the argument list even when no		// because we have to continually rebuild the argument list even when no
// simplifications can be performed. Until that is fixed with remapping		// simplifications can be performed. Until that is fixed with remapping
// inside of instsimplify, directly constant fold calls here.		// inside of instsimplify, directly constant fold calls here.
if (!canConstantFoldCallTo(F))		if (!canConstantFoldCallTo(F))
return false;		return false;
		bool SVLookup = false;
// Try to re-map the arguments to constants.		// Try to re-map the arguments to constants.
SmallVector<Constant *, 4> ConstantArgs;		SmallVector<Constant *, 4> ConstantArgs;
ConstantArgs.reserve(CS.arg_size());		ConstantArgs.reserve(CS.arg_size());
for (CallSite::arg_iterator I = CS.arg_begin(), E = CS.arg_end(); I != E;		for (CallSite::arg_iterator I = CS.arg_begin(), E = CS.arg_end(); I != E;
++I) {		++I) {
Constant C = dyn_cast<Constant>(I);		Constant C = dyn_cast<Constant>(I);
if (!C)		if (!C) {
C = dyn_cast_or_null<Constant>(SimplifiedValues.lookup(*I));		C = dyn_cast_or_null<Constant>(SimplifiedValues.lookup(*I));
		SVLookup = true;
		}
if (!C)		if (!C)
return false; // This argument doesn't map to a constant.		return false; // This argument doesn't map to a constant.

ConstantArgs.push_back(C);		ConstantArgs.push_back(C);
}		}
if (Constant *C = ConstantFoldCall(F, ConstantArgs)) {		if (Constant *C = ConstantFoldCall(F, ConstantArgs)) {
SimplifiedValues[CS.getInstruction()] = C;		SimplifiedValues[CS.getInstruction()] = C;

		// If SVLookup is true, then the constness of at least one argument is
		// inferred only after looking up the SimplifiedValues map. In other words,
		davidxlUnsubmitted Not Done Reply Inline Actions Why is the guard needed here? davidxl: Why is the guard needed here?
		eramanAuthorUnsubmitted Not Done Reply Inline Actions Same explanation as above. eraman: Same explanation as above.
		davidxlUnsubmitted Done Reply Inline Actions A brief comment. davidxl: A brief comment.
		// it is not possible to infer the the constness of all the arguments (and
		// hence the result of the call due to constant folding) without inlining.
		// Thus we account the savings associated with the elimination of this
		// callsite to inlining.
		//
		// FIXME: Increase the savings associated with simplifying a callsite.
		if (SVLookup)
		accumulateSavings(InlineConstants::InstrCost +
		InlineConstants::CallPenalty);

return true;		return true;
}		}

return false;		return false;
}		}

bool CallAnalyzer::visitCallSite(CallSite CS) {		bool CallAnalyzer::visitCallSite(CallSite CS) {
if (CS.hasFnAttr(Attribute::ReturnsTwice) &&		if (CS.hasFnAttr(Attribute::ReturnsTwice) &&
Show All 14 Lines	if (Function *F = CS.getCalledFunction()) {
// FIXME: Lift this into part of the InstVisitor.		// FIXME: Lift this into part of the InstVisitor.
if (IntrinsicInst *II = dyn_cast<IntrinsicInst>(CS.getInstruction())) {		if (IntrinsicInst *II = dyn_cast<IntrinsicInst>(CS.getInstruction())) {
switch (II->getIntrinsicID()) {		switch (II->getIntrinsicID()) {
default:		default:
return Base::visitCallSite(CS);		return Base::visitCallSite(CS);

case Intrinsic::load_relative:		case Intrinsic::load_relative:
// This is normally lowered to 4 LLVM instructions.		// This is normally lowered to 4 LLVM instructions.
Cost += 3 * InlineConstants::InstrCost;		accumulateCost(3 * InlineConstants::InstrCost);
return false;		return false;

case Intrinsic::memset:		case Intrinsic::memset:
case Intrinsic::memcpy:		case Intrinsic::memcpy:
case Intrinsic::memmove:		case Intrinsic::memmove:
// SROA can usually chew through these intrinsics, but they aren't free.		// SROA can usually chew through these intrinsics, but they aren't free.
return false;		return false;
case Intrinsic::localescape:		case Intrinsic::localescape:
HasFrameEscape = true;		HasFrameEscape = true;
return false;		return false;
}		}
}		}

if (F == CS.getInstruction()->getParent()->getParent()) {		if (F == CS.getInstruction()->getParent()->getParent()) {
// This flag will fully abort the analysis, so don't bother with anything		// This flag will fully abort the analysis, so don't bother with anything
// else.		// else.
IsRecursiveCall = true;		IsRecursiveCall = true;
return false;		return false;
}		}

if (TTI.isLoweredToCall(F)) {		if (TTI.isLoweredToCall(F)) {
// We account for the average 1 instruction per call argument setup		// We account for the average 1 instruction per call argument setup
// here.		// here.
Cost += CS.arg_size() * InlineConstants::InstrCost;		accumulateCost(CS.arg_size() * InlineConstants::InstrCost);

// Everything other than inline ASM will also have a significant cost		// Everything other than inline ASM will also have a significant cost
// merely from making the call.		// merely from making the call.
if (!isa<InlineAsm>(CS.getCalledValue()))		if (!isa<InlineAsm>(CS.getCalledValue()))
Cost += InlineConstants::CallPenalty;		accumulateCost(InlineConstants::CallPenalty);
}		}

return Base::visitCallSite(CS);		return Base::visitCallSite(CS);
}		}

// Otherwise we're in a very special case -- an indirect function call. See		// Otherwise we're in a very special case -- an indirect function call. See
// if we can be particularly clever about this.		// if we can be particularly clever about this.
Value *Callee = CS.getCalledValue();		Value *Callee = CS.getCalledValue();

// First, pay the price of the argument setup. We account for the average		// First, pay the price of the argument setup. We account for the average
// 1 instruction per call argument setup here.		// 1 instruction per call argument setup here.
Cost += CS.arg_size() * InlineConstants::InstrCost;		accumulateCost(CS.arg_size() * InlineConstants::InstrCost);

// Next, check if this happens to be an indirect function call to a known		// Next, check if this happens to be an indirect function call to a known
// function in this inline context. If not, we've done all we can.		// function in this inline context. If not, we've done all we can.
Function *F = dyn_cast_or_null<Function>(SimplifiedValues.lookup(Callee));		Function *F = dyn_cast_or_null<Function>(SimplifiedValues.lookup(Callee));
if (!F)		if (!F)
return Base::visitCallSite(CS);		return Base::visitCallSite(CS);

// If we have a constant that we are calling as a function, we can peer		// If we have a constant that we are calling as a function, we can peer
// through it and see the function target. This happens not infrequently		// through it and see the function target. This happens not infrequently
// during devirtualization and so we want to give it a hefty bonus for		// during devirtualization and so we want to give it a hefty bonus for
// inlining, but cap that bonus in the event that inlining wouldn't pan		// inlining, but cap that bonus in the event that inlining wouldn't pan
// out. Pretend to inline the function, with a custom threshold.		// out. Pretend to inline the function, with a custom threshold.
auto IndirectCallParams = Params;		auto IndirectCallParams = Params;
IndirectCallParams.DefaultThreshold = InlineConstants::IndirectCallThreshold;		IndirectCallParams.DefaultThreshold = InlineConstants::IndirectCallThreshold;
CallAnalyzer CA(TTI, GetAssumptionCache, GetBFI, PSI, *F, CS,		CallAnalyzer CA(TTI, GetAssumptionCache, GetBFI, PSI, *F, CS,
IndirectCallParams);		IndirectCallParams);
if (CA.analyzeCall(CS)) {		if (CA.analyzeCall(CS)) {
// We were able to inline the indirect call! Subtract the cost from the		// We were able to inline the indirect call! Subtract the cost from the
// threshold to get the bonus we want to apply, but don't go below zero.		// threshold to get the bonus we want to apply, but don't go below zero.
Cost -= std::max(0, CA.getThreshold() - CA.getCost());		Cost -= std::max(0, CA.getThreshold() - CA.getCost());
		// FIXME: This underestimates the savings due to removing a call. Perhaps we
		// should get the weighted cost and savings of inlining the indirect call,
		// scale it based on callsite and callee's enrtry frequencies and add them
		// up to the current callee's weighted cost and savings.
		int ArgPassingCost = getArgPassingCost(CS, F);
		davidxlUnsubmitted Done Reply Inline Actions Perhaps at least add parameter passing to the savings? davidxl: Perhaps at least add parameter passing to the savings?
		accumulateSavings(InlineConstants::InstrCost +
		InlineConstants::CallPenalty - ArgPassingCost);
}		}

return Base::visitCallSite(CS);		return Base::visitCallSite(CS);
}		}

bool CallAnalyzer::visitReturnInst(ReturnInst &RI) {		bool CallAnalyzer::visitReturnInst(ReturnInst &RI) {
// At least one return instruction will be free after inlining.		// At least one return instruction will be free after inlining.
bool Free = !HasReturn;		bool Free = !HasReturn;
HasReturn = true;		HasReturn = true;
return Free;		return Free;
}		}

bool CallAnalyzer::visitBranchInst(BranchInst &BI) {		bool CallAnalyzer::visitBranchInst(BranchInst &BI) {
// We model unconditional branches as essentially free -- they really		// We model unconditional branches as essentially free -- they really
// shouldn't exist at all, but handling them makes the behavior of the		// shouldn't exist at all, but handling them makes the behavior of the
// inliner more regular and predictable. Interestingly, conditional branches		// inliner more regular and predictable. Interestingly, conditional branches
// which will fold away are also free.		// which will fold away are also free.
return BI.isUnconditional() \|\| isa<ConstantInt>(BI.getCondition()) \|\|		if (BI.isUnconditional() \|\| isa<ConstantInt>(BI.getCondition()))
dyn_cast_or_null<ConstantInt>(		return true;
SimplifiedValues.lookup(BI.getCondition()));		if (dyn_cast_or_null<ConstantInt>(
		SimplifiedValues.lookup(BI.getCondition()))) {
		accumulateSavings(InlineConstants::InstrCost);
		return true;
		}
		return false;
}		}

bool CallAnalyzer::visitSwitchInst(SwitchInst &SI) {		bool CallAnalyzer::visitSwitchInst(SwitchInst &SI) {
// We model unconditional switches as free, see the comments on handling		// We model unconditional switches as free, see the comments on handling
// branches.		// branches.
if (isa<ConstantInt>(SI.getCondition()))		if (isa<ConstantInt>(SI.getCondition()))
return true;		return true;
if (Value *V = SimplifiedValues.lookup(SI.getCondition()))
if (isa<ConstantInt>(V))
return true;

// Otherwise, we need to accumulate a cost proportional to the number of		// Lambda to compute the cost of switch instruction.
		auto SwitchInstCost = [](SwitchInst &SI) {
		// We need to compute a cost proportional to the number of
// distinct successor blocks. This fan-out in the CFG cannot be represented		// distinct successor blocks. This fan-out in the CFG cannot be represented
// for free even if we can represent the core switch as a jumptable that		// for free even if we can represent the core switch as a jumptable that
// takes a single instruction.		// takes a single instruction.
//		//
// NB: We convert large switches which are just used to initialize large phi		// NB: We convert large switches which are just used to initialize large phi
// nodes to lookup tables instead in simplify-cfg, so this shouldn't prevent		// nodes to lookup tables instead in simplify-cfg, so this shouldn't prevent
// inlining those. It will prevent inlining in cases where the optimization		// inlining those. It will prevent inlining in cases where the optimization
// does not (yet) fire.		// does not (yet) fire.
SmallPtrSet<BasicBlock *, 8> SuccessorBlocks;		SmallPtrSet<BasicBlock *, 8> SuccessorBlocks;
SuccessorBlocks.insert(SI.getDefaultDest());		SuccessorBlocks.insert(SI.getDefaultDest());
for (auto Case : SI.cases())		for (auto Case : SI.cases())
SuccessorBlocks.insert(Case.getCaseSuccessor());		SuccessorBlocks.insert(Case.getCaseSuccessor());

// Add cost corresponding to the number of distinct destinations. The first		// Add cost corresponding to the number of distinct destinations. The first
// we model as free because of fallthrough.		// we model as free because of fallthrough.
Cost += (SuccessorBlocks.size() - 1) * InlineConstants::InstrCost;		return (SuccessorBlocks.size() - 1) * InlineConstants::InstrCost;
		};

		if (Value *V = SimplifiedValues.lookup(SI.getCondition()))
		if (isa<ConstantInt>(V)) {
		accumulateSavings(SwitchInstCost(SI));
		return true;
		}

		accumulateCost(SwitchInstCost(SI));
return false;		return false;
}		}

bool CallAnalyzer::visitIndirectBrInst(IndirectBrInst &IBI) {		bool CallAnalyzer::visitIndirectBrInst(IndirectBrInst &IBI) {
// We never want to inline functions that contain an indirectbr. This is		// We never want to inline functions that contain an indirectbr. This is
// incorrect because all the blockaddress's (in static global initializers		// incorrect because all the blockaddress's (in static global initializers
// for example) would be referring to the original function, and this		// for example) would be referring to the original function, and this
// indirect jump would jump from the inlined copy of the function into the		// indirect jump would jump from the inlined copy of the function into the
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
///		///
/// This method walks the analyzer over every instruction in the given basic		/// This method walks the analyzer over every instruction in the given basic
/// block and accounts for their cost during inlining at this callsite. It		/// block and accounts for their cost during inlining at this callsite. It
/// aborts early if the threshold has been exceeded or an impossible to inline		/// aborts early if the threshold has been exceeded or an impossible to inline
/// construct has been detected. It returns false if inlining is no longer		/// construct has been detected. It returns false if inlining is no longer
/// viable, and true if inlining remains viable.		/// viable, and true if inlining remains viable.
bool CallAnalyzer::analyzeBlock(BasicBlock *BB,		bool CallAnalyzer::analyzeBlock(BasicBlock *BB,
SmallPtrSetImpl<const Value *> &EphValues) {		SmallPtrSetImpl<const Value *> &EphValues) {
		if (CalleeBFI)
		CurrBBFreq = CalleeBFI->getBlockFreq(BB).getFrequency();

for (BasicBlock::iterator I = BB->begin(), E = BB->end(); I != E; ++I) {		for (BasicBlock::iterator I = BB->begin(), E = BB->end(); I != E; ++I) {
// FIXME: Currently, the number of instructions in a function regardless of		// FIXME: Currently, the number of instructions in a function regardless of
// our ability to simplify them during inline to constants or dead code,		// our ability to simplify them during inline to constants or dead code,
// are actually used by the vector bonus heuristic. As long as that's true,		// are actually used by the vector bonus heuristic. As long as that's true,
// we have to special case debug intrinsics here to prevent differences in		// we have to special case debug intrinsics here to prevent differences in
// inlining due to debug symbols. Eventually, the number of unsimplified		// inlining due to debug symbols. Eventually, the number of unsimplified
// instructions shouldn't factor into the cost computation, but until then,		// instructions shouldn't factor into the cost computation, but until then,
// hack around it here.		// hack around it here.
Show All 20 Lines	if (I->getType()->isFloatingPointTy()) {
Attribute Attr = F.getFnAttribute("use-soft-float");		Attribute Attr = F.getFnAttribute("use-soft-float");
StringRef Val = Attr.getValueAsString();		StringRef Val = Attr.getValueAsString();
if (Val == "true")		if (Val == "true")
hasSoftFloatAttr = true;		hasSoftFloatAttr = true;
}		}

if (TTI.getFPOpCost(I->getType()) == TargetTransformInfo::TCC_Expensive \|\|		if (TTI.getFPOpCost(I->getType()) == TargetTransformInfo::TCC_Expensive \|\|
hasSoftFloatAttr)		hasSoftFloatAttr)
Cost += InlineConstants::CallPenalty;		accumulateCost(InlineConstants::CallPenalty);
}		}

// If the instruction simplified to a constant, there is no cost to this		// If the instruction simplified to a constant, there is no cost to this
// instruction. Visit the instructions using our InstVisitor to account for		// instruction. Visit the instructions using our InstVisitor to account for
// all of the per-instruction logic. The visit tree returns true if we		// all of the per-instruction logic. The visit tree returns true if we
// consumed the instruction in any way, and false if the instruction's base		// consumed the instruction in any way, and false if the instruction's base
// cost should count against inlining.		// cost should count against inlining.
if (Base::visit(&*I))		if (Base::visit(&*I))
++NumInstructionsSimplified;		++NumInstructionsSimplified;
else		else
Cost += InlineConstants::InstrCost;		accumulateCost(InlineConstants::InstrCost);

// If the visit this instruction detected an uninlinable pattern, abort.		// If the visit this instruction detected an uninlinable pattern, abort.
if (IsRecursiveCall \|\| ExposesReturnsTwice \|\| HasDynamicAlloca \|\|		if (IsRecursiveCall \|\| ExposesReturnsTwice \|\| HasDynamicAlloca \|\|
HasIndirectBr \|\| HasFrameEscape)		HasIndirectBr \|\| HasFrameEscape)
return false;		return false;

// If the caller is a recursive function then we don't want to inline		// If the caller is a recursive function then we don't want to inline
// functions which allocate a lot of stack space because it would increase		// functions which allocate a lot of stack space because it would increase
▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	bool CallAnalyzer::analyzeCall(CallSite CS) {
// low. Note that these bonuses are some what arbitrary and evolved over time		// low. Note that these bonuses are some what arbitrary and evolved over time
// by accident as much as because they are principled bonuses.		// by accident as much as because they are principled bonuses.
//		//
// FIXME: It would be nice to remove all such bonuses. At least it would be		// FIXME: It would be nice to remove all such bonuses. At least it would be
// nice to base the bonus values on something more scientific.		// nice to base the bonus values on something more scientific.
assert(NumInstructions == 0);		assert(NumInstructions == 0);
assert(NumVectorInstructions == 0);		assert(NumVectorInstructions == 0);

		Function *Caller = CS.getInstruction()->getParent()->getParent();
		CallerBFI = nullptr;
		CalleeBFI = nullptr;
		if (GetBFI) {
		CallerBFI = &((GetBFI)(Caller));
		if (!F.isDeclaration()) {
		CalleeBFI = &((*GetBFI)(F));
		efriedmaUnsubmitted Not Done Reply Inline Actions CalleeBFI is unused? efriedma: CalleeBFI is unused?
		eramanAuthorUnsubmitted Not Done Reply Inline Actions Ouch. This is badly broken. In two places where I should have used CalleeBFI, I've used CallerBFI. getBlockFreq returns 0 if the BB is not in the function whose BFI is used. The tests still pass because I've used getEntryFreq of Caller (instead of Callee) which still returns a valid value and the weighted savings associated with the entry block (elimination of the call and arg setup) is enough to result in the speedup. I'll send a revised patch tomorrow. eraman: Ouch. This is badly broken. In two places where I should have used CalleeBFI, I've used…
		// While evaluating the weighted savings due to removal of argument setup
		// and the call overhead, we want to use the entry block's frequency.
		CurrBBFreq = CalleeBFI->getEntryFreq();
		}
		}
// Update the threshold based on callsite properties		// Update the threshold based on callsite properties
updateThreshold(CS, F);		updateThreshold(CS, F);

FiftyPercentVectorBonus = 3 * Threshold / 2;		FiftyPercentVectorBonus = 3 * Threshold / 2;
TenPercentVectorBonus = 3 * Threshold / 4;		TenPercentVectorBonus = 3 * Threshold / 4;
const DataLayout &DL = F.getParent()->getDataLayout();

// Track whether the post-inlining function would have more than one basic		// Track whether the post-inlining function would have more than one basic
// block. A single basic block is often intended for inlining. Balloon the		// block. A single basic block is often intended for inlining. Balloon the
// threshold by 50% until we pass the single-BB phase.		// threshold by 50% until we pass the single-BB phase.
bool SingleBB = true;		bool SingleBB = true;
int SingleBBBonus = Threshold / 2;		int SingleBBBonus = Threshold / 2;
		int SpeedupBonus = getSpeedupBonus(CS, Threshold);

// Speculatively apply all possible bonuses to Threshold. If cost exceeds		// Speculatively apply all possible bonuses to Threshold. If cost exceeds
// this Threshold any time, and cost cannot decrease, we can stop processing		// this Threshold any time, and cost cannot decrease, we can stop processing
// the rest of the function body.		// the rest of the function body.
Threshold += (SingleBBBonus + FiftyPercentVectorBonus);		Threshold += (SingleBBBonus + FiftyPercentVectorBonus + SpeedupBonus);

// Give out bonuses per argument, as the instructions setting them up will		int ArgPassingCost = getArgPassingCost(CS, &F);
// be gone after inlining.		accumulateCost(ArgPassingCost);
for (unsigned I = 0, E = CS.arg_size(); I != E; ++I) {		// Argument passing cost is negative. We negate that to get savings due to
if (CS.isByValArgument(I)) {		// inlining.
// We approximate the number of loads and stores needed by dividing the		accumulateSavings(-ArgPassingCost);
// size of the byval type by the target's pointer size.
PointerType *PTy = cast<PointerType>(CS.getArgument(I)->getType());
unsigned TypeSize = DL.getTypeSizeInBits(PTy->getElementType());
unsigned PointerSize = DL.getPointerSizeInBits();
// Ceiling division.
unsigned NumStores = (TypeSize + PointerSize - 1) / PointerSize;

// If it generates more than 8 stores it is likely to be expanded as an
// inline memcpy so we take that as an upper bound. Otherwise we assume
// one load and one store per word copied.
// FIXME: The maxStoresPerMemcpy setting from the target should be used
// here instead of a magic number of 8, but it's not available via
// DataLayout.
NumStores = std::min(NumStores, 8U);

Cost -= 2 * NumStores * InlineConstants::InstrCost;
} else {
// For non-byval arguments subtract off one instruction per call
// argument.
Cost -= InlineConstants::InstrCost;
}
}
// The call instruction also disappears after inlining.		// The call instruction also disappears after inlining.
Cost -= InlineConstants::InstrCost + InlineConstants::CallPenalty;		accumulateCost(-InlineConstants::InstrCost - InlineConstants::CallPenalty);
		accumulateSavings(InlineConstants::InstrCost + InlineConstants::CallPenalty);

// If there is only one call of the function, and it has internal linkage,		// If there is only one call of the function, and it has internal linkage,
// the cost of inlining it drops dramatically.		// the cost of inlining it drops dramatically.
bool OnlyOneCallAndLocalLinkage =		bool OnlyOneCallAndLocalLinkage =
F.hasLocalLinkage() && F.hasOneUse() && &F == CS.getCalledFunction();		F.hasLocalLinkage() && F.hasOneUse() && &F == CS.getCalledFunction();
if (OnlyOneCallAndLocalLinkage)		if (OnlyOneCallAndLocalLinkage)
Cost -= InlineConstants::LastCallToStaticBonus;		Cost -= InlineConstants::LastCallToStaticBonus;

// If this function uses the coldcc calling convention, prefer not to inline		// If this function uses the coldcc calling convention, prefer not to inline
// it.		// it.
if (F.getCallingConv() == CallingConv::Cold)		if (F.getCallingConv() == CallingConv::Cold)
Cost += InlineConstants::ColdccPenalty;		accumulateCost(InlineConstants::ColdccPenalty);

// Check if we're done. This can happen due to bonuses and penalties.		// Check if we're done. This can happen due to bonuses and penalties.
if (Cost > Threshold)		if (Cost > Threshold)
return false;		return false;

if (F.empty())		if (F.empty())
return true;		return true;

Function *Caller = CS.getInstruction()->getParent()->getParent();
// Check if the caller function is recursive itself.		// Check if the caller function is recursive itself.
for (User *U : Caller->users()) {		for (User *U : Caller->users()) {
CallSite Site(U);		CallSite Site(U);
if (!Site)		if (!Site)
continue;		continue;
Instruction *I = Site.getInstruction();		Instruction *I = Site.getInstruction();
if (I->getParent()->getParent() == Caller) {		if (I->getParent()->getParent() == Caller) {
IsCallerRecursive = true;		IsCallerRecursive = true;
Show All 12 Lines	for (Function::arg_iterator FAI = F.arg_begin(), FAE = F.arg_end();

Value PtrArg = CAI;		Value PtrArg = CAI;
if (ConstantInt *C = stripAndComputeInBoundsConstantOffsets(PtrArg)) {		if (ConstantInt *C = stripAndComputeInBoundsConstantOffsets(PtrArg)) {
ConstantOffsetPtrs[&*FAI] = std::make_pair(PtrArg, C->getValue());		ConstantOffsetPtrs[&*FAI] = std::make_pair(PtrArg, C->getValue());

// We can SROA any pointer arguments derived from alloca instructions.		// We can SROA any pointer arguments derived from alloca instructions.
if (isa<AllocaInst>(PtrArg)) {		if (isa<AllocaInst>(PtrArg)) {
SROAArgValues[&*FAI] = PtrArg;		SROAArgValues[&*FAI] = PtrArg;
SROAArgCosts[PtrArg] = 0;		SROAArgCosts[PtrArg] = {0, 0};
}		}
}		}
}		}
NumConstantArgs = SimplifiedValues.size();		NumConstantArgs = SimplifiedValues.size();
NumConstantOffsetPtrArgs = ConstantOffsetPtrs.size();		NumConstantOffsetPtrArgs = ConstantOffsetPtrs.size();
NumAllocaArgs = SROAArgValues.size();		NumAllocaArgs = SROAArgValues.size();

// FIXME: If a caller has multiple calls to a callee, we end up recomputing		// FIXME: If a caller has multiple calls to a callee, we end up recomputing
▲ Show 20 Lines • Show All 87 Lines • ▼ Show 20 Lines	bool CallAnalyzer::analyzeCall(CallSite CS) {
// We applied the maximum possible vector bonus at the beginning. Now,		// We applied the maximum possible vector bonus at the beginning. Now,
// subtract the excess bonus, if any, from the Threshold before		// subtract the excess bonus, if any, from the Threshold before
// comparing against Cost.		// comparing against Cost.
if (NumVectorInstructions <= NumInstructions / 10)		if (NumVectorInstructions <= NumInstructions / 10)
Threshold -= FiftyPercentVectorBonus;		Threshold -= FiftyPercentVectorBonus;
else if (NumVectorInstructions <= NumInstructions / 2)		else if (NumVectorInstructions <= NumInstructions / 2)
Threshold -= (FiftyPercentVectorBonus - TenPercentVectorBonus);		Threshold -= (FiftyPercentVectorBonus - TenPercentVectorBonus);

		if (!hasLargeSpeedup())
		Threshold -= SpeedupBonus;

return Cost < std::max(1, Threshold);		return Cost < std::max(1, Threshold);
}		}

#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)		#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)
/// \brief Dump stats about this call's analysis.		/// \brief Dump stats about this call's analysis.
LLVM_DUMP_METHOD void CallAnalyzer::dump() {		LLVM_DUMP_METHOD void CallAnalyzer::dump() {
#define DEBUG_PRINT_STAT(x) dbgs() << " " #x ": " << x << "\n"		#define DEBUG_PRINT_STAT(x) dbgs() << " " #x ": " << x << "\n"
DEBUG_PRINT_STAT(NumConstantArgs);		DEBUG_PRINT_STAT(NumConstantArgs);
DEBUG_PRINT_STAT(NumConstantOffsetPtrArgs);		DEBUG_PRINT_STAT(NumConstantOffsetPtrArgs);
DEBUG_PRINT_STAT(NumAllocaArgs);		DEBUG_PRINT_STAT(NumAllocaArgs);
DEBUG_PRINT_STAT(NumConstantPtrCmps);		DEBUG_PRINT_STAT(NumConstantPtrCmps);
DEBUG_PRINT_STAT(NumConstantPtrDiffs);		DEBUG_PRINT_STAT(NumConstantPtrDiffs);
DEBUG_PRINT_STAT(NumInstructionsSimplified);		DEBUG_PRINT_STAT(NumInstructionsSimplified);
DEBUG_PRINT_STAT(NumInstructions);		DEBUG_PRINT_STAT(NumInstructions);
DEBUG_PRINT_STAT(SROACostSavings);		DEBUG_PRINT_STAT(SROACostSavings);
DEBUG_PRINT_STAT(SROACostSavingsLost);		DEBUG_PRINT_STAT(SROACostSavingsLost);
DEBUG_PRINT_STAT(ContainsNoDuplicateCall);		DEBUG_PRINT_STAT(ContainsNoDuplicateCall);
DEBUG_PRINT_STAT(Cost);		DEBUG_PRINT_STAT(Cost);
		DEBUG_PRINT_STAT(WeightedCost);
		DEBUG_PRINT_STAT(WeightedSavings);
DEBUG_PRINT_STAT(Threshold);		DEBUG_PRINT_STAT(Threshold);
#undef DEBUG_PRINT_STAT		#undef DEBUG_PRINT_STAT
}		}
#endif		#endif

/// \brief Test that there are no attribute conflicts between Caller and Callee		/// \brief Test that there are no attribute conflicts between Caller and Callee
/// that prevent inlining.		/// that prevent inlining.
static bool functionsHaveCompatibleAttributes(Function *Caller,		static bool functionsHaveCompatibleAttributes(Function *Caller,
▲ Show 20 Lines • Show All 170 Lines • Show Last 20 Lines

test/Transforms/Inline/speedup-analysis.ll

This file was added.

				; RUN: opt < %s -passes='require<profile-summary>,cgscc(inline)' -inline-threshold=10 -speedup-bonus-percent=1000
				; Test that a callee that does not fit within the threshold gets inlined
				; because of the estimated speedup heuristic.
				define i32 @caller(i32 %n) {
				davidxlUnsubmitted Not Done Reply Inline Actions what do the savings come from in this case? just the call overhead? davidxl: what do the savings come from in this case? just the call overhead?
				eramanAuthorUnsubmitted Not Done Reply Inline Actions Yes, the savings come from eliminating the call. eraman: Yes, the savings come from eliminating the call.
				; CHECK-LABEL: define i32 @caller
				entry:
				br label %loop
				loop:
				%r = phi i32 [%n, %entry], [%result, %loop]
				; CHECK-NOT: call i32 @callee
				%result = call i32 @callee(i32 %r)
				%cond = icmp sle i32 %result, 100
				br i1 %cond, label %loop, label %exit
				exit:
				; CHECK: ret
				ret i32 %result
				}

				define i32 @callee(i32 %n) {
				%cond = icmp sle i32 %n, 100
				br i1 %cond, label %cond_true, label %cond_false, !prof !0

				cond_true:
				%n1 = add i32 %n, 1
				%n2 = add i32 %n1, 1
				ret i32 %n2
				cond_false:
				call void @extern()
				call void @extern()
				call void @extern()
				ret i32 0
				}
				declare void @extern()

				!0 = !{!"branch_weights", i32 1, i32 0}

test/Transforms/Inline/speedup-analysis2.ll

This file was added.

				; RUN: opt < %s -passes='require<profile-summary>,cgscc(inline)' -inline-threshold=10 -speedup-bonus-percent=1000
				; Test that a callee that does not fit within the threshold gets inlined
				; because of the estimated speedup heuristic. The callee has a switch statement.
				; Since only one of the cases is executed per invocation of the callee, the
				; weighted cost of the callee is low and results in a big relative speedup as the
				; benefits of removing the function call is accounted for.

				define i32 @caller(i32 %n) {
				; CHECK-LABEL: define i32 @caller
				entry:
				br label %loop
				loop:
				%r = phi i32 [%n, %entry], [%result, %loop]
				; CHECK-NOT: call i32 @callee
				%result = call i32 @callee(i32 %r)
				%cond = icmp sle i32 %result, 100
				br i1 %cond, label %loop, label %exit
				exit:
				; CHECK: ret
				ret i32 %result
				}

				define i32 @callee(i32 %n) {
				entry:
				switch i32 %n, label %return [
				i32 1, label %sw.bb1
				i32 2, label %sw.bb2
				i32 3, label %sw.bb3
				i32 4, label %sw.bb4
				i32 5, label %sw.bb5
				i32 6, label %sw.bb6
				i32 7, label %sw.bb7
				]

				sw.bb1:
				%r1 = add i32 %n, 1
				br label %return

				sw.bb2:
				%r2 = add i32 %n, 2
				br label %return

				sw.bb3:
				%r3 = add i32 %n, 3
				br label %return

				sw.bb4:
				%r4 = add i32 %n, 4
				br label %return

				sw.bb5:
				%r5 = add i32 %n, 5
				br label %return

				sw.bb6:
				%r6 = add i32 %n, 6
				br label %return

				sw.bb7:
				%r7 = add i32 %n, 7
				br label %return

				return:
				%res = phi i32 [%n, %entry], [%r1, %sw.bb1], [%r2, %sw.bb2], [%r3, %sw.bb3], [%r4, %sw.bb4], [%r5, %sw.bb5], [%r6, %sw.bb6], [%r7, %sw.bb7]
				ret i32 %res

				}
				declare void @extern()