This is an archive of the discontinued LLVM Phabricator instance.

[InlineCost] Perform caller store analysis
Needs ReviewPublic

Authored by mpdenton on Aug 29 2019, 5:51 PM.

Download Raw Diff

Details

Reviewers

chandlerc
efriedma

Summary

Currently, the only information the inline cost analysis uses from the caller is the call's arguments and the attributes used on those arguments. It does not do any analysis on constants stored by the caller and later loaded by the callee.

This causes a problem in -Oz, in which certain functions are not inlined despite being a no-op at runtime, whereas with a larger inline threshold, the function might be inlined and completely eliminated by further optimization passes.

For context, Chrome compiles the Android binary with -Oz. However, this means that many no-op destructors remain un-inlined. Chief among them are ~unique_ptr and ~scoped_refptr (Chrome's refcounted pointer wrapper). This is even when an rvalue of type unique_ptr<T> is immediately moved and destructed. For example,
global_unique_ptr = std::make_unique<int>();
or
function_that_takes_unique_ptr(std::make_unique<int>());
Emits a destructor call on Android, despite the fact that the move constructor sets the unique_ptr's internal pointer to nullptr, and the destructor performs something similar to:
if (ptr_)

get_deleter()(ptr_);

Here's a relatively simple example: https://godbolt.org/z/YD3U7S

These extra no-op function calls are bad for binary size, and bad for performance. I tried to think of a way to fix this on a higher-level, such as a Clang plugin that fixes the largest offenders (~unique_ptr and ~scoped_refptr), but the AST is too high-level to fix this issue. And increasing the inline-threshold whatsoever is very bad for performance. (not to mention that fiddling with inline-threshold will either cause *all* ~unique_ptr destructors to be inlined or none of them, when it would be better for them to be inlined on a case-by-case basis).

So, add a flag called -inline-perform-caller-analysis. For now, this flag will cause analysis of stores to the caller's stack in the callsite's basic block to run during calculation of the inline cost. Then, any loads in the callee from the caller's stack will be simplified if possible (if the values could not have been clobbered already).

This change causes a 20KB decrease in Android binary size for Chrome. In addition, a planned Chrome change (increasing usage of std::make_unique and base::MakeRefCounted) that would increase Android binary size by ~100KB now causes no change in binary size. Chrome build time with and without this feature does not change (a trial compilation actually ran marginally faster with -inline-perform-caller-analysis).

The comments in InlineCost.cpp on CandidateCall indicate that we should not be performing analysis in the caller, so the results are more easily cacheable. However, inline analysis is already dependent on the call's arguments and parameter attributes, which probably often prevents caching from being useful. I also don't see anywhere in the LLVM codebase where results are being cached. And, in the future if caching will be added, we can return an "uncacheable" cost if this analysis affects the results. Even so, this feature is added behind a flag.

Diff Detail

Event Timeline

mpdenton created this revision.Aug 29 2019, 5:51 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 29 2019, 5:51 PM

Herald added subscribers: llvm-commits, haicheng, hiraditya, eraman. · View Herald Transcript

I mostly would like feedback on the idea--there are one or two TODOs in the code and I spotted a bug. But, the change has some useful consequences for Chrome nonetheless.

llvm/lib/Analysis/InlineCost.cpp
1268	Ah, spotted a bug as soon as I posted this--a store to a stack location that overlaps (but is not equal to) an existing constant won't invalidate the original constant. I'll have to figure out how to fix that.

This idea isn't fundamentally flawed, its a good idea and something we've discussed many times.

The tricky thing I think will be getting it right.

I think you're going to need a much more powerful form of analysis, specifically something like MemorySSA to do this kind of thing well. Integrating MemorySSA into the inliner and using it to do powerful store->load forwarding would be *awesome*, but it is also going to be a quite a lot of work I fear.

Doing something simpler is likely possible along the lines of what you're starting to do here, but I fear we will have to chase a very long tail of subtle corner cases even with these kinds of simplifications (only look at stores in the callee, only look at constant offsets, etc. etc.).

Not sure how far down you want to go on this rabbit hole? That'll somewhat dictate the direction I suggest...

In D66987#1656628, @chandlerc wrote:

This idea isn't fundamentally flawed, its a good idea and something we've discussed many times.

The tricky thing I think will be getting it right.

I think you're going to need a much more powerful form of analysis, specifically something like MemorySSA to do this kind of thing well. Integrating MemorySSA into the inliner and using it to do powerful store->load forwarding would be *awesome*, but it is also going to be a quite a lot of work I fear.

Doing something simpler is likely possible along the lines of what you're starting to do here, but I fear we will have to chase a very long tail of subtle corner cases even with these kinds of simplifications (only look at stores in the callee, only look at constant offsets, etc. etc.).

Not sure how far down you want to go on this rabbit hole? That'll somewhat dictate the direction I suggest...

Indeed, I saw the MemorySSA comment in the source. :) I'm interested in the rabbit hole.

As for this simple case, I tried to make the patch heavily integrated with the current inlining infrastructure, so that corner cases are primarily handled by by the existing logic. For example, most of the information used comes from the ConstantOffsetPtrs map and the SROAArgValues map. SimplifiedValues map is used for propagating constants, just like for the function arguments. And, clearing the map of "CallerStackConstants" is handled by the existing disableLoadElimination(), which is called whenever future loads may be clobbered.

So, the only non-trivial things added here are
(1) CallerStackConstants map, which has the corner case of overlapping stores (hasn't been dealt with)
(2) The caller "store search" performed in searchForConstantAtOffset. It's quite simple and doesn't go past many calls, nor past the callsite's basic block. These simplifications mostly eliminate all the corner cases.

So (1) has a corner case that should be solvable, and (2) might have some corner cases--can anything modify memory/clobber stores other than Callbase's and Stores?
There are plenty of corner cases in the inliner altogether, but I think the additions here don't necessarily *add* many extra.

As for efficacy, (2) is the part that is sad to do without MemorySSA. However, it covers some fairly significant cases that come up all the time in C++, as with the no-op destructors.

It seems it would be cool to do something like this first, and then dive down the MemorySSA rabbit hole to see how much it can add.

Thoughts?

So, looking at MemorySSA it looks like getClobberingAccess is the API. I assume the first argument would be the MemoryAccess that corresponds to the callsite, and the MemoryLocation is the caller pointer returned from calleePtrToCallerAllocaOffset. Is there a more sophisticated use here, or do I still need all the restrictions from the current inlining code (e.g. constant offsets)?

Would I need MemorySSA from both the callee and the caller? Is MemorySSA intra-procedural or do I still need to manually perform the translation from caller ptr -> callee ptr with calleePtrToCallerAllocaOffset?

Also, getClobberingAccess I believe just skips no-alias Defs but can return may-alias. But don't I want something more definitive? I.e. I either want a definitive "yes this writes exactly the memory location you're querying about" or no result at all. So maybe I'd use the walker and then do a manual check to see whether we've found a definitive overwrite. Which would mean I'd be dealing the same edge cases as my current code (overlapping stores?).

Also, if MemorySSA is not intrapodecural, the "search" for stores of constant wouldn't continue past calls anyway, just like my current code. (calls to readonly functions won't show up as clobbers, but my current code handles that already)

So I'd think MemorySSA would probably buy me "stores from other basic blocks in the function, not just the callsite's basic block", but I'd still probably be dealing with the same edge cases, and MemorySSA brings its own can of worms.

But I'm new to this MemorySSA thing, so I probably missed something above and there might be some additional benefit to using MemorySSA? (There's the "load elimination" thing from the current inliner code, but I'm considering that orthogonal to this change)

I'm not familiar with the inliner, but I'll try to answer some of the MemorySSA questions.

In D66987#1661697, @mpdenton wrote:

So, looking at MemorySSA it looks like getClobberingAccess is the API. I assume the first argument would be the MemoryAccess that corresponds to the callsite, and the MemoryLocation is the caller pointer returned from calleePtrToCallerAllocaOffset. Is there a more sophisticated use here, or do I still need all the restrictions from the current inlining code (e.g. constant offsets)?

Would I need MemorySSA from both the callee and the caller? Is MemorySSA intra-procedural or do I still need to manually perform the translation from caller ptr -> callee ptr with calleePtrToCallerAllocaOffset?

MemorySSA applies to a single function (it is intra-procedural, but not inter-procedural I believe). I am guessing yes to manually perform the translation.

Also, getClobberingAccess I believe just skips no-alias Defs but can return may-alias. But don't I want something more definitive? I.e. I either want a definitive "yes this writes exactly the memory location you're querying about" or no result at all. So maybe I'd use the walker and then do a manual check to see whether we've found a definitive overwrite. Which would mean I'd be dealing the same edge cases as my current code (overlapping stores?).

Yes. It will also set a "Must" bit for a known must alias. If it set it to May, it's probably not a definitive overwrite.

Also, if MemorySSA is not intrapodecural, the "search" for stores of constant wouldn't continue past calls anyway, just like my current code. (calls to readonly functions won't show up as clobbers, but my current code handles that already)

That's right, it's not inter-procedural, single function only. Again, not familiar with this. I will guess you could skip more than just readonly calls, but it cannot continue past any call.

So I'd think MemorySSA would probably buy me "stores from other basic blocks in the function, not just the callsite's basic block", but I'd still probably be dealing with the same edge cases, and MemorySSA brings its own can of worms.

It will probably buy you some more than that but I will guess there will be remaining edge cases and potentially a big can of worms :).

But I'm new to this MemorySSA thing, so I probably missed something above and there might be some additional benefit to using MemorySSA? (There's the "load elimination" thing from the current inliner code, but I'm considering that orthogonal to this change)

I'll try to look over this in more detail next week to understand the use case. Happy to chat as well.

In D66987#1661750, @asbirlea wrote:

MemorySSA applies to a single function (it is intra-procedural, but not inter-procedural I believe). I am guessing yes to manually perform the translation.

Ah yes, of course I meant inter-procedural, thanks for the clarification.

Also, getClobberingAccess I believe just skips no-alias Defs but can return may-alias. But don't I want something more definitive? I.e. I either want a definitive "yes this writes exactly the memory location you're querying about" or no result at all. So maybe I'd use the walker and then do a manual check to see whether we've found a definitive overwrite. Which would mean I'd be dealing the same edge cases as my current code (overlapping stores?).

Yes. It will also set a "Must" bit for a known must alias. If it set it to May, it's probably not a definitive overwrite.

Where does it set this "Must" bit? I took a look at the MemorySSA code and from what I can tell it doesn't do anything with the Alias result, at least in the "MemoryLoc" version?

So I'd think MemorySSA would probably buy me "stores from other basic blocks in the function, not just the callsite's basic block", but I'd still probably be dealing with the same edge cases, and MemorySSA brings its own can of worms.

It will probably buy you some more than that but I will guess there will be remaining edge cases and potentially a big can of worms :).

Well, hopefully it can buy us a lot without too much trouble.

I'll try to look over this in more detail next week to understand the use case. Happy to chat as well.

Thanks!

Revision Contents

Path

Size

llvm/

lib/

Analysis/

InlineCost.cpp

126 lines

test/

Transforms/

Inline/

caller-stack-constants.ll

55 lines

Diff 218003

llvm/lib/Analysis/InlineCost.cpp

Show First 20 Lines • Show All 44 Lines • ▼ Show 20 Lines

STATISTIC(NumCallsAnalyzed, "Number of call sites analyzed");		STATISTIC(NumCallsAnalyzed, "Number of call sites analyzed");

static cl::opt<int> InlineThreshold(		static cl::opt<int> InlineThreshold(
"inline-threshold", cl::Hidden, cl::init(225), cl::ZeroOrMore,		"inline-threshold", cl::Hidden, cl::init(225), cl::ZeroOrMore,
cl::desc("Control the amount of inlining to perform (default = 225)"));		cl::desc("Control the amount of inlining to perform (default = 225)"));

static cl::opt<int> HintThreshold(		static cl::opt<int> HintThreshold(
"inlinehint-threshold", cl::Hidden, cl::init(325), cl::ZeroOrMore,		"inlinehint-threshold", cl::Hidden, cl::init(325), cl::ZeroOrMore,
cl::desc("Threshold for inlining functions with inline hint"));		cl::desc("Threshold for inlining functions with inline hint"));

static cl::opt<int>		static cl::opt<int>
ColdCallSiteThreshold("inline-cold-callsite-threshold", cl::Hidden,		ColdCallSiteThreshold("inline-cold-callsite-threshold", cl::Hidden,
cl::init(45), cl::ZeroOrMore,		cl::init(45), cl::ZeroOrMore,
cl::desc("Threshold for inlining cold callsites"));		cl::desc("Threshold for inlining cold callsites"));

// We introduce this threshold to help performance of instrumentation based		// We introduce this threshold to help performance of instrumentation based
// PGO before we actually hook up inliner with analysis passes such as BPI and		// PGO before we actually hook up inliner with analysis passes such as BPI and
// BFI.		// BFI.
static cl::opt<int> ColdThreshold(		static cl::opt<int> ColdThreshold(
"inlinecold-threshold", cl::Hidden, cl::init(45), cl::ZeroOrMore,		"inlinecold-threshold", cl::Hidden, cl::init(45), cl::ZeroOrMore,
cl::desc("Threshold for inlining functions with cold attribute"));		cl::desc("Threshold for inlining functions with cold attribute"));

static cl::opt<int>		static cl::opt<int>
HotCallSiteThreshold("hot-callsite-threshold", cl::Hidden, cl::init(3000),		HotCallSiteThreshold("hot-callsite-threshold", cl::Hidden, cl::init(3000),
cl::ZeroOrMore,		cl::ZeroOrMore,
cl::desc("Threshold for hot callsites "));		cl::desc("Threshold for hot callsites "));

static cl::opt<int> LocallyHotCallSiteThreshold(		static cl::opt<int> LocallyHotCallSiteThreshold(
Show All 12 Lines	cl::desc("Minimum block frequency, expressed as a multiple of caller's "
"entry frequency, for a callsite to be hot in the absence of "		"entry frequency, for a callsite to be hot in the absence of "
"profile information."));		"profile information."));

static cl::opt<bool> OptComputeFullInlineCost(		static cl::opt<bool> OptComputeFullInlineCost(
"inline-cost-full", cl::Hidden, cl::init(false), cl::ZeroOrMore,		"inline-cost-full", cl::Hidden, cl::init(false), cl::ZeroOrMore,
cl::desc("Compute the full inline cost of a call site even when the cost "		cl::desc("Compute the full inline cost of a call site even when the cost "
"exceeds the threshold."));		"exceeds the threshold."));

		static cl::opt<bool> OptPerformCallerAnalysis(
		"inline-perform-caller-analysis", cl::Hidden, cl::init(false),
		cl::ZeroOrMore,
		cl::desc("Perform extra analysis in the callsite's basic block to aid "
		"inlining decision."));

namespace {		namespace {

class CallAnalyzer : public InstVisitor<CallAnalyzer, bool> {		class CallAnalyzer : public InstVisitor<CallAnalyzer, bool> {
typedef InstVisitor<CallAnalyzer, bool> Base;		typedef InstVisitor<CallAnalyzer, bool> Base;
friend class InstVisitor<CallAnalyzer, bool>;		friend class InstVisitor<CallAnalyzer, bool>;

/// The TargetTransformInfo available for this compilation.		/// The TargetTransformInfo available for this compilation.
const TargetTransformInfo &TTI;		const TargetTransformInfo &TTI;
▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	class CallAnalyzer : public InstVisitor<CallAnalyzer, bool> {
/// The mapping of caller Alloca values to their accumulated cost savings. If		/// The mapping of caller Alloca values to their accumulated cost savings. If
/// we have to disable SROA for one of the allocas, this tells us how much		/// we have to disable SROA for one of the allocas, this tells us how much
/// cost must be added.		/// cost must be added.
DenseMap<Value *, int> SROAArgCosts;		DenseMap<Value *, int> SROAArgCosts;

/// Keep track of values which map to a pointer base and constant offset.		/// Keep track of values which map to a pointer base and constant offset.
DenseMap<Value , std::pair<Value , APInt>> ConstantOffsetPtrs;		DenseMap<Value , std::pair<Value , APInt>> ConstantOffsetPtrs;

		/// Keep track of constants that are stored on the caller stack, where a
		/// pointer to the stack item is passed to the callee.
		DenseMap<std::pair<Value , uint64_t>, Constant > CallerStackConstants;

/// Keep track of dead blocks due to the constant arguments.		/// Keep track of dead blocks due to the constant arguments.
SetVector<BasicBlock *> DeadBlocks;		SetVector<BasicBlock *> DeadBlocks;

/// The mapping of the blocks to their known unique successors due to the		/// The mapping of the blocks to their known unique successors due to the
/// constant arguments.		/// constant arguments.
DenseMap<BasicBlock , BasicBlock > KnownSuccessors;		DenseMap<BasicBlock , BasicBlock > KnownSuccessors;

/// Model the elimination of repeated loads that is expected to happen		/// Model the elimination of repeated loads that is expected to happen
/// whenever we simplify away the stores that would otherwise cause them to be		/// whenever we simplify away the stores that would otherwise cause them to be
/// loads.		/// loads.
bool EnableLoadElimination;		bool EnableLoadElimination;
SmallPtrSet<Value *, 16> LoadAddrSet;		SmallPtrSet<Value *, 16> LoadAddrSet;
int LoadEliminationCost = 0;		int LoadEliminationCost = 0;

		// TODO
		bool PerformedCallerStoreSearch = false;
		bool EnableCallerAnalysis;

// Custom simplification helper routines.		// Custom simplification helper routines.
bool isAllocaDerivedArg(Value *V);		bool isAllocaDerivedArg(Value *V);
bool lookupSROAArgAndCost(Value V, Value &Arg,		bool lookupSROAArgAndCost(Value V, Value &Arg,
DenseMap<Value *, int>::iterator &CostIt);		DenseMap<Value *, int>::iterator &CostIt);
void disableSROA(DenseMap<Value *, int>::iterator CostIt);		void disableSROA(DenseMap<Value *, int>::iterator CostIt);
void disableSROA(Value *V);		void disableSROA(Value *V);
void findDeadBlocks(BasicBlock CurrBB, BasicBlock NextBB);		void findDeadBlocks(BasicBlock CurrBB, BasicBlock NextBB);
void accumulateSROACost(DenseMap<Value *, int>::iterator CostIt,		void accumulateSROACost(DenseMap<Value *, int>::iterator CostIt,
int InstructionCost);		int InstructionCost);
void disableLoadElimination();		void disableLoadElimination();
bool isGEPFree(GetElementPtrInst &GEP);		bool isGEPFree(GetElementPtrInst &GEP);
bool canFoldInboundsGEP(GetElementPtrInst &I);		bool canFoldInboundsGEP(GetElementPtrInst &I);
bool accumulateGEPOffset(GEPOperator &GEP, APInt &Offset);		bool accumulateGEPOffset(GEPOperator &GEP, APInt &Offset);
		bool calleePtrToCallerAllocaOffset(Value *&V, uint64_t &Offset);
		bool searchForConstantAtOffset(
		Value V, DenseMap<std::pair<Value , uint64_t>, Constant *>::iterator
		&CallerStackConstantIt);
bool simplifyCallSite(Function *F, CallBase &Call);		bool simplifyCallSite(Function *F, CallBase &Call);
template <typename Callable>		template <typename Callable>
bool simplifyInstruction(Instruction &I, Callable Evaluate);		bool simplifyInstruction(Instruction &I, Callable Evaluate);
ConstantInt stripAndComputeInBoundsConstantOffsets(Value &V);		ConstantInt stripAndComputeInBoundsConstantOffsets(Value &V);

/// Return true if the given argument to the function being considered for		/// Return true if the given argument to the function being considered for
/// inlining has the given attribute set either at the call site or the		/// inlining has the given attribute set either at the call site or the
/// function declaration. Primarily used to inspect call site specific		/// function declaration. Primarily used to inspect call site specific
▲ Show 20 Lines • Show All 77 Lines • ▼ Show 20 Lines	CallAnalyzer(const TargetTransformInfo &TTI,
Optional<function_ref<BlockFrequencyInfo &(Function &)>> &GetBFI,		Optional<function_ref<BlockFrequencyInfo &(Function &)>> &GetBFI,
ProfileSummaryInfo PSI, OptimizationRemarkEmitter ORE,		ProfileSummaryInfo PSI, OptimizationRemarkEmitter ORE,
Function &Callee, CallBase &Call, const InlineParams &Params)		Function &Callee, CallBase &Call, const InlineParams &Params)
: TTI(TTI), GetAssumptionCache(GetAssumptionCache), GetBFI(GetBFI),		: TTI(TTI), GetAssumptionCache(GetAssumptionCache), GetBFI(GetBFI),
PSI(PSI), F(Callee), DL(F.getParent()->getDataLayout()), ORE(ORE),		PSI(PSI), F(Callee), DL(F.getParent()->getDataLayout()), ORE(ORE),
CandidateCall(Call), Params(Params), Threshold(Params.DefaultThreshold),		CandidateCall(Call), Params(Params), Threshold(Params.DefaultThreshold),
ComputeFullInlineCost(OptComputeFullInlineCost \|\|		ComputeFullInlineCost(OptComputeFullInlineCost \|\|
Params.ComputeFullInlineCost \|\| ORE),		Params.ComputeFullInlineCost \|\| ORE),
EnableLoadElimination(true) {}		EnableLoadElimination(true),
		EnableCallerAnalysis(OptPerformCallerAnalysis) {}

InlineResult analyzeCall(CallBase &Call);		InlineResult analyzeCall(CallBase &Call);

int getThreshold() { return Threshold; }		int getThreshold() { return Threshold; }
int getCost() { return Cost; }		int getCost() { return Cost; }

// Keep a bunch of stats about the cost savings found so we can print them		// Keep a bunch of stats about the cost savings found so we can print them
// out when debugging.		// out when debugging.
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	void CallAnalyzer::accumulateSROACost(DenseMap<Value *, int>::iterator CostIt,
SROACostSavings += InstructionCost;		SROACostSavings += InstructionCost;
}		}

void CallAnalyzer::disableLoadElimination() {		void CallAnalyzer::disableLoadElimination() {
if (EnableLoadElimination) {		if (EnableLoadElimination) {
addCost(LoadEliminationCost);		addCost(LoadEliminationCost);
LoadEliminationCost = 0;		LoadEliminationCost = 0;
EnableLoadElimination = false;		EnableLoadElimination = false;
		// disableLoadElimination() is called whenever we might clobber future
		// loads unexpectedly. This means we can't be sure what's on the caller's
		// stack any more, so just delete all our cached information, and prevent
		// anything further from being cached.
		CallerStackConstants.clear();
		EnableCallerAnalysis = false;
}		}
}		}

/// Accumulate a constant GEP offset into an APInt if possible.		/// Accumulate a constant GEP offset into an APInt if possible.
///		///
/// Returns false if unable to compute the offset for any reason. Respects any		/// Returns false if unable to compute the offset for any reason. Respects any
/// simplified values known during the analysis of this callsite.		/// simplified values known during the analysis of this callsite.
bool CallAnalyzer::accumulateGEPOffset(GEPOperator &GEP, APInt &Offset) {		bool CallAnalyzer::accumulateGEPOffset(GEPOperator &GEP, APInt &Offset) {
Show All 34 Lines	bool CallAnalyzer::isGEPFree(GetElementPtrInst &GEP) {
for (User::op_iterator I = GEP.idx_begin(), E = GEP.idx_end(); I != E; ++I)		for (User::op_iterator I = GEP.idx_begin(), E = GEP.idx_end(); I != E; ++I)
if (Constant SimpleOp = SimplifiedValues.lookup(I))		if (Constant SimpleOp = SimplifiedValues.lookup(I))
Operands.push_back(SimpleOp);		Operands.push_back(SimpleOp);
else		else
Operands.push_back(*I);		Operands.push_back(*I);
return TargetTransformInfo::TCC_Free == TTI.getUserCost(&GEP, Operands);		return TargetTransformInfo::TCC_Free == TTI.getUserCost(&GEP, Operands);
}		}

		/// Given a pointer in the callee that points to the caller's stack, computes
		/// the ponter's total offset off of the corresponding alloca.
		///
		/// Asserts that V is a pointer to the caller's stack.
		bool CallAnalyzer::calleePtrToCallerAllocaOffset(Value *&V, uint64_t &Offset) {
		assert(isAllocaDerivedArg(V));
		std::pair<Value *, APInt> BaseAndOffset = ConstantOffsetPtrs[V];
		if (!BaseAndOffset.first)
		return false;
		V = BaseAndOffset.first;
		Offset = BaseAndOffset.second.getZExtValue();
		return true;
		}

		/// Looks up a constant on the caller's stack.
		///
		/// Value *V is a pointer in the callee that points to the caller' stack.
		/// This will search the stores of the callsite's basic block for any stored
		/// constants.
		bool CallAnalyzer::searchForConstantAtOffset(
		Value V, DenseMap<std::pair<Value , uint64_t>, Constant *>::iterator
		&CallerStackConstantIt) {
		if (!PerformedCallerStoreSearch && EnableCallerAnalysis) {
		// Search the callsite's basic block.
		// TODO possibly stop once we hit our candidate store as an optimization,
		// and later continue the search from the place we left off if we are
		// analyzing another load.
		for (auto IRI = ++CandidateCall.getReverseIterator();
		IRI != CandidateCall.getParent()->rend(); IRI++) {
		// If this is a store of a constant, to a pointer on the caller's stack
		// with a constant offset, we will save the value.
		if (StoreInst SI = dyn_cast<StoreInst>(&IRI)) {
		Constant *StoreVal = dyn_cast<Constant>(SI->getValueOperand());
		if (!StoreVal)
		continue;

		Value *PtrVal = SI->getPointerOperand();
		ConstantInt *Offset = stripAndComputeInBoundsConstantOffsets(PtrVal);
		if (Offset && isa<AllocaInst>(PtrVal)) {
		auto PtrAndOffset = std::make_pair(PtrVal, Offset->getZExtValue());
		// Won't insert if the value already exists, so older stores won't
		// clobber the newer ones here.
		CallerStackConstants.try_emplace(PtrAndOffset, StoreVal);
		}
		} else if (CallBase Call = dyn_cast<CallBase>(&IRI)) {
		// Don't continue past some calls, as they may modify our stack data.
		// FIXME: would be nice to perform a more precise analysis here.
		IntrinsicInst *II = dyn_cast<IntrinsicInst>(Call);
		if (!Call->onlyReadsMemory() && (!II \|\| !isAssumeLikeIntrinsic(II)))
		break;
		}
		}
		PerformedCallerStoreSearch = true;
		}

		uint64_t Offset;
		if (!calleePtrToCallerAllocaOffset(V, Offset))
		return false;
		// V is now the caller's alloca.
		CallerStackConstantIt = CallerStackConstants.find(std::make_pair(V, Offset));
		// If the Constant pointer in the map is nullptr, we may have known the
		// value previously, but a store in the callee has overwritten it with
		// a non-constant.
		return CallerStackConstantIt != CallerStackConstants.end() && CallerStackConstantIt->second;
		}

bool CallAnalyzer::visitAlloca(AllocaInst &I) {		bool CallAnalyzer::visitAlloca(AllocaInst &I) {
// Check whether inlining will turn a dynamic alloca into a static		// Check whether inlining will turn a dynamic alloca into a static
// alloca and handle that case.		// alloca and handle that case.
if (I.isArrayAllocation()) {		if (I.isArrayAllocation()) {
Constant *Size = SimplifiedValues.lookup(I.getArraySize());		Constant *Size = SimplifiedValues.lookup(I.getArraySize());
if (auto *AllocSize = dyn_cast_or_null<ConstantInt>(Size)) {		if (auto *AllocSize = dyn_cast_or_null<ConstantInt>(Size)) {
Type *Ty = I.getAllocatedType();		Type *Ty = I.getAllocatedType();
AllocatedSize = SaturatingMultiplyAdd(		AllocatedSize = SaturatingMultiplyAdd(
▲ Show 20 Lines • Show All 680 Lines • ▼ Show 20 Lines	bool CallAnalyzer::visitFNeg(UnaryOperator &I) {

return false;		return false;
}		}

bool CallAnalyzer::visitLoad(LoadInst &I) {		bool CallAnalyzer::visitLoad(LoadInst &I) {
Value *SROAArg;		Value *SROAArg;
DenseMap<Value *, int>::iterator CostIt;		DenseMap<Value *, int>::iterator CostIt;
if (lookupSROAArgAndCost(I.getPointerOperand(), SROAArg, CostIt)) {		if (lookupSROAArgAndCost(I.getPointerOperand(), SROAArg, CostIt)) {
		// Check here whether we can simplify this load instruction based on stores
		// from the callsite's basic block.
		Value *V = I.getPointerOperand();
		DenseMap<std::pair<Value , uint64_t>, Constant >::iterator
		CallerStackConstantIt;
		if (EnableCallerAnalysis &&
		searchForConstantAtOffset(V, CallerStackConstantIt)) {
		// If our stored constants are of the same type, store this in
		// SimplifiedValues.
		if (I.getType() == CallerStackConstantIt->second->getType())
		SimplifiedValues[&I] = CallerStackConstantIt->second;
		return true;
		}

if (I.isSimple()) {		if (I.isSimple()) {
accumulateSROACost(CostIt, InlineConstants::InstrCost);		accumulateSROACost(CostIt, InlineConstants::InstrCost);
return true;		return true;
}		}

disableSROA(CostIt);		disableSROA(CostIt);
}		}

// If the data is already loaded from this address and hasn't been clobbered		// If the data is already loaded from this address and hasn't been clobbered
// by any stores or calls, this load is likely to be redundant and can be		// by any stores or calls, this load is likely to be redundant and can be
// eliminated.		// eliminated.
if (EnableLoadElimination &&		if (EnableLoadElimination &&
!LoadAddrSet.insert(I.getPointerOperand()).second && I.isUnordered()) {		!LoadAddrSet.insert(I.getPointerOperand()).second && I.isUnordered()) {
LoadEliminationCost += InlineConstants::InstrCost;		LoadEliminationCost += InlineConstants::InstrCost;
return true;		return true;
}		}

return false;		return false;
}		}

bool CallAnalyzer::visitStore(StoreInst &I) {		bool CallAnalyzer::visitStore(StoreInst &I) {
Value *SROAArg;		Value *SROAArg;
DenseMap<Value *, int>::iterator CostIt;		DenseMap<Value *, int>::iterator CostIt;
if (lookupSROAArgAndCost(I.getPointerOperand(), SROAArg, CostIt)) {		if (lookupSROAArgAndCost(I.getPointerOperand(), SROAArg, CostIt)) {
		// Must invalidate constants stored here to the arguments.
		Value *V = I.getPointerOperand();
		uint64_t Offset;
		if (EnableCallerAnalysis && calleePtrToCallerAllocaOffset(V, Offset)) {
		// V is now the caller's alloca.
		auto PtrAndOffset = std::make_pair(V, Offset);
		Value *StoreVal = I.getValueOperand();
		// Try to convert the store's operand to a constant.
		Constant *StoreC = dyn_cast<Constant>(StoreVal);
		if (!StoreC)
		StoreC = SimplifiedValues.lookup(StoreVal);
		// Store either the final constant or a nullptr.
		CallerStackConstants[PtrAndOffset] = StoreC;
		mpdentonAuthorUnsubmitted Done Reply Inline Actions Ah, spotted a bug as soon as I posted this--a store to a stack location that overlaps (but is not equal to) an existing constant won't invalidate the original constant. I'll have to figure out how to fix that. mpdenton: Ah, spotted a bug as soon as I posted this--a store to a stack location that overlaps (but is…
		}

if (I.isSimple()) {		if (I.isSimple()) {
accumulateSROACost(CostIt, InlineConstants::InstrCost);		accumulateSROACost(CostIt, InlineConstants::InstrCost);
return true;		return true;
}		}

disableSROA(CostIt);		disableSROA(CostIt);
}		}

▲ Show 20 Lines • Show All 1,076 Lines • Show Last 20 Lines

llvm/test/Transforms/Inline/caller-stack-constants.ll

This file was added.

				; RUN: opt -S -Oz -inline-perform-caller-analysis %s \| FileCheck %s
				; Tests that the inliner can propagate some constants stored on the caller's
				; stack to values used in the callee.
				; Essentially tests that a C++ unique_ptr will have its destructor elided if
				; it was just moved from, even in -Oz.

				%class.unique_ptr = type { %struct.Dummy* }
				%struct.Dummy = type { i32 }

				@global_unique_ptr = global %class.unique_ptr zeroinitializer, align 8

				; Similar to an operator=(unique_ptr<Dummy>&& other) or a move constructor
				define void @transfer_ptr_ownership(%class.unique_ptr* %this, %class.unique_ptr* dereferenceable(8) %other) {
				entry:
				%ptr_ = getelementptr inbounds %class.unique_ptr, %class.unique_ptr* %other, i64 0, i32 0
				%0 = bitcast %class.unique_ptr* %other to i64*
				%1 = load i64, i64* %0, align 8
				%2 = bitcast %class.unique_ptr* %this to i64*
				store i64 %1, i64* %2, align 8
				store %struct.Dummy* null, %struct.Dummy** %ptr_, align 8
				ret void
				}

				; Basically a unique_ptr destructor.
				define void @destruct_if_not_null(%class.unique_ptr* %this) {
				entry:
				%ptr_ = getelementptr inbounds %class.unique_ptr, %class.unique_ptr* %this, i64 0, i32 0
				%0 = load %struct.Dummy, %struct.Dummy* %ptr_, align 8
				%tobool = icmp eq %struct.Dummy* %0, null
				br i1 %tobool, label %if.end, label %if.then

				if.then: ; preds = %entry
				tail call void @external_destructor(%struct.Dummy* nonnull %0)
				br label %if.end

				if.end: ; preds = %entry, %if.then
				ret void
				}

				define i32 @main() {
				; CHECK-LABEL: @main(
				entry:
				%ref.tmp = alloca %class.unique_ptr, align 8
				; CHECK: call void @return_a_unique_ptr
				call void @return_a_unique_ptr(%class.unique_ptr* nonnull sret %ref.tmp)
				; CHECK-NOT: call void @transfer_ptr_ownership
				call void @transfer_ptr_ownership(%class.unique_ptr* nonnull @global_unique_ptr, %class.unique_ptr* nonnull dereferenceable(8) %ref.tmp)
				; CHECK-NOT: call void @destruct_if_not_null
				call void @destruct_if_not_null(%class.unique_ptr* nonnull %ref.tmp)
				; CHECK: ret i32 0
				ret i32 0
				}

				declare void @return_a_unique_ptr(%class.unique_ptr* sret) local_unnamed_addr
				declare void @external_destructor(%struct.Dummy*) local_unnamed_addr