This is an archive of the discontinued LLVM Phabricator instance.

[Inliner] Attempt to more accurately model the cost of loops at minsize
Changes PlannedPublic

Authored by dmgreen on Oct 18 2018, 10:34 AM.

Download Raw Diff

Details

Reviewers

efriedma
chandlerc
fhahn
eraman
javed.absar
zzheng

Summary

This is three separate things, which when applied together come to much the same effect as D52716. They are:

Reduce the call penalty at minsize to 15. In my testing on Arm and X86, this seemed to be the best spot for reducing codesize. It is still not 0 because of inaccuracies in the inliner cost modelling, but over time can be gradually decreased as things gets better. (I did not change the Threshold as it seemed sensible to also decrease the cost of sub-calls at minsize).
If a block has more than one unconditional predecessor, mark each one past the first as non-free. This models the extra branch costs for loops (but can cause problems for cases where the blocks are not in the form they will appear in assembly).
Geps that are used by phis are not free. This is attempting to capture the geps in loop IVs. (I'm not sure if there is a better way to capture that.)

All together they make us much less likely to inline small loops at minsize. The patch is D52716 is still better for total codesize in my testing, but this attempts to model things more precisely.

Diff Detail

Event Timeline

dmgreen created this revision.Oct 18 2018, 10:34 AM

Herald added a reviewer: javed.absar. · View Herald TranscriptOct 18 2018, 10:34 AM

Herald added subscribers: haicheng, kristof.beyls. · View Herald Transcript

dmgreen mentioned this in D52716: [Inliner] Penalise inlining of calls with loops at Oz.Oct 24 2018, 9:11 AM

(but can cause problems for cases where the blocks are not in the form they will appear in assembly).

I'm not sure what sort of issue you're running into here?

Reducing the call penalty seems like it's a good idea if our inline modeling has improved since it was set. But I'm a little cautious about messing with it on its own. If you decreased both the call penalty and the overall threshold at the same time, it would have the effect of encouraging inlining for functions which call other functions; if that has a good effect on codesize, we should do that. If that's not the effect we want, we should probably just change the overall threshold instead.

The PHI handling change is kind of ugly, but I guess it's roughly right. Not sure we want to guard it with minsize, specifically; it should do the right thing at higher inlining levels.

If we're going to start spending time making changes like this that slightly tweak the inline cost for the sake of minsize, I think we need to be a bit more methodical in terms of testing, so we don't end up playing tug-of-war with inliner tweaks that affect different codebases in different ways. For example, having a series of testcases with calls to small functions near the various thresholds, accompanied by actual measurements of the codesize on a couple targets, so we can be confident that tweaks are actually profitable. For other optimization levels, inlining is more about capturing big optimization opportunities, as opposed to the exact cost of various instructions, so it doesn't matter as much, but I think spending more time to construct tests will pay off for -Oz in particular. We currently have very little test coverage for minsize inlining.

(but can cause problems for cases where the blocks are not in the form they will appear in assembly).

I'm not sure what sort of issue you're running into here?

IIRC, One example I looked at was code like:

bb1:
  ..
  br end
bb2:
  ..
  br end
end:
  ret

The final assembly will fold the ret's into the previous blocks, so no extra branches (it may have returned a value, with a phi being it the end block). Generally we have to work with what we expect the final form to be, not what the form is right now.

Reducing the call penalty seems like it's a good idea if our inline modeling has improved since it was set. But I'm a little cautious about messing with it on its own. If you decreased both the call penalty and the overall threshold at the same time, it would have the effect of encouraging inlining for functions which call other functions; if that has a good effect on codesize, we should do that. If that's not the effect we want, we should probably just change the overall threshold instead.

I wasn't sure why a subcall should cost more than a single instruction at minsize (baring obvious registry setup). There may be some knock-on effects I'm not thinking of, but if it's only going to be a single instruction, it only needs to be counted as such. Extra spilled registers around calls perhaps?

The PHI handling change is kind of ugly, but I guess it's roughly right. Not sure we want to guard it with minsize, specifically; it should do the right thing at higher inlining levels.

Yeah, I was a little wary of changing anything when optimisation for performance (having to run a lot of benchmarking and the potential reverts). Are register moves "free" there or not? Depends on the target I would guess, but we do count register setup for calls. There was also a test in AArch64/phi.ll (outer13 I think, but it may have been one of the ones before/after), that was failing. Looking again at the test though, it may just be that the test isn't really right, if instcombine can simplify it already.

If we're going to start spending time making changes like this that slightly tweak the inline cost for the sake of minsize, I think we need to be a bit more methodical in terms of testing, so we don't end up playing tug-of-war with inliner tweaks that affect different codebases in different ways. For example, having a series of testcases with calls to small functions near the various thresholds, accompanied by actual measurements of the codesize on a couple targets, so we can be confident that tweaks are actually profitable. For other optimization levels, inlining is more about capturing big optimization opportunities, as opposed to the exact cost of various instructions, so it doesn't matter as much, but I think spending more time to construct tests will pay off for -Oz in particular. We currently have very little test coverage for minsize inlining.

That sounds smart. I have been going with compiling lots of codebases and see what's looks best, adding interesting cases as testcases over time. More smaller testcases would be better, but we can easily hit cases where we are just not modelling the correct thing. It may be difficult to get all the test cases "correct", some that should be inlined may not be, some that are inlined should not. It's sometimes hard to change a test, even if the end result is better overall.

With D52716 in, my main motivation for the code here will disappear. I will hopefully try to rerun the numbers on top of that patch, seeing how well we do now and how much of this is still an improvement.

zzheng resigned from this revision.Feb 25 2021, 9:53 AM

Herald added a subscriber: pengfei. · View Herald TranscriptFeb 25 2021, 9:53 AM

Revision Contents

Path

Size

include/

llvm/

Analysis/

InlineCost.h

3 lines

lib/

Analysis/

InlineCost.cpp

68 lines

Transforms/

IPO/

PartialInlining.cpp

6 lines

test/

Transforms/

Inline/

AArch64/

phi.ll

2 lines

ARM/

loop-add.ll

95 lines

loop-memcpy.ll

87 lines

loop-noinline.ll

49 lines

Diff 170102

include/llvm/Analysis/InlineCost.h

	Show All 38 Lines

	/// Use when -O3 is specified.			/// Use when -O3 is specified.
	const int OptAggressiveThreshold = 250;			const int OptAggressiveThreshold = 250;

	// Various magic constants used to adjust heuristics.			// Various magic constants used to adjust heuristics.
	const int InstrCost = 5;			const int InstrCost = 5;
	const int IndirectCallThreshold = 100;			const int IndirectCallThreshold = 100;
	const int CallPenalty = 25;			const int CallPenalty = 25;
				const int CallMinSizePenalty = 15;
	const int LastCallToStaticBonus = 15000;			const int LastCallToStaticBonus = 15000;
	const int ColdccPenalty = 2000;			const int ColdccPenalty = 2000;
	const int NoreturnPenalty = 10000;			const int NoreturnPenalty = 10000;
	/// Do not inline functions which allocate this many bytes on the stack			/// Do not inline functions which allocate this many bytes on the stack
	/// when the caller is recursive.			/// when the caller is recursive.
	const unsigned TotalAllocaSizeRecursiveCaller = 1024;			const unsigned TotalAllocaSizeRecursiveCaller = 1024;
	}			}

	▲ Show 20 Lines • Show All 141 Lines • ▼ Show 20 Lines
	/// the default threshold is computed from \p OptLevel and \p SizeOptLevel.			/// the default threshold is computed from \p OptLevel and \p SizeOptLevel.
	/// An \p OptLevel value above 3 is considered an aggressive optimization mode.			/// An \p OptLevel value above 3 is considered an aggressive optimization mode.
	/// \p SizeOptLevel of 1 corresponds to the -Os flag and 2 corresponds to			/// \p SizeOptLevel of 1 corresponds to the -Os flag and 2 corresponds to
	/// the -Oz flag.			/// the -Oz flag.
	InlineParams getInlineParams(unsigned OptLevel, unsigned SizeOptLevel);			InlineParams getInlineParams(unsigned OptLevel, unsigned SizeOptLevel);

	/// Return the cost associated with a callsite, including parameter passing			/// Return the cost associated with a callsite, including parameter passing
	/// and the call/return instruction.			/// and the call/return instruction.
	int getCallsiteCost(CallSite CS, const DataLayout &DL);			int getCallsiteCost(CallSite CS, const DataLayout &DL, int CallPenalty);

	/// Get an InlineCost object representing the cost of inlining this			/// Get an InlineCost object representing the cost of inlining this
	/// callsite.			/// callsite.
	///			///
	/// Note that a default threshold is passed into this function. This threshold			/// Note that a default threshold is passed into this function. This threshold
	/// could be modified based on callsite's properties and only costs below this			/// could be modified based on callsite's properties and only costs below this
	/// new threshold are computed with any accuracy. The new threshold can be			/// new threshold are computed with any accuracy. The new threshold can be
	/// used to bound the computation necessary to determine whether the cost is			/// used to bound the computation necessary to determine whether the cost is
	Show All 27 Lines

lib/Analysis/InlineCost.cpp

Show First 20 Lines • Show All 123 Lines • ▼ Show 20 Lines	class CallAnalyzer : public InstVisitor<CallAnalyzer, bool> {

/// Tunable parameters that control the analysis.		/// Tunable parameters that control the analysis.
const InlineParams &Params;		const InlineParams &Params;

int Threshold;		int Threshold;
int Cost;		int Cost;
bool ComputeFullInlineCost;		bool ComputeFullInlineCost;

		/// The penalty for making calls. Defaults to InlineCost:CallPenalty, but
		/// lower at minsize (still not zero due to the inaccuracies in cost
		/// modelling).
		int CallPenalty;

bool IsCallerRecursive;		bool IsCallerRecursive;
bool IsRecursiveCall;		bool IsRecursiveCall;
bool ExposesReturnsTwice;		bool ExposesReturnsTwice;
bool HasDynamicAlloca;		bool HasDynamicAlloca;
bool ContainsNoDuplicateCall;		bool ContainsNoDuplicateCall;
bool HasReturn;		bool HasReturn;
bool HasIndirectBr;		bool HasIndirectBr;
bool HasUninlineableIntrinsic;		bool HasUninlineableIntrinsic;
▲ Show 20 Lines • Show All 135 Lines • ▼ Show 20 Lines	CallAnalyzer(const TargetTransformInfo &TTI,
Optional<function_ref<BlockFrequencyInfo &(Function &)>> &GetBFI,		Optional<function_ref<BlockFrequencyInfo &(Function &)>> &GetBFI,
ProfileSummaryInfo PSI, OptimizationRemarkEmitter ORE,		ProfileSummaryInfo PSI, OptimizationRemarkEmitter ORE,
Function &Callee, CallSite CSArg, const InlineParams &Params)		Function &Callee, CallSite CSArg, const InlineParams &Params)
: TTI(TTI), GetAssumptionCache(GetAssumptionCache), GetBFI(GetBFI),		: TTI(TTI), GetAssumptionCache(GetAssumptionCache), GetBFI(GetBFI),
PSI(PSI), F(Callee), DL(F.getParent()->getDataLayout()), ORE(ORE),		PSI(PSI), F(Callee), DL(F.getParent()->getDataLayout()), ORE(ORE),
CandidateCS(CSArg), Params(Params), Threshold(Params.DefaultThreshold),		CandidateCS(CSArg), Params(Params), Threshold(Params.DefaultThreshold),
Cost(0), ComputeFullInlineCost(OptComputeFullInlineCost \|\|		Cost(0), ComputeFullInlineCost(OptComputeFullInlineCost \|\|
Params.ComputeFullInlineCost \|\| ORE),		Params.ComputeFullInlineCost \|\| ORE),
IsCallerRecursive(false), IsRecursiveCall(false),		CallPenalty(InlineConstants::CallPenalty), IsCallerRecursive(false),
ExposesReturnsTwice(false), HasDynamicAlloca(false),		IsRecursiveCall(false), ExposesReturnsTwice(false),
ContainsNoDuplicateCall(false), HasReturn(false), HasIndirectBr(false),		HasDynamicAlloca(false), ContainsNoDuplicateCall(false),
HasUninlineableIntrinsic(false), InitsVargArgs(false), AllocatedSize(0),		HasReturn(false), HasIndirectBr(false), HasUninlineableIntrinsic(false),
NumInstructions(0), NumVectorInstructions(0), VectorBonus(0),		InitsVargArgs(false), AllocatedSize(0), NumInstructions(0),
SingleBBBonus(0), EnableLoadElimination(true), LoadEliminationCost(0),		NumVectorInstructions(0), VectorBonus(0), SingleBBBonus(0),
NumConstantArgs(0), NumConstantOffsetPtrArgs(0), NumAllocaArgs(0),		EnableLoadElimination(true), LoadEliminationCost(0), NumConstantArgs(0),
NumConstantPtrCmps(0), NumConstantPtrDiffs(0),		NumConstantOffsetPtrArgs(0), NumAllocaArgs(0), NumConstantPtrCmps(0),
NumInstructionsSimplified(0), SROACostSavings(0),		NumConstantPtrDiffs(0), NumInstructionsSimplified(0),
SROACostSavingsLost(0) {}		SROACostSavings(0), SROACostSavingsLost(0) {}

InlineResult analyzeCall(CallSite CS);		InlineResult analyzeCall(CallSite CS);

int getThreshold() { return Threshold; }		int getThreshold() { return Threshold; }
int getCost() { return Cost; }		int getCost() { return Cost; }

// Keep a bunch of stats about the cost savings found so we can print them		// Keep a bunch of stats about the cost savings found so we can print them
// out when debugging.		// out when debugging.
▲ Show 20 Lines • Show All 279 Lines • ▼ Show 20 Lines	auto IsGEPOffsetConstant = [&](GetElementPtrInst &GEP) {
return true;		return true;
};		};

if ((I.isInBounds() && canFoldInboundsGEP(I)) \|\| IsGEPOffsetConstant(I)) {		if ((I.isInBounds() && canFoldInboundsGEP(I)) \|\| IsGEPOffsetConstant(I)) {
if (SROACandidate)		if (SROACandidate)
SROAArgValues[&I] = SROAArg;		SROAArgValues[&I] = SROAArg;

// Constant GEPs are modeled as free.		// Constant GEPs are modeled as free.
		// Except at minsize, we don't count those used in PHI nodes.
		if (I.getParent()->getParent()->optForMinSize()) {
		for (auto U : I.users())
		if (isa<PHINode>(U))
		return false;
		}
return true;		return true;
}		}

// Variable GEPs will require math and will disable SROA.		// Variable GEPs will require math and will disable SROA.
if (SROACandidate)		if (SROACandidate)
disableSROA(CostIt);		disableSROA(CostIt);
return isGEPFree(I);		return isGEPFree(I);
}		}
▲ Show 20 Lines • Show All 118 Lines • ▼ Show 20 Lines	bool CallAnalyzer::visitCastInst(CastInst &I) {
switch (I.getOpcode()) {		switch (I.getOpcode()) {
case Instruction::FPTrunc:		case Instruction::FPTrunc:
case Instruction::FPExt:		case Instruction::FPExt:
case Instruction::UIToFP:		case Instruction::UIToFP:
case Instruction::SIToFP:		case Instruction::SIToFP:
case Instruction::FPToUI:		case Instruction::FPToUI:
case Instruction::FPToSI:		case Instruction::FPToSI:
if (TTI.getFPOpCost(I.getType()) == TargetTransformInfo::TCC_Expensive)		if (TTI.getFPOpCost(I.getType()) == TargetTransformInfo::TCC_Expensive)
Cost += InlineConstants::CallPenalty;		Cost += CallPenalty;
default:		default:
break;		break;
}		}

return TargetTransformInfo::TCC_Free == TTI.getUserCost(&I);		return TargetTransformInfo::TCC_Free == TTI.getUserCost(&I);
}		}

bool CallAnalyzer::visitUnaryInstruction(UnaryInstruction &I) {		bool CallAnalyzer::visitUnaryInstruction(UnaryInstruction &I) {
▲ Show 20 Lines • Show All 161 Lines • ▼ Show 20 Lines	auto DisallowAllBonuses = [&]() {
VectorBonusPercent = 0;		VectorBonusPercent = 0;
LastCallToStaticBonus = 0;		LastCallToStaticBonus = 0;
};		};

// Use the OptMinSizeThreshold or OptSizeThreshold knob if they are available		// Use the OptMinSizeThreshold or OptSizeThreshold knob if they are available
// and reduce the threshold if the caller has the necessary attribute.		// and reduce the threshold if the caller has the necessary attribute.
if (Caller->optForMinSize()) {		if (Caller->optForMinSize()) {
Threshold = MinIfValid(Threshold, Params.OptMinSizeThreshold);		Threshold = MinIfValid(Threshold, Params.OptMinSizeThreshold);
		CallPenalty = InlineConstants::CallMinSizePenalty;
// For minsize, we want to disable the single BB bonus and the vector		// For minsize, we want to disable the single BB bonus and the vector
// bonuses, but not the last-call-to-static bonus. Inlining the last call to		// bonuses, but not the last-call-to-static bonus. Inlining the last call to
// a static function will, at the minimum, eliminate the parameter setup and		// a static function will, at the minimum, eliminate the parameter setup and
// call/return instructions.		// call/return instructions.
SingleBBBonusPercent = 0;		SingleBBBonusPercent = 0;
VectorBonusPercent = 0;		VectorBonusPercent = 0;
} else if (Caller->optForSize())		} else if (Caller->optForSize())
Threshold = MinIfValid(Threshold, Params.OptSizeThreshold);		Threshold = MinIfValid(Threshold, Params.OptSizeThreshold);
▲ Show 20 Lines • Show All 174 Lines • ▼ Show 20 Lines	bool CallAnalyzer::visitBinaryOperator(BinaryOperator &I) {
disableSROA(LHS);		disableSROA(LHS);
disableSROA(RHS);		disableSROA(RHS);

// If the instruction is floating point, and the target says this operation		// If the instruction is floating point, and the target says this operation
// is expensive, this may eventually become a library call. Treat the cost		// is expensive, this may eventually become a library call. Treat the cost
// as such.		// as such.
if (I.getType()->isFloatingPointTy() &&		if (I.getType()->isFloatingPointTy() &&
TTI.getFPOpCost(I.getType()) == TargetTransformInfo::TCC_Expensive)		TTI.getFPOpCost(I.getType()) == TargetTransformInfo::TCC_Expensive)
Cost += InlineConstants::CallPenalty;		Cost += CallPenalty;

return false;		return false;
}		}

bool CallAnalyzer::visitLoad(LoadInst &I) {		bool CallAnalyzer::visitLoad(LoadInst &I) {
Value *SROAArg;		Value *SROAArg;
DenseMap<Value *, int>::iterator CostIt;		DenseMap<Value *, int>::iterator CostIt;
if (lookupSROAArgAndCost(I.getPointerOperand(), SROAArg, CostIt)) {		if (lookupSROAArgAndCost(I.getPointerOperand(), SROAArg, CostIt)) {
▲ Show 20 Lines • Show All 155 Lines • ▼ Show 20 Lines	if (Function *F = CS.getCalledFunction()) {
if (TTI.isLoweredToCall(F)) {		if (TTI.isLoweredToCall(F)) {
// We account for the average 1 instruction per call argument setup		// We account for the average 1 instruction per call argument setup
// here.		// here.
Cost += CS.arg_size() * InlineConstants::InstrCost;		Cost += CS.arg_size() * InlineConstants::InstrCost;

// Everything other than inline ASM will also have a significant cost		// Everything other than inline ASM will also have a significant cost
// merely from making the call.		// merely from making the call.
if (!isa<InlineAsm>(CS.getCalledValue()))		if (!isa<InlineAsm>(CS.getCalledValue()))
Cost += InlineConstants::CallPenalty;		Cost += CallPenalty;
}		}

if (!CS.onlyReadsMemory())		if (!CS.onlyReadsMemory())
disableLoadElimination();		disableLoadElimination();
return Base::visitCallSite(CS);		return Base::visitCallSite(CS);
}		}

// Otherwise we're in a very special case -- an indirect function call. See		// Otherwise we're in a very special case -- an indirect function call. See
▲ Show 20 Lines • Show All 459 Lines • ▼ Show 20 Lines	InlineResult CallAnalyzer::analyzeCall(CallSite CS) {

// Speculatively apply all possible bonuses to Threshold. If cost exceeds		// Speculatively apply all possible bonuses to Threshold. If cost exceeds
// this Threshold any time, and cost cannot decrease, we can stop processing		// this Threshold any time, and cost cannot decrease, we can stop processing
// the rest of the function body.		// the rest of the function body.
Threshold += (SingleBBBonus + VectorBonus);		Threshold += (SingleBBBonus + VectorBonus);

// Give out bonuses for the callsite, as the instructions setting them up		// Give out bonuses for the callsite, as the instructions setting them up
// will be gone after inlining.		// will be gone after inlining.
Cost -= getCallsiteCost(CS, DL);		Cost -= getCallsiteCost(CS, DL, CallPenalty);

// If this function uses the coldcc calling convention, prefer not to inline		// If this function uses the coldcc calling convention, prefer not to inline
// it.		// it.
if (F.getCallingConv() == CallingConv::Cold)		if (F.getCallingConv() == CallingConv::Cold)
Cost += InlineConstants::ColdccPenalty;		Cost += InlineConstants::ColdccPenalty;

// Check if we're done. This can happen due to bonuses and penalties.		// Check if we're done. This can happen due to bonuses and penalties.
if (Cost >= Threshold && !ComputeFullInlineCost)		if (Cost >= Threshold && !ComputeFullInlineCost)
▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	for (unsigned Idx = 0; Idx != BBWorklist.size(); ++Idx) {
// behavior for an indirect branch in the same function, and we do not		// behavior for an indirect branch in the same function, and we do not
// currently support inlining indirect branches. But, the inliner may not		// currently support inlining indirect branches. But, the inliner may not
// see an indirect branch that ends up being dead code at a particular call		// see an indirect branch that ends up being dead code at a particular call
// site. If the blockaddress escapes the function, e.g., via a global		// site. If the blockaddress escapes the function, e.g., via a global
// variable, inlining may lead to an invalid cross-function reference.		// variable, inlining may lead to an invalid cross-function reference.
if (BB->hasAddressTaken())		if (BB->hasAddressTaken())
return "blockaddress";		return "blockaddress";

		// Count the number of unconditional branches leading to this block. Add an
		// instruction cost for each past the first. This models the cost of
		// branches to the block.
		if (Caller->optForMinSize()) {
		bool FoundUnconditional = false;
		for (auto *PredBB : predecessors(BB)) {
		// Discount edges we know to not be taken
		if (DeadBlocks.count(PredBB))
		continue;
		BasicBlock *KnownSuccessor = KnownSuccessors[PredBB];
		if (KnownSuccessor && KnownSuccessor != BB)
		continue;

		Instruction *TI = PredBB->getTerminator();
		if (BranchInst *BI = dyn_cast<BranchInst>(TI)) {
		if (!BI->isConditional()) {
		if (FoundUnconditional)
		Cost += InlineConstants::InstrCost;
		FoundUnconditional = true;
		}
		}
		}
		}

// Analyze the cost of this block. If we blow through the threshold, this		// Analyze the cost of this block. If we blow through the threshold, this
// returns false, and we can bail on out.		// returns false, and we can bail on out.
InlineResult IR = analyzeBlock(BB, EphValues);		InlineResult IR = analyzeBlock(BB, EphValues);
if (!IR)		if (!IR)
return IR;		return IR;

TerminatorInst *TI = BB->getTerminator();		TerminatorInst *TI = BB->getTerminator();

▲ Show 20 Lines • Show All 84 Lines • ▼ Show 20 Lines
/// that prevent inlining.		/// that prevent inlining.
static bool functionsHaveCompatibleAttributes(Function *Caller,		static bool functionsHaveCompatibleAttributes(Function *Caller,
Function *Callee,		Function *Callee,
TargetTransformInfo &TTI) {		TargetTransformInfo &TTI) {
return TTI.areInlineCompatible(Caller, Callee) &&		return TTI.areInlineCompatible(Caller, Callee) &&
AttributeFuncs::areInlineCompatible(Caller, Callee);		AttributeFuncs::areInlineCompatible(Caller, Callee);
}		}

int llvm::getCallsiteCost(CallSite CS, const DataLayout &DL) {		int llvm::getCallsiteCost(CallSite CS, const DataLayout &DL, int CallPenalty) {
int Cost = 0;		int Cost = 0;
for (unsigned I = 0, E = CS.arg_size(); I != E; ++I) {		for (unsigned I = 0, E = CS.arg_size(); I != E; ++I) {
if (CS.isByValArgument(I)) {		if (CS.isByValArgument(I)) {
// We approximate the number of loads and stores needed by dividing the		// We approximate the number of loads and stores needed by dividing the
// size of the byval type by the target's pointer size.		// size of the byval type by the target's pointer size.
PointerType *PTy = cast<PointerType>(CS.getArgument(I)->getType());		PointerType *PTy = cast<PointerType>(CS.getArgument(I)->getType());
unsigned TypeSize = DL.getTypeSizeInBits(PTy->getElementType());		unsigned TypeSize = DL.getTypeSizeInBits(PTy->getElementType());
unsigned AS = PTy->getAddressSpace();		unsigned AS = PTy->getAddressSpace();
Show All 12 Lines	if (CS.isByValArgument(I)) {
Cost += 2 * NumStores * InlineConstants::InstrCost;		Cost += 2 * NumStores * InlineConstants::InstrCost;
} else {		} else {
// For non-byval arguments subtract off one instruction per call		// For non-byval arguments subtract off one instruction per call
// argument.		// argument.
Cost += InlineConstants::InstrCost;		Cost += InlineConstants::InstrCost;
}		}
}		}
// The call instruction also disappears after inlining.		// The call instruction also disappears after inlining.
Cost += InlineConstants::InstrCost + InlineConstants::CallPenalty;		Cost += InlineConstants::InstrCost + CallPenalty;
return Cost;		return Cost;
}		}

InlineCost llvm::getInlineCost(		InlineCost llvm::getInlineCost(
CallSite CS, const InlineParams &Params, TargetTransformInfo &CalleeTTI,		CallSite CS, const InlineParams &Params, TargetTransformInfo &CalleeTTI,
std::function<AssumptionCache &(Function &)> &GetAssumptionCache,		std::function<AssumptionCache &(Function &)> &GetAssumptionCache,
Optional<function_ref<BlockFrequencyInfo &(Function &)>> GetBFI,		Optional<function_ref<BlockFrequencyInfo &(Function &)>> GetBFI,
ProfileSummaryInfo PSI, OptimizationRemarkEmitter ORE) {		ProfileSummaryInfo PSI, OptimizationRemarkEmitter ORE) {
▲ Show 20 Lines • Show All 207 Lines • Show Last 20 Lines

lib/Transforms/IPO/PartialInlining.cpp

Show First 20 Lines • Show All 792 Lines • ▼ Show 20 Lines	ORE.emit([&]() {
<< NV("Cost", IC.getCost()) << ", threshold="		<< NV("Cost", IC.getCost()) << ", threshold="
<< NV("Threshold", IC.getCostDelta() + IC.getCost()) << ")";		<< NV("Threshold", IC.getCostDelta() + IC.getCost()) << ")";
});		});
return false;		return false;
}		}
const DataLayout &DL = Caller->getParent()->getDataLayout();		const DataLayout &DL = Caller->getParent()->getDataLayout();

// The savings of eliminating the call:		// The savings of eliminating the call:
int NonWeightedSavings = getCallsiteCost(CS, DL);		int NonWeightedSavings = getCallsiteCost(CS, DL, InlineConstants::CallPenalty);
BlockFrequency NormWeightedSavings(NonWeightedSavings);		BlockFrequency NormWeightedSavings(NonWeightedSavings);

// Weighted saving is smaller than weighted cost, return false		// Weighted saving is smaller than weighted cost, return false
if (NormWeightedSavings < WeightedOutliningRcost) {		if (NormWeightedSavings < WeightedOutliningRcost) {
ORE.emit([&]() {		ORE.emit([&]() {
return OptimizationRemarkAnalysis(DEBUG_TYPE, "OutliningCallcostTooHigh",		return OptimizationRemarkAnalysis(DEBUG_TYPE, "OutliningCallcostTooHigh",
Call)		Call)
<< NV("Callee", Cloner.OrigFunc) << " not partially inlined into "		<< NV("Callee", Cloner.OrigFunc) << " not partially inlined into "
▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	for (BasicBlock::iterator I = BB->begin(), E = BB->end(); I != E; ++I) {
IntrinsicInst *IntrInst = dyn_cast<IntrinsicInst>(I);		IntrinsicInst *IntrInst = dyn_cast<IntrinsicInst>(I);
if (IntrInst) {		if (IntrInst) {
if (IntrInst->getIntrinsicID() == Intrinsic::lifetime_start \|\|		if (IntrInst->getIntrinsicID() == Intrinsic::lifetime_start \|\|
IntrInst->getIntrinsicID() == Intrinsic::lifetime_end)		IntrInst->getIntrinsicID() == Intrinsic::lifetime_end)
continue;		continue;
}		}

if (CallInst *CI = dyn_cast<CallInst>(I)) {		if (CallInst *CI = dyn_cast<CallInst>(I)) {
InlineCost += getCallsiteCost(CallSite(CI), DL);		InlineCost += getCallsiteCost(CallSite(CI), DL, InlineConstants::CallPenalty);
continue;		continue;
}		}

if (InvokeInst *II = dyn_cast<InvokeInst>(I)) {		if (InvokeInst *II = dyn_cast<InvokeInst>(I)) {
InlineCost += getCallsiteCost(CallSite(II), DL);		InlineCost += getCallsiteCost(CallSite(II), DL, InlineConstants::CallPenalty);
continue;		continue;
}		}

if (SwitchInst *SI = dyn_cast<SwitchInst>(I)) {		if (SwitchInst *SI = dyn_cast<SwitchInst>(I)) {
InlineCost += (SI->getNumCases() + 1) * InlineConstants::InstrCost;		InlineCost += (SI->getNumCases() + 1) * InlineConstants::InstrCost;
continue;		continue;
}		}
InlineCost += InlineConstants::InstrCost;		InlineCost += InlineConstants::InstrCost;
▲ Show 20 Lines • Show All 633 Lines • Show Last 20 Lines

test/Transforms/Inline/AArch64/phi.ll

Show First 20 Lines • Show All 339 Lines • ▼ Show 20 Lines	entry:
%gep2 = getelementptr inbounds i32, i32* %ptr, i32 1		%gep2 = getelementptr inbounds i32, i32* %ptr, i32 1
br i1 %cond, label %if_true, label %exit		br i1 %cond, label %if_true, label %exit

if_true:		if_true:
%gep3 = getelementptr inbounds i32, i32* %gep2, i32 1		%gep3 = getelementptr inbounds i32, i32* %gep2, i32 1
br label %exit		br label %exit

exit:		exit:
%phi = phi i32* [%gep1, %entry], [%gep3, %if_true] ; Simplifeid to %gep1		%phi = phi i32* [%gep1, %entry], [%gep3, %if_true] ; Simplified to %gep1
%load = load i32, i32* %phi		%load = load i32, i32* %phi
call void @pad()		call void @pad()
ret i32 %load		ret i32 %load
}		}


define i32 @outer14(i1 %cond) {		define i32 @outer14(i1 %cond) {
; CHECK-LABEL: @outer14(		; CHECK-LABEL: @outer14(
▲ Show 20 Lines • Show All 148 Lines • Show Last 20 Lines

test/Transforms/Inline/ARM/loop-add.ll

This file was added.

				; RUN: opt -inline %s -S \| FileCheck %s

				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "thumbv7m-arm-none-eabi"

				; CHECK-LABEL: void @doCalls
				define void @doCalls(i8* nocapture %p1, i8* nocapture %p2, i32 %n) #0 {
				entry:
				%div = lshr i32 %n, 1
				; CHECK: call void @LoopCall
				tail call void @LoopCall(i8* %p1, i8* %p2, i32 %div) #0

				%div2 = lshr i32 %n, 2
				; CHECK: call void @LoopCall
				tail call void @LoopCall(i8* %p1, i8* %p2, i32 %div2) #0

				; CHECK-NOT: call void @LoopCall
				tail call void @LoopCall(i8* %p2, i8* %p1, i32 0) #0

				; CHECK-NOT: call void @LoopCall_internal
				tail call void @LoopCall_internal(i8* %p1, i8* %p2, i32 %div2) #0

				%div3 = lshr i32 %n, 4
				; CHECK-NOT: call void @SimpleCall
				tail call void @SimpleCall(i8* %p2, i8* %p1, i32 %div3) #0
				ret void
				}

				; CHECK-LABEL: define void @LoopCall
				define void @LoopCall(i8* nocapture %dest, i8* nocapture readonly %source, i32 %num) #0 {
				entry:
				%c = icmp ne i32 %num, 0
				br i1 %c, label %while.cond, label %while.end

				while.cond: ; preds = %while.body, %entry
				%num.addr.0 = phi i32 [ %num, %entry ], [ %dec, %while.body ]
				%p_dest.0 = phi i8* [ %dest, %entry ], [ %incdec.ptr2, %while.body ]
				%p_source.0 = phi i8* [ %source, %entry ], [ %incdec.ptr, %while.body ]
				%cmp = icmp eq i32 %num.addr.0, 0
				br i1 %cmp, label %while.end, label %while.body

				while.body: ; preds = %while.cond
				%incdec.ptr = getelementptr inbounds i8, i8* %p_source.0, i32 1
				%0 = load i8, i8* %p_source.0, align 1
				%1 = trunc i32 %num.addr.0 to i8
				%conv1 = add i8 %0, %1
				%incdec.ptr2 = getelementptr inbounds i8, i8* %p_dest.0, i32 1
				store i8 %conv1, i8* %p_dest.0, align 1
				%dec = add i32 %num.addr.0, -1
				br label %while.cond

				while.end: ; preds = %while.cond
				ret void
				}

				; CHECK-LABEL-NOT: define void @LoopCall_internal
				define internal void @LoopCall_internal(i8* nocapture %dest, i8* nocapture readonly %source, i32 %num) #0 {
				entry:
				%c = icmp ne i32 %num, 0
				br i1 %c, label %while.cond, label %while.end

				while.cond: ; preds = %while.body, %entry
				%num.addr.0 = phi i32 [ %num, %entry ], [ %dec, %while.body ]
				%p_dest.0 = phi i8* [ %dest, %entry ], [ %incdec.ptr2, %while.body ]
				%p_source.0 = phi i8* [ %source, %entry ], [ %incdec.ptr, %while.body ]
				%cmp = icmp eq i32 %num.addr.0, 0
				br i1 %cmp, label %while.end, label %while.body

				while.body: ; preds = %while.cond
				%incdec.ptr = getelementptr inbounds i8, i8* %p_source.0, i32 1
				%0 = load i8, i8* %p_source.0, align 1
				%1 = trunc i32 %num.addr.0 to i8
				%conv1 = add i8 %0, %1
				%incdec.ptr2 = getelementptr inbounds i8, i8* %p_dest.0, i32 1
				store i8 %conv1, i8* %p_dest.0, align 1
				%dec = add i32 %num.addr.0, -1
				br label %while.cond

				while.end: ; preds = %while.cond
				ret void
				}

				; CHECK-LABEL: define void @SimpleCall
				define void @SimpleCall(i8* nocapture %dest, i8* nocapture readonly %source, i32 %num) #0 {
				entry:
				%arrayidx = getelementptr inbounds i8, i8* %source, i32 %num
				%0 = load i8, i8* %arrayidx, align 1
				%1 = xor i8 %0, 127
				%arrayidx2 = getelementptr inbounds i8, i8* %dest, i32 %num
				store i8 %1, i8* %arrayidx2, align 1
				ret void
				}

				attributes #0 = { minsize optsize }

test/Transforms/Inline/ARM/loop-memcpy.ll

This file was added.

				; RUN: opt -inline %s -S \| FileCheck %s

				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "thumbv7m-arm-none-eabi"

				; CHECK-LABEL: define void @matcpy
				define void @matcpy(i8* %dest, i8* %source, i32 %num) #0 {
				entry:
				%0 = ptrtoint i8* %dest to i32
				%1 = ptrtoint i8* %source to i32
				%2 = xor i32 %0, %1
				%3 = and i32 %2, 3
				%cmp = icmp eq i32 %3, 0
				br i1 %cmp, label %if.then, label %if.else20

				if.then: ; preds = %entry
				%sub = sub i32 0, %0
				%and2 = and i32 %sub, 3
				%add = or i32 %and2, 4
				%cmp3 = icmp ugt i32 %add, %num
				br i1 %cmp3, label %if.else, label %if.then4

				if.then4: ; preds = %if.then
				%sub5 = sub i32 %num, %and2
				%shr = and i32 %sub5, -4
				%sub7 = sub i32 %sub5, %shr
				%tobool = icmp eq i32 %and2, 0
				br i1 %tobool, label %if.end, label %if.then8

				if.then8: ; preds = %if.then4
				; CHECK: call fastcc void @memcpy
				call fastcc void @memcpy(i8* %dest, i8* %source, i32 %and2) #0
				%add.ptr = getelementptr inbounds i8, i8* %dest, i32 %and2
				%add.ptr9 = getelementptr inbounds i8, i8* %source, i32 %and2
				br label %if.end

				if.end: ; preds = %if.then4, %if.then8
				%p_dest.0 = phi i8* [ %add.ptr, %if.then8 ], [ %dest, %if.then4 ]
				%p_source.0 = phi i8* [ %add.ptr9, %if.then8 ], [ %source, %if.then4 ]
				%tobool14 = icmp eq i32 %sub7, 0
				br i1 %tobool14, label %if.end22, label %if.then15

				if.then15: ; preds = %if.end
				%add.ptr13 = getelementptr inbounds i8, i8* %p_source.0, i32 %shr
				%add.ptr11 = getelementptr inbounds i8, i8* %p_dest.0, i32 %shr
				; CHECK: call fastcc void @memcpy
				call fastcc void @memcpy(i8* %add.ptr11, i8* %add.ptr13, i32 %sub7) #0
				br label %if.end22

				if.else: ; preds = %if.then
				call fastcc void @memcpy(i8* %dest, i8* %source, i32 %num) #0
				br label %if.end22

				if.else20: ; preds = %entry
				call fastcc void @memcpy(i8* %dest, i8* %source, i32 %num) #0
				br label %if.end22

				if.end22: ; preds = %if.then15, %if.end, %if.else, %if.else20
				ret void
				}

				; CHECK-LABEL: define internal void @memcpy
				define internal void @memcpy(i8* nocapture %dest, i8* nocapture readonly %source, i32 %num) #0 {
				entry:
				br label %while.cond

				while.cond: ; preds = %while.body, %entry
				%num.addr.0 = phi i32 [ %num, %entry ], [ %dec, %while.body ]
				%p_dest.0 = phi i8* [ %dest, %entry ], [ %incdec.ptr1, %while.body ]
				%p_source.0 = phi i8* [ %source, %entry ], [ %incdec.ptr, %while.body ]
				%cmp = icmp eq i32 %num.addr.0, 0
				br i1 %cmp, label %while.end, label %while.body

				while.body: ; preds = %while.cond
				%incdec.ptr = getelementptr inbounds i8, i8* %p_source.0, i32 1
				%0 = load i8, i8* %p_source.0, align 1
				%incdec.ptr1 = getelementptr inbounds i8, i8* %p_dest.0, i32 1
				store i8 %0, i8* %p_dest.0, align 1
				%dec = add i32 %num.addr.0, -1
				br label %while.cond

				while.end: ; preds = %while.cond
				ret void
				}

				attributes #0 = { minsize optsize }

test/Transforms/Inline/ARM/loop-noinline.ll

This file was added.

				; RUN: opt -inline %s -S \| FileCheck %s

				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "thumbv7m-arm-none-eabi"

				; Check we don't inline loops at -Oz. They tend to be larger than we
				; expect.

				; CHECK: define i8* @H
				@digits = constant [16 x i8] c"0123456789ABCDEF", align 1
				define i8* @H(i8* %p, i32 %val, i32 %num) #0 {
				entry:
				br label %do.body

				do.body: ; preds = %do.body, %entry
				%p.addr.0 = phi i8* [ %p, %entry ], [ %incdec.ptr, %do.body ]
				%val.addr.0 = phi i32 [ %val, %entry ], [ %shl, %do.body ]
				%num.addr.0 = phi i32 [ %num, %entry ], [ %dec, %do.body ]
				%shr = lshr i32 %val.addr.0, 28
				%arrayidx = getelementptr inbounds [16 x i8], [16 x i8]* @digits, i32 0, i32 %shr
				%0 = load i8, i8* %arrayidx, align 1
				%incdec.ptr = getelementptr inbounds i8, i8* %p.addr.0, i32 1
				store i8 %0, i8* %p.addr.0, align 1
				%shl = shl i32 %val.addr.0, 4
				%dec = add i32 %num.addr.0, -1
				%tobool = icmp eq i32 %dec, 0
				br i1 %tobool, label %do.end, label %do.body

				do.end: ; preds = %do.body
				%scevgep = getelementptr i8, i8* %p, i32 %num
				ret i8* %scevgep
				}

				define nonnull i8* @call1(i8* %p, i32 %val, i32 %num) #0 {
				entry:
				; CHECK: tail call i8* @H
				%call = tail call i8* @H(i8* %p, i32 %val, i32 %num) #0
				ret i8* %call
				}

				define nonnull i8* @call2(i8* %p, i32 %val) #0 {
				entry:
				; CHECK: tail call i8* @H
				%call = tail call i8* @H(i8* %p, i32 %val, i32 32) #0
				ret i8* %call
				}

				attributes #0 = { minsize optsize }