This is an archive of the discontinued LLVM Phabricator instance.

[InlineCost, -Oz] Don't take into account the penalty of a fast call of frequently used functions
Needs ReviewPublic

Authored by eastig on Mar 6 2017, 7:04 AM.

Download Raw Diff

Details

Reviewers

Summary

This patch introduces an inline heuristic to be used whilst the cost of inlining is calculated for calls at '-Oz' optimization level.

When the cost of inlining is calculated for a call which has the fast calling convention and it is a call of a frequently used function, the penalty of the call is not taken into account. By default a function is frequently used if a number of its users is greater than 3. This value can be change via the command-line option '-inline-freq-func-threshold'. The heuristic allows inlining of very small functions, like a few instructions.

The rationale for the heuristic:

Fast calls are usually optimized to minimize the call penalty, e.g. parameters are passed through registers.
Inlining of frequently used functions increases the code size.
It's based on real examples of code for microcontrollers (v6-M). We see an improvement of the code size of the examples ~6.5%.

Benchmark results of the LNT testsuite:

Code size of only three benchmarks were affected by the heuristic.

x86 (i7-4770):

	Code size improvement	Perf regression
MultiSource/Applications/SPASS/SPASS	6.67%	~0%
MultiSource/Benchmarks/FreeBench/distray/distray	2.46%	~0%
MultiSource/Applications/sqlite3/sqlite3	1.34%	~0%

v7-A (Cortex-A53, Thumb)

	Code size improvement	Perf regression
MultiSource/Applications/SPASS/SPASS	4.93%	~3%
MultiSource/Applications/sqlite3/sqlite3	1.12%	~1.5%

AArch64 (Cortex-A57)

	Code size improvement	Perf regression
MultiSource/Applications/SPASS/SPASS	3.95%	~3%
MultiSource/Benchmarks/FreeBench/distray/distray	1.17%	~0%
MultiSource/Applications/sqlite3/sqlite3	1.07%	~0%

Diff Detail

Build Status

Buildable 4604
Build 4604: arc lint + arc unit

Event Timeline

eastig created this revision.Mar 6 2017, 7:04 AM

Herald added subscribers: eraman, aemerson. · View Herald TranscriptMar 6 2017, 7:04 AM

Special-casing calls marked "fast" doesn't really make sense in general. For example, on ARM v6-M, "fast" is exactly the same as the C calling convention.

That said, adding an inliner special-case to inline internal functions with two or three callers more aggressively might make sense, similar to the existing bonus for inlining internal functions with exactly one caller. (This might have a similar effect to your patch because calls marked "fast" generally are calling functions with internal linkage.)

Err, to be more clear, I should say "fastcc" is the same as the C calling convention. (Calls aren't marked fast, they have calling convention fastcc.)

Gah, I should read more carefully before I post. I meant to say, on ARM v6-M, fastcc is exactly the same as the C calling convention.

In D30651#693418, @efriedma wrote:

Special-casing calls marked "fast" doesn't really make sense in general. For example, on ARM v6-M, "fast" is exactly the same as the C calling convention.

More precise, the code from GlobalOpt:

static bool isProfitableToMakeFastCC(Function *F) {
  CallingConv::ID CC = F->getCallingConv();
  // FIXME: Is it worth transforming x86_stdcallcc and x86_fastcallcc?
  return CC == CallingConv::C || CC == CallingConv::X86_ThisCall;
}

You are mixing up the idea of fast calls and what is marked as a fast call. According to the http://llvm.org/docs/LangRef.html#calling-conventions:

“fastcc” - The fast calling convention
This calling convention attempts to make calls as fast as possible (e.g. by passing things in registers). This calling convention allows the target to use whatever tricks it wants to produce fast code for the target, without having to conform to an externally specified ABI (Application Binary Interface). Tail calls can only be optimized when this, the GHC or the HiPE convention is used. This calling convention does not support varargs and requires the prototype of all callees to exactly match the prototype of the function definition.

Why not to use this information? Based on this we can expect fast calls will have minimized call overhead.

That said, adding an inliner special-case to inline internal functions with two or three callers more aggressively might make sense, similar to the existing bonus for inlining internal functions with exactly one caller. (This might have a similar effect to your patch because calls marked "fast" generally are calling functions with internal linkage.)

If a call is not marked as a call with a low call overhead we should not assume this.
Also we can mark calls as fastcc based on additional heuristics or target knowledge.

In D30651#693418, @efriedma wrote:

Special-casing calls marked "fast" doesn't really make sense in general.

If it makes no sense why functions are marked as fastcc so aggressively. I think not all calls of them can actually be made as fastcc.

Based on this we can expect fast calls will have minimized call overhead.

That's true, in some sense... but that's also true for the C calling convention in most cases.

If it makes no sense why functions are marked as fastcc so aggressively. I think not all calls of them can actually be made as fastcc.

It's better to think of fastcc as the default calling convention for everything where the LLVM optimizer controls the calling convention. It passes values in registers where it makes sense, and tries to strike a reasonable balance between caller-save and callee-save registers. We use it over the C calling convention where we can so that we aren't stuck doing stupid things, like passing everything on the stack on x86-32, or shuffling floating-point values between integer and VFP registers on ARM with a softfp abi. It doesn't really indicate anything about the absolute overhead of a call.

lib/Analysis/InlineCost.cpp
1219	getNumUses() isn't going to return anything useful unless the function has internal linkage; even if a function only has one or two uses in the current file, it's likely to have more uses in other files the compiler can't see.

In D30651#693607, @efriedma wrote:

Based on this we can expect fast calls will have minimized call overhead.

That's true, in some sense... but that's also true for the C calling convention in most cases.

If it makes no sense why functions are marked as fastcc so aggressively. I think not all calls of them can actually be made as fastcc.

It's better to think of fastcc as the default calling convention for everything where the LLVM optimizer controls the calling convention. It passes values in registers where it makes sense, and tries to strike a reasonable balance between caller-save and callee-save registers. We use it over the C calling convention where we can so that we aren't stuck doing stupid things, like passing everything on the stack on x86-32, or shuffling floating-point values between integer and VFP registers on ARM with a softfp abi. It doesn't really indicate anything about the absolute overhead of a call.

Agreed.

I suspect that the underlying problem is some combination of imprecise modeling of call overhead and failing to account for low numbers of call sites that aren't exactly one.

I suspect that there would be more general (and still quite easy) fixes for the call overhead modeling, and some tuning of heuristics around the number of call sites seems like a good idea much as Eli suggested.

Hi Eli,

Thank you for explanation. You are right. Assuming the model you've described checking fastcc makes no sense.
I'll replace it with hasLocalLinkage. I'll also do runs with FrequentFuncThreshold==2 and FrequentFuncThreshold==3.

Thanks,
Evgeny

Updated according to Eli's comments:

The heuristic is changed to check internal functions with 3+ callers.

LNT data I've got at the moment:

FrequentFuncThreshold: 2+ callers

	Code size change	Exe time
MultiSource/Applications/SPASS/SPASS	-6.21%	+0..+1%
MultiSource/Benchmarks/FreeBench/distray/distray	-2.79%	0%
SingleSource/Benchmarks/Misc/perlin	-2.32%	0%
SingleSource/Benchmarks/BenchmarkGame/Large/fasta	-1.75%	0%
MultiSource/Applications/sqlite3/sqlite3	-1.09%	+0..1%
SingleSource/Benchmarks/Misc-C++/Large/sphereflake	+1.54%	0%
MultiSource/Benchmarks/Olden/tsp/tsp	+1.16%	0%

FrequentFuncThreshold: 3+ callers

	Code size change	Exe time
MultiSource/Applications/SPASS/SPASS	-6.29%	+0..+1%
MultiSource/Benchmarks/FreeBench/distray/distray	-2.79%	0%
MultiSource/Applications/sqlite3/sqlite3	-1.27%	+0..1%

AArch64

FrequentFuncThreshold: 2+ callers

	Code size change	Exe time
MultiSource/Benchmarks/Olden/perimeter/perimeter	-4.0%	+5%
MultiSource/Applications/SPASS/SPASS	-3.97%	+4%
MultiSource/Benchmarks/FreeBench/distray/distray	-1.17%	0%
MultiSource/Applications/sqlite3/sqlite3	-1.05%	+2%

FrequentFuncThreshold: 3+ callers

	Code size change	Exe time
MultiSource/Applications/SPASS/SPASS	-4.04%	+4%
MultiSource/Benchmarks/Olden/perimeter/perimeter	-4.0%	4%
MultiSource/Benchmarks/FreeBench/distray/distray	-1.17%	0%
MultiSource/Applications/sqlite3/sqlite3	-1.11%	+3%

Fixed code comments

Harbormaster completed remote builds in B4605: Diff 90945.Mar 7 2017, 2:45 PM

efriedma added inline comments.Mar 7 2017, 3:54 PM

lib/Analysis/InlineCost.cpp
1216	This comparison seems backwards... generally, we assume non-local functions have many callers, and therefore treat them the same way as local functions with many callers.

eastig added inline comments.Mar 8 2017, 10:45 AM

lib/Analysis/InlineCost.cpp
1216	If I understand you correctly you mean this: if (!Callee \|\| !Callee->hasLocalStorage()) return true; Do I?

efriedma added inline comments.Mar 8 2017, 10:52 AM

lib/Analysis/InlineCost.cpp
1216	Yes.

Updated according Eli's comment.

Harbormaster completed remote builds in B4642: Diff 91095.Mar 8 2017, 3:50 PM

Probably needs new benchmarking numbers with that change.

The functionality mostly makes sense now, but the comments/naming are really confusing. "Frequent" makes it sound like a check based on runtime profiling check rather than the absolute number of callers. And there isn't any comment explaining why we want to more aggressively inline local functions with 2 callers.

lib/Analysis/InlineCost.cpp
1296	You might want give a bonus to functions with few callers rather than a penalty to functions with many callers; as-is, you're basically decreasing the default inline threshold from 5 to -20.

"Fast calls are usually optimized to minimize the call penalty, e.g. parameters are passed through registers."

BTW: In terms of code size this statement isn't convincing to me as calls still impose constraints on the register allocator so we end up producing moves: Parameters need to go in specific registers, results come out in specific registers and live-through values have to be moved to callee-saved registers.

In D30651#696071, @MatzeB wrote:

"Fast calls are usually optimized to minimize the call penalty, e.g. parameters are passed through registers."

BTW: In terms of code size this statement isn't convincing to me as calls still impose constraints on the register allocator so we end up producing moves: Parameters need to go in specific registers, results come out in specific registers and live-through values have to be moved to callee-saved registers.

If this was an often case we would see code size increases in LNT.

eastig added inline comments.Mar 9 2017, 2:59 PM

lib/Analysis/InlineCost.cpp
1296	As I can see Cost = Cost_we_cannot_remove - Cost_we_can_remove; where Cost_we_can_remove = Call + Ret + Arguments + Call_Penalty. Could you please explain me the values of the threshold? Why is it 5 for Oz? Maybe, it's simpler to adjust the threshold than to add my code.

OptMinSizeThreshold is basically set experimentally; http://reviews.llvm.org/rL288024 is the last time it was adjusted.

I don't think anyone's looked at CallPenalty in a long time; it's possible it's too large at -Oz.

In D30651#697010, @efriedma wrote:

OptMinSizeThreshold is basically set experimentally; http://reviews.llvm.org/rL288024 is the last time it was adjusted.

I don't think anyone's looked at CallPenalty in a long time; it's possible it's too large at -Oz.

I ran LNT for the patch. There are big code-size regressions in benchmarks written in C++. So the idea to treat functions with the non-local linkage as having many users needs some research.

In D30651#697010, @efriedma wrote:

OptMinSizeThreshold is basically set experimentally; http://reviews.llvm.org/rL288024 is the last time it was adjusted.

I don't think anyone's looked at CallPenalty in a long time; it's possible it's too large at -Oz.

Hi Eli,
What do think about an idea to add a knob for CallPenalty? Playing with it and the threshold knob in the driver can do what I want to do with the patch.

I've found an interesting thing related to the cost of getelementptr:

while.cond:                                       ; preds = %while.body, %entry
  %dest.addr.0 = phi i8* [ %dest, %entry ], [ %incdec.ptr1, %while.body ]
  %src.addr.0 = phi i8* [ %src, %entry ], [ %incdec.ptr, %while.body ]
  %tobool = icmp eq i32 %size.addr.0, 0
  br i1 %tobool, label %while.end, label %while.body

while.body:                                       ; preds = %while.cond
  %dec = add nsw i32 %size.addr.0, -1
  %incdec.ptr = getelementptr inbounds i8, i8* %src.addr.0, i32 1
  %0 = load i8, i8* %src.addr.0, align 1, !tbaa !12
  %incdec.ptr1 = getelementptr inbounds i8, i8* %dest.addr.0, i32 1
  store i8 %0, i8* %dest.addr.0, align 1, !tbaa !12
  br label %while.cond

while.end:                                        ; preds = %while.cond

Getelementptr instructions have inter-iteration dependencies. CallAnalyzer::visitGetElementPtr returns true for them which means the instructions can be simplified. As a result they don't contribute into the cost of inlining. The instructions are not simplified and are lowered to cost-having instructions.
In fact instructions having inter-iteration dependencies are not so easy to simplify.
Should we have a check of inter-iteration dependencies in CallAnalyzer::visitGetElementPtr?

I ran LNT for the patch. There are big code-size regressions in benchmarks written in C++. So the idea to treat functions with the non-local linkage as having many users needs some research.

Hmm. Maybe you're just turning the inline threshold too low to eliminate C++ abstractions. Or maybe in C++, the average number of callers for a linkonce function is lower than we might expect. Not sure.

What do think about an idea to add a knob for CallPenalty? Playing with it and the threshold knob in the driver can do what I want to do with the patch.

Makes sense, at least to make it easier to experiment with it.

Should we have a check of inter-iteration dependencies in CallAnalyzer::visitGetElementPtr?

The problem isn't really inter-iteration dependencies, exactly... it's that the computation which declares a GEP "free" (see getGEPCost in TargetTransformInfoImpl.h) isn't paying attention to the instructions which use the GEP. A GEP can only be free if we can fold the addressing mode into a load or store instruction. Not sure how relevant that is in practice, though.

In D30651#697984, @efriedma wrote:

Should we have a check of inter-iteration dependencies in CallAnalyzer::visitGetElementPtr?

The problem isn't really inter-iteration dependencies, exactly... it's that the computation which declares a GEP "free" (see getGEPCost in TargetTransformInfoImpl.h) isn't paying attention to the instructions which use the GEP. A GEP can only be free if we can fold the addressing mode into a load or store instruction.

'getGEPCost' is not called. Execution ends here in InlineCost.cpp because an offset is a constant:

439	  if (IsGEPOffsetConstant(I)) {
440	    if (SROACandidate)
441	      SROAArgValues[&I] = SROAArg;
442	
443	    // Constant GEPs are modeled as free.
444	    return true;
445	  }

I guess if it was not a loop GEP would have been folded into a load/store instruction. Anyway the cost calculation for GEP is not correct.
Maybe instead of 'return true' it should be 'return isGEPFree(I);'. Then TTI::getGEPCost would return it is not free.

Not sure how relevant that is in practice, though.

Do you mean the IR code I provided? This is what a naive memcopy function is translated into.

I guess if it was not a loop GEP would have been folded into a load/store instruction. Anyway the cost calculation for GEP is not correct.
Maybe instead of 'return true' it should be 'return isGEPFree(I);'. Then TTI::getGEPCost would return it is not free.

Oh, the inliner has its own equivalent? Anyway, that isn't really the point; the point is that we're concluding the GEP is free because we're assuming the users of the GEP are loads and stores, rather than PHI nodes/calls/etc.

Not sure how relevant that is in practice, though.

Do you mean the IR code I provided? This is what a naive memcopy function is translated into.

Sorry, wasn't clear there; I mean, it obviously triggers in a substantial number of cases, but it probably doesn't actually change the inline cost enough to matter in most cases.

In D30651#698341, @efriedma wrote:

I guess if it was not a loop GEP would have been folded into a load/store instruction. Anyway the cost calculation for GEP is not correct.
Maybe instead of 'return true' it should be 'return isGEPFree(I);'. Then TTI::getGEPCost would return it is not free.

Oh, the inliner has its own equivalent? Anyway, that isn't really the point; the point is that we're concluding the GEP is free because we're assuming the users of the GEP are loads and stores, rather than PHI nodes/calls/etc.

My understanding of GEP was exactly as you wrote. I've browsed the LangRef and http://llvm.org/docs/GetElementPtr.html and I have not found anything prohibiting using GEP in such way. I think this representation of an increment of a pointer in a loop is better than:

inc.offset.0 = phi(offset, inc.offset)
inc.offset = inc.offset.0 + 1
ptr = GEP(base, inc.offset.0)

However the last one gives the correct cost.
BTW I did an experiment when such GEPs are not free and the call penalty has lower values: 5 and 10. 5 means it's a cost of one instruction. 10 is a cost of two instructions. This model of the call penalty might be good for ARM where a call just saves PC into a register. So it's something like a sequence of MOV + BR.
I didn't use the heuristic of many callers. The inline threshold value was not changed. It was 5.
The functions I didn't want to inline were not inlined. C++ benchmarks code size regressions improved but there are still some small regressions.

I checked if the GEPs from the provided IR are free on x86. No, they are not:

        movq    %rdi, -8(%rbp)
        movq    %rsi, -16(%rbp)
        movl    %edx, -20(%rbp)
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        movl    -20(%rbp), %eax
        movl    %eax, %ecx
        addl    $-1, %ecx
        movl    %ecx, -20(%rbp)
        cmpl    $0, %eax
        je      .LBB0_3
# BB#2:                                 #   in Loop: Header=BB0_1 Depth=1
        movq    -16(%rbp), %rax
        movq    %rax, %rcx
        addq    $1, %rcx
        movq    %rcx, -16(%rbp)
        movb    (%rax), %dl
        movq    -8(%rbp), %rax
        movq    %rax, %rcx
        addq    $1, %rcx
        movq    %rcx, -8(%rbp)
        movb    %dl, (%rax)
        jmp     .LBB0_1
.LBB0_3:

MOVQ + ADDQ are generated for them. I am writing a patch.

Revision Contents

Path

Size

include/

llvm/

Analysis/

InlineCost.h

3 lines

lib/

Analysis/

InlineCost.cpp

38 lines

test/

Transforms/

Inline/

minsize-fastcc-frq-func.ll

70 lines

Diff 90941

include/llvm/Analysis/InlineCost.h

Show First 20 Lines • Show All 135 Lines • ▼ Show 20 Lines	struct InlineParams {
/// Threshold to use when the caller is optimized for minsize.		/// Threshold to use when the caller is optimized for minsize.
Optional<int> OptMinSizeThreshold;		Optional<int> OptMinSizeThreshold;

/// Threshold to use when the callsite is considered hot.		/// Threshold to use when the callsite is considered hot.
Optional<int> HotCallSiteThreshold;		Optional<int> HotCallSiteThreshold;

/// Threshold to use when the callsite is considered cold.		/// Threshold to use when the callsite is considered cold.
Optional<int> ColdCallSiteThreshold;		Optional<int> ColdCallSiteThreshold;

		/// Threshold to use when a function is considered frequently called.
		Optional<int> FrequentFuncThreshold;
};		};

/// Generate the parameters to tune the inline cost analysis based only on the		/// Generate the parameters to tune the inline cost analysis based only on the
/// commandline options.		/// commandline options.
InlineParams getInlineParams();		InlineParams getInlineParams();

/// Generate the parameters to tune the inline cost analysis based on command		/// Generate the parameters to tune the inline cost analysis based on command
/// line options. If -inline-threshold option is not explicitly passed,		/// line options. If -inline-threshold option is not explicitly passed,
▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

lib/Analysis/InlineCost.cpp

Show First 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	static cl::opt<int> ColdThreshold(
"inlinecold-threshold", cl::Hidden, cl::init(225),		"inlinecold-threshold", cl::Hidden, cl::init(225),
cl::desc("Threshold for inlining functions with cold attribute"));		cl::desc("Threshold for inlining functions with cold attribute"));

static cl::opt<int>		static cl::opt<int>
HotCallSiteThreshold("hot-callsite-threshold", cl::Hidden, cl::init(3000),		HotCallSiteThreshold("hot-callsite-threshold", cl::Hidden, cl::init(3000),
cl::ZeroOrMore,		cl::ZeroOrMore,
cl::desc("Threshold for hot callsites "));		cl::desc("Threshold for hot callsites "));

		/// When minsize (-Oz) is specified a function with a number of calls greater
		/// or equal this threshold is considered as a frequently used function.
		static cl::opt<int>
		FrequentFuncThreshold("inline-freq-func-threshold", cl::Hidden, cl::init(3),
		cl::desc("Threshold for frequently called functions"));

namespace {		namespace {

class CallAnalyzer : public InstVisitor<CallAnalyzer, bool> {		class CallAnalyzer : public InstVisitor<CallAnalyzer, bool> {
typedef InstVisitor<CallAnalyzer, bool> Base;		typedef InstVisitor<CallAnalyzer, bool> Base;
friend class InstVisitor<CallAnalyzer, bool>;		friend class InstVisitor<CallAnalyzer, bool>;

/// The TargetTransformInfo available for this compilation.		/// The TargetTransformInfo available for this compilation.
const TargetTransformInfo &TTI;		const TargetTransformInfo &TTI;
▲ Show 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	class CallAnalyzer : public InstVisitor<CallAnalyzer, bool> {
void updateThreshold(CallSite CS, Function &Callee);		void updateThreshold(CallSite CS, Function &Callee);

/// Return true if size growth is allowed when inlining the callee at CS.		/// Return true if size growth is allowed when inlining the callee at CS.
bool allowSizeGrowth(CallSite CS);		bool allowSizeGrowth(CallSite CS);

// Custom analysis routines.		// Custom analysis routines.
bool analyzeBlock(BasicBlock BB, SmallPtrSetImpl<const Value > &EphValues);		bool analyzeBlock(BasicBlock BB, SmallPtrSetImpl<const Value > &EphValues);

		bool isCallOfFrequentLocalLinkageFunc(CallSite CS);

// Disable several entry points to the visitor so we don't accidentally use		// Disable several entry points to the visitor so we don't accidentally use
// them by declaring but not defining them here.		// them by declaring but not defining them here.
void visit(Module *);		void visit(Module *);
void visit(Module &);		void visit(Module &);
void visit(Function *);		void visit(Function *);
void visit(Function &);		void visit(Function &);
void visit(BasicBlock *);		void visit(BasicBlock *);
void visit(BasicBlock &);		void visit(BasicBlock &);
▲ Show 20 Lines • Show All 1,004 Lines • ▼ Show 20 Lines	do {
}		}
assert(V->getType()->isPointerTy() && "Unexpected operand type!");		assert(V->getType()->isPointerTy() && "Unexpected operand type!");
} while (Visited.insert(V).second);		} while (Visited.insert(V).second);

Type *IntPtrTy = DL.getIntPtrType(V->getContext());		Type *IntPtrTy = DL.getIntPtrType(V->getContext());
return cast<ConstantInt>(ConstantInt::get(IntPtrTy, Offset));		return cast<ConstantInt>(ConstantInt::get(IntPtrTy, Offset));
}		}

		/// \brief Check if the CallSite is a call of a frequently called function
		/// having local linkage.
		bool CallAnalyzer::isCallOfFrequentLocalLinkageFunc(CallSite CS) {
		Function *Callee = CS.getCalledFunction();
		if (!Callee \|\| !Callee->hasLocalLinkage())
		return false;

		int FrqFuncThreshold = (Params.FrequentFuncThreshold)
		? Params.FrequentFuncThreshold.getValue()
		: FrequentFuncThreshold.getDefault().getValue();

		assert(FrqFuncThreshold > 0);

		return (Callee->getNumUses() >= static_cast<unsigned>(FrqFuncThreshold));
		efriedmaUnsubmitted Not Done Reply Inline Actions This comparison seems backwards... generally, we assume non-local functions have many callers, and therefore treat them the same way as local functions with many callers. efriedma: This comparison seems backwards... generally, we assume non-local functions have many callers…
		eastigAuthorUnsubmitted Not Done Reply Inline Actions If I understand you correctly you mean this: if (!Callee \|\| !Callee->hasLocalStorage()) return true; Do I? eastig: If I understand you correctly you mean this: ``` if (!Callee \|\| !Callee->hasLocalStorage())…
		efriedmaUnsubmitted Not Done Reply Inline Actions Yes. efriedma: Yes.
		}

/// \brief Analyze a call site for potential inlining.		/// \brief Analyze a call site for potential inlining.
		efriedmaUnsubmitted Not Done Reply Inline Actions getNumUses() isn't going to return anything useful unless the function has internal linkage; even if a function only has one or two uses in the current file, it's likely to have more uses in other files the compiler can't see. efriedma: getNumUses() isn't going to return anything useful unless the function has internal linkage…
///		///
/// Returns true if inlining this call is viable, and false if it is not		/// Returns true if inlining this call is viable, and false if it is not
/// viable. It computes the cost and adjusts the threshold based on numerous		/// viable. It computes the cost and adjusts the threshold based on numerous
/// factors and heuristics. If this method returns false but the computed cost		/// factors and heuristics. If this method returns false but the computed cost
/// is below the computed threshold, then inlining was forcibly disabled by		/// is below the computed threshold, then inlining was forcibly disabled by
/// some artifact of the routine.		/// some artifact of the routine.
bool CallAnalyzer::analyzeCall(CallSite CS) {		bool CallAnalyzer::analyzeCall(CallSite CS) {
++NumCallsAnalyzed;		++NumCallsAnalyzed;
▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines	for (unsigned I = 0, E = CS.arg_size(); I != E; ++I) {
} else {		} else {
// For non-byval arguments subtract off one instruction per call		// For non-byval arguments subtract off one instruction per call
// argument.		// argument.
Cost -= InlineConstants::InstrCost;		Cost -= InlineConstants::InstrCost;
}		}
}		}
// The call instruction also disappears after inlining.		// The call instruction also disappears after inlining.
Cost -= InlineConstants::InstrCost + InlineConstants::CallPenalty;		Cost -= InlineConstants::InstrCost + InlineConstants::CallPenalty;

		// Don't consider the call penalty as a penalty if minsize (-Oz) is
		// requested and it is a fast call of a frequently used function.
		// Fast calls are optimized to reduce the call penalty.
		assert(CS.getCaller() && "CallSite without Caller");
		if (CS.getCaller()->hasFnAttribute(Attribute::MinSize) &&
		isCallOfFrequentLocalLinkageFunc(CS)) {
		Cost += InlineConstants::CallPenalty;
		efriedmaUnsubmitted Not Done Reply Inline Actions You might want give a bonus to functions with few callers rather than a penalty to functions with many callers; as-is, you're basically decreasing the default inline threshold from 5 to -20. efriedma: You might want give a bonus to functions with few callers rather than a penalty to functions…
		eastigAuthorUnsubmitted Not Done Reply Inline Actions As I can see Cost = Cost_we_cannot_remove - Cost_we_can_remove; where Cost_we_can_remove = Call + Ret + Arguments + Call_Penalty. Could you please explain me the values of the threshold? Why is it 5 for Oz? Maybe, it's simpler to adjust the threshold than to add my code. eastig: As I can see Cost = Cost_we_cannot_remove - Cost_we_can_remove; where Cost_we_can_remove = Call…
		}

// If there is only one call of the function, and it has internal linkage,		// If there is only one call of the function, and it has internal linkage,
// the cost of inlining it drops dramatically.		// the cost of inlining it drops dramatically.
bool OnlyOneCallAndLocalLinkage =		bool OnlyOneCallAndLocalLinkage =
F.hasLocalLinkage() && F.hasOneUse() && &F == CS.getCalledFunction();		F.hasLocalLinkage() && F.hasOneUse() && &F == CS.getCalledFunction();
if (OnlyOneCallAndLocalLinkage)		if (OnlyOneCallAndLocalLinkage)
Cost -= InlineConstants::LastCallToStaticBonus;		Cost -= InlineConstants::LastCallToStaticBonus;

// If this function uses the coldcc calling convention, prefer not to inline		// If this function uses the coldcc calling convention, prefer not to inline
▲ Show 20 Lines • Show All 300 Lines • ▼ Show 20 Lines	InlineParams llvm::getInlineParams(int Threshold) {
Params.HintThreshold = HintThreshold;		Params.HintThreshold = HintThreshold;

// Set the HotCallSiteThreshold knob from the -hot-callsite-threshold.		// Set the HotCallSiteThreshold knob from the -hot-callsite-threshold.
Params.HotCallSiteThreshold = HotCallSiteThreshold;		Params.HotCallSiteThreshold = HotCallSiteThreshold;

// Set the ColdCallSiteThreshold knob from the -inline-cold-callsite-threshold.		// Set the ColdCallSiteThreshold knob from the -inline-cold-callsite-threshold.
Params.ColdCallSiteThreshold = ColdCallSiteThreshold;		Params.ColdCallSiteThreshold = ColdCallSiteThreshold;

		// Set the FrequentFuncThreshold knob from the -inline-freq-func-threshold.
		Params.FrequentFuncThreshold = FrequentFuncThreshold;

// Set the OptMinSizeThreshold and OptSizeThreshold params only if the		// Set the OptMinSizeThreshold and OptSizeThreshold params only if the
// Set the OptMinSizeThreshold and OptSizeThreshold params only if the		// Set the OptMinSizeThreshold and OptSizeThreshold params only if the
// -inlinehint-threshold commandline option is not explicitly given. If that		// -inlinehint-threshold commandline option is not explicitly given. If that
// option is present, then its value applies even for callees with size and		// option is present, then its value applies even for callees with size and
// minsize attributes.		// minsize attributes.
// If the -inline-threshold is not specified, set the ColdThreshold from the		// If the -inline-threshold is not specified, set the ColdThreshold from the
// -inlinecold-threshold even if it is not explicitly passed. If		// -inlinecold-threshold even if it is not explicitly passed. If
// -inline-threshold is specified, then -inlinecold-threshold needs to be		// -inline-threshold is specified, then -inlinecold-threshold needs to be
Show All 31 Lines

test/Transforms/Inline/minsize-fastcc-frq-func.ll

This file was added.

				; Check that fast calls of frequently used functions are not inlined if minsize
				; is requested.
				; The first test is to check the default threshold for frequent functions
				; which is 4.
				; The second test is to provide a value for the threshold and to check
				; a function is not inlined.
				;
				; RUN: opt < %s -inline -S \| FileCheck -check-prefix=DEFAULT_THRESHOLD %s
				; RUN: opt < %s -inline -inline-freq-func-threshold=2 -S \| FileCheck -check-prefix=THRESHOLD_VALUE_2 %s

				define internal i32 @X9(i32 %x) nounwind {
				%x2 = add i32 %x, %x
				%x3 = add i32 %x2, %x
				%x4 = add i32 %x3, %x
				%x5 = add i32 %x4, %x
				%x6 = add i32 %x5, %x
				%x7 = add i32 %x6, %x
				%x8 = add i32 %x7, %x
				%x9 = add i32 %x8, %x

				ret i32 %x9
				}

				define internal i32 @Add_10_X9(i32 %x) nounwind {
				%x2 = add i32 10, %x
				%x3 = add i32 10, %x2
				%x4 = add i32 10, %x3
				%x5 = add i32 10, %x4
				%x6 = add i32 10, %x5
				%x7 = add i32 10, %x6
				%x8 = add i32 10, %x7
				%x9 = add i32 10, %x8

				ret i32 %x9
				}

				define i32 @f1(i32 %x) nounwind minsize {
				%res = call i32 @X9(i32 %x)
				ret i32 %res
				; DEFAULT_THRESHOLD-LABEL: @f1(
				; DEFAULT_THRESHOLD-NEXT: %res = call i32 @X9
				}

				define i32 @f2(i32 %x) nounwind {
				%res = call i32 @X9(i32 %x)
				ret i32 %res
				}

				define i32 @f3(i32 %x) nounwind {
				%res = call i32 @X9(i32 %x)
				ret i32 %res
				}

				define i32 @f4(i32 %x) nounwind {
				%res = call i32 @X9(i32 %x)
				ret i32 %res
				}

				define i32 @f5(i32 %x) nounwind minsize {
				%res1 = call i32 @Add_10_X9(i32 %x)
				%res2 = call i32 @Add_10_X9(i32 %res1)
				ret i32 %res2
				; THRESHOLD_VALUE_2-LABEL: @f5(
				; THRESHOLD_VALUE_2-NEXT: %res1 = call i32 @Add_10_X9
				; THRESHOLD_VALUE_2-NEXT: %res2 = call i32 @Add_10_X9

				; DEFAULT_THRESHOLD-LABEL: @f5(
				; DEFAULT_THRESHOLD-CHECK-NOT: call
				}