Download Raw Diff

Details

Reviewers

Commits

rGc5fa6358ba73: [NewPM/Inliner] Reduce threshold for cold callsites in the non-PGO case
rL306484: [NewPM/Inliner] Reduce threshold for cold callsites in the non-PGO case

Summary

This uses inline-cold-callsite-threshold to callsites that are cold locally (within the function) in the absence of profile information. Callsite's coldness is determined based on it's BFI relative to the caller's entry BFI. The default value chosen is callsite's frequency being <= 1/100th of the caller's entry frequency. In general this is a small size win. For example, the llvm test-suite sees a mean text size reduction of 0.03%, but there are some nice wins in large benchmarks there (sqlite sees 23% reduction and 7zip sees 4% reduction). There are some regressions, but those are all on benchmarks with smaller code size (<4K). This also improves an internal compression benchmark by around 8% by preventing a cold callee from being inlined and thereby allowing the caller to be inlined into its caller. The 1% threshold allows a callsite guarded by _builtin_expect(cond, 0) inside a singly-nested loop to be considered cold.

I am working on doing similar identification of hot callsites without profile information, but tuning that is proving to be harder and so I want to start with this first.

Diff Detail

Repository: rL LLVM

Event Timeline

eraman created this revision.Jun 16 2017, 6:09 PM

davidxl added inline comments.Jun 16 2017, 10:40 PM

lib/Analysis/InlineCost.cpp
691 ↗	(On Diff #102917)	Can the static/local call site check and the global cold site check be moved into a helper function such as CallAnalyzer::isColdCallsite ?
test/Transforms/Inline/inline-cold-callsite.ll
35 ↗	(On Diff #102917)	change the label name to 'cold:'.

Address David's comments.

eraman marked 2 inline comments as done.Jun 17 2017, 12:54 PM

Ping.

chandlerc added inline comments.Jun 26 2017, 12:00 PM

lib/Analysis/InlineCost.cpp
72–76 ↗	(On Diff #102943)	Did clang format do this? yuck. You may get better results by mergeing the string literal into a single line string literal and letting clang-format re-break it from there.
656–659 ↗	(On Diff #102943)	Rather than digging the integers out (and risking overflow) I would expect to use `BlockFrequency` and `BranchProbability` directly? For example, in MBP we do something along the lines of... const BranchProbability ColdProb(1, 5); // 20% auto CallSiteFreq = CallerBFI->getBlockFreq(CallSiteBB); auto EntryFreq = CallerBFI->getEntryFreq(); return CallSiteFreq < EntryFreq * ColdProb; (Clearly the coldness should be different here than it is in MBP, just seems like if the way we scale `BlockFrequency` is with a `BranchProbability`, we should keep doing that here.) It also might be good to try and cache the scaled entry frequency so that we don't re-compute it for each call site?

Address Chandler's comments.

Harbormaster completed remote builds in B7656: Diff 104191.Jun 27 2017, 9:21 AM

I used BranchProbability as you suggested but due to BranchProbability's implementation I need to adjust the test case (essentitally BranchProbability(1,100) is < 1/100). This is not a big deal, but it does not treat a callsite guarded by a builtin_expect inside a loop as cold anymore.

In D34312#792293, @eraman wrote:

I used BranchProbability as you suggested but due to BranchProbability's implementation I need to adjust the test case (essentitally BranchProbability(1,100) is < 1/100). This is not a big deal, but it does not treat a callsite guarded by a builtin_expect inside a loop as cold anymore.

I'd be happy to increase this to 2/100 to start with if that makes more sense to you. I agree that a callsite guarded by builtin_expect inside a loop seems like a good baseline for "is this threshold sane". (Naturally, we can tweak it further based on experience using it in the wild, but good to get a sane initial starting point.)

Also LGTM generally. Feel free to submit with updated (or current) threshold based on what makes the most sense to you. Also a minor comment below, lemme know if you want another look at the code if caching that ends up useful but weird/complex.

lib/Analysis/InlineCost.cpp
658 ↗	(On Diff #104191)	As mentioned earlier, is it possible to cache this so that we don't re-compute the scaled entry frequency for each callsite? Maybe there isn't a good place to cache it...

This revision is now accepted and ready to land.Jun 27 2017, 2:30 PM

Increase the threshold to 2, fix a bug and add comments about caching the caller entry threshold.

In D34312#792837, @chandlerc wrote:

Also LGTM generally. Feel free to submit with updated (or current) threshold based on what makes the most sense to you. Also a minor comment below, lemme know if you want another look at the code if caching that ends up useful but weird/complex.

I have increased the threshold to 2. I've also fixed a thinko in constructing the ColdProb object (I wrote ColdProb(1, 100 * ColdCallSiteRelFreq), but it really should be ColdProb(ColdCallSiteRelFreq, 100). Will submit with these changes.

lib/Analysis/InlineCost.cpp
658 ↗	(On Diff #104191)	Discussed this offline with Chandler and we agreed that the effort to cache this not worth it unless this starts showing up in the profiles. I have added a comment reflecting this.

Closed by commit rL306484: [NewPM/Inliner] Reduce threshold for cold callsites in the non-PGO case (authored by eraman). · Explain WhyJun 27 2017, 4:11 PM

This revision was automatically updated to reflect the committed changes.

Diff 104305

llvm/trunk/lib/Analysis/InlineCost.cpp

Show First 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	static cl::opt<int> ColdThreshold(
"inlinecold-threshold", cl::Hidden, cl::init(45),		"inlinecold-threshold", cl::Hidden, cl::init(45),
cl::desc("Threshold for inlining functions with cold attribute"));		cl::desc("Threshold for inlining functions with cold attribute"));

static cl::opt<int>		static cl::opt<int>
HotCallSiteThreshold("hot-callsite-threshold", cl::Hidden, cl::init(3000),		HotCallSiteThreshold("hot-callsite-threshold", cl::Hidden, cl::init(3000),
cl::ZeroOrMore,		cl::ZeroOrMore,
cl::desc("Threshold for hot callsites "));		cl::desc("Threshold for hot callsites "));

		static cl::opt<int> ColdCallSiteRelFreq(
		"cold-callsite-rel-freq", cl::Hidden, cl::init(2), cl::ZeroOrMore,
		cl::desc("Maxmimum block frequency, expressed as a percentage of caller's "
		"entry frequency, for a callsite to be cold in the absence of "
		"profile information."));

namespace {		namespace {

class CallAnalyzer : public InstVisitor<CallAnalyzer, bool> {		class CallAnalyzer : public InstVisitor<CallAnalyzer, bool> {
typedef InstVisitor<CallAnalyzer, bool> Base;		typedef InstVisitor<CallAnalyzer, bool> Base;
friend class InstVisitor<CallAnalyzer, bool>;		friend class InstVisitor<CallAnalyzer, bool>;

/// The TargetTransformInfo available for this compilation.		/// The TargetTransformInfo available for this compilation.
const TargetTransformInfo &TTI;		const TargetTransformInfo &TTI;
▲ Show 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	class CallAnalyzer : public InstVisitor<CallAnalyzer, bool> {
/// attributes and callee hotness for PGO builds. The Callee is explicitly		/// attributes and callee hotness for PGO builds. The Callee is explicitly
/// passed to support analyzing indirect calls whose target is inferred by		/// passed to support analyzing indirect calls whose target is inferred by
/// analysis.		/// analysis.
void updateThreshold(CallSite CS, Function &Callee);		void updateThreshold(CallSite CS, Function &Callee);

/// Return true if size growth is allowed when inlining the callee at CS.		/// Return true if size growth is allowed when inlining the callee at CS.
bool allowSizeGrowth(CallSite CS);		bool allowSizeGrowth(CallSite CS);

		/// Return true if \p CS is a cold callsite.
		bool isColdCallSite(CallSite CS, BlockFrequencyInfo *CallerBFI);

// Custom analysis routines.		// Custom analysis routines.
bool analyzeBlock(BasicBlock BB, SmallPtrSetImpl<const Value > &EphValues);		bool analyzeBlock(BasicBlock BB, SmallPtrSetImpl<const Value > &EphValues);

// Disable several entry points to the visitor so we don't accidentally use		// Disable several entry points to the visitor so we don't accidentally use
// them by declaring but not defining them here.		// them by declaring but not defining them here.
void visit(Module *);		void visit(Module *);
void visit(Module &);		void visit(Module &);
void visit(Function *);		void visit(Function *);
▲ Show 20 Lines • Show All 443 Lines • ▼ Show 20 Lines	if (InvokeInst *II = dyn_cast<InvokeInst>(Instr)) {
if (isa<UnreachableInst>(II->getNormalDest()->getTerminator()))		if (isa<UnreachableInst>(II->getNormalDest()->getTerminator()))
return false;		return false;
} else if (isa<UnreachableInst>(Instr->getParent()->getTerminator()))		} else if (isa<UnreachableInst>(Instr->getParent()->getTerminator()))
return false;		return false;

return true;		return true;
}		}

		bool CallAnalyzer::isColdCallSite(CallSite CS, BlockFrequencyInfo *CallerBFI) {
		// If global profile summary is available, then callsite's coldness is
		// determined based on that.
		if (PSI->hasProfileSummary())
		return PSI->isColdCallSite(CS, CallerBFI);
		if (!CallerBFI)
		return false;

		// In the absence of global profile summary, determine if the callsite is cold
		// relative to caller's entry. We could potentially cache the computation of
		// scaled entry frequency, but the added complexity is not worth it unless
		// this scaling shows up high in the profiles.
		const BranchProbability ColdProb(ColdCallSiteRelFreq, 100);
		auto CallSiteBB = CS.getInstruction()->getParent();
		auto CallSiteFreq = CallerBFI->getBlockFreq(CallSiteBB);
		auto CallerEntryFreq =
		CallerBFI->getBlockFreq(&(CS.getCaller()->getEntryBlock()));
		return CallSiteFreq < CallerEntryFreq * ColdProb;
		}

void CallAnalyzer::updateThreshold(CallSite CS, Function &Callee) {		void CallAnalyzer::updateThreshold(CallSite CS, Function &Callee) {
// If no size growth is allowed for this inlining, set Threshold to 0.		// If no size growth is allowed for this inlining, set Threshold to 0.
if (!allowSizeGrowth(CS)) {		if (!allowSizeGrowth(CS)) {
Threshold = 0;		Threshold = 0;
return;		return;
}		}

Function *Caller = CS.getCaller();		Function *Caller = CS.getCaller();
Show All 29 Lines	if (PSI) {
// Use callee's hotness information only if we have no way of determining		// Use callee's hotness information only if we have no way of determining
// callsite's hotness information. Callsite hotness can be determined if		// callsite's hotness information. Callsite hotness can be determined if
// sample profile is used (which adds hotness metadata to calls) or if		// sample profile is used (which adds hotness metadata to calls) or if
// caller's BlockFrequencyInfo is available.		// caller's BlockFrequencyInfo is available.
if (CallerBFI \|\| PSI->hasSampleProfile()) {		if (CallerBFI \|\| PSI->hasSampleProfile()) {
if (PSI->isHotCallSite(CS, CallerBFI)) {		if (PSI->isHotCallSite(CS, CallerBFI)) {
DEBUG(dbgs() << "Hot callsite.\n");		DEBUG(dbgs() << "Hot callsite.\n");
Threshold = Params.HotCallSiteThreshold.getValue();		Threshold = Params.HotCallSiteThreshold.getValue();
} else if (PSI->isColdCallSite(CS, CallerBFI)) {		} else if (isColdCallSite(CS, CallerBFI)) {
DEBUG(dbgs() << "Cold callsite.\n");		DEBUG(dbgs() << "Cold callsite.\n");
Threshold = MinIfValid(Threshold, Params.ColdCallSiteThreshold);		Threshold = MinIfValid(Threshold, Params.ColdCallSiteThreshold);
}		}
} else {		} else {
if (PSI->isFunctionEntryHot(&Callee)) {		if (PSI->isFunctionEntryHot(&Callee)) {
DEBUG(dbgs() << "Hot callee.\n");		DEBUG(dbgs() << "Hot callee.\n");
// If callsite hotness can not be determined, we may still know		// If callsite hotness can not be determined, we may still know
// that the callee is hot and treat it as a weaker hint for threshold		// that the callee is hot and treat it as a weaker hint for threshold
▲ Show 20 Lines • Show All 979 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/Inline/inline-cold-callsite-pgo.ll

				; RUN: opt < %s -passes='require<profile-summary>,cgscc(inline)' -inline-threshold=100 -inline-cold-callsite-threshold=0 -S \| FileCheck %s

				; This tests that a cold callsite gets the inline-cold-callsite-threshold
				; and does not get inlined. Another callsite to an identical callee that
				; is not cold gets inlined because cost is below the inline-threshold.

				define i32 @callee1(i32 %x) !prof !21 {
				%x1 = add i32 %x, 1
				%x2 = add i32 %x1, 1
				%x3 = add i32 %x2, 1
				call void @extern()
				ret i32 %x3
				}

				define i32 @caller(i32 %n) !prof !22 {
				; CHECK-LABEL: @caller(
				%cond = icmp sle i32 %n, 100
				br i1 %cond, label %cond_true, label %cond_false, !prof !0

				cond_true:
				; CHECK-LABEL: cond_true:
				; CHECK-NOT: call i32 @callee1
				; CHECK: ret i32 %x3.i
				%i = call i32 @callee1(i32 %n)
				ret i32 %i
				cond_false:
				; CHECK-LABEL: cond_false:
				; CHECK: call i32 @callee1
				; CHECK: ret i32 %j
				%j = call i32 @callee1(i32 %n)
				ret i32 %j
				}
				declare void @extern()

				!0 = !{!"branch_weights", i32 200, i32 1}

				!llvm.module.flags = !{!1}
				!21 = !{!"function_entry_count", i64 200}
				!22 = !{!"function_entry_count", i64 200}

				!1 = !{i32 1, !"ProfileSummary", !2}
				!2 = !{!3, !4, !5, !6, !7, !8, !9, !10}
				!3 = !{!"ProfileFormat", !"InstrProf"}
				!4 = !{!"TotalCount", i64 10000}
				!5 = !{!"MaxCount", i64 1000}
				!6 = !{!"MaxInternalCount", i64 1}
				!7 = !{!"MaxFunctionCount", i64 1000}
				!8 = !{!"NumCounts", i64 3}
				!9 = !{!"NumFunctions", i64 3}
				!10 = !{!"DetailedSummary", !11}
				!11 = !{!12, !13, !14}
				!12 = !{i32 10000, i64 1000, i32 1}
				!13 = !{i32 999000, i64 1000, i32 1}
				!14 = !{i32 999999, i64 1, i32 2}

llvm/trunk/test/Transforms/Inline/inline-cold-callsite.ll


	; RUN: opt < %s -passes='require<profile-summary>,cgscc(inline)' -inline-threshold=100 -inline-cold-callsite-threshold=0 -S \| FileCheck %s			; RUN: opt < %s -passes='require<profile-summary>,cgscc(inline)' -inline-threshold=100 -inline-cold-callsite-threshold=0 -S \| FileCheck %s

	; This tests that a cold callsite gets the inline-cold-callsite-threshold			; This tests that a cold callsite gets the inline-cold-callsite-threshold
	; and does not get inlined. Another callsite to an identical callee that			; and does not get inlined. Another callsite to an identical callee that
	; is not cold gets inlined because cost is below the inline-threshold.			; is not cold gets inlined because cost is below the inline-threshold.

	define i32 @callee1(i32 %x) !prof !21 {			define void @callee() {
	%x1 = add i32 %x, 1			call void @extern()
	%x2 = add i32 %x1, 1
	%x3 = add i32 %x2, 1
	call void @extern()			call void @extern()
	ret i32 %x3			ret void
	}			}

	define i32 @caller(i32 %n) !prof !22 {
	; CHECK-LABEL: @caller(
	%cond = icmp sle i32 %n, 100
	br i1 %cond, label %cond_true, label %cond_false, !prof !0

	cond_true:
	; CHECK-LABEL: cond_true:
	; CHECK-NOT: call i32 @callee1
	; CHECK: ret i32 %x3.i
	%i = call i32 @callee1(i32 %n)
	ret i32 %i
	cond_false:
	; CHECK-LABEL: cond_false:
	; CHECK: call i32 @callee1
	; CHECK: ret i32 %j
	%j = call i32 @callee1(i32 %n)
	ret i32 %j
	}
	declare void @extern()			declare void @extern()
				declare i1 @ext(i32)

				; CHECK-LABEL: caller
				define i32 @caller(i32 %n) {
				entry:
				%cmp4 = icmp sgt i32 %n, 0
				br i1 %cmp4, label %for.body, label %for.cond.cleanup

				for.cond.cleanup:
				ret i32 0

				for.body:
				%i.05 = phi i32 [ %inc, %for.inc ], [ 0, %entry ]
				; CHECK: %call = tail call
				%call = tail call zeroext i1 @ext(i32 %i.05)
				; CHECK-NOT: call void @callee
				; CHECK-NEXT: call void @extern
				call void @callee()
				br i1 %call, label %cold, label %for.inc, !prof !0

				cold:
				; CHECK: call void @callee
				call void @callee()
				br label %for.inc

				for.inc:
				%inc = add nuw nsw i32 %i.05, 1
				%exitcond = icmp eq i32 %inc, %n
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

	!0 = !{!"branch_weights", i32 200, i32 1}

	!llvm.module.flags = !{!1}			!0 = !{!"branch_weights", i32 1, i32 2000}
	!21 = !{!"function_entry_count", i64 200}
	!22 = !{!"function_entry_count", i64 200}

	!1 = !{i32 1, !"ProfileSummary", !2}
	!2 = !{!3, !4, !5, !6, !7, !8, !9, !10}
	!3 = !{!"ProfileFormat", !"InstrProf"}
	!4 = !{!"TotalCount", i64 10000}
	!5 = !{!"MaxCount", i64 1000}
	!6 = !{!"MaxInternalCount", i64 1}
	!7 = !{!"MaxFunctionCount", i64 1000}
	!8 = !{!"NumCounts", i64 3}
	!9 = !{!"NumFunctions", i64 3}
	!10 = !{!"DetailedSummary", !11}
	!11 = !{!12, !13, !14}
	!12 = !{i32 10000, i64 1000, i32 1}
	!13 = !{i32 999000, i64 1000, i32 1}
	!14 = !{i32 999999, i64 1, i32 2}

This is an archive of the discontinued LLVM Phabricator instance.

[NewPM/Inliner] Reduce threshold for cold callsites in the non-PGO case
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 104305

llvm/trunk/lib/Analysis/InlineCost.cpp

llvm/trunk/test/Transforms/Inline/inline-cold-callsite-pgo.ll

llvm/trunk/test/Transforms/Inline/inline-cold-callsite.ll

This is an archive of the discontinued LLVM Phabricator instance.

[NewPM/Inliner] Reduce threshold for cold callsites in the non-PGO caseClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 104305

llvm/trunk/lib/Analysis/InlineCost.cpp

llvm/trunk/test/Transforms/Inline/inline-cold-callsite-pgo.ll

llvm/trunk/test/Transforms/Inline/inline-cold-callsite.ll

[NewPM/Inliner] Reduce threshold for cold callsites in the non-PGO case
ClosedPublic