This is an archive of the discontinued LLVM Phabricator instance.

[Inliner] Increase threshold for hot callsites without PGO.
ClosedPublic

Authored by eraman on Aug 1 2017, 4:17 PM.

Download Raw Diff

Details

Reviewers

chandlerc
davidxl

Commits

rG974d4eea93a6: [Inliner] Increase threshold for hot callsites without PGO.
rL309994: [Inliner] Increase threshold for hot callsites without PGO.

Summary

This increases the inlining threshold for hot callsites. Hotness is
defined in terms of block frequency of the callsite relative to the
caller's entry block's frequency. Since this requires BFI in the
inliner, this only affects the new PM pipeline.

This improves the performance of some internal benchmarks. Notably, an
internal benchmark for Gipfeli compression
(https://github.com/google/gipfeli) improves by ~7%. Povray in SPEC2006
improves by ~2.5%. I am running more experiments and will update the
thread if other benchmarks show improvement/regression.

In terms of text size, LLVM test-suite shows an 1.22% text size
increase. Diving into the results, 13 of the benchmarks in the
test-suite increases by > 10%. Most of these are small, but
Adobe-C++/loop_unroll (17.6% increases) and tramp3d(20.7% size increase)
have >250K text size. On a large application, the text size increases by
2%

Diff Detail

Build Status

Buildable 8866
Build 8866: arc lint + arc unit

Event Timeline

eraman created this revision.Aug 1 2017, 4:17 PM

While the total size increase doesn't concern me much, the >10% code size growth in some benchmarks is a bit concerning. Do these benchmarks also improve performance? If so, then this might be fine in general (once the -Os behavior is fixed, see below). However, if the benchmarks that are growing in size by a lot aren't also getting faster, then it seems a hard sell at O2 where we expect size increase to be at least generally associated with performance wins.

Naturally, this seems completely fine at O3 either way.

lib/Analysis/InlineCost.cpp
782–784	I think we need to avoid this for optForSize() functions.

haicheng added a subscriber: haicheng.Aug 1 2017, 7:48 PM

Hi Easwaran,

What if the callee of a hot callsite also has a inline hint?

Thanks,

Haicheng

mcrosier added a subscriber: mcrosier.Aug 2 2017, 7:26 AM

In D36199#828958, @haicheng wrote:

Hi Easwaran,

What if the callee of a hot callsite also has a inline hint?

(FWIW, my 2 cents here would be that this should win over inline hint, as it should be quite a bit stronger. Anyways, will let Easwaran actually respond as well as I'm curious what he thinks.)

In D36199#829276, @chandlerc wrote:

In D36199#828958, @haicheng wrote:

Hi Easwaran,

What if the callee of a hot callsite also has a inline hint?

(FWIW, my 2 cents here would be that this should win over inline hint, as it should be quite a bit stronger. Anyways, will let Easwaran actually respond as well as I'm curious what he thinks.)

I initially started with a multiplier instead of an absolute threshold, but changed it mainly to be consistent with the PGO based hot callsite. It is definitely a stronger hint, but I think an argument can be made that various "signals" should be composed.

In D36199#829525, @eraman wrote:

In D36199#829276, @chandlerc wrote:

In D36199#828958, @haicheng wrote:

Hi Easwaran,

What if the callee of a hot callsite also has a inline hint?

(FWIW, my 2 cents here would be that this should win over inline hint, as it should be quite a bit stronger. Anyways, will let Easwaran actually respond as well as I'm curious what he thinks.)

I initially started with a multiplier instead of an absolute threshold, but changed it mainly to be consistent with the PGO based hot callsite. It is definitely a stronger hint, but I think an argument can be made that various "signals" should be composed.

Maybe a follow-up with benchmarks motivating (and checking for size issues)?

In D36199#828523, @chandlerc wrote:

While the total size increase doesn't concern me much, the >10% code size growth in some benchmarks is a bit concerning. Do these benchmarks also improve performance? If so, then this might be fine in general (once the -Os behavior is fixed, see below). However, if the benchmarks that are growing in size by a lot aren't also getting faster, then it seems a hard sell at O2 where we expect size increase to be at least generally associated with performance wins.

Naturally, this seems completely fine at O3 either way.

There are 5 tests in the test-suite with text size increase > 10% and > 4K. Three of them improve and two don't. For performance numbers, I used the perf tool to collect uops_retired:any on an Intel Xeon E5-2690 since the running time of these are small and noisy.

Test - Size increase KB (%) - Performance improvement

MultiSource/Benchmarks/VersaBench/beamformer - 5K(41.6%) - None
SingleSource/Benchmarks/Linpack/linpack-pc - 7K(36.4%) - +2%
MultiSource/Benchmarks/FreeBench/pifft - 18K(33%) - None
MultiSource/Benchmarks/tramp3d-v4 - 148K(20.7%) - 18%
SingleSource/Benchmarks/Adobe-C++/loop_unroll - 44K(17.6%) - 9%

The C/C++ subset of SPEC2006 increases by 0.5%. Two benchmarks improve in performance: 453.povray (+1.5%) and 473.astar (+1.81%).

If you think this is iffy for O2, I'll do this only for O3 for now and work on tuning it further with the goal of enabling it at O2.

In D36199#829818, @eraman wrote:

In D36199#828523, @chandlerc wrote:

While the total size increase doesn't concern me much, the >10% code size growth in some benchmarks is a bit concerning. Do these benchmarks also improve performance? If so, then this might be fine in general (once the -Os behavior is fixed, see below). However, if the benchmarks that are growing in size by a lot aren't also getting faster, then it seems a hard sell at O2 where we expect size increase to be at least generally associated with performance wins.

Naturally, this seems completely fine at O3 either way.

There are 5 tests in the test-suite with text size increase > 10% and > 4K. Three of them improve and two don't. For performance numbers, I used the perf tool to collect uops_retired:any on an Intel Xeon E5-2690 since the running time of these are small and noisy.

Test - Size increase KB (%) - Performance improvement

MultiSource/Benchmarks/VersaBench/beamformer - 5K(41.6%) - None
SingleSource/Benchmarks/Linpack/linpack-pc - 7K(36.4%) - +2%
MultiSource/Benchmarks/FreeBench/pifft - 18K(33%) - None
MultiSource/Benchmarks/tramp3d-v4 - 148K(20.7%) - 18%
SingleSource/Benchmarks/Adobe-C++/loop_unroll - 44K(17.6%) - 9%

The C/C++ subset of SPEC2006 increases by 0.5%. Two benchmarks improve in performance: 453.povray (+1.5%) and 473.astar (+1.81%).

First and foremost, thanks for the excellent data and carefully gathering all of it. =]

If you think this is iffy for O2, I'll do this only for O3 for now and work on tuning it further with the goal of enabling it at O2.

Yeah, I would put it in O3 for now, essentially to be conservative. I think this is a borderline case, but it seems easier to revisit it later than deal with churn of regressing people now, and it seems like a clear win for O3.

Apply this heuristic only at O3 and disable for optsize

LGTM, thanks!

This revision is now accepted and ready to land.Aug 3 2017, 2:54 PM

Closed by commit rL309994: [Inliner] Increase threshold for hot callsites without PGO. (authored by eraman). · Explain WhyAug 3 2017, 3:24 PM

This revision was automatically updated to reflect the committed changes.

haicheng added inline comments.Aug 4 2017, 8:18 AM

llvm/trunk/lib/Analysis/InlineCost.cpp
81 ↗	(On Diff #109643)	Is this a typo? Maximum => Minimum

Revision Contents

Path

Size

include/

llvm/

Analysis/

InlineCost.h

4 lines

lib/

Analysis/

InlineCost.cpp

47 lines

Diff 109250

include/llvm/Analysis/InlineCost.h

Show First 20 Lines • Show All 133 Lines • ▼ Show 20 Lines	struct InlineParams {
Optional<int> OptSizeThreshold;		Optional<int> OptSizeThreshold;

/// Threshold to use when the caller is optimized for minsize.		/// Threshold to use when the caller is optimized for minsize.
Optional<int> OptMinSizeThreshold;		Optional<int> OptMinSizeThreshold;

/// Threshold to use when the callsite is considered hot.		/// Threshold to use when the callsite is considered hot.
Optional<int> HotCallSiteThreshold;		Optional<int> HotCallSiteThreshold;

		/// Threshold to use when the callsite is considered hot relative to function
		/// entry.
		Optional<int> LocallyHotCallSiteThreshold;

/// Threshold to use when the callsite is considered cold.		/// Threshold to use when the callsite is considered cold.
Optional<int> ColdCallSiteThreshold;		Optional<int> ColdCallSiteThreshold;
};		};

/// Generate the parameters to tune the inline cost analysis based only on the		/// Generate the parameters to tune the inline cost analysis based only on the
/// commandline options.		/// commandline options.
InlineParams getInlineParams();		InlineParams getInlineParams();

▲ Show 20 Lines • Show All 52 Lines • Show Last 20 Lines

lib/Analysis/InlineCost.cpp

Show First 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	static cl::opt<int> ColdThreshold(
"inlinecold-threshold", cl::Hidden, cl::init(45),		"inlinecold-threshold", cl::Hidden, cl::init(45),
cl::desc("Threshold for inlining functions with cold attribute"));		cl::desc("Threshold for inlining functions with cold attribute"));

static cl::opt<int>		static cl::opt<int>
HotCallSiteThreshold("hot-callsite-threshold", cl::Hidden, cl::init(3000),		HotCallSiteThreshold("hot-callsite-threshold", cl::Hidden, cl::init(3000),
cl::ZeroOrMore,		cl::ZeroOrMore,
cl::desc("Threshold for hot callsites "));		cl::desc("Threshold for hot callsites "));

		static cl::opt<int> LocallyHotCallSiteThreshold(
		"locally-hot-callsite-threshold", cl::Hidden, cl::init(525), cl::ZeroOrMore,
		cl::desc("Threshold for locally hot callsites "));

static cl::opt<int> ColdCallSiteRelFreq(		static cl::opt<int> ColdCallSiteRelFreq(
"cold-callsite-rel-freq", cl::Hidden, cl::init(2), cl::ZeroOrMore,		"cold-callsite-rel-freq", cl::Hidden, cl::init(2), cl::ZeroOrMore,
cl::desc("Maxmimum block frequency, expressed as a percentage of caller's "		cl::desc("Maxmimum block frequency, expressed as a percentage of caller's "
"entry frequency, for a callsite to be cold in the absence of "		"entry frequency, for a callsite to be cold in the absence of "
"profile information."));		"profile information."));

		static cl::opt<int> HotCallSiteRelFreq(
		"hot-callsite-rel-freq", cl::Hidden, cl::init(60), cl::ZeroOrMore,
		cl::desc("Maxmimum block frequency, expressed as a multiple of caller's "
		"entry frequency, for a callsite to be hot in the absence of "
		"profile information."));

namespace {		namespace {

class CallAnalyzer : public InstVisitor<CallAnalyzer, bool> {		class CallAnalyzer : public InstVisitor<CallAnalyzer, bool> {
typedef InstVisitor<CallAnalyzer, bool> Base;		typedef InstVisitor<CallAnalyzer, bool> Base;
friend class InstVisitor<CallAnalyzer, bool>;		friend class InstVisitor<CallAnalyzer, bool>;

/// The TargetTransformInfo available for this compilation.		/// The TargetTransformInfo available for this compilation.
const TargetTransformInfo &TTI;		const TargetTransformInfo &TTI;
▲ Show 20 Lines • Show All 94 Lines • ▼ Show 20 Lines	class CallAnalyzer : public InstVisitor<CallAnalyzer, bool> {
void updateThreshold(CallSite CS, Function &Callee);		void updateThreshold(CallSite CS, Function &Callee);

/// Return true if size growth is allowed when inlining the callee at CS.		/// Return true if size growth is allowed when inlining the callee at CS.
bool allowSizeGrowth(CallSite CS);		bool allowSizeGrowth(CallSite CS);

/// Return true if \p CS is a cold callsite.		/// Return true if \p CS is a cold callsite.
bool isColdCallSite(CallSite CS, BlockFrequencyInfo *CallerBFI);		bool isColdCallSite(CallSite CS, BlockFrequencyInfo *CallerBFI);

		/// Return a higher threshold if \p CS is a hot callsite.
		Optional<int> getHotCallSiteThreshold(CallSite CS,
		BlockFrequencyInfo *CallerBFI);

// Custom analysis routines.		// Custom analysis routines.
bool analyzeBlock(BasicBlock BB, SmallPtrSetImpl<const Value > &EphValues);		bool analyzeBlock(BasicBlock BB, SmallPtrSetImpl<const Value > &EphValues);

// Disable several entry points to the visitor so we don't accidentally use		// Disable several entry points to the visitor so we don't accidentally use
// them by declaring but not defining them here.		// them by declaring but not defining them here.
void visit(Module *);		void visit(Module *);
void visit(Module &);		void visit(Module &);
void visit(Function *);		void visit(Function *);
▲ Show 20 Lines • Show All 462 Lines • ▼ Show 20 Lines	bool CallAnalyzer::isColdCallSite(CallSite CS, BlockFrequencyInfo *CallerBFI) {
const BranchProbability ColdProb(ColdCallSiteRelFreq, 100);		const BranchProbability ColdProb(ColdCallSiteRelFreq, 100);
auto CallSiteBB = CS.getInstruction()->getParent();		auto CallSiteBB = CS.getInstruction()->getParent();
auto CallSiteFreq = CallerBFI->getBlockFreq(CallSiteBB);		auto CallSiteFreq = CallerBFI->getBlockFreq(CallSiteBB);
auto CallerEntryFreq =		auto CallerEntryFreq =
CallerBFI->getBlockFreq(&(CS.getCaller()->getEntryBlock()));		CallerBFI->getBlockFreq(&(CS.getCaller()->getEntryBlock()));
return CallSiteFreq < CallerEntryFreq * ColdProb;		return CallSiteFreq < CallerEntryFreq * ColdProb;
}		}

		Optional<int>
		CallAnalyzer::getHotCallSiteThreshold(CallSite CS,
		BlockFrequencyInfo *CallerBFI) {
		// If global profile summary is available, then callsite's hotness is
		// determined based on that.
		int HotCallSiteThreshold = Params.HotCallSiteThreshold.getValue();
		if (PSI->hasProfileSummary() && PSI->isHotCallSite(CS, CallerBFI))
		return HotCallSiteThreshold;
		if (!CallerBFI)
		return None;

		HotCallSiteThreshold = Params.LocallyHotCallSiteThreshold.getValue();
		// In the absence of global profile summary, determine if the callsite is hot
		// relative to caller's entry. We could potentially cache the computation of
		// scaled entry frequency, but the added complexity is not worth it unless
		// this scaling shows up high in the profiles.
		auto CallSiteBB = CS.getInstruction()->getParent();
		auto CallSiteFreq = CallerBFI->getBlockFreq(CallSiteBB).getFrequency();
		auto CallerEntryFreq = CallerBFI->getEntryFreq();
		if (CallSiteFreq >= CallerEntryFreq * HotCallSiteRelFreq)
		return HotCallSiteThreshold;
		return None;
		}

void CallAnalyzer::updateThreshold(CallSite CS, Function &Callee) {		void CallAnalyzer::updateThreshold(CallSite CS, Function &Callee) {
// If no size growth is allowed for this inlining, set Threshold to 0.		// If no size growth is allowed for this inlining, set Threshold to 0.
if (!allowSizeGrowth(CS)) {		if (!allowSizeGrowth(CS)) {
Threshold = 0;		Threshold = 0;
return;		return;
}		}

Function *Caller = CS.getCaller();		Function *Caller = CS.getCaller();
▲ Show 20 Lines • Show All 64 Lines • ▼ Show 20 Lines	if (PSI) {
// by checking only the callsite hotness/coldness. The check for CallerBFI		// by checking only the callsite hotness/coldness. The check for CallerBFI
// exists only because we do not have BFI available with the old PM.		// exists only because we do not have BFI available with the old PM.
//		//
// Use callee's hotness information only if we have no way of determining		// Use callee's hotness information only if we have no way of determining
// callsite's hotness information. Callsite hotness can be determined if		// callsite's hotness information. Callsite hotness can be determined if
// sample profile is used (which adds hotness metadata to calls) or if		// sample profile is used (which adds hotness metadata to calls) or if
// caller's BlockFrequencyInfo is available.		// caller's BlockFrequencyInfo is available.
if (CallerBFI \|\| PSI->hasSampleProfile()) {		if (CallerBFI \|\| PSI->hasSampleProfile()) {
if (PSI->isHotCallSite(CS, CallerBFI)) {		auto HotCallSiteThreshold = getHotCallSiteThreshold(CS, CallerBFI);
		if (HotCallSiteThreshold) {
DEBUG(dbgs() << "Hot callsite.\n");		DEBUG(dbgs() << "Hot callsite.\n");
Threshold = Params.HotCallSiteThreshold.getValue();		Threshold = MaxIfValid(Threshold, HotCallSiteThreshold.getValue());
		chandlercUnsubmitted Not Done Reply Inline Actions I think we need to avoid this for optForSize() functions. chandlerc: I think we need to avoid this for optForSize() functions.
} else if (isColdCallSite(CS, CallerBFI)) {		} else if (isColdCallSite(CS, CallerBFI)) {
DEBUG(dbgs() << "Cold callsite.\n");		DEBUG(dbgs() << "Cold callsite.\n");
// Do not apply bonuses for a cold callsite including the		// Do not apply bonuses for a cold callsite including the
// LastCallToStatic bonus. While this bonus might result in code size		// LastCallToStatic bonus. While this bonus might result in code size
// reduction, it can cause the size of a non-cold caller to increase		// reduction, it can cause the size of a non-cold caller to increase
// preventing it from being inlined.		// preventing it from being inlined.
DisallowAllBonuses();		DisallowAllBonuses();
Threshold = MinIfValid(Threshold, Params.ColdCallSiteThreshold);		Threshold = MinIfValid(Threshold, Params.ColdCallSiteThreshold);
▲ Show 20 Lines • Show All 939 Lines • ▼ Show 20 Lines	else
Params.DefaultThreshold = Threshold;		Params.DefaultThreshold = Threshold;

// Set the HintThreshold knob from the -inlinehint-threshold.		// Set the HintThreshold knob from the -inlinehint-threshold.
Params.HintThreshold = HintThreshold;		Params.HintThreshold = HintThreshold;

// Set the HotCallSiteThreshold knob from the -hot-callsite-threshold.		// Set the HotCallSiteThreshold knob from the -hot-callsite-threshold.
Params.HotCallSiteThreshold = HotCallSiteThreshold;		Params.HotCallSiteThreshold = HotCallSiteThreshold;

		// Set the LocallyHotCallSiteThreshold knob from
		// -locally-hot-callsite-threshold.
		Params.LocallyHotCallSiteThreshold = LocallyHotCallSiteThreshold;

// Set the ColdCallSiteThreshold knob from the -inline-cold-callsite-threshold.		// Set the ColdCallSiteThreshold knob from the -inline-cold-callsite-threshold.
Params.ColdCallSiteThreshold = ColdCallSiteThreshold;		Params.ColdCallSiteThreshold = ColdCallSiteThreshold;

// Set the OptMinSizeThreshold and OptSizeThreshold params only if the		// Set the OptMinSizeThreshold and OptSizeThreshold params only if the
// -inlinehint-threshold commandline option is not explicitly given. If that		// -inlinehint-threshold commandline option is not explicitly given. If that
// option is present, then its value applies even for callees with size and		// option is present, then its value applies even for callees with size and
// minsize attributes.		// minsize attributes.
// If the -inline-threshold is not specified, set the ColdThreshold from the		// If the -inline-threshold is not specified, set the ColdThreshold from the
Show All 33 Lines