This is an archive of the discontinued LLVM Phabricator instance.

[InlineCost] Enable the new switch cost heuristic
ClosedPublic

Authored by junbuml on Apr 28 2017, 12:07 PM.

Download Raw Diff

Details

Reviewers

hans
chandlerc
eraman
haicheng
mcrosier
bmakam
eastig
ddibyend
echristo

Commits

rG2960d41e68f8: [InlineCost] Enable the new switch cost heuristic
rL304594: [InlineCost] Enable the new switch cost heuristic

Summary

This is to enable the new switch inline cost heuristic (r301649) by removing the
old heuristic as well as the flag itself.
In my experiment for LLVM test suite and spec2000/2006, +17.82% performance and
8% code size reduce was observed in spec2000/vertex with O3 LTO in AArch64.
No significant code size / performance regression was found in O3/O2/Os. No
significant complain was reported from the llvm-dev thread.

Diff Detail

Repository: rL LLVM

Event Timeline

junbuml created this revision.Apr 28 2017, 12:07 PM

Herald added subscribers: mehdi_amini, rengolin, aemerson. · View Herald TranscriptApr 28 2017, 12:07 PM

haicheng added inline comments.Apr 28 2017, 12:10 PM

lib/Analysis/InlineCost.cpp
1067–1084 ↗	(On Diff #97137)	Do we still need this part after enabling your flag?

junbuml added inline comments.Apr 28 2017, 12:14 PM

lib/Analysis/InlineCost.cpp
1067–1084 ↗	(On Diff #97137)	I don't think this is needed if we enable it. However, I keep this in this patch to let other people compare with/without this heuristic.

Having a flag is good for people that want to experiment with turning it on. It's only been a day or so with the heuristic available, maybe we should allow a bit more for those who want to experiment before turning it on by default?

Since this patch enables the new heuristic by default, I don't think having the flag and old code-path makes sense anymore.

I will keep the flag for a while in this patch to allow people to turn it on/off, but the flag and old code will be removed when committing it if it's generally acceptable.

Is there anything going on here? Should we do this?

I'm planning on enabling this heuristic by default if no significant regression was reported. Let me ping one more time to llvm.dev. Can any reviewer here please take a chance to test this? Or let other people take a chance to test it?

Through llvm.dev, Dibyendu reported that there were no clear positive or negative impact on his environment (x86).
I also didn't see any obvious regression in my experiment in aarch64 for llvm-test suite, spec2000 and spec2006 with O2, Os, and O3, but significant performance and size improvement in spec2000/vortex with lto; please see the detail in https://reviews.llvm.org/D31085.
We also expect to hear from Evgeny Astigeevich; I guess several different configurations in ARM.

Please let me know if there is any further test/experiment we need to do to enable this heuristic.

junbuml added reviewers: eastig, ddibyend.May 24 2017, 11:53 AM

FWIW I think the flag "inline-generic-switch-cost" is both misleading (it takes true and not a number) and not helpful (the name of the flag doesn't tell me anything). Changing that would be nice. I'll let others determine whether or not we should turn it on now or not, the patch itself should be fine though.

FWIW I think the flag "inline-generic-switch-cost" is both misleading (it takes true and not a number) and not helpful (the name of the flag doesn't tell me anything). Changing that would be nice. I'll let others determine whether or not we should turn it on now or not, the patch itself should be fine though.

Initially, the flag (-inline-generic-switch-cost) was intended to allow others to turn it on/off temporarily. I will remove both the flag and the old heuristic code when we enable this heuristic by default.

Kindly ping. Please let us know if any of you have any experiment result.

Hans/Chandler,
Is there any experiment result you expect to see to enable this new heuristic ?

In D32653#771466, @junbuml wrote:

Kindly ping. Please let us know if any of you have any experiment result.

Hans/Chandler,
Is there any experiment result you expect to see to enable this new heuristic ?

On May 9, you wrote:

I'm planning on enabling this heuristic by default if no significant regression was reported. Let me ping one more time to llvm.dev. Can any reviewer here please take a chance to test this? Or let other people take a chance to test it?

As there were no complaints on the llvm-dev thread, from my point of view, I think you should go ahead and enable this. If it causes regressions for anyone, they will let you know and we can potentially revert it like any other change.

Thanks Hans for the comments. Now, I removed the old heuristic and the flag in this change.

Herald added subscribers: kristof.beyls, javed.absar. · View Herald TranscriptJun 2 2017, 10:04 AM

junbuml edited the summary of this revision. (Show Details)Jun 2 2017, 10:05 AM

junbuml edited the summary of this revision. (Show Details)

echristo accepted this revision.Jun 2 2017, 10:13 AM

This revision is now accepted and ready to land.Jun 2 2017, 10:13 AM

! In D32653#771489, @hans wrote:
As there were no complaints on the llvm-dev thread, from my point of view, I think you should go ahead and enable this. If it causes regressions for anyone, they will let you know and we can potentially revert it like any other change.

Closed by commit rL304594: [InlineCost] Enable the new switch cost heuristic (authored by junbuml). · Explain WhyJun 2 2017, 1:43 PM

This revision was automatically updated to reflect the committed changes.

Hello. Sorry to bring up ancient history.

llvm/trunk/lib/Analysis/InlineCost.cpp
1072–1074	Should these be uint64's? If Cost is negative (perhaps because of LastCallToStaticBonus), it can take the INT_MAX option.

junbuml added inline comments.Jun 15 2017, 8:44 AM

llvm/trunk/lib/Analysis/InlineCost.cpp
1072–1074	Yes, looks like you are correct. If Cost is some big negative and Cost+SwitchCost is still negative, Cost could be INT_MAX, which is wrong. We should use int64_t instead of uint64_t : int64_t ExpectedNumberOfCompare = 3 * (int64_t)NumCaseCluster / 2 - 1; int64_t SwitchCost = ExpectedNumberOfCompare * 2 * InlineConstants::InstrCost; Cost = std::min((int64_t)INT_MAX, SwitchCost + Cost); Do you by chance have any case which incorrectly take INT_MAX ?

The case I was looking at was a bit of an LTO mess I'm afraid. And from what I can tell, it tends not to inline any less, it just does it in a different order. Some code like the following will print out a bad cost, and choose not to inline test1 into test2, but then does inline it into test3 as it inlines test2.

static int test1(int i)
{
    return i;
}

static int test2(int t)
{
    int s = test1(t);
    switch(s) {
        case 1: return 0;
        case 0: return 0;
        case 42: return 2;
        case 43: return 3;
        default: return 1;
    }
}

int test3(int t)
{
    return test2(t);
}

Debug:

Inliner visiting SCC: test2: 1 call sites.
...
    NOT Inlining:   %call = call fastcc i32 @test1(i32 %t) Cost = -15035, outer Cost = -2147483644
...
Inliner visiting SCC: test3: 1 call sites.
...
    Inlining: cost=-2147483644, thres=250, Call:   %call = call fastcc i32 @test2(i32 %t)
    -> Deleting dead function: test2
...
    Inlining: cost=-15035, thres=375, Call:   %call.i = call fastcc i32 @test1(i32 %t) #2
    -> Deleting dead function: test1
...

Revision Contents

Path

Size

llvm/

trunk/

lib/

Analysis/

InlineCost.cpp

132 lines

test/

Transforms/

Inline/

AArch64/

switch.ll

4 lines

Diff 101271

llvm/trunk/lib/Analysis/InlineCost.cpp

Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	static cl::opt<int> HintThreshold(
"inlinehint-threshold", cl::Hidden, cl::init(325),		"inlinehint-threshold", cl::Hidden, cl::init(325),
cl::desc("Threshold for inlining functions with inline hint"));		cl::desc("Threshold for inlining functions with inline hint"));

static cl::opt<int>		static cl::opt<int>
ColdCallSiteThreshold("inline-cold-callsite-threshold", cl::Hidden,		ColdCallSiteThreshold("inline-cold-callsite-threshold", cl::Hidden,
cl::init(45),		cl::init(45),
cl::desc("Threshold for inlining cold callsites"));		cl::desc("Threshold for inlining cold callsites"));

static cl::opt<bool>
EnableGenericSwitchCost("inline-generic-switch-cost", cl::Hidden,
cl::init(false),
cl::desc("Enable generic switch cost model"));

// We introduce this threshold to help performance of instrumentation based		// We introduce this threshold to help performance of instrumentation based
// PGO before we actually hook up inliner with analysis passes such as BPI and		// PGO before we actually hook up inliner with analysis passes such as BPI and
// BFI.		// BFI.
static cl::opt<int> ColdThreshold(		static cl::opt<int> ColdThreshold(
"inlinecold-threshold", cl::Hidden, cl::init(45),		"inlinecold-threshold", cl::Hidden, cl::init(45),
cl::desc("Threshold for inlining functions with cold attribute"));		cl::desc("Threshold for inlining functions with cold attribute"));

static cl::opt<int>		static cl::opt<int>
▲ Show 20 Lines • Show All 940 Lines • ▼ Show 20 Lines	bool CallAnalyzer::visitSwitchInst(SwitchInst &SI) {
// We model unconditional switches as free, see the comments on handling		// We model unconditional switches as free, see the comments on handling
// branches.		// branches.
if (isa<ConstantInt>(SI.getCondition()))		if (isa<ConstantInt>(SI.getCondition()))
return true;		return true;
if (Value *V = SimplifiedValues.lookup(SI.getCondition()))		if (Value *V = SimplifiedValues.lookup(SI.getCondition()))
if (isa<ConstantInt>(V))		if (isa<ConstantInt>(V))
return true;		return true;

if (EnableGenericSwitchCost) {
// Assume the most general case where the swith is lowered into		// Assume the most general case where the swith is lowered into
// either a jump table, bit test, or a balanced binary tree consisting of		// either a jump table, bit test, or a balanced binary tree consisting of
// case clusters without merging adjacent clusters with the same		// case clusters without merging adjacent clusters with the same
// destination. We do not consider the switches that are lowered with a mix		// destination. We do not consider the switches that are lowered with a mix
// of jump table/bit test/binary search tree. The cost of the switch is		// of jump table/bit test/binary search tree. The cost of the switch is
// proportional to the size of the tree or the size of jump table range.		// proportional to the size of the tree or the size of jump table range.
		//
		// NB: We convert large switches which are just used to initialize large phi
		// nodes to lookup tables instead in simplify-cfg, so this shouldn't prevent
		// inlining those. It will prevent inlining in cases where the optimization
		// does not (yet) fire.

// Exit early for a large switch, assuming one case needs at least one		// Exit early for a large switch, assuming one case needs at least one
// instruction.		// instruction.
// FIXME: This is not true for a bit test, but ignore such case for now to		// FIXME: This is not true for a bit test, but ignore such case for now to
// save compile-time.		// save compile-time.
int64_t CostLowerBound =		int64_t CostLowerBound =
std::min((int64_t)INT_MAX,		std::min((int64_t)INT_MAX,
(int64_t)SI.getNumCases() * InlineConstants::InstrCost + Cost);		(int64_t)SI.getNumCases() * InlineConstants::InstrCost + Cost);

if (CostLowerBound > Threshold) {		if (CostLowerBound > Threshold) {
Cost = CostLowerBound;		Cost = CostLowerBound;
return false;		return false;
}		}

unsigned JumpTableSize = 0;		unsigned JumpTableSize = 0;
unsigned NumCaseCluster =		unsigned NumCaseCluster =
TTI.getEstimatedNumberOfCaseClusters(SI, JumpTableSize);		TTI.getEstimatedNumberOfCaseClusters(SI, JumpTableSize);

// If suitable for a jump table, consider the cost for the table size and		// If suitable for a jump table, consider the cost for the table size and
// branch to destination.		// branch to destination.
if (JumpTableSize) {		if (JumpTableSize) {
int64_t JTCost = (int64_t)JumpTableSize * InlineConstants::InstrCost +		int64_t JTCost = (int64_t)JumpTableSize * InlineConstants::InstrCost +
4 * InlineConstants::InstrCost;		4 * InlineConstants::InstrCost;
Cost = std::min((int64_t)INT_MAX, JTCost + Cost);		Cost = std::min((int64_t)INT_MAX, JTCost + Cost);
return false;		return false;
}		}

// Considering forming a binary search, we should find the number of nodes		// Considering forming a binary search, we should find the number of nodes
// which is same as the number of comparisons when lowered. For a given		// which is same as the number of comparisons when lowered. For a given
// number of clusters, n, we can define a recursive function, f(n), to find		// number of clusters, n, we can define a recursive function, f(n), to find
// the number of nodes in the tree. The recursion is :		// the number of nodes in the tree. The recursion is :
// f(n) = 1 + f(n/2) + f (n - n/2), when n > 3,		// f(n) = 1 + f(n/2) + f (n - n/2), when n > 3,
// and f(n) = n, when n <= 3.		// and f(n) = n, when n <= 3.
// This will lead a binary tree where the leaf should be either f(2) or f(3)		// This will lead a binary tree where the leaf should be either f(2) or f(3)
// when n > 3. So, the number of comparisons from leaves should be n, while		// when n > 3. So, the number of comparisons from leaves should be n, while
// the number of non-leaf should be :		// the number of non-leaf should be :
// 2^(log2(n) - 1) - 1		// 2^(log2(n) - 1) - 1
// = 2^log2(n) * 2^-1 - 1		// = 2^log2(n) * 2^-1 - 1
// = n / 2 - 1.		// = n / 2 - 1.
// Considering comparisons from leaf and non-leaf nodes, we can estimate the		// Considering comparisons from leaf and non-leaf nodes, we can estimate the
// number of comparisons in a simple closed form :		// number of comparisons in a simple closed form :
// n + n / 2 - 1 = n * 3 / 2 - 1		// n + n / 2 - 1 = n * 3 / 2 - 1
if (NumCaseCluster <= 3) {		if (NumCaseCluster <= 3) {
// Suppose a comparison includes one compare and one conditional branch.		// Suppose a comparison includes one compare and one conditional branch.
Cost += NumCaseCluster * 2 * InlineConstants::InstrCost;		Cost += NumCaseCluster * 2 * InlineConstants::InstrCost;
return false;		return false;
}		}
int64_t ExpectedNumberOfCompare = 3 * (uint64_t)NumCaseCluster / 2 - 1;		int64_t ExpectedNumberOfCompare = 3 * (uint64_t)NumCaseCluster / 2 - 1;
uint64_t SwitchCost =		uint64_t SwitchCost =
ExpectedNumberOfCompare * 2 * InlineConstants::InstrCost;		ExpectedNumberOfCompare * 2 * InlineConstants::InstrCost;
Cost = std::min((uint64_t)INT_MAX, SwitchCost + Cost);		Cost = std::min((uint64_t)INT_MAX, SwitchCost + Cost);
		dmgreenUnsubmitted Not Done Reply Inline Actions Should these be uint64's? If Cost is negative (perhaps because of LastCallToStaticBonus), it can take the INT_MAX option. dmgreen: Should these be uint64's? If Cost is negative (perhaps because of LastCallToStaticBonus), it…
		junbumlAuthorUnsubmitted Not Done Reply Inline Actions Yes, looks like you are correct. If Cost is some big negative and Cost+SwitchCost is still negative, Cost could be INT_MAX, which is wrong. We should use int64_t instead of uint64_t : int64_t ExpectedNumberOfCompare = 3 * (int64_t)NumCaseCluster / 2 - 1; int64_t SwitchCost = ExpectedNumberOfCompare * 2 * InlineConstants::InstrCost; Cost = std::min((int64_t)INT_MAX, SwitchCost + Cost); Do you by chance have any case which incorrectly take INT_MAX ? junbuml: Yes, looks like you are correct. If Cost is some big negative and Cost+SwitchCost is still…
return false;		return false;
}		}

// Use a simple switch cost model where we accumulate a cost proportional to
// the number of distinct successor blocks. This fan-out in the CFG cannot
// be represented for free even if we can represent the core switch as a
// jumptable that takes a single instruction.
///
// NB: We convert large switches which are just used to initialize large phi
// nodes to lookup tables instead in simplify-cfg, so this shouldn't prevent
// inlining those. It will prevent inlining in cases where the optimization
// does not (yet) fire.
SmallPtrSet<BasicBlock *, 8> SuccessorBlocks;
SuccessorBlocks.insert(SI.getDefaultDest());
for (auto Case : SI.cases())
SuccessorBlocks.insert(Case.getCaseSuccessor());
// Add cost corresponding to the number of distinct destinations. The first
// we model as free because of fallthrough.
Cost += (SuccessorBlocks.size() - 1) * InlineConstants::InstrCost;
return false;
}

bool CallAnalyzer::visitIndirectBrInst(IndirectBrInst &IBI) {		bool CallAnalyzer::visitIndirectBrInst(IndirectBrInst &IBI) {
// We never want to inline functions that contain an indirectbr. This is		// We never want to inline functions that contain an indirectbr. This is
// incorrect because all the blockaddress's (in static global initializers		// incorrect because all the blockaddress's (in static global initializers
// for example) would be referring to the original function, and this		// for example) would be referring to the original function, and this
// indirect jump would jump from the inlined copy of the function into the		// indirect jump would jump from the inlined copy of the function into the
// original function which is extremely undefined behavior.		// original function which is extremely undefined behavior.
// FIXME: This logic isn't really right; we can safely inline functions with		// FIXME: This logic isn't really right; we can safely inline functions with
// indirectbr's as long as no other function or global references the		// indirectbr's as long as no other function or global references the
▲ Show 20 Lines • Show All 575 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/Inline/AArch64/switch.ll

	; RUN: opt < %s -inline -inline-threshold=20 -S -mtriple=aarch64-none-linux -inline-generic-switch-cost=true \| FileCheck %s			; RUN: opt < %s -inline -inline-threshold=20 -S -mtriple=aarch64-none-linux \| FileCheck %s
	; RUN: opt < %s -passes='cgscc(inline)' -inline-threshold=20 -S -mtriple=aarch64-none-linux -inline-generic-switch-cost=true \| FileCheck %s			; RUN: opt < %s -passes='cgscc(inline)' -inline-threshold=20 -S -mtriple=aarch64-none-linux \| FileCheck %s

	define i32 @callee_range(i32 %a, i32* %P) {			define i32 @callee_range(i32 %a, i32* %P) {
	switch i32 %a, label %sw.default [			switch i32 %a, label %sw.default [
	i32 0, label %sw.bb0			i32 0, label %sw.bb0
	i32 1000, label %sw.bb1			i32 1000, label %sw.bb1
	i32 2000, label %sw.bb1			i32 2000, label %sw.bb1
	i32 3000, label %sw.bb1			i32 3000, label %sw.bb1
	i32 4000, label %sw.bb1			i32 4000, label %sw.bb1
	▲ Show 20 Lines • Show All 113 Lines • Show Last 20 Lines