This is an archive of the discontinued LLVM Phabricator instance.

[InlineCost] Increase the cost of Switch
AbandonedPublic

Authored by haicheng on Feb 11 2017, 10:13 PM.

Download Raw Diff

Details

Reviewers

bmakam
chandlerc
eraman
junbuml
mcrosier

Summary

The motivation example is like below which has 13 cases but only 2 distinct targets

lor.lhs.false2:                                   ; preds = %if.then
  switch i32 %Status, label %if.then27 [
    i32 -7012, label %if.end35
    i32 -10008, label %if.end35
    i32 -10016, label %if.end35
    i32 15000, label %if.end35
    i32 14013, label %if.end35
    i32 10114, label %if.end35
    i32 10107, label %if.end35
    i32 10105, label %if.end35
    i32 10013, label %if.end35
    i32 10011, label %if.end35
    i32 7008, label %if.end35
    i32 7007, label %if.end35
    i32 5002, label %if.end35
  ]

which is compiled into a balanced binary tree like this on AArch64 (similar on X86)

.LBB853_9:                              // %lor.lhs.false2
        mov     w8, #10012
        cmp             w19, w8
        b.gt    .LBB853_14
// BB#10:                               // %lor.lhs.false2
        mov     w8, #5001
        cmp             w19, w8
        b.gt    .LBB853_18
// BB#11:                               // %lor.lhs.false2
        mov     w8, #-10016
        cmp             w19, w8
        b.eq    .LBB853_23
// BB#12:                               // %lor.lhs.false2
        mov     w8, #-10008
        cmp             w19, w8
        b.eq    .LBB853_23
// BB#13:                               // %lor.lhs.false2
        mov     w8, #-7012
        cmp             w19, w8
        b.eq    .LBB853_23
        b       .LBB853_3
.LBB853_14:                             // %lor.lhs.false2
        mov     w8, #14012
        cmp             w19, w8
        b.gt    .LBB853_21
// BB#15:                               // %lor.lhs.false2
        mov     w8, #-10105
        add             w8, w19, w8
        cmp             w8, #9          // =9
        b.hi    .LBB853_17
// BB#16:                               // %lor.lhs.false2
        orr     w9, wzr, #0x1
        lsl     w8, w9, w8
        mov     w9, #517
        and             w8, w8, w9
        cbnz    w8, .LBB853_23
.LBB853_17:                             // %lor.lhs.false2
        mov     w8, #10013
        cmp             w19, w8
        b.eq    .LBB853_23
        b       .LBB853_3
.LBB853_18:                             // %lor.lhs.false2
        mov     w8, #-7007
        add             w8, w19, w8
        cmp             w8, #2          // =2
        b.lo    .LBB853_23
// BB#19:                               // %lor.lhs.false2
        mov     w8, #5002
        cmp             w19, w8
        b.eq    .LBB853_23
// BB#20:                               // %lor.lhs.false2
        mov     w8, #10011
        cmp             w19, w8
        b.eq    .LBB853_23
        b       .LBB853_3
.LBB853_21:                             // %lor.lhs.false2
        mov     w8, #14013
        cmp             w19, w8
        b.eq    .LBB853_23
// BB#22:                               // %lor.lhs.false2
        mov     w8, #15000
        cmp             w19, w8
        b.ne    .LBB853_3

However, the inline cost model estimates the cost to be linear with the number of distinct targets and the cost of the above switch is just 2 InstrCosts. The function containing this switch is then inlined about 900 times.

This change modifies the model to be linear with the size of the balanced binary tree.

Diff Detail

Repository: rL LLVM

Event Timeline

haicheng created this revision.Feb 11 2017, 10:13 PM

Herald added a subscriber: aemerson. · View Herald TranscriptFeb 11 2017, 10:13 PM

The basic reasoning makes sense to me when we do the binary search lowering. However, while you talk about simplify-cfg handling the lookup table case, this code should also respond well to the case where it will be lowered as a jump table.

Beyond that, this needs to be accompanied by basic code size and runtime benchmark numbers for the LLVM test suite. It would be nice to also check SPEC.

If you can't benchmark enough platforms, it may make sense to ask other LLVM users to benchmark this (either in patch form or by changing the patch to put this behind a flag at first) so that we can collect their data.

Here is the data I have for the current implementation collected from AArch64. Only benchmarks impacted are listed.

SPEC2000

	Performance (+ is better)	Code Size (- is better)
vortex	+8.22%	-7.63%
perlbmk	+2.39%	0.00%
twolf	+1.46%	0.00%
crafty	+1.42%	0.00%
gcc	-0.52%	+0.01%
mesa	-0.92%	0.00%

SPEC2006

	Performance (+ is better)	Code Size (- is better)
povray	+1.09%	0.00%
soplex	+0.75%	0.00%
xalancbmk	+0.41%	0.00%
hmmer	+0.09%	-1.21%
dealII	+0.06%	0.00%
omnetpp	0.00%	0.00%
sjeng	-0.08%	0.00%
h264ref	-0.10%	0.00%
gcc	-0.20%	-0.70%
perlbench	-0.31%	0.00%

LLVM Test suite

	Performance (+ is better)	Code Size (- is better)
kc	+3.00%	0.00%
sqlite3	+2.80%	0.00%
siod	+1.44%	0.00%
consumer-typeset	+0.85%	0.00%
SIBsim4	+0.73%	0.00%
lua	+0.28%	0.00%
automotive-susan	0.00%	0.00%
consumer-jpeg	0.00%	0.00%
cjpeg	0.00%	0.00%
7zip	-0.20%	0.00%
lencod	-0.25%	0.00%

It seems give switch bigger penalty is a good thing.

junbuml added inline comments.Feb 13 2017, 9:32 AM

lib/Analysis/InlineCost.cpp
1031	I wonder if this assumption is reasonable enough. If I remember correctly, a switch could also end up with a jump table or mix of jump table and BTree depending on the number of case, comparison value, etc.

junbuml added a reviewer: bmakam.Feb 13 2017, 9:32 AM

chandlerc added inline comments.Feb 13 2017, 9:39 AM

lib/Analysis/InlineCost.cpp
1031	Yes, see my top level comment -- I think something more detailed than this will be necessary in order to model the usage of jump tables. I wonder if we should just expose a TTI hook that can query the same logic that lowering actually uses to decide this....

Since Haicheng is in paternity leave. I took over this patch and created a new patch in D31085. In this new change (D31085), I used a target hook to get the number of case cluster based on D31080.

junbuml mentioned this in D31085: [InlineCost] Increase the cost of Switch.Mar 17 2017, 10:20 AM

junbuml mentioned this in rL301649: [InlineCost] Improve the cost heuristic for Switch.Apr 28 2017, 9:17 AM

Jun took over the patch and submitted it in r304594.

Revision Contents

Path

Size

lib/

Analysis/

InlineCost.cpp

40 lines

test/

Transforms/

Inline/

switch.ll

34 lines

Diff 88116

lib/Analysis/InlineCost.cpp

Show First 20 Lines • Show All 1,022 Lines • ▼ Show 20 Lines	bool CallAnalyzer::visitSwitchInst(SwitchInst &SI) {
// We model unconditional switches as free, see the comments on handling		// We model unconditional switches as free, see the comments on handling
// branches.		// branches.
if (isa<ConstantInt>(SI.getCondition()))		if (isa<ConstantInt>(SI.getCondition()))
return true;		return true;
if (Value *V = SimplifiedValues.lookup(SI.getCondition()))		if (Value *V = SimplifiedValues.lookup(SI.getCondition()))
if (isa<ConstantInt>(V))		if (isa<ConstantInt>(V))
return true;		return true;

// Otherwise, we need to accumulate a cost proportional to the number of		// Otherwise, we assume the most general case where the big swith is lowered
		junbumlUnsubmitted Not Done Reply Inline Actions I wonder if this assumption is reasonable enough. If I remember correctly, a switch could also end up with a jump table or mix of jump table and BTree depending on the number of case, comparison value, etc. junbuml: I wonder if this assumption is reasonable enough. If I remember correctly, a switch could also…
		chandlercUnsubmitted Not Done Reply Inline Actions Yes, see my top level comment -- I think something more detailed than this will be necessary in order to model the usage of jump tables. I wonder if we should just expose a TTI hook that can query the same logic that lowering actually uses to decide this.... chandlerc: Yes, see my top level comment -- I think something more detailed than this will be necessary in…
// distinct successor blocks. This fan-out in the CFG cannot be represented		// into a balanced binary tree, the probability of entering each case is
// for free even if we can represent the core switch as a jumptable that		// equal, and no case can be merged. The cost of the switch is proportional
// takes a single instruction.		// to the size of the tree. See visitSwitch() of SelectionDAGBuilder.cpp
		// about the detail of constructing the tree.
//		//
// NB: We convert large switches which are just used to initialize large phi		// NB: We convert large switches which are just used to initialize large phi
// nodes to lookup tables instead in simplify-cfg, so this shouldn't prevent		// nodes to lookup tables instead in simplify-cfg, so this shouldn't prevent
// inlining those. It will prevent inlining in cases where the optimization		// inlining those. It will prevent inlining in cases where the optimization
// does not (yet) fire.		// does not (yet) fire.
SmallPtrSet<BasicBlock *, 8> SuccessorBlocks;		SmallVector<unsigned, 4> SwitchWorkList;
SuccessorBlocks.insert(SI.getDefaultDest());		SwitchWorkList.push_back(SI.getNumCases());
for (auto I = SI.case_begin(), E = SI.case_end(); I != E; ++I)		Cost -= InlineConstants::InstrCost;
SuccessorBlocks.insert(I.getCaseSuccessor());		while (!SwitchWorkList.empty()) {
// Add cost corresponding to the number of distinct destinations. The first		unsigned NumCases = SwitchWorkList.back();
// we model as free because of fallthrough.		SwitchWorkList.pop_back();
Cost += (SuccessorBlocks.size() - 1) * InlineConstants::InstrCost;
		if (NumCases <= 3)
		// Do not split the tree if the number of remaining cases is less than 3.
		// Just compare switch condition with each case value. Suppose each
		// comparison includes one compare and one conditional branch.
		Cost += (2 * NumCases * InlineConstants::InstrCost);
		else {
		// Split the remaining nodes and add one more comparison.
		unsigned NumLeft = NumCases / 2;
		unsigned NumRight = NumCases - NumLeft;
		SwitchWorkList.push_back(NumLeft);
		SwitchWorkList.push_back(NumRight);
		Cost += (2 * InlineConstants::InstrCost);
		}
		// Exit early if Cost is already larger than Threshold.
		if (Cost > Threshold)
		return false;
		}
return false;		return false;
}		}

bool CallAnalyzer::visitIndirectBrInst(IndirectBrInst &IBI) {		bool CallAnalyzer::visitIndirectBrInst(IndirectBrInst &IBI) {
// We never want to inline functions that contain an indirectbr. This is		// We never want to inline functions that contain an indirectbr. This is
// incorrect because all the blockaddress's (in static global initializers		// incorrect because all the blockaddress's (in static global initializers
// for example) would be referring to the original function, and this		// for example) would be referring to the original function, and this
// indirect jump would jump from the inlined copy of the function into the		// indirect jump would jump from the inlined copy of the function into the
▲ Show 20 Lines • Show All 591 Lines • Show Last 20 Lines

test/Transforms/Inline/switch.ll

	; RUN: opt < %s -inline -inline-threshold=20 -S \| FileCheck %s			; RUN: opt < %s -inline -inline-threshold=20 -S \| FileCheck %s
	; RUN: opt < %s -passes='cgscc(inline)' -inline-threshold=20 -S \| FileCheck %s			; RUN: opt < %s -passes='cgscc(inline)' -inline-threshold=20 -S \| FileCheck %s

	define i32 @callee(i32 %a) {			define i32 @callee1(i32 %a) {
	switch i32 %a, label %sw.default [			switch i32 %a, label %sw.default [
	i32 0, label %sw.bb0			i32 0, label %sw.bb0
	i32 1, label %sw.bb1			i32 1, label %sw.bb1
	i32 2, label %sw.bb2			i32 2, label %sw.bb2
	i32 3, label %sw.bb3			i32 3, label %sw.bb3
	i32 4, label %sw.bb4			i32 4, label %sw.bb4
	i32 5, label %sw.bb5			i32 5, label %sw.bb5
	i32 6, label %sw.bb6			i32 6, label %sw.bb6
	Show All 34 Lines

	sw.bb9:			sw.bb9:
	br label %return			br label %return

	return:			return:
	ret i32 42			ret i32 42
	}			}

				define i32 @callee2(i32 %a) {
				switch i32 %a, label %sw.default [
				i32 0, label %sw.bb0
				i32 1, label %sw.bb0
				i32 2, label %sw.bb0
				i32 3, label %sw.bb0
				i32 4, label %sw.bb0
				i32 5, label %sw.bb0
				i32 6, label %sw.bb0
				i32 7, label %sw.bb0
				i32 8, label %sw.bb0
				i32 9, label %sw.bb0
				]

				sw.default:
				br label %return

				sw.bb0:
				br label %return

				return:
				ret i32 42
				}

	define i32 @caller(i32 %a) {			define i32 @caller(i32 %a) {
	; CHECK-LABEL: @caller(			; CHECK-LABEL: @caller(
	; CHECK: call i32 @callee(			; CHECK: call i32 @callee1(
				; CHECK: call i32 @callee2(

	%result = call i32 @callee(i32 %a)			%result1 = call i32 @callee1(i32 %a)
	ret i32 %result			%result2 = call i32 @callee2(i32 %a)
				ret i32 %result1
	}			}