This is an archive of the discontinued LLVM Phabricator instance.

[InlineCost] Increase the cost of Switch
AbandonedPublic

Authored by haicheng on Feb 11 2017, 10:13 PM.

Details

Summary

The motivation example is like below which has 13 cases but only 2 distinct targets

lor.lhs.false2:                                   ; preds = %if.then
  switch i32 %Status, label %if.then27 [
    i32 -7012, label %if.end35
    i32 -10008, label %if.end35
    i32 -10016, label %if.end35
    i32 15000, label %if.end35
    i32 14013, label %if.end35
    i32 10114, label %if.end35
    i32 10107, label %if.end35
    i32 10105, label %if.end35
    i32 10013, label %if.end35
    i32 10011, label %if.end35
    i32 7008, label %if.end35
    i32 7007, label %if.end35
    i32 5002, label %if.end35
  ]

which is compiled into a balanced binary tree like this on AArch64 (similar on X86)

.LBB853_9:                              // %lor.lhs.false2
        mov     w8, #10012
        cmp             w19, w8
        b.gt    .LBB853_14
// BB#10:                               // %lor.lhs.false2
        mov     w8, #5001
        cmp             w19, w8
        b.gt    .LBB853_18
// BB#11:                               // %lor.lhs.false2
        mov     w8, #-10016
        cmp             w19, w8
        b.eq    .LBB853_23
// BB#12:                               // %lor.lhs.false2
        mov     w8, #-10008
        cmp             w19, w8
        b.eq    .LBB853_23
// BB#13:                               // %lor.lhs.false2
        mov     w8, #-7012
        cmp             w19, w8
        b.eq    .LBB853_23
        b       .LBB853_3
.LBB853_14:                             // %lor.lhs.false2
        mov     w8, #14012
        cmp             w19, w8
        b.gt    .LBB853_21
// BB#15:                               // %lor.lhs.false2
        mov     w8, #-10105
        add             w8, w19, w8
        cmp             w8, #9          // =9
        b.hi    .LBB853_17
// BB#16:                               // %lor.lhs.false2
        orr     w9, wzr, #0x1
        lsl     w8, w9, w8
        mov     w9, #517
        and             w8, w8, w9
        cbnz    w8, .LBB853_23
.LBB853_17:                             // %lor.lhs.false2
        mov     w8, #10013
        cmp             w19, w8
        b.eq    .LBB853_23
        b       .LBB853_3
.LBB853_18:                             // %lor.lhs.false2
        mov     w8, #-7007
        add             w8, w19, w8
        cmp             w8, #2          // =2
        b.lo    .LBB853_23
// BB#19:                               // %lor.lhs.false2
        mov     w8, #5002
        cmp             w19, w8
        b.eq    .LBB853_23
// BB#20:                               // %lor.lhs.false2
        mov     w8, #10011
        cmp             w19, w8
        b.eq    .LBB853_23
        b       .LBB853_3
.LBB853_21:                             // %lor.lhs.false2
        mov     w8, #14013
        cmp             w19, w8
        b.eq    .LBB853_23
// BB#22:                               // %lor.lhs.false2
        mov     w8, #15000
        cmp             w19, w8
        b.ne    .LBB853_3

However, the inline cost model estimates the cost to be linear with the number of distinct targets and the cost of the above switch is just 2 InstrCosts. The function containing this switch is then inlined about 900 times.

This change modifies the model to be linear with the size of the balanced binary tree.

Diff Detail

Repository
rL LLVM

Event Timeline

haicheng created this revision.Feb 11 2017, 10:13 PM
chandlerc edited edge metadata.Feb 11 2017, 11:09 PM

The basic reasoning makes sense to me when we do the binary search lowering. However, while you talk about simplify-cfg handling the lookup table case, this code should also respond well to the case where it will be lowered as a jump table.

Beyond that, this needs to be accompanied by basic code size and runtime benchmark numbers for the LLVM test suite. It would be nice to also check SPEC.

If you can't benchmark enough platforms, it may make sense to ask other LLVM users to benchmark this (either in patch form or by changing the patch to put this behind a flag at first) so that we can collect their data.

haicheng added a comment.EditedFeb 12 2017, 12:48 PM

Here is the data I have for the current implementation collected from AArch64. Only benchmarks impacted are listed.

SPEC2000

Performance (+ is better)Code Size (- is better)
vortex+8.22%-7.63%
perlbmk+2.39%0.00%
twolf+1.46%0.00%
crafty+1.42%0.00%
gcc-0.52%+0.01%
mesa-0.92%0.00%

SPEC2006

Performance (+ is better)Code Size (- is better)
povray+1.09%0.00%
soplex+0.75%0.00%
xalancbmk+0.41%0.00%
hmmer+0.09%-1.21%
dealII+0.06%0.00%
omnetpp0.00%0.00%
sjeng-0.08%0.00%
h264ref-0.10%0.00%
gcc-0.20%-0.70%
perlbench-0.31%0.00%

LLVM Test suite

Performance (+ is better)Code Size (- is better)
kc+3.00%0.00%
sqlite3+2.80%0.00%
siod+1.44%0.00%
consumer-typeset+0.85%0.00%
SIBsim4+0.73%0.00%
lua+0.28%0.00%
automotive-susan0.00%0.00%
consumer-jpeg0.00%0.00%
cjpeg0.00%0.00%
7zip-0.20%0.00%
lencod-0.25%0.00%

It seems give switch bigger penalty is a good thing.

junbuml added inline comments.Feb 13 2017, 9:32 AM
lib/Analysis/InlineCost.cpp
1031

I wonder if this assumption is reasonable enough. If I remember correctly, a switch could also end up with a jump table or mix of jump table and BTree depending on the number of case, comparison value, etc.

chandlerc added inline comments.Feb 13 2017, 9:39 AM
lib/Analysis/InlineCost.cpp
1031

Yes, see my top level comment -- I think something more detailed than this will be necessary in order to model the usage of jump tables.

I wonder if we should just expose a TTI hook that can query the same logic that lowering actually uses to decide this....

junbuml edited edge metadata.Mar 17 2017, 10:15 AM

Since Haicheng is in paternity leave. I took over this patch and created a new patch in D31085. In this new change (D31085), I used a target hook to get the number of case cluster based on D31080.

haicheng abandoned this revision.Jun 6 2017, 8:29 AM

Jun took over the patch and submitted it in r304594.