This is an archive of the discontinued LLVM Phabricator instance.

Switch lowering: use profile info to build weight-balanced binary search trees
ClosedPublic

Authored by hans on Apr 27 2015, 7:17 PM.

Details

Summary

This patch uses profile info to balance the binary search trees used for switch lowering by weight instead of node count, causing hot nodes to appear closer to the root.

I've been using this benchmark:

. It's a 512-case switch. A loop exercises the cases in a random sequence, hitting the first case once, the second twice, and so on, making the 512th case the hottest. I ran it like this:

$ bin/clang -O3 switch_with_profile.c 
$ perf stat -r5 ./a.out                                                                                                                                                                                                               

 Performance counter stats for './a.out' (5 runs):

       2456.382229 task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.63% )
               403 context-switches          #    0.164 K/sec                    ( +-  3.93% )
                12 cpu-migrations            #    0.005 K/sec                    ( +- 32.02% )
               131 page-faults               #    0.053 K/sec                  
     7,961,322,884 cycles                    #    3.241 GHz                      ( +-  0.07% ) [66.68%]
     2,511,064,458 stalled-cycles-frontend   #   31.54% frontend cycles idle     ( +-  0.06% ) [50.05%]
     1,874,963,880 stalled-cycles-backend    #   23.55% backend  cycles idle     ( +-  0.05% ) [50.14%]
    10,579,662,099 instructions              #    1.33  insns per cycle        
                                             #    0.24  stalled cycles per insn  ( +-  0.08% ) [66.80%]
     4,910,929,143 branches                  # 1999.253 M/sec                    ( +-  0.05% ) [66.63%]
        14,888,827 branch-misses             #    0.30% of all branches          ( +-  0.07% ) [66.58%]

       2.460022963 seconds time elapsed                                          ( +-  0.64% )

$ bin/clang -O3 switch_with_profile.c -fprofile-instr-generate
$ LLVM_PROFILE_FILE=profile.raw ./a.out
$ bin/llvm-profdata merge -output=profile.profdata profile.raw
$ bin/clang -O3 switch_with_profile.c -fprofile-instr-use=profile.profdata
$ perf stat -r5 ./a.out

 Performance counter stats for './a.out' (5 runs):

       2156.926282 task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.84% )
               348 context-switches          #    0.161 K/sec                    ( +-  3.07% )
                 7 cpu-migrations            #    0.003 K/sec                    ( +- 39.10% )
               131 page-faults               #    0.061 K/sec                    ( +-  0.15% )
     7,269,065,426 cycles                    #    3.370 GHz                      ( +-  0.05% ) [66.56%]
     2,183,070,540 stalled-cycles-frontend   #   30.03% frontend cycles idle     ( +-  0.27% ) [50.00%]
     1,601,430,175 stalled-cycles-backend    #   22.03% backend  cycles idle     ( +-  0.37% ) [50.21%]
    10,328,446,596 instructions              #    1.42  insns per cycle        
                                             #    0.21  stalled cycles per insn  ( +-  0.08% ) [66.91%]
     4,777,439,719 branches                  # 2214.930 M/sec                    ( +-  0.05% ) [66.75%]
        14,804,172 branch-misses             #    0.31% of all branches          ( +-  0.06% ) [66.59%]

       2.160195042 seconds time elapsed                                          ( +-  0.84% )

That's a 12 % speed-up, which I think is good given that the switch's profile is tricky and not completely dominated by one or a few cases.

Note that this also affects non-profile guided builds. When there is no profile info, each case has the same weight. When clustered together in e.g. a jump table, that cluster will be heavier than a single-case cluster, which affects the tree layout.

Diff Detail

Event Timeline

hans updated this revision to Diff 24527.Apr 27 2015, 7:17 PM
hans retitled this revision from to Switch lowering: use profile info to build weight-balanced binary search trees.
hans updated this object.
hans edited the test plan for this revision. (Show Details)
hans added a reviewer: djasper.
hans added subscribers: Unknown Object (MLST), hansw, dnovillo and 2 others.
asl added a subscriber: asl.Apr 29 2015, 7:35 AM

Looks reasonable to me.

djasper added inline comments.Apr 29 2015, 9:34 AM
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7814

No braces.

7819

At this point Left and Right are identical and the element they are pointing it belongs to the larger/heavier weight, right?

test/CodeGen/Generic/MachineBranchProb.ll
61

Maybe add a comment that this division is because the left half is 1+10+1+1 = 13, the right half is 10+10=20 and we are directly checking against the pivot element so that that one weight doesn't contribute to either side. (If I am correct).

test/CodeGen/X86/switch.ll
471

Could you describe somewhere (patch description, comment) what weight=0 means and how it affects the balancing? From this test, I would assume that we just do balanced trees in this case. However, the test below suggests otherwise.

hans updated this revision to Diff 24658.Apr 29 2015, 2:20 PM

Addressing djasper's comments.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7814

Done.

7819

No, Left == Right - 1 at this point, and LeftWeight and RightWeight are as equal as possible.

For example, if we have cases with weights {3,4,3,5}, the first two belong on the left side (LeftWeight=7, Left=1) and the other two on the right (RightWeight=8, Right=2).

We use Right as the pivot element since we're doing a < comparison.

I'll try to comment this better, and use better names for the variables.

test/CodeGen/Generic/MachineBranchProb.ll
61

The first weight in the metadata is for the default case. The weights on the left and right are: 10+1+1+1=13 vs 10+10=20.

I'll add comments making this easier to read.

The pivot element is part of the right side and does contribute to that weight. (This is different from gcc which I think always does both 'je' and 'jg' on pivot elements, re-using the condition code from the 'cmp'.)

test/CodeGen/X86/switch.ll
471

The problem is that the pivot finding code cannot handle 0-weight nodes as they don't affect the weight-balance. E.g. a tree with weights {10,0,0,0,0,0,0} to the left and {10} on the right would be considered balanced because the sub-trees have equal weight.

Luckily, BranchProbability::getEdgeWeight() will never return 0. I just figured this was important enough to have a test for.

I'll add a comment in the code.

djasper added inline comments.Apr 29 2015, 3:47 PM
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7811

"each other"?

test/CodeGen/Generic/MachineBranchProb.ll
58

This is actually an interesting case. As I understand it, LLVM will now do?

this_case = (x < 40
                 ? x < 20 ? x == 0 ? A : x == 10 ? B : DEFAULT
                          : x == 20 ? C : x == 30 ? D : DEFAULT
                 : x == 40 ? E : x == 50 ? F : DEFAULT);

(I hope it is somewhat readable what I mean, case A represents 0, case B 10,... - I just didn't want to re-use the numbers as that would be even more confusing).

Now, I hope I counted correctly, this gives me the following numbers.

No of comparisons:
A:    3 (Weight: 10, Weighted: 30)
B:    4 (Weight:  1, Weighted:  4)
C:    3 (Weight:  1, Weighted:  3)
D:    4 (Weight:  1, Weighted:  4)
E:    2 (Weight: 10, Weighted: 20)
F:    3 (Weight: 10, Weighted: 30)
---
SUM: 19 (            Weighted: 91)

However, if we split this equally, we get to do a linear scan on both sides:

this_case = (x < 30 ? x == 0 ? A : x == 10 ? B : x == 20 ? C : DEFAULT
                    : x == 40 ? E : x == 50 ? F : x == 30 ? D : DEFAULT);

No of comparisons:
A:    2 (Weight: 10, Weighted: 20)
B:    3 (Weight:  1, Weighted:  3)
C:    4 (Weight:  1, Weighted:  4)
D:    4 (Weight:  1, Weighted:  4)
E:    2 (Weight: 10, Weighted: 20)
F:    3 (Weight: 10, Weighted: 30)
---
SUM: 18 (            Weighted: 81)

Which seems beneficial in total. I think the rule might be something like: Do never create a split with less than three elements on one side unless the smallest weight on that side is larger than all the weights on the other side. But that might not be sufficient.

61

Right. Why doesn't LLVM do that? Intuitively, it sounds like a good idea. (But that's for a follow-up patch if that).

70

I think this is Case 0, 10, 20

test/CodeGen/X86/switch.ll
471

But what's the added benefit over the test below?

djasper accepted this revision.Apr 29 2015, 3:58 PM
djasper edited edge metadata.

Just to clarify. I think this is definitely an excellent step and good to go in independent of whether my analysis is correct or not :-).

So, ship it :-).

This revision is now accepted and ready to land.Apr 29 2015, 3:58 PM
hans added a comment.Apr 29 2015, 6:00 PM

Just to clarify. I think this is definitely an excellent step and good to go in independent of whether my analysis is correct or not :-).

So, ship it :-).

Thanks very much for the review!

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7811

Done.

test/CodeGen/Generic/MachineBranchProb.ll
58

It's going to be

this_case = (x < 40
                 ? x < 30 ? x == 0 ? A : default
                          : x == 10 ? C : x == 20 ? C : x == 30 ? D : DEFAULT
                 : x == 40 ? E : x == 50 ? F : DEFAULT);

But the numbers add upp the same way.

This is interesting. It's something about our leaves being different than in a typical binary tree in that we can do up to three comparisons in them.

Like you, I'm also thinking we should probably be careful creating a split with less than three on one side + some conditions. But this will require some thinking.

As discussed on the IRC channel, for switches where some cases are completely dominating, hoisting them out is probably a good idea. What I've been doing when playing with benchmarks is turning:

switch (x) {
  cases ...
}

into

switch (x) {
  hot_case 1:
  hot_case 2:
  ...
}
switch (x) {
  cold_cases here
}

And that's been really beneficial for switches where a few cases dominate.

I like the current approach for handling switches in general though, and I think the hoisting would be a separate thing that takes place before.

61

It's not a given win. It reduces the number of cmps, but introduces more branches. It probably varies by architecture how fast this "three-way branch" is.

For the benchmark attached in this review, Clang's -O3 (without profile info) was faster than gcc's -O3, maybe because of this.

70

Oops. Fixed.

test/CodeGen/X86/switch.ll
471

Oh, the test below shouldn't have 0-weights. I'll remove that.

This revision was automatically updated to reflect the committed changes.
hans added inline comments.Jun 20 2015, 10:23 AM
test/CodeGen/Generic/MachineBranchProb.ll
58

Coming back to this example where weight-balancing the tree actually makes the example worse, since it would have been better to have 3 elements in the leaf node instead of just one. I committed r240224 to address this. It's not perfect, but I think it helps a lot.