Page MenuHomePhabricator

Switch lowering: extract jump tables and bit tests before building binary tree (PR22262)
ClosedPublic

Authored by hans on Mar 26 2015, 6:23 PM.

Details

Summary

This is a large patch, and it's not quite ready yet but I'd like to put it up for review now to get some feedback on the approach.

Instead of building the binary search tree and discovering jump tables and bit tests as we go along, this patch extract such parts (CaseClusters) from the switch first (FindJumpTables and FindBitTestsClusters), and then builds a binary tree around those clusters afterwards (work-list driven algorithm in visitSwitch).

This has several benefits. Most importantly, the binary tree will not be skewed because we're trying to find jump tables - it will always be balanced. We could also change how we'd like to balance it, e.g. using profile info to balance on branch frequency, causing hot cases to be closer to the root.

Secondly, decoupling the cluster extraction mechanism from tree building allows us to use other algorithms. For jump table finding, I'm using a dynamic programming approach which is basically Kannan & Proebsting's algorithm referenced in Anton's original switch lowering paper (http://llvm.org/pubs/2007-05-31-Switch-Lowering.pdf, Ref 6). I believe that's optimal, which is cool, but it's also O(n^2) in all cases, whereas the previous algorithm was only O(n^2) in worst case and usually faster. I haven't measured the impact yet, but if it's bad we could cap the search at a certain table size, or use a different approach.

The code to extract bit test clusters uses the same approach. I haven't really thought about this one much, it might be better to do something simpler.

Also, this patch makes us bypass all the fancy lowering logic in -O0 mode, which should make that faster.

I'd be very happy to hear your thoughts on this.

Diff Detail

Repository
rL LLVM

Event Timeline

hans updated this revision to Diff 22774.Mar 26 2015, 6:23 PM
hans retitled this revision from to Switch lowering: extract jump tables and bit tests before building binary tree (PR22262).
hans updated this object.
hans edited the test plan for this revision. (Show Details)
hans added reviewers: djasper, asl.
hans added subscribers: Unknown Object (MLST), hansw.
hans added subscribers: mcrosier, silvas.

cc'ing some more folks who might be interested based on previous switch threads.

Could you give a pseudocode description of the switch lowering process as you've implemented it? Just a little something so that everyone can be on the same page about the high-level organization.

Also, any initial performance numbers to report?

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7182–7187 ↗(On Diff #22774)

Out of curiosity, does this correctly handle jump tables that cross the "overflow discontinuity" for the switched-on integer type? (not sure if we should bother with handling that case)

silvas added inline comments.Mar 27 2015, 6:01 PM
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7182–7187 ↗(On Diff #22774)

I also wonder if the sense in which this algorithm is optimal is actually optimal from a performance perspective. Maybe a faster greedy algorithm might do just as well (or better) for our use case (using a bottom-up "tree reduction" (which can be done in a single "tortoise and hare" linear scan); the tree reduction also correctly handles the "overflow discontinuity" since the leaves of the tree can be thought of as being on a circle, with the root at the center), or are there important cases that would be missed?

hans added a comment.Mar 27 2015, 6:33 PM

Could you give a pseudocode description of the switch lowering process as you've implemented it? Just a little something so that everyone can be on the same page about the high-level organization.

Sure, it goes like this:

  1. Starting in visitSwitch(), extract cases into a vector of CaseClusters. Each case is in a "range" cluster of its own to begin with.
  2. Clusterify() merges adjacent cases with the same destination into larger range clusters
  3. FindJumpTables() runs an O(n^2) search on Clusters to find ranges suitable for jump tables; such ranges are replaced with jump-table clusters
  4. FindBItTests() runs a similar algorithm, but for ranges suitable for bit-test lowering. Bit-test lowering is used for ranges that fine in a machine word (e.g. case 0-63) and have a small number of destinations. A bit-mask is built from the case values for each destination, and the switch condition is checked against that.
  5. The Clusters vector is now lowered as a binary tree: a pivot is picked in the middle of the vector, and so on and so forth.

The main difference from the existing code is that finding the jump tables and bit tests is done before building the binary tree.

Also, any initial performance numbers to report?

No, sorry. I'm still working on cleaning up the patch and squashing the bugs, but stay tuned :-)

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7182–7187 ↗(On Diff #22774)

No, this doesn't search for tables wrapping around the edge. I'm not sure that would be worth the effort.

7182–7187 ↗(On Diff #22774)

It's optimal for finding jump tables with a certain density and minimum size. Those parameters can of course be tuned (in particular, when not optimizing for size I suspect we could allow sparser tables). I don't think it's possible to do that with a simple greedy search.

Do you have any pointers to where I can read about something similar to the tree reduction idea you mention?

hans updated this revision to Diff 22850.Mar 28 2015, 10:12 PM

Uploaded a new version which separates the finding and building of jump tables into separate functions, and the same for bit tests, to make the code easier to follow.

Also fix a bug in visitBitTestHeader that was always there but could never trigger before, and in SelectionDAGISel::FinishBasicBlock (which I'm still a little confused by).

The patch now passes bootstrap and all tests. I'll try to provide numbers next week.

djasper edited edge metadata.Mar 30 2015, 6:39 AM

From a first look, this looks fantastic and is exactly what I had planned. Added a first round of comments, but haven't fully reviewed everything yet.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2104 ↗(On Diff #22774)

This assert could use a comment. I presume it checks whether all of the input Clusters contain exactly one case?

2112 ↗(On Diff #22774)

Why isn't this a for loop?

2134 ↗(On Diff #22774)

Don't add empty line.

7129 ↗(On Diff #22774)

I find a bit odd to use percentages here and accept the resulting rounding error. Even using 4 and 10 would lead to better accuracy. Not at all important, I just don't understand it ;-).

7165 ↗(On Diff #22774)

Wouldn't it be even easier to build jump tables before rangifying the cases? Seems like that would make a lot of the specific TotalCases logic unnecessary. Then, you could rangify everything that is left after building jump tables?

7178 ↗(On Diff #22774)

Is this the same behavior as before? If I understand correctly, you now wouldn't turn:

case 1:
case 2:
case 3: return 1;
case 4: return 2;

into a jump table as these are only two clusters. I think before, it might have been counted as 4 values with a density of 50% so it would have been turned into jump table.

Also: The minimum size should be controlled by TLI.getMinimumJumpTableEntries()!

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
138 ↗(On Diff #22774)

Maybe extend the comment to explain that this is also used for trivial 1-element ranges.

147 ↗(On Diff #22774)

I'd probably make this a proper class and make the fields const to show that they aren't mutable.

211 ↗(On Diff #22774)

The name is a bit confusing now. The input are already (trivial) clusters. SortAndRangify maybe?

303 ↗(On Diff #22774)

Functions should start lower-case. Here and everywhere else.

850–851 ↗(On Diff #22774)

Remove.

hans added a comment.Mar 30 2015, 6:13 PM

Some numbers from a Clang bootstrap at r233105.

The tree metric is (tree height)/(log2(nbr cases) + 1). Lower is better.

Without my patch
Nbr of jump tables: 4743
Bit tests: 5520
Tree metric avg/min/max: 0.467942 0.077705 2.182237

With my patch
Jump tables: 4910
Bit tests: 3689
Tree metric avg/min/max: 0.452684 0.077705 0.860977

It seems my patch is successfully finding more jump tables. It's finding a lot less bit tests though, so there might be something wrong there (unless the jump tables and bit tests it's finding are wider.. maybe I need a different metric).

Most importantly, the "tree metric" average is lower, and the max is much lower, which means the trees are more balanced.

I also compared binary size, and with my patch, the bootstrap is 20 kB smaller, presumably due to finding more jump tables.

For compile time, I used gcc-in-a-file.

Without my patch: (only one run, because it was slow and I was lazy)
$ time bin/clang -w -c -O3 /work/gcc.c -o /dev/null
real 2m15.618s
user 2m14.532s
sys 0m0.998s

With my patch:
$ time bin/clang -w -c -O3 /work/gcc.c -o /dev/null
real 2m18.549s
user 2m16.871s
sys 0m1.567s

It seems my patch is a little slower, but not by much.

For a more synthetic benchmark, I generated a file where f() contains a switch with 10k random cases, and main() then calls f() with each case value a thousand times:

I used "perf stat -r10" for the tests below.

Without my patch:
clang -O3: 4.640746669 seconds ( +- 1.63% )
clang -O0: 1.240886991 ( +- 2.25% )
./a.out: 1.103107940 seconds ( +- 1.46% )
Tree metric: 8.608796 (no jump tables or bit tests)

With my patch:
clang -O3: 6.302967137 seconds ( +- 1.42% )
clang -O0: 0.778322722 seconds ( +- 2.59% )
./a.out: 0.729969454 seconds ( +- 2.18% )
Tree metric: 0.909873 (no jump tables or bit tests)

To summarize the above: on this benchmark, my patch is 36% slower in -O3, 38% faster in -O0, and generates a tree that's balanced, resulting in 35% faster code.

Moving forward, I need to take a proper look at the bit test finding code, and I'll experiment with capping the search range for jump tables to get the compile time down.

hans updated this revision to Diff 22923.Mar 30 2015, 6:50 PM
hans edited edge metadata.

Addressing djasper's comments.

Thank you very much for the review so far!

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
2104 ↗(On Diff #22774)

Right. Done.

2112 ↗(On Diff #22774)

I fixed that in the second patch version.

2134 ↗(On Diff #22774)

Done.

7129 ↗(On Diff #22774)

What rounding error do you mean?

It's no big deal, I just find "40%" easier to think about than "4 tenths".

7165 ↗(On Diff #22774)

The only reason (besides that the old code did it this way) is to reduce the amount of work in FindJumpTables, which if O(n^2) in the number of clusters. That might not make a big difference in practice, though..

Initially, I was worried about getting hit with case ranges like this: "case 0 ... 100000", but it turns out Clang won't lower those to switch cases anyway.

7178 ↗(On Diff #22774)

Excellent question, this is a little sketchy.

With the old code, your example would not lowered with a jump table, because handleSmallSwitchRange() would get to it first.

I think counting clusters is the right metric for deciding to build a table (in terms of size -- for density we count actual cases). A range is as cheap to lower as a single case in terms of branches, and it's the number of branches that matters here. It's possible JT-threshold should be 3 instead of 4, though.

Good point about TLI.getMinimumJumpTableEntries(), I'll fix that.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
138 ↗(On Diff #22774)

Done.

147 ↗(On Diff #22774)

I do mutate them though, e.g. in Clusterify() :-)
I can try to avoid that if you think it's bad.

211 ↗(On Diff #22774)

Done.

303 ↗(On Diff #22774)

Done.

850–851 ↗(On Diff #22774)

Done.

djasper added inline comments.Mar 31 2015, 5:03 AM
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7337 ↗(On Diff #22923)

Everything in this function needs comments. Even if it didn't have any before, write down what you think this does.

Also, I think I'd probably pull out two different function that have names corresponding to what they do.

7374 ↗(On Diff #22923)

This assert basically test part of isSuitableForBitTests. Pull that part out into a separate well-named function instead.

7430 ↗(On Diff #22923)

With the new structure, I'd probably approach this entirely differently. IIUC, bit tests are really good to handle many (potentially relatively spares) case labels leading to the same destination.

Now, to make use of this, I'd first group all case clusters (maybe even before rangeifying them) and group them by destination. Now, start with the biggest group (possibly biggest weight by taking branch probabilities into account) and look at whether they can be lowered as bit tests. Thereby, it is possible to peel away the most bit-test-worthy groups (grouped by common destination) one at a time. I haven't completely thought about how to do this and there are a few additional things to consider:

  • The range of a group might be too large (greater than the bit width). However, it might be possible to split that one group into multiple sub groups.
  • It might be possible to lower multiple groups (Group1 leading to Dest1, Group2 leading to Dest2, ..) as a single bit test to conserve extra shifts (if the entire range fits into the bit width). This is currently (and in the previous implementation) sort of done by saying that there can be up to three unique destinations, but I don't that this is actually a good heuristic.
  • At some point there will be a cut-off, i.e. lowering a group as bit test is not worth it. Trivially, single-destination groups are never worth to be lowered as bit tests (a single cmp is faster than and+cmp). But as we might be able to lower the others as jump tables, the cut-off might be higher.

Lets chat some more about this.

7448 ↗(On Diff #22923)

I think this function is too long and BW should be replaced with something not abbreviated.

7449 ↗(On Diff #22923)

This should be extended and made into a function comment, not hidden somewhere in the code.

7454 ↗(On Diff #22923)

Do you actually observe a difference here? Seems like premature optimization.

And also, there seem to be better ways to optimize this (e.g. use a SmallVector to keep the common case of small N on the stack or to re-use and allocated area for this pass instead of allocating a new one for each switch.

7470 ↗(On Diff #22923)

--i?

7476 ↗(On Diff #22923)

Please write down what 'better' means.

7482 ↗(On Diff #22923)

If you pull apart the isSuitableForBitTests() function as mentioned above, you can likely re-use it here.

7489 ↗(On Diff #22923)

So, this is actually a O(N^3) algorithm?

7837 ↗(On Diff #22923)

IIUC, this seems to be a fundamental difference between the previous and current implementation. The previous implementation called handleBitTestsSwitchCase first and only later handleJTSwitchCase. Why did you reverse the order?

IMO, it could be really important to prefer bit tests if there are many labels leading to the same destination. These inherently cause redundancy in jump tables.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
206 ↗(On Diff #22923)

Use space before ":" or ideally clang-format :-).

djasper added inline comments.Mar 31 2015, 5:06 AM
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7430 ↗(On Diff #22923)

"relatively sparse" was what I meant to say.

hans updated this revision to Diff 23005.Mar 31 2015, 1:27 PM

Addressing djasper's comments.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7337 ↗(On Diff #22923)

Added comments and extracted rangeFitsInWord().

7374 ↗(On Diff #22923)

I don't agree. I think this assert makes sense here: we're about to build the actual bit mask and assert that it fits. I've added a comment in the assert though.

7430 ↗(On Diff #22923)

Yes, bit tests come up a lot, especially in code like "if (x == 5 || x == 17)" that we transform to switches internally.

I like the idea of grouping the cases by destination and trying to greedily find clusters for bit tests that way.

The tricky bit is that in the end, the CaseClusters need to be non-overlapping for the binary tree to work, i.e. if we find that {1,8,10,34} go to to one destination and could be a bit test, it's a problem if there are a bunch of other cases in between.

I'll need to think about this.

7448 ↗(On Diff #22923)

Renamed to BitWidth.

7449 ↗(On Diff #22923)

Moved it to the top of the function.

7454 ↗(On Diff #22923)

You're right, I'll just use SmallVectors. Most switches are very small.

7470 ↗(On Diff #22923)

Done.

7476 ↗(On Diff #22923)

Done.

7482 ↗(On Diff #22923)

Done.

7489 ↗(On Diff #22923)

See my "Slowness" FIXME :)

It's not O(n^3) though, as j-i is capped at BitWidth. I believe this algorithm is O(n) with a large constant.

7837 ↗(On Diff #22923)

Very good point, the order here is reversed, but I think the priorities here are still the same.

While the previous code would only check if a whole range could be lowered as bit tests or jump table, this new code is good at finding bit tests in a sub-range. In my testing, it was common that a bit test would be found which then prevented a jump table from being built.

The new code checks for jump tables first, because I think it is more expensive if we miss an opportunity for a jump table. However, FindJumpTable explicitly checks if that range could be lowered with bit tests instead, and bails if so. That is, we will not use a jump table to lower a range that could be lowered with bit tests. This is tested by the "bt_is_better" and "jt_is_better" tests in switch.ll.

As you pointed out on IRC, it's now possible that a small jump table prevents a larger bit test from being built. However, I think that's a less costly problem than the reverse.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
206 ↗(On Diff #22923)

Done.

Is it possible to get some idea of which bit tests LLVM is not finding with your bit tests? From your comments and IRC discussion, it feels like it should be finding more bit tests.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7275–7277 ↗(On Diff #23005)

Do I understand this correctly in that you would could something like:

case 1:
case 2:
case 3:
case 4:
...
case 10:
  f();
  break;

as 100% dense? I don't think this would be a good idea. Essentially, the density of a jump table should be a measure for the amount of redundancy in it.

Probably not worth addressing in this patch as it is pre-existing, but we should look into that.

hans added a comment.Apr 15 2015, 6:42 AM

Sorry for the lack of updates. I'll try to get this patch moving again.

Is it possible to get some idea of which bit tests LLVM is not finding with your bit tests? From your comments and IRC discussion, it feels like it should be finding more bit tests.

I have a theory. The previous code has a "number of cmps" concepts, where a single case counts as 1, and a range counts as 2, and this metric affects whether we build bit tests.

Since we can compare both single cases and ranges with a single comparison (though the range also requires a subtraction), my patch just uses the number of clusters. I'll try to confirm if that's the reason for the difference.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7275–7277 ↗(On Diff #23005)

Yes, that counts as 100% dense. The current code has the same problem, and you're right that we'd want to bias against building tables in this case.

However, the switch above only has one cluster, and we require 4 for a jump table. So we're not completely messing things up :-)

hans updated this revision to Diff 23769.Apr 15 2015, 6:43 AM

Rebase. No change.

hans added a comment.Apr 15 2015, 8:17 AM
In D8649#156414, @hans wrote:

Sorry for the lack of updates. I'll try to get this patch moving again.

Is it possible to get some idea of which bit tests LLVM is not finding with your bit tests? From your comments and IRC discussion, it feels like it should be finding more bit tests.

I have a theory. The previous code has a "number of cmps" concepts, where a single case counts as 1, and a range counts as 2, and this metric affects whether we build bit tests.

Since we can compare both single cases and ranges with a single comparison (though the range also requires a subtraction), my patch just uses the number of clusters. I'll try to confirm if that's the reason for the difference.

Yeah, that was it. If I use the "NumCmps" metric instead, I get 5785 bit tests with my patch in a bootstrap vs. 5603 without my patch, so that's the explanation.

Here's an example:

case 1: case3: Something.
case 4: case 5: case 32:
Something else.

Without my patch, we use bit tests (NumCmps = 5), but with my patch we don't (NumClusters = 4). I think counting the number of clusters makes more sense, because a range still only requires one branch, but I'll change this back for now to reduce the number of changes introduced by my patch.

djasper accepted this revision.Apr 15 2015, 9:24 AM
djasper edited edge metadata.

I think you should re-run the numbers you gathered under "Some numbers from a Clang bootstrap at r233105.", but other than that this seems fine to go in. Anything else we should try to experiment with in smaller subsequent patches.

This revision is now accepted and ready to land.Apr 15 2015, 9:24 AM
hans updated this revision to Diff 23856.Apr 16 2015, 7:19 AM
hans edited edge metadata.

Switching to the NumCmps metric for deciding when to do bit tests, and some minor tweaks after another read-through.

hans added a comment.Apr 16 2015, 7:49 AM

I think you should re-run the numbers you gathered under "Some numbers from a Clang bootstrap at r233105.", but other than that this seems fine to go in. Anything else we should try to experiment with in smaller subsequent patches.

Sounds good. I re-ran the numbers for the self-host. I'm leaving out the tree balance metrics, as they haven't changed. The binary size difference is also still 20 kB.

Without my patch:
4781 jump tables, 5593 bit tests

With my patch:
4781 jump tables, 5775 bit tests

Again, the main benefit is that with my patch, the trees are always balanced.

This confirms that we fixed the bit test discrepancy by going back to NumCmps. The interesting thing is that we're now not finding any new jump tables. It seems previously we'd generate jump tables for some of the bit tests we didn't use due to the different metric.

This also shows that the new jump table finding algorithm doesn't have any effect on a clang bootstrap. That's a little disappointing, but I guess the cases where the current algorithm fails are not that common.

We talked a bit about compile-time issues at Euro-LLVM. For my 10,000 case switch example in "big.c" above, the slow-down doesn't seem that significant, and switches like that should be rare. If it turns out that the jump table or bit-test algorithms have speed issues in practice, we could try simpler algorithms, or reducing their search space somehow.

This revision was automatically updated to reflect the committed changes.
hans added a comment.Apr 23 2015, 10:57 AM

Anton asked me at Euro-LLVM if it was possible to construct a switch that exhibits the worst-case properties of the old switch lowering. I believe the attached file does that:

. It's 10k-case switch where the distance between cases increases linearly.

(I used perf stat -r5 for the timings except for slow run which I only ran once.)

Without my patch:
clang -O0: 3.26 s ( +- 2.34% )
clang -O3: 7.43 s ( +- 3.49% )
./a.out: ~40 s

With my patch:
clang -O0: 0.76 s ( +- 3.04% )
clang -O3: 6.24 s ( +- 3.33% )
./a.out: 0.32 s ( +- 0.23% )

I think we're hitting an issue with this. Not 100% sure yet. Will try to get you something to test out but it will take some time.

We have a switch with cases 0...16. It gets put as a one-cluster jump table. Then we get an iterator out of bounds error in SelectionDagBuilder @7671 when trying to re-arrange case blocks since there's nothing to rearrange. Seems like this rearrangement code path could be avoided unless there are multiple clusters.

hans added a comment.Apr 23 2015, 12:33 PM

Hi Andy,

What revision are you using?

There was a bug with iterators going out-of-bounds in that code, Aaron
reverted in r235597, and I recommitted with a fix in r235608.

  • Hans

We're at r235571 so it looks like we just need to update. Thanks.

hans added a comment.Apr 23 2015, 1:07 PM

OK, sounds like it's the same issue as Aaron hit, then. Thanks for
letting me know, and sorry for the breakage.

  • Hans