This is an archive of the discontinued LLVM Phabricator instance.

[PGO] Hoist hot case statement from switches
AbandonedPublic

Authored by mcrosier on Oct 14 2014, 2:44 PM.

Details

Summary

This patch identifies hot cases, based on profile information, and inserts the necessary conditional logic to jump to the hottest case statement prior to getting into the switching logic.

A few comments:

  1. I'm still working on getting performance numbers, so feel free to grab the patch and test yourself.
  2. The 80% "hot" threshold is entirely arbitrary and likely needs tuning. By default, we don't consider switches with fewer than 4 cases, so 80% seems reasonable, IMHO.
  3. I would like to add additional tests, so suggestions are welcome. Also, I'm not sure how to verify profile information is being propagated correctly; the logic looks correct, but I'd like to have a test in place to ensure no future regressions.

The ideal solution would be a balanced binary tree, but this seems to be a reasonable first step.

Chad

Diff Detail

Event Timeline

mcrosier updated this revision to Diff 14892.Oct 14 2014, 2:44 PM
mcrosier retitled this revision from to [PGO] Hoist hot case statement from switches.
mcrosier updated this object.
mcrosier edited the test plan for this revision. (Show Details)
mcrosier added a subscriber: Unknown Object (MLST).
silvas added a subscriber: silvas.Oct 15 2014, 4:56 PM

The ideal solution to "given I have branch weight metadata giving me the relative probabilities of each case label, in what order should I do two-way comparisons to minimize the expected number of comparisons needed to reach the cases at run-time" is not a balanced binary tree; it is a Huffman tree (where the case probabilities are what is usually called "symbol probabilities" when talking about Huffman trees.)

I don't think this patch is the right long-term approach. I do think this patch is useful for identifying the possible performance win of using branch weight info to guide switch lowering. If the current patch doesn't result in a measurable speedup across a variety of benchmarks, it is unlikely to be worth much effort in improving our lowering by using branch weight metadata. You mentioned you ran spec2000/spec2006; what were the performance benefits?

If you aren't familiar with Huffman trees, they are basically based on a couple simple observations (imagine you have a binary tree, where each branching point represents a test, and the leaves represent case labels):

  • the tree is full (i.e. all non-leaf nodes have two children). This means that if a case label is one of the "deepest" labels (longest path from the root), then its sibling must be as well.
  • if case A has a greater probability than case B, then the leaf containing case A must be closer to the root (or the same distance) as case B.
  • there is an optimal tree where the two least likely cases are siblings.

That's it. The algorithm is then to start with all case labels in a min-heap by probability, and to repeatedly pop off the two least likely cases, merge them into a fragment of the final tree, which notionally has a probability equal to the sum of their probabilities, and push this merged tree fragment back on the heap. Eventually, the heap contains only one element which is your final tree.

Honestly, having branch probabilities is really a game-changer for switch lowering and probably merits a different approach (or reconsidering the current approach). You should probably take a look (if you haven't already) at the paper "A Superoptimizer Analysis of Multiway Branch Code Generation" in https://gcc.gnu.org/wiki/HomePage?action=AttachFile&do=get&target=gcc-2008-proceedings.pdf

Hi Sean,
Thanks for your input. Comments below.

In D5786#4, @silvas wrote:

The ideal solution to "given I have branch weight metadata giving me the relative probabilities of each case label, in what order should I do two-way comparisons to minimize the expected number of comparisons needed to reach the cases at run-time" is not a balanced binary tree; it is a Huffman tree (where the case probabilities are what is usually called "symbol probabilities" when talking about Huffman trees.)

To clarify, I didn't mean "balanced" in the traditional sense, but balanced based on probabilities. I do recall our conversation (+Chandler +Has) on IRC about Huffman trees.

I don't think this patch is the right long-term approach.

I believe I said that.. LLVM prefers incremental; I believe this is a reasonable first step. I understand the Huffman tree is an entirely different implementation, but the proposed solution is on the order of 50 lines of real code, straight-forward, and should require little or no maintenance. It *should* also provide a fairly good ROI, but I understand that that is yet to be proven.

I do think this patch is useful for identifying the possible performance win of using branch weight info to guide switch lowering. If the current patch doesn't result in a measurable speedup across a variety of benchmarks, it is unlikely to be worth much effort in improving our lowering by using branch weight metadata. You mentioned you ran spec2000/spec2006; what were the performance benefits?

I mentioned that I was working on getting numbers, but I've now been side tracked due to an internal release and the upcoming Dev Meeting. I don't know when I'll have the bandwidth to generate the numbers, so I suggested other give it a try if they so wish. Feel free to do so yourself.

If you aren't familiar with Huffman trees, they are basically based on a couple simple observations (imagine you have a binary tree, where each branching point represents a test, and the leaves represent case labels):

  • the tree is full (i.e. all non-leaf nodes have two children). This means that if a case label is one of the "deepest" labels (longest path from the root), then its sibling must be as well.
  • if case A has a greater probability than case B, then the leaf containing case A must be closer to the root (or the same distance) as case B.
  • there is an optimal tree where the two least likely cases are siblings.

That's it. The algorithm is then to start with all case labels in a min-heap by probability, and to repeatedly pop off the two least likely cases, merge them into a fragment of the final tree, which notionally has a probability equal to the sum of their probabilities, and push this merged tree fragment back on the heap. Eventually, the heap contains only one element which is your final tree.

Honestly, having branch probabilities is really a game-changer for switch lowering and probably merits a different approach (or reconsidering the current approach). You should probably take a look (if you haven't already) at the paper "A Superoptimizer Analysis of Multiway Branch Code Generation" in https://gcc.gnu.org/wiki/HomePage?action=AttachFile&do=get&target=gcc-2008-proceedings.pdf

I'm aware of the paper. In fact, this approach is based on results in the 2006 Summit in the paper titled, "Switch Statement Case Reordering FDO." That works shows that hoisting the hottest two cases in spec2000/perlbmk improved performance by 17%. I don't expect to get nearly that performance, but I do expect to see some gain.

I understand your concern, but I don't agree that this patch is a step in the wrong direction (or should be rejected for the ideal). If the patch were more pervasive, then I would proceed with caution. However, it is not and it should have a fair ROI. If, and when, someone comes along to implement the Huffman approach it will be trivial to revert this change.

Chad
chandlerc resigned from this revision.Mar 29 2015, 2:02 PM
chandlerc edited reviewers, added: hansw, djasper; removed: chandlerc.

I'd really like to get this in, but I suspect others could review this better than I could. Adding Hans and Daniel as they've looked at this stuff recently.

djasper edited edge metadata.Mar 30 2015, 10:00 AM

Hans is probably best-suited to review this properly. However, also note that he is undertaking a massive rewrite of the entire switch lowering logic (http://reviews.llvm.org/D8649). You should probably sync.

As for the approach, I am not entirely sure if it is good. Obviously, we want to make use of profile numbers to improve switch lowering. However, it is unclear to me whether e.g. we'd want to handle the hot case separately, even if it could be part of a jump table instead. Jump tables are quite fast and pulling the hot case out is additional, entirely additional code.

hans added a subscriber: hans.Mar 30 2015, 10:37 AM

I too am a little concerned about the approach of peeling off a hot case. Like djasper pointed out, it might have been possible to include the case in a jump table (even worse, peeling the case might prevent a jump table being built due to density changing). More generally, peeling a hot case will penalize the non-hot cases by putting an extra branch in front of them, which means we'd only want to do this when the hot case is completely dominating.

If we do want to peel a hot case, my patch would allow doing that without disrupting jump tables or bit tests: we'd extract those first, and then peel the hot CaseCluster, which might be a single case, range, jump table or bit test.

The direction I'd like to explore with my patch however, is to balance the binary tree based on profile info rather than node count. That would put hotter cases closer to the root. If f(x) is the number of branches needed to reach case x, I think this approach would minimize the expected value of f(x) when x is a random variable with a distribution matching the profile info.

In D5786#149001, @hans wrote:

The direction I'd like to explore with my patch however, is to balance the binary tree based on profile info rather than node count. That would put hotter cases closer to the root. If f(x) is the number of branches needed to reach case x, I think this approach would minimize the expected value of f(x) when x is a random variable with a distribution matching the profile info.

Sort of random, but in case it saves you a bunch of time, I believe this exact problem is covered in Knuth (either volume 1 or 3, I forget). The keyword is "optimal binary search tree".

hans added a comment.Mar 30 2015, 7:26 PM
In D5786#149001, @hans wrote:

The direction I'd like to explore with my patch however, is to balance the binary tree based on profile info rather than node count. That would put hotter cases closer to the root. If f(x) is the number of branches needed to reach case x, I think this approach would minimize the expected value of f(x) when x is a random variable with a distribution matching the profile info.

Sort of random, but in case it saves you a bunch of time, I believe this exact problem is covered in Knuth (either volume 1 or 3, I forget). The keyword is "optimal binary search tree".

Thanks! It sounds like what I'm proposing is this approach: http://en.wikipedia.org/wiki/Optimal_binary_search_tree#Mehlhorn.27s_approximation_algorithm

I guess it was bold of me to claim my idea guarantees the minimum expected value, but this makes me think it's still a good idea.

In D5786#149403, @hans wrote:
In D5786#149001, @hans wrote:

The direction I'd like to explore with my patch however, is to balance the binary tree based on profile info rather than node count. That would put hotter cases closer to the root. If f(x) is the number of branches needed to reach case x, I think this approach would minimize the expected value of f(x) when x is a random variable with a distribution matching the profile info.

Sort of random, but in case it saves you a bunch of time, I believe this exact problem is covered in Knuth (either volume 1 or 3, I forget). The keyword is "optimal binary search tree".

Thanks! It sounds like what I'm proposing is this approach: http://en.wikipedia.org/wiki/Optimal_binary_search_tree#Mehlhorn.27s_approximation_algorithm

I guess it was bold of me to claim my idea guarantees the minimum expected value, but this makes me think it's still a good idea.

Ultimately it is based on your model for performance.

As a possibly realistic example, we may not want to optimize expected tree depth of the binary tree, but rather want to optimize predictability of the branches along each path from the root to the pieces of code in the leaves (weighted appropriately etc.). In this case, what we want to infer from the data is not "how likely are we to go to this case", but rather "how likely is a particular choice of branching scheme to result in well-predicted branches". In this case the choice of tree structure could be something related to minimizing entropy of each branch (as a rough approximation to branch prediction cost).

It's all about your model for the resulting performance. It seems like your current approach is based on a performance model where the tree depth is the sole determining factor in the resulting performance once we are doing a binary search tree (i.e. branches take constant time). That sounds reasonable, but might be worth revisiting in light of modern uarch's. Perhaps the entropy of each branch should be taken into account. Or maybe there is something fancier than entropy we can use here (such as a markov model; this would require improving the profiling tooling I think, so that they can collect conditional probabilities).

The extraction of switch tables and bit tests are examples where we are modeling the performance of handling particular situations.

Ok... done braindumping now...

hans added a comment.Mar 31 2015, 11:20 AM
In D5786#149403, @hans wrote:
In D5786#149001, @hans wrote:

The direction I'd like to explore with my patch however, is to balance the binary tree based on profile info rather than node count. That would put hotter cases closer to the root. If f(x) is the number of branches needed to reach case x, I think this approach would minimize the expected value of f(x) when x is a random variable with a distribution matching the profile info.

Sort of random, but in case it saves you a bunch of time, I believe this exact problem is covered in Knuth (either volume 1 or 3, I forget). The keyword is "optimal binary search tree".

Thanks! It sounds like what I'm proposing is this approach: http://en.wikipedia.org/wiki/Optimal_binary_search_tree#Mehlhorn.27s_approximation_algorithm

I guess it was bold of me to claim my idea guarantees the minimum expected value, but this makes me think it's still a good idea.

Ultimately it is based on your model for performance.

As a possibly realistic example, we may not want to optimize expected tree depth of the binary tree, but rather want to optimize predictability of the branches along each path from the root to the pieces of code in the leaves (weighted appropriately etc.). In this case, what we want to infer from the data is not "how likely are we to go to this case", but rather "how likely is a particular choice of branching scheme to result in well-predicted branches". In this case the choice of tree structure could be something related to minimizing entropy of each branch (as a rough approximation to branch prediction cost).

It's all about your model for the resulting performance. It seems like your current approach is based on a performance model where the tree depth is the sole determining factor in the resulting performance once we are doing a binary search tree (i.e. branches take constant time). That sounds reasonable, but might be worth revisiting in light of modern uarch's. Perhaps the entropy of each branch should be taken into account. Or maybe there is something fancier than entropy we can use here (such as a markov model; this would require improving the profiling tooling I think, so that they can collect conditional probabilities).

The extraction of switch tables and bit tests are examples where we are modeling the performance of handling particular situations.

Ok... done braindumping now...

It's an interesting idea, but I'm not sure it's practical to try to model this. For now, I think minimizing the number of branches is a good heuristic, and then I put my faith in the hardware to predict the branches we do generate. For example, http://dl.acm.org/citation.cfm?id=2738614 suggests that indirect branches are well predicted on modern architectures.

mcrosier abandoned this revision.Aug 5 2015, 11:17 AM