This is an archive of the discontinued LLVM Phabricator instance.

CodeGen: BlockPlacement: Use Branching factor to choose between near equals.
Needs ReviewPublic

Authored by iteratee on May 25 2017, 4:36 PM.

Details

Reviewers
davidxl
Summary

When choosing between blocks of approximately the same probability, we need some
other metric to choose between them. In this patch, we add the simple metric of
"Branch Factor", which is the number of independent blocks that can be reached
from a block without crossing a join point (A block with multiple successors).
This is related to the classical "Branch Factor" similarly to how a 2-3-4 tree
is related to a red-black tree. These nodes could in theory all be rolled up
into a single node like a switch. If we did that, the calculated value would be
the branching factor of that single node.

Another way of viewing this is if you don't cross join-points, you're
essentially finding a sub-tree in the CFG. We're counting the # of leaves in
that subtree.

The motivation here is that when choosing between a list of alternatives like
this:

if (v < (1<<7)) {
...
} else if (v < (1<<14)) {
...
} else if (v < (1<<21)) {
...
} else if (v < (1<<28)) {
...
} else {
...
}

That we should prefer to fall through to the next test, because those tests have
a larger number of branches they have to pass through in order to be executed.
Even if our estimates of 50% probabilities are correct for these branches,
Having the fallthrough go to the next test helps to reduce tail latency because
it reduces the # of taken branches to get to the bottom of the sub-tree of the
CFG. In the above example, each case ends up with 1 taken branch, instead of the
last case requiring 4 taken branches.

In the attached test case, the expected number of dynamic taken branches is
15/16 for both layouts. But with the patch, the maximum # of taken branches is
1 . Without the patch, it is 4.

Branch factor is a local heuristic. Ideally we would use a measure based on a
graph property like dominance, which would have a better theoretical
grounding. A good candidate would be exclusion: A excludes B if
A ∉ DomF_inf(B) && B ∉ DomF_inf(A) && A ∉ Dom(B) && B ∉ Dom(A).
This is the theoretical definition, but a good intuition is that if A executes,
then B has not, and will not, and vice versa. We are trying to use branch factor
to guess which chosen branch will have the smaller # of excluded blocks.

Diff Detail

Repository
rL LLVM

Event Timeline

iteratee created this revision.May 25 2017, 4:36 PM
davidxl edited edge metadata.May 26 2017, 9:26 AM

Looking at the example. Comparing two layout decisions: 1) always pick the next test as layout successor ; and 2) always pick the (else) if-then block as the layout successor.
A) the number of taken branches is the same
B) assuming 50% prob for all branches, the dynamic count of taken branches with 2) is actually smaller.

So I am not convinced this is a right heuristic unless more data is provided.

Here is the suggestion. You can write up a microbenchmark with this CFG shape. Make all the branch's real probability to be 50% (using controlled by random data make branch less predictable). Using __buitlin_expect to force the layout to be either 1) or 2) described above and compare
a) runtime performance
b) taken branches with PMU.

I think you counted wrong. The dynamic taken branch count is the same. But the distribution of the taken branch count is more consistent.

In the example test, the default layout gives:

0 taken: 1/2
1 taken: 1/4
2 taken: 1/8
3 taken: 1/16
4 taken: 1/16

For an average of 15/16.

The layout with this heuristic gives:
0 taken: 1/16
1 taken: 15/16

For the same average, but a much more consistent distribution.

iteratee updated this revision to Diff 100472.May 26 2017, 2:04 PM
iteratee edited the summary of this revision. (Show Details)

Add comments to the change summary about the global heuristic that we're approximating (maybe badly) by calculating the branch factor.

It is basically a choice between a layout (exiting) that has 50% chance of not taking any branches , 25% of taking one branch, and 25% of taking more than one branches vs the new layout that has only 6.25% chance of taking zero branch and 93.75% of taking only one branch.

The existing layout only has 25% chance of taking more than 2 branches -- is it worth sacrificing 43.75% of chances to not take any branches for the improvement for the 25% cases?

It is basically a choice between a layout (exiting) that has 50% chance of not taking any branches , 25% of taking one branch, and 25% of taking more than one branches vs the new layout that has only 6.25% chance of taking zero branch and 93.75% of taking only one branch.

The existing layout only has 25% chance of taking more than 2 branches -- is it worth sacrificing 43.75% of chances to not take any branches for the improvement for the 25% cases?

Yes.
Another way of looking at this is that the new layout is more resilient to our guesses or profiles being wrong.

Note that the existing layout gets chosen somewhat at random. For exactly 50/50 branches, we rely on the order of the successors, so we're creating a guaranteed order where in the existing code, there isn't any. If the branches were reversed in the IR, we won't un-reverse them.

iteratee edited the summary of this revision. (Show Details)May 26 2017, 3:22 PM

Do you have performance numbers (to show it works better on average)?