This is an archive of the discontinued LLVM Phabricator instance.

CodeGen: BlockPlacement: Precompute layout for chains of triangles.
ClosedPublic

Authored by iteratee on Feb 23 2017, 12:47 PM.

Details

Reviewers
davidxl
Summary

For chains of triangles with small join blocks that can be tail duplicated, a
simple calculation of probabilities is insufficient. Tail duplication
can be profitable in 3 different ways for these cases:

  1. The post-dominators marked 50% are actually taken 56% (This shrinks with longer chains)
  2. The chains are statically correlated. Branch probabilities have a very U-shaped distribution. [http://nrs.harvard.edu/urn-3:HUL.InstRepos:24015805] If the branches in a chain are likely to be from the same side of the distribution as their predecessor, but are independent at runtime, this transformation is profitable. (Because the cost of being wrong is a small fixed cost, unlike the standard triangle layout where the cost of being wrong scales with the # of triangles.)
  3. The chains are dynamically correlated. If the probability that a previous branch was taken positively influences whether the next branch will be taken

We believe that 2 and 3 are common enough to justify the small margin in 1.

The code pre-scans a function's CFG to identify this pattern and marks the edges
so that the standard layout algorithm can use the computed results.

Diff Detail

Event Timeline

iteratee created this revision.Feb 23 2017, 12:47 PM
davidxl edited edge metadata.Feb 25 2017, 4:08 PM

if BP is not correct, it is better to improve static branch prediction. We explicitly added a threshold for the cost based analysis result to kick in just to be conservative when the branch probability is not biased enough. Even for the long chain case, tail dup is enabled for 50/50 case, but the real profile is 40/60, taildup will hurt performance. I don't see the reason to by pass the branch prob + cost analysis by just looking at the shape.

if BP is not correct, it is better to improve static branch prediction. We explicitly added a threshold for the cost based analysis result to kick in just to be conservative when the branch probability is not biased enough. Even for the long chain case, tail dup is enabled for 50/50 case, but the real profile is 40/60, taildup will hurt performance. I don't see the reason to by pass the branch prob + cost analysis by just looking at the shape.

Well, long chains amortize the penalty, so looking for the shape is definitely necessary.

I can adjust the static prediction if you'd like, but I have a source for the 60/40 stat:
See page 13 here:
http://digitalassets.lib.berkeley.edu/techreports/ucb/text/CSD-83-121.pdf

The chains also allow us to make a correlation assumption. I can explicitly calculate that as well, however, edge frequencies run into aliasing problems. It seems that BlockFrequency wasn't designed to allow for calculations like these.

I really don't want to change this until I get a more specific feedback about what you'd like to see.
Assuming a small amount of positive correlation (10%), the cutoff is 47% (including the frequency bonus) for a chain of 2 triangles that ends in a non-triangle,
and 56% for a chain of triangles that ends in a triangle.

Would you prefer that I adjust the static probabilities for triangles, and then run the comparisons against the thresholds I calculated above? I could even include the whole table from 2-10. (The threshold goes down as the # of triangles goes up)

if BP is not correct, it is better to improve static branch prediction. We explicitly added a threshold for the cost based analysis result to kick in just to be conservative when the branch probability is not biased enough. Even for the long chain case, tail dup is enabled for 50/50 case, but the real profile is 40/60, taildup will hurt performance. I don't see the reason to by pass the branch prob + cost analysis by just looking at the shape.

Well, long chains amortize the penalty, so looking for the shape is definitely necessary.

I can adjust the static prediction if you'd like, but I have a source for the 60/40 stat:
See page 13 here:
http://digitalassets.lib.berkeley.edu/techreports/ucb/text/CSD-83-121.pdf

FWIW, I remember discussing this a loooooong time ago as we were really starting to set up static prediction. At the time, there was some desire to not try to put this weak of signal into static predictions. They have ways of compounding and ending up producing pretty weird results. So I'm not really sure the probabilities we use in static prediction are wrong. (Or rather, are wrong by enough of a margin or in enough cases to be worth shifting.) Maybe we should revisit this, but I'm always a bit skeptical of static heuristics with this small of a difference...

The chains also allow us to make a correlation assumption. I can explicitly calculate that as well, however, edge frequencies run into aliasing problems. It seems that BlockFrequency wasn't designed to allow for calculations like these.

Based on your explanation to me about how all of this works, my understanding is this:

Branches in these kinds of long chains of triangles empirically correlate, even though they may individually have something like 50/50 probability. And the advantage of *correlation* is pretty specific to the *layout* we're doing here.

Given that, I think it is very reasonable to handle this within the layout code by detecting the pattern of CFG combined with (nearly) 50/50 probabilities, and choosing to prioritize a layout this is profitable in the face of correlation because we believe that such correlation will often occur.

Anyways, just my two cents. I'll leave figuring out the end state to you and David. =D

if BP is not correct, it is better to improve static branch prediction. We explicitly added a threshold for the cost based analysis result to kick in just to be conservative when the branch probability is not biased enough. Even for the long chain case, tail dup is enabled for 50/50 case, but the real profile is 40/60, taildup will hurt performance. I don't see the reason to by pass the branch prob + cost analysis by just looking at the shape.

Well, long chains amortize the penalty, so looking for the shape is definitely necessary.

I can adjust the static prediction if you'd like, but I have a source for the 60/40 stat:
See page 13 here:
http://digitalassets.lib.berkeley.edu/techreports/ucb/text/CSD-83-121.pdf

FWIW, I remember discussing this a loooooong time ago as we were really starting to set up static prediction. At the time, there was some desire to not try to put this weak of signal into static predictions. They have ways of compounding and ending up producing pretty weird results. So I'm not really sure the probabilities we use in static prediction are wrong. (Or rather, are wrong by enough of a margin or in enough cases to be worth shifting.) Maybe we should revisit this, but I'm always a bit skeptical of static heuristics with this small of a difference...

The chains also allow us to make a correlation assumption. I can explicitly calculate that as well, however, edge frequencies run into aliasing problems. It seems that BlockFrequency wasn't designed to allow for calculations like these.

Based on your explanation to me about how all of this works, my understanding is this:

Branches in these kinds of long chains of triangles empirically correlate, even though they may individually have something like 50/50 probability. And the advantage of *correlation* is pretty specific to the *layout* we're doing here.

Given that, I think it is very reasonable to handle this within the layout code by detecting the pattern of CFG combined with (nearly) 50/50 probabilities, and choosing to prioritize a layout this is profitable in the face of correlation because we believe that such correlation will often occur.

Anyways, just my two cents. I'll leave figuring out the end state to you and David. =D

So correlation is an interesting term. There are actually 2 ways that the branches may be correlated:

  1. They may be biased in the same direction. We guess 50% for unknown branches, and over all of them that may be pretty close, but for any individual branch it's unlikely to be 50%. Tail duplication is profitable if the biases go the same direction, even if they are independent at runtime.
  2. They may be dynamically correlated. If the branch is close to 50%, but they are positively correlated, this is also profitable.

It's also profitable if the branches are independent, but each branch is slightly more than 50%: 58% for 2 triangles in a row, and 56% for 3 triangles in a row. (This includes for a 2% penalty for size increases)

Add an internal option to allow precomputing to be disabled.

lib/CodeGen/MachineBlockPlacement.cpp
1057

Add a little more comments here for the justification for doing this: cost based analysis compute layout costs by assuming full branch indepoendence which can lead to conservative result. In certain scenarios, the existence of branch correlation can make tail dup more beneficial ...

1131

use the insert method so that you can modify the chain in place.

iteratee updated this revision to Diff 90398.Mar 2 2017, 2:33 PM
iteratee marked an inline comment as done.
iteratee edited the summary of this revision. (Show Details)

Added flag, improved comments and description.

iteratee added inline comments.Mar 2 2017, 2:38 PM
lib/CodeGen/MachineBlockPlacement.cpp
1131

I actually can't do that, because you find it base on the key being the edge source, and insert it based on the key being the edge destination.

I wrote it that way initially, and then wondered why all my chains were of length 1.

I'll add a comment though, because someone might come and try to clean it up.

iteratee updated this revision to Diff 90401.Mar 2 2017, 2:44 PM

Actually update diff.

davidxl added inline comments.Mar 2 2017, 3:13 PM
lib/CodeGen/MachineBlockPlacement.cpp
152

Should the default be 3?

1061
  1. This is not entirely true -- it depends on the length the chain (e.g, when branch proabiity to Pdom blocks are low). Perhaps remove 2).
test/CodeGen/PowerPC/tail-dup-layout.ll
59

Why changing this test?

iteratee added inline comments.Mar 2 2017, 3:43 PM
lib/CodeGen/MachineBlockPlacement.cpp
152

If I had found anywhere where it was likely to cause problems in my benchmarking, I would bump it to 3, but I haven't.

1061

No, It's accurate. If you know that the frequencies are correlated, but not whether they're high or low, than this is profitable. Even if the individual branch decisions are independent.

If by the length of the chain, you mean that the only correlations that matter are neighbors, then that is correct and I can add a note about that.

iteratee updated this revision to Diff 90418.Mar 2 2017, 4:29 PM
iteratee edited the summary of this revision. (Show Details)

Comments improved, length changed to 3, and tests reverted.

davidxl added inline comments.Mar 2 2017, 4:38 PM
test/CodeGen/PowerPC/tail-dup-layout.ll
59

Missing reply here?

iteratee updated this revision to Diff 90419.Mar 2 2017, 4:54 PM

Remove unrelated change to test.

iteratee marked 2 inline comments as done.Mar 2 2017, 4:55 PM
davidxl accepted this revision.Mar 2 2017, 4:56 PM

lgtm

This revision is now accepted and ready to land.Mar 2 2017, 4:56 PM
iteratee closed this revision.May 2 2017, 4:37 PM

Committed in rL296845