This is an archive of the discontinued LLVM Phabricator instance.

Filter cold blocks off the loop chain when profile data is available.
ClosedPublic

Authored by congh on Jul 30 2015, 2:14 PM.

Details

Summary

In the current BB placement algorithm, a loop chain always contains all loop blocks. This has a drawback that cold blocks in the loop may be inserted on a hot function path, hence increasing branch cost and also reducing icache locality.

Consider a simple example shown below:

A
|
B⇆C
|
D

When B->C is quite cold, the best BB-layout should be A,B,D,C. But the current implementation produces A,C,B,D.

This patch filters those cold blocks off from the loop chain by comparing the ratio:

LoopBBFreq / LoopFreq

to 20%: if it is less than 20%, we don't include this BB to the loop chain. Here LoopFreq is the frequency of the loop when we reduce the loop into a single node. In general we have more cold blocks when the loop has few iterations. And vice versa.

Diff Detail

Event Timeline

congh updated this revision to Diff 31073.Jul 30 2015, 2:14 PM
congh retitled this revision from to Filter cold blocks off the loop chain when profile data is available..
congh updated this object.
congh added reviewers: chandlerc, dexonsmith.
congh added subscribers: llvm-commits, davidxl.

How do you come up with the 5/1 ratio? Why not making it 1/1? Here are the reasons

  1. if the loop has high trip count, a block that executed fewer than 1 time per loop does not have any cache reuse
  2. if the loop has very short trip count, it is likely that the loop won't be rotated. Splitting the cold block out increases the chances to connect loop with the exit BB with a fallthrough ..
lib/CodeGen/MachineBlockPlacement.cpp
831

Overloading BranchProbablity for the purpose of computing ratios is confusing IMO. Why not just

const unsigned LoopToColdBBRatio = 5;

836

if (Freq * LoopToColdBBRatio < LoopFreq.getFrequency()) ..

congh marked 2 inline comments as done.Jul 30 2015, 4:01 PM

How do you come up with the 5/1 ratio? Why not making it 1/1? Here are the reasons

It complies with the ratio that is used in selectBestSuccessor(): when the loop has only one iteration, this will lead to similar result if we treat the loop as a non-loop by removing back edges.

  1. if the loop has high trip count, a block that executed fewer than 1 time per loop does not have any cache reuse

The cold blocks that are not filtered off will be placed to the end of the loop chain (or beginning after loop rotation). So for loop itself the icache locality is still good as all hot blocks are still put together. When connecting those unfiltered cold blocks to blocks outside of the loop, we can use the similar logic that is used in selectBestSuccessor(): we follow CFG constraints when the block is not too cold (>20%).

  1. if the loop has very short trip count, it is likely that the loop won't be rotated. Splitting the cold block out increases the chances to connect loop with the exit BB with a fallthrough ..

In an extreme situation, if the loop has only one iteration, and if we use 1/1 as the threshold, it seems that too many blocks in the loop will be filtered off.

lib/CodeGen/MachineBlockPlacement.cpp
831

At first I think this may be more flexible as we can choose other threshold like 2/5. But now I prefer your suggestion. This can also prevent the potential overflow when converting uint64 to uint32.

congh updated this revision to Diff 31086.Jul 30 2015, 4:09 PM
congh marked an inline comment as done.

Update the patch by using LoopToColdBBRatio instead of a probability according to David's comment.

Ok -- the change is made to be consistent with existing trace formation heuristic to improve locality. It outlines the cold blocks and allow them to be pushed away further so that trace can be formed from the current loop with loop exit blocks..

lib/CodeGen/MachineBlockPlacement.cpp
832

Try to avoid hard-coding parameter like this.

congh added inline comments.Jul 31 2015, 2:41 PM
lib/CodeGen/MachineBlockPlacement.cpp
832

Should I make it a global constant variable or an option? The other 20% probability used in selectBestSuccessor() is also hard-coded and we could use the same variable to replace them.

I suggest making it an option. The other usage is slightly different

  • we can leave it out for now (if needed can be changed later).

David

congh updated this revision to Diff 31177.Jul 31 2015, 4:29 PM

Update the patch by making the loop-to-cold-block-ratio an option.

djasper edited edge metadata.Oct 7 2015, 7:42 AM

Why only do this if profile data is available? Are we scared that the statically derived probabilities are off by too much?

How does this affect nested loops? Is the 'outlined' block still in the outer loop? I am not sure which one would be better, but maybe we should add a comment and a test to document the behavior?

Should there be some minimum size for the block to be outlined? A very short block will still increase branch count, but not really affect cache locality. For those, the trade-off might be different.

congh added a comment.Oct 12 2015, 5:07 PM

Why only do this if profile data is available? Are we scared that the statically derived probabilities are off by too much?

Without precise profile data, it is safer to put all blocks of the same loop together: if hot blocks are scattered into several places at the runtime, we will get very poor cache locality.

How does this affect nested loops? Is the 'outlined' block still in the outer loop? I am not sure which one would be better, but maybe we should add a comment and a test to document the behavior?

You can treat each loop as a function: for an inner loop, if a block is cold, it will probably put together with blocks of its outer loop. But this is also determined by whether that block is cold or not in the outer loop (e.g., its frequency is more than 20% of the frequency of the outer loop). That says, we keep passing this cold block to outer loops until it is not cold in that outer loop anymore. I have updated the comment and explained this. I also updated the test case with a nested loop for which a block is cold for the inner loop but not cold for the outer loop.

Should there be some minimum size for the block to be outlined? A very short block will still increase branch count, but not really affect cache locality. For those, the trade-off might be different.

If we consider not to outline cold but very short blocks, we should do it uniformly in other cases. Take a diamond branch for instance:

A; if (...) B; else C; D;

Suppose B is very cold but very short, should we layout this branch as ACBD..., or ACD...B?

One optimization I could think out is that we can put B as close as possible to D without affecting overall branch cost.

congh updated this revision to Diff 37201.Oct 12 2015, 5:31 PM
congh edited edge metadata.

Update the patch according to Daniel's comments.

Sorry for the delay -- I will try to find some time to review it.

David

Please put the block set computation code into a helper function -- LGTM with that. Also how do you come up with the 5/1 ratio?

congh added a comment.Nov 2 2015, 10:33 AM

Ping?

Please put the block set computation code into a helper function -- LGTM with that. Also how do you come up with the 5/1 ratio?

OK. The 5/1 ratio complies with the ratio that is used in selectBestSuccessor(): when the loop has only one iteration, this will lead to similar result if we treat the loop as a non-loop by removing back edges.

congh updated this revision to Diff 38968.Nov 2 2015, 12:43 PM

Update the patch according to David's comment.

This revision was automatically updated to reflect the committed changes.