This is an archive of the discontinued LLVM Phabricator instance.

[MBP] Avoid tail duplication if it can't bring benefit
ClosedPublic

Authored by Carrot on Jul 8 2019, 4:02 PM.

Details

Summary

Current tail duplication integrated in bb layout is designed to increase the fallthrough from a BB's predecessor to its successor, but we have observed cases that duplication doesn't increase fallthrough, or it brings too much size overhead.

To overcome these two issues in function canTailDuplicateUnplacedPreds I add two checks:

  • make sure there is at least one duplication in current work set.
  • the number of duplication should not exceed the number of successors.

The modification in hasBetterLayoutPredecessor fixes a bug that potential predecessor must be at the bottom of a chain.

Diff Detail

Event Timeline

Carrot created this revision.Jul 8 2019, 4:02 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 8 2019, 4:02 PM

Did you ran some benchmark? Can you share perf/codesize results?

Can you put an example (text CFG) showing cases where tailDup is bad -- possibly as a source comment?

Also some size saving and performance numbers would be helpful.

Carrot updated this revision to Diff 209359.Jul 11 2019, 4:02 PM

Add a simple CFG example to show the inefficient tail duplication.

For our biggest object file which is actually generated from a protocol buffer, the file size is reduced by 1.2%.

It also increased the performance of our important internal benchmark.

What about CTMark? SPEC?

davidxl added inline comments.Jul 12 2019, 9:58 AM
lib/CodeGen/MachineBlockPlacement.cpp
1075 ↗(On Diff #209359)

Add comments for the two variables.

1123 ↗(On Diff #209359)

Early return is still valid here, right?

1137 ↗(On Diff #209359)

Move the check earlier than the loop.

1143 ↗(On Diff #209359)

This does not seem right. Dup was Pred1's fall through successor and Succ1 was Dup's fall through (without tailDup). Also assume Pred2 and Pred3 has one single successor (otherwise Dup can not be tailDup ed into them).

Here, we should look at reduced taken branches, not total fallthrough --- because once Dup is tail duped into Pred2 and Pred3, it is merged with them thus has an implicit fall through there.

Let the incoming frequency to Dup from Pred1,2,3 are F1, F2, and F3, and assume the Probability from Dup to Succ1 is P(S1). We also assume path insensitivity here in the calculation.

Before the tail duplication, the total taken branches are

T1 = F2 + F3 + (F1+F2+F3)*(1-P(S1))

After taildup, assuming Succ1 is a fall through of Pred1->Dup chain, and Succ2 is a fall through of Pred2, then the total taken branches is

T2 = F1 * (1-(P(S1)) + F2 * P(S1) + F3

so the difference is :

T1 - T2 = 2*(1-P(S1)) * F2 + (1-P(S1)) * F3 which is greater than 0.

Carrot updated this revision to Diff 209891.Jul 15 2019, 9:46 AM
Carrot marked 4 inline comments as done.

Add comments.

lib/CodeGen/MachineBlockPlacement.cpp
1123 ↗(On Diff #209359)

Because of the following (Succ->succ_size() == 0) check, early return is not valid. It is the exact reason of introducing the variable Duplicate here.

1137 ↗(On Diff #209359)

Because of the previous NumDup check, move this return earlier can cause different result.

1143 ↗(On Diff #209359)

Current implementation considers only fallthrough from Dup's predecessor to its successors. It doesn't consider other factors such as taken branches and code size.

If we consider taken branches, in this example:

  • Dup is not duplicated into Pred3, total taken branches through Pred3 is Freq3 + taken branches to Succ1 or Succ2
  • Dup is duplicated into Pred3, at the end of Pred3 it needs two branches to either Succ1 or Succ2, and the total number of taken branches is Freq3

So duplicated into Pred3 really reduces taken branches. Similarly if Dup has more predecessors with single successor(Dup), tail duplicated Dup into them always reduces taken branches. Unfortunately Dup may have many predecessors, we have multiple cases with dozens of predecessors. Although duplicated Dup reduces taken branches, it also increases code size, reduces the efficiency of i-cache, and regress the performance finally.

So we need better method to model it. I can't see a simple trade off between code size and fallthroughs(taken branches) here, maybe we can apply machine learning here.

davidxl added inline comments.Jul 16 2019, 3:36 PM
lib/CodeGen/MachineBlockPlacement.cpp
1143 ↗(On Diff #209359)

Some heuristics can be added. For instance of if the branch from Dup to Succ1,2 is highly biased, do not tail duplicate. If Pred3 (assuming it is the coldest) block is very cold, do not tail duplicate. Tunings like this can even be extended to 2 predecessor case -- with a good cost model + profile information, we can avoid bloating icache size.

1138 ↗(On Diff #209891)

Can you explain this check (succ_size() ==0 ) a little more?

Carrot marked 2 inline comments as done.Jul 18 2019, 10:40 AM

What about CTMark? SPEC?

I ran spec2006int w/o and w/ this patch, the result is 38.7 vs 38.6. So basically no change.

lib/CodeGen/MachineBlockPlacement.cpp
1143 ↗(On Diff #209359)

Unfortunately current integrated tail duplication can only duplicate BB to all its predecessors or none. So we can't consider Pred3 only. But I agree tail duplicate BB into part of its predecessors is valuable, it needs significant changes to current tail duplication framework, I plan to work on it in future.

1138 ↗(On Diff #209891)

(Succ->succ_size() == 0) means Succ doesn't have successors, so it is the function exit BB.
This conditional return is to keep its current behavior, because it can impact too many test cases.
In future when we can duplicate BB into part of its predecessors, the exit BB can also be handled better.

davidxl added inline comments.Sep 11 2019, 8:49 AM
lib/CodeGen/MachineBlockPlacement.cpp
1069 ↗(On Diff #209891)

Can the code be restructured here so that it picks and returns the top two most profittable predecessors and perform tail dup into them?

Carrot marked an inline comment as done.Nov 4 2019, 4:04 PM
Carrot added inline comments.
lib/CodeGen/MachineBlockPlacement.cpp
1069 ↗(On Diff #209891)

Current tail duplication can do it for 0 or all its predecessors. Duplicate a BB into hot predecessors only is in future plan.

davidxl added inline comments.Dec 3 2019, 11:44 AM
lib/CodeGen/MachineBlockPlacement.cpp
1164 ↗(On Diff #209891)

Can you add a // FIXME here to add support to tail dup to some of the predecessors.

Carrot updated this revision to Diff 232362.Dec 5 2019, 8:49 AM
Carrot marked an inline comment as done.
davidxl accepted this revision.Dec 5 2019, 9:21 AM

lgtm

This revision is now accepted and ready to land.Dec 5 2019, 9:21 AM
This revision was automatically updated to reflect the committed changes.