Page MenuHomePhabricator

[MBP] Avoid tail duplication if it can't bring benefit
Needs ReviewPublic

Authored by Carrot on Mon, Jul 8, 4:02 PM.

Details

Reviewers
davidxl
Summary

Current tail duplication integrated in bb layout is designed to increase the fallthrough from a BB's predecessor to its successor, but we have observed cases that duplication doesn't increase fallthrough, or it brings too much size overhead.

To overcome these two issues in function canTailDuplicateUnplacedPreds I add two checks:

  • make sure there is at least one duplication in current work set.
  • the number of duplication should not exceed the number of successors.

The modification in hasBetterLayoutPredecessor fixes a bug that potential predecessor must be at the bottom of a chain.

Diff Detail

Event Timeline

Carrot created this revision.Mon, Jul 8, 4:02 PM
Herald added a project: Restricted Project. · View Herald TranscriptMon, Jul 8, 4:02 PM

Did you ran some benchmark? Can you share perf/codesize results?

Can you put an example (text CFG) showing cases where tailDup is bad -- possibly as a source comment?

Also some size saving and performance numbers would be helpful.

Carrot updated this revision to Diff 209359.Thu, Jul 11, 4:02 PM

Add a simple CFG example to show the inefficient tail duplication.

For our biggest object file which is actually generated from a protocol buffer, the file size is reduced by 1.2%.

It also increased the performance of our important internal benchmark.

What about CTMark? SPEC?

davidxl added inline comments.Fri, Jul 12, 9:58 AM
lib/CodeGen/MachineBlockPlacement.cpp
1075

Add comments for the two variables.

1125–1126

Early return is still valid here, right?

1139

Move the check earlier than the loop.

1145

This does not seem right. Dup was Pred1's fall through successor and Succ1 was Dup's fall through (without tailDup). Also assume Pred2 and Pred3 has one single successor (otherwise Dup can not be tailDup ed into them).

Here, we should look at reduced taken branches, not total fallthrough --- because once Dup is tail duped into Pred2 and Pred3, it is merged with them thus has an implicit fall through there.

Let the incoming frequency to Dup from Pred1,2,3 are F1, F2, and F3, and assume the Probability from Dup to Succ1 is P(S1). We also assume path insensitivity here in the calculation.

Before the tail duplication, the total taken branches are

T1 = F2 + F3 + (F1+F2+F3)*(1-P(S1))

After taildup, assuming Succ1 is a fall through of Pred1->Dup chain, and Succ2 is a fall through of Pred2, then the total taken branches is

T2 = F1 * (1-(P(S1)) + F2 * P(S1) + F3

so the difference is :

T1 - T2 = 2*(1-P(S1)) * F2 + (1-P(S1)) * F3 which is greater than 0.

Carrot updated this revision to Diff 209891.Mon, Jul 15, 9:46 AM
Carrot marked 4 inline comments as done.

Add comments.

lib/CodeGen/MachineBlockPlacement.cpp
1125–1126

Because of the following (Succ->succ_size() == 0) check, early return is not valid. It is the exact reason of introducing the variable Duplicate here.

1139

Because of the previous NumDup check, move this return earlier can cause different result.

1145

Current implementation considers only fallthrough from Dup's predecessor to its successors. It doesn't consider other factors such as taken branches and code size.

If we consider taken branches, in this example:

  • Dup is not duplicated into Pred3, total taken branches through Pred3 is Freq3 + taken branches to Succ1 or Succ2
  • Dup is duplicated into Pred3, at the end of Pred3 it needs two branches to either Succ1 or Succ2, and the total number of taken branches is Freq3

So duplicated into Pred3 really reduces taken branches. Similarly if Dup has more predecessors with single successor(Dup), tail duplicated Dup into them always reduces taken branches. Unfortunately Dup may have many predecessors, we have multiple cases with dozens of predecessors. Although duplicated Dup reduces taken branches, it also increases code size, reduces the efficiency of i-cache, and regress the performance finally.

So we need better method to model it. I can't see a simple trade off between code size and fallthroughs(taken branches) here, maybe we can apply machine learning here.

davidxl added inline comments.Tue, Jul 16, 3:36 PM
lib/CodeGen/MachineBlockPlacement.cpp
1138

Can you explain this check (succ_size() ==0 ) a little more?

1145

Some heuristics can be added. For instance of if the branch from Dup to Succ1,2 is highly biased, do not tail duplicate. If Pred3 (assuming it is the coldest) block is very cold, do not tail duplicate. Tunings like this can even be extended to 2 predecessor case -- with a good cost model + profile information, we can avoid bloating icache size.

Carrot marked 2 inline comments as done.Thu, Jul 18, 10:40 AM

What about CTMark? SPEC?

I ran spec2006int w/o and w/ this patch, the result is 38.7 vs 38.6. So basically no change.

lib/CodeGen/MachineBlockPlacement.cpp
1138

(Succ->succ_size() == 0) means Succ doesn't have successors, so it is the function exit BB.
This conditional return is to keep its current behavior, because it can impact too many test cases.
In future when we can duplicate BB into part of its predecessors, the exit BB can also be handled better.

1145

Unfortunately current integrated tail duplication can only duplicate BB to all its predecessors or none. So we can't consider Pred3 only. But I agree tail duplicate BB into part of its predecessors is valuable, it needs significant changes to current tail duplication framework, I plan to work on it in future.