This is an archive of the discontinued LLVM Phabricator instance.

CodeGen: Allow small copyable blocks to "break" the CFG.
ClosedPublic

Authored by iteratee on Jan 11 2017, 3:07 PM.

Details

Summary

When choosing the best successor for a block, ordinarily we would have preferred
a block that preserves the CFG unless there is a strong probability the other
direction. For small blocks that can be duplicated we now skip that requirement
as well, subject to some simple frequency calculations.

Diff Detail

Event Timeline

iteratee updated this revision to Diff 84034.Jan 11 2017, 3:07 PM
iteratee retitled this revision from to CodeGen: Allow small copyable blocks to "break" the CFG..
iteratee updated this object.
iteratee added a reviewer: davidxl.
iteratee set the repository for this revision to rL LLVM.
iteratee updated this object.Jan 11 2017, 3:08 PM
iteratee edited edge metadata.
iteratee updated this revision to Diff 84036.Jan 11 2017, 3:15 PM
iteratee added a reviewer: arsenm.
iteratee removed rL LLVM as the repository for this revision.
iteratee updated this revision to Diff 84053.Jan 11 2017, 4:34 PM
iteratee edited edge metadata.

Realized that one of the calculations I did was only valid for D28522. Re-worked the calculation for now, and will rebase and update the calculation there.

davidxl edited edge metadata.Jan 12 2017, 9:20 AM

I like the direction (with more precise cost analysis) this is going. Will review the code soon.

davidxl added inline comments.Jan 13 2017, 10:04 AM
lib/CodeGen/MachineBlockPlacement.cpp
676

In this case, what is needed to to invoke 'hasBetterLayoutPredecessor' on PDom block. Dependinng the result, we will know that without tailDup, the layout order is Succ-> PDom or Succ->D->PDom. This will make the cost computation more precise.

davidxl added inline comments.Jan 13 2017, 10:04 AM
lib/CodeGen/MachineBlockPlacement.cpp
431

Suggest new name : isProfitableToTailDup

634

Dom -> PDom

641

Why not just check if there exists a SuccSucc that post dominates Succ directly?

660

PU + PV == P

Also Assuming U > V, the layout order (with tail dup) on should be

BB --> Succ --> D

so the overall cost is:

Q + P V + Q ( which is smaller than Q + QV + PU + PV)

661

We can not assume BB and Succ are in a triangular shape subcfg here -- given where this method is called. Besides, if Succ is not tail-duped, the layout decision may even reject Succ as the layout successor, so the cost is no longer P + V, but 2*Q + V instead (with U > V).

In other words, isProfitable check can not be done inside 'hasBetterLayoutPredecessor', but hoisted to the caller of it when 'hasBetterLayoutPrecessor' returns, at which point we will know the layout decision if taildup does not kick in.

iteratee updated this revision to Diff 84159.Jan 13 2017, 3:42 PM
iteratee edited edge metadata.
iteratee marked an inline comment as done.

Updated the cost calculation to not rely on the lattice layout.
This resulted in fewer duplications in tests, so those tests changes have been rolled into D28522

iteratee updated this revision to Diff 85070.Jan 19 2017, 5:25 PM

I made the calculations in terms of frequency instead of probability.

I adjusted the cost calculation when there is a post dominator based on whether it will be laid out after Succ or not.

Let me know if there are any cost calculations that you think are wrong.

iteratee updated this revision to Diff 85073.Jan 19 2017, 5:29 PM
iteratee marked 3 inline comments as done.

Actually upload the diff with what I said was in the last one:
Use frequency instead of probability

Use slight lookahead for more precise probability calculations.

Let me know what you think. There is a small cleanup that could go in as a separate patch: I switched to a SmallDenseSet because we don't need the orderedness of the SmallVectorSet.

lib/CodeGen/MachineBlockPlacement.cpp
660

I thought that too. But without the lattice patch, after duplication, we won't put D after Succ because it now has an unplaced predecessor. The lattice patch fixes the behavior and the calculation.

661

I think I have the calculation right for when Succ would not be the layout successor, but you're right to point out that we should also do the calculation even when Succ is the chosen successor.

661

We now only call this function to check if we should use Succ despite it having been rejected. So we know that Succ is not the layout successor.

iteratee updated this revision to Diff 85084.Jan 19 2017, 6:28 PM

Tidied comments and spacing.

davidxl added inline comments.Jan 20 2017, 2:04 PM
lib/CodeGen/MachineBlockPlacement.cpp
280

There is a reason SmallSetVector is used here -- to make sure the iteration order is deterministic.

653

I assume this is loop back edge source block. You need a test case to cover it.

664

Why break here?

669

nit: -->SuccBestPred

677

Computing BestSuccPred here is unnecessary. See below for more comments.

686

Qin is not necessarily BestSuccPred.

Profitability check is called only after hasBetterLayoutPredecessor is returned and it returns true. There are two scenarios it returns true

  1. Qin or Qout is larger than P, or
  2. P is larger than Qout, but not the branch is not biased enough such that the layout algorithm still decides to keep the top-order.

Either way, the baseline layout to compare (with taildup) is that BB->Succ is the branch taken edge, and BB->C is the fall through edge. Qin should just be Prob(BB->C)

702

PDom is always a successor of Succ according to the way it is computed.

708

The base cost is as wrote, the DupCost however depends on whether P > Q or not.

If P > Q, the fallthrough path is BB->Succ->D
so the cost (normalized with freq(bb) ==1) is

2*Q+ P*V

If P < Q, the fall through path is BB->C'->D
the cost is

2*P + Q*V
875

Add more description about what blocks to ignore.

I'll be glad to add some more comments to explain, but I think the calculations are correct. I've commented individually.

lib/CodeGen/MachineBlockPlacement.cpp
280

BlockFilterSet is never iterated. I checked.

661

I think I have the calculation right for when Succ would not be the layout successor, but you're right to point out that we should also do the calculation even when Succ is the chosen successor.

664

Because if PDom is not null, that's all that we look at for the probability calculation.

686

When we place Succ, we remove 2 fallthrough edges BB->C and C'->Succ.

Freq(C'->Succ) may be larger than Freq(BB->C).
I am using Qin to represent Freq(C'->Succ) and Qout for Freq(BB->C). I could just use different letters if that were more clear.

Qout is Freq(BB->C). I don't think Qin should be as well.

702

Thanks.

708

This function is called in a loop looking for the highest probability successor. If Q > P, this function will be ignored and we will lay out Q anyway, so we can ignore the second case.

As to the first case:
Until the 2nd patch lands, the duplication will prevent the BB->Succ->D layout.

Instead you will get BB->Succ ; C'->D
So the cost is as calculated.
D28522 will include an update to this calculation along with an update to the behavior.

875

Well, that's really up to the caller. Do you want me to list why you might want to ignore a block?

davidxl added inline comments.Jan 20 2017, 4:19 PM
lib/CodeGen/MachineBlockPlacement.cpp
280

See

for (MachineBasicBlock *LoopBB : LoopBlockSet)

fillWorkLists(LoopBB, UpdatedPreds, &LoopBlockSet);
686

differentiate Qin and Qout is fine, but in the code Qin = BestSuccPred which could be Freq(BB->Succ). What I meant is you should directly compute Qin as its definition Freq(C'->Succ)

708

You are right about Q > P case that that scenario will be dropped. It is very subtle, so please add some comment to clarify.

Ok -- for the first case, also add a comment

875

something like : e.g, when called under xxx, we want to ignore yyy. See caller zzz for details. However, see my comment in the function, this parameter seems unnecessary.

1007

I think it is equivalent to check Pred == BB. In normal calling context, this is covered by BlockToChain[Pred] == &Chain, but for lookahead case, it is needed to filter BB which is not laid out yet.

iteratee updated this revision to Diff 85437.Jan 23 2017, 11:36 AM

Improved comments based on review.

Please mark addressed comments as done. Also let me know if it is ready for another round of review (I saw some issues not addressed such as the deterministic iteration of block filter set).

iteratee updated this revision to Diff 85451.Jan 23 2017, 2:03 PM
iteratee marked 17 inline comments as done.

Missed a comment to rename something.

Please mark addressed comments as done. Also let me know if it is ready for another round of review (I saw some issues not addressed such as the deterministic iteration of block filter set).

Marked.

I think it's ready, and I put back the deterministic set.

lib/CodeGen/MachineBlockPlacement.cpp
686

Did you still want me to fix something here?

davidxl added inline comments.Jan 23 2017, 2:39 PM
lib/CodeGen/MachineBlockPlacement.cpp
653

test case for this?

683

Add a short cut here with comments:

// If P is not larger, the best successor selection loop will eventually select C, not Succ (as it is not profitable to do so).
if (P <= Qout)

return false;
686

just add a comment above Qin decl stating that Qin is the largest frequency of Succ's incoming edges which have not been placed.

1008

--> ... for lookhead by isProfitableToTailDup when BB has not yet been placed.

iteratee updated this revision to Diff 85518.Jan 23 2017, 7:11 PM
iteratee marked 4 inline comments as done.

More comments from review, and a new test case.

This version looks almost fine except for one remaining unaddressed comment.

lib/CodeGen/MachineBlockPlacement.cpp
683

How about this comment? Early return can 1) speed up the computation and 2) make the following code easier to understand.

iteratee added inline comments.Jan 23 2017, 8:33 PM
lib/CodeGen/MachineBlockPlacement.cpp
653

It's not just a back edge. I added a test case.

683

If we weren't estimating Qout, I'd agree. Instead we'll skip calling this altogether if we know that we won't use the result.

Sorry. I'd replied to the comment, but Phabricator didn't submit it along with my diff update for some reason.

iteratee updated this revision to Diff 85634.Jan 24 2017, 2:41 PM

Save the blocks with CFG violations that are duplication candidates. Review them in descending order of probability, so we call isProfitableToTailDup the minimum number of times.

davidxl added inline comments.Jan 24 2017, 4:43 PM
lib/CodeGen/MachineBlockPlacement.cpp
1091

Remove the first 'SuccProb > BestProb' check -- it provides only very tiny compile time win depending on the iteration order, but adds more confusion.

1108

no need to set ShouldTailDup in the loop -- it is already initalized outside.

1116

Why not just stable sort it? The vector should be of size 1 for most of the cases. Also why do you need position ?

1126

Should it break instead?

1128

isProfitableToTailDup assumes the baseline layout does not pick Succ. The assumption may not be true here as there are other two possibilities:

  1. Succ == BestSucc.BB in the base layout
  2. BestSucc.BB == null in the base layout (all BB's successors have conflicts).

In such two cases, isProfitable check should probably be skipped (as it is benefitial)

iteratee updated this revision to Diff 85666.Jan 24 2017, 5:19 PM
iteratee marked 4 inline comments as done.

Changes from comments:
Just sort the vector instead of make_heap.
If there is a tail duplication opportunity and no other successor, take it.

lib/CodeGen/MachineBlockPlacement.cpp
1116

Will just sort the vector.

Position is because we rely on the successor order being stable and the first successor being a subtle hint. Without the position, we lose track of whether the block in the vector came before or after the block we picked without tail duplication.

1128

Succ won't equal BestSucc.BB because of the continue. These blocks were not chosen by the first loop by construction.

Good catch. I'll add that.

iteratee updated this revision to Diff 86081.Jan 27 2017, 11:01 AM

Per offline discussion, I removed the ordering constraint for blocks that are profitable to tail-duplicate.

This resulted in a lot of test churn, but the source change is relatively small.

This looks very clean now.

However the amount of churns remind me of one thing. Since the profit computation is based on static branch prediction (without PGO), it is the right thing to do to be a little more conservative in taildup. In other words, instead of making 'isProtifiable' return true when the taildup cost is smaller than baseline cost, add a predefined margin (controlled by a parameter):

if (baseline_cost - taildup_cost > threshold)

return true;

return false;

The threshold also roughly models the side effect of taildup -- increased icache footprint etc due to code size increase.

iteratee updated this revision to Diff 86340.Jan 30 2017, 1:57 PM

Compare frequencies with a small bias against the tail-duplication side to account for increased icache pressure.

Includes a TODO to handle edge frequencies better in general.

davidxl added inline comments.Jan 30 2017, 4:33 PM
lib/CodeGen/MachineBlockPlacement.cpp
151

perhaps simplify it to tail-dup-penalty ?

620

This basically treats the penality percent parameter as the threshold of normalized improvement: (A-B)/B

if ((A-B)/B > PenaltyPercent/100)

return true;

The problem with this formula is that if B is very hot, it makes (A-B)/B become small, even though the (A-B) is still large.

So I think it is better to compute the normalized improvement as (A-B)/Entry_Freq

basically the improvement relative to the entry frequency. This will help prevent tail dup from happening in very cold paths.

The implementation can makes use of BranchProbablity as well. Suppose we want to implement condition:

if ( (A-B)/Entry_Freq > P/100)
     return true;

do this 3 lines:

BlockFrequency Profit = A - B;
BlockFrequency Threshold = Entry_Freq  * BranchProbability(P, 100);

return Profit > Threshold;
iteratee updated this revision to Diff 86381.Jan 30 2017, 6:44 PM

Use a percentage of the entry frequency as a cutoff.

davidxl added inline comments.Jan 30 2017, 7:23 PM
lib/CodeGen/MachineBlockPlacement.cpp
156

Is this default value too low? Increase it 5 or 10 perhaps?

625

I suppose this logic here is for rounding errors or overflow? Can you explain why the simple scaling with branch prob (in BranchProbablity.cpp) does not work?

return Gain > EntryFreq*ThresholdProb;

iteratee updated this revision to Diff 86468.Jan 31 2017, 11:23 AM

Simplify the biased comparison.

iteratee marked 4 inline comments as done.Jan 31 2017, 11:36 AM
iteratee added inline comments.
lib/CodeGen/MachineBlockPlacement.cpp
156

No, I think we should leave it. Now that it's a flag it's easy to change, and especially comparing with the entry frequency 2% is a big enough margin.

625

I did the math, and found a way to do it simply.

davidxl accepted this revision.Jan 31 2017, 1:45 PM

lgtm

(I only sampled some test case changes which look reasonable)

test/CodeGen/X86/bt.ll
27–32

This test has not changed in behavior. Better to revert the change.

This revision is now accepted and ready to land.Jan 31 2017, 1:45 PM
iteratee marked an inline comment as done.Jan 31 2017, 1:48 PM
iteratee added inline comments.
test/CodeGen/X86/bt.ll
27–32

I'll do a complete check for any tests that fall into this category and revert them.

This revision was automatically updated to reflect the committed changes.