This is an archive of the discontinued LLVM Phabricator instance.

[MBP] Partial tail duplication into hot predecessors
ClosedPublic

Authored by Carrot on Jan 24 2020, 3:58 PM.

Details

Summary

Current tail duplication embedded in MBP duplicates a BB into all or none of its predecessors without too much cost analysis. So sometimes it is duplicated into cold predecessors, and in other cases it may miss the duplication into hot predecessors.

This patch improves tail duplication in 3 aspects:

  1. A successor can be duplicated into part of its predecessors.
  2. A more fine-grained benefit analysis, combined with 1, now a successor is duplicated into hot predecessors only.
  3. If a successor can't be duplicated into one predecessor, it doesn't impact the duplication into other predecessors.

It doesn't impact the performance of spec2006int like many other code layout changes, spec doesn't have large instruction work set.
Performance test with two Google's internal search benchmarks shows obvious improvement on large binaries. I got 0.30% and 0.18% on them.

Diff Detail

Event Timeline

Carrot created this revision.Jan 24 2020, 3:58 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 24 2020, 3:58 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
xbolva00 added a subscriber: xbolva00.EditedJan 24 2020, 4:06 PM

New improvement looks cool, any data for clang itself (large binary too)?

davidxl added inline comments.Jan 28 2020, 9:43 AM
llvm/lib/CodeGen/MachineBlockPlacement.cpp
1158

should the benefit analysis be done here instead of just returning true?

1820

can you explain this new condition?

3107

Add some description (and examples) showing the algorithm/cost model used here.

New improvement looks cool, any data for clang itself (large binary too)?

Without this patch
real 139m31.300s
user 130m0.326s
sys 6m33.313s

With this patch
real 139m1.414s
user 129m46.296s
sys 6m30.924s

Carrot updated this revision to Diff 240980.Jan 28 2020, 1:18 PM
Carrot marked 3 inline comments as done.
Carrot added inline comments.
llvm/lib/CodeGen/MachineBlockPlacement.cpp
1158

It's possible to call findDuplicateCandidates here. But it needs to be called again just before TailDup.tailDuplicateAndUpdate to provide the duplicate candidates.

1820

Previously if function repeatedlyTailDuplicateBlock returns true, BestSucc is duplicated into all predecessors and removed. So there is no BestSucc any more, so we goto the loop header to find another best successor for BB.

Now with partial duplication, BestSucc may or may not be duplicated into BB.

  • If BestSucc is duplicated into BB, thus the condition is true, BestSucc is not BB's successor, we need to find another successor of BB.
  • If BestSucc is not duplicated into BB, it falls through to the code to layout BestSucc after BB.
xur added a comment.Jan 28 2020, 2:50 PM

Thanks for working on this. Current Duplication in block-placement does need improvement. We notice that disable it often improved performance.

llvm/include/llvm/CodeGen/TailDuplicator.h
96

Need to update the comments for this function.
Nit: maybe move CandidatePtr before DuplicatePreds?

llvm/lib/CodeGen/MachineBlockPlacement.cpp
3035

It seems to me that findDuplicateCandidates() will not let CandidatePreds to be full Preds (it will put the first one to fallthrough -- which makes this condition always true).

3072

what does "most possible" mean here?

3084

add "excluding BB".

3104

Add some comments to explain the modeling.
The same value is computed in findDuplicationCandidates() i.e. BBDupThreshold-- should we refactor the code to share the same heuristic?

3107

As David said, need to have cost model description here.

I give my first impression here:
(1) should you check if BB can be placed after a Pred? Pred could be in the middle of the chain.
(2) It seems that you assume the Succ BBs will be chosen as fall-through by different Dupped BBs.
Should you check if SuccIt has already be placed? Also, if SuccIt can be duplicated, the assumption does not to hold.
(3) Pred fall-through can be handled better. this algorithm does not actually try to find the optimal fall through of BB. It passively get the fall-through when we cannot duplicate. Maybe we can refactor code better so the post fixup at line 3171 is not needed.

3160

I don't think this will ever be true.

3192

I think using profile count will get you the same thing here. Profile count is derived from frequency.

https://bugs.llvm.org/show_bug.cgi?id=44596

offtopic: Can you look at this issue? MBP fais here?

Carrot updated this revision to Diff 241470.Jan 30 2020, 8:10 AM
Carrot marked 9 inline comments as done.
Carrot added inline comments.
llvm/lib/CodeGen/MachineBlockPlacement.cpp
3035

Only when Candidates.size() < Preds.size() is true then it changes the first predecessor to fall through.

If Candidates.size() == Preds.size(), the first predecessor will not be changed to fall through.

3072

It should be changed to "best".

3107

If BB can't be duplicated into Pred, isBestSuccessor checks if Pred is in the middle of a chain.
If BB can be duplicated into Pred, Pred has the only successor BB, it is extremely unlikely in the middle of a chain.

The main purpose of the loop is finding out beneficial duplications. If a predecessor has multiple successors we can't duplicate BB into it, even if the edge is hot. But we can make it fall through to BB to get the same benefit as tail duplication.
Change one duplication into fall through is only a plus for code size.

3160

Good catch.

3192

I mean compare the raw profile count with the hot number from profile summary information.

davidxl added inline comments.Feb 3 2020, 10:10 AM
llvm/lib/CodeGen/MachineBlockPlacement.cpp
3072

Add a comment explaining the scaling.

3137

it assumes *SuccIt is the layout successor (fall through bb). Is it always the case?

3167

"the benefit of duplicating into a predecessor is .."

3171

laid out

Carrot updated this revision to Diff 242226.Feb 3 2020, 4:01 PM
Carrot marked 4 inline comments as done.
Carrot added inline comments.
llvm/lib/CodeGen/MachineBlockPlacement.cpp
3137

No, *SuccIt may have other predecessors can fall through to it.
After sorting Succs, Succs.begin() has the largest probability among edges come out from BB. So *SuccIt has the largest chance to be laid after BB.

Consider other cases may improve the following cost analysis. But I think the extra benefit is very small with much more complicated implementation.

davidxl accepted this revision.Feb 11 2020, 1:25 PM

lgtm

This revision is now accepted and ready to land.Feb 11 2020, 1:25 PM
This revision was automatically updated to reflect the committed changes.