This is an archive of the discontinued LLVM Phabricator instance.

a better profi rebalancer
ClosedPublic

Authored by spupyrev on Jan 10 2022, 1:35 PM.

Details

Summary

This is an extension of profi post-processing step that rebalances counts
in CFGs that have basic blocks w/o probes (aka "unknown" blocks). Specifically,
the new version finds many more "unknown" subgraphs and marks more "unknown"
basic blocks as hot (which prevents unwanted optimization passes).

I see up to 0.5% perf on some (large) binaries, e.g., clang-10 and gcc-8.

The algorithm is still linear and yields no build time overhead.

Diff Detail

Event Timeline

spupyrev created this revision.Jan 10 2022, 1:35 PM
spupyrev requested review of this revision.Jan 10 2022, 1:35 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 10 2022, 1:35 PM
spupyrev edited the summary of this revision. (Show Details)Jan 10 2022, 1:41 PM
spupyrev added reviewers: wenlei, hoy.
spupyrev added a reviewer: wlei.
spupyrev updated this revision to Diff 398952.Jan 11 2022, 7:45 AM

git-clang-format HEAD^

spupyrev updated this revision to Diff 399041.Jan 11 2022, 12:18 PM

another clang-format

hoy added a comment.Jan 11 2022, 5:25 PM

Thanks for the work. Nice to see 0.5% perf win on clang-10 and gcc-8!

llvm/lib/Transforms/Utils/SampleProfileInference.cpp
148

How did this value come up? Have you tried other values between 20 and 30?

556

Can this be relaxed? Say if the exit block of a subgraph has unknown weight, the block may never be rebalanced. Eg.

       A
    /     \
B(?)     C(?)
  |         |
 D        E
   \      /
      F (?)

Did I understanding correctly?

566

return false here if DstBlock != nullptr ?

573

The check NumIgnoredJumps > 0 isn't necessary since Block->SuccJumps must not be empty here.

583

Please add omment for SrcBlock and DstBlock.

613–614

nit: more compact without using continue:

if (!ignoreJump(SrcBlock, DstBlock, Jump))
   LocalInDegree[Jump->Target]++;
684–685

Can SrcBlock have unknown weight? It should have been filtered out by canRebalanceAtRoot.

I see up to 0.5% perf on some (large) binaries, e.g., clang-10 and gcc-8.

Great results! And the change also makes sense. Was this with AutoFDO or CSSPGO or InstrPGO? How does overall count quality look with this change?

We can also give it a try on server workloads.

I see up to 0.5% perf on some (large) binaries, e.g., clang-10 and gcc-8.

Great results! And the change also makes sense. Was this with AutoFDO or CSSPGO or InstrPGO? How does overall count quality look with this change?

We can also give it a try on server workloads.

The numbers are for CSSPGO. We don't have dangling/unknown blocks in AutoFDO so it is irrelevant there. For InstrPGO, we don't run profi, if I am not mistaken?

As discussed offline, let's wait for the second (related) diff to be ready before real-world testing. This diff alone might not be enough to see a difference.

spupyrev marked 2 inline comments as done.Jan 12 2022, 2:59 PM
spupyrev added inline comments.
llvm/lib/Transforms/Utils/SampleProfileInference.cpp
148

This is supposed to be an "infinity" value. I noticed however that 2^20 isn't large enough for some extreme cases (huge CFGs with thousand of basic blocks and very hot loops). Hence, increasing the value to 2^30.

556

I am not sure I understand this particular example, what needs to be rebalanced here? If A, D, and E have known weights, then a solution is unique, no?

But in general you're right that instances with KnownDstBlocks.size() > 1 won't be rebalanced. We thought about this for a long time but the problem is difficult.

684–685

good point. The reason is UnknownBlocks contains SrcBlock in this function.

Let me try to reimplement this part or apply some renaming to avoid confusion

hoy added inline comments.Jan 13 2022, 12:15 PM
llvm/lib/Transforms/Utils/SampleProfileInference.cpp
556

If A, D, and E have known weights, then a solution is unique, no?

Say if we are rebalancing between B and C, due to this check and the fact KnownDstBlocks will include D and E, it'll bail out here?

But in general you're right that instances with KnownDstBlocks.size() > 1 won't be rebalanced. We thought about this for a long time but the problem is difficult.

Can blocks in KnownDstBlocks be wired to a single artificial block which serves as the final only known dst? This is just my random thought. Hopefully the multi known case isn't too common.

spupyrev updated this revision to Diff 400029.Jan 14 2022, 8:29 AM
spupyrev marked 5 inline comments as done.

addressing hoy's comments

llvm/lib/Transforms/Utils/SampleProfileInference.cpp
556

Discussed this example offline. It seems we can slightly relax the condition here, but that only affects a single instance in clang binary and no instances in gcc binary. However, the adjustment makes the algorithm quadratic. Thus, keeping it as is.

hoy accepted this revision.Jan 14 2022, 12:41 PM

LGTM, thanks.

This revision is now accepted and ready to land.Jan 14 2022, 12:41 PM
spupyrev closed this revision.Jan 18 2022, 2:59 PM