This is an archive of the discontinued LLVM Phabricator instance.

[HotColdSplitting] Identify larger cold regions using domtree queries
ClosedPublic

Authored by vsk on Oct 23 2018, 5:43 PM.

Details

Summary

The current splitting algorithm works in three stages:

  1. Identify cold blocks, then
  2. Use forward/backward propagation to mark hot blocks, then
  3. Grow a SESE region of blocks *outside* of the set of hot blocks and start outlining.

While testing this pass on Apple internal frameworks I noticed that some
kinds of control flow (e.g. loops) are never outlined, even though they
unconditionally lead to / follow cold blocks. I noticed two other issues
related to how cold regions are identified:

  • An inconsistency can arise in the internal state of the hotness propagation stage, as a block may end up in both the ColdBlocks set and the HotBlocks set. Further inconsistencies can arise as these sets do not match what's in ProfileSummaryInfo.
  • It isn't necessary to limit outlining to single-exit regions.

This patch teaches the splitting algorithm to identify maximal cold
regions and outline them. A maximal cold region is defined as the set of
blocks post-dominated by a cold sink block, or dominated by that sink
block. This approach can successfully outline loops in the cold path. As
a side benefit, it maintains less internal state than the current
approach.

Due to a limitation in CodeExtractor, blocks within the maximal cold
region which aren't dominated by a single entry point (a so-called "max
ancestor") are filtered out.

Results:

  • X86 (LNT + -Os + externals): 134KB of TEXT were outlined compared to 47KB pre-patch, or a ~3x improvement. Did not see a performance impact across two runs.
  • AArch64 (LNT + -Os + externals + Apple-internal benchmarks): 149KB of TEXT were outlined. Ditto re: performance impact.
  • Outlining results improve marginally in the internal frameworks I tested.

Follow-ups:

  • Outline more than once per function, outline large single basic blocks, & try to remove unconditional branches in outlined functions.

Diff Detail

Repository
rL LLVM

Event Timeline

vsk created this revision.Oct 23 2018, 5:43 PM
vsk updated this revision to Diff 170800.Oct 23 2018, 6:15 PM
vsk edited the summary of this revision. (Show Details)
  • Before this patch, 47KB of text was outlined in a test-suite run (X86, -Os, LNT + externals). After this patch, 134KB was outlined, so roughly a ~3x improvement.
  • Fix a typo in a comment.
sebpop accepted this revision.Oct 23 2018, 10:09 PM

ok

llvm/lib/Transforms/IPO/HotColdSplitting.cpp
184 ↗(On Diff #170800)

When the MaxAncestor has no predecessors we may be able to outline the second largest cold region that has a predecessor. Maybe add a comment about this.

186 ↗(On Diff #170800)

Maybe add a FIXME note: the blocks not dominated by the max ancestor may be extracted as other smaller cold regions. Not sure how much benefit that would give...

221 ↗(On Diff #170800)

Agreed. I think the threshold can be pretty low given that the code is cold.
You could even try to remove the condition (ColdRegion.size() == 1) and see what happens: code size and perf.

This revision is now accepted and ready to land.Oct 23 2018, 10:09 PM
vsk marked 2 inline comments as done.Oct 24 2018, 10:43 AM

Thanks for the review.

llvm/lib/Transforms/IPO/HotColdSplitting.cpp
184 ↗(On Diff #170800)

Good point, I'll add a comment.

186 ↗(On Diff #170800)

Good point. Although.. marking outlined calls as noreturn when appropriate and outlining more than once per function could achieve most of the win.

221 ↗(On Diff #170800)

I have tried it out internally and saw a substantial increase in outlining, but haven't looked into performance impact yet. I'll send a separate patch when I have more data.

vsk planned changes to this revision.Oct 24 2018, 11:21 AM
vsk marked 2 inline comments as done.

I've been fuzzing+testing this patch further and found a verification failure in CodeExtractor. As we now support outlining regions with multiple exits, CodeExtractor cannot leave around duplicate incoming values from the codeRepl block. I think this should be simple to fix.

Test:

vsk updated this revision to Diff 170944.Oct 24 2018, 12:01 PM

Fixed a bug in CodeExtractor found by building the test suite with an asserts-enabled compiler.

CodeExtractor should assert that the codeReplacer block provides exactly one (unique) incoming value to any phi node with incoming values from the extracted region.

I'd appreciate another look at that change, just to make sure it's ok. I added a test in 'duplicate-phi-preds-crash.ll'. Thanks!

This revision is now accepted and ready to land.Oct 24 2018, 12:01 PM
vsk updated this revision to Diff 170954.Oct 24 2018, 12:32 PM
vsk added a reviewer: kuhar.
  • Disable a problematic CodeExtractor unit test. CodeExtractor miscompiles the code in the test. Any time there are live output values from an outlined region, there's a chance CodeExtractor will leave around a phi node with incoming values from a different function.
vsk updated this revision to Diff 170959.Oct 24 2018, 1:02 PM
sebpop accepted this revision.Oct 24 2018, 3:03 PM

The change looks good to me. Thanks!

This revision was automatically updated to reflect the committed changes.
llvm/trunk/test/Transforms/HotColdSplit/outline-while-loop.ll