This is an archive of the discontinued LLVM Phabricator instance.

[HotColdSplitting] Add SplittingDelta option to enable splitting more small blocks
Needs ReviewPublic

Authored by rjf on Jul 23 2020, 2:37 PM.

Details

Reviewers
hiraditya
vsk
Summary

Add an option "hotcoldsplit-delta" that is set to 5 by default;
the delta value enables blocks with cost benefit-penalty>=-delta
be split, in addition to blocks with positive benefit-penalty
differences. We have found that on common workloads, a splitting
delta of 5 enables hot/cold splitting to split significantly more
cold blocks with no obvious performance regression observed.

Diff Detail

Event Timeline

rjf created this revision.Jul 23 2020, 2:37 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 23 2020, 2:37 PM
rcorcs added a subscriber: rcorcs.Jul 23 2020, 2:50 PM
vsk requested changes to this revision.Jul 27 2020, 10:54 AM
vsk added a subscriber: vsk.

Changing this threshold can have a significant impact on code size. Artificially boosting the outlining 'benefit' can, in many instances, actually regress the size of function-undergoing-extraction (defeating the point of hot/cold splitting). Could you share some data, specifically:

  • Which projects were tested, at what optimization levels.
  • Code size, pre-patch vs. post-patch.
  • Code size (non-extracted functions only), pre-patch vs. post-patch.
  • % of functions which undergo hot/cold splitting pre-patch vs. post-patch.
This revision now requires changes to proceed.Jul 27 2020, 10:54 AM
rjf added a comment.Jul 28 2020, 8:40 AM

Based on discussion with @hiraditya today we'll change the default to 0 (which stays the same) and add additional test cases for delta=5 (which come from the four failing cases here).

We should keep the default value same as before.

rjf updated this revision to Diff 281263.Jul 28 2020, 9:42 AM

Change default SplittingDelta to zero.

vsk added a comment.Jul 28 2020, 12:16 PM

@rjf IIUC you have a test setup to evaluate setting the splitting boost to 5. Could you please share the statistics I outlined in my previous comment? This can inform the decision about whether to add the new option.

hiraditya added a comment.EditedJul 28 2020, 3:32 PM

@vsk do you think having SplittingDelta will help other projects? Is it likely that some workloads may benefit from more/less outlining compared to default?

@rjf seems like the updated diff needs to be squashed with previous to see all the changes in one patch.

rjf added a comment.Jul 28 2020, 4:13 PM

@vsk I'll post some relevant statistics on the distribution of benefit-penalty scores on our workloads first:

qemu and firefox compiled with -O2 optimization level and hotcoldsplit

For qemu, there are 4331 cold blocks detected in total, with 967 blocks with benefit-penalty difference in [-5, 0], in addition to 2892 blocks with positive benefit-penalty scores. See the attached file for a histogram.


Note that the histogram has been truncated to include blocks with benefit-penalty difference between -10 and 10. Overall, the greatest number of blocks have a score of 4.

For Firefox (mozilla-central), there are 152048 cold blocks detected in total, 79612 blocks with benefit-penalty difference in [-5,0], in addition to 69444 blocks with positive benefit-penalty scores.

vsk added a comment.Jul 29 2020, 5:40 PM

@rjf thanks for sharing those numbers!

In my testing, I found that it's actually necessary to radically increase the outlining penalty in order to avoid pathological code size growth (see https://reviews.llvm.org/D59715 -- which, incidentally, I would still love to upstream -- review would be very much appreciated). This evaluation was done across several thousand projects at Apple (most of which are built at -Os, although some firmware is built at -Oz and the kernel at -O3).

A big potential problem with splitting out a block and replacing it with a call is that the replacement call can actually be more expensive to codegen than the extracted block. This 100% defeats the purpose of splitting. This happens because, depending on what exactly is being extracted, there can be a (large) number of inputs/outputs to the extraction region, and these ratchet up register pressure at the extraction site. This is what https://reviews.llvm.org/D59715 tries to account for.

That is why I'm pushing for getting more hard __text section code size numbers (not simply # blocks extracted, as this can be misleading). In the experiments I've done in the past, I got the strong impression that we needed to make splitting _less_ aggressive (hence D59715).

hiraditya added a comment.EditedJul 29 2020, 7:35 PM
In D84468#2183505, @vsk wrote:

@rjf thanks for sharing those numbers!

In my testing, I found that it's actually necessary to radically increase the outlining penalty in order to avoid pathological code size growth (see https://reviews.llvm.org/D59715 -- which, incidentally, I would still love to upstream -- review would be very much appreciated). This evaluation was done across several thousand projects at Apple (most of which are built at -Os, although some firmware is built at -Oz and the kernel at -O3).

A big potential problem with splitting out a block and replacing it with a call is that the replacement call can actually be more expensive to codegen than the extracted block. This 100% defeats the purpose of splitting.

Do we have a cost model to know if the call maybe more expensive. D59715 does address one of the major concerns by checking number of parameters. But in some cases it may still be useful to outline when optimizations like merge-function is enabled. outlined functions which are identical can be de-duplicated and that could potentially reduce the code size.

This happens because, depending on what exactly is being extracted, there can be a (large) number of inputs/outputs to the extraction region, and these ratchet up register pressure at the extraction site. This is what https://reviews.llvm.org/D59715 tries to account for.

Sorry never visited this patch. This is super useful! Please update the diff and I'll accept it.

That is why I'm pushing for getting more hard __text section code size numbers (not simply # blocks extracted, as this can be misleading). In the experiments I've done in the past, I got the strong impression that we needed to make splitting _less_ aggressive (hence D59715).

rjf updated this revision to Diff 282250.Jul 31 2020, 9:38 AM

Update diff to fix merge conflicts.

rjf added a comment.EditedAug 11 2020, 12:17 AM

@vsk Apologies for the late reply. Here is the data on:

Firefox Benchmark Data

-Os:

DeltaSize (including dynamic libraries)
02.188262032 GB
52.206931464 GB

-O3:

DeltaSize (including dynamic libraries)
-22.270277648 GB
02.247788640 GB
22.259242024 GB
52.270277648 GB

Benchmark data for talos-test perf-reftest across 5 runs:

Deltaperf-reftest elapsed timeicache misses/hitpagefaults
-2mean:979.154s,min:976.070s,med:979.677s76252841384/16391472828829931997
0mean:958.988s,min:957.816s,med:959.531s76386933845/16511632410909998150
2mean:971.755s,min:968.438s,med:972.438s75050288347/16596956589209921934
wenlei added a subscriber: wenlei.Aug 12 2020, 11:24 AM