Add an option "hotcoldsplit-delta" that is set to 5 by default;
the delta value enables blocks with cost benefit-penalty>=-delta
be split, in addition to blocks with positive benefit-penalty
differences. We have found that on common workloads, a splitting
delta of 5 enables hot/cold splitting to split significantly more
cold blocks with no obvious performance regression observed.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Changing this threshold can have a significant impact on code size. Artificially boosting the outlining 'benefit' can, in many instances, actually regress the size of function-undergoing-extraction (defeating the point of hot/cold splitting). Could you share some data, specifically:
- Which projects were tested, at what optimization levels.
- Code size, pre-patch vs. post-patch.
- Code size (non-extracted functions only), pre-patch vs. post-patch.
- % of functions which undergo hot/cold splitting pre-patch vs. post-patch.
Based on discussion with @hiraditya today we'll change the default to 0 (which stays the same) and add additional test cases for delta=5 (which come from the four failing cases here).
@rjf IIUC you have a test setup to evaluate setting the splitting boost to 5. Could you please share the statistics I outlined in my previous comment? This can inform the decision about whether to add the new option.
@vsk do you think having SplittingDelta will help other projects? Is it likely that some workloads may benefit from more/less outlining compared to default?
@rjf seems like the updated diff needs to be squashed with previous to see all the changes in one patch.
@vsk I'll post some relevant statistics on the distribution of benefit-penalty scores on our workloads first:
qemu and firefox compiled with -O2 optimization level and hotcoldsplit
For qemu, there are 4331 cold blocks detected in total, with 967 blocks with benefit-penalty difference in [-5, 0], in addition to 2892 blocks with positive benefit-penalty scores. See the attached file for a histogram.
Note that the histogram has been truncated to include blocks with benefit-penalty difference between -10 and 10. Overall, the greatest number of blocks have a score of 4.
For Firefox (mozilla-central), there are 152048 cold blocks detected in total, 79612 blocks with benefit-penalty difference in [-5,0], in addition to 69444 blocks with positive benefit-penalty scores.
@rjf thanks for sharing those numbers!
In my testing, I found that it's actually necessary to radically increase the outlining penalty in order to avoid pathological code size growth (see https://reviews.llvm.org/D59715 -- which, incidentally, I would still love to upstream -- review would be very much appreciated). This evaluation was done across several thousand projects at Apple (most of which are built at -Os, although some firmware is built at -Oz and the kernel at -O3).
A big potential problem with splitting out a block and replacing it with a call is that the replacement call can actually be more expensive to codegen than the extracted block. This 100% defeats the purpose of splitting. This happens because, depending on what exactly is being extracted, there can be a (large) number of inputs/outputs to the extraction region, and these ratchet up register pressure at the extraction site. This is what https://reviews.llvm.org/D59715 tries to account for.
That is why I'm pushing for getting more hard __text section code size numbers (not simply # blocks extracted, as this can be misleading). In the experiments I've done in the past, I got the strong impression that we needed to make splitting _less_ aggressive (hence D59715).
Do we have a cost model to know if the call maybe more expensive. D59715 does address one of the major concerns by checking number of parameters. But in some cases it may still be useful to outline when optimizations like merge-function is enabled. outlined functions which are identical can be de-duplicated and that could potentially reduce the code size.
This happens because, depending on what exactly is being extracted, there can be a (large) number of inputs/outputs to the extraction region, and these ratchet up register pressure at the extraction site. This is what https://reviews.llvm.org/D59715 tries to account for.
Sorry never visited this patch. This is super useful! Please update the diff and I'll accept it.
That is why I'm pushing for getting more hard __text section code size numbers (not simply # blocks extracted, as this can be misleading). In the experiments I've done in the past, I got the strong impression that we needed to make splitting _less_ aggressive (hence D59715).
@vsk Apologies for the late reply. Here is the data on:
Firefox Benchmark Data
-Os:
Delta | Size (including dynamic libraries) | |
---|---|---|
0 | 2.188262032 GB | |
5 | 2.206931464 GB | |
-O3:
Delta | Size (including dynamic libraries) | |
---|---|---|
-2 | 2.270277648 GB | |
0 | 2.247788640 GB | |
2 | 2.259242024 GB | |
5 | 2.270277648 GB | |
Benchmark data for talos-test perf-reftest across 5 runs:
Delta | perf-reftest elapsed time | icache misses/hit | pagefaults |
---|---|---|---|
-2 | mean:979.154s,min:976.070s,med:979.677s | 76252841384/1639147282882 | 9931997 |
0 | mean:958.988s,min:957.816s,med:959.531s | 76386933845/1651163241090 | 9998150 |
2 | mean:971.755s,min:968.438s,med:972.438s | 75050288347/1659695658920 | 9921934 |