This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/lib/Transforms/IPO/
-
lib/
-
Transforms/
-
IPO/
-
HotColdSplitting.cpp

Differential D84468

[HotColdSplitting] Add SplittingDelta option to enable splitting more small blocks
Needs ReviewPublic

Authored by rjf on Jul 23 2020, 2:37 PM.

Download Raw Diff

Details

Reviewers

hiraditya
vsk

Summary

Add an option "hotcoldsplit-delta" that is set to 5 by default;
the delta value enables blocks with cost benefit-penalty>=-delta
be split, in addition to blocks with positive benefit-penalty
differences. We have found that on common workloads, a splitting
delta of 5 enables hot/cold splitting to split significantly more
cold blocks with no obvious performance regression observed.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

rjf created this revision.Jul 23 2020, 2:37 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 23 2020, 2:37 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

rcorcs added a subscriber: rcorcs.Jul 23 2020, 2:50 PM

Harbormaster failed remote builds in B65457: Diff 280263!Jul 23 2020, 3:08 PM

Changing this threshold can have a significant impact on code size. Artificially boosting the outlining 'benefit' can, in many instances, actually regress the size of function-undergoing-extraction (defeating the point of hot/cold splitting). Could you share some data, specifically:

Which projects were tested, at what optimization levels.
Code size, pre-patch vs. post-patch.
Code size (non-extracted functions only), pre-patch vs. post-patch.
% of functions which undergo hot/cold splitting pre-patch vs. post-patch.

This revision now requires changes to proceed.Jul 27 2020, 10:54 AM

Based on discussion with @hiraditya today we'll change the default to 0 (which stays the same) and add additional test cases for delta=5 (which come from the four failing cases here).

We should keep the default value same as before.

Change default SplittingDelta to zero.

Harbormaster completed remote builds in B66036: Diff 281263.Jul 28 2020, 9:42 AM

@rjf IIUC you have a test setup to evaluate setting the splitting boost to 5. Could you please share the statistics I outlined in my previous comment? This can inform the decision about whether to add the new option.

@vsk do you think having SplittingDelta will help other projects? Is it likely that some workloads may benefit from more/less outlining compared to default?

@rjf seems like the updated diff needs to be squashed with previous to see all the changes in one patch.

@vsk I'll post some relevant statistics on the distribution of benefit-penalty scores on our workloads first:

qemu and firefox compiled with -O2 optimization level and hotcoldsplit

For qemu, there are 4331 cold blocks detected in total, with 967 blocks with benefit-penalty difference in [-5, 0], in addition to 2892 blocks with positive benefit-penalty scores. See the attached file for a histogram.

Note that the histogram has been truncated to include blocks with benefit-penalty difference between -10 and 10. Overall, the greatest number of blocks have a score of 4.

For Firefox (mozilla-central), there are 152048 cold blocks detected in total, 79612 blocks with benefit-penalty difference in [-5,0], in addition to 69444 blocks with positive benefit-penalty scores.

@rjf thanks for sharing those numbers!

In my testing, I found that it's actually necessary to radically increase the outlining penalty in order to avoid pathological code size growth (see https://reviews.llvm.org/D59715 -- which, incidentally, I would still love to upstream -- review would be very much appreciated). This evaluation was done across several thousand projects at Apple (most of which are built at -Os, although some firmware is built at -Oz and the kernel at -O3).

A big potential problem with splitting out a block and replacing it with a call is that the replacement call can actually be more expensive to codegen than the extracted block. This 100% defeats the purpose of splitting. This happens because, depending on what exactly is being extracted, there can be a (large) number of inputs/outputs to the extraction region, and these ratchet up register pressure at the extraction site. This is what https://reviews.llvm.org/D59715 tries to account for.

That is why I'm pushing for getting more hard __text section code size numbers (not simply # blocks extracted, as this can be misleading). In the experiments I've done in the past, I got the strong impression that we needed to make splitting _less_ aggressive (hence D59715).

In D84468#2183505, @vsk wrote:

@rjf thanks for sharing those numbers!

In my testing, I found that it's actually necessary to radically increase the outlining penalty in order to avoid pathological code size growth (see https://reviews.llvm.org/D59715 -- which, incidentally, I would still love to upstream -- review would be very much appreciated). This evaluation was done across several thousand projects at Apple (most of which are built at -Os, although some firmware is built at -Oz and the kernel at -O3).

A big potential problem with splitting out a block and replacing it with a call is that the replacement call can actually be more expensive to codegen than the extracted block. This 100% defeats the purpose of splitting.

Do we have a cost model to know if the call maybe more expensive. D59715 does address one of the major concerns by checking number of parameters. But in some cases it may still be useful to outline when optimizations like merge-function is enabled. outlined functions which are identical can be de-duplicated and that could potentially reduce the code size.

This happens because, depending on what exactly is being extracted, there can be a (large) number of inputs/outputs to the extraction region, and these ratchet up register pressure at the extraction site. This is what https://reviews.llvm.org/D59715 tries to account for.

Sorry never visited this patch. This is super useful! Please update the diff and I'll accept it.

That is why I'm pushing for getting more hard __text section code size numbers (not simply # blocks extracted, as this can be misleading). In the experiments I've done in the past, I got the strong impression that we needed to make splitting _less_ aggressive (hence D59715).

Update diff to fix merge conflicts.

Harbormaster completed remote builds in B66578: Diff 282250.Jul 31 2020, 11:18 AM

@vsk Apologies for the late reply. Here is the data on:

Firefox Benchmark Data

-Os:

Delta	Size (including dynamic libraries)
0	2.188262032 GB
5	2.206931464 GB

-O3:

Delta	Size (including dynamic libraries)
-2	2.270277648 GB
0	2.247788640 GB
2	2.259242024 GB
5	2.270277648 GB

Benchmark data for talos-test perf-reftest across 5 runs:

Delta	perf-reftest elapsed time	icache misses/hit	pagefaults
-2	mean:979.154s,min:976.070s,med:979.677s	76252841384/1639147282882	9931997
0	mean:958.988s,min:957.816s,med:959.531s	76386933845/1651163241090	9998150
2	mean:971.755s,min:968.438s,med:972.438s	75050288347/1659695658920	9921934

wenlei added a subscriber: wenlei.Aug 12 2020, 11:24 AM

rjf mentioned this in D59715: [HotColdSplit] Reflect full cost of parameters in split penalty.Aug 28 2020, 10:26 PM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

IPO/

HotColdSplitting.cpp

2 lines

Diff 281263

llvm/lib/Transforms/IPO/HotColdSplitting.cpp

Show First 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	static cl::opt<bool> EnableStaticAnalyis("hot-cold-static-analysis",
cl::init(true), cl::Hidden);		cl::init(true), cl::Hidden);

static cl::opt<int>		static cl::opt<int>
SplittingThreshold("hotcoldsplit-threshold", cl::init(2), cl::Hidden,		SplittingThreshold("hotcoldsplit-threshold", cl::init(2), cl::Hidden,
cl::desc("Base penalty for splitting cold code (as a "		cl::desc("Base penalty for splitting cold code (as a "
"multiple of TCC_Basic)"));		"multiple of TCC_Basic)"));

static cl::opt<int>		static cl::opt<int>
SplittingDelta("hotcoldsplit-delta", cl::init(5), cl::Hidden,		SplittingDelta("hotcoldsplit-delta", cl::init(0), cl::Hidden,
cl::desc("Allowance threshold for blocks with "		cl::desc("Allowance threshold for blocks with "
"Benefit - Penalty >= -Delta to be split. "));		"Benefit - Penalty >= -Delta to be split. "));

namespace {		namespace {
// Same as blockEndsInUnreachable in CodeGen/BranchFolding.cpp. Do not modify		// Same as blockEndsInUnreachable in CodeGen/BranchFolding.cpp. Do not modify
// this function unless you modify the MBB version as well.		// this function unless you modify the MBB version as well.
//		//
/// A no successor, non-return block probably ends in unreachable and is cold.		/// A no successor, non-return block probably ends in unreachable and is cold.
▲ Show 20 Lines • Show All 660 Lines • Show Last 20 Lines