This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/IPO/
-
Transforms/
-
IPO/
-
HotColdSplitting.cpp
-
test/Transforms/HotColdSplit/
-
Transforms/
-
HotColdSplit/
-
X86/
-
split-delta-5.ll
-
assumption-cache-invalidation-delta-5.ll

Differential D84468

[HotColdSplitting] Add SplittingDelta option to enable splitting more small blocks
Needs ReviewPublic

Authored by rjf on Jul 23 2020, 2:37 PM.

Download Raw Diff

Details

Reviewers

hiraditya
vsk

Summary

Add an option "hotcoldsplit-delta" that is set to 5 by default;
the delta value enables blocks with cost benefit-penalty>=-delta
be split, in addition to blocks with positive benefit-penalty
differences. We have found that on common workloads, a splitting
delta of 5 enables hot/cold splitting to split significantly more
cold blocks with no obvious performance regression observed.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

rjf created this revision.Jul 23 2020, 2:37 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 23 2020, 2:37 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

rcorcs added a subscriber: rcorcs.Jul 23 2020, 2:50 PM

Harbormaster failed remote builds in B65457: Diff 280263!Jul 23 2020, 3:08 PM

Changing this threshold can have a significant impact on code size. Artificially boosting the outlining 'benefit' can, in many instances, actually regress the size of function-undergoing-extraction (defeating the point of hot/cold splitting). Could you share some data, specifically:

Which projects were tested, at what optimization levels.
Code size, pre-patch vs. post-patch.
Code size (non-extracted functions only), pre-patch vs. post-patch.
% of functions which undergo hot/cold splitting pre-patch vs. post-patch.

This revision now requires changes to proceed.Jul 27 2020, 10:54 AM

Based on discussion with @hiraditya today we'll change the default to 0 (which stays the same) and add additional test cases for delta=5 (which come from the four failing cases here).

We should keep the default value same as before.

Change default SplittingDelta to zero.

Harbormaster completed remote builds in B66036: Diff 281263.Jul 28 2020, 9:42 AM

@rjf IIUC you have a test setup to evaluate setting the splitting boost to 5. Could you please share the statistics I outlined in my previous comment? This can inform the decision about whether to add the new option.

@vsk do you think having SplittingDelta will help other projects? Is it likely that some workloads may benefit from more/less outlining compared to default?

@rjf seems like the updated diff needs to be squashed with previous to see all the changes in one patch.

@vsk I'll post some relevant statistics on the distribution of benefit-penalty scores on our workloads first:

qemu and firefox compiled with -O2 optimization level and hotcoldsplit

For qemu, there are 4331 cold blocks detected in total, with 967 blocks with benefit-penalty difference in [-5, 0], in addition to 2892 blocks with positive benefit-penalty scores. See the attached file for a histogram.

Note that the histogram has been truncated to include blocks with benefit-penalty difference between -10 and 10. Overall, the greatest number of blocks have a score of 4.

For Firefox (mozilla-central), there are 152048 cold blocks detected in total, 79612 blocks with benefit-penalty difference in [-5,0], in addition to 69444 blocks with positive benefit-penalty scores.

@rjf thanks for sharing those numbers!

In my testing, I found that it's actually necessary to radically increase the outlining penalty in order to avoid pathological code size growth (see https://reviews.llvm.org/D59715 -- which, incidentally, I would still love to upstream -- review would be very much appreciated). This evaluation was done across several thousand projects at Apple (most of which are built at -Os, although some firmware is built at -Oz and the kernel at -O3).

A big potential problem with splitting out a block and replacing it with a call is that the replacement call can actually be more expensive to codegen than the extracted block. This 100% defeats the purpose of splitting. This happens because, depending on what exactly is being extracted, there can be a (large) number of inputs/outputs to the extraction region, and these ratchet up register pressure at the extraction site. This is what https://reviews.llvm.org/D59715 tries to account for.

That is why I'm pushing for getting more hard __text section code size numbers (not simply # blocks extracted, as this can be misleading). In the experiments I've done in the past, I got the strong impression that we needed to make splitting _less_ aggressive (hence D59715).

In D84468#2183505, @vsk wrote:

@rjf thanks for sharing those numbers!

In my testing, I found that it's actually necessary to radically increase the outlining penalty in order to avoid pathological code size growth (see https://reviews.llvm.org/D59715 -- which, incidentally, I would still love to upstream -- review would be very much appreciated). This evaluation was done across several thousand projects at Apple (most of which are built at -Os, although some firmware is built at -Oz and the kernel at -O3).

A big potential problem with splitting out a block and replacing it with a call is that the replacement call can actually be more expensive to codegen than the extracted block. This 100% defeats the purpose of splitting.

Do we have a cost model to know if the call maybe more expensive. D59715 does address one of the major concerns by checking number of parameters. But in some cases it may still be useful to outline when optimizations like merge-function is enabled. outlined functions which are identical can be de-duplicated and that could potentially reduce the code size.

This happens because, depending on what exactly is being extracted, there can be a (large) number of inputs/outputs to the extraction region, and these ratchet up register pressure at the extraction site. This is what https://reviews.llvm.org/D59715 tries to account for.

Sorry never visited this patch. This is super useful! Please update the diff and I'll accept it.

That is why I'm pushing for getting more hard __text section code size numbers (not simply # blocks extracted, as this can be misleading). In the experiments I've done in the past, I got the strong impression that we needed to make splitting _less_ aggressive (hence D59715).

Update diff to fix merge conflicts.

Harbormaster completed remote builds in B66578: Diff 282250.Jul 31 2020, 11:18 AM

@vsk Apologies for the late reply. Here is the data on:

Firefox Benchmark Data

-Os:

Delta	Size (including dynamic libraries)
0	2.188262032 GB
5	2.206931464 GB

-O3:

Delta	Size (including dynamic libraries)
-2	2.270277648 GB
0	2.247788640 GB
2	2.259242024 GB
5	2.270277648 GB

Benchmark data for talos-test perf-reftest across 5 runs:

Delta	perf-reftest elapsed time	icache misses/hit	pagefaults
-2	mean:979.154s,min:976.070s,med:979.677s	76252841384/1639147282882	9931997
0	mean:958.988s,min:957.816s,med:959.531s	76386933845/1651163241090	9998150
2	mean:971.755s,min:968.438s,med:972.438s	75050288347/1659695658920	9921934

wenlei added a subscriber: wenlei.Aug 12 2020, 11:24 AM

rjf mentioned this in D59715: [HotColdSplit] Reflect full cost of parameters in split penalty.Aug 28 2020, 10:26 PM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

IPO/

HotColdSplitting.cpp

13 lines

test/

Transforms/

HotColdSplit/

X86/

split-delta-5.ll

216 lines

assumption-cache-invalidation-delta-5.ll

50 lines

Diff 282250

llvm/lib/Transforms/IPO/HotColdSplitting.cpp

Show First 20 Lines • Show All 79 Lines • ▼ Show 20 Lines
static cl::opt<bool> EnableStaticAnalyis("hot-cold-static-analysis",		static cl::opt<bool> EnableStaticAnalyis("hot-cold-static-analysis",
cl::init(true), cl::Hidden);		cl::init(true), cl::Hidden);

static cl::opt<int>		static cl::opt<int>
SplittingThreshold("hotcoldsplit-threshold", cl::init(2), cl::Hidden,		SplittingThreshold("hotcoldsplit-threshold", cl::init(2), cl::Hidden,
cl::desc("Base penalty for splitting cold code (as a "		cl::desc("Base penalty for splitting cold code (as a "
"multiple of TCC_Basic)"));		"multiple of TCC_Basic)"));

		static cl::opt<int>
		SplittingDelta("hotcoldsplit-delta", cl::init(0), cl::Hidden,
		cl::desc("Allowance threshold for blocks with "
		"Benefit - Penalty >= -Delta to be split. "));

namespace {		namespace {
// Same as blockEndsInUnreachable in CodeGen/BranchFolding.cpp. Do not modify		// Same as blockEndsInUnreachable in CodeGen/BranchFolding.cpp. Do not modify
// this function unless you modify the MBB version as well.		// this function unless you modify the MBB version as well.
//		//
/// A no successor, non-return block probably ends in unreachable and is cold.		/// A no successor, non-return block probably ends in unreachable and is cold.
/// Also consider a block that ends in an indirect branch to be a return block,		/// Also consider a block that ends in an indirect branch to be a return block,
/// since many targets use plain indirect branches to return.		/// since many targets use plain indirect branches to return.
bool blockEndsInUnreachable(const BasicBlock &BB) {		bool blockEndsInUnreachable(const BasicBlock &BB) {
▲ Show 20 Lines • Show All 216 Lines • ▼ Show 20 Lines	Function *HotColdSplitting::extractColdRegion(
// splitting.		// splitting.
SetVector<Value *> Inputs, Outputs, Sinks;		SetVector<Value *> Inputs, Outputs, Sinks;
CE.findInputsOutputs(Inputs, Outputs, Sinks);		CE.findInputsOutputs(Inputs, Outputs, Sinks);
int OutliningBenefit = getOutliningBenefit(Region, TTI);		int OutliningBenefit = getOutliningBenefit(Region, TTI);
int OutliningPenalty =		int OutliningPenalty =
getOutliningPenalty(Region, Inputs.size(), Outputs.size());		getOutliningPenalty(Region, Inputs.size(), Outputs.size());
LLVM_DEBUG(dbgs() << "Split profitability: benefit = " << OutliningBenefit		LLVM_DEBUG(dbgs() << "Split profitability: benefit = " << OutliningBenefit
<< ", penalty = " << OutliningPenalty << "\n");		<< ", penalty = " << OutliningPenalty << "\n");
if (OutliningBenefit <= OutliningPenalty)
		// SplittingDelta gives OutliningBenefit a "boost" by allowing
		// certain blocks with small Benefit-Penalty differences to be split.
		// Empirical evidence indicates that setting SplittingDelta to a small
		// number in [1, 5] can lead to a sizeable improvement in the overall
		// number of cold regions extracted.
		if (OutliningBenefit + SplittingDelta <= OutliningPenalty)
return nullptr;		return nullptr;

Function *OrigF = Region[0]->getParent();		Function *OrigF = Region[0]->getParent();
if (Function *OutF = CE.extractCodeRegion(CEAC)) {		if (Function *OutF = CE.extractCodeRegion(CEAC)) {
User U = OutF->user_begin();		User U = OutF->user_begin();
CallInst *CI = cast<CallInst>(U);		CallInst *CI = cast<CallInst>(U);
NumColdRegionsOutlined++;		NumColdRegionsOutlined++;
if (TTI.useColdCCForColdCall(*OutF)) {		if (TTI.useColdCCForColdCall(*OutF)) {
▲ Show 20 Lines • Show All 418 Lines • Show Last 20 Lines

llvm/test/Transforms/HotColdSplit/X86/split-delta-5.ll

This file was added.

				; RUN: opt -hotcoldsplit -hotcoldsplit-threshold=2 -S < %s \| FileCheck %s

				target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-apple-macosx10.14.0"

				; Check that these functions are not split. Outlined functions are called from a
				; basic block named codeRepl.

				; The cold region is too small to split.
				; CHECK-LABEL: @foo
				; CHECK-NOT: foo.cold.1
				define void @foo() {
				entry:
				br i1 undef, label %if.then, label %if.end

				if.then: ; preds = %entry
				unreachable

				if.end: ; preds = %entry
				ret void
				}

				; The cold region is still too small to split.
				; CHECK-LABEL: @bar
				; CHECK-NOT: bar.cold.1
				define void @bar() {
				entry:
				br i1 undef, label %if.then, label %if.end

				if.then: ; preds = %entry
				call void @sink()
				ret void

				if.end: ; preds = %entry
				ret void
				}

				; Make sure we don't try to outline the entire function.
				; CHECK-LABEL: @fun
				; CHECK-NOT: fun.cold.1
				define void @fun() {
				entry:
				br i1 undef, label %if.then, label %if.end

				if.then: ; preds = %entry
				call void @sink()
				br label %if.end

				if.end: ; preds = %entry
				ret void
				}

				; Make sure we don't try to outline the entire function, especially when the
				; entry block is cold.
				; CHECK: define void @cold_entry_block() [[COLD_ATTR:#[0-9]+]]
				; CHECK-NOT: cold_entry_block.cold.1
				define void @cold_entry_block() {
				entry:
				call void @sink()
				ret void
				}

				; Do not split `noinline` functions.
				; CHECK-LABEL: @noinline_func
				; CHECK-NOT: noinline_func.cold.1
				define void @noinline_func() noinline {
				entry:
				br i1 undef, label %if.then, label %if.end

				if.then: ; preds = %entry
				call void @sink()
				br label %if.end

				if.end: ; preds = %entry
				ret void
				}

				; Do not split `alwaysinline` functions.
				; CHECK-LABEL: @alwaysinline_func
				; CHECK-NOT: alwaysinline_func.cold.1
				define void @alwaysinline_func() alwaysinline {
				entry:
				br i1 undef, label %if.then, label %if.end

				if.then: ; preds = %entry
				call void @sink()
				br label %if.end

				if.end: ; preds = %entry
				ret void
				}

				; Don't outline infinite loops.
				; CHECK-LABEL: @infinite_loop
				; CHECK-NOT: infinite_loop.cold.1
				define void @infinite_loop() {
				entry:
				br label %loop

				loop:
				call void @sink()
				br label %loop
				}

				; Don't count debug intrinsics towards the outlining threshold.
				; CHECK-LABEL: @dont_count_debug_intrinsics
				; CHECK-NOT: dont_count_debug_intrinsics.cold.1
				define void @dont_count_debug_intrinsics(i32 %arg1) !dbg !6 {
				entry:
				%var = add i32 0, 0, !dbg !11
				br i1 undef, label %if.then, label %if.end

				if.then: ; preds = %entry
				ret void

				if.end: ; preds = %entry
				call void @llvm.dbg.value(metadata i32 %arg1, metadata !9, metadata !DIExpression()), !dbg !11
				call void @llvm.dbg.value(metadata i32 %arg1, metadata !9, metadata !DIExpression()), !dbg !11
				call void @sink()
				ret void
				}

				; CHECK-LABEL: @sanitize_address
				; CHECK-NOT: sanitize_address.cold.1
				define void @sanitize_address() sanitize_address {
				entry:
				br i1 undef, label %if.then, label %if.end

				if.then: ; preds = %entry
				call void @sink()
				ret void

				if.end: ; preds = %entry
				ret void
				}

				; CHECK-LABEL: @sanitize_hwaddress
				; CHECK-NOT: sanitize_hwaddress.cold.1
				define void @sanitize_hwaddress() sanitize_hwaddress {
				entry:
				br i1 undef, label %if.then, label %if.end

				if.then: ; preds = %entry
				call void @sink()
				ret void

				if.end: ; preds = %entry
				ret void
				}

				; CHECK-LABEL: @sanitize_thread
				; CHECK-NOT: sanitize_thread.cold.1
				define void @sanitize_thread() sanitize_thread {
				entry:
				br i1 undef, label %if.then, label %if.end

				if.then: ; preds = %entry
				call void @sink()
				ret void

				if.end: ; preds = %entry
				ret void
				}

				; CHECK-LABEL: @sanitize_memory
				; CHECK-NOT: sanitize_memory.cold.1
				define void @sanitize_memory() sanitize_memory {
				entry:
				br i1 undef, label %if.then, label %if.end

				if.then: ; preds = %entry
				call void @sink()
				ret void

				if.end: ; preds = %entry
				ret void
				}

				declare void @llvm.trap() cold noreturn

				; CHECK-LABEL: @nosanitize_call
				; CHECK-NOT: nosanitize_call.cold.1
				define void @nosanitize_call() sanitize_memory {
				entry:
				br i1 undef, label %if.then, label %if.end

				if.then: ; preds = %entry
				call void @llvm.trap(), !nosanitize !2
				unreachable

				if.end: ; preds = %entry
				ret void
				}

				; CHECK: attributes [[COLD_ATTR]] = { {{.*}}cold

				declare void @llvm.dbg.value(metadata, metadata, metadata)

				declare void @sink() cold

				!llvm.dbg.cu = !{!0}
				!llvm.debugify = !{!3, !4}
				!llvm.module.flags = !{!5}

				!0 = distinct !DICompileUnit(language: DW_LANG_C, file: !1, producer: "debugify", isOptimized: true, runtimeVersion: 0, emissionKind: FullDebug, enums: !2)
				!1 = !DIFile(filename: "<stdin>", directory: "/")
				!2 = !{}
				!3 = !{i32 7}
				!4 = !{i32 1}
				!5 = !{i32 2, !"Debug Info Version", i32 3}
				!6 = distinct !DISubprogram(name: "dont_count_debug_intrinsics", linkageName: "dont_count_debug_intrinsics", scope: null, file: !1, line: 1, type: !7, isLocal: false, isDefinition: true, scopeLine: 1, isOptimized: true, unit: !0, retainedNodes: !8)
				!7 = !DISubroutineType(types: !2)
				!8 = !{!9}
				!9 = !DILocalVariable(name: "1", scope: !6, file: !1, line: 1, type: !10)
				!10 = !DIBasicType(name: "ty32", size: 32, encoding: DW_ATE_unsigned)
				!11 = !DILocation(line: 1, column: 1, scope: !6)

llvm/test/Transforms/HotColdSplit/assumption-cache-invalidation-delta-5.ll

This file was added.

				; REQUIRES: asserts
				; RUN: opt -S -instsimplify -hotcoldsplit -debug < %s 2>&1 \| FileCheck %s
				; RUN: opt -instcombine -hotcoldsplit -instsimplify %s -o /dev/null

				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64"

				%a = type { i64, i64 }
				%b = type { i64 }

				; CHECK: @f
				; CHECK-LABEL: codeRepl:
				; CHECK-NOT: @llvm.assume
				; CHECK: }
				; CHECK: declare {{.*}}@llvm.assume
				; CHECK: define {{.*}}@f.cold.1(i64 %0)
				; CHECK-LABEL: newFuncRoot:
				; CHECK: %1 = icmp eq i64 %0, 0
				; CHECK-NOT: call void @llvm.assume

				define void @f() {
				entry:
				%0 = getelementptr inbounds %a, %a* null, i64 0, i32 1
				br label %label

				label:
				%1 = bitcast i64* %0 to %b**
				%2 = load %b, %b* %1, align 8
				%3 = getelementptr inbounds %b, %b* %2, i64 undef, i32 0
				%4 = load i64, i64* %3, align 8
				%5 = icmp ugt i64 %4, 1
				br i1 %5, label %if.then, label %if.else

				if.then:
				unreachable

				if.else:
				call void @g(i8* undef)
				%6 = load i64, i64* undef, align 8
				%7 = and i64 %6, -16
				%8 = inttoptr i64 %7 to i8*
				%9 = icmp eq i64 %4, 0
				call void @llvm.assume(i1 %9)
				unreachable
				}

				declare void @g(i8*)

				declare void @llvm.assume(i1)