This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Analysis/
-
Analysis/
-
InlineCost.cpp
-
test/Transforms/Inline/X86/
-
Transforms/
-
Inline/
-
X86/
-
inline-cold-callsite.ll

Differential D101229

[InlineCost] Bump threshold for inlining cold callsites (PR50099)
Needs ReviewPublic

Authored by lebedev.ri on Apr 24 2021, 4:10 AM.

Download Raw Diff

Details

Reviewers

aeubanks
eraman
Prazek
davidxl

Summary

I'm observing a rather big runtime performance regression
as a result of NewPM switch on one of RawSpeed's benchmarks:

raw.pixls.us-unique/Panasonic/DC-GH5S$ /repositories/googlebenchmark/tools/compare.py -a benchmarks ~/rawspeed/build-{old,new}/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true P1022085.RW2 --benchmark_repetitions=9 --benchmark_min_time=1
RUNNING: /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true P1022085.RW2 --benchmark_repetitions=9 --benchmark_min_time=1 --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmpk9pkqe2s
2021-04-24T14:00:17+03:00
Running /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench
Run on (32 X 3599.99 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 0.65, 0.51, 1.27
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                      Time             CPU   Iterations  CPUTime,s CPUTime/WallTime     Pixels Pixels/CPUTime Pixels/WallTime Raws/CPUTime Raws/WallTime WallTime,s
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
P1022085.RW2/threads:32/process_time/real_time_mean        0.748 ms         23.9 ms            9  0.0239231          31.9721   10.3933M       434.452M        13.8904G       41.801      1.33647k    748.25u
P1022085.RW2/threads:32/process_time/real_time_median      0.748 ms         23.9 ms            9  0.0239156          31.9716   10.3933M       434.585M        13.8934G      41.8138      1.33676k   748.079u
P1022085.RW2/threads:32/process_time/real_time_stddev      0.003 ms        0.080 ms            9   80.0846u         6.00073m          0       1.45335M        48.9162M     0.139834        4.7065   2.63684u
RUNNING: /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true P1022085.RW2 --benchmark_repetitions=9 --benchmark_min_time=1 --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmpt6ijfryg
2021-04-24T14:00:31+03:00
Running /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench
Run on (32 X 3600.05 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 5.54, 1.56, 1.61
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                      Time             CPU   Iterations  CPUTime,s CPUTime/WallTime     Pixels Pixels/CPUTime Pixels/WallTime Raws/CPUTime Raws/WallTime WallTime,s
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
P1022085.RW2/threads:32/process_time/real_time_mean        0.851 ms         27.2 ms            9  0.0272077          31.9615   10.3933M       382.027M        12.2102G      36.7569      1.17481k   851.271u
P1022085.RW2/threads:32/process_time/real_time_median      0.848 ms         27.1 ms            9  0.0271017          31.9699   10.3933M       383.494M        12.2598G      36.8981      1.17959k   847.755u
P1022085.RW2/threads:32/process_time/real_time_stddev      0.008 ms        0.243 ms            9   243.106u        0.0215795          0       3.38806M         116.08M     0.325984       11.1687   8.16022u
Comparing /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench to /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench
Benchmark                                                               Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------
P1022085.RW2/threads:32/process_time/real_time_pvalue                 0.0004          0.0004      U Test, Repetitions: 9 vs 9
P1022085.RW2/threads:32/process_time/real_time_mean                  +0.1377         +0.1373             1             1            24            27
P1022085.RW2/threads:32/process_time/real_time_median                +0.1332         +0.1333             1             1            24            27
P1022085.RW2/threads:32/process_time/real_time_stddev                +2.0947         +2.0384             0             0             0             0

I've posted repro IR at https://bugs.llvm.org/show_bug.cgi?id=50099.
It happens because certain destructor, that runs at the end of certain function,
is no longer inlined, thus preventing SROA of the class.

I guess this bisects to D28331, which added the lower threshold for cold calls,
with comment:

In D28331, @eraman wrote:

I've tuned the thresholds for the hot and cold callsites using a hacked up version of the old inliner that explicitly computes BFI on a set of internal benchmarks and spec. Once the new PM based pipeline stabilizes (IIRC Chandler mentioned there are known issues) I'll benchmark this again and adjust the thresholds if required.

Since the values haven't been changed since then, i guess that didn't happen as of yet.

Analyzing the problem, on top of D101228, it would cost 50 to inline it, while the threshold is 45.
Bumping it to 50 isn't sufficient, because the threshold is non-inclusive (is that intentional?),
so i rounded it up to 55.

This addresses the issue, inlining happens, SROA happens, and the perf is happy.

While this seems like the least problematic approach,
i think we may want to somehow boost inlining of the callees
that have arguments that are marked as likely to be SROA'ble.
Perhaps we don't want to apply this budget-lowering logic in such cases at least?

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	40 ms	x64 debian > LLVM.Transforms/Inline/X86::inline-cold-callsite.ll
	80 ms	x64 windows > LLVM.Transforms/Inline/X86::inline-cold-callsite.ll

Event Timeline

lebedev.ri created this revision.Apr 24 2021, 4:10 AM

Herald added subscribers: haicheng, hiraditya. · View Herald TranscriptApr 24 2021, 4:10 AM

lebedev.ri requested review of this revision.Apr 24 2021, 4:10 AM

lebedev.ri added a parent revision: D101228: [InlineCost] CallAnalyzer: use TTI info for extractvalue - they are free (PR50099).

Harbormaster completed remote builds in B100747: Diff 340259.Apr 24 2021, 4:42 AM

lebedev.ri mentioned this in D101231: [RFC][InlineCost] Don't count the cost of truly exceptionally unlikely blocks (PR50099).Apr 24 2021, 6:53 AM

Changing the default threshold needs lots of benchmarking.

For this particular case, IMO the better way is to enhance inline cost
analysis to give callsite more bonus if it enables SROA in call context.
The analysis needs to be careful such that if there is another callsite
that blocks SROA, and that callsites can never be inlined, then the bonus
can not be applied.

David

In D101229#2714853, @davidxl wrote:

Changing the default threshold needs lots of benchmarking.

For this particular case, IMO the better way is to enhance inline cost
analysis to give callsite more bonus if it enables SROA in call context.

I have thought about that too, yes.

The analysis needs to be careful such that if there is another callsite
that blocks SROA, and that callsites can never be inlined, then the bonus
can not be applied.

So when inlining call to curr_callee(arg) from entry(),
and we've deduced that arg is an alloca within entry(),
we need to run an analysis on entry(), and verify that the alloca
is not used by anything that would prevent SROA, that's obvious to me.

The caveat that is a little murky to me still, *how* specifically
should we deal with the cases when the alloca is passed as an argument
to some other callee? I don't suppose we want to actually recurse into it?

David

Revision Contents

Path

Size

llvm/

lib/

Analysis/

InlineCost.cpp

4 lines

test/

Transforms/

Inline/

X86/

inline-cold-callsite.ll

37 lines

Diff 340259

llvm/lib/Analysis/InlineCost.cpp

Show First 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	static cl::opt<int> InlineThreshold(
cl::desc("Control the amount of inlining to perform (default = 225)"));		cl::desc("Control the amount of inlining to perform (default = 225)"));

static cl::opt<int> HintThreshold(		static cl::opt<int> HintThreshold(
"inlinehint-threshold", cl::Hidden, cl::init(325), cl::ZeroOrMore,		"inlinehint-threshold", cl::Hidden, cl::init(325), cl::ZeroOrMore,
cl::desc("Threshold for inlining functions with inline hint"));		cl::desc("Threshold for inlining functions with inline hint"));

static cl::opt<int>		static cl::opt<int>
ColdCallSiteThreshold("inline-cold-callsite-threshold", cl::Hidden,		ColdCallSiteThreshold("inline-cold-callsite-threshold", cl::Hidden,
cl::init(45), cl::ZeroOrMore,		cl::init(55), cl::ZeroOrMore,
cl::desc("Threshold for inlining cold callsites"));		cl::desc("Threshold for inlining cold callsites"));

static cl::opt<bool> InlineEnableCostBenefitAnalysis(		static cl::opt<bool> InlineEnableCostBenefitAnalysis(
"inline-enable-cost-benefit-analysis", cl::Hidden, cl::init(false),		"inline-enable-cost-benefit-analysis", cl::Hidden, cl::init(false),
cl::desc("Enable the cost-benefit analysis for the inliner"));		cl::desc("Enable the cost-benefit analysis for the inliner"));

static cl::opt<int> InlineSavingsMultiplier(		static cl::opt<int> InlineSavingsMultiplier(
"inline-savings-multiplier", cl::Hidden, cl::init(8), cl::ZeroOrMore,		"inline-savings-multiplier", cl::Hidden, cl::init(8), cl::ZeroOrMore,
cl::desc("Multiplier to multiply cycle savings by during inlining"));		cl::desc("Multiplier to multiply cycle savings by during inlining"));

static cl::opt<int>		static cl::opt<int>
InlineSizeAllowance("inline-size-allowance", cl::Hidden, cl::init(100),		InlineSizeAllowance("inline-size-allowance", cl::Hidden, cl::init(100),
cl::ZeroOrMore,		cl::ZeroOrMore,
cl::desc("The maximum size of a callee that get's "		cl::desc("The maximum size of a callee that get's "
"inlined without sufficient cycle savings"));		"inlined without sufficient cycle savings"));

// We introduce this threshold to help performance of instrumentation based		// We introduce this threshold to help performance of instrumentation based
// PGO before we actually hook up inliner with analysis passes such as BPI and		// PGO before we actually hook up inliner with analysis passes such as BPI and
// BFI.		// BFI.
static cl::opt<int> ColdThreshold(		static cl::opt<int> ColdThreshold(
"inlinecold-threshold", cl::Hidden, cl::init(45), cl::ZeroOrMore,		"inlinecold-threshold", cl::Hidden, cl::init(55), cl::ZeroOrMore,
cl::desc("Threshold for inlining functions with cold attribute"));		cl::desc("Threshold for inlining functions with cold attribute"));

static cl::opt<int>		static cl::opt<int>
HotCallSiteThreshold("hot-callsite-threshold", cl::Hidden, cl::init(3000),		HotCallSiteThreshold("hot-callsite-threshold", cl::Hidden, cl::init(3000),
cl::ZeroOrMore,		cl::ZeroOrMore,
cl::desc("Threshold for hot callsites "));		cl::desc("Threshold for hot callsites "));

static cl::opt<int> LocallyHotCallSiteThreshold(		static cl::opt<int> LocallyHotCallSiteThreshold(
▲ Show 20 Lines • Show All 2,727 Lines • Show Last 20 Lines

llvm/test/Transforms/Inline/X86/inline-cold-callsite.ll

This file was added.

				; RUN: opt < %s -inline -debug-only=inline-cost -print-instruction-comments -S -mtriple=x86_64-unknown-linux-gnu 2>&1 \| FileCheck %s
				; RUN: opt < %s -passes='cgscc(inline)' -debug-only=inline-cost -print-instruction-comments -S -mtriple=x86_64-unknown-linux-gnu 2>&1 \| FileCheck %s

				; REQUIRES: asserts

				; Check the threshold for inlining cold callsites.

				; CHECK: Analyzing call of cold_callee... (caller:caller)
				; CHECK-NEXT: Cold callsite
				; CHECK-NEXT: define void @cold_callee() {
				; CHECK-NEXT: ; cost before = -30, cost after = -30, threshold before = 51, threshold after = 51, cost delta = 0
				; CHECK-NEXT: ret void
				; CHECK-NEXT: }

				declare void @hot_callee()

				define void @cold_callee() {
				ret void
				}

				define void @caller(i1 %c) {
				entry:
				br i1 %c, label %hot, label %cold, !prof !0

				hot:
				call void @hot_callee()
				br label %end

				cold:
				call void @cold_callee()
				br label %end

				end:
				ret void
				}

				!0 = !{!"branch_weights", i32 1000000, i32 1}