This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Analysis/
-
Analysis/
-
InlineCost.cpp
-
test/Transforms/Inline/X86/
-
Transforms/
-
Inline/
-
X86/
-
impossible-block.ll

Differential D101231

[RFC][InlineCost] Don't count the cost of truly exceptionally unlikely blocks (PR50099)
Needs ReviewPublic

Authored by lebedev.ri on Apr 24 2021, 6:53 AM.

Download Raw Diff

Details

Reviewers

aeubanks
eraman
Prazek
davidxl
kazu
mtrofin
chandlerc

Summary

Cost-benefit analysis already did something like that, but different,
off by default, and needs a profile (as opposed to being happy with static weights).

In PR50099 i reported that NewPM switch introduced a rather large regression in one benchmark.
That happens because certain destructor call is determined to be cold,
so it is given smaller inlining budget (45), and it's inlining cost measure at ~55.

We could either raise budgets, or lower costs.
D101228, and potentially D101229 does the former.
Here i propose to investigate one approach for the latter.

The large portion of that destructor is exception handling,
and as per block frequency it is *exceptionally* unlikely to execute.
So i propose to introduce a impossible-code-rel-freq option,
defaulting to 2 parts per million (that is the smallest one to do the job)
of function's entry frequency, and saying that if the block's frequency
is less than that, then it is impossible to execute,
and not adding the costs of instructions in said block.

Does this sound completely insane? Thoughts?

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

lebedev.ri created this revision.Apr 24 2021, 6:53 AM

Herald added subscribers: haicheng, hiraditya. · View Herald TranscriptApr 24 2021, 6:53 AM

lebedev.ri requested review of this revision.Apr 24 2021, 6:53 AM

lebedev.ri retitled this revision from [RFC][InlineCost] Don't count the cost of trully exceptionally unlikely blocks (PR50099) to [RFC][InlineCost] Don't count the cost of truly exceptionally unlikely blocks (PR50099).

lebedev.ri added a parent revision: D101228: [InlineCost] CallAnalyzer: use TTI info for extractvalue - they are free (PR50099).

Possibly I'm misunderstanding what is proposed here, but aren't these kind of very rarely executed blocks exactly what we do not want to inline? This would make sense in the context of something like partial inlining, where we could inline the hot parts without the cold parts (and thus not cost the cold parts), but for regular inlining this means we duplicate cold code, increasing code size and icache pressure.

In D101231#2714557, @nikic wrote:

Possibly I'm misunderstanding what is proposed here, but aren't these kind of very rarely executed blocks exactly what we do not want to inline? This would make sense in the context of something like partial inlining, where we could inline the hot parts without the cold parts (and thus not cost the cold parts), but for regular inlining this means we duplicate cold code, increasing code size and icache pressure.

Is that not what i asked in the description?

Right now i'm simply enumerating possible solutions in hope that at least one will fit.

Harbormaster completed remote builds in B100754: Diff 340272.Apr 24 2021, 7:33 AM

In D101231#2714564, @lebedev.ri wrote:

In D101231#2714557, @nikic wrote:

Possibly I'm misunderstanding what is proposed here, but aren't these kind of very rarely executed blocks exactly what we do not want to inline? This would make sense in the context of something like partial inlining, where we could inline the hot parts without the cold parts (and thus not cost the cold parts), but for regular inlining this means we duplicate cold code, increasing code size and icache pressure.

Is that not what i asked in the description?

Right now i'm simply enumerating possible solutions in hope that at least one will fit.

FWIW what i'm doing here is actually apparently already suggested in a FIXME:

bool CallAnalyzer::visitUnreachableInst(UnreachableInst &I) {
  // FIXME: It might be reasonably to discount the cost of instructions leading
  // to unreachable as they have the lowest possible impact on both runtime and
  // code size.
  return true; // No actual code is needed for unreachable.
}

from rL197215 @ https://github.com/llvm/llvm-project/blob/269b335bd7332cd0d13451260d408dc9fcbcb5b1/llvm/lib/Analysis/InlineCost.cpp#L2066-L2071

Do we know why the destructor call is determined to be cold in the first place? Is there something can be done to improve static branch prediction?

I believe this patch itself will likely help improve performance of programs with EH (without PGO) at the cost of increased code size in general.

In D101231#2714842, @davidxl wrote:

Do we know why the destructor call is determined to be cold in the first place? Is there something can be done to improve static branch prediction?

I've posted full unoptimized repro IR at https://bugs.llvm.org/show_bug.cgi?id=50099
I believe that determination is correct. Said destructor call is only reachable through landingpads,
which are obviously modelled as being quite cold.

I believe this patch itself will likely help improve performance of programs with EH (without PGO) at the cost of increased code size in general.

That's the idea, yes. I see two alternatives: just bump the threshold (D101229),
or give bonus per each alloca (that is suspected as SROA'ble) that is passed as an argument into the callee function.
There is a problem with the latter approach - such cold callsites currently disable *all* bonuses

If there are other alternatives i'd love to hear about them.

lebedev.ri mentioned this in D101468: [Passes] Run sinking/hoisting in SimplifyCFG earlier..Apr 28 2021, 10:34 AM

lebedev.ri mentioned this in D104870: [SimplifyCFG] Tail-merging all blocks with `unreachable` terminator.Jul 1 2021, 12:00 PM

Revision Contents

Path

Size

llvm/

lib/

Analysis/

InlineCost.cpp

45 lines

test/

Transforms/

Inline/

X86/

impossible-block.ll

70 lines

Diff 340272

llvm/lib/Analysis/InlineCost.cpp

Show First 20 Lines • Show All 74 Lines • ▼ Show 20 Lines	static cl::opt<bool> InlineEnableCostBenefitAnalysis(
"inline-enable-cost-benefit-analysis", cl::Hidden, cl::init(false),		"inline-enable-cost-benefit-analysis", cl::Hidden, cl::init(false),
cl::desc("Enable the cost-benefit analysis for the inliner"));		cl::desc("Enable the cost-benefit analysis for the inliner"));

static cl::opt<int> InlineSavingsMultiplier(		static cl::opt<int> InlineSavingsMultiplier(
"inline-savings-multiplier", cl::Hidden, cl::init(8), cl::ZeroOrMore,		"inline-savings-multiplier", cl::Hidden, cl::init(8), cl::ZeroOrMore,
cl::desc("Multiplier to multiply cycle savings by during inlining"));		cl::desc("Multiplier to multiply cycle savings by during inlining"));

static cl::opt<int>		static cl::opt<int>
InlineSizeAllowance("inline-size-allowance", cl::Hidden, cl::init(100),		InlineCostAllowance("inline-size-allowance", cl::Hidden, cl::init(100),
cl::ZeroOrMore,		cl::ZeroOrMore,
cl::desc("The maximum size of a callee that get's "		cl::desc("The maximum size of a callee that get's "
"inlined without sufficient cycle savings"));		"inlined without sufficient cycle savings"));

// We introduce this threshold to help performance of instrumentation based		// We introduce this threshold to help performance of instrumentation based
// PGO before we actually hook up inliner with analysis passes such as BPI and		// PGO before we actually hook up inliner with analysis passes such as BPI and
// BFI.		// BFI.
static cl::opt<int> ColdThreshold(		static cl::opt<int> ColdThreshold(
"inlinecold-threshold", cl::Hidden, cl::init(45), cl::ZeroOrMore,		"inlinecold-threshold", cl::Hidden, cl::init(45), cl::ZeroOrMore,
cl::desc("Threshold for inlining functions with cold attribute"));		cl::desc("Threshold for inlining functions with cold attribute"));

static cl::opt<int>		static cl::opt<int>
HotCallSiteThreshold("hot-callsite-threshold", cl::Hidden, cl::init(3000),		HotCallSiteThreshold("hot-callsite-threshold", cl::Hidden, cl::init(3000),
cl::ZeroOrMore,		cl::ZeroOrMore,
cl::desc("Threshold for hot callsites "));		cl::desc("Threshold for hot callsites "));

static cl::opt<int> LocallyHotCallSiteThreshold(		static cl::opt<int> LocallyHotCallSiteThreshold(
"locally-hot-callsite-threshold", cl::Hidden, cl::init(525), cl::ZeroOrMore,		"locally-hot-callsite-threshold", cl::Hidden, cl::init(525), cl::ZeroOrMore,
cl::desc("Threshold for locally hot callsites "));		cl::desc("Threshold for locally hot callsites "));

		static cl::opt<int> ImpossibleCodeRelFreq(
		"impossible-code-rel-freq", cl::Hidden, cl::init(2), cl::ZeroOrMore,
		cl::desc("Maximum block frequency, expressed as a per-million of functions "
		"entry frequency, for a block to be considered to be impossible "
		"to execute in the absence of profile information."));

static cl::opt<int> ColdCallSiteRelFreq(		static cl::opt<int> ColdCallSiteRelFreq(
"cold-callsite-rel-freq", cl::Hidden, cl::init(2), cl::ZeroOrMore,		"cold-callsite-rel-freq", cl::Hidden, cl::init(2), cl::ZeroOrMore,
cl::desc("Maximum block frequency, expressed as a percentage of caller's "		cl::desc("Maximum block frequency, expressed as a percentage of caller's "
"entry frequency, for a callsite to be cold in the absence of "		"entry frequency, for a callsite to be cold in the absence of "
"profile information."));		"profile information."));

static cl::opt<int> HotCallSiteRelFreq(		static cl::opt<int> HotCallSiteRelFreq(
"hot-callsite-rel-freq", cl::Hidden, cl::init(60), cl::ZeroOrMore,		"hot-callsite-rel-freq", cl::Hidden, cl::init(60), cl::ZeroOrMore,
▲ Show 20 Lines • Show All 368 Lines • ▼ Show 20 Lines	class InlineCostCallAnalyzer final : public CallAnalyzer {
/// arguments are not counted here.		/// arguments are not counted here.
int Cost = 0;		int Cost = 0;

// The cumulative cost at the beginning of the basic block being analyzed. At		// The cumulative cost at the beginning of the basic block being analyzed. At
// the end of analyzing each basic block, "Cost - CostAtBBStart" represents		// the end of analyzing each basic block, "Cost - CostAtBBStart" represents
// the size of that basic block.		// the size of that basic block.
int CostAtBBStart = 0;		int CostAtBBStart = 0;

// The static size of live but cold basic blocks. This is "static" in the		// As per the block frequency info, does this block basically never execute?
// sense that it's not weighted by profile counts at all.		bool IsImpossibleBB = false;
int ColdSize = 0;

// Whether inlining is decided by cost-benefit analysis.		// Whether inlining is decided by cost-benefit analysis.
bool DecidedByCostBenefit = false;		bool DecidedByCostBenefit = false;

bool SingleBB = true;		bool SingleBB = true;

unsigned SROACostSavings = 0;		unsigned SROACostSavings = 0;
unsigned SROACostSavingsLost = 0;		unsigned SROACostSavingsLost = 0;
Show All 13 Lines	class InlineCostCallAnalyzer final : public CallAnalyzer {
void updateThreshold(CallBase &Call, Function &Callee);		void updateThreshold(CallBase &Call, Function &Callee);
/// Return a higher threshold if \p Call is a hot callsite.		/// Return a higher threshold if \p Call is a hot callsite.
Optional<int> getHotCallSiteThreshold(CallBase &Call,		Optional<int> getHotCallSiteThreshold(CallBase &Call,
BlockFrequencyInfo *CallerBFI);		BlockFrequencyInfo *CallerBFI);

/// Handle a capped 'int' increment for Cost.		/// Handle a capped 'int' increment for Cost.
void addCost(int64_t Inc, int64_t UpperBound = INT_MAX) {		void addCost(int64_t Inc, int64_t UpperBound = INT_MAX) {
assert(UpperBound > 0 && UpperBound <= INT_MAX && "invalid upper bound");		assert(UpperBound > 0 && UpperBound <= INT_MAX && "invalid upper bound");
		if (!IsImpossibleBB)
Cost = (int)std::min(UpperBound, Cost + Inc);		Cost = (int)std::min(UpperBound, Cost + Inc);
}		}

void onDisableSROA(AllocaInst *Arg) override {		void onDisableSROA(AllocaInst *Arg) override {
auto CostIt = SROAArgCosts.find(Arg);		auto CostIt = SROAArgCosts.find(Arg);
if (CostIt == SROAArgCosts.end())		if (CostIt == SROAArgCosts.end())
return;		return;
addCost(CostIt->second);		addCost(CostIt->second);
SROACostSavings -= CostIt->second;		SROACostSavings -= CostIt->second;
▲ Show 20 Lines • Show All 95 Lines • ▼ Show 20 Lines	class InlineCostCallAnalyzer final : public CallAnalyzer {
void onAggregateSROAUse(AllocaInst *SROAArg) override {		void onAggregateSROAUse(AllocaInst *SROAArg) override {
auto CostIt = SROAArgCosts.find(SROAArg);		auto CostIt = SROAArgCosts.find(SROAArg);
assert(CostIt != SROAArgCosts.end() &&		assert(CostIt != SROAArgCosts.end() &&
"expected this argument to have a cost");		"expected this argument to have a cost");
CostIt->second += InlineConstants::InstrCost;		CostIt->second += InlineConstants::InstrCost;
SROACostSavings += InlineConstants::InstrCost;		SROACostSavings += InlineConstants::InstrCost;
}		}

void onBlockStart(const BasicBlock *BB) override { CostAtBBStart = Cost; }		void onBlockStart(const BasicBlock *BB) override {
		CostAtBBStart = Cost;

void onBlockAnalyzed(const BasicBlock *BB) override {		IsImpossibleBB = false;
if (CostBenefitAnalysisEnabled) {		if (GetBFI) {
// Keep track of the static size of live but cold basic blocks. For now,
// we define a cold basic block to be one that's never executed.
assert(GetBFI && "GetBFI must be available");
BlockFrequencyInfo *BFI = &(GetBFI(F));		BlockFrequencyInfo *BFI = &(GetBFI(F));
assert(BFI && "BFI must be available");		assert(BFI && "BFI must be available");
auto ProfileCount = BFI->getBlockProfileCount(BB);		const BranchProbability ImpossibleProb(ImpossibleCodeRelFreq, 1'000'000);
assert(ProfileCount.hasValue());		auto Freq = BFI->getBlockFreq(BB);
if (ProfileCount.getValue() == 0)		auto EntryFreq = BFI->getBlockFreq(&BB->getParent()->getEntryBlock());
ColdSize += Cost - CostAtBBStart;		IsImpossibleBB = Freq < EntryFreq * ImpossibleProb;
		}
}		}

		void onBlockAnalyzed(const BasicBlock *BB) override {
auto *TI = BB->getTerminator();		auto *TI = BB->getTerminator();
// If we had any successors at this point, than post-inlining is likely to		// If we had any successors at this point, than post-inlining is likely to
// have them as well. Note that we assume any basic blocks which existed		// have them as well. Note that we assume any basic blocks which existed
// due to branches or switches which folded above will also fold after		// due to branches or switches which folded above will also fold after
// inlining.		// inlining.
if (SingleBB && TI->getNumSuccessors() > 1) {		if (SingleBB && TI->getNumSuccessors() > 1) {
// Take off the bonus we applied to the threshold.		// Take off the bonus we applied to the threshold.
Threshold -= SingleBBBonus;		Threshold -= SingleBBBonus;
▲ Show 20 Lines • Show All 122 Lines • ▼ Show 20 Lines	Optional<bool> costBenefitAnalysis() {
CycleSavings = CycleSavings.udiv(EntryCount);		CycleSavings = CycleSavings.udiv(EntryCount);

// Compute the total savings for the call site.		// Compute the total savings for the call site.
auto *CallerBB = CandidateCall.getParent();		auto *CallerBB = CandidateCall.getParent();
BlockFrequencyInfo CallerBFI = &(GetBFI((CallerBB->getParent())));		BlockFrequencyInfo CallerBFI = &(GetBFI((CallerBB->getParent())));
CycleSavings += getCallsiteCost(this->CandidateCall, DL);		CycleSavings += getCallsiteCost(this->CandidateCall, DL);
CycleSavings *= CallerBFI->getBlockProfileCount(CallerBB).getValue();		CycleSavings *= CallerBFI->getBlockProfileCount(CallerBB).getValue();

// Remove the cost of the cold basic blocks.
int Size = Cost - ColdSize;

// Allow tiny callees to be inlined regardless of whether they meet the		// Allow tiny callees to be inlined regardless of whether they meet the
// savings threshold.		// savings threshold.
Size = Size > InlineSizeAllowance ? Size - InlineSizeAllowance : 1;		Cost = Cost > InlineCostAllowance ? Cost - InlineCostAllowance : 1;

// Return true if the savings justify the cost of inlining. Specifically,		// Return true if the savings justify the cost of inlining. Specifically,
// we evaluate the following inequality:		// we evaluate the following inequality:
//		//
// CycleSavings PSI->getOrCompHotCountThreshold()		// CycleSavings PSI->getOrCompHotCountThreshold()
// -------------- >= -----------------------------------		// -------------- >= -----------------------------------
// Size InlineSavingsMultiplier		// Cost InlineSavingsMultiplier
//		//
// Note that the left hand side is specific to a call site. The right hand		// Note that the left hand side is specific to a call site. The right hand
// side is a constant for the entire executable.		// side is a constant for the entire executable.
APInt LHS = CycleSavings;		APInt LHS = CycleSavings;
LHS *= InlineSavingsMultiplier;		LHS *= InlineSavingsMultiplier;
APInt RHS(128, PSI->getOrCompHotCountThreshold());		APInt RHS(128, PSI->getOrCompHotCountThreshold());
RHS *= Size;		RHS *= Cost;
return LHS.uge(RHS);		return LHS.uge(RHS);
}		}

InlineResult finalizeAnalysis() override {		InlineResult finalizeAnalysis() override {
// Loops generally act a lot like calls in that they act like barriers to		// Loops generally act a lot like calls in that they act like barriers to
// movement, require a certain amount of setup, etc. So when optimising for		// movement, require a certain amount of setup, etc. So when optimising for
// size, we penalise any call sites that perform loops. We do this after all		// size, we penalise any call sites that perform loops. We do this after all
// other costs here, so will likely only be dealing with relatively small		// other costs here, so will likely only be dealing with relatively small
▲ Show 20 Lines • Show All 2,015 Lines • Show Last 20 Lines

llvm/test/Transforms/Inline/X86/impossible-block.ll

This file was added.

				; RUN: opt < %s -inline -debug-only=inline-cost -impossible-code-rel-freq=2 -print-instruction-comments -S -mtriple=x86_64-unknown-linux-gnu 2>&1 \| FileCheck --check-prefixes=CHECK,CHECK-GOOD %s
				; RUN: opt < %s -passes='cgscc(inline)' -impossible-code-rel-freq=2 -debug-only=inline-cost -print-instruction-comments -S -mtriple=x86_64-unknown-linux-gnu 2>&1 \| FileCheck --check-prefixes=CHECK,CHECK-GOOD %s

				; RUN: opt < %s -inline -debug-only=inline-cost -impossible-code-rel-freq=1 -print-instruction-comments -S -mtriple=x86_64-unknown-linux-gnu 2>&1 \| FileCheck --check-prefixes=CHECK,CHECK-BAD %s
				; RUN: opt < %s -passes='cgscc(inline)' -impossible-code-rel-freq=1 -debug-only=inline-cost -print-instruction-comments -S -mtriple=x86_64-unknown-linux-gnu 2>&1 \| FileCheck --check-prefixes=CHECK,CHECK-BAD %s

				; REQUIRES: asserts

				; Check the threshold for inlining cold callsites.

				; CHECK: Analyzing call of callee... (caller:caller)
				; CHECK-NEXT: define void @callee(i1 %c) {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: ; cost before = -35, cost after = -30, threshold before = 674, threshold after = 674, cost delta = 5
				; CHECK-NEXT: br i1 %c, label %hot, label %cold, !prof !0
				; CHECK: hot: ; preds = %entry
				; CHECK-NEXT: ; cost before = -30, cost after = 0, threshold before = 562, threshold after = 562, cost delta = 30
				; CHECK-NEXT: call void @hot_callee()
				; CHECK-NEXT: ; cost before = 0, cost after = 0, threshold before = 562, threshold after = 562, cost delta = 0
				; CHECK-NEXT: br label %end
				; CHECK: cold: ; preds = %entry
				; CHECK-GOOD-NEXT: ; cost before = 0, cost after = 0, threshold before = 562, threshold after = 562, cost delta = 0
				; CHECK-BAD-NEXT: ; cost before = 0, cost after = 30, threshold before = 562, threshold after = 562, cost delta = 30
				; CHECK-NEXT: call void @cold_callee()
				; CHECK-GOOD-NEXT: ; cost before = 0, cost after = 0, threshold before = 562, threshold after = 562, cost delta = 0
				; CHECK-BAD-NEXT: ; cost before = 30, cost after = 30, threshold before = 562, threshold after = 562, cost delta = 0
				; CHECK-NEXT: br label %end
				; CHECK: end: ; preds = %cold, %hot
				; CHECK-GOOD-NEXT: ; cost before = 0, cost after = 0, threshold before = 562, threshold after = 562, cost delta = 0
				; CHECK-BAD-NEXT: ; cost before = 30, cost after = 30, threshold before = 562, threshold after = 562, cost delta = 0
				; CHECK-NEXT: ret void
				; CHECK-NEXT: }

				declare void @hot_callee()
				declare void @cold_callee()
				declare void @terminating_callee() noreturn nounwind

				define void @callee(i1 %c) {
				entry:
				br i1 %c, label %hot, label %cold, !prof !0

				hot:
				call void @hot_callee()
				br label %end

				cold:
				call void @cold_callee()
				br label %end

				end:
				ret void
				}

				define void @caller(i1 %c) {
				entry:
				br i1 %c, label %hot, label %cold

				hot:
				call void @callee(i1 %c)
				br label %end

				cold:
				call void @cold_callee()
				br label %end

				end:
				ret void
				}

				!0 = !{!"branch_weights", i32 1000000, i32 1}