This is an archive of the discontinued LLVM Phabricator instance.

[InlineCost] Set minsize inline threshold to 0
AbandonedPublic

Authored by jmolloy on Jul 12 2016, 1:33 AM.

Download Raw Diff

Details

Reviewers

chandlerc
ab
mehdi_amini
hfinkel

Summary

"minsize", unlike other optimization levels, has a well defined expectation of how the inliner should behave. It shouldn't increase code size.

Therefore set the threshold to 0. This still allows the inliner to inline trivial functions that are smaller than the callsite cost, modulo a bug that I intend to fix in a followup where the cost calcualation doesn't take into account the removal of the call instruction at the callsite.

Diff Detail

Event Timeline

jmolloy updated this revision to Diff 63651.Jul 12 2016, 1:33 AM

jmolloy retitled this revision from to [InlineCost] Set minsize inline threshold to 0.

jmolloy updated this object.

jmolloy added reviewers: chandlerc, hfinkel, mehdi_amini, ab.

jmolloy set the repository for this revision to rL LLVM.

jmolloy added a subscriber: llvm-commits.

I've always thought this was the ideal threshold to use for minsize, and something we should strive for in order to tune the inliner.

However, last time I checked, this setting actually *increased* code size quite a bit. What benchmarking have you done to evaluate this change? I wonder what other uses of minsize would be useful to get to check that we don't significantly regress code size for them?

(All of these would likely be great test cases for the inline cost analysis of course....)

Hi Chandler,

I've run benchmarking for the test-suite, and this change causes no code change whatsoever.

This is very suspicious, but I've tested and tested again, checked three times - on trivial examples the two compilers I'm testing do have different results (this is on AArch64).

My suspicion for why there is no difference is that trivial functions (getters/setters) are still inlined, as their inline cost is less than the call argument setup cost. This change only really affects larger-than-trivial functions. I'm still suspicious though.

James

Hi,

I re-ran testing, again using test-suite -Oz, but this time targetting T32 (thumb mode 32-bit). This time I do get differences:

Trunk -Oz: find . -name '*.o' | xargs size | awk '{sum+=$1} END {print sum}'
10080358

Trunk -Oz -mllvm -inline-threshold=0:  find . -name '*.o' | xargs size | awk '{sum+=$1} END {print sum}'
7722119

For completeness, I also tried this with Clang-3.7:

3.7 -Oz: 6919363
3.7 -Oz -mllvm inline-threshold=0: 7820357

So 3.7 -Oz was a lot better than trunk -Oz and there were indeed regressions when setting the threshold to 0. But now we've bloated a lot and we get improvements.

James

Hi Chandler,

We discussed this on IRC and you were suspicious of my numbers. I was too, so I did more re-running. It turns out that CMake was appending -O3 to all of my builds, so the numbers I got were completely worthless.

Having done a *lot* more testing, I can confirm that purely setting the threshold to zero causes significant code size regressions. I've looked at these and 99% of them are in C++ code. It turns out that although we were giving a bonus for inlining the last call to an internal function, we weren't doing the same for linkonce_odr functions which is what C++ templates become. This is what was causing our bloat.

Ideally, we'd teach the inliner to much more accurately determine the overall expansion of a tree of thunks and tiny template expansions. However simply giving a bonus bump to linkonce_odr functions in the same way as internal functions appears to do the trick quite well.

With this change I see a geomean codesize *improvement* of 2.3% on the test-suite, and that excludes the TSVC benchmarks that improve so massively that they skew the results. Without the linkonce_odr bonus (simply setting the threshold to 0) I see a codesize *regression* of 2.9%.

I will of course split these up into two separate patches for committing but thought they'd be easier to review as one.

Cheers,

James

jmolloy updated this revision to Diff 65916.Jul 28 2016, 5:15 AM

jmolloy edited edge metadata.

jmolloy removed rL LLVM as the repository for this revision.

Herald added a subscriber: eraman. · View Herald TranscriptJul 28 2016, 5:15 AM

In D22261#499087, @jmolloy wrote:

Hi Chandler,

We discussed this on IRC and you were suspicious of my numbers. I was too, so I did more re-running. It turns out that CMake was appending -O3 to all of my builds, so the numbers I got were completely worthless.

Having done a *lot* more testing, I can confirm that purely setting the threshold to zero causes significant code size regressions. I've looked at these and 99% of them are in C++ code. It turns out that although we were giving a bonus for inlining the last call to an internal function, we weren't doing the same for linkonce_odr functions which is what C++ templates become. This is what was causing our bloat.

This makes sense: linkonce_odr are not internal, they are not much different than a regular global.

Ideally, we'd teach the inliner to much more accurately determine the overall expansion of a tree of thunks and tiny template expansions. However simply giving a bonus bump to linkonce_odr functions in the same way as internal functions appears to do the trick quite well.

This is fairly arbitrary.

In D22261#499087, @jmolloy wrote:

Ideally, we'd teach the inliner to much more accurately determine the overall expansion of a tree of thunks and tiny template expansions. However simply giving a bonus bump to linkonce_odr functions in the same way as internal functions appears to do the trick quite well.

I agree with Mehdi that this seems arbitrary, but I'll go farther -- I really think that focusing on linkonce_odr functions from the cost analysis side is the wrong approach.

I think instead you'll need to look at the nature of the (admitedly linkonce_odr) functions which get inlined at a threshold of 25 but not at a threshold of 0, and try to understand what pattern we're missing that causes us to misestimate the size. There is probably a reasonably small number of patterns that are missing at the low end of the threshold (because we've not stared at it as much) that will be generally beneficial to recognize and accurately model.

But what is more concerning is that the "bonus" you're using is the "call once" bonus. That isn't really a bonus. What it does is it pretty much completely removes the inlining threshold. As a consequence, with this change, you can pretty cause *massive* code size explosions by just a huge linkonce_odr function that is called once in every translation unit, but by a different function in every translation unit. You'll get O(number of translation units) copies of that function. =[

I really think we need to understand why you see this swing in size between two very close thresholds of 0 and 25. My suspicion is that we have some flaw in how we calculate the cost of inlining a function which just calls another function with (perhaps a permutation of) the arguments it is called with. Fixing that cost computation so that it (correctly) is modeled as 0 cost seems really valuable *outside* of -Oz as well.

-Chandler

In D22261#499678, @chandlerc wrote:

I really think we need to understand why you see this swing in size between two very close thresholds of 0 and 25. My suspicion is that we have some flaw in how we calculate the cost of inlining a function which just calls another function with (perhaps a permutation of) the arguments it is called with. Fixing that cost computation so that it (correctly) is modeled as 0 cost seems really valuable *outside* of -Oz as well.

I want to second what Chandler said here. If we have a testing methodology which can help us fix cases where we got our cost model wrong and are thus more sensitive to the threshold values than we should be, we should exploit that for all it's worth.

Also, I wouldn't be surprised if simply writing some manual test cases for argument shuffling and other idiomatic small functions with an inline-threshold set to zero were to find a few cases. Some targeted test writing (as opposed to benchmark analysis) might be really useful.

Also, adding a hook to the inliner which causes it to print inline candidates which *did* get inlined with a 25 threshold, but had a cost above 0 would be a quick way to find a bunch of these cases. We've applied that a couple of times in our tree with pretty good success. Actually, a "print-inline-close-to-threshold" option might be a generically useful analysis tool.

Abandoning in favour of D24338.

Revision Contents

Path

Size

include/

llvm/

Analysis/

InlineCost.h

1 line

lib/

Analysis/

InlineCost.cpp

9 lines

test/

Transforms/

Inline/

ephemeral.ll

18 lines

inline-fp.ll

4 lines

inline_linkonceodr.ll

16 lines

Diff 65916

include/llvm/Analysis/InlineCost.h

	Show All 27 Lines
	class TargetTransformInfo;			class TargetTransformInfo;

	namespace InlineConstants {			namespace InlineConstants {
	// Various magic constants used to adjust heuristics.			// Various magic constants used to adjust heuristics.
	const int InstrCost = 5;			const int InstrCost = 5;
	const int IndirectCallThreshold = 100;			const int IndirectCallThreshold = 100;
	const int CallPenalty = 25;			const int CallPenalty = 25;
	const int LastCallToStaticBonus = -15000;			const int LastCallToStaticBonus = -15000;
				const int LastCallToLinkOnceODRBonus = -15000;
	const int ColdccPenalty = 2000;			const int ColdccPenalty = 2000;
	const int NoreturnPenalty = 10000;			const int NoreturnPenalty = 10000;
	/// Do not inline functions which allocate this many bytes on the stack			/// Do not inline functions which allocate this many bytes on the stack
	/// when the caller is recursive.			/// when the caller is recursive.
	const unsigned TotalAllocaSizeRecursiveCaller = 1024;			const unsigned TotalAllocaSizeRecursiveCaller = 1024;
	}			}

	/// \brief Represents the cost of inlining a function.			/// \brief Represents the cost of inlining a function.
	▲ Show 20 Lines • Show All 96 Lines • Show Last 20 Lines

lib/Analysis/InlineCost.cpp

Show All 39 Lines

STATISTIC(NumCallsAnalyzed, "Number of call sites analyzed");		STATISTIC(NumCallsAnalyzed, "Number of call sites analyzed");

// Threshold to use when optsize is specified (and there is no		// Threshold to use when optsize is specified (and there is no
// -inline-threshold).		// -inline-threshold).
const int OptSizeThreshold = 75;		const int OptSizeThreshold = 75;

// Threshold to use when -Oz is specified (and there is no -inline-threshold).		// Threshold to use when -Oz is specified (and there is no -inline-threshold).
const int OptMinSizeThreshold = 25;		const int OptMinSizeThreshold = 0;

// Threshold to use when -O[34] is specified (and there is no		// Threshold to use when -O[34] is specified (and there is no
// -inline-threshold).		// -inline-threshold).
const int OptAggressiveThreshold = 275;		const int OptAggressiveThreshold = 275;

static cl::opt<int> DefaultInlineThreshold(		static cl::opt<int> DefaultInlineThreshold(
"inline-threshold", cl::Hidden, cl::init(225), cl::ZeroOrMore,		"inline-threshold", cl::Hidden, cl::init(225), cl::ZeroOrMore,
cl::desc("Control the amount of inlining to perform (default = 225)"));		cl::desc("Control the amount of inlining to perform (default = 225)"));
▲ Show 20 Lines • Show All 1,194 Lines • ▼ Show 20 Lines	if (CS.isByValArgument(I)) {
// argument.		// argument.
Cost -= InlineConstants::InstrCost;		Cost -= InlineConstants::InstrCost;
}		}
}		}

// If there is only one call of the function, and it has internal linkage,		// If there is only one call of the function, and it has internal linkage,
// the cost of inlining it drops dramatically.		// the cost of inlining it drops dramatically.
bool OnlyOneCallAndLocalLinkage =		bool OnlyOneCallAndLocalLinkage =
F.hasLocalLinkage() && F.hasOneUse() && &F == CS.getCalledFunction();		F.hasLocalLinkage() && F.hasOneUse() && &F == CS.getCalledFunction();
if (OnlyOneCallAndLocalLinkage)		if (OnlyOneCallAndLocalLinkage)
Cost += InlineConstants::LastCallToStaticBonus;		Cost += InlineConstants::LastCallToStaticBonus;

		bool OnlyOneCallAndODRLinkage =
		F.hasLinkOnceODRLinkage() && F.hasOneUse() && &F == CS.getCalledFunction();
		if (OnlyOneCallAndODRLinkage)
		Cost += InlineConstants::LastCallToLinkOnceODRBonus;

// If this function uses the coldcc calling convention, prefer not to inline		// If this function uses the coldcc calling convention, prefer not to inline
// it.		// it.
if (F.getCallingConv() == CallingConv::Cold)		if (F.getCallingConv() == CallingConv::Cold)
Cost += InlineConstants::ColdccPenalty;		Cost += InlineConstants::ColdccPenalty;

// Check if we're done. This can happen due to bonuses and penalties.		// Check if we're done. This can happen due to bonuses and penalties.
if (Cost > Threshold)		if (Cost > Threshold)
return false;		return false;
▲ Show 20 Lines • Show All 283 Lines • Show Last 20 Lines

test/Transforms/Inline/ephemeral.ll

	; RUN: opt -S -Oz %s \| FileCheck %s			; RUN: opt -S -inline %s \| FileCheck %s

	@a = global i32 4			@a = global i32 4

	define i1 @inner() {			define i32 @inner() minsize optsize {
	%a1 = load volatile i32, i32* @a			%a1 = load i32, i32* @a
	%x1 = add i32 %a1, %a1			%x1 = add i32 %a1, %a1
	%c = icmp eq i32 %x1, 0

	; Here are enough instructions to prevent inlining, but because they are used			; Here are enough instructions to prevent inlining, but because they are used
	; only by the @llvm.assume intrinsic, they're free (and, thus, inlining will			; only by the @llvm.assume intrinsic, they're free (and, thus, inlining will
	; still happen).			; still happen).
	%a2 = mul i32 %a1, %a1			%a2 = mul i32 %a1, %a1
	%a3 = sub i32 %a1, 5			%a3 = sub i32 %a2, 5
	%a4 = udiv i32 %a3, -13			%a4 = udiv i32 %a3, -13
	%a5 = mul i32 %a4, %a4			%a5 = mul i32 %a4, %a4
	%a6 = add i32 %a5, %x1			%a6 = add i32 %a5, %x1
	%ca = icmp sgt i32 %a6, -7			%ca = icmp sgt i32 %a6, -7
	tail call void @llvm.assume(i1 %ca)			tail call void @llvm.assume(i1 %ca)

	ret i1 %c			ret i32 %a1
	}			}

	; @inner() should be inlined for -Oz.			; @inner() should be inlined for -Oz.
	; CHECK-NOT: call i1 @inner			; CHECK-NOT: call i1 @inner
	define i1 @outer() optsize {			define i32 @outer() minsize optsize {
	%r = call i1 @inner()			%r = call i32 @inner()
	ret i1 %r			ret i32 %r
	}			}

	declare void @llvm.assume(i1) nounwind			declare void @llvm.assume(i1) nounwind

test/Transforms/Inline/inline-fp.ll

Show First 20 Lines • Show All 126 Lines • ▼ Show 20 Lines	entry:
%div = fdiv float %sub3, %mul		%div = fdiv float %sub3, %mul
ret float %div		ret float %div
}		}

declare float @fabsf(float) optsize minsize		declare float @fabsf(float) optsize minsize

declare float @llvm.pow.f32(float, float) optsize minsize		declare float @llvm.pow.f32(float, float) optsize minsize

attributes #0 = { minsize optsize }		attributes #0 = { optsize }
attributes #1 = { minsize optsize "use-soft-float"="true" }		attributes #1 = { optsize "use-soft-float"="true" }

test/Transforms/Inline/inline_linkonceodr.ll

This file was added.

				; RUN: opt < %s -inline -inline-threshold=0 -S \| FileCheck %s

				define linkonce_odr i32 @callee1(i32 %A, i32 %B) {
				%C = sdiv i32 %A, %B
				%D = sdiv i32 %A, %C
				%E = sdiv i32 %A, %D
				ret i32 %E
				}

				define i32 @caller1(i32 %a, i32 %b) minsize {
				; CHECK-LABEL: define i32 @caller1(
				; CHECK-NEXT: sdiv

				%X = call i32 @callee1( i32 %a, i32 %b )
				ret i32 %X
				}

This is an archive of the discontinued LLVM Phabricator instance.

[InlineCost] Set minsize inline threshold to 0AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 65916

include/llvm/Analysis/InlineCost.h

lib/Analysis/InlineCost.cpp

test/Transforms/Inline/ephemeral.ll

test/Transforms/Inline/inline-fp.ll

test/Transforms/Inline/inline_linkonceodr.ll

[InlineCost] Set minsize inline threshold to 0
AbandonedPublic