This is an archive of the discontinued LLVM Phabricator instance.

[InlineCost] Remove CallPenalty and change MinSizeThreshold to 5
AcceptedPublic

Authored by jmolloy on Sep 8 2016, 4:59 AM.

Details

Summary

Call instructions had a penalty of 25 applied during inline cost analysis. This seems nonsensical - calls surely don't cost any more than any other instruction when inlined. I presume that the logic was that a call dominates in terms of time so as to diminish the benefits of inlining other instructions. In reality this penalty was applied in some cases and not others, causing skewed behaviour and some pretty glaringly obvious errors. For example:

int g() {
  h();
}
int f() {
  g();
}

With an inline threshold set to zero (which *should* mean no code bloat), we won't inline g into f. This is because there's a call penalty of 25 applied for the call to h(), but there *isn't an equivalent bonus applied for removing the call to g()*! This has caused the inlining threshold for minsize to be set to a default of... 25. Which means that for functions that *don't* have a call in them, we'll accept around 5 instructions of code bloat in minimum size mode!

This call penalty has logic behind it but it's skewing decisions and causing us to try and counterbalance it in strange places. Nuke it.

This improves code size by a percentage point or so over the test-suite in minsize mode, but most importantly removes some really quite glaring clangers of inlining decisions and is a more sane default. The effects on -Os and -O3 is very minimal, because their thresholds are 125 and 225 respectively which are really quite large. However with functions with many calls inside this may change the inlining decision.

Zero would be the obvious choice for minsize, however I've noticed code size regressions when picking 0 - most of these are fixed with a threshold of 5. The reason for this is the cost model isn't accurate - being a little more aggressive allows us to take some risks that sometimes (or often) pay off. For example:

int g(int a) {
  h(0);
}
int f() {
  g();
}

With a threshold of zero we will not inline g into f, because the call to h requires one more argument to set up than the call to g it replaces. In reality this extra parameter will cause negligible code size impact, and we have the possibility of removing g entirely if nothing else references it (happens often when using C++ containers).

I've tested this on the test-suite and it gives decent results. Excluding TSVC (see later):

242 total test executables
68 have a code size improvement from between 0.01% - 4.6% (most of these are tiny, in the 0.01 - 1.00% range)
34 have a code size regression from between 0.01% - 5.3%  (most of these are tiny, in the 0.01 - 1.00% range)
0.07% geomean improvement in code size.

The TSVC benchmarks were excluded because they *improve* in codesize by 40% - their size almost halves. There are 36 of these, with code size improvements ranging from 32% to 45%.

7.55% geomean improvement in code size, with TSVC included in the sample.

These tests were done on Armv7, thumb mode, -Oz. I have confirmed that -O3 looks vaguely similar to how it did before (not surprising - the threshold there is 225 so it less likely to be affected by this change).

Diff Detail

Repository
rL LLVM

Event Timeline

jmolloy updated this revision to Diff 70678.Sep 8 2016, 4:59 AM
jmolloy retitled this revision from to [InlineCost] Remove CallPenalty and change MinSizeThreshold to 5.
jmolloy updated this object.
jmolloy added reviewers: chandlerc, reames.
jmolloy set the repository for this revision to rL LLVM.
jmolloy added a subscriber: llvm-commits.

Keeping in mind that the inliner threshold is (AFAIK) a balance between code-size and performance. With this view, it "makes sense" to have a different cost for calls: they can trigger spilling around the call site and they may need to push args on the stack. Of course the same cost should be applied to the call that is removed in the caller.
I'm not saying that 25 is the best approximation, indeed I'd expect that the number of args to the call play a role.

Hi,

I agree completely, and this is actually already the case. The argument setup is accounted for; currently *on top of* the call penalty.

I see the inline cost metric for performance being highly correlated to IR code size, as the big gains from inlining come from code simplification and constant propagation, which is reflected in number of resulting IR instructions.

Calls on Power are actually slower than a normal instruction, so inlining is even more important for performance. This probably needs some benchmarking and perhaps a way to conditionalize cost. A few more inline comments as well.

Thanks!

-eric

lib/Transforms/Scalar/LoopStrengthReduce.cpp
1064 ↗(On Diff #70678)

Whitespace :)

test/Transforms/Inline/ephemeral.ll
22 ↗(On Diff #70678)

Since you're having to change the function here it seems reasonable to document it a bit more with how the structure of the function is affecting the inlining?

test/Transforms/Inline/inline_minisize.ll
25 ↗(On Diff #70678)

What happened here?

200 ↗(On Diff #70678)

not minsize?

test/Transforms/Inline/nonnull.ll
1 ↗(On Diff #70678)

Not sure what to do about the tests that need a threshold change here, it seems slightly off to add a threshold to them. Thoughts on alternatives?

eraman added a comment.Nov 1 2016, 3:25 PM

I agree that having a call penalty in addition to argument set up cost skews inline cost computation. Some high level comments:

  • This patch has two changes : removing the call penalty and reduce MinSizeThreshold. It is preferable to separate them.
  • The call penalty change doesn't affect just the code size, so it is important to measure the performance impact of this change.
  • Could you generate the patch with a large -U value to get full context?
lib/Analysis/InlineCost.cpp
1112

This needs more justification. The instruction gets its cost in the visitor. In addition you're adding InstrCost. Is this to simulate the effect of call with one argument or something else? Comments will be helpful.

eraman added a subscriber: davidxl.Nov 1 2016, 3:26 PM

The change makes sense overall. Easwaran mentioned this exact issue a short while ago.

include/llvm/Analysis/InlineCost.h
44 ↗(On Diff #70678)

We should not remove this parameter. Instead it should be used in the opposite way. Currently the CallAnalyzer::analyzeCall only considers cost of parameter passing (subtracted from the overall cost), it in fact should subtract CallPenalty : Cost -= CallPenalty; to model cost of function call (call/ret, pro/epi logue). Note that the cost here means runtime cost. The current inline cost is highly tuned toward size cost.

lib/Analysis/InlineCost.cpp
1254

From the point of view of modelling runtime cost, this should really be Cost -= CallPenalty. The penalty should actually be defined by target.

echristo added inline comments.Nov 1 2016, 4:30 PM
include/llvm/Analysis/InlineCost.h
44 ↗(On Diff #70678)

+1

lib/Analysis/InlineCost.cpp
1254

Agreed.

I agree that having a call penalty in addition to argument set up cost skews inline cost computation.

What about register spills for instance? An IR call can really be more costly when lowered to machine code.

(Add inline the answer to David, so that there is the relevant context)

lib/Analysis/InlineCost.cpp
938
In D24338, @davidxl wrote:

Do you mean parameter passing? That should be counted independently by the instruction visitor.

No, I don’t mean the parameters, the visitor account for one instruction per parameter just before this indeed.
However a call has other extra-cost on top of a regular instruction, even without any argument: it splits live-ranges and may cause spills around it to preserve registers that aren't callee-saved.

In this sense, it make sense that a call can "cost" more than a normal instruction.

Hi all,

Thanks for the rush of comments! This has been on my back burner for a while, you've motivated me to try and push it forward :)

Having stared at this code for a long time now, my overall feeling is that, long term, we should try and move a lot of the inline cost logic around calls into TargetTransformInfo. TTI::getCallCost already does a similar cost calculation, just with fewer special cases. Doing this would allow the broad brush heuristics "CallPenalty" and "InliningMultiplier" to be removed entirely, as targets that cared about them could simply override getCallCost.

That's a large change however, and is likely to change cost calculations. The updated diff is essentially what you all have suggested, and I think gets us one stage of the way towards the longer term goal above.

Most uses of CallPenalty are moved into TargetTransformInfo, with a default value the same as InlineConstants::CallPenalty. InlineConstants::CallPenalty stays, because it is also used as a default cost for a soft-float library call (this could probably be renamed, thinking about it).

I've removed the skew by subtracting CallPenalty when setting up the initial threshold.

CallPenalty is calculated from TargetTransformInfo, but is overridable from the command line. This is very useful for inlining tests - while the penalty has a use in practice, when devising small testcases it does get in the way. This allows the testcases to stay small, and makes sense in most cases because the tests are already overriding the inlining cost.

This is the first step, and shouldn't cause much impact at all in performance. The overall effect is actually the same as my previous patchset, and I benchmarked that one very heavily and found no glaring performance issues.

The next step after this would be to sort out -Oz, which is affected by this change the most. It makes sense to reduce the call penalty to zero when optimizing purely for size (-Oz) and to change the -Oz inline threshold to zero to compensate. We'll then have sensibly modelled inlining behaviour when focussing on size.

jmolloy updated this revision to Diff 76702.Nov 2 2016, 7:26 AM

Thanks, that looks good as a first step! See one inline comment.

lib/Analysis/InlineCost.cpp
1272

I think this should be a separate patch, all the rest of this revision would be NFC

Hi Mehdi,

Thanks! Would you prefer I strip that out here and submit another phab review, or is it OK to split them apart upon committing?

Cheers,

James

davidxl added inline comments.Nov 3 2016, 7:55 PM
include/llvm/Analysis/TargetTransformInfo.h
188 ↗(On Diff #76702)

I think it is a mistake to conflate two different things here. One is the penalty of 'not' inlining a callsite (aka the benefit of inlining it). This penalty models the branch(call)/return, function prologue and epilogue etc. It can also model the size impact of call instruction. The other is the overhead of a newly exposed callsite from the inline instance. The latter models the potential caller save register spills etc. The former is the one that needs to defined in TTI. For the latter, I think it should stay in Inliner proper. In the future, it can be replaced by target dependent register pressure analysis. In other words, this interface should be getCallPenalty().

include/llvm/Analysis/TargetTransformInfoImpl.h
132 ↗(On Diff #76702)

Why does it need to be tied with TCC_Basic? Why not just defining an independent default?

lib/Analysis/InlineCost.cpp
110

This is a good parameter to be added to InlineParams struct

943

This one should use a different parameter from CallPenalty (that models overhead of exposed callsites -- which does not need to live in TTI)

1206

A more common way is to check getNumOfOccurences.

1207

The computation here should be pushed to getInlineParams

1209

Why not directly using the TTI returned value?

lib/Transforms/IPO/Inliner.cpp
288 ↗(On Diff #76702)

should -1 be removed too?

chandlerc edited edge metadata.Nov 4 2016, 12:37 AM

Hi Mehdi,

Thanks! Would you prefer I strip that out here and submit another phab review, or is it OK to split them apart upon committing?

I'd go ahead and strip them out so the review can continue without conflating them? (Mostly because the review seems to continue...)

I'm even fine if you want to land that part of this first, and then clean up the layering with TTI, but should check with David.

chandlerc added inline comments.Nov 4 2016, 1:14 AM
include/llvm/Analysis/TargetTransformInfo.h
188 ↗(On Diff #76702)

Sorry I'm a bit late to the thread, but I'm confused by this and some of the other comments.

There are definitely two concepts here: the cost of having a call in the instruction stream, and the expected "benefit" of inlining. But I don't think CallPenalty is really being used to model the latter concept these days. The *threshold* is what models the latter concept.

So originally, in early 2010, we started using CallPenalty scaled by the number of arguments to track the the "caller cost of having a call". Then later in 2010, we replaced this usage of CallPenalty with a new and completely unrelated usage to model the fact that "big" functions are "slow". Even though it was applied to a threshold that is primarily size based. Then a few years later we *completely changed how we even think about inline cost*. Since then, the idea of functions being either "big" or "slow" and modeling that with CallPenalty, IMO, no longer makes sense in the inline cost analysis *at all*. So I suspect that the current usage of CallPenalty is just wrong. If it is doing anything, it is augmenting the basic call cost to fix a failure for it to account for caller (size) cost of *having* a call instruction.

My only real concern with adding an API is understanding why we can't just make this what the existing getCallCost do exactly this. In fact, it already does most of this. Is there just some way we can adjust it to allow us to transition from CallPenalty to getCallCost to model the caller-side cost? If the name is too confusing, we could rename it getCallerCallCost or some such...

However, I would *not* merge getInliningThresholdMultiplier into this. That is the target's tool to adjust the *threshold* which models the other thing you are talking about David. I agree they should remain separate. If we need more granular control of the threshold in a target, we should just add an API that has that granularity (with an understanding of why it is needed).

And I still think it is reasonable to factor the caller cost *into* the threshold (after it is scaled), as it does model one aspect of the "benefit" of inlining. The threshold itself models the rest, and the multiplier can scale it appropriately.

lib/Analysis/InlineCost.cpp
110

See above, but as a consequence I disagree. I think this is just a cost modeling issue, and not a threshold issue. The InlineParams should be concerned with setting up the threshold to model the "benefit".

1209

Agreed. I think this will become much more straight-forward when it is expressed as a cost and thus can use normal cost values to configure itself.

davidxl added inline comments.Nov 4 2016, 2:16 PM
include/llvm/Analysis/TargetTransformInfo.h
188 ↗(On Diff #76702)

I agree it is confusing. My suggestion of handling runtime cost modeling in this patch actually makes it worse, so I take it back -- the runtime cost modeling part can be handled later in a different patch. So let's go back and focus on size modeling intended by Jame's original patch.

The fundamental problem Jame's noticed is that the cost (size) modeling of the original callsite (inline candidate) and the cost of introduced new site is not handled consistently. Example

caller () {

callee ();    // to be inlined;

}
callee () {

new_cs1();
..

}

When computing the cost of inlining callee(), the inline cost analysis first subtract the callsite cost -- but only partially: it subtracts the cost associated with parameter passing, but does not subtract other size penalty associated with the eliminated callsite (presumably higher than base instruction cost). However, when considering the cost of new_cs1(), it not only adds the parameter passing cost, but also the additional penalty. Theoretically, if a simple wrapper function is inlined, there should be zero size impact, but in the current implementation, the wrapper function is blocked from being inlined.

In short, I think the simplest fix to the problem is to make the cost adjustment for the original and new callsites consistent -- it is very similar to the first version of the patch with the difference that CallPenalty is retained.

Regarding 'getCallCost' -- it computes the cost of both parameter passing and call itself. Ideally it should be used in inline cost adjustment, but unfortunately implemented slightly differently -- e.g., the parameter passing cost computation is less sophisticated than what inliner uses.

jmolloy updated this revision to Diff 77027.Nov 7 2016, 6:14 AM
jmolloy edited edge metadata.

Hi David and Chandler,

Thanks for all these comments. I'm glad the two main reviewers agree with each other at this point :)

I've stripped this patch down to the basics - removing the skew. This one should be a no-brainer. Following up to this, I'd like to, in separate patches:

  • Change the minsize threshold to 5, as I've determined through testing that setting it to zero leaves codesize and performance on the table due to the inaccuracy of the IR cost model.
  • Improve TTI::getCallCost(), teaching it the tricks the inliner knows.
  • Hopefully eventually switch to using getCallCost() in the inliner.

Does this sound acceptable?

Cheers,

James

mehdi_amini accepted this revision.Nov 7 2016, 8:41 AM
mehdi_amini added a reviewer: mehdi_amini.

LGTM
(but you should wait a little to give an opportunity to chandlerc / davidxl to approve as well).

This revision is now accepted and ready to land.Nov 7 2016, 8:41 AM

This one looks good.

Easwaran, can you help measure the performance impact of this change with internal benchmarks?

eraman added a comment.Nov 8 2016, 9:52 AM

This one looks good.

Easwaran, can you help measure the performance impact of this change with internal benchmarks?

This is hurting one of our internal benchmarks by > 5%. I'm looking into the root cause now.

eraman added a comment.Nov 8 2016, 2:37 PM

This one looks good.

Easwaran, can you help measure the performance impact of this change with internal benchmarks?

This is hurting one of our internal benchmarks by > 5%. I'm looking into the root cause now.

It turns out this is due to a bug in a local inliner patch we have that got exposed by this (and that prevented one crucial function from being inlined). This patch itself LGTM.