This is an archive of the discontinued LLVM Phabricator instance.

[InlineCost] Set minsize inline threshold to 0
AbandonedPublic

Authored by jmolloy on Jul 12 2016, 1:33 AM.

Details

Summary

"minsize", unlike other optimization levels, has a well defined expectation of how the inliner should behave. It shouldn't increase code size.

Therefore set the threshold to 0. This still allows the inliner to inline trivial functions that are smaller than the callsite cost, modulo a bug that I intend to fix in a followup where the cost calcualation doesn't take into account the removal of the call instruction at the callsite.

Diff Detail

Event Timeline

jmolloy updated this revision to Diff 63651.Jul 12 2016, 1:33 AM
jmolloy retitled this revision from to [InlineCost] Set minsize inline threshold to 0.
jmolloy updated this object.
jmolloy added reviewers: chandlerc, hfinkel, mehdi_amini, ab.
jmolloy set the repository for this revision to rL LLVM.
jmolloy added a subscriber: llvm-commits.
chandlerc edited edge metadata.Jul 12 2016, 1:36 AM

I've always thought this was the ideal threshold to use for minsize, and something we should strive for in order to tune the inliner.

However, last time I checked, this setting actually *increased* code size quite a bit. What benchmarking have you done to evaluate this change? I wonder what other uses of minsize would be useful to get to check that we don't significantly regress code size for them?

(All of these would likely be great test cases for the inline cost analysis of course....)

Hi Chandler,

I've run benchmarking for the test-suite, and this change causes no code change whatsoever.

This is very suspicious, but I've tested and tested again, checked three times - on trivial examples the two compilers I'm testing do have different results (this is on AArch64).

My suspicion for why there is no difference is that trivial functions (getters/setters) are still inlined, as their inline cost is less than the call argument setup cost. This change only really affects larger-than-trivial functions. I'm still suspicious though.

James

Hi,

I re-ran testing, again using test-suite -Oz, but this time targetting T32 (thumb mode 32-bit). This time I do get differences:

Trunk -Oz: find . -name '*.o' | xargs size | awk '{sum+=$1} END {print sum}'
10080358

Trunk -Oz -mllvm -inline-threshold=0:  find . -name '*.o' | xargs size | awk '{sum+=$1} END {print sum}'
7722119

For completeness, I also tried this with Clang-3.7:

3.7 -Oz: 6919363
3.7 -Oz -mllvm inline-threshold=0: 7820357

So 3.7 -Oz was a lot better than trunk -Oz and there were indeed regressions when setting the threshold to 0. But now we've bloated a lot and we get improvements.

James

Hi Chandler,

We discussed this on IRC and you were suspicious of my numbers. I was too, so I did more re-running. It turns out that CMake was appending -O3 to all of my builds, so the numbers I got were completely worthless.

Having done a *lot* more testing, I can confirm that purely setting the threshold to zero causes significant code size regressions. I've looked at these and 99% of them are in C++ code. It turns out that although we were giving a bonus for inlining the last call to an internal function, we weren't doing the same for linkonce_odr functions which is what C++ templates become. This is what was causing our bloat.

Ideally, we'd teach the inliner to much more accurately determine the overall expansion of a tree of thunks and tiny template expansions. However simply giving a bonus bump to linkonce_odr functions in the same way as internal functions appears to do the trick quite well.

With this change I see a geomean codesize *improvement* of 2.3% on the test-suite, and that excludes the TSVC benchmarks that improve so massively that they skew the results. Without the linkonce_odr bonus (simply setting the threshold to 0) I see a codesize *regression* of 2.9%.

I will of course split these up into two separate patches for committing but thought they'd be easier to review as one.

Cheers,

James

jmolloy updated this revision to Diff 65916.Jul 28 2016, 5:15 AM
jmolloy edited edge metadata.
jmolloy removed rL LLVM as the repository for this revision.
mehdi_amini edited edge metadata.Jul 28 2016, 9:44 AM

Hi Chandler,

We discussed this on IRC and you were suspicious of my numbers. I was too, so I did more re-running. It turns out that CMake was appending -O3 to all of my builds, so the numbers I got were completely worthless.

Having done a *lot* more testing, I can confirm that purely setting the threshold to zero causes significant code size regressions. I've looked at these and 99% of them are in C++ code. It turns out that although we were giving a bonus for inlining the last call to an internal function, we weren't doing the same for linkonce_odr functions which is what C++ templates become. This is what was causing our bloat.

This makes sense: linkonce_odr are not internal, they are not much different than a regular global.

Ideally, we'd teach the inliner to much more accurately determine the overall expansion of a tree of thunks and tiny template expansions. However simply giving a bonus bump to linkonce_odr functions in the same way as internal functions appears to do the trick quite well.

This is fairly arbitrary.

Ideally, we'd teach the inliner to much more accurately determine the overall expansion of a tree of thunks and tiny template expansions. However simply giving a bonus bump to linkonce_odr functions in the same way as internal functions appears to do the trick quite well.

I agree with Mehdi that this seems arbitrary, but I'll go farther -- I really think that focusing on linkonce_odr functions from the cost analysis side is the wrong approach.

I think instead you'll need to look at the nature of the (admitedly linkonce_odr) functions which get inlined at a threshold of 25 but not at a threshold of 0, and try to understand what pattern we're missing that causes us to misestimate the size. There is probably a reasonably small number of patterns that are missing at the low end of the threshold (because we've not stared at it as much) that will be generally beneficial to recognize and accurately model.

But what is more concerning is that the "bonus" you're using is the "call once" bonus. That isn't really a bonus. What it does is it pretty much completely removes the inlining threshold. As a consequence, with this change, you can pretty cause *massive* code size explosions by just a huge linkonce_odr function that is called once in every translation unit, but by a different function in every translation unit. You'll get O(number of translation units) copies of that function. =[

I really think we need to understand why you see this swing in size between two very close thresholds of 0 and 25. My suspicion is that we have some flaw in how we calculate the cost of inlining a function which just calls another function with (perhaps a permutation of) the arguments it is called with. Fixing that cost computation so that it (correctly) is modeled as 0 cost seems really valuable *outside* of -Oz as well.

-Chandler

reames added a subscriber: reames.Aug 30 2016, 7:10 PM

I really think we need to understand why you see this swing in size between two very close thresholds of 0 and 25. My suspicion is that we have some flaw in how we calculate the cost of inlining a function which just calls another function with (perhaps a permutation of) the arguments it is called with. Fixing that cost computation so that it (correctly) is modeled as 0 cost seems really valuable *outside* of -Oz as well.

I want to second what Chandler said here. If we have a testing methodology which can help us fix cases where we got our cost model wrong and are thus more sensitive to the threshold values than we should be, we should exploit that for all it's worth.

Also, I wouldn't be surprised if simply writing some manual test cases for argument shuffling and other idiomatic small functions with an inline-threshold set to zero were to find a few cases. Some targeted test writing (as opposed to benchmark analysis) might be really useful.

Also, adding a hook to the inliner which causes it to print inline candidates which *did* get inlined with a 25 threshold, but had a cost above 0 would be a quick way to find a bunch of these cases. We've applied that a couple of times in our tree with pretty good success. Actually, a "print-inline-close-to-threshold" option might be a generically useful analysis tool.

jmolloy abandoned this revision.Sep 8 2016, 4:59 AM

Abandoning in favour of D24338.