This is an archive of the discontinued LLVM Phabricator instance.

Estimate speedup due to inlining and use that to adjust threshold.
AbandonedPublic

Authored by eraman on Feb 16 2017, 5:31 PM.

Details

Reviewers
chandlerc
davidxl
Summary

This adds a heuristic that estimates the speedup due to inlining a callsite and increases thresholds for callsites that have an estimated speedup above a configurable minimum. The speedup is define as WeightedSavings / (WeightedCost + WeightedSavings). WeightedCost above is obtained by weighting the cost (already computed by the analysis) per basic block with the block's frequency. In order to compute the WeightedSavings, we keep track of savings due to inlining. There are 3 components to the Savings: a. The cost savings due to elimination of call overhead and argument setup overhead, b. The cost savings due to SROA, and c. The cost savings due to elimination of instructions after inlining. Note that the savings does not include the cost of blocks that are unreachable in the call context since, irresepective of inlining, those blocks are not reachable in a dynamic sense. Similar to WeightedCost, WeightedSavings is calculated by weighting the savings per block with its frequency.

This analysis is triggered only when the callsite's frequency relative to caller's entry exceeds a configurable parameter. In my experiments, I noticed that not having this filter results in increased code size in the cold regions.

I have done some parameter tuning using a hacked up version of this code (to work on the old PM) on a set of internal benchmarks. I do expect more tuning to be done on a broader set of benchmarks. Since this is active only with the new PM based pipeline, I haven't added a flag to disable this but I'll add them if you prefer so.

Diff Detail

Event Timeline

efriedma added inline comments.
lib/Analysis/InlineCost.cpp
1429

CalleeBFI is unused?

Thanks for the comment!

lib/Analysis/InlineCost.cpp
1429

Ouch. This is badly broken. In two places where I should have used CalleeBFI, I've used CallerBFI. getBlockFreq returns 0 if the BB is not in the function whose BFI is used. The tests still pass because I've used getEntryFreq of Caller (instead of Callee) which still returns a valid value and the weighted savings associated with the entry block (elimination of the call and arg setup) is enough to result in the speedup. I'll send a revised patch tomorrow.

eraman updated this revision to Diff 88946.Feb 17 2017, 1:12 PM
  • Use CalleeBFI to get the current block's frequency.
  • Move accumulateCost into this patch instead of it being part of the dependent patch.
davide added a subscriber: davide.Feb 17 2017, 6:57 PM
davidxl added inline comments.Feb 21 2017, 12:02 PM
lib/Analysis/InlineCost.cpp
89

Does this mean globally hot callsites may not get the speed up bonus if their relative frequency to caller entry is smaller than 8? while a less hot callsite may get both hot callsite bump and big speedup bump?

254

Document the method.

255

hasLargeSpeedup ?

561

should this be called inside the lambda?

592

why does this need to be guarded here?

1053

Why is the guard needed here?

1160

Perhaps at least add parameter passing to the savings?

eraman marked 5 inline comments as done.Feb 22 2017, 11:14 AM

Thanks for the comments.

lib/Analysis/InlineCost.cpp
89

That is possible. The easy fix is to check if the callsite is hot or if it's relative frequency exceeds this threshold. I have made this change.

592

We want to give a bonus if the speedup only happens due to inlining. In most cases, if Evaluate returns a constant, then we would have looked up the SimplifiedValues. But it is also possible that the IR actually has an instruction with two constant operands which were not cleaned by earlier simplification pass and we don't want to consider it an inlining benefit (since a later simplification pass will simplify it). I don't think it is very likely in a reasonable pass pipeline, but explicitly having this guard makes the intention cleaner.

1053

Same explanation as above.

eraman updated this revision to Diff 89385.Feb 22 2017, 11:15 AM
eraman marked an inline comment as done.

Rebase and address David's comments.

davidxl added inline comments.Feb 24 2017, 11:52 AM
lib/Analysis/InlineCost.cpp
592

Can you add a brief comment about this?

615

Better filter cold callsite out too.

1053

A brief comment.

test/Transforms/Inline/speedup-analysis.ll
4

what do the savings come from in this case? just the call overhead?

chandlerc edited edge metadata.Feb 24 2017, 12:34 PM

Some initial (minor) comments. It'd also help I think to rebase this as some of the refactorings land.

lib/Analysis/InlineCost.cpp
80

nit: Please use vertical whitespace before the comment for the second flag.

More substantive comment, what unit is this? The "desc" string only says "speedup", but it isn't clear if you mean "10% faster after inlining"? If so, that seems a surprisingly low threshold...

90–91

The flag seems to indicate this is the value of the frequency itself... the comment seems to indicate it is relative... Makes it hard for me to understand the value used...

186

I think it'd be better to have the argument always passed and match the accumulateCost API?

Prazek added a subscriber: Prazek.Feb 26 2017, 1:29 AM
Prazek added inline comments.
lib/Analysis/InlineCost.cpp
543

range based for loop? There should be something like .operands()

chandlerc added inline comments.Feb 26 2017, 3:02 PM
lib/Analysis/InlineCost.cpp
543

I looked and there isn't really. I can add one though.

Prazek added inline comments.Feb 26 2017, 4:13 PM
lib/Analysis/InlineCost.cpp
543

That would be great, thanks

chandlerc added inline comments.Feb 26 2017, 5:10 PM
lib/Analysis/InlineCost.cpp
543

Huh, I must have missed it. Danny added an 'indices' method back in 2015. =D

eraman marked 5 inline comments as done.Feb 27 2017, 3:16 PM
eraman added inline comments.
lib/Analysis/InlineCost.cpp
80

I've tried to expand this a bit. PTAL.

Regarding the threshold being low: When I tuned it originally, it was before r286814 which added -30 to the cost for every inlining. At least to account that this needs to be revised up. In any case, I'll collect more numbers on the size/performance tradeoff and follow up.

90–91

It is relative. I've updated the variable name, flag name and the description to make this clear.

543

I see indices() in some other classes but not in GetElementPtrInst

test/Transforms/Inline/speedup-analysis.ll
4

Yes, the savings come from eliminating the call.

eraman updated this revision to Diff 89947.Feb 27 2017, 3:17 PM
eraman marked an inline comment as done.

Address review comments.

Sorry for the delay in collecting performance numbers. I now have some data to share. First, some details on the methodology. I used ~400 microbenchmarks used internally at Google. I built them with the following percentage values of min-speedup-for-bonus: 0%, 5%, 10%, and 15%. I ran each benchmark 10 times in each configuration. Speedup/slowdown for a benchmark is calculated only when the p-value <=0.05 (and thus the results might include different subset of benchmarks for different configs). The numbers presented below are the geomean across all benchmarks.

Config | #Benchmarks | Geomean | #Slowdowns | #Speedups | Size increase percentage
0% | 134 | 2.92% | 51 | 83 | 2.45%
5% | 121 | 1.05% | 41 | 80 | 1.58%
10% | 115 | 0.8% | 51 | 64 | 1.32%
15% | 160 | 1.03% | 44 | 116| 1.02%

Some observations:

  • The best geomean performance comes when the min-speedup-for-bonus is set to 0%. I interpret this to mean that it is generally a performance win to increase the threshold for hot callsites, and the speedup estimation is a way to control the size growth.
  • The performance when the min-speedup-for-bonus is set to 10% sits in-between that of 5% and 15%. As I mentioned above, these are not apples-to-apples comparisons beacause we compute geomean on a different set of benchmarks. Even for the same benchmark, it is possible (and it does happen) that the performance numbers are not monotonically decreasing as the min-speedup-for-bonus is increased.
  • For comparison, I calculated the size growth if we simply apply a 3X multiplier to the threshold irrespective of the callsite frequency. The size increase is 9.7%.

I'm collecting SPEC numbers now. I've also fixed a bug in the code and will update the patch shortly.

eraman updated this revision to Diff 95490.Apr 17 2017, 2:29 PM

Fix a bug and rebase.

davidxl edited edge metadata.Apr 18 2017, 12:18 PM

Given the performance data, I suggest drop the min speedup requirement to 0 at O3.

Sorry for the delay in collecting performance numbers. I now have some data to share. First, some details on the methodology. I used ~400 microbenchmarks used internally at Google. I built them with the following percentage values of min-speedup-for-bonus: 0%, 5%, 10%, and 15%. I ran each benchmark 10 times in each configuration. Speedup/slowdown for a benchmark is calculated only when the p-value <=0.05 (and thus the results might include different subset of benchmarks for different configs). The numbers presented below are the geomean across all benchmarks.

Config | #Benchmarks | Geomean | #Slowdowns | #Speedups | Size increase percentage
0% | 134 | 2.92% | 51 | 83 | 2.45%
5% | 121 | 1.05% | 41 | 80 | 1.58%
10% | 115 | 0.8% | 51 | 64 | 1.32%
15% | 160 | 1.03% | 44 | 116| 1.02%

Some observations:

  • The best geomean performance comes when the min-speedup-for-bonus is set to 0%. I interpret this to mean that it is generally a performance win to increase the threshold for hot callsites, and the speedup estimation is a way to control the size growth.
  • The performance when the min-speedup-for-bonus is set to 10% sits in-between that of 5% and 15%. As I mentioned above, these are not apples-to-apples comparisons beacause we compute geomean on a different set of benchmarks. Even for the same benchmark, it is possible (and it does happen) that the performance numbers are not monotonically decreasing as the min-speedup-for-bonus is increased.
  • For comparison, I calculated the size growth if we simply apply a 3X multiplier to the threshold irrespective of the callsite frequency. The size increase is 9.7%.

The data here is really interesting, but I'm not sure about using the 0% threshold...

What I mean by that, is that if we use a 0% min-speedup-for-bonus, then we essentially aren't using the speedup computation at all are we? It seems like this would be roughly the same as just applying a similar multiplier to the threshold based on call site hotness. Maybe I just don't understand what the result of this is (sorry if I'm just failing to page back in all of the details)? If my understanding is correct though, then I would focus on that first and get it in, and then return to the speedup heuristic to see if there are wins to be found by doing a speedup analysis to bonus less hot call sites, or doing it to give an *even higher* threshold when a call site is both hot *and* gives a speedup on inlining.

I'm collecting SPEC numbers now. I've also fixed a bug in the code and will update the patch shortly.

Please also collect LLVM test suite numbers with the SPEC numbers.

One thing that would be particularly important though is to collect larger application *size* numbers. I don't think the size growth numbers from microbenchmarks are really going to tell us what we need to know to make good threshold decisions where size is a factor (especially O2 vs. O3).

eraman abandoned this revision.Apr 21 2017, 5:36 PM