[Inliner] Increase threshold for hot callsites without PGO.
ClosedPublic

Authored by eraman on Aug 1 2017, 4:17 PM.

Details

Summary

This increases the inlining threshold for hot callsites. Hotness is
defined in terms of block frequency of the callsite relative to the
caller's entry block's frequency. Since this requires BFI in the
inliner, this only affects the new PM pipeline.

This improves the performance of some internal benchmarks. Notably, an
internal benchmark for Gipfeli compression
(https://github.com/google/gipfeli) improves by ~7%. Povray in SPEC2006
improves by ~2.5%. I am running more experiments and will update the
thread if other benchmarks show improvement/regression.

In terms of text size, LLVM test-suite shows an 1.22% text size
increase. Diving into the results, 13 of the benchmarks in the
test-suite increases by > 10%. Most of these are small, but
Adobe-C++/loop_unroll (17.6% increases) and tramp3d(20.7% size increase)
have >250K text size. On a large application, the text size increases by
2%

Diff Detail

Repository
rL LLVM
eraman created this revision.Aug 1 2017, 4:17 PM

While the total size increase doesn't concern me much, the >10% code size growth in some benchmarks is a bit concerning. Do these benchmarks also improve performance? If so, then this might be fine in general (once the -Os behavior is fixed, see below). However, if the benchmarks that are growing in size by a lot aren't also getting faster, then it seems a hard sell at O2 where we expect size increase to be at least generally associated with performance wins.

Naturally, this seems completely fine at O3 either way.

lib/Analysis/InlineCost.cpp
782–784 ↗(On Diff #109250)

I think we need to avoid this for optForSize() functions.

Hi Easwaran,

What if the callee of a hot callsite also has a inline hint?

Thanks,

Haicheng

Hi Easwaran,

What if the callee of a hot callsite also has a inline hint?

(FWIW, my 2 cents here would be that this should win over inline hint, as it should be quite a bit stronger. Anyways, will let Easwaran actually respond as well as I'm curious what he thinks.)

Hi Easwaran,

What if the callee of a hot callsite also has a inline hint?

(FWIW, my 2 cents here would be that this should win over inline hint, as it should be quite a bit stronger. Anyways, will let Easwaran actually respond as well as I'm curious what he thinks.)

I initially started with a multiplier instead of an absolute threshold, but changed it mainly to be consistent with the PGO based hot callsite. It is definitely a stronger hint, but I think an argument can be made that various "signals" should be composed.

Hi Easwaran,

What if the callee of a hot callsite also has a inline hint?

(FWIW, my 2 cents here would be that this should win over inline hint, as it should be quite a bit stronger. Anyways, will let Easwaran actually respond as well as I'm curious what he thinks.)

I initially started with a multiplier instead of an absolute threshold, but changed it mainly to be consistent with the PGO based hot callsite. It is definitely a stronger hint, but I think an argument can be made that various "signals" should be composed.

Maybe a follow-up with benchmarks motivating (and checking for size issues)?

eraman added a comment.Aug 2 2017, 3:39 PM

While the total size increase doesn't concern me much, the >10% code size growth in some benchmarks is a bit concerning. Do these benchmarks also improve performance? If so, then this might be fine in general (once the -Os behavior is fixed, see below). However, if the benchmarks that are growing in size by a lot aren't also getting faster, then it seems a hard sell at O2 where we expect size increase to be at least generally associated with performance wins.

Naturally, this seems completely fine at O3 either way.

There are 5 tests in the test-suite with text size increase > 10% and > 4K. Three of them improve and two don't. For performance numbers, I used the perf tool to collect uops_retired:any on an Intel Xeon E5-2690 since the running time of these are small and noisy.

Test - Size increase KB (%) - Performance improvement

MultiSource/Benchmarks/VersaBench/beamformer - 5K(41.6%) - None
SingleSource/Benchmarks/Linpack/linpack-pc - 7K(36.4%) - +2%
MultiSource/Benchmarks/FreeBench/pifft - 18K(33%) - None
MultiSource/Benchmarks/tramp3d-v4 - 148K(20.7%) - 18%
SingleSource/Benchmarks/Adobe-C++/loop_unroll - 44K(17.6%) - 9%

The C/C++ subset of SPEC2006 increases by 0.5%. Two benchmarks improve in performance: 453.povray (+1.5%) and 473.astar (+1.81%).

If you think this is iffy for O2, I'll do this only for O3 for now and work on tuning it further with the goal of enabling it at O2.

While the total size increase doesn't concern me much, the >10% code size growth in some benchmarks is a bit concerning. Do these benchmarks also improve performance? If so, then this might be fine in general (once the -Os behavior is fixed, see below). However, if the benchmarks that are growing in size by a lot aren't also getting faster, then it seems a hard sell at O2 where we expect size increase to be at least generally associated with performance wins.

Naturally, this seems completely fine at O3 either way.

There are 5 tests in the test-suite with text size increase > 10% and > 4K. Three of them improve and two don't. For performance numbers, I used the perf tool to collect uops_retired:any on an Intel Xeon E5-2690 since the running time of these are small and noisy.

Test - Size increase KB (%) - Performance improvement

MultiSource/Benchmarks/VersaBench/beamformer - 5K(41.6%) - None
SingleSource/Benchmarks/Linpack/linpack-pc - 7K(36.4%) - +2%
MultiSource/Benchmarks/FreeBench/pifft - 18K(33%) - None
MultiSource/Benchmarks/tramp3d-v4 - 148K(20.7%) - 18%
SingleSource/Benchmarks/Adobe-C++/loop_unroll - 44K(17.6%) - 9%

The C/C++ subset of SPEC2006 increases by 0.5%. Two benchmarks improve in performance: 453.povray (+1.5%) and 473.astar (+1.81%).

First and foremost, thanks for the excellent data and carefully gathering all of it. =]

If you think this is iffy for O2, I'll do this only for O3 for now and work on tuning it further with the goal of enabling it at O2.

Yeah, I would put it in O3 for now, essentially to be conservative. I think this is a borderline case, but it seems easier to revisit it later than deal with churn of regressing people now, and it seems like a clear win for O3.

eraman updated this revision to Diff 109630.Aug 3 2017, 2:07 PM
  • Apply this heuristic only at O3 and disable for optsize
chandlerc accepted this revision.Aug 3 2017, 2:54 PM

LGTM, thanks!

This revision is now accepted and ready to land.Aug 3 2017, 2:54 PM
This revision was automatically updated to reflect the committed changes.
haicheng added inline comments.Aug 4 2017, 8:18 AM
llvm/trunk/lib/Analysis/InlineCost.cpp
81

Is this a typo? Maximum => Minimum