This is an archive of the discontinued LLVM Phabricator instance.

[PGO] IRPGO pre-cleanup pass changes
ClosedPublic

Authored by xur on Jun 15 2016, 1:31 PM.

Details

Summary

This patch adds a selected set of cleanup passes including a pre-inline pass before LLVM IR PGO instrumentation. The inline is only intended to apply those obvious/trivial ones before instrumentation so that much less instrumentation is needed to get better profiling information. This will drastically improve the instrumented code performance for large C++ applications. Another benefit was the context sensitive counts that can potentially improve the PGO optimization.

This is previously discussed here:
http://lists.llvm.org/pipermail/llvm-dev/2015-August/089425.html

Here I recollected some the data to show the impact using the followings:
(1) spec2006 C and C++ programs
(2) some Google internal benchmarks (which are of different size of C++ applications).
(3) instrumented LLVM compiler

The metrics collected are:
(1) instrumented code runtime performance.
https://docs.google.com/document/d/1DCP7KPjtitHMotX2KgTO8PADd9zjLzPJgSXZ8egd6Mk/edit?usp=sharing

(2) program text size (i.e. sum of sections (having address and name) sizes minus
writable section sizes.)
https://docs.google.com/document/d/1ItOetOEc1vTpU54EBjQK4O0LfulFIv3gs9XXnmzwqdo/edit?usp=sharing

(3) program binary size
https://docs.google.com/document/d/1nmiR6He0cxMFqMzWE-X7FdhL5jSzdS9h7WLxDrfmSxQ/edit?usp=sharing

Diff Detail

Repository
rL LLVM

Event Timeline

xur updated this revision to Diff 60891.Jun 15 2016, 1:31 PM
xur retitled this revision from to [PGO] IRPGO pre-cleanup pass changes.
xur updated this object.
xur added reviewers: silvas, vsk, davidxl, mehdi_amini, tejohnson.
xur added a subscriber: llvm-commits.
xur added a comment.Jun 15 2016, 1:41 PM

The results in pdf are also attached here.

mehdi_amini edited edge metadata.Jun 15 2016, 2:08 PM

I really don't like the fact that the PGO pipeline will be different from the non-PGO pipeline (other than what is required for instrumentations). I wonder what other people will think of that.

lib/Transforms/IPO/PassManagerBuilder.cpp
128 ↗(On Diff #60891)

Why two options?

228 ↗(On Diff #60891)

These seem pretty arbitrary numbers?

245 ↗(On Diff #60891)

Where is this list coming from? How is it computed?

vsk edited edge metadata.Jun 15 2016, 2:12 PM

This looks great! Thanks for putting together the detailed performance reports.

I noticed that there is one small slowdown with 401.bzip. Do you know what aspect of the pipeline is responsible for the change? What were your criteria for picking the pre-inline pipeline?

lib/Transforms/IPO/PassManagerBuilder.cpp
122 ↗(On Diff #60891)

Is zero or more needed?

xur added a comment.Jun 15 2016, 2:25 PM

comments are inlined.

lib/Transforms/IPO/PassManagerBuilder.cpp
122 ↗(On Diff #60891)

probably not needed.

128 ↗(On Diff #60891)

this is adopted from the earlier implementation where we have a separated inline passes. Since now we are using the standard inliner with different thresholds. I think we can get rid of preinline option and keep disable-preinline option.

228 ↗(On Diff #60891)

No. They are from current Os and Oz thresholds

245 ↗(On Diff #60891)

I came up this list. I'm open to add most passes.

silvas added inline comments.Jun 15 2016, 2:52 PM
lib/Transforms/IPO/PassManagerBuilder.cpp
245 ↗(On Diff #60891)

In testing our games, we did not observe a significant reduction in instrumentation overhead from adding passes after pre-inlining.

For example, we started with the set of passes initially proposed in http://reviews.llvm.org/D15828. Then we removed individual passes until there was a significant change in performance of the instrumented build. We were able to remove all of them except for the inlining pass.

I.e. only the inlining pass contributed significantly to reducing overhead of the instrumented build. I would be interested to see the results of this experiment on a subset of your benchmarks.

I believe the reason for this is that in the codebases we were testing (PS4 games), there are a very large number of calls to trivial single-BB functions (e.g. Vec4::operator+). On one codebase I analyzed, over 80% of all counts in the profile are from single-BB functions.

silvas edited edge metadata.Jun 15 2016, 3:09 PM

I really don't like the fact that the PGO pipeline will be different from the non-PGO pipeline (other than what is required for instrumentations). I wonder what other people will think of that.

This was one of my initial concerns too: http://lists.llvm.org/pipermail/llvm-dev/2015-August/089058.html
Jake and I (actually mostly Jake) did some basic measurements related to this and did not find any significant difference.

Rong also did some measurements related to this in the initial RFC thread: http://lists.llvm.org/pipermail/llvm-dev/2015-August/089425.html

I agree though: generally speaking, we should avoid running too many passes before instrumentation precisely because they cause divergence between the PGO and non-PGO pipelines. Thankfully Jake and I found that on our games only pre-inlining was needed (running other optimizations before instrumentation did not help significantly for our test cases beyond the benefit of just pre-inlining).

(note: due to PR27299, a run of simplifycfg after inlining is needed. But this run of simplifycfg doesn't really affect the performance of the instrumented build at all)

xur added a comment.Jun 15 2016, 3:26 PM

to vsk:

I did some analysis on the slow down on bzip2: with preinstrumentation inliner we actually are more aggressive on the late simple inline. Here is the related call chain.
BZ2_compressBlock() -> sendMTFValues() --> bsW()
BZ2_compressBlock calls sendMTFValues() one time (1 call site), and
sendMTFValues() has 64 call sites to bsW().

In preinline, we inlines sendMTFValues() to BZ2_compressBlock().
In simple inline, we inlines all 64 calls to bsW() to BZ2_compressBlock().

Without preininline, we inline 2 calls to bsW in sendMTFValues() and then decided to defers the inline to the other calls bsW(). But somehow we do not inline sendMTFValues() to BZ2_compressBlock().

I'm yet to investigate why deferred decision changed in simple inliner. I think this is a rare case that we happen to hit.

davidxl added inline comments.Jun 15 2016, 4:03 PM
lib/Transforms/IPO/PassManagerBuilder.cpp
238 ↗(On Diff #60891)

A more effective cleanup should probably include SROA + EarlyCSE

239 ↗(On Diff #60891)

I doubt JumpThreading pass will make any difference here. Can you experiment removing it?

240 ↗(On Diff #60891)

Experimenting removing this pass too?

242 ↗(On Diff #60891)

CFG simplification and inst combine look ok

243 ↗(On Diff #60891)

Remove GVN pass?

eraman added a subscriber: eraman.Jun 17 2016, 5:44 PM
In D21405#459392, @xur wrote:

to vsk:

I did some analysis on the slow down on bzip2: with preinstrumentation inliner we actually are more aggressive on the late simple inline. Here is the related call chain.
BZ2_compressBlock() -> sendMTFValues() --> bsW()
BZ2_compressBlock calls sendMTFValues() one time (1 call site), and
sendMTFValues() has 64 call sites to bsW().

In preinline, we inlines sendMTFValues() to BZ2_compressBlock().
In simple inline, we inlines all 64 calls to bsW() to BZ2_compressBlock().

Without preininline, we inline 2 calls to bsW in sendMTFValues() and then decided to defers the inline to the other calls bsW(). But somehow we do not inline sendMTFValues() to BZ2_compressBlock().

I'm yet to investigate why deferred decision changed in simple inliner. I think this is a rare case that we happen to hit.

Rong asked me to look into this. There is a bug in the deferral logic and if it is fixed the default inliner will also result in a code size increase (and possibly performance regression).
A brief desciption of the deferral logic: When we inline a B->C callsite, and B has local linkage, we look at all callers of B (say A_i). If the cost of B->C inlining exceeds the delta (threshold - cost) of A->B inlining, it checks if the overall cost of B->C inlining plus all A_i->B inlining is less than B->C inlining and if it is true, the inlining is deferred. The idea is that delaying B->C inlining will allow B to be inlined into all its callers and the out-of-line body of B can be removed and subsequently C will be inlined into A, resulting in overall cost (proxy for code size) reduction. To account for the fact that the body of B could be removed, a negative cost (-15000) is applied.

Now, in this case (A: BZ2_compressBlock(), B: sendMTFValues() C:bsW()) , there is only one caller of B ( A). When A->B inline cost is computed, the cost analysis also applies the -15000 cost. In other words, the deferral logic under-estimate the cost of A->B inlining by 15000 and defer (because the cost of A->B + B->C is less than the cost of B->C after this under-estimation) most B->C callsites are deferred. But when we consider A->B inlining, the cost becomes higher than the threshold (since we don't apply the -15000 cost twice) and once that fails, the C nodes do not get inlined.

The fix is simple - apply the negative cost correctly - but that will result in all B->C callsites being inlined (and no inlining of A->B) callsite resulting in code size regression.

xur updated this revision to Diff 62287.Jun 29 2016, 2:51 PM
xur edited edge metadata.

ere is the updated patch. Changes are:
(1) Removed options preinline as suggested by mehdi.
(2) Updated the cleanup passes list.

I used google internal benchmarks to measure the performance. The cleanup pass have ~2% geo-mean on the profile-generate performance. The largest saving is 17%.

David's suggestion on using SROA and EarlyCSE is good. It has positive impacts on a number of programs (largest ~10%)

Jumpthreading actually helps a few programs also (~3%).

In this patch, I propose to use
SROA + EArlyCSE + CFGSimplification + InstructionCombining
and remove the rest.

davidxl added inline comments.Jul 8 2016, 10:43 AM
lib/Transforms/IPO/PassManagerBuilder.cpp
213 ↗(On Diff #62287)

Do you have text size comparison (for profile-use phase) with and without pre-cleanup? If pre-cleanup always reduces final size, there is no need to special case here. If not, we should disable pre-cleanup with Os or Oz is specified.

216 ↗(On Diff #62287)

Needs to have an option to control the value.

xur added a comment.Jul 8 2016, 11:06 AM

There is small text size increment in profile-use compilation. Here is the data using google benchmarks:
"-" means size increase.
`benchmark nocleanup cleanup
C++_bencharmk01 223956862 -2.47%
C++_bencharmk02 83380721 -1.06%
C++_bencharmk03 163282214 -0.56%
C++_bencharmk04 623589887 -0.16%
C++_bencharmk05 510498950 -0.41%
C++_bencharmk06 4523393 -3.07%
C++_bencharmk07 4523028 -3.07%
C++_bencharmk08 4523953 -3.05%
C_bencharmk09 17632563 -0.70%
C_bencharmk10 17633287 -0.69%
C++_bencharmk11 4402197 -3.00%
C++_bencharmk12 147280528 -1.13%
C++_bencharmk14 148784695 -1.01%
C_bencharmk15 2829725 -0.69%
C++_bencharmk16 41095218 -1.01%
C++_bencharmk17 115356091 -0.58%
C++_bencharmk18 115327817 -0.63%
C++_bencharmk19 240183470 -1.39%
C++_bencharmk20 4453343 -2.96%
C_bencharmk21 1729644 -1.67%
C++_bencharmk22 4336754 -2.67%

total size 2479324340 -0.80%
mean of improvements -1.52%
`

I'll disable the cleanup pass in Os and Oz and use a parameterized threshold.

mehdi_amini added inline comments.Jul 8 2016, 11:07 AM
lib/Transforms/IPO/PassManagerBuilder.cpp
228 ↗(On Diff #62287)

Uh? Since we have Os/Oz as function attribute, it seems strange to me to have this hard-coded on a global level.

mehdi_amini added inline comments.Jul 8 2016, 11:09 AM
lib/Transforms/IPO/PassManagerBuilder.cpp
228 ↗(On Diff #62287)

Also is it duplicating llvm::computeThresholdFromOptLevels ?

mehdi_amini added inline comments.Jul 8 2016, 11:11 AM
lib/Transforms/IPO/PassManagerBuilder.cpp
216 ↗(On Diff #62287)

Apparently -inline-threshold will override.

davidxl added inline comments.Jul 8 2016, 11:28 AM
lib/Transforms/IPO/PassManagerBuilder.cpp
216 ↗(On Diff #62287)

I don't think that is the right option to use. The -inline-threshold option is used to control the regular inliner -- the tuning option of the pre-cleanup should not affect it.

228 ↗(On Diff #62287)

It does not. The pre-inline's role is to do pre-profiling cleanup, but to do full scale inlining, thus it needs to be more conservative (no profile guidance is available at this point). It should not controlled in the same way as the regular inline phase.

228 ↗(On Diff #62287)

There is no mechanizm to do threshold tuning on a function granularity. Even there is one, using it may not help the goal of reducing the final overall size.

mehdi_amini added inline comments.Jul 8 2016, 11:41 AM
lib/Transforms/IPO/PassManagerBuilder.cpp
216 ↗(On Diff #62287)

OK, make sense.

228 ↗(On Diff #62287)

OK.

xur updated this revision to Diff 63272.Jul 8 2016, 11:46 AM

Integrated David's review comments:
(1) don't perform the cleanup passes for size optimizations.
(2) add an option to control the preinline threshold.

xur added a comment.Jul 8 2016, 1:45 PM

This cleanup passes also have performance impact on the profile-use compilation.

In google benchmarks, compare to without these cleanup passes, we are seeing ~14% improvement for 1 benchmark, ~2 to ~4% for 3 benchmarks, and ~1% for other 2 benchmarks.

If we run these cleanup passes on O2 (without fdo), we only see small improvement (~1%) on 2 benchmarks.
We are thinking a large part of the improvement in profile-use is getting from better context sensitive profiles.

Note that all the numbers are got from existing profile-use infrastructure which is less well tuned.

davidxl added inline comments.Jul 8 2016, 1:50 PM
test/Transforms/PGOProfile/preinline.ll
8 ↗(On Diff #63272)

Also add a check that the call to bar is actually eliminated

xur updated this revision to Diff 63315.Jul 8 2016, 2:01 PM
xur marked an inline comment as done.

check function bar() is eliminated in the test case.

davidxl edited edge metadata.Jul 11 2016, 10:07 AM

The result is really good. LGTM, but wait to see other reviewers have further comment.

I don't have any other comment, don't wait for me.

test/Transforms/PGOProfile/preinline.ll
8 ↗(On Diff #63315)

Maybe add a one-line comment here saying why you have this test?

davidxl accepted this revision.Jul 13 2016, 10:24 AM
davidxl edited edge metadata.

lgtm

(with test comment requested by medhdi).

This revision is now accepted and ready to land.Jul 13 2016, 10:24 AM
This revision was automatically updated to reflect the committed changes.

I really don't like the fact that the PGO pipeline will be different from the non-PGO pipeline (other than what is required for instrumentations). I wonder what other people will think of that.

This was one of my initial concerns too: http://lists.llvm.org/pipermail/llvm-dev/2015-August/089058.html
Jake and I (actually mostly Jake) did some basic measurements related to this and did not find any significant difference.

Rong also did some measurements related to this in the initial RFC thread: http://lists.llvm.org/pipermail/llvm-dev/2015-August/089425.html

I agree though: generally speaking, we should avoid running too many passes before instrumentation precisely because they cause divergence between the PGO and non-PGO pipelines. Thankfully Jake and I found that on our games only pre-inlining was needed (running other optimizations before instrumentation did not help significantly for our test cases beyond the benefit of just pre-inlining).

I'm glad this is just pre-inlining, I strongly agree that we should not diverge here.

I actually continue to disagree with pre-inlining. I have said this repeatedly in the past, and I'm not sure where we reached consensus that pre-inlining was the right design, and where the discussion of the tradeoffs took place. The closest I could find was the thread you linked to Sean, but I don't actually see a lot of discussion of alternatives there, nor do I see much real discussion of the inliner at all outside of the fact that yes, it does work.

I would be much more in favor of a design which instead of changing the pipeline to optimize differently (and thus having PGO vs. normal divergence) either:

  • Teaches the instrumentation pass to do the necessary (possibly interprocedural) analysis to instrument only in places it will be cheap, or
  • Teach the existing pipeline to actually optimize and transform the instrumentation code to remove the overhead as necessary.

This way we have a single optimization pipeline that we turn profile info on and off of ,rather than twe pipelines.

I would be much more in favor of a design which instead of changing the pipeline to optimize differently (and thus having PGO vs. normal divergence) either:

  • Teaches the instrumentation pass to do the necessary (possibly interprocedural) analysis to instrument only in places it will be cheap, or
  • Teach the existing pipeline to actually optimize and transform the instrumentation code to remove the overhead as necessary.

I totally agree that this would be much better, however there is something that was mentioned that I don't know how to address: the counters can get actually more precise due to the split context after inlining.