Page MenuHomePhabricator

[WIP][LTO] Apply SamplePGO pipeline tunes for ThinLTO pre-link to full LTO
Needs ReviewPublic

Authored by tejohnson on Nov 1 2019, 1:04 PM.

Details

Reviewers
wristow
ormris
Summary

There are several modifications to the optimizations performed by the
ThinLTO pre link compile when building with Sample PGO, in order to get
better matching of the SamplePGO when it is re-applied in the backend.
These same tunes should be done for full LTO pre-links as well, as
presumably the same matching issues could occur there.

There are a few issues with this patch as it stands, relating to the
fact that not all of these optimizations are attempted again in the full
LTO backend, owing to the larger compile time with the monolithic LTO.
Specifically, this includes some loop optimizations:

  • In the old PM, LoopUnrollAndJam is not done in the full LTO backend.
  • In the new PM, none of the loop unrolling is done in the full LTO

backend. The comments indicate that this is in part due to issues with
the new PM loop pass manager (presumably sorted out by now, but I
haven't followed this). Here is the comment:

// FIXME: at this point, we run a bunch of loop passes:
// indVarSimplify, loopDeletion, loopInterchange, loopUnroll,
// loopVectorize. Enable them once the remaining issue with LPM
// are sorted out.

So what needs to happen still is to either:

  1. Continue to diverge the ThinLTO and full LTO pre-link pipelines for

these optimizations (which means this patch needs to be adjusted).
OR

  1. Fix the full LTO post-link pipelines to ensure these optimizations

all occur there.

Event Timeline

tejohnson created this revision.Nov 1 2019, 1:04 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptNov 1 2019, 1:04 PM

This probably needs to be taken over by someone who cares about full LTO performance (@wristow or @ormris ?). This patch was some cleanup of the full LTO sample PGO pipeline, but has a number of issues I enumerate in the summary.

wenlei added a subscriber: wenlei.Nov 1 2019, 2:18 PM

The comments indicate that this is in part due to issues with
the new PM loop pass manager

Wondering how different it is for these loop passes to be enabled for MonoLTO vs ThinLTO? If it's due to problems with the newPM, I guess ThinLTO would have the same problems? Asking because we have almost the same change as internal patch trying to get better LTO time profile precision for MonoLTO, and with that there's small win for oldPM+MonoLTO. But we'd love to converge on new PM for both MonoLTO and ThinLTO.

The comments indicate that this is in part due to issues with
the new PM loop pass manager

Wondering how different it is for these loop passes to be enabled for MonoLTO vs ThinLTO? If it's due to problems with the newPM, I guess ThinLTO would have the same problems?

The ThinLTO backends don't use this code but rather PassBuilder::buildModuleOptimizationPipeline, which includes all of these loop optimizations (and other optimizations, since the ThinLTO backends can absorb the extra compile time cost). That's what makes me think this comment is stale and someone just forgot to add the loop optimization passes to the full LTO post-link pipeline. I haven't looked at the history of all these changes. My guess is that since the full LTO pre-link pipeline does all these loop optimizations (also through buildModuleOptimizationPipeline), no one noticed that they weren't also being done in the post-link full LTO pipeline.

Asking because we have almost the same change as internal patch trying to get better LTO time profile precision for MonoLTO, and with that there's small win for oldPM+MonoLTO.

That's good to know, it's what I would expect but good to have the confirmation.

But we'd love to converge on new PM for both MonoLTO and ThinLTO.

wenlei added inline comments.Nov 4 2019, 1:47 PM
llvm/lib/Transforms/IPO/PassManagerBuilder.cpp
614

this also need to be PrepareForThinLTO || PrepareForLTO for oldPM?

This probably needs to be taken over by someone who cares about full LTO performance

We at PlayStation are definitely interested in full LTO performance, so we're looking into this. We certainly agree with the rationale that if suppressing some optimizations is useful to allow better SamplePGO matching, then we'd expect that would apply equally to both ThinLTO and full LTO.

I guess much of this comes down to a balancing act between:

  1. The amount of the runtime benefit with Sample PGO if these loop optimizations are deferred to the full LTO back-end (like they are for ThinLTO).
  2. The cost in compile-time resources in the full LTO back-end to do these loop optimizations at that later stage.

From the discussion here, the Sample PGO runtime win (point 1) seems more or less to be a given. If we find the compile-time cost in the full LTO back-end (point 2) is not significant, then the decision should be easy. So after seeing this patch, we're doing some experiments to at least try to get a handle on this. (I'm a bit concerned we won't be able to draw any hard conclusions from the results of our experiments, but at least we'll be able to make a better informed assessment.)

FTR, for PlayStation, we're using the old PM. But we'll do some experiments for both the old and new PM, to get a sense of the answers to the (old PM) LoopUnrollAndJam point, and the (new PM) FIXME comment.

wristow added inline comments.Nov 4 2019, 6:31 PM
llvm/lib/Transforms/IPO/PassManagerBuilder.cpp
614

I agree this is another instance where a balancing act question applies. In this case, assuming the comment about the concern of code bloat is accurate, it's not so much about compile-time resources in the full LTO back-end, but rather about minimizing the ThinLTO bitcode write/read time. So if as this WIP evolves, it ultimately is a win for SamplePGO to suppress some loop optimizations (unrolling/vectorization) here, then that will probably also be a small win in full LTO compile time. That said, in addition to these loop-related optimizations, there are other transformations here that are done in the full LTO pipeline (but not in the ThinLTO pipeline). So I suspect if some change to check for PrepareForThinLTO || PrepareForLTO (rather than only PrepareForThinLTO) makes sense here from a Sample PGO perspective, then the change will be more complicated than simply adding the small set of passes here followed by the early return (that is, I think there are probably things after the return on line 621 that still ought to be enabled for full LTO -- essentially continuing to do them in the pre-link stage for full LTO, to try to avoid needing to do too much work in the full LTO backend stage, since it's more of a problem for the full backend to absorb that compile time cost).

tejohnson marked an inline comment as done.Nov 5 2019, 6:22 AM

This probably needs to be taken over by someone who cares about full LTO performance

We at PlayStation are definitely interested in full LTO performance, so we're looking into this. We certainly agree with the rationale that if suppressing some optimizations is useful to allow better SamplePGO matching, then we'd expect that would apply equally to both ThinLTO and full LTO.

I guess much of this comes down to a balancing act between:

  1. The amount of the runtime benefit with Sample PGO if these loop optimizations are deferred to the full LTO back-end (like they are for ThinLTO).
  2. The cost in compile-time resources in the full LTO back-end to do these loop optimizations at that later stage.

    From the discussion here, the Sample PGO runtime win (point 1) seems more or less to be a given. If we find the compile-time cost in the full LTO back-end (point 2) is not significant, then the decision should be easy. So after seeing this patch, we're doing some experiments to at least try to get a handle on this. (I'm a bit concerned we won't be able to draw any hard conclusions from the results of our experiments, but at least we'll be able to make a better informed assessment.)

    FTR, for PlayStation, we're using the old PM. But we'll do some experiments for both the old and new PM, to get a sense of the answers to the (old PM) LoopUnrollAndJam point, and the (new PM) FIXME comment.

This is a good summary. I look forward to your results.

llvm/lib/Transforms/IPO/PassManagerBuilder.cpp
614

This early return was not for Sample PGO btw. It was added much earlier with the thought that a) these types of optimizations might affect function importing heuristics because they could bloat the code; b) we can push more optimizations to the post-link in ThinLTO because it is parallel; and c) there isn't otherwise a benefit to doing these optimizations in the pre vs post link, i.e. they aren't cleanup/simplification passes. The equation is of course different for full LTO which has a monolithic serial post link backend. But I believe this early return is the one @ormris is looking to remove on the ThinLTO pass to "merge" the two pipelines, which needs a good amount of evaluation on the ThinLTO performance side.

ormris added a comment.Thu, Dec 5, 4:39 PM

I've done testing with the following global parameters.

  • The base for the branch is llvmorg-10-init-8655-g94a4a2c97f8
  • Used llvm, clang, lld, and llvm-ar from this branch.
  • The sqlite kvtest program was the test payload.

This test compared an unmodified compiler from the base of the branch with a modified compiler with this patch applied and the loop optimisation passes mentioned above moved to the backend. The results were as follows. All numbers in seconds.

RunModified LTOModified SPGO+LTOUnmodified SPGO+LTO
142.0041.7342.08
242.3039.4942.45
341.2142.4642.49
AVG:41.8441.2342.34

TL;DR the average run using a compiler built with the modified SPGO pipeline is about a second faster. Definitely a positive initial result.

This probably needs to be taken over by someone who cares about full LTO performance (@wristow or @ormris ?). This patch was some cleanup of the full LTO sample PGO pipeline, but has a number of issues I enumerate in the summary.

Given the performance improvements here, I'd like to develop this patch further.

Given the performance improvements here, I'd like to develop this patch further.

In D69732#1784290, @ormris wrote:
Ping @tejohnson

@ormris, I think that since @tejohnson originally suggested that someone with more interest in full LTO performance pick this up (and she specifically suggested you or me), then you can feel free to take this over.

Given the performance improvements here, I'd like to develop this patch further.

In D69732#1784290, @ormris wrote:
Ping @tejohnson

@ormris, I think that since @tejohnson originally suggested that someone with more interest in full LTO performance pick this up (and she specifically suggested you or me), then you can feel free to take this over.

Yep, sorry, I didn't realize you were waiting for me to confirm! That sounds great.