This is an archive of the discontinued LLVM Phabricator instance.

[Pipelines] Don't run module optimization in full LTO pre-link
AcceptedPublic

Authored by nikic on Apr 11 2023, 5:54 AM.

Details

Summary

When LTO is used, the pre-link pipeline is supposed to only perform module simplification, but not optimization. The post-link pipeline is responsible for optimization. In particular, we should not perform size increasing optimizations like loop vectorization and runtime unrolling before the second inlining run in the post-link pipeline.

This is already handled correctly by the thin LTO pipeline, but not the full LTO pipeline. This patch effectively makes the pre-link pipelines for thin/full LTO the same. (There is a minor difference in a single GlobalDCE run that possibly shouldn't be there, but I'm leaving it for now.)

Diff Detail

Event Timeline

nikic created this revision.Apr 11 2023, 5:54 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2023, 5:54 AM
nikic requested review of this revision.Apr 11 2023, 5:54 AM
fhahn added a comment.Apr 11 2023, 5:58 AM

In general I think this is the right direction to go, but it would be good to have some performance data on the impact before going ahead, to catch any regressions beforehand.

I would be very surprised if this didn't cause some large regressions. I've always thought that one of the benefits for LTO is not just that it combines all function into a single module, but that it runs the pass pipeline twice.

I would be very surprised if this didn't cause some large regressions.

In theory this should be an improvement on both compile-time and performance. In practice, theory and practice probably don't match and this will be a mixed bag in terms of performance.

I've always thought that one of the benefits for LTO is not just that it combines all function into a single module, but that it runs the pass pipeline twice.

This is not an intended benefit of LTO, but it's indeed pretty plausible that LTO can paper over phase ordering issues.

I'd appreciate any help in evaluating performance for this change.

makes sense in theory since the module optimization pipeline really only needs to be run once per function, but it looks like there are some passes in the module optimization that aren't in the lto post-link pipeline, like RelLookupTableConverterPass, DivRemPairsPass, LoopSink

nikic added a comment.Apr 14 2023, 7:47 AM

makes sense in theory since the module optimization pipeline really only needs to be run once per function, but it looks like there are some passes in the module optimization that aren't in the lto post-link pipeline, like RelLookupTableConverterPass, DivRemPairsPass, LoopSink

Good point. I've added LoopSink and DivRemPairs to the LTO pipeline in D148343. It's worth noting that these are close to useless in the pre-link pipeline, because they are de-canonicalizing transforms, which will get undone by the post-link passes. There is no point in running these pre-link.

I also happened across D148120 today, which is one of those cases where doing module optimization pre-link hurts optimization quality, because code gets vectorized pre-link before the trip count becomes known post-link.

aeubanks accepted this revision.Apr 14 2023, 10:06 AM
This revision is now accepted and ready to land.Apr 14 2023, 10:06 AM
nikic updated this revision to Diff 514172.Apr 17 2023, 5:28 AM

Rebase and remove bits of dead code.

Hello. Did you manage to get anywhere with performance results?

nikic added a comment.Apr 19 2023, 2:55 PM

Hello. Did you manage to get anywhere with performance results?

No yet. I had hoped that somebody with production non-thin LTO workloads would provide a performance evaluation. As that hasn't happened, I'll have to sanity check this on llvm-test-suite instead.

I have been trying to looks through our downstream performance differences, but they might be quite different to upstream. I think I can reduce or remove most of the regressions with pass pipeline changes in the test we have that run under LTO.

I have been looking at cases similar to D148120 recently. It's a problem we already knew about (https://reviews.llvm.org/D125301#3536693). But more than just predication, scalable vectorization in general will lead to loops without precise trip counts. I'm still not 100% sure this is the best, but we need to fix it in some way and for scalable vectors at least it makes sense to limit vectorization during the pre-lto step.

Unfortunately I think there might just be a lot of changes. The first thing to fix should probably be that the pass pipeline is different through addVectorPasses. There will be quite a few differences that come up from that.

@dmgreen how much work do you think it is to fix the regressions? I hear SLPVectorizer has code to not vectorize in some cases so that LTO post-link can vectorize better and this would simplify that.

Can you contrast this to the previous attempts like mine here: https://reviews.llvm.org/D29376 ?

nikic added a comment.May 19 2023, 2:17 PM

Can you contrast this to the previous attempts like mine here: https://reviews.llvm.org/D29376 ?

As I commented on the other thread, the difference is that you replaced both the pre-link and the post-link pipelines, while this is only changing the pre-link pipeline. Fully aligning the post-link pipelines is a much more complex problem, in part because the full LTO post-link pipeline is compile-time sensitive, and in part because it is pretty much completely separate from all the other pipelines we have (which use minor variations of the general module simplification / module optimization framework). I think there are some things we can do to make the full LTO post-link pipeline closer to our other pipelines, but just replacing it with the thin LTO pipeline as you proposed is likely not going to be viable. (In particular, I believe it should be possible to mostly reuse the module optimization phase, but not the module simplification phase as-is.)

As far as I remember, there were a bunch of passes that are part of the compile-time phase that are not in the FullLTO link-time pipeline: if you just pruning all these from the compile-time phase, aren't we gonna miss on some optimization opportunities?
That is: passes that "shouldn't be run" during the compile-time are actually executed here because we don't have time-budget to run all this during the link-time (which in turns focuses on catching up on cross-module opt + lightweight pipeline).

Is there a thread on Discourse about this? I'm interested to see benchmarks numbers for this!

nikic added a comment.May 19 2023, 2:46 PM

As far as I remember, there were a bunch of passes that are part of the compile-time phase that are not in the FullLTO link-time pipeline: if you just pruning all these from the compile-time phase, aren't we gonna miss on some optimization opportunities?
That is: passes that "shouldn't be run" during the compile-time are actually executed here because we don't have time-budget to run all this during the link-time (which in turns focuses on catching up on cross-module opt + lightweight pipeline).

To the most part, passes that are part of the module optimization phase do also get run post-link in full LTO (the big differences are before that -- the simplification phase is much reduced in full LTO and also does not use CGSCC interleaving). There is a small number of differences (e.g. missing loop load elimination) that I still plan to address. The probably more substantial problem is that while the right passes get run, they get run in the wrong order. In particular, runtime unrolling is performed before vectorization, rather than after. I plan to address these issues before moving forward with this patch, but not sure when I'll get around to it.

We build Meta's Android apps with full LTO (paired with -Oz) and function merging to minimize size. I evaluated this patch as-is and it was a considerable size regression (526 KiB for Facebook for Android arm64). However, my colleague @lanza discovered that adding additional instances of the function merging pass (here, here, and here), in addition to being a considerable size improvement by itself, also makes this change be a small size improvement (20 KiB) instead. We haven't upstreamed the additional function merger runs yet (we want to make sure we understand the size improvements properly and have a principled reason for the change), but I figured I'd note it here early in case someone else is also evaluating this for size.

lanza added a comment.EditedMay 26 2023, 3:17 PM

I evaluated this patch as-is and it was a considerable size regression (526 KiB for Facebook for Android arm64). However, my colleague @lanza discovered that adding additional instances of the function merging pass (here, here, and here), in addition to being a considerable size improvement by itself, also makes this change be a small size improvement (20 KiB) instead. We haven't upstreamed the additional function merger runs yet (we want to make sure we understand the size improvements properly and have a principled reason for the change), but I figured I'd note it here early in case someone else is also evaluating this for size.

To add a bit more detail here: we believe that the reason for this patch causing a regression (for us) is that running the function merging pass before LTO inlining (e.g. in the module optimization pipeline at the end of LTOprelink) causes a set of internal functions to become true for hasOneUse(). Thus the inliner will then boost the inline threshold, inline these functions, simplify them and generate size savings.