Page MenuHomePhabricator

LTO: align the Monolithic LTO optimization pipeline on the ThinLTO and O2/O3 one
Needs RevisionPublic

Authored by mehdi_amini on Jan 31 2017, 11:19 PM.

Details

Summary

18 months ago, we tried to do the same thing but the increased
link-time was not judged acceptable because the lack of scaling
for LTO. Now that ThinLTO is there for any project that cares
about scaling and build time performance in general, we can be
more aggressive with Monolithic LTO for projects that are either
small enough or are willing to pay the extra compile time.

Aligning the pipeline on O2/O3 (and ThinLTO), makes the pipeline
maintenance easier: making the pipeline evolve for ThinLTO will
benefit Monolithic LTO and vice-versa. Also any misoptimization
for one use case is likely to affect the other and fixing it will
benefit all. Studying the difference in performance between LTO
and ThinLTO while doing this helped us last year to improve some
analyses that helped non-LTO builds.

Note: I haven't had time to benchmark this recently, so I put it up
for review so that everyone can run their own private benchmarks and
we can evaluate any identified defficiencies.

Event Timeline

mehdi_amini created this revision.Jan 31 2017, 11:19 PM
eastig added a subscriber: eastig.Feb 1 2017, 1:56 AM
eastig added a comment.Feb 1 2017, 2:18 AM

Hi Mehdi,

I'll check if the patch has any impact on our benchmarks.

Thanks,
Evgeny Astigeevich
Senior Compiler Engineer
Compilation Tools
ARM

pcc added a comment.Feb 1 2017, 11:33 AM

For a start, this pipeline doesn't look right as it needs to include at least the CFI passes with the correct summary action.

But a higher level point is that it seems premature to do this until there is a proven migration path from regular to thin LTO for users of the CFI and whole-program devirtualization features, such as Chromium. This at least means that the features must be supported. For the migration path to be proven, at least one major user of the feature (i.e. Chromium) must have moved its official builds from regular to thin LTO.

We hope to have completed the migration by the end of the quarter, but that is, as always, optimistic.

In D29376#663665, @pcc wrote:

For a start, this pipeline doesn't look right as it needs to include at least the CFI passes with the correct summary action.

Yes I noticed it while doing it, but promised to the other guys that I'll get a patch out to be able to test the performance and the build time impact.

But a higher level point is that it seems premature to do this until there is a proven migration path from regular to thin LTO for users of the CFI and whole-program devirtualization features, such as Chromium.

I disagree that we would hold back the whole LTO optimization strategy based on a single large project that is using CFI-features. If the extra build-time is a real problem for Chromium and CFI, then a better strategy is to lower the optimization level during LTO to O1.

(O1 may have be tweaked with extra globalopt when "PerformLTO" is set though).

pcc added a comment.Feb 1 2017, 12:05 PM

We already use regular LTO at opt level 1 -- this basically provides "dead stripping with support for CFI and devirtualization". That is essentially all we need from the LTO pipeline at the moment. Getting more out of LTO (cross-module inlining, etc.) would be nice to have, and is part of the motivation for switching to thin LTO.

Based on my reading of the code, the thin LTO opt level 1 pipeline does a lot more than the current regular LTO opt level 1 pipeline. Switching the pipeline like this would likely make link times unacceptably slow until we switch to thin LTO.

In D29376#663757, @pcc wrote:

We already use regular LTO at opt level 1 -- this basically provides "dead stripping with support for CFI and devirtualization". That is essentially all we need from the LTO pipeline at the moment.

So that's more like a O0 :)

Based on my reading of the code, the thin LTO opt level 1 pipeline does a lot more than the current regular LTO opt level 1 pipeline. Switching the pipeline like this would likely make link times unacceptably slow until we switch to thin LTO.

Unacceptably slow for Chromium is not the majority of use-cases.
If we end up with Chromium being the only blocker for any progress for the 99.9% other applications, you're likely gonna have to supply you own flag "-mllvm -do-lto-optimizations-but-not-too-much".

pcc added a comment.Feb 1 2017, 12:54 PM
In D29376#663757, @pcc wrote:

We already use regular LTO at opt level 1 -- this basically provides "dead stripping with support for CFI and devirtualization". That is essentially all we need from the LTO pipeline at the moment.

So that's more like a O0 :)

No, O0 does not include globaldce and functionattrs.

Based on my reading of the code, the thin LTO opt level 1 pipeline does a lot more than the current regular LTO opt level 1 pipeline. Switching the pipeline like this would likely make link times unacceptably slow until we switch to thin LTO.

Unacceptably slow for Chromium is not the majority of use-cases.
If we end up with Chromium being the only blocker for any progress for the 99.9% other applications, you're likely gonna have to supply you own flag "-mllvm -do-lto-optimizations-but-not-too-much".

It is not necessarily just Chromium, it is any user who just wants CFI/devirt without it being tied to the rest of the LTO pipeline. It is already possible to pick and choose those features at compile time, and I reckon that property should be carried through to the linker. LTO opt level 1 already allows users to do that; it roughly means "act like a regular linker with --gc-sections". I think we should keep that meaning at least until those users can move to thin LTO. I don't have a problem with changing the other opt levels, though (and in fact, I'd encourage it, as it would allow us to remove the existing regular LTO pipeline).

In D29376#663788, @pcc wrote:
In D29376#663757, @pcc wrote:

We already use regular LTO at opt level 1 -- this basically provides "dead stripping with support for CFI and devirtualization". That is essentially all we need from the LTO pipeline at the moment.

So that's more like a O0 :)

No, O0 does not include globaldce and functionattrs.

I know, that's why I didn't write "that is exactly O0".

Based on my reading of the code, the thin LTO opt level 1 pipeline does a lot more than the current regular LTO opt level 1 pipeline. Switching the pipeline like this would likely make link times unacceptably slow until we switch to thin LTO.

Unacceptably slow for Chromium is not the majority of use-cases.
If we end up with Chromium being the only blocker for any progress for the 99.9% other applications, you're likely gonna have to supply you own flag "-mllvm -do-lto-optimizations-but-not-too-much".

It is not necessarily just Chromium, it is any user who just wants CFI/devirt without it being tied to the rest of the LTO pipeline.

That's very legitimate, and "not being tied to the rest of the LTO pipeline" is exactly the reason why I oppose blocking the LTO pipeline on such use case. If such use-case needs to be "untied", a separate flag/flow can be provided to them.

LTO opt level 1 already allows users to do that; it roughly means "act like a regular linker with --gc-sections".

I'd be fine with having LTO O1 being hardcoded to a special path that would be CFI+DCE, but then what is O0?

pcc added a comment.Feb 1 2017, 1:16 PM

It is not necessarily just Chromium, it is any user who just wants CFI/devirt without it being tied to the rest of the LTO pipeline.

That's very legitimate, and "not being tied to the rest of the LTO pipeline" is exactly the reason why I oppose blocking the LTO pipeline on such use case. If such use-case needs to be "untied", a separate flag/flow can be provided to them.

I agree that in the long term we should think about this more carefully. In fact right now in ThinLTO we have no way of untying these features from the rest of the pipeline. I think the long term plan should be:

  • figure out the right way to expose the "act like a regular linker with --gc-sections" feature
  • expose it to regular and thin LTO
  • remove the hardcoded regular LTO path you mention below

LTO opt level 1 already allows users to do that; it roughly means "act like a regular linker with --gc-sections".

I'd be fine with having LTO O1 being hardcoded to a special path that would be CFI+DCE,

Works for me, thanks.

but then what is O0?

I think O0 should have basically the same meaning as in the compiler: the minimal pass pipeline required for correctness.

mehdi_amini updated this revision to Diff 86756.Feb 1 2017, 5:47 PM

Fixed a few discrepancies, add back the CFI/WPDevirt and try to hook this up in a way that makes sense.

At least the validation is passing now.

eastig added a comment.Feb 8 2017, 5:32 AM

Hi Mehdi,

I've got results of benchmarks for Cortex-M4:

  • A benchmark failed because something wrong with a created executable.
  • A number of performance gains is 21. Max score boost is ~4x.
  • A number of performance losses is 15. Max score degradation is ~47x.

You can see the patch affects performance very much and needs detailed performance analysis.

We have other M-profile boards I'll run the benchmarks on them as well.

Awesome! Thanks for running it, this is not surprising at all :)

Hi Mehdi,

I've got results of benchmarks for Cortex-M7, they are better than for Cortex-M4:

  • The same benchmark failed because something wrong with a created executable.
  • A number of performance gains is 22. Five of them have score boost > 4x. 17 have no change or slight change. Max score boost is ~20x.
  • A number of performance losses is 23. Five of them have score degradation between 1.2x and 2.5x. 18 have no change or slight change. Max score degradation is ~2.5x.
davide edited edge metadata.Feb 10 2017, 9:21 AM

Hi Mehdi,

I've got results of benchmarks for Cortex-M7, they are better than for Cortex-M4:

  • The same benchmark failed because something wrong with a created executable.

Can you please open a bug at llvm.org with a repro?

  • A number of performance gains is 22. Five of them have score boost > 4x. 17 have no change or slight change. Max score boost is ~20x.
  • A number of performance losses is 23. Five of them have score degradation between 1.2x and 2.5x. 18 have no change or slight change. Max score degradation is ~2.5x.

Hi Mehdi,

I've got results of benchmarks for Cortex-M7, they are better than for Cortex-M4:

  • The same benchmark failed because something wrong with a created executable.
  • A number of performance gains is 22. Five of them have score boost > 4x. 17 have no change or slight change. Max score boost is ~20x.
  • A number of performance losses is 23. Five of them have score degradation between 1.2x and 2.5x. 18 have no change or slight change. Max score degradation is ~2.5x.

Hi Mehdi,

I've got results of benchmarks for Cortex-M7, they are better than for Cortex-M4:

  • The same benchmark failed because something wrong with a created executable.
  • A number of performance gains is 22. Five of them have score boost > 4x. 17 have no change or slight change. Max score boost is ~20x.
  • A number of performance losses is 23. Five of them have score degradation between 1.2x and 2.5x. 18 have no change or slight change. Max score degradation is ~2.5x.

Thanks for the update. Good to know.

This is all shows how much performance room we have ahead :)

I'm not sure when I'll have time to dig into the perf regressions I noticed on the llvm-testsuite. It'll take some time to get there!

(hopefully we'll improve the non-LTO performance at the same time...)

I took some time to run on some games internally. The change is mostly performance neutral, some titles are 1% faster, some 1% slower, but nothing glamorous.
(Please note these are "real-world" programs, not synthetic benchmarks).
What worries me about this change is the increase in compile time. I was able to notice 30%-50% increase in compile time (for some of these, the LTO time was already 10 minutes so 50% more is not acceptable).
The time is spent in the usual suspects: GVN, InstCombine, Inlining etc... I'd love to hear others' thoughts, but I'm inclined to hold on this until we have a better story for our compile time.

Can you please open a bug at llvm.org with a repro?

I am on holidays next week. I'll investigate the issue when I am back.

davide requested changes to this revision.Apr 25 2017, 1:28 PM

Can you please open a bug at llvm.org with a repro?

I am on holidays next week. I'll investigate the issue when I am back.

@eastig Did you get a chance to take a look?

This revision now requires changes to proceed.Apr 25 2017, 1:28 PM

Can you please open a bug at llvm.org with a repro?

I am on holidays next week. I'll investigate the issue when I am back.

@eastig Did you get a chance to take a look?

Hi Davide,

Thank you for reminding.
Shame on me. I get it lost.
I'll check with the trunk if the issue still exists.

Can you please open a bug at llvm.org with a repro?

I am on holidays next week. I'll investigate the issue when I am back.

@eastig Did you get a chance to take a look?

Hi Davide,

I have a good news. There is no need for a bug.

I have looked at the benchmark failure. In fact, it is not a crash. The benchmark completed its run. The benchmark did some checks of its results. The checks found the results suspicious. The execution time was almost zero. It seems the benchmark is very sensitive to LTO optimizations.
I compared binary files. The test code, a whole test loop, was completely removed. As the benchmark was run on a bare-metal board, a driver running the benchmark reported the board returned an unexpected error.

Thanks,
Evgeny