Page MenuHomePhabricator

Rework the LTO Pipeline, aligning closer to the O2/O3 pipeline.
Needs ReviewPublic

Authored by mehdi_amini on Oct 5 2015, 11:33 AM.



It seems that the LTO pipeline didn't receive as much attention as the
regular O2/O3 pipeline the last few years. This patch attempts to
refactor the pass pipeline initialization to align LTO with the
regular pipeline.

The proposed change exit the compile phase before the inliner is ran
during the compile phase. The link phase will run the usuall pipeline,
extended with some specific passes that leverage the augmented
knowledge we have from full program visibility.

The LTO addition over the regular pipeline consists mainly of re-doing
the peephole a second after the inliner is done, but prepended with a
run of GlobalDCE/GlobalOpt.

The measurement on the public test suite as well as on our internal
suite show an overall net improvement. With some regression that I
am still tracking. Consider this patch as a work in progress that I
submit now for feedback.

Diff Detail

Event Timeline

mehdi_amini retitled this revision from to Rework the LTO Pipeline, aligning closer to the O2/O3 pipeline..
mehdi_amini updated this object.
mehdi_amini added a reviewer: chandlerc.
mehdi_amini added subscribers: llvm-commits, dexonsmith.

Minor update, moving EliminateAvailableExternallyPass before GlobalOpt

tejohnson added inline comments.

Why not perform the first round of inlining and some of the other optimizations (like the peepholes) called below when preparing for LTO? I would think this would result in smaller intermediate .o bitcode sizes and more efficient LTO.

I'm not sure why inlining should be able to end up with smaller files, except in the case where an "internal linkage" function can end up with no call site. Do you see any other cases?

The inliner may take different decisions with more of the call graph available. Assuming the inliner heuristics are well tuned, running on a partial call-graph shouldn't result in better results but on the opposite only limits the freedom of it during LTO.

If your only concern is about compile time, it is fairly easy to test and I can launch a run of our compile-time test-suite.

Teresa (and other), can you run this on your internal benchmark suite to see the performance impact?

Unfortunately, we haven't had a lot of success running LTO on our
internal benchmarks in the past due to the scaling issue.

That's great it is giving you better performance - I was just
surprised that it was beneficial to skip all of those downstream
optimizations in the initial compile step. Is the better performance
coming from skipping the initial inline? Is there no benefit to doing
the peepholes etc in the initial compile?

Right now my view of it is that if I get a performance improvement by running two times the inliner and the "peephole" passes, then it is a bug. If it is not a bug it means that the O3 pipeline is affected as well and we might run it two times there as well. Does it make sense?

I ran the LLVM benchmark suite + some internals with a return before and after the inliner+peephole phase. Stopping before the inliner during the compile phase ends up with 13 regressions and 20 improvements, compared to running the inliner during the compile phase. I sent you some more details by email.

Hi Mehdi,

Thanks for sharing the results. As you note there are swings in both
directions, but the improvements outweigh the regressions.

tejohnson added inline comments.Oct 6 2015, 7:09 AM

Can this be removed? I think it is dead code now.


This can be removed - it only needs to be called once and now the LTO pipeline calls it via addLTOOptimizationPasses.

mehdi_amini added inline comments.Oct 6 2015, 9:21 AM

Yes, the patch is not ready for review. I posted it for feedback on the approach as a work in progress, and hoping you (google) and others may have other benchmarks to run.

davide added a subscriber: davide.Jan 16 2017, 10:53 AM