Page MenuHomePhabricator

[PassManager] First Pass implementation at -O1 pass pipeline
ClosedPublic

Authored by echristo on Jul 29 2019, 11:11 AM.

Details

Summary

As a follow-up to my initial mail to llvm-dev here's a first pass at the O1 described there.

Some rough internal testing using a bootstrap and test of clang has shown a combined build and test time for clang with nearly equivalent performance to O3 and quite a speedup over O0 - it's currently a little slower than the existing O1, likely due to the clang+llvm testsuite use of the same binaries many times rather than a few for individual tests. Build time is a bit better. For a larger build and smaller test time (think a couple of unittests), this is a bit better than either O3, O0, or O1. Overall binary size drops significantly compared to O0.

This change doesn't include any change to move from selection dag to fast isel and that will come with other numbers that should help inform that decision. I also haven't done any real debuggability studies with this pipeline yet, I wanted to get the initial start done so that people could see it and we could start tweaking after.

Test updates: Outside of the newpm tests most of the updates are coming from either optimization passes not run anymore (and without a compelling argument at the moment) that were largely used for canonicalization in clang.

Original post:

http://lists.llvm.org/pipermail/llvm-dev/2019-April/131494.html

Diff Detail

Event Timeline

echristo created this revision.Jul 29 2019, 11:11 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJul 29 2019, 11:11 AM
phosek added a subscriber: phosek.Aug 2 2019, 5:25 PM

Thanks for starting on this. Can you go ahead and replace the sroa calls with mem2reg calls for O1 and then see what that does to the performance? That strikes me as a major change, but certainly one that potentially makes sense, so I'd rather we go ahead and test it now before we make decisions about other adjustments.

FWIW, I thought that we might run InstCombine less often (or maybe replace it with InstSimplify, in some places). Did you try that?

llvm/lib/Transforms/IPO/PassManagerBuilder.cpp
356

By definition, this loses information from the call stack, no?

427

Yes, I'd fall back to using regular DCE.

One high level point that is at least worth clarifying, and maybe others will want to suggest a different approach:

The overall approach here is to have as small of a difference between the O1 and O2 pipelines as possible.

An alternative approach that we could take would be to design a focused O1 pipeline without regard to how much it diverges from the O2 pipeline.

Which approach is used somewhat depends on the goals. I feel like the goal here is to get as close to the level of optimization at O2 as possible without losing compile time or coherent backtraces for test / assertion failures. For that goal, the approach taken makes sense. But it seems important to clarify that goal as otherwise I think we'd want to go in very different directions.

Thanks for starting on this. Can you go ahead and replace the sroa calls with mem2reg calls for O1 and then see what that does to the performance? That strikes me as a major change, but certainly one that potentially makes sense, so I'd rather we go ahead and test it now before we make decisions about other adjustments.

I really think we need mem2reg at least at -O1... In fact, I really think we need SROA at O1. If it is actually a compile time problem, I'd like to fix that in SROA. I don't really expect it to be though.

FWIW, I thought that we might run InstCombine less often (or maybe replace it with InstSimplify, in some places). Did you try that?

I think the biggest thing to do would be to avoid repeated runs of instcombine over the same code. I suspect we want at least one run after inliner and inside the CGSCC walk for canonicalization. But it'd be great to limit it to exactly one or maaaaybe one before and one after the loop pipeline.

llvm/lib/Passes/PassBuilder.cpp
412–421

I think you can merge all of these?

llvm/lib/Transforms/IPO/PassManagerBuilder.cpp
356

Yeah, I'd have really expected this to be skipped.

427

+1

The goal from the original email was:

The design goal is to rewrite the O1
optimization and code generation pipeline to include the set of
optimizations that minimizes build and test time while retaining our
ability to debug.

"Retaining our ability to debug" is a more constraining goal than Chandler's "coherent backtraces for test / assertion failures." A good debugging experience comes from good variable-location information, and minimal reordering of instructions. This is different from "don't mess with my stack frames" which lets you do pretty much anything that doesn't involve a call.

Without presuming to speak for Eric, I don't think there was an explicit goal to make O1 look like a stripped-down O2? But certain pass orderings make sense, and so where the same passes occur, the same ordering would be pretty likely. So, building an O1 pipeline by erasing some stuff from the O2 sequence makes some sense, because it's the practical thing and not because it's a goal to do it that way.

As for specific pass choices: Based on the data from Greg Bedwell's lightning talk last year (https://llvm.org/devmtg/2018-04/talks.html#Lightning_11) the three worst offenders for debuggability were LICM, InstCombine, and SROA. It looks like this patch does exclude SROA but not the others? Or at least not all cases of LICM.

I'd suspect that SROA's main problem is that we don't do a good job of tracking broken-up fragments (the original SROA rewrite didn't even try, IIRC), and if the pass has a good compile-time versus run-time benefit, it's worth investing in making that work better.
LICM just inherently interferes with smooth debugging; it's hard to do anything useful with loops that remains nicely debuggable.
I don't know what InstCombine's problem is.

The goal from the original email was:

The design goal is to rewrite the O1
optimization and code generation pipeline to include the set of
optimizations that minimizes build and test time while retaining our
ability to debug.

"Retaining our ability to debug" is a more constraining goal than Chandler's "coherent backtraces for test / assertion failures." A good debugging experience comes from good variable-location information, and minimal reordering of instructions. This is different from "don't mess with my stack frames" which lets you do pretty much anything that doesn't involve a call.

Without presuming to speak for Eric, I don't think there was an explicit goal to make O1 look like a stripped-down O2? But certain pass orderings make sense, and so where the same passes occur, the same ordering would be pretty likely. So, building an O1 pipeline by erasing some stuff from the O2 sequence makes some sense, because it's the practical thing and not because it's a goal to do it that way.

Not an explicit goal of a stripped down O2, but honestly probably not too far off.

(further below)...

As for specific pass choices: Based on the data from Greg Bedwell's lightning talk last year (https://llvm.org/devmtg/2018-04/talks.html#Lightning_11) the three worst offenders for debuggability were LICM, InstCombine, and SROA. It looks like this patch does exclude SROA but not the others? Or at least not all cases of LICM.

I'd suspect that SROA's main problem is that we don't do a good job of tracking broken-up fragments (the original SROA rewrite didn't even try, IIRC), and if the pass has a good compile-time versus run-time benefit, it's worth investing in making that work better.
LICM just inherently interferes with smooth debugging; it's hard to do anything useful with loops that remains nicely debuggable.
I don't know what InstCombine's problem is.

Smooth is going to be somewhat subjective and there's some work in instcombine (rnk took a stab at it and is currently reviewing a pretty good patch I think in this area). SROA just needs some tracking work if there are still problems.

Now LICM - this is one of those I've gone back and forth on and I'd like to see what debugging and performance look like with and without. I like the general perspective of "remove the abstraction penalties" as an O1/Og sort of thing. That said, ultimately something that people are able to use to debug their code is the ultimate goal - it won't be 100%, but as good as we can make it in the general case.

At any rate, this is just a first pass through both the individual passes are going to need work for debugging as is the particular pipeline.

-eric

Thanks for starting on this. Can you go ahead and replace the sroa calls with mem2reg calls for O1 and then see what that does to the performance? That strikes me as a major change, but certainly one that potentially makes sense, so I'd rather we go ahead and test it now before we make decisions about other adjustments.

I'll give it a shot. I think we'll want it in the long run, but happy to run it through the performance blender just to get an idea of what we're getting for our complexity.

FWIW, I thought that we might run InstCombine less often (or maybe replace it with InstSimplify, in some places). Did you try that?

I haven't. It's one part of pass ordering that we'll want to look at, but I figured some optimizing here could happen later. Happy to try a few things though.

echristo updated this revision to Diff 228338.Nov 7 2019, 4:46 PM
echristo marked an inline comment as done.

I've gone ahead and enabled SROA here. In the testing I've done so far it's helped execute time quite a bit and compile time/object size as well. It'll be really good for use with the trivial auto var initialization option also. SROA is a bit of a worry right now for debugging, but it's an area that's been improved upon significantly and I'm not worried about it getting a lot better.

Future plans here are likely going to be moving it to a more separate pass pipeline so I can pull out things like superfluous instcombines etc while continuing to do incremental performance improvements. In addition, I'm going to do experiments with enabling fast-isel at O1 so we can look at the compile time and performance impact of individual changes there - and hopefully some debugging analysis in the near term.

echristo updated this revision to Diff 228339.Nov 7 2019, 4:48 PM

Update to remove comments around SROA addition.

MaskRay added a subscriber: MaskRay.Nov 7 2019, 4:55 PM
jmorse added a subscriber: jmorse.Nov 8 2019, 3:31 AM
ormris added a subscriber: ormris.Nov 8 2019, 10:36 AM

LGTM to land this and iterate, but you likely want someone else to confirm :)

llvm/test/Other/new-pm-defaults.ll
230

Just a drive-by idea: you could have simplified the changes here with a new prefix: "CHECK-O23sz" that you could add to the non-O1 invocations.

chandlerc accepted this revision.Sun, Nov 24, 12:44 AM

Minor nits around redundant predicates for SROA. With thouse fixed, LGTM.

I'd really love to find a way to make TCO debuggable so that we don't lose that. I'm particularly worried about code that relies on it to not run out of stack. Not sure what the best thing to do here is though. Anyways, not relevant for this iteration. I mostly feel bad for a potential future re-churn of all the tests. ;]

llvm/lib/Passes/PassBuilder.cpp
397

We know O0 isn't used here, so this should be a no-op.

llvm/lib/Transforms/IPO/PassManagerBuilder.cpp
263

We early exit at O0 above, so this is a no-op.

293

We only reach here if OptLevel > 0 so this should be redundant?

324

This doesn't have the assert, but I believe this is only used above O0 as well. Maybe just add the assert?

llvm/test/Other/new-pm-defaults.ll
230

Good idea!

This revision is now accepted and ready to land.Sun, Nov 24, 12:44 AM
This revision was automatically updated to reflect the committed changes.
omjavaid reopened this revision.Mon, Nov 25, 8:42 PM
omjavaid added a subscriber: omjavaid.

Re-opening this because I have reverted the commit due to failures seen on LLDB AArch64 buildbot with this commit.

This revision is now accepted and ready to land.Mon, Nov 25, 8:42 PM
omjavaid requested changes to this revision.Mon, Nov 25, 8:42 PM
omjavaid added a reviewer: omjavaid.
This revision now requires changes to proceed.Mon, Nov 25, 8:42 PM
This revision was not accepted when it landed; it landed in state Needs Revision.Tue, Nov 26, 8:32 PM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptTue, Nov 26, 8:32 PM