This is an archive of the discontinued LLVM Phabricator instance.

[PassManager] unify vector passes between regular and LTO pipelines
Changes PlannedPublic

Authored by spatel on May 6 2021, 9:53 AM.

Details

Summary

The regular and LTO pipelines diverge starting from the loop vectorizer and ending with instcombine cleanup, but I don't know why that should happen. This patch unifies the passes at that stage of the pipelines. The old pass manager is updated to keep it synchronized.

To prevent unintended divergence in the future, I created a helper function so changes to this part of the pipeline will remain identical between regular and LTO. We can add an isLTO flag is we really want them to be different.

The difference for the regular pipeline is that we unroll directly after the loop vectorizer instead of waiting until after SLP/VectorCombine. That eliminates the need for one stage of instcombine. This reduced compile-time by about 2.7%:
https://llvm-compile-time-tracker.com/?config=LegacyPM-O3&stat=instructions&remote=rotateright

The difference for the LTO pipeline is that we inherit a run of LoopLoadElimination and drop SCCP, InstCombine, BDCE, and AlignmentFromAssumptions. This reduced compile-time by about 1.8%:
https://llvm-compile-time-tracker.com/?config=LegacyPM-ReleaseLTO-g&stat=instructions&remote=rotateright

(You can see that the results for NewPM are similar, but I wasn't sure how to squash the intermediate experiments to show that directly in the tables.)

We may see perf regressions from these changes, but then we can add/move passes to recover and add tests to verify/document that the changes are intentional. Right now, we only have tests that show less unrolling for x86, but that doesn't actually seem like a bad thing to me.

These pipeline changes were suggested by the discussion in D100802 (but this patch doesn't change the LICM difference).

Diff Detail

Event Timeline

spatel created this revision.May 6 2021, 9:53 AM
spatel requested review of this revision.May 6 2021, 9:53 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 6 2021, 9:53 AM

Bikeshedding: i would prefer this to be two patches

  1. fully NFC patch to introduce addVectorPasses() with bool IsLTO param.
  2. reshuffling

A not-very-helpful rant:

I think it is good that we Unroll after LV,
but i'm uncertain about SLP and Unroll relative ordering.
One would guess that Unroll might introduce extra opportunities for SLP,
but then proper LoopVectorize should have caught them already.

On the other hand, as comment mentions

// Unroll small loops to hide loop backedge latency and saturate any parallel
// execution resources of an out-of-order processor. We also then need to
// clean up redundancies and loop invariant code.

so if SLP vectorized code, it may no longer be saturated, and unrolling may be beneficial.
(So i would think we should keep Unroll at the end.)

On the other hand, i'm kinda surprised that we don't do unrolling before LV,
specifically so that it would peel some conditions,
and thus simplify in-loop code.

I'm also unsure about LoopLoadElimination - how exactly does it help?

RKSimon added subscribers: fhahn, RKSimon.

Adding @fhahn who might be able to give some insight on LV + unrolling

spatel updated this revision to Diff 343481.May 6 2021, 12:57 PM

Patch updated:
Split the NFC refactoring part off and committed ( fefcb1f878c2 ).
This is the same set of changes that were proposed in the previous revision (but much easier to see now); the test diffs are identical.

spatel added a comment.May 6 2021, 1:08 PM

One would guess that Unroll might introduce extra opportunities for SLP,
but then proper LoopVectorize should have caught them already.

Yes, I think we have gone back-and-forth on that, and that might explain why LTO is different than regular.

I'm also unsure about LoopLoadElimination - how exactly does it help?

I don't know. I'm leaning toward reduction + rebuild-with-justification as you can see, but I missed that diff initially! I can remove it too and see if that causes any tests to wiggle.
OTOH, this could end up being a multi-month slog as we get regression complaints. So it's probably better to surgically remove one pass at a time until we reach something close to what this patch is currently showing.

Right now, we only have tests that show less unrolling for x86, but that doesn't actually seem like a bad thing to me.

The phase ordering test in LLVM have never been very extensive. LLVM doesn't have a lot of end-to-end tests in-tree, always leaving that to C tests from the llvm-test-suite or other benchmark suites.

Without a lot of evidence that this is better across a range of architectures, I would recommend trying to be conservative, taking things a step at a time.
We have one test where this patch would make it 88% worse. That might be a pathological case but no one is going to put up with on tenth of the speed :-)

llvm/lib/Passes/PassBuilder.cpp
1199

I'm not sure what this comment is trying to say exactly. I think it's coming from somewhere that is very old and out of date now.

My very high level understand of the pass pipeline, at least for non-LTO in terms of loops is that we do:

  • Cleanup clang code, code simplification, inlining, DSE, GVN, all that goodness.
  • Run loop optimizations like licm, idiom recognition, loop deletion. And including _full_ unrolling.
  • Other optimizations.
  • Run Vectorizer
  • Run SLP
  • _Runtime_ Unroll loops.

There are some extra simplifications that need to happen in between too. The last unrolling, especially on smaller inorder cores has nothing really to do with vectorization. It's done near the end of the pipeline because the runtime unrolling isn't expected to be helpful to anything else, but up to that point we have not done runtime unrolling.

Without a lot of evidence that this is better across a range of architectures

Or.. I should have said good evidence on one architecture and no evidence of it making others worse. I wouldn't expect you to run on architectures you don't have access to, of course.

nikic added a comment.May 6 2021, 2:39 PM

The difference for the regular pipeline is that we unroll directly after the loop vectorizer instead of waiting until after SLP/VectorCombine. That eliminates the need for one stage of instcombine. This reduced compile-time by about 2.7%:

This seems to be mostly due to less code size, which is most likely caused by less unrolling. It's not really obvious to me why this ordering change results in such a large difference. While I would love LLVM to be less gratuitous when it comes to runtime unrolling (especially of vectorized code), I strongly suspect that unrolling is being prevented here for the wrong reasons. Maybe the IR coming out of LoopVectorize is in some form that SCEV/LoopUnroll doesn't understand, and other passes are needed to clean it up first, or something along those lines.

I'm seeing at least one +20% run-time regression with this. (NewPM, no LTO)


Looks like SLP no longer vectorizes the code there,

statistic name			baseline	proposed	Δ	%	|%|
SLP.NumVectorInstructions	31		3		-28	-90.32%	90.32%
spatel added a comment.May 7 2021, 4:12 AM

Without a lot of evidence that this is better across a range of architectures, I would recommend trying to be conservative, taking things a step at a time.
We have one test where this patch would make it 88% worse. That might be a pathological case but no one is going to put up with on tenth of the speed :-)

Agreed - I should try to move the LTO variations to be more like the regular pipeline in small steps.
I'm curious if we can already see the big slowdown when compiling with "-flto" then?
Similarly for the rawspeed benchmark shown by @lebedev.ri - does the reduction in SLP vectorization show up when compiling with -flto independently of this patch?

llvm/lib/Passes/PassBuilder.cpp
1199

That sounds right. I see the call to full unroll at line 606 - inside of buildO1FunctionSimplificationPipeline().

Matt added a subscriber: Matt.May 7 2021, 8:33 AM

Without a lot of evidence that this is better across a range of architectures, I would recommend trying to be conservative, taking things a step at a time.
We have one test where this patch would make it 88% worse. That might be a pathological case but no one is going to put up with on tenth of the speed :-)

Agreed - I should try to move the LTO variations to be more like the regular pipeline in small steps.
I'm curious if we can already see the big slowdown when compiling with "-flto" then?

Similarly for the rawspeed benchmark shown by @lebedev.ri - does the reduction in SLP vectorization show up when compiling with -flto independently of this patch?

Eh, i would love to tell :) As i've just discovered, apparently some weird miscompile was introduced post-clang-12,
so i can't run the benchmark. I guess i'll have to bisect it first.
I can not tell if it is in X86 backend, or in a middle-end opt passes.

vzakhari added inline comments.May 7 2021, 4:42 PM
llvm/lib/Passes/PassBuilder.cpp
1193–1275

The code did not look like this at the place where https://reviews.llvm.org/rGfefcb1f878c2dad435af604955661ca02a5302de inserted a call to addVectorPasses(... , /* IsLTO */ true). It was only .hoistCommonInsts(true) - was this an expected change in behavior?

vzakhari added inline comments.May 7 2021, 6:05 PM
llvm/lib/Transforms/IPO/PassManagerBuilder.cpp
1092

There is another call to addExtensionsToPM(EP_Peephole) in addVectorPasses - is this expected?

spatel added inline comments.May 8 2021, 5:43 AM
llvm/lib/Passes/PassBuilder.cpp
1193–1275

No, that was a mistake. Thank you for spotting it!
AFAIK, there's no way to show that difference in a pass manager unit test, so that's why it slipped through.
Did you notice that by inspection or benchmark regression? If the latter, it would be great if we can reduce a test for it.
Let me know if I should revert or update the earlier commit.

llvm/lib/Transforms/IPO/PassManagerBuilder.cpp
1092

No - that's another mistake that slipped by the unit tests. I can revert and/or fix. Please let me know which is preferred.

Without a lot of evidence that this is better across a range of architectures, I would recommend trying to be conservative, taking things a step at a time.
We have one test where this patch would make it 88% worse. That might be a pathological case but no one is going to put up with on tenth of the speed :-)

Agreed - I should try to move the LTO variations to be more like the regular pipeline in small steps.
I'm curious if we can already see the big slowdown when compiling with "-flto" then?

Similarly for the rawspeed benchmark shown by @lebedev.ri - does the reduction in SLP vectorization show up when compiling with -flto independently of this patch?

Eh, i would love to tell :) As i've just discovered, apparently some weird miscompile was introduced post-clang-12,
so i can't run the benchmark. I guess i'll have to bisect it first.
I can not tell if it is in X86 backend, or in a middle-end opt passes.

Ok, dealt with that in rG1acd9a1a29ac30044ecefb6613485d5d168f66ca.

As far as i can tell, on that codepath, current ThinLTO build does not exhibit the same slowdown as does the plain build with this patch.

As far as i can tell, on that codepath, current ThinLTO build does not exhibit the same slowdown as does the plain build with this patch.

AFAICT, ThinLTO is running the same set of passes as the default pipeline during this part of the compile. Ie, it's only fullLTO that has diffs like LoopUnroll before SLP, etc.

llvm/lib/Passes/PassBuilder.cpp
1193–1275

I went ahead and reverted the earlier change since it was clearly not NFC as advertised.

vzakhari added inline comments.May 10 2021, 8:49 AM
llvm/lib/Passes/PassBuilder.cpp
1193–1275

I spotted it just during an inspection of a downstream conflict. Sorry for the late reply, I do not think you had to revert it :)

spatel added inline comments.May 10 2021, 9:00 AM
llvm/lib/Passes/PassBuilder.cpp
1193–1275

No problem - this is going to need the smallest changes per step to ensure I'm not breaking things. :)
Thank you for reviewing the diffs!

spatel planned changes to this revision.Jun 8 2021, 5:42 AM

Clearly, we're going to need to split this up, but it's not my highest priority, so marking for revision.

mnadeem added a subscriber: mnadeem.Oct 3 2021, 3:34 PM

I was looking at the LICM Fdiv hoisting issue when I came across this patch but this seems to be stuck.

So I was wondering that, since we now have a combined addVectorPasses() function, if a smaller incremental change like https://reviews.llvm.org/D111032 would make sense to fix the hoisting issue?
It includes the change from D100802 and has one more change relevant to LICM/instcombine ordering before the vector/unrolling passes are run.

lebedev.ri resigned from this revision.Jan 12 2023, 4:49 PM

This review seems to be stuck/dead, consider abandoning if no longer relevant.