Page MenuHomePhabricator

[VectorCombine] position pass after SLP in the optimization pipeline rather than before
ClosedPublic

Authored by spatel on May 19 2020, 1:26 PM.

Details

Summary

There are 2 known problem patterns shown in the test diffs here: vector horizontal ops (an x86 specialization) and vector reductions.
SLP has greater ability to match and fold those than vector-combine, so let SLP have first chance at that.
This is a quick fix while we continue to improve vector-combine and possibly canonicalize to reduction intrinsics.
In the longer term, we should improve matching of these patterns because if they were created in the "bad" forms shown here, then we would miss optimizing them.

I'm not sure what is happening with alias analysis on the addsub test. The old pass manager now shows an extra line for that, and we see an improvement that comes from SLP vectorizing a store. I don't know what's missing with the new pass manager to make that happen. Strangely, I can't reproduce the behavior if I compile from C++ with clang and invoke the new PM with "-fexperimental-new-pass-manager".

Diff Detail

Event Timeline

spatel created this revision.May 19 2020, 1:26 PM
This revision is now accepted and ready to land.May 21 2020, 6:53 PM
RKSimon added inline comments.May 22 2020, 4:59 AM
llvm/test/Transforms/PhaseOrdering/X86/vector-reductions.ll
7–8

update comment?

spatel marked 2 inline comments as done.May 22 2020, 9:13 AM
spatel added inline comments.
llvm/test/Transforms/PhaseOrdering/X86/vector-reductions.ll
7–8

Right - will update on push.

spatel marked an inline comment as done.May 22 2020, 9:21 AM
This revision was automatically updated to reflect the committed changes.
nikic added a subscriber: nikic.May 22 2020, 12:15 PM

This changes caused a 0.25% compile-time regression. Looking at the pipeline test changes, this is probably because you do not preserve AAResultsWrapperPass inside VectorCombine.

This changes caused a 0.25% compile-time regression. Looking at the pipeline test changes, this is probably because you do not preserve AAResultsWrapperPass inside VectorCombine.

Aha...thanks for pointing out the fix:
rG024098ae5349

There's still something wrong with alias analysis in the pipeline because that addsub test is getting folded in the old PM, but not the new PM.

nikic added a comment.May 22 2020, 3:23 PM

@spatel Thanks for the quick fix! Unfortunately, this seems to be only part of the story. Preserving AA improved things, but only a bit.

As I don't see anything else here, this probably affects the amount of generated IR, though the final code-size changes aren't particularly large. So maybe nothing to be done.

nikic added a comment.May 23 2020, 1:30 AM

It turns out that the main impact of this change (both in terms of compile-time, and in terms of text size changes) is not the move of the VectorCombine pass, but the move of the EarlyCSE pass. If I leave EarlyCSE where it is and only move VectorCombine, the results is essentially noise. (Which doesn't tell us anything about which EarlyCSE placement produces better code...)

spatel added a subscriber: dmgreen.May 23 2020, 5:41 AM

It turns out that the main impact of this change (both in terms of compile-time, and in terms of text size changes) is not the move of the VectorCombine pass, but the move of the EarlyCSE pass. If I leave EarlyCSE where it is and only move VectorCombine, the results is essentially noise. (Which doesn't tell us anything about which EarlyCSE placement produces better code...)

I added the EarlyCSE cleanup as part of D75145. And it was noted there as causing perf regressions for ARM code (cc @dmgreen), so it was not ideal from the start.

I just commented out EarlyCSE locally from the pipeline, and we no longer need that on the motivating test from D75145. (Not exactly sure why - several things changed since then.)

But there is a regression with the new pass manager on the "add_aggregate_store" test that is changing here. Do you know why alias analysis is not working the same on that test for new and old PMs?

If we can fix that, then we can remove EarlyCSE from the pipeline with no immediate regressions (getting full test-suite/other results that agree with that would be ideal, but we could push the change and wait for fallout).

nikic added a comment.May 23 2020, 9:26 AM

@spatel Apparently you need to explicitly specify -aa-pipeline=default to get a default AA pipeline.

@spatel Apparently you need to explicitly specify -aa-pipeline=default to get a default AA pipeline.

Thanks for the hint! I added that to the regression test, so that lets us remove the extra -early-cse from the pipeline:
rG57bb4787d72f

Hopefully, that improves compile-time as expected.

I added the EarlyCSE cleanup as part of D75145. And it was noted there as causing perf regressions for ARM code (cc @dmgreen), so it was not ideal from the start.

Yeah, We did see some performance changes from this patch, and will do from 57bb4787d72f again I think. They were much larger for when we were running under lto than without. We are currently in the process of moving those benchmarks over to not run under lto, so I happened to have both. Without LTO the changes were all just +-2%, so not anything to worry about. It suggests something funny might be going on with LTO where we run the entire pass pipeline twice.

It feels quite common in the pass pipeline to want to write:

MPM.add(createLowerMatrixIntrinsicsPass());
MPM.add(createEarlyCSEPass(false)); // cleanup

(That's just an example I noticed. I've wanted similar things for loop unrolling in the past.) It's a shame to run the cleanup passes on all the code you compile, just because one fairly rare thing needed it. Perhaps it would make sense to make some of these things utilities as opposed to running them as passes, in order to target them at the code we know has changed and would benefit from it.

Perhaps it would make sense to make some of these things utilities as opposed to running them as passes, in order to target them at the code we know has changed and would benefit from it.

+1

bjope added a subscriber: bjope.May 25 2020, 8:08 AM

Just for info, downstream this caused some benchmark regressions for out OOT target.

Here is my analysis:

  1. We've got some additions in the LoopVectorizer that inserts some more guards for taking the vector/scalar path.
  2. Those branch conditions might be loop invariant, and might look the same for several loops in a nest.
  3. When running EarlyCSE after LoopVectorizer (and before CFGSimplification) those branch conditions are CSE:d, so we use the same condition in branches, not the same subexpressions.
  4. I got big diffs after CFGSimplification depending on if I remove EarlyCSE or not. And I guess the CFG simplification benefits from seeing that branches are using the same condition (it depends on code being CSE:d rather than comparing subexpressions).

I can mention that we use ExtraVectorizerPasses in that test, which runs LICM/CFGSimplification etc an extra time before SLP. And I haven't checked if we got the same problem with the regular CFGSimplification.

I'll probably just add an extra run of EarlyCSE among the ExtraVectorizerPasses to solve this downstream.

Given that our downstream additions in LoopVectorizer seem to be involved somehow, this could be more common scenario for us compared to the upstream code base. We could probably handcraft some lit-test to show this, but I guess that wouldn't justify adding back EarlyCSE in the pipe, if it isn't seen in benchmarks being executed on the upstream code base.

Just wanted to let you know about a potential scenario where EarlyCSE makes a difference here.

Thanks for the feedback. I filed https://bugs.llvm.org/show_bug.cgi?id=46065 for making a CSE utility.

nikic added a comment.Jun 13 2020, 1:29 AM

@bjope Tracing back the changes here, what happened is that originally there was an EarlyCSE run in ExtraVectorizerPasses, then it got moved into the main pipeline as part of the VectorCombine introduction, then it got moved after SLP in this patch and then it got dropped entirely afterwards. So the EarlyCSE run that was present in ExtraVectorizerPasses is now gone as a side-effect of this shuffling around. So I'd say, feel free to just add it back (it was the first pass in the https://github.com/llvm/llvm-project/blob/2dc664d578f0e9c8ea5975eed745e322fa77bffe/llvm/lib/Transforms/IPO/PassManagerBuilder.cpp#L734 block) and restore the status quo of LLVM 10.

@bjope Tracing back the changes here, what happened is that originally there was an EarlyCSE run in ExtraVectorizerPasses, then it got moved into the main pipeline as part of the VectorCombine introduction, then it got moved after SLP in this patch and then it got dropped entirely afterwards. So the EarlyCSE run that was present in ExtraVectorizerPasses is now gone as a side-effect of this shuffling around. So I'd say, feel free to just add it back (it was the first pass in the https://github.com/llvm/llvm-project/blob/2dc664d578f0e9c8ea5975eed745e322fa77bffe/llvm/lib/Transforms/IPO/PassManagerBuilder.cpp#L734 block) and restore the status quo of LLVM 10.

Thanks for tracking that down! Hopefully back to LLVM10 state with:
rG098e48a6a15

bjope added a comment.Jun 14 2020, 1:57 PM

@bjope Tracing back the changes here, what happened is that originally there was an EarlyCSE run in ExtraVectorizerPasses, then it got moved into the main pipeline as part of the VectorCombine introduction, then it got moved after SLP in this patch and then it got dropped entirely afterwards. So the EarlyCSE run that was present in ExtraVectorizerPasses is now gone as a side-effect of this shuffling around. So I'd say, feel free to just add it back (it was the first pass in the https://github.com/llvm/llvm-project/blob/2dc664d578f0e9c8ea5975eed745e322fa77bffe/llvm/lib/Transforms/IPO/PassManagerBuilder.cpp#L734 block) and restore the status quo of LLVM 10.

Thanks for tracking that down! Hopefully back to LLVM10 state with:
rG098e48a6a15

Thanks! Both for the detective work and the patch.