This is an archive of the discontinued LLVM Phabricator instance.

[VectorCombine] new IR transform pass for partial vector ops
ClosedPublic

Authored by spatel on Jan 27 2020, 7:49 AM.

Details

Summary

We have several bug reports that could be characterized as "reducing scalarization", and this topic was also raised on llvm-dev recently:
http://lists.llvm.org/pipermail/llvm-dev/2020-January/138157.html
...so I'm proposing that we deal with these patterns in a new, lightweight IR vector pass that runs before/after other vectorization passes.

There are 4 alternate options that I can think of to deal with this kind of problem (and we've seen various attempts at all of these), but they all have flaws:

  1. InstCombine - can't happen without TTI, but we don't want target-specific folds there.
  2. SDAG - too late to assist other vectorization passes; TLI is not equipped for these kind of cost queries; limited to a single basic block.
  3. CGP - too late to assist other vectorization passes; would need to re-implement basic cleanups like CSE/instcombine.
  4. SLP - doesn't fit with existing transforms; limited to a single basic block.

This initial patch/transform is based on existing code in AggressiveInstCombine: we walk backwards through the function looking for a pattern match. But we diverge from that cost-independent IR canonicalization pass by using TTI to decide if the vector alternative is profitable.

We probably have at least 10 similar bug reports/patterns (binops, constants, inserts, cheap shuffles, etc) that would fit in this pass as follow-up enhancements. It's possible that we could iterate on a worklist to fix-point like InstCombine does, but it's safer to start with a most basic case and evolve from there, so I didn't try to do anything fancy here.

Diff Detail

Event Timeline

spatel created this revision.Jan 27 2020, 7:49 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 27 2020, 7:50 AM

Thanks for working on this @spatel - I'm interested to see how far we can take cost driven combines (without conflicting with canonicalizations in other passes).

You mention that you want this to begin as a relatively safe set of combines, but it's probably necessary to get some idea of where you see this going and the range of optimizations it could handle (for instance - are memory ops being considered......)?

Thanks for working on this @spatel - I'm interested to see how far we can take cost driven combines (without conflicting with canonicalizations in other passes).

You mention that you want this to begin as a relatively safe set of combines, but it's probably necessary to get some idea of where you see this going and the range of optimizations it could handle (for instance - are memory ops being considered......)?

Yes, I think we could handle some load/store patterns. Note that there is already a LoadStoreVectorizer pass for IR, but it's currently only used for GPU targets as a pre-codegen cleanup. I haven't looked in there to see if it can be extended to solve the problems we've seen.

Here are some examples of potential vector/scalar enhancements that we've been collectively chasing (for many years now...):

  1. Vector load combining: http://llvm.org/PR16739
  2. Vector load combining: http://llvm.org/PR21780
  3. Vector load combining: http://llvm.org/PR39473
  4. Vector store combining: http://llvm.org/PR41892
  5. Insert/extract -> shuffle combining: http://llvm.org/PR34724
  6. Compare insert/extract: http://llvm.org/PR39665
  7. Compare insert/extract: http://llvm.org/PR43745 (Failed to get SLP to do this so far.)
  8. Binop insert/extract: http://llvm.org/PR42633 (I suggested a pass like this proposal in this bug report.)
  9. This is going in the opposite direction, but here's a questionable (because it might interfere with GVN) scalarization proposal for InstCombine: D71828
  10. Similarly, I proposed a scalarization for binops in InstCombine that's hard to justify without TTI: D50992
  11. Shuffle fold that was too dangerous for InstCombine, but should be fine with TTI: D31509

Anybody else have any comments?

I think the general idea of a pass that uses vector cost models to do peephole optimizations on vectors makes sense. A couple general concerns:

  1. If we're running it in the middle of the pipeline, we need to be careful the transforms don't conflict with instcombine.
  2. The vectorizer's cost model is optimized for throughput; I'm a little worried we'll run into issues with latency if we depend too much on the cost model for scalar code.

I think the general idea of a pass that uses vector cost models to do peephole optimizations on vectors makes sense. A couple general concerns:

  1. If we're running it in the middle of the pipeline, we need to be careful the transforms don't conflict with instcombine.

AFAIK, the closest we've come to overlapping/conflicting with vector<->scalar transforms is D50992. I'd be happy to abandon that in favor of a cost-aware transform here. Similarly, we could move some questionable shuffle transforms and/or vector demanded elements analysis out of InstCombine to live here. That would likely have a side benefit of making the optimizer slightly faster overall since we don't need to run this pass as often as InstCombine.

  1. The vectorizer's cost model is optimized for throughput; I'm a little worried we'll run into issues with latency if we depend too much on the cost model for scalar code.

We do have: TargetTransformInfo::getInstructionLatency(const Instruction *I)
...but I don't see it used/overridden anywhere in tree currently.

This looks good to me overall.
I'm not sure about pass ordering, but the pass itself seems sound to me.

lebedev.ri added inline comments.Jan 31 2020, 2:35 PM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
57

Word "alternative" usually means "something other than what currently is".
Using it here seems misleading to me - currently we do have scalar comparison :)

spatel marked an inline comment as done.Feb 4 2020, 12:39 PM
spatel added inline comments.
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
57

Good point - the vector code is the alternative that we will compare to.

spatel updated this revision to Diff 242402.Feb 4 2020, 12:40 PM

Patch updated:
Improved code comment.

I think this is ready to land - does anyone have anymore comments?

RKSimon accepted this revision.Feb 7 2020, 10:08 AM

LGTM - thanks for working on this!

This revision is now accepted and ready to land.Feb 7 2020, 10:08 AM
This revision was automatically updated to reflect the committed changes.
craig.topper added inline comments.Feb 10 2020, 1:10 PM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
34

Is this counter used?

spatel marked an inline comment as done.Feb 10 2020, 1:53 PM
spatel added inline comments.
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
34

Oops - no. Copy/pasted from existing code. Looks like other passes do something like:

if (!DebugCounter::shouldExecute(VecCombineCounter))
  continue;

I've never used that myself, so I could either add that code or remove the counter. Let me know if there's a preference. Not sure how we test it?

craig.topper added inline comments.Feb 10 2020, 2:12 PM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
34

Probably fine to drop it. I added the one to instcombine and I've only used it a couple times.

spatel marked an inline comment as done.Feb 11 2020, 6:58 AM
spatel added inline comments.
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
34
liuz added a subscriber: liuz.Feb 19 2020, 6:20 PM
xbolva00 added inline comments.
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
96

Skip debug insn?

DbgInfoIntrinsic?

xbolva00 added inline comments.Mar 29 2020, 8:19 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
96

And skip “cold” blocks?

spatel marked 3 inline comments as done.Mar 29 2020, 11:26 AM
spatel added inline comments.
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
96

I don't think there's enough going on in this pass yet to make this measurable, but:
rGfc3cc8a4b074

96

Is there a pass that we can view as a template for this?

xbolva00 added inline comments.Mar 29 2020, 5:38 PM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
96

https://llvm.org/doxygen/classllvm_1_1ProfileSummaryInfo.html

IsColdBlock/IsFunctionEntryCold are probably the helpers we need.

spatel marked an inline comment as done.Mar 30 2020, 9:25 AM
spatel added inline comments.
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
96

Thinking about this 1 a bit more, it's not clear to me that we want to predicate on hot/cold. This pass can reduce code size, so the transforms are still potentially beneficial.

Also, I don't see that kind of restriction on any other combiner passes (or other IR passes in general?), so raise this on llvm-dev, so we have a consistent implementation across different passes?