This is an archive of the discontinued LLVM Phabricator instance.

[VectorCombine] transform bitcasted shuffle to narrower elements
ClosedPublic

Authored by spatel on Mar 24 2020, 12:51 PM.

Details

Summary

bitcast (shuf V, MaskC) --> shuf (bitcast V), MaskC'

We do not attempt this in InstCombine because we do not want to change types and create new shuffle ops that are potentially not lowered as well as the original code. Here, we can check the cost model to see if it is worthwhile.

I've aggressively enabled this transform even if the types are the same size and/or equal cost because moving the bitcast allows InstCombine to make further simplifications.

In the motivating cases from PR35454:
https://bugs.llvm.org/show_bug.cgi?id=35454
...this is enough to let instcombine and the backend eliminate the redundant shuffles, but we probably want to extend VectorCombine to handle the inverse pattern (shuffle-of-bitcast) to get that simplification directly in IR.

Diff Detail

Event Timeline

spatel created this revision.Mar 24 2020, 12:51 PM

Do we need to count the cost of bitcasts too?

Do we need to count the cost of bitcasts too?

Hmm...is there a scenario where the old bitcast is different in cost than the new bitcast? That op is effectively getting hoisted in the test cases shown here, so it would cancel out if we put it on both sides of the cost comparison.
The TTI API that I think we'd use for this is: TTI.getCastInstrCost(Instruction::BitCast, DestTy, SrcTy).

From reading through the getCastInstrCost()'s i don't think any backend
currently models it, but there's this comment in AArch64ISelLowering.cpp

namespace llvm {

namespace AArch64ISD {

enum NodeType : unsigned {
<...>
  /// Natural vector cast. ISD::BITCAST is not natural in the big-endian
  /// world w.r.t vectors; which causes additional REV instructions to be
  /// generated to compensate for the byte-swapping. But sometimes we do
  /// need to re-interpret the data in SIMD vector registers in big-endian
  /// mode without emitting such REV instructions.
  NVCAST,

which is consistent with https://reviews.llvm.org/D40633#inline-355090 by @efriedma:

On some targets, vector bitcasts aren't free (IIRC big-endian ARM is like this).

From reading through the getCastInstrCost()'s i don't think any backend
currently models it, but there's this comment in AArch64ISelLowering.cpp

namespace llvm {

namespace AArch64ISD {

enum NodeType : unsigned {
<...>
  /// Natural vector cast. ISD::BITCAST is not natural in the big-endian
  /// world w.r.t vectors; which causes additional REV instructions to be
  /// generated to compensate for the byte-swapping. But sometimes we do
  /// need to re-interpret the data in SIMD vector registers in big-endian
  /// mode without emitting such REV instructions.
  NVCAST,

which is consistent with https://reviews.llvm.org/D40633#inline-355090 by @efriedma:

On some targets, vector bitcasts aren't free (IIRC big-endian ARM is like this).

I agree that bitcasts may not be free, but I don't see how that affects the cost calc for this transform.

I'm open to ideas on how to improve this, but I'm not sure how to proceed without some concrete examples:

  1. This transform is too narrow to effectively cost model in isolation? Ie, we need to pattern match something bigger than just cast+shuf.
  2. Implement a generic DAGCombine version of x86's canWidenShuffleElements() to allow targets to reverse this?
  3. Limit this transform to targets where the bitcast is free (and potentially improve the base cost model to account for big-endian)?

Eeeek.
I'm not sure what i was thinking, there is indeed no point in modelling the cost
of bitcast here because we have the exact same bitcast instruction in either case.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
257

I'm not sure how the second part of the check can fail?

spatel marked 3 inline comments as done.Mar 25 2020, 7:21 AM
spatel added inline comments.
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
257

That's one that I've tripped on a few times in the past (but failed to add a test for here): a shufflevector can change the vector length (the mask value can have more/less elements than the source values).

I'll add negative tests for both of those clauses.

267–268

The types are reversed here. This is covered by the SSE2 run of the 1st test, but I still missed the bug.

spatel updated this revision to Diff 252582.Mar 25 2020, 8:19 AM
spatel marked an inline comment as done.

Patch updated:

  1. Fix bug in cost calc - types were inverted (the new shuffle is executed in the destination type).
  2. Added code comment to explain the cost calc.
  3. Added negative tests and test comments.
lebedev.ri accepted this revision.Mar 26 2020, 12:11 PM

LG

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
248

It may be useful to prefix the function[s] with a
short blurb explaining what happens here,
what pattern we strive to replace with what pattern.

287–288
NewMaskC.reserve(NewMask.size());
for (int NewMaskElt : NewMask)
  NewMaskC.push_back(Builder.getInt32(NewMaskElt));
This revision is now accepted and ready to land.Mar 26 2020, 12:11 PM

D72467 is not actually a parent, but adding that here because this would conflict (force another rebase of that patch).

spatel marked 3 inline comments as done.Apr 2 2020, 10:17 AM
spatel added inline comments.
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
287–288

Obsoleted by D72467 .

spatel updated this revision to Diff 254559.Apr 2 2020, 10:20 AM
spatel marked an inline comment as done.

Patch updated - no logic diffs from before but:

  1. Rebased/simplified with new mask-operand-less version of shuffle (D72467, D77183).
  2. Added function comment to explain the transform/motivation.
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptApr 2 2020, 10:50 AM