This is an archive of the discontinued LLVM Phabricator instance.

[RISCV][SelectionDAG] Lower shuffles as bitrotates with vror.vi when possible
ClosedPublic

Authored by luke on Aug 8 2023, 10:09 AM.

Details

Summary

Given a shuffle mask like <3, 0, 1, 2, 7, 4, 5, 6> for v8i8, we can
reinterpret it as a shuffle of v2i32 where the two i32s are bit rotated, and
lower it as a vror.vi (if legal with zvbb enabled).
We also need to make sure that the larger element type is a valid SEW, hence
the tests for zve32x.

X86 already did this, so I've extracted the logic for it and put it inside
ShuffleVectorSDNode so it could be reused by RISC-V. I originally tried to add
this as a generic combine in DAGCombiner.cpp, but it ended up causing worse
codegen on X86 and PPC.

Diff Detail

Event Timeline

luke created this revision.Aug 8 2023, 10:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 8 2023, 10:09 AM
luke requested review of this revision.Aug 8 2023, 10:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 8 2023, 10:09 AM
luke added inline comments.Aug 8 2023, 10:17 AM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
4141–4147

As an example, If the rotateAmt is 24 then on RV32 the constant comes out as:

t3:v1i64 = bitcast t4
t4:v2i32 = build_vector <i32 24, i32 0>

I tried handling this case in lowerBuildVectorOfConstants to lower it as a v1i64 vmv_v_x_vl, with the constant reinterpreted across the elements, but it doesn't seem to catch any other cases since this pattern doesn't seem to be generated anywhere else.

craig.topper added inline comments.Aug 8 2023, 11:35 PM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
4147

So directly lower it to a vmv.v.x gets picked up and matched to a vror.vi.

I think there's some words missing from that sentence.

luke updated this revision to Diff 548548.Aug 9 2023, 3:26 AM

Fix typo in comments
Also move logic from ShuffleVectorSDNode to ShuffleVectorInst in
Instruction.h. The old signature returned an EVT since nothing else in
SelectionDAGNodes.h seemed to return an MVT, but we still had to pass
through an MVT which seemed hairy. So I've shuffled it about to just
operate on the mask array and placed it amongst the other mask helpers,
even though it isn't necessarily used by the middle-end. But I think that
should be OK, since the helpers in ShuffleVectorInst are used by both the
middle-end and the backend.

luke updated this revision to Diff 548624.Aug 9 2023, 7:52 AM

Fix X86 build

This might be an over generalization, but I can't help but notice that we don't need the same rotate amount in each chunk. Given the small number of possible rotate amounts, I wonder if we'd be worth generating a rotate by a build_vector. This might be overly general.

When looking at your tests, I noticed a bunch which should probably be vrev8.v instead. Might be worth implementing that, then rebasing this over. I'd be fine with the other order too, so don't take that as must.

Broader point is that maybe we should be doing this even without zbb.

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-rotate.ll
192

This case is interesting a couple different ways.

First, the vror here is also a vrev8.v. This points out that we probably should be matching brev8.v. Minor, but we should probably also be preferring it here when both are possible.

Second, the bswap lowering for 2 element pairs is going to be cheaper than the vrgather even without zvbb. We probably should be matching high/low swapping on all element sizes up to XLEN.

240

Continuing from the prior comment..

Since this isn't an arbitrary byte shuffle within each element, but instead two contiguous regions, I think using the rotate expansion even for no-zbb is going to be profitable.

luke added a comment.Aug 17 2023, 3:53 AM

Broader point is that maybe we should be doing this even without zbb.

For reference, this is the new sequence we would be generating without zvbb:

 define <8 x i16> @shuffle_v8i16_as_i32(<8 x i16> %v) {
 ; CHECK-LABEL: shuffle_v8i16_as_i32:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    lui a0, %hi(.LCPI18_0)
-; CHECK-NEXT:    addi a0, a0, %lo(.LCPI18_0)
-; CHECK-NEXT:    vsetivli zero, 8, e16, m1, ta, ma
-; CHECK-NEXT:    vle16.v v10, (a0)
-; CHECK-NEXT:    vrgather.vv v9, v8, v10
-; CHECK-NEXT:    vmv.v.v v8, v9
+; CHECK-NEXT:    vsetivli zero, 4, e16, mf2, ta, ma
+; CHECK-NEXT:    vmv.v.i v9, 0
+; CHECK-NEXT:    li a0, 16
+; CHECK-NEXT:    vwsubu.vx v10, v9, a0
+; CHECK-NEXT:    li a1, 31
+; CHECK-NEXT:    vsetvli zero, zero, e32, m1, ta, ma
+; CHECK-NEXT:    vand.vx v9, v10, a1
+; CHECK-NEXT:    vsrl.vv v9, v8, v9
+; CHECK-NEXT:    vmv.v.x v10, a0
+; CHECK-NEXT:    vand.vx v10, v10, a1
+; CHECK-NEXT:    vsll.vv v8, v8, v10
+; CHECK-NEXT:    vor.vv v8, v8, v9
 ; CHECK-NEXT:    ret
 ;
 ; ZVBB_V-LABEL: shuffle_v8i16_as_i32:
 ; ZVBB_V:       # %bb.0:
 ; ZVBB_V-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
 ; ZVBB_V-NEXT:    vror.vi v8, v8, 16
 ; ZVBB_V-NEXT:    ret
 ;
 ; ZVBB_ZVE32X-LABEL: shuffle_v8i16_as_i32:
 ; ZVBB_ZVE32X:       # %bb.0:
 ; ZVBB_ZVE32X-NEXT:    vsetivli zero, 4, e32, m4, ta, ma
 ; ZVBB_ZVE32X-NEXT:    vror.vi v8, v8, 16
 ; ZVBB_ZVE32X-NEXT:    ret
   %shuffle = shufflevector <8 x i16> %v, <8 x i16> poison, <8 x i32> <i32 1, i32 0, i32 3, i32 2, i32 5, i32 4, i32 7, i32 6>
   ret <8 x i16> %shuffle
 }

Broader point is that maybe we should be doing this even without zbb.

For reference, this is the new sequence we would be generating without zvbb:

 define <8 x i16> @shuffle_v8i16_as_i32(<8 x i16> %v) {
 ; CHECK-LABEL: shuffle_v8i16_as_i32:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    lui a0, %hi(.LCPI18_0)
-; CHECK-NEXT:    addi a0, a0, %lo(.LCPI18_0)
-; CHECK-NEXT:    vsetivli zero, 8, e16, m1, ta, ma
-; CHECK-NEXT:    vle16.v v10, (a0)
-; CHECK-NEXT:    vrgather.vv v9, v8, v10
-; CHECK-NEXT:    vmv.v.v v8, v9
+; CHECK-NEXT:    vsetivli zero, 4, e16, mf2, ta, ma
+; CHECK-NEXT:    vmv.v.i v9, 0
+; CHECK-NEXT:    li a0, 16
+; CHECK-NEXT:    vwsubu.vx v10, v9, a0
+; CHECK-NEXT:    li a1, 31
+; CHECK-NEXT:    vsetvli zero, zero, e32, m1, ta, ma
+; CHECK-NEXT:    vand.vx v9, v10, a1
+; CHECK-NEXT:    vsrl.vv v9, v8, v9
+; CHECK-NEXT:    vmv.v.x v10, a0
+; CHECK-NEXT:    vand.vx v10, v10, a1
+; CHECK-NEXT:    vsll.vv v8, v8, v10
+; CHECK-NEXT:    vor.vv v8, v8, v9
 ; CHECK-NEXT:    ret
 ;
 ; ZVBB_V-LABEL: shuffle_v8i16_as_i32:
 ; ZVBB_V:       # %bb.0:
 ; ZVBB_V-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
 ; ZVBB_V-NEXT:    vror.vi v8, v8, 16
 ; ZVBB_V-NEXT:    ret
 ;
 ; ZVBB_ZVE32X-LABEL: shuffle_v8i16_as_i32:
 ; ZVBB_ZVE32X:       # %bb.0:
 ; ZVBB_ZVE32X-NEXT:    vsetivli zero, 4, e32, m4, ta, ma
 ; ZVBB_ZVE32X-NEXT:    vror.vi v8, v8, 16
 ; ZVBB_ZVE32X-NEXT:    ret
   %shuffle = shufflevector <8 x i16> %v, <8 x i16> poison, <8 x i32> <i32 1, i32 0, i32 3, i32 2, i32 5, i32 4, i32 7, i32 6>
   ret <8 x i16> %shuffle
 }

Why isn't this constant folded

+; CHECK-NEXT:    vmv.v.i v9, 0
+; CHECK-NEXT:    li a0, 16
+; CHECK-NEXT:    vwsubu.vx v10, v9, a0

This and feels unnecessary. This shift only uses the lower 5 bits

+; CHECK-NEXT:    li a1, 31
+; CHECK-NEXT:    vsetvli zero, zero, e32, m1, ta, ma
+; CHECK-NEXT:    vand.vx v9, v10, a1
+; CHECK-NEXT:    vsrl.vv v9, v8, v9
luke updated this revision to Diff 551493.Aug 18 2023, 6:55 AM

Fix extra multiplication on x86, and fix test diffs that didn't get included

luke added a comment.Aug 21 2023, 2:55 AM

Why isn't this constant folded

+; CHECK-NEXT:    vmv.v.i v9, 0
+; CHECK-NEXT:    li a0, 16
+; CHECK-NEXT:    vwsubu.vx v10, v9, a0

This and feels unnecessary. This shift only uses the lower 5 bits

+; CHECK-NEXT:    li a1, 31
+; CHECK-NEXT:    vsetvli zero, zero, e32, m1, ta, ma
+; CHECK-NEXT:    vand.vx v9, v10, a1
+; CHECK-NEXT:    vsrl.vv v9, v8, v9

I put both sequences through llvm-mca, and on the sifive-x280 model it seems like the rotate sequence has better throughput.


But if there's still some codegen issues to be addressed, should we defer handling non-zvbb archs to a later patch?

reames added a comment.EditedAug 21 2023, 10:50 AM

But if there's still some codegen issues to be addressed, should we defer handling non-zvbb archs to a later patch?

I'm fine with staging in this way. I mostly brought it up because it might result in a different code structure.

LGTM

Follow ups worth exploring:

  • Can we use slide lowering when no-zbb?
  • Can we handle vector.reverse via tree of rotates? Either only with zbb, or always? (Edit: This only covers very short fixed vectors and probably isn't that interesting.)
  • Is matching a rotate with a non-splat rotate amount interesting? This feels like maybe the start of a more general "handle nearby parts" shuffle lowering strategy, and is maybe worth some offline discussion.
llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-reverse.ll
192 ↗(On Diff #551493)

As a follow up, it's worth noting this can be done as two rotates. One at 32 SEW by 16, and one at SEW by 8. The later is a brev8.

Jim added inline comments.Aug 21 2023, 6:14 PM
llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-rotate.ll
5

In testcases for RISC-V, I always saw that use '-' not '_' underline in check-prefixes.

luke added inline comments.Aug 22 2023, 2:53 AM
llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-rotate.ll
5

Good catch, will fix!

Adding a few more X86 people, hope you're ok with me stealing your code :)

luke updated this revision to Diff 552287.Aug 22 2023, 3:02 AM

Rename filecheck prefixes

reames accepted this revision.Aug 29 2023, 11:49 AM

Just to be explicit - my LGTM still stands here.

This revision is now accepted and ready to land.Aug 29 2023, 11:49 AM
luke added a comment.Aug 29 2023, 4:39 PM

Just to be explicit - my LGTM still stands here.

Thanks. Is it worthwhile waiting to see if any of the x86 people are ok with the refactoring?

pengfei accepted this revision.Aug 29 2023, 5:45 PM

LGTM.