This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Canonicalize vrot{l,r} to vrev8 when lowering shuffle as rotate
ClosedPublic

Authored by luke on Aug 17 2023, 9:05 AM.

Details

Summary

A rotate of 8 bits of an e16 vector in either direction is equivalent to a
byteswap, i.e. vrev8. There is a generic combine on ISD::ROT{L,R} to
canonicalize these rotations to byteswaps, but on fixed vectors they are
legalized before they have the chance to be combined. This patch teaches the
rotate vector_shuffle lowering to emit these rotations as byteswaps to match
the scalable vector behaviour.

Diff Detail

Event Timeline

luke created this revision.Aug 17 2023, 9:05 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 17 2023, 9:05 AM
luke requested review of this revision.Aug 17 2023, 9:05 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 17 2023, 9:05 AM

Do we have reason to believe vrev8 is better than vror?

luke added a comment.Aug 17 2023, 9:26 AM

Do we have reason to believe vrev8 is better than vror?

@reames alluded to it here: https://reviews.llvm.org/D157417#inline-1529308
Tangentially, DAGCombiner canonicalises rotr/rotl to bswap anyway so this brings fixed-length vector behaviour more in line with scalable

If the rotate came in as a fshl/fshr intrinsic or as shl+shr+or would we already get vrev8 for fixed vectors? Is only the shuffle case that is being optimized?

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-reverse.ll
278

How does this patch create new rotates?

luke added a comment.Aug 18 2023, 6:24 AM

If the rotate came in as a fshl/fshr intrinsic or as shl+shr+or would we already get vrev8 for fixed vectors? Is only the shuffle case that is being optimized?

Yeah we already get vrev8 for these, DAGCombiner canonicalises them before they would be legalised to vl nodes:

define <4 x i16> @rot_via_fshr(<4 x i16> %a) {
  %res = call <4 x i16> @llvm.fshr.v4i16(<4 x i16> %a, <4 x i16> %a, <4 x i16> <i16 8, i16 8, i16 8, i16 8>)
  ret <4 x i16> %res
}

declare <4 x i16> @llvm.fshr.v4i16(<4 x i16> %a, <4 x i16> %b, <4 x i16> %c)

define <4 x i16> @rot_via_shift(<4 x i16> %a, <4 x i16> %amt) {
  %1 = shl <4 x i16> %a, <i16 8, i16 8, i16 8, i16 8>
  %2 = lshr <4 x i16> %a, <i16 8, i16 8, i16 8, i16 8>
  %3 = or <4 x i16> %1, %2
  ret <4 x i16> %3
}
=== rot_via_fshr
Initial selection DAG: %bb.0 'rot_via_fshr:'
SelectionDAG has 13 nodes:
  t0: ch,glue = EntryToken
          t2: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %0
        t4: v4i16 = extract_subvector t2, Constant:i64<0>
        t6: v4i16 = BUILD_VECTOR Constant:i16<8>, Constant:i16<8>, Constant:i16<8>, Constant:i16<8>
      t7: v4i16 = rotr t4, t6
    t9: nxv2i16 = insert_subvector undef:nxv2i16, t7, Constant:i64<0>
  t11: ch,glue = CopyToReg t0, Register:nxv2i16 $v8, t9
  t12: ch = RISCVISD::RET_GLUE t11, Register:nxv2i16 $v8, t11:1


Optimized lowered selection DAG: %bb.0 'rot_via_fshr:'
SelectionDAG has 11 nodes:
  t0: ch,glue = EntryToken
          t2: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %0
        t4: v4i16 = extract_subvector t2, Constant:i64<0>
      t13: v4i16 = bswap t4
    t9: nxv2i16 = insert_subvector undef:nxv2i16, t13, Constant:i64<0>
  t11: ch,glue = CopyToReg t0, Register:nxv2i16 $v8, t9
  t12: ch = RISCVISD::RET_GLUE t11, Register:nxv2i16 $v8, t11:1
=== rot_via_shift
Initial selection DAG: %bb.0 'rot_via_shift:'
SelectionDAG has 18 nodes:
  t0: ch,glue = EntryToken
    t2: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %0
  t4: v4i16 = extract_subvector t2, Constant:i64<0>
    t6: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %1
  t7: v4i16 = extract_subvector t6, Constant:i64<0>
  t9: v4i16 = BUILD_VECTOR Constant:i16<8>, Constant:i16<8>, Constant:i16<8>, Constant:i16<8>
        t10: v4i16 = shl t4, t9
        t11: v4i16 = srl t4, t9
      t12: v4i16 = or t10, t11
    t14: nxv2i16 = insert_subvector undef:nxv2i16, t12, Constant:i64<0>
  t16: ch,glue = CopyToReg t0, Register:nxv2i16 $v8, t14
  t17: ch = RISCVISD::RET_GLUE t16, Register:nxv2i16 $v8, t16:1


Optimized lowered selection DAG: %bb.0 'rot_via_shift:'
SelectionDAG has 11 nodes:
  t0: ch,glue = EntryToken
          t2: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %0
        t4: v4i16 = extract_subvector t2, Constant:i64<0>
      t19: v4i16 = bswap t4
    t14: nxv2i16 = insert_subvector undef:nxv2i16, t19, Constant:i64<0>
  t16: ch,glue = CopyToReg t0, Register:nxv2i16 $v8, t14
  t17: ch = RISCVISD::RET_GLUE t16, Register:nxv2i16 $v8, t16:1
llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-reverse.ll
278

Not sure how I didn't notice these. Looks like it always emitted rotates on zvbb, there's just an issue with the filecheck prefixes.

luke added inline comments.Aug 18 2023, 6:31 AM
llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-reverse.ll
278

*It emits rotates on zvbb after D157417

luke updated this revision to Diff 551495.Aug 18 2023, 6:56 AM

Fix extraneous test diffs caused by dodgy rebase (filecheck prefixes are actually fine:
sorry for the noise)

If the rotate came in as a fshl/fshr intrinsic or as shl+shr+or would we already get vrev8 for fixed vectors? Is only the shuffle case that is being optimized?

Yeah we already get vrev8 for these, DAGCombiner canonicalises them before they would be legalised to vl nodes:

define <4 x i16> @rot_via_fshr(<4 x i16> %a) {
  %res = call <4 x i16> @llvm.fshr.v4i16(<4 x i16> %a, <4 x i16> %a, <4 x i16> <i16 8, i16 8, i16 8, i16 8>)
  ret <4 x i16> %res
}

declare <4 x i16> @llvm.fshr.v4i16(<4 x i16> %a, <4 x i16> %b, <4 x i16> %c)

define <4 x i16> @rot_via_shift(<4 x i16> %a, <4 x i16> %amt) {
  %1 = shl <4 x i16> %a, <i16 8, i16 8, i16 8, i16 8>
  %2 = lshr <4 x i16> %a, <i16 8, i16 8, i16 8, i16 8>
  %3 = or <4 x i16> %1, %2
  ret <4 x i16> %3
}
=== rot_via_fshr
Initial selection DAG: %bb.0 'rot_via_fshr:'
SelectionDAG has 13 nodes:
  t0: ch,glue = EntryToken
          t2: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %0
        t4: v4i16 = extract_subvector t2, Constant:i64<0>
        t6: v4i16 = BUILD_VECTOR Constant:i16<8>, Constant:i16<8>, Constant:i16<8>, Constant:i16<8>
      t7: v4i16 = rotr t4, t6
    t9: nxv2i16 = insert_subvector undef:nxv2i16, t7, Constant:i64<0>
  t11: ch,glue = CopyToReg t0, Register:nxv2i16 $v8, t9
  t12: ch = RISCVISD::RET_GLUE t11, Register:nxv2i16 $v8, t11:1


Optimized lowered selection DAG: %bb.0 'rot_via_fshr:'
SelectionDAG has 11 nodes:
  t0: ch,glue = EntryToken
          t2: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %0
        t4: v4i16 = extract_subvector t2, Constant:i64<0>
      t13: v4i16 = bswap t4
    t9: nxv2i16 = insert_subvector undef:nxv2i16, t13, Constant:i64<0>
  t11: ch,glue = CopyToReg t0, Register:nxv2i16 $v8, t9
  t12: ch = RISCVISD::RET_GLUE t11, Register:nxv2i16 $v8, t11:1
=== rot_via_shift
Initial selection DAG: %bb.0 'rot_via_shift:'
SelectionDAG has 18 nodes:
  t0: ch,glue = EntryToken
    t2: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %0
  t4: v4i16 = extract_subvector t2, Constant:i64<0>
    t6: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %1
  t7: v4i16 = extract_subvector t6, Constant:i64<0>
  t9: v4i16 = BUILD_VECTOR Constant:i16<8>, Constant:i16<8>, Constant:i16<8>, Constant:i16<8>
        t10: v4i16 = shl t4, t9
        t11: v4i16 = srl t4, t9
      t12: v4i16 = or t10, t11
    t14: nxv2i16 = insert_subvector undef:nxv2i16, t12, Constant:i64<0>
  t16: ch,glue = CopyToReg t0, Register:nxv2i16 $v8, t14
  t17: ch = RISCVISD::RET_GLUE t16, Register:nxv2i16 $v8, t16:1


Optimized lowered selection DAG: %bb.0 'rot_via_shift:'
SelectionDAG has 11 nodes:
  t0: ch,glue = EntryToken
          t2: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %0
        t4: v4i16 = extract_subvector t2, Constant:i64<0>
      t19: v4i16 = bswap t4
    t14: nxv2i16 = insert_subvector undef:nxv2i16, t19, Constant:i64<0>
  t16: ch,glue = CopyToReg t0, Register:nxv2i16 $v8, t14
  t17: ch = RISCVISD::RET_GLUE t16, Register:nxv2i16 $v8, t16:1

How ugly would it be to do it as a special case during the shuffle lowering instead?

Do we have reason to believe vrev8 is better than vror?

@reames alluded to it here: https://reviews.llvm.org/D157417#inline-1529308
Tangentially, DAGCombiner canonicalises rotr/rotl to bswap anyway so this brings fixed-length vector behaviour more in line with scalable

For context, I do *not* think there's a performance difference. This was mostly a canonicalization thing.

If the rotate came in as a fshl/fshr intrinsic or as shl+shr+or would we already get vrev8 for fixed vectors? Is only the shuffle case that is being optimized?

Yeah we already get vrev8 for these, DAGCombiner canonicalises them before they would be legalised to vl nodes:

define <4 x i16> @rot_via_fshr(<4 x i16> %a) {
  %res = call <4 x i16> @llvm.fshr.v4i16(<4 x i16> %a, <4 x i16> %a, <4 x i16> <i16 8, i16 8, i16 8, i16 8>)
  ret <4 x i16> %res
}

declare <4 x i16> @llvm.fshr.v4i16(<4 x i16> %a, <4 x i16> %b, <4 x i16> %c)

define <4 x i16> @rot_via_shift(<4 x i16> %a, <4 x i16> %amt) {
  %1 = shl <4 x i16> %a, <i16 8, i16 8, i16 8, i16 8>
  %2 = lshr <4 x i16> %a, <i16 8, i16 8, i16 8, i16 8>
  %3 = or <4 x i16> %1, %2
  ret <4 x i16> %3
}
=== rot_via_fshr
Initial selection DAG: %bb.0 'rot_via_fshr:'
SelectionDAG has 13 nodes:
  t0: ch,glue = EntryToken
          t2: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %0
        t4: v4i16 = extract_subvector t2, Constant:i64<0>
        t6: v4i16 = BUILD_VECTOR Constant:i16<8>, Constant:i16<8>, Constant:i16<8>, Constant:i16<8>
      t7: v4i16 = rotr t4, t6
    t9: nxv2i16 = insert_subvector undef:nxv2i16, t7, Constant:i64<0>
  t11: ch,glue = CopyToReg t0, Register:nxv2i16 $v8, t9
  t12: ch = RISCVISD::RET_GLUE t11, Register:nxv2i16 $v8, t11:1


Optimized lowered selection DAG: %bb.0 'rot_via_fshr:'
SelectionDAG has 11 nodes:
  t0: ch,glue = EntryToken
          t2: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %0
        t4: v4i16 = extract_subvector t2, Constant:i64<0>
      t13: v4i16 = bswap t4
    t9: nxv2i16 = insert_subvector undef:nxv2i16, t13, Constant:i64<0>
  t11: ch,glue = CopyToReg t0, Register:nxv2i16 $v8, t9
  t12: ch = RISCVISD::RET_GLUE t11, Register:nxv2i16 $v8, t11:1
=== rot_via_shift
Initial selection DAG: %bb.0 'rot_via_shift:'
SelectionDAG has 18 nodes:
  t0: ch,glue = EntryToken
    t2: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %0
  t4: v4i16 = extract_subvector t2, Constant:i64<0>
    t6: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %1
  t7: v4i16 = extract_subvector t6, Constant:i64<0>
  t9: v4i16 = BUILD_VECTOR Constant:i16<8>, Constant:i16<8>, Constant:i16<8>, Constant:i16<8>
        t10: v4i16 = shl t4, t9
        t11: v4i16 = srl t4, t9
      t12: v4i16 = or t10, t11
    t14: nxv2i16 = insert_subvector undef:nxv2i16, t12, Constant:i64<0>
  t16: ch,glue = CopyToReg t0, Register:nxv2i16 $v8, t14
  t17: ch = RISCVISD::RET_GLUE t16, Register:nxv2i16 $v8, t16:1


Optimized lowered selection DAG: %bb.0 'rot_via_shift:'
SelectionDAG has 11 nodes:
  t0: ch,glue = EntryToken
          t2: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %0
        t4: v4i16 = extract_subvector t2, Constant:i64<0>
      t19: v4i16 = bswap t4
    t14: nxv2i16 = insert_subvector undef:nxv2i16, t19, Constant:i64<0>
  t16: ch,glue = CopyToReg t0, Register:nxv2i16 $v8, t14
  t17: ch = RISCVISD::RET_GLUE t16, Register:nxv2i16 $v8, t16:1

How ugly would it be to do it as a special case during the shuffle lowering instead?

Another possibility would be a RISCV shuffle to bswap combine before lowering, but having this be a special case in the lowering doesn't seem bad to me.

luke updated this revision to Diff 551964.Aug 21 2023, 3:44 AM
luke marked an inline comment as done.

Since these rotates are only emitted in one place during lowering, remove the combine and just
handle it there.

luke retitled this revision from [RISCV] Combine (vrot{l,r} vxi16, 8) -> vrev8 to [RISCV] Canonicalize vrot{l,r} to vrev8 when lowering shuffle as rotate.Aug 21 2023, 4:22 AM
luke edited the summary of this revision. (Show Details)
reames accepted this revision.Aug 21 2023, 10:31 AM

LGTM

This revision is now accepted and ready to land.Aug 21 2023, 10:31 AM
craig.topper added inline comments.Aug 21 2023, 10:33 AM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
4301

Do we handle FP shuffles here? If so should this be f16 too?

luke added inline comments.Aug 22 2023, 3:10 AM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
4301

I don't think we can ever get an f16 element type in RotateVT, since it's always going to be larger than the original element type. E.g. i16 has to come from a shuffle of i8s