A rotate of 8 bits of an e16 vector in either direction is equivalent to a
byteswap, i.e. vrev8. There is a generic combine on ISD::ROT{L,R} to
canonicalize these rotations to byteswaps, but on fixed vectors they are
legalized before they have the chance to be combined. This patch teaches the
rotate vector_shuffle lowering to emit these rotations as byteswaps to match
the scalable vector behaviour.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
@reames alluded to it here: https://reviews.llvm.org/D157417#inline-1529308
Tangentially, DAGCombiner canonicalises rotr/rotl to bswap anyway so this brings fixed-length vector behaviour more in line with scalable
If the rotate came in as a fshl/fshr intrinsic or as shl+shr+or would we already get vrev8 for fixed vectors? Is only the shuffle case that is being optimized?
llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-reverse.ll | ||
---|---|---|
278 | How does this patch create new rotates? |
Yeah we already get vrev8 for these, DAGCombiner canonicalises them before they would be legalised to vl nodes:
define <4 x i16> @rot_via_fshr(<4 x i16> %a) { %res = call <4 x i16> @llvm.fshr.v4i16(<4 x i16> %a, <4 x i16> %a, <4 x i16> <i16 8, i16 8, i16 8, i16 8>) ret <4 x i16> %res } declare <4 x i16> @llvm.fshr.v4i16(<4 x i16> %a, <4 x i16> %b, <4 x i16> %c) define <4 x i16> @rot_via_shift(<4 x i16> %a, <4 x i16> %amt) { %1 = shl <4 x i16> %a, <i16 8, i16 8, i16 8, i16 8> %2 = lshr <4 x i16> %a, <i16 8, i16 8, i16 8, i16 8> %3 = or <4 x i16> %1, %2 ret <4 x i16> %3 }
=== rot_via_fshr Initial selection DAG: %bb.0 'rot_via_fshr:' SelectionDAG has 13 nodes: t0: ch,glue = EntryToken t2: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %0 t4: v4i16 = extract_subvector t2, Constant:i64<0> t6: v4i16 = BUILD_VECTOR Constant:i16<8>, Constant:i16<8>, Constant:i16<8>, Constant:i16<8> t7: v4i16 = rotr t4, t6 t9: nxv2i16 = insert_subvector undef:nxv2i16, t7, Constant:i64<0> t11: ch,glue = CopyToReg t0, Register:nxv2i16 $v8, t9 t12: ch = RISCVISD::RET_GLUE t11, Register:nxv2i16 $v8, t11:1 Optimized lowered selection DAG: %bb.0 'rot_via_fshr:' SelectionDAG has 11 nodes: t0: ch,glue = EntryToken t2: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %0 t4: v4i16 = extract_subvector t2, Constant:i64<0> t13: v4i16 = bswap t4 t9: nxv2i16 = insert_subvector undef:nxv2i16, t13, Constant:i64<0> t11: ch,glue = CopyToReg t0, Register:nxv2i16 $v8, t9 t12: ch = RISCVISD::RET_GLUE t11, Register:nxv2i16 $v8, t11:1
=== rot_via_shift Initial selection DAG: %bb.0 'rot_via_shift:' SelectionDAG has 18 nodes: t0: ch,glue = EntryToken t2: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %0 t4: v4i16 = extract_subvector t2, Constant:i64<0> t6: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %1 t7: v4i16 = extract_subvector t6, Constant:i64<0> t9: v4i16 = BUILD_VECTOR Constant:i16<8>, Constant:i16<8>, Constant:i16<8>, Constant:i16<8> t10: v4i16 = shl t4, t9 t11: v4i16 = srl t4, t9 t12: v4i16 = or t10, t11 t14: nxv2i16 = insert_subvector undef:nxv2i16, t12, Constant:i64<0> t16: ch,glue = CopyToReg t0, Register:nxv2i16 $v8, t14 t17: ch = RISCVISD::RET_GLUE t16, Register:nxv2i16 $v8, t16:1 Optimized lowered selection DAG: %bb.0 'rot_via_shift:' SelectionDAG has 11 nodes: t0: ch,glue = EntryToken t2: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %0 t4: v4i16 = extract_subvector t2, Constant:i64<0> t19: v4i16 = bswap t4 t14: nxv2i16 = insert_subvector undef:nxv2i16, t19, Constant:i64<0> t16: ch,glue = CopyToReg t0, Register:nxv2i16 $v8, t14 t17: ch = RISCVISD::RET_GLUE t16, Register:nxv2i16 $v8, t16:1
llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-reverse.ll | ||
---|---|---|
278 | Not sure how I didn't notice these. Looks like it always emitted rotates on zvbb, there's just an issue with the filecheck prefixes. |
Fix extraneous test diffs caused by dodgy rebase (filecheck prefixes are actually fine:
sorry for the noise)
For context, I do *not* think there's a performance difference. This was mostly a canonicalization thing.
Another possibility would be a RISCV shuffle to bswap combine before lowering, but having this be a special case in the lowering doesn't seem bad to me.
Since these rotates are only emitted in one place during lowering, remove the combine and just
handle it there.
llvm/lib/Target/RISCV/RISCVISelLowering.cpp | ||
---|---|---|
4213 | Do we handle FP shuffles here? If so should this be f16 too? |
llvm/lib/Target/RISCV/RISCVISelLowering.cpp | ||
---|---|---|
4213 | I don't think we can ever get an f16 element type in RotateVT, since it's always going to be larger than the original element type. E.g. i16 has to come from a shuffle of i8s |
Do we handle FP shuffles here? If so should this be f16 too?