A rotate of 8 bits of an e16 vector in either direction is equivalent to a
byteswap, i.e. vrev8. There is a generic combine on ISD::ROT{L,R} to
canonicalize these rotations to byteswaps, but on fixed vectors they are
legalized before they have the chance to be combined. This patch teaches the
rotate vector_shuffle lowering to emit these rotations as byteswaps to match
the scalable vector behaviour.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
@reames alluded to it here: https://reviews.llvm.org/D157417#inline-1529308
Tangentially, DAGCombiner canonicalises rotr/rotl to bswap anyway so this brings fixed-length vector behaviour more in line with scalable
If the rotate came in as a fshl/fshr intrinsic or as shl+shr+or would we already get vrev8 for fixed vectors? Is only the shuffle case that is being optimized?
| llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-reverse.ll | ||
|---|---|---|
| 278 | How does this patch create new rotates? | |
Yeah we already get vrev8 for these, DAGCombiner canonicalises them before they would be legalised to vl nodes:
define <4 x i16> @rot_via_fshr(<4 x i16> %a) {
%res = call <4 x i16> @llvm.fshr.v4i16(<4 x i16> %a, <4 x i16> %a, <4 x i16> <i16 8, i16 8, i16 8, i16 8>)
ret <4 x i16> %res
}
declare <4 x i16> @llvm.fshr.v4i16(<4 x i16> %a, <4 x i16> %b, <4 x i16> %c)
define <4 x i16> @rot_via_shift(<4 x i16> %a, <4 x i16> %amt) {
%1 = shl <4 x i16> %a, <i16 8, i16 8, i16 8, i16 8>
%2 = lshr <4 x i16> %a, <i16 8, i16 8, i16 8, i16 8>
%3 = or <4 x i16> %1, %2
ret <4 x i16> %3
}=== rot_via_fshr
Initial selection DAG: %bb.0 'rot_via_fshr:'
SelectionDAG has 13 nodes:
t0: ch,glue = EntryToken
t2: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %0
t4: v4i16 = extract_subvector t2, Constant:i64<0>
t6: v4i16 = BUILD_VECTOR Constant:i16<8>, Constant:i16<8>, Constant:i16<8>, Constant:i16<8>
t7: v4i16 = rotr t4, t6
t9: nxv2i16 = insert_subvector undef:nxv2i16, t7, Constant:i64<0>
t11: ch,glue = CopyToReg t0, Register:nxv2i16 $v8, t9
t12: ch = RISCVISD::RET_GLUE t11, Register:nxv2i16 $v8, t11:1
Optimized lowered selection DAG: %bb.0 'rot_via_fshr:'
SelectionDAG has 11 nodes:
t0: ch,glue = EntryToken
t2: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %0
t4: v4i16 = extract_subvector t2, Constant:i64<0>
t13: v4i16 = bswap t4
t9: nxv2i16 = insert_subvector undef:nxv2i16, t13, Constant:i64<0>
t11: ch,glue = CopyToReg t0, Register:nxv2i16 $v8, t9
t12: ch = RISCVISD::RET_GLUE t11, Register:nxv2i16 $v8, t11:1=== rot_via_shift
Initial selection DAG: %bb.0 'rot_via_shift:'
SelectionDAG has 18 nodes:
t0: ch,glue = EntryToken
t2: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %0
t4: v4i16 = extract_subvector t2, Constant:i64<0>
t6: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %1
t7: v4i16 = extract_subvector t6, Constant:i64<0>
t9: v4i16 = BUILD_VECTOR Constant:i16<8>, Constant:i16<8>, Constant:i16<8>, Constant:i16<8>
t10: v4i16 = shl t4, t9
t11: v4i16 = srl t4, t9
t12: v4i16 = or t10, t11
t14: nxv2i16 = insert_subvector undef:nxv2i16, t12, Constant:i64<0>
t16: ch,glue = CopyToReg t0, Register:nxv2i16 $v8, t14
t17: ch = RISCVISD::RET_GLUE t16, Register:nxv2i16 $v8, t16:1
Optimized lowered selection DAG: %bb.0 'rot_via_shift:'
SelectionDAG has 11 nodes:
t0: ch,glue = EntryToken
t2: nxv2i16,ch = CopyFromReg t0, Register:nxv2i16 %0
t4: v4i16 = extract_subvector t2, Constant:i64<0>
t19: v4i16 = bswap t4
t14: nxv2i16 = insert_subvector undef:nxv2i16, t19, Constant:i64<0>
t16: ch,glue = CopyToReg t0, Register:nxv2i16 $v8, t14
t17: ch = RISCVISD::RET_GLUE t16, Register:nxv2i16 $v8, t16:1| llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-reverse.ll | ||
|---|---|---|
| 278 | Not sure how I didn't notice these. Looks like it always emitted rotates on zvbb, there's just an issue with the filecheck prefixes. | |
Fix extraneous test diffs caused by dodgy rebase (filecheck prefixes are actually fine:
sorry for the noise)
How ugly would it be to do it as a special case during the shuffle lowering instead?
For context, I do *not* think there's a performance difference. This was mostly a canonicalization thing.
Another possibility would be a RISCV shuffle to bswap combine before lowering, but having this be a special case in the lowering doesn't seem bad to me.
Since these rotates are only emitted in one place during lowering, remove the combine and just
handle it there.
| llvm/lib/Target/RISCV/RISCVISelLowering.cpp | ||
|---|---|---|
| 4301 | Do we handle FP shuffles here? If so should this be f16 too? | |
| llvm/lib/Target/RISCV/RISCVISelLowering.cpp | ||
|---|---|---|
| 4301 | I don't think we can ever get an f16 element type in RotateVT, since it's always going to be larger than the original element type. E.g. i16 has to come from a shuffle of i8s | |
Do we handle FP shuffles here? If so should this be f16 too?