By converting the SVE intrinsic to a normal LLVM insertelement we give
the code generator a better chance to remove transitions between GPRs
and VPRs
Depends on D101302
Co-authored-by: Paul Walker <paul.walker@arm.com>
Differential D101167
[AArch64][SVE] Convert svdup(vec, SV_VL1, elm) to insertelement(vec, elm, 0) bsmith on Apr 23 2021, 7:37 AM. Authored by
Details By converting the SVE intrinsic to a normal LLVM insertelement we give Depends on D101302 Co-authored-by: Paul Walker <paul.walker@arm.com>
Diff Detail
Event TimelineComment Actions Hello! This looks good to me modulo a few nits. Not necessarily for this commit, but I ended up implementing a similar optimisation myself as part of some other work. Something else we could do is recognise when a series of DUP calls is the same as a series of insertelement calls. For example: define <vscale x 16 x i8> @dup_insertelement_multi(<vscale x 16 x i8> %v, i8 %s) #0 { %pg1 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 3) %insert1 = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %v, <vscale x 16 x i1> %pg1, i8 %s) %pg2 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 2) %insert2 = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %insert1, <vscale x 16 x i1> %pg2, i8 %s) %pg3 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 1) %insert3 = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %insert2, <vscale x 16 x i1> %pg3, i8 %s) ret <vscale x 16 x i8> %insert1 } is the same as: define <vscale x 16 x i8> @dup_insertelement_multi(<vscale x 16 x i8> %v, i8 %s) #0 { %1 = insertelement <vscale x 16 x i8> %v, i8 %s, i64 2 %2 = insertelement <vscale x 16 x i8> %1, i8 %s, i64 1 %3 = insertelement <vscale x 16 x i8> %2, i8 %s, i64 0 ret <vscale x 16 x i8> %1 } Doing this might look like: // NOTE: not tested very extensively at all auto *Cursor = I; unsigned ExpectedPTruePattern = AArch64SVEPredPattern::vl1; while (Cursor && Cursor->getIntrinsicID() == Intrinsic::aarch64_sve_dup && ExpectedPTruePattern <= AArch64SVEPredPattern::vl8) { Value *Dst = Cursor->getArgOperand(0); Value *Pg = Cursor->getArgOperand(1); Value *Splat = Cursor->getArgOperand(2); auto *PTrue = dyn_cast<IntrinsicInst>(Pg); if (!PTrue || PTrue->getIntrinsicID() != Intrinsic::aarch64_sve_ptrue) break; const auto PTruePattern = cast<ConstantInt>(PTrue->getOperand(0))->getZExtValue(); if (PTruePattern != ExpectedPTruePattern) break; LLVMContext &Ctx = Cursor->getContext(); IRBuilder<> Builder(Ctx); Builder.SetInsertPoint(Cursor); auto *Insert = Builder.CreateInsertElement(Dst, Splat, ExpectedPTruePattern - 1); Cursor->replaceAllUsesWith(Insert); Cursor->eraseFromParent(); if (PTrue->use_empty()) PTrue->eraseFromParent(); Cursor = dyn_cast<IntrinsicInst>(Dst); ExpectedPTruePattern++; Changed = true; } (alternatively, we could keep the existing optimisation as-is, but add another function that looks for insertions into DUPs) We see these chained-dups when we're moving data between NEON and SVE ACLE types[0]. I think doing this further optimisation might make sense in the context of this patch, but not a blocker from me. [0]: https://developer.arm.com/documentation/ka004612/latest
Comment Actions @dmgreen, as per the comment in the summary, we need this to be done after the convert.{from,to}.svbool intrinsics have been optimised, which is done in SVEIntrinsicOpts, hence it needs to live in there rather than instcombine. Comment Actions @joechrisellis I think that should probably be a separate patch, but yes, good idea :) Comment Actions
Comment Actions @dmgreen On second thought, I shall move the svbool convert optimisation into instcombine as well and then move this there too. |