By converting the SVE intrinsic to a normal LLVM insertelement we give
the code generator a better chance to remove transitions between GPRs
and VPRs
Depends on D101302
Co-authored-by: Paul Walker <paul.walker@arm.com>
Differential D101167
[AArch64][SVE] Convert svdup(vec, SV_VL1, elm) to insertelement(vec, elm, 0) Authored by bsmith on Apr 23 2021, 7:37 AM.
Details By converting the SVE intrinsic to a normal LLVM insertelement we give Depends on D101302 Co-authored-by: Paul Walker <paul.walker@arm.com>
Diff Detail
Event TimelineComment Actions Hello! This looks good to me modulo a few nits. Not necessarily for this commit, but I ended up implementing a similar optimisation myself as part of some other work. Something else we could do is recognise when a series of DUP calls is the same as a series of insertelement calls. For example: define <vscale x 16 x i8> @dup_insertelement_multi(<vscale x 16 x i8> %v, i8 %s) #0 {
%pg1 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 3)
%insert1 = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %v, <vscale x 16 x i1> %pg1, i8 %s)
%pg2 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 2)
%insert2 = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %insert1, <vscale x 16 x i1> %pg2, i8 %s)
%pg3 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 1)
%insert3 = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %insert2, <vscale x 16 x i1> %pg3, i8 %s)
ret <vscale x 16 x i8> %insert1
}is the same as: define <vscale x 16 x i8> @dup_insertelement_multi(<vscale x 16 x i8> %v, i8 %s) #0 {
%1 = insertelement <vscale x 16 x i8> %v, i8 %s, i64 2
%2 = insertelement <vscale x 16 x i8> %1, i8 %s, i64 1
%3 = insertelement <vscale x 16 x i8> %2, i8 %s, i64 0
ret <vscale x 16 x i8> %1
}Doing this might look like: // NOTE: not tested very extensively at all
auto *Cursor = I;
unsigned ExpectedPTruePattern = AArch64SVEPredPattern::vl1;
while (Cursor && Cursor->getIntrinsicID() == Intrinsic::aarch64_sve_dup &&
ExpectedPTruePattern <= AArch64SVEPredPattern::vl8) {
Value *Dst = Cursor->getArgOperand(0);
Value *Pg = Cursor->getArgOperand(1);
Value *Splat = Cursor->getArgOperand(2);
auto *PTrue = dyn_cast<IntrinsicInst>(Pg);
if (!PTrue || PTrue->getIntrinsicID() != Intrinsic::aarch64_sve_ptrue)
break;
const auto PTruePattern =
cast<ConstantInt>(PTrue->getOperand(0))->getZExtValue();
if (PTruePattern != ExpectedPTruePattern)
break;
LLVMContext &Ctx = Cursor->getContext();
IRBuilder<> Builder(Ctx);
Builder.SetInsertPoint(Cursor);
auto *Insert = Builder.CreateInsertElement(Dst, Splat, ExpectedPTruePattern - 1);
Cursor->replaceAllUsesWith(Insert);
Cursor->eraseFromParent();
if (PTrue->use_empty())
PTrue->eraseFromParent();
Cursor = dyn_cast<IntrinsicInst>(Dst);
ExpectedPTruePattern++;
Changed = true;
}(alternatively, we could keep the existing optimisation as-is, but add another function that looks for insertions into DUPs) We see these chained-dups when we're moving data between NEON and SVE ACLE types[0]. I think doing this further optimisation might make sense in the context of this patch, but not a blocker from me. [0]: https://developer.arm.com/documentation/ka004612/latest
Comment Actions @dmgreen, as per the comment in the summary, we need this to be done after the convert.{from,to}.svbool intrinsics have been optimised, which is done in SVEIntrinsicOpts, hence it needs to live in there rather than instcombine. Comment Actions @joechrisellis I think that should probably be a separate patch, but yes, good idea :) Comment Actions
Comment Actions @dmgreen On second thought, I shall move the svbool convert optimisation into instcombine as well and then move this there too. | ||||||||||||