This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE] Convert svdup(vec, SV_VL1, elm) to insertelement(vec, elm, 0)
ClosedPublic

Authored by bsmith on Apr 23 2021, 7:37 AM.

Details

Summary

By converting the SVE intrinsic to a normal LLVM insertelement we give
the code generator a better chance to remove transitions between GPRs
and VPRs

Depends on D101302

Co-authored-by: Paul Walker <paul.walker@arm.com>

Diff Detail

Event Timeline

bsmith created this revision.Apr 23 2021, 7:37 AM
bsmith requested review of this revision.Apr 23 2021, 7:37 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 23 2021, 7:37 AM

Hello Similarly to D100476, can this just be done in instcombine?

Hello! This looks good to me modulo a few nits.

Not necessarily for this commit, but I ended up implementing a similar optimisation myself as part of some other work. Something else we could do is recognise when a series of DUP calls is the same as a series of insertelement calls. For example:

define <vscale x 16 x i8> @dup_insertelement_multi(<vscale x 16 x i8> %v, i8 %s) #0 {
  %pg1 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 3)
  %insert1 = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %v, <vscale x 16 x i1> %pg1, i8 %s)
  %pg2 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 2)
  %insert2 = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %insert1, <vscale x 16 x i1> %pg2, i8 %s)
  %pg3 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 1)
  %insert3 = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %insert2, <vscale x 16 x i1> %pg3, i8 %s)
  ret <vscale x 16 x i8> %insert1
}

is the same as:

define <vscale x 16 x i8> @dup_insertelement_multi(<vscale x 16 x i8> %v, i8 %s) #0 {
  %1 = insertelement <vscale x 16 x i8> %v, i8 %s, i64 2
  %2 = insertelement <vscale x 16 x i8> %1, i8 %s, i64 1
  %3 = insertelement <vscale x 16 x i8> %2, i8 %s, i64 0
  ret <vscale x 16 x i8> %1
}

Doing this might look like:

// NOTE: not tested very extensively at all
auto *Cursor = I;
unsigned ExpectedPTruePattern = AArch64SVEPredPattern::vl1;
while (Cursor && Cursor->getIntrinsicID() == Intrinsic::aarch64_sve_dup &&
       ExpectedPTruePattern <= AArch64SVEPredPattern::vl8) {
  Value *Dst = Cursor->getArgOperand(0);
  Value *Pg = Cursor->getArgOperand(1);
  Value *Splat = Cursor->getArgOperand(2);

  auto *PTrue = dyn_cast<IntrinsicInst>(Pg);
  if (!PTrue || PTrue->getIntrinsicID() != Intrinsic::aarch64_sve_ptrue)
    break;

  const auto PTruePattern =
      cast<ConstantInt>(PTrue->getOperand(0))->getZExtValue();

  if (PTruePattern != ExpectedPTruePattern)
    break;

  LLVMContext &Ctx = Cursor->getContext();
  IRBuilder<> Builder(Ctx);
  Builder.SetInsertPoint(Cursor);

  auto *Insert = Builder.CreateInsertElement(Dst, Splat, ExpectedPTruePattern - 1);
  Cursor->replaceAllUsesWith(Insert);
  Cursor->eraseFromParent();

  if (PTrue->use_empty())
    PTrue->eraseFromParent();

  Cursor = dyn_cast<IntrinsicInst>(Dst);
  ExpectedPTruePattern++;
  Changed = true;
}

(alternatively, we could keep the existing optimisation as-is, but add another function that looks for insertions into DUPs)

We see these chained-dups when we're moving data between NEON and SVE ACLE types[0]. I think doing this further optimisation might make sense in the context of this patch, but not a blocker from me.

[0]: https://developer.arm.com/documentation/ka004612/latest

llvm/lib/Target/AArch64/SVEIntrinsicOpts.cpp
547

nit: I think this should say insert instead.

llvm/test/CodeGen/AArch64/sve-intrinsic-opts-dup.ll
2

nit: can we declare the triple in the IR instead of in the run line?

Hello Similarly to D100476, can this just be done in instcombine?

@dmgreen, as per the comment in the summary, we need this to be done after the convert.{from,to}.svbool intrinsics have been optimised, which is done in SVEIntrinsicOpts, hence it needs to live in there rather than instcombine.

Not necessarily for this commit, but I ended up implementing a similar optimisation myself as part of some other work. Something else we could do is recognise when a series of DUP calls is the same as a series of insertelement calls. For example:

@joechrisellis I think that should probably be a separate patch, but yes, good idea :)

bsmith updated this revision to Diff 340487.Apr 26 2021, 5:14 AM
bsmith marked 2 inline comments as done.
  • Move triple in test to IR statement rather than command line
  • Fix comment typo and lint issues

@dmgreen On second thought, I shall move the svbool convert optimisation into instcombine as well and then move this there too.

Matt added a subscriber: Matt.Apr 26 2021, 6:51 AM
bsmith updated this revision to Diff 340554.Apr 26 2021, 9:16 AM
bsmith edited the summary of this revision. (Show Details)
  • Move transform into instcombine
dmgreen accepted this revision.Apr 27 2021, 9:20 AM

Move transform into instcombine

Brilliant! Thanks. LGTM, if @joechrisellis agrees.

This revision is now accepted and ready to land.Apr 27 2021, 9:20 AM