This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE] Add intrinsics for gather loads with 64-bit offsets
ClosedPublic

Authored by andwar on Nov 21 2019, 7:59 AM.

Details

Summary

This patch adds the following intrinsics for gather loads with 64-bit offsets:

  • @llvm.aarch64.sve.ld1.gather (unscaled offset)
  • @llvm.aarch64.sve.ld1.gather.index (scaled offset)

These intrinsics map 1-1 to the following AArch64 instructions respectively (examples for half-words):

  • ld1h { z0.d }, p0/z, [x0, z0.d]
  • ld1h { z0.d }, p0/z, [x0, z0.d, lsl #1]

Diff Detail

Event Timeline

andwar created this revision.Nov 21 2019, 7:59 AM
Herald added a reviewer: efriedma. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
andwar marked 2 inline comments as done.Nov 21 2019, 8:09 AM
andwar added inline comments.
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
1147

I missed this, sorry! I will remove it in the next patch.

llvm/lib/Target/AArch64/SVEInstrFormats.td
5608

This Pseudo is not needed. I will remove it in the next patch.

Thanks for this patch @andwar! I think the patch could do with some cleanup, but most of it looks quite straightforward.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
617

Is this change supposed to be here?

775

is this change supposed to be here?

836

nit: unrelated change.

2993

nit: unrelated change.

11793

Can we move this functionality into a separate function? e.g. something like MVT getPackedIntegerSVEType(MVT EltTy).
That should simplify this function quite a bit.

11831

Can this code not simply do a BITCAST to OutVT?

llvm/lib/Target/AArch64/AArch64ISelLowering.h
230

nit: unrelated change.

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
26

SVEAddrModeRegReg8 is not used anywhere, please remove.

1147

please remove.

llvm/lib/Target/AArch64/SVEInstrFormats.td
5294

nit: unrelated change.

llvm/test/CodeGen/AArch64/sve-intrinsics-gather-loads-64bit-offset.ll
11 ↗(On Diff #230460)

The check-lines for the sign/zero extend are not really necessary, but it will make it more obvious to see them removed in a patch that folds the sign/zero extension into the load itself.

andwar marked 15 inline comments as done.Nov 22 2019, 2:03 AM
andwar added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
617

Removed, thanks!

775

Good catch, thanks, removed!

11831

Good catch, thanks! Updated.

llvm/test/CodeGen/AArch64/sve-intrinsics-gather-loads-64bit-offset.ll
11 ↗(On Diff #230460)

Ack

andwar updated this revision to Diff 230614.Nov 22 2019, 2:07 AM
andwar marked 4 inline comments as done.

@sdesmalen Cheers for the quick review and your comments. I've updated the patch accordingly.

fpetrogalli added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11930–11933

Nit: the style of the file seems to be more of having a single invocation of the function shared by the two N->getOpcode(), with the ISD node selection inside the function.

static SDValue performLD1GatherCombine(SDNode *N, SelectionDAG &DAG,
                                       ) {
  unsigned Opcode;
  switch(N->getOpcode()) {
  default:
     llvm_unreachable();  // <- this would guarantee that the function is not invoked on something that it cannot handle yet?
  case case Intrinsic::aarch64_sve_ld1_gather:
     Opcode = AArch64ISD::GLD1;
     break;
   case ...
  }
  EVT RetVT = N->getValueType(0);
  assert(RetVT.isScalableVector() &&
         "Gather loads are only possible for SVE vectors");

}


//...
    case Intrinsic::aarch64_sve_ld1_gather:
    case Intrinsic::aarch64_sve_ld1_gather_index:
      return performLD1GatherCombine(N, DAG);
// ...
llvm/lib/Target/AArch64/Utils/AArch64BaseInfo.h
654

static constexpr unsigned should make sure that we don't run into duplicate variable declaration if the header get's included somewhere else (admittedly, an unlikely situation in this specific case).

sdesmalen added inline comments.Nov 22 2019, 10:57 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11765

The packed vector type is not defined by the number of elements, but rather the element type, so should take MVT EltTy instead.

efriedma added inline comments.Nov 22 2019, 1:33 PM
llvm/test/CodeGen/AArch64/sve-intrinsics-gather-loads-64bit-offset.ll
15 ↗(On Diff #230614)

This doesn't match the way the corresponding C intrinsics are defined in the ACLE spec. Are you intentionally diverging here?

andwar updated this revision to Diff 230902.EditedNov 25 2019, 7:41 AM
andwar marked 5 inline comments as done.

I've uploaded new patch.

Thank you all for having a look! @sdesmalen, I refactored and renamed getPackedIntegerSVEType to getSVEContainerType. I'm hoping that the intent for that method is now clearer. What are your thoughts?

andwar added inline comments.Nov 25 2019, 7:42 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11930–11933

Good point! However, that would lead to 2 separate switch statements with similar cases (i.e. code duplication). In other words, either way it won't be ideal. I would like to keep the current implementation for now.

llvm/lib/Target/AArch64/Utils/AArch64BaseInfo.h
654

Good point, updated!

llvm/test/CodeGen/AArch64/sve-intrinsics-gather-loads-64bit-offset.ll
15 ↗(On Diff #230614)

Thank you for taking a look @efriedma! Yes, this is intentional. Well spotted!

If we used <nxv2i64> as the return type here then we wouldn't need zext. However, we'd need some other way to differentiate between ld1b and ld1sb later. That would basically double the number of intrinsics. We felt that leaving the sign/zero here (to be folded using a simple DAGCombine) is a good compromise. I will be upstreaming that code shortly.

I should also point out the we have more intrinsics like this to upstream - this patch covers only 2 addressing modes.

efriedma added inline comments.Nov 25 2019, 12:10 PM
llvm/test/CodeGen/AArch64/sve-intrinsics-gather-loads-64bit-offset.ll
15 ↗(On Diff #230614)

My only concern with this is that it means we're committing to a specific layout for these types: specifically, that converting from <vscale x 2 x i8> to <vscale x 2 x i64> using an ANY_EXTEND is a no-op. (Otherwise, we would be forced to generate some very nasty code to lower a load that isn't immediately extended.)

For other targets, we've found that doing legalization by promoting the integer type isn't the best approach. For example, on x86, we used to legalize <2 x i16> by promoting it to <2 x i64>. But that was changed to widen it to <8 x i16>, where the padding is all at the end, because the operations were generally cheaper.

Maybe we could implement some hybrid approach to legalization, though, that widens <vscale 2 x i8> to <vscale x 16 x i8>, but interleaves the padding so the conversion to <vscale 2 x i64> is just a bitcast.

sdesmalen added inline comments.Nov 26 2019, 7:07 AM
llvm/test/CodeGen/AArch64/sve-intrinsics-gather-loads-64bit-offset.ll
15 ↗(On Diff #230614)

Using the unpacked layout for scalable vectors is a conscious decision for SVE. The most important reason for that is the ability to generate a predicate; there is no ptrue instruction to generate an all true predicate for a packed <vscale x 8 x i16> vector that sets the bottom vscale x 8 lanes to true. That means we'd need to evaluate a whole compare expression to create the mask. Because a predicate must have a consistent meaning across all of its uses, we'll also need to add instructions to unpack the predicate in order for it to have the same meaning across operations that work on different element sizes. Other operations like extracting the bottom/high half of a vector would also become more expensive since we can't use zip1/zip2 instructions for that any more.

Unlike other targets, SVE's full predication makes it much less likely that operating on unpacked vectors is more expensive, so this seemed like the right design choice.

It's worth pointing out that this requirement is specific to scalable vectors and does not necessarily translate to fixed-width vectors like <2 x i32> mapped to SVE, because ptrue has patterns to generate a mask for an explicit vector length (although that would still require multiple predicates if the loop uses different element sizes).

efriedma accepted this revision.Nov 26 2019, 12:49 PM

LGTM

llvm/test/CodeGen/AArch64/sve-intrinsics-gather-loads-64bit-offset.ll
15 ↗(On Diff #230614)

Okay, that makes sense.

This revision is now accepted and ready to land.Nov 26 2019, 12:49 PM
andwar updated this revision to Diff 231497.Nov 29 2019, 2:11 AM

Added some NFCs before merging in:

  • Missing test case for a vector of doubles
  • Renamed the tests to better reflect the content
  • Added a TableGen class for the intrinsics (to reduce code duplications)
This revision was automatically updated to reflect the committed changes.