This is an archive of the discontinued LLVM Phabricator instance.

[Aarch64][SVE] Add intrinsics for scatter stores
ClosedPublic

Authored by andwar on Dec 5 2019, 10:16 AM.

Details

Summary

This patch adds the following SVE intrinsics for scatter stores:

  • 64-bit offsets:
    • @llvm.aarch64.sve.st1.scatter (unscaled)
    • @llvm.aarch64.sve.st1.scatter.index (scaled)
  • 32-bit unscaled offsets:
    • @llvm.aarch64.sve.st1.scatter.uxtw (zero-extended offset)
    • @llvm.aarch64.sve.st1.scatter.sxtw (sign-extended-offset)
  • 32-bit scaled offsets:
    • @llvm.aarch64.sve.st1.scatter.uxtw.index (zero-extended offset)
    • @llvm.aarch64.sve.st1.scatter.sxtw.index (sign-extended offset)
  • vector base + immediate:
    • @llvm.aarch64.sve.st1.scatter.imm

NFCs: Move the TableGen classes for gather load intrnisics so that they
are all in one places. Fixed typo in names.

Event Timeline

andwar created this revision.Dec 5 2019, 10:16 AM
Herald added a reviewer: efriedma. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
efriedma added inline comments.Dec 6 2019, 1:27 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12039

Is the getKnownMinSize comparison trying to reject illegal types? Or something else?

12067

Does this work correctly for nxv2f32? It looks like we're missing test coverage, also.

12079

INTRINSIC_VOID has one result: the chain. You don't need to MergeValues here.

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
605

"unpacked"

llvm/test/CodeGen/AArch64/sve-intrinsics-scatter-stores-32bit-scaled-offsets.ll
158

Why are the offsets here using <vscale x 2 x i64>? <vscale x 2 x i32> seems more natural.

andwar marked 5 inline comments as done.Dec 9 2019, 5:34 AM
andwar added a subscriber: eli.friedman.

Thank you for taking a look @eli.friedman!

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12039

It's meant to reject illegal types that won't fit into an SVE register (i.e. nxv8i32). I've added a comment to clarify this.

12067

That's a good catch, thanks! For scatter stores ACLE only supports _packed_ vectors of single or double precision floats. I will add a check for this.

I also missed this when implementing performLD1GatherLoad (which is very similar to this method, but it felt worthwhile to keep them seperate). When this patch is approved I'll prepare an update for performLD1GatherLoad.

llvm/test/CodeGen/AArch64/sve-intrinsics-scatter-stores-32bit-scaled-offsets.ll
158
  • Although these _are_ 32-bit wide offsets, they are stored in 64-bit registers (and are implicitly sign or zero-extended)
  • By using <vscale x 2 x i64> we can rely on LLVMScalarOrSameVectorWidth to capture error (i.e. incorrect argument type)
  • This is consistent with the implementation for gather loads

Do you see any disadvantages of using <vscale x 2 x i64> instead of <vscale x 2 x i32>?

andwar updated this revision to Diff 232830.Dec 9 2019, 6:10 AM
  • make sure performST1ScatterCombine bails out for unpacked floats
  • remove MergeValues
  • rebase on top of master
efriedma added inline comments.Dec 9 2019, 2:17 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12067

We could support this in the backend even if clang can't generate it... but I guess it's fine to leave it out.

llvm/test/CodeGen/AArch64/sve-intrinsics-scatter-stores-32bit-scaled-offsets.ll
158

The biggest advantage of using <vscale x 2 x i32> is that it's obvious the high bits are unused, so you don't end up with extra mask/64-bit multiply/etc. That said, we can recover this in other ways.

I'm mostly concerned that we should try to be consistent here. We use <vscale x 2 x i32> for the stored value in llvm.aarch64.sve.st1.scatter.index.nxv2i32.

sdesmalen added inline comments.Dec 11 2019, 7:28 AM
llvm/include/llvm/IR/IntrinsicsAArch64.td
981

nit: The formatting for these intrinsics is odd.

1456

nit: unrelated whitespace change.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12028

It is unclear what N->getOperand(2) is:

EVT SrcVT = N->getOperand(2)->getValueType(0);
const SDValue Dst = N->getOperand(2);
12038

Should we simply compare:

SrcVT.getSizeInBits().getScalableSize() > AArch64::SVEBitsPerBlock

?

12038

Sorry, I meant:

SrcVT.getSizeInBits().getKnownMinSize() > AArch64::SVEBitsPerBlock
llvm/test/CodeGen/AArch64/sve-intrinsics-scatter-stores-32bit-scaled-offsets.ll
158

We envisioned the type for the indices generated by Clang to be <vscale x 2 x i64>, since the ACLE does not have any knowledge of unpacked types (like <vscale x 2 x i32>). The uxtw in the intrinsic would make it clear that the indices will need to be zero-extended from from word to double-word.

However, I agree this isn't very clear when reading/writing the intrinsics in IR and it should be simple to change the patch to take <vscale x 2 x i32> indices instead and generating a truncate in Clang. If the scatterSt1Combine then ANY_EXTENDs the data, the truncate will likely fall away.

If we change this, we should be consistent and update the loads as well.

andwar marked 6 inline comments as done.Dec 12 2019, 6:41 AM
andwar added inline comments.
llvm/include/llvm/IR/IntrinsicsAArch64.td
981

Fixed - I try to use clang-format (which IMHO does a good job here), but sadly the result is inconsistent with the rest of the file.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12028

That's a copy & paste leftover, sorry! I've renamed Dst as well as few other variables for consistency.

llvm/test/CodeGen/AArch64/sve-intrinsics-scatter-stores-32bit-scaled-offsets.ll
158

Thank you both, I'll update accordingly. I suggest that gather loads are updated in a separate patch.

andwar updated this revision to Diff 233610.Dec 12 2019, 6:52 AM
  • Make sure that 32 bit unpacked offsets are passed as nxv2i32 (instead of nxv2i64)
  • Removed NFCs (landed in a seperate patch)
  • Refactored performST1ScatterCombine (better variable names)
efriedma accepted this revision.Dec 12 2019, 9:21 AM

LGTM with one minor suggestion

llvm/include/llvm/IR/IntrinsicsAArch64.td
1072

ImmArg?

This revision is now accepted and ready to land.Dec 12 2019, 9:21 AM
andwar updated this revision to Diff 233816.Dec 13 2019, 8:49 AM
andwar marked an inline comment as done.
  • add ImmArg in the defintion of AdvSIMD_ScatterStore_VectorBase_Intrinsic
  • duplicated imm0_31, uimm5s2, uimm5s4, uimm5s8 with TImmLeaf-based equivalents (required after adding ImmArg above)
  • rebased on top of trun
  • removed a whitespace
andwar added inline comments.
llvm/include/llvm/IR/IntrinsicsAArch64.td
1072

Sorry, you mentioned that earlier and I missed that. I've updated this patch accordingly.

I also had to make sure that the affected instruction multiclasses in _AArch64SVEInstrInfo.td_ use TImmLeaf instead of ImmLeaf (this came up in other patches similar to this one). To this end, I duplicated the following definitions that use ImmLeaf:

  • imm0_31
  • uimm5s2
  • uimm5s4
  • uimm5s8

and defined equivalents using TImmLeaf:

  • timm0_31
  • tuimm5s2
  • tuimm5s4
  • tuimm5s8

We may implement this later with a ComplexPattern, instead of duplicating these. I think that @kmclaughlin might already be looking into it.

This revision was automatically updated to reflect the committed changes.