Page MenuHomePhabricator

[Aarch64][SVE] Add intrinsics for scatter stores

Authored by andwar on Dec 5 2019, 10:16 AM.



This patch adds the following SVE intrinsics for scatter stores:

  • 64-bit offsets:
    • @llvm.aarch64.sve.st1.scatter (unscaled)
    • @llvm.aarch64.sve.st1.scatter.index (scaled)
  • 32-bit unscaled offsets:
    • @llvm.aarch64.sve.st1.scatter.uxtw (zero-extended offset)
    • @llvm.aarch64.sve.st1.scatter.sxtw (sign-extended-offset)
  • 32-bit scaled offsets:
    • @llvm.aarch64.sve.st1.scatter.uxtw.index (zero-extended offset)
    • @llvm.aarch64.sve.st1.scatter.sxtw.index (sign-extended offset)
  • vector base + immediate:
    • @llvm.aarch64.sve.st1.scatter.imm

NFCs: Move the TableGen classes for gather load intrnisics so that they
are all in one places. Fixed typo in names.

Diff Detail

Event Timeline

andwar created this revision.Dec 5 2019, 10:16 AM
Herald added a reviewer: efriedma. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
efriedma added inline comments.Dec 6 2019, 1:27 PM

Is the getKnownMinSize comparison trying to reject illegal types? Or something else?


Does this work correctly for nxv2f32? It looks like we're missing test coverage, also.


INTRINSIC_VOID has one result: the chain. You don't need to MergeValues here.




Why are the offsets here using <vscale x 2 x i64>? <vscale x 2 x i32> seems more natural.

andwar marked 5 inline comments as done.Dec 9 2019, 5:34 AM
andwar added a subscriber: eli.friedman.

Thank you for taking a look @eli.friedman!


It's meant to reject illegal types that won't fit into an SVE register (i.e. nxv8i32). I've added a comment to clarify this.


That's a good catch, thanks! For scatter stores ACLE only supports _packed_ vectors of single or double precision floats. I will add a check for this.

I also missed this when implementing performLD1GatherLoad (which is very similar to this method, but it felt worthwhile to keep them seperate). When this patch is approved I'll prepare an update for performLD1GatherLoad.

  • Although these _are_ 32-bit wide offsets, they are stored in 64-bit registers (and are implicitly sign or zero-extended)
  • By using <vscale x 2 x i64> we can rely on LLVMScalarOrSameVectorWidth to capture error (i.e. incorrect argument type)
  • This is consistent with the implementation for gather loads

Do you see any disadvantages of using <vscale x 2 x i64> instead of <vscale x 2 x i32>?

andwar updated this revision to Diff 232830.Dec 9 2019, 6:10 AM
  • make sure performST1ScatterCombine bails out for unpacked floats
  • remove MergeValues
  • rebase on top of master
efriedma added inline comments.Dec 9 2019, 2:17 PM

We could support this in the backend even if clang can't generate it... but I guess it's fine to leave it out.


The biggest advantage of using <vscale x 2 x i32> is that it's obvious the high bits are unused, so you don't end up with extra mask/64-bit multiply/etc. That said, we can recover this in other ways.

I'm mostly concerned that we should try to be consistent here. We use <vscale x 2 x i32> for the stored value in llvm.aarch64.sve.st1.scatter.index.nxv2i32.

sdesmalen added inline comments.Dec 11 2019, 7:28 AM

nit: The formatting for these intrinsics is odd.


nit: unrelated whitespace change.


It is unclear what N->getOperand(2) is:

EVT SrcVT = N->getOperand(2)->getValueType(0);
const SDValue Dst = N->getOperand(2);

Should we simply compare:

SrcVT.getSizeInBits().getScalableSize() > AArch64::SVEBitsPerBlock



Sorry, I meant:

SrcVT.getSizeInBits().getKnownMinSize() > AArch64::SVEBitsPerBlock

We envisioned the type for the indices generated by Clang to be <vscale x 2 x i64>, since the ACLE does not have any knowledge of unpacked types (like <vscale x 2 x i32>). The uxtw in the intrinsic would make it clear that the indices will need to be zero-extended from from word to double-word.

However, I agree this isn't very clear when reading/writing the intrinsics in IR and it should be simple to change the patch to take <vscale x 2 x i32> indices instead and generating a truncate in Clang. If the scatterSt1Combine then ANY_EXTENDs the data, the truncate will likely fall away.

If we change this, we should be consistent and update the loads as well.

andwar marked 6 inline comments as done.Dec 12 2019, 6:41 AM
andwar added inline comments.

Fixed - I try to use clang-format (which IMHO does a good job here), but sadly the result is inconsistent with the rest of the file.


That's a copy & paste leftover, sorry! I've renamed Dst as well as few other variables for consistency.


Thank you both, I'll update accordingly. I suggest that gather loads are updated in a separate patch.

andwar updated this revision to Diff 233610.Dec 12 2019, 6:52 AM
  • Make sure that 32 bit unpacked offsets are passed as nxv2i32 (instead of nxv2i64)
  • Removed NFCs (landed in a seperate patch)
  • Refactored performST1ScatterCombine (better variable names)
efriedma accepted this revision.Dec 12 2019, 9:21 AM

LGTM with one minor suggestion



This revision is now accepted and ready to land.Dec 12 2019, 9:21 AM
andwar updated this revision to Diff 233816.Dec 13 2019, 8:49 AM
andwar marked an inline comment as done.
  • add ImmArg in the defintion of AdvSIMD_ScatterStore_VectorBase_Intrinsic
  • duplicated imm0_31, uimm5s2, uimm5s4, uimm5s8 with TImmLeaf-based equivalents (required after adding ImmArg above)
  • rebased on top of trun
  • removed a whitespace
andwar added inline comments.

Sorry, you mentioned that earlier and I missed that. I've updated this patch accordingly.

I also had to make sure that the affected instruction multiclasses in _AArch64SVEInstrInfo.td_ use TImmLeaf instead of ImmLeaf (this came up in other patches similar to this one). To this end, I duplicated the following definitions that use ImmLeaf:

  • imm0_31
  • uimm5s2
  • uimm5s4
  • uimm5s8

and defined equivalents using TImmLeaf:

  • timm0_31
  • tuimm5s2
  • tuimm5s4
  • tuimm5s8

We may implement this later with a ComplexPattern, instead of duplicating these. I think that @kmclaughlin might already be looking into it.

This revision was automatically updated to reflect the committed changes.