This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE] Add intrinsics for non-temporal gather-loads/scatter-stores
ClosedPublic

Authored by andwar on Feb 19 2020, 11:45 AM.

Details

Summary

[AArch64][SVE] Add intrinsics for non-temporal gather-loads/scatter-stores

This patch adds the following LLVM IR intrinsics:

  1. SVE non-temporal gather loads
    • @llvm.aarch64.sve.ldnt1.gather
    • @llvm.aarch64.sve.ldnt1.gather.uxtw
    • @llvm.aarch64.sve.ldnt1.gather.scalar.offset
  2. SVE non-temporal scatter stores
    • @llvm.aarch64.sve.stnt1.scatter
    • @llvm.aarch64.sve.ldnt1.gather.uxtw
    • @llvm.aarch64.sve.ldnt1.gather.scalar.offset

These intrinsic are mapped to the corresponding SVE instructions
(example for half-words, zero-extending):

  • ldnt1h { z0.s }, p0/z, [z0.s, x0]
  • stnt1h { z0.s }, p0/z, [z0.s, x0]

Note that for non-temporal gathers/scatters, the SVE spec defines only
one instruction type: "vector + scalar". For this reason, we swap the
arguments when processing the following intrinsics (which implement the
"vector + scalar" addressing mode):

  • @llvm.aarch64.sve.ldnt1.gather
  • @llvm.aarch64.sve.ldnt1.gather.uxtw
  • @llvm.aarch64.sve.stnt1.scatter
  • @llvm.aarch64.sve.ldnt1.gather.uxtw

In other words, all intrinsics for gather-loads and scatter-stores
implemented in this patch are mapped to the same load and store
instruction, respectively.

The sve2_mem_gldnt_vs multiclass (and it's counterpart for scatter
stores) from SVEInstrFormats.td was split into:

  • sve2_mem_gldnt_vec_vs_32_ptrs (32bit wide base addresses)
  • sve2_mem_gldnt_vec_vs_62_ptrs (64bit wide base addresses)

This is consistent with what we did for
@llvm.aarch64.sve.ld1.scalar.offset and highlights the actual split in
the spec and the implementation.

Diff Detail

Event Timeline

andwar created this revision.Feb 19 2020, 11:45 AM
Herald added a project: Restricted Project. · View Herald Transcript

Note that for the non-temporal gather-loads and scatter-stores "vector base, scalar offset" is the only available addressing mode.

Not sure what you mean by this. The ACLE provides both "vector base, scalar offset" and "scalar base, vector offset"; we should provide corresponding intrinsics, I think.

Hi @efriedma , thank you for taking a look!

Note that for the non-temporal gather-loads and scatter-stores "vector base, scalar offset" is the only available addressing mode.

Not sure what you mean by this. The ACLE provides both "vector base, scalar offset" and "scalar base, vector offset"; we should provide corresponding intrinsics, I think.

I'm not being very clear there, sorry. I'm only referring to the LLVM IR intrinsics, which we want to map 1:1 to the actual instructions (and, AFAIK, "scalar base, vector offset" is the only option there).

ACLE indeed defines more addressing modes for non-temporal gather loads and that will require some extra logic. Clang seems like a more natural place to handle that.

andwar updated this revision to Diff 245633.Feb 20 2020, 6:06 AM

I've made some NFC changes, so that we re-use more code.

  • Rebased this patch to incorporate the changes implemented here: https://reviews.llvm.org/rG0e417b034ad2
  • Renamed Selection DAG types for gathers and scatters and used those definitions for non-temporal gathers/scatters (as opposed to definitnig new ones)

The instruction in the SVE spec is just "scalar + vector". Whether the vector is the base or the offset is a matter of interpretation. (Really, the difference is that we want to write the base with a pointer type, so we aren't forced to insert ptrtoint operations into the IR.)

andwar edited the summary of this revision. (Show Details)Feb 21 2020, 2:48 AM

That's a good point, thank you! I've updated the summary to reflect that.

andwar marked an inline comment as done.Feb 21 2020, 5:25 AM
andwar added inline comments.
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
30

This should be SDT instead of STD. I will fix this.

andwar updated this revision to Diff 245841.Feb 21 2020, 6:40 AM

Fixing typo: STD --> SDT.

This patch is doing a lot of refactoring/renaming, can you separate that out into a separate NFC patch?
On the point for having separate intrinsics for (vector, scalar) and (scalar, vector), I think this makes sense, as we'll then use similar intrinsics for ldnt1 and ld1 and we can target the right instruction in a similar way we do this in performLD1GatherCombine.

llvm/include/llvm/IR/IntrinsicsAArch64.td
1279

Can you derive from AdvSIMD_GatherLoad_VectorBase_Intrinsic instead?
(and something similar for the scatter store)

This also makes it more clear that these have the exact same form as the normal gathers.

llvm/lib/Target/AArch64/SVEInstrFormats.td
5074–5077

nit: vec_32b_ptrs ?

5077

If you change the name of this class, you may want to update the parent class as well.

andwar edited the summary of this revision. (Show Details)Feb 25 2020, 7:42 AM
andwar marked 3 inline comments as done.

This patch is doing a lot of refactoring/renaming, can you separate that out into a separate NFC patch?

Extracted and submitted here: https://reviews.llvm.org/rGcff90c938b7be43de482ffb7a8a7fdbdf57c32a3

On the point for having separate intrinsics for (vector, scalar) and (scalar, vector), I think this makes sense, as we'll then use similar intrinsics for ldnt1 and ld1 and we can target the right instruction in a > similar way we do this in performLD1GatherCombine.

I've updated this patch accordingly. I've kept the names of the new intrinsics consistent with the regular gathers/scatters. The intrinsics added here should be sufficient to support what's defined in ACLE, but there's still fewer LLVM IR intrinsics for non-temporal scatters/gathers than there is for regular gathers/scatters. Note that for the (scalar, vector) case I added 2 intrinsics:

  • @llvm.aarch64.sve.ldnt1.gather - 32bit wide offsets
  • @llvm.aarch64.sve.ldnt1.gather.uxtw - 64bit wide offsets

Doing it this way makes it more natural to re-use the exisiting code for regular gathers/scatters and keeps things consistent. I hope that this makes sense.

llvm/lib/Target/AArch64/SVEInstrFormats.td
5074–5077
5077

What about setting the name of this multi class to sve2_mem_gldnt_vec_vs_32_ptrs instead? This way updating the name of the base class is no longer needed.

andwar updated this revision to Diff 246463.Feb 25 2020, 7:49 AM
andwar marked an inline comment as done.

I've already updated the summary and replied to comments inline. Here's a summary of the latest changes:

  • Extracted the NFC changes (this required rebasing)
  • Added 2 more intrinsics for the (scalar, vector) case
  • Address comments from @sdesmalen
sdesmalen added inline comments.Feb 25 2020, 9:14 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12660

Have you experimented moving this code out of this function (same for the SST1_IMM case below), and pass in the Base and Offset (and possibly Chain/PG) as operands to performScatterStoreCombine. In the case-statement for aarch64_sve_st1_scatter_scalar_offset and aarch64_sve_stnt1_scatter_scalar_offset you can than do the swap. That seems a bit better than special handling these cases in this combine itself.

andwar marked an inline comment as done.Feb 26 2020, 5:41 AM
andwar added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12660

Yes, but currently:

  • it is very explicit that AArch64ISD::SSTNT1 requires some special treatment, which IMO is a bit counterintuitive (hopefully the comments make it clear!)
  • I'm only passing one argument (*N), instead of 3 (*N, SDValue, SDValue) to performScatterStoreCombine (not counting the other arguments), so the call-site is cleaner if we keep things as they are.

Also, SST1_IMM requires 2 conditions to be checked and the opcode to be updated (and there are 2 possibilities here, either SST1_UXTW or SST1). Swapping Base and Offset when calling performScatterStoreCombine wouldn't be enough to replicate this.

Having said that, I've been looking at performScatterStoreCombine/ performGatherLoadCombine for a while now and I wouldn't be surprised if I'm over-engineering this :)

sdesmalen accepted this revision.Feb 27 2020, 3:07 AM

LGTM!

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12660

Fair enough. If this function ever needs to be extended for more intrinsics, we may want to reconsider generalising this, but this is fine for now then.

This revision is now accepted and ready to land.Feb 27 2020, 3:07 AM
This revision was automatically updated to reflect the committed changes.