This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE] Convert gather/scatter with a stride of 2 to contiguous loads/stores
Changes PlannedPublic

Authored by kmclaughlin on Mar 3 2022, 8:11 AM.

Details

Summary

This patch extends performMaskedGatherScatterCombine to find gathers
& scatters with a stride of two in their indices, which can be converted
to a pair of contiguous loads or stores with zips & uzps and the
appropriate predicates.

There were no performance improvements found using this combine for scatter
stores of 64 bit data, so we just return SDValue() in this case.

Diff Detail

Event Timeline

kmclaughlin created this revision.Mar 3 2022, 8:11 AM
kmclaughlin requested review of this revision.Mar 3 2022, 8:11 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 3 2022, 8:11 AM

Hi @kmclaughlin, this looks like a nice improvement! I've not reviewed all of it so far, but I'll leave a few comments in the bits I have reviewed.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16934

Perhaps worth adding a comment here explaining that the following code will attempt to transform stride=2 gathers/scatters into contiguous loads/stores?

16935

It might look nicer if you can avoid unnecessary indentation here by rewriting this to return early, i.e.

if (Index->getOpcode() != ISD::STEP_VECTOR ||
    Mask->getOpcode() == ISD::EXTRACT_SUBVECTOR ||
    IndexType != ISD::SIGNED_SCALED)
  return SDValue();

... rest of code with less indentation ...
16940

I think you can just write if (Step != 2) here.

16944

Are you assuming this is a legal/simple type here? For example, the type could be <vscale x 12 x i32>, which will be an extended type and getSimpleVT will assert. Could you add a test case for this? I think you can avoid this either by only doing this after legalisation or by rewriting as a sequence of if-else statements, i.e.

if (MemVT == MVT::nxv2i64 || MemVT == MVT::nxv2f64
...

Did you try using ld2/st2? I guess the problem is that it accesses too many bytes?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16936

Why the restriction on the mask opcode?

16954

I guess you're not handling i16/i8 because you'd have to worry about extension?

If possible can you move the performMaskedGatherScatterCombine changes into their own function, something like tryCombineToMaskedLoadStore?

kmclaughlin marked 5 inline comments as done.
  • Moved the changes to performMaskedGatherScatterCombine into a new function, tryCombineToMaskedLoadStore.
  • Removed the switch statement and use of getSimpleVT(), using if-else statements which check MemVT for supported types instead.
  • Removed the restriction that the mask opcode of the gather/scatter is not an extract_subvector.

Did you try using ld2/st2? I guess the problem is that it accesses too many bytes?

Hi @efriedma, I did investigate whether I could use ld2/st2 instead and found that the performance was very similar compared with contiguous loads & stores. The problem with using ld2/st2 was that we could end up accessing too many bytes as you suggested. It's for this reason that we're instead using contiguous loads/stores and always interleave the predicate with pfalse, to ensure that we don't attempt to access beyond the last element which the gather/scatter would have accessed.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16936

This restriction was added downstream as a workaround to an issue which I have not been able to reproduce. Removing as I believe it is no longer required.

16944

Hi @david-arm, I have changed this to be a sequence of if-else statements checking MemVT as suggested. I tried adding a test using types such as nxv12i32, but I found that this crashes when the stepvector has an illegal type with "Do not know how to widen the result of this operator!"

16954

This patch was only for handling legal types, though I think i8/i16 types could be supported in future if this would be beneficial.

Hi @kmclaughlin, this patch looks a lot better now thanks! Most of my remaining comments are minor, but I do think we should add some explicit checks for truncating scatters and extending gathers.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16785

nit: Is it worth adding a comment here that Index should always be a STEP_VECTOR, which is the reason you are asumming the first operand is a ConstantSDNode?

16796

I think you may need some extra checks here for extending gather loads or truncating scatter stores and bail out for both cases, otherwise we may attempt to do the DAG combine for extloads of nxv4i32 -> nx4i64.

16805

Given that you're creating new nodes above and then potentially dropping them it might be better to only create nodes once you're guaranteed to do the transformation, i.e. something like

auto *MGT = dyn_cast<MaskedGatherSDNode>(MGS);
if (MGT && !MGT->getPassThru().isUndef())
  return SDValue();

// Using contiguous stores is not beneficial over 64 bit scatters
auto *MSC = cast<MaskedScatterSDNode>(MGS);
if (MSC && MSC->getValue().getValueType().getScalarSizeInBits() != 64)

// Split the mask into two parts and interleave with false
...

if (MGT) {
.. gather case ..
}
16809

nit: I think you also calculate this for the scatter case. Maybe it's worth moving this to above the if statement and then re-using the value for both gather and scatter cases?

16824

I'm not sure if you can simply reuse the same memory operand here, i.e. if you look at DAGTypeLegalizer::SplitVecRes_MLOAD you'll see an example of what we do for scalable vectors. I think it's because we're loading from a different offset now.

16866

Again, I think you need to create a new mem operand here.

llvm/test/CodeGen/AArch64/sve-gather-scatter-to-contiguous.ll
7

I think it's worth having a FP gather test and FP scatter test too, since we support them

I find the performance claims interesting. I've asked this question to SVE hw engineers before on which is faster -- gather vs load and shuffle vs load2 and the answer was essentially "depends on your loop". If you're using up ports to shuffle that could otherwise be used for computation, it seems like this would be a loser. If that analysis is correct then IMO this decision should be made in LV and the backend should honor the gather. However, if it stays in the backend, I still have some comments:

  • 64b scatters being as fast/faster than either contiguous stores or st2 is amazing if true, but is that always going to be true for all SVE targets? I'm guessing some sort of "preferScatter()" or "preferGather()" on a per target basis will (eventually?) be needed for this, but I don't have access to different SVE capable chips.
  • I understand why the contiguous sequence is needed over ld2/st2, but what about when you know it is safe to use ld2/st2? If the contiguous sequence you have here is indeed faster, would existing combines that match to ld2/st2 be combined to this instead?
  • Should this combine apply of the "odd" elements which are also stride2 { 1, 3, 5, 7...}?
  • Does this change still allow a match to ld2/st2 when all the elements are accessed by a pair of gathers/scatters?
kmclaughlin marked 7 inline comments as done.
  • Added the preferGatherScatter function to AArch64Subtarget, which checks if a gather/scatter with a given type & stride is preferable to contiguous loads/stores on the current target.
  • Create new machine memory operands for the contiguous loads/stores instead of reusing the mem operand from the gather/scatter.
  • Added tests for floating-point gathers & scatters, and extending & truncating gathers/scatters.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16796

I think in the case of extending gather loads and truncating scatter stores, the index opcode would be a sign/zero-extend or trunc instead of a stepvector? I've added some tests to check that we don't perform the combine in these cases.

Hi @rscottmanley, thank you for taking a look at this patch.

  • 64b scatters being as fast/faster than either contiguous stores or st2 is amazing if true, but is that always going to be true for all SVE targets? I'm guessing some sort of "preferScatter()" or "preferGather()" on a per target basis will (eventually?) be needed for this, but I don't have access to different SVE capable chips.

We expect it to be faster on Neoverse-v1 for 32b/64b gathers & 64b scatters with a stride of two, so those are the cases we're currently optimising for. You're right that it makes sense to tune this per subtarget and I've added this now.

  • I understand why the contiguous sequence is needed over ld2/st2, but what about when you know it is safe to use ld2/st2? If the contiguous sequence you have here is indeed faster, would existing combines that match to ld2/st2 be combined to this instead?
  • Should this combine apply of the "odd" elements which are also stride2 { 1, 3, 5, 7...}?
  • Does this change still allow a match to ld2/st2 when all the elements are accessed by a pair of gathers/scatters?

This approach is largely a stop-gap until we have proper (de)interleaving in the loop vectoriser, or when the InterleavedAccess pass supports scalable vectors. We would expect those transformations to use an appropriate cost-model to decide whether to use gathers/scatters, ld2/st2 or explicit (de)interleaving intrinsics. All of these transformations would happen before legalisation, so if it ends up here as a gather that was either because it wasn't safe to use ld2/st2 or the cost-model argued against it. That makes this a bit of a last-resort DAGCombine fold to improve code quality. Until we have proper (de)interleaving support in the passes mentioned above, this mechanism will at least already improve performance for those subtargets where this is enabled.

Thanks for making all the changes @kmclaughlin! I think it looks pretty good now. I just had a few more minor comments ...

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16818

nit: Looks like this is indented too much here.

16830

nit: I think you can also move this above the if statement and then reuse for the store case too, and use MGS->getPointerInfo() instead.

llvm/lib/Target/AArch64/AArch64Subtarget.cpp
364

nit: Can you fix the formatting issue? Thanks!

llvm/lib/Target/AArch64/AArch64Subtarget.h
672

I think it's worth being more explicit here about what VecTy actually is because it could be different to the loaded or stored data type. Do you think it's worth renaming this to MemTy to be clear? Or you can keep the same variable name, but add a comment saying that is the type for the data in memory. The choice is yours!

@kmclaughlin Thanks for the reply and adding the subtarget check. I figured this might have something to do with the scalable vectorization and this is a reasonable stopgap.

kmclaughlin marked 4 inline comments as done.
  • Renamed VecVT to MemVT in the preferGatherScatter function.
  • Fixed formatting issues.

LGTM (with formatting issues fixed)! This looks great now thanks @kmclaughlin. :)

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16773

nit: Can you fix all the formatting issues in your new code before merging please?

david-arm accepted this revision.Mar 14 2022, 2:20 AM
This revision is now accepted and ready to land.Mar 14 2022, 2:20 AM

Perhaps I've misunderstood something but I'd like to take a step back to make sure we're not overcomplicating something here and at the same time making it harder to handle more cases in the future.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
16801–16806

Is this a temporary restriction? I ask because when I asked about doing this in IR instead of CodeGen, one of the answers was to support illegal types.

To be clear I'm not against solving this in CodeGen by my assumption is that one of the biggest winners to doing this transform is when trying to gather/scatter bytes or halfwords, which are not supported and thus require splitting. However they can be done more efficiently with masked loads and stores.

16816–16818

Can these use ISD::EXTRACT_SUBVECTOR instead? I ask because those will emit better code (removes the need for pfalse) and also allows DAGCombine to perhaps optimise further (e.g. when Mask is a constant).

With that said, ...

16827–16845

...sorry if this is a silly question as I've not scrutinised the code but is this manual splitting and recombine necessary. It looks a little like manual type legalisation but this combine is something that is generally best done before type legalisation and thus you should be able to create a single big load and let common code split it if necessary. This might also resolve my "bytes" and "halfwords" comment above.

16857–16858

As above, using ISD::EXTRACT_SUBVECTOR here seems preferable and also means the scatter case doesn't require any target specific nodes.

16914–16915

I think these tests really belong inside tryToCombineToMaskedLoadStore because it is that function that has the restrictions and thus the same function that might later remove them.

kmclaughlin planned changes to this revision.Mar 16 2022, 6:34 AM
Matt added a subscriber: Matt.Mar 17 2022, 5:49 PM