This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][LoopVectorize] Disable tail-folding for SVE when loop has interleaved accesses
ClosedPublic

Authored by david-arm on Jun 22 2022, 7:36 AM.

Details

Summary

If we have interleave groups in the loop we want to vectorise then
we should fall back on normal vectorisation with a scalar epilogue. In
such cases when tail-folding is enabled we'll almost certainly go on to
create vplans with very high costs for all vector VFs and fall back on
VF=1 anyway. This is likely to be worse than if we'd just used an
unpredicated vector loop in the first place.

Once the vectoriser has proper support for analysing all the costs
for each combination of VF and vectorisation style, then we should
be able to remove this.

Added an extra test here:

Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll

Diff Detail

Unit TestsFailed

Event Timeline

david-arm created this revision.Jun 22 2022, 7:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 22 2022, 7:36 AM
david-arm requested review of this revision.Jun 22 2022, 7:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 22 2022, 7:36 AM
sdesmalen added inline comments.Jun 28 2022, 1:52 AM
llvm/include/llvm/Analysis/VectorUtils.h
814

nit: s/haveGroups/hasGroups/ ?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5074

I'm not sure I understand where the UserVF comes into play with this choice, given that this a choice between "tail folding" vs "scalar epilogue". Can you elaborate?

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-interleave.ll
2

Why is this loop still vectorized with a VF of 8?

fhahn added inline comments.Jun 28 2022, 2:44 AM
llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-interleave.ll
2

This needs REQUIRES: asserts I think, as it uses -debug, same for the other test. Could. you precommit the tests, so only the impact of the patch is shown in this diff?

10

nit: everything expcept the no alias attribute appears to be unnecessary

david-arm added inline comments.Jun 28 2022, 5:02 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5074

So there is a MVE test that tries to force the use of tail-folding for a loop with masked interleaved accesses, and the test forces a particular VF. So I had two choices here:

  1. Either fix the test to reflect the new behaviour, or
  2. In cases where the user has explicitly requested tail-folding and forced a VF, then probably the user really wants tail-folding enabled for that VF.

Alternatively I could add a brand new flag controlling this behaviour, something like -use-tail-folding-when-interleaving, etc.

david-arm added inline comments.Jun 28 2022, 5:14 AM
llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-interleave.ll
2

Hi @sdesmalen, this is actually what we want to happen because this is a loop with interleaved memory accesses where NEON can do this very efficiently. It's actually a VF of 4, but we have the <8 x float> and <12 x float> types in the loop due to deinterleaving loads and interleaving stores, respectively. What this patch is doing is instructing the vectoriser to disable tail folding because performance is likely to be terrible, and instead fall back on normal unpredicated vector loops that don't require masked interleaved accesses.

david-arm updated this revision to Diff 440598.Jun 28 2022, 6:46 AM
  • Cleaned up tests.
  • Renamed haveGroups -> hasGroups.
david-arm marked 3 inline comments as done.Jun 28 2022, 6:47 AM

In terms of MVE - A VLD2/VLD4 cannot be predicated so in that regards we do not support "MaskedInterleavedAccesses". There is code in canTailPredicateLoop that attempts to get that right. Any other interleaving group width will be emulated with a gather/scatter though, which can happily be masked.
https://godbolt.org/z/KzvEqz439

For SVE my understanding is that LD2/LD3/LD4 can be predicated, and other widths (and current codegen as interleaving is not yet supported) will use gather/scatter which can be masked. In the long run they may have MaskedInterleavedAccesses returning true.

Is the problem that we are trying to use one variable for both Neon and SVE vectorization, where SVE prefers folding the tail, and NEON will need not to?

In terms of MVE - A VLD2/VLD4 cannot be predicated so in that regards we do not support "MaskedInterleavedAccesses". There is code in canTailPredicateLoop that attempts to get that right. Any other interleaving group width will be emulated with a gather/scatter though, which can happily be masked.
https://godbolt.org/z/KzvEqz439

For SVE my understanding is that LD2/LD3/LD4 can be predicated, and other widths (and current codegen as interleaving is not yet supported) will use gather/scatter which can be masked. In the long run they may have MaskedInterleavedAccesses returning true.

Is the problem that we are trying to use one variable for both Neon and SVE vectorization, where SVE prefers folding the tail, and NEON will need not to?

Hi @dmgreen, when I first discovered this issue it gave me a headache! The fundamental problem here is that we don't create a matrix of vplan costs with vectorisation style (tail-folding, vector loop + scalar epilogue, etc.) as one axis and VF as the other. This means we cannot look at all possible permutations and choose the lowest cost combination, so our only option is to decide up-front whether or not to use tail-folding before we've calculated any costs.

We currently don't have support for interleaving with SVE using ld2/st2/etc because we cannot rely upon shufflevector in the same way that fixed-width vectorisation does. This means that if we choose to tail-fold a loop with masked interleaved accesses the costs for both NEON and SVE will be so high that we will not vectorise at all. Whereas if we didn't choose to tail-fold we'd actually end up vectorising the interleaved memory accesses using NEON's ld2/st3, which is still faster than a scalar loop. So what I'm trying to do here is avoid causing regressions when enabling tail-folding for any target that does not support masked interleaved memory accesses. If at some point in the future we can support them for SVE then useMaskedInterleavedAccesses will return true.

Hi @dmgreen, when I first discovered this issue it gave me a headache! The fundamental problem here is that we don't create a matrix of vplan costs with vectorisation style (tail-folding, vector loop + scalar epilogue, etc.) as one axis and VF as the other. This means we cannot look at all possible permutations and choose the lowest cost combination, so our only option is to decide up-front whether or not to use tail-folding before we've calculated any costs.

We currently don't have support for interleaving with SVE using ld2/st2/etc because we cannot rely upon shufflevector in the same way that fixed-width vectorisation does. This means that if we choose to tail-fold a loop with masked interleaved accesses the costs for both NEON and SVE will be so high that we will not vectorise at all. Whereas if we didn't choose to tail-fold we'd actually end up vectorising the interleaved memory accesses using NEON's ld2/st3, which is still faster than a scalar loop. So what I'm trying to do here is avoid causing regressions when enabling tail-folding for any target that does not support masked interleaved memory accesses. If at some point in the future we can support them for SVE then useMaskedInterleavedAccesses will return true.

Yeah I remember that causing problems for Sjoerd on MVE in the past, with it deciding too much up-front.

Would it be possible to treat this the same way as MVE does - return false for preferPredicateOverEpilogue if the stride of a memory access is 2 or 3 or 4. That would prevent the MVE side getting worse. (It's not a lot worse, but predicating the loop is better than not doing it in the second example up above). And in the long run we can try to keep pushing the vectorizer towards being able to cost different vplans against one another.

Matt added a subscriber: Matt.Jun 28 2022, 2:23 PM

Hi @dmgreen, I can certainly look into this! I'd just assumed that trying to have predicated loops for targets that don't have masked interleaved accesses would be bad in general that's all. It means falling back on gather/scatters or scalarising, either of which is usually very expensive.

It looks like in LoopVectorizePass::processLoop we set up the InterleavedAccessInfo object after we've called getScalarEpilogueLowering, which is a shame but perhaps I can shuffle things around.

david-arm planned changes to this revision.Jun 30 2022, 6:39 AM
david-arm updated this revision to Diff 445758.Jul 19 2022, 3:08 AM
david-arm retitled this revision from [LoopVectorize] Disable tail-folding when masked interleaved accesses are unavailable to [AArch64][LoopVectorize] Disable tail-folding for SVE when loop has interleaved accesses.
david-arm edited the summary of this revision. (Show Details)
  • Changed to using the TTI hook 'preferPredicateOverEpilogue' in order to make the decision about whether to use tail-folding or not. This requires refactoring the LoopVectorizer.cpp code a little in order to pass in the InterleaveAccessInfo object.
kmclaughlin accepted this revision.Jul 29 2022, 7:09 AM

Hi @david-arm, this change looks reasonable to me. Can you please wait a day or so before landing, just to make sure the other reviewers have a chance to look at it again?

This revision is now accepted and ready to land.Jul 29 2022, 7:09 AM
This revision was landed with ongoing or failed builds.Aug 2 2022, 1:52 AM
This revision was automatically updated to reflect the committed changes.