This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Don't tail-fold for scalable VFs when there is no scalar tail
ClosedPublic

Authored by david-arm on Mar 16 2023, 1:50 AM.

Details

Summary

Currently in LoopVectorize we avoid tail-folding if we can
prove the trip count is always a multiple of the maximum
fixed-width VF. This works because we know the vectoriser
only ever chooses a VF that is a power of 2. However, if
we are also considering scalable VFs then we conservatively
bail out of the optimisation because we don't know the value
of vscale, which could be an odd or prime number, etc.

This patch tries to enable the same optimisation for scalable
VFs by asking if vscale is known to be a power of 2. If so,
we can then query the maximum value of vscale and use the same
logic as we do for fixed-width VFs. I've also added a new TTI
hook called isVScaleKnownToBeAPowerOfTwo that does the same
thing as the existing TargetLowering hook.

Diff Detail

Event Timeline

david-arm created this revision.Mar 16 2023, 1:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 16 2023, 1:50 AM
david-arm requested review of this revision.Mar 16 2023, 1:50 AM
fhahn added a subscriber: fhahn.Mar 16 2023, 1:57 AM

The changes to make the tests more robust should probably be submitted separately so the diff only shows the functional test changes. Also, would be good to pre-commit new tests like vector_add_trip1024

I had written a very similar patch recently, but it would only use the fixed length if the scalable was unknown. The performance of it was pretty bad though, so I ended up dropping it. I had noticed that there is an xfail in llvm/test/Transforms/LoopVectorize/AArch64/eliminate-tail-predication.ll at the moment. Can it now be replaced with a check for store <vscale x 4 x i32>?

TargetTransformInfo::isVScaleKnownToBeAPowerOfTwo isn't going to be useable from all the places that need it like instcombine. It might be best to add it to somewhere like vscale_range in the long run?

david-arm updated this revision to Diff 505773.Mar 16 2023, 4:57 AM
david-arm edited the summary of this revision. (Show Details)
  • Rebased on top of NFC test patch.
reames accepted this revision.Mar 16 2023, 8:30 AM

LGTM

I do wonder if we should consider redefining vscale in LangRef to be a power of two. I don't think we have a motivating example for a non power of two. I believe this was originally to support SVE whose original version didn't mandate power of two vector sizes, but I believe a later version of that specification mandated them didn't it?

llvm/test/Transforms/LoopVectorize/AArch64/eliminate-tail-predication.ll
2

Please use update-test-checks.py

This revision is now accepted and ready to land.Mar 16 2023, 8:30 AM
sdesmalen added inline comments.Mar 16 2023, 10:16 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5161

Is MaxRuntimeVF a better name?

Also, is it worth making this a std::optional<unsigned>? I struggled to wrap my head around what MaxPossibleVF = 1 meant, because I assumed this meant "don't vectorize, use scalar instructions", given other uses of VF=1 in the LV.

5164

I initially wanted to ask "should this be else if (MaxFactors.ScalableVF) {?" but then I realised we need to be conservative here because the decision (not) to do tail folding is made for _both_ fixed and scalable VFs. So should this code be considering the max(MaxFactors.FixedVF, MaxVScale * MaxFactors.ScalableVF) instead?

5171–5172

nit: move assignment into the condition?

if (std::optional<unsigned> MaxVScale = getMaxVScale(*TheFunction, TTI)) {
  ..
}
5179

nit: This default is already set on line 5155, so there's no need to set it again here?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5179

I don't think we can because this could have been set to something greater than 1 for the fixed-width case, since for scalable vectorisation we usually have both a max fixed-width VF and a max scalable VF. As you say above, we're kind of taking the max of the fixed-width and scalable VFs, but only if safe to do so for scalable VFs. I agree the logic is kind of awkward, but it's not immediately obvious to me how to make it better?

Matt added a subscriber: Matt.Mar 17 2023, 11:11 AM
david-arm updated this revision to Diff 507024.Mar 21 2023, 9:49 AM
  • Addressed review comments about using std::optional
david-arm marked 5 inline comments as done.Mar 21 2023, 9:49 AM

I had written a very similar patch recently, but it would only use the fixed length if the scalable was unknown. The performance of it was pretty bad though, so I ended up dropping it. I had noticed that there is an xfail in llvm/test/Transforms/LoopVectorize/AArch64/eliminate-tail-predication.ll at the moment. Can it now be replaced with a check for store <vscale x 4 x i32>?

TargetTransformInfo::isVScaleKnownToBeAPowerOfTwo isn't going to be useable from all the places that need it like instcombine. It might be best to add it to somewhere like vscale_range in the long run?

Hi @dmgreen, so I don't have a strong objection to doing this as a new vscale_power_of_2 attribute, but I am trying to avoid changing the LangRef again if we don't have a compelling case to do so yet. This is what we did originally with the vscale max, i.e. we first added a TTI interface, then as time went on we saw more and more convincing arguments for moving this to be a vscale_range attribute instead. There is nothing to stop us doing something similar in future I think, right? Of course, there is already a TLI hook of the same name that would need removing too.

Hi @dmgreen, so I don't have a strong objection to doing this as a new vscale_power_of_2 attribute, but I am trying to avoid changing the LangRef again if we don't have a compelling case to do so yet. This is what we did originally with the vscale max, i.e. we first added a TTI interface, then as time went on we saw more and more convincing arguments for moving this to be a vscale_range attribute instead. There is nothing to stop us doing something similar in future I think, right? Of course, there is already a TLI hook of the same name that would need removing too.

Yeah sounds good. I was thinking as a followup, something to keep in mind. It could even go like @reames suggested where we only support power-of-2 vscales - I heard that GCC always took that route for SVE and has only ever supported power-of-2 vscales.

sdesmalen added inline comments.Mar 23 2023, 9:27 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5161

Sorry for changing my mind on this, but I wonder if MaxPowerOf2RuntimeVF is more appropriate name, since the value is std::nullopt if the VF is not a power of 2.

5161–5178

You can write this a bit more compactly using std::max:

std::optional<unsigned> MaxPowerOf2RuntimeVF =
    MaxFactors.FixedVF.getFixedValue();
if (MaxFactors.ScalableVF) {
  std::optional<unsigned> MaxVScale = getMaxVScale(*TheFunction, TTI);
  if (MaxVScale && TTI.isVScaleKnownToBeAPowerOfTwo()) {
    MaxPowerOf2RuntimeVF = std::max<unsigned>(
        *MaxPowerOf2RuntimeVF,
        *MaxVScale * MaxFactors.ScalableVF.getKnownMinValue());
  } else
    MaxPowerOf2RuntimeVF = std::nullopt; // Stick with tail-folding for now.
}
5165–5169

I find this comment more confusing than enlightening. I also think that the issue is more that we decide early in the process whether or not to use tail folding, before we get to decide on a VF. We could make a more fine-grained choice if we would make those choices together. But when I read this comment, I'm not really thinking about that.

In any case, I'm happy for the comment to be removed since the code itself is clear enough.

david-arm updated this revision to Diff 508065.Mar 24 2023, 5:57 AM
  • Addressed review comments.
david-arm marked 3 inline comments as done.Mar 24 2023, 5:57 AM
sdesmalen accepted this revision.Mar 24 2023, 6:28 AM

Thanks @david-arm, LGTM.

ABataev added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5161

Looks like this std::optional does not work here, because MaxFactors.FixedVF.getFixedValue() returns non_optional value. So it may crash in the assert in line 5174, if MaxFactors.FixedVF.getFixedValue() returns 0 and MaxFactors.ScalableVF is not set.

sdesmalen added inline comments.Mar 28 2023, 2:19 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5161

My understanding is that normally (when not forcing it through an option) MaxFactors.FixedVF will be >= 1, are you saying that is not the case? I agree it's worth adding an extra check.

ABataev added inline comments.Mar 28 2023, 2:52 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5161

It might be 0 too

david-arm added inline comments.Mar 29 2023, 1:18 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5173

@ABataev @sdesmalen I am happy to add an extra check here such as

if (MaxPowerOf2RuntimeVF && *MaxPowerOf2RuntimeVF > 0) {

but I there may be no way to add a test to defend the non-zero behaviour. You'd need to have one of these scenarios:

  1. computeFeasibleMaxVF returns 0 and 0 for both fixed-width and scalable VFs. Not sure how this could ever happen?
  2. computeFeasibleMaxVF returns 0 for fixed-width and non-zero for scalable VFs, but there is no vscale max or vscale is not a power of 2. This may be possible in theory, but difficult to find an actual test that exhibits this behaviour.

Regardless, even if I can't write a test for this I am happy to add a check!

ABataev added inline comments.Mar 29 2023, 5:56 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5173

I think it can be reproduced if:

  1. Legal->getMaxSafeVectorWidthInBits() / WidestType == 0 (i.e. there is a dependency in the loop, but WidestType is larger than the dependency).
  2. UserVF is specified with the fixed vector length, which is greater than 0.