Page MenuHomePhabricator

[LV] Fallback strategies if tail-folding fails
Needs ReviewPublic

Authored by SjoerdMeijer on May 12 2020, 6:53 AM.

Details

Summary

This implements 2 different vectorization fallback strategies if tail-folding fails: 1) don't vectorize at all, or 2) vectorize using a scalar epilogue. This can be controlled with option -prefer-predicate-over-epilog, that has been changed to take a numeric value corresponding to the tail-folding preference and preferred fallback.

Diff Detail

Event Timeline

Pierre-vh created this revision.May 12 2020, 6:53 AM

The direction looks good to me. We definitely still want to vectorise if possible, if we don't do that when tail-folding is requested, that sounds like a "performance bug".

Just a general remark: tail-predication is a bit of a MVE term. In the LV, tail-folding is the term that is used. Please use that instead in the summary/description etc.

I had a first look, see some nits/questions inline.

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1234

nit: how about SoftFailure -> ReportFailure?
(and that would mean flipping the default value)

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5055

nit: tail-predication -> tail-folding

5091

Do we need this if there is no tail? I haven't reminded myself and checked the flow, but can this condition be true?

llvm/test/Transforms/LoopVectorize/ARM/tail-folding-scalar-epilogue-fallback.ll
3

Sorry for being a bit lazy, but this is a big example, but why is this rejected for tail-predication? Would be good to indicate the reason somewhere, e.g. function name, in the IR.
And could this example be reduced?

A follow up question inlined.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5047

nit: tail-predication -> tail-folding

llvm/test/Transforms/LoopVectorize/ARM/tail-folding-scalar-epilogue-fallback.ll
3

This test is enabling MVE tail-predication, meaning that here we "enable" masked loads/stores that enable tail-folding. Thus, since no other options are used, this relies on TTI hook preferPredicateOverEpilogue to set CM_ScalarEpilogueNotNeededUsePredicate. And so, this hook determines for this test that tail-folding was possible, which we then overrule later, is that correct?

Pierre-vh marked 2 inline comments as done.May 12 2020, 9:09 AM
Pierre-vh added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5091

Yes, it can happen if tail-folding is enabled, and the loop's TC is a multiple of the VF. (e.g. TC=64, VF=16)
In those cases, we do the preparation for tail-folding earlier (line 4968), but then realize here that the loop has no tail, so we must revert (abandon tail folding + clear the flag).

We need this because else it would generate masked load/stores for those kinds of loops, which isn't optimal (normal loads are better). Additionally, it will cause an assertion failure in the MVETailPredication pass (TC cannot be a multiple of the VF in a tail-predicated loop).

llvm/test/Transforms/LoopVectorize/ARM/tail-folding-scalar-epilogue-fallback.ll
3

Sorry for being a bit lazy, but this is a big example, but why is this rejected for tail-predication? Would be good to indicate the reason somewhere, e.g. function name, in the IR.
And could this example be reduced?

If I remember correctly, this test is rejected because of this: store i8* %incdec.ptr13, i8** %pos, align 4, it's an outside user of %incdec.ptr13 which is defined in the loop.

I'll try to reduce the test a bit more, and I'll add a comment explaining why it should be rejected.

This test is enabling MVE tail-predication, meaning that here we "enable" masked loads/stores that enable tail-folding. Thus, since no other options are used, this relies on TTI hook preferPredicateOverEpilogue to set CM_ScalarEpilogueNotNeededUsePredicate. And so, this hook determines for this test that tail-folding was possible, which we then overrule later, is that correct?

That is correct, this test is the same as the other one, except it relies on preferPredicateOverEpilogue to set the flag.
Should I add something to check that preferPredicateOverEpilogue accepts the loop? (So the test doesn't silently pass if the TTI hook doesn't accept it anymore)

Pierre-vh updated this revision to Diff 263695.May 13 2020, 6:35 AM
Pierre-vh marked 7 inline comments as done.

Motivated by the discussion on the added test case, just a general remark about the description of this change:

This patch teaches the vectorizer to fallback to a scalar epilogue when a tail-predication hint is found

Using "hint" here can be a bit ambigious. I think the complete story is that we always fall back to a scalar epilogue, regardless whether tail-folding was requested on the command-line, a loop hint, or the TTI hook. Thinking about this, it's probably worth mentioning this on the documentation or description of the -prefer-predicate-over-epilog option, and/or the general documentation of the vectorize_predicate pragma, for which there probably also needs to be a test case added, i.e. for all 3 cases how tail-folding can be requested, there need to be a test. Looks like 2 are covered already: the option, and the TTI hook.

Ayal added a comment.May 13 2020, 1:17 PM

There are indeed loops which can be vectorized only with a scalar remainder loop, i.e., w/o foldTail. But the tests attached could easily be made an exception: LV handles induction variables that are live-out by pre-computing their values in the pre-header. So foldTail should work fine in their presence, provided the live-out value is computed using the original trip-count rather than the one foldTail rounds up.

Other live-outs could also be made to work under foldTail, such as selecting the last live element from the last live vector instead of assuming it's the last element of the last part.

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
238

But they need to be reset to reflect original conditional blocks, as set by canVectorizeWithIfConvert() when it initially called blockCanBePredicated().

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1234

Rather than if/how to report failure, probably better to clone this method into a canFoldTailByMasking() (as it was originally called iirc) const method, w/o having any side-effects that may later need to be abandoned, and a prepareToFoldTailByMasking() which need not return any value.

1241

Try also adding getInductionVars() Phi's, and the values that feed them across the backedge, to the set of live outs allowed by fold tail?

1258

Should an "analysis" message be reported instead of a "failure" one?

1263

Can dump this note regardless of ReportFailure.

1269–1270

In order (for canFoldTailByMasking()) to only check if tail can be folded, this call to blockCanBePredicated() should be replaced with one that does not have side-effects.

1277

ditto.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5051

Fold this under the following switch, possibly falling through to the CM_ScalarEpilogueAllowed case?

5091

They (masked load/stores) are needed, but only in blocks that were originally conditional, rather than all blocks of the loop.

llvm/test/Transforms/LoopVectorize/ARM/tail-folding-scalar-epilogue-fallback.ll
34

IND_END pre-computes the value of %incdec.ptr.

llvm/test/Transforms/LoopVectorize/use-scalar-epilogue-if-tp-fails.ll
33

ditto.

There are indeed loops which can be vectorized only with a scalar remainder loop, i.e., w/o foldTail. But the tests attached could easily be made an exception: LV handles induction variables that are live-out by pre-computing their values in the pre-header. So foldTail should work fine in their presence, provided the live-out value is computed using the original trip-count rather than the one foldTail rounds up.

Other live-outs could also be made to work under foldTail, such as selecting the last live element from the last live vector instead of assuming it's the last element of the last part.

Thanks for elaborating and explaining, and yes, got it about the live-out values and precomputing them. Other than the live-out values, I think we have an additional challenge because target hook TTI->preferPredicateOverEpilogue that can reject tail-folding for loops that can be vectorised with a scalar remainder because of target dependent restrictions. I think that's why we also need a fallback. And judging from your other comments, please correct me if I'm wrong, it looks like you're happy with the general direction of this, and so we will address your other comments.

Pierre-vh added inline comments.May 15 2020, 4:38 AM
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1234

I've been thinking about this, but I'm not exactly sure it's the right thing to do. There will probably be a small amount of code duplication and we'd be iterating over the loop's instruction twice.

I'm not a fan of iterating over the loop's instructions more than needed, especially in this case as I think it can be avoided.

Here is another idea: What if prepareToFoldTailByMasking becomes const, and instead of directly inserting elements into MaskedOp and ConditionalAssumes, it works on a temporary set.

The simplest way to implement this would be to take sets by reference, like this:
bool prepareToFoldTailByMasking(SmallPtrSetImpl<Instruction*> &MaskedOps, SmallPtrSetImpl<Instruction*> ConditionalAssumes, bool ReportFailure = true). Then the caller controls the sets, and can either give them to LoopVectorizationLegal, or discard them.

Another idea would be to move this logic in another object, for instance, something like TailFoldingLegal, which would:

  • Check the loop, like prepareToFoldTailByMasking currently does, and store the info into its own sets
  • Have methods to either abandon tail-folding by masking, or apply it (= transfer the contents of its sets into LoopVectorizationLegal's sets)

Then it could be used like this:

TailFoldingLegal TFL;
if tail folding is needed:
  TFL.checkLoop(TheLoop)

// etc.

if TFL.canFoldTailByMasking() && we need a tail:
  TFL.applyTailFoldingByMasking(Legal);

What do you think ? Do you prefer a simpler approach, even if it means iterating over the loop's instruction many times, or something a bit more complete like what I suggested?

Note that I don't know the cost of iterating over the loop. Maybe it's really cheap and I shouldn't worry about it (then just ignore everything I suggested above). In any case, I'll just implement your favorite solution, as you know much more about this topic than I do, so please tell me what you prefer.

Thank you very much for your help.

Ayal added inline comments.May 17 2020, 11:24 AM
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1234

Agreed, would be better to avoid repeated instruction scans.

In order for prepareToFoldTail() to be an optional preference, it should cause no side-effects when returning false; i.e., it should add to MaksedOps and ConditionalAssumes iff returning true. This mostly affects blockCanBePredicated() - it would need to have MaskedOps and ConditionalAssumes passed-in as arguments. Sounds reasonable?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5052

May be simpler to exit early after calling prepareToFoldTail(), instead of checking FoldTailByMasking later.

Pierre-vh marked an inline comment as done.May 18 2020, 1:20 AM
Pierre-vh added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1234

In order for prepareToFoldTail() to be an optional preference, it should cause no side-effects when returning false; i.e., it should add to MaksedOps and ConditionalAssumes iff returning true. This mostly affects blockCanBePredicated() - it would need to have MaskedOps and ConditionalAssumes passed-in as arguments. Sounds reasonable?

This is a good idea, but there is one issue left: when ScalarEpilogueStatus == CM_ScalarEpilogueNotNeededUsePredicate and the loop has no tail (TC > 0 && TC % MaxVF == 0), we have to abandon tail-folding (Line 5029 in LoopVectorize.cpp), but if we do things that way, there's no way to abandon tail-folding at that stage. There are 2 solutions that I can think of:

  • Refactor computeMaxVF so it calculates the MaxVF earlier. That way, we can tell if the loops has a tail or not before calling prepareToFoldTailByMasking (= we only call it if there's a tail). Unfortunately, I tried moving the call to computeFeasibleMaxVF earlier in the function, but it triggered the assertion at line 5000 (in if (!useMaskedInterleavedAccesses(TTI))), so it doesn't look like a trivial fix.
  • Make prepareToFoldTailByMasking return some kind of result object (that'd contain both MaskedOp/ConditionalAssumes sets), which can either be discarded or applied to LoopVectorizationLegal. This is similar to what I suggested in my previous comment, just a bit simplified.
    • As I also suggested earlier, we can do without the result object - just pass-in the sets to prepareToFoldTailByMasking and add new methods to LoopVectorizationLegal to add elements to the MaskedOp/ConditionalAssumes sets. This is the simplest fix, but maybe not the cleanest.

Either solution is fine for me, but I'll need some guidance on how to refactor computeMaxVF properly if you prefer the first solution.

Ayal added inline comments.May 18 2020, 1:05 PM
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1234

Right. Instead of introducing an early call to prepareToFoldTail(), before knowing MaxVF and if there's a tail to fold or else the preparation should be undone, how about waiting until the original call to prepareToFoldTail() fails. Then, before bailing-out, check if the status is CM_ScalarEpilogueNotNeededUsePredicate and if so return MaxVF instead?

Note BTW that D80085 should hopefully land soon.

Pierre-vh updated this revision to Diff 264822.May 19 2020, 1:59 AM
Pierre-vh marked 3 inline comments as done.
  • Changing implementation of the patch following discussion
  • Removed the ReportFailure argument of prepareToFoldTailByMasking. I don't think it's useful anymore, but feedback is welcome. (The only thing that annoys me is that we now print "loop not vectorized" even when we'll fallback to a scalar epilogue)
  • Added a test that makes use of the attribute that enables tail-folding
  • Simplified tests
Ayal added a comment.May 19 2020, 3:23 AM
  • Changing implementation of the patch following discussion
  • Removed the ReportFailure argument of prepareToFoldTailByMasking. I don't think it's useful anymore, but feedback is welcome. (The only thing that annoys me is that we now print "loop not vectorized" even when we'll fallback to a scalar epilogue)

Right, as commented earlier, an "analysis" message should be reported when fold-tail doesn't work, instead of a "failure" one.

  • Added a test that makes use of the attribute that enables tail-folding
  • Simplified tests

Would be good to also preserve the existing "force-vector-tail-folding" behavior of "fold or do not vectorize", in addition to introducing the new "prefer-vector-tail-folding" behavior.

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
379

(Would have been good to set defaults, but using nullptr's seems cumbersome.)

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1265

Rename e.g. with Tmp or FoldTail prefix to distinguish from the non fold-tail sets?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5107

Why change ScalarEpilogueStatus?

llvm/test/Transforms/LoopVectorize/use-scalar-epilogue-if-tp-fails.ll
6

"TP is forced" >> "tail folding is preferred"
"tail-predicated" >> "tail folding".

If "tail-predication" is preferred, it should replace "tail folding" throughout. But a tail can be predicated w/o folding it, so FoldTail or in full FoldTailByMasking seems more accurate?

11

"forces" >> "specifies fold-tail preference via metadata"

Pierre-vh updated this revision to Diff 264887.May 19 2020, 6:58 AM
Pierre-vh marked 5 inline comments as done.
  • Addressed comments (see items marked as "done")
  • Changed "reportVectorizationFailure" calls in "prepareTailFoldingByMasking" into simple debug prints.

Would be good to also preserve the existing "force-vector-tail-folding" behavior of "fold or do not vectorize", in addition to introducing the new "prefer-vector-tail-folding" behavior.

What is your preferred solution for this? Here are some ideas:

  • Add a new command-line switch that acts like the old -prefer-predicate-over-epilog, something like -force-predicate-over-epilog
  • Add an additional switch to use with -prefer-predicate-over-epilog to disable the fallback behaviour, such as -disable-scalar-epilog-fallback
  • Keep the old -prefer-predicate-over-epilog behaviour and only fallback when TTI requested tail-folding
Ayal added a comment.May 20 2020, 11:04 AM
  • Addressed comments (see items marked as "done")
  • Changed "reportVectorizationFailure" calls in "prepareTailFoldingByMasking" into simple debug prints.

Would be good to also preserve the existing "force-vector-tail-folding" behavior of "fold or do not vectorize", in addition to introducing the new "prefer-vector-tail-folding" behavior.

What is your preferred solution for this? Here are some ideas:

  • Add a new command-line switch that acts like the old -prefer-predicate-over-epilog, something like -force-predicate-over-epilog
  • Add an additional switch to use with -prefer-predicate-over-epilog to disable the fallback behaviour, such as -disable-scalar-epilog-fallback
  • Keep the old -prefer-predicate-over-epilog behaviour and only fallback when TTI requested tail-folding

Perhaps either (the first option of) a new command-line switch e.g. -force-vector-tail-folding, or a value added to -prefer-predicate-over-epilog=x indicating the "strength" of the preference.
Adding a matching pragma/metadata would also be nice.
Better avoid adding a new TTI hook, which will not be exercised by any in-tree target, right?

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1258

Would be good to also have an ORE report similar to the original one, except as an analysis rather than failure.

1265

Can add a comment what these sets are for.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5107

Why change ScalarEpilogueStatus?

Pierre-vh added inline comments.May 20 2020, 11:39 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5107

(Sorry, forgot to post this comment)
It seems to be needed, my tests don't pass if unless I change it (they have conflicting ASM for both run lines, and the content is different)

Ayal added inline comments.May 20 2020, 1:21 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5107

Add -check-prefix to the FileCheck of one of the runs?
https://llvm.org/docs/CommandGuide/FileCheck.html#cmdoption-filecheck-check-prefix

Pierre-vh marked an inline comment as done.May 21 2020, 12:37 AM
Pierre-vh added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5107

This is what's added when I don't change ScalarEpilogueStatus, and it's set to ScalarEpilogueNotNeededUsePredicate:

; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> undef, i32 [[OFFSET_IDX]], i32 0
; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> undef, <4 x i32> zeroinitializer
; CHECK-NEXT:    [[INDUCTION:%.*]] = add <4 x i32> [[BROADCAST_SPLAT]], <i32 0, i32 -1, i32 -2, i32 -3>

It's because of this in InnerLoopVectorizer::widenIntOrFpInduction in LoopVectorize.cpp:1920

// All IV users are scalar instructions, so only emit a scalar IV, not a
// vectorised IV. Except when we tail-fold, then the splat IV feeds the
// predicate used by the masked loads/stores.
if (!Cost->isScalarEpilogueAllowed()) // ScalarEpilogueStatus == CM_ScalarEpilogueAllowed
  CreateSplatIV(ScalarIV, Step);

I am guessing that this should be using Cost->foldTailByMasking() instead, which returns the FoldTailByMasking boolean instead of looking at ScalarEpilogueStatus.

Pierre-vh marked an inline comment as done.May 21 2020, 1:18 AM
Pierre-vh added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5107

After looking a bit more into it:

  • Changing !Cost->isScalarEpilogueAllowed() to Cost->foldTailByMasking() does not solve the issue entirely, it causes multiple test failures.
  • The INDUCTION variable is unused, so this bit of code definitely shouldn't be there.

Solution is either to change the ScalarEpilogueStatus, or find out why Cost->foldTailByMasking() doesn't work as expected.
The tests that fail are:

Failing Tests (3):
  LLVM :: Transforms/LoopVectorize/X86/constant-fold.ll
  LLVM :: Transforms/LoopVectorize/X86/vect.omp.force.small-tc.ll
  LLVM :: Transforms/LoopVectorize/pr44488-predication.ll

Which all seem to expect this "INDUCTION" variable, one way or another. However, interestingly, the tests don't use the INDUCTION instruction at all, so maybe it's a bug? (Everytime there's a failure in one of those test, it's in a place where those 3 lines are expected, but not used at all)

Ayal added inline comments.May 24 2020, 3:53 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5107

So now we know why ScalarEpilogueStatus needed to change...

@SjoerdMeijer tried to remove that presumably redundant call to CreateSplatIV() in D78911, recommitting it later under ScalarEpilogueStatus in rG9529597cf4562c64034943dacc29a4ff4fe93d86. We're still trying to figure out how to remove it: http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20200518/784362.html

I am guessing that this should be using Cost->foldTailByMasking() instead, which returns the FoldTailByMasking boolean instead of looking at ScalarEpilogueStatus.

guess so too, along with needed test fixings. Sjoerd, what do you think?

SjoerdMeijer commandeered this revision.May 24 2020, 11:44 PM
SjoerdMeijer edited reviewers, added: Pierre-vh; removed: SjoerdMeijer.

Hi Ayal, thanks again for looking at this. I am taking over this patch as Pierre has finished his rotation in our team. This patch is part of our tail-predication story, and is the highest on my list. I will pick this up tomorrow as today is a bank holiday here.

Hi @Ayal , with a bit of delay, I am almost ready to pick this up. Regarding your comments on rG9529597cf4562c64034943dacc29a4ff4fe93d86, I've copied them here below as I thought that would be more convenient, but let me know if you would like to move it to elsewhere. Anyway, I am now going to look into your comments, these ones:

This omission of Primary Induction Phi's from Scalars in collectLoopScalars(), under foldTail:

// An induction variable will remain scalar if all users of the induction
// variable and induction variable update remain scalar.
...
+    // If tail-folding is applied, the primary induction variable will be used
+    // to feed a vector compare.
+    if (Ind == Legal->getPrimaryInduction() && foldTailByMasking())
+      continue;

should have led widenIntOrFpInduction() to take the following early-exit for Phi's that are not in Scalars:

// Try to create a new independent vector induction variable. If we can't
// create the phi node, we will splat the scalar induction variable in each
// loop iteration.
if (!shouldScalarizeInstruction(EntryVal)) {

rather than reaching CreateSplatIV() at the end.

Perhaps the regressions occur when EntryVal is the Trunc rather than the IV Phi?
If so, one remedy may be to also omit Trunc from Scalars; another may be to check if (!shouldScalarizeInstruction(IV) instead-of or in addition to if (!shouldScalarizeInstruction(EntryVal)).

Is there a short reproducer?

SjoerdMeijer added a comment.EditedJun 23 2020, 9:21 AM

Hi @Ayal , with a bit of delay, I am almost ready to pick this up. Regarding your comments on rG9529597cf4562c64034943dacc29a4ff4fe93d86, I've copied them here below as I thought that would be more convenient, but let me know if you would like to move it to elsewhere. Anyway, I am now going to look into your comments, these ones:

This omission of Primary Induction Phi's from Scalars in collectLoopScalars(), under foldTail:

// An induction variable will remain scalar if all users of the induction
// variable and induction variable update remain scalar.
...
+    // If tail-folding is applied, the primary induction variable will be used
+    // to feed a vector compare.
+    if (Ind == Legal->getPrimaryInduction() && foldTailByMasking())
+      continue;

should have led widenIntOrFpInduction() to take the following early-exit for Phi's that are not in Scalars:

// Try to create a new independent vector induction variable. If we can't
// create the phi node, we will splat the scalar induction variable in each
// loop iteration.
if (!shouldScalarizeInstruction(EntryVal)) {

rather than reaching CreateSplatIV() at the end.

Perhaps the regressions occur when EntryVal is the Trunc rather than the IV Phi?
If so, one remedy may be to also omit Trunc from Scalars; another may be to check if (!shouldScalarizeInstruction(IV) instead-of or in addition to if (!shouldScalarizeInstruction(EntryVal)).

Is there a short reproducer?

Yep, function @foo_optsize() in test/Transforms/LoopVectorize/X86/optsize.ll is a minimal example. It will be miscompiled with these 2 lines commented out that I added:

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index ec756671ea6..d39795627ec 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1917,8 +1917,8 @@ void InnerLoopVectorizer::widenIntOrFpInduction(PHINode *IV, TruncInst *Trunc) {
   // vectorised IV. Except when we tail-fold, then the splat IV feeds the
   // predicate used by the masked loads/stores.
   Value *ScalarIV = CreateScalarIV(Step);
-  if (!Cost->isScalarEpilogueAllowed())
-    CreateSplatIV(ScalarIV, Step);
+  //if (!Cost->isScalarEpilogueAllowed())
+  //  CreateSplatIV(ScalarIV, Step);
   buildScalarSteps(ScalarIV, Step, EntryVal, ID);
 }

The miscompilation is that we don't emit:

%induction = add <64 x i32> %broadcast.splat, <i32 0, i32 1, ...

and the icmp is incorrectly performed against:

%2 = icmp ule <64 x i32> %broadcast.splat, <i32 202, i32 202, ...

which should have been:

%2 = icmp ule <64 x i32> %induction, <i32 202, i32 202, ...

it indeed doesn't take the early exit:

// Try to create a new independent vector induction variable. If we can't
// create the phi node, we will splat the scalar induction variable in each              
// loop iteration.
if (!shouldScalarizeInstruction(EntryVal)) {

I need to look again if this makes sense or not.

Caught up on all previous messages, and I believe missing in this patch is control over some of the decision making of this old and new behaviour. Rather than introducing yet another option, I liked @Ayal 's suggestion:

value added to -prefer-predicate-over-epilog=x indicating the "strength" of the preference.

for which I will introduce these values/options:

  • x = 1: vectorize if we can't fail-fold,
  • x = 2 : don't vectorize if we can't tail-fold

I think this models the "strength" of the vector tail-folding preference.

SjoerdMeijer retitled this revision from [LoopVectorize] Fallback to a scalar epilogue when TP fails to [LV] Fallback strategies if tail-folding fails.
SjoerdMeijer edited the summary of this revision. (Show Details)

Changed -prefer-predicate-over-epilog to take a numeric value, corresponding to the preferred predication and fallback strategy.

Changed option -prefer-predicate-over-epilog to accept a more descriptive value rather than a cryptic numeric value.

Also a friendly ping. :-)

Is the idea to turn this option on by default for MVE? Maybe by changing the preferPredicateOverEpilogue call?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
182

-> there are different ...

191

Epilog -> Epilogue?

201

Maybe call this predicate-else-scalar-epilog.

202

-> "prefer tail-folding, create scalar epilogue if tail folding fails" ?

llvm/test/Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll
2–3

What happens if we change this to predicate-vectorize? Would that be a better option for MVE? (or is that not what this test is really aiming to test?)

Thanks for looking at this Dave. This addresses:

  • the minor comments, the rewording of comments/option,
  • restores 2 new regression tests that I forgot to upload in the previous diff.

Is the idea to turn this option on by default for MVE? Maybe by changing the preferPredicateOverEpilogue call?

So that was definitely the idea, that was the motivation of this prep work here.
I was thinking of addressing that in a follow up patch, because that is a different functional change, and then we can also discuss how to enable this. I.e., preferPredicateOverEpilogue would be one way to do it, another way is to flip the switch and default to falling back to vectorisation, which seems to me what most targets would want actually, but perhaps I am wrong and should get some performance numbers for this.