This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Transforms/Vectorize/
-
llvm/
-
Transforms/
-
Vectorize/
2
LoopVectorizationLegality.h
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
4/14
LoopVectorizationLegality.cpp
10/23
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
ARM/
4/4
tail-folding-scalar-epilogue-fallback.ll
3/3
use-scalar-epilogue-if-tp-fails.ll

Differential D79783

[LV] Fallback strategies if tail-folding fails
ClosedPublic

Authored by SjoerdMeijer on May 12 2020, 6:53 AM.

Download Raw Diff

Details

Reviewers

samparker
dmgreen
Ayal
gilr
rengolin
Pierre-vh
fhahn

Commits

rGbda8fbe2d2af: [LV] Fallback strategies if tail-folding fails

Summary

This implements 2 different vectorization fallback strategies if tail-folding fails: 1) don't vectorize at all, or 2) vectorize using a scalar epilogue. This can be controlled with option -prefer-predicate-over-epilog, that has been changed to take a numeric value corresponding to the tail-folding preference and preferred fallback.

Diff Detail

Event Timeline

Pierre-vh created this revision.May 12 2020, 6:53 AM

Herald added subscribers: llvm-commits, hiraditya, kristof.beyls. · View Herald TranscriptMay 12 2020, 6:53 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 12 2020, 6:53 AM

SjoerdMeijer added a reviewer: Ayal.May 12 2020, 7:08 AM

Pierre-vh added reviewers: gilr, rengolin.May 12 2020, 7:09 AM

The direction looks good to me. We definitely still want to vectorise if possible, if we don't do that when tail-folding is requested, that sounds like a "performance bug".

Just a general remark: tail-predication is a bit of a MVE term. In the LV, tail-folding is the term that is used. Please use that instead in the summary/description etc.

I had a first look, see some nits/questions inline.

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1233	nit: how about SoftFailure -> ReportFailure? (and that would mean flipping the default value)
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4976	nit: tail-predication -> tail-folding
5011	Do we need this if there is no tail? I haven't reminded myself and checked the flow, but can this condition be true?
llvm/test/Transforms/LoopVectorize/ARM/tail-folding-scalar-epilogue-fallback.ll
3	Sorry for being a bit lazy, but this is a big example, but why is this rejected for tail-predication? Would be good to indicate the reason somewhere, e.g. function name, in the IR. And could this example be reduced?

Harbormaster completed remote builds in B56427: Diff 263421.May 12 2020, 8:02 AM

A follow up question inlined.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4968	nit: tail-predication -> tail-folding
llvm/test/Transforms/LoopVectorize/ARM/tail-folding-scalar-epilogue-fallback.ll
3	This test is enabling MVE tail-predication, meaning that here we "enable" masked loads/stores that enable tail-folding. Thus, since no other options are used, this relies on TTI hook preferPredicateOverEpilogue to set CM_ScalarEpilogueNotNeededUsePredicate. And so, this hook determines for this test that tail-folding was possible, which we then overrule later, is that correct?

Pierre-vh marked 2 inline comments as done.May 12 2020, 9:09 AM

Pierre-vh added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5011	Yes, it can happen if tail-folding is enabled, and the loop's TC is a multiple of the VF. (e.g. TC=64, VF=16) In those cases, we do the preparation for tail-folding earlier (line 4968), but then realize here that the loop has no tail, so we must revert (abandon tail folding + clear the flag). We need this because else it would generate masked load/stores for those kinds of loops, which isn't optimal (normal loads are better). Additionally, it will cause an assertion failure in the MVETailPredication pass (TC cannot be a multiple of the VF in a tail-predicated loop).
llvm/test/Transforms/LoopVectorize/ARM/tail-folding-scalar-epilogue-fallback.ll
3	Sorry for being a bit lazy, but this is a big example, but why is this rejected for tail-predication? Would be good to indicate the reason somewhere, e.g. function name, in the IR. And could this example be reduced? If I remember correctly, this test is rejected because of this: `store i8* %incdec.ptr13, i8** %pos, align 4`, it's an outside user of `%incdec.ptr13` which is defined in the loop. I'll try to reduce the test a bit more, and I'll add a comment explaining why it should be rejected. This test is enabling MVE tail-predication, meaning that here we "enable" masked loads/stores that enable tail-folding. Thus, since no other options are used, this relies on TTI hook preferPredicateOverEpilogue to set CM_ScalarEpilogueNotNeededUsePredicate. And so, this hook determines for this test that tail-folding was possible, which we then overrule later, is that correct? That is correct, this test is the same as the other one, except it relies on `preferPredicateOverEpilogue` to set the flag. Should I add something to check that `preferPredicateOverEpilogue` accepts the loop? (So the test doesn't silently pass if the TTI hook doesn't accept it anymore)

Pierre-vh updated this revision to Diff 263695.May 13 2020, 6:35 AM

Pierre-vh marked 7 inline comments as done.

Motivated by the discussion on the added test case, just a general remark about the description of this change:

This patch teaches the vectorizer to fallback to a scalar epilogue when a tail-predication hint is found

Using "hint" here can be a bit ambigious. I think the complete story is that we always fall back to a scalar epilogue, regardless whether tail-folding was requested on the command-line, a loop hint, or the TTI hook. Thinking about this, it's probably worth mentioning this on the documentation or description of the -prefer-predicate-over-epilog option, and/or the general documentation of the vectorize_predicate pragma, for which there probably also needs to be a test case added, i.e. for all 3 cases how tail-folding can be requested, there need to be a test. Looks like 2 are covered already: the option, and the TTI hook.

There are indeed loops which can be vectorized only with a scalar remainder loop, i.e., w/o foldTail. But the tests attached could easily be made an exception: LV handles induction variables that are live-out by pre-computing their values in the pre-header. So foldTail should work fine in their presence, provided the live-out value is computed using the original trip-count rather than the one foldTail rounds up.

Other live-outs could also be made to work under foldTail, such as selecting the last live element from the last live vector instead of assuming it's the last element of the last part.

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
238	But they need to be reset to reflect original conditional blocks, as set by `canVectorizeWithIfConvert()` when it initially called `blockCanBePredicated()`.
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1233	Rather than if/how to report failure, probably better to clone this method into a `canFoldTailByMasking()` (as it was originally called iirc) const method, w/o having any side-effects that may later need to be abandoned, and a `prepareToFoldTailByMasking()` which need not return any value.
1240	Try also adding getInductionVars() Phi's, and the values that feed them across the backedge, to the set of live outs allowed by fold tail?
1260	Should an "analysis" message be reported instead of a "failure" one?
1265	Can dump this note regardless of ReportFailure.
1269–1274	In order (for `canFoldTailByMasking()`) to only check if tail can be folded, this call to `blockCanBePredicated()` should be replaced with one that does not have side-effects.
1282	ditto.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4972	Fold this under the following switch, possibly falling through to the CM_ScalarEpilogueAllowed case?
5011	They (masked load/stores) are needed, but only in blocks that were originally conditional, rather than all blocks of the loop.
llvm/test/Transforms/LoopVectorize/ARM/tail-folding-scalar-epilogue-fallback.ll
34	IND_END pre-computes the value of %incdec.ptr.
llvm/test/Transforms/LoopVectorize/use-scalar-epilogue-if-tp-fails.ll
33	ditto.

There are indeed loops which can be vectorized only with a scalar remainder loop, i.e., w/o foldTail. But the tests attached could easily be made an exception: LV handles induction variables that are live-out by pre-computing their values in the pre-header. So foldTail should work fine in their presence, provided the live-out value is computed using the original trip-count rather than the one foldTail rounds up.

Other live-outs could also be made to work under foldTail, such as selecting the last live element from the last live vector instead of assuming it's the last element of the last part.

Thanks for elaborating and explaining, and yes, got it about the live-out values and precomputing them. Other than the live-out values, I think we have an additional challenge because target hook TTI->preferPredicateOverEpilogue that can reject tail-folding for loops that can be vectorised with a scalar remainder because of target dependent restrictions. I think that's why we also need a fallback. And judging from your other comments, please correct me if I'm wrong, it looks like you're happy with the general direction of this, and so we will address your other comments.

Pierre-vh added inline comments.May 15 2020, 4:38 AM

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1233	I've been thinking about this, but I'm not exactly sure it's the right thing to do. There will probably be a small amount of code duplication and we'd be iterating over the loop's instruction twice. I'm not a fan of iterating over the loop's instructions more than needed, especially in this case as I think it can be avoided. Here is another idea: What if `prepareToFoldTailByMasking` becomes const, and instead of directly inserting elements into `MaskedOp` and `ConditionalAssumes`, it works on a temporary set. The simplest way to implement this would be to take sets by reference, like this: `bool prepareToFoldTailByMasking(SmallPtrSetImpl<Instruction> &MaskedOps, SmallPtrSetImpl<Instruction> ConditionalAssumes, bool ReportFailure = true)`. Then the caller controls the sets, and can either give them to `LoopVectorizationLegal`, or discard them. Another idea would be to move this logic in another object, for instance, something like `TailFoldingLegal`, which would: Check the loop, like `prepareToFoldTailByMasking` currently does, and store the info into its own sets Have methods to either abandon tail-folding by masking, or apply it (= transfer the contents of its sets into `LoopVectorizationLegal`'s sets) Then it could be used like this: TailFoldingLegal TFL; if tail folding is needed: TFL.checkLoop(TheLoop) // etc. if TFL.canFoldTailByMasking() && we need a tail: TFL.applyTailFoldingByMasking(Legal); What do you think ? Do you prefer a simpler approach, even if it means iterating over the loop's instruction many times, or something a bit more complete like what I suggested? Note that I don't know the cost of iterating over the loop. Maybe it's really cheap and I shouldn't worry about it (then just ignore everything I suggested above). In any case, I'll just implement your favorite solution, as you know much more about this topic than I do, so please tell me what you prefer. Thank you very much for your help.

Ayal added inline comments.May 17 2020, 11:24 AM

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1233	Agreed, would be better to avoid repeated instruction scans. In order for prepareToFoldTail() to be an optional preference, it should cause no side-effects when returning false; i.e., it should add to MaksedOps and ConditionalAssumes iff returning true. This mostly affects blockCanBePredicated() - it would need to have MaskedOps and ConditionalAssumes passed-in as arguments. Sounds reasonable?
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4973	May be simpler to exit early after calling prepareToFoldTail(), instead of checking FoldTailByMasking later.

Pierre-vh marked an inline comment as done.May 18 2020, 1:20 AM

Pierre-vh added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1233	In order for prepareToFoldTail() to be an optional preference, it should cause no side-effects when returning false; i.e., it should add to MaksedOps and ConditionalAssumes iff returning true. This mostly affects blockCanBePredicated() - it would need to have MaskedOps and ConditionalAssumes passed-in as arguments. Sounds reasonable? This is a good idea, but there is one issue left: when `ScalarEpilogueStatus == CM_ScalarEpilogueNotNeededUsePredicate` and the loop has no tail (`TC > 0 && TC % MaxVF == 0`), we have to abandon tail-folding (Line 5029 in LoopVectorize.cpp), but if we do things that way, there's no way to abandon tail-folding at that stage. There are 2 solutions that I can think of: Refactor `computeMaxVF` so it calculates the MaxVF earlier. That way, we can tell if the loops has a tail or not before calling `prepareToFoldTailByMasking` (= we only call it if there's a tail). Unfortunately, I tried moving the call to `computeFeasibleMaxVF` earlier in the function, but it triggered the assertion at line 5000 (in `if (!useMaskedInterleavedAccesses(TTI))`), so it doesn't look like a trivial fix. Make `prepareToFoldTailByMasking` return some kind of result object (that'd contain both `MaskedOp`/`ConditionalAssumes` sets), which can either be discarded or applied to `LoopVectorizationLegal`. This is similar to what I suggested in my previous comment, just a bit simplified. As I also suggested earlier, we can do without the result object - just pass-in the sets to `prepareToFoldTailByMasking` and add new methods to `LoopVectorizationLegal` to add elements to the `MaskedOp`/`ConditionalAssumes` sets. This is the simplest fix, but maybe not the cleanest. Either solution is fine for me, but I'll need some guidance on how to refactor `computeMaxVF` properly if you prefer the first solution.

Ayal added inline comments.May 18 2020, 1:05 PM

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1233	Right. Instead of introducing an early call to prepareToFoldTail(), before knowing MaxVF and if there's a tail to fold or else the preparation should be undone, how about waiting until the original call to prepareToFoldTail() fails. Then, before bailing-out, check if the status is CM_ScalarEpilogueNotNeededUsePredicate and if so return MaxVF instead? Note BTW that D80085 should hopefully land soon.

Changing implementation of the patch following discussion
Removed the ReportFailure argument of prepareToFoldTailByMasking. I don't think it's useful anymore, but feedback is welcome. (The only thing that annoys me is that we now print "loop not vectorized" even when we'll fallback to a scalar epilogue)
Added a test that makes use of the attribute that enables tail-folding
Simplified tests

In D79783#2043479, @Pierre-vh wrote:

Changing implementation of the patch following discussion

Removed the ReportFailure argument of prepareToFoldTailByMasking. I don't think it's useful anymore, but feedback is welcome. (The only thing that annoys me is that we now print "loop not vectorized" even when we'll fallback to a scalar epilogue)

Right, as commented earlier, an "analysis" message should be reported when fold-tail doesn't work, instead of a "failure" one.

Added a test that makes use of the attribute that enables tail-folding

Simplified tests

Would be good to also preserve the existing "force-vector-tail-folding" behavior of "fold or do not vectorize", in addition to introducing the new "prefer-vector-tail-folding" behavior.

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
379	(Would have been good to set defaults, but using nullptr's seems cumbersome.)
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1264	Rename e.g. with Tmp or FoldTail prefix to distinguish from the non fold-tail sets?
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5027	Why change ScalarEpilogueStatus?
llvm/test/Transforms/LoopVectorize/use-scalar-epilogue-if-tp-fails.ll
5	"TP is forced" >> "tail folding is preferred" "tail-predicated" >> "tail folding". If "tail-predication" is preferred, it should replace "tail folding" throughout. But a tail can be predicated w/o folding it, so FoldTail or in full FoldTailByMasking seems more accurate?
10	"forces" >> "specifies fold-tail preference via metadata"

Addressed comments (see items marked as "done")
Changed "reportVectorizationFailure" calls in "prepareTailFoldingByMasking" into simple debug prints.

Would be good to also preserve the existing "force-vector-tail-folding" behavior of "fold or do not vectorize", in addition to introducing the new "prefer-vector-tail-folding" behavior.

What is your preferred solution for this? Here are some ideas:

Add a new command-line switch that acts like the old -prefer-predicate-over-epilog, something like -force-predicate-over-epilog
Add an additional switch to use with -prefer-predicate-over-epilog to disable the fallback behaviour, such as -disable-scalar-epilog-fallback
Keep the old -prefer-predicate-over-epilog behaviour and only fallback when TTI requested tail-folding

In D79783#2044036, @Pierre-vh wrote:

Addressed comments (see items marked as "done")

Changed "reportVectorizationFailure" calls in "prepareTailFoldingByMasking" into simple debug prints.

Would be good to also preserve the existing "force-vector-tail-folding" behavior of "fold or do not vectorize", in addition to introducing the new "prefer-vector-tail-folding" behavior.

What is your preferred solution for this? Here are some ideas:

Add a new command-line switch that acts like the old -prefer-predicate-over-epilog, something like -force-predicate-over-epilog

Add an additional switch to use with -prefer-predicate-over-epilog to disable the fallback behaviour, such as -disable-scalar-epilog-fallback

Keep the old -prefer-predicate-over-epilog behaviour and only fallback when TTI requested tail-folding

Perhaps either (the first option of) a new command-line switch e.g. -force-vector-tail-folding, or a value added to -prefer-predicate-over-epilog=x indicating the "strength" of the preference.
Adding a matching pragma/metadata would also be nice.
Better avoid adding a new TTI hook, which will not be exercised by any in-tree target, right?

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1260	Would be good to also have an ORE report similar to the original one, except as an analysis rather than failure.
1264	Can add a comment what these sets are for.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5027	Why change ScalarEpilogueStatus?

Pierre-vh added inline comments.May 20 2020, 11:39 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5027	(Sorry, forgot to post this comment) It seems to be needed, my tests don't pass if unless I change it (they have conflicting ASM for both run lines, and the content is different)

Ayal added inline comments.May 20 2020, 1:21 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5027	Add `-check-prefix` to the FileCheck of one of the runs? https://llvm.org/docs/CommandGuide/FileCheck.html#cmdoption-filecheck-check-prefix

Pierre-vh marked an inline comment as done.May 21 2020, 12:37 AM

Pierre-vh added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

5027

This is what's added when I don't change ScalarEpilogueStatus, and it's set to ScalarEpilogueNotNeededUsePredicate:

; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> undef, i32 [[OFFSET_IDX]], i32 0
; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> undef, <4 x i32> zeroinitializer
; CHECK-NEXT:    [[INDUCTION:%.*]] = add <4 x i32> [[BROADCAST_SPLAT]], <i32 0, i32 -1, i32 -2, i32 -3>

It's because of this in InnerLoopVectorizer::widenIntOrFpInduction in LoopVectorize.cpp:1920

// All IV users are scalar instructions, so only emit a scalar IV, not a
// vectorised IV. Except when we tail-fold, then the splat IV feeds the
// predicate used by the masked loads/stores.
if (!Cost->isScalarEpilogueAllowed()) // ScalarEpilogueStatus == CM_ScalarEpilogueAllowed
  CreateSplatIV(ScalarIV, Step);

I am guessing that this should be using Cost->foldTailByMasking() instead, which returns the FoldTailByMasking boolean instead of looking at ScalarEpilogueStatus.

Pierre-vh marked an inline comment as done.May 21 2020, 1:18 AM

Pierre-vh added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5027	After looking a bit more into it: Changing `!Cost->isScalarEpilogueAllowed()` to `Cost->foldTailByMasking()` does not solve the issue entirely, it causes multiple test failures. The `INDUCTION` variable is unused, so this bit of code definitely shouldn't be there. Solution is either to change the `ScalarEpilogueStatus`, or find out why `Cost->foldTailByMasking()` doesn't work as expected. The tests that fail are: Failing Tests (3): LLVM :: Transforms/LoopVectorize/X86/constant-fold.ll LLVM :: Transforms/LoopVectorize/X86/vect.omp.force.small-tc.ll LLVM :: Transforms/LoopVectorize/pr44488-predication.ll Which all seem to expect this "INDUCTION" variable, one way or another. However, interestingly, the tests don't use the `INDUCTION` instruction at all, so maybe it's a bug? (Everytime there's a failure in one of those test, it's in a place where those 3 lines are expected, but not used at all)

SjoerdMeijer mentioned this in D80316: [HardwareLoops] Intrinsic LangRef descriptions.May 22 2020, 11:59 AM

Ayal added inline comments.May 24 2020, 3:53 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5027	So now we know why ScalarEpilogueStatus needed to change... @SjoerdMeijer tried to remove that presumably redundant call to CreateSplatIV() in D78911, recommitting it later under ScalarEpilogueStatus in rG9529597cf4562c64034943dacc29a4ff4fe93d86. We're still trying to figure out how to remove it: http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20200518/784362.html I am guessing that this should be using Cost->foldTailByMasking() instead, which returns the FoldTailByMasking boolean instead of looking at ScalarEpilogueStatus. guess so too, along with needed test fixings. Sjoerd, what do you think?

Hi Ayal, thanks again for looking at this. I am taking over this patch as Pierre has finished his rotation in our team. This patch is part of our tail-predication story, and is the highest on my list. I will pick this up tomorrow as today is a bank holiday here.

Hi @Ayal , with a bit of delay, I am almost ready to pick this up. Regarding your comments on rG9529597cf4562c64034943dacc29a4ff4fe93d86, I've copied them here below as I thought that would be more convenient, but let me know if you would like to move it to elsewhere. Anyway, I am now going to look into your comments, these ones:

This omission of Primary Induction Phi's from Scalars in collectLoopScalars(), under foldTail:
// An induction variable will remain scalar if all users of the induction
// variable and induction variable update remain scalar.
...
+    // If tail-folding is applied, the primary induction variable will be used
+    // to feed a vector compare.
+    if (Ind == Legal->getPrimaryInduction() && foldTailByMasking())
+      continue;
should have led widenIntOrFpInduction() to take the following early-exit for Phi's that are not in Scalars:
// Try to create a new independent vector induction variable. If we can't
// create the phi node, we will splat the scalar induction variable in each
// loop iteration.
if (!shouldScalarizeInstruction(EntryVal)) {
rather than reaching CreateSplatIV() at the end.

Perhaps the regressions occur when EntryVal is the Trunc rather than the IV Phi?
If so, one remedy may be to also omit Trunc from Scalars; another may be to check if (!shouldScalarizeInstruction(IV) instead-of or in addition to if (!shouldScalarizeInstruction(EntryVal)).

Is there a short reproducer?

In D79783#2057310, @SjoerdMeijer wrote:
Hi @Ayal , with a bit of delay, I am almost ready to pick this up. Regarding your comments on rG9529597cf4562c64034943dacc29a4ff4fe93d86, I've copied them here below as I thought that would be more convenient, but let me know if you would like to move it to elsewhere. Anyway, I am now going to look into your comments, these ones:
This omission of Primary Induction Phi's from Scalars in collectLoopScalars(), under foldTail:
// An induction variable will remain scalar if all users of the induction
// variable and induction variable update remain scalar.
...
+    // If tail-folding is applied, the primary induction variable will be used
+    // to feed a vector compare.
+    if (Ind == Legal->getPrimaryInduction() && foldTailByMasking())
+      continue;
should have led widenIntOrFpInduction() to take the following early-exit for Phi's that are not in Scalars:
// Try to create a new independent vector induction variable. If we can't
// create the phi node, we will splat the scalar induction variable in each
// loop iteration.
if (!shouldScalarizeInstruction(EntryVal)) {
rather than reaching CreateSplatIV() at the end.

Perhaps the regressions occur when EntryVal is the Trunc rather than the IV Phi?
If so, one remedy may be to also omit Trunc from Scalars; another may be to check if (!shouldScalarizeInstruction(IV) instead-of or in addition to if (!shouldScalarizeInstruction(EntryVal)).

Is there a short reproducer?

Yep, function @foo_optsize() in test/Transforms/LoopVectorize/X86/optsize.ll is a minimal example. It will be miscompiled with these 2 lines commented out that I added:

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index ec756671ea6..d39795627ec 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1917,8 +1917,8 @@ void InnerLoopVectorizer::widenIntOrFpInduction(PHINode *IV, TruncInst *Trunc) {
   // vectorised IV. Except when we tail-fold, then the splat IV feeds the
   // predicate used by the masked loads/stores.
   Value *ScalarIV = CreateScalarIV(Step);
-  if (!Cost->isScalarEpilogueAllowed())
-    CreateSplatIV(ScalarIV, Step);
+  //if (!Cost->isScalarEpilogueAllowed())
+  //  CreateSplatIV(ScalarIV, Step);
   buildScalarSteps(ScalarIV, Step, EntryVal, ID);
 }

The miscompilation is that we don't emit:

%induction = add <64 x i32> %broadcast.splat, <i32 0, i32 1, ...

and the icmp is incorrectly performed against:

%2 = icmp ule <64 x i32> %broadcast.splat, <i32 202, i32 202, ...

which should have been:

%2 = icmp ule <64 x i32> %induction, <i32 202, i32 202, ...

it indeed doesn't take the early exit:

// Try to create a new independent vector induction variable. If we can't
// create the phi node, we will splat the scalar induction variable in each              
// loop iteration.
if (!shouldScalarizeInstruction(EntryVal)) {

I need to look again if this makes sense or not.

Caught up on all previous messages, and I believe missing in this patch is control over some of the decision making of this old and new behaviour. Rather than introducing yet another option, I liked @Ayal 's suggestion:

value added to -prefer-predicate-over-epilog=x indicating the "strength" of the preference.

for which I will introduce these values/options:

x = 1: vectorize if we can't fail-fold,
x = 2 : don't vectorize if we can't tail-fold

I think this models the "strength" of the vector tail-folding preference.

Changed -prefer-predicate-over-epilog to take a numeric value, corresponding to the preferred predication and fallback strategy.

Herald added a subscriber: rkruppe. · View Herald TranscriptJun 29 2020, 3:31 AM

Changed option -prefer-predicate-over-epilog to accept a more descriptive value rather than a cryptic numeric value.

Also a friendly ping. :-)

SjoerdMeijer added a reviewer: fhahn.Jul 15 2020, 3:21 AM

Is the idea to turn this option on by default for MVE? Maybe by changing the preferPredicateOverEpilogue call?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
189	-> there are different ...
198	Epilog -> Epilogue?
208	Maybe call this predicate-else-scalar-epilog.
209	-> "prefer tail-folding, create scalar epilogue if tail folding fails" ?
llvm/test/Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll
2 ↗	(On Diff #278125)	What happens if we change this to predicate-vectorize? Would that be a better option for MVE? (or is that not what this test is really aiming to test?)

Thanks for looking at this Dave. This addresses:

the minor comments, the rewording of comments/option,
restores 2 new regression tests that I forgot to upload in the previous diff.

Is the idea to turn this option on by default for MVE? Maybe by changing the preferPredicateOverEpilogue call?

So that was definitely the idea, that was the motivation of this prep work here.
I was thinking of addressing that in a follow up patch, because that is a different functional change, and then we can also discuss how to enable this. I.e., preferPredicateOverEpilogue would be one way to do it, another way is to flip the switch and default to falling back to vectorisation, which seems to me what most targets would want actually, but perhaps I am wrong and should get some performance numbers for this.

@SjoerdMeijer I ran into a potential use for this patch in the reduction code I was working on. I think this looks good in general I just had a question about the fallback stratergy.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
201–202	We should probably pick a single spelling of epilog and stick to it. At least in the same option! Epilog is nice because it's shorter, but Epilogue seems to be the more common choice? But feel free to ignore too if you like.
5025	So if I understand this logic properly, the loop hint or the targethook now both mean "try tailpredicate but fall back to vectorize if you can't". The PreferPredicateOverEpilogue option can mean "only tail predicate" or "try tailpredicate, fallback if needed" depending on the choice of the option? (I had not originally realized that was already how this worked for the target hook.) Do we want the pragma and targethook to always work like that? Or is it worth giving the target hook a choice between the two options. I suspect for MVE we would always want to choose the "with fallback" option, so maybe it's not worth adding? If we did, we could possibly add another ScalarEpilogueStatus value, to not have to recheck PreferPredicateOverEpilogue or the target hook here.

Thanks for looking Dave.
I am addressing your remarks, also find some replies inline.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
201–202	No, good point! I honestly don't know what it should be and have no opinion on it, but I will go for the common case epilogue.
5025	So if I understand this logic properly, the loop hint or the targethook now both mean "try tailpredicate but fall back to vectorize if you can't". The PreferPredicateOverEpilogue option can mean "only tail predicate" or "try tailpredicate, fallback if needed" depending on the choice of the option? (I had not originally realized that was already how this worked for the target hook.) Yep, that is exactly right I think. Do we want the pragma and targethook to always work like that? Or is it worth giving the target hook a choice between the two options. I suspect for MVE we would always want to choose the "with fallback" option, so maybe it's not worth adding? My first and perhaps naive thoughts were that we always want to vectorise, so set that as the fallback, because I thought this was guaranteed to be a win. But then some benchmarks proved me wrong. From a first look, however, this is just exposing some inefficiencies elsewhere. So, I still think it is the right choice to make in the end. The way I look at this patch is that this is an enabler, and if we want to flip the switch by default somehow, then that's probably best done as a follow up of this groundwork.

If you do something about the spelling of epilog, then this looks good to me.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5025	I think this probably does what we want for MVE, and seems like a sensible default.

This revision is now accepted and ready to land.Aug 25 2020, 2:19 AM

Thanks Dave, I will fix the spelling before committing.

Closed by commit rGbda8fbe2d2af: [LV] Fallback strategies if tail-folding fails (authored by SjoerdMeijer). · Explain WhyAug 26 2020, 8:58 AM

This revision was automatically updated to reflect the committed changes.

SjoerdMeijer added a commit: rGbda8fbe2d2af: [LV] Fallback strategies if tail-folding fails.

Revision Contents

Path

Size

llvm/

include/

llvm/

Transforms/

Vectorize/

LoopVectorizationLegality.h

9 lines

lib/

Transforms/

Vectorize/

LoopVectorizationLegality.cpp

19 lines

LoopVectorize.cpp

14 lines

test/

Transforms/

LoopVectorize/

ARM/

tail-folding-scalar-epilogue-fallback.ll

78 lines

use-scalar-epilogue-if-tp-fails.ll

154 lines

Diff 264822

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

Show First 20 Lines • Show All 223 Lines • ▼ Show 20 Lines	public:
/// Temporarily taking UseVPlanNativePath parameter. If true, take		/// Temporarily taking UseVPlanNativePath parameter. If true, take
/// the new code path being implemented for outer loop vectorization		/// the new code path being implemented for outer loop vectorization
/// (should be functional for inner loop vectorization) based on VPlan.		/// (should be functional for inner loop vectorization) based on VPlan.
/// If false, good old LV code.		/// If false, good old LV code.
bool canVectorize(bool UseVPlanNativePath);		bool canVectorize(bool UseVPlanNativePath);

/// Return true if we can vectorize this loop while folding its tail by		/// Return true if we can vectorize this loop while folding its tail by
/// masking, and mark all respective loads/stores for masking.		/// masking, and mark all respective loads/stores for masking.
		/// This object's state is only modified iff this function returns true.
bool prepareToFoldTailByMasking();		bool prepareToFoldTailByMasking();

/// Returns the primary induction variable.		/// Returns the primary induction variable.
PHINode *getPrimaryInduction() { return PrimaryInduction; }		PHINode *getPrimaryInduction() { return PrimaryInduction; }

/// Returns the reduction variables found in the loop.		/// Returns the reduction variables found in the loop.
		AyalUnsubmitted Not Done Reply Inline Actions But they need to be reset to reflect original conditional blocks, as set by `canVectorizeWithIfConvert()` when it initially called `blockCanBePredicated()`. Ayal: But they need to be reset to reflect original conditional blocks, as set by…
ReductionList &getReductionVars() { return Reductions; }		ReductionList &getReductionVars() { return Reductions; }

/// Returns the induction variables found in the loop.		/// Returns the induction variables found in the loop.
InductionList &getInductionVars() { return Inductions; }		InductionList &getInductionVars() { return Inductions; }

/// Return the first-order recurrences found in the loop.		/// Return the first-order recurrences found in the loop.
RecurrenceSet &getFirstOrderRecurrences() { return FirstOrderRecurrences; }		RecurrenceSet &getFirstOrderRecurrences() { return FirstOrderRecurrences; }

▲ Show 20 Lines • Show All 118 Lines • ▼ Show 20 Lines	private:
/// executed, and record the loads/stores that require masking. If's that		/// executed, and record the loads/stores that require masking. If's that
/// guard loads can be ignored under "assume safety" unless \p PreserveGuards		/// guard loads can be ignored under "assume safety" unless \p PreserveGuards
/// is true. This can happen when we introduces guards for which the original		/// is true. This can happen when we introduces guards for which the original
/// "unguarded-loads are safe" assumption does not hold. For example, the		/// "unguarded-loads are safe" assumption does not hold. For example, the
/// vectorizer's fold-tail transformation changes the loop to execute beyond		/// vectorizer's fold-tail transformation changes the loop to execute beyond
/// its original trip-count, under a proper guard, which should be preserved.		/// its original trip-count, under a proper guard, which should be preserved.
/// \p SafePtrs is a list of addresses that are known to be legal and we know		/// \p SafePtrs is a list of addresses that are known to be legal and we know
/// that we can read from them without segfault.		/// that we can read from them without segfault.
		/// \p MaskedOp is a list of instructions that have to be transformed into
		/// calls to the appropriate masked intrinsic when the loop is vectorized.
		/// \p ConditionalAssumes is a list of assume instructions in predicated
		/// blocks that must be dropped if the CFG gets flattened.
bool blockCanBePredicated(BasicBlock BB, SmallPtrSetImpl<Value > &SafePtrs,		bool blockCanBePredicated(BasicBlock BB, SmallPtrSetImpl<Value > &SafePtrs,
bool PreserveGuards = false);		SmallPtrSetImpl<const Instruction *> &MaskedOp,
		SmallPtrSetImpl<Instruction *> &ConditionalAssumes,
		AyalUnsubmitted Not Done Reply Inline Actions (Would have been good to set defaults, but using nullptr's seems cumbersome.) Ayal: (Would have been good to set defaults, but using nullptr's seems cumbersome.)
		bool PreserveGuards = false) const;

/// Updates the vectorization state by adding \p Phi to the inductions list.		/// Updates the vectorization state by adding \p Phi to the inductions list.
/// This can set \p Phi as the main induction of the loop if \p Phi is a		/// This can set \p Phi as the main induction of the loop if \p Phi is a
/// better choice for the main induction than the existing one.		/// better choice for the main induction than the existing one.
void addInductionPhi(PHINode *Phi, const InductionDescriptor &ID,		void addInductionPhi(PHINode *Phi, const InductionDescriptor &ID,
SmallPtrSetImpl<Value *> &AllowedExit);		SmallPtrSetImpl<Value *> &AllowedExit);

/// If an access has a symbolic strides, this maps the pointer value to		/// If an access has a symbolic strides, this maps the pointer value to
▲ Show 20 Lines • Show All 104 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

Show First 20 Lines • Show All 905 Lines • ▼ Show 20 Lines	bool LoopVectorizationLegality::isFirstOrderRecurrence(const PHINode *Phi) {
return FirstOrderRecurrences.count(Phi);		return FirstOrderRecurrences.count(Phi);
}		}

bool LoopVectorizationLegality::blockNeedsPredication(BasicBlock *BB) {		bool LoopVectorizationLegality::blockNeedsPredication(BasicBlock *BB) {
return LoopAccessInfo::blockNeedsPredication(BB, TheLoop, DT);		return LoopAccessInfo::blockNeedsPredication(BB, TheLoop, DT);
}		}

bool LoopVectorizationLegality::blockCanBePredicated(		bool LoopVectorizationLegality::blockCanBePredicated(
BasicBlock BB, SmallPtrSetImpl<Value > &SafePtrs, bool PreserveGuards) {		BasicBlock BB, SmallPtrSetImpl<Value > &SafePtrs,
		SmallPtrSetImpl<const Instruction *> &MaskedOp,
		SmallPtrSetImpl<Instruction *> &ConditionalAssumes,
		bool PreserveGuards) const {
const bool IsAnnotatedParallel = TheLoop->isAnnotatedParallel();		const bool IsAnnotatedParallel = TheLoop->isAnnotatedParallel();

for (Instruction &I : *BB) {		for (Instruction &I : *BB) {
// Check that we don't have a constant expression that can trap as operand.		// Check that we don't have a constant expression that can trap as operand.
for (Value *Operand : I.operands()) {		for (Value *Operand : I.operands()) {
if (auto *C = dyn_cast<Constant>(Operand))		if (auto *C = dyn_cast<Constant>(Operand))
if (C->canTrap())		if (C->canTrap())
return false;		return false;
▲ Show 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	if (!isa<BranchInst>(BB->getTerminator())) {
"loop contains a switch statement",		"loop contains a switch statement",
"LoopContainsSwitch", ORE, TheLoop,		"LoopContainsSwitch", ORE, TheLoop,
BB->getTerminator());		BB->getTerminator());
return false;		return false;
}		}

// We must be able to predicate all blocks that need to be predicated.		// We must be able to predicate all blocks that need to be predicated.
if (blockNeedsPredication(BB)) {		if (blockNeedsPredication(BB)) {
if (!blockCanBePredicated(BB, SafePointers)) {		if (!blockCanBePredicated(BB, SafePointers, MaskedOp,
		ConditionalAssumes)) {
reportVectorizationFailure(		reportVectorizationFailure(
"Control flow cannot be substituted for a select",		"Control flow cannot be substituted for a select",
"control flow cannot be substituted for a select",		"control flow cannot be substituted for a select",
"NoCFGForSelect", ORE, TheLoop,		"NoCFGForSelect", ORE, TheLoop,
BB->getTerminator());		BB->getTerminator());
return false;		return false;
}		}
} else if (BB != Header && !canIfConvertPHINodes(BB)) {		} else if (BB != Header && !canIfConvertPHINodes(BB)) {
▲ Show 20 Lines • Show All 191 Lines • ▼ Show 20 Lines	bool LoopVectorizationLegality::canVectorize(bool UseVPlanNativePath) {

// Okay! We've done all the tests. If any have failed, return false. Otherwise		// Okay! We've done all the tests. If any have failed, return false. Otherwise
// we can vectorize, and at this point we don't have any other mem analysis		// we can vectorize, and at this point we don't have any other mem analysis
// which may limit our maximum vectorization factor, so just return true with		// which may limit our maximum vectorization factor, so just return true with
// no restrictions.		// no restrictions.
return Result;		return Result;
}		}

bool LoopVectorizationLegality::prepareToFoldTailByMasking() {		bool LoopVectorizationLegality::prepareToFoldTailByMasking() {
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions nit: how about SoftFailure -> ReportFailure? (and that would mean flipping the default value) SjoerdMeijer: nit: how about SoftFailure -> ReportFailure? (and that would mean flipping the default value)
		AyalUnsubmitted Not Done Reply Inline Actions Rather than if/how to report failure, probably better to clone this method into a `canFoldTailByMasking()` (as it was originally called iirc) const method, w/o having any side-effects that may later need to be abandoned, and a `prepareToFoldTailByMasking()` which need not return any value. Ayal: Rather than if/how to report failure, probably better to clone this method into a…
		Pierre-vhUnsubmitted Not Done Reply Inline Actions I've been thinking about this, but I'm not exactly sure it's the right thing to do. There will probably be a small amount of code duplication and we'd be iterating over the loop's instruction twice. I'm not a fan of iterating over the loop's instructions more than needed, especially in this case as I think it can be avoided. Here is another idea: What if `prepareToFoldTailByMasking` becomes const, and instead of directly inserting elements into `MaskedOp` and `ConditionalAssumes`, it works on a temporary set. The simplest way to implement this would be to take sets by reference, like this: `bool prepareToFoldTailByMasking(SmallPtrSetImpl<Instruction> &MaskedOps, SmallPtrSetImpl<Instruction> ConditionalAssumes, bool ReportFailure = true)`. Then the caller controls the sets, and can either give them to `LoopVectorizationLegal`, or discard them. Another idea would be to move this logic in another object, for instance, something like `TailFoldingLegal`, which would: Check the loop, like `prepareToFoldTailByMasking` currently does, and store the info into its own sets Have methods to either abandon tail-folding by masking, or apply it (= transfer the contents of its sets into `LoopVectorizationLegal`'s sets) Then it could be used like this: TailFoldingLegal TFL; if tail folding is needed: TFL.checkLoop(TheLoop) // etc. if TFL.canFoldTailByMasking() && we need a tail: TFL.applyTailFoldingByMasking(Legal); What do you think ? Do you prefer a simpler approach, even if it means iterating over the loop's instruction many times, or something a bit more complete like what I suggested? Note that I don't know the cost of iterating over the loop. Maybe it's really cheap and I shouldn't worry about it (then just ignore everything I suggested above). In any case, I'll just implement your favorite solution, as you know much more about this topic than I do, so please tell me what you prefer. Thank you very much for your help. Pierre-vh: I've been thinking about this, but I'm not exactly sure it's the right thing to do. There will…
		AyalUnsubmitted Not Done Reply Inline Actions Agreed, would be better to avoid repeated instruction scans. In order for prepareToFoldTail() to be an optional preference, it should cause no side-effects when returning false; i.e., it should add to MaksedOps and ConditionalAssumes iff returning true. This mostly affects blockCanBePredicated() - it would need to have MaskedOps and ConditionalAssumes passed-in as arguments. Sounds reasonable? Ayal: Agreed, would be better to avoid repeated instruction scans. In order for prepareToFoldTail()…
		Pierre-vhUnsubmitted Done Reply Inline Actions In order for prepareToFoldTail() to be an optional preference, it should cause no side-effects when returning false; i.e., it should add to MaksedOps and ConditionalAssumes iff returning true. This mostly affects blockCanBePredicated() - it would need to have MaskedOps and ConditionalAssumes passed-in as arguments. Sounds reasonable? This is a good idea, but there is one issue left: when `ScalarEpilogueStatus == CM_ScalarEpilogueNotNeededUsePredicate` and the loop has no tail (`TC > 0 && TC % MaxVF == 0`), we have to abandon tail-folding (Line 5029 in LoopVectorize.cpp), but if we do things that way, there's no way to abandon tail-folding at that stage. There are 2 solutions that I can think of: Refactor `computeMaxVF` so it calculates the MaxVF earlier. That way, we can tell if the loops has a tail or not before calling `prepareToFoldTailByMasking` (= we only call it if there's a tail). Unfortunately, I tried moving the call to `computeFeasibleMaxVF` earlier in the function, but it triggered the assertion at line 5000 (in `if (!useMaskedInterleavedAccesses(TTI))`), so it doesn't look like a trivial fix. Make `prepareToFoldTailByMasking` return some kind of result object (that'd contain both `MaskedOp`/`ConditionalAssumes` sets), which can either be discarded or applied to `LoopVectorizationLegal`. This is similar to what I suggested in my previous comment, just a bit simplified. As I also suggested earlier, we can do without the result object - just pass-in the sets to `prepareToFoldTailByMasking` and add new methods to `LoopVectorizationLegal` to add elements to the `MaskedOp`/`ConditionalAssumes` sets. This is the simplest fix, but maybe not the cleanest. Either solution is fine for me, but I'll need some guidance on how to refactor `computeMaxVF` properly if you prefer the first solution. Pierre-vh: > In order for prepareToFoldTail() to be an optional preference, it should cause no side…
		AyalUnsubmitted Done Reply Inline Actions Right. Instead of introducing an early call to prepareToFoldTail(), before knowing MaxVF and if there's a tail to fold or else the preparation should be undone, how about waiting until the original call to prepareToFoldTail() fails. Then, before bailing-out, check if the status is CM_ScalarEpilogueNotNeededUsePredicate and if so return MaxVF instead? Note BTW that D80085 should hopefully land soon. Ayal: Right. Instead of introducing an early call to prepareToFoldTail(), before knowing MaxVF and if…

LLVM_DEBUG(dbgs() << "LV: checking if tail can be folded by masking.\n");		LLVM_DEBUG(dbgs() << "LV: checking if tail can be folded by masking.\n");

SmallPtrSet<const Value *, 8> ReductionLiveOuts;		SmallPtrSet<const Value *, 8> ReductionLiveOuts;

for (auto &Reduction : getReductionVars())		for (auto &Reduction : getReductionVars())
ReductionLiveOuts.insert(Reduction.second.getLoopExitInstr());		ReductionLiveOuts.insert(Reduction.second.getLoopExitInstr());
		AyalUnsubmitted Not Done Reply Inline Actions Try also adding getInductionVars() Phi's, and the values that feed them across the backedge, to the set of live outs allowed by fold tail? Ayal: Try also adding getInductionVars() Phi's, and the values that feed them across the backedge, to…

// TODO: handle non-reduction outside users when tail is folded by masking.		// TODO: handle non-reduction outside users when tail is folded by masking.
for (auto *AE : AllowedExit) {		for (auto *AE : AllowedExit) {
// Check that all users of allowed exit values are inside the loop or		// Check that all users of allowed exit values are inside the loop or
// are the live-out of a reduction.		// are the live-out of a reduction.
if (ReductionLiveOuts.count(AE))		if (ReductionLiveOuts.count(AE))
continue;		continue;
for (User *U : AE->users()) {		for (User *U : AE->users()) {
Instruction *UI = cast<Instruction>(U);		Instruction *UI = cast<Instruction>(U);
if (TheLoop->contains(UI))		if (TheLoop->contains(UI))
continue;		continue;
reportVectorizationFailure(		reportVectorizationFailure(
"Cannot fold tail by masking, loop has an outside user for",		"Cannot fold tail by masking, loop has an outside user for",
"Cannot fold tail by masking in the presence of live outs.",		"Cannot fold tail by masking in the presence of live outs.",
"LiveOutFoldingTailByMasking", ORE, TheLoop, UI);		"LiveOutFoldingTailByMasking", ORE, TheLoop, UI);
return false;		return false;
}		}
}		}

// The list of pointers that we can safely read and write to remains empty.		// The list of pointers that we can safely read and write to remains empty.
		AyalUnsubmitted Not Done Reply Inline Actions Should an "analysis" message be reported instead of a "failure" one? Ayal: Should an "analysis" message be reported instead of a "failure" one?
		AyalUnsubmitted Not Done Reply Inline Actions Would be good to also have an ORE report similar to the original one, except as an analysis rather than failure. Ayal: Would be good to also have an ORE report similar to the original one, except as an analysis…
SmallPtrSet<Value *, 8> SafePointers;		SmallPtrSet<Value *, 8> SafePointers;

		SmallPtrSet<const Instruction *, 8> MaskedOp;
		SmallPtrSet<Instruction *, 8> ConditionalAssumes;
		AyalUnsubmitted Done Reply Inline Actions Rename e.g. with Tmp or FoldTail prefix to distinguish from the non fold-tail sets? Ayal: Rename e.g. with Tmp or FoldTail prefix to distinguish from the non fold-tail sets?
		AyalUnsubmitted Not Done Reply Inline Actions Can add a comment what these sets are for. Ayal: Can add a comment what these sets are for.

		AyalUnsubmitted Not Done Reply Inline Actions Can dump this note regardless of ReportFailure. Ayal: Can dump this note regardless of ReportFailure.
// Check and mark all blocks for predication, including those that ordinarily		// Check and mark all blocks for predication, including those that ordinarily
// do not need predication such as the header block.		// do not need predication such as the header block.
for (BasicBlock *BB : TheLoop->blocks()) {		for (BasicBlock *BB : TheLoop->blocks()) {
if (!blockCanBePredicated(BB, SafePointers, /* MaskAllLoads= */ true)) {		if (!blockCanBePredicated(BB, SafePointers, MaskedOp, ConditionalAssumes,
		/* MaskAllLoads= */ true)) {
reportVectorizationFailure(		reportVectorizationFailure(
"Cannot fold tail by masking as required",		"Cannot fold tail by masking as required",
"control flow cannot be substituted for a select",		"control flow cannot be substituted for a select",
"NoCFGForSelect", ORE, TheLoop,		"NoCFGForSelect", ORE, TheLoop,
		AyalUnsubmitted Not Done Reply Inline Actions In order (for `canFoldTailByMasking()`) to only check if tail can be folded, this call to `blockCanBePredicated()` should be replaced with one that does not have side-effects. Ayal: In order (for `canFoldTailByMasking()`) to only check if tail can be folded, this call to…
BB->getTerminator());		BB->getTerminator());
return false;		return false;
}		}
}		}

LLVM_DEBUG(dbgs() << "LV: can fold tail by masking.\n");		LLVM_DEBUG(dbgs() << "LV: can fold tail by masking.\n");

		this->MaskedOp.insert(MaskedOp.begin(), MaskedOp.end());
		AyalUnsubmitted Not Done Reply Inline Actions ditto. Ayal: ditto.
		this->ConditionalAssumes.insert(ConditionalAssumes.begin(),
		ConditionalAssumes.end());

return true;		return true;
}		}

} // namespace llvm		} // namespace llvm

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 174 Lines • ▼ Show 20 Lines
static cl::opt<unsigned> TinyTripCountVectorThreshold(		static cl::opt<unsigned> TinyTripCountVectorThreshold(
"vectorizer-min-trip-count", cl::init(16), cl::Hidden,		"vectorizer-min-trip-count", cl::init(16), cl::Hidden,
cl::desc("Loops with a constant trip count that is smaller than this "		cl::desc("Loops with a constant trip count that is smaller than this "
"value are vectorized only if no scalar iteration overheads "		"value are vectorized only if no scalar iteration overheads "
"are incurred."));		"are incurred."));

// Indicates that an epilogue is undesired, predication is preferred.		// Indicates that an epilogue is undesired, predication is preferred.
// This means that the vectorizer will try to fold the loop-tail (epilogue)		// This means that the vectorizer will try to fold the loop-tail (epilogue)
// into the loop and predicate the loop body accordingly.		// into the loop and predicate the loop body accordingly. If that fails, the
		// vectorizer will fallback to a scalar epilogue.
static cl::opt<bool> PreferPredicateOverEpilog(		static cl::opt<bool> PreferPredicateOverEpilog(
"prefer-predicate-over-epilog", cl::init(false), cl::Hidden,		"prefer-predicate-over-epilog", cl::init(false), cl::Hidden,
cl::desc("Indicate that an epilogue is undesired, predication should be "		cl::desc("Indicate that an epilogue is undesired, predication should be "
"used instead."));		"used instead when it is possible"));

		dmgreenUnsubmitted Not Done Reply Inline Actions -> there are different ... dmgreen: -> there are different ...
static cl::opt<bool> MaximizeBandwidth(		static cl::opt<bool> MaximizeBandwidth(
"vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,		"vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,
cl::desc("Maximize bandwidth when selecting vectorization factor which "		cl::desc("Maximize bandwidth when selecting vectorization factor which "
"will be determined by the smallest type in loop."));		"will be determined by the smallest type in loop."));

static cl::opt<bool> EnableInterleavedMemAccesses(		static cl::opt<bool> EnableInterleavedMemAccesses(
"enable-interleaved-mem-accesses", cl::init(false), cl::Hidden,		"enable-interleaved-mem-accesses", cl::init(false), cl::Hidden,
cl::desc("Enable vectorization on interleaved memory accesses in a loop"));		cl::desc("Enable vectorization on interleaved memory accesses in a loop"));

		dmgreenUnsubmitted Not Done Reply Inline Actions Epilog -> Epilogue? dmgreen: Epilog -> Epilogue?
/// An interleave-group may need masking if it resides in a block that needs		/// An interleave-group may need masking if it resides in a block that needs
/// predication, or in order to mask away gaps.		/// predication, or in order to mask away gaps.
static cl::opt<bool> EnableMaskedInterleavedMemAccesses(		static cl::opt<bool> EnableMaskedInterleavedMemAccesses(
"enable-masked-interleaved-mem-accesses", cl::init(false), cl::Hidden,		"enable-masked-interleaved-mem-accesses", cl::init(false), cl::Hidden,
		dmgreenUnsubmitted Not Done Reply Inline Actions We should probably pick a single spelling of epilog and stick to it. At least in the same option! Epilog is nice because it's shorter, but Epilogue seems to be the more common choice? But feel free to ignore too if you like. dmgreen: We should probably pick a single spelling of epilog and stick to it. At least in the same…
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions No, good point! I honestly don't know what it should be and have no opinion on it, but I will go for the common case epilogue. SjoerdMeijer: No, good point! I honestly don't know what it should be and have no opinion on it, but I will…
cl::desc("Enable vectorization on masked interleaved memory accesses in a loop"));		cl::desc("Enable vectorization on masked interleaved memory accesses in a loop"));

static cl::opt<unsigned> TinyTripCountInterleaveThreshold(		static cl::opt<unsigned> TinyTripCountInterleaveThreshold(
"tiny-trip-count-interleave-threshold", cl::init(128), cl::Hidden,		"tiny-trip-count-interleave-threshold", cl::init(128), cl::Hidden,
cl::desc("We don't interleave loops with a estimated constant trip count "		cl::desc("We don't interleave loops with a estimated constant trip count "
"below this number"));		"below this number"));
		dmgreenUnsubmitted Not Done Reply Inline Actions Maybe call this predicate-else-scalar-epilog. dmgreen: Maybe call this predicate-else-scalar-epilog.

		dmgreenUnsubmitted Not Done Reply Inline Actions -> "prefer tail-folding, create scalar epilogue if tail folding fails" ? dmgreen: -> "prefer tail-folding, create scalar epilogue if tail folding fails" ?
static cl::opt<unsigned> ForceTargetNumScalarRegs(		static cl::opt<unsigned> ForceTargetNumScalarRegs(
"force-target-num-scalar-regs", cl::init(0), cl::Hidden,		"force-target-num-scalar-regs", cl::init(0), cl::Hidden,
cl::desc("A flag that overrides the target's number of scalar registers."));		cl::desc("A flag that overrides the target's number of scalar registers."));

static cl::opt<unsigned> ForceTargetNumVectorRegs(		static cl::opt<unsigned> ForceTargetNumVectorRegs(
"force-target-num-vector-regs", cl::init(0), cl::Hidden,		"force-target-num-vector-regs", cl::init(0), cl::Hidden,
cl::desc("A flag that overrides the target's number of vector registers."));		cl::desc("A flag that overrides the target's number of vector registers."));

▲ Show 20 Lines • Show All 4,742 Lines • ▼ Show 20 Lines	Optional<unsigned> LoopVectorizationCostModel::computeMaxVF(unsigned UserVF,
if (TC == 1) {		if (TC == 1) {
reportVectorizationFailure("Single iteration (non) loop",		reportVectorizationFailure("Single iteration (non) loop",
"loop trip count is one, irrelevant for vectorization",		"loop trip count is one, irrelevant for vectorization",
"SingleIterationLoop", ORE, TheLoop);		"SingleIterationLoop", ORE, TheLoop);
return None;		return None;
}		}

switch (ScalarEpilogueStatus) {		switch (ScalarEpilogueStatus) {
case CM_ScalarEpilogueAllowed:		case CM_ScalarEpilogueAllowed:
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions nit: tail-predication -> tail-folding SjoerdMeijer: nit: tail-predication -> tail-folding
return UserVF ? UserVF : computeFeasibleMaxVF(TC);		return UserVF ? UserVF : computeFeasibleMaxVF(TC);
case CM_ScalarEpilogueNotNeededUsePredicate:		case CM_ScalarEpilogueNotNeededUsePredicate:
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LV: vector predicate hint/switch found.\n"		dbgs() << "LV: vector predicate hint/switch found.\n"
		AyalUnsubmitted Not Done Reply Inline Actions Fold this under the following switch, possibly falling through to the CM_ScalarEpilogueAllowed case? Ayal: Fold this under the following switch, possibly falling through to the CM_ScalarEpilogueAllowed…
<< "LV: Not allowing scalar epilogue, creating predicated "		<< "LV: Not allowing scalar epilogue, creating predicated "
		AyalUnsubmitted Not Done Reply Inline Actions May be simpler to exit early after calling prepareToFoldTail(), instead of checking FoldTailByMasking later. Ayal: May be simpler to exit early after calling prepareToFoldTail(), instead of checking…
<< "vector loop.\n");		<< "vector loop.\n");
break;		break;
case CM_ScalarEpilogueNotAllowedLowTripLoop:		case CM_ScalarEpilogueNotAllowedLowTripLoop:
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions nit: tail-predication -> tail-folding SjoerdMeijer: nit: tail-predication -> tail-folding
// fallthrough as a special case of OptForSize		// fallthrough as a special case of OptForSize
case CM_ScalarEpilogueNotAllowedOptSize:		case CM_ScalarEpilogueNotAllowedOptSize:
if (ScalarEpilogueStatus == CM_ScalarEpilogueNotAllowedOptSize)		if (ScalarEpilogueStatus == CM_ScalarEpilogueNotAllowedOptSize)
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LV: Not allowing scalar epilogue due to -Os/-Oz.\n");		dbgs() << "LV: Not allowing scalar epilogue due to -Os/-Oz.\n");
else		else
LLVM_DEBUG(dbgs() << "LV: Not allowing scalar epilogue due to low trip "		LLVM_DEBUG(dbgs() << "LV: Not allowing scalar epilogue due to low trip "
<< "count.\n");		<< "count.\n");
Show All 18 Lines	Optional<unsigned> LoopVectorizationCostModel::computeMaxVF(unsigned UserVF,
}		}

unsigned MaxVF = UserVF ? UserVF : computeFeasibleMaxVF(TC);		unsigned MaxVF = UserVF ? UserVF : computeFeasibleMaxVF(TC);
unsigned MaxVFtimesIC = UserIC ? MaxVF * UserIC : MaxVF;		unsigned MaxVFtimesIC = UserIC ? MaxVF * UserIC : MaxVF;
if (TC > 0 && TC % MaxVFtimesIC == 0) {		if (TC > 0 && TC % MaxVFtimesIC == 0) {
// Accept MaxVF if we do not have a tail.		// Accept MaxVF if we do not have a tail.
LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");		LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");
return MaxVF;		return MaxVF;
}		}
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Do we need this if there is no tail? I haven't reminded myself and checked the flow, but can this condition be true? SjoerdMeijer: Do we need this if there is no tail? I haven't reminded myself and checked the flow, but can…
		Pierre-vhUnsubmitted Done Reply Inline Actions Yes, it can happen if tail-folding is enabled, and the loop's TC is a multiple of the VF. (e.g. TC=64, VF=16) In those cases, we do the preparation for tail-folding earlier (line 4968), but then realize here that the loop has no tail, so we must revert (abandon tail folding + clear the flag). We need this because else it would generate masked load/stores for those kinds of loops, which isn't optimal (normal loads are better). Additionally, it will cause an assertion failure in the MVETailPredication pass (TC cannot be a multiple of the VF in a tail-predicated loop). Pierre-vh: Yes, it can happen if tail-folding is enabled, and the loop's TC is a multiple of the VF. (e.g.
		AyalUnsubmitted Not Done Reply Inline Actions They (masked load/stores) are needed, but only in blocks that were originally conditional, rather than all blocks of the loop. Ayal: They (masked load/stores) are needed, but only in blocks that were originally conditional…

// If we don't know the precise trip count, or if the trip count that we		// If we don't know the precise trip count, or if the trip count that we
// found modulo the vectorization factor is not zero, try to fold the tail		// found modulo the vectorization factor is not zero, try to fold the tail
// by masking.		// by masking.
// FIXME: look for a smaller MaxVF that does divide TC rather than masking.		// FIXME: look for a smaller MaxVF that does divide TC rather than masking.
if (Legal->prepareToFoldTailByMasking()) {		if (Legal->prepareToFoldTailByMasking()) {
FoldTailByMasking = true;		FoldTailByMasking = true;
return MaxVF;		return MaxVF;
}		}

		// If there was a tail-folding hint/switch, but we can't fold the tail by
		// masking, fallback to a scalar epilogue.
		if (ScalarEpilogueStatus == CM_ScalarEpilogueNotNeededUsePredicate) {
		LLVM_DEBUG(dbgs() << "LV: Cannot fold tail by masking: Ignoring "
		dmgreenUnsubmitted Not Done Reply Inline Actions So if I understand this logic properly, the loop hint or the targethook now both mean "try tailpredicate but fall back to vectorize if you can't". The PreferPredicateOverEpilogue option can mean "only tail predicate" or "try tailpredicate, fallback if needed" depending on the choice of the option? (I had not originally realized that was already how this worked for the target hook.) Do we want the pragma and targethook to always work like that? Or is it worth giving the target hook a choice between the two options. I suspect for MVE we would always want to choose the "with fallback" option, so maybe it's not worth adding? If we did, we could possibly add another ScalarEpilogueStatus value, to not have to recheck PreferPredicateOverEpilogue or the target hook here. dmgreen: So if I understand this logic properly, the loop hint or the targethook now both mean "try…
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions So if I understand this logic properly, the loop hint or the targethook now both mean "try tailpredicate but fall back to vectorize if you can't". The PreferPredicateOverEpilogue option can mean "only tail predicate" or "try tailpredicate, fallback if needed" depending on the choice of the option? (I had not originally realized that was already how this worked for the target hook.) Yep, that is exactly right I think. Do we want the pragma and targethook to always work like that? Or is it worth giving the target hook a choice between the two options. I suspect for MVE we would always want to choose the "with fallback" option, so maybe it's not worth adding? My first and perhaps naive thoughts were that we always want to vectorise, so set that as the fallback, because I thought this was guaranteed to be a win. But then some benchmarks proved me wrong. From a first look, however, this is just exposing some inefficiencies elsewhere. So, I still think it is the right choice to make in the end. The way I look at this patch is that this is an enabler, and if we want to flip the switch by default somehow, then that's probably best done as a follow up of this groundwork. SjoerdMeijer: > So if I understand this logic properly, the loop hint or the targethook now both mean "try…
		dmgreenUnsubmitted Not Done Reply Inline Actions I think this probably does what we want for MVE, and seems like a sensible default. dmgreen: I think this probably does what we want for MVE, and seems like a sensible default.
		"hint/switch and using a scalar epilogue instead.\n");
		ScalarEpilogueStatus = CM_ScalarEpilogueAllowed;
		AyalUnsubmitted Done Reply Inline Actions Why change ScalarEpilogueStatus? Ayal: Why change ScalarEpilogueStatus?
		Pierre-vhUnsubmitted Done Reply Inline Actions (Sorry, forgot to post this comment) It seems to be needed, my tests don't pass if unless I change it (they have conflicting ASM for both run lines, and the content is different) Pierre-vh: (Sorry, forgot to post this comment) It seems to be needed, my tests don't pass if unless I…
		AyalUnsubmitted Not Done Reply Inline Actions Add `-check-prefix` to the FileCheck of one of the runs? https://llvm.org/docs/CommandGuide/FileCheck.html#cmdoption-filecheck-check-prefix Ayal: Add `-check-prefix` to the FileCheck of one of the runs? https://llvm.
		Pierre-vhUnsubmitted Done Reply Inline Actions This is what's added when I don't change `ScalarEpilogueStatus`, and it's set to `ScalarEpilogueNotNeededUsePredicate`: ; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <4 x i32> undef, i32 [[OFFSET_IDX]], i32 0 ; CHECK-NEXT: [[BROADCAST_SPLAT:%.]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> undef, <4 x i32> zeroinitializer ; CHECK-NEXT: [[INDUCTION:%.]] = add <4 x i32> [[BROADCAST_SPLAT]], <i32 0, i32 -1, i32 -2, i32 -3> It's because of this in `InnerLoopVectorizer::widenIntOrFpInduction` in LoopVectorize.cpp:1920 // All IV users are scalar instructions, so only emit a scalar IV, not a // vectorised IV. Except when we tail-fold, then the splat IV feeds the // predicate used by the masked loads/stores. if (!Cost->isScalarEpilogueAllowed()) // ScalarEpilogueStatus == CM_ScalarEpilogueAllowed CreateSplatIV(ScalarIV, Step); I am guessing that this should be using `Cost->foldTailByMasking()` instead, which returns the `FoldTailByMasking` boolean instead of looking at `ScalarEpilogueStatus`. Pierre-vh:* This is what's added when I don't change `ScalarEpilogueStatus`, and it's set to…
		Pierre-vhUnsubmitted Done Reply Inline Actions After looking a bit more into it: Changing `!Cost->isScalarEpilogueAllowed()` to `Cost->foldTailByMasking()` does not solve the issue entirely, it causes multiple test failures. The `INDUCTION` variable is unused, so this bit of code definitely shouldn't be there. Solution is either to change the `ScalarEpilogueStatus`, or find out why `Cost->foldTailByMasking()` doesn't work as expected. The tests that fail are: Failing Tests (3): LLVM :: Transforms/LoopVectorize/X86/constant-fold.ll LLVM :: Transforms/LoopVectorize/X86/vect.omp.force.small-tc.ll LLVM :: Transforms/LoopVectorize/pr44488-predication.ll Which all seem to expect this "INDUCTION" variable, one way or another. However, interestingly, the tests don't use the `INDUCTION` instruction at all, so maybe it's a bug? (Everytime there's a failure in one of those test, it's in a place where those 3 lines are expected, but not used at all) Pierre-vh: After looking a bit more into it: - Changing `!Cost->isScalarEpilogueAllowed()` to `Cost…
		AyalUnsubmitted Not Done Reply Inline Actions So now we know why ScalarEpilogueStatus needed to change... @SjoerdMeijer tried to remove that presumably redundant call to CreateSplatIV() in D78911, recommitting it later under ScalarEpilogueStatus in rG9529597cf4562c64034943dacc29a4ff4fe93d86. We're still trying to figure out how to remove it: http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20200518/784362.html I am guessing that this should be using Cost->foldTailByMasking() instead, which returns the FoldTailByMasking boolean instead of looking at ScalarEpilogueStatus. guess so too, along with needed test fixings. Sjoerd, what do you think? Ayal: So now we know why ScalarEpilogueStatus needed to change... @SjoerdMeijer tried to remove that…
		AyalUnsubmitted Not Done Reply Inline Actions Why change ScalarEpilogueStatus? Ayal: Why change ScalarEpilogueStatus?
		return MaxVF;
		}

if (TC == 0) {		if (TC == 0) {
reportVectorizationFailure(		reportVectorizationFailure(
"Unable to calculate the loop count due to complex control flow",		"Unable to calculate the loop count due to complex control flow",
"unable to calculate the loop count due to complex control flow",		"unable to calculate the loop count due to complex control flow",
"UnknownLoopCountComplexCFG", ORE, TheLoop);		"UnknownLoopCountComplexCFG", ORE, TheLoop);
return None;		return None;
}		}

▲ Show 20 Lines • Show All 3,063 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/ARM/tail-folding-scalar-epilogue-fallback.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -S -loop-vectorize -mattr=+armv8.1-m.main,+mve.fp -disable-mve-tail-predication=false < %s \| FileCheck %s
				; RUN: opt -S -loop-vectorize -mattr=+armv8.1-m.main,+mve.fp -disable-mve-tail-predication=true < %s \| FileCheck %s
				SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Sorry for being a bit lazy, but this is a big example, but why is this rejected for tail-predication? Would be good to indicate the reason somewhere, e.g. function name, in the IR. And could this example be reduced? SjoerdMeijer: Sorry for being a bit lazy, but this is a big example, but why is this rejected for tail…
				SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions This test is enabling MVE tail-predication, meaning that here we "enable" masked loads/stores that enable tail-folding. Thus, since no other options are used, this relies on TTI hook preferPredicateOverEpilogue to set CM_ScalarEpilogueNotNeededUsePredicate. And so, this hook determines for this test that tail-folding was possible, which we then overrule later, is that correct? SjoerdMeijer: This test is enabling MVE tail-predication, meaning that here we "enable" masked loads/stores…
				Pierre-vhUnsubmitted Done Reply Inline Actions Sorry for being a bit lazy, but this is a big example, but why is this rejected for tail-predication? Would be good to indicate the reason somewhere, e.g. function name, in the IR. And could this example be reduced? If I remember correctly, this test is rejected because of this: `store i8* %incdec.ptr13, i8 %pos, align 4`, it's an outside user of `%incdec.ptr13` which is defined in the loop. I'll try to reduce the test a bit more, and I'll add a comment explaining why it should be rejected. This test is enabling MVE tail-predication, meaning that here we "enable" masked loads/stores that enable tail-folding. Thus, since no other options are used, this relies on TTI hook preferPredicateOverEpilogue to set CM_ScalarEpilogueNotNeededUsePredicate. And so, this hook determines for this test that tail-folding was possible, which we then overrule later, is that correct? That is correct, this test is the same as the other one, except it relies on `preferPredicateOverEpilogue` to set the flag. Should I add something to check that `preferPredicateOverEpilogue` accepts the loop? (So the test doesn't silently pass if the TTI hook doesn't accept it anymore) Pierre-vh:** > Sorry for being a bit lazy, but this is a big example, but why is this rejected for tail…

				; This test should produce the same result (vectorized loop + scalar epilogue) with
				; default options and when MVE Tail Predication is enabled, as this loop's tail cannot be folded
				; by masking due to an outside user of %incdec.ptr in %end.

				target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "thumbv8.1m.main-arm-unknown-eabihf"

				define void @outside_user_blocks_tail_folding(i8* nocapture readonly %ptr, i32 %size, i8** %pos) {
				; CHECK-LABEL: @outside_user_blocks_tail_folding(
				; CHECK-NEXT: header:
				; CHECK-NEXT: [[PTR0:%.]] = load i8, i8** [[POS:%.*]], align 4
				; CHECK-NEXT: [[MIN_ITERS_CHECK:%.]] = icmp ult i32 [[SIZE:%.]], 16
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[SIZE]], 16
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i32 [[SIZE]], [[N_MOD_VF]]
				; CHECK-NEXT: [[IND_END:%.*]] = sub i32 [[SIZE]], [[N_VEC]]
				; CHECK-NEXT: [[IND_END2:%.]] = getelementptr i8, i8 [[PTR:%.*]], i32 [[N_VEC]]
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[OFFSET_IDX:%.*]] = sub i32 [[SIZE]], [[INDEX]]
				; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[OFFSET_IDX]], 0
				; CHECK-NEXT: [[TMP1:%.*]] = add i32 [[INDEX]], 0
				; CHECK-NEXT: [[NEXT_GEP:%.]] = getelementptr i8, i8 [[PTR]], i32 [[TMP1]]
				; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 [[NEXT_GEP]], i32 1
				; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 [[TMP2]], i32 0
				; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[TMP3]] to <16 x i8>*
				; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <16 x i8>, <16 x i8> [[TMP4]], align 1
				; CHECK-NEXT: [[TMP5:%.]] = getelementptr i8, i8 [[NEXT_GEP]], i32 0
				AyalUnsubmitted Done Reply Inline Actions IND_END pre-computes the value of %incdec.ptr. Ayal: IND_END pre-computes the value of %incdec.ptr.
				; CHECK-NEXT: [[TMP6:%.]] = bitcast i8 [[TMP5]] to <16 x i8>*
				; CHECK-NEXT: store <16 x i8> [[WIDE_LOAD]], <16 x i8>* [[TMP6]], align 1
				; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 16
				; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !0
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[SIZE]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[CMP_N]], label [[END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ [[SIZE]], [[HEADER:%.]] ]
				; CHECK-NEXT: [[BC_RESUME_VAL1:%.]] = phi i8 [ [[IND_END2]], [[MIDDLE_BLOCK]] ], [ [[PTR]], [[HEADER]] ]
				; CHECK-NEXT: br label [[BODY:%.*]]
				; CHECK: body:
				; CHECK-NEXT: [[DEC66:%.]] = phi i32 [ [[DEC:%.]], [[BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
				; CHECK-NEXT: [[BUFF:%.]] = phi i8 [ [[INCDEC_PTR:%.*]], [[BODY]] ], [ [[BC_RESUME_VAL1]], [[SCALAR_PH]] ]
				; CHECK-NEXT: [[INCDEC_PTR]] = getelementptr inbounds i8, i8* [[BUFF]], i32 1
				; CHECK-NEXT: [[DEC]] = add nsw i32 [[DEC66]], -1
				; CHECK-NEXT: [[TMP8:%.]] = load i8, i8 [[INCDEC_PTR]], align 1
				; CHECK-NEXT: store i8 [[TMP8]], i8* [[BUFF]], align 1
				; CHECK-NEXT: [[TOBOOL11:%.*]] = icmp eq i32 [[DEC]], 0
				; CHECK-NEXT: br i1 [[TOBOOL11]], label [[END]], label [[BODY]], !llvm.loop !2
				; CHECK: end:
				; CHECK-NEXT: [[INCDEC_PTR_LCSSA:%.]] = phi i8 [ [[INCDEC_PTR]], [[BODY]] ], [ [[IND_END2]], [[MIDDLE_BLOCK]] ]
				; CHECK-NEXT: store i8* [[INCDEC_PTR_LCSSA]], i8** [[POS]], align 4
				; CHECK-NEXT: ret void
				;
				header:
				%ptr0 = load i8, i8* %pos, align 4
				br label %body

				body:
				%dec66 = phi i32 [ %dec, %body ], [ %size, %header ]
				%buff = phi i8* [ %incdec.ptr, %body ], [ %ptr, %header ]
				%incdec.ptr = getelementptr inbounds i8, i8* %buff, i32 1
				%dec = add nsw i32 %dec66, -1
				%0 = load i8, i8* %incdec.ptr, align 1
				store i8 %0, i8* %buff, align 1
				%tobool11 = icmp eq i32 %dec, 0
				br i1 %tobool11, label %end, label %body

				end:
				store i8* %incdec.ptr, i8** %pos, align 4
				ret void
				}

llvm/test/Transforms/LoopVectorize/use-scalar-epilogue-if-tp-fails.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -S -loop-vectorize -prefer-predicate-over-epilog < %s \| FileCheck %s
				; RUN: opt -S -loop-vectorize < %s \| FileCheck %s

				; This test should produce the same result when TP is forced/disabled, because it
				AyalUnsubmitted Done Reply Inline Actions "TP is forced" >> "tail folding is preferred" "tail-predicated" >> "tail folding". If "tail-predication" is preferred, it should replace "tail folding" throughout. But a tail can be predicated w/o folding it, so FoldTail or in full FoldTailByMasking seems more accurate? Ayal: "TP is forced" >> "tail folding is preferred" "tail-predicated" >> "tail folding". If "tail…
				; can't be tail-predicated (due to an outside user of %incdec.ptr in %end),
				; so the vectorizer should fall back to a scalar epilogue.
				;
				; The first test (@basic_loop) simply relies on the command-line switches.
				; The second test (@metadata) forces tail-folding using metadata.
				AyalUnsubmitted Done Reply Inline Actions "forces" >> "specifies fold-tail preference via metadata" Ayal: "forces" >> "specifies fold-tail preference via metadata"
				; Both tests should always generate a scalar epilogue.

				target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"

				define void @basic_loop(i8* nocapture readonly %ptr, i32 %size, i8** %pos) {
				; CHECK-LABEL: @basic_loop(
				; CHECK-NEXT: header:
				; CHECK-NEXT: [[PTR0:%.]] = load i8, i8** [[POS:%.*]], align 4
				; CHECK-NEXT: [[MIN_ITERS_CHECK:%.]] = icmp ult i32 [[SIZE:%.]], 4
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[SIZE]], 4
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i32 [[SIZE]], [[N_MOD_VF]]
				; CHECK-NEXT: [[IND_END:%.*]] = sub i32 [[SIZE]], [[N_VEC]]
				; CHECK-NEXT: [[IND_END2:%.]] = getelementptr i8, i8 [[PTR:%.*]], i32 [[N_VEC]]
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[OFFSET_IDX:%.*]] = sub i32 [[SIZE]], [[INDEX]]
				; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[OFFSET_IDX]], 0
				; CHECK-NEXT: [[TMP1:%.*]] = add i32 [[INDEX]], 0
				; CHECK-NEXT: [[NEXT_GEP:%.]] = getelementptr i8, i8 [[PTR]], i32 [[TMP1]]
				; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 [[NEXT_GEP]], i32 1
				AyalUnsubmitted Done Reply Inline Actions ditto. Ayal: ditto.
				; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 [[TMP2]], i32 0
				; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[TMP3]] to <4 x i8>*
				; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i8>, <4 x i8> [[TMP4]], align 1
				; CHECK-NEXT: [[TMP5:%.]] = getelementptr i8, i8 [[NEXT_GEP]], i32 0
				; CHECK-NEXT: [[TMP6:%.]] = bitcast i8 [[TMP5]] to <4 x i8>*
				; CHECK-NEXT: store <4 x i8> [[WIDE_LOAD]], <4 x i8>* [[TMP6]], align 1
				; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
				; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !0
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[SIZE]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[CMP_N]], label [[END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ [[SIZE]], [[HEADER:%.]] ]
				; CHECK-NEXT: [[BC_RESUME_VAL1:%.]] = phi i8 [ [[IND_END2]], [[MIDDLE_BLOCK]] ], [ [[PTR]], [[HEADER]] ]
				; CHECK-NEXT: br label [[BODY:%.*]]
				; CHECK: body:
				; CHECK-NEXT: [[DEC66:%.]] = phi i32 [ [[DEC:%.]], [[BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
				; CHECK-NEXT: [[BUFF:%.]] = phi i8 [ [[INCDEC_PTR:%.*]], [[BODY]] ], [ [[BC_RESUME_VAL1]], [[SCALAR_PH]] ]
				; CHECK-NEXT: [[INCDEC_PTR]] = getelementptr inbounds i8, i8* [[BUFF]], i32 1
				; CHECK-NEXT: [[DEC]] = add nsw i32 [[DEC66]], -1
				; CHECK-NEXT: [[TMP8:%.]] = load i8, i8 [[INCDEC_PTR]], align 1
				; CHECK-NEXT: store i8 [[TMP8]], i8* [[BUFF]], align 1
				; CHECK-NEXT: [[TOBOOL11:%.*]] = icmp eq i32 [[DEC]], 0
				; CHECK-NEXT: br i1 [[TOBOOL11]], label [[END]], label [[BODY]], !llvm.loop !2
				; CHECK: end:
				; CHECK-NEXT: [[INCDEC_PTR_LCSSA:%.]] = phi i8 [ [[INCDEC_PTR]], [[BODY]] ], [ [[IND_END2]], [[MIDDLE_BLOCK]] ]
				; CHECK-NEXT: store i8* [[INCDEC_PTR_LCSSA]], i8** [[POS]], align 4
				; CHECK-NEXT: ret void
				;
				header:
				%ptr0 = load i8, i8* %pos, align 4
				br label %body

				body:
				%dec66 = phi i32 [ %dec, %body ], [ %size, %header ]
				%buff = phi i8* [ %incdec.ptr, %body ], [ %ptr, %header ]
				%incdec.ptr = getelementptr inbounds i8, i8* %buff, i32 1
				%dec = add nsw i32 %dec66, -1
				%0 = load i8, i8* %incdec.ptr, align 1
				store i8 %0, i8* %buff, align 1
				%tobool11 = icmp eq i32 %dec, 0
				br i1 %tobool11, label %end, label %body

				end:
				store i8* %incdec.ptr, i8** %pos, align 4
				ret void
				}

				define void @metadata(i8* nocapture readonly %ptr, i32 %size, i8** %pos) {
				; CHECK-LABEL: @metadata(
				; CHECK-NEXT: header:
				; CHECK-NEXT: [[PTR0:%.]] = load i8, i8** [[POS:%.*]], align 4
				; CHECK-NEXT: [[MIN_ITERS_CHECK:%.]] = icmp ult i32 [[SIZE:%.]], 4
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[SIZE]], 4
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i32 [[SIZE]], [[N_MOD_VF]]
				; CHECK-NEXT: [[IND_END:%.*]] = sub i32 [[SIZE]], [[N_VEC]]
				; CHECK-NEXT: [[IND_END2:%.]] = getelementptr i8, i8 [[PTR:%.*]], i32 [[N_VEC]]
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[OFFSET_IDX:%.*]] = sub i32 [[SIZE]], [[INDEX]]
				; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[OFFSET_IDX]], 0
				; CHECK-NEXT: [[TMP1:%.*]] = add i32 [[INDEX]], 0
				; CHECK-NEXT: [[NEXT_GEP:%.]] = getelementptr i8, i8 [[PTR]], i32 [[TMP1]]
				; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 [[NEXT_GEP]], i32 1
				; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 [[TMP2]], i32 0
				; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[TMP3]] to <4 x i8>*
				; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i8>, <4 x i8> [[TMP4]], align 1
				; CHECK-NEXT: [[TMP5:%.]] = getelementptr i8, i8 [[NEXT_GEP]], i32 0
				; CHECK-NEXT: [[TMP6:%.]] = bitcast i8 [[TMP5]] to <4 x i8>*
				; CHECK-NEXT: store <4 x i8> [[WIDE_LOAD]], <4 x i8>* [[TMP6]], align 1
				; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
				; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !4
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[SIZE]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[CMP_N]], label [[END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ [[SIZE]], [[HEADER:%.]] ]
				; CHECK-NEXT: [[BC_RESUME_VAL1:%.]] = phi i8 [ [[IND_END2]], [[MIDDLE_BLOCK]] ], [ [[PTR]], [[HEADER]] ]
				; CHECK-NEXT: br label [[BODY:%.*]]
				; CHECK: body:
				; CHECK-NEXT: [[DEC66:%.]] = phi i32 [ [[DEC:%.]], [[BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
				; CHECK-NEXT: [[BUFF:%.]] = phi i8 [ [[INCDEC_PTR:%.*]], [[BODY]] ], [ [[BC_RESUME_VAL1]], [[SCALAR_PH]] ]
				; CHECK-NEXT: [[INCDEC_PTR]] = getelementptr inbounds i8, i8* [[BUFF]], i32 1
				; CHECK-NEXT: [[DEC]] = add nsw i32 [[DEC66]], -1
				; CHECK-NEXT: [[TMP8:%.]] = load i8, i8 [[INCDEC_PTR]], align 1
				; CHECK-NEXT: store i8 [[TMP8]], i8* [[BUFF]], align 1
				; CHECK-NEXT: [[TOBOOL11:%.*]] = icmp eq i32 [[DEC]], 0
				; CHECK-NEXT: br i1 [[TOBOOL11]], label [[END]], label [[BODY]], !llvm.loop !5
				; CHECK: end:
				; CHECK-NEXT: [[INCDEC_PTR_LCSSA:%.]] = phi i8 [ [[INCDEC_PTR]], [[BODY]] ], [ [[IND_END2]], [[MIDDLE_BLOCK]] ]
				; CHECK-NEXT: store i8* [[INCDEC_PTR_LCSSA]], i8** [[POS]], align 4
				; CHECK-NEXT: ret void
				;
				header:
				%ptr0 = load i8, i8* %pos, align 4
				br label %body

				body:
				%dec66 = phi i32 [ %dec, %body ], [ %size, %header ]
				%buff = phi i8* [ %incdec.ptr, %body ], [ %ptr, %header ]
				%incdec.ptr = getelementptr inbounds i8, i8* %buff, i32 1
				%dec = add nsw i32 %dec66, -1
				%0 = load i8, i8* %incdec.ptr, align 1
				store i8 %0, i8* %buff, align 1
				%tobool11 = icmp eq i32 %dec, 0
				br i1 %tobool11, label %end, label %body, !llvm.loop !1

				end:
				store i8* %incdec.ptr, i8** %pos, align 4
				ret void
				}

				!1 = distinct !{!1, !2, !3}
				!2 = !{!"llvm.loop.vectorize.predicate.enable", i1 true}
				!3 = !{!"llvm.loop.vectorize.enable", i1 true}

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Fallback strategies if tail-folding failsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 264822

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/ARM/tail-folding-scalar-epilogue-fallback.ll

llvm/test/Transforms/LoopVectorize/use-scalar-epilogue-if-tp-fails.ll

[LV] Fallback strategies if tail-folding fails
ClosedPublic