This implements 2 different vectorization fallback strategies if tail-folding fails: 1) don't vectorize at all, or 2) vectorize using a scalar epilogue. This can be controlled with option -prefer-predicate-over-epilog, that has been changed to take a numeric value corresponding to the tail-folding preference and preferred fallback.
Details
Diff Detail
Event Timeline
The direction looks good to me. We definitely still want to vectorise if possible, if we don't do that when tail-folding is requested, that sounds like a "performance bug".
Just a general remark: tail-predication is a bit of a MVE term. In the LV, tail-folding is the term that is used. Please use that instead in the summary/description etc.
I had a first look, see some nits/questions inline.
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
---|---|---|
1233 | nit: how about SoftFailure -> ReportFailure? | |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
4976 | nit: tail-predication -> tail-folding | |
5011 | Do we need this if there is no tail? I haven't reminded myself and checked the flow, but can this condition be true? | |
llvm/test/Transforms/LoopVectorize/ARM/tail-folding-scalar-epilogue-fallback.ll | ||
3 | Sorry for being a bit lazy, but this is a big example, but why is this rejected for tail-predication? Would be good to indicate the reason somewhere, e.g. function name, in the IR. |
A follow up question inlined.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
4968 | nit: tail-predication -> tail-folding | |
llvm/test/Transforms/LoopVectorize/ARM/tail-folding-scalar-epilogue-fallback.ll | ||
3 | This test is enabling MVE tail-predication, meaning that here we "enable" masked loads/stores that enable tail-folding. Thus, since no other options are used, this relies on TTI hook preferPredicateOverEpilogue to set CM_ScalarEpilogueNotNeededUsePredicate. And so, this hook determines for this test that tail-folding was possible, which we then overrule later, is that correct? |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
5011 | Yes, it can happen if tail-folding is enabled, and the loop's TC is a multiple of the VF. (e.g. TC=64, VF=16) We need this because else it would generate masked load/stores for those kinds of loops, which isn't optimal (normal loads are better). Additionally, it will cause an assertion failure in the MVETailPredication pass (TC cannot be a multiple of the VF in a tail-predicated loop). | |
llvm/test/Transforms/LoopVectorize/ARM/tail-folding-scalar-epilogue-fallback.ll | ||
3 |
If I remember correctly, this test is rejected because of this: store i8* %incdec.ptr13, i8** %pos, align 4, it's an outside user of %incdec.ptr13 which is defined in the loop. I'll try to reduce the test a bit more, and I'll add a comment explaining why it should be rejected.
That is correct, this test is the same as the other one, except it relies on preferPredicateOverEpilogue to set the flag. |
Motivated by the discussion on the added test case, just a general remark about the description of this change:
This patch teaches the vectorizer to fallback to a scalar epilogue when a tail-predication hint is found
Using "hint" here can be a bit ambigious. I think the complete story is that we always fall back to a scalar epilogue, regardless whether tail-folding was requested on the command-line, a loop hint, or the TTI hook. Thinking about this, it's probably worth mentioning this on the documentation or description of the -prefer-predicate-over-epilog option, and/or the general documentation of the vectorize_predicate pragma, for which there probably also needs to be a test case added, i.e. for all 3 cases how tail-folding can be requested, there need to be a test. Looks like 2 are covered already: the option, and the TTI hook.
There are indeed loops which can be vectorized only with a scalar remainder loop, i.e., w/o foldTail. But the tests attached could easily be made an exception: LV handles induction variables that are live-out by pre-computing their values in the pre-header. So foldTail should work fine in their presence, provided the live-out value is computed using the original trip-count rather than the one foldTail rounds up.
Other live-outs could also be made to work under foldTail, such as selecting the last live element from the last live vector instead of assuming it's the last element of the last part.
llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h | ||
---|---|---|
238 | But they need to be reset to reflect original conditional blocks, as set by canVectorizeWithIfConvert() when it initially called blockCanBePredicated(). | |
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
1233 | Rather than if/how to report failure, probably better to clone this method into a canFoldTailByMasking() (as it was originally called iirc) const method, w/o having any side-effects that may later need to be abandoned, and a prepareToFoldTailByMasking() which need not return any value. | |
1240 | Try also adding getInductionVars() Phi's, and the values that feed them across the backedge, to the set of live outs allowed by fold tail? | |
1260 | Should an "analysis" message be reported instead of a "failure" one? | |
1265 | Can dump this note regardless of ReportFailure. | |
1269–1274 | In order (for canFoldTailByMasking()) to only check if tail can be folded, this call to blockCanBePredicated() should be replaced with one that does not have side-effects. | |
1282 | ditto. | |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
4972 | Fold this under the following switch, possibly falling through to the CM_ScalarEpilogueAllowed case? | |
5011 | They (masked load/stores) are needed, but only in blocks that were originally conditional, rather than all blocks of the loop. | |
llvm/test/Transforms/LoopVectorize/ARM/tail-folding-scalar-epilogue-fallback.ll | ||
34 | IND_END pre-computes the value of %incdec.ptr. | |
llvm/test/Transforms/LoopVectorize/use-scalar-epilogue-if-tp-fails.ll | ||
33 | ditto. |
There are indeed loops which can be vectorized only with a scalar remainder loop, i.e., w/o foldTail. But the tests attached could easily be made an exception: LV handles induction variables that are live-out by pre-computing their values in the pre-header. So foldTail should work fine in their presence, provided the live-out value is computed using the original trip-count rather than the one foldTail rounds up.
Other live-outs could also be made to work under foldTail, such as selecting the last live element from the last live vector instead of assuming it's the last element of the last part.
Thanks for elaborating and explaining, and yes, got it about the live-out values and precomputing them. Other than the live-out values, I think we have an additional challenge because target hook TTI->preferPredicateOverEpilogue that can reject tail-folding for loops that can be vectorised with a scalar remainder because of target dependent restrictions. I think that's why we also need a fallback. And judging from your other comments, please correct me if I'm wrong, it looks like you're happy with the general direction of this, and so we will address your other comments.
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
---|---|---|
1233 | I've been thinking about this, but I'm not exactly sure it's the right thing to do. There will probably be a small amount of code duplication and we'd be iterating over the loop's instruction twice. I'm not a fan of iterating over the loop's instructions more than needed, especially in this case as I think it can be avoided. Here is another idea: What if prepareToFoldTailByMasking becomes const, and instead of directly inserting elements into MaskedOp and ConditionalAssumes, it works on a temporary set. The simplest way to implement this would be to take sets by reference, like this: Another idea would be to move this logic in another object, for instance, something like TailFoldingLegal, which would:
Then it could be used like this: TailFoldingLegal TFL; if tail folding is needed: TFL.checkLoop(TheLoop) // etc. if TFL.canFoldTailByMasking() && we need a tail: TFL.applyTailFoldingByMasking(Legal); What do you think ? Do you prefer a simpler approach, even if it means iterating over the loop's instruction many times, or something a bit more complete like what I suggested? Note that I don't know the cost of iterating over the loop. Maybe it's really cheap and I shouldn't worry about it (then just ignore everything I suggested above). In any case, I'll just implement your favorite solution, as you know much more about this topic than I do, so please tell me what you prefer. Thank you very much for your help. |
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
---|---|---|
1233 | Agreed, would be better to avoid repeated instruction scans. In order for prepareToFoldTail() to be an optional preference, it should cause no side-effects when returning false; i.e., it should add to MaksedOps and ConditionalAssumes iff returning true. This mostly affects blockCanBePredicated() - it would need to have MaskedOps and ConditionalAssumes passed-in as arguments. Sounds reasonable? | |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
4973 | May be simpler to exit early after calling prepareToFoldTail(), instead of checking FoldTailByMasking later. |
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
---|---|---|
1233 |
This is a good idea, but there is one issue left: when ScalarEpilogueStatus == CM_ScalarEpilogueNotNeededUsePredicate and the loop has no tail (TC > 0 && TC % MaxVF == 0), we have to abandon tail-folding (Line 5029 in LoopVectorize.cpp), but if we do things that way, there's no way to abandon tail-folding at that stage. There are 2 solutions that I can think of:
Either solution is fine for me, but I'll need some guidance on how to refactor computeMaxVF properly if you prefer the first solution. |
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
---|---|---|
1233 | Right. Instead of introducing an early call to prepareToFoldTail(), before knowing MaxVF and if there's a tail to fold or else the preparation should be undone, how about waiting until the original call to prepareToFoldTail() fails. Then, before bailing-out, check if the status is CM_ScalarEpilogueNotNeededUsePredicate and if so return MaxVF instead? Note BTW that D80085 should hopefully land soon. |
- Changing implementation of the patch following discussion
- Removed the ReportFailure argument of prepareToFoldTailByMasking. I don't think it's useful anymore, but feedback is welcome. (The only thing that annoys me is that we now print "loop not vectorized" even when we'll fallback to a scalar epilogue)
- Added a test that makes use of the attribute that enables tail-folding
- Simplified tests
Right, as commented earlier, an "analysis" message should be reported when fold-tail doesn't work, instead of a "failure" one.
- Added a test that makes use of the attribute that enables tail-folding
- Simplified tests
Would be good to also preserve the existing "force-vector-tail-folding" behavior of "fold or do not vectorize", in addition to introducing the new "prefer-vector-tail-folding" behavior.
llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h | ||
---|---|---|
379 | (Would have been good to set defaults, but using nullptr's seems cumbersome.) | |
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
1264 | Rename e.g. with Tmp or FoldTail prefix to distinguish from the non fold-tail sets? | |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
5027 | Why change ScalarEpilogueStatus? | |
llvm/test/Transforms/LoopVectorize/use-scalar-epilogue-if-tp-fails.ll | ||
5 | "TP is forced" >> "tail folding is preferred" If "tail-predication" is preferred, it should replace "tail folding" throughout. But a tail can be predicated w/o folding it, so FoldTail or in full FoldTailByMasking seems more accurate? | |
10 | "forces" >> "specifies fold-tail preference via metadata" |
- Addressed comments (see items marked as "done")
- Changed "reportVectorizationFailure" calls in "prepareTailFoldingByMasking" into simple debug prints.
Would be good to also preserve the existing "force-vector-tail-folding" behavior of "fold or do not vectorize", in addition to introducing the new "prefer-vector-tail-folding" behavior.
What is your preferred solution for this? Here are some ideas:
- Add a new command-line switch that acts like the old -prefer-predicate-over-epilog, something like -force-predicate-over-epilog
- Add an additional switch to use with -prefer-predicate-over-epilog to disable the fallback behaviour, such as -disable-scalar-epilog-fallback
- Keep the old -prefer-predicate-over-epilog behaviour and only fallback when TTI requested tail-folding
Perhaps either (the first option of) a new command-line switch e.g. -force-vector-tail-folding, or a value added to -prefer-predicate-over-epilog=x indicating the "strength" of the preference.
Adding a matching pragma/metadata would also be nice.
Better avoid adding a new TTI hook, which will not be exercised by any in-tree target, right?
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
---|---|---|
1260 | Would be good to also have an ORE report similar to the original one, except as an analysis rather than failure. | |
1264 | Can add a comment what these sets are for. | |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
5027 | Why change ScalarEpilogueStatus? |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
5027 | (Sorry, forgot to post this comment) |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
5027 | Add -check-prefix to the FileCheck of one of the runs? |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
5027 | This is what's added when I don't change ScalarEpilogueStatus, and it's set to ScalarEpilogueNotNeededUsePredicate: ; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> undef, i32 [[OFFSET_IDX]], i32 0 ; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> undef, <4 x i32> zeroinitializer ; CHECK-NEXT: [[INDUCTION:%.*]] = add <4 x i32> [[BROADCAST_SPLAT]], <i32 0, i32 -1, i32 -2, i32 -3> It's because of this in InnerLoopVectorizer::widenIntOrFpInduction in LoopVectorize.cpp:1920 // All IV users are scalar instructions, so only emit a scalar IV, not a // vectorised IV. Except when we tail-fold, then the splat IV feeds the // predicate used by the masked loads/stores. if (!Cost->isScalarEpilogueAllowed()) // ScalarEpilogueStatus == CM_ScalarEpilogueAllowed CreateSplatIV(ScalarIV, Step); I am guessing that this should be using Cost->foldTailByMasking() instead, which returns the FoldTailByMasking boolean instead of looking at ScalarEpilogueStatus. |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
5027 | After looking a bit more into it:
Solution is either to change the ScalarEpilogueStatus, or find out why Cost->foldTailByMasking() doesn't work as expected. Failing Tests (3): LLVM :: Transforms/LoopVectorize/X86/constant-fold.ll LLVM :: Transforms/LoopVectorize/X86/vect.omp.force.small-tc.ll LLVM :: Transforms/LoopVectorize/pr44488-predication.ll Which all seem to expect this "INDUCTION" variable, one way or another. However, interestingly, the tests don't use the INDUCTION instruction at all, so maybe it's a bug? (Everytime there's a failure in one of those test, it's in a place where those 3 lines are expected, but not used at all) |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
5027 | So now we know why ScalarEpilogueStatus needed to change... @SjoerdMeijer tried to remove that presumably redundant call to CreateSplatIV() in D78911, recommitting it later under ScalarEpilogueStatus in rG9529597cf4562c64034943dacc29a4ff4fe93d86. We're still trying to figure out how to remove it: http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20200518/784362.html
guess so too, along with needed test fixings. Sjoerd, what do you think? |
Hi Ayal, thanks again for looking at this. I am taking over this patch as Pierre has finished his rotation in our team. This patch is part of our tail-predication story, and is the highest on my list. I will pick this up tomorrow as today is a bank holiday here.
Hi @Ayal , with a bit of delay, I am almost ready to pick this up. Regarding your comments on rG9529597cf4562c64034943dacc29a4ff4fe93d86, I've copied them here below as I thought that would be more convenient, but let me know if you would like to move it to elsewhere. Anyway, I am now going to look into your comments, these ones:
This omission of Primary Induction Phi's from Scalars in collectLoopScalars(), under foldTail:
// An induction variable will remain scalar if all users of the induction // variable and induction variable update remain scalar. ... + // If tail-folding is applied, the primary induction variable will be used + // to feed a vector compare. + if (Ind == Legal->getPrimaryInduction() && foldTailByMasking()) + continue;should have led widenIntOrFpInduction() to take the following early-exit for Phi's that are not in Scalars:
// Try to create a new independent vector induction variable. If we can't // create the phi node, we will splat the scalar induction variable in each // loop iteration. if (!shouldScalarizeInstruction(EntryVal)) {rather than reaching CreateSplatIV() at the end.
Perhaps the regressions occur when EntryVal is the Trunc rather than the IV Phi?
If so, one remedy may be to also omit Trunc from Scalars; another may be to check if (!shouldScalarizeInstruction(IV) instead-of or in addition to if (!shouldScalarizeInstruction(EntryVal)).Is there a short reproducer?
Yep, function @foo_optsize() in test/Transforms/LoopVectorize/X86/optsize.ll is a minimal example. It will be miscompiled with these 2 lines commented out that I added:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp index ec756671ea6..d39795627ec 100644 --- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp +++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp @@ -1917,8 +1917,8 @@ void InnerLoopVectorizer::widenIntOrFpInduction(PHINode *IV, TruncInst *Trunc) { // vectorised IV. Except when we tail-fold, then the splat IV feeds the // predicate used by the masked loads/stores. Value *ScalarIV = CreateScalarIV(Step); - if (!Cost->isScalarEpilogueAllowed()) - CreateSplatIV(ScalarIV, Step); + //if (!Cost->isScalarEpilogueAllowed()) + // CreateSplatIV(ScalarIV, Step); buildScalarSteps(ScalarIV, Step, EntryVal, ID); }
The miscompilation is that we don't emit:
%induction = add <64 x i32> %broadcast.splat, <i32 0, i32 1, ...
and the icmp is incorrectly performed against:
%2 = icmp ule <64 x i32> %broadcast.splat, <i32 202, i32 202, ...
which should have been:
%2 = icmp ule <64 x i32> %induction, <i32 202, i32 202, ...
it indeed doesn't take the early exit:
// Try to create a new independent vector induction variable. If we can't // create the phi node, we will splat the scalar induction variable in each // loop iteration. if (!shouldScalarizeInstruction(EntryVal)) {
I need to look again if this makes sense or not.
Caught up on all previous messages, and I believe missing in this patch is control over some of the decision making of this old and new behaviour. Rather than introducing yet another option, I liked @Ayal 's suggestion:
value added to -prefer-predicate-over-epilog=x indicating the "strength" of the preference.
for which I will introduce these values/options:
- x = 1: vectorize if we can't fail-fold,
- x = 2 : don't vectorize if we can't tail-fold
I think this models the "strength" of the vector tail-folding preference.
Changed -prefer-predicate-over-epilog to take a numeric value, corresponding to the preferred predication and fallback strategy.
Changed option -prefer-predicate-over-epilog to accept a more descriptive value rather than a cryptic numeric value.
Also a friendly ping. :-)
Is the idea to turn this option on by default for MVE? Maybe by changing the preferPredicateOverEpilogue call?
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
189 | -> there are different ... | |
198 | Epilog -> Epilogue? | |
208 | Maybe call this predicate-else-scalar-epilog. | |
209 | -> "prefer tail-folding, create scalar epilogue if tail folding fails" ? | |
llvm/test/Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll | ||
2 ↗ | (On Diff #278125) | What happens if we change this to predicate-vectorize? Would that be a better option for MVE? (or is that not what this test is really aiming to test?) |
Thanks for looking at this Dave. This addresses:
- the minor comments, the rewording of comments/option,
- restores 2 new regression tests that I forgot to upload in the previous diff.
Is the idea to turn this option on by default for MVE? Maybe by changing the preferPredicateOverEpilogue call?
So that was definitely the idea, that was the motivation of this prep work here.
I was thinking of addressing that in a follow up patch, because that is a different functional change, and then we can also discuss how to enable this. I.e., preferPredicateOverEpilogue would be one way to do it, another way is to flip the switch and default to falling back to vectorisation, which seems to me what most targets would want actually, but perhaps I am wrong and should get some performance numbers for this.
@SjoerdMeijer I ran into a potential use for this patch in the reduction code I was working on. I think this looks good in general I just had a question about the fallback stratergy.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
201–202 | We should probably pick a single spelling of epilog and stick to it. At least in the same option! But feel free to ignore too if you like. | |
5025 | So if I understand this logic properly, the loop hint or the targethook now both mean "try tailpredicate but fall back to vectorize if you can't". The PreferPredicateOverEpilogue option can mean "only tail predicate" or "try tailpredicate, fallback if needed" depending on the choice of the option? (I had not originally realized that was already how this worked for the target hook.) Do we want the pragma and targethook to always work like that? Or is it worth giving the target hook a choice between the two options. I suspect for MVE we would always want to choose the "with fallback" option, so maybe it's not worth adding? If we did, we could possibly add another ScalarEpilogueStatus value, to not have to recheck PreferPredicateOverEpilogue or the target hook here. |
Thanks for looking Dave.
I am addressing your remarks, also find some replies inline.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
201–202 | No, good point! I honestly don't know what it should be and have no opinion on it, but I will go for the common case epilogue. | |
5025 |
Yep, that is exactly right I think.
My first and perhaps naive thoughts were that we always want to vectorise, so set that as the fallback, because I thought this was guaranteed to be a win. But then some benchmarks proved me wrong. From a first look, however, this is just exposing some inefficiencies elsewhere. So, I still think it is the right choice to make in the end. The way I look at this patch is that this is an enabler, and if we want to flip the switch by default somehow, then that's probably best done as a follow up of this groundwork. |
If you do something about the spelling of epilog, then this looks good to me.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
5025 | I think this probably does what we want for MVE, and seems like a sensible default. |
But they need to be reset to reflect original conditional blocks, as set by canVectorizeWithIfConvert() when it initially called blockCanBePredicated().