This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Vectorize select-cmp reduction pattern for increasing integer induction variable
AcceptedPublic

Authored by Mel-Chen on May 18 2023, 2:43 AM.

Details

Summary

Consider the following loop:

int red = ii;
for (int i = 0; i < n; ++i)
  red = (a[i] > b[i]) ? i : red;

We can vectorize this loop if i is an increasing induction variable.
The final reduced value will be tha maximum of i that the condition
a[i] > b[i] is satisfied, or the start value ii.

This patch added new RecurKind enums - IFindLastIV and FFindLastIV.

TODOs:

  • Casting increasing induction variable, like truncate, SExt.

Diff Detail

Event Timeline

Mel-Chen created this revision.May 18 2023, 2:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 18 2023, 2:43 AM
Mel-Chen requested review of this revision.May 18 2023, 2:43 AM
Mel-Chen updated this revision to Diff 526982.May 31 2023, 3:14 AM

Changes:

  • Add test cases
  • Check the bound of increasing induction variable
Mel-Chen edited the summary of this revision. (Show Details)May 31 2023, 3:14 AM
Mel-Chen retitled this revision from [LoopVectorize] Vectorize select-cmp reduction pattern for increasing integer induction variable (WIP) to [LoopVectorize] Vectorize select-cmp reduction pattern for increasing integer induction variable .Jun 5 2023, 1:45 AM
Mel-Chen edited the summary of this revision. (Show Details)
fhahn added inline comments.Jun 12 2023, 3:53 AM
llvm/include/llvm/Analysis/IVDescriptors.h
57

To keep the diff more compact, could you split the FP handling off? It also looks like codegen is at least not tested for the FP case?

llvm/lib/Analysis/IVDescriptors.cpp
674

Do we need to distinguish here between no signed/no unsigned wrap and then chose smax/umax during codegen?

llvm/lib/Transforms/Utils/LoopUtils.cpp
1089

What does this mean? The current patch only handles increasing inductions, so there's no mis-compile that needs fixing here( which the comment kind-of implies)?

llvm/test/Transforms/LoopVectorize/iv-select-cmp-no-wrap.ll
1 ↗(On Diff #526982)

Could you submit the tests separately?

Mel-Chen added inline comments.Jun 12 2023, 6:23 AM
llvm/include/llvm/Analysis/IVDescriptors.h
57

I am afraid there is misunderstanding.
The test functions starting with "@select_fcmp" are testing the SelectIVFCmp reduction pattern. SelectIVFCmp is similar in semantics to SelectFCmp, where the operand types of the select instruction are integer, and the cmp instruction is fcmp.

llvm/lib/Analysis/IVDescriptors.cpp
674

That's a good point. Implementing the patch for both signed and unsigned induction variables can be challenging in practice.
The current patch only focuses on the signed IV because in most applications, we often encounter induction variables in the form of IV {0, +, step}. When the IV is signed, we can use -1 or the minimum value of the signed data type as a sentinel value. However, when the IV is unsigned, we don't have a value smaller than 0 to use. This doesn't mean that unsigned IV cannot be vectorized, but rather they require additional handling and a more refined approach.

Of course, if an unsigned IV is {1, +, step}, we can directly use the method implemented in this patch. However, such cases are less common, so we decided to focus on handling signed IV first.

llvm/lib/Transforms/Utils/LoopUtils.cpp
1089

My bad. This maybe the note I leave when I implemented this patch. Will Remove it.

llvm/test/Transforms/LoopVectorize/iv-select-cmp-no-wrap.ll
1 ↗(On Diff #526982)

Sure. Will do.

artagnon added inline comments.Jun 14 2023, 9:05 AM
llvm/lib/Analysis/IVDescriptors.cpp
613–633

Please update these comments.

661

Why not split this out into its own function?

663–671

We should forbid these cases in the first place instead of checking them.

674

Doesn't the InstDesc tell you if it's signed or unsigned?

677–679

Why is this necessary? Even if it isn't the min signed value, we should codegen just fine, as the codegen just involves applying a mask and applying max/min-reduce.

681–682

getStepRecurrence() from the SCEV AddRec, instead of relying on isInductionPhi()?

llvm/lib/Transforms/Utils/LoopUtils.cpp
1090

Personally, I prefer straight-line codegen as I've done, but I'm probably biased.

artagnon added inline comments.Jun 14 2023, 9:11 AM
llvm/include/llvm/Analysis/IVDescriptors.h
55–57

Maybe have a max in the name to make it clear that we only do max-reductions?

shiva0217 added inline comments.Jun 21 2023, 1:18 AM
llvm/include/llvm/Transforms/Utils/LoopUtils.h
361–364

or an increasing loop induction variable?

llvm/lib/Transforms/Utils/LoopUtils.cpp
1164

It might be worth to have a comment to describe that the SelectIVICmp and SelectIVFCmp code generation will use Identity value to determine the if condition in the following case has ever been true.

int r = 331;
for (int i = 0; i < n; i++)
  if (src[i] > 111)
    r = i;

When the reduction value(Rdx) equal to the Identity value(Iden), it reveals the condition never been true. So it will select the InitVal.

It might be the reason that in IsIncreasingLoopInduction, the function will check the IV start value to avoid the IV overlapping the Identity value.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4003

Could the function rename to createSelectIVCmpTargetReduction and be called from createSelectCmpTargetReduction?
Perhaps CreateIntMaxReduce can be moved to the function?

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
1265

Perhaps a comment to describe that SelectIVICmp and SelectIVFCmp will initial the reduction PHI with Iden and createSentinelValueHandling will use Iden to determine the if condition in the loop has ever been true?

Ayal added a comment.Jun 22 2023, 4:00 AM

An important step forward - adding some general thoughts and terminology, going (admittedly nearly two years) back to D108136.

This FindLast compound pattern combines two header phi's: an induction and a reduction. The reduction is a "selecting reduction" which choses one element from the reduced set, or none. The induction is recorded to return the index of the (last) element found. The induction must be monotone and have some out-of-bounds value, which this patch ensures by restricting to signed inductions that start at non-min-signed-value and increase w/o wrapping - restrictions which can be lifted.

Reducing a collection C of values into one value v can be classified as either a "selecting reduction" or a "combining reduction", depending on whether v is always a member of C or not, respectively. Min and max reductions are selecting reductions whereas add (including fmuladd), multiply, or, and, xor are in general combining reductions.
When C is a collection of boolean values, the selecting reductions "max" and "min" practically compute Any and All, respectively, as in std::any_of() and std::all_of(). Support for boolean selecting reductions was introduced in D108136, which should arguably be called [I|F]Any rather than Select[I|F]Cmp. The boolean values produced by Integer/Float Compares such as "(src[i] > 3)" or "(a[i] > b[i])" are essentially being "max" reduced; any desired pair of invariant return values can be set/selected after determining the outcome of the boolean reduction.
Note that an Any reduction can terminate once "true" is encountered, similar to a general max/min reduction encountering max/min-value.

A selecting reduction could report the index of the value reduced in addition to the value itself. If the reduced value appears multiple times, the index of the first or last appearance can be reported. Tests for such MinLast cases, aka argmin, were introduced in 4f04be564907f, and are yet to be vectorized by LV - hope this patch helps us get there! These are compound patterns combining three header phi's: an induction and two reductions.

A boolean selecting max/Any reduction reporting the index is typically interested in the index only if the reduced value is 1/"true", otherwise the index is obvious.
This patch deals with boolean selecting Any reductions that report the index of the (last) value reduced, provided it is "true", and may be called FindLast, as in std::find_if() being FindFirst.

When vectorizing and/or unroll-and-interleaving a selecting reduction with index, the indices of multiple candidates need to be compared to determine which is first (or last), during the reduction epilog (for in-loop reductions this is trivial). This comparison requires the indices to be monotone, i.e., to avoid wrapping.
When dealing with Any reductions with index of "true" values, the indicator that a "true" value was encountered can be folded together with the index found so far (of a true value) by using an "invalid" out-of-bounds index - preferably smaller than first iteration for FindLast or larger than last iteration for FindFirst. Such values are overwritten naturally by the (valid) index of any "true" value, when selecting the first or last index. This reduces the pattern to consider a single reduction (of the combined index+indicator value) rather than the two reductions of the general MaxLast case (of index and value).

This FindLast patch currently meets these requirements by restricting to indices that are increasing, signed, and start from a non-min-signed value. It seems unnatural for such indices to wrap, or if PSCEV guards against AddRec wrapping in general(?), but even if an index may wrap and/or does not provide desired out-of-bounds values, a designated IV counting vector iterations could be used from which the original indices can later be reconstructed in the epilog and reduced. Such an IV is immune to wrapping and provides out-of-bound values. This is one of several possible ways to lift these restrictions.

Note that Any reductions reporting the first index can terminate once "true" is encountered, but seem more cumbersome to write (w/o a break), e.g.,:

// FindFirst w/o break.
int red = ii;
int red_set = false;
for (int i = 0; i < n; ++i)
  if (a[i] > b[i]) {
    red = red_set ? red : i;
    red_set = true;
  }

instead of

// FindLast.
int red = ii;
for (int i = 0; i < n; ++i)
  red = (a[i] > b[i]) ? i : red;

A FindLast loop could be optimized into a FindFirst one by reversing the loop.

llvm/test/Transforms/LoopVectorize/select-min-index.ll
89

This test now gets vectorized, being a FindLast loop that reports the last index where a[i] < a[i-1]+1, or zero if none are found. (I.e., proving that a sequence is not strictly increasing, rather than computing MinLast.)
But the vector loop is never reached?

Mel-Chen updated this revision to Diff 535225.Jun 27 2023, 8:42 PM
Mel-Chen marked 2 inline comments as not done.
Mel-Chen retitled this revision from [LoopVectorize] Vectorize select-cmp reduction pattern for increasing integer induction variable to [LoopVectorize] Vectorize select-cmp reduction pattern for increasing integer induction variable.

Changes:

  • Remove the redundant comment. (fhahn's comment)
  • Separate testing and implementation. (fhahn's comment)
Mel-Chen marked 2 inline comments as done.Jun 28 2023, 2:52 AM
Mel-Chen added inline comments.
llvm/include/llvm/Analysis/IVDescriptors.h
55–57

It has some functional overlap with max reduction, but it is not exactly the same as max reduction.
If we were to rename it, the change I suggest is:
SelectICmp --> SelectInvICmp
SelectFCmp --> SelectInvFCmp
SelectIVICmp --> SelectIncIVICmp
SelectIVFCmp --> SelectIncIVFCmp
This will also make it easier for you to expand it in the future.

llvm/lib/Analysis/IVDescriptors.cpp
613–633

Nice catch! Will update it.

661

I'm also think about this because I personally feel that this function is a bit long.
Do you think it's better to put it here as a static function, or put it in another file?

663–671

I do not understand. I believe this series of checks already forbids these cases.

674

No, it does not provide signed or unsigned information.

677–679

This is complicated, let me explain why this pattern requires this check.
It is related to our goals: 1) We want to achieve the reduction with one reduce intrinsic. 2) We want to use a static sentinel value.

About 1), , I understand that even without checking the boundaries or ensuring the select operand is increasing or decreasing, we could still vectorize it by using two reductions. However, the most common format we encounter for induction variables is {0, +, 1}. If this IV is signed, we can accomplish the reduction in one reduction. Therefore, this is based on performance considerations.

As for 2), it follows the decision made in 1), which requires us to have a sentinel value. In the case of an increasing IV, the only restriction on the sentinel value is that it must be less than the start value of the IV. In the case of {0, +, 1}, the sentinel value can be -1 or a smaller value, i.e. dynamic sentinel value. However, for easier implementation, we finally decided to use a static sentinel value, which is the minimum value of the data type.

Additional description: If we were to use a dynamic sentinel value, it would involve checking whether the data type can represent IV start value - step (in the case of {0, +, 1}, it would be -1). However, within the LLVM framework, I currently don't have a way to implement a version with a dynamic sentinel value, which is a room for potential improvement.

681–682

Why? What are the shortcomings that make you think isInductionPHI should not be used?
In my mind, better to rely on the existing function rather than trying to reimplement it.

llvm/lib/Transforms/Utils/LoopUtils.cpp
1090

I don't get it. Only one I need is creating a signed max reduce.

Mel-Chen added inline comments.Jun 28 2023, 8:49 PM
llvm/include/llvm/Transforms/Utils/LoopUtils.h
361–364

Good find! Will correct it.

llvm/lib/Transforms/Utils/LoopUtils.cpp
1164

Sure, will do.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4003

I'm afraid I won't be able to meet this requirement. Placing createSentinelValueHandling in this position is for handling the case when the vector width is 1. You could refer to CHECK-VF1IC4 in the test cases and focus on the middle.block. In implementation, VF1IC4 doesn't call createTargetReduction, but ReducedPartRdx still need to be did the sentinel value fixing.

However, perhaps we can create a new bool function for RK == RecurKind::SelectIVICmp || RK == RecurKind::SelectIVFCmp. This will most likely expand further and cause the if-condition to become too long.

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
1265

Sure, will do.

A selecting reduction could report the index of the value reduced in addition to the value itself. If the reduced value appears multiple times, the index of the first or last appearance can be reported. Tests for such MinLast cases, aka argmin, were introduced in 4f04be564907f, and are yet to be vectorized by LV - hope this patch helps us get there! These are compound patterns combining three header phi's: an induction and two reductions.

Yes, this patch was separated from the D143465 based on @fhahn's suggestion. D143465 still need to be refined, so it hasn't been invited to more reviewers yet.

This FindLast patch currently meets these requirements by restricting to indices that are increasing, signed, and start from a non-min-signed value. It seems unnatural for such indices to wrap, or if PSCEV guards against AddRec wrapping in general(?), but even if an index may wrap and/or does not provide desired out-of-bounds values, a designated IV counting vector iterations could be used from which the original indices can later be reconstructed in the epilog and reduced. Such an IV is immune to wrapping and provides out-of-bound values. This is one of several possible ways to lift these restrictions.

This comment inspired me deeply. Let me share my thoughts and plans regarding the select-cmp reduction pattern (referred to as [I|F]Any mentioned in your comment) .

I believe that the select-cmp reduction pattern can be classified into several types based on the selecting variable. Currently, I have categorized them as follows:

  1. Select operand is a loop invariant, i.e., Select[I|F]Cmp. This has already been implemented in the D108136 by @david-arm.
  1. Select operand is a monotonic increasing/decreasing induction variable, and the start value of the induction variable is not equal to the minimum/maximum value of the data type. This patch handles the case of signed increasing induction variables, while the case of decreasing induction variables is yet to be implemented. The decision to only handle signed variables depends on LLVM's design, the issue including the choice of sentinel values, and the selection of umax|smax reduction intrinsics. If the compiler architecture allows distinguishing between signed and unsigned, the unsigned induction variable case should be easily achievable.
  1. Select operand is a monotonic increasing/decreasing induction variable, and there are no restrictions on the start value of the induction variable.
unsigned int red = start_value;
for (unsigned int i = 0; i < n; ++i)
  red = (a[i] > b[i]) ? i : red;
  1. Select operand is an any variable.
int red = start_value;
for (int i = 0; i < n; ++i)
  red = (a[i] > b[i]) ? c[i] : red;

Both 1) and 2) can be handled with a single reduction. On the other hand, 3) and 4) are more complex, and require two reductions to be completed.

Although all select-cmp reduction patterns can be vectorized using the vectorization approach in 4), for performance, I believe that the cases in 1) and 2) should be handled with a single reduction first. Therefore, when identifying and classifying the RecurKind for select-cmp reduction patterns, it is preferable to first consider whether they can be handled with cases 1) or 2), and then consider whether cases 3) or 4) need to be applied.

Next, let's discuss cases 3) and 4), which have not been implemented yet.

For case 3), I currently have two approaches to solve it. The first approach is to perform reduction not only on the select part, but also on the boolean value of the cmp operation.

unsigned int red = start_value;
vec_bool cmp_red_part = splat(false);
vec_unsigned_int select_red_part = splat(DTypeMin);
vec_unsigned_int step_vec = {0, 1, 2, ...};
for (unsigned int i = 0; i < n; i+=vl) {
  cmp_red_part = cmp_red_part | (vec_a[i] > vec_b[i]);
  select_red_part = (vec_a[i] > vec_b[i]) ? step_vec: select_red_part;
  step_vec += {vl, vl, vl, ...};
}
bool cmp_red = reduce.or(cmp_red_part);
red = cmp_red ? reduce.smax|umax(select_red_part) : start_value;

The second approach is to directly use the vectorization approach in 4) to vectorize case 3).

int red = start_value;
vec_unsigned_int iter_red_part = splat(0);
vec_unsigned_int red_part = splat(start_value);
vec_unsigned_int step_vec = {0, 1, 2, ...};
for (int i = 0; i < n; i+=vl) {
  iter_red_part = (vec_a[i] > vec_b[i]) ? step_vec : iter_red_part;
  red_part = (vec_a[i] > vec_b[i]) ? vec_c[i] : red_part;
  step_vec += {vl, vl, vl, ...};
}
unsigned int iter_red = reduce.umax(iter_red_part);
mask_bool red_mask = (iter_red_part == splat(iter_red));
red = reduce.or(red_part, red_mask);  // unsure about which reduction operation would be best for the extracting the result at the position red_mask indicated so far

Both approaches require two reductions, and one of the reductions will be a reduction phi that does not appear in the original user code. In other words, the vectorizer needs to have the capability to create a new reduction phi.

These are my thoughts on the select-cmp reduction pattern so far.

Note that Any reductions reporting the first index can terminate once "true" is encountered, but seem more cumbersome to write (w/o a break), e.g.,:

// FindFirst w/o break.
int red = ii;
int red_set = false;
for (int i = 0; i < n; ++i)
  if (a[i] > b[i]) {
    red = red_set ? red : i;
    red_set = true;
  }

instead of

// FindLast.
int red = ii;
for (int i = 0; i < n; ++i)
  red = (a[i] > b[i]) ? i : red;

A FindLast loop could be optimized into a FindFirst one by reversing the loop.

Interesting, I haven't thought about FindFirst yet. If it includes a break statement, it will be another long story - uncountable loop vectorization. Although I haven't deeply considered the FindFirst case, I still have some rough ideas to share.

Perhaps we can simplify the FindFirst w/o break example to:

// FindFirst w/o break.
int red = ii;
int red_set = false;
for (int i = 0; i < n; ++i) {
  if ((a[i] > b[i]) && !red_set)   // reduction 1
    red = i;
  if (a[i] > b[i]) // reduction 2
    red_set = true;
}

In this way, we can clearly see that there are two reductions involved, and the result of one reduction will be masked by the result of the other reduction. This is very interesting, and may similar with the pattern in D143465.
If we can transform the code into:

// FindFirst w/o break.
int red = ii;
int red_set = false;
for (int i = 0; i < n; ++i){
  if ((a[i] > b[i]) && !red_set)   // reduction 1
    red = i;
  red_set = red_set | (a[i] > b[i]);  // reduction 2
}

, perhaps it will lead to better optimization results.

llvm/test/Transforms/LoopVectorize/select-min-index.ll
89

Impressive catch!
We have been focusing only on the vector.body and ignoring the others. I will prioritize clarifying this bug and fixing it as soon as reasonable.

Ayal added a comment.Jul 2 2023, 3:07 PM

[snip]

I believe that the select-cmp reduction pattern can be classified into several types based on the selecting variable. Currently, I have categorized them as follows:

Would be good to try and find more accurate names than using Select*Cmp combinations. A Compare-Select pattern is also used in the existing Min/Max reduction, but has a much better name.

  1. Select operand is a loop invariant, i.e., Select[I|F]Cmp. This has already been implemented in the D108136 by @david-arm.

and should be renamed [I|F]Any or something else more accurate. The two invariant operands should be sunk and selected after the loop, according to the outcome if "any" were found or not.

  1. Select operand is a monotonic increasing/decreasing induction variable, and the start value of the induction variable is not equal to the minimum/maximum value of the data type. This patch handles the case of signed increasing induction variables, while the case of decreasing induction variables is yet to be implemented. The decision to only handle signed variables depends on LLVM's design, the issue including the choice of sentinel values, and the selection of umax|smax reduction intrinsics. If the compiler architecture allows distinguishing between signed and unsigned, the unsigned induction variable case should be easily achievable.
  1. Select operand is a monotonic increasing/decreasing induction variable, and there are no restrictions on the start value of the induction variable.
unsigned int red = start_value;
for (unsigned int i = 0; i < n; ++i)
  red = (a[i] > b[i]) ? i : red;
  1. Select operand is an any variable.
int red = start_value;
for (int i = 0; i < n; ++i)
  red = (a[i] > b[i]) ? c[i] : red;

Both 1) and 2) can be handled with a single reduction. On the other hand, 3) and 4) are more complex, and require two reductions to be completed.

Arguably, (2), (3) and (4) are essentially all FindLast reductions, interested in the last loop iteration i for which some predicate p(i) such as a[i]-b[i]>0 holds, plus an indicator if no such iteration was found, followed by some post-processing of these results: in (2), (3) and (4), if no such iteration was found then some invariant "start_value" is returned, regardless if it was originally out-of-bounds or not. In (4), in addition, if a loop iteration i was found satisfying p(i), some f(i) computation of the last such i should be returned, as in c[i]. This is analogous to sinking the invariants of case (1), and may indeed be more elaborate, but also covers simpler cases such as any (other) AddRec/IV that can be evaluated given i, e.g.,: red = (a[i] > b[i]) ? 3*i+8 : red; - so chose any IV you prefer, e.g., one which has the desired no-wrap plus out-of-bound value, even if the original one does not.

In any case, it should be helpful to distill what actually needs to be maintained throughout the reduction loop, even if more appear there originally; be it boolean indicators in Any reductions (cmp_red_part in your example), indices in Find reductions (select_red_part in your example initialized with "unfound" out-of-bound indicators), or compound value+index in e.g. min/max-with-index reductions. The compiler can surely and does introduce new phi's as needed, hopefully having minimal width, but could also try to eliminate existing phi's and reduce the number of values that are live-out of the loop, possibly at the cost of replicating code, e.g., if c[i] is also used inside the loop.

Here's a sketch minimizing the size of the indices maintained throughout the loop, so they would avoid wrapping, provide out-of-bound values, and possibly use narrower types depending on trip-count and vl:

return_type FindLast(return_type unfound_value, vec_predicate_func, found_func) {
  vec_unsigned_int select_red_part = splat(0); // Zero indicates unfound.
  vec_unsigned_int step_vec = splat(1); // Count vector iterations starting at 1.

  for (unsigned int i = 0; i < n; i+=vl, step_vec+=splat(1))
    select_red_part = (vec_predicate_func(i) ? step_vec : select_red_part;

  unsigned vec_indices_ored = reduce.or(select_red_part);
  if (vec_indices_ored == 0)
    return unfound_value;
  unsigned inflated_red_part = (select_red_part - splat(1)) * vl + <0,1,...,vl-1>;
  unsigned last_index = reduce.umax(inflated_red_part);
  return found_func(last_index);
}

Regarding FindFirst, indeed the natural way of writing it with a break would provide the compiler with an uncountable loop that is harder to vectorize due to speculative execution, and so it is natural to start with FindLast. But if written as a countable loop the compiler might be able to vectorize and optimize it by introducing a break. As in an [I|F]Any countable loop that if free of any other side-effects, or a FindLast loop that can be reversed into a FindFirst loop moving backwards and breaking on first finding.

Mel-Chen updated this revision to Diff 536986.Jul 4 2023, 1:34 AM

Changes:

  • Fix the test cases, D154415. (Ayal's comment)
Mel-Chen marked an inline comment as done.Jul 4 2023, 1:41 AM
Mel-Chen added inline comments.
llvm/test/Transforms/LoopVectorize/select-min-index.ll
89

Clarified, it has been confirmed that this is not a bug.
The reason is that the loop trip count in the test case is 0, causing the simplification of min.iters.check to be true. The test case has been fixed in D154415.

Mel-Chen updated this revision to Diff 537231.Jul 4 2023, 11:43 PM

Changes:

  • Update comments. (Artagnon and Shiva's comments)

Would be good to try and find more accurate names than using Select*Cmp combinations. A Compare-Select pattern is also used in the existing Min/Max reduction, but has a much better name.

Sure, perhaps @artagnon can join the discussion as well. Let me share my experience first: In GCC, I have seen a classification called ExtractLast, which has a similar semantics to what you mentioned as FindLast. Welcome further input and opinions.

  1. Select operand is a loop invariant, i.e., Select[I|F]Cmp. This has already been implemented in the D108136 by @david-arm.

and should be renamed [I|F]Any or something else more accurate. The two invariant operands should be sunk and selected after the loop, according to the outcome if "any" were found or not.

How about following the C++ STL, renaming it to [I|F]AnyOf? What do you think, @david-arm?

Here's a sketch minimizing the size of the indices maintained throughout the loop, so they would avoid wrapping, provide out-of-bound values, and possibly use narrower types depending on trip-count and vl:

return_type FindLast(return_type unfound_value, vec_predicate_func, found_func) {
  vec_unsigned_int select_red_part = splat(0); // Zero indicates unfound.
  vec_unsigned_int step_vec = splat(1); // Count vector iterations starting at 1.

  for (unsigned int i = 0; i < n; i+=vl, step_vec+=splat(1))
    select_red_part = (vec_predicate_func(i) ? step_vec : select_red_part;

  unsigned vec_indices_ored = reduce.or(select_red_part);
  if (vec_indices_ored == 0)
    return unfound_value;
  unsigned inflated_red_part = (select_red_part - splat(1)) * vl + <0,1,...,vl-1>;
  unsigned last_index = reduce.umax(inflated_red_part);
  return found_func(last_index);
}

If we focus on removing the wrapping and bound restrictions, I think we can consider the approach proposed by @artagnon in D152693. This method cleverly extends the technique used by @david-arm in SelectICmp. The approach can be summarized as follows:
Consider the loop:

unsigned int red = start_value;
for (unsigned int i = 0; i < n; ++i)
  red = (a[i] > b[i]) ? i : red;

vectorize to:

unsigned int red = start_value;
vec_unsigned_int red_part = splat(start_value);
vec_unsigned_int step_vec = {0, 1, 2, ...};
for (unsigned int i = 0; i < n; i+=vl) {
  red_part = (vec_a[i] > vec_b[i]) ? step_vec : red_part;
  step_vec += {vl, vl, vl, ...};
}
vec_bool ne_start_value = red_part != splat(start_value);
bool may_update = reduce.or(ne_start_value);
vec_unsigned_int masked_red_part = ne_start_value ? red_part : splat(DataTypeMin);
red = may_update ? reduce.smax|umax(masked_red_part) : start_value;

While the conditions checked in this patch are more strict, I believe both approaches should coexist. In general, the IR generated by this patch should have better performance in the same case. Therefore, it should be prioritized when possible. However, when the cases that cannot be handled by this patch, we can apply the approach in D152693.
In addition, there is still room for optimization in this patch. We usually face source code like this:

j = -1;
for (int i = 0; i < n; i++) {
    if (a[i] < b[i]) {
        j = i;
    }
}

When the start value of the reduction is a known constant and is known to be smaller than the start value of the increasing induction variable, we may not even need to use a sentinel value. Simply using the reduce max operation would suffice.

Update: I could found an example where the approach in D152693 lead to incorrect result:
Assuming start_value is 3, and red_part is {0, 1, 2, 3} in the end. If the 3 is updated from the loop, not from the start_value , red should be 3 instead of 2.

@artagnon, could you please help to verify it?

If we focus on removing the wrapping and bound restrictions, I think we can consider the approach proposed by @artagnon in D152693. This method cleverly extends the technique used by @david-arm in SelectICmp. The approach can be summarized as follows:
Consider the loop:

unsigned int red = start_value;
for (unsigned int i = 0; i < n; ++i)
  red = (a[i] > b[i]) ? i : red;

vectorize to:

unsigned int red = start_value;
vec_unsigned_int red_part = splat(start_value);
vec_unsigned_int step_vec = {0, 1, 2, ...};
for (unsigned int i = 0; i < n; i+=vl) {
  red_part = (vec_a[i] > vec_b[i]) ? step_vec : red_part;
  step_vec += {vl, vl, vl, ...};
}
vec_bool ne_start_value = red_part != splat(start_value);
bool may_update = reduce.or(ne_start_value);
vec_unsigned_int masked_red_part = ne_start_value ? red_part : splat(DataTypeMin);
red = may_update ? reduce.smax|umax(masked_red_part) : start_value;

Would be good to try and find more accurate names than using Select*Cmp combinations. A Compare-Select pattern is also used in the existing Min/Max reduction, but has a much better name.

Sure, perhaps @artagnon can join the discussion as well. Let me share my experience first: In GCC, I have seen a classification called ExtractLast, which has a similar semantics to what you mentioned as FindLast. Welcome further input and opinions.

I agree with @Ayal: I also got confused by the SelectCmp naming convention initially. Perhaps [I|F]AnyOfInv, [I|F]AnyOfInc, and [I|F]AnyOfDec, and rename the corresponding function isAnyOfReduction? I don't have a strong preference for the rename.

If we focus on removing the wrapping and bound restrictions, I think we can consider the approach proposed by @artagnon in D152693. This method cleverly extends the technique used by @david-arm in SelectICmp. The approach can be summarized as follows:
Consider the loop:

unsigned int red = start_value;
for (unsigned int i = 0; i < n; ++i)
  red = (a[i] > b[i]) ? i : red;

vectorize to:

unsigned int red = start_value;
vec_unsigned_int red_part = splat(start_value);
vec_unsigned_int step_vec = {0, 1, 2, ...};
for (unsigned int i = 0; i < n; i+=vl) {
  red_part = (vec_a[i] > vec_b[i]) ? step_vec : red_part;
  step_vec += {vl, vl, vl, ...};
}
vec_bool ne_start_value = red_part != splat(start_value);
bool may_update = reduce.or(ne_start_value);
vec_unsigned_int masked_red_part = ne_start_value ? red_part : splat(DataTypeMin);
red = may_update ? reduce.smax|umax(masked_red_part) : start_value;

While the conditions checked in this patch are more strict, I believe both approaches should coexist. In general, the IR generated by this patch should have better performance in the same case. Therefore, it should be prioritized when possible. However, when the cases that cannot be handled by this patch, we can apply the approach in D152693.

Yes, the IR generated by this patch is quite optimized, and I suppose we can fallback to code-gen like in D152693 as a follow-up.

In addition, there is still room for optimization in this patch. We usually face source code like this:

j = -1;
for (int i = 0; i < n; i++) {
    if (a[i] < b[i]) {
        j = i;
    }
}

When the start value of the reduction is a known constant and is known to be smaller than the start value of the increasing induction variable, we may not even need to use a sentinel value. Simply using the reduce max operation would suffice.

I wouldn't worry about this for now. The patch already looks pretty good in its current state, and I think with the renaming that @Ayal proposed, it should be ready to land.

llvm/lib/Analysis/IVDescriptors.cpp
661

A static function would be nice.

Update: I could found an example where the approach in D152693 lead to incorrect result:
Assuming start_value is 3, and red_part is {0, 1, 2, 3} in the end. If the 3 is updated from the loop, not from the start_value , red should be 3 instead of 2.

@artagnon, could you please help to verify it?

Yes, I can confirm that there is indeed a bug. Thanks for catching it! I'm thinking about a fix now.

Would be good to try and find more accurate names than using Select*Cmp combinations. A Compare-Select pattern is also used in the existing Min/Max reduction, but has a much better name.

I agree with @Ayal: I also got confused by the SelectCmp naming convention initially. Perhaps [I|F]AnyOfInv, [I|F]AnyOfInc, and [I|F]AnyOfDec, and rename the corresponding function isAnyOfReduction? I don't have a strong preference for the rename.

My opinion is:
Select[I|F]Cmp --> [I|F]AnyOf
SelectIV[I|F]Cmp --> [I|F]FindLastIV or [I|F]FindLastIncIV

About the function name, I lean towards not renaming it, or renaming it to more easily understandable names like conditional select, conditional assignment, and so on.

Yes, I can confirm that there is indeed a bug. Thanks for catching it! I'm thinking about a fix now.

Perhaps Ayal's approach can be helpful. The final found_func allows the approach to be more widely applicable. If we only focus on increasing/decreasing induction variables, we may be able to further streamline the process.

Here's a sketch minimizing the size of the indices maintained throughout the loop, so they would avoid wrapping, provide out-of-bound values, and possibly use narrower types depending on trip-count and vl:

return_type FindLast(return_type unfound_value, vec_predicate_func, found_func) {
  vec_unsigned_int select_red_part = splat(0); // Zero indicates unfound.
  vec_unsigned_int step_vec = splat(1); // Count vector iterations starting at 1.

  for (unsigned int i = 0; i < n; i+=vl, step_vec+=splat(1))
    select_red_part = (vec_predicate_func(i) ? step_vec : select_red_part;

  unsigned vec_indices_ored = reduce.or(select_red_part);
  if (vec_indices_ored == 0)
    return unfound_value;
  unsigned inflated_red_part = (select_red_part - splat(1)) * vl + <0,1,...,vl-1>;
  unsigned last_index = reduce.umax(inflated_red_part);
  return found_func(last_index);
}
shiva0217 added inline comments.Jul 11 2023, 8:23 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4003

Thanks for the explanation!

Could we use "} else if ((!VF.isVector() && !PhiR->isInLoop()))" to guard the generation?

It could be easier to understand the codegen is needed when VF is not a vector and createTargetReduction won't be invoked.

Should we rename createSentinelValueHandling as createSelectInitValOrReduction?

I feel it could reflect the codegen but in a less strong opinion.

Mel-Chen updated this revision to Diff 542324.Jul 19 2023, 11:48 PM

Changes:

  • Rename Select[I|F]Cmp to [I|F]AnyOf
  • Rename SelectIV[I|F]Cmp to [I|F]FindLastIV

Hi Mel,

I had the chance to try out your patch today, and it seems that it's really quite restricted in its scope. I tried it out on this simple C program:

#include <stdio.h>

int main() {
      int src[20000] = {4, 5, 2};
      int r = 331;
      for (int i = 0; i < 20000; i++) {
        if (src[i] > 3)
          r = i;
      }
      printf("%d\n", r);
      return 0;
}

Unfortunately, your patch failed to vectorize this simple case, because when you match NonPhi against a PHINode, it is blocked on a trunc:

for.body:                                         ; preds = %entry, %for.body
  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
  %r.05 = phi i32 [ 331, %entry ], [ %spec.select, %for.body ]
  %arrayidx = getelementptr inbounds [20000 x i32], ptr %src, i64 0, i64 %indvars.iv
  %3 = load i32, ptr %arrayidx, align 4, !tbaa !6
  %cmp1 = icmp sgt i32 %3, 3
  %4 = trunc i64 %indvars.iv to i32 // blocked here
  %spec.select = select i1 %cmp1, i32 %4, i32 %r.05
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, 20000
  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body, !llvm.loop !10

May I suggest using the logic outlined in isInductionMinMaxPattern in D152693?

Thanks.

May I suggest using the logic outlined in isInductionMinMaxPattern in D152693?

It's not that simple, as you'd lose the information about hasNoSignedWrap from the AR of the trunc. I've also been playing with attempting to extend this patch to the decreasing IV case, and it seems we've painted ourselves into a corner by being more concerned about the codegen performance than the generality of the patch: the constant IV start is a serious limitation when looking at the decreasing IV case. I think other reviewers like @fhahn and @Ayal have also expressed interest in greater generality (perhaps at the cost of codegen performance).

Mel-Chen added a comment.EditedJul 25 2023, 1:07 AM

May I suggest using the logic outlined in isInductionMinMaxPattern in D152693?

It's not that simple, as you'd lose the information about hasNoSignedWrap from the AR of the trunc. I've also been playing with attempting to extend this patch to the decreasing IV case, and it seems we've painted ourselves into a corner by being more concerned about the codegen performance than the generality of the patch: the constant IV start is a serious limitation when looking at the decreasing IV case. I think other reviewers like @fhahn and @Ayal have also expressed interest in greater generality (perhaps at the cost of codegen performance).

Yes, this issue is known, which is why I left a TODO in the summary.
I made the following modifications before:

   auto IsIncreasingLoopInduction = [&SE, &Loop](Value *V) {
+    // FIXME: Should focus on SExt and Trunc only?
+    if (auto *Cast = dyn_cast<CastInst>(V))
+      V = Cast->getOperand(0);
+
     auto *Phi = dyn_cast<PHINode>(V);

However, this approach is not rigorous.
Regardless of whether the nsw flag can still be used to determine whether to use signed max or unsigned max, proving whether the truncated induction variable is monotonically increasing is already a challenge.
Consider the following scenario:

previous step: i64:0000000000000000000000000000000001111111111111111111111111111111 -> i32:01111111111111111111111111111111
add one
current step: i64:0000000000000000000000000000000010000000000000000000000000000000 -> i32:10000000000000000000000000000000

I have discussed this issue with my colleagues several times, and here are some directions we came up with:

  1. Use SCEV overflow check. There is an overflow check generator in PSE that we can use, but it seems to be designed for checking the type of the original induction variable, not the truncated type. Using this approach would require implementing a new check generator and inserting it in the preheader.
  1. Prevent the generation of truncated induction variables. In the IndVarSimplifyPass, there is an option called -indvars-widen-indvars https://llvm.org/doxygen/IndVarSimplify_8cpp.html#aea2c111bf1f82fd672acaad1931e7e2d. This transformation widens the type of the induction variable to reduce the number of sign/zero-extends. However, if there are users in the loop that require the narrower original type, truncation will occur. Perhaps we can make some adjustments in this area.
Before IndVarSimplifyPass:
; Preheader:
  for.body.preheader:                               ; preds = %entry
    br label %for.body

  ; Loop:
  for.body:                                         ; preds = %for.body.preheader, %for.body
    %i.011 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
    %idx.010 = phi i32 [ %cond, %for.body ], [ %ii, %for.body.preheader ]
    %idxprom = zext i32 %i.011 to i64
    %arrayidx = getelementptr inbounds i64, ptr %a, i64 %idxprom
    %0 = load i64, ptr %arrayidx, align 8, !tbaa !4
    %arrayidx2 = getelementptr inbounds i64, ptr %b, i64 %idxprom
    %1 = load i64, ptr %arrayidx2, align 8, !tbaa !4
    %cmp3 = icmp sgt i64 %0, %1
    %cond = select i1 %cmp3, i32 %i.011, i32 %idx.010
    %inc = add nuw nsw i32 %i.011, 1 
    %cmp = icmp slt i32 %inc, %n
    br i1 %cmp, label %for.body, label %for.cond.cleanup.loopexit, !llvm.loop !8

  ; Exit blocks
  for.cond.cleanup.loopexit:                        ; preds = %for.body
    %cond.lcssa = phi i32 [ %cond, %for.body ]
    br label %for.cond.cleanup

After IndVarSimplifyPass:
; Preheader:
  for.body.preheader:                               ; preds = %entry
    %wide.trip.count = zext i32 %n to i64
    br label %for.body

  ; Loop:
  for.body:                                         ; preds = %for.body.preheader, %for.body
    %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
    %idx.010 = phi i32 [ %cond, %for.body ], [ %ii, %for.body.preheader ]
    %arrayidx = getelementptr inbounds i64, ptr %a, i64 %indvars.iv
    %0 = load i64, ptr %arrayidx, align 8, !tbaa !4
    %arrayidx2 = getelementptr inbounds i64, ptr %b, i64 %indvars.iv
    %1 = load i64, ptr %arrayidx2, align 8, !tbaa !4
    %cmp3 = icmp sgt i64 %0, %1
    %2 = trunc i64 %indvars.iv to i32
    %cond = select i1 %cmp3, i32 %2, i32 %idx.010
    %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
    %exitcond = icmp ne i64 %indvars.iv.next, %wide.trip.count
    br i1 %exitcond, label %for.body, label %for.cond.cleanup.loopexit, !llvm.loop !8

  ; Exit blocks
  for.cond.cleanup.loopexit:                        ; preds = %for.body
    %cond.lcssa = phi i32 [ %cond, %for.body ]
    br label %for.cond.cleanup
  1. Only target induction variables that determine the branch condition in the loop latch. With this approach, we can indirectly determine whether the induction variable is signed using loop guards, and directly perform vectorization according to the signed overflow is undefined behavior.
Signed IV loop guard:
entry:
  %cmp9 = icmp sgt i32 %n, 0
  br i1 %cmp9, label %for.body.preheader, label %for.cond.cleanup

Unsigned IV loop guard:
entry:
  %cmp9.not = icmp eq i32 %n, 0
  br i1 %cmp9.not, label %for.cond.cleanup, label %for.body.preheader

Currently, I personally prefer option 2, but since the performance impact of this modification is uncertain, our final decision is to proceed with option 3.

And of course, the worst-case would be to use a general conditional selecting reduction method to solve the problem.

Thanks for the detailed thoughts, Mel! My initial reaction is that (2) is probably unworkable due to the fallout, and that (3) isn't general enough. I'm leaning towards trying out (1) -- I'll see how that works out over the next few days.

Joe added a subscriber: Joe.Jul 28 2023, 3:48 AM
Joe added a comment.Jul 28 2023, 8:18 AM
  1. Only target induction variables that determine the branch condition in the loop latch. With this approach, we can indirectly determine whether the induction variable is signed using loop guards, and directly perform vectorization according to the signed overflow is undefined behavior.

@Mel-Chen I like this option. But I don't think I understand fully - what's the purposes of determining whether the induction variable is signed?

To me, the problem stems from the loss of information that the IV had 32-bit NSW/NUW when IndVarSimplify widens it. Just thinking about alternative options, what if we could retain that information in loop metadata?

Thanks for the detailed thoughts, Mel! My initial reaction is that (2) is probably unworkable due to the fallout, and that (3) isn't general enough. I'm leaning towards trying out (1) -- I'll see how that works out over the next few days.

Sounds good! Before you start implementing, I suggest observing the benchmarks to gather more cases. We choose option 3 because, based on our available benchmarks, all the opportunities for the FindLastIV pattern arise from induction variables that determine the branch condition at the loop latch. Therefore, we decided to go with option 3, as it is an implementation that is not too challenging and provides effective results

  1. Only target induction variables that determine the branch condition in the loop latch. With this approach, we can indirectly determine whether the induction variable is signed using loop guards, and directly perform vectorization according to the signed overflow is undefined behavior.

@Mel-Chen I like this option. But I don't think I understand fully - what's the purposes of determining whether the induction variable is signed?

We need to determine whether we should use umax or smax in the middle.block. Currently, we have only implemented smax because the application of {1,+,step} unsigned induction variables is relatively uncommon.

To me, the problem stems from the loss of information that the IV had 32-bit NSW/NUW when IndVarSimplify widens it. Just thinking about alternative options, what if we could retain that information in loop metadata?

Yes, this is due to the loss of information. If you would like to use metadata, would it be attached to the step instruction of the induction variable?

Joe added a comment.Jul 31 2023, 3:12 AM
  1. Only target induction variables that determine the branch condition in the loop latch. With this approach, we can indirectly determine whether the induction variable is signed using loop guards, and directly perform vectorization according to the signed overflow is undefined behavior.

@Mel-Chen I like this option. But I don't think I understand fully - what's the purposes of determining whether the induction variable is signed?

We need to determine whether we should use umax or smax in the middle.block. Currently, we have only implemented smax because the application of {1,+,step} unsigned induction variables is relatively uncommon.

That makes sense, thanks for clarifying :)

To me, the problem stems from the loss of information that the IV had 32-bit NSW/NUW when IndVarSimplify widens it. Just thinking about alternative options, what if we could retain that information in loop metadata?

Yes, this is due to the loss of information. If you would like to use metadata, would it be attached to the step instruction of the induction variable?

I (naively) thought this would be attached to the loop metadata, but I haven't looked at implementation details. To be honest, I don't think this is the best option - I prefer option 3.

shiva0217 added inline comments.Aug 7 2023, 11:53 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4003

Oops, I think I mix the patch with some local changes. Please ignore the comment.

Mel-Chen updated this revision to Diff 548110.Aug 8 2023, 1:57 AM

Rebase, and here is a summary of the changes:

  1. Avoid the use of the SelectCmp.* series to indicate to AnyOf and FindLastIV, as @Ayal expressed concerns about potential confusion with min/max reduction.
  2. I attempted to use FindLast.* as a collective term for AnyOf and FindLastIV, but it was deemed less readable upon completion, so the current version uses separate functions for AnyOf and FindLastIV.
  3. Discovered an issue with AnyOf reduction while inserting AddReductionVar. A pre-commit revision D157375 has been opened to discuss this.
Mel-Chen added inline comments.Aug 8 2023, 2:00 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4003

No problem. And I've just rebased this patch. Please continue with the review. Thank you.

Mel-Chen updated this revision to Diff 548452.Aug 8 2023, 8:57 PM

Minor changes. Remove the changes that should not be in this patch.

Mel-Chen updated this revision to Diff 551932.Aug 21 2023, 1:52 AM

Changes:

  • Rebase
  • Format some code
  • New test cases @not_vectorized_select_icmp_const_cmp_in_recurrence and @not_vectorized_select_icmp_cmp_in_recurrence in Transforms/LoopVectorize/iv-select-cmp.ll.
  • Not allow the inclusion of cmp instructions in FindLast recurrence, as I mentioned in the pre-commit revision D157375. The examples are @not_vectorized_select_icmp_const_cmp_in_recurrence and @not_vectorized_select_icmp_cmp_in_recurrence in Transforms/LoopVectorize/iv-select-cmp.ll.
Mel-Chen updated this revision to Diff 552652.Aug 23 2023, 3:59 AM

Remove the unused parameter Prev in RecurrenceDescriptor::isFindLastIVPattern.

artagnon accepted this revision.Sep 20 2023, 7:10 AM

I've looked at this patch several times, and applied it locally and played with it. I might not be the most qualified reviewer, but this patch is ready to land in my opinion. If anyone has any objections, please raise them now.

This revision is now accepted and ready to land.Sep 20 2023, 7:10 AM

I've looked at this patch several times, and applied it locally and played with it. I might not be the most qualified reviewer, but this patch is ready to land in my opinion. If anyone has any objections, please raise them now.

Hi, artagnon, Thank you. We have some new fixes, but for some unknown reason, I can't use arc diff --update to update the changes. I'm considering moving the patch directly to a GitHub pull request.

It has been moved to github, please continue to review, thank you.
https://github.com/llvm/llvm-project/pull/67812