This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Vectorize the reduction pattern of integer min/max with index.
Needs ReviewPublic

Authored by Mel-Chen on Feb 6 2023, 11:09 PM.

Details

Reviewers
ABataev
fhahn
Summary

The Concept and Approach

Here is an example of min max with index pattern:

int idx = ii;
int max = mm;
for (int i = 0; i < n; ++i) {
  int x = a[i];
  if (max < x) {
    max = x;
    idx = i;
  }
}

After transfering to LLVM IR, it will look like this:

define i64 @smax_idx(ptr nocapture readonly %a, i64 %mm, i64 %ii, ptr nocapture writeonly %res_max, i64 %n) {
entry:
  br label %for.body

for.body:
  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
  %max.09 = phi i64 [ %mm, %entry ], [ %1, %for.body ]
  %idx.011 = phi i64 [ %ii, %entry ], [ %spec.select7, %for.body ]
  %arrayidx = getelementptr inbounds i64, ptr %a, i64 %indvars.iv
  %0 = load i64, ptr %arrayidx
  %1 = tail call i64 @llvm.smax.i64(i64 %max.09, i64 %0)
  %cmp1 = icmp slt i64 %max.09, %0
  %spec.select7 = select i1 %cmp1, i64 %indvars.iv, i64 %idx.011
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, %n
  br i1 %exitcond.not, label %exit, label %for.body

exit:
  store i64 %1, ptr %res_max
  ret i64 %spec.select7
}

Then we'll make a def-use graph for illustration (focus on for.body):

                   ┌──────────────────┐
                   ▼                  │
          %indvars.iv                 │
         /         │ \                │
        ▼          │  ▼               │
%arrayidx          │%indvars.iv.next  │
     │             │      │        └──┘
     ▼             │      ▼
    %0             │   %exitcond.not
┌─┐ /└───────────┐ │      │
│ ▼▼             │ │      ▼
│ %1             │ │      br
│  │             │ │
│  ▼             │ │
│phi_max:%max.09 │ │
└──┘  \    ┌─────┘ │
       ▼   ▼       │ 
       %cmp1       │
      ┌─┐ \        │
      │ ▼  ▼       ▼
      │ %spec.select7
      │       │
      │       ▼
      │    phi_idx:%idx.011 
      └───────┘

Generally, we will do traveling that starts from the phi of the loop header block when recognizing a reduction pattern, that is, phi_max and phi_idx in the graph. Taking simple max reduction as an example, we will start with phi_max and perform depth-first traveling on the def-use graph to check whether we can go back to phi_max, which means forming a cycle. Besides that, two things must be confirmed: ​​first, at least one reduction operation, the operations in the cycle, is used outside the loop. If there is no external user, there is no need for the vectorizer to vectorize for it. Second, the reduction operation cannot have users inside the loop, unless the internal users are also reduction operations, and this is one of the issues that this patch needs to handle.

Let’s go back to the def-use graph. If we want to find two cycles for one traveling, it is obviously difficult. Besides making the algorithm more complicated, we also face the issue of ordering phi_max and phi_idx, because we cannot control the order of input IR. The better way is to find the cycles of phi_max and phi_idx respectively according to the original algorithm, and perform the second stage - combining phi_max and phi_idx. In this way, the phi order issue can be solved.

for.body:
  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
  %max.09 = phi i64 [ %mm, %entry ], [ %1, %for.body ]
  %idx.011 = phi i64 [ %ii, %entry ], [ %spec.select7, %for.body ]
…

And 

for.body:
  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
  %idx.011 = phi i64 [ %ii, %entry ], [ %spec.select7, %for.body ]
  %max.09 = phi i64 [ %mm, %entry ], [ %1, %for.body ]
…

Next, let's see what type of cycle phi_max and phi_idx are. Ignoring the edge (phi_max, %cmp1), phi_max is the general max reduction, and phi_idx is the select-cmp reduction. In other words, we need to recognize max reduction and select-cmp reduction in the first stage of reduction recognition. Then use the relationship between the two reductions found in the first stage to perform the combination in the second stage.

After the recognition is completed, the next step is code generation. First, let's look at how to code generation when there is no dependency between max reduction and select-cmp reduction (that is, when it is not a pattern of min max with index):

/* Normal two independent reductions*/
int idx = ii;
int max = mm;
for (int i = 0; i < n; ++i) {
  int x = a[i];
  if (max < x) 
    max = x; 
  if (b[i] < a[i])
    idx = i;
}

vec_max = broadcast(mm)
vec_step = iota
vec_idx = broadcast(MIN_VALUE(DType))
for (int i = 0; i < n; i += vf) {
  vec_a = load(a, i, vf)
  vec_b = load(b, i, vf)
  vec_cmp = vec_b < vec_a
  vec_max = max(vec_max, vec_a)
  vec_idx = select(vec_cmp, vec_step, vec_idx)
  vec_step += vf;
}
red_max = reduce_max(vec_max)
red_idx_candidate = reduce_max(vec_idx)
red_idx = red_idx_candidate == MIN_VALUE(DType) ? ii : red_idx_candidate

And when there is a dependency between max reduction and select-cmp reduction (that is, it is a pattern of min max with index), what will happen to code generation?

/* Two dependent reductions, the max with first index pattern */
int idx = ii;
int max = mm;
for (int i = 0; i < n; ++i) {
  int x = a[i];
  if (max < x) {  // strict
    max = x; 
    idx = i;
  }
}

vec_max = broadcast(mm)
vec_step = iota
vec_idx = broadcast(MIN_VALUE(DType))
for (int i = 0; i < n; i += vf) {
  vec_a = load(a, i, vf)
  vec_cmp = vec_max < vec_a
  vec_max = max(vec_max, vec_a)
  vec_idx = select(vec_cmp, vec_step, vec_idx)
  vec_step += vf;
}
red_max = reduce_max(vec_max)
vec_all_max = broadcast(red_max)
mask = (vec_max == vec_all_max)
red_idx_candidate = reduce_min(vec_idx, mask)  // since this case is strict max reduction
red_idx = red_idx_candidate == MIN_VALUE(DType) ? ii : red_idx_candidate

The biggest difference is whether a mask needs to be created between max reduction and select-cmp reduction in the exit block (or middle block in LLVM vectorizer). Secondly, according to whether min max is strict or non-strict, decide whether to use the maximum index or the minimum index. Therefore, as long as the correct reduction dependency can be established in the recognition stage, and do code generation according to the dependency, the pattern of min max with index can be vectorized.

The Implementation in LLVM

According to the description in the previous chapter, first of all, the vectorizer must be able to recognize select-cmp reduction. Next, we need to solve the issue of internal reduction users, so that min max reduction can accept loop internal users in the first recognition stage. The third is to combine min max reduction and select-cmp reduction. Finally, according to the relationship between select-cmp reduction and min max reduction, a mask is generated in the middle block and reduction fix is performed.

Select-Cmp Reduction

At present, there is already a select-cmp implementation in LLVM, developed by the author david-arm. However, the current implementation restricts the value of non-reduction phi to be loop invariant, which does not meet our demands. Therefore, we need to expand this feature, namely SelectIVICmp/ SelectIVFCmp.
Taking SelectIVICmp as an example, the result of vectorization is as follows:

/* A SelectIVICmp example */
#include <stdint.h>

int64_t idx_scalar(int64_t *a, int64_t *b, int64_t ii, int64_t n) {
  int64_t idx = ii;
  for (int64_t i = 0; i < n; ++i)
    idx = (a[i] > b[i]) ? i : idx;

  return idx;
}

/* LLVM IR for vectorized SelectIVICmp reduction */
define dso_local i64 @idx_scalar(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b, i64 noundef %ii, i64 noundef %n) local_unnamed_addr #0 {
…

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.ind = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %vector.ph ], [ %vec.ind.next, %vector.body ]
  %vec.phi = phi <4 x i64> [ <i64 -9223372036854775808, i64 -9223372036854775808, i64 -9223372036854775808, i64 -9223372036854775808>, %vector.ph ], [ %6, %vector.body ]
  %0 = add i64 %index, 0
  %1 = getelementptr inbounds i64, ptr %a, i64 %0
  %2 = getelementptr inbounds i64, ptr %1, i32 0
  %wide.load = load <4 x i64>, ptr %2, align 8, !tbaa !4
  %3 = getelementptr inbounds i64, ptr %b, i64 %0
  %4 = getelementptr inbounds i64, ptr %3, i32 0
  %wide.load1 = load <4 x i64>, ptr %4, align 8, !tbaa !4
  %5 = icmp sgt <4 x i64> %wide.load, %wide.load1
  %6 = select <4 x i1> %5, <4 x i64> %vec.ind, <4 x i64> %vec.phi
  %index.next = add nuw i64 %index, 4
  %vec.ind.next = add <4 x i64> %vec.ind, <i64 4, i64 4, i64 4, i64 4>
  %7 = icmp eq i64 %index.next, %n.vec
  br i1 %7, label %middle.block, label %vector.body, !llvm.loop !8

middle.block:                                     ; preds = %vector.body
  %8 = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> %6)
  %rdx.select.cmp = icmp ne i64 %8, -9223372036854775808
  %rdx.select = select i1 %rdx.select.cmp, i64 %8, i64 %ii
  %cmp.n = icmp eq i64 %n, %n.vec
  br i1 %cmp.n, label %for.exit.loopexit, label %scalar.ph
…
}

Assuming that the format of the induction variable i is {start, +, step}, SelectIVICmp should use start - step as its identity. However, for easy implementation, I directly use MIN_VALUE(DType) as identity (-9223372036854775808 in above example). The role of identity here is the sentinel value, which represents the start value of SelectIVICmp (%ii in the above example). In the end, if the result of reduce_max is identity, the result of reduction will be fixed as the start value.

Note that this is a temporary implementation. Unexpected errors may occur when the start of the induction variable is MIN_TYPE(DType), or when the maximum value of the induction variable exceeds SignedMax(DType). Generally, it should be implemented by two reductions.

Internal User Issue: UserRecurPhi and UserRecurKind

I modified the function isMinMaxPattern so that while identifying min max reduction, it is also possible to identify whether there is an index pattern that may depend on min max reduction.

isMinMaxIdxPattern will set UserRecurPhi and UserRecurKind according to the current select and cmp IR traveling, and the currently recognizing kind of min max reduction. Please refer to the form below:

Take the first example:
  %1 = tail call i64 @llvm.smax.i64(i64 %max.09, i64 %0)
  %cmp1 = icmp slt i64 %max.09, %0
  %spec.select7 = select i1 %cmp1, i64 %indvars.iv, i64 %idx.011
cmp format%max.09 < %0%max.09 <= %0%max.09 > %0%max.09 >= %0%0 < %max.09%0 <= %max.09%0 > %max.09%0 >= %max.09
UMax SMaxUserRecurPhi=select.getFalseValue() UserRecurKind=MinMaxFirstIdxUserRecurPhi=select.getFalseValue() UserRecurKind=MinMaxLastIdxUserRecurPhi=select.getTrueValue() UserRecurKind=MinMaxLastIdxUserRecurPhi=select.getTrueValue() UserRecurKind=MinMaxFirstIdxUserRecurPhi=select.getTrueValue() UserRecurKind=MinMaxLastIdxUserRecurPhi=select.getTrueValue() UserRecurKind=MinMaxFirstIdxUserRecurPhi=select.getFalseValue() UserRecurKind=MinMaxFirstIdxUserRecurPhi=select.getFalseValue() UserRecurKind=MinMaxLastIdx
UMin SMinUserRecurPhi=select.getTrueValue() UserRecurKind=MinMaxLastIdxUserRecurPhi=select.getTrueValue() UserRecurKind=MinMaxFirstIdxUserRecurPhi=select.getFalseValue() UserRecurKind=MinMaxFirstIdxUserRecurPhi=select.getFalseValue() UserRecurKind=MinMaxLastIdxUserRecurPhi=select.getFalseValue() UserRecurKind=MinMaxFirstIdxUserRecurPhi=select.getFalseValue() UserRecurKind=MinMaxLastIdxUserRecurPhi=select.getTrueValue() UserRecurKind=MinMaxLastIdxUserRecurPhi=select.getTrueValue() UserRecurKind=MinMaxFirstIdx

According to this table, UserRecurPhi will be set to %idx.011 (FalseValue in the select IR) and UserRecurKind to MinMaxFirstIdx.

Once UserRecurPhi is set, it means that the select should belong to another unknown reduction, UserRecurPhi should be the phi of the unknown reduction, and UserRecurKind is the expected reduction kind. Both UserRecurPhi and UserRecurKind will be used in the second phase recognition.

The Second Phase of Recognition

At the end of function canVectorizeInstrs, there will be a second confirmation against the reduction that has UserRecurPhi.
At this stage, all reductions should have been found. At this point, the vectorizer only needs to check whether UserRecurPhi is a reduction phi. At the same time, by using the function fixUserRecurrence, the SelectIVICmp will be converted into MinMaxFirstIdx or MinMaxLastIdx according to UserRecurKind.

Code Generation and Reduction Fix

Min max with index vectorization does not need to be adjusted for the contents of vector.body, but requires changes in function fixReduction.

We must ensure that min max reduction is fixed earlier than index reduction, because index reduction needs to use the mask generated by min max reduction. We can achieve this by modifying the function fixCrossIterationPHIs. The map DependRecurrenceMasks will keep the mask generated by min max reduction. If the mask required by index reduction is not ready, it will postpone the fix of the index reduction until the required mask is ready.


Consider the following loops:

int idx = ii;
int max = mm;
for (int i = 0; i < n; ++i) {
  int x = a[i];
  if (max < x) {
    max = x;
    idx = i;
  }
}

and

int idx = ii;
int max = mm;
for (int i = 0; i < n; ++i) {
  int x = a[i];
  if (max <= x) {
    max = x;
    idx = i;
  }
}

Changes:

  • New recurrence Kinds: MinMaxFirstIdx and MinMaxLastIdx. This kind is not directly generated by function AddReductionVar, but converted from SelectIVICmp/SelectIVFCmp.

TODOs:

  • Now have not support that the min/max recurrence without exit instruction. Refer to test case smax_idx_max_no_exit_user.
  • Support the min/max recurrence in select(cmp()). Refer to test case smax_idx_select_cmp.
  • Support FP min/max recurrence.

Diff Detail

Event Timeline

Mel-Chen created this revision.Feb 6 2023, 11:09 PM
Mel-Chen requested review of this revision.Feb 6 2023, 11:09 PM
Mel-Chen retitled this revision from [LoopVectorize] Vectorize the reduction pattern of integer min/max with index. to [LoopVectorize] Vectorize the reduction pattern of integer min/max with index. (WIP).Feb 7 2023, 12:16 AM
Mel-Chen edited the summary of this revision. (Show Details)Feb 7 2023, 12:21 AM
Mel-Chen updated this revision to Diff 496034.Feb 8 2023, 11:40 PM

Rebase and update the command in test case.

Mel-Chen updated this revision to Diff 496879.Feb 13 2023, 2:07 AM

Changes:

  • Fix interleave code generation for SelectIVICmp and SelectIVFCmp.
  • Fix the internal compiler error for MinMaxFirstIdx and MinMaxLastIdx when -force-vector-width=1.
  • Update test cases. Add more run command lines.
Mel-Chen updated this revision to Diff 496947.Feb 13 2023, 5:51 AM

Changes:

  • Rebase
  • Split the patch of test case and implementation
  • Update FIXME
Mel-Chen edited the summary of this revision. (Show Details)Feb 13 2023, 5:57 AM
Mel-Chen updated this revision to Diff 497320.Feb 14 2023, 7:20 AM

Changes:

  • Remove function createInductionSelectCmpTargetReduction
  • Add function createSentinelValueHandling
  • Fix the start value of SelectCmp and MinMaxIdx

And then I don't know why all FIXME has been fixed, will confirm.

Mel-Chen updated this revision to Diff 497548.Feb 14 2023, 10:18 PM

Rebase and fix the check prefix in test cases.

Mel-Chen updated this revision to Diff 497615.EditedFeb 15 2023, 3:24 AM
Mel-Chen added a subscriber: test.

Changes:

  • Confirm the operands of intrinsic and cmp. Fixed test case @smax_idx_not_vec_1.
  • Format code.
Mel-Chen retitled this revision from [LoopVectorize] Vectorize the reduction pattern of integer min/max with index. (WIP) to [LoopVectorize] Vectorize the reduction pattern of integer min/max with index..Feb 15 2023, 3:24 AM
Mel-Chen edited the summary of this revision. (Show Details)

@fhahn Ping. What do you think of my approach? I am looking forward to your reply.

Mel-Chen updated this revision to Diff 499753.Feb 23 2023, 12:22 AM

Changes:

  • Rebase
  • Fix the bug of predicate normalization
Mel-Chen updated this revision to Diff 504121.Mar 10 2023, 6:05 AM

Rebase and update test case result.

fhahn added inline comments.Mar 19 2023, 2:15 PM
llvm/include/llvm/Analysis/IVDescriptors.h
379

It would be helpful to document how the new system of recurrences depending on other recurrences would work I think, possibly also with an explanation of the whole approach in the patch description.

llvm/lib/Analysis/IVDescriptors.cpp
703

nit: Variables should start with upper case

also, move definition to use?

728

The naming here is a bit confusing now, NonPhi can be an increasing loop induction? In that case it would be a phi, right?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
754

We are in the process of removing those kinds of global maps that are used to carry information used during codegen and later. Ideally the combination of values would be modeled explicitly in the exit block of the plan, but we are not there yet. This is the main reason for D132063 doing things the way it does.

3768

this would need documenting.

llvm/test/Transforms/LoopVectorize/smax-idx.ll
2

Could you add new tests as a separate patch?

Mel-Chen added inline comments.Mar 22 2023, 7:31 AM
llvm/include/llvm/Analysis/IVDescriptors.h
379

Sure. I will document the whole approach. and update in the summary tomorrow.
Quickly explain the function of UserRecurPhi . The purpose of UserRecurPhi is to allow the recurrence to be used in the loop (loop internal use), and to ensure that the user is also a recurrence. UserRecurPhi will record the candidate user recurrence phi, and UserRecurKind will recored the excepted user recurrence kind. Currently I'm limiting candidates to one, but it should be possible to have more than one.

llvm/lib/Analysis/IVDescriptors.cpp
728

Yes, it's a little confusing here. It could be better to replace NonPhi with NonRecurPhi. By the way, are you interested in supporting full functional SelectCmp pattern? I think min max with index pattern really needs to depend on the SelectCmp to be safe.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
754

I see., but DependRecurrenceMasks exists for a reason. Consider the following case:

int idx = ii;
int foo = jj;
int max = mm;
for (int i = 0; i < n; ++i) {
  int x = a[i];
  if (max < x) {
    max = x;
    idx = i;
    foo = b[i];
  }
}

That mask has the chance to be reused, and I try to keep that flexibility. Of course, we can recalculate the mask for each reduction that needs a mask, but currently using the global maps to preserve the mask is a relatively simple method that I think of.
I have heard that VPlan is going to be extended to other blocks, could you share the relevant discussion links?

3768

Sure. Quick explanation: min/max recurrence should be done earlier than min max idx recurrence, because idx recurrence depends on the mask produced by min max recurrence. Here is to ensure that the recurrence dependencies are correct.

llvm/test/Transforms/LoopVectorize/smax-idx.ll
2

Of course. I will split an NFC patch tomorrow.

Mel-Chen updated this revision to Diff 507716.Mar 23 2023, 6:08 AM

Split test case into parent revision.

Mel-Chen added inline comments.Mar 23 2023, 6:13 AM
llvm/include/llvm/Analysis/IVDescriptors.h
379

Too busy today. The document will be available in next week.

Mel-Chen edited the summary of this revision. (Show Details)Mar 30 2023, 11:14 PM

@fhahn Updated my approach introduction in the summary.
If you have any questions, please contact me. Looking forward to discussing with you again. Thank you.

@fhahn Ping. I'd be glad to discuss this patch with you.

fhahn added inline comments.Apr 23 2023, 2:44 PM
llvm/lib/Analysis/IVDescriptors.cpp
711

Using this API seems unnecessarily strict; we don't need to bounds (and getBounds may fail if It cannot identify the bounds), we just need to check the direction of the IV, which can be done by checking if it is an induction PHI and use SE.getMonotonicPredicateTyp.

llvm/test/Transforms/LoopVectorize/select-min-index.ll
135–136

Is this incorrectly vectorized or does the test name need fixing? It looks like %min.val isn't an actual minimum value phi?

Mel-Chen updated this revision to Diff 519006.May 3 2023, 1:14 AM

Rebase this patch, and created revision D149731 to preserving min max operation in select-cmp form.

Mel-Chen updated this revision to Diff 523302.May 18 2023, 2:46 AM

Changes:

  • Split SelectIVICmp and SelectIVFCmp out
  • Add Comment
  • Minor refine the code
RKSimon edited the summary of this revision. (Show Details)Jun 5 2023, 2:31 AM
Mel-Chen updated this revision to Diff 528430.Jun 5 2023, 7:16 AM

Changes:

  • Format and minor fix
Matt added a subscriber: Matt.Jun 14 2023, 3:35 PM
artagnon added inline comments.Jun 15 2023, 3:52 AM
llvm/lib/Analysis/IVDescriptors.cpp
415

Why?

424–426

If you separate out the MinMaxIdx pattern into its own function, we can check NumCmpSelectPatternInst for it separately.

831–860

This is a bit cryptic: would you consider adding more RecurKinds to make this less cryptic?

877–878

Can we avoid the expensive call to isInductionPHI() by checking that the SCEVAddRec is a SCEVConstant?

1306–1313

Why not merge this with the RecurKind::SMax case?

1341–1342

Rename these to IMinMaxFirstIdx and IMinMaxLastIdx?

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
881

Typo: comfirm.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4051–4054

RdxDesc.isOrdered() can help you pick between FCMP_OEQ and FCMP_UEQ.

shiva0217 added inline comments.Aug 17 2023, 1:59 AM
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
892

Instead of fixUserRecurrence to setDependMinMaxRecurDes and change the user RecurKind, is it possible to setDependMinMaxRecurDes when isReductionPHI return true?

If we able to propagate parent(dependent) RecurDes to isReductionPHI, perhaps we can create reduction as following.

RecurKind ParentKind = RedDes.getRecurrenceKind();
if (ParentKind == RecurKind::SMax) {
  if (AddReductionVar(Phi, RecurKind::MinMaxFirstIdx, TheLoop, FMF, RedDes, DB, AC, DT,
                      SE)) {
    LLVM_DEBUG(dbgs() << "Found an MinMaxFirstIdx reduction PHI." << *Phi << "\n");
    return true;
  }
}

The dependency for the RecurKind could be explicitly and avoid the user RecurKind fixup.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3768

Perhaps we could do the sorting according to the reduction dependency before calling fixReduction which may be similar to https://reviews.llvm.org/D157631.

4023

Should we check isMinMaxRecurrenceKind(Kind) and isMinMaxIdxRecurrenceKind for user kind?

Although it would be the only dependency currently, it might be explicit for the reader and avoid unexpected codegen in the future.

4025

Could we encapsulate the mask generation to createMinMaxIdxMaskOp or other name you prefer?

4047

Should we check isMinMaxRecurrenceKind(Kind) and isMinMaxIdxRecurrenceKind for user kind?

4048

Could we encapsulate the mask generation to createMinMaxIdxMask or similar?