The Concept and Approach
Here is an example of min max with index pattern:
int idx = ii; int max = mm; for (int i = 0; i < n; ++i) { int x = a[i]; if (max < x) { max = x; idx = i; } }
After transfering to LLVM IR, it will look like this:
define i64 @smax_idx(ptr nocapture readonly %a, i64 %mm, i64 %ii, ptr nocapture writeonly %res_max, i64 %n) { entry: br label %for.body for.body: %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ] %max.09 = phi i64 [ %mm, %entry ], [ %1, %for.body ] %idx.011 = phi i64 [ %ii, %entry ], [ %spec.select7, %for.body ] %arrayidx = getelementptr inbounds i64, ptr %a, i64 %indvars.iv %0 = load i64, ptr %arrayidx %1 = tail call i64 @llvm.smax.i64(i64 %max.09, i64 %0) %cmp1 = icmp slt i64 %max.09, %0 %spec.select7 = select i1 %cmp1, i64 %indvars.iv, i64 %idx.011 %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 %exitcond.not = icmp eq i64 %indvars.iv.next, %n br i1 %exitcond.not, label %exit, label %for.body exit: store i64 %1, ptr %res_max ret i64 %spec.select7 }
Then we'll make a def-use graph for illustration (focus on for.body):
┌──────────────────┐ ▼ │ %indvars.iv │ / │ \ │ ▼ │ ▼ │ %arrayidx │%indvars.iv.next │ │ │ │ └──┘ ▼ │ ▼ %0 │ %exitcond.not ┌─┐ /└───────────┐ │ │ │ ▼▼ │ │ ▼ │ %1 │ │ br │ │ │ │ │ ▼ │ │ │phi_max:%max.09 │ │ └──┘ \ ┌─────┘ │ ▼ ▼ │ %cmp1 │ ┌─┐ \ │ │ ▼ ▼ ▼ │ %spec.select7 │ │ │ ▼ │ phi_idx:%idx.011 └───────┘
Generally, we will do traveling that starts from the phi of the loop header block when recognizing a reduction pattern, that is, phi_max and phi_idx in the graph. Taking simple max reduction as an example, we will start with phi_max and perform depth-first traveling on the def-use graph to check whether we can go back to phi_max, which means forming a cycle. Besides that, two things must be confirmed: first, at least one reduction operation, the operations in the cycle, is used outside the loop. If there is no external user, there is no need for the vectorizer to vectorize for it. Second, the reduction operation cannot have users inside the loop, unless the internal users are also reduction operations, and this is one of the issues that this patch needs to handle.
Let’s go back to the def-use graph. If we want to find two cycles for one traveling, it is obviously difficult. Besides making the algorithm more complicated, we also face the issue of ordering phi_max and phi_idx, because we cannot control the order of input IR. The better way is to find the cycles of phi_max and phi_idx respectively according to the original algorithm, and perform the second stage - combining phi_max and phi_idx. In this way, the phi order issue can be solved.
for.body: %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ] %max.09 = phi i64 [ %mm, %entry ], [ %1, %for.body ] %idx.011 = phi i64 [ %ii, %entry ], [ %spec.select7, %for.body ] … And for.body: %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ] %idx.011 = phi i64 [ %ii, %entry ], [ %spec.select7, %for.body ] %max.09 = phi i64 [ %mm, %entry ], [ %1, %for.body ] …
Next, let's see what type of cycle phi_max and phi_idx are. Ignoring the edge (phi_max, %cmp1), phi_max is the general max reduction, and phi_idx is the select-cmp reduction. In other words, we need to recognize max reduction and select-cmp reduction in the first stage of reduction recognition. Then use the relationship between the two reductions found in the first stage to perform the combination in the second stage.
After the recognition is completed, the next step is code generation. First, let's look at how to code generation when there is no dependency between max reduction and select-cmp reduction (that is, when it is not a pattern of min max with index):
/* Normal two independent reductions*/ int idx = ii; int max = mm; for (int i = 0; i < n; ++i) { int x = a[i]; if (max < x) max = x; if (b[i] < a[i]) idx = i; } vec_max = broadcast(mm) vec_step = iota vec_idx = broadcast(MIN_VALUE(DType)) for (int i = 0; i < n; i += vf) { vec_a = load(a, i, vf) vec_b = load(b, i, vf) vec_cmp = vec_b < vec_a vec_max = max(vec_max, vec_a) vec_idx = select(vec_cmp, vec_step, vec_idx) vec_step += vf; } red_max = reduce_max(vec_max) red_idx_candidate = reduce_max(vec_idx) red_idx = red_idx_candidate == MIN_VALUE(DType) ? ii : red_idx_candidate
And when there is a dependency between max reduction and select-cmp reduction (that is, it is a pattern of min max with index), what will happen to code generation?
/* Two dependent reductions, the max with first index pattern */ int idx = ii; int max = mm; for (int i = 0; i < n; ++i) { int x = a[i]; if (max < x) { // strict max = x; idx = i; } } vec_max = broadcast(mm) vec_step = iota vec_idx = broadcast(MIN_VALUE(DType)) for (int i = 0; i < n; i += vf) { vec_a = load(a, i, vf) vec_cmp = vec_max < vec_a vec_max = max(vec_max, vec_a) vec_idx = select(vec_cmp, vec_step, vec_idx) vec_step += vf; } red_max = reduce_max(vec_max) vec_all_max = broadcast(red_max) mask = (vec_max == vec_all_max) red_idx_candidate = reduce_min(vec_idx, mask) // since this case is strict max reduction red_idx = red_idx_candidate == MIN_VALUE(DType) ? ii : red_idx_candidate
The biggest difference is whether a mask needs to be created between max reduction and select-cmp reduction in the exit block (or middle block in LLVM vectorizer). Secondly, according to whether min max is strict or non-strict, decide whether to use the maximum index or the minimum index. Therefore, as long as the correct reduction dependency can be established in the recognition stage, and do code generation according to the dependency, the pattern of min max with index can be vectorized.
The Implementation in LLVM
According to the description in the previous chapter, first of all, the vectorizer must be able to recognize select-cmp reduction. Next, we need to solve the issue of internal reduction users, so that min max reduction can accept loop internal users in the first recognition stage. The third is to combine min max reduction and select-cmp reduction. Finally, according to the relationship between select-cmp reduction and min max reduction, a mask is generated in the middle block and reduction fix is performed.
Select-Cmp Reduction
At present, there is already a select-cmp implementation in LLVM, developed by the author david-arm. However, the current implementation restricts the value of non-reduction phi to be loop invariant, which does not meet our demands. Therefore, we need to expand this feature, namely SelectIVICmp/ SelectIVFCmp.
Taking SelectIVICmp as an example, the result of vectorization is as follows:
/* A SelectIVICmp example */ #include <stdint.h> int64_t idx_scalar(int64_t *a, int64_t *b, int64_t ii, int64_t n) { int64_t idx = ii; for (int64_t i = 0; i < n; ++i) idx = (a[i] > b[i]) ? i : idx; return idx; } /* LLVM IR for vectorized SelectIVICmp reduction */ define dso_local i64 @idx_scalar(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b, i64 noundef %ii, i64 noundef %n) local_unnamed_addr #0 { … vector.body: ; preds = %vector.body, %vector.ph %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ] %vec.ind = phi <4 x i64> [ <i64 0, i64 1, i64 2, i64 3>, %vector.ph ], [ %vec.ind.next, %vector.body ] %vec.phi = phi <4 x i64> [ <i64 -9223372036854775808, i64 -9223372036854775808, i64 -9223372036854775808, i64 -9223372036854775808>, %vector.ph ], [ %6, %vector.body ] %0 = add i64 %index, 0 %1 = getelementptr inbounds i64, ptr %a, i64 %0 %2 = getelementptr inbounds i64, ptr %1, i32 0 %wide.load = load <4 x i64>, ptr %2, align 8, !tbaa !4 %3 = getelementptr inbounds i64, ptr %b, i64 %0 %4 = getelementptr inbounds i64, ptr %3, i32 0 %wide.load1 = load <4 x i64>, ptr %4, align 8, !tbaa !4 %5 = icmp sgt <4 x i64> %wide.load, %wide.load1 %6 = select <4 x i1> %5, <4 x i64> %vec.ind, <4 x i64> %vec.phi %index.next = add nuw i64 %index, 4 %vec.ind.next = add <4 x i64> %vec.ind, <i64 4, i64 4, i64 4, i64 4> %7 = icmp eq i64 %index.next, %n.vec br i1 %7, label %middle.block, label %vector.body, !llvm.loop !8 middle.block: ; preds = %vector.body %8 = call i64 @llvm.vector.reduce.smax.v4i64(<4 x i64> %6) %rdx.select.cmp = icmp ne i64 %8, -9223372036854775808 %rdx.select = select i1 %rdx.select.cmp, i64 %8, i64 %ii %cmp.n = icmp eq i64 %n, %n.vec br i1 %cmp.n, label %for.exit.loopexit, label %scalar.ph … }
Assuming that the format of the induction variable i is {start, +, step}, SelectIVICmp should use start - step as its identity. However, for easy implementation, I directly use MIN_VALUE(DType) as identity (-9223372036854775808 in above example). The role of identity here is the sentinel value, which represents the start value of SelectIVICmp (%ii in the above example). In the end, if the result of reduce_max is identity, the result of reduction will be fixed as the start value.
Note that this is a temporary implementation. Unexpected errors may occur when the start of the induction variable is MIN_TYPE(DType), or when the maximum value of the induction variable exceeds SignedMax(DType). Generally, it should be implemented by two reductions.
Internal User Issue: UserRecurPhi and UserRecurKind
I modified the function isMinMaxPattern so that while identifying min max reduction, it is also possible to identify whether there is an index pattern that may depend on min max reduction.
isMinMaxIdxPattern will set UserRecurPhi and UserRecurKind according to the current select and cmp IR traveling, and the currently recognizing kind of min max reduction. Please refer to the form below:
Take the first example: %1 = tail call i64 @llvm.smax.i64(i64 %max.09, i64 %0) %cmp1 = icmp slt i64 %max.09, %0 %spec.select7 = select i1 %cmp1, i64 %indvars.iv, i64 %idx.011
cmp format | %max.09 < %0 | %max.09 <= %0 | %max.09 > %0 | %max.09 >= %0 | %0 < %max.09 | %0 <= %max.09 | %0 > %max.09 | %0 >= %max.09 |
---|---|---|---|---|---|---|---|---|
UMax SMax | UserRecurPhi=select.getFalseValue() UserRecurKind=MinMaxFirstIdx | UserRecurPhi=select.getFalseValue() UserRecurKind=MinMaxLastIdx | UserRecurPhi=select.getTrueValue() UserRecurKind=MinMaxLastIdx | UserRecurPhi=select.getTrueValue() UserRecurKind=MinMaxFirstIdx | UserRecurPhi=select.getTrueValue() UserRecurKind=MinMaxLastIdx | UserRecurPhi=select.getTrueValue() UserRecurKind=MinMaxFirstIdx | UserRecurPhi=select.getFalseValue() UserRecurKind=MinMaxFirstIdx | UserRecurPhi=select.getFalseValue() UserRecurKind=MinMaxLastIdx |
UMin SMin | UserRecurPhi=select.getTrueValue() UserRecurKind=MinMaxLastIdx | UserRecurPhi=select.getTrueValue() UserRecurKind=MinMaxFirstIdx | UserRecurPhi=select.getFalseValue() UserRecurKind=MinMaxFirstIdx | UserRecurPhi=select.getFalseValue() UserRecurKind=MinMaxLastIdx | UserRecurPhi=select.getFalseValue() UserRecurKind=MinMaxFirstIdx | UserRecurPhi=select.getFalseValue() UserRecurKind=MinMaxLastIdx | UserRecurPhi=select.getTrueValue() UserRecurKind=MinMaxLastIdx | UserRecurPhi=select.getTrueValue() UserRecurKind=MinMaxFirstIdx |
According to this table, UserRecurPhi will be set to %idx.011 (FalseValue in the select IR) and UserRecurKind to MinMaxFirstIdx.
Once UserRecurPhi is set, it means that the select should belong to another unknown reduction, UserRecurPhi should be the phi of the unknown reduction, and UserRecurKind is the expected reduction kind. Both UserRecurPhi and UserRecurKind will be used in the second phase recognition.
The Second Phase of Recognition
At the end of function canVectorizeInstrs, there will be a second confirmation against the reduction that has UserRecurPhi.
At this stage, all reductions should have been found. At this point, the vectorizer only needs to check whether UserRecurPhi is a reduction phi. At the same time, by using the function fixUserRecurrence, the SelectIVICmp will be converted into MinMaxFirstIdx or MinMaxLastIdx according to UserRecurKind.
Code Generation and Reduction Fix
Min max with index vectorization does not need to be adjusted for the contents of vector.body, but requires changes in function fixReduction.
We must ensure that min max reduction is fixed earlier than index reduction, because index reduction needs to use the mask generated by min max reduction. We can achieve this by modifying the function fixCrossIterationPHIs. The map DependRecurrenceMasks will keep the mask generated by min max reduction. If the mask required by index reduction is not ready, it will postpone the fix of the index reduction until the required mask is ready.
Consider the following loops:
int idx = ii; int max = mm; for (int i = 0; i < n; ++i) { int x = a[i]; if (max < x) { max = x; idx = i; } }
and
int idx = ii; int max = mm; for (int i = 0; i < n; ++i) { int x = a[i]; if (max <= x) { max = x; idx = i; } }
Changes:
- New recurrence Kinds: MinMaxFirstIdx and MinMaxLastIdx. This kind is not directly generated by function AddReductionVar, but converted from SelectIVICmp/SelectIVFCmp.
TODOs:
- Now have not support that the min/max recurrence without exit instruction. Refer to test case smax_idx_max_no_exit_user.
- Support the min/max recurrence in select(cmp()). Refer to test case smax_idx_select_cmp.
- Support FP min/max recurrence.
It would be helpful to document how the new system of recurrences depending on other recurrences would work I think, possibly also with an explanation of the whole approach in the patch description.