When using tail-folding and using the predicate for both data and control-flow
(the next vector iteration's predicate is generated with the llvm.active.lane.mask
intrinsic and then tested for the backedge), the LoopVectorizer still inserts a
runtime check to see if the 'i + VF' may at any point overflow for the given
trip-count. When it does, it falls back to a scalar epilogue loop.
We can get rid of that runtime check in the pre-header and therefore also
remove the scalar epilogue loop. This reduces code-size and avoids a runtime
check.
Consider the following loop:
void foo(char * __restrict__ dst, char *src, unsigned long N) { for (unsigned long i=0; i<N; ++i) dst[i] = src[i] + 42; }
If 'N' is e.g. ULONG_MAX, and the VF > 1, then the loop iteration counter
will overflow when calculating the predicate for the next vector iteration
at some point, because LLVM does:
vector.ph: %active.lane.mask.entry = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 %N) vector.body: %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ] %active.lane.mask = phi <vscale x 16 x i1> [ %active.lane.mask.entry, %vector.ph ], [ %active.lane.mask.next, %vector.body ] ... %index.next = add i64 %index, 16 ; The add above may overflow, which would affect the lane mask and control flow. Hence a runtime check is needed. %active.lane.mask.next = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 %index.next, i64 %N) %8 = extractelement <vscale x 16 x i1> %active.lane.mask.next, i64 0 br i1 %8, label %vector.body, label %for.cond.cleanup, !llvm.loop !7
The solution:
What we can do instead is calculate the predicate before incrementing
the loop iteration counter, such that the llvm.active.lane.mask is
calculated from 'i' to 'tripcount > VF ? tripcount - VF : 0', i.e.
vector.ph: %active.lane.mask.entry = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 %N) %N_minus_VF = select %N > 16 ? %N - 16 : 0 vector.body: %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ] %active.lane.mask = phi <vscale x 16 x i1> [ %active.lane.mask.entry, %vector.ph ], [ %active.lane.mask.next, %vector.body ] ... %active.lane.mask.next = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 %index, i64 %N_minus_VF) %index.next = add i64 %index, %4 ; The add above may still overflow, but this time the active.lane.mask is not affected %8 = extractelement <vscale x 16 x i1> %active.lane.mask.next, i64 0 br i1 %8, label %vector.body, label %for.cond.cleanup, !llvm.loop !7
For N = 20, we'd then get:
vector.ph: %active.lane.mask.entry = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 %N) ; %active.lane.mask.entry = <1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1> %N_minus_VF = select 20 > 16 ? 20 - 16 : 0 ; %N_minus_VF = 4 vector.body: (1st iteration) ... ; using <1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1> as predicate in the loop ... %active.lane.mask.next = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 4) ; %active.lane.mask.next = <1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0> %index.next = add i64 0, 16 ; %index.next = 16 %8 = extractelement <vscale x 16 x i1> %active.lane.mask.next, i64 0 ; %8 = 1 br i1 %8, label %vector.body, label %for.cond.cleanup, !llvm.loop !7 ; branch to %vector.body vector.body: (2nd iteration) ... ; using <1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0> as predicate in the loop ... %active.lane.mask.next = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 16, i64 4) ; %active.lane.mask.next = <0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0> %index.next = add i64 16, 16 ; %index.next = 32 %8 = extractelement <vscale x 16 x i1> %active.lane.mask.next, i64 0 ; %8 = 0 br i1 %8, label %vector.body, label %for.cond.cleanup, !llvm.loop !7 ; branch to %for.cond.cleanup
Having a helper to check if the style means using ActiveLaneMask would probably help to make some of the code below a bit more compact and clearer.