This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Remove runtime check and scalar tail loop when tail-folding.
ClosedPublic

Authored by sdesmalen on Jan 19 2023, 6:20 AM.

Details

Summary

When using tail-folding and using the predicate for both data and control-flow
(the next vector iteration's predicate is generated with the llvm.active.lane.mask
intrinsic and then tested for the backedge), the LoopVectorizer still inserts a
runtime check to see if the 'i + VF' may at any point overflow for the given
trip-count. When it does, it falls back to a scalar epilogue loop.

We can get rid of that runtime check in the pre-header and therefore also
remove the scalar epilogue loop. This reduces code-size and avoids a runtime
check.

Consider the following loop:

void foo(char * __restrict__ dst, char *src, unsigned long N) {
    for (unsigned long  i=0; i<N; ++i)
        dst[i] = src[i] + 42;
}

If 'N' is e.g. ULONG_MAX, and the VF > 1, then the loop iteration counter
will overflow when calculating the predicate for the next vector iteration
at some point, because LLVM does:

vector.ph:
  %active.lane.mask.entry = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 %N)

vector.body:
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %active.lane.mask = phi <vscale x 16 x i1> [ %active.lane.mask.entry, %vector.ph ], [ %active.lane.mask.next, %vector.body ]
  ...

  %index.next = add i64 %index, 16
    ; The add above may overflow, which would affect the lane mask and control flow. Hence a runtime check is needed.
  %active.lane.mask.next = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 %index.next, i64 %N)
  %8 = extractelement <vscale x 16 x i1> %active.lane.mask.next, i64 0
  br i1 %8, label %vector.body, label %for.cond.cleanup, !llvm.loop !7

The solution:

What we can do instead is calculate the predicate before incrementing
the loop iteration counter, such that the llvm.active.lane.mask is
calculated from 'i' to 'tripcount > VF ? tripcount - VF : 0', i.e.

vector.ph:
  %active.lane.mask.entry = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 %N)
  %N_minus_VF = select %N > 16 ? %N - 16 : 0

vector.body:
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %active.lane.mask = phi <vscale x 16 x i1> [ %active.lane.mask.entry, %vector.ph ], [ %active.lane.mask.next, %vector.body ]
  ...

  %active.lane.mask.next = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 %index, i64 %N_minus_VF)
  %index.next = add i64 %index, %4
    ; The add above may still overflow, but this time the active.lane.mask is not affected
  %8 = extractelement <vscale x 16 x i1> %active.lane.mask.next, i64 0
  br i1 %8, label %vector.body, label %for.cond.cleanup, !llvm.loop !7

For N = 20, we'd then get:

vector.ph:
  %active.lane.mask.entry = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 %N)
    ; %active.lane.mask.entry = <1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1>
  %N_minus_VF = select 20 > 16 ? 20 - 16 : 0
    ; %N_minus_VF = 4

vector.body: (1st iteration)
  ... ; using <1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1> as predicate in the loop
  ...
  %active.lane.mask.next = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 0, i64 4)
    ; %active.lane.mask.next = <1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0>
  %index.next = add i64 0, 16
    ; %index.next = 16
  %8 = extractelement <vscale x 16 x i1> %active.lane.mask.next, i64 0
    ; %8 = 1
  br i1 %8, label %vector.body, label %for.cond.cleanup, !llvm.loop !7
    ; branch to %vector.body

vector.body: (2nd iteration)
  ... ; using <1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0> as predicate in the loop
  ...
  %active.lane.mask.next = tail call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 16, i64 4)
    ; %active.lane.mask.next = <0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0>
  %index.next = add i64 16, 16
    ; %index.next = 32
  %8 = extractelement <vscale x 16 x i1> %active.lane.mask.next, i64 0
    ; %8 = 0
  br i1 %8, label %vector.body, label %for.cond.cleanup, !llvm.loop !7
    ; branch to %for.cond.cleanup

Diff Detail

Event Timeline

sdesmalen created this revision.Jan 19 2023, 6:20 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 19 2023, 6:20 AM
sdesmalen requested review of this revision.Jan 19 2023, 6:20 AM

This is a nice patch @sdesmalen! It's great to remove the overflow checks and scalar loop - a double win for performance and code size. :) I just had a comment about ways we can avoid overflow checks completely for low trip counts.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8766

I wonder if it's possible to be avoid this extra work in the preheader if we know that we can never overflow? For example, I'm thinking of a loop like this:

void foo(int* __restrict__ dst, int* __restrict__ src) {
  for (long i = 0; i < 7; i++) {
    dst[i] = 1000 / src[i];
  }
}

where we will vectorise with tail-folding (assuming the unroller hasn't already destroyed the vectorisation opportunity!). For vscale=2 this is actually just a single iteration, so the additional work in the preheader becomes more significant.

sdesmalen updated this revision to Diff 493286.Jan 30 2023, 5:22 AM

Rebased the patch on top of D142887, and added a new enum 'DataAndControlFlowWithoutRuntimeCheck' to distinguish it from 'DataAndControlFlow'.

sdesmalen added inline comments.Jan 30 2023, 5:30 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8766

That's a good point! For instances where we know no overflow can occur, it's probably better to avoid this style of tail-folding. It seems that GCC follows a similar approach.
I'll post a separate patch for that so that this patch remains relatively simple to review.

I've already tried to simplify things a bit by distinguishing DataAndControlFlow and DataAndControlFlowWithoutRuntimeCheck, so that we can easily fall back on the former if we know the runtime check can be folded away.

david-arm accepted this revision.Feb 6 2023, 6:48 AM

LGTM!

llvm/test/Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll
7

I'm happy with this regression for now, since it will be fixed by D142894.

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-optsize.ll
9

This is now actually doing what -Os should be doing!

This revision is now accepted and ready to land.Feb 6 2023, 6:48 AM
fhahn added a comment.Feb 6 2023, 7:35 AM

Would it be possible to have at least one target-independent patch by passing the style as option>?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1576

Having a helper to check if the style means using ActiveLaneMask would probably help to make some of the code below a bit more compact and clearer.

8750

The code for handling ActiveLaneMask now as grown quite a bit. Do you think it would be possible to separate the logic in different functions? Not in this patch ,but as general cleanup?

sdesmalen updated this revision to Diff 496150.Feb 9 2023, 9:04 AM
sdesmalen marked an inline comment as done.

Added helper functions to query if the active lane mask is used.
Added option to force tail-folding-style to aid testing, and added a new test.
Rebased patch.

Would it be possible to have at least one target-independent patch by passing the style as option>?

I tried to create such a test but it needs to run with scalable vectors, otherwise the code under the check on line 3010 is not exercised. It doesn't seem possible at the moment to test it without having a target (e.g. the cost-model prevents the use of scalable vectors due to Invalid costs, even when using -force-target-instruction-cost=1).
That said, I've created a force-vectorization-style option and created a new AArch64 test (tail-folding-styles.ll) that exercises all the different tail folding styles for the same loop.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1576

Yes, makes sense. I've created a helper function for it.

8750

Fair question, I'll try to create a follow-up patch to clean this up a bit!

fhahn accepted this revision.Feb 13 2023, 12:26 PM

LGTM, thanks!

I tried to create such a test but it needs to run with scalable vectors, otherwise the code under the check on line 3010 is not exercised. It doesn't seem possible at the moment to test it without having a target (e.g. the cost-model prevents the use of scalable vectors due to Invalid costs, even when using -force-target-instruction-cost=1).

Ah right, thanks for checking!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8749

Is this just useActiveLaneMaskForControlFlow?

llvm/test/Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll
10

Might be worth to include the % in the pattern, like the other variables in this file do?

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-forced.ll
26

It would be great if you could update the test here to use named variables for the the value numbers here, to avoid having to update every check line if the numbering changes slightly.

This revision was landed with ongoing or failed builds.Mar 1 2023, 1:02 AM
This revision was automatically updated to reflect the committed changes.
sdesmalen marked 7 inline comments as done.

Thanks for reviewing! I've addressed final comments before committing

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8749

You're right, not sure how I missed that :)

sdesmalen added inline comments.Mar 1 2023, 7:25 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8750

I have created D145068 for this.

Allen added a subscriber: Allen.Mar 24 2023, 5:37 AM
ChillingHsu removed a subscriber: ChillingHsu.