This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorizer] Simplify branch in the remainder loop for trivial cases
Needs ReviewPublic

Authored by danilaml on Mar 6 2020, 7:35 AM.

Details

Summary

When vectorizing by factor of 2 the remainder loop always executes
only one iteration so there is no actual need to keep the branch.

Fixes PR44547

Diff Detail

Event Timeline

danilaml created this revision.Mar 6 2020, 7:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 6 2020, 7:35 AM
danilaml marked an inline comment as done.Mar 6 2020, 7:40 AM
danilaml added inline comments.
llvm/test/Transforms/LoopVectorize/if-pred-stores.ll
278

this transform seems correct, but not sure if the original purpose of this test is still fulfilled

danilaml updated this revision to Diff 248733.Mar 6 2020, 7:57 AM
danilaml updated this revision to Diff 248737.Mar 6 2020, 8:03 AM
danilaml marked an inline comment as done.
danilaml added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3108

I think this check is enough unless there are other cases in which "remainder loop has N % (VF*UF) iterations doesn't hold.

Harbormaster completed remote builds in B48341: Diff 248733.
Ayal added a comment.Mar 14 2020, 7:10 AM

Eliminating a redundant back-edge is clearly good. It would be better if such a special-case cleanup could be handled by some subsequent optimization, and possibly generalized. Note however that the remainder loop may or may not be considered subject for further optimization: LV currently disables unrolling the remainder loop following r231631, OTOH it may be worth vectorizing according to D30247. If desired, optimizations other than vectorization should preferably be taken care of by subsequent passes such as indvars and loop-unroll. LV knows that the trip count of the remainder loop is at most VF*UF, in the absence of overflows and runtime guards, and can make an effort to convey this bound, if GVN and IPSCCP fail to pick it up (referring to PR44547). Unfortunately, introducing an llvm.assume() seems insufficient - perhaps it could be made to work?

LV originally tries to keep the control of the remainder loop intact, adjusting only the starting values of its phi's, including that of its iv. If this control is going to be modified, by hacking its latch branch, another alternative is to replace it altogether with a new canonical {0, +1, %TripCount} iv (as done for the vector loop), possibly following a %TripCount = urem %ComputedTripCount, VF*UF which conveys this upper bound clearly. Somewhat akin to how truncateToMinimalBitwidths() conveys minimal bitwidths.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3108

Note that the trip count of the remainder loop may be equal to VF*UF, when loop requiresScalarEpilogue(); so in the above special case of VF*UF==2 remainder loop may iterate once or twice.

Note that emitMinimumIterationCountCheck() also takes care of the case where adding 1 to the backedge-taken count overflows, leading to an incorrect trip count of zero; here too "remainder" loop iterates (much) more than once.

3110

BI is aka ScalarLatchBr

danilaml updated this revision to Diff 251702.Mar 20 2020, 10:44 AM
danilaml marked 2 inline comments as done.
danilaml marked an inline comment as done.Mar 20 2020, 10:49 AM

Updated revision with additional checks and rebased.

I'm not sure that llvm.assume can be reliably made to work (and be simpler, than just eliminating back edge).

Going with {0, +1, %TripCount} might be more beneficial in the end (it should also be trivial to call vectorize/unroll recursively in the remainder loop in such case, if I understood things correctly).

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3108

Thanks. I knew I might've missed something. This makes me more skeptical about potential llvm.assume solution.
Am I understanding your note correctly, that adding requiresScalarEpilogue check is enough?

Ayal added inline comments.Mar 20 2020, 5:28 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3108

Adding requiresScalarEpilogue() check is enough to handle the first case, but not the second case.
One way to try and handle the second case is to change getOrCreateVectorTripCount() so that it relies on PSE::getBackedgeTakenCount() w/o adding 1 to it, as this addition (done by getOrCreateTripCount()) may overflow to zero.
See r209854, and the max_i32_backedgetaken() test it added to test/Transforms/LoopVectorize/induction.ll.

Another way may be to check if/when adding 1 is known not to overflow.

danilaml updated this revision to Diff 254766.Apr 3 2020, 6:04 AM
danilaml marked an inline comment as done.
danilaml updated this revision to Diff 254767.Apr 3 2020, 6:07 AM
danilaml marked 2 inline comments as done.Apr 3 2020, 6:46 AM
danilaml added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3108

I haven't figured out how to make VectorTripCount reliant solely on getBackedgeTakenCount without introducing extra checks/instructions (which seems to be the rationale behind the current code, instead of the one in r209854, at the expense of not vectorizing the rare UINT_MAX loops).

Instead, I've opted in checking whether the overflow might've occurred by using getConstantMaxBackedgeTakenCount. If I understood things correctly, it would return -1 (all ones) in the overflow case.

danilaml updated this revision to Diff 254791.Apr 3 2020, 7:03 AM
danilaml marked an inline comment as done.
Ayal added inline comments.Apr 3 2020, 7:16 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3108

A sketch of making VectorTripCount work correctly for loops with BTC= UINTMAX as well:

Step = VF*UF
BTC = PSE::BackedgeTakenCount()
N = BTC + 1                     // could overflow to 0 so do not compute N % Step
if (foldTail) N = N + (Step-1)  // for rounding up
R = BTC % Step                  // Fits foldTail: (N+Step-1)%Step == (BTC+1+Step-1)%Step == (BTC+Step)%Step == BTC%Step
if !(foldTail) { R = R + 1      // Fits requiresScalarEpilog: produces 0 < R <= Step 
  if !(requiresScalarEpilog) R = (R == Step ? 0 : R) == R % Step}
VectorTripCount = N - R

which could be optimized into

Step = VF*UF
BTC = PSE::BackedgeTakenCount()
R = BTC % Step
VTC = BTC - R
if !(requiresScalarEpilog) VTC = (R == Step-1 ? VTC + Step : VTC)
if (foldTail) VTC = VTC + Step
VectorTripCount = VTC

This also allows foldTail to work with Steps (i.e., UF's) that are not a power of 2.

danilaml updated this revision to Diff 255335.Apr 6 2020, 7:35 AM
danilaml marked an inline comment as done.Apr 6 2020, 8:09 AM
danilaml added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3108

Hm, but doesn't it introduce additional instructions/compares that might lead to worse CodeGen?
I.e. in the common case, BTC will need to be computed, when BTC - R, when select from VTC and VTC+Step.
whereas currently it is just VectorTripCount = TC - (TC % Step), where TC is already computed.

Ayal added inline comments.Apr 6 2020, 12:42 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3108

Hopefully the code-size increase and the slowdown will be insignificant, being outside the loop, but indeed they need to be checked.

Another option is to use the fact that, although BTC+1 might overflow and wrap to zero, BTC-(Step-1) may not underflow if !foldTail thanks to the min.iters.check of TripCount = N = BTC+1 >= Step, and therefore (BTC+1)(w/o overflow) === (BTC-(Step-1)) modulo Step.
So for a common case of !foldTail && !requiresScalarEpilog, VectorTripCount can be computed w/o risk of overflow or underflow using N-((N-Step)%Step) instead of the current N - (N % Step).

For completeness, current VectorTripCount is computed by:

Step = VF*UF
BTC = PSE::BackedgeTakenCount()
N = BTC + 1
if (foldTail) N = N + (Step-1)
R = N % Step
if (requiresScalarEpilog) R = (R == 0 ? Step : R)
VectorTripCount = N - R
danilaml updated this revision to Diff 296446.Oct 6 2020, 6:47 AM

Rebased.
Had some time to come back to this.
I've tried implementing the following proposed way to get VTC locally:

Step = VF*UF
BTC = PSE::BackedgeTakenCount()
R = BTC % Step
VTC = BTC - R
if !(requiresScalarEpilog) VTC = (R == Step-1 ? VTC + Step : VTC)
if (foldTail) VTC = VTC + Step
VectorTripCount = VTC

But it had generally a negative performance impact on the benchmarks I've run mainly due to 1) more instructions required to compute VTC 2) VTC = (R == Step-1 ? VTC + Step : VTC) select for common case making vectorized loop's Exit Count uncomputable thus impeding some later optimizations.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3108

I'm not sure what using N-((N-Step)%Step) instead of N - (N % Step) is supposed to solve.

danilaml updated this revision to Diff 314186.Dec 31 2020, 5:08 AM

Is there no transform that can be taught to do this cleanup instead?

Is there no transform that can be taught to do this cleanup instead?

Possibly, but I have a hard time imagining how it could be done. Perhaps someone much more knowledgeable in LLVM loop optimizations than I can answer that.
I'm not even sure if it's currently possible to reliably reconstruct the unroll factor and find the remainder loop in some pass after unrolling has been complete.

Ayal added a comment.Jun 9 2021, 2:30 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3108

I'm not sure what using N-((N-Step)%Step) instead of N - (N % Step) is supposed to solve.

Currently if BTC = UINTMAX the vector loop is bypassed and the scalar "remainder" loop executes all iterations, under !foldTail.
In order for loops requiring no runtime checks besides min.iters.check to execute the vector loop (as much as possible) also for BTC = UINTMAX, thereby ensuring that the scalar loop always executes at most Step-1 iterations -- the original motivation for this patch -- the following may be done:

(1) Change min.iters.check to check if BTC >= Step-1 instead of checking if N = BTC+1 >= Step. The latter overflows for BTC = UINTMAX thereby bypassing the vector loop, whereas the former does not wrap.
(2) Compute the VectorTripCount using N-((N-Step)%Step) instead of N - (N % Step). The latter produces zero when BTC = UINTMAX for any Step, which is incorrect for Steps that do not divide BTC+1, i.e., for non-power-of-2 UFs.

With requiresScalarEpilog, apply the R = (R == 0 ? Step : R) to R = ((N-Step)%Step) before subtracting it from N.

Note that VectorTripCount computed in (2) may overflow to zero, i.e., for BTC = UINTMAX and Step(UF) that is a power-of-2. This works correctly, as currently done with foldTail, where min.iters.check is eliminated, and UF is required to be a power-of-2. With foldTail, use R = ((N-1)%Step) = BTC%Step as suggested earlier above, which never wraps. In any case, this patch focuses on the tail.

To summarize:

Step = VF*UF
BTC = PSE::BackedgeTakenCount()
; min.iters.check:
if (!foldTail): if (BTC < Step-1) goto scalar loop
N = BTC+1   ; may overflow to zero
if (foldTail): R = BTC % Step
if (!foldTail): R = (N-Step) % Step
if (requiresScalarEpilog): R = (R == 0 ? Step : R)
VectorTripCount = N - R

Hopefully this way of ensuring that the tail scalar loop always executes less than Step iterations, also for non-power-of-2 Steps, has no significant negative performance impact.

danilaml updated this revision to Diff 356531.Jul 5 2021, 10:31 AM
danilaml edited the summary of this revision. (Show Details)

Rebased

dcaballe resigned from this revision.Oct 8 2021, 2:33 PM