When vectorizing by factor of 2 the remainder loop always executes
only one iteration so there is no actual need to keep the branch.
Fixes PR44547
llvm/test/Transforms/LoopVectorize/if-pred-stores.ll | ||
---|---|---|
278 | this transform seems correct, but not sure if the original purpose of this test is still fulfilled |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
3108 | I think this check is enough unless there are other cases in which "remainder loop has N % (VF*UF) iterations doesn't hold. |
Eliminating a redundant back-edge is clearly good. It would be better if such a special-case cleanup could be handled by some subsequent optimization, and possibly generalized. Note however that the remainder loop may or may not be considered subject for further optimization: LV currently disables unrolling the remainder loop following r231631, OTOH it may be worth vectorizing according to D30247. If desired, optimizations other than vectorization should preferably be taken care of by subsequent passes such as indvars and loop-unroll. LV knows that the trip count of the remainder loop is at most VF*UF, in the absence of overflows and runtime guards, and can make an effort to convey this bound, if GVN and IPSCCP fail to pick it up (referring to PR44547). Unfortunately, introducing an llvm.assume() seems insufficient - perhaps it could be made to work?
LV originally tries to keep the control of the remainder loop intact, adjusting only the starting values of its phi's, including that of its iv. If this control is going to be modified, by hacking its latch branch, another alternative is to replace it altogether with a new canonical {0, +1, %TripCount} iv (as done for the vector loop), possibly following a %TripCount = urem %ComputedTripCount, VF*UF which conveys this upper bound clearly. Somewhat akin to how truncateToMinimalBitwidths() conveys minimal bitwidths.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
3108 | Note that the trip count of the remainder loop may be equal to VF*UF, when loop requiresScalarEpilogue(); so in the above special case of VF*UF==2 remainder loop may iterate once or twice. Note that emitMinimumIterationCountCheck() also takes care of the case where adding 1 to the backedge-taken count overflows, leading to an incorrect trip count of zero; here too "remainder" loop iterates (much) more than once. | |
3110 | BI is aka ScalarLatchBr |
Updated revision with additional checks and rebased.
I'm not sure that llvm.assume can be reliably made to work (and be simpler, than just eliminating back edge).
Going with {0, +1, %TripCount} might be more beneficial in the end (it should also be trivial to call vectorize/unroll recursively in the remainder loop in such case, if I understood things correctly).
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
3108 | Thanks. I knew I might've missed something. This makes me more skeptical about potential llvm.assume solution. |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
3108 | Adding requiresScalarEpilogue() check is enough to handle the first case, but not the second case. Another way may be to check if/when adding 1 is known not to overflow. |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
3108 | I haven't figured out how to make VectorTripCount reliant solely on getBackedgeTakenCount without introducing extra checks/instructions (which seems to be the rationale behind the current code, instead of the one in r209854, at the expense of not vectorizing the rare UINT_MAX loops). Instead, I've opted in checking whether the overflow might've occurred by using getConstantMaxBackedgeTakenCount. If I understood things correctly, it would return -1 (all ones) in the overflow case. |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
3108 | A sketch of making VectorTripCount work correctly for loops with BTC= UINTMAX as well: Step = VF*UF BTC = PSE::BackedgeTakenCount() N = BTC + 1 // could overflow to 0 so do not compute N % Step if (foldTail) N = N + (Step-1) // for rounding up R = BTC % Step // Fits foldTail: (N+Step-1)%Step == (BTC+1+Step-1)%Step == (BTC+Step)%Step == BTC%Step if !(foldTail) { R = R + 1 // Fits requiresScalarEpilog: produces 0 < R <= Step if !(requiresScalarEpilog) R = (R == Step ? 0 : R) == R % Step} VectorTripCount = N - R which could be optimized into Step = VF*UF BTC = PSE::BackedgeTakenCount() R = BTC % Step VTC = BTC - R if !(requiresScalarEpilog) VTC = (R == Step-1 ? VTC + Step : VTC) if (foldTail) VTC = VTC + Step VectorTripCount = VTC This also allows foldTail to work with Steps (i.e., UF's) that are not a power of 2. |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
3108 | Hm, but doesn't it introduce additional instructions/compares that might lead to worse CodeGen? |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
3108 | Hopefully the code-size increase and the slowdown will be insignificant, being outside the loop, but indeed they need to be checked. Another option is to use the fact that, although BTC+1 might overflow and wrap to zero, BTC-(Step-1) may not underflow if !foldTail thanks to the min.iters.check of TripCount = N = BTC+1 >= Step, and therefore (BTC+1)(w/o overflow) === (BTC-(Step-1)) modulo Step. For completeness, current VectorTripCount is computed by: Step = VF*UF BTC = PSE::BackedgeTakenCount() N = BTC + 1 if (foldTail) N = N + (Step-1) R = N % Step if (requiresScalarEpilog) R = (R == 0 ? Step : R) VectorTripCount = N - R |