When vectorizing by factor of 2 the remainder loop always executes
only one iteration so there is no actual need to keep the branch.
Fixes PR44547
Differential D75746
[LoopVectorizer] Simplify branch in the remainder loop for trivial cases danilaml on Mar 6 2020, 7:35 AM. Authored by
Details
Diff Detail
Event Timeline
Comment Actions Eliminating a redundant back-edge is clearly good. It would be better if such a special-case cleanup could be handled by some subsequent optimization, and possibly generalized. Note however that the remainder loop may or may not be considered subject for further optimization: LV currently disables unrolling the remainder loop following r231631, OTOH it may be worth vectorizing according to D30247. If desired, optimizations other than vectorization should preferably be taken care of by subsequent passes such as indvars and loop-unroll. LV knows that the trip count of the remainder loop is at most VF*UF, in the absence of overflows and runtime guards, and can make an effort to convey this bound, if GVN and IPSCCP fail to pick it up (referring to PR44547). Unfortunately, introducing an llvm.assume() seems insufficient - perhaps it could be made to work? LV originally tries to keep the control of the remainder loop intact, adjusting only the starting values of its phi's, including that of its iv. If this control is going to be modified, by hacking its latch branch, another alternative is to replace it altogether with a new canonical {0, +1, %TripCount} iv (as done for the vector loop), possibly following a %TripCount = urem %ComputedTripCount, VF*UF which conveys this upper bound clearly. Somewhat akin to how truncateToMinimalBitwidths() conveys minimal bitwidths.
Comment Actions Updated revision with additional checks and rebased. I'm not sure that llvm.assume can be reliably made to work (and be simpler, than just eliminating back edge). Going with {0, +1, %TripCount} might be more beneficial in the end (it should also be trivial to call vectorize/unroll recursively in the remainder loop in such case, if I understood things correctly).
Comment Actions Rebased. Step = VF*UF BTC = PSE::BackedgeTakenCount() R = BTC % Step VTC = BTC - R if !(requiresScalarEpilog) VTC = (R == Step-1 ? VTC + Step : VTC) if (foldTail) VTC = VTC + Step VectorTripCount = VTC But it had generally a negative performance impact on the benchmarks I've run mainly due to 1) more instructions required to compute VTC 2) VTC = (R == Step-1 ? VTC + Step : VTC) select for common case making vectorized loop's Exit Count uncomputable thus impeding some later optimizations.
Comment Actions Possibly, but I have a hard time imagining how it could be done. Perhaps someone much more knowledgeable in LLVM loop optimizations than I can answer that. Comment Actions
|
I think this check is enough unless there are other cases in which "remainder loop has N % (VF*UF) iterations doesn't hold.