If a scalar loop is required on a divergent target, if the loop
bounds is divergent both loops could end up executing.
Details
Diff Detail
Event Timeline
lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
5176 | The comment here needs a bit more detail. Even if the trip count isn't uniform, as long as it isn't tiny, vectorizing still reduces the total number of instructions executed. For example, if you're vectorizing a simple f32 loop to <4 x f32>, the main loop executes "maxTripCount/4" times, and the remainder loop executes at most 3 times. Assuming maxTripCount is large, maxTripCount/4 + 3 is much smaller than maxTripCount. |
I don't understand this patch. Unless there is both a vector and scalar loop, and I don't see how vectorizing the loop affects divergence between threads one way or the other. Do you really mean to prohibit vectorization when you have a dynamic trip count *and also* don't require a scalar loop? If so, you might look at D34373 which is related.
Ah, okay. Adding that check on top of D34373 seems likely to be easy. I recommend doing that.
Base on new OptSize handling. Allow forced vectorization with metadata in case user knows it is dynamically uniform etc.
lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
6322 | Here we should indeed continue to use (OptForSize) only, rather than (OptForSize || OptForDivergent), right? | |
6323 | Should this also be if ((OptForSize || OptForDivergent) && TC % MaxVF != 0) ? It may be better to have one AvoidTailLoop flag, which is set if we're either really optimizing for size, or deal with a tiny loop, or optimize for a divergent target. Check here if a tail is needed for whatever reason, with a debug dump stating simply that a tail is required when it must be avoided. The reason why a tail must be avoided can be dumped separately when setting AvoidTailLoop. Not sure how to best handle the different ORE reports though; perhaps by refactoring out an isTailLoopNeeded()? |
The comment here needs a bit more detail. Even if the trip count isn't uniform, as long as it isn't tiny, vectorizing still reduces the total number of instructions executed. For example, if you're vectorizing a simple f32 loop to <4 x f32>, the main loop executes "maxTripCount/4" times, and the remainder loop executes at most 3 times. Assuming maxTripCount is large, maxTripCount/4 + 3 is much smaller than maxTripCount.