This is an archive of the discontinued LLVM Phabricator instance.

LV: Don't vectorize with unknown loop counts on divergent targets
Needs ReviewPublic

Authored by arsenm on May 1 2017, 6:07 PM.

Details

Summary

If a scalar loop is required on a divergent target, if the loop
bounds is divergent both loops could end up executing.

Diff Detail

Event Timeline

arsenm created this revision.May 1 2017, 6:07 PM
efriedma added inline comments.
lib/Transforms/Vectorize/LoopVectorize.cpp
5176

The comment here needs a bit more detail. Even if the trip count isn't uniform, as long as it isn't tiny, vectorizing still reduces the total number of instructions executed. For example, if you're vectorizing a simple f32 loop to <4 x f32>, the main loop executes "maxTripCount/4" times, and the remainder loop executes at most 3 times. Assuming maxTripCount is large, maxTripCount/4 + 3 is much smaller than maxTripCount.

arsenm updated this revision to Diff 97484.May 2 2017, 12:25 PM

Add more to comment

ping

I don't understand this patch. Unless there is both a vector and scalar loop, and I don't see how vectorizing the loop affects divergence between threads one way or the other. Do you really mean to prohibit vectorization when you have a dynamic trip count *and also* don't require a scalar loop? If so, you might look at D34373 which is related.

ping

I don't understand this patch. Unless there is both a vector and scalar loop, and I don't see how vectorizing the loop affects divergence between threads one way or the other. Do you really mean to prohibit vectorization when you have a dynamic trip count *and also* don't require a scalar loop? If so, you might look at D34373 which is related.

Yes, the point is to avoid the additional condition where both loops could execute

ping

I don't understand this patch. Unless there is both a vector and scalar loop, and I don't see how vectorizing the loop affects divergence between threads one way or the other. Do you really mean to prohibit vectorization when you have a dynamic trip count *and also* don't require a scalar loop? If so, you might look at D34373 which is related.

Yes, the point is to avoid the additional condition where both loops could execute

Ah, okay. Adding that check on top of D34373 seems likely to be easy. I recommend doing that.

arsenm updated this revision to Diff 109459.Aug 2 2017, 4:43 PM
arsenm edited the summary of this revision. (Show Details)
arsenm added a reviewer: hfinkel.

Base on new OptSize handling. Allow forced vectorization with metadata in case user knows it is dynamically uniform etc.

fhahn added a subscriber: fhahn.Aug 2 2017, 11:26 PM
Ayal added inline comments.Sep 13 2017, 1:46 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
6322

Here we should indeed continue to use (OptForSize) only, rather than (OptForSize || OptForDivergent), right?

6323

Should this also be if ((OptForSize || OptForDivergent) && TC % MaxVF != 0) ?

It may be better to have one AvoidTailLoop flag, which is set if we're either really optimizing for size, or deal with a tiny loop, or optimize for a divergent target. Check here if a tail is needed for whatever reason, with a debug dump stating simply that a tail is required when it must be avoided. The reason why a tail must be avoided can be dumped separately when setting AvoidTailLoop. Not sure how to best handle the different ORE reports though; perhaps by refactoring out an isTailLoopNeeded()?