Page MenuHomePhabricator

[LV/LAA] Avoid specializing a loop for stride=1 when this predicate implies a single-iteration loop
ClosedPublic

Authored by dorit on Oct 10 2017, 11:25 PM.

Details

Summary

This fixes PR34681, where vectorization results in specializing the loop for the case where the unknown Stride N == 1 (where Stride happens to be equal to the loop iteration count N).
However, the vectorized loop will only be executed if the iteration count N>=VF, and since VF>1 in the vectorized loop, this implies that N > 1.
The two conditions cannot co-exist, so the vectorized loop body becomes dead code. (Eventually this dead code is identified and gets removed, but until then it can have some side-effects on the optimization passes on the way).

Instead, this patch avoids specialization of an unknown stride if we know that the "stride==1" predicate is going to contradict the loop-minimum-iteration-count guard. This gives the vectorizer a chance to try to vectorize the loop with the unknown stride, instead of falling directly to the scalar version.

While the motivation for this patch came from this vectorization scenario, it is relevant for any loop-optimization that wants to take advantage of the Stride==1 specialization when Stride is equal to the loop count, because optimizing a loop that executes at most one iteration probably doesn't make much sense, so we might as sell avoid this runtime guard.

Diff Detail

Repository
rL LLVM

Event Timeline

dorit created this revision.Oct 10 2017, 11:25 PM

It sounds like this is a profitability issue and therefore the LAA users should handle this? Maybe the users would also want to be more general (I think a loop with an iteration count of 2 is probably not worth vectorizing).

Another issue is that if LV would somehow add a stride == 1 predicate, LAA wouldn't see it so the problem wouldn't go away.

Do you happen to know what transformation we're blocking here?

dorit added a comment.Oct 11 2017, 5:10 AM

Hi Silviu,

It sounds like this is a profitability issue and therefore the LAA users should handle this? Maybe the users would also want to be more general (I think a loop with an iteration count of 2 is probably not worth vectorizing).

 
First of all, yes, that's true; This is why I went only for the obvious no-brainer case: a loop that iterates only a single iteration is not a loop and shouldn't go through any loop level analysis and transformation. Anything beyond that is indeed a profitability decision that each user needs to make.

(In fact, regardless of this corner case, the user should be given a chance to consider the specialization to stride=1 as just one alternative. At least the vectorizer is making its first steps towards a model (VPlan) that could allow it to consider different vectorization alternatives. )

Another issue is that if LV would somehow add a stride == 1 predicate, LAA wouldn't see it so the problem wouldn't go away.

 
Yes, of course, the vectorizer has a much bigger problem of not being aware of the predicates it operates under. There's the memcheck predicates, scev predicates, LAA Stride==1 predicate, and then the loop-iteration-count predicates that the vectorizer itself adds (I think it's the only ones that no other analysis pass is adding under the covers?). These predicates are never reasoned about together, no attempt to check if they contradict one another or can be optimized. Indeed this patch is not attempting to address this larger problem…

Do you happen to know what transformation we're blocking here?

The main transformation we are blocking is vectorization with unknown stride (using gathers/scalarized loads). We specialize for an impossible case, and lose the opportunity to consider an actual viable alternative.

Beyond that the side effect that I've seen, is that the removal of the dead vectorized loop body created a situation where some of the runtime guards that the vectorizer inserted before the vectorized loop got to be in the pre-header of the scalar loop that followed the (now removed) vectorized loop, thereby affecting the decisions of the loop-unroller, which in turn affected the behavior of the SLP vectorizer… (all because of IR we inserted that shouldn't have been added in the first place…).

But the stronger motivation to me is to avoid inserting an impossible predicate, waste time on analysis and transformation that this predicate enables (while also leaving behind a bunch of garbage). (And all this where we could have done something better…).

In short, Silviu, I absolutely agree with your observations, but I think that this patch has (some) value nonetheless…

Thanks,
dorit

dorit retitled this revision from [LV/LAA] Avoid secializing a loop for stride=1 when this predicate implies a single-iteration loop to [LV/LAA] Avoid specializing a loop for stride=1 when this predicate implies a single-iteration loop.Oct 18 2017, 1:37 AM
Ayal edited edge metadata.Nov 1 2017, 5:47 PM

Indeed, a loop with an iteration count smaller than VF is definitely not worth vectorizing. An interesting profitability issue is to decide how many iterations past VF suffice to amortize vectorization overheads. In any case, this single/no iteration case looks like a no-brainer and realistic case - traversing a column of an NxN matrix.

llvm/lib/Analysis/LoopAccessAnalysis.cpp
2153 ↗(On Diff #118544)

Simplify the above explanation. Suffice to say something like the following:

Avoid adding the "Stride == 1" predicate when we know that Stride >= Trip-Count. Such a predicate will effectively optimize a single or no iteration loop, as Trip-Count <= Stride == 1.

2177 ↗(On Diff #118544)

Minus >> StrideMinusBETaken?

2184 ↗(On Diff #118544)

Can report success here, as in the original message above:

DEBUG(dbgs() << "LAA: Found a strided access that we can version");
llvm/test/Transforms/LoopVectorize/pr34681.ll
13 ↗(On Diff #118544)

Would

tmp += B[k*N];

suffice? I.e., the cast to int and offset of j seem redundant, albeit do no harm.

16 ↗(On Diff #118544)

"runtine" >> "runtime"

Suggest to emphasize: "is *not* generated"

llvm/test/Transforms/LoopVectorize/version-mem-access.ll
63 ↗(On Diff #118544)

Just noting for completeness that this test-case originally used the symbolic stride also as the trip count. Separating them below in order to continue to vectorize the loop.

I agree that this a strict improvement.

However it seems like stride versioning might be the bigger problem. Here it doesn't help us with the dependence analysis and it adds a run-time check with a very high probability of not being validated. Probably a cost analysis that takes into account the probability of execution of the versioned loop wouldn't find stride versioning profitable, even if it did pass the added condition. In addition it causes issues such as this one. Maybe we should only use it if it somehow helps with the dependence analysis, otherwise using scatter/gather might always better.

dorit added a comment.Nov 2 2017, 4:50 AM

Hi Silviu,

However it seems like stride versioning might be the bigger problem.

Yes, we have a problem with our decision making on when to do stride versioning. We need to evaluate the cost of the runtime test, and to evaluate the benefit of various possible stride specializations, and then make an informed decision about which versioning we want to perform (which stride(s) to specialize for), if at all. That's a big change that needs to be designed and carefully tuned; For example:

Maybe we should only use it if it somehow helps with the dependence analysis, otherwise using scatter/gather might always better.

If the unknown stride happens to be 1, then using gathers would have worse performance than the versioned loop with regular wide loads. So maybe we'd want more than one specialized version, or maybe some other solution, but in any case something that needs to be designed and tuned.

In short, I think we are all in agreement that:

  1. This patch is a (small) improvement, * regardless of the users and their specific cost considerations *.
  2. The cost-model aspect of deciding when to specialize for a certain stride needs to be improved. The users (LV especially) are currently not making informed decisions.

…Right?

thanks,
Dorit

Ayal added a comment.Nov 2 2017, 2:47 PM

In short, I think we are all in agreement that:

  1. This patch is a (small) improvement, * regardless of the users and their specific cost considerations *.
  2. The cost-model aspect of deciding when to specialize for a certain stride needs to be improved. The users (LV especially) are currently not making informed decisions.

    …Right?

I agree,

  1. This patch looks good to me; my comments below were very minor.
  2. Add a TODO capturing these concerns/opportunities to further restrict unknown stride specialization. Not sure what probability may be assigned to an unknown stride being one, but another option is to optimize each gather/scatter with its unknown stride guard inside the loop, rather than loop-unswitching it into a preheader predicate.

Is this ok with you too @silviu.baranga ?

Yes, I agree as well.

My point was that a more general solution might be to turn off stride versioning, but that requires some more discussions and benchmarking.

LGTM

Closed by commit rL317438 (authored by dorit). · Explain WhyNov 5 2017, 8:53 AM
This revision was automatically updated to reflect the committed changes.