Page MenuHomePhabricator

[LV] Avoid vectorizing loops under opt for size that involve SCEV checks
ClosedPublic

Authored by Ayal on Oct 23 2018, 2:18 PM.

Details

Summary

The loop vectorizer may generate runtime SCEV checks for overflow and stride==1 cases, leading to execution of original scalar loop. The latter should be eliminated when optimizing for size. This patch fixes this behavior by preventing vectorization in such cases.

Reported by @uabelho in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20181022/596443.html
Issue also occurs w/o "fold tail" patch, as demonstrated by additional tests.

Diff Detail

Repository
rL LLVM

Event Timeline

Ayal created this revision.Oct 23 2018, 2:18 PM

A PR for the problem reported by @uabelho can be found here: https://bugs.llvm.org/show_bug.cgi?id=39417

dorit accepted this revision.Oct 24 2018, 1:56 PM

Just minor comments on the tests.
LGTM.

test/Transforms/LoopVectorize/pr30654-phiscev-sext-trunc.ll
245 ↗(On Diff #170750)

"with tiny trip count / under opt for size" --> "with tiny trip count (which implies opt for size).".

250 ↗(On Diff #170750)

(can mention now that this is pr39417)

254 ↗(On Diff #170750)

I was once asked to remove the "; preds = ..." comments if they are not needed. (I see a lot of tests including these comments, but this specific file doesn't).

267 ↗(On Diff #170750)

Now that there's a PR opened for this scenario, you may want to consider moving these two loops to a separate file (pr39417-optsize-and-scevchecks.ll ?); I know that there's a desire to not open separate files for each testcase, but this specific file deals with sext-trunc overflow tests (as the name of the test implies), so may be less appropriate to include the stride==1 testcase here?

This revision is now accepted and ready to land.Oct 24 2018, 1:56 PM
bjope added a comment.EditedOct 25 2018, 1:23 AM

The patch will avoid the assert seen in PR39417 so that is great.
(hard to tell if it still is possible to hit the assert, PR39417 was triggered during fuzzy testing using csmith to generate the input)

One thing I noticed is that if I use the test case from PR39417 and add -vectorizer-min-trip-count=3, to avoid the detection of a "very small trip count", the loop will be vectorized with VF=16. That is also what happened when we triggered the assert (without this patch). Shouldn't the VF be clamped to the trip count?
It seems like the vectorizer detects that the trip count is tiny (trip count is 3), but it vectorize using VF=16 but then the vectorized loop is skipped since we emit br i1 true, label %scalar.ph, label %vector.scevcheck. So all the hard work with vectorizing the loop is just a waste of time, or could it be beneficial to have VF > tripcount in some cases?

If the actual problem is that VF should be clamped to the trip count, then maybe this patch just hides that problem in certain cases (when having OptForSize).

The patch will avoid the assert seen in PR39417 so that is great.
(hard to tell if it still is possible to hit the assert, PR39417 was triggered during fuzzy testing using csmith to generate the input)

One thing I noticed is that if I use the test case from PR39417 and add -vectorizer-min-trip-count=3, to avoid the detection of a "very small trip count", the loop will be vectorized with VF=16. That is also what happened when we triggered the assert (without this patch). Shouldn't the VF be clamped to the trip count?
It seems like the vectorizer detects that the trip count is tiny (trip count is 3), but it vectorize using VF=16 but then the vectorized loop is skipped since we emit br i1 true, label %scalar.ph, label %vector.scevcheck. So all the hard work with vectorizing the loop is just a waste of time, or could it be beneficial to have VF > tripcount in some cases?

If the actual problem is that VF should be clamped to the trip count, then maybe this patch just hides that problem in certain cases (when having OptForSize).

VF doesn't have to be clamped to tripcount, but it should be clamped to reasonable multiple of natural vector size (e.g., executing 3-iter loop using 4-way vector could be better than 2-way vector if 2-way vector is less than full vector)--- and we need to ensure we'll hit vector code if vectorizer intentionally used VF that is greater than constant tripcount. Please file a separate ticket.

Ayal updated this revision to Diff 172276.Nov 1 2018, 5:11 PM

Addressed comments.
Added another test case derived from PR39497.
Updated to trunk before committing.

Ayal marked 4 inline comments as done.Nov 1 2018, 5:13 PM
This revision was automatically updated to reflect the committed changes.
Ayal added a comment.Nov 3 2018, 11:00 PM

...
One thing I noticed is that if I use the test case from PR39417 and add -vectorizer-min-trip-count=3, to avoid the detection of a "very small trip count", the loop will be vectorized with VF=16. That is also what happened when we triggered the assert (without this patch). Shouldn't the VF be clamped to the trip count?
It seems like the vectorizer detects that the trip count is tiny (trip count is 3), but it vectorize using VF=16 but then the vectorized loop is skipped since we emit br i1 true, label %scalar.ph, label %vector.scevcheck. So all the hard work with vectorizing the loop is just a waste of time, or could it be beneficial to have VF > tripcount in some cases?

If the actual problem is that VF should be clamped to the trip count, then maybe this patch just hides that problem in certain cases (when having OptForSize).

Interesting :-)

If tail is folded by masking, then VF's greater than the trip count are potentially relevant.

If tail is not folded by masking, it's indeed futile to use any VF(*UF) greater than the trip count. LV doesn't really notice this; IRBuilder does, when emitMinimumIterationCountCheck() asks it to

CheckMinIters = Builder.CreateICmp(
    P, Count, ConstantInt::get(Count->getType(), VF * UF),
    "min.iters.check");

it simply sets CheckMinIters to 'true'.

Another aspect related to known trip counts (smaller than VF): they should be used when comparing VectorCost to ScalarCost, instead of taking

float VectorCost = C.first / (float)i;

In any case, this patch deals with trip counts only indirectly, as they trigger OptForSize when small. Indeed worthy of a separate ticket/patch.