Page MenuHomePhabricator

[LV] Avoid considering scalar-with-predication instructions as also uniform-after-vectorization, fix PR40816
ClosedPublic

Authored by Ayal on Fri, Nov 15, 2:19 AM.

Details

Summary

Instructions identified as "scalar with predication" will be "vectorized" using a replicating region. If such instructions are also optimized as "uniform after vectorization", namely when only the first of VF lanes is used, such a replicating region becomes erroneous - only the first instance of the region can and should be formed. Fix such cases by not considering such instructions as "uniform after vectorization".

A TODO is left as such cases could be optimized by implementing single instance regions, but noting that such cases are rare. The specific case of PR40816 should be optimized by not vectorizing such instructions at all but instead recognizing them as DeadInstructions, or employing indvars to rid them before LV as discussed in https://reviews.llvm.org/D68577#1742745.

The added test case is a simplification of the original one reported in the PR.

Diff Detail

Event Timeline

Ayal created this revision.Fri, Nov 15, 2:19 AM
Herald added a project: Restricted Project. · View Herald TranscriptFri, Nov 15, 2:19 AM

Nice! This indeed seems to solve the problem we saw in PR40816.

I don't know the code but I can at least provide some additional testing by including this patch in my weekend testing to see nothing unexpected pops up there.

Nice! This indeed seems to solve the problem we saw in PR40816.

I don't know the code but I can at least provide some additional testing by including this patch in my weekend testing to see nothing unexpected pops up there.

Testing went well!

Ayal marked an inline comment as done.Mon, Nov 18, 1:42 AM

Nice! This indeed seems to solve the problem we saw in PR40816.

I don't know the code but I can at least provide some additional testing by including this patch in my weekend testing to see nothing unexpected pops up there.

Testing went well!

Thanks Mikael, good to know!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4682

LLVM_DEBUG is redundant under #ifndef NDEBUG, will remove.

bjope added a subscriber: bjope.Tue, Nov 19, 1:05 AM
Ayal updated this revision to Diff 231306.Wed, Nov 27, 12:03 PM

Ping.

Cleaned-up the #ifndef/LLVM_DEBUG and renamed the lambda.

Ayal added a comment.Wed, Nov 27, 12:03 PM

and rebased.

fhahn accepted this revision.Thu, Nov 28, 9:52 AM

In the test case we predicate due to low trip count right? It might be worth calling that out.

Also, it would be great if the test could be simplified a bit more (e.g. make the phi and related values i16, get rid of the truncation and the multiply). It seems like the test is not the most robust, as the load feeding the compare is not used at all after vectorisation, but I’m not sure if there’s another easy way to mark it as uniform at the moment.

LGTM thanks, any test improvements would be a nice bonus :) I agree that in most cases it’s unlikely that replicating a few uniform instructions will have a big impact.

This revision is now accepted and ready to land.Thu, Nov 28, 9:52 AM
fhahn added a comment.Fri, Nov 29, 7:39 AM

It looks like this also fixes https://bugs.llvm.org/show_bug.cgi?id=43951. The reproducer there might be a good source for a test case as well.

Ayal updated this revision to Diff 231626.Sun, Dec 1, 2:56 PM
Ayal edited the summary of this revision. (Show Details)

Simplified the test as suggested.

Ayal added a comment.Sun, Dec 1, 3:06 PM

In the test case we predicate due to low trip count right? It might be worth calling that out.

Right, sure, called out.

Also, it would be great if the test could be simplified a bit more (e.g. make the phi and related values i16, get rid of the truncation and the multiply). It seems like the test is not the most robust, as the load feeding the compare is not used at all after vectorisation, but I’m not sure if there’s another easy way to mark it as uniform at the moment.

Test has been simplified.
Long version: if the phi is turned into an i16, LV can no longer vectorize the loop even if forced, because LoopVectorizationLegality's convertPointerToIntegerType() prevents it (and thus any induction) from being a Primary Induction. So to simplify and get rid of the truncation while still vectorizing with fold-tail, all i16's are turned into i32's instead. To make sure the load is scalarized with predication, instead of forming an interleave group (once we get rid of the multiply) or a masked gather, the target is changed from knl to core-2. If a load feeding the latch-compare (thereby becoming Uniform-After-Vectorization) is live, the loop would probably not have a countable trip count, i.e., vectorizable. The other option for a load to become UAV is if it feeds a GEP known to be consecutive, or strided; but for that the load must be known to produce an arithmetic progression...

LGTM thanks, any test improvements would be a nice bonus :) I agree that in most cases it’s unlikely that replicating a few uniform instructions will have a big impact.

Ayal added a comment.Mon, Dec 2, 3:01 PM

It looks like this also fixes https://bugs.llvm.org/show_bug.cgi?id=43951. The reproducer there might be a good source for a test case as well.

Agreed, thanks for pointing out! PR43951 exhibits a similar pattern of a load-from-constant-array used for computing the (tiny) trip count, eventually recognized as both scalar-with-predication and uniform-after-vectorization, and hence fixed by this patch. PR43951 has a (redundant) OR reduction instead of PR40816's store, but this distinction seems irrelevant in terms of test coverage.

This revision was automatically updated to reflect the committed changes.