This is an archive of the discontinued LLVM Phabricator instance.

[LV] Allow vectorization of hot short trip count loops with epilog
AbandonedPublic

Authored by ebrevnov on Sep 20 2019, 1:09 AM.

Details

Summary

Currently we reject to vectorize loops with epilog if trip count less than 16 (as defined by TinyTripCountVectorThreshold) and loop is cold according to profile. Thus in absence of profile summary information we will never vectorize such loops even if cost model can prove profitability. With this change we try to improve the situation by using block execution count information (if available) to judge if loop is hot relatively to function entry.

Please note that I moved related code to getScalarEpilogueLowering. That not only improves maintainability but fixes potential issue of redefining decision made by getScalarEpilogueLowering by short trip count heuristic.

Ideally we should be using loop size and expected perf gain from cost model in addition to loop hotness to make the decision but that would be much bigger change and can be a follow up step.

Note that this change depends on https://reviews.llvm.org/D67690

Event Timeline

ebrevnov created this revision.Sep 20 2019, 1:09 AM
ebrevnov updated this revision to Diff 220962.Sep 20 2019, 1:20 AM

Minor update

ebrevnov updated this revision to Diff 220978.Sep 20 2019, 1:38 AM

Minor changes in tests.

ebrevnov retitled this revision from [LV] Allow vectorization of hot short trip count loops with epilog. to [LV] Allow vectorization of hot short trip count loops with epilog.Sep 20 2019, 1:42 AM
ebrevnov edited the summary of this revision. (Show Details)
ebrevnov edited the summary of this revision. (Show Details)Sep 20 2019, 1:47 AM
ebrevnov marked 2 inline comments as done.
ebrevnov added inline comments.
llvm/include/llvm/Transforms/Utils/SizeOpts.h
25 ↗(On Diff #220978)

This change is motivated by use case in getScalarEpilogueLowering.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
942

just to be consistent with the next line :-)

ebrevnov updated this revision to Diff 222118.Sep 27 2019, 3:18 AM

Test update

ebrevnov updated this revision to Diff 222120.Sep 27 2019, 3:26 AM

Minor test update

Heuristics looks pretty reasonable to me, but I'd really like to have someone more familiar with how all the existing hueristics interact give the actual LGTM.

llvm/include/llvm/Transforms/Utils/SizeOpts.h
23 ↗(On Diff #222120)

Please update comments to indicate what None value means. Your verbal explanation was clear, just make sure the code reflects that.

After that, please separate and land the refactoring change involving these two methods. Then rebase.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7440

Explicit type this please for readability.

ebrevnov marked an inline comment as done.Nov 20 2019, 12:23 AM
ebrevnov added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7464

Some rational for the chosen heuristic. In general if nobody actually asked to optimize for size it seems reasonable to relay on cost model to decide if vectorization is profitable or not even for short trip count loops. If we still want to have some balance between code bloat and performance we should decide based on potential gain and loop size for all loops. Even though the described approach looks simple and reasonable in theory it most likely will have big implications on existing apps. That's why I decided take more conservative approach and give a chance for hot loops to be vectorized.

ebrevnov updated this revision to Diff 230227.Nov 20 2019, 3:42 AM

Rebase. Attempt2.

Ayal added a comment.Nov 20 2019, 2:26 PM

Ideally we should be using loop size and expected perf gain from cost model in addition to loop hotness to make the decision but that would be much bigger change and can be a follow up step.

The original motivation to vectorize loops of tiny trip counts as if under opt-for-size, is not to save code size, nor is it related to how hot the loop is. It stems from the fact that LV's cost model currently focuses on the body of the vector loop only, disregarding overheads associated with runtime guards and code outside this body, under the assumption that the execution of the vector loop will dominate. If the trip count is tiny, this assumption no longer holds, and the overheads may outweigh the benefits, w/o LV's cost-model noticing, leading to slowdowns (which become more severe as the loop gets hotter). Under opt-for-size LV generates only the vector loop, with no runtime guards nor scalar epilog, so this assumption holds.

In order to vectorize loops of tiny trip count with a scalar epilogue, the cost model should be enhanced to account for both plus associated overheads.

Regarding hotness, that could help the cost model in general to compare the expected performance with that of the original loop, but not sure how in this case where the trip count is a known constant.

ebrevnov abandoned this revision.Dec 5 2019, 1:51 AM

I'll start a separate review since the approach should be completely different.