[LV] Optimize for size when vectorizing loops with tiny trip count
ClosedPublic

Authored by Ayal on Jun 19 2017, 4:49 PM.

Details

Summary

Try to vectorize loops whose trip-count is smaller than TinyTripCountVectorThreshold under OptForSize constraint rather than not trying to vectorize them at all. The OptForSize constraint implies little if any overheads outside of the vectorized loop body, so the current cost estimate of the vectorized-vs-scalar loop body should hopefully be more/sufficiently accurate.

Also holds when the small value of the trip-count is based on profile data rather than static analysis, for potential cases where the trip-count is statically known to be divisible by the VF.

Patch inspired by D32451.

Diff Detail

Repository
rL LLVM
Ayal created this revision.Jun 19 2017, 4:49 PM
hfinkel added inline comments.
lib/Transforms/Vectorize/LoopVectorize.cpp
7829 ↗(On Diff #103123)

Please add some comments here explaining this behavior (it seems to make sense after reading the patch description, but without that context, I'd likely find this confusing).

twoh added a comment.Jun 19 2017, 10:13 PM

I think this is a right approach, but concerned that the experimental results I shared on D32451 show that it is generally better to not to vectorize the low trip count loops. @Ayal, I wonder if you have any results that this patch actually improves the performance. Thanks!

In D34373#784975, @twoh wrote:

I think this is a right approach, but concerned that the experimental results I shared on D32451 show that it is generally better to not to vectorize the low trip count loops. @Ayal, I wonder if you have any results that this patch actually improves the performance. Thanks!

I know that we're currently missing opportunities for large vectorizable loops with low (static) trip counts. Smaller inner loops are also good candidates for unpredicated vectorization, but we may need to be a bit careful because of modeling inaccuracies and phase-ordering effects (e.g. if we don't vectorize a loop, then we'll end up unrolling it when the unroller runs).

twoh added a comment.Jun 20 2017, 9:17 AM
In D34373#784975, @twoh wrote:

I think this is a right approach, but concerned that the experimental results I shared on D32451 show that it is generally better to not to vectorize the low trip count loops. @Ayal, I wonder if you have any results that this patch actually improves the performance. Thanks!

I know that we're currently missing opportunities for large vectorizable loops with low (static) trip counts. Smaller inner loops are also good candidates for unpredicated vectorization, but we may need to be a bit careful because of modeling inaccuracies and phase-ordering effects (e.g. if we don't vectorize a loop, then we'll end up unrolling it when the unroller runs).

Got it. My concern was for small single-level loops with low trip counts, as I observe them pretty frequently. I have no objection accepting this patch and improve the cost estimator separately.

In D34373#785485, @twoh wrote:
In D34373#784975, @twoh wrote:

I think this is a right approach, but concerned that the experimental results I shared on D32451 show that it is generally better to not to vectorize the low trip count loops. @Ayal, I wonder if you have any results that this patch actually improves the performance. Thanks!

I know that we're currently missing opportunities for large vectorizable loops with low (static) trip counts. Smaller inner loops are also good candidates for unpredicated vectorization, but we may need to be a bit careful because of modeling inaccuracies and phase-ordering effects (e.g. if we don't vectorize a loop, then we'll end up unrolling it when the unroller runs).

Got it. My concern was for small single-level loops with low trip counts, as I observe them pretty frequently. I have no objection accepting this patch and improve the cost estimator separately.

Do you mean that you see such loops frequently with dynamically-small trip counts, or with static trip counts? I assume the small loops with (small) static trip counts will generally be unrolled.

twoh added a comment.Jun 20 2017, 11:11 AM
In D34373#785485, @twoh wrote:
In D34373#784975, @twoh wrote:

I think this is a right approach, but concerned that the experimental results I shared on D32451 show that it is generally better to not to vectorize the low trip count loops. @Ayal, I wonder if you have any results that this patch actually improves the performance. Thanks!

I know that we're currently missing opportunities for large vectorizable loops with low (static) trip counts. Smaller inner loops are also good candidates for unpredicated vectorization, but we may need to be a bit careful because of modeling inaccuracies and phase-ordering effects (e.g. if we don't vectorize a loop, then we'll end up unrolling it when the unroller runs).

Got it. My concern was for small single-level loops with low trip counts, as I observe them pretty frequently. I have no objection accepting this patch and improve the cost estimator separately.

Do you mean that you see such loops frequently with dynamically-small trip counts, or with static trip counts? I assume the small loops with (small) static trip counts will generally be unrolled.

Actually you're right. The case I observed was a small static trip count loop completely unrolled and SLP vectorized which actually harms the performance, but not with LV. I think this patch should work if it effectively targets loops that are large enough to not to be unrolled.

In D34373#785678, @twoh wrote:
In D34373#785485, @twoh wrote:
In D34373#784975, @twoh wrote:

I think this is a right approach, but concerned that the experimental results I shared on D32451 show that it is generally better to not to vectorize the low trip count loops. @Ayal, I wonder if you have any results that this patch actually improves the performance. Thanks!

I know that we're currently missing opportunities for large vectorizable loops with low (static) trip counts. Smaller inner loops are also good candidates for unpredicated vectorization, but we may need to be a bit careful because of modeling inaccuracies and phase-ordering effects (e.g. if we don't vectorize a loop, then we'll end up unrolling it when the unroller runs).

Got it. My concern was for small single-level loops with low trip counts, as I observe them pretty frequently. I have no objection accepting this patch and improve the cost estimator separately.

Do you mean that you see such loops frequently with dynamically-small trip counts, or with static trip counts? I assume the small loops with (small) static trip counts will generally be unrolled.

Actually you're right. The case I observed was a small static trip count loop completely unrolled and SLP vectorized which actually harms the performance, but not with LV. I think this patch should work if it effectively targets loops that are large enough to not to be unrolled.

I agree. Given that this will only vectorize loops that don't need a remainder loop, even if it would otherwise be unrolled, that should be fine. As you might be pointing out with you observation about SLP vectorization sometimes hurting performance, there certainly are cases where vectorization of small numbers of instructions can harm performance on OOO cores (for example, because they introduce additional data dependencies that might be more harmful than the corresponding increase in parallelism). It seems possible that the code for a small loop that comes out of the LV might have the same issue (if, for example, we generate unaligned vector loads, or access strided data and then shuffle it together). However, it is not clear to me that this will be the case. The SLP vectorizer has a minimum tree height of three, and for the LV to produce a loop that unrolls to less than three instructions, I assume it would need to essentially be a memcpy. I suspect that we'll need to try it and see if we find regressions.

Ayal added a comment.Jun 21 2017, 1:45 PM

Yes, we saw a couple of ~7% improvements running eembc benchmarks on x86.

This patch applies mostly to short static trip counts. For it to apply to short profile-based trip counts, they would need to be divisible by VF statically to avoid a remainder loop.

The current cost-model aims to estimate the relative performance of the loop body (only), when vectorized vs. original scalar version. The overheads of runtime guards and remainder loop may certainly outweigh the gains of the vectorized body, especially if the trip count is small; unless we know the former are not needed at all. If the body is expected to run faster when vectorized with a large trip count, it seems reasonable to expect it would do so with a small trip count, when all that's running is the body. Right?

Regarding unrolling such small trip-count loops, note that the loop-vectorizer itself may decide to do so, with interleaving.

Sure, will add comments explaining the logic behind turning on OptForSize in this case.

dorit added a subscriber: dorit.Jun 25 2017, 5:50 AM
twoh added a comment.Jun 25 2017, 2:58 PM

Hello @Ayal, can you please update the comments per @hfinkel's request so that I can accept the patch? Thanks!

Ayal updated this revision to Diff 104453.Jun 28 2017, 10:33 AM

Updated version includes the comment requested by @hfinkel.

@mkuper, this is somewhat related to D26873; it conceptually allows vectorizing loops with low dynamic trip-counts that are also known to be divisible by VF statically; but current computeMaxVF() allows vectorizing loops under OptForSize only if their trip count is known statically, and known to be divisible by VF.

This revision is now accepted and ready to land.Jun 28 2017, 10:56 AM
This revision was automatically updated to reflect the committed changes.