This is an archive of the discontinued LLVM Phabricator instance.

[LV] Don't vectorize when we have a static bound on trip count
ClosedPublic

Authored by mkuper on Dec 12 2016, 3:39 PM.

Details

Summary

We check if the exact trip count is known and is smaller than the "tiny loop" bound. We should be checking the maximum bound on the trip count instead.

Note that right now this probably won't do a lot of good, since getSmallConstantMaxTripCount() seems to choke on the interesting cases, I hope to fix this in follow-ups.

Diff Detail

Repository
rL LLVM

Event Timeline

mkuper updated this revision to Diff 81155.Dec 12 2016, 3:39 PM
mkuper retitled this revision from to [LV] Don't vectorize when we have a static bound on trip count.
mkuper updated this object.
mkuper added reviewers: mssimpso, gilr.
mkuper added a subscriber: llvm-commits.
mssimpso accepted this revision.Dec 12 2016, 4:35 PM
mssimpso edited edge metadata.

LGTM.

This revision is now accepted and ready to land.Dec 12 2016, 4:35 PM

Can you please remind me why we're doing this? A loop with a small trip count should be unrolled if it is small, so we should really only be seeing here loops with small trip counts that have large bodies. That being the case, if vectorization looks profitable, why isn't it?

Can you please remind me why we're doing this? A loop with a small trip count should be unrolled if it is small, so we should really only be seeing here loops with small trip counts that have large bodies. That being the case, if vectorization looks profitable, why isn't it?

There are two separate reasons it may not be profitable:

  1. The one-time setup overhead for the vectorized loop - broadcasts, bounds computations for runtime checks, etc. is too high w.r.t the gain from vectorization.
  2. We should never use a VF that is higher than the trip-count, since if we do, we're only paying the price for the overhead, while in practice we'll only execute the scalar remainder loop. This is, of course, completely independent of the size of the loop body. (That may change if we start using masked vector remainder loops, but I think Intel experimented with that, and didn't get very good results, so I don't see this happening in the near future.)

For (1), what we should do, if we want to be precise, is compute the overhead cost, and then take both the cost and the expected number of iterations into account as part of the cost model. As to (2), the right way to deal with it would be to upper-bound the VF choice by the trip-count.

However, since we don't really model the overhead correctly what we have (and have had since forever) is this big hammer of just bailing out on loops with a small trip count. What I'm doing is extending this to uniformly cover all kinds of loops - that is, not only loops with a known exact static trip count, but also dynamic trip counts (for PGO builds) and static trip count bounds.

My main motivations for that are:

a) Loops that have an unknown static trip-count, but according to the profile have very low iteration counts (0 or 1). These should not be unrolled, and should never be vectorized either.

b) Remainder loops for hand-vectorized code. These will also not be unrolled - the trip-count is unknown, and doesn't have a known multiple. (We may end up with runtime unrolling and yet another "remainder loop", which doesn't really improve things.) And, of course, it's almost always a bad idea to vectorize these. (The exception may be something like hand-vectorization by 16, with a scalar remainder loop. We may want to vectorize that remainder by 4 and leave a smaller scalar remainder, but that sounds like a very small win.)

I agree that we probably want to fix (1) eventually, but that will still need a good trip count estimate, so it's in some sense orthogonal to improving the estimate in cases where it's not statically known exactly.

Can you please remind me why we're doing this? A loop with a small trip count should be unrolled if it is small, so we should really only be seeing here loops with small trip counts that have large bodies. That being the case, if vectorization looks profitable, why isn't it?

There are two separate reasons it may not be profitable:

  1. The one-time setup overhead for the vectorized loop - broadcasts, bounds computations for runtime checks, etc. is too high w.r.t the gain from vectorization.
  2. We should never use a VF that is higher than the trip-count, since if we do, we're only paying the price for the overhead, while in practice we'll only execute the scalar remainder loop. This is, of course, completely independent of the size of the loop body. (That may change if we start using masked vector remainder loops, but I think Intel experimented with that, and didn't get very good results, so I don't see this happening in the near future.)

For (1), what we should do, if we want to be precise, is compute the overhead cost, and then take both the cost and the expected number of iterations into account as part of the cost model. As to (2), the right way to deal with it would be to upper-bound the VF choice by the trip-count.

However, since we don't really model the overhead correctly what we have (and have had since forever) is this big hammer of just bailing out on loops with a small trip count. What I'm doing is extending this to uniformly cover all kinds of loops - that is, not only loops with a known exact static trip count, but also dynamic trip counts (for PGO builds) and static trip count bounds.

My main motivations for that are:

a) Loops that have an unknown static trip-count, but according to the profile have very low iteration counts (0 or 1). These should not be unrolled, and should never be vectorized either.

For 0 or 1, agreed.

b) Remainder loops for hand-vectorized code. These will also not be unrolled - the trip-count is unknown, and doesn't have a known multiple. (We may end up with runtime unrolling and yet another "remainder loop", which doesn't really improve things.) And, of course, it's almost always a bad idea to vectorize these. (The exception may be something like hand-vectorization by 16, with a scalar remainder loop. We may want to vectorize that remainder by 4 and leave a smaller scalar remainder, but that sounds like a very small win.)

I agree, but I think we're going about this the wrong way. The cost of the branching and runtime checks need to be factored into the cost model (which will be relevant for low-trip-count loops), and that should naturally prevent this kind of messiness. Just not vectorizing low-trip-count loops is suboptimial because it will miss cases where vectorization is quite profitable.

I agree that we probably want to fix (1) eventually, but that will still need a good trip count estimate, so it's in some sense orthogonal to improving the estimate in cases where it's not statically known exactly.

This revision was automatically updated to reflect the committed changes.