This is an archive of the discontinued LLVM Phabricator instance.

Epilog loop vectorization
Needs ReviewPublic

Authored by ashutosh.nema on Feb 22 2017, 1:46 AM.

Details

Summary

This is a proposal about epilog loop vectorization.

Currently Loop Vectorizer inserts an epilogue loop for handling loops that don’t have known iteration counts.

The Loop Vectorizer supports loops with an unknown trip count, unknown trip count may not be a multiple of the vector width, and the vectorizer has to execute the last few iterations as scalar code. It keeps a scalar copy of the loop for the remaining iterations.

Loop with the large width has a high possibility of executing many scalar iterations.
i.e. i8 data type with 256bits target register can vectorize with vector width 32, with that maximum trip count possibility for scalar(epilog) loop is 31, which is significant & worth vectorizing.

Large vector factor has following challenges:

  1. Possibility of remainder iteration is substantial.
  2. Actual trip count at runtime is substantial but not meeting minimum trip count to execute vector loop.

These challenges can be addressed by mask instructions, but these instructions are limited and may not be available to all targets.

By epilog vectorization our aim to vectorize epilog loop where original loop is vectorized with large vector factor and has a high possibility of executing scalar iterations.

This require following changes:

  1. Costing: Preserve all profitable vector factor.
  2. Transform: Create an additional vector loop with next profitable vector factor.

Please refer attached file (BlockLayout.png) for the details about transformed block layout.

Diff Detail

Repository
rL LLVM

Event Timeline

ashutosh.nema created this revision.Feb 22 2017, 1:46 AM
dorit added a subscriber: dorit.Feb 22 2017, 1:51 AM

Test cases are missing well add.

rengolin edited edge metadata.Feb 22 2017, 3:29 AM

I think this is an interesting idea, though with limited applicability. Furthermore, the way you implemented makes it completely orthogonal to the vectoriser, and at odds with the VPlan strategy being discussed.

Before VPlans get in, I'd assume you would sort the strategies by cost and, if they were in a beneficial order (ex. 4 > 2 > 1), you'd then split the loop in N parts, one for each size. The problem here is that the trade-offs are not clear, and this is probably only beneficial for *very* large VF (32+), because now you're adding more run time checks, shuffles and moves between scalar and vector register banks, which are not always free.

A few more ideas...

  1. If the second loop needs to be >16, then just unroll however many instructions to match 16 lanes, no loops necessary.
  2. Link this optimisation to code size restrictions. We don't wan't to run this at Os.
  3. Once VPlans go in, this would be a separate VPlan that could be applied on top of others, so we may need to change the VPlan implementation to allow that.

Finally, tests. Even this just being a proposal, with tests you can show your proposal "in action" and allow us to discuss specific details of the pass that could be too opaque or intricate to realise just looking at the code.

cheers,
--renato

lib/Transforms/Vectorize/LoopVectorize.cpp
3429

So, you're assuming that just having more than 16 iterations is "beneficial", but you haven't done any cost analysis. It may very well be that the cost is just not worth it, especially for smaller vector sizes.

delena added a subscriber: delena.Feb 22 2017, 3:43 AM
mkuper edited edge metadata.Feb 22 2017, 10:35 AM

I haven't looked at the patch yet - but I think it's a nice idea.
I know there was some experimentation with using masked epilogues, which, IIRC, did not provide a tangible benefit. Do you have any performance numbers showing that doing it this way does?

ashutosh.nema added inline comments.Mar 6 2017, 12:51 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
3364

We can remove call to loop vectorizer’s legality before widening epilog loop, as legality is already proven for scalar loop while generating first vector version, it’s not required to prove again as scalar body has not changed. Epilog vectorization will depend on already computed legality. I have tried this change and found no impacts in my regular tests & benchmarks.

Please let me know thoughts on this.

3429

Costing is already done, during the cost calculation of first vector version we have preserved the profitable VFs.
At this point we look for the next profitable vector factor and widen epilog loop with it.

This change includes:

a) Code refactoring.

b) Removed call to loop vectorizer’s legality before widening epilog loop as legality is already proven for scalar loop while generating first vector version, it’s not required to prove again as scalar body has not changed. Epilog vectorization will depend on already computed legality.

c) Added tests.

Block layout description

Hi Ashutosh,

Sorry for the delay. I think the patch is looking good from my end.

I'd add a TODO to maybe unroll+SLP if the proportion between the next profitable VF and the size of the tail are close together. This could also simplify the call graph.

Example: VF=8, so tail is at most 7 and next VF is 4. This would be better without a latch, just if (end - i) > 4 -> 4-way SIMD -> for (i .. end) -> scalar.

But this is not a job for this patch, I think.

So, now, it would be good to know a few numbers. Can you share some performance improvements?

Also, would be good to get @mkuper take on this.

cheers,
--renato

lib/Transforms/Vectorize/LoopVectorize.cpp
3364

Legality is not the problem, here, but the cost. If the transformation is legal for VF, then it's also legal for all vfs < VF.

6420

This makes sense to me, and it's exactly how I would implement it.

6430

You don't need the NOTE in comments...

After this patch was posted, there was an RFC discussion regarding approach. We discussed, for example, just adding metadata (or keeping similar state) to restrict VF and rerunning the vectorizer to vectorize the epilogue loop. Can you please summarize that discussion and how it relates to what's here? Did the design of this patch change as a result of that discussion? If not, why not?

include/llvm/Transforms/Utils/LoopVersioning.h
151

I don't see why this variable is needed. It only seems to be used to adjust the assert. I see no problem with relaxing the required loop conditions, but I think that you should just relax them and document what they are.

After this patch was posted, there was an RFC discussion regarding approach. We discussed, for example, just adding metadata (or keeping similar state) to restrict VF and rerunning the vectorizer to vectorize the epilogue loop. Can you please summarize that discussion and how it relates to what's here? Did the design of this patch change as a result of that discussion? If not, why not?

@ashutosh.nema can you clarify where this patch stands? and if it's gonna be updated soon?

This patch is little old and not updated with latest vectorization changes.

In the RFC there are few concerns raised around the new block layout:
a) If original loop alias check fails, then directly jump to scalar loop
b) On critical path there should not be extra checks, i.e. when minimum iteration check fails for the original vector loop then directly jump to scalar loop. In my opinion this is possible optimization opportunity loss, because when minimum iteration check for original loop fails, can still try executing epilog loop and that requires epilog loop’s minimum iteration check. We can keep his under option.

I guess both the cases can be handled, I’ll re initiate the discussion & update this patch soon.

This patch is little old and not updated with latest vectorization changes.

In the RFC there are few concerns raised around the new block layout:
a) If original loop alias check fails, then directly jump to scalar loop
b) On critical path there should not be extra checks, i.e. when minimum iteration check fails for the original vector loop then directly jump to scalar loop. In my opinion this is possible optimization opportunity loss, because when minimum iteration check for original loop fails, can still try executing epilog loop and that requires epilog loop’s minimum iteration check. We can keep his under option.

I guess both the cases can be handled, I’ll re initiate the discussion & update this patch soon.

Is there any update with this patch? I find it very useful for cycles and size for targets that has masked operations not just for load and store but also for other operations including intra operations

Ayal added a comment.Jan 29 2019, 8:15 AM

This patch is little old and not updated with latest vectorization changes.

In the RFC there are few concerns raised around the new block layout:
a) If original loop alias check fails, then directly jump to scalar loop
b) On critical path there should not be extra checks, i.e. when minimum iteration check fails for the original vector loop then directly jump to scalar loop. In my opinion this is possible optimization opportunity loss, because when minimum iteration check for original loop fails, can still try executing epilog loop and that requires epilog loop’s minimum iteration check. We can keep his under option.

I guess both the cases can be handled, I’ll re initiate the discussion & update this patch soon.

Is there any update with this patch? I find it very useful for cycles and size for targets that has masked operations not just for load and store but also for other operations including intra operations

BTW, targets that have efficient masked operations may also find it useful for cycles and size to vectorize (including epilogs) under -Os, and following r345705 possibly also with enable-masked-interleaved-mem-accesses turned on.

BTW, targets that have efficient masked operations may also find it useful for cycles and size to vectorize (including epilogs) under -Os, and following r345705 possibly also with enable-masked-interleaved-mem-accesses turned on.

I agree, but I think this approach is a tad too heavy for masked operations.

In SVE, the masking will happen naturally and the last loop will just have the remaining elements as a consequence of the ISA design.

In non-scalable vector extensions, that's not true, so it would need some emulation. But I wouldn't make it such a special case and with so many checks.

I imagine that, if the vectorisation of the full vector length was legal, then so is the case for a smaller length. It may not be profitable on its own, but the fact that it continues the existing patterns (before moving results into scalar registers) will probably make it cheaper than a scalar tail.

This could either be some arithmetic on the mask computation (which would occur on every iteration of the loop + 1) or a tail loop with the same mask computation duplicated from the one just vectorised. I imagine -O3 would have different trade-offs than -Os, so we could potentially have both solutions.

Herald added a project: Restricted Project. · View Herald TranscriptJul 30 2019, 11:11 AM