This is an archive of the discontinued LLVM Phabricator instance.

[LV] Epilogue Vectorization with Optimal Control Flow
ClosedPublic

Authored by bmahjour on Oct 16 2020, 10:30 AM.

Details

Summary

This is yet another attempt at providing support for epilogue vectorization following discussions raised in RFC http://llvm.1065342.n5.nabble.com/llvm-dev-Proposal-RFC-Epilog-loop-vectorization-tt106322.html#none and reviews D30247 and D88819.

Similar to D88819, this patch achieve epilogue vectorization by executing a single vplan twice: once on the main loop and a second time on the epilogue loop (using a different VF). This implementation differs from D88819 at least in the following ways:

  1. It's able to generate the most optimal control flow discussed in the above mentioned RFC by shortening the path-length in the case of small trip counts (those that result in all or most of the vector code getting skipped). It also avoids the redundant generation of runtime memory and SCEV checks needed to check for pointer aliasing. Please refer to the attached image illustrating the generated CFG.
  2. It uses a more modular approach by using the strategy design pattern and extending the InnerLoopVectorizer class.
  3. It can handle loops with multiple induction variables.
  4. It adds more debug traces.

The heuristic for determining when to perform the transform is overly simplistic and needs to be improved in the future. That work is not in the scope of this patch.

Diff Detail

Event Timeline

bmahjour created this revision.Oct 16 2020, 10:30 AM
bmahjour requested review of this revision.Oct 16 2020, 10:30 AM
mivnay added inline comments.Oct 26 2020, 9:00 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
175

Why not enable it by default?

926

High UF might negate the benefit of EpilogVectorization. I think keeping UF = 1 is a good idea unless there are multiple levels of epilog loop vectorization.

7454

This function and other core functions like createEpilogueVectorizedLoopSkeleton, createInductionResumeValues, etc contains lot of redundant code. Can it be improved?

xbolva00 added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
175

+1

bmahjour added inline comments.Oct 27 2020, 1:29 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
175

The reason I'm reluctant to enable it by default is because the heuristic for choosing when to vectorize the epilogue (and by what VF) is quite simplistic and tuned to a specific benchmark, without consideration for the cost of extra branches, code size increase, extra register spills, etc.
My thinking is we can move towards the goal of enabling it by default in smaller steps, by first implementing the transformation (this patch), then improving the cost-model along with performance tuning (future work). What do you think?

926

I think so too, but I also thought it would be a good thing to make the UF configurable in case the need arises in the future (eg. with increasing vector widths and scalable vector types). In this patch, the only EpilogueLoopVectorizationInfo object is created with an epilogue UF of 1.

7454

Right, there is a bit of code duplication in exchange for decoupling and separation of concerns. There is very little code reuse opportunity from createEpilogueVectorizedLoopSkeleton(), as it already makes calls to common code like emitMinimumIterationCountCheck, createInductionVariable, completeLoopSkeleton, etc. and the rest of the code is inherently different between the IMLAEMainLoop and IMLAEEpilogueLoop.

I agree this function can be avoided and I can think of a way to reuse code in createInductionResumeValues. I'll post an update soon.

bmahjour updated this revision to Diff 301618.EditedOct 29 2020, 7:53 AM

Removed executePlanForEpilogueVectorization, improved code reuse for creatInductionResumeValues, and added more test coverage including a case with double IV.

bmahjour marked 2 inline comments as done.Oct 29 2020, 7:55 AM
mivnay added inline comments.Oct 29 2020, 9:01 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
175

Unless you see some regressions, I think we can enable it by default.

926

It would be good to assert UF == 1 I guess.

fhahn added a comment.Nov 2 2020, 5:50 AM

Thanks for sharing and sorry for the delay. I probably won’t be able to take a closer look this week due to vacation & needing to wrap up some internal work, but plan to take a closer look early next week :)

As for whether this should be on/off by default in general, I think it would be good to gather some numbers on some architectures to establish a baseline and make sure there are no surprises.

I have not looked at any of the details here, but a very high level comment is that this isn't very VPlany. If we do want to push things in that direction, then it is at least worth thinking about how this and vplan will co-exist, even if vplan isn't ready for it yet. It would be great to get to a point where all this information is in the vplan and we can compare epilog remainders vs scalar remainders vs whatever else, and come up with a good total cost based on the estimated trip count. Just something to think about.

Also on a completely different note, I presume this could be expanded to handle predicated remainders too? So that a unpredicated loop was give a single predicated remainder iteration, as might be useful to SVE.

Also, +1 to "it would be good to gather some numbers"

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9215

I worry this would not work if we have removed the vplan for the different VF's.

llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization.ll
7

PowerPC test go into LoopVectorize/PowerPC (I think, if it exists).

I have not looked at any of the details here, but a very high level comment is that this isn't very VPlany. If we do want to push things in that direction, then it is at least worth thinking about how this and vplan will co-exist, even if vplan isn't ready for it yet. It would be great to get to a point where all this information is in the vplan and we can compare epilog remainders vs scalar remainders vs whatever else, and come up with a good total cost based on the estimated trip count. Just something to think about.

As it stands right now, VPlan can only model control-flow inside a loop. Since epilogue vectorization is concerned with control-flow around (and outside) the loop, there isn't much that can be done today to make the transformation "VPlany". I understand the ultimate goal of vplan is to also model the context around subject loops (eg the entire loop nest or other code surrounding loops). I wondered about whether it's worth to delay this work until that becomes available, but most of the feedback I received was in the direction of let's get it done now and then do it in vplan when it's capable of representing surrounding context.

Also on a completely different note, I presume this could be expanded to handle predicated remainders too? So that a unpredicated loop was give a single predicated remainder iteration, as might be useful to SVE.

Absolutely, that can be a nice follow on to this work. I think any target that supports predicated vector instructions could benefit, specially if the predicated vector instructions perform better than scalar instructions but not as good as non-predicated vector instructions. If predicated and non-predicated vector instructions have similar throughput and latency, then perhaps tail-folding the main loop would be a better fit.

As it stands right now, VPlan can only model control-flow inside a loop. Since epilogue vectorization is concerned with control-flow around (and outside) the loop, there isn't much that can be done today to make the transformation "VPlany". I understand the ultimate goal of vplan is to also model the context around subject loops (eg the entire loop nest or other code surrounding loops). I wondered about whether it's worth to delay this work until that becomes available, but most of the feedback I received was in the direction of let's get it done now and then do it in vplan when it's capable of representing surrounding context.

Yeah. I don't think VPlan should slow this down. The problem is that if no-one pushes on vplan to have those extra features, they will never appear :)

Absolutely, that can be a nice follow on to this work. I think any target that supports predicated vector instructions could benefit, specially if the predicated vector instructions perform better than scalar instructions but not as good as non-predicated vector instructions. If predicated and non-predicated vector instructions have similar throughput and latency, then perhaps tail-folding the main loop would be a better fit.

I think this can depend upon.. a lot of things. Good to hear it should be possible.

As it stands right now, VPlan can only model control-flow inside a loop. Since epilogue vectorization is concerned with control-flow around (and outside) the loop, there isn't much that can be done today to make the transformation "VPlany". I understand the ultimate goal of vplan is to also model the context around subject loops (eg the entire loop nest or other code surrounding loops). I wondered about whether it's worth to delay this work until that becomes available, but most of the feedback I received was in the direction of let's get it done now and then do it in vplan when it's capable of representing surrounding context.

Yeah. I don't think VPlan should slow this down. The problem is that if no-one pushes on vplan to have those extra features, they will never appear :)

FWIW, I agree with both. The more features we add to the vectoriser, the more difficult it will be for VPlan to catch up. At the same time, I don't think there's an initiative to do this, so waiting for VPlan would be unreasonable. Also, epilogue vectorisation has been an outstanding issue for so long, so it is time it gets addressed.

Just checking/summarising: is there consensus that this is the patch that is going to do epilogue vectorisation, and not D88819? Or does that depend on perf numbers as Florian suggested?

bmahjour updated this revision to Diff 303480.Nov 6 2020, 9:23 AM
bmahjour marked an inline comment as done.

Addressed code review comments + check for optsize and minsize.

bmahjour added inline comments.Nov 6 2020, 9:24 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
175

I'm not aware of any functional or performance regressions. Performance results for SPEC2017 x264 should match those from D88819. @mivnay would you mind verifying that on X86 and ARM? I can do a perf run on POWER.

9215

Good catch. We need to make sure that the vplan that's chosen for the main loop supports the requested VF for the epilogue loop. I've updated selectEpilogueVectorizationFactor() to check for this.

llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization.ll
7

Done.

bmahjour marked 3 inline comments as done.Nov 6 2020, 9:31 AM

Just checking/summarising: is there consensus that this is the patch that is going to do epilogue vectorisation, and not D88819? Or does that depend on perf numbers as Florian suggested?

I think we should go with this, unless @fhahn finds that D88819 can outperform it (unlikely). There are more loops that can be transformed with this patch and I've listed other advantages in the description. @mivnay do you agree?

Just checking/summarising: is there consensus that this is the patch that is going to do epilogue vectorisation, and not D88819? Or does that depend on perf numbers as Florian suggested?

I think we should go with this, unless @fhahn finds that D88819 can outperform it (unlikely). There are more loops that can be transformed with this patch and I've listed other advantages in the description. @mivnay do you agree?

Are you blocking D88819 without running benchmarks? i.e. Without a comparison on any one of the architectures. For x264, D88819 might give better numbers....

Just checking/summarising: is there consensus that this is the patch that is going to do epilogue vectorisation, and not D88819? Or does that depend on perf numbers as Florian suggested?

I think we should go with this, unless @fhahn finds that D88819 can outperform it (unlikely). There are more loops that can be transformed with this patch and I've listed other advantages in the description. @mivnay do you agree?

Are you blocking D88819 without running benchmarks? i.e. Without a comparison on any one of the architectures. For x264, D88819 might give better numbers....

Have you been following the conversations in D88819 and this revision? Blocking work has certainly not been my intention. I just resigned from D88819, in case people would like to move ahead with it. In terms of testing I've verified that x264 opportunity can be caught on POWER, however I do not have access to other architectures, which is why I would appreciate help on testing on other platforms.

I also think "blocking" is not the right terminology here. But I was asking about this because first we had D88819, then came this one, so we have 2 options. In the end it's pretty simple I guess, because we go for the one with the best code-gen, and this patch looks like the one with the most potential. But would be nice to get that confirmed.

Just checking/summarising: is there consensus that this is the patch that is going to do epilogue vectorisation, and not D88819? Or does that depend on perf numbers as Florian suggested?

I think we should go with this, unless @fhahn finds that D88819 can outperform it (unlikely). There are more loops that can be transformed with this patch and I've listed other advantages in the description. @mivnay do you agree?

Are you blocking D88819 without running benchmarks? i.e. Without a comparison on any one of the architectures. For x264, D88819 might give better numbers....

Have you been following the conversations in D88819 and this revision? Blocking work has certainly not been my intention. I just resigned from D88819, in case people would like to move ahead with it. In terms of testing I've verified that x264 opportunity can be caught on POWER, however I do not have access to other architectures, which is why I would appreciate help on testing on other platforms.

I have been following both the reviews....
What are x264 numbers for both the patches on POWER?

I also think "blocking" is not the right terminology here. But I was asking about this because first we had D88819, then came this one, so we have 2 options. In the end it's pretty simple I guess, because we go for the one with the best code-gen, and this patch looks like the one with the most potential. But would be nice to get that confirmed.

On Thunderx2 , x264, 1c, rate :
Base : 3.31
With this patch : 3.50
With D88819 : 3.60

Just checking/summarising: is there consensus that this is the patch that is going to do epilogue vectorisation, and not D88819? Or does that depend on perf numbers as Florian suggested?

I think we should go with this, unless @fhahn finds that D88819 can outperform it (unlikely). There are more loops that can be transformed with this patch and I've listed other advantages in the description. @mivnay do you agree?

Are you blocking D88819 without running benchmarks? i.e. Without a comparison on any one of the architectures. For x264, D88819 might give better numbers....

Have you been following the conversations in D88819 and this revision? Blocking work has certainly not been my intention. I just resigned from D88819, in case people would like to move ahead with it. In terms of testing I've verified that x264 opportunity can be caught on POWER, however I do not have access to other architectures, which is why I would appreciate help on testing on other platforms.

It would be great if you are reviewer of the other patch. Let us run the benchmarks on Graviton/Thunderx2/X86/Power and see what happens..

mivnay added a comment.Nov 9 2020, 2:27 AM

I also think "blocking" is not the right terminology here. But I was asking about this because first we had D88819, then came this one, so we have 2 options. In the end it's pretty simple I guess, because we go for the one with the best code-gen, and this patch looks like the one with the most potential. But would be nice to get that confirmed.

On Thunderx2 , x264, 1c, rate :
Base : 3.31
With this patch : 3.50
With D88819 : 3.60

Here are the numbers with -O3 -flto -fuse-ld=lld :

Here are the numbers for SPEC2017 intrate suite on POWER9:

and here are the number of loops that can be transformed by each patch:

SjoerdMeijer added a comment.EditedNov 9 2020, 10:51 AM

My observations here are:

  • Performance of both patches for SPEC are the same. The advantage of D88819 on ThunderX2 might be a coincidence (knock on effect, e.g. (loop) alignment) or some micro-architecture reason.
  • This patch D89566 can handle more loops, but it is not reflected in the SPEC numbers (cold/not executed code?).
  • The beauty of D88819 is that the changes are very minimal, but looks like it's worth spending extra lines of code here in D89566 to helps in vectorising more.

To me, that shows there's more potential with this patch.
I don't think I can be the arbiter in this, so it's best if other reviewers comment too.

No regressions so enable by default?

bmahjour updated this revision to Diff 304012.Nov 9 2020, 4:48 PM

Update test cases for default enablement.

No regressions so enable by default?

I enabled it in my previous update, but forgot to update the test cases. Now the test cases are updated too.

My observations here are:

  • Performance of both patches for SPEC are the same. The advantage of D88819 on ThunderX2 might be a coincidence (knock on effect, e.g. (loop) alignment) or some micro-architecture reason.
  • This patch D89566 can handle more loops, but it is not reflected in the SPEC numbers (cold/not executed code?).
  • The beauty of D88819 is that the changes are very minimal, but looks like it's worth spending extra lines of code here in D89566 to helps in vectorising more.

To me, that shows there's more potential with this patch.
I don't think I can be the arbiter in this, so it's best if other reviewers comment too.

As you pointed out, the reason D89566 causes no additional gain despite transforming more loops, could be because the loops are cold or the main vectorized loop dominates the execution profile. These issues can probably be addressed with an improved cost-model or with PGO data. It's important to be able to have the infrastructure to transform these loops for when the cost-model gets an upgrade.

This is a big change, and here are some notes from my first pass reading through this. Some high level questions here, and find some questions inlined:

  • Think we need a doc update with the new vectorization skeleton? The picture in the description of this change? Don't know how feasible that is, some ascii art too as comments?
  • Difficult to see for me, but are there tests with Minsize?
  • Thanks for running the SPEC numbers! Would it not too difficult to run the llvm test suite too? Hopefully that serves 2 purposes: throw some more code at this to test it, and should probably trigger in a few cases.
  • Given that there are not test changes in this area, it doesn't look like this is changing the tail folding decision making, there is no interaction with that? I.e., haven't checked, but I guess that is performed first as part of the first step, the "normal" inner loop vectorisation. I guess targets that support this, have some interesting decision making to do: tail folding or epilogue vectorisation. But that doesn't seem to be a problem of this patch.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
177

nit: for consistency residual -> epilogue?

950

Bikeshedding names:

perhaps InnerMainLoopAndEpilogueVectorizer -> InnerLoopAndEpilogueVectorizer if Main doesn't add much here?

SjoerdMeijer added inline comments.Nov 10 2020, 2:30 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
986

And some more bikeshedding:

Can we make IMLAEEpilogueLoop a bit more readable? Same for IMLAEEpilogueLoop below.

5791

Why is this 16? Do we need to put this "magic constant" behind a target hook? Or calculate this?

llvm/test/Transforms/LoopVectorize/PowerPC/optimal-epilog-vectorization.ll
7

Just checking: what is PPC specific about this test?

fhahn added a comment.Nov 10 2020, 3:07 AM

As it stands right now, VPlan can only model control-flow inside a loop. Since epilogue vectorization is concerned with control-flow around (and outside) the loop, there isn't much that can be done today to make the transformation "VPlany". I understand the ultimate goal of vplan is to also model the context around subject loops (eg the entire loop nest or other code surrounding loops). I wondered about whether it's worth to delay this work until that becomes available, but most of the feedback I received was in the direction of let's get it done now and then do it in vplan when it's capable of representing surrounding context.

Yeah. I don't think VPlan should slow this down. The problem is that if no-one pushes on vplan to have those extra features, they will never appear :)

FWIW, I agree with both. The more features we add to the vectoriser, the more difficult it will be for VPlan to catch up. At the same time, I don't think there's an initiative to do this, so waiting for VPlan would be unreasonable. Also, epilogue vectorisation has been an outstanding issue for so long, so it is time it gets addressed.

I anticipate that we will be able to eventually model RT check blocks and the CFG around the vector body somehow in VPlan, which should eventually replace the code added here. But that's not going to happen in the near future, so I don't think this should hinder progress here. I think however we should try to structure the code in a way that will not make our life harder than it needs to be going forward.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1030

IIUC we specialize LV so we can override just createEpilogueVectorizedLoopSkeleton and emitMinimumVectorEpilogueIterCountCheck (which is only used by createEpilogueVectorizedLoopSkeleton.

Could we do a more targeted specialization, e.g. by extracting the skeleton creation code into something like LoopSekletonCreator and have a regular and epilogue version of that? Or combine it with EpilogueLoopVectorizationInfo?

5717

I think there are some additional tests cases needed to cover all code paths in here?

5721

nit: could use any_of, e.g. any_of(L.getHeader()->phis(), [this](PHINode &Phi){ return Legal->isFirstOrderRecurrence(&Phi) || Legal->isReductionVariable(&Phi); }).

5741

nit: use any_of?

bmahjour marked 10 inline comments as done.Nov 11 2020, 7:11 AM

This is a big change, and here are some notes from my first pass reading through this. Some high level questions here, and find some questions inlined:

  • Think we need a doc update with the new vectorization skeleton? The picture in the description of this change? Don't know how feasible that is, some ascii art too as comments?

Sure, I had a similar comment in D88819. I find the textual graph horrendous. I'll create a section under https://llvm.org/docs/Vectorizers.html and put the diagram there.

  • Difficult to see for me, but are there tests with Minsize?

Added optsize and minsize tests.

  • Thanks for running the SPEC numbers! Would it not too difficult to run the llvm test suite too? Hopefully that serves 2 purposes: throw some more code at this to test it, and should probably trigger in a few cases.

I've verified that test-suite is functionally clean. The compile-time and performance numbers fluctuate a lot. Is this normal? I reran what appeared to be a few large (20%+) degradations and they were not reproducible. There were upwards of 14% improvement in some tests but I think those are fluctuations also.

  • Given that there are not test changes in this area, it doesn't look like this is changing the tail folding decision making, there is no interaction with that? I.e., haven't checked, but I guess that is performed first as part of the first step, the "normal" inner loop vectorisation. I guess targets that support this, have some interesting decision making to do: tail folding or epilogue vectorisation. But that doesn't seem to be a problem of this patch.

This does not affect tail-folding decision. When tail-folding is requested, no scalar loop is generated so there is no epilogue to vectorize. However, it is possible to tail-fold the epilogue loop. This patch makes it easier to do that in the future.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
950

sure.

986

Are you referring to their names or their implementation? If the former, would EpilogueVectorizerMainLoop and EpilogueVectorizerEpilogueLoop be preferred?

1030

Currently only the skeleton code is specialized, but future enhancements will likely necessitate more specialization (eg to support widened induction vars and live-out phis). We could create a LoopSkeletonCreator, and modify the LoopVectorizationPlanner interfaces to work with it, but I think extending from the InnerLoopVectorizer fits better with the current design. Note that InnerLoopUnroller also takes a similar approach and extends InnerLoopVectorizer while only overriding a couple of functions.

5717

more tests added.

5791

This part is copied from D88819. The value 16 is chosen because it catches the x264 opportunity. As I've been saying all along, we need to replace this with a better cost-model. The cost-modeling work is not in the scope of this patch.

llvm/test/Transforms/LoopVectorize/PowerPC/optimal-epilog-vectorization.ll
7

The triple is used to make sure ILV finds the loop profitable and chooses a default VF, triggering vectorization of the main loop and its epilogue. Without this we'd have to force a vectorization factor (through hints) for the main loop, however this patch does not vectorize epilogue loops that have user hints on them. It's debatable whether we should vectorize epilogues for hinted loops or not, or whether we need a new pragma, etc. It should be fairly easy to implement whatever we agree on, so I think we should leave those discussions for later and focus on the codegen in this patch.

bmahjour updated this revision to Diff 304511.Nov 11 2020, 7:13 AM
bmahjour marked 6 inline comments as done.

Addressed more comments.

bmahjour updated this revision to Diff 304523.Nov 11 2020, 8:14 AM

Update llvm docs.

One more round of high-level remarks before I look at some more details.

  • Thanks for running the SPEC numbers! Would it not too difficult to run the llvm test suite too? Hopefully that serves 2 purposes: throw some more code at this to test it, and should probably trigger in a few cases.

I've verified that test-suite is functionally clean. The compile-time and performance numbers fluctuate a lot. Is this normal? I reran what appeared to be a few large (20%+) degradations and they were not reproducible. There were upwards of 14% improvement in some tests but I think those are fluctuations also.

Yeah, that could be the case. There are some (micro)benchmarks with a very small execution time, so more susceptible to fluctuations. Have you tried the CTMark subset and TEST_SUITE_BENCHMARKING_ONLY variable? Anyway, if it is too noisy, it was at least a good testing exercise.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
986

Cheers, much more readable IMHO.

5791

I think that's fine, but we should at least hide this 16 behind a TTI hook that e.g. defaults to 16, which should be a separate patch.

7561

Do we want to refer in a comment here to the doc and skeleton that you've added?

bmahjour updated this revision to Diff 304862.Nov 12 2020, 9:01 AM
bmahjour marked 3 inline comments as done.

Add link to the doc in comments.

Yeah, that could be the case. There are some (micro)benchmarks with a very small execution time, so more susceptible to fluctuations. Have you tried the CTMark subset and TEST_SUITE_BENCHMARKING_ONLY variable? Anyway, if it is too noisy, it was at least a good testing exercise.

I've tried both CTMark and TEST_SUITE_BENCHMARKING_ONLY .

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5791

A reasonable cost function can probably be developed with the existing TTI hooks, so a new one may actually not be necessary. I agree to deferring this to a separate patch.

7561

Done.

SjoerdMeijer added inline comments.Nov 16 2020, 2:33 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5791

Okay, but can we stub it out for now? If we don't need a new TTI function, can we create a new helper function and at least sketch a way how this can be calculated. It can still just default to returning 16, but I just don't think we should keep the 16 hard coded here.

bmahjour updated this revision to Diff 305547.Nov 16 2020, 10:09 AM
bmahjour marked an inline comment as done.
bmahjour added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5791

Ok, sure. I've added an option to make it configurable too.

SjoerdMeijer added inline comments.Nov 17 2020, 7:01 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
185

Think we need a test-case for it too, perhaps one that is not the default value 16.

5727

I was hoping and seemed to remember there was a LoopUtils helper for this. Had a quick look, perhaps findDefsUsedOutsideOfLoop could be useful, or there was another function here in the vectoriser that was doing the same/similar things for reductions. Perhaps have a look if there's something we can reuse.

5798

I am wondering which option should win: e.g. -Os or EpilogueVectorizationForceVF? I am guessing that is the forced epilogue vectorisation? Should the opt size check be moved?

5800

nit: don't think we need the curly brackets here in LLVM_DEBUG.

5805

same here

5821

nit: != ?

5822

nit: curly bracket

9210

false is for PreserveLCSSA, but do we want that to be true, and does that make formLCSSARecursively redundant? Not sure if I am suprised that this is necessary at all....

nit: I think it should be at least false /* PreserveLCSSA */

bmahjour updated this revision to Diff 307441.Nov 24 2020, 1:44 PM
bmahjour marked 6 inline comments as done.

Rebased, addressed comments, and added more tests (including one for the limited case of live-out supported).

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
185

test added

5727

I was trying to match the conditions in fixupIVUsers() and added early exits to make sure we don't waste compile-time collecting information that will be thrown away. Maybe we can add a flag to findDefsUsedOutsideOfLoop to stop collecting instructions as soon as one is found, but then we'll have a new problem in that the checks in this code would be different from what's in fixupIVUsers().

5798

I can see arguments for it either way, but after thinking more about it, I believe it makes it easier to reason about the behaviour of the loop-vectorizer when the forced vf wins over the minsize. I'll change it.

5800

I have developed a habit of adding them because I find it makes the enclosed code more friendly to clang-format. I'll remove the brackets here since they don't make much of a difference to the formatting in this case.

5821

Currently VectorizationFactor only overrides operator== not the operator!=. I'll provide an override and change this to !=.

9210

I think we could get away without calling formLCSSARecursively for now and set PreserveLCSSA to true for the limited cases of live-out values supported (see newly added Transforms/LoopVectorize/PowerPC/optimal-epilog-vectorization.liveout.ll). However in the future if we add support for more live-outs (eg. reductions and first-order recs), I worry that we may be in a state at this point in the code where the original loop is temporarily non-LCSSA which would break simplifyLoop's assumptions if we set PreserveLCSSA to true. Note that when PreserveLCSSA is true, simplifyLoop assumes that the loop is already in LCSSA form.
Maybe I worry too much about it, but I think the current way is a bit more future-proof.

fhahn added inline comments.Nov 25 2020, 1:34 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1030

Currently only the skeleton code is specialized, but future enhancements will likely necessitate more specialization (eg to support widened induction vars and live-out phis).

I thought a bit on how to support those cases. IIUC the problem here is mostly setting up the right incoming values for the PHIs in the epilogue loop. Given that this is related to codegen, I *think* we should be able to represent all the required info in the VPlan, i.e. we should be able to adjust the VPlan of the epilogue vector loop to use the correct start values naturally.

We could create a LoopSkeletonCreator, and modify the LoopVectorizationPlanner interfaces to work with it, but I think extending from the InnerLoopVectorizer fits better with the current design. Note that InnerLoopUnroller also takes a similar approach and extends InnerLoopVectorizer while only overriding a couple of functions.

My concern here is that this approach invites more modifications in the sub-classes, instead of modeling the required information directly in VPlan. I think this will lead to more work down the road, as we move more pieces into VPlan. I don't think InnerLoopUnroller is an ideal example here, because IIRC it was created before VPlan. I think working on modeling the incoming values in VPlan would be better aligned with the long term goals and lead to a more modular solution overall.

To illustrate this approach, I went ahead and tried to sketch support for widened induction using VPlan: D92132. It is a bit rough around the edges, but should give an idea of the required pieces. I'm working on a similar patch for reductions as well and we should be able to handle first-order recurrences in a similar fashion.

SjoerdMeijer accepted this revision.Nov 26 2020, 12:54 AM

Thanks for working on this, I am happy with this patch.

The way I understand it, is that we want to make this VPlan-future-proof, not make any VPlan work more difficult than necessary. Looking at D92132, that doesn't seemed to be the case to me. I.e., even though that patch is work-in-progress, if that patch is representative to achieve that, we don't have anything to worry about. So, I would be happy if this patch lands if we also progress D92132 (and friends). But @fhahn can correct me if I am wrong here.

This revision is now accepted and ready to land.Nov 26 2020, 12:54 AM
fhahn added a comment.Nov 26 2020, 1:37 AM

Thanks for working on this, I am happy with this patch.

The way I understand it, is that we want to make this VPlan-future-proof, not make any VPlan work more difficult than necessary. Looking at D92132, that doesn't seemed to be the case to me. I.e., even though that patch is work-in-progress, if that patch is representative to achieve that, we don't have anything to worry about. So, I would be happy if this patch lands if we also progress D92132 (and friends). But @fhahn can correct me if I am wrong here.

The point I tried to highlight with the patch is that I think we can solve the limitations without subclassing/customising code other than the skeleton creation. In that case, I think it would be preferable to go with a more targeted approach discussed (only providing a custom ‘skeleton’ creator). I outlined the potential drawbacks I see inline, but the main one is that it encourages adding more codegen specialisations that we have to entangle again later.

As mentioned earlier, I don’t think it is worth holding things up due to VPlan, but I think if we have a clear and short path towards addressing the limitations using VPlan, we should do that and avoid broad sub classing.

I might be missing some other cases where full subclassing might be needed. I think that’s something worth further discussing before committing.

On an unrelated note, I think it would be good to have some target-independent tests for this or for some additional targets, so this gets wide coverage on the public bots (some of which might not enable the PPC target)

Thank you for your reviews @SjoerdMeijer @fhahn .

I totally see the benefit of D92129 as you illustrated in D92132 and I think it can be used to replace things like fixupIVUsers() which would also make it easier to extend epilogue vectorization. As for broader subclassing, it shouldn't prevent us from extending it in a way that uses VPlan. If specialization at the subclass level is not necessary (ie there is code reuse opportunity) then we should still implement it in the superclasses and inherit it in the subclasses. If specialization is not applicable to any other part of the transform then it will happen at the subclass level where it belongs, whether using VPlan or not.

Having said that there is a timing aspect. I see your point that if we tried to extend epilogue vectorization now via specialization of parts of the codegen, that don't use VPlan currently, then we'd have to change more places in the future. From that perspective a more targeted specialization would help by making it harder to extend epilogue vectorization without having improved VPlan. On the other hand, some improvements to the epilogue vectorization would become contingent on improving VPlan (which is not an issue for me, but others may disagree).

Having a skeleton creator class would require some refactoring of InnerLoopVectorizer and LoopVectorizationPlanner, so it would warrant having a separate patch for it. I can do it after this one lands.

BTW, I found a trick to make some of the tests target independent and expand the coverage. Please see llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization*.

bmahjour updated this revision to Diff 307915.Nov 26 2020, 12:46 PM
bmahjour updated this revision to Diff 307918.Nov 26 2020, 12:58 PM
bmahjour marked 3 inline comments as done.

Forgot to remove target triple and attributes from target independent tests. They're fixed now.

dmgreen added inline comments.Nov 27 2020, 4:27 AM
llvm/test/Transforms/LoopVectorize/ARM/pointer_iv.ll
292 ↗(On Diff #307918)

I was surprised to see an MVE test like this chose to try and epilogue vectorize. I had presumed that would not happen on MVE - we only have a single vector width with no interleaving - the benefit of trying to do a single <8 x i8> iterations after a <16 x i8> main loop is not going to be worth the additional branching/setup we have to do, unfortunately. I ran some extra tests and added a mve-qabs.ll test, where again the <16 x i8> loop is getting a remainder where it isn't beneficial.

I don't believe that MVE is a vector target that would ever benefit from epilogue vectorization, unfortunately. Can we get some sort of target hook that allows us to disable it? Perhaps something that sets a maximum epilogue vectorization factor given a VF * UF main loop? That would allow us to set it to none, whilst others tune it for their needs, like possibly always having the fallback as a 64bit vector under aarch64 (just a though, not sure if that's best idea or not but it at least allows targets to tune things).

llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization.ll
11

-> triple

bmahjour added inline comments.Nov 27 2020, 7:37 AM
llvm/test/Transforms/LoopVectorize/ARM/pointer_iv.ll
292 ↗(On Diff #307918)

I ran some extra tests and added a mve-qabs.ll test, where again the <16 x i8> loop is getting a remainder where it isn't beneficial.

Is it degrading performance or just not beneficial (harmless)? As I mentioned before the heuristic in this patch is not very good, but putting the cost-modeling in the critical path for getting the codegen implemented is also not desirable. I had suggested to disable this transformation by default until a proper cost-model is implemented, to which some people disagreed.

In order to come up with a meaningful target hook it would be helpful to know what machine characteristics in MVE cause epilogue vectorization to not be beneficial. Are there existing TTI hooks that we can use (eg. getMaxInterleaveFactor() > 1)?

dmgreen added inline comments.Nov 27 2020, 9:09 AM
llvm/test/Transforms/LoopVectorize/ARM/pointer_iv.ll
292 ↗(On Diff #307918)

Is it degrading performance or just not beneficial (harmless)?

Degrading performance unfortunately. It doesn't happen in a lot of tests, but it was between a 0% and 25% decrease, apparently. The M in MVE stands for microcontroller (umm, I think), so it can be somewhat constrained and can be especially hurt by inefficient codegen, that would not be as bad on other cores/vector architectures.

The max interleaving factor will be 1 for any MVE target. They also only have 128 bit vectors, no 64bit wide vectors that would be present in NEON. Essentially that means that for i8 we would either vectorize x 16 (which is excellent), or x 8 using extends if we can't for x16. The x8 would be beneficial on it's own with enough iterations I think, but doing a single iteration at x 8 does not overcome the additional cost from outside the vector loops.

Using getMaxInterleaveFactor to limit this for the moment would work for me. I have no strong opinions on enabling this by default or not, but you may want the very initial commit to default to false with a commit soon after to enable it. That way if someone does revert this, at least they are only reverting the flipping of the switch and not the whole patch.

bmahjour updated this revision to Diff 308114.Nov 27 2020, 2:18 PM
bmahjour marked an inline comment as done.
bmahjour added inline comments.Nov 27 2020, 2:19 PM
llvm/test/Transforms/LoopVectorize/ARM/pointer_iv.ll
292 ↗(On Diff #307918)

Using getMaxInterleaveFactor to limit this for the moment would work for me.

Ok, I'll use that then.

I have no strong opinions on enabling this by default or not, but you may want the very initial commit to default to false with a commit soon after to enable it. That way if someone does revert this, at least they are only reverting the flipping of the switch and not the whole patch.

Great suggestion! I'll do it that way.

This revision was landed with ongoing or failed builds.Dec 1 2020, 9:05 AM
This revision was automatically updated to reflect the committed changes.
fhahn added a comment.Dec 1 2020, 9:11 AM

Thank you for your reviews @SjoerdMeijer @fhahn .

I totally see the benefit of D92129 as you illustrated in D92132 and I think it can be used to replace things like fixupIVUsers() which would also make it easier to extend epilogue vectorization. As for broader subclassing, it shouldn't prevent us from extending it in a way that uses VPlan. If specialization at the subclass level is not necessary (ie there is code reuse opportunity) then we should still implement it in the superclasses and inherit it in the subclasses. If specialization is not applicable to any other part of the transform then it will happen at the subclass level where it belongs, whether using VPlan or not.

Having said that there is a timing aspect. I see your point that if we tried to extend epilogue vectorization now via specialization of parts of the codegen, that don't use VPlan currently, then we'd have to change more places in the future. From that perspective a more targeted specialization would help by making it harder to extend epilogue vectorization without having improved VPlan. On the other hand, some improvements to the epilogue vectorization would become contingent on improving VPlan (which is not an issue for me, but others may disagree).

Having a skeleton creator class would require some refactoring of InnerLoopVectorizer and LoopVectorizationPlanner, so it would warrant having a separate patch for it. I can do it after this one lands.

OK great. It sounds like there's agreement on the further direction overall and we can work on that in-tree. I'll work on getting the pieces ready to handle live-ins/live-outs in VPlan as required. As for the sekeleton creator, I might be able to take a look over the next week.

LGTM, thanks (meant to respond a bit earlier today :)

MaskRay added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
159

This variable is mutable.

Please use constexpr char VerboseDebug[] or const char VerboseDebug[]. Namespace-scoped const variables are already internal, no need for static

Please also use #ifndef NDEBUG to avoid -Wunused-variable in -DLLVM_ENABLE_ASSERTIONS=off builds

OK great. It sounds like there's agreement on the further direction overall and we can work on that in-tree. I'll work on getting the pieces ready to handle live-ins/live-outs in VPlan as required. As for the sekeleton creator, I might be able to take a look over the next week.

LGTM, thanks (meant to respond a bit earlier today :)

Sounds good. If I get a chance to work on skeleton builder I'll let you know before hand to make sure we don't work on it at the same time. Otherwise I'd be happy to help with the review. Thanks!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
159

Right, made the changes in the latest commit. Thanks!

MaskRay added inline comments.Dec 2 2020, 11:15 AM
llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization-limitations.ll
1

debug-only tests are rare. If you really want them, please add REQUIRES: asserts (which has kindly been fixed by RKSimon)

nikic added a subscriber: nikic.Dec 7 2020, 1:05 PM

Should this really be enabled by default given the current (lack of) cost modelling? Primarily concerned about the code size regression this causes.

Should this really be enabled by default given the current (lack of) cost modelling? Primarily concerned about the code size regression this causes.

Please note that functions with optsize and minsize attributes are not affected. I'm ok with turning it back off (that was my preference as well), but curious about what has regressed (since that information can help with cost-modeling). Any comments @mivnay @xbolva00 ?

mivnay added a comment.Dec 8 2020, 2:12 AM

Should this really be enabled by default given the current (lack of) cost modelling? Primarily concerned about the code size regression this causes.

Please note that functions with optsize and minsize attributes are not affected. I'm ok with turning it back off (that was my preference as well), but curious about what has regressed (since that information can help with cost-modeling). Any comments @mivnay @xbolva00 ?

@bmahjour Thanks for the patch. I will run the experiments and get back to you...