This is an archive of the discontinued LLVM Phabricator instance.

[LV][VPlan] Detect outer loops for explicit vectorization.
ClosedPublic

Authored by dcaballe on Jan 23 2018, 3:09 PM.

Details

Summary

This is the patch #2 from the Patch Series #1 to introduce outer loop vectorization support in LV using the VPlan infrastructure.
RFC: http://lists.llvm.org/pipermail/llvm-dev/2017-December/119523.html
Patch #1: D40874

This patch introduces the basic infrastructure to detect, legality check and process outer loops annotated with hints for explicit vectorization:

  1. Outer loop detection: only outer loops annotated with explicit vectorization hints, including the vector length, are collected for outer loop vectorization. This includes outer loops annotated with #pragma omp simd simdlen(#) or #pragma clang vectorize(enable) vectorize_width(#)*.
  1. Outer loop legality check: only a restricted subset of simple outer loops are considered legal at this point. This subset includes outer loops that only contain uniform inner loops and uniform non-backedge branches. The uniformity property is also highly conservative (loop invariance) and will be relaxed in the future to support more complex cases.
  1. Outer loop processing: legal outer loops are processed in a new vectorization path that will build the VPlan infrastructure upfront. We denote it as VPlan-native vectorization path. This new path is integrated in LV but it's independent of the inner loop vectorization path. We followed this approach to prevent the instability of the current inner loop vectorizer while reusing code and minimize divergence from the existing infrastructure. In the VPlan-native path, legal outer loops are fed into the LoopVectorizationPlanner which only prints a debug message for now. Actual vectorization will be introduced in the subsequent patches of this series.

It's important to remark that all these changes are protected under the feature flag -enable-vplan-native-path. This should make this patch NFC for the existing inner loop vectorizer.

(*) Pragma 'clang vectorize' and pragma 'omp simd' are currently implemented with the same metadata (llvm.loop.vectorize) even though the former has auto-vectorization semantics and the latter has explicit vectorization semantics. We temporarily abuse pragma 'clang vectorize' on outer loops to denote explicit vectorization due to the shared implementation of both pragmas. This will be fixed when the native representation for pragma 'omp simd' is introduce in LLVM (WIP).

Diff Detail

Repository
rL LLVM

Event Timeline

dcaballe created this revision.Jan 23 2018, 3:09 PM
rengolin added inline comments.Mar 7 2018, 5:44 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
2319 ↗(On Diff #131151)

Sorry, this is my fault, as both were done separately. We discussed adding one more metadata info to mean {"forced"/"hint"} but ended up never doing it. It should be simple to fix this, I think, just need to make sure we change all the tests correctly to what they're supposed to mean.

2363 ↗(On Diff #131151)

What's the complexity of this analysis? We'll be adding a lot of those, repeatedly, no?

If I understand correctly, containsIrreducibleCFG is not that simple, in addition to the traversal. Not calling it unnecessarily would be a nice thing to have up front.

How complex would it be to create the inherited attribute?

5138 ↗(On Diff #131151)

nit. I'd remove this space, as both blocks are regarding (3)

7642 ↗(On Diff #131151)

I wouldn't make this an assert, just a debug message and return

7647 ↗(On Diff #131151)

left over?

7650 ↗(On Diff #131151)

is this really the place to do predication and cost modelling?

8227 ↗(On Diff #131151)

Isn't containsIrreducibleCFG depending on this to work?

8228 ↗(On Diff #131151)

Better not to have TODOs as debug messages.

8642 ↗(On Diff #131151)

I'm really uncomfortable with all these temporary code blocks that don't do anything...

They're really just hijacking the existing infrastructure without implementing as a VPlan.

I really thought the whole point of VPlans was that you wouldn't need to hack-it-up like we used to in the old vectoriser...

dcaballe marked 3 inline comments as done.Mar 8 2018, 12:24 PM

Thanks for the comments, Renato!
Please, have a look at my inline comments and let me know what you think.

Thanks!
Diego

lib/Transforms/Vectorize/LoopVectorize.cpp
2319 ↗(On Diff #131151)

No problem. As I mention, there are going to be changes regarding the representation of #pragma omp simd. I think that's the right time to address the problem.

2363 ↗(On Diff #131151)

What's the complexity of this analysis? We'll be adding a lot of those, repeatedly, no?

Sorry, I'm not sure I understand the question. If you mean the complexity of collecting nested loops given an outer loop, it would be linear on the number of nested loops. We wouldn't add the same loop multiple times. Each loop in the loop nest would be added and processed just once.

If I understand correctly, containsIrreducibleCFG is not that simple, in addition to the traversal. Not calling it unnecessarily would be a nice thing to have up front. How complex would it be to create the inherited attribute?

Adding the attribute wouldn't be complicated. We would have a recursive call that processes loops from outer to inner. Once we know that the CFG of the outer loop is reducible, we can pass a flag through the recursive call to mark that nested loops are reducible and skip containsIrreducibleCFG for them. Does this make sense to you?

I had it implemented but there is a subtle detail that would make this patch non-NFC. If you look at line 8901, collected loops are processed in reverse order. This basically means that if we have:

#pragma clang loop vectorize
for i
  for j
  
  for k
}

the current code would process loops k and j first. If one of them is vectorized, we couldn't vectorize 'i', the one marked for vectorization. I see two potential solutions:

  1. Reverse the order in which loops are processed in line 8901. This is non-NFC and some existing LIT tests would have to be updated accordingly.
  2. Collect loops at different loop nest levels in post-order and loops at the same level in pre-order. This would be the collection order for the previous example: j, k, i. This would be NFC for inner loops but I find it particularly weird.

IMO, option 1 seems the right approach but it's non-NFC and I wouldn't include it as part of this patch.
What do you think?

7642 ↗(On Diff #131151)

I think this should be an assert because an outer loop shouldn't reach this point if the VPlan-native patch is disabled. However, I'm going to add the debug message in the caller code. Does it sound good?

7650 ↗(On Diff #131151)

This function returns the VF to be used during code generation so we would need to evaluate the cost here to return the selected VF. Cost modeling shouldn't be part of the VPlan building process. The same approach is followed in the code below. Regarding predication, it must happen before the cost evaluation. We can discuss if it should belong here or not when we introduce the actual code. If the comment is confusing, I can remove it. We decided to introduce the TODOs to give a better picture of the subsequent patches but if this is not helpful or annoying I can just get rid of all of them. Please, let me know what you think.

8227 ↗(On Diff #131151)

They are different things. containsIrreducibleCFG is used at the very beginning of the pass to collect potential loop candidates (without irreducible CFG) to be vectorized. VPlanHCFGBuilder will build a CFG out of the input IR, using the VPlan infrastructure (VPBlockBases). This VPlan CFG will be modified during the vectorization process without actually modifying the CFG of the input IR. Changes in the VPlan CFG will be materialized once the best profitable VPlan is chosen. Is it clearer now?

The VPlanHCFGBuilder is going to be introduced in the next patch.

8642 ↗(On Diff #131151)

This is the entrance to the VPlan-native vectorization path. It's not doing anything yet because we are trying to follow an incremental approach by releasing relatively small patches that are easy to digest. This code will be functional (generating vector code) soon.

The code block is temporary as long as both vectorization paths co-exist but the final goal is to converge into a single one. This approach will allow us to incrementally and easily extend all the current inner loop vectorization functionality to support outer loops and, most importantly, doing so without destabilizing inner loop vectorization. We are really concerned about the latter and we think that this approach is a reasonable trade-off between safety and temporary code blocks.

If you want to discuss this further, I would recommend to move the discussion to the RFC thread so that everybody is aware of it: http://lists.llvm.org/pipermail/llvm-dev/2017-December/119523.html

hsaito added a subscriber: hsaito.Mar 8 2018, 1:03 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
8642 ↗(On Diff #131151)

I'm working on the "converge into a singe one" side. At this point, I'm taking care of the ground work of moving the right things to the right places such that I don't have to include those "almost NFC" things as part of "expand VPlan's participation into innermost loop vectorization". Thank you for helping me do that with your reviews. We need to be able to build VPlan for the innermost loop vectorization right after Legal, for example, before we can remove the diverged code path at the beginning. In the meantime, the outer loop vectorization patch series will help people realize how much common things are there between innermost loop vectorization and outer loop vectorization, and more importantly, help people think how to write code that can work in both ways.
That's as much as I want to write about the approach we are taking, within this patch review. The rest of the discussions should happen on the above mentioned RFC. Thanks.

fhahn added a comment.Mar 8 2018, 2:05 PM

Outer loop detection: only outer loops annotated with explicit vectorization hints, including the vector length, are collected for outer loop vectorization. This includes outer loops annotated with #pragma omp simd simdlen(#) or #pragma clang vectorize(enable) vectorize_width(#)*.

If I understand correctly, this limitation is due to the fact that VPlan based cost-modelling is not implemented yet, right? I think for testing, it would useful to have an option to process all outer loops. The legality checks should filter out any unsupported loops and this way we could test the VPlan native code path on a much wider range of loops. I think it also would be great if we would have a bot that runs at least the test-suite with VPlan native to discover regressions.

lib/Transforms/Vectorize/LoopVectorize.cpp
29 ↗(On Diff #131151)

Is it worth mentioning docs/Proposal/VectorizationPlan.rst as well?

rengolin added inline comments.Mar 9 2018, 1:43 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
263 ↗(On Diff #131151)

Right now, this is just enabling outer-loop. Are you planning on adding more functionality to the native part of VPlan before merging the inner loop vectoriser into it? I wouldn't recommend, as we really don't want two paths in parallel for too long.

I'd recommend this to just be called "vectorize-outerloop" or something.

2319 ↗(On Diff #131151)

Yup.

2363 ↗(On Diff #131151)

it would be linear on the number of nested loops. We wouldn't add the same loop multiple times. Each loop in the loop nest would be added and processed just once.

That's what I wanted to know, thanks. :)

IMO, option 1 seems the right approach but it's non-NFC and I wouldn't include it as part of this patch.

Agreed.

7650 ↗(On Diff #131151)

I don't like the idea of outer-loops being in a special branch of the code, but I understand the current prototype nature of it.

I believe it's still not the time to define what goes where that hasn't been implemented yet, so better to remove the TODOs for now, in case they lead us astray in the future. Same for the debug messages, etc.

What you should do is shortly explain why outer-loop needs "special handling", and that can be a one/two line comment in the beginning of the block.

8227 ↗(On Diff #131151)

Right, as above, don't leave commented out code hanging. Feel free to add a two-line comment in the begining of the block explaining the expectation.

8642 ↗(On Diff #131151)

Ok, as above, just remove the comments and add a two-line comment summarising it.

dcaballe marked 4 inline comments as done.Mar 9 2018, 5:27 PM

Thanks you, Renato and Florian, for your comments.

this limitation is due to the fact that VPlan based cost-modelling is not implemented yet, right?

Not only. For full outer loop auto-vectorization we'd also need to extend Legal to check for data dependences that prevent the vectorization of outer loops. In loop nests with several outer loops, we'd also need to compare the cost of vectorizing each of them.

I think for testing, it would useful to have an option to process all outer loops. The legality checks should filter out any unsupported loops and this way we could test the VPlan native code path on a much wider range of loops. I think it also would be great if we would have a bot that runs at least the test-suite with VPlan native to discover regressions.

Is -vplan-build-stress-test flag in Patch #3 (D44338) aligned with what you had in mind? :)
I would need some help/guidance with the bot part since I'm not familiar with that.

lib/Transforms/Vectorize/LoopVectorize.cpp
29 ↗(On Diff #131151)

Definitely. Thanks!

263 ↗(On Diff #131151)

Inner loop vectorization is a subset of outer loop vectorization so the VPlan native path will be inherently supporting inner loops. However, it's not our intention to enable it "for production" while both paths co-exist. However, as we described in the RFC, inner loop vectorization support in the VPlan native path is indispensable for the convergence of both paths. As we start migrating and extending all the existing functionality for inner loops to outer loops in the VPlan native path, we will need to compare side-by-side where both paths stand regarding inner loop vectorization. When both paths are comparable in that regard, the migration will be completed.

Inner loop support will also be very useful for (stress) testing the VPlan native path, since some loops don't have another loop around.

For these reasons we are not using the 'outerloop' word in the flags/interfaces.
Does it make sense to you?

7650 ↗(On Diff #131151)

I believe it's still not the time to define what goes where that hasn't been implemented yet, so better to remove the TODOs for now, in case they lead us astray in the future.

at you should do is shortly explain why outer-loop needs "special handling", and that can be a one/two line comment in the beginning of the block.

Agreed. Thanks!

dcaballe updated this revision to Diff 137881.Mar 9 2018, 5:30 PM
dcaballe marked 2 inline comments as done.

Addressing previous comments.

fhahn added a comment.Mar 19 2018, 3:13 AM

Thanks you, Renato and Florian, for your comments.

this limitation is due to the fact that VPlan based cost-modelling is not implemented yet, right?

Not only. For full outer loop auto-vectorization we'd also need to extend Legal to check for data dependences that prevent the vectorization of outer loops. In loop nests with several outer loops, we'd also need to compare the cost of vectorizing each of them.

Ah yes, that's missing for now, thanks for clearing that up.

I think for testing, it would useful to have an option to process all outer loops. The legality checks should filter out any unsupported loops and this way we could test the VPlan native code path on a much wider range of loops. I think it also would be great if we would have a bot that runs at least the test-suite with VPlan native to discover regressions.

Is -vplan-build-stress-test flag in Patch #3 (D44338) aligned with what you had in mind? :)

Yep, that's along the lines I had in mind. So far the checks are quite limited, but I think it is a good starting point :)

fhahn added inline comments.Mar 28 2018, 9:14 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
1661 ↗(On Diff #137881)

Maybe add a newline to separate the 2 functions. Not sure if calling it out as helper function is necessary. In a way, most functions here are helper functions :)

5119 ↗(On Diff #137881)

Work done here is potentially done multiple times for each loop, right? E.g. for deep loop nests, this will be called multiple times for the same Lp, but with different outer loops.

Only a few checks here depend on the outer loop and I think ideally we would not check the same things again and again. For now those redundant checks are quite simple, but I think we should keep that issue in mind once we introduce more complex checks.

5128 ↗(On Diff #137881)

I think the use of getCanonicalInductionVariable is discouraged. I think it would be better to detect induction variables using SCEV, as done LoopVectorizeLegality.

8642 ↗(On Diff #131151)

I am also slightly worried that people will come along and see this code and think that cost modelling and planning already works for outer loops, as it is used in the VPlan native path. But I think the comment makes it clear now.

I am not sure if it would be clearer/nicer to have clearer separation by having the code in separate functions rather than adding even more code to those already huge functions.

test/Transforms/LoopVectorize/explicit_outer_detection.ll
222 ↗(On Diff #137881)

attributes not needed here and in the tests below, as no cost modelling is done so far.

Is -vplan-build-stress-test flag in Patch #3 (D44338) aligned with what you had in mind? :)

Yes, that's exactly what I had in mind. :)

I have no further questions, I'll let @fhahn finish this review.

Thanks!

lib/Transforms/Vectorize/LoopVectorize.cpp
263 ↗(On Diff #131151)

It does, thanks for the explanation!

Thanks for your comments, Florian and Renato!
More comments inline.

Diego.

lib/Transforms/Vectorize/LoopVectorize.cpp
1661 ↗(On Diff #137881)

Sounds good! Thanks!

5119 ↗(On Diff #137881)

Good point. OuterLp will be fixed, at least in the short term while we only support explicit vectorization. Given that we are introducing support for divergent inner loops in the patch series #4, it's more likely that we don't need this function (or at least this function as is) before we introduce the engine to evaluate different outer loops.

In any case, the proper inner loop uniformity check will depend on the outer loop we are vectorizing. Some of these "extra" checks are very specific for the patch series #1, where the supported loops are very limited. They will be progressively removed, leaving only the OuterLp dependent checks.

For these reasons, IMO, it makes sense to keep all these checks and the documentation together. I think it's easier to understand which inner loops are currently supported. I could add a comment explaining you concerns. However, I could try to split them if you think this is not enough.

Please, let me know what you think.

5128 ↗(On Diff #137881)

Could you please elaborate a bit more? Why is it discouraged? I can't find any comments in the source code.

We are trying to introduce some restrictive but simple checks. If the answer is that this interface is discouraged because it may not detect some IVs that that are canonical, that would be perfectly fine. I'm also looking at the LoopVectorizationLegality::addInductionPhi. Isn't this function doing something similar to getCanonicalInductionVariable to detect the primary induction but using InductionDescriptor?

8642 ↗(On Diff #131151)

I am not sure if it would be clearer/nicer to have clearer separation by having the code in separate functions rather than adding even more code to those already huge functions.

Agreed, I could move these code to a separate function.

rather than adding even more code to those already huge functions.

Are you talking about only this function or also some other ones?

fhahn added inline comments.Apr 9 2018, 8:17 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
5119 ↗(On Diff #137881)

Yes, I think for now it is fine, but we should definitely keep that in mind, for future checks.

Another related question came to mind: How are we going to deal with nested loops where different nests can be vectorized, e.g. say we have nested loops with 3 levels and both outer-most loops can be vectorized? If we decide to vectorize the outermost loop, wouldn't we have to skip handling the other outer loop? Or have links between the VPlans to decide which level is best to vectorize?

5128 ↗(On Diff #137881)

Yes, for the very simple checks it works, but as you said we potentially miss IVs that we could support. I suppose we could use getCanonicalInductionVariable if it keeps things simple for now, but we definitely should relax that soonish.

8642 ↗(On Diff #131151)

Mostly this function and LoopVectorizationPlanner::plan. Otherwise it is already nicely separated.

dcaballe updated this revision to Diff 142382.Apr 13 2018, 5:20 AM
dcaballe marked 3 inline comments as done.

Addressing Florian's comments.

lib/Transforms/Vectorize/LoopVectorize.cpp
5119 ↗(On Diff #137881)

Yes, I think for now it is fine, but we should definitely keep that in mind, for future checks.

Ok, I added a comment explaining the situation. I also tried to find a better place for these checks, at least to indicate it in the comment, but I couldn't find a good place at this point. We don't have the infrastructure to evaluate multiple outer loops of the same loop nest. Hopefully, we comment is clarifying enough.

If we decide to vectorize the outermost loop, wouldn't we have to skip handling the other outer loop?

Yes, if the outermost loop is finally vectorized, the outer and the inner should be marked also as vectorized (I introduced a TODO suggesting this in an earlier version of this patch, but we decided to remove it to avoid confusion and keep the code cleaner). Or we could follow any other approach that leads to the same behavior: skipping them.

Or have links between the VPlans to decide which level is best to vectorize?

The idea is to have an initial H-CFG modeling the input IR of the whole loop nest (starting form the outermost vectorizable loop) and use it as a starting point to evaluate all the candidate loops for vectorization. Does this answer your questions?

5128 ↗(On Diff #137881)

Then it's perfectly fine. Let's do it incrementally. It's the whole purpose of the approach. We may even want to support non-canonical IVs sooner than later and coming up with something more complicated for the current check might not be worth it.

8642 ↗(On Diff #131151)

Ok, thanks!. I doubted if the new 'processLoopInVPlanNativePath' should be a member function of LoopVectorizerPass (same as 'processLoop') and avoid passing most of the parameters. I decided not to do it just not to modify the public header file. Please, let me know what you think.

fhahn accepted this revision.Apr 18 2018, 10:37 AM

Thanks Diego and thanks for your patience! LGTM, but please wait a bit with committing, in case other people people want to raise any additional comments.

lib/Transforms/Vectorize/LoopVectorize.cpp
8629 ↗(On Diff #142382)

I suppose we should never call processLoopInVPlanNativePath without the flag? Could this be an assertion?

This revision is now accepted and ready to land.Apr 18 2018, 10:37 AM
javed.absar added inline comments.Apr 18 2018, 11:25 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
4947 ↗(On Diff #142382)

Can we not simply check for isLoopSimplifyForm() ?

Thanks Diego and thanks for your patience! LGTM, but please wait a bit with committing, in case other people people want to raise any additional comments.

Thanks, Florian! I'll wait until Monday.

lib/Transforms/Vectorize/LoopVectorize.cpp
4947 ↗(On Diff #142382)

Thanks for the comment, Javed!

I guess that getNumBackEdges is more efficient? isLoopSimplifyForm is checking getLoopPreheader() && getLoopLatch() && hasDedicatedExits() and, having a quick look, only the last one is more expensive than the getNumBackEdges.

In any case, please, note that this call was already there. I'm not adding it as part of this patch.

8629 ↗(On Diff #142382)

Ok, thanks!

dcaballe updated this revision to Diff 143148.Apr 19 2018, 12:12 PM

Adding assert and rebasing diff to ToT

dcaballe added inline comments.Apr 19 2018, 12:20 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
8627 ↗(On Diff #143148)

@hsaito, there was a conflict with D45072 and I had to construct IAI here and carry over DT just for it. I wonder if it would be better to make IAI optional in CM to avoid this. It would make the code reuse easier.

hsaito added inline comments.Apr 19 2018, 12:37 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
8627 ↗(On Diff #143148)

Feel free to change IAI into a pointer that can be nullptr. After all, InterleavedAccess is an optimization step that we should be able to skip. D45072 didn't get into that, but I was planning to do that while I clean up CostModel. I kept it as a reference simply because it was a reference in Legal. If you want me to do it, I can do it quick.

hsaito added inline comments.Apr 19 2018, 4:31 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
8627 ↗(On Diff #143148)

The "Optimize" phase of the vectorizer most likely need DT anyway in a long run and thus having to carry over DT by itself is not a really bad thing. Part of the reason is your design choice of not making procesLoopInVPlanNativePath a member function of LoopVectorizePass class. For the time being, I think this part of the code can go in as is, i.e., without making IAI optional. I'll be touching CM code soon enough anyway.

dcaballe added inline comments.Apr 20 2018, 1:01 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
8627 ↗(On Diff #143148)

If you want me to do it, I can do it quick.

I think this part of the code can go in as is, i.e., without making IAI optional. I'll be touching CM code soon enough anyway.

I'm ok with either of both. If you want to quickly fix it, I can wait for that patch. I won't be committing it until next week. Otherwise, you can remove this line later when you address it.

Part of the reason is your design choice of not making procesLoopInVPlanNativePath a member function of LoopVectorizePass class.

I just tried to keep these changes away from LoopVectorize.h but the bunch of parameter is certainly inconvenient. If you think it's better, I can just make it member. Please, let me know what you think.

hsaito added inline comments.Apr 20 2018, 10:28 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
8627 ↗(On Diff #143148)

I think it's best to get this checked in and then address CostModel stuff as it gets restructured for VPlan and longer term future. I think it's okay to keep it as a static function with a lot of parameters until it is ready to take over the position of processLoop(). When most of the functionality is in place, this function would need all the analysis just like processLoop() does. So, passing DT will inevitably happen even if we don't do it now.

Thanks, Hideki.
Ok, I'll go ahead and commit it.

Diego.

rengolin accepted this revision.Apr 24 2018, 1:50 AM

Thank you! LGTM!

This revision was automatically updated to reflect the committed changes.