This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Transforms/Vectorize/
-
llvm/
-
Transforms/
-
Vectorize/
-
LoopVectorize.h
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
LoopVectorizationPlanner.h
9/50
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
1/1
explicit_outer_detection.ll
-
explicit_outer_nonuniform_inner.ll
-
explicit_outer_uniform_diverg_branch.ll

Differential D42447

[LV][VPlan] Detect outer loops for explicit vectorization.
ClosedPublic

Authored by dcaballe on Jan 23 2018, 3:09 PM.

Download Raw Diff

Details

Reviewers

hfinkel
mkuper
rengolin
fhahn
aemerson
mssimpso

Commits

rG60f2776b2f68: [LV][VPlan] Detect outer loops for explicit vectorization.
rL330739: [LV][VPlan] Detect outer loops for explicit vectorization.

Summary

This is the patch #2 from the Patch Series #1 to introduce outer loop vectorization support in LV using the VPlan infrastructure.
RFC: http://lists.llvm.org/pipermail/llvm-dev/2017-December/119523.html
Patch #1: D40874

This patch introduces the basic infrastructure to detect, legality check and process outer loops annotated with hints for explicit vectorization:

Outer loop detection: only outer loops annotated with explicit vectorization hints, including the vector length, are collected for outer loop vectorization. This includes outer loops annotated with #pragma omp simd simdlen(#) or #pragma clang vectorize(enable) vectorize_width(#)*.

Outer loop legality check: only a restricted subset of simple outer loops are considered legal at this point. This subset includes outer loops that only contain uniform inner loops and uniform non-backedge branches. The uniformity property is also highly conservative (loop invariance) and will be relaxed in the future to support more complex cases.

Outer loop processing: legal outer loops are processed in a new vectorization path that will build the VPlan infrastructure upfront. We denote it as VPlan-native vectorization path. This new path is integrated in LV but it's independent of the inner loop vectorization path. We followed this approach to prevent the instability of the current inner loop vectorizer while reusing code and minimize divergence from the existing infrastructure. In the VPlan-native path, legal outer loops are fed into the LoopVectorizationPlanner which only prints a debug message for now. Actual vectorization will be introduced in the subsequent patches of this series.

It's important to remark that all these changes are protected under the feature flag -enable-vplan-native-path. This should make this patch NFC for the existing inner loop vectorizer.

(*) Pragma 'clang vectorize' and pragma 'omp simd' are currently implemented with the same metadata (llvm.loop.vectorize) even though the former has auto-vectorization semantics and the latter has explicit vectorization semantics. We temporarily abuse pragma 'clang vectorize' on outer loops to denote explicit vectorization due to the shared implementation of both pragmas. This will be fixed when the native representation for pragma 'omp simd' is introduce in LLVM (WIP).

Diff Detail

Event Timeline

dcaballe created this revision.Jan 23 2018, 3:09 PM

Herald added a subscriber: bollu. · View Herald TranscriptJan 23 2018, 3:09 PM

dcaballe mentioned this in D40874: [LV][LoopInfo] Add irreducible CFG detection for outer loops.Jan 23 2018, 3:12 PM

sguggill added a subscriber: sguggill.Jan 23 2018, 3:14 PM

rogfer01 added a subscriber: rogfer01.Jan 23 2018, 3:33 PM

a.elovikov added a subscriber: a.elovikov.Jan 23 2018, 3:46 PM

ebrevnov added a subscriber: ebrevnov.Jan 23 2018, 4:17 PM

kmitropo added a subscriber: kmitropo.Jan 24 2018, 4:35 PM

tschuett added a subscriber: tschuett.Jan 25 2018, 11:10 PM

Ping :)

rengolin added inline comments.Mar 7 2018, 5:44 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
2320	Sorry, this is my fault, as both were done separately. We discussed adding one more metadata info to mean {"forced"/"hint"} but ended up never doing it. It should be simple to fix this, I think, just need to make sure we change all the tests correctly to what they're supposed to mean.
2364–2370	What's the complexity of this analysis? We'll be adding a lot of those, repeatedly, no? If I understand correctly, `containsIrreducibleCFG` is not that simple, in addition to the traversal. Not calling it unnecessarily would be a nice thing to have up front. How complex would it be to create the inherited attribute?
5139	nit. I'd remove this space, as both blocks are regarding (3)
7712	I wouldn't make this an assert, just a debug message and return
7717	left over?
7720	is this really the place to do predication and cost modelling?
8282	Isn't `containsIrreducibleCFG` depending on this to work?
8283	Better not to have TODOs as debug messages.
8730	I'm really uncomfortable with all these temporary code blocks that don't do anything... They're really just hijacking the existing infrastructure without implementing as a VPlan. I really thought the whole point of VPlans was that you wouldn't need to hack-it-up like we used to in the old vectoriser...

Thanks for the comments, Renato!
Please, have a look at my inline comments and let me know what you think.

Thanks!
Diego

lib/Transforms/Vectorize/LoopVectorize.cpp
2320	No problem. As I mention, there are going to be changes regarding the representation of #pragma omp simd. I think that's the right time to address the problem.
2364–2370	What's the complexity of this analysis? We'll be adding a lot of those, repeatedly, no? Sorry, I'm not sure I understand the question. If you mean the complexity of collecting nested loops given an outer loop, it would be linear on the number of nested loops. We wouldn't add the same loop multiple times. Each loop in the loop nest would be added and processed just once. If I understand correctly, containsIrreducibleCFG is not that simple, in addition to the traversal. Not calling it unnecessarily would be a nice thing to have up front. How complex would it be to create the inherited attribute? Adding the attribute wouldn't be complicated. We would have a recursive call that processes loops from outer to inner. Once we know that the CFG of the outer loop is reducible, we can pass a flag through the recursive call to mark that nested loops are reducible and skip `containsIrreducibleCFG` for them. Does this make sense to you? I had it implemented but there is a subtle detail that would make this patch non-NFC. If you look at line 8901, collected loops are processed in reverse order. This basically means that if we have: #pragma clang loop vectorize for i for j for k } the current code would process loops k and j first. If one of them is vectorized, we couldn't vectorize 'i', the one marked for vectorization. I see two potential solutions: Reverse the order in which loops are processed in line 8901. This is non-NFC and some existing LIT tests would have to be updated accordingly. Collect loops at different loop nest levels in post-order and loops at the same level in pre-order. This would be the collection order for the previous example: j, k, i. This would be NFC for inner loops but I find it particularly weird. IMO, option 1 seems the right approach but it's non-NFC and I wouldn't include it as part of this patch. What do you think?
7712	I think this should be an assert because an outer loop shouldn't reach this point if the VPlan-native patch is disabled. However, I'm going to add the debug message in the caller code. Does it sound good?
7720	This function returns the VF to be used during code generation so we would need to evaluate the cost here to return the selected VF. Cost modeling shouldn't be part of the VPlan building process. The same approach is followed in the code below. Regarding predication, it must happen before the cost evaluation. We can discuss if it should belong here or not when we introduce the actual code. If the comment is confusing, I can remove it. We decided to introduce the TODOs to give a better picture of the subsequent patches but if this is not helpful or annoying I can just get rid of all of them. Please, let me know what you think.
8282	They are different things. `containsIrreducibleCFG` is used at the very beginning of the pass to collect potential loop candidates (without irreducible CFG) to be vectorized. `VPlanHCFGBuilder` will build a CFG out of the input IR, using the VPlan infrastructure (VPBlockBases). This VPlan CFG will be modified during the vectorization process without actually modifying the CFG of the input IR. Changes in the VPlan CFG will be materialized once the best profitable VPlan is chosen. Is it clearer now? The VPlanHCFGBuilder is going to be introduced in the next patch.
8730	This is the entrance to the VPlan-native vectorization path. It's not doing anything yet because we are trying to follow an incremental approach by releasing relatively small patches that are easy to digest. This code will be functional (generating vector code) soon. The code block is temporary as long as both vectorization paths co-exist but the final goal is to converge into a single one. This approach will allow us to incrementally and easily extend all the current inner loop vectorization functionality to support outer loops and, most importantly, doing so without destabilizing inner loop vectorization. We are really concerned about the latter and we think that this approach is a reasonable trade-off between safety and temporary code blocks. If you want to discuss this further, I would recommend to move the discussion to the RFC thread so that everybody is aware of it: http://lists.llvm.org/pipermail/llvm-dev/2017-December/119523.html

lib/Transforms/Vectorize/LoopVectorize.cpp
8730	I'm working on the "converge into a singe one" side. At this point, I'm taking care of the ground work of moving the right things to the right places such that I don't have to include those "almost NFC" things as part of "expand VPlan's participation into innermost loop vectorization". Thank you for helping me do that with your reviews. We need to be able to build VPlan for the innermost loop vectorization right after Legal, for example, before we can remove the diverged code path at the beginning. In the meantime, the outer loop vectorization patch series will help people realize how much common things are there between innermost loop vectorization and outer loop vectorization, and more importantly, help people think how to write code that can work in both ways. That's as much as I want to write about the approach we are taking, within this patch review. The rest of the discussions should happen on the above mentioned RFC. Thanks.

Outer loop detection: only outer loops annotated with explicit vectorization hints, including the vector length, are collected for outer loop vectorization. This includes outer loops annotated with #pragma omp simd simdlen(#) or #pragma clang vectorize(enable) vectorize_width(#)*.

If I understand correctly, this limitation is due to the fact that VPlan based cost-modelling is not implemented yet, right? I think for testing, it would useful to have an option to process all outer loops. The legality checks should filter out any unsupported loops and this way we could test the VPlan native code path on a much wider range of loops. I think it also would be great if we would have a bot that runs at least the test-suite with VPlan native to discover regressions.

lib/Transforms/Vectorize/LoopVectorize.cpp
29	Is it worth mentioning docs/Proposal/VectorizationPlan.rst as well?

rengolin added inline comments.Mar 9 2018, 1:43 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
263	Right now, this is just enabling outer-loop. Are you planning on adding more functionality to the native part of VPlan before merging the inner loop vectoriser into it? I wouldn't recommend, as we really don't want two paths in parallel for too long. I'd recommend this to just be called "vectorize-outerloop" or something.
2320	Yup.
2364–2370	it would be linear on the number of nested loops. We wouldn't add the same loop multiple times. Each loop in the loop nest would be added and processed just once. That's what I wanted to know, thanks. :) IMO, option 1 seems the right approach but it's non-NFC and I wouldn't include it as part of this patch. Agreed.
7720	I don't like the idea of outer-loops being in a special branch of the code, but I understand the current prototype nature of it. I believe it's still not the time to define what goes where that hasn't been implemented yet, so better to remove the TODOs for now, in case they lead us astray in the future. Same for the debug messages, etc. What you should do is shortly explain why outer-loop needs "special handling", and that can be a one/two line comment in the beginning of the block.
8282	Right, as above, don't leave commented out code hanging. Feel free to add a two-line comment in the begining of the block explaining the expectation.
8730	Ok, as above, just remove the comments and add a two-line comment summarising it.

dcaballe mentioned this in D44338: [LV][VPlan] Build plain CFG with simple VPInstructions for outer loops..Mar 9 2018, 4:42 PM

Thanks you, Renato and Florian, for your comments.

this limitation is due to the fact that VPlan based cost-modelling is not implemented yet, right?

Not only. For full outer loop auto-vectorization we'd also need to extend Legal to check for data dependences that prevent the vectorization of outer loops. In loop nests with several outer loops, we'd also need to compare the cost of vectorizing each of them.

I think for testing, it would useful to have an option to process all outer loops. The legality checks should filter out any unsupported loops and this way we could test the VPlan native code path on a much wider range of loops. I think it also would be great if we would have a bot that runs at least the test-suite with VPlan native to discover regressions.

Is -vplan-build-stress-test flag in Patch #3 (D44338) aligned with what you had in mind? :)
I would need some help/guidance with the bot part since I'm not familiar with that.

lib/Transforms/Vectorize/LoopVectorize.cpp
29	Definitely. Thanks!
263	Inner loop vectorization is a subset of outer loop vectorization so the VPlan native path will be inherently supporting inner loops. However, it's not our intention to enable it "for production" while both paths co-exist. However, as we described in the RFC, inner loop vectorization support in the VPlan native path is indispensable for the convergence of both paths. As we start migrating and extending all the existing functionality for inner loops to outer loops in the VPlan native path, we will need to compare side-by-side where both paths stand regarding inner loop vectorization. When both paths are comparable in that regard, the migration will be completed. Inner loop support will also be very useful for (stress) testing the VPlan native path, since some loops don't have another loop around. For these reasons we are not using the 'outerloop' word in the flags/interfaces. Does it make sense to you?
7720	I believe it's still not the time to define what goes where that hasn't been implemented yet, so better to remove the TODOs for now, in case they lead us astray in the future. at you should do is shortly explain why outer-loop needs "special handling", and that can be a one/two line comment in the beginning of the block. Agreed. Thanks!

Addressing previous comments.

egarcia added a subscriber: egarcia.Mar 15 2018, 11:16 AM

Ping :)

In D42447#1033730, @dcaballe wrote:

Thanks you, Renato and Florian, for your comments.

this limitation is due to the fact that VPlan based cost-modelling is not implemented yet, right?

Not only. For full outer loop auto-vectorization we'd also need to extend Legal to check for data dependences that prevent the vectorization of outer loops. In loop nests with several outer loops, we'd also need to compare the cost of vectorizing each of them.

Ah yes, that's missing for now, thanks for clearing that up.

I think for testing, it would useful to have an option to process all outer loops. The legality checks should filter out any unsupported loops and this way we could test the VPlan native code path on a much wider range of loops. I think it also would be great if we would have a bot that runs at least the test-suite with VPlan native to discover regressions.

Is -vplan-build-stress-test flag in Patch #3 (D44338) aligned with what you had in mind? :)

Yep, that's along the lines I had in mind. So far the checks are quite limited, but I think it is a good starting point :)

fhahn added inline comments.Mar 28 2018, 9:14 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
1641	Maybe add a newline to separate the 2 functions. Not sure if calling it out as helper function is necessary. In a way, most functions here are helper functions :)
5110	Work done here is potentially done multiple times for each loop, right? E.g. for deep loop nests, this will be called multiple times for the same Lp, but with different outer loops. Only a few checks here depend on the outer loop and I think ideally we would not check the same things again and again. For now those redundant checks are quite simple, but I think we should keep that issue in mind once we introduce more complex checks.
5119	I think the use of getCanonicalInductionVariable is discouraged. I think it would be better to detect induction variables using SCEV, as done LoopVectorizeLegality.
8730	I am also slightly worried that people will come along and see this code and think that cost modelling and planning already works for outer loops, as it is used in the VPlan native path. But I think the comment makes it clear now. I am not sure if it would be clearer/nicer to have clearer separation by having the code in separate functions rather than adding even more code to those already huge functions.
test/Transforms/LoopVectorize/explicit_outer_detection.ll
223	attributes not needed here and in the tests below, as no cost modelling is done so far.

In D42447#1033730, @dcaballe wrote:

Is -vplan-build-stress-test flag in Patch #3 (D44338) aligned with what you had in mind? :)

Yes, that's exactly what I had in mind. :)

I have no further questions, I'll let @fhahn finish this review.

Thanks!

lib/Transforms/Vectorize/LoopVectorize.cpp
263	It does, thanks for the explanation!

Thanks for your comments, Florian and Renato!
More comments inline.

Diego.

lib/Transforms/Vectorize/LoopVectorize.cpp
1641	Sounds good! Thanks!
5110	Good point. `OuterLp` will be fixed, at least in the short term while we only support explicit vectorization. Given that we are introducing support for divergent inner loops in the patch series #4, it's more likely that we don't need this function (or at least this function as is) before we introduce the engine to evaluate different outer loops. In any case, the proper inner loop uniformity check will depend on the outer loop we are vectorizing. Some of these "extra" checks are very specific for the patch series #1, where the supported loops are very limited. They will be progressively removed, leaving only the `OuterLp` dependent checks. For these reasons, IMO, it makes sense to keep all these checks and the documentation together. I think it's easier to understand which inner loops are currently supported. I could add a comment explaining you concerns. However, I could try to split them if you think this is not enough. Please, let me know what you think.
5119	Could you please elaborate a bit more? Why is it discouraged? I can't find any comments in the source code. We are trying to introduce some restrictive but simple checks. If the answer is that this interface is discouraged because it may not detect some IVs that that are canonical, that would be perfectly fine. I'm also looking at the `LoopVectorizationLegality::addInductionPhi`. Isn't this function doing something similar to `getCanonicalInductionVariable` to detect the primary induction but using `InductionDescriptor`?
8730	I am not sure if it would be clearer/nicer to have clearer separation by having the code in separate functions rather than adding even more code to those already huge functions. Agreed, I could move these code to a separate function. rather than adding even more code to those already huge functions. Are you talking about only this function or also some other ones?

fhahn added inline comments.Apr 9 2018, 8:17 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
5110	Yes, I think for now it is fine, but we should definitely keep that in mind, for future checks. Another related question came to mind: How are we going to deal with nested loops where different nests can be vectorized, e.g. say we have nested loops with 3 levels and both outer-most loops can be vectorized? If we decide to vectorize the outermost loop, wouldn't we have to skip handling the other outer loop? Or have links between the VPlans to decide which level is best to vectorize?
5119	Yes, for the very simple checks it works, but as you said we potentially miss IVs that we could support. I suppose we could use getCanonicalInductionVariable if it keeps things simple for now, but we definitely should relax that soonish.
8730	Mostly this function and LoopVectorizationPlanner::plan. Otherwise it is already nicely separated.

Addressing Florian's comments.

lib/Transforms/Vectorize/LoopVectorize.cpp
5110	Yes, I think for now it is fine, but we should definitely keep that in mind, for future checks. Ok, I added a comment explaining the situation. I also tried to find a better place for these checks, at least to indicate it in the comment, but I couldn't find a good place at this point. We don't have the infrastructure to evaluate multiple outer loops of the same loop nest. Hopefully, we comment is clarifying enough. If we decide to vectorize the outermost loop, wouldn't we have to skip handling the other outer loop? Yes, if the outermost loop is finally vectorized, the outer and the inner should be marked also as vectorized (I introduced a TODO suggesting this in an earlier version of this patch, but we decided to remove it to avoid confusion and keep the code cleaner). Or we could follow any other approach that leads to the same behavior: skipping them. Or have links between the VPlans to decide which level is best to vectorize? The idea is to have an initial H-CFG modeling the input IR of the whole loop nest (starting form the outermost vectorizable loop) and use it as a starting point to evaluate all the candidate loops for vectorization. Does this answer your questions?
5119	Then it's perfectly fine. Let's do it incrementally. It's the whole purpose of the approach. We may even want to support non-canonical IVs sooner than later and coming up with something more complicated for the current check might not be worth it.
8730	Ok, thanks!. I doubted if the new 'processLoopInVPlanNativePath' should be a member function of LoopVectorizerPass (same as 'processLoop') and avoid passing most of the parameters. I decided not to do it just not to modify the public header file. Please, let me know what you think.

javed.absar added a subscriber: javed.absar.Apr 13 2018, 5:40 AM

Thanks Diego and thanks for your patience! LGTM, but please wait a bit with committing, in case other people people want to raise any additional comments.

lib/Transforms/Vectorize/LoopVectorize.cpp
8625	I suppose we should never call processLoopInVPlanNativePath without the flag? Could this be an assertion?

This revision is now accepted and ready to land.Apr 18 2018, 10:37 AM

javed.absar added inline comments.Apr 18 2018, 11:25 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
4948	Can we not simply check for isLoopSimplifyForm() ?

In D42447#1071162, @fhahn wrote:

Thanks Diego and thanks for your patience! LGTM, but please wait a bit with committing, in case other people people want to raise any additional comments.

Thanks, Florian! I'll wait until Monday.

lib/Transforms/Vectorize/LoopVectorize.cpp
4948	Thanks for the comment, Javed! I guess that `getNumBackEdges` is more efficient? `isLoopSimplifyForm` is checking `getLoopPreheader() && getLoopLatch() && hasDedicatedExits()` and, having a quick look, only the last one is more expensive than the `getNumBackEdges`. In any case, please, note that this call was already there. I'm not adding it as part of this patch.
8625	Ok, thanks!

Adding assert and rebasing diff to ToT

dcaballe added inline comments.Apr 19 2018, 12:20 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
8627	@hsaito, there was a conflict with D45072 and I had to construct IAI here and carry over DT just for it. I wonder if it would be better to make IAI optional in CM to avoid this. It would make the code reuse easier.

hsaito added inline comments.Apr 19 2018, 12:37 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
8627	Feel free to change IAI into a pointer that can be nullptr. After all, InterleavedAccess is an optimization step that we should be able to skip. D45072 didn't get into that, but I was planning to do that while I clean up CostModel. I kept it as a reference simply because it was a reference in Legal. If you want me to do it, I can do it quick.

hsaito added inline comments.Apr 19 2018, 4:31 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
8627	The "Optimize" phase of the vectorizer most likely need DT anyway in a long run and thus having to carry over DT by itself is not a really bad thing. Part of the reason is your design choice of not making procesLoopInVPlanNativePath a member function of LoopVectorizePass class. For the time being, I think this part of the code can go in as is, i.e., without making IAI optional. I'll be touching CM code soon enough anyway.

dcaballe added inline comments.Apr 20 2018, 1:01 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
8627	If you want me to do it, I can do it quick. I think this part of the code can go in as is, i.e., without making IAI optional. I'll be touching CM code soon enough anyway. I'm ok with either of both. If you want to quickly fix it, I can wait for that patch. I won't be committing it until next week. Otherwise, you can remove this line later when you address it. Part of the reason is your design choice of not making procesLoopInVPlanNativePath a member function of LoopVectorizePass class. I just tried to keep these changes away from LoopVectorize.h but the bunch of parameter is certainly inconvenient. If you think it's better, I can just make it member. Please, let me know what you think.

hsaito added inline comments.Apr 20 2018, 10:28 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
8627	I think it's best to get this checked in and then address CostModel stuff as it gets restructured for VPlan and longer term future. I think it's okay to keep it as a static function with a lot of parameters until it is ready to take over the position of processLoop(). When most of the functionality is in place, this function would need all the analysis just like processLoop() does. So, passing DT will inevitably happen even if we don't do it now.

Thanks, Hideki.
Ok, I'll go ahead and commit it.

Diego.

Thank you! LGTM!

Closed by commit rL330739: [LV][VPlan] Detect outer loops for explicit vectorization. (authored by dcaballe). · Explain WhyApr 24 2018, 10:07 AM

This revision was automatically updated to reflect the committed changes.

dcaballe mentioned this in D41953: [LoopUnroll] Unroll and Jam.May 29 2018, 2:43 PM

Revision Contents

Path

Size

include/

llvm/

Transforms/

Vectorize/

LoopVectorize.h

8 lines

lib/

Transforms/

Vectorize/

LoopVectorizationPlanner.h

4 lines

LoopVectorize.cpp

419 lines

test/

Transforms/

LoopVectorize/

explicit_outer_detection.ll

238 lines

explicit_outer_nonuniform_inner.ll

177 lines

explicit_outer_uniform_diverg_branch.ll

138 lines

Diff 143148

include/llvm/Transforms/Vectorize/LoopVectorize.h

	Show All 20 Lines
	// 2. LoopVectorizationLegality - A unit that checks for the legality			// 2. LoopVectorizationLegality - A unit that checks for the legality
	// of the vectorization.			// of the vectorization.
	// 3. InnerLoopVectorizer - A unit that performs the actual			// 3. InnerLoopVectorizer - A unit that performs the actual
	// widening of instructions.			// widening of instructions.
	// 4. LoopVectorizationCostModel - A unit that checks for the profitability			// 4. LoopVectorizationCostModel - A unit that checks for the profitability
	// of vectorization. It decides on the optimal vector width, which			// of vectorization. It decides on the optimal vector width, which
	// can be one, if vectorization is not profitable.			// can be one, if vectorization is not profitable.
	//			//
				// There is a development effort going on to migrate loop vectorizer to the
				// VPlan infrastructure and to introduce outer loop vectorization support (see
				// docs/Proposal/VectorizationPlan.rst and
				// http://lists.llvm.org/pipermail/llvm-dev/2017-December/119523.html). For this
				// purpose, we temporarily introduced the VPlan-native vectorization path: an
				// alternative vectorization path that is natively implemented on top of the
				// VPlan infrastructure. See EnableVPlanNativePath for enabling.
				//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// The reduction-variable vectorization is based on the paper:			// The reduction-variable vectorization is based on the paper:
	// D. Nuzman and R. Henderson. Multi-platform Auto-vectorization.			// D. Nuzman and R. Henderson. Multi-platform Auto-vectorization.
	//			//
	// Variable uniformity checks are inspired by:			// Variable uniformity checks are inspired by:
	// Karrenberg, R. and Hack, S. Whole Function Vectorization.			// Karrenberg, R. and Hack, S. Whole Function Vectorization.
	//			//
	▲ Show 20 Lines • Show All 71 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/LoopVectorizationPlanner.h

Show First 20 Lines • Show All 138 Lines • ▼ Show 20 Lines	LoopVectorizationPlanner(Loop L, LoopInfo LI, const TargetLibraryInfo *TLI,
const TargetTransformInfo *TTI,		const TargetTransformInfo *TTI,
LoopVectorizationLegality *Legal,		LoopVectorizationLegality *Legal,
LoopVectorizationCostModel &CM)		LoopVectorizationCostModel &CM)
: OrigLoop(L), LI(LI), TLI(TLI), TTI(TTI), Legal(Legal), CM(CM) {}		: OrigLoop(L), LI(LI), TLI(TLI), TTI(TTI), Legal(Legal), CM(CM) {}

/// Plan how to best vectorize, return the best VF and its cost.		/// Plan how to best vectorize, return the best VF and its cost.
VectorizationFactor plan(bool OptForSize, unsigned UserVF);		VectorizationFactor plan(bool OptForSize, unsigned UserVF);

		/// Use the VPlan-native path to plan how to best vectorize, return the best
		/// VF and its cost.
		VectorizationFactor planInVPlanNativePath(bool OptForSize, unsigned UserVF);

/// Finalize the best decision and dispose of all other VPlans.		/// Finalize the best decision and dispose of all other VPlans.
void setBestPlan(unsigned VF, unsigned UF);		void setBestPlan(unsigned VF, unsigned UF);

/// Generate the IR code for the body of the vectorized loop according to the		/// Generate the IR code for the body of the vectorized loop according to the
/// best selected VPlan.		/// best selected VPlan.
void executePlan(InnerLoopVectorizer &LB, DominatorTree *DT);		void executePlan(InnerLoopVectorizer &LB, DominatorTree *DT);

void printPlans(raw_ostream &O) {		void printPlans(raw_ostream &O) {
▲ Show 20 Lines • Show All 105 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show All 20 Lines
// 2. LoopVectorizationLegality - A unit that checks for the legality		// 2. LoopVectorizationLegality - A unit that checks for the legality
// of the vectorization.		// of the vectorization.
// 3. InnerLoopVectorizer - A unit that performs the actual		// 3. InnerLoopVectorizer - A unit that performs the actual
// widening of instructions.		// widening of instructions.
// 4. LoopVectorizationCostModel - A unit that checks for the profitability		// 4. LoopVectorizationCostModel - A unit that checks for the profitability
// of vectorization. It decides on the optimal vector width, which		// of vectorization. It decides on the optimal vector width, which
// can be one, if vectorization is not profitable.		// can be one, if vectorization is not profitable.
//		//
		// There is a development effort going on to migrate loop vectorizer to the
		fhahnUnsubmitted Done Reply Inline Actions Is it worth mentioning docs/Proposal/VectorizationPlan.rst as well? fhahn: Is it worth mentioning docs/Proposal/VectorizationPlan.rst as well?
		dcaballeAuthorUnsubmitted Not Done Reply Inline Actions Definitely. Thanks! dcaballe: Definitely. Thanks!
		// VPlan infrastructure and to introduce outer loop vectorization support (see
		// docs/Proposal/VectorizationPlan.rst and
		// http://lists.llvm.org/pipermail/llvm-dev/2017-December/119523.html). For this
		// purpose, we temporarily introduced the VPlan-native vectorization path: an
		// alternative vectorization path that is natively implemented on top of the
		// VPlan infrastructure. See EnableVPlanNativePath for enabling.
		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// The reduction-variable vectorization is based on the paper:		// The reduction-variable vectorization is based on the paper:
// D. Nuzman and R. Henderson. Multi-platform Auto-vectorization.		// D. Nuzman and R. Henderson. Multi-platform Auto-vectorization.
//		//
// Variable uniformity checks are inspired by:		// Variable uniformity checks are inspired by:
// Karrenberg, R. and Hack, S. Whole Function Vectorization.		// Karrenberg, R. and Hack, S. Whole Function Vectorization.
//		//
▲ Show 20 Lines • Show All 209 Lines • ▼ Show 20 Lines	static cl::opt<unsigned> VectorizeSCEVCheckThreshold(
"vectorize-scev-check-threshold", cl::init(16), cl::Hidden,		"vectorize-scev-check-threshold", cl::init(16), cl::Hidden,
cl::desc("The maximum number of SCEV checks allowed."));		cl::desc("The maximum number of SCEV checks allowed."));

static cl::opt<unsigned> PragmaVectorizeSCEVCheckThreshold(		static cl::opt<unsigned> PragmaVectorizeSCEVCheckThreshold(
"pragma-vectorize-scev-check-threshold", cl::init(128), cl::Hidden,		"pragma-vectorize-scev-check-threshold", cl::init(128), cl::Hidden,
cl::desc("The maximum number of SCEV checks allowed with a "		cl::desc("The maximum number of SCEV checks allowed with a "
"vectorize(enable) pragma"));		"vectorize(enable) pragma"));

		static cl::opt<bool> EnableVPlanNativePath(
		"enable-vplan-native-path", cl::init(false), cl::Hidden,
		rengolinUnsubmitted Not Done Reply Inline Actions Right now, this is just enabling outer-loop. Are you planning on adding more functionality to the native part of VPlan before merging the inner loop vectoriser into it? I wouldn't recommend, as we really don't want two paths in parallel for too long. I'd recommend this to just be called "vectorize-outerloop" or something. rengolin: Right now, this is just enabling outer-loop. Are you planning on adding more functionality to…
		dcaballeAuthorUnsubmitted Not Done Reply Inline Actions Inner loop vectorization is a subset of outer loop vectorization so the VPlan native path will be inherently supporting inner loops. However, it's not our intention to enable it "for production" while both paths co-exist. However, as we described in the RFC, inner loop vectorization support in the VPlan native path is indispensable for the convergence of both paths. As we start migrating and extending all the existing functionality for inner loops to outer loops in the VPlan native path, we will need to compare side-by-side where both paths stand regarding inner loop vectorization. When both paths are comparable in that regard, the migration will be completed. Inner loop support will also be very useful for (stress) testing the VPlan native path, since some loops don't have another loop around. For these reasons we are not using the 'outerloop' word in the flags/interfaces. Does it make sense to you? dcaballe: Inner loop vectorization is a subset of outer loop vectorization so the VPlan native path will…
		rengolinUnsubmitted Not Done Reply Inline Actions It does, thanks for the explanation! rengolin: It does, thanks for the explanation!
		cl::desc("Enable VPlan-native vectorization path with "
		"support for outer loop vectorization."));

/// Create an analysis remark that explains why vectorization failed		/// Create an analysis remark that explains why vectorization failed
///		///
/// \p PassName is the name of the pass (e.g. can be AlwaysPrint). \p		/// \p PassName is the name of the pass (e.g. can be AlwaysPrint). \p
/// RemarkName is the identifier for the remark. If \p I is passed it is an		/// RemarkName is the identifier for the remark. If \p I is passed it is an
/// instruction that prevents vectorization. Otherwise \p TheLoop is used for		/// instruction that prevents vectorization. Otherwise \p TheLoop is used for
/// the location of the remark. \return the remark object that can be		/// the location of the remark. \return the remark object that can be
/// streamed to.		/// streamed to.
static OptimizationRemarkAnalysis		static OptimizationRemarkAnalysis
▲ Show 20 Lines • Show All 1,252 Lines • ▼ Show 20 Lines
class LoopVectorizationLegality {		class LoopVectorizationLegality {
public:		public:
LoopVectorizationLegality(		LoopVectorizationLegality(
Loop L, PredicatedScalarEvolution &PSE, DominatorTree DT,		Loop L, PredicatedScalarEvolution &PSE, DominatorTree DT,
TargetLibraryInfo TLI, AliasAnalysis AA, Function *F,		TargetLibraryInfo TLI, AliasAnalysis AA, Function *F,
std::function<const LoopAccessInfo &(Loop &)> GetLAA, LoopInfo LI,		std::function<const LoopAccessInfo &(Loop &)> GetLAA, LoopInfo LI,
OptimizationRemarkEmitter ORE, LoopVectorizationRequirements R,		OptimizationRemarkEmitter ORE, LoopVectorizationRequirements R,
LoopVectorizeHints H, DemandedBits DB, AssumptionCache *AC)		LoopVectorizeHints H, DemandedBits DB, AssumptionCache *AC)
: TheLoop(L), PSE(PSE), TLI(TLI), DT(DT), GetLAA(GetLAA),		: TheLoop(L), LI(LI), PSE(PSE), TLI(TLI), DT(DT), GetLAA(GetLAA),
ORE(ORE), Requirements(R), Hints(H), DB(DB), AC(AC) {}		ORE(ORE), Requirements(R), Hints(H), DB(DB), AC(AC) {}

/// ReductionList contains the reduction descriptors for all		/// ReductionList contains the reduction descriptors for all
/// of the reductions that were found in the loop.		/// of the reductions that were found in the loop.
using ReductionList = DenseMap<PHINode *, RecurrenceDescriptor>;		using ReductionList = DenseMap<PHINode *, RecurrenceDescriptor>;

/// InductionList saves induction variables and maps them to the		/// InductionList saves induction variables and maps them to the
/// induction descriptor.		/// induction descriptor.
▲ Show 20 Lines • Show All 85 Lines • ▼ Show 20 Lines	public:

unsigned getNumStores() const { return LAI->getNumStores(); }		unsigned getNumStores() const { return LAI->getNumStores(); }
unsigned getNumLoads() const { return LAI->getNumLoads(); }		unsigned getNumLoads() const { return LAI->getNumLoads(); }

// Returns true if the NoNaN attribute is set on the function.		// Returns true if the NoNaN attribute is set on the function.
bool hasFunNoNaNAttr() const { return HasFunNoNaNAttr; }		bool hasFunNoNaNAttr() const { return HasFunNoNaNAttr; }

private:		private:
		/// Return true if the pre-header, exiting and latch blocks of \p Lp and all
		/// its nested loops are considered legal for vectorization. These legal
		/// checks are common for inner and outer loop vectorization.
		bool canVectorizeLoopNestCFG(Loop *Lp);

		fhahnUnsubmitted Done Reply Inline Actions Maybe add a newline to separate the 2 functions. Not sure if calling it out as helper function is necessary. In a way, most functions here are helper functions :) fhahn: Maybe add a newline to separate the 2 functions. Not sure if calling it out as helper function…
		dcaballeAuthorUnsubmitted Not Done Reply Inline Actions Sounds good! Thanks! dcaballe: Sounds good! Thanks!
		/// Return true if the pre-header, exiting and latch blocks of \p Lp
		/// (non-recursive) are considered legal for vectorization.
		bool canVectorizeLoopCFG(Loop *Lp);

/// Check if a single basic block loop is vectorizable.		/// Check if a single basic block loop is vectorizable.
/// At this point we know that this is a loop with a constant trip count		/// At this point we know that this is a loop with a constant trip count
/// and we only need to check individual instructions.		/// and we only need to check individual instructions.
bool canVectorizeInstrs();		bool canVectorizeInstrs();

/// When we vectorize loops we may change the order in which		/// When we vectorize loops we may change the order in which
/// we read and write from memory. This method checks if it is		/// we read and write from memory. This method checks if it is
/// legal to vectorize the code, considering only memory constrains.		/// legal to vectorize the code, considering only memory constrains.
/// Returns true if the loop is vectorizable		/// Returns true if the loop is vectorizable
bool canVectorizeMemory();		bool canVectorizeMemory();

/// Return true if we can vectorize this loop using the IF-conversion		/// Return true if we can vectorize this loop using the IF-conversion
/// transformation.		/// transformation.
bool canVectorizeWithIfConvert();		bool canVectorizeWithIfConvert();

		/// Return true if we can vectorize this outer loop. The method performs
		/// specific checks for outer loop vectorization.
		bool canVectorizeOuterLoop();

/// Return true if all of the instructions in the block can be speculatively		/// Return true if all of the instructions in the block can be speculatively
/// executed. \p SafePtrs is a list of addresses that are known to be legal		/// executed. \p SafePtrs is a list of addresses that are known to be legal
/// and we know that we can read from them without segfault.		/// and we know that we can read from them without segfault.
bool blockCanBePredicated(BasicBlock BB, SmallPtrSetImpl<Value > &SafePtrs);		bool blockCanBePredicated(BasicBlock BB, SmallPtrSetImpl<Value > &SafePtrs);

/// Updates the vectorization state by adding \p Phi to the inductions list.		/// Updates the vectorization state by adding \p Phi to the inductions list.
/// This can set \p Phi as the main induction of the loop if \p Phi is a		/// This can set \p Phi as the main induction of the loop if \p Phi is a
/// better choice for the main induction than the existing one.		/// better choice for the main induction than the existing one.
Show All 20 Lines	const ValueToValueMap *getSymbolicStrides() {
// pointer is checked to reference consecutive elements suitable for a		// pointer is checked to reference consecutive elements suitable for a
// masked access.		// masked access.
return LAI ? &LAI->getSymbolicStrides() : nullptr;		return LAI ? &LAI->getSymbolicStrides() : nullptr;
}		}

/// The loop that we evaluate.		/// The loop that we evaluate.
Loop *TheLoop;		Loop *TheLoop;

		/// Loop Info analysis.
		LoopInfo *LI;

/// A wrapper around ScalarEvolution used to add runtime SCEV checks.		/// A wrapper around ScalarEvolution used to add runtime SCEV checks.
/// Applies dynamic knowledge to simplify SCEV expressions in the context		/// Applies dynamic knowledge to simplify SCEV expressions in the context
/// of existing SCEV assumptions. The analysis will also add a minimal set		/// of existing SCEV assumptions. The analysis will also add a minimal set
/// of new predicates if this is required to enable vectorization and		/// of new predicates if this is required to enable vectorization and
/// unrolling.		/// unrolling.
PredicatedScalarEvolution &PSE;		PredicatedScalarEvolution &PSE;

/// Target Library Info.		/// Target Library Info.
▲ Show 20 Lines • Show All 587 Lines • ▼ Show 20 Lines	private:
Instruction *UnsafeAlgebraInst = nullptr;		Instruction *UnsafeAlgebraInst = nullptr;

/// Interface to emit optimization remarks.		/// Interface to emit optimization remarks.
OptimizationRemarkEmitter &ORE;		OptimizationRemarkEmitter &ORE;
};		};

} // end anonymous namespace		} // end anonymous namespace

static void addAcyclicInnerLoop(Loop &L, LoopInfo &LI,		// Return true if \p OuterLp is an outer loop annotated with hints for explicit
		// vectorization. The loop needs to be annotated with #pragma omp simd
		// simdlen(#) or #pragma clang vectorize(enable) vectorize_width(#). If the
		// vector length information is not provided, vectorization is not considered
		// explicit. Interleave hints are not allowed either. These limitations will be
		// relaxed in the future.
		// Please, note that we are currently forced to abuse the pragma 'clang
		// vectorize' semantics. This pragma provides auto-vectorization hints
		// (i.e., LV must check that vectorization is legal) whereas pragma 'omp simd'
		// provides explicit vectorization hints (LV can bypass legal checks and
		// assume that vectorization is legal). However, both hints are implemented
		// using the same metadata (llvm.loop.vectorize, processed by
		// LoopVectorizeHints). This will be fixed in the future when the native IR
		// representation for pragma 'omp simd' is introduced.
		rengolinUnsubmitted Not Done Reply Inline Actions Sorry, this is my fault, as both were done separately. We discussed adding one more metadata info to mean {"forced"/"hint"} but ended up never doing it. It should be simple to fix this, I think, just need to make sure we change all the tests correctly to what they're supposed to mean. rengolin: Sorry, this is my fault, as both were done separately. We discussed adding one more metadata…
		dcaballeAuthorUnsubmitted Not Done Reply Inline Actions No problem. As I mention, there are going to be changes regarding the representation of #pragma omp simd. I think that's the right time to address the problem. dcaballe: No problem. As I mention, there are going to be changes regarding the representation of #pragma…
		rengolinUnsubmitted Not Done Reply Inline Actions Yup. rengolin: Yup.
		static bool isExplicitVecOuterLoop(Loop *OuterLp,
		OptimizationRemarkEmitter *ORE) {
		assert(!OuterLp->empty() && "This is not an outer loop");
		LoopVectorizeHints Hints(OuterLp, true /DisableInterleaving/, *ORE);

		// Only outer loops with an explicit vectorization hint are supported.
		// Unannotated outer loops are ignored.
		if (Hints.getForce() == LoopVectorizeHints::FK_Undefined)
		return false;

		Function *Fn = OuterLp->getHeader()->getParent();
		if (!Hints.allowVectorization(Fn, OuterLp, false /AlwaysVectorize/)) {
		DEBUG(dbgs() << "LV: Loop hints prevent outer loop vectorization.\n");
		return false;
		}

		if (!Hints.getWidth()) {
		DEBUG(dbgs() << "LV: Not vectorizing: No user vector width.\n");
		emitMissedWarning(Fn, OuterLp, Hints, ORE);
		return false;
		}

		if (Hints.getInterleave() > 1) {
		// TODO: Interleave support is future work.
		DEBUG(dbgs() << "LV: Not vectorizing: Interleave is not supported for "
		"outer loops.\n");
		emitMissedWarning(Fn, OuterLp, Hints, ORE);
		return false;
		}

		return true;
		}

		static void collectSupportedLoops(Loop &L, LoopInfo *LI,
		OptimizationRemarkEmitter *ORE,
SmallVectorImpl<Loop *> &V) {		SmallVectorImpl<Loop *> &V) {
if (L.empty()) {		// Collect inner loops and outer loops without irreducible control flow. For
		// now, only collect outer loops that have explicit vectorization hints.
		if (L.empty() \|\| (EnableVPlanNativePath && isExplicitVecOuterLoop(&L, ORE))) {
LoopBlocksRPO RPOT(&L);		LoopBlocksRPO RPOT(&L);
RPOT.perform(&LI);		RPOT.perform(LI);
if (!containsIrreducibleCFG<const BasicBlock *>(RPOT, LI))		if (!containsIrreducibleCFG<const BasicBlock >(RPOT, LI)) {
V.push_back(&L);		V.push_back(&L);
		// TODO: Collect inner loops inside marked outer loops in case
		// vectorization fails for the outer loop. Do not invoke
		// 'containsIrreducibleCFG' again for inner loops when the outer loop is
		// already known to be reducible. We can use an inherited attribute for
		// that.
return;		return;
}		}
		rengolinUnsubmitted Not Done Reply Inline Actions What's the complexity of this analysis? We'll be adding a lot of those, repeatedly, no? If I understand correctly, `containsIrreducibleCFG` is not that simple, in addition to the traversal. Not calling it unnecessarily would be a nice thing to have up front. How complex would it be to create the inherited attribute? rengolin: What's the complexity of this analysis? We'll be adding a lot of those, repeatedly, no? If I…
		dcaballeAuthorUnsubmitted Not Done Reply Inline Actions What's the complexity of this analysis? We'll be adding a lot of those, repeatedly, no? Sorry, I'm not sure I understand the question. If you mean the complexity of collecting nested loops given an outer loop, it would be linear on the number of nested loops. We wouldn't add the same loop multiple times. Each loop in the loop nest would be added and processed just once. If I understand correctly, containsIrreducibleCFG is not that simple, in addition to the traversal. Not calling it unnecessarily would be a nice thing to have up front. How complex would it be to create the inherited attribute? Adding the attribute wouldn't be complicated. We would have a recursive call that processes loops from outer to inner. Once we know that the CFG of the outer loop is reducible, we can pass a flag through the recursive call to mark that nested loops are reducible and skip `containsIrreducibleCFG` for them. Does this make sense to you? I had it implemented but there is a subtle detail that would make this patch non-NFC. If you look at line 8901, collected loops are processed in reverse order. This basically means that if we have: #pragma clang loop vectorize for i for j for k } the current code would process loops k and j first. If one of them is vectorized, we couldn't vectorize 'i', the one marked for vectorization. I see two potential solutions: Reverse the order in which loops are processed in line 8901. This is non-NFC and some existing LIT tests would have to be updated accordingly. Collect loops at different loop nest levels in post-order and loops at the same level in pre-order. This would be the collection order for the previous example: j, k, i. This would be NFC for inner loops but I find it particularly weird. IMO, option 1 seems the right approach but it's non-NFC and I wouldn't include it as part of this patch. What do you think? dcaballe: > What's the complexity of this analysis? We'll be adding a lot of those, repeatedly, no?
		rengolinUnsubmitted Not Done Reply Inline Actions it would be linear on the number of nested loops. We wouldn't add the same loop multiple times. Each loop in the loop nest would be added and processed just once. That's what I wanted to know, thanks. :) IMO, option 1 seems the right approach but it's non-NFC and I wouldn't include it as part of this patch. Agreed. rengolin: > it would be linear on the number of nested loops. We wouldn't add the same loop multiple…
		}
for (Loop *InnerL : L)		for (Loop *InnerL : L)
addAcyclicInnerLoop(*InnerL, LI, V);		collectSupportedLoops(*InnerL, LI, ORE, V);
}		}

namespace {		namespace {

/// The LoopVectorize Pass.		/// The LoopVectorize Pass.
struct LoopVectorize : public FunctionPass {		struct LoopVectorize : public FunctionPass {
/// Pass identification, replacement for typeid		/// Pass identification, replacement for typeid
static char ID;		static char ID;
▲ Show 20 Lines • Show All 2,530 Lines • ▼ Show 20 Lines	if (blockNeedsPredication(BB)) {
return false;		return false;
}		}
}		}

// We can if-convert this loop.		// We can if-convert this loop.
return true;		return true;
}		}

bool LoopVectorizationLegality::canVectorize() {		// Helper function to canVectorizeLoopNestCFG.
		bool LoopVectorizationLegality::canVectorizeLoopCFG(Loop *Lp) {
		assert((EnableVPlanNativePath \|\| Lp->empty()) &&
		"VPlan-native path is not enabled.");

		// TODO: ORE should be improved to show more accurate information when an
		// outer loop can't be vectorized because a nested loop is not understood or
		// legal. Something like: "outer_loop_location: loop not vectorized:
		// (inner_loop_location) loop control flow is not understood by vectorizer".

// Store the result and return it at the end instead of exiting early, in case		// Store the result and return it at the end instead of exiting early, in case
// allowExtraAnalysis is used to report multiple reasons for not vectorizing.		// allowExtraAnalysis is used to report multiple reasons for not vectorizing.
bool Result = true;		bool Result = true;

bool DoExtraAnalysis = ORE->allowExtraAnalysis(DEBUG_TYPE);		bool DoExtraAnalysis = ORE->allowExtraAnalysis(DEBUG_TYPE);

// We must have a loop in canonical form. Loops with indirectbr in them cannot		// We must have a loop in canonical form. Loops with indirectbr in them cannot
// be canonicalized.		// be canonicalized.
if (!TheLoop->getLoopPreheader()) {		if (!Lp->getLoopPreheader()) {
DEBUG(dbgs() << "LV: Loop doesn't have a legal pre-header.\n");		DEBUG(dbgs() << "LV: Loop doesn't have a legal pre-header.\n");
ORE->emit(createMissedAnalysis("CFGNotUnderstood")		ORE->emit(createMissedAnalysis("CFGNotUnderstood")
<< "loop control flow is not understood by vectorizer");		<< "loop control flow is not understood by vectorizer");
if (DoExtraAnalysis)		if (DoExtraAnalysis)
Result = false;		Result = false;
else		else
return false;		return false;
}		}

// FIXME: The code is currently dead, since the loop gets sent to
// LoopVectorizationLegality is already an innermost loop.
//
// We can only vectorize innermost loops.
if (!TheLoop->empty()) {
ORE->emit(createMissedAnalysis("NotInnermostLoop")
<< "loop is not the innermost loop");
if (DoExtraAnalysis)
Result = false;
else
return false;
}

// We must have a single backedge.		// We must have a single backedge.
if (TheLoop->getNumBackEdges() != 1) {		if (Lp->getNumBackEdges() != 1) {
		javed.absarUnsubmitted Not Done Reply Inline Actions Can we not simply check for isLoopSimplifyForm() ? javed.absar: Can we not simply check for isLoopSimplifyForm() ?
		dcaballeAuthorUnsubmitted Not Done Reply Inline Actions Thanks for the comment, Javed! I guess that `getNumBackEdges` is more efficient? `isLoopSimplifyForm` is checking `getLoopPreheader() && getLoopLatch() && hasDedicatedExits()` and, having a quick look, only the last one is more expensive than the `getNumBackEdges`. In any case, please, note that this call was already there. I'm not adding it as part of this patch. dcaballe: Thanks for the comment, Javed! I guess that `getNumBackEdges` is more efficient?
ORE->emit(createMissedAnalysis("CFGNotUnderstood")		ORE->emit(createMissedAnalysis("CFGNotUnderstood")
<< "loop control flow is not understood by vectorizer");		<< "loop control flow is not understood by vectorizer");
if (DoExtraAnalysis)		if (DoExtraAnalysis)
Result = false;		Result = false;
else		else
return false;		return false;
}		}

// We must have a single exiting block.		// We must have a single exiting block.
if (!TheLoop->getExitingBlock()) {		if (!Lp->getExitingBlock()) {
ORE->emit(createMissedAnalysis("CFGNotUnderstood")		ORE->emit(createMissedAnalysis("CFGNotUnderstood")
<< "loop control flow is not understood by vectorizer");		<< "loop control flow is not understood by vectorizer");
if (DoExtraAnalysis)		if (DoExtraAnalysis)
Result = false;		Result = false;
else		else
return false;		return false;
}		}

// We only handle bottom-tested loops, i.e. loop in which the condition is		// We only handle bottom-tested loops, i.e. loop in which the condition is
// checked at the end of each iteration. With that we can assume that all		// checked at the end of each iteration. With that we can assume that all
// instructions in the loop are executed the same number of times.		// instructions in the loop are executed the same number of times.
if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) {		if (Lp->getExitingBlock() != Lp->getLoopLatch()) {
ORE->emit(createMissedAnalysis("CFGNotUnderstood")		ORE->emit(createMissedAnalysis("CFGNotUnderstood")
<< "loop control flow is not understood by vectorizer");		<< "loop control flow is not understood by vectorizer");
if (DoExtraAnalysis)		if (DoExtraAnalysis)
Result = false;		Result = false;
else		else
return false;		return false;
}		}

		return Result;
		}

		bool LoopVectorizationLegality::canVectorizeLoopNestCFG(Loop *Lp) {
		// Store the result and return it at the end instead of exiting early, in case
		// allowExtraAnalysis is used to report multiple reasons for not vectorizing.
		bool Result = true;
		bool DoExtraAnalysis = ORE->allowExtraAnalysis(DEBUG_TYPE);
		if (!canVectorizeLoopCFG(Lp)) {
		if (DoExtraAnalysis)
		Result = false;
		else
		return false;
		}

		// Recursively check whether the loop control flow of nested loops is
		// understood.
		for (Loop SubLp : Lp)
		if (!canVectorizeLoopNestCFG(SubLp)) {
		if (DoExtraAnalysis)
		Result = false;
		else
		return false;
		}

		return Result;
		}

		bool LoopVectorizationLegality::canVectorize() {
		// Store the result and return it at the end instead of exiting early, in case
		// allowExtraAnalysis is used to report multiple reasons for not vectorizing.
		bool Result = true;

		bool DoExtraAnalysis = ORE->allowExtraAnalysis(DEBUG_TYPE);
		// Check whether the loop-related control flow in the loop nest is expected by
		// vectorizer.
		if (!canVectorizeLoopNestCFG(TheLoop)) {
		if (DoExtraAnalysis)
		Result = false;
		else
		return false;
		}

// We need to have a loop header.		// We need to have a loop header.
DEBUG(dbgs() << "LV: Found a loop: " << TheLoop->getHeader()->getName()		DEBUG(dbgs() << "LV: Found a loop: " << TheLoop->getHeader()->getName()
<< '\n');		<< '\n');

		// Specific checks for outer loops. We skip the remaining legal checks at this
		// point because they don't support outer loops.
		if (!TheLoop->empty()) {
		assert(EnableVPlanNativePath && "VPlan-native path is not enabled.");

		if (!canVectorizeOuterLoop()) {
		DEBUG(dbgs() << "LV: Not vectorizing: Unsupported outer loop.\n");
		// TODO: Implement DoExtraAnalysis when subsequent legal checks support
		// outer loops.
		return false;
		}

		DEBUG(dbgs() << "LV: We can vectorize this outer loop!\n");
		return Result;
		}

		assert(TheLoop->empty() && "Inner loop expected.");
// Check if we can if-convert non-single-bb loops.		// Check if we can if-convert non-single-bb loops.
unsigned NumBlocks = TheLoop->getNumBlocks();		unsigned NumBlocks = TheLoop->getNumBlocks();
if (NumBlocks != 1 && !canVectorizeWithIfConvert()) {		if (NumBlocks != 1 && !canVectorizeWithIfConvert()) {
DEBUG(dbgs() << "LV: Can't if-convert the loop.\n");		DEBUG(dbgs() << "LV: Can't if-convert the loop.\n");
if (DoExtraAnalysis)		if (DoExtraAnalysis)
Result = false;		Result = false;
else		else
return false;		return false;
Show All 40 Lines	bool LoopVectorizationLegality::canVectorize() {

// Okay! We've done all the tests. If any have failed, return false. Otherwise		// Okay! We've done all the tests. If any have failed, return false. Otherwise
// we can vectorize, and at this point we don't have any other mem analysis		// we can vectorize, and at this point we don't have any other mem analysis
// which may limit our maximum vectorization factor, so just return true with		// which may limit our maximum vectorization factor, so just return true with
// no restrictions.		// no restrictions.
return Result;		return Result;
}		}

		// Return true if the inner loop \p Lp is uniform with regard to the outer loop
		// \p OuterLp (i.e., if the outer loop is vectorized, all the vector lanes
		// executing the inner loop will execute the same iterations). This check is
		// very constrained for now but it will be relaxed in the future. \p Lp is
		// considered uniform if it meets all the following conditions:
		// 1) it has a canonical IV (starting from 0 and with stride 1),
		// 2) its latch terminator is a conditional branch and,
		// 3) its latch condition is a compare instruction whose operands are the
		// canonical IV and an OuterLp invariant.
		// This check doesn't take into account the uniformity of other conditions not
		// related to the loop latch because they don't affect the loop uniformity.
		//
		fhahnUnsubmitted Not Done Reply Inline Actions Work done here is potentially done multiple times for each loop, right? E.g. for deep loop nests, this will be called multiple times for the same Lp, but with different outer loops. Only a few checks here depend on the outer loop and I think ideally we would not check the same things again and again. For now those redundant checks are quite simple, but I think we should keep that issue in mind once we introduce more complex checks. fhahn: Work done here is potentially done multiple times for each loop, right? E.g. for deep loop…
		dcaballeAuthorUnsubmitted Not Done Reply Inline Actions Good point. `OuterLp` will be fixed, at least in the short term while we only support explicit vectorization. Given that we are introducing support for divergent inner loops in the patch series #4, it's more likely that we don't need this function (or at least this function as is) before we introduce the engine to evaluate different outer loops. In any case, the proper inner loop uniformity check will depend on the outer loop we are vectorizing. Some of these "extra" checks are very specific for the patch series #1, where the supported loops are very limited. They will be progressively removed, leaving only the `OuterLp` dependent checks. For these reasons, IMO, it makes sense to keep all these checks and the documentation together. I think it's easier to understand which inner loops are currently supported. I could add a comment explaining you concerns. However, I could try to split them if you think this is not enough. Please, let me know what you think. dcaballe: Good point. `OuterLp` will be fixed, at least in the short term while we only support explicit…
		fhahnUnsubmitted Not Done Reply Inline Actions Yes, I think for now it is fine, but we should definitely keep that in mind, for future checks. Another related question came to mind: How are we going to deal with nested loops where different nests can be vectorized, e.g. say we have nested loops with 3 levels and both outer-most loops can be vectorized? If we decide to vectorize the outermost loop, wouldn't we have to skip handling the other outer loop? Or have links between the VPlans to decide which level is best to vectorize? fhahn: Yes, I think for now it is fine, but we should definitely keep that in mind, for future checks.
		dcaballeAuthorUnsubmitted Not Done Reply Inline Actions Yes, I think for now it is fine, but we should definitely keep that in mind, for future checks. Ok, I added a comment explaining the situation. I also tried to find a better place for these checks, at least to indicate it in the comment, but I couldn't find a good place at this point. We don't have the infrastructure to evaluate multiple outer loops of the same loop nest. Hopefully, we comment is clarifying enough. If we decide to vectorize the outermost loop, wouldn't we have to skip handling the other outer loop? Yes, if the outermost loop is finally vectorized, the outer and the inner should be marked also as vectorized (I introduced a TODO suggesting this in an earlier version of this patch, but we decided to remove it to avoid confusion and keep the code cleaner). Or we could follow any other approach that leads to the same behavior: skipping them. Or have links between the VPlans to decide which level is best to vectorize? The idea is to have an initial H-CFG modeling the input IR of the whole loop nest (starting form the outermost vectorizable loop) and use it as a starting point to evaluate all the candidate loops for vectorization. Does this answer your questions? dcaballe: > Yes, I think for now it is fine, but we should definitely keep that in mind, for future…
		// NOTE: We decided to keep all these checks and its associated documentation
		// together so that we can easily have a picture of the current supported loop
		// nests. However, some of the current checks don't depend on \p OuterLp and
		// would be redundantly executed for each \p Lp if we invoked this function for
		// different candidate outer loops. This is not the case for now because we
		// don't currently have the infrastructure to evaluate multiple candidate outer
		// loops and \p OuterLp will be a fixed parameter while we only support explicit
		// outer loop vectorization. It's also very likely that these checks go away
		// before introducing the aforementioned infrastructure. However, if this is not
		fhahnUnsubmitted Not Done Reply Inline Actions I think the use of getCanonicalInductionVariable is discouraged. I think it would be better to detect induction variables using SCEV, as done LoopVectorizeLegality. fhahn: I think the use of getCanonicalInductionVariable is discouraged. I think it would be better to…
		dcaballeAuthorUnsubmitted Not Done Reply Inline Actions Could you please elaborate a bit more? Why is it discouraged? I can't find any comments in the source code. We are trying to introduce some restrictive but simple checks. If the answer is that this interface is discouraged because it may not detect some IVs that that are canonical, that would be perfectly fine. I'm also looking at the `LoopVectorizationLegality::addInductionPhi`. Isn't this function doing something similar to `getCanonicalInductionVariable` to detect the primary induction but using `InductionDescriptor`? dcaballe: Could you please elaborate a bit more? Why is it discouraged? I can't find any comments in the…
		fhahnUnsubmitted Not Done Reply Inline Actions Yes, for the very simple checks it works, but as you said we potentially miss IVs that we could support. I suppose we could use getCanonicalInductionVariable if it keeps things simple for now, but we definitely should relax that soonish. fhahn: Yes, for the very simple checks it works, but as you said we potentially miss IVs that we could…
		dcaballeAuthorUnsubmitted Not Done Reply Inline Actions Then it's perfectly fine. Let's do it incrementally. It's the whole purpose of the approach. We may even want to support non-canonical IVs sooner than later and coming up with something more complicated for the current check might not be worth it. dcaballe: Then it's perfectly fine. Let's do it incrementally. It's the whole purpose of the approach. We…
		// the case, we should move the \p OuterLp independent checks to a separate
		// function that is only executed once for each \p Lp.
		static bool isUniformLoop(Loop Lp, Loop OuterLp) {
		assert(Lp->getLoopLatch() && "Expected loop with a single latch.");

		// If Lp is the outer loop, it's uniform by definition.
		if (Lp == OuterLp)
		return true;
		assert(OuterLp->contains(Lp) && "OuterLp must contain Lp.");

		// 1.
		PHINode *IV = Lp->getCanonicalInductionVariable();
		if (!IV) {
		DEBUG(dbgs() << "LV: Canonical IV not found.\n");
		return false;
		}

		// 2.
		BasicBlock *Latch = Lp->getLoopLatch();
		auto *LatchBr = dyn_cast<BranchInst>(Latch->getTerminator());
		rengolinUnsubmitted Done Reply Inline Actions nit. I'd remove this space, as both blocks are regarding (3) rengolin: nit. I'd remove this space, as both blocks are regarding (3)
		if (!LatchBr \|\| LatchBr->isUnconditional()) {
		DEBUG(dbgs() << "LV: Unsupported loop latch branch.\n");
		return false;
		}

		// 3.
		auto *LatchCmp = dyn_cast<CmpInst>(LatchBr->getCondition());
		if (!LatchCmp) {
		DEBUG(dbgs() << "LV: Loop latch condition is not a compare instruction.\n");
		return false;
		}

		Value *CondOp0 = LatchCmp->getOperand(0);
		Value *CondOp1 = LatchCmp->getOperand(1);
		Value *IVUpdate = IV->getIncomingValueForBlock(Latch);
		if (!(CondOp0 == IVUpdate && OuterLp->isLoopInvariant(CondOp1)) &&
		!(CondOp1 == IVUpdate && OuterLp->isLoopInvariant(CondOp0))) {
		DEBUG(dbgs() << "LV: Loop latch condition is not uniform.\n");
		return false;
		}

		return true;
		}

		// Return true if \p Lp and all its nested loops are uniform with regard to \p
		// OuterLp.
		static bool isUniformLoopNest(Loop Lp, Loop OuterLp) {
		if (!isUniformLoop(Lp, OuterLp))
		return false;

		// Check if nested loops are uniform.
		for (Loop SubLp : Lp)
		if (!isUniformLoopNest(SubLp, OuterLp))
		return false;

		return true;
		}

		bool LoopVectorizationLegality::canVectorizeOuterLoop() {
		assert(!TheLoop->empty() && "We are not vectorizing an outer loop.");
		// Store the result and return it at the end instead of exiting early, in case
		// allowExtraAnalysis is used to report multiple reasons for not vectorizing.
		bool Result = true;
		bool DoExtraAnalysis = ORE->allowExtraAnalysis(DEBUG_TYPE);

		for (BasicBlock *BB : TheLoop->blocks()) {
		// Check whether the BB terminator is a BranchInst. Any other terminator is
		// not supported yet.
		auto *Br = dyn_cast<BranchInst>(BB->getTerminator());
		if (!Br) {
		DEBUG(dbgs() << "LV: Unsupported basic block terminator.\n");
		ORE->emit(createMissedAnalysis("CFGNotUnderstood")
		<< "loop control flow is not understood by vectorizer");
		if (DoExtraAnalysis)
		Result = false;
		else
		return false;
		}

		// Check whether the BranchInst is a supported one. Only unconditional
		// branches, conditional branches with an outer loop invariant condition or
		// backedges are supported.
		if (Br && Br->isConditional() &&
		!TheLoop->isLoopInvariant(Br->getCondition()) &&
		!LI->isLoopHeader(Br->getSuccessor(0)) &&
		!LI->isLoopHeader(Br->getSuccessor(1))) {
		DEBUG(dbgs() << "LV: Unsupported conditional branch.\n");
		ORE->emit(createMissedAnalysis("CFGNotUnderstood")
		<< "loop control flow is not understood by vectorizer");
		if (DoExtraAnalysis)
		Result = false;
		else
		return false;
		}
		}

		// Check whether inner loops are uniform. At this point, we only support
		// simple outer loops scenarios with uniform nested loops.
		if (!isUniformLoopNest(TheLoop /loop nest/,
		TheLoop /context outer loop/)) {
		DEBUG(dbgs()
		<< "LV: Not vectorizing: Outer loop contains divergent loops.\n");
		ORE->emit(createMissedAnalysis("CFGNotUnderstood")
		<< "loop control flow is not understood by vectorizer");
		if (DoExtraAnalysis)
		Result = false;
		else
		return false;
		}

		return Result;
		}

static Type convertPointerToIntegerType(const DataLayout &DL, Type Ty) {		static Type convertPointerToIntegerType(const DataLayout &DL, Type Ty) {
if (Ty->isPointerTy())		if (Ty->isPointerTy())
return DL.getIntPtrType(Ty);		return DL.getIntPtrType(Ty);

// It is possible that char's or short's overflow when we ask for the loop's		// It is possible that char's or short's overflow when we ask for the loop's
// trip count, work around this by changing the type size.		// trip count, work around this by changing the type size.
if (Ty->getScalarSizeInBits() < 32)		if (Ty->getScalarSizeInBits() < 32)
return Type::getInt32Ty(Ty->getContext());		return Type::getInt32Ty(Ty->getContext());
▲ Show 20 Lines • Show All 2,435 Lines • ▼ Show 20 Lines	void LoopVectorizationCostModel::collectValuesToIgnore() {
for (auto &Induction : *Legal->getInductionVars()) {		for (auto &Induction : *Legal->getInductionVars()) {
InductionDescriptor &IndDes = Induction.second;		InductionDescriptor &IndDes = Induction.second;
const SmallVectorImpl<Instruction *> &Casts = IndDes.getCastInsts();		const SmallVectorImpl<Instruction *> &Casts = IndDes.getCastInsts();
VecValuesToIgnore.insert(Casts.begin(), Casts.end());		VecValuesToIgnore.insert(Casts.begin(), Casts.end());
}		}
}		}

VectorizationFactor		VectorizationFactor
		LoopVectorizationPlanner::planInVPlanNativePath(bool OptForSize,
		unsigned UserVF) {
		// Width 1 means no vectorize, cost 0 means uncomputed cost.
		const VectorizationFactor NoVectorization = {1U, 0U};

		// Outer loop handling: They may require CFG and instruction level
		// transformations before even evaluating whether vectorization is profitable.
		// Since we cannot modify the incoming IR, we need to build VPlan upfront in
		// the vectorization pipeline.
		if (!OrigLoop->empty()) {
		assert(EnableVPlanNativePath && "VPlan-native path is not enabled.");
		assert(UserVF && "Expected UserVF for outer loop vectorization.");
		assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");
		DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");
		buildVPlans(UserVF, UserVF);

		return {UserVF, 0};
		}

		DEBUG(dbgs() << "LV: Not vectorizing. Inner loops aren't supported in the "
		"VPlan-native path.\n");
		return NoVectorization;
		}

		VectorizationFactor
LoopVectorizationPlanner::plan(bool OptForSize, unsigned UserVF) {		LoopVectorizationPlanner::plan(bool OptForSize, unsigned UserVF) {
		assert(OrigLoop->empty() && "Inner loop expected.");
// Width 1 means no vectorize, cost 0 means uncomputed cost.		// Width 1 means no vectorize, cost 0 means uncomputed cost.
const VectorizationFactor NoVectorization = {1U, 0U};		const VectorizationFactor NoVectorization = {1U, 0U};
		rengolinUnsubmitted Not Done Reply Inline Actions I wouldn't make this an assert, just a debug message and return rengolin: I wouldn't make this an assert, just a debug message and return
		dcaballeAuthorUnsubmitted Not Done Reply Inline Actions I think this should be an assert because an outer loop shouldn't reach this point if the VPlan-native patch is disabled. However, I'm going to add the debug message in the caller code. Does it sound good? dcaballe: I think this should be an assert because an outer loop shouldn't reach this point if the VPlan…
Optional<unsigned> MaybeMaxVF = CM.computeMaxVF(OptForSize);		Optional<unsigned> MaybeMaxVF = CM.computeMaxVF(OptForSize);
if (!MaybeMaxVF.hasValue()) // Cases considered too costly to vectorize.		if (!MaybeMaxVF.hasValue()) // Cases considered too costly to vectorize.
return NoVectorization;		return NoVectorization;

if (UserVF) {		if (UserVF) {
		rengolinUnsubmitted Done Reply Inline Actions left over? rengolin: left over?
DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");		DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");
assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");		assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");
// Collect the instructions (and their associated costs) that will be more		// Collect the instructions (and their associated costs) that will be more
		rengolinUnsubmitted Not Done Reply Inline Actions is this really the place to do predication and cost modelling? rengolin: is this really the place to do predication and cost modelling?
		dcaballeAuthorUnsubmitted Not Done Reply Inline Actions This function returns the VF to be used during code generation so we would need to evaluate the cost here to return the selected VF. Cost modeling shouldn't be part of the VPlan building process. The same approach is followed in the code below. Regarding predication, it must happen before the cost evaluation. We can discuss if it should belong here or not when we introduce the actual code. If the comment is confusing, I can remove it. We decided to introduce the TODOs to give a better picture of the subsequent patches but if this is not helpful or annoying I can just get rid of all of them. Please, let me know what you think. dcaballe: This function returns the VF to be used during code generation so we would need to evaluate the…
		rengolinUnsubmitted Done Reply Inline Actions I don't like the idea of outer-loops being in a special branch of the code, but I understand the current prototype nature of it. I believe it's still not the time to define what goes where that hasn't been implemented yet, so better to remove the TODOs for now, in case they lead us astray in the future. Same for the debug messages, etc. What you should do is shortly explain why outer-loop needs "special handling", and that can be a one/two line comment in the beginning of the block. rengolin: I don't like the idea of outer-loops being in a special branch of the code, but I understand…
		dcaballeAuthorUnsubmitted Not Done Reply Inline Actions I believe it's still not the time to define what goes where that hasn't been implemented yet, so better to remove the TODOs for now, in case they lead us astray in the future. at you should do is shortly explain why outer-loop needs "special handling", and that can be a one/two line comment in the beginning of the block. Agreed. Thanks! dcaballe: > I believe it's still not the time to define what goes where that hasn't been implemented yet…
// profitable to scalarize.		// profitable to scalarize.
CM.selectUserVectorizationFactor(UserVF);		CM.selectUserVectorizationFactor(UserVF);
buildVPlans(UserVF, UserVF);		buildVPlans(UserVF, UserVF);
DEBUG(printPlans(dbgs()));		DEBUG(printPlans(dbgs()));
return {UserVF, 0};		return {UserVF, 0};
}		}

unsigned MaxVF = MaybeMaxVF.getValue();		unsigned MaxVF = MaybeMaxVF.getValue();
▲ Show 20 Lines • Show All 536 Lines • ▼ Show 20 Lines	LoopVectorizationPlanner::createReplicateRegion(Instruction *Instr,
Pred->setOneSuccessor(Exit);		Pred->setOneSuccessor(Exit);

return Region;		return Region;
}		}

LoopVectorizationPlanner::VPlanPtr		LoopVectorizationPlanner::VPlanPtr
LoopVectorizationPlanner::buildVPlan(VFRange &Range,		LoopVectorizationPlanner::buildVPlan(VFRange &Range,
const SmallPtrSetImpl<Value *> &NeedDef) {		const SmallPtrSetImpl<Value *> &NeedDef) {
		// Outer loop handling: They may require CFG and instruction level
		// transformations before even evaluating whether vectorization is profitable.
		// Since we cannot modify the incoming IR, we need to build VPlan upfront in
		// the vectorization pipeline.
		if (!OrigLoop->empty()) {
		assert(EnableVPlanNativePath && "VPlan-native path is not enabled.");

		// Create new empty VPlan
		auto Plan = llvm::make_unique<VPlan>();
		return Plan;
		rengolinUnsubmitted Not Done Reply Inline Actions Isn't `containsIrreducibleCFG` depending on this to work? rengolin: Isn't `containsIrreducibleCFG` depending on this to work?
		dcaballeAuthorUnsubmitted Not Done Reply Inline Actions They are different things. `containsIrreducibleCFG` is used at the very beginning of the pass to collect potential loop candidates (without irreducible CFG) to be vectorized. `VPlanHCFGBuilder` will build a CFG out of the input IR, using the VPlan infrastructure (VPBlockBases). This VPlan CFG will be modified during the vectorization process without actually modifying the CFG of the input IR. Changes in the VPlan CFG will be materialized once the best profitable VPlan is chosen. Is it clearer now? The VPlanHCFGBuilder is going to be introduced in the next patch. dcaballe: They are different things. `containsIrreducibleCFG` is used at the very beginning of the pass…
		rengolinUnsubmitted Done Reply Inline Actions Right, as above, don't leave commented out code hanging. Feel free to add a two-line comment in the begining of the block explaining the expectation. rengolin: Right, as above, don't leave commented out code hanging. Feel free to add a two-line comment in…
		}
		rengolinUnsubmitted Done Reply Inline Actions Better not to have TODOs as debug messages. rengolin: Better not to have TODOs as debug messages.

		assert(OrigLoop->empty() && "Inner loop expected.");
EdgeMaskCache.clear();		EdgeMaskCache.clear();
BlockMaskCache.clear();		BlockMaskCache.clear();
DenseMap<Instruction , Instruction > &SinkAfter = Legal->getSinkAfter();		DenseMap<Instruction , Instruction > &SinkAfter = Legal->getSinkAfter();
DenseMap<Instruction , Instruction > SinkAfterInverse;		DenseMap<Instruction , Instruction > SinkAfterInverse;

// Collect instructions from the original loop that will become trivially dead		// Collect instructions from the original loop that will become trivially dead
// in the vectorized loop. We don't need to vectorize these instructions. For		// in the vectorized loop. We don't need to vectorize these instructions. For
// example, original induction update instructions can become dead because we		// example, original induction update instructions can become dead because we
▲ Show 20 Lines • Show All 313 Lines • ▼ Show 20 Lines	void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
// Last (and currently only) operand is a mask.		// Last (and currently only) operand is a mask.
InnerLoopVectorizer::VectorParts MaskValues(State.UF);		InnerLoopVectorizer::VectorParts MaskValues(State.UF);
VPValue *Mask = User->getOperand(User->getNumOperands() - 1);		VPValue *Mask = User->getOperand(User->getNumOperands() - 1);
for (unsigned Part = 0; Part < State.UF; ++Part)		for (unsigned Part = 0; Part < State.UF; ++Part)
MaskValues[Part] = State.get(Mask, Part);		MaskValues[Part] = State.get(Mask, Part);
State.ILV->vectorizeMemoryInstruction(&Instr, &MaskValues);		State.ILV->vectorizeMemoryInstruction(&Instr, &MaskValues);
}		}

		// Process the loop in the VPlan-native vectorization path. This path builds
		// VPlan upfront in the vectorization pipeline, which allows to apply
		// VPlan-to-VPlan transformations from the very beginning without modifying the
		// input LLVM IR.
		static bool processLoopInVPlanNativePath(
		Loop L, PredicatedScalarEvolution &PSE, LoopInfo LI, DominatorTree *DT,
		LoopVectorizationLegality LVL, TargetTransformInfo TTI,
		TargetLibraryInfo TLI, DemandedBits DB, AssumptionCache *AC,
		OptimizationRemarkEmitter *ORE, LoopVectorizeHints &Hints) {

		assert(EnableVPlanNativePath && "VPlan-native path is disabled.");
		fhahnUnsubmitted Not Done Reply Inline Actions I suppose we should never call processLoopInVPlanNativePath without the flag? Could this be an assertion? fhahn: I suppose we should never call processLoopInVPlanNativePath without the flag? Could this be an…
		dcaballeAuthorUnsubmitted Not Done Reply Inline Actions Ok, thanks! dcaballe: Ok, thanks!
		Function *F = L->getHeader()->getParent();
		InterleavedAccessInfo IAI(PSE, L, DT, LI, LVL->getLAI());
		dcaballeAuthorUnsubmitted Not Done Reply Inline Actions @hsaito, there was a conflict with D45072 and I had to construct IAI here and carry over DT just for it. I wonder if it would be better to make IAI optional in CM to avoid this. It would make the code reuse easier. dcaballe: @hsaito, there was a conflict with D45072 and I had to construct IAI here and carry over DT…
		hsaitoUnsubmitted Not Done Reply Inline Actions Feel free to change IAI into a pointer that can be nullptr. After all, InterleavedAccess is an optimization step that we should be able to skip. D45072 didn't get into that, but I was planning to do that while I clean up CostModel. I kept it as a reference simply because it was a reference in Legal. If you want me to do it, I can do it quick. hsaito: Feel free to change IAI into a pointer that can be nullptr. After all, InterleavedAccess is an…
		hsaitoUnsubmitted Not Done Reply Inline Actions The "Optimize" phase of the vectorizer most likely need DT anyway in a long run and thus having to carry over DT by itself is not a really bad thing. Part of the reason is your design choice of not making procesLoopInVPlanNativePath a member function of LoopVectorizePass class. For the time being, I think this part of the code can go in as is, i.e., without making IAI optional. I'll be touching CM code soon enough anyway. hsaito: The "Optimize" phase of the vectorizer most likely need DT anyway in a long run and thus having…
		dcaballeAuthorUnsubmitted Not Done Reply Inline Actions If you want me to do it, I can do it quick. I think this part of the code can go in as is, i.e., without making IAI optional. I'll be touching CM code soon enough anyway. I'm ok with either of both. If you want to quickly fix it, I can wait for that patch. I won't be committing it until next week. Otherwise, you can remove this line later when you address it. Part of the reason is your design choice of not making procesLoopInVPlanNativePath a member function of LoopVectorizePass class. I just tried to keep these changes away from LoopVectorize.h but the bunch of parameter is certainly inconvenient. If you think it's better, I can just make it member. Please, let me know what you think. dcaballe: > If you want me to do it, I can do it quick. > I think this part of the code can go in as is…
		hsaitoUnsubmitted Not Done Reply Inline Actions I think it's best to get this checked in and then address CostModel stuff as it gets restructured for VPlan and longer term future. I think it's okay to keep it as a static function with a lot of parameters until it is ready to take over the position of processLoop(). When most of the functionality is in place, this function would need all the analysis just like processLoop() does. So, passing DT will inevitably happen even if we don't do it now. hsaito: I think it's best to get this checked in and then address CostModel stuff as it gets…
		LoopVectorizationCostModel CM(L, PSE, LI, LVL, *TTI, TLI, DB, AC, ORE, F,
		&Hints, IAI);
		// Use the planner for outer loop vectorization.
		// TODO: CM is not used at this point inside the planner. Turn CM into an
		// optional argument if we don't need it in the future.
		LoopVectorizationPlanner LVP(L, LI, TLI, TTI, LVL, CM);

		// Get user vectorization factor.
		unsigned UserVF = Hints.getWidth();

		// Check the function attributes to find out if this function should be
		// optimized for size.
		bool OptForSize =
		Hints.getForce() != LoopVectorizeHints::FK_Enabled && F->optForSize();

		// Plan how to best vectorize, return the best VF and its cost.
		LVP.planInVPlanNativePath(OptForSize, UserVF);

		// Returning false. We are currently not generating vector code in the VPlan
		// native path.
		return false;
		}

bool LoopVectorizePass::processLoop(Loop *L) {		bool LoopVectorizePass::processLoop(Loop *L) {
assert(L->empty() && "Only process inner loops.");		assert((EnableVPlanNativePath \|\| L->empty()) &&
		"VPlan-native path is not enabled. Only process inner loops.");

#ifndef NDEBUG		#ifndef NDEBUG
const std::string DebugLocStr = getDebugLocString(L);		const std::string DebugLocStr = getDebugLocString(L);
#endif /* NDEBUG */		#endif /* NDEBUG */

DEBUG(dbgs() << "\nLV: Checking a loop in \""		DEBUG(dbgs() << "\nLV: Checking a loop in \""
<< L->getHeader()->getParent()->getName() << "\" from "		<< L->getHeader()->getParent()->getName() << "\" from "
<< DebugLocStr << "\n");		<< DebugLocStr << "\n");
Show All 38 Lines	if (!LVL.canVectorize()) {
return false;		return false;
}		}

// Check the function attributes to find out if this function should be		// Check the function attributes to find out if this function should be
// optimized for size.		// optimized for size.
bool OptForSize =		bool OptForSize =
Hints.getForce() != LoopVectorizeHints::FK_Enabled && F->optForSize();		Hints.getForce() != LoopVectorizeHints::FK_Enabled && F->optForSize();

		// Entrance to the VPlan-native vectorization path. Outer loops are processed
		// here. They may require CFG and instruction level transformations before
		// even evaluating whether vectorization is profitable. Since we cannot modify
		// the incoming IR, we need to build VPlan upfront in the vectorization
		// pipeline.
		if (!L->empty())
		return processLoopInVPlanNativePath(L, PSE, LI, DT, &LVL, TTI, TLI, DB, AC,
		ORE, Hints);

		assert(L->empty() && "Inner loop expected.");
// Check the loop for a trip count threshold: vectorize loops with a tiny trip		// Check the loop for a trip count threshold: vectorize loops with a tiny trip
// count by optimizing for size, to minimize overheads.		// count by optimizing for size, to minimize overheads.
// Prefer constant trip counts over profile data, over upper bound estimate.		// Prefer constant trip counts over profile data, over upper bound estimate.
unsigned ExpectedTC = 0;		unsigned ExpectedTC = 0;
bool HasExpectedTC = false;		bool HasExpectedTC = false;
if (const SCEVConstant *ConstExits =		if (const SCEVConstant *ConstExits =
dyn_cast<SCEVConstant>(SE->getBackedgeTakenCount(L))) {		dyn_cast<SCEVConstant>(SE->getBackedgeTakenCount(L))) {
const APInt &ExitsCount = ConstExits->getAPInt();		const APInt &ExitsCount = ConstExits->getAPInt();
// We are interested in small values for ExpectedTC. Skip over those that		// We are interested in small values for ExpectedTC. Skip over those that
// can't fit an unsigned.		// can't fit an unsigned.
if (ExitsCount.ult(std::numeric_limits<unsigned>::max())) {		if (ExitsCount.ult(std::numeric_limits<unsigned>::max())) {
ExpectedTC = static_cast<unsigned>(ExitsCount.getZExtValue()) + 1;		ExpectedTC = static_cast<unsigned>(ExitsCount.getZExtValue()) + 1;
HasExpectedTC = true;		HasExpectedTC = true;
		rengolinUnsubmitted Not Done Reply Inline Actions I'm really uncomfortable with all these temporary code blocks that don't do anything... They're really just hijacking the existing infrastructure without implementing as a VPlan. I really thought the whole point of VPlans was that you wouldn't need to hack-it-up like we used to in the old vectoriser... rengolin: I'm really uncomfortable with all these temporary code blocks that don't do anything...
		dcaballeAuthorUnsubmitted Not Done Reply Inline Actions This is the entrance to the VPlan-native vectorization path. It's not doing anything yet because we are trying to follow an incremental approach by releasing relatively small patches that are easy to digest. This code will be functional (generating vector code) soon. The code block is temporary as long as both vectorization paths co-exist but the final goal is to converge into a single one. This approach will allow us to incrementally and easily extend all the current inner loop vectorization functionality to support outer loops and, most importantly, doing so without destabilizing inner loop vectorization. We are really concerned about the latter and we think that this approach is a reasonable trade-off between safety and temporary code blocks. If you want to discuss this further, I would recommend to move the discussion to the RFC thread so that everybody is aware of it: http://lists.llvm.org/pipermail/llvm-dev/2017-December/119523.html dcaballe: This is the entrance to the VPlan-native vectorization path. It's not doing anything yet…
		hsaitoUnsubmitted Not Done Reply Inline Actions I'm working on the "converge into a singe one" side. At this point, I'm taking care of the ground work of moving the right things to the right places such that I don't have to include those "almost NFC" things as part of "expand VPlan's participation into innermost loop vectorization". Thank you for helping me do that with your reviews. We need to be able to build VPlan for the innermost loop vectorization right after Legal, for example, before we can remove the diverged code path at the beginning. In the meantime, the outer loop vectorization patch series will help people realize how much common things are there between innermost loop vectorization and outer loop vectorization, and more importantly, help people think how to write code that can work in both ways. That's as much as I want to write about the approach we are taking, within this patch review. The rest of the discussions should happen on the above mentioned RFC. Thanks. hsaito: I'm working on the "converge into a singe one" side. At this point, I'm taking care of the…
		rengolinUnsubmitted Done Reply Inline Actions Ok, as above, just remove the comments and add a two-line comment summarising it. rengolin: Ok, as above, just remove the comments and add a two-line comment summarising it.
		fhahnUnsubmitted Not Done Reply Inline Actions I am also slightly worried that people will come along and see this code and think that cost modelling and planning already works for outer loops, as it is used in the VPlan native path. But I think the comment makes it clear now. I am not sure if it would be clearer/nicer to have clearer separation by having the code in separate functions rather than adding even more code to those already huge functions. fhahn: I am also slightly worried that people will come along and see this code and think that cost…
		dcaballeAuthorUnsubmitted Not Done Reply Inline Actions I am not sure if it would be clearer/nicer to have clearer separation by having the code in separate functions rather than adding even more code to those already huge functions. Agreed, I could move these code to a separate function. rather than adding even more code to those already huge functions. Are you talking about only this function or also some other ones? dcaballe: > I am not sure if it would be clearer/nicer to have clearer separation by having the code in…
		fhahnUnsubmitted Done Reply Inline Actions Mostly this function and LoopVectorizationPlanner::plan. Otherwise it is already nicely separated. fhahn: Mostly this function and LoopVectorizationPlanner::plan. Otherwise it is already nicely…
		dcaballeAuthorUnsubmitted Not Done Reply Inline Actions Ok, thanks!. I doubted if the new 'processLoopInVPlanNativePath' should be a member function of LoopVectorizerPass (same as 'processLoop') and avoid passing most of the parameters. I decided not to do it just not to modify the public header file. Please, let me know what you think. dcaballe: Ok, thanks!. I doubted if the new 'processLoopInVPlanNativePath' should be a member function of…
}		}
}		}
// ExpectedTC may be large because it's bound by a variable. Check		// ExpectedTC may be large because it's bound by a variable. Check
// profiling information to validate we should vectorize.		// profiling information to validate we should vectorize.
if (!HasExpectedTC && LoopVectorizeWithBlockFrequency) {		if (!HasExpectedTC && LoopVectorizeWithBlockFrequency) {
auto EstimatedTC = getLoopEstimatedTripCount(L);		auto EstimatedTC = getLoopEstimatedTripCount(L);
if (EstimatedTC) {		if (EstimatedTC) {
ExpectedTC = *EstimatedTC;		ExpectedTC = *EstimatedTC;
▲ Show 20 Lines • Show All 247 Lines • ▼ Show 20 Lines	for (auto &L : *LI)
Changed \|= simplifyLoop(L, DT, LI, SE, AC, false /* PreserveLCSSA */);		Changed \|= simplifyLoop(L, DT, LI, SE, AC, false /* PreserveLCSSA */);

// Build up a worklist of inner-loops to vectorize. This is necessary as		// Build up a worklist of inner-loops to vectorize. This is necessary as
// the act of vectorizing or partially unrolling a loop creates new loops		// the act of vectorizing or partially unrolling a loop creates new loops
// and can invalidate iterators across the loops.		// and can invalidate iterators across the loops.
SmallVector<Loop *, 8> Worklist;		SmallVector<Loop *, 8> Worklist;

for (Loop L : LI)		for (Loop L : LI)
addAcyclicInnerLoop(L, LI, Worklist);		collectSupportedLoops(*L, LI, ORE, Worklist);

LoopsAnalyzed += Worklist.size();		LoopsAnalyzed += Worklist.size();

// Now walk the identified inner loops.		// Now walk the identified inner loops.
while (!Worklist.empty()) {		while (!Worklist.empty()) {
Loop *L = Worklist.pop_back_val();		Loop *L = Worklist.pop_back_val();

// For the inner loops we actually process, form LCSSA to simplify the		// For the inner loops we actually process, form LCSSA to simplify the
Show All 40 Lines

test/Transforms/LoopVectorize/explicit_outer_detection.ll

This file was added.

				; RUN: opt < %s -loop-vectorize -enable-vplan-native-path -debug-only=loop-vectorize -S 2>&1 \| FileCheck %s
				; REQUIRES: asserts

				; Verify that outer loops annotated only with the expected explicit
				; vectorization hints are collected for vectorization instead of inner loops.

				; Root C/C++ source code for all the test cases
				; void foo(int a, int b, int N, int M)
				; {
				; int i, j;
				; #pragma clang loop vectorize(enable)
				; for (i = 0; i < N; i++) {
				; for (j = 0; j < M; j++) {
				; a[iM+j] = b[iM+j] * b[i*M+j];
				; }
				; }
				; }

				; Case 1: Annotated outer loop WITH vector width information must be collected.

				; CHECK-LABEL: vector_width
				; CHECK: LV: Loop hints: force=enabled width=4 unroll=0
				; CHECK: LV: We can vectorize this outer loop!
				; CHECK: LV: Using user VF 4.
				; CHECK-NOT: LV: Loop hints: force=?
				; CHECK-NOT: LV: Found a loop: inner.body

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

				define void @vector_width(i32* nocapture %a, i32* nocapture readonly %b, i32 %N, i32 %M) local_unnamed_addr {
				entry:
				%cmp32 = icmp sgt i32 %N, 0
				br i1 %cmp32, label %outer.ph, label %for.end15

				outer.ph: ; preds = %entry
				%cmp230 = icmp sgt i32 %M, 0
				%0 = sext i32 %M to i64
				%wide.trip.count = zext i32 %M to i64
				%wide.trip.count38 = zext i32 %N to i64
				br label %outer.body

				outer.body: ; preds = %outer.inc, %outer.ph
				%indvars.iv35 = phi i64 [ 0, %outer.ph ], [ %indvars.iv.next36, %outer.inc ]
				br i1 %cmp230, label %inner.ph, label %outer.inc

				inner.ph: ; preds = %outer.body
				%1 = mul nsw i64 %indvars.iv35, %0
				br label %inner.body

				inner.body: ; preds = %inner.body, %inner.ph
				%indvars.iv = phi i64 [ 0, %inner.ph ], [ %indvars.iv.next, %inner.body ]
				%2 = add nsw i64 %indvars.iv, %1
				%arrayidx = getelementptr inbounds i32, i32* %b, i64 %2
				%3 = load i32, i32* %arrayidx, align 4, !tbaa !2
				%mul8 = mul nsw i32 %3, %3
				%arrayidx12 = getelementptr inbounds i32, i32* %a, i64 %2
				store i32 %mul8, i32* %arrayidx12, align 4, !tbaa !2
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %outer.inc, label %inner.body

				outer.inc: ; preds = %inner.body, %outer.body
				%indvars.iv.next36 = add nuw nsw i64 %indvars.iv35, 1
				%exitcond39 = icmp eq i64 %indvars.iv.next36, %wide.trip.count38
				br i1 %exitcond39, label %for.end15, label %outer.body, !llvm.loop !6

				for.end15: ; preds = %outer.inc, %entry
				ret void
				}

				; Case 2: Annotated outer loop WITHOUT vector width information doesn't have to
				; be collected.

				; CHECK-LABEL: case2
				; CHECK-NOT: LV: Loop hints: force=enabled
				; CHECK-NOT: LV: We can vectorize this outer loop!
				; CHECK: LV: Loop hints: force=?
				; CHECK: LV: Found a loop: inner.body

				define void @case2(i32* nocapture %a, i32* nocapture readonly %b, i32 %N, i32 %M) local_unnamed_addr {
				entry:
				%cmp32 = icmp sgt i32 %N, 0
				br i1 %cmp32, label %outer.ph, label %for.end15

				outer.ph: ; preds = %entry
				%cmp230 = icmp sgt i32 %M, 0
				%0 = sext i32 %M to i64
				%wide.trip.count = zext i32 %M to i64
				%wide.trip.count38 = zext i32 %N to i64
				br label %outer.body

				outer.body: ; preds = %outer.inc, %outer.ph
				%indvars.iv35 = phi i64 [ 0, %outer.ph ], [ %indvars.iv.next36, %outer.inc ]
				br i1 %cmp230, label %inner.ph, label %outer.inc

				inner.ph: ; preds = %outer.body
				%1 = mul nsw i64 %indvars.iv35, %0
				br label %inner.body

				inner.body: ; preds = %inner.body, %inner.ph
				%indvars.iv = phi i64 [ 0, %inner.ph ], [ %indvars.iv.next, %inner.body ]
				%2 = add nsw i64 %indvars.iv, %1
				%arrayidx = getelementptr inbounds i32, i32* %b, i64 %2
				%3 = load i32, i32* %arrayidx, align 4, !tbaa !2
				%mul8 = mul nsw i32 %3, %3
				%arrayidx12 = getelementptr inbounds i32, i32* %a, i64 %2
				store i32 %mul8, i32* %arrayidx12, align 4, !tbaa !2
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %outer.inc, label %inner.body

				outer.inc: ; preds = %inner.body, %outer.body
				%indvars.iv.next36 = add nuw nsw i64 %indvars.iv35, 1
				%exitcond39 = icmp eq i64 %indvars.iv.next36, %wide.trip.count38
				br i1 %exitcond39, label %for.end15, label %outer.body, !llvm.loop !9

				for.end15: ; preds = %outer.inc, %entry
				ret void
				}

				; Case 3: Annotated outer loop WITH vector width and interleave information
				; doesn't have to be collected.

				; CHECK-LABEL: case3
				; CHECK-NOT: LV: Loop hints: force=enabled
				; CHECK-NOT: LV: We can vectorize this outer loop!
				; CHECK: LV: Loop hints: force=?
				; CHECK: LV: Found a loop: inner.body

				define void @case3(i32* nocapture %a, i32* nocapture readonly %b, i32 %N, i32 %M) local_unnamed_addr {
				entry:
				%cmp32 = icmp sgt i32 %N, 0
				br i1 %cmp32, label %outer.ph, label %for.end15

				outer.ph: ; preds = %entry
				%cmp230 = icmp sgt i32 %M, 0
				%0 = sext i32 %M to i64
				%wide.trip.count = zext i32 %M to i64
				%wide.trip.count38 = zext i32 %N to i64
				br label %outer.body

				outer.body: ; preds = %outer.inc, %outer.ph
				%indvars.iv35 = phi i64 [ 0, %outer.ph ], [ %indvars.iv.next36, %outer.inc ]
				br i1 %cmp230, label %inner.ph, label %outer.inc

				inner.ph: ; preds = %outer.body
				%1 = mul nsw i64 %indvars.iv35, %0
				br label %inner.body

				inner.body: ; preds = %inner.body, %inner.ph
				%indvars.iv = phi i64 [ 0, %inner.ph ], [ %indvars.iv.next, %inner.body ]
				%2 = add nsw i64 %indvars.iv, %1
				%arrayidx = getelementptr inbounds i32, i32* %b, i64 %2
				%3 = load i32, i32* %arrayidx, align 4, !tbaa !2
				%mul8 = mul nsw i32 %3, %3
				%arrayidx12 = getelementptr inbounds i32, i32* %a, i64 %2
				store i32 %mul8, i32* %arrayidx12, align 4, !tbaa !2
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %outer.inc, label %inner.body

				outer.inc: ; preds = %inner.body, %outer.body
				%indvars.iv.next36 = add nuw nsw i64 %indvars.iv35, 1
				%exitcond39 = icmp eq i64 %indvars.iv.next36, %wide.trip.count38
				br i1 %exitcond39, label %for.end15, label %outer.body, !llvm.loop !11

				for.end15: ; preds = %outer.inc, %entry
				ret void
				}

				; Case 4: Outer loop without any explicit vectorization annotation doesn't have
				; to be collected.

				; CHECK-LABEL: case4
				; CHECK-NOT: LV: Loop hints: force=enabled
				; CHECK-NOT: LV: We can vectorize this outer loop!
				; CHECK: LV: Loop hints: force=?
				; CHECK: LV: Found a loop: inner.body

				define void @case4(i32* nocapture %a, i32* nocapture readonly %b, i32 %N, i32 %M) local_unnamed_addr {
				entry:
				%cmp32 = icmp sgt i32 %N, 0
				br i1 %cmp32, label %outer.ph, label %for.end15

				outer.ph: ; preds = %entry
				%cmp230 = icmp sgt i32 %M, 0
				%0 = sext i32 %M to i64
				%wide.trip.count = zext i32 %M to i64
				%wide.trip.count38 = zext i32 %N to i64
				br label %outer.body

				outer.body: ; preds = %outer.inc, %outer.ph
				%indvars.iv35 = phi i64 [ 0, %outer.ph ], [ %indvars.iv.next36, %outer.inc ]
				br i1 %cmp230, label %inner.ph, label %outer.inc

				inner.ph: ; preds = %outer.body
				%1 = mul nsw i64 %indvars.iv35, %0
				br label %inner.body

				inner.body: ; preds = %inner.body, %inner.ph
				%indvars.iv = phi i64 [ 0, %inner.ph ], [ %indvars.iv.next, %inner.body ]
				%2 = add nsw i64 %indvars.iv, %1
				%arrayidx = getelementptr inbounds i32, i32* %b, i64 %2
				%3 = load i32, i32* %arrayidx, align 4, !tbaa !2
				%mul8 = mul nsw i32 %3, %3
				%arrayidx12 = getelementptr inbounds i32, i32* %a, i64 %2
				store i32 %mul8, i32* %arrayidx12, align 4, !tbaa !2
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %outer.inc, label %inner.body

				outer.inc: ; preds = %inner.body, %outer.body
				%indvars.iv.next36 = add nuw nsw i64 %indvars.iv35, 1
				%exitcond39 = icmp eq i64 %indvars.iv.next36, %wide.trip.count38
				br i1 %exitcond39, label %for.end15, label %outer.body

				for.end15: ; preds = %outer.inc, %entry
				ret void
				}

				!llvm.module.flags = !{!0}
				!llvm.ident = !{!1}

				fhahnUnsubmitted Done Reply Inline Actions attributes not needed here and in the tests below, as no cost modelling is done so far. fhahn: attributes not needed here and in the tests below, as no cost modelling is done so far.
				!0 = !{i32 1, !"wchar_size", i32 4}
				!1 = !{!"clang version 6.0.0"}
				!2 = !{!3, !3, i64 0}
				!3 = !{!"int", !4, i64 0}
				!4 = !{!"omnipotent char", !5, i64 0}
				!5 = !{!"Simple C/C++ TBAA"}
				; Case 1
				!6 = distinct !{!6, !7, !8}
				!7 = !{!"llvm.loop.vectorize.width", i32 4}
				!8 = !{!"llvm.loop.vectorize.enable", i1 true}
				; Case 2
				!9 = distinct !{!9, !8}
				; Case 3
				!10 = !{!"llvm.loop.interleave.count", i32 2}
				!11 = distinct !{!11, !7, !10, !8}

test/Transforms/LoopVectorize/explicit_outer_nonuniform_inner.ll

This file was added.

				; RUN: opt < %s -loop-vectorize -enable-vplan-native-path -pass-remarks-analysis=loop-vectorize -debug-only=loop-vectorize -S 2>&1 \| FileCheck %s
				; REQUIRES: asserts

				; Verify that LV bails out on explicit vectorization outer loops that contain
				; divergent inner loops.

				; Root C/C++ source code for all the test cases
				; void foo(int a, int b, int N, int M)
				; {
				; int i, j;
				; #pragma clang loop vectorize(enable) vectorize_width(8)
				; for (i = 0; i < N; i++) {
				; // Tested inner loop. It will be replaced per test.
				; for (j = 0; j < M; j++) {
				; a[iM+j] = b[iM+j] * b[i*M+j];
				; }
				; }
				; }

				; Case 1 (for (j = i; j < M; j++)): Inner loop with divergent IV start.

				; CHECK-LABEL: iv_start
				; CHECK: LV: Not vectorizing: Outer loop contains divergent loops.
				; CHECK: LV: Not vectorizing: Unsupported outer loop.

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

				define void @iv_start(i32* nocapture %a, i32* nocapture readonly %b, i32 %N, i32 %M) local_unnamed_addr {
				entry:
				%cmp33 = icmp sgt i32 %N, 0
				br i1 %cmp33, label %outer.ph, label %for.end15

				outer.ph: ; preds = %entry
				%0 = sext i32 %M to i64
				%wide.trip.count = zext i32 %M to i64
				%wide.trip.count41 = zext i32 %N to i64
				br label %outer.body

				outer.body: ; preds = %outer.inc, %outer.ph
				%indvars.iv38 = phi i64 [ 0, %outer.ph ], [ %indvars.iv.next39, %outer.inc ]
				%cmp231 = icmp slt i64 %indvars.iv38, %0
				br i1 %cmp231, label %inner.ph, label %outer.inc

				inner.ph: ; preds = %outer.body
				%1 = mul nsw i64 %indvars.iv38, %0
				br label %inner.body

				inner.body: ; preds = %inner.body, %inner.ph
				%indvars.iv35 = phi i64 [ %indvars.iv38, %inner.ph ], [ %indvars.iv.next36, %inner.body ]
				%2 = add nsw i64 %indvars.iv35, %1
				%arrayidx = getelementptr inbounds i32, i32* %b, i64 %2
				%3 = load i32, i32* %arrayidx, align 4, !tbaa !2
				%mul8 = mul nsw i32 %3, %3
				%arrayidx12 = getelementptr inbounds i32, i32* %a, i64 %2
				store i32 %mul8, i32* %arrayidx12, align 4, !tbaa !2
				%indvars.iv.next36 = add nuw nsw i64 %indvars.iv35, 1
				%exitcond = icmp eq i64 %indvars.iv.next36, %wide.trip.count
				br i1 %exitcond, label %outer.inc, label %inner.body

				outer.inc: ; preds = %inner.body, %outer.body
				%indvars.iv.next39 = add nuw nsw i64 %indvars.iv38, 1
				%exitcond42 = icmp eq i64 %indvars.iv.next39, %wide.trip.count41
				br i1 %exitcond42, label %for.end15, label %outer.body, !llvm.loop !6

				for.end15: ; preds = %outer.inc, %entry
				ret void
				}


				; Case 2 (for (j = 0; j < i; j++)): Inner loop with divergent upper-bound.

				; CHECK-LABEL: loop_ub
				; CHECK: LV: Not vectorizing: Outer loop contains divergent loops.
				; CHECK: LV: Not vectorizing: Unsupported outer loop.

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

				define void @loop_ub(i32* nocapture %a, i32* nocapture readonly %b, i32 %N, i32 %M) local_unnamed_addr {
				entry:
				%cmp32 = icmp sgt i32 %N, 0
				br i1 %cmp32, label %outer.ph, label %for.end15

				outer.ph: ; preds = %entry
				%0 = sext i32 %M to i64
				%wide.trip.count41 = zext i32 %N to i64
				br label %outer.body

				outer.body: ; preds = %outer.inc, %outer.ph
				%indvars.iv38 = phi i64 [ 0, %outer.ph ], [ %indvars.iv.next39, %outer.inc ]
				%cmp230 = icmp eq i64 %indvars.iv38, 0
				br i1 %cmp230, label %outer.inc, label %inner.ph

				inner.ph: ; preds = %outer.body
				%1 = mul nsw i64 %indvars.iv38, %0
				br label %inner.body

				inner.body: ; preds = %inner.body, %inner.ph
				%indvars.iv = phi i64 [ 0, %inner.ph ], [ %indvars.iv.next, %inner.body ]
				%2 = add nsw i64 %indvars.iv, %1
				%arrayidx = getelementptr inbounds i32, i32* %b, i64 %2
				%3 = load i32, i32* %arrayidx, align 4, !tbaa !2
				%mul8 = mul nsw i32 %3, %3
				%arrayidx12 = getelementptr inbounds i32, i32* %a, i64 %2
				store i32 %mul8, i32* %arrayidx12, align 4, !tbaa !2
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %indvars.iv38
				br i1 %exitcond, label %outer.inc, label %inner.body

				outer.inc: ; preds = %inner.body, %outer.body
				%indvars.iv.next39 = add nuw nsw i64 %indvars.iv38, 1
				%exitcond42 = icmp eq i64 %indvars.iv.next39, %wide.trip.count41
				br i1 %exitcond42, label %for.end15, label %outer.body, !llvm.loop !6

				for.end15: ; preds = %outer.inc, %entry
				ret void
				}

				; Case 3 (for (j = 0; j < M; j+=i)): Inner loop with divergent step.

				; CHECK-LABEL: iv_step
				; CHECK: LV: Not vectorizing: Outer loop contains divergent loops.
				; CHECK: LV: Not vectorizing: Unsupported outer loop.

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

				define void @iv_step(i32* nocapture %a, i32* nocapture readonly %b, i32 %N, i32 %M) local_unnamed_addr {
				entry:
				%cmp33 = icmp sgt i32 %N, 0
				br i1 %cmp33, label %outer.ph, label %for.end15

				outer.ph: ; preds = %entry
				%cmp231 = icmp sgt i32 %M, 0
				%0 = sext i32 %M to i64
				%wide.trip.count = zext i32 %N to i64
				br label %outer.body

				outer.body: ; preds = %for.inc14, %outer.ph
				%indvars.iv39 = phi i64 [ 0, %outer.ph ], [ %indvars.iv.next40, %for.inc14 ]
				br i1 %cmp231, label %inner.ph, label %for.inc14

				inner.ph: ; preds = %outer.body
				%1 = mul nsw i64 %indvars.iv39, %0
				br label %inner.body

				inner.body: ; preds = %inner.ph, %inner.body
				%indvars.iv36 = phi i64 [ 0, %inner.ph ], [ %indvars.iv.next37, %inner.body ]
				%2 = add nsw i64 %indvars.iv36, %1
				%arrayidx = getelementptr inbounds i32, i32* %b, i64 %2
				%3 = load i32, i32* %arrayidx, align 4, !tbaa !2
				%mul8 = mul nsw i32 %3, %3
				%arrayidx12 = getelementptr inbounds i32, i32* %a, i64 %2
				store i32 %mul8, i32* %arrayidx12, align 4, !tbaa !2
				%indvars.iv.next37 = add nuw nsw i64 %indvars.iv36, %indvars.iv39
				%cmp2 = icmp slt i64 %indvars.iv.next37, %0
				br i1 %cmp2, label %inner.body, label %for.inc14

				for.inc14: ; preds = %inner.body, %outer.body
				%indvars.iv.next40 = add nuw nsw i64 %indvars.iv39, 1
				%exitcond = icmp eq i64 %indvars.iv.next40, %wide.trip.count
				br i1 %exitcond, label %for.end15, label %outer.body, !llvm.loop !6

				for.end15: ; preds = %for.inc14, %entry
				ret void
				}

				!llvm.module.flags = !{!0}
				!llvm.ident = !{!1}

				!0 = !{i32 1, !"wchar_size", i32 4}
				!1 = !{!"clang version 6.0.0"}
				!2 = !{!3, !3, i64 0}
				!3 = !{!"int", !4, i64 0}
				!4 = !{!"omnipotent char", !5, i64 0}
				!5 = !{!"Simple C/C++ TBAA"}
				!6 = distinct !{!6, !7, !8}
				!7 = !{!"llvm.loop.vectorize.width", i32 8}
				!8 = !{!"llvm.loop.vectorize.enable", i1 true}

test/Transforms/LoopVectorize/explicit_outer_uniform_diverg_branch.ll

This file was added.

				; RUN: opt < %s -loop-vectorize -enable-vplan-native-path -debug-only=loop-vectorize -S 2>&1 \| FileCheck %s
				; REQUIRES: asserts

				; Verify that LV can handle explicit vectorization outer loops with uniform branches
				; but bails out on outer loops with divergent branches.

				; Root C/C++ source code for the test cases
				; void foo(int a, int b, int N, int M)
				; {
				; int i, j;
				; #pragma clang loop vectorize(enable) vectorize_width(8)
				; for (i = 0; i < N; i++) {
				; // Tested conditional branch. COND will be replaced per test.
				; if (COND)
				; for (j = 0; j < M; j++) {
				; a[iM+j] = b[iM+j] * b[i*M+j];
				; }
				; }
				; }

				; Case 1 (COND => M == N): Outer loop with uniform conditional branch.

				; CHECK-LABEL: uniform_branch
				; CHECK: LV: We can vectorize this outer loop!

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

				define void @uniform_branch(i32* nocapture %a, i32* nocapture readonly %b, i32 %N, i32 %M) local_unnamed_addr {
				entry:
				%cmp39 = icmp sgt i32 %N, 0
				br i1 %cmp39, label %outer.ph, label %for.end19

				outer.ph: ; preds = %entry
				%cmp337 = icmp slt i32 %M, 1
				%0 = sext i32 %M to i64
				%N64 = zext i32 %N to i64
				%M64 = zext i32 %M to i64
				%cmp1 = icmp ne i32 %M, %N ; Uniform condition
				%brmerge = or i1 %cmp1, %cmp337 ; Uniform condition
				br label %outer.body

				outer.body: ; preds = %outer.inc, %outer.ph
				%indvars.iv42 = phi i64 [ 0, %outer.ph ], [ %indvars.iv.next43, %outer.inc ]
				%1 = mul nsw i64 %indvars.iv42, %0
				%arrayidx = getelementptr inbounds i32, i32* %b, i64 %1
				%2 = load i32, i32* %arrayidx, align 4, !tbaa !2
				br i1 %brmerge, label %outer.inc, label %inner.ph ; Supported uniform branch

				inner.ph: ; preds = %outer.body
				br label %inner.body

				inner.body: ; preds = %inner.ph, %inner.body
				%indvars.iv = phi i64 [ %indvars.iv.next, %inner.body ], [ 0, %inner.ph ]
				%3 = add nsw i64 %indvars.iv, %1
				%arrayidx7 = getelementptr inbounds i32, i32* %b, i64 %3
				%4 = load i32, i32* %arrayidx7, align 4, !tbaa !2
				%mul12 = mul nsw i32 %4, %4
				%arrayidx16 = getelementptr inbounds i32, i32* %a, i64 %3
				store i32 %mul12, i32* %arrayidx16, align 4, !tbaa !2
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %M64
				br i1 %exitcond, label %outer.inc, label %inner.body

				outer.inc: ; preds = %inner.body, %outer.body
				%indvars.iv.next43 = add nuw nsw i64 %indvars.iv42, 1
				%exitcond46 = icmp eq i64 %indvars.iv.next43, %N64
				br i1 %exitcond46, label %for.end19, label %outer.body, !llvm.loop !6

				for.end19: ; preds = %outer.inc, %entry
				ret void
				}


				; Case 2 (COND => B[i * M] == 0): Outer loop with divergent conditional branch.

				; CHECK-LABEL: divergent_branch
				; CHECK: Unsupported conditional branch.
				; CHECK: LV: Not vectorizing: Unsupported outer loop.

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

				define void @divergent_branch(i32* nocapture %a, i32* nocapture readonly %b, i32 %N, i32 %M) local_unnamed_addr {
				entry:
				%cmp39 = icmp sgt i32 %N, 0
				br i1 %cmp39, label %outer.ph, label %for.end19

				outer.ph: ; preds = %entry
				%cmp337 = icmp slt i32 %M, 1
				%0 = sext i32 %M to i64
				%N64 = zext i32 %N to i64
				%M64 = zext i32 %M to i64
				br label %outer.body

				outer.body: ; preds = %outer.inc, %outer.ph
				%indvars.iv42 = phi i64 [ 0, %outer.ph ], [ %indvars.iv.next43, %outer.inc ]
				%1 = mul nsw i64 %indvars.iv42, %0
				%arrayidx = getelementptr inbounds i32, i32* %b, i64 %1
				%2 = load i32, i32* %arrayidx, align 4, !tbaa !2
				%cmp1 = icmp ne i32 %2, 0 ; Divergent condition
				%brmerge = or i1 %cmp1, %cmp337 ; Divergent condition
				br i1 %brmerge, label %outer.inc, label %inner.ph ; Unsupported divergent branch.

				inner.ph: ; preds = %outer.body
				br label %inner.body

				inner.body: ; preds = %inner.ph, %inner.body
				%indvars.iv = phi i64 [ %indvars.iv.next, %inner.body ], [ 0, %inner.ph ]
				%3 = add nsw i64 %indvars.iv, %1
				%arrayidx7 = getelementptr inbounds i32, i32* %b, i64 %3
				%4 = load i32, i32* %arrayidx7, align 4, !tbaa !2
				%mul12 = mul nsw i32 %4, %4
				%arrayidx16 = getelementptr inbounds i32, i32* %a, i64 %3
				store i32 %mul12, i32* %arrayidx16, align 4, !tbaa !2
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %M64
				br i1 %exitcond, label %outer.inc, label %inner.body

				outer.inc: ; preds = %inner.body, %outer.body
				%indvars.iv.next43 = add nuw nsw i64 %indvars.iv42, 1
				%exitcond46 = icmp eq i64 %indvars.iv.next43, %N64
				br i1 %exitcond46, label %for.end19, label %outer.body, !llvm.loop !6

				for.end19: ; preds = %outer.inc, %entry
				ret void
				}

				!llvm.module.flags = !{!0}
				!llvm.ident = !{!1}

				!0 = !{i32 1, !"wchar_size", i32 4}
				!1 = !{!"clang version 6.0.0"}
				!2 = !{!3, !3, i64 0}
				!3 = !{!"int", !4, i64 0}
				!4 = !{!"omnipotent char", !5, i64 0}
				!5 = !{!"Simple C/C++ TBAA"}
				!6 = distinct !{!6, !7, !8}
				!7 = !{!"llvm.loop.vectorize.width", i32 8}
				!8 = !{!"llvm.loop.vectorize.enable", i1 true}

This is an archive of the discontinued LLVM Phabricator instance.

[LV][VPlan] Detect outer loops for explicit vectorization.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 143148

include/llvm/Transforms/Vectorize/LoopVectorize.h

lib/Transforms/Vectorize/LoopVectorizationPlanner.h

lib/Transforms/Vectorize/LoopVectorize.cpp

test/Transforms/LoopVectorize/explicit_outer_detection.ll

test/Transforms/LoopVectorize/explicit_outer_nonuniform_inner.ll

test/Transforms/LoopVectorize/explicit_outer_uniform_diverg_branch.ll

[LV][VPlan] Detect outer loops for explicit vectorization.
ClosedPublic