This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
docs/
1/63
VectorizationPlan.rst
-
Vectorizers.rst
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
CMakeLists.txt
-
LoopVectorize.cpp
5/16
VPlan.h
1
VPlan.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
AArch64/
-
aarch64-predication.ll
-
predication_costs.ll
-
if-pred-non-void.ll
-
induction.ll

Differential D28975

[LV] Introducing VPlan to model the vectorized code and drive its transformation
AbandonedPublic

Authored by Ayal on Jan 20 2017, 5:17 PM.

Download Raw Diff

Details

Reviewers

ddibyend
rengolin
anemet
mzolotukhin
jmolloy
mkuper
mssimpso
hfinkel

Summary

This patch follows our RFC[1] and presentation at the Dev Meeting[2]. Namely, it starts to address the proposal stated there:

Proposal: introduce the Vectorization Plan as an explicit model of a vectorization candidate and update the overall flow

according to the first step expressed:

The first patches we're working on are designed to have the innermost Loop Vectorizer explicitly model the control flow of its vectorized loop.

This implementation is designed to show key aspects of the VPlan model, demonstrating how it can capture precisely *all* vectorization decisions taken inside a to-be vectorized loop by the current Loop Vectorizer, and carry them out. It is therefore practically an NFC patch, with slight disclaimers listed below. The VPlan model implemented strives to be compact , addressing compile-time concerns. More technical details are documented in the rst file attached. The patch can be broken down into several hunks for incremental landing; a tentative break-down list is provided below.

Thanks to the Intel vectorization team for this joint effort,
Gil and Ayal.

Deviation from current functionality:

Debug printout of “LV: Scalarizing [and predicating]: <inst>” – VPlan carries out these decisions before Cost-Model’s printouts, unlike current behavior.
Placement of extracts moved to basic-block of users: at start of this basic-block vs. before first user; the distinct orders should be insignificant, subject to scheduling.
Redundant basic-blocks/phi’s – these are insignificant, subject to subsequent clean-up.

Tentative break-down; some tasks refactor or fix current LV, some introduce parts of VPlan:

refactor Cost-Model to provide MaxVF and early-exit methods.
refactor ILV to provide vectorizeInstruction, getScalarValue, getVectorValue, widenIntInduction, buildScalarSteps, PHIsToFix/fixCrossIterationPHIs, and possibly additional methods.
fix Unroller’s getScalarValue() to reuse ILV’s refactored getScalarValue(Part, Lane) which also sets metadata. Will simplify this patch.
unify the GEP reuse behavior between a vectorized wide load/store, and the wide load/store of an interleave group. Will simplify this patch.
have LV avoid creating redundant basic-blocks. Will help this patch be fully NFC.
have LV cache basic-block masks and reuse them. Will help this patch be fully NFC.
build initial VPlans and print them for debugging
convert ILV.vectorize to use LVP.executeBestPlan, keeping sinkScalarOperands() as a non-VPlan post-processing method.
optimize VPlans by introducing sinkScalarOperands() and print them for debugging
use VPlan’s sinkScalarOperands() instead of the non-VPlan version

[1] RFC
[2] Extending LoopVectorizer towards supporting OpenMP4.5 SIMD and outer loop auto-vectorization, 2016 LLVM Developers' Meeting

Diff Detail

Event Timeline

Ayal created this revision.Jan 20 2017, 5:17 PM

Herald added subscribers: mgorny, sanjoy. · View Herald TranscriptJan 20 2017, 5:17 PM

Ayal retitled this revision from LV: Introducing VPlan to model the vectorized code and drive its transformation to [LV] Introducing VPlan to model the vectorized code and drive its transformation.Jan 21 2017, 10:13 PM

delena added a subscriber: delena.Jan 21 2017, 11:24 PM

sodeh added a subscriber: sodeh.Jan 22 2017, 12:48 AM

magabari added a subscriber: magabari.Jan 22 2017, 4:03 AM

simoll added a subscriber: simoll.Jan 23 2017, 12:28 AM

ashutosh.nema added a subscriber: ashutosh.nema.Jan 23 2017, 1:38 AM

dcaballe added a subscriber: dcaballe.Jan 23 2017, 9:02 AM

rogfer01 added a subscriber: rogfer01.Jan 23 2017, 9:09 AM

Attaching the rst in pdf form for convenience.

VectorizationPlan.pdf528 KBDownload

Hi Ayal,

I just started looking into this, but I can't see the PDF you attached, nor the images. :)

Can you put this somewhere public and share the link?

cheers,
--renato

In D28975#656385, @rengolin wrote:

Can you put this somewhere public and share the link?

Ignore me, got it through some convoluted Phabricator loops. :)

Sorry for the inconvenience; the png's were easy to upload as part of the patch, but seem hard to download from Phabricator. Attaching them here instead.

Thanks, Ayal, Gil!

Two meta-comments about the review:

I suggest we use this review for higher-level points and issues with the design. If there are no significant objections, we can then review the code piece-by-piece, split into the chunks the review comment suggests. This review can then be used as a reference to how those hunks fit together, if the purpose of some of them is unclear, as a stand-alone patch. Does this make sense to everyone?
I think this is major enough to give a heads-up to llvm-dev, and give more people opportunity to take a look at the design.

Having said that, I think this is generally the right way forward. The way legal/cost/transformation interact in the current LV wasn't as much designed, as evolved, and not necessarily in the right direction...

Some preliminary comments from a first pass inline. I still need to understand how the recipes are going to work.

docs/VectorizationPlan.rst
11	As @chandlerc pointed out to me, this isn't actually NFC. It's expected not to change the output of the vectorizer, but it's a huge change in how the vectorizer functions.
28	By this do you mean that making a cost decision will "record" the transformation we expect to make?
37	This sounds reasonable to me. I assume this is in sync with what Elena is doing in, say, D27919, right?
61	How is "can construct, optimize and discard one or more VPlans" different from just "generates VPlans"? (I'm trying to understand if this adds information I'm missing. :-) )
112	contraints -> constraints. Also, I understand what constraints are in this context (e.g. dependence distance), but what do you mean by artifacts?
132	Could you give a more concrete example for what optimizing an LV VPlan would mean? I haven't read the TSLP paper (although I should), but skimming it, it looks like it's basically constructing a lot of subtrees (each of which could be considered, in our context, its own VPlan) and choose the best among those. So this isn't exactly "optimizing" a plan, it's just recursively generating plans and choosing the cheapest one. Or should I just go and read the paper properly?
141	I don't believe anything in this design raises the importance of explicit outer loop vectorization. I think what you're trying to say is that if we want to support explicit outer loop vectorization (which I believe we do), the best way to do it would be using this design. Am I right?
252	If this happens, it would be fantastic. How would this design help support this?
272	So, given the text below, by "a given IR code" you mean "a given IR SESE region", right?
282	Each Region is still required to be SESE, right?
288	How can they originate from more than one basic block? Can you be more explicit about the relationship between IR basic blocks and VP basic blocks?
290	Something here is garbled.
296	What do you mean by supporting disjoint Regions?
388	What kind of information?
486	The long-term plan is to have LVP produce a VPlan that contains both UF and VF, and this is just a transitional state, right?

In D28975#658218, @mkuper wrote:

I suggest we use this review for higher-level points and issues with the design. If there are no significant objections, we can then review the code piece-by-piece, split into the chunks the review comment suggests. This review can then be used as a reference to how those hunks fit together, if the purpose of some of them is unclear, as a stand-alone patch. Does this make sense to everyone?

Yup.

I think this is major enough to give a heads-up to llvm-dev, and give more people opportunity to take a look at the design.

I think both can happen in parallel. If we treat this as just an RFC, we can make bold comments about the code without the fear that a merge will be imminent, and discuss the more moderate design plans on the list.

I'd also suggest that the real merge request to be done on a new revision, so we don't get lost in weeks of discussion when finally reviewing the actual patch(es).

Having said that, I think this is generally the right way forward. The way legal/cost/transformation interact in the current LV wasn't as much designed, as evolved, and not necessarily in the right direction...

Ditto.

--renato

rengolin added inline comments.Jan 27 2017, 4:01 AM

docs/VectorizationPlan.rst
141	From the original RFC (linked in the description), the idea is that outer loops will add a lot more complexity to the vectoriser, which is already getting side-tracked by the multiple strategies. This effort is the first step in joining the strategies, which will be the sanest way forward to adding more functionality to the loop vectoriser, including outer loops and better interactions with Polly.

rengolin added inline comments.Jan 27 2017, 4:01 AM

docs/VectorizationPlan.rst
241	I believe this is not possible, but that's ok. What I think you mean is that we can start with a single VPlan that does exactly what the current cost model does, in which case, it will have no noticeable impact to the users (this is not the same as NFC, as Michael said). But just doing that is just replacing six of one by half-a-dozen of the other. This is only worth doing if we can add more (different) VPlans, in which case, there will be a noticeable impact in higher optimisation levels. But again, that's ok. We currently change the behaviour of the vectoriser depending on the -O flag and we can continue doing it. -O2 means one VPlan, -O3 means more. We could even break the invisible 3 barrier and add an actual 4 which is exhaust all options. My point is: this cannot be a design guideline.
245	Aligning cost and codegen will always have to rely on heuristics, even if we codegen multiple variations, as this is modelling IR, not assembly. We should respect the cost analysis and we should strive to do the best we can, yes, but we shouldn't try to estimate the generated code accurately over other improvements.
261	This is the holly grail, where a lot of attention to detail and benchmarking will have to be done.
296	If this is related to detecting in-loop disjoint regions, wouldn't it be easier to do that transformation before the vectoriser and generate two loops, so that they're always SESE or not worth looking?
308	This makes sense. And VRecipes consult the target information based on instructions and sizes. It should be straightforward to have cross-target VRecipes just based on isLegal() and friends, which would prune all unsupported functionality really early.
314	This is less clear. I understand how one VRecipe that knows about interleave access to worry about this, but I don't understand how we're going to tell the others that have no such concept to not ignore it and calculate the costs correctly. Our current interleave cost calculations may not be enough to bar unrelated changes.
319	I'd advise against doing code modifications in an already vectorised block. You may be able to get an extra oomph, but legality analysis might have to be done all over again, which can significantly add costs to an already expensive process. Not to mention tht the complexity of the task will likely leak illegal transformations through.
410	This looks mostly ok...
415	What do you mean by `vectorize()`? Is that `legal+cost+transform`? To avoid paying cost analysis and transformation costs up-front, wouldn't it make more sense to do each step for all plans as a group? Assuming VPlans may expose different legality issues, you should: vec<VPlan> TODO; for (auto P : Plans) { if (P->isLegal()) TODO.push_back(P); } Then, compute the costs first, and only `vectorize()` the assumed cheapest: unsigned MinCost = ScalarCost; VPlan Best; for (auto P : TODO) { unsigned Cost = P->cost(); if(MinCost > Cost) { Best = P; MinCost = Cost; } } Best->transform();
426	Couldn't different VPlans expose different legality issues? For example, for different combination of UFs and VFs?
473	Right, so this is my "cost loop" above.
516	And this is my `Best.transform()` above. Looks good.
557	This is just analysis, right?

mkuper added inline comments.Jan 27 2017, 10:52 AM

docs/VectorizationPlan.rst
141	All of that is correct, I didn't mean to imply we should be handling outer loops now. :-) I was just confused about the "raising the importance" part.
245	I think "codegen" here means "the IR generated by the vectorizer", not "backend code generation". See line 123. Actually getting the cost model to accurately estimate the cost of complex IR constructs is an orthogonal problem.
426	I think this is why the design says that Legal would "encode constraints". I guess the LVP would have to consult Legal per-potential-VPlan to see whether it's feasible or not?

• ashahid added a subscriber: • ashahid.Jan 29 2017, 10:47 PM

rengolin added inline comments.Jan 30 2017, 9:28 AM

docs/VectorizationPlan.rst
245	Right, but costs are associated with what assembly instructions will be generated by those IR instructions. So we associate zero cost to a lot of shuffles because we know they'll be merged into the load or add. But the same shuffles, in a different pattern, will demand a sequence of insert/extract instructions that will cost a lot more than zero. This is my point, that accuracy is not possible at this level and that, as you say, it's an orthogonal problem to solve. Maybe the word we're looking for is "reliably"?
426	Right, this is what I didn't understand. We can do it both ways: one legal consulting all plans, or each plan consulting legal. I'd prefer we act on plans for everything, as it would be a cleaner concept and an easier move. As I wrote in another comment: first we iterate through all plans and discard all illegals, then we calculate the costs, sort and pick the best. We could even (in the far future) have multiple costs per plan (VF, UF, code-size, hazards, etc) and sort by a formula that takes all of them into account.

Ayal added inline comments.Feb 1 2017, 7:23 AM

docs/VectorizationPlan.rst
11	Agreed, patch is more than "a pure refactoring/cleanup". We meant to say "no change in output intended".
28	That could be one way to think of it. The aim is to record each decision once, and have both cost estimate and transformation stem from the same recording.
37	Right; Legal should be refactored as well to reflect its true Legality nature, by adjacent patches.
61	They are the same if "generates VPlans" is understood as an iterative process, which includes taking already generated VPlans and modifying them to produce new candidates for consideration.
112	Right, a constraint can be, e.g., an upper bound on the VF. Artifacts are pieces of information collected by Legal which are useful for the vectorization process. An artifact can be, e.g., the set of reduction phi's; Legal has to identify them to make sure they are all vectorizable. The distinction may be somewhat vague; e.g., Legal has to make sure runtime guards can be constructed for possible memory dependencies. Another way of putting this, w/o constraints nor artifacts: Legal can be made responsible for either determining that a loop cannot be vectorizable, or else for providing initial VPlan(s) as constructive proof showing how it can be vectorized, albeit sub-optimally.
132	An example of optimizing an LV VPlan is LoopVectorizationPlanner::optimizePredicatedInstructions(). Another should be the recognition of interleave groups. The reference to TSLP is by analogy, which you've summarized quite well. The point was that there could be various candidates worth considering, in this case all with the same VF. Furthermore, constructing TSLP subtrees is somewhat analogous to LV's isProfitableToScalarize(). (BTW, you can also watch the TSLP video recording ;-)
141	Comment tries to say that it'll be harder to continue and make "right" decisions as the scope of vectorization increases to outerloops, with the proposed design (or any other). Explicit vectorization tries to mitigate this expected hardship.
241	The guideline requires VPlan to be designed such that it can support any decision taken by current LV; but of-course not be limited to it.
245	Right, sorry for the confusion. "CodeGen" indeed means here "the IR code generated by the vectorizer".
252	The original thought of a Recipe should also match an SLP 'bundle'. An interleave-group recipe for instance takes care of a group of related loads or stores, albeit doing more than stitching them together to produce a single vector load or store. Modeling SLP requires more attention.
272	Right.
282	Right.
290	How so?
296	Two VPRegions are disjoint (or else contain one another) in the sense that they share no common VP(Basic)Block. Say we have two if-the-else hammocks one after the other, and we wish to wrap each in a VPRegion. To avoid having the Exit of the first be the Entry of the second, it should be split somehow, possibly resulting in an empty block.
308	VPRecipes should indeed be cross-target. If something in the loop is illegal we avoid building any VPlans for it in the first place, as a form of early pruning. OTOH, it should conceptually be possible, and arguably profitable, to scalarize any instruction.

Removed png's, made a few cleanups and minor fixes to the code.

Ayal added inline comments.Feb 1 2017, 8:02 AM

docs/VectorizationPlan.rst
288	The sentence above meant to emphasize that VPBasicBlocks are not necessarily maximal. A VPBasicBlock can have a single successor VPBasicBlock, which in turn has a single predecessor. Eventually both will contribute their instructions into a common IR basic block of the vectorized version.

bader added a subscriber: bader.Feb 9 2017, 7:32 AM

Ayal added inline comments.Feb 13 2017, 4:50 PM

docs/VectorizationPlan.rst
245	Yes, "reliably" or "consistently" better describe the intent here. The "accuracy" still depends on TTI. Fix will appear in next upload.
314	Another way to view this: each instruction that will eventually appear in the vectorized loop will be generated by a VPRecipe. This recipe also takes care of estimating its cost. It is possible for several such instructions to be generated by one common VPRecipe, as in the case of an interleave group recipe.
319	The loop vectorizer already moves code effectively, when it forms interleave groups and sinks scalar operands. There may be additional opportunities during vectorization that involve code modification. These admittedly must be done carefully, but potentially impact performance significantly (by more than an extra oomph) and so ought to be represented and estimated reliably.
388	For instances: where to generate the next instruction(s), by passing an IRBuilder. When generating scalarized predicated instructions, passing which lane to generate scalar instance for.
415	By "vectorize()" we mean "transform". The latter verb can be used instead if clearer? Yes, we only call "vectorize()" / "transform()" on Best VPlan, after finding it according to relative costs, as suggested above. Currently VPlans are built only after passing all "isLegal()" checks. So every VPlan is legal by construction.
426	The current design performs all Legal checks before starting to work with VPlans, retaining the separation between correctness and performance considerations. This could be revisited if desired. If Legal determines that a loop cannot be vectorized at all, no VPlans are built. If Legal determines that a loop can be vectorized but only under certain restrictions, all VPlans built will comply with all these restrictions, and offer distinct alternatives how to vectorize. Current restriction may be: VF small enough, pass runtime guards regarding potential aliasing and non-unit strides, handle cross-iteration dependencies such as reductions, 1st order recurrences, live-outs. How to calculate the cost(s) of each VPlan and find the best one is subject to a separate, follow-up patch.
473	Right, LVP.plan takes care of building the relevant VPlans and running the "cost loop", returning the best VF. This "cost loop" currently still uses the existing CostModel, but is planned to transition to use VPlans in the aforementioned follow-up patch.
486	VPlan already addresses both UF and VF in this patch, in the sense that when instructed to do so, by calling "vectorize()", it will generate vector code based on both. Determining the best VF and the best UF is still done based on existing CostModel in this patch, by first finding the best VF and then finding the best UF. Follow-up patch is devoted to having both these decisions be based on VPlans as well.
516	Right, this will perform the actual transformation. :-)
557	Right, no IR has been transformed nor constructed yet.

RKSimon added a subscriber: RKSimon.Feb 14 2017, 1:30 PM

spatel added a subscriber: spatel.Feb 15 2017, 6:03 AM

oren_ben_simhon added a subscriber: oren_ben_simhon.Feb 22 2017, 1:40 AM

oren_ben_simhon added inline comments.

lib/Transforms/Vectorize/VPlan.cpp
11	Please follow LLVM guidelines for file documentation: http://llvm.org/docs/CodingStandards.html#file-headers
lib/Transforms/Vectorize/VPlan.h
2	This file is very big and will probably gain some more weight in time. I think that some utility classes could be moved to a separate file.
69	IMHO, VPlanUtils should not be a friend class, because you don't access any of its private/protected members. Instead use VPlanUtils statically.
73	use ///< for post member documentation.
76	I didn't follow the whole logic but are you sure that you need recipe parent? Also i think that "owner" is a better descriptive of the member than parent.
192	The function should be const.
311	I know that the initial size of SmallVectors has minimum effect, still i believe that the default size of the predecessors/Successors should be one.
316	The \brief will have no effect because anyway the comment until the first dot will be considered as brief by Doxygen.
515	No need for the brief command.
523	The function receives a parameter that it doesn't use. Also it is static although it returns a non-static class member.
593	Please consider using a single list/vector instead of two members Entry/Exit.
715	The plan is not used anywhere in the class.
718	This function member is not referenced anywhere.
721	This class should be a singleton class that should not be initialized. As such the constructor should be private.
724	All functions that do not relate to a non-static member, should be static and const.
796	I think that the function name should be more descriptive to reflect the nature of the function, for example "setConditionalSuccessors".
861	Maybe use a doxygen style comment to all class documentations.

Hi,

I have been working with improving cost estimates in the LoopVectorizer a bit. I wonder if VPlan will improve cost estimation for different VFs, including 1?

One issue currently that I don't know yet quite how to tackle, is that two scalarized instructions (def->use), have too big of scalarization costs computed, since the inserts from the first, and the extracts of the second will be optimized away. There are also several other minor issues I have found so far. I guess I should probably wait for VPlan to arrive before trying anything... Is that going to be anytime soon?

Looking forward to this

/Jonas

In D28975#684385, @jonpa wrote:

I have been working with improving cost estimates in the LoopVectorizer a bit. I wonder if VPlan will improve cost estimation for different VFs, including 1?

Hi Jonas,

AFAICT, VPlan is completely orthogonal to the cost computation, ie. the exact same cost functions will be used (including 1).

One issue currently that I don't know yet quite how to tackle, is that two scalarized instructions (def->use), have too big of scalarization costs computed, since the inserts from the first, and the extracts of the second will be optimized away. There are also several other minor issues I have found so far.

This is a known issue with any heuristics based approach, which is the case here. The high costs (for specific architectures) are expecting shuffles to be generated, but on other architectures you get for free. Overriding the cost per arch (or sub-arch) helps a bit (and that's what we use), but it doesn't cover cases based on the right sequence of instructions, at least not in any formalised way.

The only generic way of doing this is to allow targets to inspect the basic block for the expected sequences (so for example, ext+add has a free ext but ex+mov doesn't). This would require table gen descriptions of the patterns and could considerably increase the cost computation (but that's a target-specific decision, so it's ok).

I guess I should probably wait for VPlan to arrive before trying anything... Is that going to be anytime soon?

Being orthogonal, I don't expect your cost investigations to have a large impact on the VPlan implementation. Maybe if you could open a new RFC with your proposal, we could go through it and see how much it'll actually impact.

cheers,
--renato

The only generic way of doing this is to allow targets to inspect the basic block for the expected sequences (so for example, ext+add has a free ext but ex+mov doesn't). This would require table gen descriptions of the patterns and could considerably increase the cost computation (but that's a target-specific decision, so it's ok).

On thing I did was actually a check if ISD::ZEXTLOAD / ISD::SEXTLOAD is legal, which I think should work without any extra work, since those ISD nodes are already there. For in-memory operations, I suppose that's a bit tricker, maybe a target hook or that returns true/false depending on if there is an instruction that will fold the load. I will put my patch up for review when it's more mature, hopefully. Or perhaps it is best to do as you say, and allow the target to adjust the sum by inspecting the BB...

The main thought I had at this moment, was that I thought perhaps if the scalarization costs were modeled in a better way, the LoopVectorizer should be able to for example hold the scalarization costs for each instruction as a tuple of {inserts, extracts}, and then get a more accurate final cost estimate sum by checking interdependencies of scalarized / vectorized instructions. It should only add inserts if the user was vectorized, and so on. I was hoping maybe VPlan perhaps might build a model with instruction costs and sum them after all individual costs are there, or so.

In D28975#684539, @jonpa wrote:

The main thought I had at this moment, was that I thought perhaps if the scalarization costs were modeled in a better way, the LoopVectorizer should be able to for example hold the scalarization costs for each instruction as a tuple of {inserts, extracts}, and then get a more accurate final cost estimate sum by checking interdependencies of scalarized / vectorized instructions. It should only add inserts if the user was vectorized, and so on. I was hoping maybe VPlan perhaps might build a model with instruction costs and sum them after all individual costs are there, or so.

This was my first thought, since they are the highest unknowns in vectorisation. But then you may have extending adds but not extending saturating adds, and then a three-instruction-pattern (ext+add+max) cannot be matched, but it will match the (ext+add) and remove the ext cost, when there is actually an extra instruction for that.

There are also costs related to moving to and from vector registers. For example, on NEON, GPR->NEON is free, but NEON->GPR has a ~10-cycle stall. That cannot be modelled without understanding about the surrounding instructions (say, a scalar reduction on every loop would make a 2-lane vector add useless).

We could probably go slow and fiddle with the scalar costs now (with lost of benchmark results in the affected arches), and maybe have a half-baked solution for shuffles, since they're the most obvious problems, but it would be good to not destroy the plan to be able to look at the context, hopefully based on table gen pattern matches.

cheers,
--renato

Hi Jonas and Renato,

In D28975#684462, @rengolin wrote:

In D28975#684385, @jonpa wrote:

I have been working with improving cost estimates in the LoopVectorizer a bit. I wonder if VPlan will improve cost estimation for different VFs, including 1?

Hi Jonas,

AFAICT, VPlan is completely orthogonal to the cost computation, ie. the exact same cost functions will be used (including 1).

Improving the "accuracy" of TTI's cost estimates for IR patterns is indeed an orthogonal effort. VPlan is designed to provide an explicit representation of the IR patterns that the vectorizer plans to emit, for all VF's under consideration. This includes representing scalarized instructions, vectorized instructions, packing/unpacked instructions and where they are placed, as well as idioms such as interleave-groups. Doing so will help provide a more "consistent", "reliable" and straightforward cost computation. The plan is to introduce VPlan incrementally, starting with this patch driving the transformation, to be followed by having VPlan also drive the cost computation.

Determining where to best draw the line between scalarized and vectorized instructions is indeed an interesting challenge, cf. the work on Throttled SLP.

VPlan is designed to provide an explicit representation of the IR patterns that the vectorizer plans to emit, for all VF's under consideration. This includes representing scalarized instructions, vectorized instructions, packing/unpacked instructions and where they are placed, as well as idioms such as interleave-groups.

That was just what I wanted to hear, sounds great :-) I suppose then one shouldn't put too much effort into this before VPlan arrives.

Updated to ToT, including the recording of InterleaveGroup decisions per VF.

Ayal marked 5 inline comments as done.Mar 2 2017, 1:51 PM

Maybe if you could open a new RFC with your proposal, we could go through it and see how much it'll actually impact.

I have now two minor revisions open for the LoopVectorizer, which are in need of review:
https://reviews.llvm.org/D30732
https://reviews.llvm.org/D30680

hfinkel mentioned this in D31186: Changing TargetTransformInfo::getGEPCost to take GetElementPtrInst as parameter.Mar 23 2017, 6:17 AM

Updated to ToT, including first incremental patches that were committed.

tschuett added a subscriber: tschuett.Apr 13 2017, 1:38 PM

jprice added a subscriber: jprice.Apr 13 2017, 3:29 PM

gilr mentioned this in rL300310: [LV] Remove implicit single basic block assumption.Apr 14 2017, 12:43 AM

gilr mentioned this in rL300557: [LV] Cache block mask values.Apr 18 2017, 8:00 AM

gilr mentioned this in rL301293: [LV] Remove redundant basic block split.Apr 24 2017, 11:10 PM

gilr mentioned this in rL301345: [LV] Make LIT test insensitive to basic block numbering.Apr 25 2017, 11:27 AM

So, I think the document is so good and can be very helpful in further communications, that I think it should be committed separately.

Once we all agree that the document explains what we really want (I mostly do), we can look at the transformations.

I mean, this is a big change and we want to get it right, so minimising the patch size and intrusiveness is a plus. About half of this patch is the document. :)

Makes sense?
--renato

docs/VectorizationPlan.rst
11	I'll echo @mkuper here: drop the NFC. Also, this is a document that will outlive the patch, so it should also not mention the patch at all. If we want a document to help us design or explain the VPlan, then we need a VPlan document, not a document for a patch. Don't get me wrong, not many people write documents, especially not for patches, so this is awesome! But we need to think about its lifetime and format.
426	Right, this is what I meant. Filer all VPlans by legality first. I do like the idea that no illegal VPlan can exist. Costs are for a later patch, I agree.

In D28975#739893, @rengolin wrote:

So, I think the document is so good and can be very helpful in further communications, that I think it should be committed separately.

Once we all agree that the document explains what we really want (I mostly do), we can look at the transformations.

I mean, this is a big change and we want to get it right, so minimising the patch size and intrusiveness is a plus. About half of this patch is the document. :)

Makes sense?
--renato

Yes, thanks. We're refining this document to contain only the general design of VPlan; the rest of the documentation may better suits a review-summary / commit-message of the first introductory patch.

docs/VectorizationPlan.rst
11	OK, you're right. We're refining this document to contain only the general design of VPlan. The parts that specifically concern the first introductory patch will be taken out to form the SUMMARY and commit-message of this first patch, to be uploaded for review shortly. The document was originally written with everything together to make it self-contained and hopefully easier to review.

Ayal mentioned this in D32871: [LV] Using VPlan to model the vectorized code and drive its transformation.Jun 12 2017, 2:35 AM

bollu added a subscriber: bollu.Oct 30 2017, 12:52 AM

fhahn added a subscriber: fhahn.Oct 31 2017, 2:37 AM

Hi Ayal,

This functionality has been submitted already, right? If so, please close this review.

Thanks,
--renato

In D28975#946240, @rengolin wrote:

Hi Ayal,

This functionality has been submitted already, right? If so, please close this review.

Thanks,
--renato

Sure Renato. This approach of gradually changing and improving the loop vectorizer in-place has served well. Thanks for your support and appreciation!

Revision Contents

Path

Size

docs/

VectorizationPlan.rst

574 lines

Vectorizers.rst

12 lines

lib/

Transforms/

Vectorize/

1 line

10638 lines

922 lines

400 lines

test/

Transforms/

LoopVectorize/

AArch64/

aarch64-predication.ll

12 lines

predication_costs.ll

22 lines

if-pred-non-void.ll

12 lines

induction.ll

20 lines

Diff 90376

docs/VectorizationPlan.rst

This file was added.

				+++++
				VPlan
				+++++

				Goal of initial VPlan patch
				+++++++++++++++++++++++++++
				The design and implementation of VPlan follow our RFC [10]_ and presentation
				[11]_. The initial patch is designed to:

				- be a lightweight NFC patch;
				- show key aspects of VPlan's Hierarchical CFG concept;
				mkuperUnsubmitted Not Done Reply Inline Actions As @chandlerc pointed out to me, this isn't actually NFC. It's expected not to change the output of the vectorizer, but it's a huge change in how the vectorizer functions. mkuper: As @chandlerc pointed out to me, this isn't actually NFC. It's expected not to change the…
				AyalAuthorUnsubmitted Not Done Reply Inline Actions Agreed, patch is more than "a pure refactoring/cleanup". We meant to say "no change in output intended". Ayal: Agreed, patch is more than "a pure refactoring/cleanup". We meant to say "no change in output…
				rengolinUnsubmitted Not Done Reply Inline Actions I'll echo @mkuper here: drop the NFC. Also, this is a document that will outlive the patch, so it should also not mention the patch at all. If we want a document to help us design or explain the VPlan, then we need a VPlan document, not a document for a patch. Don't get me wrong, not many people write documents, especially not for patches, so this is awesome! But we need to think about its lifetime and format. rengolin: I'll echo @mkuper here: drop the NFC. Also, this is a document that will outlive the patch, so…
				AyalAuthorUnsubmitted Not Done Reply Inline Actions OK, you're right. We're refining this document to contain only the general design of VPlan. The parts that specifically concern the first introductory patch will be taken out to form the SUMMARY and commit-message of this first patch, to be uploaded for review shortly. The document was originally written with everything together to make it self-contained and hopefully easier to review. Ayal: OK, you're right. We're refining this document to contain only the general design of VPlan. The…
				- demonstrate how VPlan can

				* capture all current vectorization decisions: which instructions are to

				+ be vectorized "on their own", or
				+ be part of an interleave group, or
				+ be scalarized, and optionally have scalar instances moved down to other
				basic blocks and under a condition; and
				+ be packed or unpacked (at the definition rather than at its uses) to
				provide both scalarized and vectorized forms; and

				* represent all control-flow within loop body of vectorized code version.

				- Be a step towards

				* aligning Cost step with Transformation step,
				* representing entire code being transformed,
				mkuperUnsubmitted Not Done Reply Inline Actions By this do you mean that making a cost decision will "record" the transformation we expect to make? mkuper: By this do you mean that making a cost decision will "record" the transformation we expect to…
				AyalAuthorUnsubmitted Not Done Reply Inline Actions That could be one way to think of it. The aim is to record each decision once, and have both cost estimate and transformation stem from the same recording. Ayal: That could be one way to think of it. The aim is to record each decision once, and have both…
				* adding optmizations:

				+ optimize conditional scalarization further,
				+ retaining uniform control-flow,
				+ vectorizing outerloops,
				+ and more.

				Out of scope for initial patch:

				mkuperUnsubmitted Not Done Reply Inline Actions This sounds reasonable to me. I assume this is in sync with what Elena is doing in, say, D27919, right? mkuper: This sounds reasonable to me. I assume this is in sync with what Elena is doing in, say, D27919…
				AyalAuthorUnsubmitted Not Done Reply Inline Actions Right; Legal should be refactored as well to reflect its true Legality nature, by adjacent patches. Ayal: Right; Legal should be refactored as well to reflect its true Legality nature, by adjacent…
				- changing how a loop is checked if it can be vectorized - "Legal";
				- changing how a loop is checked if it should be vectorized - "Cost".


				==================
				Vectorization Plan
				==================

				.. contents::
				:local:

				Overview
				========
				The Vectorization Plan is an explicit recipe for describing a vectorization
				candidate. It serves for both estimating the cost reliably and for performing
				the translation, and facilitates dealing with multiple vectorization candidates.

				The overall structure consists of:

				1. One LoopVectorizationPlanner for each attempt to vectorize a loop or a loop
				nest.

				2. A LoopVectorizationPlanner can construct, optimize and discard one or more
				VPlans, providing different ways to vectorize the loop or the loop nest.
				mkuperUnsubmitted Not Done Reply Inline Actions How is "can construct, optimize and discard one or more VPlans" different from just "generates VPlans"? (I'm trying to understand if this adds information I'm missing. :-) ) mkuper: How is "can construct, optimize and discard one or more VPlans" different from just "generates…
				AyalAuthorUnsubmitted Not Done Reply Inline Actions They are the same if "generates VPlans" is understood as an iterative process, which includes taking already generated VPlans and modifying them to produce new candidates for consideration. Ayal: They are the same if "generates VPlans" is understood as an iterative process, which includes…

				3. Once the best VPlan is determined, including the best vectorization factor
				and unroll factor, this VPlan drives the vector code generation using a
				VPTransformState object.

				4. Each VPlan represents the loop or the loop nest using a hierarchical CFG.

				5. At the bottom level of the hierarchical CFG are VPBasicBlocks.

				6. Each VPBasicBlock consists of one or more VPRecipes to generate Instructions
				for it.

				Motivation
				----------
				The vectorization transformation can be rather complicated, involving several
				potential alternatives, especially for outer loops [1]_ but also possibly for
				innermost loops. These alternatives may have significant performance impact,
				both positive and negative. A cost model is therefore employed to identify the
				best alternative, including the alternative of avoiding any transformation
				altogether.

				The process of vectorization traditionally involves three major steps: Legal,
				Cost, and Transform. This is the general case in LLVM's LoopVectorizer:

				1. Legal Step: check if loop can be legally vectorized; encode constraints and
				artifacts if so.
				2. Cost Step: compute the relative cost of vectorizing it along possible
				vectorization and unroll factors (VF, UF).
				3. Transform Step: vectorize the loop according to best VF and UF.

				This design, which works only directly on the original LLVM-IR, has some
				implications:

				1. Cost Step tries to predict what the vectorized loop will look like and how
				much it will cost, independently of what the Transform Step will eventually
				do. It's hard to keep the two in sync.
				2. Cost Step essentially considers a single vectorization candidate. Any
				alternatives are immediately evaluately and resolved.
				3. Legal Step does more than check for vectorizability; e.g., it records
				auxiliary artifacts such as collectLoopUniforms() and InterleaveInfo.
				4. Transform Step first populates the single basic block of the vectorized loop
				and later revisits scalarized instructions to predicate them one by one, as
				needed.

				The Vectorization Plan is designed to explicitly model a vectorization
				candidate to overcome the above constraints, which is especially important for
				the vectorization of outer-loops. This affects the overall process by
				essentially splitting the Transform Step into a Plan Step and a Code-Gen Step:

				1. Legal Step: check if loop can be legally vectorized; encode contraints and
				artifacts if so. Initiate Vectorization Plan showing how the loop can be
				mkuperUnsubmitted Not Done Reply Inline Actions contraints -> constraints. Also, I understand what constraints are in this context (e.g. dependence distance), but what do you mean by artifacts? mkuper: contraints -> constraints. Also, I understand what constraints are in this context (e.g.
				AyalAuthorUnsubmitted Not Done Reply Inline Actions Right, a constraint can be, e.g., an upper bound on the VF. Artifacts are pieces of information collected by Legal which are useful for the vectorization process. An artifact can be, e.g., the set of reduction phi's; Legal has to identify them to make sure they are all vectorizable. The distinction may be somewhat vague; e.g., Legal has to make sure runtime guards can be constructed for possible memory dependencies. Another way of putting this, w/o constraints nor artifacts: Legal can be made responsible for either determining that a loop cannot be vectorizable, or else for providing initial VPlan(s) as constructive proof showing how it can be vectorized, albeit sub-optimally. Ayal: Right, a constraint can be, e.g., an upper bound on the VF. Artifacts are pieces of information…
				vectorized only after passing Legal, to save redundant construction.
				2. Plan Step:

				a. Build initial Vectorization Plans following the constraints and
				decisions taken by Legal.
				b. Explore ways to optimize the vectorization plan, complying with
				all legal constraints, possibly constructing several plans following
				tentative vectorization decisions.
				3. Cost Step: compute the relative cost of each plan. This step can be applied
				repeatedly by Plan Step 2.b.
				4. Code-Gen Step: materialize the best plan. Note that only this step modifies
				the IR, as in the current Loop Vectorizer.

				The Cost Step can also be split into an Early-Pruning Step(s) and a
				"Cost-Gen" Step, where the former applies quick yet inaccurate estimates to
				prune obviously-unpromising candidates, and the latter applies more accurate
				estimates based on a full Plan.

				One can compare with LLVM's existing SLP vectorizer, where TSLP [3]_ adds
				Step 2.b.
				mkuperUnsubmitted Not Done Reply Inline Actions Could you give a more concrete example for what optimizing an LV VPlan would mean? I haven't read the TSLP paper (although I should), but skimming it, it looks like it's basically constructing a lot of subtrees (each of which could be considered, in our context, its own VPlan) and choose the best among those. So this isn't exactly "optimizing" a plan, it's just recursively generating plans and choosing the cheapest one. Or should I just go and read the paper properly? mkuper: Could you give a more concrete example for what optimizing an LV VPlan would mean? I haven't…
				AyalAuthorUnsubmitted Not Done Reply Inline Actions An example of optimizing an LV VPlan is LoopVectorizationPlanner::optimizePredicatedInstructions(). Another should be the recognition of interleave groups. The reference to TSLP is by analogy, which you've summarized quite well. The point was that there could be various candidates worth considering, in this case all with the same VF. Furthermore, constructing TSLP subtrees is somewhat analogous to LV's isProfitableToScalarize(). (BTW, you can also watch the TSLP video recording ;-) Ayal: An example of optimizing an LV VPlan is LoopVectorizationPlanner…

				As the scope of vectorization grows from innermost to outer loops, so do the
				uncertainty and complexity of each step. One way to mitigate the shortcomings
				of the Legal and Cost steps is to rely on programmers to indicate which loops
				can and/or should be vectorized. This is implicit for certain loops in
				data-parallel languages such as OpenCL [4]_, [5]_ and explicit in others such as
				OpenMP [6]_. This design to extend the Loop Vectorizer to outer loops supports
				and raises the importance of explicit vectorization beyond the current
				capabilities of Clang and LLVM. Namely, from currently forcing the
				mkuperUnsubmitted Not Done Reply Inline Actions I don't believe anything in this design raises the importance of explicit outer loop vectorization. I think what you're trying to say is that if we want to support explicit outer loop vectorization (which I believe we do), the best way to do it would be using this design. Am I right? mkuper: I don't believe anything in this design raises the importance of explicit outer loop…
				rengolinUnsubmitted Not Done Reply Inline Actions From the original RFC (linked in the description), the idea is that outer loops will add a lot more complexity to the vectoriser, which is already getting side-tracked by the multiple strategies. This effort is the first step in joining the strategies, which will be the sanest way forward to adding more functionality to the loop vectoriser, including outer loops and better interactions with Polly. rengolin: From the original RFC (linked in the description), the idea is that outer loops will add a lot…
				mkuperUnsubmitted Not Done Reply Inline Actions All of that is correct, I didn't mean to imply we should be handling outer loops now. :-) I was just confused about the "raising the importance" part. mkuper: All of that is correct, I didn't mean to imply we should be handling outer loops now. :-) I…
				AyalAuthorUnsubmitted Not Done Reply Inline Actions Comment tries to say that it'll be harder to continue and make "right" decisions as the scope of vectorization increases to outerloops, with the proposed design (or any other). Explicit vectorization tries to mitigate this expected hardship. Ayal: Comment tries to say that it'll be harder to continue and make "right" decisions as the scope…
				vectorization of innermost loops according to prescribed width and/or
				interleaving count, to supporting OpenMP's "#pragma omp simd" construct and
				associated clauses, including vectorizing across function boundaries [2]_.

				References
				----------
				.. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit
				Nuzman and Ayal Zaks, PACT 2008.

				.. [2] "Proposal for function vectorization and loop vectorization with function
				calls", Xinmin Tian, [`cfe-dev
				<http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html>`_].,
				March 2, 2016.
				See also `review <https://reviews.llvm.org/D22792>`_.

				.. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios
				Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015.

				.. [4] "Intel OpenCL SDK Vectorizer", Nadav Rotem, LLVM Developers' Meeting 2011.

				.. [5] "Automatic SIMD Vectorization of SSA-based Control Flow Graphs", Ralf
				Karrenberg, Springer 2015. See also "Improving Performance of OpenCL on
				CPUs", LLVM Developers' Meeting 2012.

				.. [6] "Compiling C/C++ SIMD Extensions for Function and Loop Vectorization on
				Multicore-SIMD Processors", Xinmin Tian and Hideki Saito et al.,
				IPDPSW 2012.

				.. [7] "Exploiting mixed SIMD parallelism by reducing data reorganization
				overhead", Hao Zhou and Jingling Xue, CGO 2016.

				.. [8] "Register Allocation via Hierarchical Graph Coloring", David Callahan and
				Brian Koblenz, PLDI 1991

				.. [9] "Structural analysis: A new approach to flow analysis in optimizing
				compilers", M. Sharir, Journal of Computer Languages, Jan. 1980

				.. [10] "RFC: Extending LV to vectorize outerloops", [`llvm-dev
				<http://lists.llvm.org/pipermail/llvm-dev/2016-September/105057.html>`_],
				September 21, 2016.

				.. [11] "Extending LoopVectorizer towards supporting OpenMP4.5 SIMD and outer
				loop auto-vectorization", Hideki Saito, `LLVM Developers' Meeting 2016
				<https://www.youtube.com/watch?v=XXAvdUwO7kQ>`_, November 3, 2016.

				Examples
				--------
				An example with a single predicated scalarized instruction - integer division:

				.. code-block:: c

				void foo(int* a, int b, int* c) {
				#pragma simd
				for (int i = 0; i < 10000; ++i)
				if (a[i] > 777)
				a[i] = b - (c[i] + a[i] / b);
				}


				IR Dump Before Loop Vectorization:

				.. code-block:: LLVM
				:emphasize-lines: 6,11

				for.body: ; preds = %for.inc, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]
				%arrayidx = getelementptr inbounds i32, i32* %a, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4, !tbaa !1
				%cmp1 = icmp sgt i32 %0, 777
				br i1 %cmp1, label %if.then, label %for.inc

				if.then: ; preds = %for.body
				%arrayidx3 = getelementptr inbounds i32, i32* %c, i64 %indvars.iv
				%1 = load i32, i32* %arrayidx3, align 4, !tbaa !1
				%div = sdiv i32 %0, %b
				%add.neg = sub i32 %b, %1
				%sub = sub i32 %add.neg, %div
				store i32 %sub, i32* %arrayidx, align 4, !tbaa !1
				br label %for.inc

				for.inc: ; preds = %for.body, %if.then
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 10000
				br i1 %exitcond, label %for.cond.cleanup, label %for.body

				The VPlan that is built initially:

				.. image:: VPlanPrinter.png

				Design Guidelines
				=================
				1. Analysis-like: building and manipulating the Vectorization Plan must not
				modify the IR. In particular, if a VPlan is discarded
				compilation should proceed as if the VPlan had not been built.

				2. Support all current capabilities: the Vectorization Plan must be capable of
				representing the exact functionality of LLVM's existing Loop Vectorizer.
				In particular, the transition can start with an NFC patch.
				In particular, VPlan must support efficient selection of VF and/or UF.

				rengolinUnsubmitted Not Done Reply Inline Actions I believe this is not possible, but that's ok. What I think you mean is that we can start with a single VPlan that does exactly what the current cost model does, in which case, it will have no noticeable impact to the users (this is not the same as NFC, as Michael said). But just doing that is just replacing six of one by half-a-dozen of the other. This is only worth doing if we can add more (different) VPlans, in which case, there will be a noticeable impact in higher optimisation levels. But again, that's ok. We currently change the behaviour of the vectoriser depending on the -O flag and we can continue doing it. -O2 means one VPlan, -O3 means more. We could even break the invisible 3 barrier and add an actual 4 which is exhaust all options. My point is: this cannot be a design guideline. rengolin: I believe this is not possible, but that's ok. What I think you mean is that we can start with…
				AyalAuthorUnsubmitted Not Done Reply Inline Actions The guideline requires VPlan to be designed such that it can support any decision taken by current LV; but of-course not be limited to it. Ayal: The guideline requires VPlan to be designed such that it can support any decision taken by…
				3. Align Cost & CodeGen: the Vectorization Plan must serve both the cost
				model and the code generation phases, where the cost estimation must
				evaluate the to-be-generated code reliably.

				rengolinUnsubmitted Not Done Reply Inline Actions Aligning cost and codegen will always have to rely on heuristics, even if we codegen multiple variations, as this is modelling IR, not assembly. We should respect the cost analysis and we should strive to do the best we can, yes, but we shouldn't try to estimate the generated code accurately over other improvements. rengolin: Aligning cost and codegen will always have to rely on heuristics, even if we codegen multiple…
				mkuperUnsubmitted Not Done Reply Inline Actions I think "codegen" here means "the IR generated by the vectorizer", not "backend code generation". See line 123. Actually getting the cost model to accurately estimate the cost of complex IR constructs is an orthogonal problem. mkuper: I think "codegen" here means "the IR generated by the vectorizer", not "backend code…
				AyalAuthorUnsubmitted Not Done Reply Inline Actions Right, sorry for the confusion. "CodeGen" indeed means here "the IR code generated by the vectorizer". Ayal: Right, sorry for the confusion. "CodeGen" indeed means here "the IR code generated by the…
				rengolinUnsubmitted Not Done Reply Inline Actions Right, but costs are associated with what assembly instructions will be generated by those IR instructions. So we associate zero cost to a lot of shuffles because we know they'll be merged into the load or add. But the same shuffles, in a different pattern, will demand a sequence of insert/extract instructions that will cost a lot more than zero. This is my point, that accuracy is not possible at this level and that, as you say, it's an orthogonal problem to solve. Maybe the word we're looking for is "reliably"? rengolin: Right, but costs are associated with what assembly instructions will be generated by those IR…
				AyalAuthorUnsubmitted Not Done Reply Inline Actions Yes, "reliably" or "consistently" better describe the intent here. The "accuracy" still depends on TTI. Fix will appear in next upload. Ayal: Yes, "reliably" or "consistently" better describe the intent here. The "accuracy" still depends…
				4. Support vectorizing additional constructs:

				a. vectorization of Outer-loops.
				In particular, VPlan must be able to represent the control-flow of a
				vectorized loop which may include multiple basic-blocks and nested loops.
				b. SLP vectorization.
				c. Combinations of the above, including nested vectorization: vectorizing
				mkuperUnsubmitted Not Done Reply Inline Actions If this happens, it would be fantastic. How would this design help support this? mkuper: If this happens, it would be fantastic. How would this design help support this?
				AyalAuthorUnsubmitted Not Done Reply Inline Actions The original thought of a Recipe should also match an SLP 'bundle'. An interleave-group recipe for instance takes care of a group of related loads or stores, albeit doing more than stitching them together to produce a single vector load or store. Modeling SLP requires more attention. Ayal: The original thought of a Recipe should also match an SLP 'bundle'. An interleave-group recipe…
				both an inner loop and an outerloop at the same time (each with its own
				VF and UF), mixed vectorization: vectorizing a loop and SLP patterns
				inside [7]_, (re)vectorizing vector code.

				5. Support multiple candidates efficiently:
				In particular, similar candidates related to a range of possible VF's and
				UF's must be represented efficiently.
				In particular support potential versionings efficiently.

				rengolinUnsubmitted Not Done Reply Inline Actions This is the holly grail, where a lot of attention to detail and benchmarking will have to be done. rengolin: This is the holly grail, where a lot of attention to detail and benchmarking will have to be…
				6. Compact: the Vectorization Plan must be efficient and provide as compact a
				representation as possible. In particular where the transformation is
				straightfoward, and where the plan is to reuse existing IR (e.g.,
				leftover iterations).

				VPlan Classes: Definitions
				==========================

				:VPlan:
				A recipe for generating a vectorized version from a given IR code.
				Takes a "scenario-based approach" to vectorization planning.
				mkuperUnsubmitted Not Done Reply Inline Actions So, given the text below, by "a given IR code" you mean "a given IR SESE region", right? mkuper: So, given the text below, by "a given IR code" you mean "a given IR SESE region", right?
				AyalAuthorUnsubmitted Not Done Reply Inline Actions Right. Ayal: Right.
				Given IR code required to be SESE, mainly to simplify dominance
				information. This vectorized version is represented using a Hierarchical CFG.

				:Hierarchical CFG:
				A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's.
				The Hierarchical CFG data structure we use is similar to the Tile Tree [8]_,
				where cross-Tile edges are lifted to connect Tiles instead of the original
				basic-blocks as in Sharir [9]_, promoting the Tile encapsulation. We use the
				terms Region and Block rather than Tile [8]_ to avoid confusion with loop
				tiling.
				mkuperUnsubmitted Not Done Reply Inline Actions Each Region is still required to be SESE, right? mkuper: Each Region is still required to be SESE, right?
				AyalAuthorUnsubmitted Not Done Reply Inline Actions Right. Ayal: Right.

				:VPBasicBlock:
				Serves as the leaf of the Hierarchical CFG. Represents a sequence of
				instructions that will appear consecutively in a basic block of the vectorized
				version. The instructions of such a basic block originate from one or more
				VPBasicBlocks.
				mkuperUnsubmitted Not Done Reply Inline Actions How can they originate from more than one basic block? Can you be more explicit about the relationship between IR basic blocks and VP basic blocks? mkuper: How can they originate from more than one basic block? Can you be more explicit about the…
				AyalAuthorUnsubmitted Not Done Reply Inline Actions The sentence above meant to emphasize that VPBasicBlocks are not necessarily maximal. A VPBasicBlock can have a single successor VPBasicBlock, which in turn has a single predecessor. Eventually both will contribute their instructions into a common IR basic block of the vectorized version. Ayal: The sentence above meant to emphasize that VPBasicBlocks are not necessarily maximal. A…
				The VPBasicBlock takes care of the control-flow
				relations with other VPBasicBlock's and Regions.
				mkuperUnsubmitted Not Done Reply Inline Actions Something here is garbled. mkuper: Something here is garbled.
				AyalAuthorUnsubmitted Not Done Reply Inline Actions How so? Ayal: How so?
				Holds a sequence of zero or more
				VPRecipe's that take care of representing the instructions.
				A VPBasicBlock that holds no VPRecipe's represents no instructions; this
				may happen, e.g., to support disjoint Regions and to ensure Regions have a
				single exit, possibly an empty one.

				mkuperUnsubmitted Not Done Reply Inline Actions What do you mean by supporting disjoint Regions? mkuper: What do you mean by supporting disjoint Regions?
				rengolinUnsubmitted Not Done Reply Inline Actions If this is related to detecting in-loop disjoint regions, wouldn't it be easier to do that transformation before the vectoriser and generate two loops, so that they're always SESE or not worth looking? rengolin: If this is related to detecting in-loop disjoint regions, wouldn't it be easier to do that…
				AyalAuthorUnsubmitted Not Done Reply Inline Actions Two VPRegions are disjoint (or else contain one another) in the sense that they share no common VP(Basic)Block. Say we have two if-the-else hammocks one after the other, and we wish to wrap each in a VPRegion. To avoid having the Exit of the first be the Entry of the second, it should be split somehow, possibly resulting in an empty block. Ayal: Two VPRegions are disjoint (or else contain one another) in the sense that they share no common…
				:VPRecipeBase:
				A base class describing one or more instructions that will appear
				consecutively in the vectorized version, based on Instructions from the given
				IR.
				These Instructions are referred to as the "Ingredients" of the Recipe.
				A Recipe specifies how its ingredients are to be vectorized: e.g.,
				copy or reuse them as uniform, scalarize or vectorize them according to an
				enclosing loop dimension, vectorize them according to internal SLP dimension.

				Design principle: in order to reason about how to vectorize an Instruction
				or how much it would cost, one has to consult the VPRecipe holding it.

				rengolinUnsubmitted Not Done Reply Inline Actions This makes sense. And VRecipes consult the target information based on instructions and sizes. It should be straightforward to have cross-target VRecipes just based on isLegal() and friends, which would prune all unsupported functionality really early. rengolin: This makes sense. And VRecipes consult the target information based on instructions and sizes.
				AyalAuthorUnsubmitted Not Done Reply Inline Actions VPRecipes should indeed be cross-target. If something in the loop is illegal we avoid building any VPlans for it in the first place, as a form of early pruning. OTOH, it should conceptually be possible, and arguably profitable, to scalarize any instruction. Ayal: VPRecipes should indeed be cross-target. If something in the loop is illegal we avoid building…
				Design principle: when a sequence of instructions conveys additional
				information as a group, we use a VPRecipe to encapsulate them and attach
				this information to the VPRecipe. For instance a VPRecipe can model an
				interleave group of loads or stores with additional information for
				calculating their cost and performing code-gen, as a group.

				rengolinUnsubmitted Not Done Reply Inline Actions This is less clear. I understand how one VRecipe that knows about interleave access to worry about this, but I don't understand how we're going to tell the others that have no such concept to not ignore it and calculate the costs correctly. Our current interleave cost calculations may not be enough to bar unrelated changes. rengolin: This is less clear. I understand how one VRecipe that knows about interleave access to worry…
				AyalAuthorUnsubmitted Not Done Reply Inline Actions Another way to view this: each instruction that will eventually appear in the vectorized loop will be generated by a VPRecipe. This recipe also takes care of estimating its cost. It is possible for several such instructions to be generated by one common VPRecipe, as in the case of an interleave group recipe. Ayal: Another way to view this: each instruction that will eventually appear in the vectorized loop…
				Design principle: where possible a VPRecipe should reuse the existing
				container of its ingredients. A new containter should be opened on-demand,
				e.g., to facilitate changing the order of Instructions between original
				and vectorized versions.

				rengolinUnsubmitted Not Done Reply Inline Actions I'd advise against doing code modifications in an already vectorised block. You may be able to get an extra oomph, but legality analysis might have to be done all over again, which can significantly add costs to an already expensive process. Not to mention tht the complexity of the task will likely leak illegal transformations through. rengolin: I'd advise against doing code modifications in an already vectorised block. You may be able to…
				AyalAuthorUnsubmitted Not Done Reply Inline Actions The loop vectorizer already moves code effectively, when it forms interleave groups and sinks scalar operands. There may be additional opportunities during vectorization that involve code modification. These admittedly must be done carefully, but potentially impact performance significantly (by more than an extra oomph) and so ought to be represented and estimated reliably. Ayal: The loop vectorizer already moves code effectively, when it forms interleave groups and sinks…
				:VPOneByOneRecipeBase:
				Represents recipes which transform each Instruction in their Ingredients
				independently, in order.
				The Ingredients are a sub-sequence of original Instructions, which reside in
				the same IR BasicBlock and in the same order. The Ingredients are
				accessed by a pointer to the first and last Instruction in their original IR
				basic block. Serves as a base class for the concrete sub-classes
				VPScalarizeOneByOneRecipe and VPVectorizeOneByOneRecipe.

				:VPScalarizeOneByOneRecipe:
				A concrete VPRecipe which scalarizes each ingredient, generating either
				instances of lane 0 for a uniform instruction, or instances for a range of
				lanes otherwise.

				:VPVectorizeOneByOneRecipe:
				A concrete VPRecipe which vectorizes each ingredient.

				:VPInterleaveRecipe:
				A concrete VPRecipe which transforms an interleave group of loads or stores
				into one wide load/store and shuffles.

				:VPConditionBitRecipeBase:
				A base class for VPRecipes which provide the condition bit feeding a
				conditional branch. Such cases correspond to scalarized or uniform branches.

				:VPExtractMaskBitRecipe:
				A concrete VPRecipe which represents the extraction of a bit from a mask,
				needed when scalarizing a conditional branch.
				Such branches are needed to guard scalarized and predicated instructions.

				:VPMergeScalarizeBranchRecipe:
				A concrete VPRecipe which represents Phi's needed when control converges back
				from a scalarized branch.
				Such phi's are needed to merge live-out values that are set under a
				scalarized branch. They can be scalar or vector, depending on the user of the
				live-out value.

				:VPWidenIntInductionRecipe:
				A concrete VPRecipe which widens integer reductions, producing their vector
				values and computing the necessary values for producing their scalar values.
				The scalar values themselves are generated, possibly elsewhere, by the
				complementing VPBuildScalarStepsRecipe.

				:VPBuildScalarStepsRecipe:
				A concrete VPRecipe complemeting the handling of integer induction variables,
				responsible for generating the scalar values used by the IV's scalar users.

				:VPRegionBlock:
				A collection of VPBasicBlocks and VPRegionBlocks which form a
				single-entry-single-exit subgraph of the CFG in the vectorized code.

				Design principle: When some additional information relates to an SESE set
				of VPBlocks, we use a VPRegionBlock to wrap them and attach the information to
				it. For example, a VPRegionBlock can be used to indicate that a scalarized
				SESE region is to be replicated. It is also designed to serve predicating
				divergent branches while retaining uniform branches as much as possible /
				desirable, and represent inner loops.

				:VPBlockBase:
				The building block of the Hierarchical CFG. A VPBlockBase can be either a
				VPBasicBlock or a VPRegionBlock.
				A VPBlockBase may indicate that its contents are
				to be replicated several times. This is designed to support scalarizing
				VPBlockBases which generate VF replicas of their instructions, which in turn
				remain scalar. And to do so using a single VPlan for multiple candidate VF's.

				:VPTransformState:
				Stores information used for code generation, passed from the Planner to its
				selected VPlan for execution, and used to pass additional information down
				mkuperUnsubmitted Not Done Reply Inline Actions What kind of information? mkuper: What kind of information?
				AyalAuthorUnsubmitted Not Done Reply Inline Actions For instances: where to generate the next instruction(s), by passing an IRBuilder. When generating scalarized predicated instructions, passing which lane to generate scalar instance for. Ayal: For instances: where to generate the next instruction(s), by passing an IRBuilder. When…
				from VPBlocks to the VPRecipes.

				:VPlanUtils:
				Contains a collection of methods for the construction and modification of
				abstract VPlans.

				:VPlanUtilsLoopVectorizer:
				Derived from VPlanUtils, providing additional methods for the construction and
				modification of VPlans.

				:LoopVectorizationPlanner:
				The object in charge of creating and manipulating VPlans for a given IR code.


				VPlan Classes: Diagram
				======================

				The classes of VPlan with main fields and methods; sub-classes of VPRecipeBase
				are shown in a separate figure:

				.. image:: VPlanUML.png

				rengolinUnsubmitted Not Done Reply Inline Actions This looks mostly ok... rengolin: This looks mostly ok...

				The class hierarchy of VPlan's VPRecipeBase class:

				.. image:: VPlanRecipesUML.png

				rengolinUnsubmitted Not Done Reply Inline Actions What do you mean by `vectorize()`? Is that `legal+cost+transform`? To avoid paying cost analysis and transformation costs up-front, wouldn't it make more sense to do each step for all plans as a group? Assuming VPlans may expose different legality issues, you should: vec<VPlan> TODO; for (auto P : Plans) { if (P->isLegal()) TODO.push_back(P); } Then, compute the costs first, and only `vectorize()` the assumed cheapest: unsigned MinCost = ScalarCost; VPlan Best; for (auto P : TODO) { unsigned Cost = P->cost(); if(MinCost > Cost) { Best = P; MinCost = Cost; } } Best->transform(); rengolin: What do you mean by `vectorize()`? Is that `legal+cost+transform`? To avoid paying cost…
				AyalAuthorUnsubmitted Not Done Reply Inline Actions By "vectorize()" we mean "transform". The latter verb can be used instead if clearer? Yes, we only call "vectorize()" / "transform()" on Best VPlan, after finding it according to relative costs, as suggested above. Currently VPlans are built only after passing all "isLegal()" checks. So every VPlan is legal by construction. Ayal: By "vectorize()" we mean "transform". The latter verb can be used instead if clearer? Yes, we…

				Integration with LoopVectorize.cpp/processLoop()
				================================================

				Here's the integration within LoopVectorize.cpp's existing flow, in
				LoopVectorizePass::processLoop(Loop \*L):

				1. Plan only after passing all early bail-outs:

				a. including those that take place after Legal, which is kept intact;
				b. including those that use the Cost Model - refactor it slightly to expose
				rengolinUnsubmitted Not Done Reply Inline Actions Couldn't different VPlans expose different legality issues? For example, for different combination of UFs and VFs? rengolin: Couldn't different VPlans expose different legality issues? For example, for different…
				mkuperUnsubmitted Not Done Reply Inline Actions I think this is why the design says that Legal would "encode constraints". I guess the LVP would have to consult Legal per-potential-VPlan to see whether it's feasible or not? mkuper: I think this is why the design says that Legal would "encode constraints". I guess the LVP…
				rengolinUnsubmitted Not Done Reply Inline Actions Right, this is what I didn't understand. We can do it both ways: one legal consulting all plans, or each plan consulting legal. I'd prefer we act on plans for everything, as it would be a cleaner concept and an easier move. As I wrote in another comment: first we iterate through all plans and discard all illegals, then we calculate the costs, sort and pick the best. We could even (in the far future) have multiple costs per plan (VF, UF, code-size, hazards, etc) and sort by a formula that takes all of them into account. rengolin: Right, this is what I didn't understand. We can do it both ways: one legal consulting all plans…
				AyalAuthorUnsubmitted Not Done Reply Inline Actions The current design performs all Legal checks before starting to work with VPlans, retaining the separation between correctness and performance considerations. This could be revisited if desired. If Legal determines that a loop cannot be vectorized at all, no VPlans are built. If Legal determines that a loop can be vectorized but only under certain restrictions, all VPlans built will comply with all these restrictions, and offer distinct alternatives how to vectorize. Current restriction may be: VF small enough, pass runtime guards regarding potential aliasing and non-unit strides, handle cross-iteration dependencies such as reductions, 1st order recurrences, live-outs. How to calculate the cost(s) of each VPlan and find the best one is subject to a separate, follow-up patch. Ayal: The current design performs all Legal checks before starting to work with VPlans, retaining the…
				rengolinUnsubmitted Done Reply Inline Actions Right, this is what I meant. Filer all VPlans by legality first. I do like the idea that no illegal VPlan can exist. Costs are for a later patch, I agree. rengolin: Right, this is what I meant. Filer all VPlans by legality first. I do like the idea that no…
				its MaxVF upper bound and canVectorize() early exit:

				.. code-block:: c++

				// Check if the target supports potentially unsafe FP vectorization.
				// FIXME: Add a check for the type of safety issue (denormal, signaling)
				// for the target we're vectorizing for, to make sure none of the
				// additional fp-math flags can help.
				if (Hints.isPotentiallyUnsafe() &&
				TTI->isFPVectorizationPotentiallyUnsafe()) {
				DEBUG(dbgs() << "LV: Potentially unsafe FP op prevents vectorization.\n");
				ORE->emit(
				createMissedAnalysis(Hints.vectorizeAnalysisPassName(), "UnsafeFP", L)
				<< "loop not vectorized due to unsafe FP support.");
				emitMissedWarning(F, L, Hints, ORE);
				return false;
				}

				if (!CM.canVectorize(OptForSize))
				return false;

				// Early prune excessive VF's
				unsigned MaxVF = CM.computeMaxVectorizationFactor(OptForSize);

				// If OptForSize, MaxVF is the only VF we consider. Abort if it needs a tail.
				if (OptForSize && CM.requiresTail(MaxVF))
				return false;

				2. Plan:

				a. build VPlans for relevant VF's and optimize them,
				b. compute best cost using Cost Model as before,
				c. compute best interleave-count using Cost Model as before. Above two
				steps are refactored into LVP.plan() (see below):

				.. code-block:: c++

				// Use the planner.
				LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, &CM);

				// Get user vectorization factor.
				unsigned UserVF = Hints.getWidth();

				// Select the vectorization factor.
				LoopVectorizationCostModel::VectorizationFactor VF =
				LVP.plan(OptForSize, UserVF, MaxVF);
				bool VectorizeLoop = (VF.Width > 1);
				rengolinUnsubmitted Not Done Reply Inline Actions Right, so this is my "cost loop" above. rengolin: Right, so this is my "cost loop" above.
				AyalAuthorUnsubmitted Not Done Reply Inline Actions Right, LVP.plan takes care of building the relevant VPlans and running the "cost loop", returning the best VF. This "cost loop" currently still uses the existing CostModel, but is planned to transition to use VPlans in the aforementioned follow-up patch. Ayal: Right, LVP.plan takes care of building the relevant VPlans and running the "cost loop"…

				std::pair<StringRef, std::string> VecDiagMsg, IntDiagMsg;

				if (!UserVF && !VectorizeLoop) {
				DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n");
				VecDiagMsg = std::make_pair(
				"VectorizationNotBeneficial",
				"the cost-model indicates that vectorization is not beneficial");
				}

				// Select the interleave count.
				unsigned IC = CM.selectInterleaveCount(OptForSize, VF.Width, VF.Cost);

				mkuperUnsubmitted Not Done Reply Inline Actions The long-term plan is to have LVP produce a VPlan that contains both UF and VF, and this is just a transitional state, right? mkuper: The long-term plan is to have LVP produce a VPlan that contains both UF and VF, and this is…
				AyalAuthorUnsubmitted Not Done Reply Inline Actions VPlan already addresses both UF and VF in this patch, in the sense that when instructed to do so, by calling "vectorize()", it will generate vector code based on both. Determining the best VF and the best UF is still done based on existing CostModel in this patch, by first finding the best VF and then finding the best UF. Follow-up patch is devoted to having both these decisions be based on VPlans as well. Ayal: VPlan already addresses both UF and VF in this patch, in the sense that when instructed to do…
				// Get user interleave count.
				unsigned UserIC = Hints.getInterleave();

				3. Transform:

				a. invoke an Unroller to unroll the loop (as before), or
				b. invoke LVP.executeBestPlan() to vectorize the loop:

				.. code-block:: c++

				if (!VectorizeLoop) {
				assert(IC > 1 && "interleave count should not be 1 or 0");
				// If we decided that it is not legal to vectorize the loop, then
				// interleave it.
				InnerLoopUnroller Unroller(L, PSE, LI, DT, TLI, TTI, AC, ORE, IC, &LVL,
				&CM);
				Unroller.vectorize();

				ORE->emit(OptimizationRemark(LV_NAME, "Interleaved", L->getStartLoc(),
				L->getHeader())
				<< "interleaved loop (interleaved count: "
				<< NV("InterleaveCount", IC) << ")");
				} else {

				// If we decided that it is \* legal \* to vectorize the loop, then do it.
				InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC,
				&LVL, &CM);

				LVP.executeBestPlan(LB);

				rengolinUnsubmitted Not Done Reply Inline Actions And this is my `Best.transform()` above. Looks good. rengolin: And this is my `Best.transform()` above. Looks good.
				AyalAuthorUnsubmitted Not Done Reply Inline Actions Right, this will perform the actual transformation. :-) Ayal: Right, this will perform the actual transformation. :-)
				++LoopsVectorized;

				// Add metadata to disable runtime unrolling a scalar loop when there are
				// no runtime checks about strides and memory. A scalar loop that is
				// rarely used is not worth unrolling.
				if (!LB.areSafetyChecksAdded())
				AddRuntimeUnrollDisableMetaData(L);

				// Report the vectorization decision.
				ORE->emit(OptimizationRemark(LV_NAME, "Vectorized", L->getStartLoc(),
				L->getHeader())
				<< "vectorized loop (vectorization width: "
				<< NV("VectorizationFactor", VF.Width)
				<< ", interleaved count: " << NV("InterleaveCount", IC) << ")");
				}

				// Mark the loop as already vectorized to avoid vectorizing again.
				Hints.setAlreadyVectorized();

				4. Plan, refactored into LVP.plan():

				a. build VPlans for relevant VF's and optimize them,
				b. compute best cost using Cost Model as before:

				.. code-block:: c++

				LoopVectorizationCostModel::VectorizationFactor
				LoopVectorizationPlanner::plan(bool OptForSize, unsigned UserVF,
				unsigned MaxVF) {
				if (UserVF) {
				DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");
				if (UserVF == 1)
				return {UserVF, 0};
				assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");
				// Collect the instructions (and their associated costs) that will be more
				// profitable to scalarize.
				CM->collectInstsToScalarize(UserVF);
				buildInitialVPlans(UserVF, UserVF);
				DEBUG(printCurrentPlans("Initial VPlans", dbgs()));
				optimizePredicatedInstructions();
				DEBUG(printCurrentPlans("After optimize predicated instructions",dbgs()));
				rengolinUnsubmitted Not Done Reply Inline Actions This is just analysis, right? rengolin: This is just analysis, right?
				AyalAuthorUnsubmitted Not Done Reply Inline Actions Right, no IR has been transformed nor constructed yet. Ayal: Right, no IR has been transformed nor constructed yet.
				return {UserVF, 0};
				}
				if (MaxVF == 1)
				return {1, 0};

				assert(MaxVF > 1 && "MaxVF is zero.");
				// Collect the instructions (and their associated costs) that will be more
				// profitable to scalarize.
				for (unsigned i = 2; i <= MaxVF; i = i+i)
				CM->collectInstsToScalarize(i);
				buildInitialVPlans(2, MaxVF);
				DEBUG(printCurrentPlans("Initial VPlans", dbgs()));
				optimizePredicatedInstructions();
				DEBUG(printCurrentPlans("After optimize predicated instructions", dbgs()));
				// Select the optimal vectorization factor.
				return CM->selectVectorizationFactor(OptForSize, MaxVF);
				}

docs/Vectorizers.rst

	Show First 20 Lines • Show All 374 Lines • ▼ Show 20 Lines
	The Y-axis shows the time in msec. Lower is better. The last column shows the geomean of all the kernels.			The Y-axis shows the time in msec. Lower is better. The last column shows the geomean of all the kernels.

	.. image:: gcc-loops.png			.. image:: gcc-loops.png

	And Linpack-pc with the same configuration. Result is Mflops, higher is better.			And Linpack-pc with the same configuration. Result is Mflops, higher is better.

	.. image:: linpack-pc.png			.. image:: linpack-pc.png

				Internals
				---------

				.. toctree::
				:hidden:

				VectorizationPlan

				:doc:`VectorizationPlan`
				The loop vectorizer is based on an abstract representation called Vectorization Plan.
				This document describes its philosophy and design.

	.. _slp-vectorizer:			.. _slp-vectorizer:

	The SLP Vectorizer			The SLP Vectorizer
	==================			==================

	Details			Details
	-------			-------

	Show All 36 Lines

lib/Transforms/Vectorize/CMakeLists.txt

	add_llvm_library(LLVMVectorize			add_llvm_library(LLVMVectorize
	BBVectorize.cpp			BBVectorize.cpp
	LoadStoreVectorizer.cpp			LoadStoreVectorizer.cpp
	LoopVectorize.cpp			LoopVectorize.cpp
	SLPVectorizer.cpp			SLPVectorizer.cpp
	Vectorize.cpp			Vectorize.cpp
				VPlan.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms			${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms

	DEPENDS			DEPENDS
	intrinsics_gen			intrinsics_gen
	)			)

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines
// A. Zaks and D. Nuzman. Autovectorization in GCC-two years later.		// A. Zaks and D. Nuzman. Autovectorization in GCC-two years later.
//		//
// S. Maleki, Y. Gao, M. Garzaran, T. Wong and D. Padua. An Evaluation of		// S. Maleki, Y. Gao, M. Garzaran, T. Wong and D. Padua. An Evaluation of
// Vectorizing Compilers.		// Vectorizing Compilers.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "llvm/Transforms/Vectorize/LoopVectorize.h"		#include "llvm/Transforms/Vectorize/LoopVectorize.h"
		#include "VPlan.h"
#include "llvm/ADT/DenseMap.h"		#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/Hashing.h"		#include "llvm/ADT/Hashing.h"
#include "llvm/ADT/MapVector.h"		#include "llvm/ADT/MapVector.h"
#include "llvm/ADT/SCCIterator.h"		#include "llvm/ADT/SCCIterator.h"
#include "llvm/ADT/SetVector.h"		#include "llvm/ADT/SetVector.h"
#include "llvm/ADT/SmallPtrSet.h"		#include "llvm/ADT/SmallPtrSet.h"
#include "llvm/ADT/SmallSet.h"		#include "llvm/ADT/SmallSet.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
Show All 34 Lines
#include "llvm/Transforms/Scalar.h"		#include "llvm/Transforms/Scalar.h"
#include "llvm/Transforms/Utils/BasicBlockUtils.h"		#include "llvm/Transforms/Utils/BasicBlockUtils.h"
#include "llvm/Transforms/Utils/Local.h"		#include "llvm/Transforms/Utils/Local.h"
#include "llvm/Transforms/Utils/LoopSimplify.h"		#include "llvm/Transforms/Utils/LoopSimplify.h"
#include "llvm/Transforms/Utils/LoopUtils.h"		#include "llvm/Transforms/Utils/LoopUtils.h"
#include "llvm/Transforms/Utils/LoopVersioning.h"		#include "llvm/Transforms/Utils/LoopVersioning.h"
#include "llvm/Transforms/Vectorize.h"		#include "llvm/Transforms/Vectorize.h"
#include <algorithm>		#include <algorithm>
		#include <functional>
#include <map>		#include <map>
#include <tuple>		#include <tuple>

using namespace llvm;		using namespace llvm;
using namespace llvm::PatternMatch;		using namespace llvm::PatternMatch;

#define LV_NAME "loop-vectorize"		#define LV_NAME "loop-vectorize"
#define DEBUG_TYPE LV_NAME		#define DEBUG_TYPE LV_NAME
▲ Show 20 Lines • Show All 286 Lines • ▼ Show 20 Lines
/// * Scalarization (implementation using scalars) of un-vectorizable		/// * Scalarization (implementation using scalars) of un-vectorizable
/// instructions.		/// instructions.
/// InnerLoopVectorizer does not perform any vectorization-legality		/// InnerLoopVectorizer does not perform any vectorization-legality
/// checks, and relies on the caller to check for the different legality		/// checks, and relies on the caller to check for the different legality
/// aspects. The InnerLoopVectorizer relies on the		/// aspects. The InnerLoopVectorizer relies on the
/// LoopVectorizationLegality class to provide information about the induction		/// LoopVectorizationLegality class to provide information about the induction
/// and reduction variables that were found to a given vectorization factor.		/// and reduction variables that were found to a given vectorization factor.
class InnerLoopVectorizer {		class InnerLoopVectorizer {
		friend class LoopVectorizationPlanner;
		friend class llvm::VPlan;

public:		public:
InnerLoopVectorizer(Loop *OrigLoop, PredicatedScalarEvolution &PSE,		InnerLoopVectorizer(Loop *OrigLoop, PredicatedScalarEvolution &PSE,
LoopInfo LI, DominatorTree DT,		LoopInfo LI, DominatorTree DT,
const TargetLibraryInfo *TLI,		const TargetLibraryInfo *TLI,
const TargetTransformInfo TTI, AssumptionCache AC,		const TargetTransformInfo TTI, AssumptionCache AC,
OptimizationRemarkEmitter *ORE, unsigned VecWidth,		OptimizationRemarkEmitter *ORE, unsigned VecWidth,
unsigned UnrollFactor, LoopVectorizationLegality *LVL,		unsigned UnrollFactor, LoopVectorizationLegality *LVL,
LoopVectorizationCostModel *CM)		LoopVectorizationCostModel *CM)
Show All 30 Lines	protected:
/// original loop, when scalarized, is represented by UF x VF scalar values		/// original loop, when scalarized, is represented by UF x VF scalar values
/// in the new unrolled loop, where UF is the unroll factor and VF is the		/// in the new unrolled loop, where UF is the unroll factor and VF is the
/// vectorization factor.		/// vectorization factor.
typedef SmallVector<SmallVector<Value *, 4>, 2> ScalarParts;		typedef SmallVector<SmallVector<Value *, 4>, 2> ScalarParts;

// When we if-convert we need to create edge masks. We have to cache values		// When we if-convert we need to create edge masks. We have to cache values
// so that we don't end up with exponential recursion/IR.		// so that we don't end up with exponential recursion/IR.
typedef DenseMap<std::pair<BasicBlock , BasicBlock >, VectorParts>		typedef DenseMap<std::pair<BasicBlock , BasicBlock >, VectorParts>
EdgeMaskCache;		EdgeMaskCacheTy;
		typedef DenseMap<BasicBlock *, VectorParts> BlockMaskCacheTy;

/// Create an empty loop, based on the loop ranges of the old loop.		/// Create an empty loop, based on the loop ranges of the old loop.
void createEmptyLoop();		void createEmptyLoop();

/// Set up the values of the IVs correctly when exiting the vector loop.		/// Set up the values of the IVs correctly when exiting the vector loop.
void fixupIVUsers(PHINode *OrigPhi, const InductionDescriptor &II,		void fixupIVUsers(PHINode *OrigPhi, const InductionDescriptor &II,
Value CountRoundDown, Value EndValue,		Value CountRoundDown, Value EndValue,
BasicBlock *MiddleBlock);		BasicBlock *MiddleBlock);

/// Create a new induction variable inside L.		/// Create a new induction variable inside L.
PHINode createInductionVariable(Loop L, Value Start, Value End,		PHINode createInductionVariable(Loop L, Value Start, Value End,
Value Step, Instruction DL);		Value Step, Instruction DL);
/// Copy and widen the instructions from the old loop.		/// Copy and widen the instructions from the old loop.
virtual void vectorizeLoop();		virtual void vectorizeLoop();

		/// Handle all cross-iteration phis in the header.
		void fixCrossIterationPHIs();

/// Fix a first-order recurrence. This is the second phase of vectorizing		/// Fix a first-order recurrence. This is the second phase of vectorizing
/// this phi node.		/// this phi node.
void fixFirstOrderRecurrence(PHINode *Phi);		void fixFirstOrderRecurrence(PHINode *Phi);

		/// Fix a reduction cross-iteration phi. This is the second phase of
		/// vectorizing this phi node.
		void fixReduction(PHINode *Phi);

/// \brief The Loop exit block may have single value PHI nodes where the		/// \brief The Loop exit block may have single value PHI nodes where the
/// incoming value is 'Undef'. While vectorizing we only handled real values		/// incoming value is 'Undef'. While vectorizing we only handled real values
/// that were defined inside the loop. Here we fix the 'undef case'.		/// that were defined inside the loop. Here we fix the 'undef case'.
/// See PR14725.		/// See PR14725.
void fixLCSSAPHIs();		void fixLCSSAPHIs();

/// Iteratively sink the scalarized operands of a predicated instruction into
/// the block that was created for it.
void sinkScalarOperands(Instruction *PredInst);

/// Predicate conditional instructions that require predication on their
/// respective conditions.
void predicateInstructions();

/// Collect the instructions from the original loop that would be trivially		/// Collect the instructions from the original loop that would be trivially
/// dead in the vectorized loop if generated.		/// dead in the vectorized loop if generated.
void collectTriviallyDeadInstructions();		static void collectTriviallyDeadInstructions(
		Loop OrigLoop, LoopVectorizationLegality Legal,
		SmallPtrSetImpl<Instruction *> &DeadInstructions);

/// Shrinks vector element sizes to the smallest bitwidth they can be legally		/// Shrinks vector element sizes to the smallest bitwidth they can be legally
/// represented as.		/// represented as.
void truncateToMinimalBitwidths();		void truncateToMinimalBitwidths();

		public:
/// A helper function that computes the predicate of the block BB, assuming		/// A helper function that computes the predicate of the block BB, assuming
/// that the header block of the loop is set to True. It returns the entry		/// that the header block of the loop is set to True. It returns the entry
/// mask for the block BB.		/// mask for the block BB.
VectorParts createBlockInMask(BasicBlock *BB);		VectorParts createBlockInMask(BasicBlock *BB);

		protected:
/// A helper function that computes the predicate of the edge between SRC		/// A helper function that computes the predicate of the edge between SRC
/// and DST.		/// and DST.
VectorParts createEdgeMask(BasicBlock Src, BasicBlock Dst);		VectorParts createEdgeMask(BasicBlock Src, BasicBlock Dst);

/// A helper function to vectorize a single BB within the innermost loop.
void vectorizeBlockInLoop(BasicBlock BB, PhiVector PV);

/// Vectorize a single PHINode in a block. This method handles the induction		/// Vectorize a single PHINode in a block. This method handles the induction
/// variable canonicalization. It supports both VF = 1 for unrolled loops and		/// variable canonicalization. It supports both VF = 1 for unrolled loops and
/// arbitrary length vectors.		/// arbitrary length vectors.
void widenPHIInstruction(Instruction *PN, unsigned UF, unsigned VF,		void widenPHIInstruction(Instruction *PN, unsigned UF, unsigned VF,
PhiVector *PV);		PhiVector *PV);

/// Insert the new loop to the loop hierarchy and pass manager		/// Insert the new loop to the loop hierarchy and pass manager
/// and update the analysis passes.		/// and update the analysis passes.
void updateAnalysis();		void updateAnalysis();

/// This instruction is un-vectorizable. Implement it as a sequence		public:
/// of scalars. If \p IfPredicateInstr is true we need to 'hide' each		/// A helper function to vectorize a single Instruction in the innermost loop.
/// scalarized instruction behind an if block predicated on the control		virtual void vectorizeInstruction(Instruction &I);
/// dependence of the instruction.
virtual void scalarizeInstruction(Instruction *Instr,		/// A helper function to scalarize a single Instruction in the innermost loop.
bool IfPredicateInstr = false);		/// Generates a sequence of scalar instances for each lane between \p MinLane
		/// and \p MaxLane, times each part between \p MinPart and \p MaxPart,
		/// inclusive..
		void scalarizeInstruction(Instruction *Instr, unsigned MinPart,
		unsigned MaxPart, unsigned MinLane,
		unsigned MaxLane);

		/// Return a value in the new loop corresponding to \p V from the original
		/// loop at unroll index \p Part and vector index \p Lane. If the value has
		/// been vectorized but not scalarized, the necessary extractelement
		/// instruction will be generated.
		Value getScalarValue(Value V, unsigned Part, unsigned Lane);

		/// Set a value in the new loop corresponding to \p V from the original
		/// loop at unroll index \p Part and vector index \p Lane. The scalar parts
		/// for this value must already be initialized.
		void setScalarValue(Value V, unsigned Part, unsigned Lane, Value Scalar) {
		assert(VectorLoopValueMap.hasScalar(V) &&
		"Cannot set an uninitialized scalar value");
		VectorLoopValueMap.ScalarMapStorage[V][Part][Lane] = Scalar;
		}

		/// Return a value in the new loop corresponding to \p V from the original
		/// loop at unroll index \p Part. If there isn't one, return a null pointer.
		/// Note that the value returned may also be a null pointer if that specific
		/// part has not been generated yet.
		Value getVectorValue(Value V, unsigned Part) {
		if (!VectorLoopValueMap.hasVector(V))
		return nullptr;
		return VectorLoopValueMap.VectorMapStorage[V][Part];
		}

		/// Set a value in the new loop corresponding to \p V from the original
		/// loop at unroll index \p Part. The vector parts for this value must already
		/// be initialized.
		void setVectorValue(Value V, unsigned Part, Value Vector) {
		assert(VectorLoopValueMap.hasVector(V) &&
		"Cannot set an uninitialized vector value");
		VectorLoopValueMap.VectorMapStorage[V][Part] = Vector;
		}

		/// Construct the vector value of a scalarized value \p V one lane at a time.
		/// This method is for predicated instructions where we'd like the
		/// insert-element instructions to reside in the predicated block to have
		/// them execute only if needed.
		void constructVectorValue(Value *V, unsigned Part, unsigned Lane);

		/// Return a constant reference to the VectorParts corresponding to \p V from
		/// the original loop. If the value has already been vectorized, the
		/// corresponding vector entry in VectorLoopValueMap is returned. If,
		/// however, the value has a scalar entry in VectorLoopValueMap, we construct
		/// new vector values on-demand by inserting the scalar values into vectors
		/// with an insertelement sequence. If the value has been neither vectorized
		/// nor scalarized, it must be loop invariant, so we simply broadcast the
		/// value into vectors.
		const VectorParts &getVectorValue(Value *V);

		protected:
/// Vectorize Load and Store instructions,		/// Vectorize Load and Store instructions,
virtual void vectorizeMemoryInstruction(Instruction *Instr);		virtual void vectorizeMemoryInstruction(Instruction *Instr);

/// Create a broadcast instruction. This method generates a broadcast		/// Create a broadcast instruction. This method generates a broadcast
/// instruction (shuffle) for loop invariant values and for the induction		/// instruction (shuffle) for loop invariant values and for the induction
/// value. If this is the induction variable then we extend it to N, N+1, ...		/// value. If this is the induction variable then we extend it to N, N+1, ...
/// this is needed because each iteration in the loop corresponds to a SIMD		/// this is needed because each iteration in the loop corresponds to a SIMD
/// element.		/// element.
virtual Value getBroadcastInstrs(Value V);		virtual Value getBroadcastInstrs(Value V);

/// This function adds (StartIdx, StartIdx + Step, StartIdx + 2*Step, ...)		/// This function adds (StartIdx, StartIdx + Step, StartIdx + 2*Step, ...)
/// to each vector element of Val. The sequence starts at StartIndex.		/// to each vector element of Val. The sequence starts at StartIndex.
/// \p Opcode is relevant for FP induction variable.		/// \p Opcode is relevant for FP induction variable.
virtual Value getStepVector(Value Val, int StartIdx, Value *Step,		virtual Value getStepVector(Value Val, int StartIdx, Value *Step,
Instruction::BinaryOps Opcode =		Instruction::BinaryOps Opcode =
Instruction::BinaryOpsEnd);		Instruction::BinaryOpsEnd);

/// Compute scalar induction steps. \p ScalarIV is the scalar induction
/// variable on which to base the steps, \p Step is the size of the step, and
/// \p EntryVal is the value from the original loop that maps to the steps.
/// Note that \p EntryVal doesn't have to be an induction variable (e.g., it
/// can be a truncate instruction).
void buildScalarSteps(Value ScalarIV, Value Step, Value *EntryVal);

/// Create a vector induction phi node based on an existing scalar one. This		/// Create a vector induction phi node based on an existing scalar one. This
/// currently only works for integer induction variables with a constant		/// currently only works for integer induction variables with a constant
/// step. \p EntryVal is the value from the original loop that maps to the		/// step. \p EntryVal is the value from the original loop that maps to the
/// vector phi node. If \p EntryVal is a truncate instruction, instead of		/// vector phi node. If \p EntryVal is a truncate instruction, instead of
/// widening the original IV, we widen a version of the IV truncated to \p		/// widening the original IV, we widen a version of the IV truncated to \p
/// EntryVal's type.		/// EntryVal's type.
void createVectorIntInductionPHI(const InductionDescriptor &II,		void createVectorIntInductionPHI(const InductionDescriptor &II,
Instruction *EntryVal);		Instruction *EntryVal);

/// Widen an integer induction variable \p IV. If \p Trunc is provided, the
/// induction variable will first be truncated to the corresponding type.
void widenIntInduction(PHINode IV, TruncInst Trunc = nullptr);

/// Returns true if an instruction \p I should be scalarized instead of		/// Returns true if an instruction \p I should be scalarized instead of
/// vectorized for the chosen vectorization factor.		/// vectorized for the chosen vectorization factor.
bool shouldScalarizeInstruction(Instruction *I) const;		bool shouldScalarizeInstruction(Instruction *I) const;

/// Returns true if we should generate a scalar version of \p IV.		/// Returns true if we should generate a scalar version of \p IV.
bool needsScalarInduction(Instruction *IV) const;		bool needsScalarInduction(Instruction *IV) const;

/// Return a constant reference to the VectorParts corresponding to \p V from		public:
/// the original loop. If the value has already been vectorized, the
/// corresponding vector entry in VectorLoopValueMap is returned. If,
/// however, the value has a scalar entry in VectorLoopValueMap, we construct
/// new vector values on-demand by inserting the scalar values into vectors
/// with an insertelement sequence. If the value has been neither vectorized
/// nor scalarized, it must be loop invariant, so we simply broadcast the
/// value into vectors.
const VectorParts &getVectorValue(Value *V);

/// Return a value in the new loop corresponding to \p V from the original
/// loop at unroll index \p Part and vector index \p Lane. If the value has
/// been vectorized but not scalarized, the necessary extractelement
/// instruction will be generated.
Value getScalarValue(Value V, unsigned Part, unsigned Lane);

/// Try to vectorize the interleaved access group that \p Instr belongs to.		/// Try to vectorize the interleaved access group that \p Instr belongs to.
void vectorizeInterleaveGroup(Instruction *Instr);		void vectorizeInterleaveGroup(Instruction *Instr);

		/// Widen an integer induction variable \p IV. If \p Trunc is provided, the
		/// induction variable will first be truncated to the corresponding type.
		std::pair<Value , Value > widenIntInduction(bool NeedsScalarIV, PHINode *IV,
		TruncInst *Trunc = nullptr);

		/// Compute scalar induction steps. \p ScalarIV is the scalar induction
		/// variable on which to base the steps, \p Step is the size of the step, and
		/// \p EntryVal is the value from the original loop that maps to the steps.
		/// Note that \p EntryVal doesn't have to be an induction variable (e.g., it
		/// can be a truncate instruction).
		void buildScalarSteps(Value ScalarIV, Value Step, Value *EntryVal,
		unsigned MinPart, unsigned MaxPart, unsigned MinLane,
		unsigned MaxLane);

		protected:
/// Generate a shuffle sequence that will reverse the vector Vec.		/// Generate a shuffle sequence that will reverse the vector Vec.
virtual Value reverseVector(Value Vec);		virtual Value reverseVector(Value Vec);

/// Returns (and creates if needed) the original loop trip count.		/// Returns (and creates if needed) the original loop trip count.
Value getOrCreateTripCount(Loop NewLoop);		Value getOrCreateTripCount(Loop NewLoop);

/// Returns (and creates if needed) the trip count of the widened loop.		/// Returns (and creates if needed) the trip count of the widened loop.
Value getOrCreateVectorTripCount(Loop NewLoop);		Value getOrCreateVectorTripCount(Loop NewLoop);
▲ Show 20 Lines • Show All 100 Lines • ▼ Show 20 Lines	const ScalarParts &initScalar(Value *Key, const ScalarParts &Entry) {
[&](const SmallVectorImpl<Value *> &Values) -> bool {		[&](const SmallVectorImpl<Value *> &Values) -> bool {
return Values.size() == VF;		return Values.size() == VF;
}) &&		}) &&
"ScalarParts has wrong dimensions");		"ScalarParts has wrong dimensions");
ScalarMapStorage[Key] = Entry;		ScalarMapStorage[Key] = Entry;
return ScalarMapStorage[Key];		return ScalarMapStorage[Key];
}		}

		ScalarParts &getOrCreateScalar(Value *Key, unsigned Lanes) {
		if (!hasScalar(Key)) {
		ScalarParts Entry(UF);
		for (unsigned Part = 0; Part < UF; ++Part)
		Entry[Part].resize(Lanes);
		ScalarMapStorage[Key] = Entry;
		}
		return ScalarMapStorage[Key];
		}

/// \return A reference to the vector map entry corresponding to \p Key.		/// \return A reference to the vector map entry corresponding to \p Key.
/// The key should already be in the map. This function should only be used		/// The key should already be in the map. This function should only be used
/// when it's necessary to update values that have already been vectorized.		/// when it's necessary to update values that have already been vectorized.
/// This is the case for "fix-up" operations including type truncation and		/// This is the case for "fix-up" operations including type truncation and
/// the second phase of recurrence vectorization. If a non-const reference		/// the second phase of recurrence vectorization. If a non-const reference
/// isn't required, getVectorValue should be used instead.		/// isn't required, getVectorValue should be used instead.
VectorParts &getVector(Value *Key) {		VectorParts &getVector(Value *Key) {
assert(hasVector(Key) && "Vector entry not initialized");		assert(hasVector(Key) && "Vector entry not initialized");
return VectorMapStorage.find(Key)->second;		return VectorMapStorage.find(Key)->second;
}		}

/// Retrieve an entry from the vector or scalar maps. The preferred way to		/// Retrieve an entry from the vector or scalar maps. The preferred way to
/// access an existing mapped entry is with getVectorValue or		/// access an existing mapped entry is with getVectorValue or
/// getScalarValue from InnerLoopVectorizer. Until those functions can be		/// getScalarValue from InnerLoopVectorizer. Until those functions can be
/// moved inside ValueMap, we have to declare them as friends.		/// moved inside ValueMap, we have to declare them as friends.
friend const VectorParts &InnerLoopVectorizer::getVectorValue(Value *V);		friend const VectorParts &InnerLoopVectorizer::getVectorValue(Value *V);
friend Value InnerLoopVectorizer::getScalarValue(Value V, unsigned Part,		friend Value InnerLoopVectorizer::getScalarValue(Value V, unsigned Part,
unsigned Lane);		unsigned Lane);
		friend Value InnerLoopVectorizer::getVectorValue(Value V, unsigned Part);
		friend void InnerLoopVectorizer::setScalarValue(Value *V, unsigned Part,
		unsigned Lane,
		Value *Scalar);
		friend void InnerLoopVectorizer::setVectorValue(Value *V, unsigned Part,
		Value *Vector);
		friend void InnerLoopVectorizer::constructVectorValue(Value *V,
		unsigned Part,
		unsigned Lane);

private:		private:
/// The unroll factor. Each entry in the vector map contains UF vector		/// The unroll factor. Each entry in the vector map contains UF vector
/// values.		/// values.
unsigned UF;		unsigned UF;

/// The vectorization factor. Each entry in the scalar map contains UF x VF		/// The vectorization factor. Each entry in the scalar map contains UF x VF
/// scalar values.		/// scalar values.
Show All 37 Lines	protected:
/// vector elements.		/// vector elements.
unsigned VF;		unsigned VF;

protected:		protected:
/// The vectorization unroll factor to use. Each scalar is vectorized to this		/// The vectorization unroll factor to use. Each scalar is vectorized to this
/// many different vector instructions.		/// many different vector instructions.
unsigned UF;		unsigned UF;

		public:
/// The builder that we use		/// The builder that we use
IRBuilder<> Builder;		IRBuilder<> Builder;

		protected:
// --- Vectorization state ---		// --- Vectorization state ---

/// The vector-loop preheader.		/// The vector-loop preheader.
BasicBlock *LoopVectorPreHeader;		BasicBlock *LoopVectorPreHeader;
/// The scalar-loop preheader.		/// The scalar-loop preheader.
BasicBlock *LoopScalarPreHeader;		BasicBlock *LoopScalarPreHeader;
/// Middle Block between the vector and the scalar.		/// Middle Block between the vector and the scalar.
BasicBlock *LoopMiddleBlock;		BasicBlock *LoopMiddleBlock;
Show All 12 Lines	protected:
PHINode *OldInduction;		PHINode *OldInduction;

/// Maps values from the original loop to their corresponding values in the		/// Maps values from the original loop to their corresponding values in the
/// vectorized loop. A key value can map to either vector values, scalar		/// vectorized loop. A key value can map to either vector values, scalar
/// values or both kinds of values, depending on whether the key was		/// values or both kinds of values, depending on whether the key was
/// vectorized and scalarized.		/// vectorized and scalarized.
ValueMap VectorLoopValueMap;		ValueMap VectorLoopValueMap;

/// Store instructions that should be predicated, as a pair		EdgeMaskCacheTy EdgeMaskCache;
/// <StoreInst, Predicate>		BlockMaskCacheTy BlockMaskCache;
SmallVector<std::pair<Instruction , Value >, 4> PredicatedInstructions;
EdgeMaskCache MaskCache;
/// Trip count of the original loop.		/// Trip count of the original loop.
Value *TripCount;		Value *TripCount;
/// Trip count of the widened loop (TripCount - TripCount % (VF*UF))		/// Trip count of the widened loop (TripCount - TripCount % (VF*UF))
Value *VectorTripCount;		Value *VectorTripCount;

/// The legality analysis.		/// The legality analysis.
LoopVectorizationLegality *Legal;		LoopVectorizationLegality *Legal;

/// The profitablity analysis.		/// The profitablity analysis.
LoopVectorizationCostModel *Cost;		LoopVectorizationCostModel *Cost;

// Record whether runtime checks are added.		// Record whether runtime checks are added.
bool AddedSafetyChecks;		bool AddedSafetyChecks;

// Holds instructions from the original loop whose counterparts in the
// vectorized loop would be trivially dead if generated. For example,
// original induction update instructions can become dead because we
// separately emit induction "steps" when generating code for the new loop.
// Similarly, we create a new latch condition when setting up the structure
// of the new loop, so the old one can become dead.
SmallPtrSet<Instruction *, 4> DeadInstructions;

// Holds the end values for each induction variable. We save the end values		// Holds the end values for each induction variable. We save the end values
// so we can later fix-up the external users of the induction variables.		// so we can later fix-up the external users of the induction variables.
DenseMap<PHINode , Value > IVEndValues;		DenseMap<PHINode , Value > IVEndValues;
};		};

class InnerLoopUnroller : public InnerLoopVectorizer {		class InnerLoopUnroller : public InnerLoopVectorizer {
public:		public:
InnerLoopUnroller(Loop *OrigLoop, PredicatedScalarEvolution &PSE,		InnerLoopUnroller(Loop *OrigLoop, PredicatedScalarEvolution &PSE,
LoopInfo LI, DominatorTree DT,		LoopInfo LI, DominatorTree DT,
const TargetLibraryInfo *TLI,		const TargetLibraryInfo *TLI,
const TargetTransformInfo TTI, AssumptionCache AC,		const TargetTransformInfo TTI, AssumptionCache AC,
OptimizationRemarkEmitter *ORE, unsigned UnrollFactor,		OptimizationRemarkEmitter *ORE, unsigned UnrollFactor,
LoopVectorizationLegality *LVL,		LoopVectorizationLegality *LVL,
LoopVectorizationCostModel *CM)		LoopVectorizationCostModel *CM)
: InnerLoopVectorizer(OrigLoop, PSE, LI, DT, TLI, TTI, AC, ORE, 1,		: InnerLoopVectorizer(OrigLoop, PSE, LI, DT, TLI, TTI, AC, ORE, 1,
UnrollFactor, LVL, CM) {}		UnrollFactor, LVL, CM) {}

private:		private:
void scalarizeInstruction(Instruction *Instr,		void vectorizeInstruction(Instruction &I) override;
bool IfPredicateInstr = false) override;		void scalarizeInstruction(Instruction *Instr, bool IfPredicateInstr = false);
void vectorizeMemoryInstruction(Instruction *Instr) override;		void vectorizeMemoryInstruction(Instruction *Instr) override;
Value getBroadcastInstrs(Value V) override;		Value getBroadcastInstrs(Value V) override;
Value getStepVector(Value Val, int StartIdx, Value *Step,		Value getStepVector(Value Val, int StartIdx, Value *Step,
Instruction::BinaryOps Opcode =		Instruction::BinaryOps Opcode =
Instruction::BinaryOpsEnd) override;		Instruction::BinaryOpsEnd) override;
Value reverseVector(Value Vec) override;		Value reverseVector(Value Vec) override;

		void vectorizeLoop() override;

		/// Iteratively sink the scalarized operands of a predicated instruction into
		/// the block that was created for it.
		void sinkScalarOperands(Instruction *PredInst);

		/// Predicate conditional instructions that require predication on their
		/// respective conditions.
		void predicateInstructions();

		/// Store instructions that should be predicated, as a pair
		/// <StoreInst, Predicate>
		SmallVector<std::pair<Instruction , Value >, 4> PredicatedInstructions;

		// Holds instructions from the original loop whose counterparts in the
		// vectorized loop would be trivially dead if generated. For example,
		// original induction update instructions can become dead because we
		// separately emit induction "steps" when generating code for the new loop.
		// Similarly, we create a new latch condition when setting up the structure
		// of the new loop, so the old one can become dead.
		SmallPtrSet<Instruction *, 4> DeadInstructions;
};		};

/// \brief Look for a meaningful debug location on the instruction or it's		/// \brief Look for a meaningful debug location on the instruction or it's
/// operands.		/// operands.
static Instruction getDebugLocFromInstOrOperands(Instruction I) {		static Instruction getDebugLocFromInstOrOperands(Instruction I) {
if (!I)		if (!I)
return I;		return I;

▲ Show 20 Lines • Show All 1,009 Lines • ▼ Show 20 Lines	LoopVectorizationCostModel(Loop *L, PredicatedScalarEvolution &PSE,
: TheLoop(L), PSE(PSE), LI(LI), Legal(Legal), TTI(TTI), TLI(TLI), DB(DB),		: TheLoop(L), PSE(PSE), LI(LI), Legal(Legal), TTI(TTI), TLI(TLI), DB(DB),
AC(AC), ORE(ORE), TheFunction(F), Hints(Hints) {}		AC(AC), ORE(ORE), TheFunction(F), Hints(Hints) {}

/// Information about vectorization costs		/// Information about vectorization costs
struct VectorizationFactor {		struct VectorizationFactor {
unsigned Width; // Vector width with best cost		unsigned Width; // Vector width with best cost
unsigned Cost; // Cost of the loop with that width		unsigned Cost; // Cost of the loop with that width
};		};

		bool canVectorize(bool OptForSize);

		bool requiresTail(unsigned MaxVectorSize);

		/// \return An upper bound for the vectorization factor.
		unsigned computeMaxVectorizationFactor(bool OptForSize);

/// \return The most profitable vectorization factor and the cost of that VF.		/// \return The most profitable vectorization factor and the cost of that VF.
/// This method checks every power of two up to VF. If UserVF is not ZERO		/// This method checks every power of two up to VF. If UserVF is not ZERO
/// then this vectorization factor will be selected if vectorization is		/// then this vectorization factor will be selected if vectorization is
/// possible.		/// possible.
VectorizationFactor selectVectorizationFactor(bool OptForSize);		VectorizationFactor selectVectorizationFactor(bool OptForSize,
		unsigned MaxVF);

/// \return The size (in bits) of the smallest and widest types in the code		/// \return The size (in bits) of the smallest and widest types in the code
/// that needs to be vectorized. We ignore values that remain scalar such as		/// that needs to be vectorized. We ignore values that remain scalar such as
/// 64 bit loop indices.		/// 64 bit loop indices.
std::pair<unsigned, unsigned> getSmallestAndWidestTypes();		std::pair<unsigned, unsigned> getSmallestAndWidestTypes();

/// \return The desired interleave count.		/// \return The desired interleave count.
/// If interleave count has been specified by metadata it will be returned.		/// If interleave count has been specified by metadata it will be returned.
Show All 34 Lines	public:
/// type.		/// type.
const MapVector<Instruction *, uint64_t> &getMinimalBitwidths() const {		const MapVector<Instruction *, uint64_t> &getMinimalBitwidths() const {
return MinBWs;		return MinBWs;
}		}

/// \returns True if it is more profitable to scalarize instruction \p I for		/// \returns True if it is more profitable to scalarize instruction \p I for
/// vectorization factor \p VF.		/// vectorization factor \p VF.
bool isProfitableToScalarize(Instruction *I, unsigned VF) const {		bool isProfitableToScalarize(Instruction *I, unsigned VF) const {
		// Unroller also calls this method, but does not collectInstsToScalarize.
		if (VF == 1)
		return true;
auto Scalars = InstsToScalarize.find(VF);		auto Scalars = InstsToScalarize.find(VF);
assert(Scalars != InstsToScalarize.end() &&		assert(Scalars != InstsToScalarize.end() &&
"VF not yet analyzed for scalarization profitability");		"VF not yet analyzed for scalarization profitability");
return Scalars->second.count(I);		return Scalars->second.count(I);
}		}

/// Returns true if \p I is known to be uniform after vectorization.		/// Returns true if \p I is known to be uniform after vectorization.
bool isUniformAfterVectorization(Instruction *I, unsigned VF) const {		bool isUniformAfterVectorization(Instruction *I, unsigned VF) const {
▲ Show 20 Lines • Show All 195 Lines • ▼ Show 20 Lines	private:
/// Returns the expected difference in cost from scalarizing the expression		/// Returns the expected difference in cost from scalarizing the expression
/// feeding a predicated instruction \p PredInst. The instructions to		/// feeding a predicated instruction \p PredInst. The instructions to
/// scalarize and their scalar costs are collected in \p ScalarCosts. A		/// scalarize and their scalar costs are collected in \p ScalarCosts. A
/// non-negative return value implies the expression will be scalarized.		/// non-negative return value implies the expression will be scalarized.
/// Currently, only single-use chains are considered for scalarization.		/// Currently, only single-use chains are considered for scalarization.
int computePredInstDiscount(Instruction *PredInst, ScalarCostsTy &ScalarCosts,		int computePredInstDiscount(Instruction *PredInst, ScalarCostsTy &ScalarCosts,
unsigned VF);		unsigned VF);

		public:
/// Collects the instructions to scalarize for each predicated instruction in		/// Collects the instructions to scalarize for each predicated instruction in
/// the loop.		/// the loop.
void collectInstsToScalarize(unsigned VF);		void collectInstsToScalarize(unsigned VF);

		private:
/// Collect the instructions that are uniform after vectorization. An		/// Collect the instructions that are uniform after vectorization. An
/// instruction is uniform if we represent it with a single scalar value in		/// instruction is uniform if we represent it with a single scalar value in
/// the vectorized loop corresponding to each vector iteration. Examples of		/// the vectorized loop corresponding to each vector iteration. Examples of
/// uniform instructions include pointer operands of consecutive or		/// uniform instructions include pointer operands of consecutive or
/// interleaved memory accesses. Note that although uniformity implies an		/// interleaved memory accesses. Note that although uniformity implies an
/// instruction will be scalar, the reverse is not true. In general, a		/// instruction will be scalar, the reverse is not true. In general, a
/// scalarized instruction will be represented by VF scalar values in the		/// scalarized instruction will be represented by VF scalar values in the
/// vectorized loop, each corresponding to an iteration of the original		/// vectorized loop, each corresponding to an iteration of the original
/// scalar loop.		/// scalar loop.
void collectLoopUniforms(unsigned VF);		void collectLoopUniforms(unsigned VF);

/// Collect the instructions that are scalar after vectorization. An		/// Collect the instructions that are scalar after vectorization. An
/// instruction is scalar if it is known to be uniform or will be scalarized		/// instruction is scalar if it is known to be uniform or will be scalarized
/// during vectorization. Non-uniform scalarized instructions will be		/// during vectorization. Non-uniform scalarized instructions will be
/// represented by VF values in the vectorized loop, each corresponding to an		/// represented by VF values in the vectorized loop, each corresponding to an
/// iteration of the original scalar loop.		/// iteration of the original scalar loop.
void collectLoopScalars(unsigned VF);		void collectLoopScalars(unsigned VF);

		public:
/// Collect Uniform and Scalar values for the given \p VF.		/// Collect Uniform and Scalar values for the given \p VF.
/// The sets depend on CM decision for Load/Store instructions		/// The sets depend on CM decision for Load/Store instructions
/// that may be vectorized as interleave, gather-scatter or scalarized.		/// that may be vectorized as interleave, gather-scatter or scalarized.
void collectUniformsAndScalars(unsigned VF) {		void collectUniformsAndScalars(unsigned VF) {
// Do the analysis once.		// Do the analysis once.
if (VF == 1 \|\| Uniforms.count(VF))		if (VF == 1 \|\| Uniforms.count(VF))
return;		return;
setCostBasedWideningDecision(VF);		setCostBasedWideningDecision(VF);
collectLoopUniforms(VF);		collectLoopUniforms(VF);
collectLoopScalars(VF);		collectLoopScalars(VF);
}		}

		private:
/// Keeps cost model vectorization decision and cost for instructions.		/// Keeps cost model vectorization decision and cost for instructions.
/// Right now it is used for memory instructions only.		/// Right now it is used for memory instructions only.
typedef DenseMap<std::pair<Instruction *, unsigned>,		typedef DenseMap<std::pair<Instruction *, unsigned>,
std::pair<InstWidening, unsigned>>		std::pair<InstWidening, unsigned>>
DecisionList;		DecisionList;

DecisionList WideningDecisions;		DecisionList WideningDecisions;

Show All 21 Lines	public:
/// Loop Vectorize Hint.		/// Loop Vectorize Hint.
const LoopVectorizeHints *Hints;		const LoopVectorizeHints *Hints;
/// Values to ignore in the cost model.		/// Values to ignore in the cost model.
SmallPtrSet<const Value *, 16> ValuesToIgnore;		SmallPtrSet<const Value *, 16> ValuesToIgnore;
/// Values to ignore in the cost model when VF > 1.		/// Values to ignore in the cost model when VF > 1.
SmallPtrSet<const Value *, 16> VecValuesToIgnore;		SmallPtrSet<const Value *, 16> VecValuesToIgnore;
};		};

		/// LoopVectorizationPlanner - builds and optimizes the Vectorization Plans
		/// which record the decisions how to vectorize the given loop.
		/// In particular, represent the control-flow of the vectorized version,
		/// the replication of instructions that are to be scalarized, and interleave
		/// access groups.
		class LoopVectorizationPlanner {
		public:
		LoopVectorizationPlanner(Loop L, LoopInfo LI, const TargetLibraryInfo *TLI,
		const TargetTransformInfo *TTI,
		LoopVectorizationLegality *Legal,
		LoopVectorizationCostModel *CM)
		: TheLoop(L), LI(LI), TLI(TLI), TTI(TTI), Legal(Legal), CM(CM),
		ILV(nullptr), BestVF(0), BestUF(0) {}

		~LoopVectorizationPlanner() {}

		/// Plan how to best vectorize, return the best VF and its cost.
		LoopVectorizationCostModel::VectorizationFactor
		plan(bool OptForSize, unsigned UserVF, unsigned MaxVF);

		/// Finalize the best decision and dispose of all other VPlans.
		void setBestPlan(unsigned VF, unsigned UF);

		/// Generate the IR code for the body of the vectorized loop according to the
		/// best selected VPlan.
		void executeBestPlan(InnerLoopVectorizer &LB);

		VPlan *getVPlanForVF(unsigned VF) { return VPlans[VF].get(); }

		void printCurrentPlans(const std::string &Title, raw_ostream &O);

		/// Test a predicate on a range of VFs.
		/// The returned value reflects the result for a prefix of the range, with \p
		/// EndRangeVF modified accordingly.
		bool testVFRange(const std::function<bool(unsigned)> &Predicate,
		unsigned StartRangeVF, unsigned &EndRangeVF);

		protected:
		/// Build initial VPlans according to the information gathered by Legal
		/// when it checked if it is legal to vectorize this loop.
		/// Returns the number of VPlans built, zero if failed.
		unsigned buildInitialVPlans(unsigned MinVF, unsigned MaxVF);

		/// On VPlan construction, each instruction marked for predication by Legal
		/// gets its own basic block guarded by an if-then. This initial planning
		/// is legal, but is not optimal. This function attempts to leverage the
		/// necessary conditional execution of the predicated instruction in favor
		/// of other related instructions. The function applies these optimizations
		/// to all VPlans.
		void optimizePredicatedInstructions();

		private:
		/// Build an initial VPlan according to the information gathered by Legal
		/// when it checked if it is legal to vectorize this loop. \return a VPlan
		/// that corresponds to vectorization factors starting from the given
		/// \p StartRangeVF and up to \p EndRangeVF, exclusive, possibly decreasing
		/// the given \p EndRangeVF.
		std::shared_ptr<VPlan> buildInitialVPlan(unsigned StartRangeVF,
		unsigned &EndRangeVF);

		std::pair<VPRecipeBase , VPRecipeBase >
		widenIntInduction(VPlan *Plan, unsigned StartRangeVF, unsigned &EndRangeVF,
		PHINode IV, TruncInst Trunc = nullptr);

		/// Determine whether \p I will be scalarized in a given range of VFs.
		/// The returned value reflects the result for a prefix of the range, with \p
		/// EndRangeVF modified accordingly.
		bool willBeScalarized(Instruction *I, unsigned StartRangeVF,
		unsigned &EndRangeVF);

		/// Iteratively sink the scalarized operands of a predicated instruction into
		/// the block that was created for it.
		void sinkScalarOperands(Instruction PredInst, VPlan Plan);

		/// Determine whether a newly-created recipe adds a second user to one of the
		/// variants the values its ingredients use. This may cause the defining
		/// recipe to generate that variant itself to serve all such users.
		void assignScalarVectorConversions(Instruction PredInst, VPlan Plan);

		/// Returns true if an instruction \p I should be scalarized instead of
		/// vectorized for the chosen vectorization factor.
		bool shouldScalarizeInstruction(Instruction *I, unsigned VF) const;

		private:
		/// The loop that we evaluate.
		Loop *TheLoop;

		/// Loop Info analysis.
		LoopInfo *LI;

		/// Target Library Info.
		const TargetLibraryInfo *TLI;

		/// Target Transform Info.
		const TargetTransformInfo *TTI;

		/// The legality analysis.
		LoopVectorizationLegality *Legal;

		/// The profitablity analysis.
		LoopVectorizationCostModel *CM;

		InnerLoopVectorizer *ILV;

		// Holds instructions from the original loop that we predicated. Such
		// instructions reside in their own conditioned VPBasicBlock and represent
		// an optimization opportunity for sinking their scalarized operands thus
		// reducing their cost by the predicate's probability.
		SmallPtrSet<Instruction *, 4> PredicatedInstructions;

		/// VPlans are shared between VFs, use smart pointers.
		DenseMap<unsigned, std::shared_ptr<VPlan>> VPlans;

		unsigned BestVF;

		unsigned BestUF;

		// Holds instructions from the original loop whose counterparts in the
		// vectorized loop would be trivially dead if generated. For example,
		// original induction update instructions can become dead because we
		// separately emit induction "steps" when generating code for the new loop.
		// Similarly, we create a new latch condition when setting up the structure
		// of the new loop, so the old one can become dead.
		SmallPtrSet<Instruction *, 4> DeadInstructions;
		};

		class VPLaneRange {
		private:
		static const unsigned VF = INT_MAX;
		unsigned MinLane = 0;
		unsigned MaxLane = VF - 1;
		void dumpLane(raw_ostream &O, unsigned Lane) const {
		if (Lane == VF - 1)
		O << "VF-1";
		else
		O << Lane;
		}

		public:
		VPLaneRange() {}
		VPLaneRange(unsigned Min) : MinLane(Min) {}
		VPLaneRange(unsigned Min, unsigned Max) : MinLane(Min), MaxLane(Max) {}
		unsigned getMinLane() const { return MinLane; }
		unsigned getMaxLane() const { return MaxLane; }
		bool isEmpty() const { return MinLane > MaxLane; }
		bool isFull() const { return MinLane == 0 && MaxLane == VF - 1; }
		void print(raw_ostream &O) const {
		dumpLane(O, MinLane);
		O << "..";
		dumpLane(O, MaxLane);
		}
		static VPLaneRange intersect(const VPLaneRange &One, const VPLaneRange &Two) {
		return VPLaneRange(std::max(One.MinLane, Two.MinLane),
		std::min(One.MaxLane, Two.MaxLane));
		}
		};

		/// VPScalarizeOneByOneRecipe is a VPOneByOneRecipeBase which scalarizes each
		/// Instruction in its ingredients independently, in order. The scalarization
		/// is performed in one of two methods: a) by generating a single uniform scalar
		/// Instruction. b) by generating multiple Instructions, each one for a
		/// respective lane.
		class VPScalarizeOneByOneRecipe : public VPOneByOneRecipeBase {
		friend class VPlanUtilsLoopVectorizer;

		private:
		/// Do the actual code generation for a single instruction.
		void transformIRInstruction(Instruction *I, VPTransformState &State) override;

		VPLaneRange DesignatedLanes;

		public:
		VPScalarizeOneByOneRecipe(const BasicBlock::iterator B,
		const BasicBlock::iterator E, VPlan *Plan)
		: VPOneByOneRecipeBase(VPScalarizeOneByOneSC, B, E, Plan) {}

		~VPScalarizeOneByOneRecipe() {}

		/// Method to support type inquiry through isa, cast, and dyn_cast.
		static inline bool classof(const VPRecipeBase *V) {
		return V->getVPRecipeID() == VPRecipeBase::VPScalarizeOneByOneSC;
		}

		const VPLaneRange &getDesignatedLanes() const { return DesignatedLanes; }

		/// Print the recipe.
		void print(raw_ostream &O) const override {
		O << "Scalarize";
		if (!DesignatedLanes.isFull()) {
		O << " ";
		DesignatedLanes.print(O);
		}
		O << ":";
		for (auto It = Begin; It != End; ++It) {
		O << '\n' << *It;
		if (willAlsoPackOrUnpack(&*It))
		O << " (S->V)";
		}
		}
		};

		/// VPVectorizeOneByOneRecipe is a VPOneByOneRecipeBase which transforms by
		/// vectorizing each Instruction in itsingredients independently, in order.
		/// This recipe covers most of the traditional vectorization cases where
		/// each ingredient produces a vectorized version of itself.
		class VPVectorizeOneByOneRecipe : public VPOneByOneRecipeBase {
		friend class VPlanUtilsLoopVectorizer;

		private:
		/// Do the actual code generation for a single instruction.
		void transformIRInstruction(Instruction *I, VPTransformState &State) override;

		public:
		VPVectorizeOneByOneRecipe(const BasicBlock::iterator B,
		const BasicBlock::iterator E, VPlan *Plan)
		: VPOneByOneRecipeBase(VPVectorizeOneByOneSC, B, E, Plan) {}

		~VPVectorizeOneByOneRecipe() {}

		/// Method to support type inquiry through isa, cast, and dyn_cast.
		static inline bool classof(const VPRecipeBase *V) {
		return V->getVPRecipeID() == VPRecipeBase::VPVectorizeOneByOneSC;
		}

		/// Print the recipe.
		void print(raw_ostream &O) const override {
		O << "Vectorize:";
		for (auto It = Begin; It != End; ++It) {
		O << '\n' << *It;
		if (willAlsoPackOrUnpack(&*It))
		O << " (S->V)";
		}
		}
		};

		/// A recipe which widens integer reductions, producing their vector values
		/// and computing the necessary values for producing their scalar values.
		/// The scalar values themselves are generated by a complementing
		/// VPBuildScalarStepsRecipe.
		class VPWidenIntInductionRecipe : public VPRecipeBase {
		private:
		bool NeedsScalarIV;
		PHINode *IV;
		TruncInst *Trunc;
		Value *ScalarIV = nullptr;
		Value *Step = nullptr;

		public:
		VPWidenIntInductionRecipe(bool NeedsScalarIV, PHINode *IV,
		TruncInst *Trunc = nullptr)
		: VPRecipeBase(VPWidenIntInductionSC), NeedsScalarIV(NeedsScalarIV),
		IV(IV), Trunc(Trunc) {}

		~VPWidenIntInductionRecipe() {}

		/// Method to support type inquiry through isa, cast, and dyn_cast.
		static inline bool classof(const VPRecipeBase *V) {
		return V->getVPRecipeID() == VPRecipeBase::VPWidenIntInductionSC;
		}

		/// The method which generates the wide load or store and shuffles that
		/// correspond to this VPInterleaveRecipe in the vectorized version, thereby
		/// "executing" the VPlan.
		void vectorize(VPTransformState &State) override;

		/// Print the recipe.
		void print(raw_ostream &O) const override;

		Value *getScalarIV() {
		assert(ScalarIV && "ScalarIV does not exist yet");
		return ScalarIV;
		}

		Value *getStep() {
		assert(Step && "Step does not exist yet");
		return Step;
		}
		};

		/// This is a complemeting recipe for handling integer induction variables,
		/// responsible for generating the scalar values used by the IV's scalar users.
		class VPBuildScalarStepsRecipe : public VPRecipeBase {
		friend class VPlanUtilsLoopVectorizer;

		private:
		VPWidenIntInductionRecipe *WII;
		Instruction *EntryVal;
		VPLaneRange DesignatedLanes;

		public:
		VPBuildScalarStepsRecipe(VPWidenIntInductionRecipe *WII,
		Instruction EntryVal, VPlan Plan)
		: VPRecipeBase(VPBuildScalarStepsSC), WII(WII), EntryVal(EntryVal) {
		Plan->setInst2Recipe(EntryVal, this);
		}

		~VPBuildScalarStepsRecipe() {}

		const VPLaneRange &getDesignatedLanes() const { return DesignatedLanes; }

		/// Method to support type inquiry through isa, cast, and dyn_cast.
		static inline bool classof(const VPRecipeBase *V) {
		return V->getVPRecipeID() == VPRecipeBase::VPBuildScalarStepsSC;
		}

		/// The method which generates the wide load or store and shuffles that
		/// correspond to this VPInterleaveRecipe in the vectorized version, thereby
		/// "executing" the VPlan.
		void vectorize(VPTransformState &State) override;

		/// Print the recipe.
		void print(raw_ostream &O) const override;
		};

		/// A VPInterleaveRecipe is a VPRecipe which transforms an interleave group of
		/// loads or stores into one wide load/store and shuffles.
		class VPInterleaveRecipe : public VPRecipeBase {
		private:
		const InterleaveGroup *IG;

		public:
		VPInterleaveRecipe(const InterleaveGroup IG, VPlan Plan)
		: VPRecipeBase(VPInterleaveSC), IG(IG) {
		for (unsigned I = 0, E = IG->getNumMembers(); I < E; ++I)
		Plan->setInst2Recipe(IG->getMember(I), this);
		}

		~VPInterleaveRecipe() {}

		/// Method to support type inquiry through isa, cast, and dyn_cast.
		static inline bool classof(const VPRecipeBase *V) {
		return V->getVPRecipeID() == VPRecipeBase::VPInterleaveSC;
		}

		/// The method which generates the wide load or store and shuffles that
		/// correspond to this VPInterleaveRecipe in the vectorized version, thereby
		/// "executing" the VPlan.
		void vectorize(VPTransformState &State) override;

		/// Print the recipe.
		void print(raw_ostream &O) const override;

		const InterleaveGroup *getInterleaveGroup() { return IG; }
		};

		/// A VPExtractMaskBitRecipe is a VPConditionBitRecipe which supports a
		/// scalarized conditional branch. Such branches are needed to guard scalarized
		/// instructions with possible side-effects that are predicated under a
		/// condition. This recipe is in charge of generating the instruction that
		/// computes the condition for this branch in the vectorized version.
		class VPExtractMaskBitRecipe : public VPConditionBitRecipeBase {
		private:
		/// The original IR basic block in which the scalarized and predicated
		/// instruction(s) reside. Needed for generating the mask of the block
		/// and from it the desired condition bit.
		BasicBlock *MaskedBasicBlock;

		public:
		/// Construct a VPExtractMaskBitRecipe given the IR BasicBlock whose mask
		/// should provide the desired bit. This recipe has no Instructions as
		/// ingredients, hence does not call Plan->setInst2Recipe().
		VPExtractMaskBitRecipe(BasicBlock *BB)
		: VPConditionBitRecipeBase(VPExtractMaskBitSC), MaskedBasicBlock(BB) {}

		/// Method to support type inquiry through isa, cast, and dyn_cast.
		static inline bool classof(const VPRecipeBase *V) {
		return V->getVPRecipeID() == VPRecipeBase::VPExtractMaskBitSC;
		}

		/// The method which generates the comparison and related mask management
		/// instructions leading to computing the desired condition bit, corresponding
		/// to this VPExtractMaskBitRecipe in the vectorized version, thereby
		/// "executing" the VPlan.
		void vectorize(VPTransformState &State) override;

		/// Print the recipe.
		void print(raw_ostream &O) const override {
		O << "Extract Mask Bit:\n" << MaskedBasicBlock->getName();
		}

		StringRef getName() const override { return MaskedBasicBlock->getName(); }
		};

		/// A VPMergeScalarizeBranchRecipe is a VPRecipe which represents the Phi's
		/// needed when control converges back from a scalarized branch. Such phi's are
		/// needed to merge live-out values that are set under a scalarized conditional
		/// branch. They can be scalar or vector, depending on the user of the
		/// live-out value. This recipe works in concert with VPExtractMaskBitRecipe.
		class VPMergeScalarizeBranchRecipe : public VPRecipeBase {
		private:
		Instruction *LiveOut;

		public:
		// Construct a VPMergeScalarizeBranchRecipe given \LiveOut whose value needs
		// a Phi after merging back from a scalarized branch.
		// LiveOut is mapped to the recipe vectorizing it, instead of this recipe
		// which provides it with PHIs; hence no call to Plan->setInst2Recipe() here.
		VPMergeScalarizeBranchRecipe(Instruction *LiveOut)
		: VPRecipeBase(VPMergeScalarizeBranchSC), LiveOut(LiveOut) {}

		~VPMergeScalarizeBranchRecipe() {}

		/// Method to support type inquiry through isa, cast, and dyn_cast.
		static inline bool classof(const VPRecipeBase *V) {
		return V->getVPRecipeID() == VPRecipeBase::VPMergeScalarizeBranchSC;
		}

		/// The method which generates Phi instructions for live-outs as needed to
		/// retain SSA form, corresponding to this VPMergeScalarizeBranchRecipe in the
		/// vectorized version, thereby "executing" the VPlan.
		void vectorize(VPTransformState &State) override;

		/// Print the recipe.
		void print(raw_ostream &O) const override {
		O << "Merge Scalarize Branch:\n" << *LiveOut;
		}
		};

		class VPlanUtilsLoopVectorizer : public VPlanUtils {
		public:
		VPlanUtilsLoopVectorizer(VPlan *Plan) : VPlanUtils(Plan) {}

		~VPlanUtilsLoopVectorizer() {}

		VPOneByOneRecipeBase *createOneByOneRecipe(const BasicBlock::iterator B,
		const BasicBlock::iterator E,
		VPlan *Plan, bool isScalarizing);

		bool appendInstruction(VPOneByOneRecipeBase Recipe, Instruction Instr);

		VPOneByOneRecipeBase splitRecipe(Instruction Split);

		void insertBefore(Instruction Inst, Instruction Before,
		unsigned MinLane = 0);

		void removeInstruction(Instruction *Inst, unsigned FromLane = 0);

		void sinkInstruction(Instruction Inst, VPBasicBlock To,
		unsigned MinLane = 0);

		template <typename T> void designateLaneZero(T &Recipe) {
		Recipe->DesignatedLanes = VPLaneRange(0, 0);
		}
		};

/// \brief This holds vectorization requirements that must be verified late in		/// \brief This holds vectorization requirements that must be verified late in
/// the process. The requirements are set by legalize and costmodel. Once		/// the process. The requirements are set by legalize and costmodel. Once
/// vectorization has been determined to be possible and profitable the		/// vectorization has been determined to be possible and profitable the
/// requirements can be verified by looking for metadata or compiler options.		/// requirements can be verified by looking for metadata or compiler options.
/// For example, some loops require FP commutativity which is only allowed if		/// For example, some loops require FP commutativity which is only allowed if
/// vectorization is explicitly specified or if the fast-math compiler option		/// vectorization is explicitly specified or if the fast-math compiler option
/// has been provided.		/// has been provided.
/// Late evaluation of these requirements allows helpful diagnostics to be		/// Late evaluation of these requirements allows helpful diagnostics to be
▲ Show 20 Lines • Show All 118 Lines • ▼ Show 20 Lines	void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addPreserved<BasicAAWrapperPass>();		AU.addPreserved<BasicAAWrapperPass>();
AU.addPreserved<GlobalsAAWrapperPass>();		AU.addPreserved<GlobalsAAWrapperPass>();
}		}
};		};

} // end anonymous namespace		} // end anonymous namespace

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Implementation of LoopVectorizationLegality, InnerLoopVectorizer and		// Implementation of LoopVectorizationLegality, InnerLoopVectorizer,
// LoopVectorizationCostModel.		// LoopVectorizationCostModel and LoopVectorizationPlanner.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

Value InnerLoopVectorizer::getBroadcastInstrs(Value V) {		Value InnerLoopVectorizer::getBroadcastInstrs(Value V) {
// We need to place the broadcast of invariant variables outside the loop.		// We need to place the broadcast of invariant variables outside the loop.
Instruction *Instr = dyn_cast<Instruction>(V);		Instruction *Instr = dyn_cast<Instruction>(V);
bool NewInstr = (Instr && Instr->getParent() == LoopVectorBody);		bool NewInstr = (Instr && Instr->getParent() == LoopVectorBody);
bool Invariant = OrigLoop->isLoopInvariant(V) && !NewInstr;		bool Invariant = OrigLoop->isLoopInvariant(V) && !NewInstr;

▲ Show 20 Lines • Show All 66 Lines • ▼ Show 20 Lines	if (shouldScalarizeInstruction(IV))
return true;		return true;
auto isScalarInst = [&](User *U) -> bool {		auto isScalarInst = [&](User *U) -> bool {
auto *I = cast<Instruction>(U);		auto *I = cast<Instruction>(U);
return (OrigLoop->contains(I) && shouldScalarizeInstruction(I));		return (OrigLoop->contains(I) && shouldScalarizeInstruction(I));
};		};
return any_of(IV->users(), isScalarInst);		return any_of(IV->users(), isScalarInst);
}		}

void InnerLoopVectorizer::widenIntInduction(PHINode IV, TruncInst Trunc) {		std::pair<Value , Value >
		InnerLoopVectorizer::widenIntInduction(bool NeedsScalarIV, PHINode *IV,
		TruncInst *Trunc) {

auto II = Legal->getInductionVars()->find(IV);		auto II = Legal->getInductionVars()->find(IV);
assert(II != Legal->getInductionVars()->end() && "IV is not an induction");		assert(II != Legal->getInductionVars()->end() && "IV is not an induction");

auto ID = II->second;		auto ID = II->second;
assert(IV->getType() == ID.getStartValue()->getType() && "Types must match");		assert(IV->getType() == ID.getStartValue()->getType() && "Types must match");

// The scalar value to broadcast. This will be derived from the canonical		// The scalar value to broadcast. This will be derived from the canonical
// induction variable.		// induction variable.
Value *ScalarIV = nullptr;		Value *ScalarIV = nullptr;

// The step of the induction.		// The step of the induction.
Value *Step = nullptr;		Value *Step = nullptr;

// The value from the original loop to which we are mapping the new induction		// The value from the original loop to which we are mapping the new induction
// variable.		// variable.
Instruction *EntryVal = Trunc ? cast<Instruction>(Trunc) : IV;		Instruction *EntryVal = Trunc ? cast<Instruction>(Trunc) : IV;

// True if we have vectorized the induction variable.		// True if we have vectorized the induction variable.
auto VectorizedIV = false;		auto VectorizedIV = false;

// Determine if we want a scalar version of the induction variable. This is
// true if the induction variable itself is not widened, or if it has at
// least one user in the loop that is not widened.
auto NeedsScalarIV = VF > 1 && needsScalarInduction(EntryVal);

// If the induction variable has a constant integer step value, go ahead and		// If the induction variable has a constant integer step value, go ahead and
// get it now.		// get it now.
if (ID.getConstIntStepValue())		if (ID.getConstIntStepValue())
Step = ID.getConstIntStepValue();		Step = ID.getConstIntStepValue();

// Try to create a new independent vector induction variable. If we can't		// Try to create a new independent vector induction variable. If we can't
// create the phi node, we will splat the scalar induction variable in each		// create the phi node, we will splat the scalar induction variable in each
// loop iteration.		// loop iteration.
Show All 38 Lines	if (!VectorizedIV) {
for (unsigned Part = 0; Part < UF; ++Part)		for (unsigned Part = 0; Part < UF; ++Part)
Entry[Part] = getStepVector(Broadcasted, VF * Part, Step);		Entry[Part] = getStepVector(Broadcasted, VF * Part, Step);
VectorLoopValueMap.initVector(EntryVal, Entry);		VectorLoopValueMap.initVector(EntryVal, Entry);
if (Trunc)		if (Trunc)
addMetadata(Entry, Trunc);		addMetadata(Entry, Trunc);
}		}

// If an induction variable is only used for counting loop iterations or		// If an induction variable is only used for counting loop iterations or
// calculating addresses, it doesn't need to be widened. Create scalar steps		// calculating addresses, it doesn't need to be widened.
// that can be used by instructions we will later scalarize. Note that the
// addition of the scalar steps will not increase the number of instructions		return std::make_pair(ScalarIV, Step);
// in the loop in the common case prior to InstCombine. We will be trading
// one vector extract for each scalar step.
if (NeedsScalarIV)
buildScalarSteps(ScalarIV, Step, EntryVal);
}		}

Value InnerLoopVectorizer::getStepVector(Value Val, int StartIdx, Value *Step,		Value InnerLoopVectorizer::getStepVector(Value Val, int StartIdx, Value *Step,
Instruction::BinaryOps BinOp) {		Instruction::BinaryOps BinOp) {
// Create and check the types.		// Create and check the types.
assert(Val->getType()->isVectorTy() && "Must be a vector");		assert(Val->getType()->isVectorTy() && "Must be a vector");
int VLen = Val->getType()->getVectorNumElements();		int VLen = Val->getType()->getVectorNumElements();

▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines	Value InnerLoopVectorizer::getStepVector(Value Val, int StartIdx, Value *Step,

Value *BOp = Builder.CreateBinOp(BinOp, Val, MulOp, "induction");		Value *BOp = Builder.CreateBinOp(BinOp, Val, MulOp, "induction");
if (isa<Instruction>(BOp))		if (isa<Instruction>(BOp))
cast<Instruction>(BOp)->setFastMathFlags(Flags);		cast<Instruction>(BOp)->setFastMathFlags(Flags);
return BOp;		return BOp;
}		}

void InnerLoopVectorizer::buildScalarSteps(Value ScalarIV, Value Step,		void InnerLoopVectorizer::buildScalarSteps(Value ScalarIV, Value Step,
Value *EntryVal) {		Value *EntryVal, unsigned MinPart,
		unsigned MaxPart, unsigned MinLane,
		unsigned MaxLane) {

// We shouldn't have to build scalar steps if we aren't vectorizing.		// We shouldn't have to build scalar steps if we aren't vectorizing.
assert(VF > 1 && "VF should be greater than one");		assert(VF > 1 && "VF should be greater than one");

// Get the value type and ensure it and the step have the same integer type.		// Get the value type and ensure it and the step have the same integer type.
Type *ScalarIVTy = ScalarIV->getType()->getScalarType();		Type *ScalarIVTy = ScalarIV->getType()->getScalarType();
assert(ScalarIVTy->isIntegerTy() && ScalarIVTy == Step->getType() &&		assert(ScalarIVTy->isIntegerTy() && ScalarIVTy == Step->getType() &&
"Val and Step should have the same integer type");		"Val and Step should have the same integer type");

// Determine the number of scalars we need to generate for each unroll		ScalarParts &Entry = VectorLoopValueMap.getOrCreateScalar(EntryVal, VF);
// iteration. If EntryVal is uniform, we only need to generate the first
// lane. Otherwise, we generate all VF values.
unsigned Lanes =
Cost->isUniformAfterVectorization(cast<Instruction>(EntryVal), VF) ? 1 : VF;

// Compute the scalar steps and save the results in VectorLoopValueMap.		// Compute the scalar steps and save the results in VectorLoopValueMap.
ScalarParts Entry(UF);		for (unsigned Part = MinPart; Part <= MaxPart; ++Part) {
for (unsigned Part = 0; Part < UF; ++Part) {
Entry[Part].resize(VF);		Entry[Part].resize(VF);
for (unsigned Lane = 0; Lane < Lanes; ++Lane) {		for (unsigned Lane = MinLane; Lane <= MaxLane; ++Lane) {
auto StartIdx = ConstantInt::get(ScalarIVTy, VF Part + Lane);		auto StartIdx = ConstantInt::get(ScalarIVTy, VF Part + Lane);
auto *Mul = Builder.CreateMul(StartIdx, Step);		auto *Mul = Builder.CreateMul(StartIdx, Step);
auto *Add = Builder.CreateAdd(ScalarIV, Mul);		auto *Add = Builder.CreateAdd(ScalarIV, Mul);
Entry[Part][Lane] = Add;		Entry[Part][Lane] = Add;
}		}
}		}
VectorLoopValueMap.initScalar(EntryVal, Entry);
}		}

int LoopVectorizationLegality::isConsecutivePtr(Value *Ptr) {		int LoopVectorizationLegality::isConsecutivePtr(Value *Ptr) {

const ValueToValueMap &Strides = getSymbolicStrides() ? *getSymbolicStrides() :		const ValueToValueMap &Strides = getSymbolicStrides() ? *getSymbolicStrides() :
ValueToValueMap();		ValueToValueMap();

int Stride = getPtrStride(PSE, Ptr, TheLoop, Strides, true, false);		int Stride = getPtrStride(PSE, Ptr, TheLoop, Strides, true, false);
if (Stride == 1 \|\| Stride == -1)		if (Stride == 1 \|\| Stride == -1)
return Stride;		return Stride;
return 0;		return 0;
}		}

bool LoopVectorizationLegality::isUniform(Value *V) {		bool LoopVectorizationLegality::isUniform(Value *V) {
return LAI->isUniform(V);		return LAI->isUniform(V);
}		}

		void InnerLoopVectorizer::constructVectorValue(Value *V, unsigned Part,
		unsigned Lane) {
		assert(V != Induction && "The new induction variable should not be used.");
		assert(!V->getType()->isVectorTy() && "Can't widen a vector");
		assert(!V->getType()->isVoidTy() && "Type does not produce a value");

		if (!VectorLoopValueMap.hasVector(V)) {
		VectorParts Entry(UF);
		for (unsigned P = 0; P < UF; ++P)
		Entry[P] = nullptr;
		VectorLoopValueMap.initVector(V, Entry);
		}

		VectorParts &Parts = VectorLoopValueMap.VectorMapStorage[V];

		assert(VectorLoopValueMap.hasScalar(V) && "Expected scalar values to exist");

		auto *ScalarInst = cast<Instruction>(getScalarValue(V, Part, Lane));

		Value *VectorValue = nullptr;

		// If we're constructing lane 0, start from undef; otherwise, start from the
		// last value created.
		if (Lane == 0)
		VectorValue = UndefValue::get(VectorType::get(V->getType(), VF));
		else
		VectorValue = Parts[Part];

		VectorValue = Builder.CreateInsertElement(VectorValue, ScalarInst,
		Builder.getInt32(Lane));
		Parts[Part] = VectorValue;
		}

const InnerLoopVectorizer::VectorParts &		const InnerLoopVectorizer::VectorParts &
InnerLoopVectorizer::getVectorValue(Value *V) {		InnerLoopVectorizer::getVectorValue(Value *V) {
assert(V != Induction && "The new induction variable should not be used.");		assert(V != Induction && "The new induction variable should not be used.");
assert(!V->getType()->isVectorTy() && "Can't widen a vector");		assert(!V->getType()->isVectorTy() && "Can't widen a vector");
assert(!V->getType()->isVoidTy() && "Type does not produce a value");		assert(!V->getType()->isVoidTy() && "Type does not produce a value");

// If we have a stride that is replaced by one, do it here.		// If we have a stride that is replaced by one, do it here.
if (Legal->hasStride(V))		if (Legal->hasStride(V))
Show All 27 Lines	if (VectorLoopValueMap.hasScalar(V)) {
// of the last unroll iteration. Otherwise, the last instruction is the one		// of the last unroll iteration. Otherwise, the last instruction is the one
// we created for the last vector lane of the last unroll iteration.		// we created for the last vector lane of the last unroll iteration.
unsigned LastLane = Cost->isUniformAfterVectorization(I, VF) ? 0 : VF - 1;		unsigned LastLane = Cost->isUniformAfterVectorization(I, VF) ? 0 : VF - 1;
auto *LastInst = cast<Instruction>(getScalarValue(V, UF - 1, LastLane));		auto *LastInst = cast<Instruction>(getScalarValue(V, UF - 1, LastLane));

// Set the insert point after the last scalarized instruction. This ensures		// Set the insert point after the last scalarized instruction. This ensures
// the insertelement sequence will directly follow the scalar definitions.		// the insertelement sequence will directly follow the scalar definitions.
auto OldIP = Builder.saveIP();		auto OldIP = Builder.saveIP();
auto NewIP = std::next(BasicBlock::iterator(LastInst));		auto NextInsertionPoint = std::next(BasicBlock::iterator(LastInst));
Builder.SetInsertPoint(&*NewIP);		if (NextInsertionPoint != LastInst->getParent()->end())
		Builder.SetInsertPoint(&*NextInsertionPoint);
		else
		Builder.SetInsertPoint(LastInst->getParent());

// However, if we are vectorizing, we need to construct the vector values.		// However, if we are vectorizing, we need to construct the vector values.
// If the value is known to be uniform after vectorization, we can just		// If the value is known to be uniform after vectorization, we can just
// broadcast the scalar value corresponding to lane zero for each unroll		// broadcast the scalar value corresponding to lane zero for each unroll
// iteration. Otherwise, we construct the vector values using insertelement		// iteration. Otherwise, we construct the vector values using insertelement
// instructions. Since the resulting vectors are stored in		// instructions. Since the resulting vectors are stored in
// VectorLoopValueMap, we will only generate the insertelements once.		// VectorLoopValueMap, we will only generate the insertelements once.
for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
▲ Show 20 Lines • Show All 244 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::vectorizeMemoryInstruction(Instruction *Instr) {
unsigned Alignment = getMemInstAlignment(Instr);		unsigned Alignment = getMemInstAlignment(Instr);
// An alignment of 0 means target abi alignment. We need to use the scalar's		// An alignment of 0 means target abi alignment. We need to use the scalar's
// target abi alignment in such a case.		// target abi alignment in such a case.
const DataLayout &DL = Instr->getModule()->getDataLayout();		const DataLayout &DL = Instr->getModule()->getDataLayout();
if (!Alignment)		if (!Alignment)
Alignment = DL.getABITypeAlignment(ScalarDataTy);		Alignment = DL.getABITypeAlignment(ScalarDataTy);
unsigned AddressSpace = getMemInstAddressSpace(Instr);		unsigned AddressSpace = getMemInstAddressSpace(Instr);

// Scalarize the memory instruction if necessary.
if (Decision == LoopVectorizationCostModel::CM_Scalarize)
return scalarizeInstruction(Instr, Legal->isScalarWithPredication(Instr));

// Determine if the pointer operand of the access is either consecutive or		// Determine if the pointer operand of the access is either consecutive or
// reverse consecutive.		// reverse consecutive.
int ConsecutiveStride = Legal->isConsecutivePtr(Ptr);		int ConsecutiveStride = Legal->isConsecutivePtr(Ptr);
bool Reverse = ConsecutiveStride < 0;		bool Reverse = ConsecutiveStride < 0;
bool CreateGatherScatter =		bool CreateGatherScatter =
(Decision == LoopVectorizationCostModel::CM_GatherScatter);		(Decision == LoopVectorizationCostModel::CM_GatherScatter);

VectorParts VectorGep;		VectorParts VectorGep;
▲ Show 20 Lines • Show All 151 Lines • ▼ Show 20 Lines	if (CreateGatherScatter) {
Entry[Part] = Reverse ? reverseVector(NewLI) : NewLI;		Entry[Part] = Reverse ? reverseVector(NewLI) : NewLI;
}		}
addMetadata(NewLI, LI);		addMetadata(NewLI, LI);
}		}
VectorLoopValueMap.initVector(Instr, Entry);		VectorLoopValueMap.initVector(Instr, Entry);
}		}

void InnerLoopVectorizer::scalarizeInstruction(Instruction *Instr,		void InnerLoopVectorizer::scalarizeInstruction(Instruction *Instr,
bool IfPredicateInstr) {		unsigned MinPart,
		unsigned MaxPart,
		unsigned MinLane,
		unsigned MaxLane) {
assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");		assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");
DEBUG(dbgs() << "LV: Scalarizing"
<< (IfPredicateInstr ? " and predicating:" : ":") << *Instr
<< '\n');
// Holds vector parameters or scalars, in case of uniform vals.		// Holds vector parameters or scalars, in case of uniform vals.
SmallVector<VectorParts, 4> Params;		SmallVector<VectorParts, 4> Params;

setDebugLocFromInst(Builder, Instr);		setDebugLocFromInst(Builder, Instr);

// Does this instruction return a value ?		// Does this instruction return a value ?
bool IsVoidRetTy = Instr->getType()->isVoidTy();		bool IsVoidRetTy = Instr->getType()->isVoidTy();

// Initialize a new scalar map entry.		// Initialize a new scalar map entry.
ScalarParts Entry(UF);		ScalarParts &Entry = VectorLoopValueMap.getOrCreateScalar(Instr, VF);

VectorParts Cond;
if (IfPredicateInstr)
Cond = createBlockInMask(Instr->getParent());

// Determine the number of scalars we need to generate for each unroll
// iteration. If the instruction is uniform, we only need to generate the
// first lane. Otherwise, we generate all VF values.
unsigned Lanes = Cost->isUniformAfterVectorization(Instr, VF) ? 1 : VF;

// For each vector unroll 'part':		// For each vector unroll 'part':
for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = MinPart; Part <= MaxPart; ++Part) {
Entry[Part].resize(VF);
// For each scalar that we create:		// For each scalar that we create:
for (unsigned Lane = 0; Lane < Lanes; ++Lane) {		for (unsigned Lane = MinLane; Lane <= MaxLane; ++Lane) {

// Start if-block.
Value *Cmp = nullptr;
if (IfPredicateInstr) {
Cmp = Builder.CreateExtractElement(Cond[Part], Builder.getInt32(Lane));
Cmp = Builder.CreateICmp(ICmpInst::ICMP_EQ, Cmp,
ConstantInt::get(Cmp->getType(), 1));
}

Instruction *Cloned = Instr->clone();		Instruction *Cloned = Instr->clone();
if (!IsVoidRetTy)		if (!IsVoidRetTy)
Cloned->setName(Instr->getName() + ".cloned");		Cloned->setName(Instr->getName() + ".cloned");

// Replace the operands of the cloned instructions with their scalar		// Replace the operands of the cloned instructions with their scalar
// equivalents in the new loop.		// equivalents in the new loop.
for (unsigned op = 0, e = Instr->getNumOperands(); op != e; ++op) {		for (unsigned op = 0, e = Instr->getNumOperands(); op != e; ++op) {
auto *NewOp = getScalarValue(Instr->getOperand(op), Part, Lane);		auto *NewOp = getScalarValue(Instr->getOperand(op), Part, Lane);
Cloned->setOperand(op, NewOp);		Cloned->setOperand(op, NewOp);
}		}
addNewMetadata(Cloned, Instr);		addNewMetadata(Cloned, Instr);

// Place the cloned scalar in the new loop.		// Place the cloned scalar in the new loop.
Builder.Insert(Cloned);		Builder.Insert(Cloned);

// Add the cloned scalar to the scalar map entry.		// Add the cloned scalar to the scalar map entry.
Entry[Part][Lane] = Cloned;		Entry[Part][Lane] = Cloned;

// If we just cloned a new assumption, add it the assumption cache.		// If we just cloned a new assumption, add it the assumption cache.
if (auto *II = dyn_cast<IntrinsicInst>(Cloned))		if (auto *II = dyn_cast<IntrinsicInst>(Cloned))
if (II->getIntrinsicID() == Intrinsic::assume)		if (II->getIntrinsicID() == Intrinsic::assume)
AC->registerAssumption(II);		AC->registerAssumption(II);

// End if-block.
if (IfPredicateInstr)
PredicatedInstructions.push_back(std::make_pair(Cloned, Cmp));
}		}
}		}
VectorLoopValueMap.initScalar(Instr, Entry);
}		}

PHINode InnerLoopVectorizer::createInductionVariable(Loop L, Value *Start,		PHINode InnerLoopVectorizer::createInductionVariable(Loop L, Value *Start,
Value End, Value Step,		Value End, Value Step,
Instruction *DL) {		Instruction *DL) {
BasicBlock *Header = L->getHeader();		BasicBlock *Header = L->getHeader();
BasicBlock *Latch = L->getLoopLatch();		BasicBlock *Latch = L->getLoopLatch();
// As we're just creating this loop, it's possible no latch exists		// As we're just creating this loop, it's possible no latch exists
▲ Show 20 Lines • Show All 737 Lines • ▼ Show 20 Lines	for (Value *&I : Parts) {
Inst->eraseFromParent();		Inst->eraseFromParent();
I = NewI;		I = NewI;
}		}
}		}
}		}
}		}

void InnerLoopVectorizer::vectorizeLoop() {		void InnerLoopVectorizer::vectorizeLoop() {

//===------------------------------------------------===//		//===------------------------------------------------===//
//		//
// Notice: any optimization or new instruction that go		// Notice: any optimization or new instruction that go
// into the code below should be also be implemented in		// into the code below should be also be implemented in
// the cost-model.		// the cost-model.
//		//
//===------------------------------------------------===//		//===------------------------------------------------===//
Constant *Zero = Builder.getInt32(0);

		// Insert truncates and extends for any truncated instructions as hints to
		// InstCombine.
		if (VF > 1)
		truncateToMinimalBitwidths();

		fixCrossIterationPHIs();

		// Update the dominator tree.
		//
		// FIXME: After creating the structure of the new loop, the dominator tree is
		// no longer up-to-date, and it remains that way until we update it
		// here. An out-of-date dominator tree is problematic for SCEV,
		// because SCEVExpander uses it to guide code generation. The
		// vectorizer use SCEVExpanders in several places. Instead, we should
		// keep the dominator tree up-to-date as we go.
		updateAnalysis();

		// Fix-up external users of the induction variables.
		for (auto &Entry : *Legal->getInductionVars())
		fixupIVUsers(Entry.first, Entry.second,
		getOrCreateVectorTripCount(LI->getLoopFor(LoopVectorBody)),
		IVEndValues[Entry.first], LoopMiddleBlock);

		fixLCSSAPHIs();

		// Remove redundant induction instructions.
		cse(LoopVectorBody);
		}

		void InnerLoopVectorizer::fixCrossIterationPHIs() {
// In order to support recurrences we need to be able to vectorize Phi nodes.		// In order to support recurrences we need to be able to vectorize Phi nodes.
// Phi nodes have cycles, so we need to vectorize them in two stages. First,		// Phi nodes have cycles, so we need to vectorize them in two stages. First,
// we create a new vector PHI node with no incoming edges. We use this value		// we create a new vector PHI node with no incoming edges. We use this value
// when we vectorize all of the instructions that use the PHI. Next, after		// when we vectorize all of the instructions that use the PHI. Next, after
// all of the instructions in the block are complete we add the new incoming		// all of the instructions in the block are complete we add the new incoming
// edges to the PHI. At this point all of the instructions in the basic block		// edges to the PHI. At this point all of the instructions in the basic block
// are vectorized, so we can use them to construct the PHI.		// are vectorized, so we can use them to construct the PHI.
PhiVector PHIsToFix;

// Collect instructions from the original loop that will become trivially
// dead in the vectorized loop. We don't need to vectorize these
// instructions.
collectTriviallyDeadInstructions();

// Scan the loop in a topological order to ensure that defs are vectorized
// before users.
LoopBlocksDFS DFS(OrigLoop);
DFS.perform(LI);

// Vectorize all of the blocks in the original loop.
for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO()))
vectorizeBlockInLoop(BB, &PHIsToFix);

// Insert truncates and extends for any truncated instructions as hints to
// InstCombine.
if (VF > 1)
truncateToMinimalBitwidths();

// At this point every instruction in the original loop is widened to a		// At this point every instruction in the original loop is widened to a
// vector form. Now we need to fix the recurrences in PHIsToFix. These PHI		// vector form. Now we need to fix the recurrences. These PHI nodes are
// nodes are currently empty because we did not want to introduce cycles.		// currently empty because we did not want to introduce cycles.
// This is the second stage of vectorizing recurrences.		// This is the second stage of vectorizing recurrences.
for (PHINode *Phi : PHIsToFix) {		for (Instruction &I : *OrigLoop->getHeader()) {
assert(Phi && "Unable to recover vectorized PHI");		PHINode *Phi = dyn_cast<PHINode>(&I);
		if (!Phi)
// Handle first-order recurrences that need to be fixed.		break;
if (Legal->isFirstOrderRecurrence(Phi)) {		// Handle first-order recurrences and reductions that need to be fixed.
		if (Legal->isFirstOrderRecurrence(Phi))
fixFirstOrderRecurrence(Phi);		fixFirstOrderRecurrence(Phi);
continue;		else if (Legal->isReductionVariable(Phi))
		fixReduction(Phi);
}		}
		}

		void InnerLoopVectorizer::fixReduction(PHINode *Phi) {
		Constant *Zero = Builder.getInt32(0);

// If the phi node is not a first-order recurrence, it must be a reduction.		// Get the reduction variable descriptor.
// Get it's reduction variable descriptor.
assert(Legal->isReductionVariable(Phi) &&
"Unable to find the reduction variable");
RecurrenceDescriptor RdxDesc = (*Legal->getReductionVars())[Phi];		RecurrenceDescriptor RdxDesc = (*Legal->getReductionVars())[Phi];

RecurrenceDescriptor::RecurrenceKind RK = RdxDesc.getRecurrenceKind();		RecurrenceDescriptor::RecurrenceKind RK = RdxDesc.getRecurrenceKind();
TrackingVH<Value> ReductionStartValue = RdxDesc.getRecurrenceStartValue();		TrackingVH<Value> ReductionStartValue = RdxDesc.getRecurrenceStartValue();
Instruction *LoopExitInst = RdxDesc.getLoopExitInstr();		Instruction *LoopExitInst = RdxDesc.getLoopExitInstr();
RecurrenceDescriptor::MinMaxRecurrenceKind MinMaxKind =		RecurrenceDescriptor::MinMaxRecurrenceKind MinMaxKind =
RdxDesc.getMinMaxRecurrenceKind();		RdxDesc.getMinMaxRecurrenceKind();
setDebugLocFromInst(Builder, ReductionStartValue);		setDebugLocFromInst(Builder, ReductionStartValue);

// We need to generate a reduction vector from the incoming scalar.		// We need to generate a reduction vector from the incoming scalar.
// To do so, we need to generate the 'identity' vector and override		// To do so, we need to generate the 'identity' vector and override
// one of the elements with the incoming scalar reduction. We need		// one of the elements with the incoming scalar reduction. We need
// to do it in the vector-loop preheader.		// to do it in the vector-loop preheader.
Builder.SetInsertPoint(LoopBypassBlocks[1]->getTerminator());		Builder.SetInsertPoint(LoopBypassBlocks[1]->getTerminator());

// This is the vector-clone of the value that leaves the loop.		// This is the vector-clone of the value that leaves the loop.
const VectorParts &VectorExit = getVectorValue(LoopExitInst);		const VectorParts &VectorExit = getVectorValue(LoopExitInst);
Type *VecTy = VectorExit[0]->getType();		Type *VecTy = VectorExit[0]->getType();

// Find the reduction identity variable. Zero for addition, or, xor,		// Find the reduction identity variable. Zero for addition, or, xor,
// one for multiplication, -1 for And.		// one for multiplication, -1 for And.
Value *Identity;		Value *Identity;
Value *VectorStart;		Value *VectorStart;
if (RK == RecurrenceDescriptor::RK_IntegerMinMax \|\|		if (RK == RecurrenceDescriptor::RK_IntegerMinMax \|\|
RK == RecurrenceDescriptor::RK_FloatMinMax) {		RK == RecurrenceDescriptor::RK_FloatMinMax) {
// MinMax reduction have the start value as their identify.		// MinMax reduction have the start value as their identify.
if (VF == 1) {		if (VF == 1) {
VectorStart = Identity = ReductionStartValue;		VectorStart = Identity = ReductionStartValue;
} else {		} else {
VectorStart = Identity =		VectorStart = Identity =
Builder.CreateVectorSplat(VF, ReductionStartValue, "minmax.ident");		Builder.CreateVectorSplat(VF, ReductionStartValue, "minmax.ident");
}		}
} else {		} else {
// Handle other reduction kinds:		// Handle other reduction kinds:
Constant *Iden = RecurrenceDescriptor::getRecurrenceIdentity(		Constant *Iden =
RK, VecTy->getScalarType());		RecurrenceDescriptor::getRecurrenceIdentity(RK, VecTy->getScalarType());
if (VF == 1) {		if (VF == 1) {
Identity = Iden;		Identity = Iden;
// This vector is the Identity vector where the first element is the		// This vector is the Identity vector where the first element is the
// incoming scalar reduction.		// incoming scalar reduction.
VectorStart = ReductionStartValue;		VectorStart = ReductionStartValue;
} else {		} else {
Identity = ConstantVector::getSplat(VF, Iden);		Identity = ConstantVector::getSplat(VF, Iden);

// This vector is the Identity vector where the first element is the		// This vector is the Identity vector where the first element is the
// incoming scalar reduction.		// incoming scalar reduction.
VectorStart =		VectorStart =
Builder.CreateInsertElement(Identity, ReductionStartValue, Zero);		Builder.CreateInsertElement(Identity, ReductionStartValue, Zero);
}		}
}		}

// Fix the vector-loop phi.		// Fix the vector-loop phi.

// Reductions do not have to start at zero. They can start with		// Reductions do not have to start at zero. They can start with
// any loop invariant values.		// any loop invariant values.
const VectorParts &VecRdxPhi = getVectorValue(Phi);		const VectorParts &VecRdxPhi = getVectorValue(Phi);
BasicBlock *Latch = OrigLoop->getLoopLatch();		BasicBlock *Latch = OrigLoop->getLoopLatch();
Value *LoopVal = Phi->getIncomingValueForBlock(Latch);		Value *LoopVal = Phi->getIncomingValueForBlock(Latch);
const VectorParts &Val = getVectorValue(LoopVal);		const VectorParts &Val = getVectorValue(LoopVal);
for (unsigned part = 0; part < UF; ++part) {		for (unsigned part = 0; part < UF; ++part) {
// Make sure to add the reduction stat value only to the		// Make sure to add the reduction stat value only to the
// first unroll part.		// first unroll part.
Value *StartVal = (part == 0) ? VectorStart : Identity;		Value *StartVal = (part == 0) ? VectorStart : Identity;
		cast<PHINode>(VecRdxPhi[part])->addIncoming(StartVal, LoopVectorPreHeader);
cast<PHINode>(VecRdxPhi[part])		cast<PHINode>(VecRdxPhi[part])
->addIncoming(StartVal, LoopVectorPreHeader);		->addIncoming(Val[part],
cast<PHINode>(VecRdxPhi[part])		LI->getLoopFor(LoopVectorBody)->getLoopLatch());
->addIncoming(Val[part], LoopVectorBody);
}		}

// Before each round, move the insertion point right between		// Before each round, move the insertion point right between
// the PHIs and the values we are going to write.		// the PHIs and the values we are going to write.
// This allows us to write both PHINodes and the extractelement		// This allows us to write both PHINodes and the extractelement
// instructions.		// instructions.
Builder.SetInsertPoint(&*LoopMiddleBlock->getFirstInsertionPt());		Builder.SetInsertPoint(&*LoopMiddleBlock->getFirstInsertionPt());

VectorParts &RdxParts = VectorLoopValueMap.getVector(LoopExitInst);		VectorParts &RdxParts = VectorLoopValueMap.getVector(LoopExitInst);
setDebugLocFromInst(Builder, LoopExitInst);		setDebugLocFromInst(Builder, LoopExitInst);

// If the vector reduction can be performed in a smaller type, we truncate		// If the vector reduction can be performed in a smaller type, we truncate
// then extend the loop exit value to enable InstCombine to evaluate the		// then extend the loop exit value to enable InstCombine to evaluate the
// entire expression in the smaller type.		// entire expression in the smaller type.
if (VF > 1 && Phi->getType() != RdxDesc.getRecurrenceType()) {		if (VF > 1 && Phi->getType() != RdxDesc.getRecurrenceType()) {
Type *RdxVecTy = VectorType::get(RdxDesc.getRecurrenceType(), VF);		Type *RdxVecTy = VectorType::get(RdxDesc.getRecurrenceType(), VF);
Builder.SetInsertPoint(LoopVectorBody->getTerminator());		Builder.SetInsertPoint(LoopVectorBody->getTerminator());
for (unsigned part = 0; part < UF; ++part) {		for (unsigned part = 0; part < UF; ++part) {
Value *Trunc = Builder.CreateTrunc(RdxParts[part], RdxVecTy);		Value *Trunc = Builder.CreateTrunc(RdxParts[part], RdxVecTy);
Value *Extnd = RdxDesc.isSigned() ? Builder.CreateSExt(Trunc, VecTy)		Value *Extnd = RdxDesc.isSigned() ? Builder.CreateSExt(Trunc, VecTy)
: Builder.CreateZExt(Trunc, VecTy);		: Builder.CreateZExt(Trunc, VecTy);
for (Value::user_iterator UI = RdxParts[part]->user_begin();		for (Value::user_iterator UI = RdxParts[part]->user_begin();
UI != RdxParts[part]->user_end();)		UI != RdxParts[part]->user_end();)
if (*UI != Trunc) {		if (*UI != Trunc) {
(*UI++)->replaceUsesOfWith(RdxParts[part], Extnd);		(*UI++)->replaceUsesOfWith(RdxParts[part], Extnd);
RdxParts[part] = Extnd;		RdxParts[part] = Extnd;
} else {		} else {
++UI;		++UI;
}		}
}		}
Builder.SetInsertPoint(&*LoopMiddleBlock->getFirstInsertionPt());		Builder.SetInsertPoint(&*LoopMiddleBlock->getFirstInsertionPt());
for (unsigned part = 0; part < UF; ++part)		for (unsigned part = 0; part < UF; ++part)
RdxParts[part] = Builder.CreateTrunc(RdxParts[part], RdxVecTy);		RdxParts[part] = Builder.CreateTrunc(RdxParts[part], RdxVecTy);
}		}

// Reduce all of the unrolled parts into a single vector.		// Reduce all of the unrolled parts into a single vector.
Value *ReducedPartRdx = RdxParts[0];		Value *ReducedPartRdx = RdxParts[0];
unsigned Op = RecurrenceDescriptor::getRecurrenceBinOp(RK);		unsigned Op = RecurrenceDescriptor::getRecurrenceBinOp(RK);
setDebugLocFromInst(Builder, ReducedPartRdx);		setDebugLocFromInst(Builder, ReducedPartRdx);
for (unsigned part = 1; part < UF; ++part) {		for (unsigned part = 1; part < UF; ++part) {
if (Op != Instruction::ICmp && Op != Instruction::FCmp)		if (Op != Instruction::ICmp && Op != Instruction::FCmp)
// Floating point operations had to be 'fast' to enable the reduction.		// Floating point operations had to be 'fast' to enable the reduction.
ReducedPartRdx = addFastMathFlag(		ReducedPartRdx = addFastMathFlag(
Builder.CreateBinOp((Instruction::BinaryOps)Op, RdxParts[part],		Builder.CreateBinOp((Instruction::BinaryOps)Op, RdxParts[part],
ReducedPartRdx, "bin.rdx"));		ReducedPartRdx, "bin.rdx"));
else		else
ReducedPartRdx = RecurrenceDescriptor::createMinMaxOp(		ReducedPartRdx = RecurrenceDescriptor::createMinMaxOp(
Builder, MinMaxKind, ReducedPartRdx, RdxParts[part]);		Builder, MinMaxKind, ReducedPartRdx, RdxParts[part]);
}		}

if (VF > 1) {		if (VF > 1) {
// VF is a power of 2 so we can emit the reduction using log2(VF) shuffles		// VF is a power of 2 so we can emit the reduction using log2(VF) shuffles
// and vector ops, reducing the set of values being computed by half each		// and vector ops, reducing the set of values being computed by half each
// round.		// round.
assert(isPowerOf2_32(VF) &&		assert(isPowerOf2_32(VF) &&
"Reduction emission only supported for pow2 vectors!");		"Reduction emission only supported for pow2 vectors!");
Value *TmpVec = ReducedPartRdx;		Value *TmpVec = ReducedPartRdx;
SmallVector<Constant *, 32> ShuffleMask(VF, nullptr);		SmallVector<Constant *, 32> ShuffleMask(VF, nullptr);
for (unsigned i = VF; i != 1; i >>= 1) {		for (unsigned i = VF; i != 1; i >>= 1) {
// Move the upper half of the vector to the lower half.		// Move the upper half of the vector to the lower half.
for (unsigned j = 0; j != i / 2; ++j)		for (unsigned j = 0; j != i / 2; ++j)
ShuffleMask[j] = Builder.getInt32(i / 2 + j);		ShuffleMask[j] = Builder.getInt32(i / 2 + j);

// Fill the rest of the mask with undef.		// Fill the rest of the mask with undef.
std::fill(&ShuffleMask[i / 2], ShuffleMask.end(),		std::fill(&ShuffleMask[i / 2], ShuffleMask.end(),
UndefValue::get(Builder.getInt32Ty()));		UndefValue::get(Builder.getInt32Ty()));

Value *Shuf = Builder.CreateShuffleVector(		Value *Shuf = Builder.CreateShuffleVector(
TmpVec, UndefValue::get(TmpVec->getType()),		TmpVec, UndefValue::get(TmpVec->getType()),
ConstantVector::get(ShuffleMask), "rdx.shuf");		ConstantVector::get(ShuffleMask), "rdx.shuf");

if (Op != Instruction::ICmp && Op != Instruction::FCmp)		if (Op != Instruction::ICmp && Op != Instruction::FCmp)
// Floating point operations had to be 'fast' to enable the reduction.		// Floating point operations had to be 'fast' to enable the reduction.
TmpVec = addFastMathFlag(Builder.CreateBinOp(		TmpVec = addFastMathFlag(Builder.CreateBinOp((Instruction::BinaryOps)Op,
(Instruction::BinaryOps)Op, TmpVec, Shuf, "bin.rdx"));		TmpVec, Shuf, "bin.rdx"));
else		else
TmpVec = RecurrenceDescriptor::createMinMaxOp(Builder, MinMaxKind,		TmpVec = RecurrenceDescriptor::createMinMaxOp(Builder, MinMaxKind,
TmpVec, Shuf);		TmpVec, Shuf);
}		}

// The result is in the first element of the vector.		// The result is in the first element of the vector.
ReducedPartRdx =		ReducedPartRdx =
Builder.CreateExtractElement(TmpVec, Builder.getInt32(0));		Builder.CreateExtractElement(TmpVec, Builder.getInt32(0));

// If the reduction can be performed in a smaller type, we need to extend		// If the reduction can be performed in a smaller type, we need to extend
// the reduction to the wider type before we branch to the original loop.		// the reduction to the wider type before we branch to the original loop.
if (Phi->getType() != RdxDesc.getRecurrenceType())		if (Phi->getType() != RdxDesc.getRecurrenceType())
ReducedPartRdx =		ReducedPartRdx =
RdxDesc.isSigned()		RdxDesc.isSigned()
? Builder.CreateSExt(ReducedPartRdx, Phi->getType())		? Builder.CreateSExt(ReducedPartRdx, Phi->getType())
: Builder.CreateZExt(ReducedPartRdx, Phi->getType());		: Builder.CreateZExt(ReducedPartRdx, Phi->getType());
}		}

// Create a phi node that merges control-flow from the backedge-taken check		// Create a phi node that merges control-flow from the backedge-taken check
// block and the middle block.		// block and the middle block.
PHINode *BCBlockPhi = PHINode::Create(Phi->getType(), 2, "bc.merge.rdx",		PHINode *BCBlockPhi = PHINode::Create(Phi->getType(), 2, "bc.merge.rdx",
LoopScalarPreHeader->getTerminator());		LoopScalarPreHeader->getTerminator());
for (unsigned I = 0, E = LoopBypassBlocks.size(); I != E; ++I)		for (unsigned I = 0, E = LoopBypassBlocks.size(); I != E; ++I)
BCBlockPhi->addIncoming(ReductionStartValue, LoopBypassBlocks[I]);		BCBlockPhi->addIncoming(ReductionStartValue, LoopBypassBlocks[I]);
BCBlockPhi->addIncoming(ReducedPartRdx, LoopMiddleBlock);		BCBlockPhi->addIncoming(ReducedPartRdx, LoopMiddleBlock);

// Now, we need to fix the users of the reduction variable		// Now, we need to fix the users of the reduction variable
// inside and outside of the scalar remainder loop.		// inside and outside of the scalar remainder loop.
// We know that the loop is in LCSSA form. We need to update the		// We know that the loop is in LCSSA form. We need to update the
// PHI nodes in the exit blocks.		// PHI nodes in the exit blocks.
for (BasicBlock::iterator LEI = LoopExitBlock->begin(),		for (BasicBlock::iterator LEI = LoopExitBlock->begin(),
LEE = LoopExitBlock->end();		LEE = LoopExitBlock->end();
LEI != LEE; ++LEI) {		LEI != LEE; ++LEI) {
PHINode *LCSSAPhi = dyn_cast<PHINode>(LEI);		PHINode *LCSSAPhi = dyn_cast<PHINode>(LEI);
if (!LCSSAPhi)		if (!LCSSAPhi)
break;		break;

// All PHINodes need to have a single entry edge, or two if		// All PHINodes need to have a single entry edge, or two if
// we already fixed them.		// we already fixed them.
assert(LCSSAPhi->getNumIncomingValues() < 3 && "Invalid LCSSA PHI");		assert(LCSSAPhi->getNumIncomingValues() < 3 && "Invalid LCSSA PHI");

// We found a reduction value exit-PHI. Update it with the		// We found a reduction value exit-PHI. Update it with the
// incoming bypass edge.		// incoming bypass edge.
if (LCSSAPhi->getIncomingValue(0) == LoopExitInst)		if (LCSSAPhi->getIncomingValue(0) == LoopExitInst)
LCSSAPhi->addIncoming(ReducedPartRdx, LoopMiddleBlock);		LCSSAPhi->addIncoming(ReducedPartRdx, LoopMiddleBlock);
} // end of the LCSSA phi scan.		} // end of the LCSSA phi scan.

// Fix the scalar loop reduction variable with the incoming reduction sum		// Fix the scalar loop reduction variable with the incoming reduction sum
// from the vector body and from the backedge value.		// from the vector body and from the backedge value.
int IncomingEdgeBlockIdx =		int IncomingEdgeBlockIdx =
Phi->getBasicBlockIndex(OrigLoop->getLoopLatch());		Phi->getBasicBlockIndex(OrigLoop->getLoopLatch());
assert(IncomingEdgeBlockIdx >= 0 && "Invalid block index");		assert(IncomingEdgeBlockIdx >= 0 && "Invalid block index");
// Pick the other block.		// Pick the other block.
int SelfEdgeBlockIdx = (IncomingEdgeBlockIdx ? 0 : 1);		int SelfEdgeBlockIdx = (IncomingEdgeBlockIdx ? 0 : 1);
Phi->setIncomingValue(SelfEdgeBlockIdx, BCBlockPhi);		Phi->setIncomingValue(SelfEdgeBlockIdx, BCBlockPhi);
Phi->setIncomingValue(IncomingEdgeBlockIdx, LoopExitInst);		Phi->setIncomingValue(IncomingEdgeBlockIdx, LoopExitInst);
} // end of for each Phi in PHIsToFix.

// Update the dominator tree.
//
// FIXME: After creating the structure of the new loop, the dominator tree is
// no longer up-to-date, and it remains that way until we update it
// here. An out-of-date dominator tree is problematic for SCEV,
// because SCEVExpander uses it to guide code generation. The
// vectorizer use SCEVExpanders in several places. Instead, we should
// keep the dominator tree up-to-date as we go.
updateAnalysis();

// Fix-up external users of the induction variables.
for (auto &Entry : *Legal->getInductionVars())
fixupIVUsers(Entry.first, Entry.second,
getOrCreateVectorTripCount(LI->getLoopFor(LoopVectorBody)),
IVEndValues[Entry.first], LoopMiddleBlock);

fixLCSSAPHIs();
predicateInstructions();

// Remove redundant induction instructions.
cse(LoopVectorBody);
}		}

void InnerLoopVectorizer::fixFirstOrderRecurrence(PHINode *Phi) {		void InnerLoopVectorizer::fixFirstOrderRecurrence(PHINode *Phi) {

// This is the second phase of vectorizing first-order recurrences. An		// This is the second phase of vectorizing first-order recurrences. An
// overview of the transformation is described below. Suppose we have the		// overview of the transformation is described below. Suppose we have the
// following loop.		// following loop.
//		//
▲ Show 20 Lines • Show All 148 Lines • ▼ Show 20 Lines	for (Instruction &LEI : *LoopExitBlock) {
if (!LCSSAPhi)		if (!LCSSAPhi)
break;		break;
if (LCSSAPhi->getNumIncomingValues() == 1)		if (LCSSAPhi->getNumIncomingValues() == 1)
LCSSAPhi->addIncoming(UndefValue::get(LCSSAPhi->getType()),		LCSSAPhi->addIncoming(UndefValue::get(LCSSAPhi->getType()),
LoopMiddleBlock);		LoopMiddleBlock);
}		}
}		}

void InnerLoopVectorizer::collectTriviallyDeadInstructions() {		void InnerLoopVectorizer::collectTriviallyDeadInstructions(
		Loop OrigLoop, LoopVectorizationLegality Legal,
		SmallPtrSetImpl<Instruction *> &DeadInstructions) {
BasicBlock *Latch = OrigLoop->getLoopLatch();		BasicBlock *Latch = OrigLoop->getLoopLatch();

// We create new control-flow for the vectorized loop, so the original		// We create new control-flow for the vectorized loop, so the original
// condition will be dead after vectorization if it's only used by the		// condition will be dead after vectorization if it's only used by the
// branch.		// branch.
auto *Cmp = dyn_cast<Instruction>(Latch->getTerminator()->getOperand(0));		auto *Cmp = dyn_cast<Instruction>(Latch->getTerminator()->getOperand(0));
if (Cmp && Cmp->hasOneUse())		if (Cmp && Cmp->hasOneUse())
DeadInstructions.insert(Cmp);		DeadInstructions.insert(Cmp);

// We create new "steps" for induction variable updates to which the original		// We create new "steps" for induction variable updates to which the original
// induction variables map. An original update instruction will be dead if		// induction variables map. An original update instruction will be dead if
// all its users except the induction variable are dead.		// all its users except the induction variable are dead.
for (auto &Induction : *Legal->getInductionVars()) {		for (auto &Induction : *Legal->getInductionVars()) {
PHINode *Ind = Induction.first;		PHINode *Ind = Induction.first;
auto *IndUpdate = cast<Instruction>(Ind->getIncomingValueForBlock(Latch));		auto *IndUpdate = cast<Instruction>(Ind->getIncomingValueForBlock(Latch));
if (all_of(IndUpdate->users(), [&](User *U) -> bool {		if (all_of(IndUpdate->users(), [&](User *U) -> bool {
return U == Ind \|\| DeadInstructions.count(cast<Instruction>(U));		return U == Ind \|\| DeadInstructions.count(cast<Instruction>(U));
}))		}))
DeadInstructions.insert(IndUpdate);		DeadInstructions.insert(IndUpdate);
}		}
}		}

void InnerLoopVectorizer::sinkScalarOperands(Instruction *PredInst) {		void InnerLoopUnroller::sinkScalarOperands(Instruction *PredInst) {

// The basic block and loop containing the predicated instruction.		// The basic block and loop containing the predicated instruction.
auto *PredBB = PredInst->getParent();		auto *PredBB = PredInst->getParent();
auto *VectorLoop = LI->getLoopFor(PredBB);		auto *VectorLoop = LI->getLoopFor(PredBB);

// Initialize a worklist with the operands of the predicated instruction.		// Initialize a worklist with the operands of the predicated instruction.
SetVector<Value *> Worklist(PredInst->op_begin(), PredInst->op_end());		SetVector<Value *> Worklist(PredInst->op_begin(), PredInst->op_end());

▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	while (!Worklist.empty()) {

// The sinking may have enabled other instructions to be sunk, so we will		// The sinking may have enabled other instructions to be sunk, so we will
// need to iterate.		// need to iterate.
Changed = true;		Changed = true;
}		}
} while (Changed);		} while (Changed);
}		}

void InnerLoopVectorizer::predicateInstructions() {		void InnerLoopUnroller::vectorizeLoop() {

		// Collect instructions from the original loop that will become trivially
		// dead in the vectorized loop. We don't need to vectorize these
		// instructions.
		collectTriviallyDeadInstructions(OrigLoop, Legal, DeadInstructions);

		// Scan the loop in a topological order to ensure that defs are vectorized
		// before users.
		LoopBlocksDFS DFS(OrigLoop);
		DFS.perform(LI);

		// Vectorize all of the blocks in the original loop.
		for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO()))
		for (Instruction &I : *BB) {
		if (!DeadInstructions.count(&I))
		vectorizeInstruction(I);
		}

		fixCrossIterationPHIs();

		// Update the dominator tree.
		//
		// FIXME: After creating the structure of the new loop, the dominator tree is
		// no longer up-to-date, and it remains that way until we update it
		// here. An out-of-date dominator tree is problematic for SCEV,
		// because SCEVExpander uses it to guide code generation. The
		// vectorizer use SCEVExpanders in several places. Instead, we should
		// keep the dominator tree up-to-date as we go.
		updateAnalysis();

		// Fix-up external users of the induction variables.
		for (auto &Entry : *Legal->getInductionVars())
		fixupIVUsers(Entry.first, Entry.second,
		getOrCreateVectorTripCount(LI->getLoopFor(LoopVectorBody)),
		IVEndValues[Entry.first], LoopMiddleBlock);

		fixLCSSAPHIs();
		predicateInstructions();

		// Remove redundant induction instructions.
		cse(LoopVectorBody);
		}

		void InnerLoopUnroller::predicateInstructions() {

// For each instruction I marked for predication on value C, split I into its		// For each instruction I marked for predication on value C, split I into its
// own basic block to form an if-then construct over C. Since I may be fed by		// own basic block to form an if-then construct over C. Since I may be fed by
// an extractelement instruction or other scalar operand, we try to		// an extractelement instruction or other scalar operand, we try to
// iteratively sink its scalar operands into the predicated block. If I feeds		// iteratively sink its scalar operands into the predicated block. If I feeds
// an insertelement instruction, we try to move this instruction into the		// an insertelement instruction, we try to move this instruction into the
// predicated block as well. For non-void types, a phi node will be created		// predicated block as well. For non-void types, a phi node will be created
// for the resulting value (either vector or scalar).		// for the resulting value (either vector or scalar).
▲ Show 20 Lines • Show All 108 Lines • ▼ Show 20 Lines
}		}

InnerLoopVectorizer::VectorParts		InnerLoopVectorizer::VectorParts
InnerLoopVectorizer::createEdgeMask(BasicBlock Src, BasicBlock Dst) {		InnerLoopVectorizer::createEdgeMask(BasicBlock Src, BasicBlock Dst) {
assert(is_contained(predecessors(Dst), Src) && "Invalid edge");		assert(is_contained(predecessors(Dst), Src) && "Invalid edge");

// Look for cached value.		// Look for cached value.
std::pair<BasicBlock , BasicBlock > Edge(Src, Dst);		std::pair<BasicBlock , BasicBlock > Edge(Src, Dst);
EdgeMaskCache::iterator ECEntryIt = MaskCache.find(Edge);		EdgeMaskCacheTy::iterator ECEntryIt = EdgeMaskCache.find(Edge);
if (ECEntryIt != MaskCache.end())		if (ECEntryIt != EdgeMaskCache.end())
return ECEntryIt->second;		return ECEntryIt->second;

VectorParts SrcMask = createBlockInMask(Src);		VectorParts SrcMask = createBlockInMask(Src);

// The terminator has to be a branch inst!		// The terminator has to be a branch inst!
BranchInst *BI = dyn_cast<BranchInst>(Src->getTerminator());		BranchInst *BI = dyn_cast<BranchInst>(Src->getTerminator());
assert(BI && "Unexpected terminator found");		assert(BI && "Unexpected terminator found");

if (BI->isConditional()) {		if (BI->isConditional()) {
VectorParts EdgeMask = getVectorValue(BI->getCondition());		VectorParts EdgeMask = getVectorValue(BI->getCondition());

if (BI->getSuccessor(0) != Dst)		if (BI->getSuccessor(0) != Dst)
for (unsigned part = 0; part < UF; ++part)		for (unsigned part = 0; part < UF; ++part)
EdgeMask[part] = Builder.CreateNot(EdgeMask[part]);		EdgeMask[part] = Builder.CreateNot(EdgeMask[part]);

for (unsigned part = 0; part < UF; ++part)		for (unsigned part = 0; part < UF; ++part)
EdgeMask[part] = Builder.CreateAnd(EdgeMask[part], SrcMask[part]);		EdgeMask[part] = Builder.CreateAnd(EdgeMask[part], SrcMask[part]);

MaskCache[Edge] = EdgeMask;		EdgeMaskCache[Edge] = EdgeMask;
return EdgeMask;		return EdgeMask;
}		}

MaskCache[Edge] = SrcMask;		EdgeMaskCache[Edge] = SrcMask;
return SrcMask;		return SrcMask;
}		}

InnerLoopVectorizer::VectorParts		InnerLoopVectorizer::VectorParts
InnerLoopVectorizer::createBlockInMask(BasicBlock *BB) {		InnerLoopVectorizer::createBlockInMask(BasicBlock *BB) {
assert(OrigLoop->contains(BB) && "Block is not a part of a loop");		assert(OrigLoop->contains(BB) && "Block is not a part of a loop");

		// Look for cached value.
		BlockMaskCacheTy::iterator BCEntryIt = BlockMaskCache.find(BB);
		if (BCEntryIt != BlockMaskCache.end())
		return BCEntryIt->second;

// Loop incoming mask is all-one.		// Loop incoming mask is all-one.
if (OrigLoop->getHeader() == BB) {		if (OrigLoop->getHeader() == BB) {
Value *C = ConstantInt::get(IntegerType::getInt1Ty(BB->getContext()), 1);		Value *C = ConstantInt::get(IntegerType::getInt1Ty(BB->getContext()), 1);
return getVectorValue(C);		return getVectorValue(C);
}		}

// This is the block mask. We OR all incoming edges, and with zero.		// This is the block mask. We OR all incoming edges, and with zero.
Value *Zero = ConstantInt::get(IntegerType::getInt1Ty(BB->getContext()), 0);		Value *Zero = ConstantInt::get(IntegerType::getInt1Ty(BB->getContext()), 0);
VectorParts BlockMask = getVectorValue(Zero);		VectorParts BlockMask = getVectorValue(Zero);

// For each pred:		// For each pred:
for (pred_iterator it = pred_begin(BB), e = pred_end(BB); it != e; ++it) {		for (pred_iterator it = pred_begin(BB), e = pred_end(BB); it != e; ++it) {
VectorParts EM = createEdgeMask(*it, BB);		VectorParts EM = createEdgeMask(*it, BB);
for (unsigned part = 0; part < UF; ++part)		for (unsigned part = 0; part < UF; ++part)
BlockMask[part] = Builder.CreateOr(BlockMask[part], EM[part]);		BlockMask[part] = Builder.CreateOr(BlockMask[part], EM[part]);
}		}

		BlockMaskCache[BB] = BlockMask;
return BlockMask;		return BlockMask;
}		}

void InnerLoopVectorizer::widenPHIInstruction(Instruction *PN, unsigned UF,		void InnerLoopVectorizer::widenPHIInstruction(Instruction *PN, unsigned UF,
unsigned VF, PhiVector *PV) {		unsigned VF, PhiVector *PV) {
PHINode *P = cast<PHINode>(PN);		PHINode *P = cast<PHINode>(PN);
// Handle recurrences.		// Handle recurrences.
if (Legal->isReductionVariable(P) \|\| Legal->isFirstOrderRecurrence(P)) {		if (Legal->isReductionVariable(P) \|\| Legal->isFirstOrderRecurrence(P)) {
▲ Show 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::widenPHIInstruction(Instruction *PN, unsigned UF,
const DataLayout &DL = OrigLoop->getHeader()->getModule()->getDataLayout();		const DataLayout &DL = OrigLoop->getHeader()->getModule()->getDataLayout();

// FIXME: The newly created binary instructions should contain nsw/nuw flags,		// FIXME: The newly created binary instructions should contain nsw/nuw flags,
// which can be found from the original scalar operations.		// which can be found from the original scalar operations.
switch (II.getKind()) {		switch (II.getKind()) {
case InductionDescriptor::IK_NoInduction:		case InductionDescriptor::IK_NoInduction:
llvm_unreachable("Unknown induction");		llvm_unreachable("Unknown induction");
case InductionDescriptor::IK_IntInduction:		case InductionDescriptor::IK_IntInduction:
return widenIntInduction(P);		widenIntInduction(needsScalarInduction(P), P); // Used only by Unroller
		return;
case InductionDescriptor::IK_PtrInduction: {		case InductionDescriptor::IK_PtrInduction: {
// Handle the pointer induction variable case.		// Handle the pointer induction variable case.
assert(P->getType()->isPointerTy() && "Unexpected type.");		assert(P->getType()->isPointerTy() && "Unexpected type.");
// This is the normalized GEP that starts counting at zero.		// This is the normalized GEP that starts counting at zero.
Value *PtrInd = Induction;		Value *PtrInd = Induction;
PtrInd = Builder.CreateSExtOrTrunc(PtrInd, II.getStep()->getType());		PtrInd = Builder.CreateSExtOrTrunc(PtrInd, II.getStep()->getType());
// Determine the number of scalars we need to generate for each unroll		// Determine the number of scalars we need to generate for each unroll
// iteration. If the instruction is uniform, we only need to generate the		// iteration. If the instruction is uniform, we only need to generate the
▲ Show 20 Lines • Show All 55 Lines • ▼ Show 20 Lines	assert((I.getOpcode() == Instruction::UDiv \|\|
I.getOpcode() == Instruction::URem \|\|		I.getOpcode() == Instruction::URem \|\|
I.getOpcode() == Instruction::SRem) &&		I.getOpcode() == Instruction::SRem) &&
"Unexpected instruction");		"Unexpected instruction");
Value *Divisor = I.getOperand(1);		Value *Divisor = I.getOperand(1);
auto *CInt = dyn_cast<ConstantInt>(Divisor);		auto *CInt = dyn_cast<ConstantInt>(Divisor);
return !CInt \|\| CInt->isZero();		return !CInt \|\| CInt->isZero();
}		}

void InnerLoopVectorizer::vectorizeBlockInLoop(BasicBlock BB, PhiVector PV) {		void InnerLoopVectorizer::vectorizeInstruction(Instruction &I) {
// For each instruction in the old loop.
for (Instruction &I : *BB) {

// If the instruction will become trivially dead when vectorized, we don't
// need to generate it.
if (DeadInstructions.count(&I))
continue;

// Scalarize instructions that should remain scalar after vectorization.
if (VF > 1 &&
!(isa<BranchInst>(&I) \|\| isa<PHINode>(&I) \|\|
isa<DbgInfoIntrinsic>(&I)) &&
shouldScalarizeInstruction(&I)) {
scalarizeInstruction(&I, Legal->isScalarWithPredication(&I));
continue;
}

switch (I.getOpcode()) {		switch (I.getOpcode()) {
case Instruction::Br:
// Nothing to do for PHIs and BR, since we already took care of the
// loop control flow instructions.
continue;
case Instruction::PHI: {		case Instruction::PHI: {
// Vectorize PHINodes.		// Vectorize PHINodes.
widenPHIInstruction(&I, UF, VF, PV);		PhiVector PV; // Records Reduction and FirstOrderRecurrence header Phis.
continue;		widenPHIInstruction(&I, UF, VF, &PV);
		break;
} // End of PHI.		} // End of PHI.

case Instruction::UDiv:		case Instruction::UDiv:
case Instruction::SDiv:		case Instruction::SDiv:
case Instruction::SRem:		case Instruction::SRem:
case Instruction::URem:		case Instruction::URem:
// Scalarize with predication if this instruction may divide by zero and
// block execution is conditional, otherwise fallthrough.
if (Legal->isScalarWithPredication(&I)) {
scalarizeInstruction(&I, true);
continue;
}
case Instruction::Add:		case Instruction::Add:
case Instruction::FAdd:		case Instruction::FAdd:
case Instruction::Sub:		case Instruction::Sub:
case Instruction::FSub:		case Instruction::FSub:
case Instruction::Mul:		case Instruction::Mul:
case Instruction::FMul:		case Instruction::FMul:
case Instruction::FDiv:		case Instruction::FDiv:
case Instruction::FRem:		case Instruction::FRem:
case Instruction::Shl:		case Instruction::Shl:
case Instruction::LShr:		case Instruction::LShr:
case Instruction::AShr:		case Instruction::AShr:
case Instruction::And:		case Instruction::And:
case Instruction::Or:		case Instruction::Or:
case Instruction::Xor: {		case Instruction::Xor: {
// Just widen binops.		// Just widen binops.
auto *BinOp = cast<BinaryOperator>(&I);		auto *BinOp = cast<BinaryOperator>(&I);
setDebugLocFromInst(Builder, BinOp);		setDebugLocFromInst(Builder, BinOp);
const VectorParts &A = getVectorValue(BinOp->getOperand(0));		const VectorParts &A = getVectorValue(BinOp->getOperand(0));
const VectorParts &B = getVectorValue(BinOp->getOperand(1));		const VectorParts &B = getVectorValue(BinOp->getOperand(1));

// Use this vector value for all users of the original instruction.		// Use this vector value for all users of the original instruction.
VectorParts Entry(UF);		VectorParts Entry(UF);
for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
Value *V = Builder.CreateBinOp(BinOp->getOpcode(), A[Part], B[Part]);		Value *V = Builder.CreateBinOp(BinOp->getOpcode(), A[Part], B[Part]);

if (BinaryOperator *VecOp = dyn_cast<BinaryOperator>(V))		if (BinaryOperator *VecOp = dyn_cast<BinaryOperator>(V))
VecOp->copyIRFlags(BinOp);		VecOp->copyIRFlags(BinOp);

Entry[Part] = V;		Entry[Part] = V;
}		}

VectorLoopValueMap.initVector(&I, Entry);		VectorLoopValueMap.initVector(&I, Entry);
addMetadata(Entry, BinOp);		addMetadata(Entry, BinOp);
break;		break;
}		}
case Instruction::Select: {		case Instruction::Select: {
// Widen selects.		// Widen selects.
// If the selector is loop invariant we can create a select		// If the selector is loop invariant we can create a select
// instruction with a scalar condition. Otherwise, use vector-select.		// instruction with a scalar condition. Otherwise, use vector-select.
auto *SE = PSE.getSE();		auto *SE = PSE.getSE();
bool InvariantCond =		bool InvariantCond =
SE->isLoopInvariant(PSE.getSCEV(I.getOperand(0)), OrigLoop);		SE->isLoopInvariant(PSE.getSCEV(I.getOperand(0)), OrigLoop);
setDebugLocFromInst(Builder, &I);		setDebugLocFromInst(Builder, &I);

// The condition can be loop invariant but still defined inside the		// The condition can be loop invariant but still defined inside the
// loop. This means that we can't just use the original 'cond' value.		// loop. This means that we can't just use the original 'cond' value.
// We have to take the 'vectorized' value and pick the first lane.		// We have to take the 'vectorized' value and pick the first lane.
// Instcombine will make this a no-op.		// Instcombine will make this a no-op.
const VectorParts &Cond = getVectorValue(I.getOperand(0));		const VectorParts &Cond = getVectorValue(I.getOperand(0));
const VectorParts &Op0 = getVectorValue(I.getOperand(1));		const VectorParts &Op0 = getVectorValue(I.getOperand(1));
const VectorParts &Op1 = getVectorValue(I.getOperand(2));		const VectorParts &Op1 = getVectorValue(I.getOperand(2));

auto *ScalarCond = getScalarValue(I.getOperand(0), 0, 0);		auto *ScalarCond = getScalarValue(I.getOperand(0), 0, 0);

VectorParts Entry(UF);		VectorParts Entry(UF);
for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
Entry[Part] = Builder.CreateSelect(		Entry[Part] = Builder.CreateSelect(
InvariantCond ? ScalarCond : Cond[Part], Op0[Part], Op1[Part]);		InvariantCond ? ScalarCond : Cond[Part], Op0[Part], Op1[Part]);
}		}

VectorLoopValueMap.initVector(&I, Entry);		VectorLoopValueMap.initVector(&I, Entry);
addMetadata(Entry, &I);		addMetadata(Entry, &I);
break;		break;
}		}

case Instruction::ICmp:		case Instruction::ICmp:
case Instruction::FCmp: {		case Instruction::FCmp: {
// Widen compares. Generate vector compares.		// Widen compares. Generate vector compares.
bool FCmp = (I.getOpcode() == Instruction::FCmp);		bool FCmp = (I.getOpcode() == Instruction::FCmp);
auto *Cmp = dyn_cast<CmpInst>(&I);		auto *Cmp = dyn_cast<CmpInst>(&I);
setDebugLocFromInst(Builder, Cmp);		setDebugLocFromInst(Builder, Cmp);
const VectorParts &A = getVectorValue(Cmp->getOperand(0));		const VectorParts &A = getVectorValue(Cmp->getOperand(0));
const VectorParts &B = getVectorValue(Cmp->getOperand(1));		const VectorParts &B = getVectorValue(Cmp->getOperand(1));
VectorParts Entry(UF);		VectorParts Entry(UF);
for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
Value *C = nullptr;		Value *C = nullptr;
if (FCmp) {		if (FCmp) {
C = Builder.CreateFCmp(Cmp->getPredicate(), A[Part], B[Part]);		C = Builder.CreateFCmp(Cmp->getPredicate(), A[Part], B[Part]);
cast<FCmpInst>(C)->copyFastMathFlags(Cmp);		cast<FCmpInst>(C)->copyFastMathFlags(Cmp);
} else {		} else {
C = Builder.CreateICmp(Cmp->getPredicate(), A[Part], B[Part]);		C = Builder.CreateICmp(Cmp->getPredicate(), A[Part], B[Part]);
}		}
Entry[Part] = C;		Entry[Part] = C;
}		}

VectorLoopValueMap.initVector(&I, Entry);		VectorLoopValueMap.initVector(&I, Entry);
addMetadata(Entry, &I);		addMetadata(Entry, &I);
break;		break;
}		}

case Instruction::Store:		case Instruction::Store:
case Instruction::Load:		case Instruction::Load:
vectorizeMemoryInstruction(&I);		vectorizeMemoryInstruction(&I);
break;		break;
case Instruction::ZExt:		case Instruction::ZExt:
case Instruction::SExt:		case Instruction::SExt:
case Instruction::FPToUI:		case Instruction::FPToUI:
case Instruction::FPToSI:		case Instruction::FPToSI:
case Instruction::FPExt:		case Instruction::FPExt:
case Instruction::PtrToInt:		case Instruction::PtrToInt:
case Instruction::IntToPtr:		case Instruction::IntToPtr:
case Instruction::SIToFP:		case Instruction::SIToFP:
case Instruction::UIToFP:		case Instruction::UIToFP:
case Instruction::Trunc:		case Instruction::Trunc:
case Instruction::FPTrunc:		case Instruction::FPTrunc:
case Instruction::BitCast: {		case Instruction::BitCast: {
auto *CI = dyn_cast<CastInst>(&I);		auto *CI = dyn_cast<CastInst>(&I);
setDebugLocFromInst(Builder, CI);		setDebugLocFromInst(Builder, CI);

// Optimize the special case where the source is a constant integer
// induction variable. Notice that we can only optimize the 'trunc' case
// because (a) FP conversions lose precision, (b) sext/zext may wrap, and
// (c) other casts depend on pointer size.
if (Cost->isOptimizableIVTruncate(CI, VF)) {
widenIntInduction(cast<PHINode>(CI->getOperand(0)),
cast<TruncInst>(CI));
break;
}

/// Vectorize casts.		/// Vectorize casts.
Type *DestTy =		Type *DestTy =
(VF == 1) ? CI->getType() : VectorType::get(CI->getType(), VF);		(VF == 1) ? CI->getType() : VectorType::get(CI->getType(), VF);

const VectorParts &A = getVectorValue(CI->getOperand(0));		const VectorParts &A = getVectorValue(CI->getOperand(0));
VectorParts Entry(UF);		VectorParts Entry(UF);
for (unsigned Part = 0; Part < UF; ++Part)		for (unsigned Part = 0; Part < UF; ++Part)
Entry[Part] = Builder.CreateCast(CI->getOpcode(), A[Part], DestTy);		Entry[Part] = Builder.CreateCast(CI->getOpcode(), A[Part], DestTy);
VectorLoopValueMap.initVector(&I, Entry);		VectorLoopValueMap.initVector(&I, Entry);
addMetadata(Entry, &I);		addMetadata(Entry, &I);
break;		break;
}		}

case Instruction::Call: {		case Instruction::Call: {
// Ignore dbg intrinsics.		// Ignore dbg intrinsics.
if (isa<DbgInfoIntrinsic>(I))		if (isa<DbgInfoIntrinsic>(I))
break;		break;
setDebugLocFromInst(Builder, &I);		setDebugLocFromInst(Builder, &I);

Module *M = BB->getParent()->getParent();		Module *M = I.getParent()->getParent()->getParent();
auto *CI = cast<CallInst>(&I);		auto *CI = cast<CallInst>(&I);

StringRef FnName = CI->getCalledFunction()->getName();		StringRef FnName = CI->getCalledFunction()->getName();
Function *F = CI->getCalledFunction();		Function *F = CI->getCalledFunction();
Type *RetTy = ToVectorTy(CI->getType(), VF);		Type *RetTy = ToVectorTy(CI->getType(), VF);
SmallVector<Type *, 4> Tys;		SmallVector<Type *, 4> Tys;
for (Value *ArgOperand : CI->arg_operands())		for (Value *ArgOperand : CI->arg_operands())
Tys.push_back(ToVectorTy(ArgOperand->getType(), VF));		Tys.push_back(ToVectorTy(ArgOperand->getType(), VF));

Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);		Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);
if (ID && (ID == Intrinsic::assume \|\| ID == Intrinsic::lifetime_end \|\|		bool NeedToScalarize; // Redundant, needed for UseVectorIntrinsic.
ID == Intrinsic::lifetime_start)) {
scalarizeInstruction(&I);
break;
}
// The flag shows whether we use Intrinsic or a usual Call for vectorized
// version of the instruction.
// Is it beneficial to perform intrinsic call compared to lib call?
bool NeedToScalarize;
unsigned CallCost = getVectorCallCost(CI, VF, *TTI, TLI, NeedToScalarize);		unsigned CallCost = getVectorCallCost(CI, VF, *TTI, TLI, NeedToScalarize);
bool UseVectorIntrinsic =		bool UseVectorIntrinsic =
ID && getVectorIntrinsicCost(CI, VF, *TTI, TLI) <= CallCost;		ID && getVectorIntrinsicCost(CI, VF, *TTI, TLI) <= CallCost;
if (!UseVectorIntrinsic && NeedToScalarize) {
scalarizeInstruction(&I);
break;
}

VectorParts Entry(UF);		VectorParts Entry(UF);
for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
SmallVector<Value *, 4> Args;		SmallVector<Value *, 4> Args;
for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i) {		for (unsigned i = 0, ie = CI->getNumArgOperands(); i != ie; ++i) {
Value *Arg = CI->getArgOperand(i);		Value *Arg = CI->getArgOperand(i);
// Some intrinsics have a scalar argument - don't replace it with a		// Some intrinsics have a scalar argument - don't replace it with a
// vector.		// vector.
if (!UseVectorIntrinsic \|\| !hasVectorInstrinsicScalarOpd(ID, i)) {		if (!UseVectorIntrinsic \|\| !hasVectorInstrinsicScalarOpd(ID, i)) {
const VectorParts &VectorArg = getVectorValue(CI->getArgOperand(i));		const VectorParts &VectorArg = getVectorValue(CI->getArgOperand(i));
Arg = VectorArg[Part];		Arg = VectorArg[Part];
}		}
Args.push_back(Arg);		Args.push_back(Arg);
}		}

Function *VectorF;		Function *VectorF;
if (UseVectorIntrinsic) {		if (UseVectorIntrinsic) {
// Use vector version of the intrinsic.		// Use vector version of the intrinsic.
Type *TysForDecl[] = {CI->getType()};		Type *TysForDecl[] = {CI->getType()};
if (VF > 1)		if (VF > 1)
TysForDecl[0] = VectorType::get(CI->getType()->getScalarType(), VF);		TysForDecl[0] = VectorType::get(CI->getType()->getScalarType(), VF);
VectorF = Intrinsic::getDeclaration(M, ID, TysForDecl);		VectorF = Intrinsic::getDeclaration(M, ID, TysForDecl);
} else {		} else {
// Use vector version of the library call.		// Use vector version of the library call.
StringRef VFnName = TLI->getVectorizedFunction(FnName, VF);		StringRef VFnName = TLI->getVectorizedFunction(FnName, VF);
assert(!VFnName.empty() && "Vector function name is empty.");		assert(!VFnName.empty() && "Vector function name is empty.");
VectorF = M->getFunction(VFnName);		VectorF = M->getFunction(VFnName);
if (!VectorF) {		if (!VectorF) {
// Generate a declaration		// Generate a declaration
FunctionType *FTy = FunctionType::get(RetTy, Tys, false);		FunctionType *FTy = FunctionType::get(RetTy, Tys, false);
VectorF =		VectorF =
Function::Create(FTy, Function::ExternalLinkage, VFnName, M);		Function::Create(FTy, Function::ExternalLinkage, VFnName, M);
VectorF->copyAttributesFrom(F);		VectorF->copyAttributesFrom(F);
}		}
}		}
assert(VectorF && "Can't create vector function.");		assert(VectorF && "Can't create vector function.");

SmallVector<OperandBundleDef, 1> OpBundles;		SmallVector<OperandBundleDef, 1> OpBundles;
CI->getOperandBundlesAsDefs(OpBundles);		CI->getOperandBundlesAsDefs(OpBundles);
CallInst *V = Builder.CreateCall(VectorF, Args, OpBundles);		CallInst *V = Builder.CreateCall(VectorF, Args, OpBundles);

if (isa<FPMathOperator>(V))		if (isa<FPMathOperator>(V))
V->copyFastMathFlags(CI);		V->copyFastMathFlags(CI);

Entry[Part] = V;		Entry[Part] = V;
}		}

VectorLoopValueMap.initVector(&I, Entry);		VectorLoopValueMap.initVector(&I, Entry);
addMetadata(Entry, &I);		addMetadata(Entry, &I);
break;		break;
}		}

default:		default:
// All other instructions are unsupported. Scalarize them.		// All other instructions are scalarized.
scalarizeInstruction(&I);		DEBUG(dbgs() << "LV: Found an unhandled instruction: " << I);
break;		llvm_unreachable("Unhandled instruction!");
} // end of switch.		} // end of switch.
} // end of for_each instr.
}		}

void InnerLoopVectorizer::updateAnalysis() {		void InnerLoopVectorizer::updateAnalysis() {
// Forget the original basic block.		// Forget the original basic block.
PSE.getSE()->forgetLoop(OrigLoop);		PSE.getSE()->forgetLoop(OrigLoop);

// Update the dominator tree information.		// Update the dominator tree information.
assert(DT->properlyDominates(LoopBypassBlocks.front(), LoopExitBlock) &&		assert(DT->properlyDominates(LoopBypassBlocks.front(), LoopExitBlock) &&
"Entry does not dominate exit.");		"Entry does not dominate exit.");

// We don't predicate stores by this point, so the vector body should be a		if (!DT->getNode(LoopVectorBody)) // For InnerLoopUnroller.
// single loop.
DT->addNewBlock(LoopVectorBody, LoopVectorPreHeader);		DT->addNewBlock(LoopVectorBody, LoopVectorPreHeader);
		auto *LoopVectorLatch = LI->getLoopFor(LoopVectorBody)->getLoopLatch();
DT->addNewBlock(LoopMiddleBlock, LoopVectorBody);		DT->addNewBlock(LoopMiddleBlock, LoopVectorLatch);
DT->addNewBlock(LoopScalarPreHeader, LoopBypassBlocks[0]);		DT->addNewBlock(LoopScalarPreHeader, LoopBypassBlocks[0]);
DT->changeImmediateDominator(LoopScalarBody, LoopScalarPreHeader);		DT->changeImmediateDominator(LoopScalarBody, LoopScalarPreHeader);
DT->changeImmediateDominator(LoopExitBlock, LoopBypassBlocks[0]);		DT->changeImmediateDominator(LoopExitBlock, LoopBypassBlocks[0]);

DEBUG(DT->verifyDomTree());		DEBUG(DT->verifyDomTree());
}		}

/// \brief Check whether it is safe to if-convert this phi node.		/// \brief Check whether it is safe to if-convert this phi node.
///		///
/// Phi nodes with constant expressions that can trap are not safe to if		/// Phi nodes with constant expressions that can trap are not safe to if
/// convert.		/// convert.
static bool canIfConvertPHINodes(BasicBlock *BB) {		static bool canIfConvertPHINodes(BasicBlock *BB) {
▲ Show 20 Lines • Show All 1,075 Lines • ▼ Show 20 Lines	if (LastMember) {
continue;		continue;
}		}
DEBUG(dbgs() << "LV: Interleaved group requires epilogue iteration.\n");		DEBUG(dbgs() << "LV: Interleaved group requires epilogue iteration.\n");
RequiresScalarEpilogue = true;		RequiresScalarEpilogue = true;
}		}
}		}
}		}

LoopVectorizationCostModel::VectorizationFactor		bool LoopVectorizationCostModel::canVectorize(bool OptForSize) {
LoopVectorizationCostModel::selectVectorizationFactor(bool OptForSize) {
// Width 1 means no vectorize
VectorizationFactor Factor = {1U, 0U};
if (OptForSize && Legal->getRuntimePointerChecking()->Need) {		if (OptForSize && Legal->getRuntimePointerChecking()->Need) {
ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")		ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")
<< "runtime pointer checks needed. Enable vectorization of this "		<< "runtime pointer checks needed. Enable vectorization of this "
"loop with '#pragma clang loop vectorize(enable)' when "		"loop with '#pragma clang loop vectorize(enable)' when "
"compiling with -Os/-Oz");		"compiling with -Os/-Oz");
DEBUG(dbgs()		DEBUG(dbgs()
<< "LV: Aborting. Runtime ptr check is required with -Os/-Oz.\n");		<< "LV: Aborting. Runtime ptr check is required with -Os/-Oz.\n");
return Factor;		return false;
}		}

if (!EnableCondStoresVectorization && Legal->getNumPredStores()) {		if (!EnableCondStoresVectorization && Legal->getNumPredStores()) {
ORE->emit(createMissedAnalysis("ConditionalStore")		ORE->emit(createMissedAnalysis("ConditionalStore")
<< "store that is conditionally executed prevents vectorization");		<< "store that is conditionally executed prevents vectorization");
DEBUG(dbgs() << "LV: No vectorization. There are conditional stores.\n");		DEBUG(dbgs() << "LV: No vectorization. There are conditional stores.\n");
return Factor;		return false;
		}

		// If we optimize the program for size, avoid creating the tail loop.
		if (OptForSize) {
		unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);
		DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');

		// If we don't know the precise trip count, don't try to vectorize.
		if (TC < 2) {
		ORE->emit(
		createMissedAnalysis("UnknownLoopCountComplexCFG")
		<< "unable to calculate the loop count due to complex control flow");
		DEBUG(dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n");
		return false;
		}
		}
		return true;
}		}

		unsigned
		LoopVectorizationCostModel::computeMaxVectorizationFactor(bool OptForSize) {
MinBWs = computeMinimumValueSizes(TheLoop->getBlocks(), *DB, &TTI);		MinBWs = computeMinimumValueSizes(TheLoop->getBlocks(), *DB, &TTI);
unsigned SmallestType, WidestType;		unsigned SmallestType, WidestType;
std::tie(SmallestType, WidestType) = getSmallestAndWidestTypes();		std::tie(SmallestType, WidestType) = getSmallestAndWidestTypes();
unsigned WidestRegister = TTI.getRegisterBitWidth(true);		unsigned WidestRegister = TTI.getRegisterBitWidth(true);
unsigned MaxSafeDepDist = -1U;		unsigned MaxSafeDepDist = -1U;

// Get the maximum safe dependence distance in bits computed by LAA. If the		// Get the maximum safe dependence distance in bits computed by LAA. If the
// loop contains any interleaved accesses, we divide the dependence distance		// loop contains any interleaved accesses, we divide the dependence distance
Show All 18 Lines	if (MaxVectorSize == 0) {
DEBUG(dbgs() << "LV: The target has no vector registers.\n");		DEBUG(dbgs() << "LV: The target has no vector registers.\n");
MaxVectorSize = 1;		MaxVectorSize = 1;
}		}

assert(MaxVectorSize <= 64 && "Did not expect to pack so many elements"		assert(MaxVectorSize <= 64 && "Did not expect to pack so many elements"
" into one vector!");		" into one vector!");

unsigned VF = MaxVectorSize;		unsigned VF = MaxVectorSize;

if (MaximizeBandwidth && !OptForSize) {		if (MaximizeBandwidth && !OptForSize) {
// Collect all viable vectorization factors.		// Collect all viable vectorization factors.
SmallVector<unsigned, 8> VFs;		SmallVector<unsigned, 8> VFs;
unsigned NewMaxVectorSize = WidestRegister / SmallestType;		unsigned NewMaxVectorSize = WidestRegister / SmallestType;
for (unsigned VS = MaxVectorSize; VS <= NewMaxVectorSize; VS *= 2)		for (unsigned VS = MaxVectorSize; VS <= NewMaxVectorSize; VS *= 2)
VFs.push_back(VS);		VFs.push_back(VS);

// For each VF calculate its register usage.		// For each VF calculate its register usage.
auto RUs = calculateRegisterUsage(VFs);		auto RUs = calculateRegisterUsage(VFs);

// Select the largest VF which doesn't require more registers than existing		// Select the largest VF which doesn't require more registers than existing
// ones.		// ones.
unsigned TargetNumRegisters = TTI.getNumberOfRegisters(true);		unsigned TargetNumRegisters = TTI.getNumberOfRegisters(true);
for (int i = RUs.size() - 1; i >= 0; --i) {		for (int i = RUs.size() - 1; i >= 0; --i) {
if (RUs[i].MaxLocalUsers <= TargetNumRegisters) {		if (RUs[i].MaxLocalUsers <= TargetNumRegisters) {
VF = VFs[i];		VF = VFs[i];
break;		break;
}		}
}		}
}		}
		return VF;
		}

// If we optimize the program for size, avoid creating the tail loop.		bool LoopVectorizationCostModel::requiresTail(unsigned MaxVectorSize) {
if (OptForSize) {
unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);		unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);
DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');		DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');

// If we don't know the precise trip count, don't try to vectorize.
if (TC < 2) {
ORE->emit(
createMissedAnalysis("UnknownLoopCountComplexCFG")
<< "unable to calculate the loop count due to complex control flow");
DEBUG(dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n");
return Factor;
}

// Find the maximum SIMD width that can fit within the trip count.		// Find the maximum SIMD width that can fit within the trip count.
VF = TC % MaxVectorSize;		unsigned VF = TC % MaxVectorSize;

if (VF == 0)		if (VF == 0)
VF = MaxVectorSize;		return false;
else {
// If the trip count that we found modulo the vectorization factor is not		// If the trip count that we found modulo the vectorization factor is not
// zero then we require a tail.		// zero then we require a tail.
ORE->emit(createMissedAnalysis("NoTailLoopWithOptForSize")		ORE->emit(createMissedAnalysis("NoTailLoopWithOptForSize")
<< "cannot optimize for size and vectorize at the "		<< "cannot optimize for size and vectorize at the "
"same time. Enable vectorization of this loop "		"same time. Enable vectorization of this loop "
"with '#pragma clang loop vectorize(enable)' "		"with '#pragma clang loop vectorize(enable)' "
"when compiling with -Os/-Oz");		"when compiling with -Os/-Oz");
DEBUG(dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n");		DEBUG(dbgs() << "LV: Aborting. A tail loop is required with -Os/-Oz.\n");
return Factor;		return true;
}
}		}

int UserVF = Hints->getWidth();		LoopVectorizationCostModel::VectorizationFactor
if (UserVF != 0) {		LoopVectorizationCostModel::selectVectorizationFactor(bool OptForSize,
assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");		unsigned VF) {
DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");		// Width 1 means no vectorize
		VectorizationFactor Factor = {1U, 0U};
Factor.Width = UserVF;

collectUniformsAndScalars(UserVF);
collectInstsToScalarize(UserVF);
return Factor;
}

float Cost = expectedCost(1).first;		float Cost = expectedCost(1).first;
#ifndef NDEBUG		#ifndef NDEBUG
const float ScalarCost = Cost;		const float ScalarCost = Cost;
#endif /* NDEBUG */		#endif /* NDEBUG */
unsigned Width = 1;		unsigned Width = 1;
DEBUG(dbgs() << "LV: Scalar loop costs: " << (int)ScalarCost << ".\n");		DEBUG(dbgs() << "LV: Scalar loop costs: " << (int)ScalarCost << ".\n");

▲ Show 20 Lines • Show All 391 Lines • ▼ Show 20 Lines	for (unsigned i = 0, e = VFs.size(); i < e; ++i) {
RUs[i] = RU;		RUs[i] = RU;
}		}

return RUs;		return RUs;
}		}

void LoopVectorizationCostModel::collectInstsToScalarize(unsigned VF) {		void LoopVectorizationCostModel::collectInstsToScalarize(unsigned VF) {

// If we aren't vectorizing the loop, or if we've already collected the		// Function should not be called for the scalar case.
		assert(VF >= 2 && "Function called for the scalar loop");

		// if we've already collected the
// instructions to scalarize, there's nothing to do. Collection may already		// instructions to scalarize, there's nothing to do. Collection may already
// have occurred if we have a user-selected VF and are now computing the		// have occurred if we have a user-selected VF and are now computing the
// expected cost for interleaving.		// expected cost for interleaving.
if (VF < 2 \|\| InstsToScalarize.count(VF))		if (InstsToScalarize.count(VF))
return;		return;

// Initialize a mapping for VF in InstsToScalalarize. If we find that it's		// Initialize a mapping for VF in InstsToScalalarize. If we find that it's
// not profitable to scalarize any instructions, the presence of VF in the		// not profitable to scalarize any instructions, the presence of VF in the
// map will indicate that we've analyzed it already.		// map will indicate that we've analyzed it already.
ScalarCostsTy &ScalarCostsVF = InstsToScalarize[VF];		ScalarCostsTy &ScalarCostsVF = InstsToScalarize[VF];

// Find all the instructions that are scalar with predication in the loop and		// Find all the instructions that are scalar with predication in the loop and
▲ Show 20 Lines • Show All 127 Lines • ▼ Show 20 Lines	int LoopVectorizationCostModel::computePredInstDiscount(

return Discount;		return Discount;
}		}

LoopVectorizationCostModel::VectorizationCostTy		LoopVectorizationCostModel::VectorizationCostTy
LoopVectorizationCostModel::expectedCost(unsigned VF) {		LoopVectorizationCostModel::expectedCost(unsigned VF) {
VectorizationCostTy Cost;		VectorizationCostTy Cost;

// Collect Uniform and Scalar instructions after vectorization with VF.
collectUniformsAndScalars(VF);

// Collect the instructions (and their associated costs) that will be more
// profitable to scalarize.
collectInstsToScalarize(VF);

// For each block.		// For each block.
for (BasicBlock *BB : TheLoop->blocks()) {		for (BasicBlock *BB : TheLoop->blocks()) {
VectorizationCostTy BlockCost;		VectorizationCostTy BlockCost;

// For each instruction in the old loop.		// For each instruction in the old loop.
for (Instruction &I : *BB) {		for (Instruction &I : *BB) {
// Skip dbg intrinsics.		// Skip dbg intrinsics.
if (isa<DbgInfoIntrinsic>(I))		if (isa<DbgInfoIntrinsic>(I))
▲ Show 20 Lines • Show All 520 Lines • ▼ Show 20 Lines	void LoopVectorizationCostModel::collectValuesToIgnore() {
// detection.		// detection.
for (auto &Reduction : *Legal->getReductionVars()) {		for (auto &Reduction : *Legal->getReductionVars()) {
RecurrenceDescriptor &RedDes = Reduction.second;		RecurrenceDescriptor &RedDes = Reduction.second;
SmallPtrSetImpl<Instruction *> &Casts = RedDes.getCastInsts();		SmallPtrSetImpl<Instruction *> &Casts = RedDes.getCastInsts();
VecValuesToIgnore.insert(Casts.begin(), Casts.end());		VecValuesToIgnore.insert(Casts.begin(), Casts.end());
}		}
}		}

		LoopVectorizationCostModel::VectorizationFactor
		LoopVectorizationPlanner::plan(bool OptForSize, unsigned UserVF,
		unsigned MaxVF) {
		if (UserVF) {
		DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");
		if (UserVF == 1)
		return {UserVF, 0};
		assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");
		// Collect Uniform and Scalar instructions after vectorization with VF.
		CM->collectUniformsAndScalars(UserVF);
		// Collect the instructions (and their associated costs) that will be more
		// profitable to scalarize.
		CM->collectInstsToScalarize(UserVF);
		buildInitialVPlans(UserVF, UserVF);
		DEBUG(printCurrentPlans("Initial VPlans", dbgs()));
		optimizePredicatedInstructions();
		DEBUG(printCurrentPlans("After optimize predicated instructions", dbgs()));
		return {UserVF, 0};
		}
		if (MaxVF == 1)
		return {1, 0};

		assert(MaxVF > 1 && "MaxVF is zero.");
		for (unsigned i = 2; i <= MaxVF; i *= 2) {
		// Collect Uniform and Scalar instructions after vectorization with VF.
		CM->collectUniformsAndScalars(i);
		// Collect the instructions (and their associated costs) that will be more
		// profitable to scalarize.
		CM->collectInstsToScalarize(i);
		}
		buildInitialVPlans(2, MaxVF);
		DEBUG(printCurrentPlans("Initial VPlans", dbgs()));
		optimizePredicatedInstructions();
		DEBUG(printCurrentPlans("After optimize predicated instructions", dbgs()));
		// Select the optimal vectorization factor.
		return CM->selectVectorizationFactor(OptForSize, MaxVF);
		}

		void LoopVectorizationPlanner::printCurrentPlans(const std::string &Title,
		raw_ostream &O) {
		auto printPlan = [&](VPlan *Plan, const SmallVectorImpl<unsigned> &VFs,
		const std::string &Prefix) {
		std::string Title;
		raw_string_ostream RSO(Title);
		RSO << Prefix << " for VF=";
		if (VFs.size() == 1)
		RSO << VFs[0];
		else {
		RSO << "{";
		bool First = true;
		for (unsigned VF : VFs) {
		if (!First)
		RSO << ",";
		RSO << VF;
		First = false;
		}
		RSO << "}";
		}
		VPlanPrinter PlanPrinter(O, *Plan);
		PlanPrinter.dump(RSO.str());
		};

		if (VPlans.empty())
		return;

		VPlan *Current = VPlans.begin()->second.get();

		SmallVector<unsigned, 4> VFs;
		for (auto &Entry : VPlans) {
		VPlan *Plan = Entry.second.get();
		if (Plan != Current) {
		// Hit another VPlan. Print the current VPlan for the VFs it served thus
		// far and move on to the VPlan we just encountered.
		printPlan(Current, VFs, Title);
		Current = Plan;
		VFs.clear();
		}
		// Add VF to the list of VFs served by current VPlan.
		VFs.push_back(Entry.first);
		}
		// Print the current VPlan.
		printPlan(Current, VFs, Title);
		}

		std::pair<VPRecipeBase , VPRecipeBase >
		LoopVectorizationPlanner::widenIntInduction(VPlan *Plan, unsigned StartRangeVF,
		unsigned &EndRangeVF, PHINode *IV,
		TruncInst *Trunc) {
		// The value from the original loop to which we are mapping the new
		// induction variable.
		Instruction *EntryVal = Trunc ? cast<Instruction>(Trunc) : IV;
		// Determine if we want a scalar version of the induction variable. This
		// is true if the induction variable itself is not widened, or if it has
		// at least one user in the loop that is not widened.
		auto NeedsScalarInduction = [&](unsigned VF) -> bool {
		if (shouldScalarizeInstruction(IV, VF))
		return true;
		auto isScalarInst = [&](User *U) -> bool {
		auto *I = cast<Instruction>(U);
		return (TheLoop->contains(I) && shouldScalarizeInstruction(I, VF));
		};
		return any_of(IV->users(), isScalarInst);
		};
		bool NeedsScalarIV =
		testVFRange(NeedsScalarInduction, StartRangeVF, EndRangeVF);
		// Generate the widening recipe.
		auto *WIIRecipe = new VPWidenIntInductionRecipe(NeedsScalarIV, IV, Trunc);
		if (!NeedsScalarIV)
		return std::make_pair<VPRecipeBase , VPRecipeBase >(WIIRecipe, nullptr);

		// Create scalar steps that can be used by instructions we will later
		// scalarize. Note that the addition of the scalar steps will not
		// increase the number of instructions in the loop in the common case
		// prior to InstCombine. We will be trading one vector extract for
		// each scalar step.
		auto *BSSRecipe = new VPBuildScalarStepsRecipe(WIIRecipe, EntryVal, Plan);
		// Determine the number of scalars we need to generate for each unroll
		// iteration. If EntryVal is uniform, we only need to generate the
		// first lane. Otherwise, we generate all VF values.
		auto isUniformAfterVectorization = [&](unsigned VF) -> bool {
		return CM->isUniformAfterVectorization(cast<Instruction>(EntryVal), VF);
		};
		if (testVFRange(isUniformAfterVectorization, StartRangeVF, EndRangeVF)) {
		VPlanUtilsLoopVectorizer PlanUtils(Plan);
		PlanUtils.designateLaneZero(BSSRecipe);
		}
		return std::make_pair<VPRecipeBase , VPRecipeBase >(WIIRecipe, BSSRecipe);
		}

		// Determine if a given instruction will remain scalar after vectorization,
		// for VF \p StartRangeVF. Reset \p EndRangeVF to the minimal VF where this
		// decision does not hold, if it's less than the given \p EndRangeVF.
		bool LoopVectorizationPlanner::willBeScalarized(Instruction *I,
		unsigned StartRangeVF,
		unsigned &EndRangeVF) {
		if (!isa<PHINode>(I)) {
		auto isScalarAfterVectorization = [&](unsigned VF) -> bool {
		return CM->isScalarAfterVectorization(I, VF);
		};
		if (testVFRange(isScalarAfterVectorization, StartRangeVF, EndRangeVF))
		return true;
		}

		if (isa<CallInst>(I)) {

		auto *CI = cast<CallInst>(I);
		Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);
		if (ID && (ID == Intrinsic::assume \|\| ID == Intrinsic::lifetime_end \|\|
		ID == Intrinsic::lifetime_start))
		return true;

		// The following case may be scalarized depending on the VF.
		// The flag shows whether we use Intrinsic or a usual Call for vectorized
		// version of the instruction.
		// Is it beneficial to perform intrinsic call compared to lib call?
		auto WillBeScalarized = [&](unsigned VF) -> bool {
		bool NeedToScalarize;
		unsigned CallCost = getVectorCallCost(CI, VF, *TTI, TLI, NeedToScalarize);
		bool UseVectorIntrinsic =
		ID && getVectorIntrinsicCost(CI, VF, *TTI, TLI) <= CallCost;
		return !UseVectorIntrinsic && NeedToScalarize;
		};
		return testVFRange(WillBeScalarized, StartRangeVF, EndRangeVF);
		}

		if (isa<LoadInst>(I) \|\| isa<StoreInst>(I)) {

		// TODO: refactor memoryInstructionMustBeScalarized() to invoke only the
		// (last) part that depends on VF.
		auto WillBeScalarized = [&](unsigned VF) -> bool {
		LoopVectorizationCostModel::InstWidening Decision =
		CM->getWideningDecision(I, VF);
		assert(Decision != LoopVectorizationCostModel::CM_Unknown &&
		"CM decision should be taken at this point");
		return Decision == LoopVectorizationCostModel::CM_Scalarize;
		};
		return testVFRange(WillBeScalarized, StartRangeVF, EndRangeVF);
		}

		static DenseSet<unsigned> VectorizableOpcodes = {
		Instruction::Br, Instruction::PHI, Instruction::UDiv,
		Instruction::SDiv, Instruction::SRem, Instruction::URem,
		Instruction::Add, Instruction::FAdd, Instruction::Sub,
		Instruction::FSub, Instruction::Mul, Instruction::FMul,
		Instruction::FDiv, Instruction::FRem, Instruction::Shl,
		Instruction::LShr, Instruction::AShr, Instruction::And,
		Instruction::Or, Instruction::Xor, Instruction::Select,
		Instruction::ICmp, Instruction::FCmp, Instruction::Store,
		Instruction::Load, Instruction::ZExt, Instruction::SExt,
		Instruction::FPToUI, Instruction::FPToSI, Instruction::FPExt,
		Instruction::PtrToInt, Instruction::IntToPtr, Instruction::SIToFP,
		Instruction::UIToFP, Instruction::Trunc, Instruction::FPTrunc,
		Instruction::BitCast, Instruction::Call};

		if (!VectorizableOpcodes.count(I->getOpcode()))
		return true;

		// Scalarize instructions found to be more profitable if scalarized. Limit
		// EndRangeVF to the last VF this is continuously true for.
		auto isProfitableToScalarize = [&](unsigned VF) -> bool {
		return CM->isProfitableToScalarize(I, VF);
		};
		return testVFRange(isProfitableToScalarize, StartRangeVF, EndRangeVF);
		}

		unsigned LoopVectorizationPlanner::buildInitialVPlans(unsigned MinVF,
		unsigned MaxVF) {
		ILV->collectTriviallyDeadInstructions(TheLoop, Legal, DeadInstructions);

		unsigned StartRangeVF = MinVF;
		unsigned EndRangeVF = MaxVF + 1;

		unsigned i = 0;
		for (; StartRangeVF < EndRangeVF; ++i) {
		std::shared_ptr<VPlan> Plan = buildInitialVPlan(StartRangeVF, EndRangeVF);

		for (unsigned TmpVF = StartRangeVF; TmpVF < EndRangeVF; TmpVF *= 2)
		VPlans[TmpVF] = Plan;

		StartRangeVF = EndRangeVF;
		EndRangeVF = MaxVF + 1;
		}

		return i;
		}

		bool LoopVectorizationPlanner::testVFRange(
		const std::function<bool(unsigned)> &Predicate, unsigned StartRangeVF,
		unsigned &EndRangeVF) {
		bool StartResult = Predicate(StartRangeVF);

		for (unsigned TmpVF = StartRangeVF * 2; TmpVF < EndRangeVF; TmpVF *= 2) {
		bool TmpResult = Predicate(TmpVF);
		if (TmpResult != StartResult) {
		EndRangeVF = TmpVF;
		break;
		}
		}

		return StartResult;
		}

		std::shared_ptr<VPlan>
		LoopVectorizationPlanner::buildInitialVPlan(unsigned StartRangeVF,
		unsigned &EndRangeVF) {

		std::shared_ptr<VPlan> SharedPlan = std::make_shared<VPlan>();
		VPlan *Plan = SharedPlan.get();
		VPlanUtilsLoopVectorizer PlanUtils(Plan);

		// Create a dummy entry VPBasicBlock to start building the VPlan.
		VPBlockBase *PreviousVPBlock = PlanUtils.createBasicBlock();
		VPBlockBase *PreEntry = PreviousVPBlock;
		Plan->setEntry(PreEntry); // only to support printing during construction.

		// Return the interleave group a given instruction is part of in the context
		// of a specific VF.
		auto getInterleaveGroup = [&](Instruction *I,
		unsigned VF) -> const InterleaveGroup * {
		if (VF < 2)
		return nullptr; // Query is illegal for VF == 1
		LoopVectorizationCostModel::InstWidening Decision =
		CM->getWideningDecision(I, VF);
		if (Decision != LoopVectorizationCostModel::CM_Interleave)
		return nullptr;
		const InterleaveGroup *IG = Legal->getInterleavedAccessGroup(I);
		assert(IG && "Instruction to interleave not part of any group");
		return IG;
		};

		// Check if given Instruction should open an interleave group.
		auto isPrimaryIGMember =
		[&](Instruction *I) -> std::function<bool(unsigned)> {
		return [=](unsigned VF) -> bool {
		const InterleaveGroup *IG = getInterleaveGroup(I, VF);
		return IG && I == IG->getInsertPos();
		};
		};

		// Check if given Instruction is handled as part of an interleave group.
		auto isAdjunctIGMember =
		[&](Instruction *I) -> std::function<bool(unsigned)> {
		return [=](unsigned VF) -> bool {
		const InterleaveGroup *IG = getInterleaveGroup(I, VF);
		return IG && I != IG->getInsertPos();
		};
		};

		/// Determine whether \p K is a truncation based on an induction variable that
		/// can be optimized.
		auto isOptimizableIVTruncate =
		[&](Instruction *K) -> std::function<bool(unsigned)> {
		return
		[=](unsigned VF) -> bool { return CM->isOptimizableIVTruncate(K, VF); };
		};

		// Scan the body of the loop in a topological order to visit each basic block
		// after having visited its predecessor basic blocks.
		LoopBlocksDFS DFS(TheLoop);
		DFS.perform(LI);

		for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO())) {
		// Relevent instructions from basic block BB will be grouped into VPRecipe
		// ingredients and fill a new VPBasicBlock.
		VPBasicBlock *VPBB = nullptr;
		VPOneByOneRecipeBase *LastOBORecipe = nullptr;

		auto appendRecipe = [&](VPRecipeBase *Recipe) -> void {
		if (VPBB)
		PlanUtils.appendRecipeToBasicBlock(Recipe, VPBB);
		else {
		VPBB = PlanUtils.createBasicBlock(Recipe);
		PlanUtils.setSuccessor(PreviousVPBlock, VPBB);
		PreviousVPBlock = VPBB;
		}
		LastOBORecipe = dyn_cast<VPOneByOneRecipeBase>(Recipe);
		};

		for (auto I = BB->begin(), E = BB->end(); I != E; ++I) {
		Instruction Instr = &I;

		// Filter out irrelevant instructions.
		if (DeadInstructions.count(Instr) \|\| isa<BranchInst>(Instr) \|\|
		isa<DbgInfoIntrinsic>(Instr))
		continue;

		if (isa<LoadInst>(Instr) \|\| isa<StoreInst>(Instr)) {
		// Ignore IG's adjunct members - will be handled by the interleave group
		// recipe to be generated by the primary member of the interleave group
		// which is the insertion point and bears the cost for the entire group.
		if (testVFRange(isAdjunctIGMember(Instr), StartRangeVF, EndRangeVF))
		continue;

		if (testVFRange(isPrimaryIGMember(Instr), StartRangeVF, EndRangeVF)) {
		// Instr points to the insert position of an interleave group: first
		// load or last store.
		const InterleaveGroup *IG = Legal->getInterleavedAccessGroup(Instr);
		appendRecipe(new VPInterleaveRecipe(IG, Plan));
		continue;
		}
		}

		if (Legal->isScalarWithPredication(Instr)) {
		// Instructions marked for predication are scalarized and placed under
		// an if-then construct to prevent side-effects.
		DEBUG(dbgs() << "LV: Scalarizing and predicating:" << *Instr << '\n');

		// Build the triangular if-then region. Start with VPBB holding Instr.
		BasicBlock::iterator J = I;
		VPRecipeBase *Recipe = new VPScalarizeOneByOneRecipe(I, ++J, Plan);
		VPBB = PlanUtils.createBasicBlock(Recipe);

		// Build the entry and exit VPBB's of the triangle.
		VPRegionBlock *Region = PlanUtils.createRegion(true);
		VPExtractMaskBitRecipe R = new VPExtractMaskBitRecipe(&BB);
		VPBasicBlock *Entry = PlanUtils.createBasicBlock(R);
		Recipe = new VPMergeScalarizeBranchRecipe(Instr);
		VPBasicBlock *Exit = PlanUtils.createBasicBlock(Recipe);
		// Note: first set Entry as region entry and then connect successors
		// starting from it in order, to propagate the "parent" of each
		// VPBasicBlock.
		PlanUtils.setRegionEntry(Region, Entry);
		PlanUtils.setRegionExit(Region, Exit);
		PlanUtils.setTwoSuccessors(Entry, R, VPBB, Exit);
		PlanUtils.setSuccessor(VPBB, Exit);
		PlanUtils.setSuccessor(PreviousVPBlock, Region);
		PreviousVPBlock = Region;

		// Next instructions should start forming a VPBasicBlock of their own.
		VPBB = nullptr;
		LastOBORecipe = nullptr;

		// Record predicated instructions for later optimizations.
		PredicatedInstructions.insert(&*I);

		continue;
		}

		// Check if this is an integer induction. If so, build the recipes that
		// produce its scalar and vector values.

		if (PHINode *Phi = dyn_cast<PHINode>(Instr)) {
		InductionDescriptor II = Legal->getInductionVars()->lookup(Phi);
		if (II.getKind() == InductionDescriptor::IK_IntInduction) {
		auto Recipes = widenIntInduction(Plan, StartRangeVF, EndRangeVF, Phi);
		appendRecipe(Recipes.first);
		if (Recipes.second)
		appendRecipe(Recipes.second);
		continue;
		}
		}

		// Optimize the special case where the source is a constant integer
		// induction variable. Notice that we can only optimize the 'trunc' case
		// because (a) FP conversions lose precision, (b) sext/zext may wrap, and
		// (c) other casts depend on pointer size.
		if (isa<TruncInst>(Instr) && testVFRange(isOptimizableIVTruncate(Instr),
		StartRangeVF, EndRangeVF)) {
		auto *InductionPhi = cast<PHINode>(Instr->getOperand(0));
		auto Recipes = widenIntInduction(Plan, StartRangeVF, EndRangeVF,
		InductionPhi, cast<TruncInst>(Instr));
		appendRecipe(Recipes.first);
		if (Recipes.second)
		appendRecipe(Recipes.second);
		continue;
		}

		// Check if instruction is to be replicated.
		bool Scalarized = willBeScalarized(Instr, StartRangeVF, EndRangeVF);
		DEBUG(if (Scalarized) dbgs() << "LV: Scalarizing:" << *Instr << "\n");

		// Default: vectorize/scalarize this instruction using a one-by-one
		// recipe. We optimize the common case where consecutive instructions
		// can be represented by a single OBO recipe.
		if (!LastOBORecipe \|\| LastOBORecipe->isScalarizing() != Scalarized \|\|
		!PlanUtils.appendInstruction(LastOBORecipe, Instr)) {
		auto J = I;
		appendRecipe(PlanUtils.createOneByOneRecipe(I, ++J, Plan, Scalarized));
		}
		}
		}
		// PreviousVPBlock now holds the exit block of Plan.
		// Set entry block of Plan to the successor of PreEntry, and discard PreEntry.
		assert(PreEntry->getSuccessors().size() == 1 && "Plan has no single entry.");
		VPBlockBase *Entry = PreEntry->getSuccessors().front();
		PlanUtils.disconnectBlocks(PreEntry, Entry);
		Plan->setEntry(Entry);
		delete PreEntry;

		// FOR STRESS TESTING, uncomment the following:
		// EndRangeVF = StartRangeVF * 2;

		return SharedPlan;
		}

		void LoopVectorizationPlanner::sinkScalarOperands(Instruction *PredInst,
		VPlan *Plan) {
		VPlanUtilsLoopVectorizer PlanUtils(Plan);

		// The recipe containing the predicated instruction.
		VPBasicBlock *PredBB = Plan->getBasicBlock(PredInst);

		// Initialize a worklist with the operands of the predicated instruction.
		SetVector<Value *> Worklist(PredInst->op_begin(), PredInst->op_end());

		// Holds instructions that we need to analyze again. An instruction may be
		// reanalyzed if we don't yet know if we can sink it or not.
		SmallVector<Instruction *, 8> InstsToReanalyze;

		// Iteratively sink the scalarized operands of the predicated instruction
		// into the block we created for it. When an instruction is sunk, it's
		// operands are then added to the worklist. The algorithm ends after one pass
		// through the worklist doesn't sink a single instruction.
		bool Changed;
		do {

		// Add the instructions that need to be reanalyzed to the worklist, and
		// reset the changed indicator.
		Worklist.insert(InstsToReanalyze.begin(), InstsToReanalyze.end());
		InstsToReanalyze.clear();
		Changed = false;

		while (!Worklist.empty()) {
		auto *I = dyn_cast<Instruction>(Worklist.pop_back_val());
		if (!I)
		continue;

		// We do not sink other predicated instructions.
		if (Legal->isScalarWithPredication(I))
		continue;

		VPRecipeBase *Recipe = Plan->getRecipe(I);

		// We can't sink live-ins.
		if (!Recipe)
		continue;
		VPBasicBlock *BasicBlock = Recipe->getParent();
		assert(BasicBlock && "Recipe not in any basic block");

		// We can't sink an instruction that isn't being scalarized.
		if (!isa<VPScalarizeOneByOneRecipe>(Recipe) &&
		!isa<VPBuildScalarStepsRecipe>(Recipe))
		continue;

		// We can't sink an instruction if it is already in the predicated block,
		// is not in the VPlan, or may have side effects.
		if (BasicBlock == PredBB \|\| I->mayHaveSideEffects())
		continue;

		// Handle phi nodes last to make sure that any user they may have has sunk
		// by now. This is relevant for induction variables that feed uniform GEPs
		// which may or may not sink.
		if (isa<PHINode>(I)) {
		auto IsNotAPhi = [&](Value *V) -> bool { return isa<PHINode>(V); };
		if (any_of(Worklist, IsNotAPhi) \|\|
		any_of(InstsToReanalyze, IsNotAPhi)) {
		InstsToReanalyze.push_back(I);
		continue;
		}
		}

		bool HasVectorizedUses = false;
		bool AllScalarizedUsesInPredicatedBlock = true;
		unsigned MinLaneToSink = 0;
		for (auto &U : I->uses()) {
		auto *UI = cast<Instruction>(U.getUser());
		VPRecipeBase *UserRecipe = Plan->getRecipe(UI);
		// Generated scalarized instructions don't serve users outside of the
		// VPlan, so we can safely ignore users that have no recipe.
		if (!UserRecipe)
		continue;

		// GEPs used as the uniform address of a wide memory operation must not
		// sink lane zero.
		if (isa<VPInterleaveRecipe>(UserRecipe)) {
		assert(isa<GetElementPtrInst>(I) &&
		"Non-GEP used in interleave group");
		MinLaneToSink = std::max(MinLaneToSink, 1u);
		continue;
		}

		// Wide memory operations do not use any of the scalarized GEPs but
		// generate their own GEPs.
		if (isa<VPVectorizeOneByOneRecipe>(UserRecipe) &&
		isa<GetElementPtrInst>(I) &&
		(isa<LoadInst>(UI) \|\| isa<StoreInst>(UI)) &&
		Legal->isConsecutivePtr(I)) {
		continue;
		}

		if (!(isa<VPScalarizeOneByOneRecipe>(UserRecipe) \|\|
		isa<VPBuildScalarStepsRecipe>(UserRecipe))) {
		// All of I's lanes are used by an instruction we can't sink.
		HasVectorizedUses = true;
		break;
		}

		// Induction variables feeding consecutive GEPs can be indirectly used
		// by vectorized load/stores which generate their own GEP rather than
		// reuse the scalarized one (unlike load/store in interleave groups).
		// In such a case, we can sink all lanes but lane zero. Note that we
		// can do this whether or not the GEP is used within the predicated
		// block (i.e. whether it will sink its own lanes 1..VF-1).
		if (isa<GetElementPtrInst>(UI) && Legal->isConsecutivePtr(UI) &&
		isa<VPBuildScalarStepsRecipe>(Recipe)) {
		auto IsVectorizedMemoryOperation = [&](User *U) -> bool {
		if (!(isa<LoadInst>(U) \|\| isa<StoreInst>(U)))
		return false;
		VPRecipeBase *Recipe = Plan->getRecipe(cast<Instruction>(U));
		return Recipe && isa<VPVectorizeOneByOneRecipe>(Recipe);
		};

		if (any_of(UI->users(), IsVectorizedMemoryOperation)) {
		MinLaneToSink = std::max(MinLaneToSink, 1u);
		continue;
		}
		}

		if (UserRecipe->getParent() != PredBB) {
		// Don't make a decision until all scalarized users have sunk.
		AllScalarizedUsesInPredicatedBlock = false;
		continue;
		}

		// Ok to sink w.r.t this use, but no more lanes than what the user
		// itself has sunk.
		VPLaneRange DesignatedLanes;
		if (auto *BSS = dyn_cast<VPBuildScalarStepsRecipe>(UserRecipe))
		DesignatedLanes = BSS->getDesignatedLanes();
		else
		DesignatedLanes =
		cast<VPScalarizeOneByOneRecipe>(UserRecipe)->getDesignatedLanes();
		VPLaneRange SinkableLanes =
		VPLaneRange::intersect(VPLaneRange(MinLaneToSink), DesignatedLanes);
		MinLaneToSink = SinkableLanes.getMinLane();
		}

		if (HasVectorizedUses)
		continue; // This instruction cannot be sunk.

		// It's legal to sink the instruction if all its uses occur in the
		// predicated block. Otherwise, there's nothing to do yet, and we may
		// need to reanalyze the instruction.
		if (!AllScalarizedUsesInPredicatedBlock) {
		InstsToReanalyze.push_back(I);
		continue;
		}

		// Move the instruction to the beginning of the predicated block, and add
		// it's operands to the worklist (except for phi nodes).
		PlanUtils.sinkInstruction(I, PredBB, MinLaneToSink);
		if (!isa<PHINode>(I))
		Worklist.insert(I->op_begin(), I->op_end());

		// The sinking may have enabled other instructions to be sunk, so we will
		// need to iterate.
		Changed = true;
		}
		} while (Changed);
		}

		void LoopVectorizationPlanner::assignScalarVectorConversions(
		Instruction PredInst, VPlan Plan) {

		// NFC: Let Def's recipe generate the vector version of Def, but only
		// if all of Def's users are vectorized. This is the equivalent to the
		// previous predicateInstructions by which an insert-element got hoisted
		// into the matching predicated basic block if it is the only user of
		// the predicated instruction.

		if (PredInst->use_empty())
		return;

		for (User *U : PredInst->users()) {
		Instruction *UserInst = dyn_cast<Instruction>(U);
		if (!UserInst)
		continue;

		VPRecipeBase *UserRecipe = Plan->getRecipe(UserInst);
		if (!UserRecipe) // User is not part of the plan.
		return;

		if (dyn_cast<VPVectorizeOneByOneRecipe>(UserRecipe))
		continue;

		// Found a user that will not be using the vector form of the predicated
		// instruction. The insert-element is not going to be the only user, so
		// do not hoist it.
		return;
		}

		Plan->getRecipe(PredInst)->addAlsoPackOrUnpack(PredInst);
		}

		bool LoopVectorizationPlanner::shouldScalarizeInstruction(Instruction *I,
		unsigned VF) const {
		return CM->isScalarAfterVectorization(I, VF) \|\|
		CM->isProfitableToScalarize(I, VF);
		}

		void LoopVectorizationPlanner::optimizePredicatedInstructions() {
		VPlan *PrevPlan = nullptr;
		for (auto &It : VPlans) {
		VPlan *Plan = It.second.get();
		if (Plan == PrevPlan)
		continue;
		for (auto *PredInst : PredicatedInstructions) {
		sinkScalarOperands(PredInst, Plan);
		assignScalarVectorConversions(PredInst, Plan);
		}
		PrevPlan = Plan;
		}
		}

		void LoopVectorizationPlanner::setBestPlan(unsigned VF, unsigned UF) {
		DEBUG(dbgs() << "Setting best plan to VF=" << VF << ", UF=" << UF << '\n');
		BestVF = VF;
		BestUF = UF;

		assert(VPlans.count(VF) && "Best VF does not have a VPlan.");
		// Delete all other VPlans.
		for (auto &Entry : VPlans) {
		if (Entry.first != VF)
		VPlans.erase(Entry.first);
		}
		}

		void LoopVectorizationPlanner::executeBestPlan(InnerLoopVectorizer &LB) {
		ILV = &LB;

		// Perform the actual loop widening (vectorization).
		// 1. Create a new empty loop. Unlink the old loop and connect the new one.
		ILV->createEmptyLoop();

		// 2. Widen each instruction in the old loop to a new one in the new loop.

		VPTransformState State{BestVF, BestUF, LI, ILV->DT,
		ILV->Builder, ILV, Legal, CM};
		State.CFG.PrevBB = ILV->LoopVectorPreHeader;

		VPlan *Plan = getVPlanForVF(BestVF);

		Plan->vectorize(&State);

		// 3. Take care of phi's to fix: reduction, 1st-order-recurrence, loop-closed.
		ILV->vectorizeLoop();
		}

		void VPVectorizeOneByOneRecipe::transformIRInstruction(
		Instruction *I, VPTransformState &State) {
		assert(I && "No instruction to vectorize.");
		State.ILV->vectorizeInstruction(*I);
		if (willAlsoPackOrUnpack(I)) { // Unpack instruction
		for (unsigned Part = 0; Part < State.UF; ++Part)
		for (unsigned Lane = 0; Lane < State.VF; ++Lane)
		State.ILV->getScalarValue(I, Part, Lane);
		}
		}

		void VPScalarizeOneByOneRecipe::transformIRInstruction(
		Instruction *I, VPTransformState &State) {
		assert(I && "No instruction to vectorize.");
		// By default generate scalar instances for all VF lanes of all UF parts.
		// If the instruction is uniform, generate only the first lane for each
		// of the UF parts.
		bool IsUniform = State.Cost->isUniformAfterVectorization(I, State.VF);
		unsigned MinLane = 0;
		unsigned MaxLane = IsUniform ? 0 : State.VF - 1;
		unsigned MinPart = 0;
		unsigned MaxPart = State.UF - 1;

		if (State.Instance) {
		// Asked to create an instance for a specific lane and a specific part.
		assert(!IsUniform &&
		"Uniform instruction vectorized for a specific instance.");
		MinLane = State.Instance->Lane;
		MaxLane = MinLane;
		MinPart = State.Instance->Part;
		MaxPart = MinPart;
		}

		// Intersect requested lanes with the designated lanes for this recipe.
		VPLaneRange ActiveLanes(MinLane, MaxLane);
		VPLaneRange EffectiveLanes =
		VPLaneRange::intersect(ActiveLanes, DesignatedLanes);
		if (EffectiveLanes.isEmpty())
		return; // None of the requested lanes is designated for this recipe.

		// Generate relevant lanes.
		State.ILV->scalarizeInstruction(I, MinPart, MaxPart,
		EffectiveLanes.getMinLane(),
		EffectiveLanes.getMaxLane());
		if (willAlsoPackOrUnpack(I)) {
		if (State.Instance)
		// Insert scalar instance packing it into a vector.
		State.ILV->constructVectorValue(I, MinPart, MinLane);
		else
		// Broadcast or group together all instances into a vector.
		State.ILV->getVectorValue(I);
		}
		}

		void VPWidenIntInductionRecipe::vectorize(VPTransformState &State) {
		assert(State.Instance == nullptr && "Int induction being replicated");
		auto BuildScalarInfo = State.ILV->widenIntInduction(NeedsScalarIV, IV, Trunc);
		ScalarIV = BuildScalarInfo.first;
		Step = BuildScalarInfo.second;
		}

		void VPWidenIntInductionRecipe::print(raw_ostream &O) const {
		O << "Widen int induction";
		if (NeedsScalarIV)
		O << " (needs scalars)";
		O << ":\n";
		O << *IV;
		if (Trunc)
		O << "\n" << *Trunc << ")";
		}

		void VPBuildScalarStepsRecipe::vectorize(VPTransformState &State) {
		// By default generate scalar instances for all VF lanes of all UF parts.
		// If the instruction is uniform, generate only the first lane for each
		// of the UF parts.
		bool IsUniform = State.Cost->isUniformAfterVectorization(EntryVal, State.VF);
		unsigned MinLane = 0;
		unsigned MaxLane = IsUniform ? 0 : State.VF - 1;
		unsigned MinPart = 0;
		unsigned MaxPart = State.UF - 1;

		if (State.Instance) {
		// Asked to create an instance for a specific lane and a specific part.
		MinLane = State.Instance->Lane;
		MaxLane = MinLane;
		MinPart = State.Instance->Part;
		MaxPart = MinPart;
		}

		// Intersect requested lanes with the designated lanes for this recipe.
		VPLaneRange ActiveLanes(MinLane, MaxLane);
		VPLaneRange EffectiveLanes =
		VPLaneRange::intersect(ActiveLanes, DesignatedLanes);
		if (EffectiveLanes.isEmpty())
		return; // None of the requested lanes is designated for this recipe.

		// Generate relevant lanes.
		State.ILV->buildScalarSteps(WII->getScalarIV(), WII->getStep(), EntryVal,
		MinPart, MaxPart, EffectiveLanes.getMinLane(),
		EffectiveLanes.getMaxLane());
		}

		void VPBuildScalarStepsRecipe::print(raw_ostream &O) const {
		O << "Build scalar steps";
		if (!DesignatedLanes.isFull()) {
		O << " ";
		DesignatedLanes.print(O);
		}
		O << ":\n" << *EntryVal;
		}

		void VPInterleaveRecipe::vectorize(VPTransformState &State) {
		assert(State.Instance == nullptr && "Interleave group being replicated");
		State.ILV->vectorizeInterleaveGroup(IG->getInsertPos());
		}

		void VPInterleaveRecipe::print(raw_ostream &O) const {
		O << "InterleaveGroup factor:" << IG->getFactor() << '\n';
		for (unsigned i = 0; i < IG->getFactor(); ++i)
		if (Instruction *I = IG->getMember(i)) {
		if (I == IG->getInsertPos())
		O << i << "=]" << *I;
		else
		O << i << " ]" << *I;
		if (willAlsoPackOrUnpack(I))
		O << " (V->S)";
		}
		}

		void VPExtractMaskBitRecipe::vectorize(VPTransformState &State) {
		assert(State.Instance && "Extract Mask Bit works only on single instance.");

		unsigned Part = State.Instance->Part;
		unsigned Lane = State.Instance->Lane;

		typedef SmallVector<Value *, 2> VectorParts;

		VectorParts Cond = State.ILV->createBlockInMask(MaskedBasicBlock);

		ConditionBit = State.Builder.CreateExtractElement(
		Cond[Part], State.ILV->Builder.getInt32(Lane));
		ConditionBit =
		State.Builder.CreateICmp(ICmpInst::ICMP_EQ, ConditionBit,
		ConstantInt::get(ConditionBit->getType(), 1));
		DEBUG(dbgs() << "\nLV: vectorizing ConditionBit recipe"
		<< MaskedBasicBlock->getName());
		}

		void VPMergeScalarizeBranchRecipe::vectorize(VPTransformState &State) {
		assert(State.Instance &&
		"Merge Scalarize Branch works only on single instance.");

		Type *LiveOutType = LiveOut->getType();
		unsigned Part = State.Instance->Part;
		unsigned Lane = State.Instance->Lane;

		// Rename the predicated and merged basic blocks for backwards compatibility.
		Instruction *ScalarLiveOut =
		cast<Instruction>(State.ILV->getScalarValue(LiveOut, Part, Lane));
		BasicBlock *PredicatedBB = ScalarLiveOut->getParent();
		BasicBlock *PredicatingBB = PredicatedBB->getSinglePredecessor();
		assert(PredicatingBB && "Predicated block has no single predecessor");
		PredicatedBB->setName(Twine("pred.") + LiveOut->getOpcodeName() + ".if");
		PredicatedBB->getSingleSuccessor()->setName(
		Twine("pred.") + LiveOut->getOpcodeName() + ".continue");
		if (LiveOutType->isVoidTy())
		return;

		// Generate a phi node for the scalarized instruction.
		PHINode *Phi = State.ILV->Builder.CreatePHI(LiveOutType, 2);
		Phi->addIncoming(UndefValue::get(ScalarLiveOut->getType()), PredicatingBB);
		Phi->addIncoming(ScalarLiveOut, PredicatedBB);
		State.ILV->setScalarValue(LiveOut, Part, Lane, Phi);

		// If this instruction also generated the complementing form then we also need
		// to create a phi for the vector value of this part & lane and update the
		// vector values cache accordingly.
		Value *VectorValue = State.ILV->getVectorValue(LiveOut, Part);
		if (!VectorValue)
		return;

		InsertElementInst *IEI = cast<InsertElementInst>(VectorValue);
		PHINode *VPhi = State.ILV->Builder.CreatePHI(IEI->getType(), 2);
		VPhi->addIncoming(IEI->getOperand(0), PredicatingBB); // the unmodified vector
		VPhi->addIncoming(IEI, PredicatedBB); // new vector with the inserted element
		State.ILV->setVectorValue(LiveOut, Part, VPhi);
		}

		/// Creates a new VPScalarizeOneByOneRecipe or VPVectorizeOneByOneRecipe based
		/// on the isScalarizing parameter respectively.
		VPOneByOneRecipeBase *VPlanUtilsLoopVectorizer::createOneByOneRecipe(
		const BasicBlock::iterator B, const BasicBlock::iterator E, VPlan *Plan,
		bool isScalarizing) {
		if (isScalarizing)
		return new VPScalarizeOneByOneRecipe(B, E, Plan);
		return new VPVectorizeOneByOneRecipe(B, E, Plan);
		}

		bool VPlanUtilsLoopVectorizer::appendInstruction(VPOneByOneRecipeBase *Recipe,
		Instruction *Instr) {
		if (Recipe->End != Instr->getIterator())
		return false;

		Recipe->End++;
		Plan->setInst2Recipe(Instr, Recipe);
		return true;
		}

		/// Given a \p Split instruction assumed to reside in a VPOneByOneRecipeBase
		/// -- where VPOneByOneRecipeBase is either VPScalarizeOneByOneRecipe or
		/// VPVectorizeOneByOneRecipe -- update that recipe to start from \p Split
		/// and move all preceeding instructions to a new VPOneByOneRecipeBase.
		/// \return the newly created VPOneByOneRecipeBase, which is added to the
		/// VPBasicBlock of the original recipe, right before it.
		VPOneByOneRecipeBase *
		VPlanUtilsLoopVectorizer::splitRecipe(Instruction *Split) {
		VPOneByOneRecipeBase *Recipe =
		cast<VPOneByOneRecipeBase>(Plan->getRecipe(Split));
		auto SplitPos = Split->getIterator();

		assert(SplitPos != Recipe->Begin &&
		"Nothing to split before first instruction.");
		assert(SplitPos != Recipe->End && "Nothing to split after last instruction.");

		// Build a new recipe for all instructions up to the given Split.
		VPBasicBlock *BasicBlock = Recipe->getParent();
		VPOneByOneRecipeBase *NewRecipe = createOneByOneRecipe(
		Recipe->Begin, SplitPos, Plan, Recipe->isScalarizing());

		// Insert the new recipe before the split point.
		BasicBlock->addRecipe(NewRecipe, Recipe);

		// Update the old recipe to start from the given split point.
		Recipe->Begin = SplitPos;

		return NewRecipe;
		}

		/// Insert a given instruction \p Inst into a VPBasicBlock before another
		/// given instruction \p Before. Assumes \p Inst does not belong to any
		/// recipe, and that \p Before belongs to a VPOneByOneRecipeBase.
		void VPlanUtilsLoopVectorizer::insertBefore(Instruction *Inst,
		Instruction *Before,
		unsigned MinLane) {
		assert(!Plan->getRecipe(Inst) && "Instruction already in recipe.");
		VPRecipeBase *Recipe = Plan->getRecipe(Before);
		assert(Recipe && "Insertion point not in any recipe.");
		VPOneByOneRecipeBase *OBORecipe = cast<VPOneByOneRecipeBase>(Recipe);
		bool PartialInsertion = MinLane > 0;
		bool IndicesMatch = true;

		if (PartialInsertion) {
		VPScalarizeOneByOneRecipe *SOBO =
		dyn_cast<VPScalarizeOneByOneRecipe>(Recipe);
		if (!SOBO \|\| SOBO->DesignatedLanes.getMinLane() != MinLane)
		IndicesMatch = false;
		}

		// Can we insert \p Inst by augmemting the existing recipe of \p Before?
		// Only if \p Inst is immediately followed by \p Before:
		Instruction *NextInst = Inst;
		if (++NextInst == Before && IndicesMatch) {
		// This must imply that \p Before is the first ingredient in its recipe.
		assert(Before == &*OBORecipe->Begin &&
		"Trying to insert but Before is not first in its recipe.");
		// Yes, extend the range to include the previous instruction.
		OBORecipe->Begin--;
		Plan->setInst2Recipe(Inst, Recipe);
		return;
		}
		// Note that it is not possible to augment the end of Recipe by having
		// Inst == &*Recipe->End, because to do that Before would need to be
		// Recipe->End, which means that Before does not belong to this Recipe.

		// No, the instruction needs to have its own recipe.

		// If we're not inserting right before the Recipe's first instruction,
		// split the Recipe to allow placing the new recipe right before the
		// given insertion point. This new recipe is also added to BasicBlock.
		if (Before != &*OBORecipe->Begin)
		splitRecipe(Before);

		// TODO: VPLanUtils::addOneByOneToBasicBlock()
		auto InstBegin = Inst->getIterator();
		auto InstEnd = InstBegin;
		VPBasicBlock *BasicBlock = Recipe->getParent();
		VPOneByOneRecipeBase *NewRecipe = nullptr;
		if (PartialInsertion) {
		NewRecipe = createOneByOneRecipe(InstBegin, ++InstEnd, Plan, true);
		cast<VPScalarizeOneByOneRecipe>(NewRecipe)->DesignatedLanes =
		VPLaneRange(MinLane);
		} else
		NewRecipe = createOneByOneRecipe(InstBegin, ++InstEnd, Plan,
		OBORecipe->isScalarizing());
		Plan->setInst2Recipe(Inst, NewRecipe);
		BasicBlock->addRecipe(NewRecipe, OBORecipe);
		}

		/// Remove a given instruction \p Inst from its recipe, if exists. We only
		/// support removal from VPOneByOneRecipeBase at this time.
		void VPlanUtilsLoopVectorizer::removeInstruction(Instruction *Inst,
		unsigned FromLane) {
		VPRecipeBase *Recipe = Plan->getRecipe(Inst);
		if (!Recipe)
		return; // Nothing to do, no recipe to remove the instruction from.
		VPOneByOneRecipeBase *OBORecipe = cast<VPOneByOneRecipeBase>(Recipe);
		// First check if OBORecipe can be shortened to exclude Inst.
		bool InstructionWasLast = false;
		if (&*OBORecipe->Begin == Inst)
		OBORecipe->Begin++;
		else if (&*OBORecipe->End == Inst) {
		OBORecipe->End--;
		InstructionWasLast = true;
		}
		// Otherwise split OBORecipe at Inst.
		else {
		splitRecipe(Inst);
		OBORecipe->Begin++;
		}
		if (FromLane > 0) {
		// This is a partial removal. Leave lanes 0..FromLane-1 in the original
		// basic block in a new, unregistered recipe.
		VPOneByOneRecipeBase *NewRecipe = createOneByOneRecipe(
		Inst->getIterator(), ++(Inst->getIterator()), Plan, true);
		cast<VPScalarizeOneByOneRecipe>(NewRecipe)->DesignatedLanes =
		VPLaneRange(0, FromLane - 1);
		Recipe->getParent()->addRecipe(NewRecipe,
		InstructionWasLast ? nullptr : Recipe);
		}
		Plan->resetInst2Recipe(Inst);
		}

		// Given an instruction \p Inst and a VPBasicBlock \p To, remove \p Inst from
		// its current residence and add it as the first instruction of \p To.
		// We currently support removal from and insertion to
		// VPOneByOneRecipeBase's only.
		// TODO: this is an over-simplistic implemetation that assumes we can make
		// the new instruction the first instruction of the first recipe in the
		// basic block. This is true for the sinkScalarOperands use-case, but for a
		// general basic block a getFirstInsertionPt() logic is required.
		void VPlanUtilsLoopVectorizer::sinkInstruction(Instruction *Inst,
		VPBasicBlock *To,
		unsigned MinLane) {
		RecipeListTy *Recipes = getRecipes(To);

		VPRecipeBase *FromRecipe = Plan->getRecipe(Inst);
		if (auto *FromBSSRecipe = dyn_cast<VPBuildScalarStepsRecipe>(FromRecipe)) {
		VPBuildScalarStepsRecipe *SunkRecipe = nullptr;
		if (MinLane == 0) {
		// Sink the entire recipe.
		VPBasicBlock *From = FromRecipe->getParent();
		assert(From && "Recipe to sink not assigned to any basic block");
		From->removeRecipe(FromBSSRecipe);
		SunkRecipe = FromBSSRecipe;
		} else {
		// Partially sink lanes MinLane..VF-1
		SunkRecipe = new VPBuildScalarStepsRecipe(FromBSSRecipe->WII,
		FromBSSRecipe->EntryVal, Plan);
		SunkRecipe->DesignatedLanes = VPLaneRange(MinLane);
		FromBSSRecipe->DesignatedLanes = VPLaneRange(0, MinLane - 1);
		}
		To->addRecipe(SunkRecipe, &*Recipes->begin());
		return;
		}

		assert(Plan->getRecipe(Inst) &&
		isa<VPOneByOneRecipeBase>(Plan->getRecipe(Inst)) &&
		"Unsupported recipe to sink instructions from");

		// Remove instruction from its source recipe.
		removeInstruction(Inst, MinLane);

		auto ToRecipe = dyn_cast<VPOneByOneRecipeBase>(&Recipes->begin());
		if (ToRecipe) {
		// Try to sink the instruction into an existing recipe, default to a new
		// recipe.
		assert(ToRecipe->isScalarizing() &&
		"Cannot sink into a non-scalarizing recipe.");

		// Add it before the first ingredient of To.
		insertBefore(Inst, &*ToRecipe->Begin, MinLane);
		} else {
		// Instruction has to go into its own one-by-one recipe.
		auto InstBegin = Inst->getIterator();
		auto InstEnd = InstBegin;
		auto *NewRecipe = createOneByOneRecipe(InstBegin, ++InstEnd, Plan, true);
		if (MinLane > 0) // Partial sink
		cast<VPScalarizeOneByOneRecipe>(NewRecipe)->DesignatedLanes =
		VPLaneRange(MinLane);
		To->addRecipe(NewRecipe, &*Recipes->begin());
		}
		}

		void InnerLoopUnroller::vectorizeInstruction(Instruction &I) {
		switch (I.getOpcode()) {
		case Instruction::Br:
		// Nothing to do for branches since we already took care of the
		// loop control flow instructions.
		break;

		case Instruction::GetElementPtr:
		scalarizeInstruction(&I, false);
		break;

		case Instruction::UDiv:
		case Instruction::SDiv:
		case Instruction::SRem:
		case Instruction::URem:
		// Scalarize with predication if this instruction may divide by zero and
		// block execution is conditional, otherwise fallthrough.
		if (Legal->isScalarWithPredication(&I)) {
		scalarizeInstruction(&I, true);
		break;
		}

		case Instruction::Trunc: {
		auto *CI = dyn_cast<CastInst>(&I);
		// Optimize the special case where the source is a constant integer
		// induction variable. Notice that we can only optimize the 'trunc' case
		// because (a) FP conversions lose precision, (b) sext/zext may wrap, and
		// (c) other casts depend on pointer size.
		if (Cost->isOptimizableIVTruncate(CI, VF)) {
		setDebugLocFromInst(Builder, CI);
		widenIntInduction(true, cast<PHINode>(CI->getOperand(0)),
		cast<TruncInst>(CI));
		break;
		}
		}

		default:
		InnerLoopVectorizer::vectorizeInstruction(I);
		}
		}

void InnerLoopUnroller::scalarizeInstruction(Instruction *Instr,		void InnerLoopUnroller::scalarizeInstruction(Instruction *Instr,
bool IfPredicateInstr) {		bool IfPredicateInstr) {
assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");		assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");
// Holds vector parameters or scalars, in case of uniform vals.		// Holds vector parameters or scalars, in case of uniform vals.
SmallVector<VectorParts, 4> Params;		SmallVector<VectorParts, 4> Params;

setDebugLocFromInst(Builder, Instr);		setDebugLocFromInst(Builder, Instr);

▲ Show 20 Lines • Show All 227 Lines • ▼ Show 20 Lines	if (Hints.isPotentiallyUnsafe() &&
DEBUG(dbgs() << "LV: Potentially unsafe FP op prevents vectorization.\n");		DEBUG(dbgs() << "LV: Potentially unsafe FP op prevents vectorization.\n");
ORE->emit(		ORE->emit(
createMissedAnalysis(Hints.vectorizeAnalysisPassName(), "UnsafeFP", L)		createMissedAnalysis(Hints.vectorizeAnalysisPassName(), "UnsafeFP", L)
<< "loop not vectorized due to unsafe FP support.");		<< "loop not vectorized due to unsafe FP support.");
emitMissedWarning(F, L, Hints, ORE);		emitMissedWarning(F, L, Hints, ORE);
return false;		return false;
}		}

// Select the optimal vectorization factor.		if (!CM.canVectorize(OptForSize))
const LoopVectorizationCostModel::VectorizationFactor VF =		return false;
CM.selectVectorizationFactor(OptForSize);
		// Early prune excessive VF's
		unsigned MaxVF = CM.computeMaxVectorizationFactor(OptForSize);

		// If OptForSize, MaxVF is the only VF we consider. Abort if it needs a tail.
		if (OptForSize && CM.requiresTail(MaxVF))
		return false;

		// Use the planner.
		LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, &CM);

		// Get user vectorization factor.
		unsigned UserVF = Hints.getWidth();

		// Select the vectorization factor.
		LoopVectorizationCostModel::VectorizationFactor VF =
		LVP.plan(OptForSize, UserVF, MaxVF);
		bool VectorizeLoop = (VF.Width > 1);

		std::pair<StringRef, std::string> VecDiagMsg, IntDiagMsg;

		if (!UserVF && !VectorizeLoop) {
		DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n");
		VecDiagMsg = std::make_pair(
		"VectorizationNotBeneficial",
		"the cost-model indicates that vectorization is not beneficial");
		}

// Select the interleave count.		// Select the interleave count.
unsigned IC = CM.selectInterleaveCount(OptForSize, VF.Width, VF.Cost);		unsigned IC = CM.selectInterleaveCount(OptForSize, VF.Width, VF.Cost);

// Get user interleave count.		// Get user interleave count.
unsigned UserIC = Hints.getInterleave();		unsigned UserIC = Hints.getInterleave();

// Identify the diagnostic messages that should be produced.		// Identify the diagnostic messages that should be produced.
std::pair<StringRef, std::string> VecDiagMsg, IntDiagMsg;
bool VectorizeLoop = true, InterleaveLoop = true;
if (Requirements.doesNotMeet(F, L, Hints)) {		if (Requirements.doesNotMeet(F, L, Hints)) {
DEBUG(dbgs() << "LV: Not vectorizing: loop did not meet vectorization "		DEBUG(dbgs() << "LV: Not vectorizing: loop did not meet vectorization "
"requirements.\n");		"requirements.\n");
emitMissedWarning(F, L, Hints, ORE);		emitMissedWarning(F, L, Hints, ORE);
return false;		return false;
}		}

if (VF.Width == 1) {		bool InterleaveLoop = true;
DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n");
VecDiagMsg = std::make_pair(
"VectorizationNotBeneficial",
"the cost-model indicates that vectorization is not beneficial");
VectorizeLoop = false;
}

if (IC == 1 && UserIC <= 1) {		if (IC == 1 && UserIC <= 1) {
// Tell the user interleaving is not beneficial.		// Tell the user interleaving is not beneficial.
DEBUG(dbgs() << "LV: Interleaving is not beneficial.\n");		DEBUG(dbgs() << "LV: Interleaving is not beneficial.\n");
IntDiagMsg = std::make_pair(		IntDiagMsg = std::make_pair(
"InterleavingNotBeneficial",		"InterleavingNotBeneficial",
"the cost-model indicates that interleaving is not beneficial");		"the cost-model indicates that interleaving is not beneficial");
InterleaveLoop = false;		InterleaveLoop = false;
if (UserIC == 1) {		if (UserIC == 1) {
IntDiagMsg.first = "InterleavingNotBeneficialAndDisabled";		IntDiagMsg.first = "InterleavingNotBeneficialAndDisabled";
IntDiagMsg.second +=		IntDiagMsg.second +=
" and is explicitly disabled or interleave count is set to 1";		" and is explicitly disabled or interleave count is set to 1";
}		}
} else if (IC > 1 && UserIC == 1) {		} else if (IC > 1 && UserIC == 1) {
// Tell the user interleaving is beneficial, but it explicitly disabled.		// Tell the user interleaving is beneficial, but it explicitly disabled.
DEBUG(dbgs()		DEBUG(
<< "LV: Interleaving is beneficial but is explicitly disabled.");		dbgs() << "LV: Interleaving is beneficial but is explicitly disabled.");
IntDiagMsg = std::make_pair(		IntDiagMsg = std::make_pair(
"InterleavingBeneficialButDisabled",		"InterleavingBeneficialButDisabled",
"the cost-model indicates that interleaving is beneficial "		"the cost-model indicates that interleaving is beneficial "
"but is explicitly disabled or interleave count is set to 1");		"but is explicitly disabled or interleave count is set to 1");
InterleaveLoop = false;		InterleaveLoop = false;
}		}

// Override IC if user provided an interleave count.		// Override IC if user provided an interleave count.
IC = UserIC > 0 ? UserIC : IC;		IC = UserIC > 0 ? UserIC : IC;

		if (VectorizeLoop)
		LVP.setBestPlan(VF.Width, IC);

// Emit diagnostic messages, if any.		// Emit diagnostic messages, if any.
const char *VAPassName = Hints.vectorizeAnalysisPassName();		const char *VAPassName = Hints.vectorizeAnalysisPassName();
if (!VectorizeLoop && !InterleaveLoop) {		if (!VectorizeLoop && !InterleaveLoop) {
// Do not vectorize or interleaving the loop.		// Do not vectorize or interleaving the loop.
ORE->emit(OptimizationRemarkAnalysis(VAPassName, VecDiagMsg.first,		ORE->emit(OptimizationRemarkAnalysis(VAPassName, VecDiagMsg.first,
L->getStartLoc(), L->getHeader())		L->getStartLoc(), L->getHeader())
<< VecDiagMsg.second);		<< VecDiagMsg.second);
ORE->emit(OptimizationRemarkAnalysis(LV_NAME, IntDiagMsg.first,		ORE->emit(OptimizationRemarkAnalysis(LV_NAME, IntDiagMsg.first,
Show All 26 Lines	InnerLoopUnroller Unroller(L, PSE, LI, DT, TLI, TTI, AC, ORE, IC, &LVL,
&CM);		&CM);
Unroller.vectorize();		Unroller.vectorize();

ORE->emit(OptimizationRemark(LV_NAME, "Interleaved", L->getStartLoc(),		ORE->emit(OptimizationRemark(LV_NAME, "Interleaved", L->getStartLoc(),
L->getHeader())		L->getHeader())
<< "interleaved loop (interleaved count: "		<< "interleaved loop (interleaved count: "
<< NV("InterleaveCount", IC) << ")");		<< NV("InterleaveCount", IC) << ")");
} else {		} else {

// If we decided that it is legal to vectorize the loop, then do it.		// If we decided that it is legal to vectorize the loop, then do it.
InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC,		InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC,
&LVL, &CM);		&LVL, &CM);
LB.vectorize();
		LVP.executeBestPlan(LB);

++LoopsVectorized;		++LoopsVectorized;

// Add metadata to disable runtime unrolling a scalar loop when there are		// Add metadata to disable runtime unrolling a scalar loop when there are
// no runtime checks about strides and memory. A scalar loop that is		// no runtime checks about strides and memory. A scalar loop that is
// rarely used is not worth unrolling.		// rarely used is not worth unrolling.
if (!LB.areSafetyChecksAdded())		if (!LB.areSafetyChecksAdded())
AddRuntimeUnrollDisableMetaData(L);		AddRuntimeUnrollDisableMetaData(L);

▲ Show 20 Lines • Show All 116 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/VPlan.h

This file was added.

				//===- VPlan.h - Represent A Vectorizer Plan ------------------------------===//
				//
				oren_ben_simhonUnsubmitted Not Done Reply Inline Actions This file is very big and will probably gain some more weight in time. I think that some utility classes could be moved to a separate file. oren_ben_simhon: This file is very big and will probably gain some more weight in time. I think that some…
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This file contains the declarations of the Vectorization Plan base classes:
				// 1. VPBasicBlock and VPRegionBlock that inherit from a common pure virtual
				// VPBlockBase, together implementing a Hierarchical CFG;
				// 2. Specializations of GraphTraits that allow VPBlockBase graphs to be treated
				// as proper graphs for generic algorithms;
				// 3. Pure virtual VPRecipeBase and its pure virtual sub-classes
				// VPConditionBitRecipeBase and VPOneByOneRecipeBase that
				// represent base classes for recipes contained within VPBasicBlocks;
				// 4. The VPlan class holding a candidate for vectorization;
				// 5. The VPlanUtils class providing methods for building plans;
				// 6. The VPlanPrinter class providing a way to print a plan in dot format.
				// These are documented in docs/VectorizationPlan.rst.
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_TRANSFORMS_VECTORIZE_VPLAN_H
				#define LLVM_TRANSFORMS_VECTORIZE_VPLAN_H

				#include "llvm/ADT/GraphTraits.h"
				#include "llvm/ADT/ilist.h"
				#include "llvm/ADT/ilist_node.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/Support/raw_ostream.h"

				// The (re)use of existing LoopVectorize classes is subject to future VPlan
				// refactoring.
				namespace {
				class InnerLoopVectorizer;
				class LoopVectorizationLegality;
				class LoopVectorizationCostModel;
				}

				namespace llvm {

				class VPBasicBlock;

				/// VPRecipeBase is a base class describing one or more instructions that will
				/// appear consecutively in the vectorized version, based on Instructions from
				/// the given IR. These Instructions are referred to as the "Ingredients" of
				/// the Recipe. A Recipe specifies how its ingredients are to be vectorized:
				/// e.g., copy or reuse them as uniform, scalarize or vectorize them according
				/// to an enclosing loop dimension, vectorize them according to internal SLP
				/// dimension.
				///
				/// Design principle: in order to reason about how to vectorize an
				/// Instruction or how much it would cost, one has to consult the VPRecipe
				/// holding it.
				///
				/// Design principle: when a sequence of instructions conveys additional
				/// information as a group, we use a VPRecipe to encapsulate them and attach
				/// this information to the VPRecipe. For instance a VPRecipe can model an
				/// interleave group of loads or stores with additional information for
				/// calculating their cost and for performing IR code generation, as a group.
				///
				/// Design principle: a VPRecipe should reuse existing containers of its
				/// ingredients, i.e., iterators of basic blocks, to be lightweight. A new
				/// containter should be opened on-demand, e.g., to avoid excessive recipes
				/// each holding an interval of ingredients.
				class VPRecipeBase : public ilist_node_with_parent<VPRecipeBase, VPBasicBlock> {
				friend class VPlanUtils;
				oren_ben_simhonUnsubmitted Not Done Reply Inline Actions IMHO, VPlanUtils should not be a friend class, because you don't access any of its private/protected members. Instead use VPlanUtils statically. oren_ben_simhon: IMHO, VPlanUtils should not be a friend class, because you don't access any of its…
				friend class VPBasicBlock;

				private:
				const unsigned char VRID; /// Subclass identifier (for isa/dyn_cast)
				oren_ben_simhonUnsubmitted Done Reply Inline Actions use ///< for post member documentation. oren_ben_simhon: use ///< for post member documentation.

				/// Each VPRecipe is contained in a single VPBasicBlock.
				class VPBasicBlock *Parent;
				oren_ben_simhonUnsubmitted Not Done Reply Inline Actions I didn't follow the whole logic but are you sure that you need recipe parent? Also i think that "owner" is a better descriptive of the member than parent. oren_ben_simhon: I didn't follow the whole logic but are you sure that you need recipe parent? Also i think that…

				/// Record which Instructions would require generating their complementing
				/// form as well, providing a vector-to-scalar or scalar-to-vector conversion.
				SmallPtrSet<Instruction *, 1> AlsoPackOrUnpack;

				public:
				/// An enumeration for keeping track of the concrete subclass of VPRecipeBase
				/// that is actually instantiated. Values of this enumeration are kept in the
				/// VPRecipe classes VRID field. They are used for concrete type
				/// identification.
				typedef enum {
				VPVectorizeOneByOneSC,
				VPScalarizeOneByOneSC,
				VPWidenIntInductionSC,
				VPBuildScalarStepsSC,
				VPInterleaveSC,
				VPExtractMaskBitSC,
				VPMergeScalarizeBranchSC,
				} VPRecipeTy;

				VPRecipeBase(const unsigned char SC) : VRID(SC), Parent(nullptr) {}

				virtual ~VPRecipeBase() {}

				/// \return an ID for the concrete type of this object.
				/// This is used to implement the classof checks. This should not be used
				/// for any other purpose, as the values may change as LLVM evolves.
				unsigned getVPRecipeID() const { return VRID; }

				/// \return the VPBasicBlock which this VPRecipe belongs to.
				class VPBasicBlock *getParent() {
				return Parent;
				}

				/// The method which generates the new IR instructions that correspond to
				/// this VPRecipe in the vectorized version, thereby "executing" the VPlan.
				virtual void vectorize(struct VPTransformState &State) = 0;

				/// Each recipe prints itself.
				virtual void print(raw_ostream &O) const = 0;

				/// Add an instruction to the set of instructions for which a vector-to-
				/// scalar or scalar-to-vector conversion is needed, in addition to
				/// vectorizing or scalarizing the instruction itself, respectively.
				void addAlsoPackOrUnpack(Instruction *I) { AlsoPackOrUnpack.insert(I); }

				/// Indicates if a given instruction requires vector-to-scalar or scalar-to-
				/// vector conversion.
				bool willAlsoPackOrUnpack(Instruction *I) const {
				return AlsoPackOrUnpack.count(I);
				}
				};

				/// A VPConditionBitRecipeBase is a pure virtual VPRecipe which supports a
				/// conditional branch. Concrete sub-classes of this recipe are in charge of
				/// generating the instructions that compute the condition for this branch in
				/// the vectorized version.
				class VPConditionBitRecipeBase : public VPRecipeBase {
				protected:
				/// The actual condition bit that was generated. Holds null until the
				/// value/instuctions are generated by the vectorize() method.
				Value *ConditionBit;

				public:
				/// Construct a VPConditionBitRecipeBase, simply propating its concrete type.
				VPConditionBitRecipeBase(const unsigned char SC)
				: VPRecipeBase(SC), ConditionBit(nullptr) {}

				/// \return the actual bit that was generated, to be plugged into the IR
				/// conditional branch, or null if the code computing the actual bit has not
				/// been generated yet.
				Value *getConditionBit() { return ConditionBit; }

				virtual StringRef getName() const = 0;
				};

				/// VPOneByOneRecipeBase is a VPRecipeBase which handles each Instruction in its
				/// ingredients independently, in order. The ingredients are either all
				/// vectorized, or all scalarized.
				/// A VPOneByOneRecipeBase is a virtual base recipe which can be materialized
				/// by one of two sub-classes, namely VPVectorizeOneByOneRecipe or
				/// VPScalarizeOneByOneRecipe for Vectorizing or Scalarizing all ingredients,
				/// respectively.
				/// The ingredients are held as a sub-sequence of original Instructions, which
				/// reside in the same IR BasicBlock and in the same order. The Ingredients are
				/// accessed by a pointer to the first and last Instruction.
				class VPOneByOneRecipeBase : public VPRecipeBase {
				friend class VPlanUtilsLoopVectorizer;

				public:
				/// Hold the ingredients by pointing to their original BasicBlock location.
				BasicBlock::iterator Begin;
				BasicBlock::iterator End;

				protected:
				VPOneByOneRecipeBase() = delete;

				VPOneByOneRecipeBase(unsigned char SC, const BasicBlock::iterator B,
				const BasicBlock::iterator E, class VPlan *Plan);

				/// Do the actual code generation for a single instruction.
				/// This function is to be implemented and specialized by the respective
				/// sub-class.
				virtual void transformIRInstruction(Instruction *I,
				struct VPTransformState &State) = 0;

				public:
				~VPOneByOneRecipeBase() {}

				/// Method to support type inquiry through isa, cast, and dyn_cast.
				static inline bool classof(const VPRecipeBase *V) {
				return V->getVPRecipeID() == VPRecipeBase::VPScalarizeOneByOneSC \|\|
				V->getVPRecipeID() == VPRecipeBase::VPVectorizeOneByOneSC;
				}

				bool isScalarizing() const {
				oren_ben_simhonUnsubmitted Done Reply Inline Actions The function should be const. oren_ben_simhon: The function should be const.
				return getVPRecipeID() == VPRecipeBase::VPScalarizeOneByOneSC;
				}

				/// The method which generates all new IR instructions that correspond to
				/// this VPOneByOneRecipeBase in the vectorized version, thereby
				/// "executing" the VPlan.
				/// VPOneByOneRecipeBase may either scalarize or vectorize all Instructions.
				void vectorize(struct VPTransformState &State) override {
				for (auto It = Begin; It != End; ++It)
				transformIRInstruction(&*It, State);
				}

				const BasicBlock::iterator &begin() { return Begin; }

				const BasicBlock::iterator &end() { return End; }
				};

				/// Hold the indices of a specific scalar instruction. The VPIterationInstance
				/// span the iterations of the original loop, that correspond to a single
				/// iteration of the vectorized loop.
				struct VPIterationInstance {
				unsigned Part;
				unsigned Lane;
				};

				// Forward declaration.
				class BasicBlock;

				/// Hold additional information passed down when "executing" a VPlan, that is
				/// needed for generating IR. Also facilitates reuse of existing LV
				/// functionality.
				struct VPTransformState {

				VPTransformState(unsigned VF, unsigned UF, class LoopInfo *LI,
				class DominatorTree *DT, IRBuilder<> &Builder,
				InnerLoopVectorizer ILV, LoopVectorizationLegality Legal,
				LoopVectorizationCostModel *Cost)
				: VF(VF), UF(UF), Instance(nullptr), LI(LI), DT(DT), Builder(Builder),
				ILV(ILV), Legal(Legal), Cost(Cost) {}

				/// Record the selected vectorization and unroll factors of the single loop
				/// being vectorized.
				unsigned VF;
				unsigned UF;

				/// Hold the indices to generate a specific scalar instruction. Null indicates
				/// that all instances are to be generated, using either scalar or vector
				/// instructions.
				VPIterationInstance *Instance;

				/// Hold state information used when constructing the CFG of the vectorized
				/// Loop, traversing the VPBasicBlocks and generating corresponding IR
				/// BasicBlocks.
				struct CFGState {
				/// The previous VPBasicBlock visited. In the beginning set to null.
				VPBasicBlock *PrevVPBB;
				/// The previous IR BasicBlock created or reused. In the beginning set to
				/// the new header BasicBlock.
				BasicBlock *PrevBB;
				/// The last IR BasicBlock of the loop body. Set to the new latch
				/// BasicBlock, used for placing the newly created BasicBlocks.
				BasicBlock *LastBB;
				/// A mapping of each VPBasicBlock to the corresponding BasicBlock. In case
				/// of replication, maps the BasicBlock of the last replica created.
				SmallDenseMap<class VPBasicBlock , class BasicBlock > VPBB2IRBB;

				CFGState() : PrevVPBB(nullptr), PrevBB(nullptr), LastBB(nullptr) {}
				} CFG;

				/// Hold pointer to LoopInfo to register new basic blocks in the loop.
				class LoopInfo *LI;

				/// Hold pointer to Dominator Tree to register new basic blocks in the loop.
				class DominatorTree *DT;

				/// Hold a reference to the IRBuilder used to generate IR code.
				IRBuilder<> &Builder;

				/// Hold a pointer to InnerLoopVectorizer to reuse its IR generation methods.
				class InnerLoopVectorizer *ILV;

				/// Hold a pointer to LoopVectorizationLegality
				class LoopVectorizationLegality *Legal;

				/// Hold a pointer to LoopVectorizationCostModel to access its
				/// IsUniformAfterVectorization method.
				LoopVectorizationCostModel *Cost;
				};

				/// VPBlockBase is the building block of the Hierarchical CFG. A VPBlockBase
				/// can be either a VPBasicBlock or a VPRegionBlock.
				///
				/// The Hierarchical CFG is a control-flow graph whose nodes are basic-blocks
				/// or Hierarchical CFG's. The Hierarchical CFG data structure we use is similar
				/// to the Tile Tree [1], where cross-Tile edges are lifted to connect Tiles
				/// instead of the original basic-blocks as in Sharir [2], promoting the Tile
				/// encapsulation. We use the terms Region and Block rather than Tile [1] to
				/// avoid confusion with loop tiling.
				///
				/// [1] "Register Allocation via Hierarchical Graph Coloring", David Callahan
				/// and Brian Koblenz, PLDI 1991
				///
				/// [2] "Structural analysis: A new approach to flow analysis in optimizing
				/// compilers", M. Sharir, Journal of Computer Languages, Jan. 1980
				///
				/// Note that in contrast to the IR BasicBlock, a VPBlockBase models its
				/// control-flow edges with successor and predecessor VPBlockBase directly,
				/// rather than through a Terminator branch or through predecessor branches that
				/// Use the VPBlockBase.
				class VPBlockBase {
				friend class VPlanUtils;

				private:
				const unsigned char VBID; /// Subclass identifier (for isa/dyn_cast).

				std::string Name;

				/// The immediate VPRegionBlock which this VPBlockBase belongs to, or null if
				/// it is a topmost VPBlockBase.
				oren_ben_simhonUnsubmitted Done Reply Inline Actions I know that the initial size of SmallVectors has minimum effect, still i believe that the default size of the predecessors/Successors should be one. oren_ben_simhon: I know that the initial size of SmallVectors has minimum effect, still i believe that the…
				class VPRegionBlock *Parent;

				/// List of predecessor blocks.
				SmallVector<VPBlockBase *, 1> Predecessors;

				oren_ben_simhonUnsubmitted Done Reply Inline Actions The \brief will have no effect because anyway the comment until the first dot will be considered as brief by Doxygen. oren_ben_simhon: The \brief will have no effect because anyway the comment until the first dot will be…
				/// List of successor blocks.
				SmallVector<VPBlockBase *, 1> Successors;

				/// Successor selector, null for zero or single successor blocks.
				VPConditionBitRecipeBase *ConditionBitRecipe;

				/// Add \p Successor as the last successor to this block.
				void appendSuccessor(VPBlockBase *Successor) {
				assert(Successor && "Cannot add nullptr successor!");
				Successors.push_back(Successor);
				}

				/// Add \p Predecessor as the last predecessor to this block.
				void appendPredecessor(VPBlockBase *Predecessor) {
				assert(Predecessor && "Cannot add nullptr predecessor!");
				Predecessors.push_back(Predecessor);
				}

				/// Remove \p Predecessor from the predecessors of this block.
				void removePredecessor(VPBlockBase *Predecessor) {
				auto Pos = std::find(Predecessors.begin(), Predecessors.end(), Predecessor);
				assert(Pos && "Predecessor does not exist");
				Predecessors.erase(Pos);
				}

				/// Remove \p Successor from the successors of this block.
				void removeSuccessor(VPBlockBase *Successor) {
				auto Pos = std::find(Successors.begin(), Successors.end(), Successor);
				assert(Pos && "Successor does not exist");
				Successors.erase(Pos);
				}

				protected:
				VPBlockBase(const unsigned char SC, const std::string &N)
				: VBID(SC), Name(N), Parent(nullptr), ConditionBitRecipe(nullptr) {}

				public:
				/// An enumeration for keeping track of the concrete subclass of VPBlockBase
				/// that is actually instantiated. Values of this enumeration are kept in the
				/// VPBlockBase classes VBID field. They are used for concrete type
				/// identification.
				typedef enum { VPBasicBlockSC, VPRegionBlockSC } VPBlockTy;

				virtual ~VPBlockBase() {}

				const std::string &getName() const { return Name; }

				/// \return an ID for the concrete type of this object.
				/// This is used to implement the classof checks. This should not be used
				/// for any other purpose, as the values may change as LLVM evolves.
				unsigned getVPBlockID() const { return VBID; }

				const class VPRegionBlock *getParent() const { return Parent; }

				/// \return the VPBasicBlock that is the entry of this VPBlockBase,
				/// recursively, if the latter is a VPRegionBlock. Otherwise, if this
				/// VPBlockBase is a VPBasicBlock, it is returned.
				const class VPBasicBlock *getEntryBasicBlock() const;

				/// \return the VPBasicBlock that is the exit of this VPBlockBase,
				/// recursively, if the latter is a VPRegionBlock. Otherwise, if this
				/// VPBlockBase is a VPBasicBlock, it is returned.
				const class VPBasicBlock *getExitBasicBlock() const;
				class VPBasicBlock *getExitBasicBlock();

				const SmallVectorImpl<VPBlockBase *> &getSuccessors() const {
				return Successors;
				}

				const SmallVectorImpl<VPBlockBase *> &getPredecessors() const {
				return Predecessors;
				}

				SmallVectorImpl<VPBlockBase *> &getSuccessors() { return Successors; }

				SmallVectorImpl<VPBlockBase *> &getPredecessors() { return Predecessors; }

				/// \return the successor of this VPBlockBase if it has a single successor.
				/// Otherwise return a null pointer.
				VPBlockBase *getSingleSuccessor() const {
				return (Successors.size() == 1 ? *Successors.begin() : nullptr);
				}

				/// \return the predecessor of this VPBlockBase if it has a single
				/// predecessor. Otherwise return a null pointer.
				VPBlockBase *getSinglePredecessor() const {
				return (Predecessors.size() == 1 ? *Predecessors.begin() : nullptr);
				}

				/// Returns the closest ancestor starting from "this", which has successors.
				/// Returns the root ancestor if all ancestors have no successors.
				VPBlockBase *getAncestorWithSuccessors();

				/// Returns the closest ancestor starting from "this", which has predecessors.
				/// Returns the root ancestor if all ancestors have no predecessors.
				VPBlockBase *getAncestorWithPredecessors();

				/// \return the successors either attached directly to this VPBlockBase or, if
				/// this VPBlockBase is the exit block of a VPRegionBlock and has no
				/// successors of its own, search recursively for the first enclosing
				/// VPRegionBlock that has successors and return them. If no such
				/// VPRegionBlock exists, return the (empty) successors of the topmost
				/// VPBlockBase reached.
				const SmallVectorImpl<VPBlockBase *> &getHierarchicalSuccessors() {
				return getAncestorWithSuccessors()->getSuccessors();
				}

				/// \return the hierarchical successor of this VPBlockBase if it has a single
				/// hierarchical successor. Otherwise return a null pointer.
				VPBlockBase *getSingleHierarchicalSuccessor() {
				return getAncestorWithSuccessors()->getSingleSuccessor();
				}

				/// \return the predecessors either attached directly to this VPBlockBase or,
				/// if this VPBlockBase is the entry block of a VPRegionBlock and has no
				/// predecessors of its own, search recursively for the first enclosing
				/// VPRegionBlock that has predecessors and return them. If no such
				/// VPRegionBlock exists, return the (empty) predecessors of the topmost
				/// VPBlockBase reached.
				const SmallVectorImpl<VPBlockBase *> &getHierarchicalPredecessors() {
				return getAncestorWithPredecessors()->getPredecessors();
				}

				/// \return the hierarchical predecessor of this VPBlockBase if it has a
				/// single hierarchical predecessor. Otherwise return a null pointer.
				VPBlockBase *getSingleHierarchicalPredecessor() {
				return getAncestorWithPredecessors()->getSinglePredecessor();
				}

				/// If a VPBlockBase has two successors, this is the Recipe that will generate
				/// the condition bit selecting the successor, and feeding the terminating
				/// conditional branch. Otherwise this is null.
				VPConditionBitRecipeBase *getConditionBitRecipe() {
				return ConditionBitRecipe;
				}

				const VPConditionBitRecipeBase *getConditionBitRecipe() const {
				return ConditionBitRecipe;
				}

				void setConditionBitRecipe(VPConditionBitRecipeBase *R) {
				ConditionBitRecipe = R;
				}

				/// The method which generates all new IR instructions that correspond to
				/// this VPBlockBase in the vectorized version, thereby "executing" the VPlan.
				virtual void vectorize(struct VPTransformState *State) = 0;

				/// Delete all blocks reachable from a given VPBlockBase, inclusive.
				static void deleteCFG(VPBlockBase *Entry);
				};

				/// VPBasicBlock serves as the leaf of the Hierarchical CFG. It represents a
				/// sequence of instructions that will appear consecutively in a basic block
				/// of the vectorized version. The VPBasicBlock takes care of the control-flow
				/// relations with other VPBasicBlock's and Regions. It holds a sequence of zero
				/// or more VPRecipe's that take care of representing the instructions.
				/// A VPBasicBlock that holds no VPRecipe's represents no instructions; this
				/// may happen, e.g., to support disjoint Regions and to ensure Regions have a
				/// single exit, possibly an empty one.
				///
				/// Note that in contrast to the IR BasicBlock, a VPBasicBlock models its
				/// control-flow edges with successor and predecessor VPBlockBase directly,
				/// rather than through a Terminator branch or through predecessor branches that
				/// "use" the VPBasicBlock.
				class VPBasicBlock : public VPBlockBase {
				friend class VPlanUtils;

				public:
				typedef iplist<VPRecipeBase> RecipeListTy;

				private:
				/// The list of VPRecipes, held in order of instructions to generate.
				RecipeListTy Recipes;

				public:
				/// Instruction iterators...
				typedef RecipeListTy::iterator iterator;
				typedef RecipeListTy::const_iterator const_iterator;
				typedef RecipeListTy::reverse_iterator reverse_iterator;
				typedef RecipeListTy::const_reverse_iterator const_reverse_iterator;

				//===--------------------------------------------------------------------===//
				/// Recipe iterator methods
				///
				inline iterator begin() { return Recipes.begin(); }
				inline const_iterator begin() const { return Recipes.begin(); }
				inline iterator end() { return Recipes.end(); }
				inline const_iterator end() const { return Recipes.end(); }

				inline reverse_iterator rbegin() { return Recipes.rbegin(); }
				inline const_reverse_iterator rbegin() const { return Recipes.rbegin(); }
				inline reverse_iterator rend() { return Recipes.rend(); }
				inline const_reverse_iterator rend() const { return Recipes.rend(); }

				inline size_t size() const { return Recipes.size(); }
				inline bool empty() const { return Recipes.empty(); }
				inline const VPRecipeBase &front() const { return Recipes.front(); }
				inline VPRecipeBase &front() { return Recipes.front(); }
				oren_ben_simhonUnsubmitted Done Reply Inline Actions No need for the brief command. oren_ben_simhon: No need for the brief command.
				inline const VPRecipeBase &back() const { return Recipes.back(); }
				inline VPRecipeBase &back() { return Recipes.back(); }

				/// Return the underlying instruction list container.
				///
				/// Currently you need to access the underlying instruction list container
				/// directly if you want to modify it.
				const RecipeListTy &getInstList() const { return Recipes; }
				oren_ben_simhonUnsubmitted Not Done Reply Inline Actions The function receives a parameter that it doesn't use. Also it is static although it returns a non-static class member. oren_ben_simhon: The function receives a parameter that it doesn't use. Also it is static although it returns a…
				RecipeListTy &getInstList() { return Recipes; }

				/// Returns a pointer to a member of the instruction list.
				static RecipeListTy VPBasicBlock::getSublistAccess(VPRecipeBase ) {
				return &VPBasicBlock::Recipes;
				}

				VPBasicBlock(const std::string &Name) : VPBlockBase(VPBasicBlockSC, Name) {}

				~VPBasicBlock() { Recipes.clear(); }

				/// Method to support type inquiry through isa, cast, and dyn_cast.
				static inline bool classof(const VPBlockBase *V) {
				return V->getVPBlockID() == VPBlockBase::VPBasicBlockSC;
				}

				/// Augment the existing recipes of a VPBasicBlock with an additional
				/// \p Recipe at a position given by an existing recipe \p Before. If
				/// \p Before is null, \p Recipe is appended as the last recipe.
				void addRecipe(VPRecipeBase Recipe, VPRecipeBase Before = nullptr) {
				Recipe->Parent = this;
				if (!Before) {
				Recipes.push_back(Recipe);
				return;
				}
				assert(Before->Parent == this &&
				"Insertion before point not in this basic block.");
				Recipes.insert(Before->getIterator(), Recipe);
				}

				void removeRecipe(VPRecipeBase *Recipe) {
				assert(Recipe->Parent == this &&
				"Recipe to remove not in this basic block.");
				Recipes.remove(Recipe);
				Recipe->Parent = nullptr;
				}

				/// The method which generates all new IR instructions that correspond to
				/// this VPBasicBlock in the vectorized version, thereby "executing" the
				/// VPlan.
				void vectorize(struct VPTransformState *State) override;

				/// Retrieve the list of VPRecipes that belong to this VPBasicBlock.
				const RecipeListTy &getRecipes() const { return Recipes; }

				private:
				/// Create an IR BasicBlock to hold the instructions vectorized from this
				/// VPBasicBlock, and return it. Update the CFGState accordingly.
				BasicBlock *createEmptyBasicBlock(VPTransformState::CFGState &CFG);
				};

				/// VPRegionBlock represents a collection of VPBasicBlocks and VPRegionBlocks
				/// which form a single-entry-single-exit subgraph of the CFG in the vectorized
				/// code.
				///
				/// A VPRegionBlock may indicate that its contents are to be replicated several
				/// times. This is designed to support predicated scalarization, in which a
				/// scalar if-then code structure needs to be generated VF * UF times. Having
				/// this replication indicator helps to keep a single VPlan for multiple
				/// candidate VF's; the actual replication takes place only once the desired VF
				/// and UF have been determined.
				///
				/// Design principle: when some additional information relates to an SESE
				/// set of VPBlockBase, we use a VPRegionBlock to wrap them and attach the
				/// information to it. For example, a VPRegionBlock can be used to indicate that
				/// a scalarized SESE region is to be replicated, and that a vectorized SESE
				/// region can retain its internal control-flow, independent of the control-flow
				/// external to the region.
				class VPRegionBlock : public VPBlockBase {
				friend class VPlanUtils;
				oren_ben_simhonUnsubmitted Not Done Reply Inline Actions Please consider using a single list/vector instead of two members Entry/Exit. oren_ben_simhon: Please consider using a single list/vector instead of two members Entry/Exit.

				private:
				/// Hold the Single Entry of the SESE region represented by the VPRegionBlock.
				VPBlockBase *Entry;

				/// Hold the Single Exit of the SESE region represented by the VPRegionBlock.
				VPBlockBase *Exit;

				/// A VPRegionBlock can represent either a single instance of its
				/// VPBlockBases, or multiple (VF * UF) replicated instances. The latter is
				/// used when the internal SESE region handles a single scalarized lane.
				bool IsReplicator;

				public:
				VPRegionBlock(const std::string &Name)
				: VPBlockBase(VPRegionBlockSC, Name), Entry(nullptr), Exit(nullptr),
				IsReplicator(false) {}

				~VPRegionBlock() {
				if (Entry)
				deleteCFG(Entry);
				}

				/// Method to support type inquiry through isa, cast, and dyn_cast.
				static inline bool classof(const VPBlockBase *V) {
				return V->getVPBlockID() == VPBlockBase::VPRegionBlockSC;
				}

				VPBlockBase *getEntry() { return Entry; }

				VPBlockBase *getExit() { return Exit; }

				const VPBlockBase *getEntry() const { return Entry; }

				const VPBlockBase *getExit() const { return Exit; }

				/// An indicator if the VPRegionBlock represents single or multiple instances.
				bool isReplicator() const { return IsReplicator; }

				void setReplicator(bool ToReplicate) { IsReplicator = ToReplicate; }

				/// The method which generates the new IR instructions that correspond to
				/// this VPRegionBlock in the vectorized version, thereby "executing" the
				/// VPlan.
				void vectorize(struct VPTransformState *State) override;
				};

				/// A VPlan represents a candidate for vectorization, encoding various decisions
				/// taken to produce efficient vector code, including: which instructions are to
				/// vectorized or scalarized, which branches are to appear in the vectorized
				/// version. It models the control-flow of the candidate vectorized version
				/// explicitly, and holds prescriptions for generating the code for this version
				/// from a given IR code.
				/// VPlan takes a "senario-based approach" to vectorization planning - different
				/// scenarios, corresponding to making different decisions, can be modeled using
				/// different VPlans.
				/// The corresponding IR code is required to be SESE.
				/// The vectorized version is represented using a Hierarchical CFG.
				class VPlan {
				friend class VPlanUtils;
				friend class VPlanUtilsLoopVectorizer;

				private:
				/// Hold the single entry to the Hierarchical CFG of the VPlan.
				VPBlockBase *Entry;

				/// The IR instructions which are to be transformed to fill the vectorized
				/// version are held as ingredients inside the VPRecipe's of the VPlan. Hold a
				/// reverse mapping to locate the VPRecipe an IR instruction belongs to. This
				/// serves optimizations that operate on the VPlan.
				DenseMap<Instruction , VPRecipeBase > Inst2Recipe;

				public:
				VPlan() : Entry(nullptr) {}

				~VPlan() {
				if (Entry)
				VPBlockBase::deleteCFG(Entry);
				}

				/// Generate the IR code for this VPlan.
				void vectorize(struct VPTransformState *State);

				VPBlockBase *getEntry() { return Entry; }
				const VPBlockBase *getEntry() const { return Entry; }

				void setEntry(VPBlockBase *Block) { Entry = Block; }

				/// Retrieve the VPRecipe a given instruction \p Inst belongs to in the VPlan.
				/// Returns null if it belongs to no VPRecipe.
				VPRecipeBase getRecipe(Instruction Inst) {
				auto It = Inst2Recipe.find(Inst);
				if (It == Inst2Recipe.end())
				return nullptr;
				return It->second;
				}

				void setInst2Recipe(Instruction I, VPRecipeBase R) { Inst2Recipe[I] = R; }

				void resetInst2Recipe(Instruction *I) { Inst2Recipe.erase(I); }

				/// Retrieve the VPBasicBlock a given instruction \p Inst belongs to in the
				/// VPlan. Returns null if it belongs to no VPRecipe.
				VPBasicBlock getBasicBlock(Instruction Inst) {
				VPRecipeBase *Recipe = getRecipe(Inst);
				if (!Recipe)
				return nullptr;
				return Recipe->getParent();
				}

				private:
				/// Add to the given dominator tree the header block and every new basic block
				/// that was created between it and the latch block, inclusive.
				void updateDominatorTree(class DominatorTree DT, BasicBlock LoopPreHeaderBB,
				BasicBlock *LoopLatchBB);
				};

				/// The VPlanUtils class provides interfaces for the construction and
				/// manipulation of a VPlan.
				class VPlanUtils {
				private:
				/// Unique ID generator.
				oren_ben_simhonUnsubmitted Not Done Reply Inline Actions The plan is not used anywhere in the class. oren_ben_simhon: The plan is not used anywhere in the class.
				static unsigned NextOrdinal;

				protected:
				oren_ben_simhonUnsubmitted Not Done Reply Inline Actions This function member is not referenced anywhere. oren_ben_simhon: This function member is not referenced anywhere.
				VPlan *Plan;

				typedef iplist<VPRecipeBase> RecipeListTy;
				oren_ben_simhonUnsubmitted Not Done Reply Inline Actions This class should be a singleton class that should not be initialized. As such the constructor should be private. oren_ben_simhon: This class should be a singleton class that should not be initialized. As such the constructor…
				RecipeListTy getRecipes(VPBasicBlock Block) { return &Block->Recipes; }

				public:
				oren_ben_simhonUnsubmitted Not Done Reply Inline Actions All functions that do not relate to a non-static member, should be static and const. oren_ben_simhon: All functions that do not relate to a non-static member, should be static and const.
				VPlanUtils(VPlan *Plan) : Plan(Plan) {}

				~VPlanUtils() {}

				/// Create a unique name for a new VPlan entity such as a VPBasicBlock or
				/// VPRegionBlock.
				std::string createUniqueName(const char *Prefix) {
				std::string S;
				raw_string_ostream RSO(S);
				RSO << Prefix << NextOrdinal++;
				return RSO.str();
				}

				/// Add a given \p Recipe as the last recipe of a given VPBasicBlock.
				void appendRecipeToBasicBlock(VPRecipeBase Recipe, VPBasicBlock ToVPBB) {
				assert(Recipe && "No recipe to append.");
				assert(!Recipe->Parent && "Recipe already in VPlan");
				ToVPBB->addRecipe(Recipe);
				}

				/// Create a new empty VPBasicBlock and return it.
				VPBasicBlock *createBasicBlock() {
				VPBasicBlock *BasicBlock = new VPBasicBlock(createUniqueName("BB"));
				return BasicBlock;
				}

				/// Create a new VPBasicBlock with a single \p Recipe and return it.
				VPBasicBlock createBasicBlock(VPRecipeBase Recipe) {
				VPBasicBlock *BasicBlock = new VPBasicBlock(createUniqueName("BB"));
				appendRecipeToBasicBlock(Recipe, BasicBlock);
				return BasicBlock;
				}

				/// Create a new, empty VPRegionBlock, with no blocks.
				VPRegionBlock *createRegion(bool IsReplicator) {
				VPRegionBlock *Region = new VPRegionBlock(createUniqueName("region"));
				setReplicator(Region, IsReplicator);
				return Region;
				}

				/// Set the entry VPBlockBase of a given VPRegionBlock to a given \p Block.
				/// Block is to have no predecessors.
				void setRegionEntry(VPRegionBlock Region, VPBlockBase Block) {
				assert(Block->Predecessors.empty() &&
				"Entry block cannot have predecessors.");
				Region->Entry = Block;
				Block->Parent = Region;
				}

				/// Set the exit VPBlockBase of a given VPRegionBlock to a given \p Block.
				/// Block is to have no successors.
				void setRegionExit(VPRegionBlock Region, VPBlockBase Block) {
				assert(Block->Successors.empty() && "Exit block cannot have successors.");
				Region->Exit = Block;
				Block->Parent = Region;
				}

				void setReplicator(VPRegionBlock *Region, bool ToReplicate) {
				Region->setReplicator(ToReplicate);
				}

				/// Sets a given VPBlockBase \p Successor as the single successor of another
				/// VPBlockBase \p Block. The parent of \p Block is copied to be the parent of
				/// \p Successor.
				void setSuccessor(VPBlockBase Block, VPBlockBase Successor) {
				assert(Block->getSuccessors().empty() && "Block successors already set.");
				Block->appendSuccessor(Successor);
				Successor->appendPredecessor(Block);
				Successor->Parent = Block->Parent;
				}

				/// Sets two given VPBlockBases \p IfTrue and \p IfFalse to be the two
				oren_ben_simhonUnsubmitted Not Done Reply Inline Actions I think that the function name should be more descriptive to reflect the nature of the function, for example "setConditionalSuccessors". oren_ben_simhon: I think that the function name should be more descriptive to reflect the nature of the function…
				/// successors of another VPBlockBase \p Block. A given
				/// VPConditionBitRecipeBase provides the control selector. The parent of
				/// \p Block is copied to be the parent of \p IfTrue and \p IfFalse.
				void setTwoSuccessors(VPBlockBase Block, VPConditionBitRecipeBase R,
				VPBlockBase IfTrue, VPBlockBase IfFalse) {
				assert(Block->getSuccessors().empty() && "Block successors already set.");
				Block->setConditionBitRecipe(R);
				Block->appendSuccessor(IfTrue);
				Block->appendSuccessor(IfFalse);
				IfTrue->appendPredecessor(Block);
				IfFalse->appendPredecessor(Block);
				IfTrue->Parent = Block->Parent;
				IfFalse->Parent = Block->Parent;
				}

				/// Given two VPBlockBases \p From and \p To, disconnect them from each other.
				void disconnectBlocks(VPBlockBase From, VPBlockBase To) {
				From->removeSuccessor(To);
				To->removePredecessor(From);
				}
				};

				/// VPlanPrinter prints a given VPlan to a given output stream. The printing is
				/// indented and follows the dot format.
				class VPlanPrinter {
				private:
				raw_ostream &OS;
				const VPlan &Plan;
				unsigned Depth;
				unsigned TabLength = 2;
				std::string Indent;

				/// Handle indentation.
				void buildIndent() { Indent = std::string(Depth * TabLength, ' '); }
				void resetDepth() {
				Depth = 1;
				buildIndent();
				}
				void increaseDepth() {
				++Depth;
				buildIndent();
				}
				void decreaseDepth() {
				--Depth;
				buildIndent();
				}

				/// Dump each element of VPlan.
				void dumpBlock(const VPBlockBase *Block);
				void dumpEdges(const VPBlockBase *Block);
				void dumpBasicBlock(const VPBasicBlock *BasicBlock);
				void dumpRegion(const VPRegionBlock *Region);

				const char getNodePrefix(const VPBlockBase Block);
				const std::string &getReplicatorString(const VPRegionBlock *Region);
				void drawEdge(const VPBlockBase From, const VPBlockBase To, bool Hidden,
				const Twine &Label);

				public:
				VPlanPrinter(raw_ostream &O, const VPlan &P) : OS(O), Plan(P) {}
				void dump(const std::string &Title = "");
				};

				//===--------------------------------------------------------------------===//
				// GraphTraits specializations for VPlan/VPRegionBlock Control-Flow Graphs //
				oren_ben_simhonUnsubmitted Not Done Reply Inline Actions Maybe use a doxygen style comment to all class documentations. oren_ben_simhon: Maybe use a doxygen style comment to all class documentations.
				//===--------------------------------------------------------------------===//

				// Provide specializations of GraphTraits to be able to treat a VPRegionBlock
				// as a graph of VPBlockBases...

				template <> struct GraphTraits<VPBlockBase *> {
				typedef VPBlockBase *NodeRef;
				typedef SmallVectorImpl<VPBlockBase *>::iterator ChildIteratorType;

				static NodeRef getEntryNode(NodeRef N) { return N; }

				static inline ChildIteratorType child_begin(NodeRef N) {
				return N->getSuccessors().begin();
				}

				static inline ChildIteratorType child_end(NodeRef N) {
				return N->getSuccessors().end();
				}
				};

				template <> struct GraphTraits<const VPBlockBase *> {
				typedef const VPBlockBase *NodeRef;
				typedef SmallVectorImpl<VPBlockBase *>::const_iterator ChildIteratorType;

				static NodeRef getEntryNode(NodeRef N) { return N; }

				static inline ChildIteratorType child_begin(NodeRef N) {
				return N->getSuccessors().begin();
				}

				static inline ChildIteratorType child_end(NodeRef N) {
				return N->getSuccessors().end();
				}
				};

				// Provide specializations of GraphTraits to be able to treat a VPRegionBlock as
				// a graph of VPBasicBlocks... and to walk it in inverse order. Inverse order
				// for a VPRegionBlock is considered to be when traversing the predecessor edges
				// of a VPBlockBase instead of the successor edges.
				//

				template <> struct GraphTraits<Inverse<VPBlockBase *>> {
				typedef VPBlockBase *NodeRef;
				typedef SmallVectorImpl<VPBlockBase *>::iterator ChildIteratorType;

				static Inverse<VPBlockBase > getEntryNode(Inverse<VPBlockBase > B) {
				return B;
				}

				static inline ChildIteratorType child_begin(NodeRef N) {
				return N->getPredecessors().begin();
				}

				static inline ChildIteratorType child_end(NodeRef N) {
				return N->getPredecessors().end();
				}
				};

				} // namespace llvm

				#endif // LLVM_TRANSFORMS_VECTORIZE_VPLAN_H

lib/Transforms/Vectorize/VPlan.cpp

This file was added.

				//===- VPlan.cpp - Vectorizer Plan ----------------------------------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This is the LLVM vectorization plan. It represents a candidate for
				// vectorization, allowing to plan and optimize how to vectorize a given loop
				oren_ben_simhonUnsubmitted Not Done Reply Inline Actions Please follow LLVM guidelines for file documentation: http://llvm.org/docs/CodingStandards.html#file-headers oren_ben_simhon: Please follow LLVM guidelines for file documentation: http://llvm.org/docs/CodingStandards.
				// before generating LLVM-IR.
				// The vectorizer uses vectorization plans to estimate the costs of potential
				// candidates and if profitable to execute the desired plan, generating vector
				// LLVM-IR code.
				//
				//===----------------------------------------------------------------------===//

				#include "VPlan.h"
				#include "llvm/ADT/PostOrderIterator.h"
				#include "llvm/Analysis/LoopInfo.h"
				#include "llvm/IR/BasicBlock.h"
				#include "llvm/IR/Dominators.h"
				#include "llvm/Support/GraphWriter.h"
				#include "llvm/Transforms/Utils/BasicBlockUtils.h"

				using namespace llvm;

				#define DEBUG_TYPE "vplan"

				unsigned VPlanUtils::NextOrdinal = 1;

				VPOneByOneRecipeBase::VPOneByOneRecipeBase(unsigned char SC,
				const BasicBlock::iterator B,
				const BasicBlock::iterator E,
				class VPlan *Plan)
				: VPRecipeBase(SC), Begin(B), End(E) {
				for (auto It = B; It != E; ++It)
				Plan->setInst2Recipe(&*It, this);
				}

				/// \return the VPBasicBlock that is the entry of Block, possibly indirectly.
				const VPBasicBlock *VPBlockBase::getEntryBasicBlock() const {
				const VPBlockBase *Block = this;
				while (const VPRegionBlock *Region = dyn_cast<VPRegionBlock>(Block))
				Block = Region->getEntry();
				return cast<VPBasicBlock>(Block);
				}

				/// \return the VPBasicBlock that is the exit of Block, possibly indirectly.
				const VPBasicBlock *VPBlockBase::getExitBasicBlock() const {
				const VPBlockBase *Block = this;
				while (const VPRegionBlock *Region = dyn_cast<VPRegionBlock>(Block))
				Block = Region->getExit();
				return cast<VPBasicBlock>(Block);
				}

				VPBasicBlock *VPBlockBase::getExitBasicBlock() {
				VPBlockBase *Block = this;
				while (VPRegionBlock *Region = dyn_cast<VPRegionBlock>(Block))
				Block = Region->getExit();
				return cast<VPBasicBlock>(Block);
				}

				/// Returns the closest ancestor, starting from "this", which has successors.
				/// Returns the root ancestor if all ancestors have no successors.
				VPBlockBase *VPBlockBase::getAncestorWithSuccessors() {
				if (!Successors.empty() \|\| !Parent)
				return this;
				assert(Parent->getExit() == this &&
				"Block w/o successors not the exit of its parent.");
				return Parent->getAncestorWithSuccessors();
				}

				/// Returns the closest ancestor, starting from "this", which has predecessors.
				/// Returns the root ancestor if all ancestors have no predecessors.
				VPBlockBase *VPBlockBase::getAncestorWithPredecessors() {
				if (!Predecessors.empty() \|\| !Parent)
				return this;
				assert(Parent->getEntry() == this &&
				"Block w/o predecessors not the entry of its parent.");
				return Parent->getAncestorWithPredecessors();
				}

				void VPBlockBase::deleteCFG(VPBlockBase *Entry) {
				SmallVector<VPBlockBase *, 8> Blocks;
				for (VPBlockBase *Block : depth_first(Entry))
				Blocks.push_back(Block);

				for (VPBlockBase *Block : Blocks)
				delete Block;
				}

				BasicBlock *
				VPBasicBlock::createEmptyBasicBlock(VPTransformState::CFGState &CFG) {
				// BB stands for IR BasicBlocks. VPBB stands for VPlan VPBasicBlocks.
				// Pred stands for Predessor. Prev stands for Previous, last visited/created.
				BasicBlock *PrevBB = CFG.PrevBB;
				BasicBlock *NewBB = BasicBlock::Create(PrevBB->getContext(), "VPlannedBB",
				PrevBB->getParent(), CFG.LastBB);
				DEBUG(dbgs() << "LV: created " << NewBB->getName() << '\n');

				// Hook up the new basic block to its predecessors.
				for (VPBlockBase *PredVPBlock : getHierarchicalPredecessors()) {
				VPBasicBlock *PredVPBB = PredVPBlock->getExitBasicBlock();
				BasicBlock *PredBB = CFG.VPBB2IRBB[PredVPBB];
				DEBUG(dbgs() << "LV: draw edge from" << PredBB->getName() << '\n');
				if (isa<UnreachableInst>(PredBB->getTerminator())) {
				PredBB->getTerminator()->eraseFromParent();
				BranchInst::Create(NewBB, PredBB);
				} else {
				// Replace old unconditional branch with new conditional branch.
				// Note: we rely on traversing the successors in order.
				BasicBlock *FirstSuccBB = PredBB->getSingleSuccessor();
				PredBB->getTerminator()->eraseFromParent();
				Value *Bit = PredVPBlock->getConditionBitRecipe()->getConditionBit();
				assert(Bit && "Cannot create conditional branch with empty bit.");
				BranchInst::Create(FirstSuccBB, NewBB, Bit, PredBB);
				}
				}
				return NewBB;
				}

				void VPBasicBlock::vectorize(VPTransformState *State) {
				VPIterationInstance *I = State->Instance;
				bool Replica = I && !(I->Part == 0 && I->Lane == 0);
				VPBasicBlock *PrevVPBB = State->CFG.PrevVPBB;
				VPBlockBase *SingleHPred = nullptr;
				BasicBlock *NewBB = State->CFG.PrevBB; // Reuse it if possible.

				// 1. Create an IR basic block, or reuse the last one if possible.
				// The last IR basic block is reused in three cases:
				// A. the first VPBB reuses the header BB - when PrevVPBB is null;
				// B. when the current VPBB has a single (hierarchical) predecessor which
				// is PrevVPBB and the latter has a single (hierarchical) successor; and
				// C. when the current VPBB is an entry of a region replica - where PrevVPBB
				// is the exit of this region from a previous instance.
				if (PrevVPBB && /* A */
				!((SingleHPred = getSingleHierarchicalPredecessor()) &&
				SingleHPred->getExitBasicBlock() == PrevVPBB &&
				PrevVPBB->getSingleHierarchicalSuccessor()) && /* B */
				!(Replica && getPredecessors().empty())) { /* C */

				NewBB = createEmptyBasicBlock(State->CFG);
				State->Builder.SetInsertPoint(NewBB);
				// Temporarily terminate with unreachable until CFG is rewired.
				UnreachableInst *Terminator = State->Builder.CreateUnreachable();
				State->Builder.SetInsertPoint(Terminator);
				// Register NewBB in its loop. In innermost loops its the same for all BB's.
				Loop *L = State->LI->getLoopFor(State->CFG.LastBB);
				L->addBasicBlockToLoop(NewBB, *State->LI);
				State->CFG.PrevBB = NewBB;
				}

				// 2. Fill the IR basic block with IR instructions.
				DEBUG(dbgs() << "LV: vectorizing VPBB:" << getName()
				<< " in BB:" << NewBB->getName() << '\n');

				State->CFG.VPBB2IRBB[this] = NewBB;
				State->CFG.PrevVPBB = this;

				for (VPRecipeBase &Recipe : Recipes)
				Recipe.vectorize(*State);

				DEBUG(dbgs() << "LV: filled BB:" << *NewBB);
				}

				void VPRegionBlock::vectorize(VPTransformState *State) {
				ReversePostOrderTraversal<VPBlockBase *> RPOT(Entry);
				typedef typename std::vector<VPBlockBase *>::reverse_iterator rpo_iterator;

				if (!isReplicator()) {
				// Visit the VPBlocks connected to \p this, starting from it.
				for (rpo_iterator I = RPOT.begin(); I != RPOT.end(); ++I) {
				DEBUG(dbgs() << "LV: VPBlock in RPO " << (*I)->getName() << '\n');
				(*I)->vectorize(State);
				}
				return;
				}

				assert(!State->Instance &&
				"Replicating a Region only in null context instance.");
				VPIterationInstance I;
				State->Instance = &I;

				for (I.Part = 0; I.Part < State->UF; ++I.Part)
				for (I.Lane = 0; I.Lane < State->VF; ++I.Lane)
				// Visit the VPBlocks connected to \p this, starting from it.
				for (rpo_iterator I = RPOT.begin(); I != RPOT.end(); ++I) {
				DEBUG(dbgs() << "LV: VPBlock in RPO " << (*I)->getName() << '\n');
				(*I)->vectorize(State);
				}

				State->Instance = nullptr;
				}

				/// Generate the code inside the body of the vectorized loop. Assumes a single
				/// LoopVectorBody basic block was created for this; introduces additional
				/// basic blocks as needed, and fills them all.
				void VPlan::vectorize(VPTransformState *State) {
				BasicBlock *VectorPreHeaderBB = State->CFG.PrevBB;
				BasicBlock *VectorHeaderBB = VectorPreHeaderBB->getSingleSuccessor();
				assert(VectorHeaderBB && "Loop preheader does not have a single successor.");
				BasicBlock *VectorLatchBB = VectorHeaderBB;
				auto CurrIP = State->Builder.saveIP();

				// 1. Make room to generate basic blocks inside loop body if needed.
				VectorLatchBB = VectorHeaderBB->splitBasicBlock(
				VectorHeaderBB->getFirstInsertionPt(), "vector.body.latch");
				Loop *L = State->LI->getLoopFor(VectorHeaderBB);
				L->addBasicBlockToLoop(VectorLatchBB, *State->LI);
				// Remove the edge between Header and Latch to allow other connections.
				// Temporarily terminate with unreachable until CFG is rewired.
				// Note: this asserts xform code's assumption that getFirstInsertionPt()
				// can be dereferenced into an Instruction.
				VectorHeaderBB->getTerminator()->eraseFromParent();
				State->Builder.SetInsertPoint(VectorHeaderBB);
				UnreachableInst *Terminator = State->Builder.CreateUnreachable();
				State->Builder.SetInsertPoint(Terminator);

				// 2. Generate code in loop body of vectorized version.
				State->CFG.PrevVPBB = nullptr;
				State->CFG.PrevBB = VectorHeaderBB;
				State->CFG.LastBB = VectorLatchBB;

				for (VPBlockBase *CurrentBlock = Entry; CurrentBlock != nullptr;
				CurrentBlock = CurrentBlock->getSingleSuccessor()) {
				assert(CurrentBlock->getSuccessors().size() <= 1 &&
				"Multiple successors at top level.");
				CurrentBlock->vectorize(State);
				}

				// 3. Merge the temporary latch created with the last basic block filled.
				BasicBlock *LastBB = State->CFG.PrevBB;
				// Connect LastBB to VectorLatchBB to facilitate their merge.
				assert(isa<UnreachableInst>(LastBB->getTerminator()) &&
				"Expected VPlan CFG to terminate with unreachable");
				LastBB->getTerminator()->eraseFromParent();
				BranchInst::Create(VectorLatchBB, LastBB);

				// Merge LastBB with Latch.
				bool merged = MergeBlockIntoPredecessor(VectorLatchBB, nullptr, State->LI);
				assert(merged && "Could not merge last basic block with latch.");
				VectorLatchBB = LastBB;

				updateDominatorTree(State->DT, VectorPreHeaderBB, VectorLatchBB);
				State->Builder.restoreIP(CurrIP);
				}

				void VPlan::updateDominatorTree(DominatorTree DT, BasicBlock LoopPreHeaderBB,
				BasicBlock *LoopLatchBB) {
				BasicBlock *LoopHeaderBB = LoopPreHeaderBB->getSingleSuccessor();
				assert(LoopHeaderBB && "Loop preheader does not have a single successor.");
				DT->addNewBlock(LoopHeaderBB, LoopPreHeaderBB);
				// The vector body may be more than a single basic block by this point.
				// Update the dominator tree information inside the vector body by propagating
				// it from header to latch, expecting only triangular control-flow, if any.
				BasicBlock *PostDomSucc = nullptr;
				for (auto *BB = LoopHeaderBB; BB != LoopLatchBB; BB = PostDomSucc) {
				// Get the list of successors of this block.
				std::vector<BasicBlock *> Succs(succ_begin(BB), succ_end(BB));
				assert(Succs.size() <= 2 &&
				"Basic block in vector loop has more than 2 successors.");
				PostDomSucc = Succs[0];
				if (Succs.size() == 1) {
				assert(PostDomSucc->getSinglePredecessor() &&
				"PostDom successor has more than one predecessor.");
				DT->addNewBlock(PostDomSucc, BB);
				continue;
				}
				BasicBlock *InterimSucc = Succs[1];
				if (PostDomSucc->getSingleSuccessor() == InterimSucc) {
				PostDomSucc = Succs[1];
				InterimSucc = Succs[0];
				}
				assert(InterimSucc->getSingleSuccessor() == PostDomSucc &&
				"One successor of a basic block does not lead to the other.");
				assert(InterimSucc->getSinglePredecessor() &&
				"Interim successor has more than one predecessor.");
				assert(std::distance(pred_begin(PostDomSucc), pred_end(PostDomSucc)) == 2 &&
				"PostDom successor has more than two predecessors.");
				DT->addNewBlock(InterimSucc, BB);
				DT->addNewBlock(PostDomSucc, BB);
				}
				}

				const char VPlanPrinter::getNodePrefix(const VPBlockBase Block) {
				if (isa<VPBasicBlock>(Block))
				return "";
				assert(isa<VPRegionBlock>(Block) && "Unsupported kind of VPBlock.");
				return "cluster_";
				}

				const std::string &
				VPlanPrinter::getReplicatorString(const VPRegionBlock *Region) {
				static std::string ReplicatorString(DOT::EscapeString("<xVFxUF>"));
				static std::string NonReplicatorString(DOT::EscapeString("<x1>"));
				return Region->isReplicator() ? ReplicatorString : NonReplicatorString;
				}

				void VPlanPrinter::dump(const std::string &Title) {
				resetDepth();
				OS << "digraph VPlan {\n";
				OS << "graph [labelloc=t, fontsize=30; label=\"Vectorization Plan";
				if (!Title.empty())
				OS << "\\n" << DOT::EscapeString(Title);
				OS << "\"]\n";
				OS << "node [shape=record]\n";
				OS << "compound=true\n";

				for (const VPBlockBase *CurrentBlock = Plan.getEntry();
				CurrentBlock != nullptr;
				CurrentBlock = CurrentBlock->getSingleSuccessor())
				dumpBlock(CurrentBlock);

				OS << "}\n";
				}

				void VPlanPrinter::dumpBlock(const VPBlockBase *Block) {
				if (const VPBasicBlock *BasicBlock = dyn_cast<VPBasicBlock>(Block))
				dumpBasicBlock(BasicBlock);
				else if (const VPRegionBlock *Region = dyn_cast<VPRegionBlock>(Block))
				dumpRegion(Region);
				else
				llvm_unreachable("Unsupported kind of VPBlock.");
				}

				/// Print the information related to a CFG edge between two VPBlockBases.
				void VPlanPrinter::drawEdge(const VPBlockBase From, const VPBlockBase To,
				bool Hidden, const Twine &Label) {
				// Due to "dot" we print an edge between two regions as an edge between the
				// exit basic block and the entry basic of the respective regions.
				const VPBlockBase *Tail = From->getExitBasicBlock();
				const VPBlockBase *Head = To->getEntryBasicBlock();
				OS << Indent << getNodePrefix(Tail) << DOT::EscapeString(Tail->getName())
				<< " -> " << getNodePrefix(Head) << DOT::EscapeString(Head->getName());
				OS << " [ label=\"" << Label << '\"';
				if (Tail != From)
				OS << " ltail=" << getNodePrefix(From)
				<< DOT::EscapeString(From->getName());
				if (Head != To)
				OS << " lhead=" << getNodePrefix(To) << DOT::EscapeString(To->getName());
				if (Hidden)
				OS << "; splines=none";
				OS << "]\n";
				}

				/// Print the information related to the CFG edges going out of a given
				/// \p Block, followed by printing the successor blocks themselves.
				void VPlanPrinter::dumpEdges(const VPBlockBase *Block) {
				std::string Cond = "";
				if (auto *ConditionBitRecipe = Block->getConditionBitRecipe())
				Cond = ConditionBitRecipe->getName().str();
				unsigned SuccessorNumber = 1;
				for (auto *Successor : Block->getSuccessors()) {
				drawEdge(Block, Successor, false,
				Twine() + (SuccessorNumber == 2 ? "!" : "") + Twine(Cond));
				++SuccessorNumber;
				}
				}

				/// Print a VPBasicBlock, including its VPRecipes, followed by printing its
				/// successor blocks.
				void VPlanPrinter::dumpBasicBlock(const VPBasicBlock *BasicBlock) {
				std::string Indent(Depth * TabLength, ' ');
				OS << Indent << getNodePrefix(BasicBlock)
				<< DOT::EscapeString(BasicBlock->getName()) << " [label = \"{"
				<< DOT::EscapeString(BasicBlock->getName());

				for (const VPRecipeBase &Recipe : BasicBlock->getRecipes()) {
				OS << " \| ";
				std::string RecipeString;
				raw_string_ostream RSO(RecipeString);
				Recipe.print(RSO);
				OS << DOT::EscapeString(RSO.str());
				}

				OS << "}\"]\n";
				dumpEdges(BasicBlock);
				}

				/// Print a given \p Region of the VPlan.
				void VPlanPrinter::dumpRegion(const VPRegionBlock *Region) {
				OS << Indent << "subgraph " << getNodePrefix(Region)
				<< DOT::EscapeString(Region->getName()) << " {\n";
				increaseDepth();
				OS << Indent;
				OS << "label = \"" << getReplicatorString(Region) << " "
				<< DOT::EscapeString(Region->getName()) << "\"\n\n";

				// Dump the blocks of the region.
				assert(Region->getEntry() && "Region contains no inner blocks.");

				for (const VPBlockBase *Block : depth_first(Region->getEntry()))
				dumpBlock(Block);

				decreaseDepth();
				OS << Indent << "}\n";
				dumpEdges(Region);
				}

test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll

	Show All 9 Lines
	; user-specified vectorization factor when interleaving is disabled. We use the			; user-specified vectorization factor when interleaving is disabled. We use the
	; "optsize" attribute to disable all interleaving calculations.			; "optsize" attribute to disable all interleaving calculations.
	;			;
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK: %wide.load = load <2 x i64>, <2 x i64>* {{.*}}, align 4			; CHECK: %wide.load = load <2 x i64>, <2 x i64>* {{.*}}, align 4
	; CHECK: br i1 {{.*}}, label %[[IF0:.+]], label %[[CONT0:.+]]			; CHECK: br i1 {{.*}}, label %[[IF0:.+]], label %[[CONT0:.+]]
	; CHECK: [[IF0]]:			; CHECK: [[IF0]]:
	; CHECK: %[[T00:.+]] = extractelement <2 x i64> %wide.load, i32 0			; CHECK: %[[T00:.+]] = extractelement <2 x i64> %wide.load, i32 0
	; CHECK: %[[T01:.+]] = extractelement <2 x i64> %wide.load, i32 0			; CHECK: %[[T01:.+]] = add nsw i64 %[[T00]], %x
	; CHECK: %[[T02:.+]] = add nsw i64 %[[T01]], %x			; CHECK: %[[T02:.+]] = extractelement <2 x i64> %wide.load, i32 0
	; CHECK: %[[T03:.+]] = udiv i64 %[[T00]], %[[T02]]			; CHECK: %[[T03:.+]] = udiv i64 %[[T02]], %[[T01]]
	; CHECK: %[[T04:.+]] = insertelement <2 x i64> undef, i64 %[[T03]], i32 0			; CHECK: %[[T04:.+]] = insertelement <2 x i64> undef, i64 %[[T03]], i32 0
	; CHECK: br label %[[CONT0]]			; CHECK: br label %[[CONT0]]
	; CHECK: [[CONT0]]:			; CHECK: [[CONT0]]:
	; CHECK: %[[T05:.+]] = phi <2 x i64> [ undef, %vector.body ], [ %[[T04]], %[[IF0]] ]			; CHECK: %[[T05:.+]] = phi <2 x i64> [ undef, %vector.body ], [ %[[T04]], %[[IF0]] ]
	; CHECK: br i1 {{.*}}, label %[[IF1:.+]], label %[[CONT1:.+]]			; CHECK: br i1 {{.*}}, label %[[IF1:.+]], label %[[CONT1:.+]]
	; CHECK: [[IF1]]:			; CHECK: [[IF1]]:
	; CHECK: %[[T06:.+]] = extractelement <2 x i64> %wide.load, i32 1			; CHECK: %[[T06:.+]] = extractelement <2 x i64> %wide.load, i32 1
	; CHECK: %[[T07:.+]] = extractelement <2 x i64> %wide.load, i32 1			; CHECK: %[[T07:.+]] = add nsw i64 %[[T06]], %x
	; CHECK: %[[T08:.+]] = add nsw i64 %[[T07]], %x			; CHECK: %[[T08:.+]] = extractelement <2 x i64> %wide.load, i32 1
	; CHECK: %[[T09:.+]] = udiv i64 %[[T06]], %[[T08]]			; CHECK: %[[T09:.+]] = udiv i64 %[[T08]], %[[T07]]
	; CHECK: %[[T10:.+]] = insertelement <2 x i64> %[[T05]], i64 %[[T09]], i32 1			; CHECK: %[[T10:.+]] = insertelement <2 x i64> %[[T05]], i64 %[[T09]], i32 1
	; CHECK: br label %[[CONT1]]			; CHECK: br label %[[CONT1]]
	; CHECK: [[CONT1]]:			; CHECK: [[CONT1]]:
	; CHECK: phi <2 x i64> [ %[[T05]], %[[CONT0]] ], [ %[[T10]], %[[IF1]] ]			; CHECK: phi <2 x i64> [ %[[T05]], %[[CONT0]] ], [ %[[T10]], %[[IF1]] ]
	; CHECK: br i1 {{.*}}, label %middle.block, label %vector.body			; CHECK: br i1 {{.*}}, label %middle.block, label %vector.body

	define i64 @predicated_udiv_scalarized_operand(i64* %a, i1 %c, i64 %x) optsize {			define i64 @predicated_udiv_scalarized_operand(i64* %a, i1 %c, i64 %x) optsize {
	entry:			entry:
	Show All 25 Lines

test/Transforms/LoopVectorize/AArch64/predication_costs.ll

	Show All 12 Lines
	;			;
	; This test checks that we correctly compute the cost of the predicated udiv			; This test checks that we correctly compute the cost of the predicated udiv
	; instruction. If we assume the block probability is 50%, we compute the cost			; instruction. If we assume the block probability is 50%, we compute the cost
	; as:			; as:
	;			;
	; Cost of udiv:			; Cost of udiv:
	; (udiv(2) + extractelement(6) + insertelement(3)) / 2 = 5			; (udiv(2) + extractelement(6) + insertelement(3)) / 2 = 5
	;			;
	; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3
	; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3			; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3
				; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3
	;			;
	define i32 @predicated_udiv(i32* %a, i32* %b, i1 %c, i64 %n) {			define i32 @predicated_udiv(i32* %a, i32* %b, i1 %c, i64 %n) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]			%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]
	%r = phi i32 [ 0, %entry ], [ %tmp6, %for.inc ]			%r = phi i32 [ 0, %entry ], [ %tmp6, %for.inc ]
	Show All 23 Lines
	;			;
	; This test checks that we correctly compute the cost of the predicated store			; This test checks that we correctly compute the cost of the predicated store
	; instruction. If we assume the block probability is 50%, we compute the cost			; instruction. If we assume the block probability is 50%, we compute the cost
	; as:			; as:
	;			;
	; Cost of store:			; Cost of store:
	; (store(4) + extractelement(3)) / 2 = 3			; (store(4) + extractelement(3)) / 2 = 3
	;			;
	; CHECK: Found an estimated cost of 3 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4
	; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %tmp0, align 4			; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %tmp0, align 4
				; CHECK: Found an estimated cost of 3 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4
	;			;
	define void @predicated_store(i32* %a, i1 %c, i32 %x, i64 %n) {			define void @predicated_store(i32* %a, i1 %c, i32 %x, i64 %n) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]			%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]
	%tmp0 = getelementptr inbounds i32, i32* %a, i64 %i			%tmp0 = getelementptr inbounds i32, i32* %a, i64 %i
	Show All 21 Lines
	; inside the predicated block. If we assume the block probability is 50%, we			; inside the predicated block. If we assume the block probability is 50%, we
	; compute the cost as:			; compute the cost as:
	;			;
	; Cost of add:			; Cost of add:
	; (add(2) + extractelement(3)) / 2 = 2			; (add(2) + extractelement(3)) / 2 = 2
	; Cost of udiv:			; Cost of udiv:
	; (udiv(2) + extractelement(3) + insertelement(3)) / 2 = 4			; (udiv(2) + extractelement(3) + insertelement(3)) / 2 = 4
	;			;
	; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp3 = add nsw i32 %tmp2, %x
	; CHECK: Found an estimated cost of 4 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3
	; CHECK: Scalarizing: %tmp3 = add nsw i32 %tmp2, %x			; CHECK: Scalarizing: %tmp3 = add nsw i32 %tmp2, %x
	; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3			; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3
				; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp3 = add nsw i32 %tmp2, %x
				; CHECK: Found an estimated cost of 4 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3
	;			;
	define i32 @predicated_udiv_scalarized_operand(i32* %a, i1 %c, i32 %x, i64 %n) {			define i32 @predicated_udiv_scalarized_operand(i32* %a, i1 %c, i32 %x, i64 %n) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]			%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]
	%r = phi i32 [ 0, %entry ], [ %tmp6, %for.inc ]			%r = phi i32 [ 0, %entry ], [ %tmp6, %for.inc ]
	Show All 25 Lines
	; inside the predicated block. If we assume the block probability is 50%, we			; inside the predicated block. If we assume the block probability is 50%, we
	; compute the cost as:			; compute the cost as:
	;			;
	; Cost of add:			; Cost of add:
	; (add(2) + extractelement(3)) / 2 = 2			; (add(2) + extractelement(3)) / 2 = 2
	; Cost of store:			; Cost of store:
	; store(4) / 2 = 2			; store(4) / 2 = 2
	;			;
	; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp2 = add nsw i32 %tmp1, %x
	; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4
	; CHECK: Scalarizing: %tmp2 = add nsw i32 %tmp1, %x			; CHECK: Scalarizing: %tmp2 = add nsw i32 %tmp1, %x
	; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %tmp0, align 4			; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %tmp0, align 4
				; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp2 = add nsw i32 %tmp1, %x
				; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4
	;			;
	define void @predicated_store_scalarized_operand(i32* %a, i1 %c, i32 %x, i64 %n) {			define void @predicated_store_scalarized_operand(i32* %a, i1 %c, i32 %x, i64 %n) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]			%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]
	%tmp0 = getelementptr inbounds i32, i32* %a, i64 %i			%tmp0 = getelementptr inbounds i32, i32* %a, i64 %i
	Show All 29 Lines
	; (sdiv(2) + extractelement(6) + insertelement(3)) / 2 = 5			; (sdiv(2) + extractelement(6) + insertelement(3)) / 2 = 5
	; Cost of udiv:			; Cost of udiv:
	; (udiv(2) + extractelement(6) + insertelement(3)) / 2 = 5			; (udiv(2) + extractelement(6) + insertelement(3)) / 2 = 5
	; Cost of sub:			; Cost of sub:
	; (sub(2) + extractelement(3)) / 2 = 2			; (sub(2) + extractelement(3)) / 2 = 2
	; Cost of store:			; Cost of store:
	; store(4) / 2 = 2			; store(4) / 2 = 2
	;			;
	; CHECK: Found an estimated cost of 1 for VF 2 For instruction: %tmp2 = add i32 %tmp1, %x
	; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp3 = sdiv i32 %tmp1, %tmp2
	; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp3, %tmp2
	; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp5 = sub i32 %tmp4, %x
	; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp5, i32* %tmp0, align 4
	; CHECK-NOT: Scalarizing: %tmp2 = add i32 %tmp1, %x			; CHECK-NOT: Scalarizing: %tmp2 = add i32 %tmp1, %x
	; CHECK: Scalarizing and predicating: %tmp3 = sdiv i32 %tmp1, %tmp2			; CHECK: Scalarizing and predicating: %tmp3 = sdiv i32 %tmp1, %tmp2
	; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp3, %tmp2			; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp3, %tmp2
	; CHECK: Scalarizing: %tmp5 = sub i32 %tmp4, %x			; CHECK: Scalarizing: %tmp5 = sub i32 %tmp4, %x
	; CHECK: Scalarizing and predicating: store i32 %tmp5, i32* %tmp0, align 4			; CHECK: Scalarizing and predicating: store i32 %tmp5, i32* %tmp0, align 4
				; CHECK: Found an estimated cost of 1 for VF 2 For instruction: %tmp2 = add i32 %tmp1, %x
				; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp3 = sdiv i32 %tmp1, %tmp2
				; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp3, %tmp2
				; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp5 = sub i32 %tmp4, %x
				; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp5, i32* %tmp0, align 4
	;			;
	define void @predication_multi_context(i32* %a, i1 %c, i32 %x, i64 %n) {			define void @predication_multi_context(i32* %a, i1 %c, i32 %x, i64 %n) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]			%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]
	%tmp0 = getelementptr inbounds i32, i32* %a, i64 %i			%tmp0 = getelementptr inbounds i32, i32* %a, i64 %i
	Show All 19 Lines

test/Transforms/LoopVectorize/if-pred-non-void.ll

Show First 20 Lines • Show All 213 Lines • ▼ Show 20 Lines	entry:
br label %for.body		br label %for.body

; CHECK-LABEL: predicated_udiv_scalarized_operand		; CHECK-LABEL: predicated_udiv_scalarized_operand
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK: %wide.load = load <2 x i32>, <2 x i32>* {{.*}}, align 4		; CHECK: %wide.load = load <2 x i32>, <2 x i32>* {{.*}}, align 4
; CHECK: br i1 {{.*}}, label %[[IF0:.+]], label %[[CONT0:.+]]		; CHECK: br i1 {{.*}}, label %[[IF0:.+]], label %[[CONT0:.+]]
; CHECK: [[IF0]]:		; CHECK: [[IF0]]:
; CHECK: %[[T00:.+]] = extractelement <2 x i32> %wide.load, i32 0		; CHECK: %[[T00:.+]] = extractelement <2 x i32> %wide.load, i32 0
; CHECK: %[[T01:.+]] = extractelement <2 x i32> %wide.load, i32 0		; CHECK: %[[T01:.+]] = add nsw i32 %[[T00]], %x
; CHECK: %[[T02:.+]] = add nsw i32 %[[T01]], %x		; CHECK: %[[T02:.+]] = extractelement <2 x i32> %wide.load, i32 0
; CHECK: %[[T03:.+]] = udiv i32 %[[T00]], %[[T02]]		; CHECK: %[[T03:.+]] = udiv i32 %[[T02]], %[[T01]]
; CHECK: %[[T04:.+]] = insertelement <2 x i32> undef, i32 %[[T03]], i32 0		; CHECK: %[[T04:.+]] = insertelement <2 x i32> undef, i32 %[[T03]], i32 0
; CHECK: br label %[[CONT0]]		; CHECK: br label %[[CONT0]]
; CHECK: [[CONT0]]:		; CHECK: [[CONT0]]:
; CHECK: %[[T05:.+]] = phi <2 x i32> [ undef, %vector.body ], [ %[[T04]], %[[IF0]] ]		; CHECK: %[[T05:.+]] = phi <2 x i32> [ undef, %vector.body ], [ %[[T04]], %[[IF0]] ]
; CHECK: br i1 {{.*}}, label %[[IF1:.+]], label %[[CONT1:.+]]		; CHECK: br i1 {{.*}}, label %[[IF1:.+]], label %[[CONT1:.+]]
; CHECK: [[IF1]]:		; CHECK: [[IF1]]:
; CHECK: %[[T06:.+]] = extractelement <2 x i32> %wide.load, i32 1		; CHECK: %[[T06:.+]] = extractelement <2 x i32> %wide.load, i32 1
; CHECK: %[[T07:.+]] = extractelement <2 x i32> %wide.load, i32 1		; CHECK: %[[T07:.+]] = add nsw i32 %[[T06]], %x
; CHECK: %[[T08:.+]] = add nsw i32 %[[T07]], %x		; CHECK: %[[T08:.+]] = extractelement <2 x i32> %wide.load, i32 1
; CHECK: %[[T09:.+]] = udiv i32 %[[T06]], %[[T08]]		; CHECK: %[[T09:.+]] = udiv i32 %[[T08]], %[[T07]]
; CHECK: %[[T10:.+]] = insertelement <2 x i32> %[[T05]], i32 %[[T09]], i32 1		; CHECK: %[[T10:.+]] = insertelement <2 x i32> %[[T05]], i32 %[[T09]], i32 1
; CHECK: br label %[[CONT1]]		; CHECK: br label %[[CONT1]]
; CHECK: [[CONT1]]:		; CHECK: [[CONT1]]:
; CHECK: phi <2 x i32> [ %[[T05]], %[[CONT0]] ], [ %[[T10]], %[[IF1]] ]		; CHECK: phi <2 x i32> [ %[[T05]], %[[CONT0]] ], [ %[[T10]], %[[IF1]] ]
; CHECK: br i1 {{.*}}, label %middle.block, label %vector.body		; CHECK: br i1 {{.*}}, label %middle.block, label %vector.body

for.body:		for.body:
%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]		%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]
Show All 21 Lines

test/Transforms/LoopVectorize/induction.ll

	Show First 20 Lines • Show All 303 Lines • ▼ Show 20 Lines
	; int x = a[i];			; int x = a[i];
	; if (c)			; if (c)
	; x /= i;			; x /= i;
	; sum += x;			; sum += x;
	; }			; }
	;			;
	; CHECK-LABEL: @scalarize_induction_variable_05(			; CHECK-LABEL: @scalarize_induction_variable_05(
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue2 ]			; CHECK: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue4 ]
	; CHECK: %[[I0:.+]] = add i32 %index, 0			; CHECK: %[[I0:.+]] = add i32 %index, 0
	; CHECK: getelementptr inbounds i32, i32* %a, i32 %[[I0]]			; CHECK: getelementptr inbounds i32, i32* %a, i32 %[[I0]]
	; CHECK: pred.udiv.if:			; CHECK: pred.udiv.if:
	; CHECK: udiv i32 {{.*}}, %[[I0]]			; CHECK: udiv i32 {{.*}}, %[[I0]]
	; CHECK: pred.udiv.if1:			; CHECK: pred.udiv.if3:
	; CHECK: %[[I1:.+]] = add i32 %index, 1			; CHECK: %[[I1:.+]] = add i32 %index, 1
	; CHECK: udiv i32 {{.*}}, %[[I1]]			; CHECK: udiv i32 {{.*}}, %[[I1]]
	;			;
	; UNROLL-NO_IC-LABEL: @scalarize_induction_variable_05(			; UNROLL-NO_IC-LABEL: @scalarize_induction_variable_05(
	; UNROLL-NO-IC: vector.body:			; UNROLL-NO-IC: vector.body:
	; UNROLL-NO-IC: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue11 ]			; UNROLL-NO-IC: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue13 ]
	; UNROLL-NO-IC: %[[I0:.+]] = add i32 %index, 0			; UNROLL-NO-IC: %[[I0:.+]] = add i32 %index, 0
	; UNROLL-NO-IC: %[[I2:.+]] = add i32 %index, 2			; UNROLL-NO-IC: %[[I2:.+]] = add i32 %index, 2
	; UNROLL-NO-IC: getelementptr inbounds i32, i32* %a, i32 %[[I0]]			; UNROLL-NO-IC: getelementptr inbounds i32, i32* %a, i32 %[[I0]]
	; UNROLL-NO-IC: getelementptr inbounds i32, i32* %a, i32 %[[I2]]			; UNROLL-NO-IC: getelementptr inbounds i32, i32* %a, i32 %[[I2]]
	; UNROLL-NO-IC: pred.udiv.if:			; UNROLL-NO-IC: pred.udiv.if:
	; UNROLL-NO-IC: udiv i32 {{.*}}, %[[I0]]			; UNROLL-NO-IC: udiv i32 {{.*}}, %[[I0]]
	; UNROLL-NO-IC: pred.udiv.if6:			; UNROLL-NO-IC: pred.udiv.if6:
	; UNROLL-NO-IC: %[[I1:.+]] = add i32 %index, 1			; UNROLL-NO-IC: %[[I1:.+]] = add i32 %index, 1
	; UNROLL-NO-IC: udiv i32 {{.*}}, %[[I1]]			; UNROLL-NO-IC: udiv i32 {{.*}}, %[[I1]]
	; UNROLL-NO-IC: pred.udiv.if8:			; UNROLL-NO-IC: pred.udiv.if9:
	; UNROLL-NO-IC: udiv i32 {{.*}}, %[[I2]]			; UNROLL-NO-IC: udiv i32 {{.*}}, %[[I2]]
	; UNROLL-NO-IC: pred.udiv.if10:			; UNROLL-NO-IC: pred.udiv.if12:
	; UNROLL-NO-IC: %[[I3:.+]] = add i32 %index, 3			; UNROLL-NO-IC: %[[I3:.+]] = add i32 %index, 3
	; UNROLL-NO-IC: udiv i32 {{.*}}, %[[I3]]			; UNROLL-NO-IC: udiv i32 {{.*}}, %[[I3]]
	;			;
	; IND-LABEL: @scalarize_induction_variable_05(			; IND-LABEL: @scalarize_induction_variable_05(
	; IND: vector.body:			; IND: vector.body:
	; IND: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue2 ]			; IND: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue4 ]
	; IND: %[[E0:.+]] = sext i32 %index to i64			; IND: %[[E0:.+]] = sext i32 %index to i64
	; IND: getelementptr inbounds i32, i32* %a, i64 %[[E0]]			; IND: getelementptr inbounds i32, i32* %a, i64 %[[E0]]
	; IND: pred.udiv.if:			; IND: pred.udiv.if:
	; IND: udiv i32 {{.*}}, %index			; IND: udiv i32 {{.*}}, %index
	; IND: pred.udiv.if1:			; IND: pred.udiv.if3:
	; IND: %[[I1:.+]] = or i32 %index, 1			; IND: %[[I1:.+]] = or i32 %index, 1
	; IND: udiv i32 {{.*}}, %[[I1]]			; IND: udiv i32 {{.*}}, %[[I1]]
	;			;
	; UNROLL-LABEL: @scalarize_induction_variable_05(			; UNROLL-LABEL: @scalarize_induction_variable_05(
	; UNROLL: vector.body:			; UNROLL: vector.body:
	; UNROLL: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue11 ]			; UNROLL: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %pred.udiv.continue13 ]
	; UNROLL: %[[I2:.+]] = or i32 %index, 2			; UNROLL: %[[I2:.+]] = or i32 %index, 2
	; UNROLL: %[[E0:.+]] = sext i32 %index to i64			; UNROLL: %[[E0:.+]] = sext i32 %index to i64
	; UNROLL: %[[G0:.+]] = getelementptr inbounds i32, i32* %a, i64 %[[E0]]			; UNROLL: %[[G0:.+]] = getelementptr inbounds i32, i32* %a, i64 %[[E0]]
	; UNROLL: getelementptr i32, i32* %[[G0]], i64 2			; UNROLL: getelementptr i32, i32* %[[G0]], i64 2
	; UNROLL: pred.udiv.if:			; UNROLL: pred.udiv.if:
	; UNROLL: udiv i32 {{.*}}, %index			; UNROLL: udiv i32 {{.*}}, %index
	; UNROLL: pred.udiv.if6:			; UNROLL: pred.udiv.if6:
	; UNROLL: %[[I1:.+]] = or i32 %index, 1			; UNROLL: %[[I1:.+]] = or i32 %index, 1
	; UNROLL: udiv i32 {{.*}}, %[[I1]]			; UNROLL: udiv i32 {{.*}}, %[[I1]]
	; UNROLL: pred.udiv.if8:			; UNROLL: pred.udiv.if9:
	; UNROLL: udiv i32 {{.*}}, %[[I2]]			; UNROLL: udiv i32 {{.*}}, %[[I2]]
	; UNROLL: pred.udiv.if10:			; UNROLL: pred.udiv.if12:
	; UNROLL: %[[I3:.+]] = or i32 %index, 3			; UNROLL: %[[I3:.+]] = or i32 %index, 3
	; UNROLL: udiv i32 {{.*}}, %[[I3]]			; UNROLL: udiv i32 {{.*}}, %[[I3]]

	define i32 @scalarize_induction_variable_05(i32* %a, i32 %x, i1 %c, i32 %n) {			define i32 @scalarize_induction_variable_05(i32* %a, i32 %x, i1 %c, i32 %n) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	▲ Show 20 Lines • Show All 434 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Introducing VPlan to model the vectorized code and drive its transformationAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 90376

docs/VectorizationPlan.rst

docs/Vectorizers.rst

lib/Transforms/Vectorize/CMakeLists.txt

lib/Transforms/Vectorize/LoopVectorize.cpp

lib/Transforms/Vectorize/VPlan.h

lib/Transforms/Vectorize/VPlan.cpp

test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll

test/Transforms/LoopVectorize/AArch64/predication_costs.ll

test/Transforms/LoopVectorize/if-pred-non-void.ll

test/Transforms/LoopVectorize/induction.ll

[LV] Introducing VPlan to model the vectorized code and drive its transformation
AbandonedPublic