Page MenuHomePhabricator

[LV] Introducing VPlan to model the vectorized code and drive its transformation
AbandonedPublic

Authored by Ayal on Jan 20 2017, 5:17 PM.

Details

Summary

This patch follows our RFC[1] and presentation at the Dev Meeting[2]. Namely, it starts to address the proposal stated there:

Proposal: introduce the Vectorization Plan as an explicit model of a vectorization candidate and update the overall flow

according to the first step expressed:

The first patches we're working on are designed to have the innermost Loop Vectorizer explicitly model the control flow of its vectorized loop.

This implementation is designed to show key aspects of the VPlan model, demonstrating how it can capture precisely *all* vectorization decisions taken inside a to-be vectorized loop by the current Loop Vectorizer, and carry them out. It is therefore practically an NFC patch, with slight disclaimers listed below. The VPlan model implemented strives to be compact , addressing compile-time concerns. More technical details are documented in the rst file attached. The patch can be broken down into several hunks for incremental landing; a tentative break-down list is provided below.

Thanks to the Intel vectorization team for this joint effort,
Gil and Ayal.

Deviation from current functionality:

  • Debug printout of “LV: Scalarizing [and predicating]: <inst>” – VPlan carries out these decisions before Cost-Model’s printouts, unlike current behavior.
  • Placement of extracts moved to basic-block of users: at start of this basic-block vs. before first user; the distinct orders should be insignificant, subject to scheduling.
  • Redundant basic-blocks/phi’s – these are insignificant, subject to subsequent clean-up.

Tentative break-down; some tasks refactor or fix current LV, some introduce parts of VPlan:

  • refactor Cost-Model to provide MaxVF and early-exit methods.
  • refactor ILV to provide vectorizeInstruction, getScalarValue, getVectorValue, widenIntInduction, buildScalarSteps, PHIsToFix/fixCrossIterationPHIs, and possibly additional methods.
  • fix Unroller’s getScalarValue() to reuse ILV’s refactored getScalarValue(Part, Lane) which also sets metadata. Will simplify this patch.
  • unify the GEP reuse behavior between a vectorized wide load/store, and the wide load/store of an interleave group. Will simplify this patch.
  • have LV avoid creating redundant basic-blocks. Will help this patch be fully NFC.
  • have LV cache basic-block masks and reuse them. Will help this patch be fully NFC.
  • build initial VPlans and print them for debugging
  • convert ILV.vectorize to use LVP.executeBestPlan, keeping sinkScalarOperands() as a non-VPlan post-processing method.
  • optimize VPlans by introducing sinkScalarOperands() and print them for debugging
  • use VPlan’s sinkScalarOperands() instead of the non-VPlan version

[1] RFC
[2] Extending LoopVectorizer towards supporting OpenMP4.5 SIMD and outer loop auto-vectorization, 2016 LLVM Developers' Meeting

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
rengolin added inline comments.Jan 27 2017, 4:01 AM
docs/VectorizationPlan.rst
241

I believe this is not possible, but that's ok.

What I think you mean is that we can start with a single VPlan that does exactly what the current cost model does, in which case, it will have no noticeable impact to the users (this is not the same as NFC, as Michael said). But just doing that is just replacing six of one by half-a-dozen of the other.

This is only worth doing if we can add more (different) VPlans, in which case, there will be a noticeable impact in higher optimisation levels.

But again, that's ok. We currently change the behaviour of the vectoriser depending on the -O flag and we can continue doing it. -O2 means one VPlan, -O3 means more. We could even break the invisible 3 barrier and add an actual 4 which is exhaust all options.

My point is: this cannot be a design guideline.

245

Aligning cost and codegen will always have to rely on heuristics, even if we codegen multiple variations, as this is modelling IR, not assembly.

We should respect the cost analysis and we should strive to do the best we can, yes, but we shouldn't try to estimate the generated code accurately over other improvements.

261

This is the holly grail, where a lot of attention to detail and benchmarking will have to be done.

296

If this is related to detecting in-loop disjoint regions, wouldn't it be easier to do that transformation before the vectoriser and generate two loops, so that they're always SESE or not worth looking?

308

This makes sense. And VRecipes consult the target information based on instructions and sizes.

It should be straightforward to have cross-target VRecipes just based on isLegal() and friends, which would prune all unsupported functionality really early.

314

This is less clear. I understand how one VRecipe that knows about interleave access to worry about this, but I don't understand how we're going to tell the others that have no such concept to not ignore it and calculate the costs correctly.

Our current interleave cost calculations may not be enough to bar unrelated changes.

319

I'd advise against doing code modifications in an already vectorised block. You may be able to get an extra oomph, but legality analysis might have to be done all over again, which can significantly add costs to an already expensive process.

Not to mention tht the complexity of the task will likely leak illegal transformations through.

410

This looks mostly ok...

415

What do you mean by vectorize()? Is that legal+cost+transform?

To avoid paying cost analysis and transformation costs up-front, wouldn't it make more sense to do each step for all plans as a group?

Assuming VPlans may expose different legality issues, you should:

vec<VPlan> TODO;
for (auto P : Plans) {
  if (P->isLegal())
    TODO.push_back(P);
}

Then, compute the costs first, and only vectorize() the assumed cheapest:

unsigned MinCost = ScalarCost;
VPlan Best;
for (auto P : TODO) {
  unsigned Cost = P->cost();
  if(MinCost > Cost) {
    Best = P;
    MinCost = Cost;
  }
}
Best->transform();
426

Couldn't different VPlans expose different legality issues? For example, for different combination of UFs and VFs?

473

Right, so this is my "cost loop" above.

516

And this is my Best.transform() above. Looks good.

557

This is just analysis, right?

mkuper added inline comments.Jan 27 2017, 10:52 AM
docs/VectorizationPlan.rst
141

All of that is correct, I didn't mean to imply we should be handling outer loops *now*. :-)
I was just confused about the "raising the importance" part.

245

I think "codegen" here means "the IR generated by the vectorizer", not "backend code generation". See line 123.
Actually getting the cost model to accurately estimate the cost of complex IR constructs is an orthogonal problem.

426

I think this is why the design says that Legal would "encode constraints". I guess the LVP would have to consult Legal per-potential-VPlan to see whether it's feasible or not?

rengolin added inline comments.Jan 30 2017, 9:28 AM
docs/VectorizationPlan.rst
245

Right, but costs are associated with what assembly instructions will be generated by those IR instructions. So we associate zero cost to a lot of shuffles because we know they'll be merged into the load or add. But the same shuffles, in a different pattern, will demand a sequence of insert/extract instructions that will cost a lot more than zero.

This is my point, that accuracy is not possible at this level and that, as you say, it's an orthogonal problem to solve.

Maybe the word we're looking for is "reliably"?

426

Right, this is what I didn't understand. We can do it both ways: one legal consulting all plans, or each plan consulting legal.

I'd prefer we act on plans for everything, as it would be a cleaner concept and an easier move.

As I wrote in another comment: first we iterate through all plans and discard all illegals, then we calculate the costs, sort and pick the best.

We could even (in the far future) have multiple costs per plan (VF, UF, code-size, hazards, etc) and sort by a formula that takes all of them into account.

Ayal added inline comments.Feb 1 2017, 7:23 AM
docs/VectorizationPlan.rst
11

Agreed, patch is more than "a pure refactoring/cleanup". We meant to say "no change in output intended".

28

That could be one way to think of it. The aim is to record each decision once, and have both cost estimate and transformation stem from the same recording.

37

Right; Legal should be refactored as well to reflect its true Legality nature, by adjacent patches.

61

They are the same if "generates VPlans" is understood as an iterative process, which includes taking already generated VPlans and modifying them to produce new candidates for consideration.

112

Right, a constraint can be, e.g., an upper bound on the VF.
Artifacts are pieces of information collected by Legal which are useful for the vectorization process. An artifact can be, e.g., the set of reduction phi's; Legal has to identify them to make sure they are all vectorizable.
The distinction may be somewhat vague; e.g., Legal has to make sure runtime guards can be constructed for possible memory dependencies.

Another way of putting this, w/o constraints nor artifacts: Legal can be made responsible for either determining that a loop cannot be vectorizable, or else for providing initial VPlan(s) as constructive proof showing how it can be vectorized, albeit sub-optimally.

132

An example of optimizing an LV VPlan is LoopVectorizationPlanner::optimizePredicatedInstructions(). Another should be the recognition of interleave groups.

The reference to TSLP is by analogy, which you've summarized quite well. The point was that there could be various candidates worth considering, in this case all with the same VF. Furthermore, constructing TSLP subtrees is somewhat analogous to LV's isProfitableToScalarize().

(BTW, you can also watch the TSLP video recording ;-)

141

Comment tries to say that it'll be harder to continue and make "right" decisions as the scope of vectorization increases to outerloops, with the proposed design (or any other). Explicit vectorization tries to mitigate this expected hardship.

241

The guideline requires VPlan to be designed such that it can support any decision taken by current LV; but of-course not be limited to it.

245

Right, sorry for the confusion. "CodeGen" indeed means here "the IR code generated by the vectorizer".

252

The original thought of a Recipe should also match an SLP 'bundle'. An interleave-group recipe for instance takes care of a group of related loads or stores, albeit doing more than stitching them together to produce a single vector load or store. Modeling SLP requires more attention.

272

Right.

282

Right.

290

How so?

296

Two VPRegions are disjoint (or else contain one another) in the sense that they share no common VP(Basic)Block. Say we have two if-the-else hammocks one after the other, and we wish to wrap each in a VPRegion. To avoid having the Exit of the first be the Entry of the second, it should be split somehow, possibly resulting in an empty block.

308

VPRecipes should indeed be cross-target. If something in the loop is illegal we avoid building any VPlans for it in the first place, as a form of early pruning. OTOH, it should conceptually be possible, and arguably profitable, to scalarize any instruction.

Ayal updated this revision to Diff 86642.Feb 1 2017, 8:00 AM

Removed png's, made a few cleanups and minor fixes to the code.

Ayal added inline comments.Feb 1 2017, 8:02 AM
docs/VectorizationPlan.rst
288

The sentence above meant to emphasize that VPBasicBlocks are not necessarily maximal. A VPBasicBlock can have a single successor VPBasicBlock, which in turn has a single predecessor. Eventually both will contribute their instructions into a common IR basic block of the vectorized version.

bader added a subscriber: bader.Feb 9 2017, 7:32 AM
Ayal added inline comments.Feb 13 2017, 4:50 PM
docs/VectorizationPlan.rst
245

Yes, "reliably" or "consistently" better describe the intent here. The "accuracy" still depends on TTI. Fix will appear in next upload.

314

Another way to view this: each instruction that will eventually appear in the vectorized loop will be generated by a VPRecipe. This recipe also takes care of estimating its cost. It is possible for several such instructions to be generated by one common VPRecipe, as in the case of an interleave group recipe.

319

The loop vectorizer already moves code effectively, when it forms interleave groups and sinks scalar operands. There may be additional opportunities during vectorization that involve code modification. These admittedly must be done carefully, but potentially impact performance significantly (by more than an extra oomph) and so ought to be represented and estimated reliably.

388

For instances: where to generate the next instruction(s), by passing an IRBuilder. When generating scalarized predicated instructions, passing which lane to generate scalar instance for.

415

By "vectorize()" we mean "transform". The latter verb can be used instead if clearer?

Yes, we only call "vectorize()" / "transform()" on Best VPlan, after finding it according to relative costs, as suggested above.

Currently VPlans are built only after passing all "isLegal()" checks. So every VPlan is legal by construction.

426

The current design performs all Legal checks before starting to work with VPlans, retaining the separation between correctness and performance considerations. This could be revisited if desired.

If Legal determines that a loop cannot be vectorized at all, no VPlans are built.

If Legal determines that a loop can be vectorized but only under certain restrictions, all VPlans built will comply with all these restrictions, and offer distinct alternatives how to vectorize. Current restriction may be: VF small enough, pass runtime guards regarding potential aliasing and non-unit strides, handle cross-iteration dependencies such as reductions, 1st order recurrences, live-outs.

How to calculate the cost(s) of each VPlan and find the best one is subject to a separate, follow-up patch.

473

Right, LVP.plan takes care of building the relevant VPlans and running the "cost loop", returning the best VF. This "cost loop" currently still uses the existing CostModel, but is planned to transition to use VPlans in the aforementioned follow-up patch.

486

VPlan already addresses both UF and VF in this patch, in the sense that when instructed to do so, by calling "vectorize()", it will generate vector code based on both. Determining the best VF and the best UF is still done based on existing CostModel in this patch, by first finding the best VF and then finding the best UF. Follow-up patch is devoted to having both these decisions be based on VPlans as well.

516

Right, this will perform the actual transformation. :-)

557

Right, no IR has been transformed nor constructed yet.

spatel added a subscriber: spatel.Feb 15 2017, 6:03 AM
oren_ben_simhon added inline comments.
lib/Transforms/Vectorize/VPlan.cpp
11

Please follow LLVM guidelines for file documentation:
http://llvm.org/docs/CodingStandards.html#file-headers

lib/Transforms/Vectorize/VPlan.h
2

This file is very big and will probably gain some more weight in time. I think that some utility classes could be moved to a separate file.

69

IMHO, VPlanUtils should not be a friend class, because you don't access any of its private/protected members. Instead use VPlanUtils statically.

73

use ///< for post member documentation.

76

I didn't follow the whole logic but are you sure that you need recipe parent?
Also i think that "owner" is a better descriptive of the member than parent.

192

The function should be const.

311

I know that the initial size of SmallVectors has minimum effect, still i believe that the default size of the predecessors/Successors should be one.

316

The \brief will have no effect because anyway the comment until the first dot will be considered as brief by Doxygen.

515

No need for the brief command.

523

The function receives a parameter that it doesn't use.
Also it is static although it returns a non-static class member.

593

Please consider using a single list/vector instead of two members Entry/Exit.

715

The plan is not used anywhere in the class.

718

This function member is not referenced anywhere.

721

This class should be a singleton class that should not be initialized. As such the constructor should be private.

724

All functions that do not relate to a non-static member, should be static and const.

796

I think that the function name should be more descriptive to reflect the nature of the function, for example "setConditionalSuccessors".

861

Maybe use a doxygen style comment to all class documentations.

jonpa added a subscriber: jonpa.Feb 22 2017, 10:55 PM

Hi,

I have been working with improving cost estimates in the LoopVectorizer a bit. I wonder if VPlan will improve cost estimation for different VFs, including 1?

One issue currently that I don't know yet quite how to tackle, is that two scalarized instructions (def->use), have too big of scalarization costs computed, since the inserts from the first, and the extracts of the second will be optimized away. There are also several other minor issues I have found so far. I guess I should probably wait for VPlan to arrive before trying anything... Is that going to be anytime soon?

Looking forward to this

/Jonas

I have been working with improving cost estimates in the LoopVectorizer a bit. I wonder if VPlan will improve cost estimation for different VFs, including 1?

Hi Jonas,

AFAICT, VPlan is completely orthogonal to the cost computation, ie. the exact same cost functions will be used (including 1).

One issue currently that I don't know yet quite how to tackle, is that two scalarized instructions (def->use), have too big of scalarization costs computed, since the inserts from the first, and the extracts of the second will be optimized away. There are also several other minor issues I have found so far.

This is a known issue with any heuristics based approach, which is the case here. The high costs (for specific architectures) are expecting shuffles to be generated, but on other architectures you get for free. Overriding the cost per arch (or sub-arch) helps a bit (and that's what we use), but it doesn't cover cases based on the right sequence of instructions, at least not in any formalised way.

The only generic way of doing this is to allow targets to inspect the basic block for the expected sequences (so for example, ext+add has a free ext but ex+mov doesn't). This would require table gen descriptions of the patterns and could considerably increase the cost computation (but that's a target-specific decision, so it's ok).

I guess I should probably wait for VPlan to arrive before trying anything... Is that going to be anytime soon?

Being orthogonal, I don't expect your cost investigations to have a large impact on the VPlan implementation. Maybe if you could open a new RFC with your proposal, we could go through it and see how much it'll actually impact.

cheers,
--renato

jonpa added a comment.Feb 23 2017, 3:18 AM

The only generic way of doing this is to allow targets to inspect the basic block for the expected sequences (so for example, ext+add has a free ext but ex+mov doesn't). This would require table gen descriptions of the patterns and could considerably increase the cost computation (but that's a target-specific decision, so it's ok).

On thing I did was actually a check if ISD::ZEXTLOAD / ISD::SEXTLOAD is legal, which I think should work without any extra work, since those ISD nodes are already there. For in-memory operations, I suppose that's a bit tricker, maybe a target hook or that returns true/false depending on if there is an instruction that will fold the load. I will put my patch up for review when it's more mature, hopefully. Or perhaps it is best to do as you say, and allow the target to adjust the sum by inspecting the BB...

The main thought I had at this moment, was that I thought perhaps if the scalarization costs were modeled in a better way, the LoopVectorizer should be able to for example hold the scalarization costs for each instruction as a tuple of {inserts, extracts}, and then get a more accurate final cost estimate sum by checking interdependencies of scalarized / vectorized instructions. It should only add inserts if the user was vectorized, and so on. I was hoping maybe VPlan perhaps might build a model with instruction costs and sum them after all individual costs are there, or so.

The main thought I had at this moment, was that I thought perhaps if the scalarization costs were modeled in a better way, the LoopVectorizer should be able to for example hold the scalarization costs for each instruction as a tuple of {inserts, extracts}, and then get a more accurate final cost estimate sum by checking interdependencies of scalarized / vectorized instructions. It should only add inserts if the user was vectorized, and so on. I was hoping maybe VPlan perhaps might build a model with instruction costs and sum them after all individual costs are there, or so.

This was my first thought, since they are the highest unknowns in vectorisation. But then you may have extending adds but not extending saturating adds, and then a three-instruction-pattern (ext+add+max) cannot be matched, but it will match the (ext+add) and remove the ext cost, when there is actually an extra instruction for that.

There are also costs related to moving to and from vector registers. For example, on NEON, GPR->NEON is free, but NEON->GPR has a ~10-cycle stall. That cannot be modelled without understanding about the surrounding instructions (say, a scalar reduction on every loop would make a 2-lane vector add useless).

We could probably go slow and fiddle with the scalar costs now (with lost of benchmark results in the affected arches), and maybe have a half-baked solution for shuffles, since they're the most obvious problems, but it would be good to not destroy the plan to be able to look at the context, hopefully based on table gen pattern matches.

cheers,
--renato

Ayal added a comment.Feb 23 2017, 1:14 PM

Hi Jonas and Renato,

I have been working with improving cost estimates in the LoopVectorizer a bit. I wonder if VPlan will improve cost estimation for different VFs, including 1?

Hi Jonas,

AFAICT, VPlan is completely orthogonal to the cost computation, ie. the exact same cost functions will be used (including 1).

Improving the "accuracy" of TTI's cost estimates for IR patterns is indeed an orthogonal effort. VPlan is designed to provide an explicit representation of the IR patterns that the vectorizer plans to emit, for all VF's under consideration. This includes representing scalarized instructions, vectorized instructions, packing/unpacked instructions and where they are placed, as well as idioms such as interleave-groups. Doing so will help provide a more "consistent", "reliable" and straightforward cost computation. The plan is to introduce VPlan incrementally, starting with this patch driving the transformation, to be followed by having VPlan also drive the cost computation.

Determining where to best draw the line between scalarized and vectorized instructions is indeed an interesting challenge, cf. the work on Throttled SLP.

jonpa added a comment.Feb 24 2017, 5:18 AM

VPlan is designed to provide an explicit representation of the IR patterns that the vectorizer plans to emit, for all VF's under consideration. This includes representing scalarized instructions, vectorized instructions, packing/unpacked instructions and where they are placed, as well as idioms such as interleave-groups.

That was just what I wanted to hear, sounds great :-) I suppose then one shouldn't put too much effort into this before VPlan arrives.

Ayal updated this revision to Diff 90376.Mar 2 2017, 12:49 PM

Updated to ToT, including the recording of InterleaveGroup decisions per VF.

Ayal marked 5 inline comments as done.Mar 2 2017, 1:51 PM
jonpa added a comment.Mar 8 2017, 3:59 AM

Maybe if you could open a new RFC with your proposal, we could go through it and see how much it'll actually impact.

I have now two minor revisions open for the LoopVectorizer, which are in need of review:
https://reviews.llvm.org/D30732
https://reviews.llvm.org/D30680

Ayal updated this revision to Diff 95186.Apr 13 2017, 12:37 PM

Updated to ToT, including first incremental patches that were committed.

jprice added a subscriber: jprice.Apr 13 2017, 3:29 PM

So, I think the document is so good and can be very helpful in further communications, that I think it should be committed separately.

Once we all agree that the document explains what we really want (I mostly do), we can look at the transformations.

I mean, this is a big change and we want to get it right, so minimising the patch size and intrusiveness is a plus. About half of this patch is the document. :)

Makes sense?
--renato

docs/VectorizationPlan.rst
10

I'll echo @mkuper here: drop the NFC.

Also, this is a document that will outlive the patch, so it should also not mention the patch at all.

If we want a document to help us design or explain the VPlan, then we need a VPlan document, not a document for a patch.

Don't get me wrong, not many people write documents, especially not for patches, so this is awesome!

But we need to think about its lifetime and format.

426

Right, this is what I meant. Filer all VPlans by legality first. I do like the idea that no illegal VPlan can exist.

Costs are for a later patch, I agree.

Ayal marked an inline comment as done.Apr 30 2017, 12:53 PM

So, I think the document is so good and can be very helpful in further communications, that I think it should be committed separately.

Once we all agree that the document explains what we really want (I mostly do), we can look at the transformations.

I mean, this is a big change and we want to get it right, so minimising the patch size and intrusiveness is a plus. About half of this patch is the document. :)

Makes sense?
--renato

Yes, thanks. We're refining this document to contain only the general design of VPlan; the rest of the documentation may better suits a review-summary / commit-message of the first introductory patch.

docs/VectorizationPlan.rst
10

OK, you're right. We're refining this document to contain only the general design of VPlan. The parts that specifically concern the first introductory patch will be taken out to form the SUMMARY and commit-message of this first patch, to be uploaded for review shortly.

The document was originally written with everything together to make it self-contained and hopefully easier to review.

bollu added a subscriber: bollu.Oct 30 2017, 12:52 AM
fhahn added a subscriber: fhahn.Oct 31 2017, 2:37 AM

Hi Ayal,

This functionality has been submitted already, right? If so, please close this review.

Thanks,
--renato

Ayal abandoned this revision.Dec 7 2017, 8:21 AM

Hi Ayal,

This functionality has been submitted already, right? If so, please close this review.

Thanks,
--renato

Sure Renato. This approach of gradually changing and improving the loop vectorizer in-place has served well. Thanks for your support and appreciation!