This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
CMakeLists.txt
7/18
VPlan.h
3/4
VPlan.cpp
40/67
VPlanSLP.cpp
1/1
VPlanValue.h
-
unittests/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
CMakeLists.txt
-
VPlanSlpTest.cpp

Differential D49491

[RFC][VPlan, SLP] Add simple SLP analysis on top of VPlan.
ClosedPublic

Authored by fhahn on Jul 18 2018, 8:50 AM.

Download Raw Diff

Details

Reviewers

Ayal
mssimpso
rengolin
mkuper
hfinkel
hsaito
dcaballe
vporpo
RKSimon
ABataev

Summary

This patch adds an initial implementation of the look-ahead SLP tree
construction described in 'Look-Ahead SLP: Auto-vectorization in the Presence
of Commutative Operations, CGO 2018 by Vasileios Porpodas, Rodrigo C. O. Rocha,
Luís F. W. Góes'.

It returns an SLP tree represented as VPInstructions, with combined
instructions represented as a single, wider VPInstruction.

This initial version does not support instructions with multiple
different users (either inside or outside the SLP tree) or
non-instruction operands; it won't generate any shuffles or
insertelement instructions.

It also just adds the analysis that builds an SLP tree rooted in a set
of stores. It does not include any cost modeling or memory legality
checks. The plan is to integrate it with VPlan based cost modeling, once
available and to only apply it to operations that can be widened.

A follow-up patch will add a support for replacing instructions in a
VPlan with their SLP counter parts

Diff Detail

Event Timeline

fhahn created this revision.Jul 18 2018, 8:50 AM

Herald added subscribers: rogfer01, rkruppe, tschuett and 2 others. · View Herald TranscriptJul 18 2018, 8:50 AM

fhahn added a parent revision: D49489: [VPlan] VPlan version of InterleavedAccessInfo..Jul 18 2018, 8:50 AM

What is your plan for this vs SLPVectorizer? How do they compare wrt vectorized codegen?

rcorcs added a subscriber: rcorcs.Jul 18 2018, 9:20 AM

In D49491#1166680, @RKSimon wrote:

What is your plan for this vs SLPVectorizer? How do they compare wrt vectorized codegen?

The initial motivation is to improve vectorization in cases where currently the loop vectorizer only considers interleaving memory accesses, which either results in sub-optimal code or no vectorization in case the interleaved accesses are too expensive on the target. It should also come in handy when vectorizing for architectures with scalable vector registers, where SLP-style vectorization of an unrolled loop is harder, for example.

Overall I think this should fit nicely into the emerging VPlan infrastructure: applying SLP style vectorization is one strategy/plan among others, that are evaluated against each other before choosing the most profitable over all. We can add the SLP analysis and tooling to create a VPlan with the SLP combinations applied, and just need small tweaks to VPlan-based cost modelling and code generation to make it aware of the new combined load/store instructions.

In the long term, I think there is potential to share infrastructure with the SLPVectorizer on different levels: for example, we could potentially re-use VPlan based code generation, if the SLPVectorizer would emit VPInstructions; share the infrastructure to discover consecutive loads/stores; share different combination algorithms between VPlan and SLPVectorizer; potentially re-use part of the VPlan based cost model. One thing I want to try out is how easy/hard it would be to make the analysis work on VPInstruction and Instruction based basic blocks.

That being said, some of the VPlan infrastructure is still emerging: initial VPInstruction based codegeneration and cost modelling is currently worked on for example. However I think considering SLP style vectorization as a VPlan2VPlan transformation (and others) early on would help to make sure the design of the VPlan infrastructure is general enough to cover a wide range of use cases.

I hope that answers your question.

My tuppence...

In D49491#1166927, @fhahn wrote:

The initial motivation is to improve vectorization in cases where currently the loop vectorizer only considers interleaving memory accesses, which either results in sub-optimal code or no vectorization in case the interleaved accesses are too expensive on the target. It should also come in handy when vectorizing for architectures with scalable vector registers, where SLP-style vectorization of an unrolled loop is harder, for example.

This was part of the VPlan from the beginning, but we never discussed in detail how the implementation would work, especially considering that SLP will still be there for quite a while as is.

Overall I think this should fit nicely into the emerging VPlan infrastructure: applying SLP style vectorization is one strategy/plan among others, that are evaluated against each other before choosing the most profitable over all. We can add the SLP analysis and tooling to create a VPlan with the SLP combinations applied, and just need small tweaks to VPlan-based cost modelling and code generation to make it aware of the new combined load/store instructions.

This is indeed the biggest benefit of having SLP analysis in VPlan, especially with the VPlan-to-VPlan transformations.

In the long term, I think there is potential to share infrastructure with the SLPVectorizer on different levels: for example, we could potentially re-use VPlan based code generation, if the SLPVectorizer would emit VPInstructions; share the infrastructure to discover consecutive loads/stores; share different combination algorithms between VPlan and SLPVectorizer; potentially re-use part of the VPlan based cost model. One thing I want to try out is how easy/hard it would be to make the analysis work on VPInstruction and Instruction based basic blocks.

This is why we haven't gone too deep in the analysis, yet. Sharing code between SLPVec, which operates in IR, and VPlanSLP, which operates in VInstructions, can be confusing, limiting or even not possible. The analysis part mostly works, because VPInstruction is similar enough to Inst, but implementation and, worse, heuristics, can go wrong very quickly.

I think long term we basically have two options:

We cannibalise SLPVec, hoisting analyses, transformations and make it generic (Inst/VPInst) and make sure we never hurt performance or compile time. This is hard, slow and painful, but it's the most stable ongoing solution.

We implement VPlanSLP in parallel, create a flag to flip between (not have both at the same time), and when VPlanSLP is doing all and more, we flip by default. This is much easier short-term but risk never get truly flipped and divide the usage.

I don't have a good view on which would be better right now. VPlan is still largely monolithic, VP-to-VP is too fresh, and SLP needs to understand loop and outer loop boundaries to operate correctly in VPlan.

That being said, some of the VPlan infrastructure is still emerging: initial VPInstruction based codegeneration and cost modelling is currently worked on for example. However I think considering SLP style vectorization as a VPlan2VPlan transformation (and others) early on would help to make sure the design of the VPlan infrastructure is general enough to cover a wide range of use cases.

I can see the appeal in a proof-of-concept, and I don't oppose having it. But I'm not strongly in favour either.

If more people think that option #2 above is the way to go, then this could turn out fine.

But if more people prefer the option #1, then we would want to see what gets hoisted and how will the local implementations look like before we add VPlanSLP.

cheers,
--renato

In D49491#1166978, @rengolin wrote:

In D49491#1166927, @fhahn wrote:

That being said, some of the VPlan infrastructure is still emerging: initial VPInstruction based codegeneration and cost modelling is currently worked on for example. However I think considering SLP style vectorization as a VPlan2VPlan transformation (and others) early on would help to make sure the design of the VPlan infrastructure is general enough to cover a wide range of use cases.

I can see the appeal in a proof-of-concept, and I don't oppose having it. But I'm not strongly in favour either.

If more people think that option #2 above is the way to go, then this could turn out fine.

But if more people prefer the option #1, then we would want to see what gets hoisted and how will the local implementations look like before we add VPlanSLP.

The fact of the matter is that the loop vectorization has a need to understand SLP and SLP vectorizer needs to understand Loop. As such, unless we want to build/maintain separate LoopVectorize+SLP and SLPVectorize+Loop, consolidation of LoopVectorization and SLPVectorization will inevitably happen sooner or later. From that perspective, ensuring that VPlan is the right infrastructure for such consolidation is a very important thing for us to do.
We don't necessarily have to choose between #1 and #2. Once we come to conclude that VPlan is promising enough for SLP, we can start VPlanize SLP vectorizer (#1) while #2 moves forward. We just need to make a conscious effort to converge the two, starting from sharing small chunks of code and then steadily increase sharing.

In D49491#1167024, @hsaito wrote:

The fact of the matter is that the loop vectorization has a need to understand SLP and SLP vectorizer needs to understand Loop. As such, unless we want to build/maintain separate LoopVectorize+SLP and SLPVectorize+Loop, consolidation of LoopVectorization and SLPVectorization will inevitably happen sooner or later. From that perspective, ensuring that VPlan is the right infrastructure for such consolidation is a very important thing for us to do.

I'm totally on board with you. I want SLP to be in VPlan (have been advocating for it since the beginning).

We don't necessarily have to choose between #1 and #2. Once we come to conclude that VPlan is promising enough for SLP, we can start VPlanize SLP vectorizer (#1) while #2 moves forward. We just need to make a conscious effort to converge the two, starting from sharing small chunks of code and then steadily increase sharing.

A merge between 1 and 2, sounds equally plausible. I just want everyone to agree on the strategy.

vchuravy added a subscriber: vchuravy.Jul 18 2018, 9:25 PM

I have no strong opinions on the best approach. My concern is my current work revolves around a number of issues:

1 - Generalize alternate vectorisation paths (multiple different vector ops + select/shuffle merges).
2 - Supporting 'copyable' elements (PR30787).
3 - Pull TTI/vectorization costs from scheduling models (PR36550).
4 - Using dereferenceable_or_null metadata to vectorise loads with missing elements (PR21780).
5 - Revectorisation of 128-bit vector code to 256-bit vector code (make most of YMM ops now that Jaguar model is treating them nicely).

All of these are large pieces of work and I don't want to find myself implementing them in SLPVectorizer just for all the work to be lost and we're back to a very basic SLP system again.

How quickly do you expect VPlanSLP to be close to current SLPVectorizer codegen? Ideally I'd like to see slp tests run on both asap.

I'll let Florian/Hideki reply about timeframes and strategies, and will just focus on specific items you list.

In D49491#1167789, @RKSimon wrote:

1 - Generalize alternate vectorisation paths (multiple different vector ops + select/shuffle merges).
2 - Supporting 'copyable' elements (PR30787).
4 - Using dereferenceable_or_null metadata to vectorise loads with missing elements (PR21780).

These all look transferable to a VPlan-based SLP.

3 - Pull TTI/vectorization costs from scheduling models (PR36550).

This is really important for VPlan, too. There is some work (can't remember which review now) to get better cost analysis, looking at a longer use-chain.

5 - Revectorisation of 128-bit vector code to 256-bit vector code (make most of YMM ops now that Jaguar model is treating them nicely).

This sounds like a large undertake to do in SLP proper. I remember having briefly discussed re-vectorisation as a VP2VP transformation, which I think makes more sense.

But having it in SLP for now is perfectly fine if you have cases in mind where it's obviously beneficial. It's probably going to take a while to get the VP2VP stuff, anyway.

All of these are large pieces of work and I don't want to find myself implementing them in SLPVectorizer just for all the work to be lost and we're back to a very basic SLP system again.

That was my primary concern, too. But I don't think anyone is proposing to just dump the existing SLP unless the new one has *all* of it (and more).

And that also means either merging or commoning-up the features above (and all others).

rkruppe mentioned this in D49489: [VPlan] VPlan version of InterleavedAccessInfo..Jul 19 2018, 5:39 AM

Thanks for all the feedback! My initial plan was to make VPlan based loop vectorization SLP aware, to improve on the cases mentioned earlier, where currently we do not do the best thing in LoopVectorize. That should take a bit of load off SLPVectorize, but would not replace the current SLPVectorizer for now.

I think in the long term, we should work towards re-using as much of the VPlan infrastructure in SLPVectorize as sensible and possible. Merging of #1 and #2 seems like a good approach: we can get some benefits for loop vectorization relatively quickly and in the meantime evolve the whole VPlan infrastructure around it, while making SLPVectorize and VPlanSLP share common infrastructure (e.g. start from the bottom up, initially use VPlan for code generation and move up from there (bundle scheduling, cost modelling, auxiliary analysis like interleaved load/store analysis)) where it makes sense and is beneficial. In the end we might end up with "competing" algorithm/SLP strategies, which get evaluated against each other before creating (and executing) the final VPlan.

I have no strong opinions on the best approach. My concern is my current work revolves around a number of issues:

1 - Generalize alternate vectorisation paths (multiple different vector ops + select/shuffle merges).

Representing alternate vectorization paths/strategies in a way that can be easily evaluated against each other is one major benefit of VPlan, so I assume it could be helpful here

2 - Supporting 'copyable' elements (PR30787).
3 - Pull TTI/vectorization costs from scheduling models (PR36550).

This is great and we should use the same API here in the VPlan cost modelling and SLPVectorizer.

4 - Using dereferenceable_or_null metadata to vectorise loads with missing elements (PR21780).
5 - Revectorisation of 128-bit vector code to 256-bit vector code (make most of YMM ops now that Jaguar model is treating them nicely).

All of these are large pieces of work and I don't want to find myself implementing them in SLPVectorizer just for all the work to be lost and we're back to a very basic SLP system again.

How quickly do you expect VPlanSLP to be close to current SLPVectorizer codegen? Ideally I'd like to see slp tests run on both asap.

This depends on when we get initial VPlan-native codegen and cost modelling. I hope those things get submitted in the next few months. As mentioned earlier, I could put together a VPlanSLP version operating on scalar code outside loops, so we have something more concrete to compare.

That was my primary concern, too. But I don't think anyone is proposing to just dump the existing SLP unless the new one has *all* of it (and more).

And that also means either merging or commoning-up the features above (and all others).

Given the amount of tuning that went into SLPVectorize over the years, it will take a while to reach parity for a potential VPlan based replacement. But it would be great if we could agree on the general direction & approach and I would be more than happy to try to make sure the issues described above fit well into VPlanSLP. I guess with the problems mentioned above, the strategies and concepts are trickier to get right than the implementation details, so collaborating would be very valuable IMO.

Some minor/style comments as I get up to speed on the code.

lib/Transforms/Vectorize/VPlan.cpp
690	for (unsigned i = 0, e = User->getNumOperands(); i < e; i++) {
694	Remove unnecessary braces?
lib/Transforms/Vectorize/VPlan.h
679	No need for assert - cast will catch it. Could you use cast_or_null<Instruction>(getUnderlyingValue()) directly?
1587	Please can you add description comments for all these member variables and public functions.
lib/Transforms/Vectorize/VPlanSLP.cpp
60	static?
91	Should you use VPSLP for debug messages to avoid confusion with SLPVectorizer?
155	auto?
162	Reuse Instruction::isCommutative(cast<VPInstruction>(Values[0])->getOpcode())?
183	llvm_unreachable?
190	This would be cleaner if you pulled out cast<VPInstruction>(Values[0])->getNumOperands() unsigned NumOps = cast<VPInstruction>(Values[0])->getNumOperands(); for (unsigned i = 0; i < NumOps; ++i)
239	(style) Score
241	Don't re-evaluate the getNumOperands() calls for (unsigned i = 0, e1 = cast<VPUser>(V1)->getNumOperands(); j < e1; i++) for (unsigned j = 0, e2 = cast<VPUser>(V2)->getNumOperands(); j < e2; j++)
258	(style) auto for casts
337	for (unsigned Op = 0, E = MultiNodeOps.size(); Op < E; ++Op) {

Address comments, thanks! I will add more comments tomorrow.

fhahn marked 13 inline comments as done.Jul 19 2018, 10:27 AM

fhahn added inline comments.

lib/Transforms/Vectorize/VPlan.h
1587	Will do tomorrow!

RKSimon added inline comments.Jul 20 2018, 3:09 AM

lib/Transforms/Vectorize/VPlanSLP.cpp
176	You can move this above the switch statement and remove the duplicate cast

ABataev added inline comments.Jul 20 2018, 7:20 AM

lib/Transforms/Vectorize/VPlan.cpp
690	`i`->`I`, 'e'->`E` per coding standard. Use `++I` instead of `i++` per coding standard
lib/Transforms/Vectorize/VPlan.h
652–655	`VPInstruction *clone() const`
675	Make the function `const`
1590	No need to initialize `BundleToCombined` here, will be initalized by default
1596	`const`
lib/Transforms/Vectorize/VPlanSLP.cpp
85	`const` member function
165–166	`static`
177	`i`->`I`, `numOps`->`NumOps`, `++I`
192	Maybe it is better to use `llvm::Optional<unsigned>` instead of some magic numbers for non-matching case?
235	`static`
263	Why `5`? Use enum or const instead of magic number.
288–293	`std::remove(Candidates.begin(), Candidates.end(), Best);`?
313	`Lane = 1, E = MultiNodeOps[0].second.size(); Lane < E;` Use preincrement
325	preincrement
331	Please, use the actual type here
343	`ArrayRef<VPValue *> Values`
399–401	No braces here

Address next round of review comments, thanks! Also add some comments.

fhahn marked 15 inline comments as done.Jul 20 2018, 10:19 AM

fhahn added inline comments.

lib/Transforms/Vectorize/VPlan.h
675	Agreed that it should be const. Before I can change that here I need to change some of the underlying/connected functions const.

ABataev added inline comments.Jul 20 2018, 10:53 AM

lib/Transforms/Vectorize/VPlan.cpp
702–712	`const auto *`
lib/Transforms/Vectorize/VPlanSLP.cpp
65	`ArrayRef<VPValue *> Operands`
150	`ArrayRef<VPValue *> Values`
160	`ArrayRef<VPValue *> Values`
166	`ArrayRef<VPValue *> Values`
187	`ArrayRef<VPValue *> Values`
228–229	Capitalize variables names + preincrement
236	`ArrayRef<VPValue *> Candidates`
349	`ArrayRef<VPValue *> Values`

fhahn updated this revision to Diff 156555.Jul 20 2018, 11:44 AM

Address review comments, thanks!

lib/Transforms/Vectorize/VPlanSLP.cpp
65	Operands is used as key for the BundleToCombined map, which expects a SmallVector<VPValue *, 4> key.
236	We remove elements from Candidates which is not possible with ArrayRef I think
263	Replaced with a variable
349	Values is used as key for the BundleToCombined map, which expects a SmallVector<VPValue *, 4> key.

ABataev added inline comments.Jul 20 2018, 12:58 PM

lib/Transforms/Vectorize/VPlanSLP.cpp
65	llvm has special function `llvm::to_vector<4>()` that will allow to convert from `ArrayRef` to `SmallVector`
179–183	Do we need braces here?
195	Use `llvm::None` instead
303	Is it possible to replace it with `emplace_back(Operands.first, Operands.second[0]);`
303–309	You can preallocate the memory for `Mode` and `FinalOrder` vectors, because you know the number of elements is `MultiNodeOps.size()`
319	Also can preallocate the memory for the vector
335	use `emplace_back()`

fhahn updated this revision to Diff 156591.Jul 20 2018, 1:29 PM

fhahn updated this revision to Diff 156605.Jul 20 2018, 2:09 PM

fhahn marked 7 inline comments as done.

fhahn added inline comments.

lib/Transforms/Vectorize/VPlanSLP.cpp
65	Neat, thanks for pointing that out!
179–183	Not anymore, thanks
303	I don't think so, it fails to deduce the correct type for the second argument (with and without braces)
335	Is there are benefit by using emplace_back here? We are not constructing any objects here I think

ashutosh.nema added a subscriber: ashutosh.nema.Jul 22 2018, 10:39 PM

Ping. Thanks for all the comments so far and the initial code review.

What do you think is the best way going forward overall: 1) evolving loop aware SLP initially in VPlan (independent of SLPVectorizer) and start using VPlan infrastructure for SLPVectorizer as it becomes available or 2) trying to share as much code between both from the start? Personally I think 1) would be better, as it allows us to evolve the VPlan parts quicker and I would like to avoid unnecessary code churn in SLPVectorizer and start using VPlan infrastructure there as it becomes clearly beneficial.

I'll also add some cleanup code for destroying the created instructions if the are not used in a bit.

Abhilash added a subscriber: Abhilash.Aug 1 2018, 11:09 PM

a.elovikov added a subscriber: a.elovikov.Aug 6 2018, 7:25 AM

Thanks for working on this, Florian, and sorry for my delayed response. I added some initial comments. I'll come back soon.

In D49491#1184938, @fhahn wrote:

Ping. Thanks for all the comments so far and the initial code review.

What do you think is the best way going forward overall: 1) evolving loop aware SLP initially in VPlan (independent of SLPVectorizer) and start using VPlan infrastructure for SLPVectorizer as it becomes available or 2) trying to share as much code between both from the start? Personally I think 1) would be better, as it allows us to evolve the VPlan parts quicker and I would like to avoid unnecessary code churn in SLPVectorizer and start using VPlan infrastructure there as it becomes clearly beneficial.

I agree with you. I think it's better to have an initial end-to-end loop aware SLP PoC, properly define the representation of SLP operations, define how SLP transformation comes into play and interacts with the rest of the VPlan infrastructure, etc., before we move forward on SLPVectorizer. In the VPlan native path we have some room for experimentation. We can take advantage of that and start with SLPVectorizer once we have a solid model for SLP in VPlan.

In D49491#1184938, @fhahn wrote:

In the long term, I think there is potential to share infrastructure with the SLPVectorizer on different levels: for example, we could potentially re-use VPlan based code generation, if the SLPVectorizer would emit VPInstructions; share the infrastructure to discover consecutive loads/stores; share different combination algorithms between VPlan and SLPVectorizer; potentially re-use part of the VPlan based cost model. One thing I want to try out is how easy/hard it would be to make the analysis work on VPInstruction and Instruction based basic blocks.

! In D49491#1166978, @rengolin wrote:
This is why we haven't gone too deep in the analysis, yet. Sharing code between SLPVec, which operates in IR, and VPlanSLP, which operates in VInstructions, can be confusing, limiting or even not possible. The analysis part mostly works, because VPInstruction is similar enough to Inst, but implementation and, worse, heuristics, can go wrong very quickly.

In terms of re-use, Value/VPValue templatization could work for some analyses. Templatizing IRBuilder/VPlanBuilder might also work for some parts of the implementation. However, we may want to think carefully. If for templatization to work we needed the whole VPValue hierarchy and internal details to be a carbon copy of Value hierarchy, we would be losing all the flexibility that VPValue/VPInstruction are bringing to VPlan. Of course, we would have to investigate this a bit more. Another alternative that I had in mind is introducing some kind of InstructionTraits. That would give us an unified interface while allowing more flexibility at VPValue/Value implementation level. Unfortunately, we haven't had have time to investigate this approach further.

lib/Transforms/Vectorize/VPlan.h
620	To be more specific, what about SLPLoad/SLPStore instead?
663	I wonder if having this interface is safe. Wouldn't it be problematic if we change the opcode of a VPInstruction object to an opcode of an operation which is represented with a VPInstruction subclass? Maybe it's better to not allow this and just create a VPInstruction using VPlanBuilder?
675	Make it protected per design principle described in getUnderlyingValue in VPlanValue.h? Same for setUnderlyingInstr.
1538	(parts) of -> (parts of)?
1543	SCEV -> VPValue? I guess we couldn't have a single templatized version for sorted SmallVectors of any pointer, right?
1595	formatting
1606	Something to think about is the long term of this class. I wonder if it would be better to split it multiple ones, matching what we have for loop vectorization. For example, we may want to have a special VPlan (VPlanSLP) to represent the SLP planning information, a VPlanSLPBuilder, a VPlanSLPContext (if necessary)... We don't have to do it as part of this patch but it's something to keep in mind.
lib/Transforms/Vectorize/VPlanSLP.cpp
63	Would it make sense to move this to VPValue?
67	Some doc would be great. Same for some of the methods below.
211	If opcodes are the same at this point, shouldn't we keep only A or B checks but not both? You could add `assert(A->getOpcode() == B->getOpcode())` to be safe.
220	I would be great if you could elaborate a bit what 'score' is exactly.

• ashahid added a subscriber: • ashahid.Aug 6 2018, 10:29 PM

Addressed comments, thanks!

lib/Transforms/Vectorize/VPlan.h
663	Removed, it was a leftover from an earlier version.
1543	I think it is possible to provide a single implementation for SmallVectors. I've removed the implementation here and will upload a separate patch for it soon.

pranaviyala added a subscriber: pranaviyala.Aug 13 2018, 12:27 PM

In D49491#1190505, @dcaballe wrote:

In D49491#1184938, @fhahn wrote:

What do you think is the best way going forward overall: 1) evolving loop aware SLP initially in VPlan (independent of SLPVectorizer) and start using VPlan infrastructure for SLPVectorizer as it becomes available or 2) trying to share as much code between both from the start? Personally I think 1) would be better, as it allows us to evolve the VPlan parts quicker and I would like to avoid unnecessary code churn in SLPVectorizer and start using VPlan infrastructure there as it becomes clearly beneficial.

I agree with you. I think it's better to have an initial end-to-end loop aware SLP PoC, properly define the representation of SLP operations, define how SLP transformation comes into play and interacts with the rest of the VPlan infrastructure, etc., before we move forward on SLPVectorizer. In the VPlan native path we have some room for experimentation. We can take advantage of that and start with SLPVectorizer once we have a solid model for SLP in VPlan.

Yep! I think it would be good to get wider agreement of the direction. Is there anything I should look at/elaborate in more detail/summarize? Concerns with this approach that need to be addressed? It would be great to get some more feedback on this :)

In D49491#1166978, @rengolin wrote:

This is why we haven't gone too deep in the analysis, yet. Sharing code between SLPVec, which operates in IR, and VPlanSLP, which operates in VInstructions, can be confusing, limiting or even not possible. The analysis part mostly works, because VPInstruction is similar enough to Inst, but implementation and, worse, heuristics, can go wrong very quickly.

In terms of re-use, Value/VPValue templatization could work for some analyses. Templatizing IRBuilder/VPlanBuilder might also work for some parts of the implementation. However, we may want to think carefully. If for templatization to work we needed the whole VPValue hierarchy and internal details to be a carbon copy of Value hierarchy, we would be losing all the flexibility that VPValue/VPInstruction are bringing to VPlan. Of course, we would have to investigate this a bit more. Another alternative that I had in mind is introducing some kind of InstructionTraits. That would give us an unified interface while allowing more flexibility at VPValue/Value implementation level. Unfortunately, we haven't had have time to investigate this approach further.

Right, I do not think we should restrict ourselves too much by coupling things together too early. I think in terms of analysis, we currently have access to def/uses, operands, opcodes in Instruction/VPInstruction and also DominatorTrees and LoopInfo for IR/VPlan, which should be enough for a set of analysis.

ping

Rebased and added comments.

@RKSimon , @rengolin, @ABataev after the previous discussion, does getting this analysis in for VPlan only initially make sense to you? Or is there anything more I should investigate/write up?

fhahn added a parent revision: D52312: [DenseMapInfo] Add implementation for SmallVector of pointers..Sep 20 2018, 9:25 AM

fhahn added inline comments.

lib/Transforms/Vectorize/VPlan.h
1543	I've added a patch implementing DenseMap key info for smallvector: D52312

vporpo added inline comments.Sep 20 2018, 4:18 PM

lib/Transforms/Vectorize/VPlanSLP.cpp
135	Could we have other non-Store VPInstructions that could touch memory (e.g., calls) ? Maybe it is safer to add a check whether the underlying IR instruction mayWriteToMemory().
141	I think we should also check that the operands are (i) all in the same BB, and (ii) in the same BB as the seed instructions (at least for now). A lit test for this would also be nice.
346	If I am not mistaken, since we are only allowing single user nodes in the graph, the graph is actually a tree. So this should not happen. We should have an assertion here to check this.
351	Use LLVM_DEBUG for dumpBundle()
364	Use LLVM_DEBUG for dumpBundle()
406	Hmm why no buildGraph() recursion here ?

Thanks Vasileios! I hope I addressed all comments. The only follow up I need to do is the check that everything is in the same BB as the seed instructions, which I'll do tomorrow.

lib/Transforms/Vectorize/VPlanSLP.cpp
135	I've added a simple implementation for mayWriteToMemory to VPInstruction. We could just use mayWriteToMemory of the underlying instruction, but I think this would unnecessarily increase coupling of VPInstruction and IR instructions. What do you think?
141	I've added a check for (i) and I'll add a check for (ii) tomorrow. I'll think a bit more what the best way to do it would be. I've added a unit test for (i), and will add IR tests once we VPlan SLP is hooked up to the vplan native code path.
346	Currently we check for unique users. For example, in the testSlpReuse_1 we have 2 add instruction which add the same value. In that case we should re-use the already created combined instruction I think

Hi Florian, yes I am happy with the changes, thanks!
Just please try to add an assertion that checks that the graph is a tree (see inline comment) to make sure that the graph is the one we expect.

lib/Transforms/Vectorize/VPlanSLP.cpp
135	I agree it looks better this way.
346	Ah yes, you are checking for unique users, so it should work the way you are describing. If I understand correctly, the graph can still be considered a tree since the only case when two nodes are connected by more than one path is when the user has two identical operands. If this is the case, I still think that there should be some kind of assertion here checking that the graph is a tree and not a DAG, because SLP on a DAG is slightly different and would need a few more components to work correctly. I think that the only case where (I != BundleToCombined.end()) is true is the one you are describing: when the user instructions have identical operands. So maybe add an assertion that checks that the single users of Values have identical operands and a comment like "For now buildGraph should form a tree, not a DAG.".

Added assertion the ensure we only re-use nodes if the users of all values are equal and limited instructions considered to a single BB for now. I think this should now address all of @vporpo comments, thank you very much!

fhahn marked an inline comment as done.Oct 3 2018, 10:23 AM

ping

ABataev added inline comments.Oct 31 2018, 10:52 AM

lib/Transforms/Vectorize/VPlanSLP.cpp
119	Seems to me the comment does not match the functionality
301	Try `FinalOrder.emplace_back(Operands.first, Operands.second[0]);`

Address comments, thanks!

fhahn added inline comments.Nov 1 2018, 1:29 PM

lib/Transforms/Vectorize/VPlanSLP.cpp
119	I've updated the comment and removed the code to check if all instructions are in the same BB. That's check earlier on. Does it make sense now?
301	I tried, but neither that nor `emplace_back(Operands.first, {Operands.second[0])};` gets accepted unfortunately.

What about non-vectorizable loads/stores? Atomic, volaltile? Does it aware of those kind of instructions?

lib/Transforms/Vectorize/VPlanSLP.cpp
72	use `try_emplace(to_vector<4>(Operands), New);` instead
119	`are no instructions`?

Address comments and add checks for simple loads/stores in areVectorizable().

Herald added a subscriber: jfb. · View Herald TranscriptNov 7 2018, 9:54 AM

Just one more question: how do you estimate that the vectorized code is more effective than the scalar one? What's the criterion?

lib/Transforms/Vectorize/VPlanValue.h
169	Capitalize `i`, must be `I`

In D49491#1291680, @ABataev wrote:

Just one more question: how do you estimate that the vectorized code is more effective than the scalar one? What's the criterion?

This patch as a first step only adds support for the transform without cost modelling. This transform is intended as a VPlan2VPlan transformation and the general idea is to create multiple VPlans (say one with all the SLP opportunities applied and one with none) and evaluate the cost of the resulting VPlans at the end to choose the best one.

Together with @sguggill , we gave a brief overview of VP2VP transforms at the dev meeting (http://llvm.org/devmtg/2018-10/talk-abstracts.html#talk21) and this is something I want to write up a bit better in the actual VPlan design documentation.

I have no more comments.

In D49491#1291871, @ABataev wrote:

I have no more comments.

Great, thanks for all the comments!

@RKSimon @rengolin @ABataev are you happy with the direction of getting this in as VPlan-only transform initially?

From the VPlan point of view, LGTM. I don't have any other comments.

Thanks, Florian!
Diego

Just had a look again and it's looking great, thanks Florian!

The unit tests help a lot in understanding what's the gist of the transformation and I really like where this is going.

I think time-wise, @RKSimon's concerns were addressed (ie. we're not going to switch to this any time soon, and certainly not without making sure all his work is kept/moved in).

Given that most people are happy with it, and there are no objections, I'll go on and approve it.

Thanks everyone!

This revision is now accepted and ready to land.Nov 12 2018, 12:53 PM

Thanks, everyone! Important first step towards the great converged Loop+SLP vectorizer.

In D49491#1295967, @rengolin wrote:

I think time-wise, @RKSimon's concerns were addressed (ie. we're not going to switch to this any time soon, and certainly not without making sure all his work is kept/moved in).

Yup, I have no objections - cheers!

Thanks for all the review!

I've pulled in the DenseMapInfo changes from D52312 into this patch again, as making it generic requires some additional work. I will open an issue to follow up on this.

I'll commit this patch once D49489 gets accepted and committed.

fhahn mentioned this in D52312: [DenseMapInfo] Add implementation for SmallVector of pointers..Nov 13 2018, 7:30 AM

Committed as rL346857, thanks for all the comments! I will create a few tickets for follow up work.

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

1 line

125 lines

13 lines

469 lines

16 lines

unittests/

Transforms/

Vectorize/

CMakeLists.txt

1 line

VPlanSlpTest.cpp

899 lines

Diff 173830

lib/Transforms/Vectorize/CMakeLists.txt

	add_llvm_library(LLVMVectorize			add_llvm_library(LLVMVectorize
	LoadStoreVectorizer.cpp			LoadStoreVectorizer.cpp
	LoopVectorizationLegality.cpp			LoopVectorizationLegality.cpp
	LoopVectorize.cpp			LoopVectorize.cpp
	SLPVectorizer.cpp			SLPVectorizer.cpp
	Vectorize.cpp			Vectorize.cpp
	VPlan.cpp			VPlan.cpp
	VPlanHCFGBuilder.cpp			VPlanHCFGBuilder.cpp
	VPlanHCFGTransforms.cpp			VPlanHCFGTransforms.cpp
				VPlanSLP.cpp
	VPlanVerifier.cpp			VPlanVerifier.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms			${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms

	DEPENDS			DEPENDS
	intrinsics_gen			intrinsics_gen
	)			)

lib/Transforms/Vectorize/VPlan.h

Show First 20 Lines • Show All 54 Lines • ▼ Show 20 Lines
class InnerLoopVectorizer;		class InnerLoopVectorizer;
template <class T> class InterleaveGroup;		template <class T> class InterleaveGroup;
class LoopInfo;		class LoopInfo;
class raw_ostream;		class raw_ostream;
class Value;		class Value;
class VPBasicBlock;		class VPBasicBlock;
class VPRegionBlock;		class VPRegionBlock;
class VPlan;		class VPlan;
		class VPlanSlp;

/// A range of powers-of-2 vectorization factors with fixed start and		/// A range of powers-of-2 vectorization factors with fixed start and
/// adjustable end. The range includes start and excludes end, e.g.,:		/// adjustable end. The range includes start and excludes end, e.g.,:
/// [1, 9) = {1, 2, 4, 8}		/// [1, 9) = {1, 2, 4, 8}
struct VFRange {		struct VFRange {
// A power of 2.		// A power of 2.
const unsigned Start;		const unsigned Start;

▲ Show 20 Lines • Show All 533 Lines • ▼ Show 20 Lines
};		};

/// This is a concrete Recipe that models a single VPlan-level instruction.		/// This is a concrete Recipe that models a single VPlan-level instruction.
/// While as any Recipe it may generate a sequence of IR instructions when		/// While as any Recipe it may generate a sequence of IR instructions when
/// executed, these instructions would always form a single-def expression as		/// executed, these instructions would always form a single-def expression as
/// the VPInstruction is also a single def-use vertex.		/// the VPInstruction is also a single def-use vertex.
class VPInstruction : public VPUser, public VPRecipeBase {		class VPInstruction : public VPUser, public VPRecipeBase {
friend class VPlanHCFGTransforms;		friend class VPlanHCFGTransforms;
		friend class VPlanSlp;

public:		public:
/// VPlan opcodes, extending LLVM IR with idiomatics instructions.		/// VPlan opcodes, extending LLVM IR with idiomatics instructions.
enum { Not = Instruction::OtherOpsEnd + 1, ICmpULE };		enum {
		Not = Instruction::OtherOpsEnd + 1,
		ICmpULE,
		SLPLoad,
		dcaballeUnsubmitted Not Done Reply Inline Actions To be more specific, what about SLPLoad/SLPStore instead? dcaballe: To be more specific, what about SLPLoad/SLPStore instead?
		SLPStore,
		};

private:		private:
typedef unsigned char OpcodeTy;		typedef unsigned char OpcodeTy;
OpcodeTy Opcode;		OpcodeTy Opcode;

/// Utility method serving execute(): generates a single instance of the		/// Utility method serving execute(): generates a single instance of the
/// modeled instruction.		/// modeled instruction.
void generateInstruction(VPTransformState &State, unsigned Part);		void generateInstruction(VPTransformState &State, unsigned Part);

		protected:
		Instruction *getUnderlyingInstr() {
		return cast_or_null<Instruction>(getUnderlyingValue());
		}

		void setUnderlyingInstr(Instruction *I) { setUnderlyingValue(I); }

public:		public:
VPInstruction(unsigned Opcode, ArrayRef<VPValue *> Operands)		VPInstruction(unsigned Opcode, ArrayRef<VPValue *> Operands)
: VPUser(VPValue::VPInstructionSC, Operands),		: VPUser(VPValue::VPInstructionSC, Operands),
VPRecipeBase(VPRecipeBase::VPInstructionSC), Opcode(Opcode) {}		VPRecipeBase(VPRecipeBase::VPInstructionSC), Opcode(Opcode) {}

VPInstruction(unsigned Opcode, std::initializer_list<VPValue *> Operands)		VPInstruction(unsigned Opcode, std::initializer_list<VPValue *> Operands)
: VPInstruction(Opcode, ArrayRef<VPValue *>(Operands)) {}		: VPInstruction(Opcode, ArrayRef<VPValue *>(Operands)) {}

/// Method to support type inquiry through isa, cast, and dyn_cast.		/// Method to support type inquiry through isa, cast, and dyn_cast.
static inline bool classof(const VPValue *V) {		static inline bool classof(const VPValue *V) {
return V->getVPValueID() == VPValue::VPInstructionSC;		return V->getVPValueID() == VPValue::VPInstructionSC;
}		}

		VPInstruction *clone() const {
		SmallVector<VPValue *, 2> Operands(operands());
		return new VPInstruction(Opcode, Operands);
		}
		ABataevUnsubmitted Done Reply Inline Actions `VPInstruction clone() const` ABataev:* `VPInstruction *clone() const`

/// Method to support type inquiry through isa, cast, and dyn_cast.		/// Method to support type inquiry through isa, cast, and dyn_cast.
static inline bool classof(const VPRecipeBase *R) {		static inline bool classof(const VPRecipeBase *R) {
return R->getVPRecipeID() == VPRecipeBase::VPInstructionSC;		return R->getVPRecipeID() == VPRecipeBase::VPInstructionSC;
}		}

unsigned getOpcode() const { return Opcode; }		unsigned getOpcode() const { return Opcode; }

		dcaballeUnsubmitted Done Reply Inline Actions I wonder if having this interface is safe. Wouldn't it be problematic if we change the opcode of a VPInstruction object to an opcode of an operation which is represented with a VPInstruction subclass? Maybe it's better to not allow this and just create a VPInstruction using VPlanBuilder? dcaballe: I wonder if having this interface is safe. Wouldn't it be problematic if we change the opcode…
		fhahnAuthorUnsubmitted Not Done Reply Inline Actions Removed, it was a leftover from an earlier version. fhahn: Removed, it was a leftover from an earlier version.
/// Generate the instruction.		/// Generate the instruction.
/// TODO: We currently execute only per-part unless a specific instance is		/// TODO: We currently execute only per-part unless a specific instance is
/// provided.		/// provided.
void execute(VPTransformState &State) override;		void execute(VPTransformState &State) override;

/// Print the Recipe.		/// Print the Recipe.
void print(raw_ostream &O, const Twine &Indent) const override;		void print(raw_ostream &O, const Twine &Indent) const override;

/// Print the VPInstruction.		/// Print the VPInstruction.
void print(raw_ostream &O) const;		void print(raw_ostream &O) const;

		/// Return true if this instruction may modify memory.
		ABataevUnsubmitted Not Done Reply Inline Actions Make the function `const` ABataev: Make the function `const`
		fhahnAuthorUnsubmitted Not Done Reply Inline Actions Agreed that it should be const. Before I can change that here I need to change some of the underlying/connected functions const. fhahn: Agreed that it should be const. Before I can change that here I need to change some of the…
		dcaballeUnsubmitted Done Reply Inline Actions Make it protected per design principle described in getUnderlyingValue in VPlanValue.h? Same for setUnderlyingInstr. dcaballe: Make it protected per design principle described in getUnderlyingValue in VPlanValue.h? Same…
		bool mayWriteToMemory() const {
		// TODO: we can use attributes of the called function to rule out memory
		// modifications.
		return Opcode == Instruction::Store \|\| Opcode == Instruction::Call \|\|
		RKSimonUnsubmitted Done Reply Inline Actions No need for assert - cast will catch it. Could you use cast_or_null<Instruction>(getUnderlyingValue()) directly? RKSimon: No need for assert - cast will catch it. Could you use cast_or_null<Instruction>…
		Opcode == Instruction::Invoke \|\| Opcode == SLPStore;
		}
};		};

/// VPWidenRecipe is a recipe for producing a copy of vector type for each		/// VPWidenRecipe is a recipe for producing a copy of vector type for each
/// Instruction in its ingredients independently, in order. This recipe covers		/// Instruction in its ingredients independently, in order. This recipe covers
/// most of the traditional vectorization cases where each ingredient transforms		/// most of the traditional vectorization cases where each ingredient transforms
/// into a vectorized version of itself.		/// into a vectorized version of itself.
class VPWidenRecipe : public VPRecipeBase {		class VPWidenRecipe : public VPRecipeBase {
private:		private:
▲ Show 20 Lines • Show All 840 Lines • ▼ Show 20 Lines	public:
InterleaveGroup<VPInstruction> *		InterleaveGroup<VPInstruction> *
getInterleaveGroup(VPInstruction *Instr) const {		getInterleaveGroup(VPInstruction *Instr) const {
if (InterleaveGroupMap.count(Instr))		if (InterleaveGroupMap.count(Instr))
return InterleaveGroupMap.find(Instr)->second;		return InterleaveGroupMap.find(Instr)->second;
return nullptr;		return nullptr;
}		}
};		};

		/// Class that maps (parts of) an existing VPlan to trees of combined
		dcaballeUnsubmitted Not Done Reply Inline Actions (parts) of -> (parts of)? dcaballe: (parts) of -> (parts of)?
		/// VPInstructions.
		class VPlanSlp {
		private:
		enum class OpMode { Failed, Load, Opcode };

		dcaballeUnsubmitted Done Reply Inline Actions SCEV -> VPValue? I guess we couldn't have a single templatized version for sorted SmallVectors of any pointer, right? dcaballe: SCEV -> VPValue? I guess we couldn't have a single templatized version for sorted SmallVectors…
		fhahnAuthorUnsubmitted Not Done Reply Inline Actions I think it is possible to provide a single implementation for SmallVectors. I've removed the implementation here and will upload a separate patch for it soon. fhahn: I think it is possible to provide a single implementation for SmallVectors. I've removed the…
		fhahnAuthorUnsubmitted Not Done Reply Inline Actions I've added a patch implementing DenseMap key info for smallvector: D52312 fhahn: I've added a patch implementing DenseMap key info for smallvector: D52312
		/// A DenseMapInfo implementation for using SmallVector<VPValue *, 4> as
		/// DenseMap keys.
		struct BundleDenseMapInfo {
		static SmallVector<VPValue *, 4> getEmptyKey() {
		return {reinterpret_cast<VPValue *>(-1)};
		}

		static SmallVector<VPValue *, 4> getTombstoneKey() {
		return {reinterpret_cast<VPValue *>(-2)};
		}

		static unsigned getHashValue(const SmallVector<VPValue *, 4> &V) {
		return static_cast<unsigned>(hash_combine_range(V.begin(), V.end()));
		}

		static bool isEqual(const SmallVector<VPValue *, 4> &LHS,
		const SmallVector<VPValue *, 4> &RHS) {
		return LHS == RHS;
		}
		};

		/// Mapping of values in the original VPlan to a combined VPInstruction.
		DenseMap<SmallVector<VPValue , 4>, VPInstruction , BundleDenseMapInfo>
		BundleToCombined;

		VPInterleavedAccessInfo &IAI;

		/// Basic block to operate on. For now, only instructions in a single BB are
		/// considered.
		const VPBasicBlock &BB;

		/// Indicates whether we managed to combine all visited instructions or not.
		bool CompletelySLP = true;

		/// Width of the widest combined bundle in bits.
		unsigned WidestBundleBits = 0;

		using MultiNodeOpTy =
		typename std::pair<VPInstruction , SmallVector<VPValue , 4>>;

		// Input operand bundles for the current multi node. Each multi node operand
		// bundle contains values not matching the multi node's opcode. They will
		// be reordered in reorderMultiNodeOps, once we completed building a
		// multi node.
		RKSimonUnsubmitted Not Done Reply Inline Actions Please can you add description comments for all these member variables and public functions. RKSimon: Please can you add description comments for all these member variables and public functions.
		fhahnAuthorUnsubmitted Not Done Reply Inline Actions Will do tomorrow! fhahn: Will do tomorrow!
		SmallVector<MultiNodeOpTy, 4> MultiNodeOps;

		/// Indicates whether we are building a multi node currently.
		ABataevUnsubmitted Done Reply Inline Actions No need to initialize `BundleToCombined` here, will be initalized by default ABataev: No need to initialize `BundleToCombined` here, will be initalized by default
		bool MultiNodeActive = false;

		/// Check if we can vectorize Operands together.
		bool areVectorizable(ArrayRef<VPValue *> Operands) const;

		dcaballeUnsubmitted Not Done Reply Inline Actions formatting dcaballe: formatting
		/// Add combined instruction \p New for the bundle \p Operands.
		ABataevUnsubmitted Done Reply Inline Actions `const` ABataev: `const`
		void addCombined(ArrayRef<VPValue > Operands, VPInstruction New);

		/// Indicate we hit a bundle we failed to combine. Returns nullptr for now.
		VPInstruction *markFailed();

		/// Reorder operands in the multi node to maximize sequential memory access
		/// and commutative operations.
		SmallVector<MultiNodeOpTy, 4> reorderMultiNodeOps();

		/// Choose the best candidate to use for the lane after \p Last. The set of
		dcaballeUnsubmitted Not Done Reply Inline Actions Something to think about is the long term of this class. I wonder if it would be better to split it multiple ones, matching what we have for loop vectorization. For example, we may want to have a special VPlan (VPlanSLP) to represent the SLP planning information, a VPlanSLPBuilder, a VPlanSLPContext (if necessary)... We don't have to do it as part of this patch but it's something to keep in mind. dcaballe: Something to think about is the long term of this class. I wonder if it would be better to…
		/// candidates to choose from are values with an opcode matching \p Last's
		/// or loads consecutive to \p Last.
		std::pair<OpMode, VPValue > getBest(OpMode Mode, VPValue Last,
		SmallVectorImpl<VPValue *> &Candidates,
		VPInterleavedAccessInfo &IAI);

		/// Print bundle \p Values to dbgs().
		void dumpBundle(ArrayRef<VPValue *> Values);

		public:
		VPlanSlp(VPInterleavedAccessInfo &IAI, VPBasicBlock &BB) : IAI(IAI), BB(BB) {}

		~VPlanSlp() {
		for (auto &KV : BundleToCombined)
		delete KV.second;
		}

		/// Tries to build an SLP tree rooted at \p Operands and returns a
		/// VPInstruction combining \p Operands, if they can be combined.
		VPInstruction buildGraph(ArrayRef<VPValue > Operands);

		/// Return the width of the widest combined bundle in bits.
		unsigned getWidestBundleBits() const { return WidestBundleBits; }

		/// Return true if all visited instruction can be combined.
		bool isCompletelySLP() const { return CompletelySLP; }
		};
} // end namespace llvm		} // end namespace llvm

#endif // LLVM_TRANSFORMS_VECTORIZE_VPLAN_H		#endif // LLVM_TRANSFORMS_VECTORIZE_VPLAN_H

lib/Transforms/Vectorize/VPlan.cpp

Show First 20 Lines • Show All 332 Lines • ▼ Show 20 Lines	void VPInstruction::print(raw_ostream &O) const {

switch (getOpcode()) {		switch (getOpcode()) {
case VPInstruction::Not:		case VPInstruction::Not:
O << "not";		O << "not";
break;		break;
case VPInstruction::ICmpULE:		case VPInstruction::ICmpULE:
O << "icmp ule";		O << "icmp ule";
break;		break;
		case VPInstruction::SLPLoad:
		O << "combined load";
		break;
		case VPInstruction::SLPStore:
		O << "combined store";
		break;
default:		default:
O << Instruction::getOpcodeName(getOpcode());		O << Instruction::getOpcodeName(getOpcode());
}		}

for (const VPValue *Operand : operands()) {		for (const VPValue *Operand : operands()) {
O << " ";		O << " ";
Operand->printAsOperand(O);		Operand->printAsOperand(O);
}		}
▲ Show 20 Lines • Show All 327 Lines • ▼ Show 20 Lines	if (User) {
O << ", ";		O << ", ";
User->getOperand(0)->printAsOperand(O);		User->getOperand(0)->printAsOperand(O);
}		}
O << "\\l\"";		O << "\\l\"";
}		}

template void DomTreeBuilder::Calculate<VPDominatorTree>(VPDominatorTree &DT);		template void DomTreeBuilder::Calculate<VPDominatorTree>(VPDominatorTree &DT);

		void VPValue::replaceAllUsesWith(VPValue *New) {
		RKSimonUnsubmitted Done Reply Inline Actions for (unsigned i = 0, e = User->getNumOperands(); i < e; i++) { RKSimon: ``` for (unsigned i = 0, e = User->getNumOperands(); i < e; i++) { ```
		ABataevUnsubmitted Done Reply Inline Actions `i`->`I`, 'e'->`E` per coding standard. Use `++I` instead of `i++` per coding standard ABataev: 1. `i`->`I`, 'e'->`E` per coding standard. 2. Use `++I` instead of `i++` per coding standard
		for (VPUser *User : users())
		for (unsigned I = 0, E = User->getNumOperands(); I < E; ++I)
		if (User->getOperand(I) == this)
		User->setOperand(I, New);
		RKSimonUnsubmitted Done Reply Inline Actions Remove unnecessary braces? RKSimon: Remove unnecessary braces?
		}

void VPInterleavedAccessInfo::visitRegion(VPRegionBlock *Region,		void VPInterleavedAccessInfo::visitRegion(VPRegionBlock *Region,
Old2NewTy &Old2New,		Old2NewTy &Old2New,
InterleavedAccessInfo &IAI) {		InterleavedAccessInfo &IAI) {
ReversePostOrderTraversal<VPBlockBase *> RPOT(Region->getEntry());		ReversePostOrderTraversal<VPBlockBase *> RPOT(Region->getEntry());
for (VPBlockBase *Base : RPOT) {		for (VPBlockBase *Base : RPOT) {
visitBlock(Base, Old2New, IAI);		visitBlock(Base, Old2New, IAI);
}		}
}		}

void VPInterleavedAccessInfo::visitBlock(VPBlockBase *Block, Old2NewTy &Old2New,		void VPInterleavedAccessInfo::visitBlock(VPBlockBase *Block, Old2NewTy &Old2New,
InterleavedAccessInfo &IAI) {		InterleavedAccessInfo &IAI) {
if (VPBasicBlock *VPBB = dyn_cast<VPBasicBlock>(Block)) {		if (VPBasicBlock *VPBB = dyn_cast<VPBasicBlock>(Block)) {
for (VPRecipeBase &VPI : *VPBB) {		for (VPRecipeBase &VPI : *VPBB) {
assert(isa<VPInstruction>(&VPI) && "Can only handle VPInstructions");		assert(isa<VPInstruction>(&VPI) && "Can only handle VPInstructions");
auto *VPInst = cast<VPInstruction>(&VPI);		auto *VPInst = cast<VPInstruction>(&VPI);
auto *Inst = cast<Instruction>(VPInst->getUnderlyingValue());		auto *Inst = cast<Instruction>(VPInst->getUnderlyingValue());
		ABataevUnsubmitted Not Done Reply Inline Actions `const auto ` ABataev:* `const auto *`
auto *IG = IAI.getInterleaveGroup(Inst);		auto *IG = IAI.getInterleaveGroup(Inst);
if (!IG)		if (!IG)
continue;		continue;

auto NewIGIter = Old2New.find(IG);		auto NewIGIter = Old2New.find(IG);
if (NewIGIter == Old2New.end())		if (NewIGIter == Old2New.end())
Old2New[IG] = new InterleaveGroup<VPInstruction>(		Old2New[IG] = new InterleaveGroup<VPInstruction>(
IG->getFactor(), IG->isReverse(), IG->getAlignment());		IG->getFactor(), IG->isReverse(), IG->getAlignment());
Show All 20 Lines

lib/Transforms/Vectorize/VPlanSLP.cpp

This file was added.

				//===- VPlanSLP.cpp - SLP Analysis based on VPlan -------------------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				/// This file implements SLP analysis based on VPlan. The analysis is based on
				/// the ideas described in
				///
				/// Look-ahead SLP: auto-vectorization in the presence of commutative
				/// operations, CGO 2018 by Vasileios Porpodas, Rodrigo C. O. Rocha,
				/// Luís F. W. Góes
				///
				//===----------------------------------------------------------------------===//

				#include "VPlan.h"
				#include "llvm/ADT/DepthFirstIterator.h"
				#include "llvm/ADT/PostOrderIterator.h"
				#include "llvm/ADT/SmallVector.h"
				#include "llvm/ADT/Twine.h"
				#include "llvm/Analysis/LoopInfo.h"
				#include "llvm/Analysis/VectorUtils.h"
				#include "llvm/IR/BasicBlock.h"
				#include "llvm/IR/CFG.h"
				#include "llvm/IR/Dominators.h"
				#include "llvm/IR/InstrTypes.h"
				#include "llvm/IR/Instruction.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/IR/Type.h"
				#include "llvm/IR/Value.h"
				#include "llvm/Support/Casting.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Support/ErrorHandling.h"
				#include "llvm/Support/GraphWriter.h"
				#include "llvm/Support/raw_ostream.h"
				#include "llvm/Transforms/Utils/BasicBlockUtils.h"
				#include <cassert>
				#include <iterator>
				#include <string>
				#include <vector>

				using namespace llvm;

				#define DEBUG_TYPE "vplan-slp"

				// Number of levels to look ahead when re-ordering multi node operands.
				static unsigned LookaheadMaxDepth = 5;

				VPInstruction *VPlanSlp::markFailed() {
				// FIXME: Currently this is used to signal we hit instructions we cannot
				// trivially SLP'ize.
				CompletelySLP = false;
				return nullptr;
				}

				void VPlanSlp::addCombined(ArrayRef<VPValue > Operands, VPInstruction New) {
				if (all_of(Operands, [](VPValue *V) {
				return cast<VPInstruction>(V)->getUnderlyingInstr();
				RKSimonUnsubmitted Done Reply Inline Actions static? RKSimon: static?
				})) {
				unsigned BundleSize = 0;
				for (VPValue *V : Operands) {
				dcaballeUnsubmitted Done Reply Inline Actions Would it make sense to move this to VPValue? dcaballe: Would it make sense to move this to VPValue?
				Type *T = cast<VPInstruction>(V)->getUnderlyingInstr()->getType();
				assert(!T->isVectorTy() && "Only scalar types supported for now");
				ABataevUnsubmitted Not Done Reply Inline Actions `ArrayRef<VPValue > Operands` ABataev:* `ArrayRef<VPValue *> Operands`
				fhahnAuthorUnsubmitted Not Done Reply Inline Actions Operands is used as key for the BundleToCombined map, which expects a SmallVector<VPValue , 4> key. fhahn:* Operands is used as key for the BundleToCombined map, which expects a SmallVector<VPValue *, 4>…
				ABataevUnsubmitted Done Reply Inline Actions llvm has special function `llvm::to_vector<4>()` that will allow to convert from `ArrayRef` to `SmallVector` ABataev: llvm has special function `llvm::to_vector<4>()` that will allow to convert from `ArrayRef` to…
				fhahnAuthorUnsubmitted Not Done Reply Inline Actions Neat, thanks for pointing that out! fhahn: Neat, thanks for pointing that out!
				BundleSize += T->getScalarSizeInBits();
				}
				dcaballeUnsubmitted Not Done Reply Inline Actions Some doc would be great. Same for some of the methods below. dcaballe: Some doc would be great. Same for some of the methods below.
				WidestBundleBits = std::max(WidestBundleBits, BundleSize);
				}

				auto Res = BundleToCombined.try_emplace(to_vector<4>(Operands), New);
				assert(Res.second &&
				ABataevUnsubmitted Done Reply Inline Actions use `try_emplace(to_vector<4>(Operands), New);` instead ABataev: use `try_emplace(to_vector<4>(Operands), New);` instead
				"Already created a combined instruction for the operand bundle");
				(void)Res;
				}

				bool VPlanSlp::areVectorizable(ArrayRef<VPValue *> Operands) const {
				// Currently we only support VPInstructions.
				if (!all_of(Operands, [](VPValue *Op) {
				return Op && isa<VPInstruction>(Op) &&
				cast<VPInstruction>(Op)->getUnderlyingInstr();
				})) {
				LLVM_DEBUG(dbgs() << "VPSLP: not all operands are VPInstructions\n");
				return false;
				}
				ABataevUnsubmitted Done Reply Inline Actions `const` member function ABataev: `const` member function

				// Check if opcodes and type width agree for all instructions in the bundle.
				// FIXME: Differing widths/opcodes can be handled by inserting additional
				// instructions.
				// FIXME: Deal with non-primitive types.
				const Instruction *OriginalInstr =
				RKSimonUnsubmitted Done Reply Inline Actions Should you use VPSLP for debug messages to avoid confusion with SLPVectorizer? RKSimon: Should you use VPSLP for debug messages to avoid confusion with SLPVectorizer?
				cast<VPInstruction>(Operands[0])->getUnderlyingInstr();
				unsigned Opcode = OriginalInstr->getOpcode();
				unsigned Width = OriginalInstr->getType()->getPrimitiveSizeInBits();
				if (!all_of(Operands, [Opcode, Width](VPValue *Op) {
				const Instruction *I = cast<VPInstruction>(Op)->getUnderlyingInstr();
				return I->getOpcode() == Opcode &&
				I->getType()->getPrimitiveSizeInBits() == Width;
				})) {
				LLVM_DEBUG(dbgs() << "VPSLP: Opcodes do not agree \n");
				return false;
				}

				// For now, all operands must be defined in the same BB.
				if (any_of(Operands, [this](VPValue *Op) {
				return cast<VPInstruction>(Op)->getParent() != &this->BB;
				})) {
				LLVM_DEBUG(dbgs() << "VPSLP: operands in different BBs\n");
				return false;
				}

				if (any_of(Operands,
				[](VPValue *Op) { return Op->hasMoreThanOneUniqueUser(); })) {
				LLVM_DEBUG(dbgs() << "VPSLP: Some operands have multiple users.\n");
				return false;
				}

				// For loads, check that there are no instructions writing to memory in
				// between them.
				ABataevUnsubmitted Not Done Reply Inline Actions Seems to me the comment does not match the functionality ABataev: Seems to me the comment does not match the functionality
				fhahnAuthorUnsubmitted Not Done Reply Inline Actions I've updated the comment and removed the code to check if all instructions are in the same BB. That's check earlier on. Does it make sense now? fhahn: I've updated the comment and removed the code to check if all instructions are in the same BB.
				ABataevUnsubmitted Not Done Reply Inline Actions `are no instructions`? ABataev: `are no instructions`?
				// TODO: we only have to forbid instructions writing to memory that could
				// interfere with any of the loads in the bundle
				if (Opcode == Instruction::Load) {
				unsigned LoadsSeen = 0;
				VPBasicBlock *Parent = cast<VPInstruction>(Operands[0])->getParent();
				for (auto &I : *Parent) {
				auto *VPI = cast<VPInstruction>(&I);
				if (VPI->getOpcode() == Instruction::Load &&
				std::find(Operands.begin(), Operands.end(), VPI) != Operands.end())
				LoadsSeen++;

				if (LoadsSeen == Operands.size())
				break;
				if (LoadsSeen > 0 && VPI->mayWriteToMemory()) {
				LLVM_DEBUG(
				dbgs() << "VPSLP: instruction modifying memory between loads\n");
				vporpoUnsubmitted Not Done Reply Inline Actions Could we have other non-Store VPInstructions that could touch memory (e.g., calls) ? Maybe it is safer to add a check whether the underlying IR instruction mayWriteToMemory(). vporpo: Could we have other non-Store VPInstructions that could touch memory (e.g., calls) ? Maybe it…
				fhahnAuthorUnsubmitted Not Done Reply Inline Actions I've added a simple implementation for mayWriteToMemory to VPInstruction. We could just use mayWriteToMemory of the underlying instruction, but I think this would unnecessarily increase coupling of VPInstruction and IR instructions. What do you think? fhahn: I've added a simple implementation for mayWriteToMemory to VPInstruction. We could just use…
				vporpoUnsubmitted Not Done Reply Inline Actions I agree it looks better this way. vporpo: I agree it looks better this way.
				return false;
				}
				}

				if (!all_of(Operands, [](VPValue *Op) {
				return cast<LoadInst>(cast<VPInstruction>(Op)->getUnderlyingInstr())
				vporpoUnsubmitted Not Done Reply Inline Actions I think we should also check that the operands are (i) all in the same BB, and (ii) in the same BB as the seed instructions (at least for now). A lit test for this would also be nice. vporpo: I think we should also check that the operands are (i) all in the same BB, and (ii) in the same…
				fhahnAuthorUnsubmitted Not Done Reply Inline Actions I've added a check for (i) and I'll add a check for (ii) tomorrow. I'll think a bit more what the best way to do it would be. I've added a unit test for (i), and will add IR tests once we VPlan SLP is hooked up to the vplan native code path. fhahn: I've added a check for (i) and I'll add a check for (ii) tomorrow. I'll think a bit more what…
				->isSimple();
				})) {
				LLVM_DEBUG(dbgs() << "VPSLP: only simple loads are supported.\n");
				return false;
				}
				}

				if (Opcode == Instruction::Store)
				if (!all_of(Operands, [](VPValue *Op) {
				ABataevUnsubmitted Done Reply Inline Actions `ArrayRef<VPValue > Values` ABataev:* `ArrayRef<VPValue *> Values`
				return cast<StoreInst>(cast<VPInstruction>(Op)->getUnderlyingInstr())
				->isSimple();
				})) {
				LLVM_DEBUG(dbgs() << "VPSLP: only simple stores are supported.\n");
				return false;
				RKSimonUnsubmitted Done Reply Inline Actions auto? RKSimon: auto?
				}

				return true;
				}

				ABataevUnsubmitted Done Reply Inline Actions `ArrayRef<VPValue > Values` ABataev:* `ArrayRef<VPValue *> Values`
				static SmallVector<VPValue , 4> getOperands(ArrayRef<VPValue > Values,
				unsigned OperandIndex) {
				RKSimonUnsubmitted Done Reply Inline Actions Reuse Instruction::isCommutative(cast<VPInstruction>(Values[0])->getOpcode())? RKSimon: Reuse Instruction::isCommutative(cast<VPInstruction>(Values[0])->getOpcode())?
				SmallVector<VPValue *, 4> Operands;
				for (VPValue *V : Values) {
				auto *U = cast<VPUser>(V);
				Operands.push_back(U->getOperand(OperandIndex));
				ABataevUnsubmitted Done Reply Inline Actions `static` ABataev: `static`
				ABataevUnsubmitted Not Done Reply Inline Actions `ArrayRef<VPValue > Values` ABataev:* `ArrayRef<VPValue *> Values`
				}
				return Operands;
				}

				static bool areCommutative(ArrayRef<VPValue *> Values) {
				return Instruction::isCommutative(
				cast<VPInstruction>(Values[0])->getOpcode());
				}

				static SmallVector<SmallVector<VPValue *, 4>, 4>
				RKSimonUnsubmitted Done Reply Inline Actions You can move this above the switch statement and remove the duplicate cast RKSimon: You can move this above the switch statement and remove the duplicate cast
				getOperands(ArrayRef<VPValue *> Values) {
				ABataevUnsubmitted Done Reply Inline Actions `i`->`I`, `numOps`->`NumOps`, `++I` ABataev: `i`->`I`, `numOps`->`NumOps`, `++I`
				SmallVector<SmallVector<VPValue *, 4>, 4> Result;
				auto *VPI = cast<VPInstruction>(Values[0]);

				switch (VPI->getOpcode()) {
				case Instruction::Load:
				llvm_unreachable("Loads terminate a tree, no need to get operands");
				RKSimonUnsubmitted Done Reply Inline Actions llvm_unreachable? RKSimon: llvm_unreachable?
				ABataevUnsubmitted Done Reply Inline Actions Do we need braces here? ABataev: Do we need braces here?
				fhahnAuthorUnsubmitted Not Done Reply Inline Actions Not anymore, thanks fhahn: Not anymore, thanks
				case Instruction::Store:
				Result.push_back(getOperands(Values, 0));
				break;
				default:
				ABataevUnsubmitted Done Reply Inline Actions `ArrayRef<VPValue > Values` ABataev:* `ArrayRef<VPValue *> Values`
				for (unsigned I = 0, NumOps = VPI->getNumOperands(); I < NumOps; ++I)
				Result.push_back(getOperands(Values, I));
				break;
				RKSimonUnsubmitted Done Reply Inline Actions This would be cleaner if you pulled out cast<VPInstruction>(Values[0])->getNumOperands() unsigned NumOps = cast<VPInstruction>(Values[0])->getNumOperands(); for (unsigned i = 0; i < NumOps; ++i) RKSimon: This would be cleaner if you pulled out cast<VPInstruction>(Values[0])->getNumOperands() ```…
				}

				ABataevUnsubmitted Done Reply Inline Actions Maybe it is better to use `llvm::Optional<unsigned>` instead of some magic numbers for non-matching case? ABataev: Maybe it is better to use `llvm::Optional<unsigned>` instead of some magic numbers for non…
				return Result;
				}

				ABataevUnsubmitted Done Reply Inline Actions Use `llvm::None` instead ABataev: Use `llvm::None` instead
				/// Returns the opcode of Values or ~0 if they do not all agree.
				static Optional<unsigned> getOpcode(ArrayRef<VPValue *> Values) {
				unsigned Opcode = cast<VPInstruction>(Values[0])->getOpcode();
				if (any_of(Values, [Opcode](VPValue *V) {
				return cast<VPInstruction>(V)->getOpcode() != Opcode;
				}))
				return None;
				return {Opcode};
				}

				/// Returns true if A and B access sequential memory if they are loads or
				/// stores or if they have identical opcodes otherwise.
				static bool areConsecutiveOrMatch(VPInstruction A, VPInstruction B,
				VPInterleavedAccessInfo &IAI) {
				if (A->getOpcode() != B->getOpcode())
				return false;
				dcaballeUnsubmitted Done Reply Inline Actions If opcodes are the same at this point, shouldn't we keep only A or B checks but not both? You could add `assert(A->getOpcode() == B->getOpcode())` to be safe. dcaballe: If opcodes are the same at this point, shouldn't we keep only A or B checks but not both? You…

				if (A->getOpcode() != Instruction::Load &&
				A->getOpcode() != Instruction::Store)
				return true;
				auto *GA = IAI.getInterleaveGroup(A);
				auto *GB = IAI.getInterleaveGroup(B);

				return GA && GB && GA == GB && GA->getIndex(A) + 1 == GB->getIndex(B);
				}
				dcaballeUnsubmitted Not Done Reply Inline Actions I would be great if you could elaborate a bit what 'score' is exactly. dcaballe: I would be great if you could elaborate a bit what 'score' is exactly.

				/// Implements getLAScore from Listing 7 in the paper.
				/// Traverses and compares operands of V1 and V2 to MaxLevel.
				static unsigned getLAScore(VPValue V1, VPValue V2, unsigned MaxLevel,
				VPInterleavedAccessInfo &IAI) {
				if (!isa<VPInstruction>(V1) \|\| !isa<VPInstruction>(V2))
				return 0;

				if (MaxLevel == 0)
				ABataevUnsubmitted Done Reply Inline Actions Capitalize variables names + preincrement ABataev: Capitalize variables names + preincrement
				return (unsigned)areConsecutiveOrMatch(cast<VPInstruction>(V1),
				cast<VPInstruction>(V2), IAI);

				unsigned Score = 0;
				for (unsigned I = 0, EV1 = cast<VPUser>(V1)->getNumOperands(); I < EV1; ++I)
				for (unsigned J = 0, EV2 = cast<VPUser>(V2)->getNumOperands(); J < EV2; ++J)
				ABataevUnsubmitted Done Reply Inline Actions `static` ABataev: `static`
				Score += getLAScore(cast<VPUser>(V1)->getOperand(I),
				ABataevUnsubmitted Not Done Reply Inline Actions `ArrayRef<VPValue > Candidates` ABataev:* `ArrayRef<VPValue *> Candidates`
				fhahnAuthorUnsubmitted Not Done Reply Inline Actions We remove elements from Candidates which is not possible with ArrayRef I think fhahn: We remove elements from Candidates which is not possible with ArrayRef I think
				cast<VPUser>(V2)->getOperand(J), MaxLevel - 1, IAI);
				return Score;
				}
				RKSimonUnsubmitted Done Reply Inline Actions (style) Score RKSimon: (style) Score

				std::pair<VPlanSlp::OpMode, VPValue *>
				RKSimonUnsubmitted Done Reply Inline Actions Don't re-evaluate the getNumOperands() calls for (unsigned i = 0, e1 = cast<VPUser>(V1)->getNumOperands(); j < e1; i++) for (unsigned j = 0, e2 = cast<VPUser>(V2)->getNumOperands(); j < e2; j++) RKSimon: Don't re-evaluate the getNumOperands() calls ``` for (unsigned i = 0, e1 = cast<VPUser>(V1)…
				VPlanSlp::getBest(OpMode Mode, VPValue *Last,
				SmallVectorImpl<VPValue *> &Candidates,
				VPInterleavedAccessInfo &IAI) {
				LLVM_DEBUG(dbgs() << " getBest\n");
				VPValue *Best = Candidates[0];
				SmallVector<VPValue *, 4> BestCandidates;

				LLVM_DEBUG(dbgs() << " Candidates for "
				<< *cast<VPInstruction>(Last)->getUnderlyingInstr() << " ");
				for (auto *Candidate : Candidates) {
				auto *LastI = cast<VPInstruction>(Last);
				auto *CandidateI = cast<VPInstruction>(Candidate);
				if (areConsecutiveOrMatch(LastI, CandidateI, IAI)) {
				LLVM_DEBUG(dbgs() << *cast<VPInstruction>(Candidate)->getUnderlyingInstr()
				<< " ");
				BestCandidates.push_back(Candidate);
				}
				RKSimonUnsubmitted Done Reply Inline Actions (style) auto for casts RKSimon: (style) auto for casts
				}
				LLVM_DEBUG(dbgs() << "\n");

				if (BestCandidates.empty())
				return {OpMode::Failed, nullptr};
				ABataevUnsubmitted Done Reply Inline Actions Why `5`? Use enum or const instead of magic number. ABataev: Why `5`? Use enum or const instead of magic number.
				fhahnAuthorUnsubmitted Not Done Reply Inline Actions Replaced with a variable fhahn: Replaced with a variable

				if (BestCandidates.size() == 1)
				return {Mode, BestCandidates[0]};

				if (Mode == OpMode::Opcode) {
				unsigned BestScore = 0;
				for (unsigned Depth = 1; Depth < LookaheadMaxDepth; Depth++) {
				unsigned PrevScore = ~0u;
				bool AllSame = true;

				// FIXME: Avoid visiting the same operands multiple times.
				for (auto *Candidate : BestCandidates) {
				unsigned Score = getLAScore(Last, Candidate, Depth, IAI);
				if (PrevScore == ~0u)
				PrevScore = Score;
				if (PrevScore != Score)
				AllSame = false;
				PrevScore = Score;

				if (Score > BestScore) {
				BestScore = Score;
				Best = Candidate;
				}
				}
				if (!AllSame)
				break;
				}
				}
				LLVM_DEBUG(dbgs() << "Found best "
				<< *cast<VPInstruction>(Best)->getUnderlyingInstr()
				ABataevUnsubmitted Done Reply Inline Actions `std::remove(Candidates.begin(), Candidates.end(), Best);`? ABataev: `std::remove(Candidates.begin(), Candidates.end(), Best);`?
				<< "\n");
				std::remove(Candidates.begin(), Candidates.end(), Best);

				return {Mode, Best};
				}

				SmallVector<VPlanSlp::MultiNodeOpTy, 4> VPlanSlp::reorderMultiNodeOps() {
				SmallVector<MultiNodeOpTy, 4> FinalOrder;
				ABataevUnsubmitted Not Done Reply Inline Actions Try `FinalOrder.emplace_back(Operands.first, Operands.second[0]);` ABataev: Try `FinalOrder.emplace_back(Operands.first, Operands.second[0]);`
				fhahnAuthorUnsubmitted Not Done Reply Inline Actions I tried, but neither that nor `emplace_back(Operands.first, {Operands.second[0])};` gets accepted unfortunately. fhahn: I tried, but neither that nor `emplace_back(Operands.first, {Operands.second[0])};` gets…
				SmallVector<OpMode, 4> Mode;
				FinalOrder.reserve(MultiNodeOps.size());
				ABataevUnsubmitted Not Done Reply Inline Actions Is it possible to replace it with `emplace_back(Operands.first, Operands.second[0]);` ABataev: Is it possible to replace it with `emplace_back(Operands.first, Operands.second[0]);`
				fhahnAuthorUnsubmitted Not Done Reply Inline Actions I don't think so, it fails to deduce the correct type for the second argument (with and without braces) fhahn: I don't think so, it fails to deduce the correct type for the second argument (with and without…
				Mode.reserve(MultiNodeOps.size());

				LLVM_DEBUG(dbgs() << "Reordering multinode\n");

				for (auto &Operands : MultiNodeOps) {
				FinalOrder.push_back({Operands.first, {Operands.second[0]}});
				ABataevUnsubmitted Done Reply Inline Actions You can preallocate the memory for `Mode` and `FinalOrder` vectors, because you know the number of elements is `MultiNodeOps.size()` ABataev: You can preallocate the memory for `Mode` and `FinalOrder` vectors, because you know the number…
				if (cast<VPInstruction>(Operands.second[0])->getOpcode() ==
				Instruction::Load)
				Mode.push_back(OpMode::Load);
				else
				ABataevUnsubmitted Done Reply Inline Actions `Lane = 1, E = MultiNodeOps[0].second.size(); Lane < E;` Use preincrement ABataev: 2. `Lane = 1, E = MultiNodeOps[0].second.size(); Lane < E;` 1. Use preincrement
				Mode.push_back(OpMode::Opcode);
				}

				for (unsigned Lane = 1, E = MultiNodeOps[0].second.size(); Lane < E; ++Lane) {
				LLVM_DEBUG(dbgs() << " Finding best value for lane " << Lane << "\n");
				SmallVector<VPValue *, 4> Candidates;
				ABataevUnsubmitted Done Reply Inline Actions Also can preallocate the memory for the vector ABataev: Also can preallocate the memory for the vector
				Candidates.reserve(MultiNodeOps.size());
				LLVM_DEBUG(dbgs() << " Candidates ");
				for (auto Ops : MultiNodeOps) {
				LLVM_DEBUG(
				dbgs() << *cast<VPInstruction>(Ops.second[Lane])->getUnderlyingInstr()
				<< " ");
				ABataevUnsubmitted Done Reply Inline Actions preincrement ABataev: preincrement
				Candidates.push_back(Ops.second[Lane]);
				}
				LLVM_DEBUG(dbgs() << "\n");

				for (unsigned Op = 0, E = MultiNodeOps.size(); Op < E; ++Op) {
				LLVM_DEBUG(dbgs() << " Checking " << Op << "\n");
				ABataevUnsubmitted Done Reply Inline Actions Please, use the actual type here ABataev: Please, use the actual type here
				if (Mode[Op] == OpMode::Failed)
				continue;

				VPValue *Last = FinalOrder[Op].second[Lane - 1];
				ABataevUnsubmitted Not Done Reply Inline Actions use `emplace_back()` ABataev: use `emplace_back()`
				fhahnAuthorUnsubmitted Not Done Reply Inline Actions Is there are benefit by using emplace_back here? We are not constructing any objects here I think fhahn: Is there are benefit by using emplace_back here? We are not constructing any objects here I…
				std::pair<OpMode, VPValue *> Res =
				getBest(Mode[Op], Last, Candidates, IAI);
				RKSimonUnsubmitted Done Reply Inline Actions for (unsigned Op = 0, E = MultiNodeOps.size(); Op < E; ++Op) { RKSimon: ``` for (unsigned Op = 0, E = MultiNodeOps.size(); Op < E; ++Op) { ```
				if (Res.second)
				FinalOrder[Op].second.push_back(Res.second);
				else
				// TODO: handle this case
				FinalOrder[Op].second.push_back(markFailed());
				}
				ABataevUnsubmitted Done Reply Inline Actions `ArrayRef<VPValue > Values` ABataev:* `ArrayRef<VPValue *> Values`
				}

				return FinalOrder;
				vporpoUnsubmitted Not Done Reply Inline Actions If I am not mistaken, since we are only allowing single user nodes in the graph, the graph is actually a tree. So this should not happen. We should have an assertion here to check this. vporpo: If I am not mistaken, since we are only allowing single user nodes in the graph, the graph is…
				fhahnAuthorUnsubmitted Not Done Reply Inline Actions Currently we check for unique users. For example, in the testSlpReuse_1 we have 2 add instruction which add the same value. In that case we should re-use the already created combined instruction I think fhahn: Currently we check for unique users. For example, in the testSlpReuse_1 we have 2 add…
				vporpoUnsubmitted Done Reply Inline Actions Ah yes, you are checking for unique users, so it should work the way you are describing. If I understand correctly, the graph can still be considered a tree since the only case when two nodes are connected by more than one path is when the user has two identical operands. If this is the case, I still think that there should be some kind of assertion here checking that the graph is a tree and not a DAG, because SLP on a DAG is slightly different and would need a few more components to work correctly. I think that the only case where (I != BundleToCombined.end()) is true is the one you are describing: when the user instructions have identical operands. So maybe add an assertion that checks that the single users of Values have identical operands and a comment like "For now buildGraph should form a tree, not a DAG.". vporpo: Ah yes, you are checking for unique users, so it should work the way you are describing. If I…
				}

				void VPlanSlp::dumpBundle(ArrayRef<VPValue *> Values) {
				ABataevUnsubmitted Done Reply Inline Actions `ArrayRef<VPValue > Values` ABataev:* `ArrayRef<VPValue *> Values`
				fhahnAuthorUnsubmitted Not Done Reply Inline Actions Values is used as key for the BundleToCombined map, which expects a SmallVector<VPValue , 4> key. fhahn:* Values is used as key for the BundleToCombined map, which expects a SmallVector<VPValue *, 4>…
				LLVM_DEBUG(dbgs() << " Ops: ");
				for (auto Op : Values)
				vporpoUnsubmitted Done Reply Inline Actions Use LLVM_DEBUG for dumpBundle() vporpo: Use LLVM_DEBUG for dumpBundle()
				if (auto *Instr = cast_or_null<VPInstruction>(Op)->getUnderlyingInstr())
				LLVM_DEBUG(dbgs() << *Instr << " \| ");
				else
				LLVM_DEBUG(dbgs() << " nullptr \| ");
				LLVM_DEBUG(dbgs() << "\n");
				}

				VPInstruction VPlanSlp::buildGraph(ArrayRef<VPValue > Values) {
				assert(!Values.empty() && "Need some operands!");

				// If we already visited this instruction bundle, re-use the existing node
				auto I = BundleToCombined.find(to_vector<4>(Values));
				if (I != BundleToCombined.end()) {
				vporpoUnsubmitted Done Reply Inline Actions Use LLVM_DEBUG for dumpBundle() vporpo: Use LLVM_DEBUG for dumpBundle()
				#ifdef NDEBUG
				// Check that the resulting graph is a tree. If we re-use a node, this means
				// its values have multiple users. We only allow this, if all users of each
				// value are the same instruction.
				for (auto *V : Values) {
				auto UI = V->user_begin();
				auto FirstUser = UI++;
				while (UI != V->use_end()) {
				assert(*UI == FirstUser && "Currently we only support SLP trees.");
				UI++;
				}
				}
				#endif
				return I->second;
				}

				// Dump inputs
				LLVM_DEBUG({
				dbgs() << "buildGraph: ";
				dumpBundle(Values);
				});

				if (!areVectorizable(Values))
				return markFailed();

				assert(getOpcode(Values) && "Opcodes for all values must match");
				unsigned ValuesOpcode = getOpcode(Values).getValue();

				SmallVector<VPValue *, 4> CombinedOperands;
				if (areCommutative(Values)) {
				bool MultiNodeRoot = !MultiNodeActive;
				MultiNodeActive = true;
				for (auto &Operands : getOperands(Values)) {
				LLVM_DEBUG({
				dbgs() << " Visiting Commutative";
				dumpBundle(Operands);
				});
				ABataevUnsubmitted Done Reply Inline Actions No braces here ABataev: No braces here

				auto OperandsOpcode = getOpcode(Operands);
				if (OperandsOpcode && OperandsOpcode == getOpcode(Values)) {
				LLVM_DEBUG(dbgs() << " Same opcode, continue building\n");
				CombinedOperands.push_back(buildGraph(Operands));
				vporpoUnsubmitted Done Reply Inline Actions Hmm why no buildGraph() recursion here ? vporpo: Hmm why no buildGraph() recursion here ?
				} else {
				LLVM_DEBUG(dbgs() << " Adding multinode Ops\n");
				// Create dummy VPInstruction, which will we replace later by the
				// re-ordered operand.
				VPInstruction *Op = new VPInstruction(0, {});
				CombinedOperands.push_back(Op);
				MultiNodeOps.emplace_back(Op, Operands);
				}
				}

				if (MultiNodeRoot) {
				LLVM_DEBUG(dbgs() << "Reorder \n");
				MultiNodeActive = false;

				auto FinalOrder = reorderMultiNodeOps();

				MultiNodeOps.clear();
				for (auto &Ops : FinalOrder) {
				VPInstruction *NewOp = buildGraph(Ops.second);
				Ops.first->replaceAllUsesWith(NewOp);
				for (unsigned i = 0; i < CombinedOperands.size(); i++)
				if (CombinedOperands[i] == Ops.first)
				CombinedOperands[i] = NewOp;
				delete Ops.first;
				Ops.first = NewOp;
				}
				LLVM_DEBUG(dbgs() << "Found final order\n");
				}
				} else {
				LLVM_DEBUG(dbgs() << " NonCommuntative\n");
				if (ValuesOpcode == Instruction::Load)
				for (VPValue *V : Values)
				CombinedOperands.push_back(cast<VPInstruction>(V)->getOperand(0));
				else
				for (auto &Operands : getOperands(Values))
				CombinedOperands.push_back(buildGraph(Operands));
				}

				unsigned Opcode;
				switch (ValuesOpcode) {
				case Instruction::Load:
				Opcode = VPInstruction::SLPLoad;
				break;
				case Instruction::Store:
				Opcode = VPInstruction::SLPStore;
				break;
				default:
				Opcode = ValuesOpcode;
				break;
				}

				if (!CompletelySLP)
				return markFailed();

				assert(CombinedOperands.size() > 0 && "Need more some operands");
				auto *VPI = new VPInstruction(Opcode, CombinedOperands);
				VPI->setUnderlyingInstr(cast<VPInstruction>(Values[0])->getUnderlyingInstr());

				LLVM_DEBUG(dbgs() << "Create VPInstruction "; VPI->print(dbgs());
				cast<VPInstruction>(Values[0])->print(dbgs()); dbgs() << "\n");
				addCombined(Values, VPI);
				return VPI;
				}

lib/Transforms/Vectorize/VPlanValue.h

Show First 20 Lines • Show All 101 Lines • ▼ Show 20 Lines	public:
user_iterator user_begin() { return Users.begin(); }		user_iterator user_begin() { return Users.begin(); }
const_user_iterator user_begin() const { return Users.begin(); }		const_user_iterator user_begin() const { return Users.begin(); }
user_iterator user_end() { return Users.end(); }		user_iterator user_end() { return Users.end(); }
const_user_iterator user_end() const { return Users.end(); }		const_user_iterator user_end() const { return Users.end(); }
user_range users() { return user_range(user_begin(), user_end()); }		user_range users() { return user_range(user_begin(), user_end()); }
const_user_range users() const {		const_user_range users() const {
return const_user_range(user_begin(), user_end());		return const_user_range(user_begin(), user_end());
}		}

		/// Returns true if the value has more than one unique user.
		bool hasMoreThanOneUniqueUser() {
		if (getNumUsers() == 0)
		return false;

		// Check if all users match the first user.
		auto Current = std::next(user_begin());
		while (Current != user_end() && user_begin() == Current)
		Current++;
		return Current != user_end();
		}

		void replaceAllUsesWith(VPValue *New);
};		};

typedef DenseMap<Value , VPValue > Value2VPValueTy;		typedef DenseMap<Value , VPValue > Value2VPValueTy;
typedef DenseMap<VPValue , Value > VPValue2ValueTy;		typedef DenseMap<VPValue , Value > VPValue2ValueTy;

raw_ostream &operator<<(raw_ostream &OS, const VPValue &V);		raw_ostream &operator<<(raw_ostream &OS, const VPValue &V);

/// This class augments VPValue with operands which provide the inverse def-use		/// This class augments VPValue with operands which provide the inverse def-use
Show All 29 Lines	public:
}		}

unsigned getNumOperands() const { return Operands.size(); }		unsigned getNumOperands() const { return Operands.size(); }
inline VPValue *getOperand(unsigned N) const {		inline VPValue *getOperand(unsigned N) const {
assert(N < Operands.size() && "Operand index out of bounds");		assert(N < Operands.size() && "Operand index out of bounds");
return Operands[N];		return Operands[N];
}		}

		void setOperand(unsigned I, VPValue *New) { Operands[I] = New; }
		ABataevUnsubmitted Done Reply Inline Actions Capitalize `i`, must be `I` ABataev: Capitalize `i`, must be `I`

typedef SmallVectorImpl<VPValue *>::iterator operand_iterator;		typedef SmallVectorImpl<VPValue *>::iterator operand_iterator;
typedef SmallVectorImpl<VPValue *>::const_iterator const_operand_iterator;		typedef SmallVectorImpl<VPValue *>::const_iterator const_operand_iterator;
typedef iterator_range<operand_iterator> operand_range;		typedef iterator_range<operand_iterator> operand_range;
typedef iterator_range<const_operand_iterator> const_operand_range;		typedef iterator_range<const_operand_iterator> const_operand_range;

operand_iterator op_begin() { return Operands.begin(); }		operand_iterator op_begin() { return Operands.begin(); }
const_operand_iterator op_begin() const { return Operands.begin(); }		const_operand_iterator op_begin() const { return Operands.begin(); }
operand_iterator op_end() { return Operands.end(); }		operand_iterator op_end() { return Operands.end(); }
Show All 10 Lines

unittests/Transforms/Vectorize/CMakeLists.txt

	set(LLVM_LINK_COMPONENTS			set(LLVM_LINK_COMPONENTS
	Analysis			Analysis
	Core			Core
	Vectorize			Vectorize
	AsmParser			AsmParser
	)			)

	add_llvm_unittest(VectorizeTests			add_llvm_unittest(VectorizeTests
	VPlanDominatorTreeTest.cpp			VPlanDominatorTreeTest.cpp
	VPlanLoopInfoTest.cpp			VPlanLoopInfoTest.cpp
	VPlanTest.cpp			VPlanTest.cpp
	VPlanHCFGTest.cpp			VPlanHCFGTest.cpp
				VPlanSlpTest.cpp
	)			)

unittests/Transforms/Vectorize/VPlanSlpTest.cpp

This file was added.

				//===- llvm/unittest/Transforms/Vectorize/VPlanSlpTest.cpp ---------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//

				#include "../lib/Transforms/Vectorize/VPlan.h"
				#include "../lib/Transforms/Vectorize/VPlanHCFGBuilder.h"
				#include "../lib/Transforms/Vectorize/VPlanHCFGTransforms.h"
				#include "VPlanTestBase.h"
				#include "llvm/Analysis/VectorUtils.h"
				#include "gtest/gtest.h"

				namespace llvm {
				namespace {

				class VPlanSlpTest : public VPlanTestBase {
				protected:
				TargetLibraryInfoImpl TLII;
				TargetLibraryInfo TLI;
				DataLayout DL;

				std::unique_ptr<AssumptionCache> AC;
				std::unique_ptr<ScalarEvolution> SE;
				std::unique_ptr<AAResults> AARes;
				std::unique_ptr<BasicAAResult> BasicAA;
				std::unique_ptr<LoopAccessInfo> LAI;
				std::unique_ptr<PredicatedScalarEvolution> PSE;
				std::unique_ptr<InterleavedAccessInfo> IAI;

				VPlanSlpTest()
				: TLII(), TLI(TLII),
				DL("e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-"
				"f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:"
				"16:32:64-S128") {}

				VPInterleavedAccessInfo getInterleavedAccessInfo(Function &F, Loop *L,
				VPlan &Plan) {
				AC.reset(new AssumptionCache(F));
				SE.reset(new ScalarEvolution(F, TLI, AC, DT, *LI));
				BasicAA.reset(new BasicAAResult(DL, F, TLI, AC, &DT, &*LI));
				AARes.reset(new AAResults(TLI));
				AARes->addAAResult(*BasicAA);
				PSE.reset(new PredicatedScalarEvolution(SE, L));
				LAI.reset(new LoopAccessInfo(L, &SE, &TLI, &AARes, &DT, &LI));
				IAI.reset(new InterleavedAccessInfo(PSE, L, &DT, &LI, &LAI));
				IAI->analyzeInterleaving(false);
				return {Plan, *IAI};
				}
				};

				TEST_F(VPlanSlpTest, testSlpSimple_2) {
				const char *ModuleString =
				"%struct.Test = type { i32, i32 }\n"
				"%struct.Test3 = type { i32, i32, i32 }\n"
				"%struct.Test4xi8 = type { i8, i8, i8 }\n"
				"define void @add_x2(%struct.Test* nocapture readonly %A, %struct.Test* "
				"nocapture readonly %B, %struct.Test* nocapture %C) {\n"
				"entry:\n"
				" br label %for.body\n"
				"for.body: ; preds = %for.body, "
				"%entry\n"
				" %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]\n"
				" %A0 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				"%indvars.iv, i32 0\n"
				" %vA0 = load i32, i32* %A0, align 4\n"
				" %B0 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 "
				"%indvars.iv, i32 0\n"
				" %vB0 = load i32, i32* %B0, align 4\n"
				" %add0 = add nsw i32 %vA0, %vB0\n"
				" %A1 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				"%indvars.iv, i32 1\n"
				" %vA1 = load i32, i32* %A1, align 4\n"
				" %B1 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 "
				"%indvars.iv, i32 1\n"
				" %vB1 = load i32, i32* %B1, align 4\n"
				" %add1 = add nsw i32 %vA1, %vB1\n"
				" %C0 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				"%indvars.iv, i32 0\n"
				" store i32 %add0, i32* %C0, align 4\n"
				" %C1 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				"%indvars.iv, i32 1\n"
				" store i32 %add1, i32* %C1, align 4\n"
				" %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1\n"
				" %exitcond = icmp eq i64 %indvars.iv.next, 1024\n"
				" br i1 %exitcond, label %for.cond.cleanup, label %for.body\n"
				"for.cond.cleanup: ; preds = %for.body\n"
				" ret void\n"
				"}\n";

				Module &M = parseModule(ModuleString);

				Function *F = M.getFunction("add_x2");
				BasicBlock *LoopHeader = F->getEntryBlock().getSingleSuccessor();
				auto Plan = buildHCFG(LoopHeader);
				auto VPIAI = getInterleavedAccessInfo(F, LI->getLoopFor(LoopHeader), Plan);

				VPBlockBase *Entry = Plan->getEntry()->getEntryBasicBlock();
				EXPECT_NE(nullptr, Entry->getSingleSuccessor());
				VPBasicBlock *Body = Entry->getSingleSuccessor()->getEntryBasicBlock();

				VPInstruction Store1 = cast<VPInstruction>(&std::next(Body->begin(), 12));
				VPInstruction Store2 = cast<VPInstruction>(&std::next(Body->begin(), 14));

				VPlanSlp Slp(VPIAI, *Body);
				SmallVector<VPValue *, 4> StoreRoot = {Store1, Store2};
				VPInstruction *CombinedStore = Slp.buildGraph(StoreRoot);
				EXPECT_EQ(64u, Slp.getWidestBundleBits());
				EXPECT_EQ(VPInstruction::SLPStore, CombinedStore->getOpcode());

				auto *CombinedAdd = cast<VPInstruction>(CombinedStore->getOperand(0));
				EXPECT_EQ(Instruction::Add, CombinedAdd->getOpcode());

				auto *CombinedLoadA = cast<VPInstruction>(CombinedAdd->getOperand(0));
				auto *CombinedLoadB = cast<VPInstruction>(CombinedAdd->getOperand(1));
				EXPECT_EQ(VPInstruction::SLPLoad, CombinedLoadA->getOpcode());
				EXPECT_EQ(VPInstruction::SLPLoad, CombinedLoadB->getOpcode());
				}

				TEST_F(VPlanSlpTest, testSlpSimple_3) {
				const char *ModuleString =
				"%struct.Test = type { i32, i32 }\n"
				"%struct.Test3 = type { i32, i32, i32 }\n"
				"%struct.Test4xi8 = type { i8, i8, i8 }\n"
				"define void @add_x2(%struct.Test* nocapture readonly %A, %struct.Test* "
				"nocapture readonly %B, %struct.Test* nocapture %C) {\n"
				"entry:\n"
				" br label %for.body\n"
				"for.body: ; preds = %for.body, "
				"%entry\n"
				" %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]\n"
				" %A0 = getelementptr %struct.Test, %struct.Test* %A, i64 "
				" %indvars.iv, i32 0\n"
				" %vA0 = load i32, i32* %A0, align 4\n"
				" %B0 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 "
				" %indvars.iv, i32 0\n"
				" %vB0 = load i32, i32* %B0, align 4\n"
				" %add0 = add nsw i32 %vA0, %vB0\n"
				" %A1 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				" %indvars.iv, i32 1\n"
				" %vA1 = load i32, i32* %A1, align 4\n"
				" %B1 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 "
				" %indvars.iv, i32 1\n"
				" %vB1 = load i32, i32* %B1, align 4\n"
				" %add1 = add nsw i32 %vA1, %vB1\n"
				" %C0 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				" %indvars.iv, i32 0\n"
				" store i32 %add0, i32* %C0, align 4\n"
				" %C1 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				" %indvars.iv, i32 1\n"
				" store i32 %add1, i32* %C1, align 4\n"
				" %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1\n"
				" %exitcond = icmp eq i64 %indvars.iv.next, 1024\n"
				" br i1 %exitcond, label %for.cond.cleanup, label %for.body\n"
				"for.cond.cleanup: ; preds = %for.body\n"
				" ret void\n"
				"}\n";

				Module &M = parseModule(ModuleString);

				Function *F = M.getFunction("add_x2");
				BasicBlock *LoopHeader = F->getEntryBlock().getSingleSuccessor();
				auto Plan = buildHCFG(LoopHeader);

				VPBlockBase *Entry = Plan->getEntry()->getEntryBasicBlock();
				EXPECT_NE(nullptr, Entry->getSingleSuccessor());
				VPBasicBlock *Body = Entry->getSingleSuccessor()->getEntryBasicBlock();

				VPInstruction Store1 = cast<VPInstruction>(&std::next(Body->begin(), 12));
				VPInstruction Store2 = cast<VPInstruction>(&std::next(Body->begin(), 14));

				auto VPIAI = getInterleavedAccessInfo(F, LI->getLoopFor(LoopHeader), Plan);

				VPlanSlp Slp(VPIAI, *Body);
				SmallVector<VPValue *, 4> StoreRoot = {Store1, Store2};
				VPInstruction *CombinedStore = Slp.buildGraph(StoreRoot);
				EXPECT_EQ(64u, Slp.getWidestBundleBits());
				EXPECT_EQ(VPInstruction::SLPStore, CombinedStore->getOpcode());

				auto *CombinedAdd = cast<VPInstruction>(CombinedStore->getOperand(0));
				EXPECT_EQ(Instruction::Add, CombinedAdd->getOpcode());

				auto *CombinedLoadA = cast<VPInstruction>(CombinedAdd->getOperand(0));
				auto *CombinedLoadB = cast<VPInstruction>(CombinedAdd->getOperand(1));
				EXPECT_EQ(VPInstruction::SLPLoad, CombinedLoadA->getOpcode());
				EXPECT_EQ(VPInstruction::SLPLoad, CombinedLoadB->getOpcode());

				VPInstruction GetA = cast<VPInstruction>(&std::next(Body->begin(), 1));
				VPInstruction GetB = cast<VPInstruction>(&std::next(Body->begin(), 3));
				EXPECT_EQ(GetA, CombinedLoadA->getOperand(0));
				EXPECT_EQ(GetB, CombinedLoadB->getOperand(0));
				}

				TEST_F(VPlanSlpTest, testSlpReuse_1) {
				const char *ModuleString =
				"%struct.Test = type { i32, i32 }\n"
				"define void @add_x2(%struct.Test* nocapture readonly %A, %struct.Test* "
				"nocapture readonly %B, %struct.Test* nocapture %C) {\n"
				"entry:\n"
				" br label %for.body\n"
				"for.body: ; preds = %for.body, "
				"%entry\n"
				" %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]\n"
				" %A0 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				"%indvars.iv, i32 0\n"
				" %vA0 = load i32, i32* %A0, align 4\n"
				" %add0 = add nsw i32 %vA0, %vA0\n"
				" %A1 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				"%indvars.iv, i32 1\n"
				" %vA1 = load i32, i32* %A1, align 4\n"
				" %add1 = add nsw i32 %vA1, %vA1\n"
				" %C0 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				"%indvars.iv, i32 0\n"
				" store i32 %add0, i32* %C0, align 4\n"
				" %C1 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				"%indvars.iv, i32 1\n"
				" store i32 %add1, i32* %C1, align 4\n"
				" %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1\n"
				" %exitcond = icmp eq i64 %indvars.iv.next, 1024\n"
				" br i1 %exitcond, label %for.cond.cleanup, label %for.body\n"
				"for.cond.cleanup: ; preds = %for.body\n"
				" ret void\n"
				"}\n";

				Module &M = parseModule(ModuleString);

				Function *F = M.getFunction("add_x2");
				BasicBlock *LoopHeader = F->getEntryBlock().getSingleSuccessor();
				auto Plan = buildHCFG(LoopHeader);
				auto VPIAI = getInterleavedAccessInfo(F, LI->getLoopFor(LoopHeader), Plan);

				VPBlockBase *Entry = Plan->getEntry()->getEntryBasicBlock();
				EXPECT_NE(nullptr, Entry->getSingleSuccessor());
				VPBasicBlock *Body = Entry->getSingleSuccessor()->getEntryBasicBlock();

				VPInstruction Store1 = cast<VPInstruction>(&std::next(Body->begin(), 8));
				VPInstruction Store2 = cast<VPInstruction>(&std::next(Body->begin(), 10));

				VPlanSlp Slp(VPIAI, *Body);
				SmallVector<VPValue *, 4> StoreRoot = {Store1, Store2};
				VPInstruction *CombinedStore = Slp.buildGraph(StoreRoot);
				EXPECT_EQ(64u, Slp.getWidestBundleBits());
				EXPECT_EQ(VPInstruction::SLPStore, CombinedStore->getOpcode());

				auto *CombinedAdd = cast<VPInstruction>(CombinedStore->getOperand(0));
				EXPECT_EQ(Instruction::Add, CombinedAdd->getOpcode());

				auto *CombinedLoadA = cast<VPInstruction>(CombinedAdd->getOperand(0));
				EXPECT_EQ(CombinedLoadA, CombinedAdd->getOperand(1));
				EXPECT_EQ(VPInstruction::SLPLoad, CombinedLoadA->getOpcode());
				}

				TEST_F(VPlanSlpTest, testSlpReuse_2) {
				const char *ModuleString =
				"%struct.Test = type { i32, i32 }\n"
				"define i32 @add_x2(%struct.Test* nocapture readonly %A, %struct.Test* "
				"nocapture readonly %B, %struct.Test* nocapture %C) {\n"
				"entry:\n"
				" br label %for.body\n"
				"for.body: ; preds = %for.body, "
				"%entry\n"
				" %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]\n"
				" %A0 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				"%indvars.iv, i32 0\n"
				" %vA0 = load i32, i32* %A0, align 4\n"
				" %add0 = add nsw i32 %vA0, %vA0\n"
				" %C0 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				"%indvars.iv, i32 0\n"
				" store i32 %add0, i32* %C0, align 4\n"
				" %A1 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				"%indvars.iv, i32 1\n"
				" %vA1 = load i32, i32* %A1, align 4\n"
				" %add1 = add nsw i32 %vA1, %vA1\n"
				" %C1 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				"%indvars.iv, i32 1\n"
				" store i32 %add1, i32* %C1, align 4\n"
				" %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1\n"
				" %exitcond = icmp eq i64 %indvars.iv.next, 1024\n"
				" br i1 %exitcond, label %for.cond.cleanup, label %for.body\n"
				"for.cond.cleanup: ; preds = %for.body\n"
				" ret i32 %vA1\n"
				"}\n";

				Module &M = parseModule(ModuleString);

				Function *F = M.getFunction("add_x2");
				BasicBlock *LoopHeader = F->getEntryBlock().getSingleSuccessor();
				auto Plan = buildHCFG(LoopHeader);
				auto VPIAI = getInterleavedAccessInfo(F, LI->getLoopFor(LoopHeader), Plan);

				VPBlockBase *Entry = Plan->getEntry()->getEntryBasicBlock();
				EXPECT_NE(nullptr, Entry->getSingleSuccessor());
				VPBasicBlock *Body = Entry->getSingleSuccessor()->getEntryBasicBlock();

				VPInstruction Store1 = cast<VPInstruction>(&std::next(Body->begin(), 5));
				VPInstruction Store2 = cast<VPInstruction>(&std::next(Body->begin(), 10));

				VPlanSlp Slp(VPIAI, *Body);
				SmallVector<VPValue *, 4> StoreRoot = {Store1, Store2};
				Slp.buildGraph(StoreRoot);
				EXPECT_FALSE(Slp.isCompletelySLP());
				}

				static void checkReorderExample(VPInstruction Store1, VPInstruction Store2,
				VPBasicBlock *Body,
				VPInterleavedAccessInfo &&IAI) {
				VPlanSlp Slp(IAI, *Body);
				SmallVector<VPValue *, 4> StoreRoot = {Store1, Store2};
				VPInstruction *CombinedStore = Slp.buildGraph(StoreRoot);

				EXPECT_TRUE(Slp.isCompletelySLP());
				EXPECT_EQ(CombinedStore->getOpcode(), VPInstruction::SLPStore);

				VPInstruction *CombinedAdd =
				cast<VPInstruction>(CombinedStore->getOperand(0));
				EXPECT_EQ(CombinedAdd->getOpcode(), Instruction::Add);

				VPInstruction *CombinedMulAB =
				cast<VPInstruction>(CombinedAdd->getOperand(0));
				VPInstruction *CombinedMulCD =
				cast<VPInstruction>(CombinedAdd->getOperand(1));
				EXPECT_EQ(CombinedMulAB->getOpcode(), Instruction::Mul);

				VPInstruction *CombinedLoadA =
				cast<VPInstruction>(CombinedMulAB->getOperand(0));
				EXPECT_EQ(VPInstruction::SLPLoad, CombinedLoadA->getOpcode());
				VPInstruction LoadvA0 = cast<VPInstruction>(&std::next(Body->begin(), 2));
				VPInstruction LoadvA1 = cast<VPInstruction>(&std::next(Body->begin(), 12));
				EXPECT_EQ(LoadvA0->getOperand(0), CombinedLoadA->getOperand(0));
				EXPECT_EQ(LoadvA1->getOperand(0), CombinedLoadA->getOperand(1));

				VPInstruction *CombinedLoadB =
				cast<VPInstruction>(CombinedMulAB->getOperand(1));
				EXPECT_EQ(VPInstruction::SLPLoad, CombinedLoadB->getOpcode());
				VPInstruction LoadvB0 = cast<VPInstruction>(&std::next(Body->begin(), 4));
				VPInstruction LoadvB1 = cast<VPInstruction>(&std::next(Body->begin(), 14));
				EXPECT_EQ(LoadvB0->getOperand(0), CombinedLoadB->getOperand(0));
				EXPECT_EQ(LoadvB1->getOperand(0), CombinedLoadB->getOperand(1));

				EXPECT_EQ(CombinedMulCD->getOpcode(), Instruction::Mul);

				VPInstruction *CombinedLoadC =
				cast<VPInstruction>(CombinedMulCD->getOperand(0));
				EXPECT_EQ(VPInstruction::SLPLoad, CombinedLoadC->getOpcode());
				VPInstruction LoadvC0 = cast<VPInstruction>(&std::next(Body->begin(), 7));
				VPInstruction LoadvC1 = cast<VPInstruction>(&std::next(Body->begin(), 17));
				EXPECT_EQ(LoadvC0->getOperand(0), CombinedLoadC->getOperand(0));
				EXPECT_EQ(LoadvC1->getOperand(0), CombinedLoadC->getOperand(1));

				VPInstruction *CombinedLoadD =
				cast<VPInstruction>(CombinedMulCD->getOperand(1));
				EXPECT_EQ(VPInstruction::SLPLoad, CombinedLoadD->getOpcode());
				VPInstruction LoadvD0 = cast<VPInstruction>(&std::next(Body->begin(), 9));
				VPInstruction LoadvD1 = cast<VPInstruction>(&std::next(Body->begin(), 19));
				EXPECT_EQ(LoadvD0->getOperand(0), CombinedLoadD->getOperand(0));
				EXPECT_EQ(LoadvD1->getOperand(0), CombinedLoadD->getOperand(1));
				}

				TEST_F(VPlanSlpTest, testSlpReorder_1) {
				LLVMContext Ctx;
				const char *ModuleString =
				"%struct.Test = type { i32, i32 }\n"
				"define void @add_x3(%struct.Test* %A, %struct.Test* %B, %struct.Test* "
				"%C, %struct.Test* %D, %struct.Test* %E) {\n"
				"entry:\n"
				" br label %for.body\n"
				"for.body: ; preds = %for.body, "
				"%entry\n"
				" %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]\n"
				" %A0 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				"%indvars.iv, i32 0\n"
				" %vA0 = load i32, i32* %A0, align 4\n"
				" %B0 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 "
				"%indvars.iv, i32 0\n"
				" %vB0 = load i32, i32* %B0, align 4\n"
				" %mul11 = mul nsw i32 %vA0, %vB0\n"
				" %C0 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				"%indvars.iv, i32 0\n"
				" %vC0 = load i32, i32* %C0, align 4\n"
				" %D0 = getelementptr inbounds %struct.Test, %struct.Test* %D, i64 "
				"%indvars.iv, i32 0\n"
				" %vD0 = load i32, i32* %D0, align 4\n"
				" %mul12 = mul nsw i32 %vC0, %vD0\n"
				" %A1 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				"%indvars.iv, i32 1\n"
				" %vA1 = load i32, i32* %A1, align 4\n"
				" %B1 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 "
				"%indvars.iv, i32 1\n"
				" %vB1 = load i32, i32* %B1, align 4\n"
				" %mul21 = mul nsw i32 %vA1, %vB1\n"
				" %C1 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				"%indvars.iv, i32 1\n"
				" %vC1 = load i32, i32* %C1, align 4\n"
				" %D1 = getelementptr inbounds %struct.Test, %struct.Test* %D, i64 "
				"%indvars.iv, i32 1\n"
				" %vD1 = load i32, i32* %D1, align 4\n"
				" %mul22 = mul nsw i32 %vC1, %vD1\n"
				" %add1 = add nsw i32 %mul11, %mul12\n"
				" %add2 = add nsw i32 %mul22, %mul21\n"
				" %E0 = getelementptr inbounds %struct.Test, %struct.Test* %E, i64 "
				"%indvars.iv, i32 0\n"
				" store i32 %add1, i32* %E0, align 4\n"
				" %E1 = getelementptr inbounds %struct.Test, %struct.Test* %E, i64 "
				"%indvars.iv, i32 1\n"
				" store i32 %add2, i32* %E1, align 4\n"
				" %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1\n"
				" %exitcond = icmp eq i64 %indvars.iv.next, 1024\n"
				" br i1 %exitcond, label %for.cond.cleanup, label %for.body\n"
				"for.cond.cleanup: ; preds = %for.body\n"
				" ret void\n"
				"}\n";

				Module &M = parseModule(ModuleString);

				Function *F = M.getFunction("add_x3");
				BasicBlock *LoopHeader = F->getEntryBlock().getSingleSuccessor();
				auto Plan = buildHCFG(LoopHeader);

				VPBlockBase *Entry = Plan->getEntry()->getEntryBasicBlock();
				EXPECT_NE(nullptr, Entry->getSingleSuccessor());
				VPBasicBlock *Body = Entry->getSingleSuccessor()->getEntryBasicBlock();

				VPInstruction Store1 = cast<VPInstruction>(&std::next(Body->begin(), 24));
				VPInstruction Store2 = cast<VPInstruction>(&std::next(Body->begin(), 26));

				checkReorderExample(
				Store1, Store2, Body,
				getInterleavedAccessInfo(F, LI->getLoopFor(LoopHeader), Plan));
				}

				TEST_F(VPlanSlpTest, testSlpReorder_2) {
				LLVMContext Ctx;
				const char *ModuleString =
				"%struct.Test = type { i32, i32 }\n"
				"define void @add_x3(%struct.Test* %A, %struct.Test* %B, %struct.Test* "
				"%C, %struct.Test* %D, %struct.Test* %E) {\n"
				"entry:\n"
				" br label %for.body\n"
				"for.body: ; preds = %for.body, "
				"%entry\n"
				" %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]\n"
				" %A0 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				"%indvars.iv, i32 0\n"
				" %vA0 = load i32, i32* %A0, align 4\n"
				" %B0 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 "
				"%indvars.iv, i32 0\n"
				" %vB0 = load i32, i32* %B0, align 4\n"
				" %mul11 = mul nsw i32 %vA0, %vB0\n"
				" %C0 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				"%indvars.iv, i32 0\n"
				" %vC0 = load i32, i32* %C0, align 4\n"
				" %D0 = getelementptr inbounds %struct.Test, %struct.Test* %D, i64 "
				"%indvars.iv, i32 0\n"
				" %vD0 = load i32, i32* %D0, align 4\n"
				" %mul12 = mul nsw i32 %vC0, %vD0\n"
				" %A1 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				"%indvars.iv, i32 1\n"
				" %vA1 = load i32, i32* %A1, align 4\n"
				" %B1 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 "
				"%indvars.iv, i32 1\n"
				" %vB1 = load i32, i32* %B1, align 4\n"
				" %mul21 = mul nsw i32 %vB1, %vA1\n"
				" %C1 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				"%indvars.iv, i32 1\n"
				" %vC1 = load i32, i32* %C1, align 4\n"
				" %D1 = getelementptr inbounds %struct.Test, %struct.Test* %D, i64 "
				"%indvars.iv, i32 1\n"
				" %vD1 = load i32, i32* %D1, align 4\n"
				" %mul22 = mul nsw i32 %vD1, %vC1\n"
				" %add1 = add nsw i32 %mul11, %mul12\n"
				" %add2 = add nsw i32 %mul22, %mul21\n"
				" %E0 = getelementptr inbounds %struct.Test, %struct.Test* %E, i64 "
				"%indvars.iv, i32 0\n"
				" store i32 %add1, i32* %E0, align 4\n"
				" %E1 = getelementptr inbounds %struct.Test, %struct.Test* %E, i64 "
				"%indvars.iv, i32 1\n"
				" store i32 %add2, i32* %E1, align 4\n"
				" %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1\n"
				" %exitcond = icmp eq i64 %indvars.iv.next, 1024\n"
				" br i1 %exitcond, label %for.cond.cleanup, label %for.body\n"
				"for.cond.cleanup: ; preds = %for.body\n"
				" ret void\n"
				"}\n";

				Module &M = parseModule(ModuleString);

				Function *F = M.getFunction("add_x3");
				BasicBlock *LoopHeader = F->getEntryBlock().getSingleSuccessor();
				auto Plan = buildHCFG(LoopHeader);

				VPBlockBase *Entry = Plan->getEntry()->getEntryBasicBlock();
				EXPECT_NE(nullptr, Entry->getSingleSuccessor());
				VPBasicBlock *Body = Entry->getSingleSuccessor()->getEntryBasicBlock();

				VPInstruction Store1 = cast<VPInstruction>(&std::next(Body->begin(), 24));
				VPInstruction Store2 = cast<VPInstruction>(&std::next(Body->begin(), 26));

				checkReorderExample(
				Store1, Store2, Body,
				getInterleavedAccessInfo(F, LI->getLoopFor(LoopHeader), Plan));
				}

				TEST_F(VPlanSlpTest, testSlpReorder_3) {
				LLVMContext Ctx;
				const char *ModuleString =
				"%struct.Test = type { i32, i32 }\n"
				"define void @add_x3(%struct.Test* %A, %struct.Test* %B, %struct.Test* "
				"%C, %struct.Test* %D, %struct.Test* %E) {\n"
				"entry:\n"
				" br label %for.body\n"
				"for.body: ; preds = %for.body, "
				"%entry\n"
				" %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]\n"
				" %A1 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				"%indvars.iv, i32 1\n"
				" %vA1 = load i32, i32* %A1, align 4\n"
				" %B0 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 "
				"%indvars.iv, i32 0\n"
				" %vB0 = load i32, i32* %B0, align 4\n"
				" %mul11 = mul nsw i32 %vA1, %vB0\n"
				" %C0 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				"%indvars.iv, i32 0\n"
				" %vC0 = load i32, i32* %C0, align 4\n"
				" %D0 = getelementptr inbounds %struct.Test, %struct.Test* %D, i64 "
				"%indvars.iv, i32 0\n"
				" %vD0 = load i32, i32* %D0, align 4\n"
				" %mul12 = mul nsw i32 %vC0, %vD0\n"
				" %A0 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				"%indvars.iv, i32 0\n"
				" %vA0 = load i32, i32* %A0, align 4\n"
				" %B1 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 "
				"%indvars.iv, i32 1\n"
				" %vB1 = load i32, i32* %B1, align 4\n"
				" %mul21 = mul nsw i32 %vB1, %vA0\n"
				" %C1 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				"%indvars.iv, i32 1\n"
				" %vC1 = load i32, i32* %C1, align 4\n"
				" %D1 = getelementptr inbounds %struct.Test, %struct.Test* %D, i64 "
				"%indvars.iv, i32 1\n"
				" %vD1 = load i32, i32* %D1, align 4\n"
				" %mul22 = mul nsw i32 %vD1, %vC1\n"
				" %add1 = add nsw i32 %mul11, %mul12\n"
				" %add2 = add nsw i32 %mul22, %mul21\n"
				" %E0 = getelementptr inbounds %struct.Test, %struct.Test* %E, i64 "
				"%indvars.iv, i32 0\n"
				" store i32 %add1, i32* %E0, align 4\n"
				" %E1 = getelementptr inbounds %struct.Test, %struct.Test* %E, i64 "
				"%indvars.iv, i32 1\n"
				" store i32 %add2, i32* %E1, align 4\n"
				" %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1\n"
				" %exitcond = icmp eq i64 %indvars.iv.next, 1024\n"
				" br i1 %exitcond, label %for.cond.cleanup, label %for.body\n"
				"for.cond.cleanup: ; preds = %for.body\n"
				" ret void\n"
				"}\n";

				Module &M = parseModule(ModuleString);

				Function *F = M.getFunction("add_x3");
				BasicBlock *LoopHeader = F->getEntryBlock().getSingleSuccessor();
				auto Plan = buildHCFG(LoopHeader);

				VPBlockBase *Entry = Plan->getEntry()->getEntryBasicBlock();
				EXPECT_NE(nullptr, Entry->getSingleSuccessor());
				VPBasicBlock *Body = Entry->getSingleSuccessor()->getEntryBasicBlock();

				VPInstruction Store1 = cast<VPInstruction>(&std::next(Body->begin(), 24));
				VPInstruction Store2 = cast<VPInstruction>(&std::next(Body->begin(), 26));

				auto VPIAI = getInterleavedAccessInfo(F, LI->getLoopFor(LoopHeader), Plan);
				VPlanSlp Slp(VPIAI, *Body);
				SmallVector<VPValue *, 4> StoreRoot = {Store1, Store2};
				EXPECT_EQ(nullptr, Slp.buildGraph(StoreRoot));

				// FIXME Need to select better first value for lane0.
				EXPECT_FALSE(Slp.isCompletelySLP());
				}

				TEST_F(VPlanSlpTest, testSlpReorder_4) {
				LLVMContext Ctx;
				const char *ModuleString =
				"%struct.Test = type { i32, i32 }\n"
				"define void @add_x3(%struct.Test* %A, %struct.Test* %B, %struct.Test* "
				"%C, %struct.Test* %D, %struct.Test* %E) {\n"
				"entry:\n"
				" br label %for.body\n"
				"for.body: ; preds = %for.body, "
				"%entry\n"
				" %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]\n"
				" %A0 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				"%indvars.iv, i32 0\n"
				" %vA0 = load i32, i32* %A0, align 4\n"
				" %B0 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 "
				"%indvars.iv, i32 0\n"
				" %vB0 = load i32, i32* %B0, align 4\n"
				" %mul11 = mul nsw i32 %vA0, %vB0\n"
				" %C0 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				"%indvars.iv, i32 0\n"
				" %vC0 = load i32, i32* %C0, align 4\n"
				" %D0 = getelementptr inbounds %struct.Test, %struct.Test* %D, i64 "
				"%indvars.iv, i32 0\n"
				" %vD0 = load i32, i32* %D0, align 4\n"
				" %mul12 = mul nsw i32 %vC0, %vD0\n"
				" %A1 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				"%indvars.iv, i32 1\n"
				" %vA1 = load i32, i32* %A1, align 4\n"
				" %B1 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 "
				"%indvars.iv, i32 1\n"
				" %vB1 = load i32, i32* %B1, align 4\n"
				" %mul21 = mul nsw i32 %vA1, %vB1\n"
				" %C1 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				"%indvars.iv, i32 1\n"
				" %vC1 = load i32, i32* %C1, align 4\n"
				" %D1 = getelementptr inbounds %struct.Test, %struct.Test* %D, i64 "
				"%indvars.iv, i32 1\n"
				" %vD1 = load i32, i32* %D1, align 4\n"
				" %mul22 = mul nsw i32 %vC1, %vD1\n"
				" %add1 = add nsw i32 %mul11, %mul12\n"
				" %add2 = add nsw i32 %mul22, %mul21\n"
				" %E0 = getelementptr inbounds %struct.Test, %struct.Test* %E, i64 "
				"%indvars.iv, i32 0\n"
				" store i32 %add1, i32* %E0, align 4\n"
				" %E1 = getelementptr inbounds %struct.Test, %struct.Test* %E, i64 "
				"%indvars.iv, i32 1\n"
				" store i32 %add2, i32* %E1, align 4\n"
				" %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1\n"
				" %exitcond = icmp eq i64 %indvars.iv.next, 1024\n"
				" br i1 %exitcond, label %for.cond.cleanup, label %for.body\n"
				"for.cond.cleanup: ; preds = %for.body\n"
				" ret void\n"
				"}\n";

				Module &M = parseModule(ModuleString);

				Function *F = M.getFunction("add_x3");
				BasicBlock *LoopHeader = F->getEntryBlock().getSingleSuccessor();
				auto Plan = buildHCFG(LoopHeader);

				VPBlockBase *Entry = Plan->getEntry()->getEntryBasicBlock();
				EXPECT_NE(nullptr, Entry->getSingleSuccessor());
				VPBasicBlock *Body = Entry->getSingleSuccessor()->getEntryBasicBlock();

				VPInstruction Store1 = cast<VPInstruction>(&std::next(Body->begin(), 24));
				VPInstruction Store2 = cast<VPInstruction>(&std::next(Body->begin(), 26));

				checkReorderExample(
				Store1, Store2, Body,
				getInterleavedAccessInfo(F, LI->getLoopFor(LoopHeader), Plan));
				}

				// Make sure we do not combine instructions with operands in different BBs.
				TEST_F(VPlanSlpTest, testInstrsInDifferentBBs) {
				const char *ModuleString =
				"%struct.Test = type { i32, i32 }\n"
				"%struct.Test3 = type { i32, i32, i32 }\n"
				"%struct.Test4xi8 = type { i8, i8, i8 }\n"
				"define void @add_x2(%struct.Test* nocapture readonly %A, %struct.Test* "
				"nocapture readonly %B, %struct.Test* nocapture %C) {\n"
				"entry:\n"
				" br label %for.body\n"
				"for.body: ; preds = %for.body, "
				"%entry\n"
				" %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]\n"
				" %A0 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				"%indvars.iv, i32 0\n"
				" %vA0 = load i32, i32* %A0, align 4\n"
				" %B0 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 "
				"%indvars.iv, i32 0\n"
				" %vB0 = load i32, i32* %B0, align 4\n"
				" %add0 = add nsw i32 %vA0, %vB0\n"
				" %A1 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				"%indvars.iv, i32 1\n"
				" %vA1 = load i32, i32* %A1, align 4\n"
				" %B1 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 "
				"%indvars.iv, i32 1\n"
				" br label %bb2\n"
				"bb2:\n"
				" %vB1 = load i32, i32* %B1, align 4\n"
				" %add1 = add nsw i32 %vA1, %vB1\n"
				" %C0 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				"%indvars.iv, i32 0\n"
				" store i32 %add0, i32* %C0, align 4\n"
				" %C1 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				"%indvars.iv, i32 1\n"
				" store i32 %add1, i32* %C1, align 4\n"
				" %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1\n"
				" %exitcond = icmp eq i64 %indvars.iv.next, 1024\n"
				" br i1 %exitcond, label %for.cond.cleanup, label %for.body\n"
				"for.cond.cleanup: ; preds = %for.body\n"
				" ret void\n"
				"}\n";

				Module &M = parseModule(ModuleString);

				Function *F = M.getFunction("add_x2");
				BasicBlock *LoopHeader = F->getEntryBlock().getSingleSuccessor();
				auto Plan = buildHCFG(LoopHeader);
				auto VPIAI = getInterleavedAccessInfo(F, LI->getLoopFor(LoopHeader), Plan);

				VPBlockBase *Entry = Plan->getEntry()->getEntryBasicBlock();
				EXPECT_NE(nullptr, Entry->getSingleSuccessor());
				VPBasicBlock *Body = Entry->getSingleSuccessor()->getEntryBasicBlock();
				VPBasicBlock *BB2 = Body->getSingleSuccessor()->getEntryBasicBlock();

				VPInstruction Store1 = cast<VPInstruction>(&std::next(BB2->begin(), 3));
				VPInstruction Store2 = cast<VPInstruction>(&std::next(BB2->begin(), 5));

				VPlanSlp Slp(VPIAI, *BB2);
				SmallVector<VPValue *, 4> StoreRoot = {Store1, Store2};
				EXPECT_EQ(nullptr, Slp.buildGraph(StoreRoot));
				EXPECT_EQ(0u, Slp.getWidestBundleBits());
				}

				// Make sure we do not combine instructions with operands in different BBs.
				TEST_F(VPlanSlpTest, testInstrsInDifferentBBs2) {
				const char *ModuleString =
				"%struct.Test = type { i32, i32 }\n"
				"%struct.Test3 = type { i32, i32, i32 }\n"
				"%struct.Test4xi8 = type { i8, i8, i8 }\n"
				"define void @add_x2(%struct.Test* nocapture readonly %A, %struct.Test* "
				"nocapture readonly %B, %struct.Test* nocapture %C) {\n"
				"entry:\n"
				" br label %for.body\n"
				"for.body: ; preds = %for.body, "
				"%entry\n"
				" %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]\n"
				" %A0 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				"%indvars.iv, i32 0\n"
				" %vA0 = load i32, i32* %A0, align 4\n"
				" %B0 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 "
				"%indvars.iv, i32 0\n"
				" %vB0 = load i32, i32* %B0, align 4\n"
				" %add0 = add nsw i32 %vA0, %vB0\n"
				" %A1 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				"%indvars.iv, i32 1\n"
				" %vA1 = load i32, i32* %A1, align 4\n"
				" %B1 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 "
				"%indvars.iv, i32 1\n"
				" %vB1 = load i32, i32* %B1, align 4\n"
				" %add1 = add nsw i32 %vA1, %vB1\n"
				" br label %bb2\n"
				"bb2:\n"
				" %C0 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				"%indvars.iv, i32 0\n"
				" store i32 %add0, i32* %C0, align 4\n"
				" %C1 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				"%indvars.iv, i32 1\n"
				" store i32 %add1, i32* %C1, align 4\n"
				" %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1\n"
				" %exitcond = icmp eq i64 %indvars.iv.next, 1024\n"
				" br i1 %exitcond, label %for.cond.cleanup, label %for.body\n"
				"for.cond.cleanup: ; preds = %for.body\n"
				" ret void\n"
				"}\n";

				Module &M = parseModule(ModuleString);

				Function *F = M.getFunction("add_x2");
				BasicBlock *LoopHeader = F->getEntryBlock().getSingleSuccessor();
				auto Plan = buildHCFG(LoopHeader);
				auto VPIAI = getInterleavedAccessInfo(F, LI->getLoopFor(LoopHeader), Plan);

				VPBlockBase *Entry = Plan->getEntry()->getEntryBasicBlock();
				EXPECT_NE(nullptr, Entry->getSingleSuccessor());
				VPBasicBlock *Body = Entry->getSingleSuccessor()->getEntryBasicBlock();
				VPBasicBlock *BB2 = Body->getSingleSuccessor()->getEntryBasicBlock();

				VPInstruction Store1 = cast<VPInstruction>(&std::next(BB2->begin(), 1));
				VPInstruction Store2 = cast<VPInstruction>(&std::next(BB2->begin(), 3));

				VPlanSlp Slp(VPIAI, *BB2);
				SmallVector<VPValue *, 4> StoreRoot = {Store1, Store2};
				EXPECT_EQ(nullptr, Slp.buildGraph(StoreRoot));
				EXPECT_EQ(0u, Slp.getWidestBundleBits());
				}

				TEST_F(VPlanSlpTest, testSlpAtomicLoad) {
				const char *ModuleString =
				"%struct.Test = type { i32, i32 }\n"
				"%struct.Test3 = type { i32, i32, i32 }\n"
				"%struct.Test4xi8 = type { i8, i8, i8 }\n"
				"define void @add_x2(%struct.Test* nocapture readonly %A, %struct.Test* "
				"nocapture readonly %B, %struct.Test* nocapture %C) {\n"
				"entry:\n"
				" br label %for.body\n"
				"for.body: ; preds = %for.body, "
				"%entry\n"
				" %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]\n"
				" %A0 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				"%indvars.iv, i32 0\n"
				" %vA0 = load atomic i32, i32* %A0 monotonic, align 4\n"
				" %B0 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 "
				"%indvars.iv, i32 0\n"
				" %vB0 = load i32, i32* %B0, align 4\n"
				" %add0 = add nsw i32 %vA0, %vB0\n"
				" %A1 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				"%indvars.iv, i32 1\n"
				" %vA1 = load i32, i32* %A1, align 4\n"
				" %B1 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 "
				"%indvars.iv, i32 1\n"
				" %vB1 = load i32, i32* %B1, align 4\n"
				" %add1 = add nsw i32 %vA1, %vB1\n"
				" %C0 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				"%indvars.iv, i32 0\n"
				" store i32 %add0, i32* %C0, align 4\n"
				" %C1 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				"%indvars.iv, i32 1\n"
				" store i32 %add1, i32* %C1, align 4\n"
				" %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1\n"
				" %exitcond = icmp eq i64 %indvars.iv.next, 1024\n"
				" br i1 %exitcond, label %for.cond.cleanup, label %for.body\n"
				"for.cond.cleanup: ; preds = %for.body\n"
				" ret void\n"
				"}\n";

				Module &M = parseModule(ModuleString);

				Function *F = M.getFunction("add_x2");
				BasicBlock *LoopHeader = F->getEntryBlock().getSingleSuccessor();
				auto Plan = buildHCFG(LoopHeader);
				auto VPIAI = getInterleavedAccessInfo(F, LI->getLoopFor(LoopHeader), Plan);

				VPBlockBase *Entry = Plan->getEntry()->getEntryBasicBlock();
				EXPECT_NE(nullptr, Entry->getSingleSuccessor());
				VPBasicBlock *Body = Entry->getSingleSuccessor()->getEntryBasicBlock();

				VPInstruction Store1 = cast<VPInstruction>(&std::next(Body->begin(), 12));
				VPInstruction Store2 = cast<VPInstruction>(&std::next(Body->begin(), 14));

				VPlanSlp Slp(VPIAI, *Body);
				SmallVector<VPValue *, 4> StoreRoot = {Store1, Store2};
				EXPECT_EQ(nullptr, Slp.buildGraph(StoreRoot));
				EXPECT_FALSE(Slp.isCompletelySLP());
				}

				TEST_F(VPlanSlpTest, testSlpAtomicStore) {
				const char *ModuleString =
				"%struct.Test = type { i32, i32 }\n"
				"%struct.Test3 = type { i32, i32, i32 }\n"
				"%struct.Test4xi8 = type { i8, i8, i8 }\n"
				"define void @add_x2(%struct.Test* nocapture readonly %A, %struct.Test* "
				"nocapture readonly %B, %struct.Test* nocapture %C) {\n"
				"entry:\n"
				" br label %for.body\n"
				"for.body: ; preds = %for.body, "
				"%entry\n"
				" %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]\n"
				" %A0 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				"%indvars.iv, i32 0\n"
				" %vA0 = load i32, i32* %A0, align 4\n"
				" %B0 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 "
				"%indvars.iv, i32 0\n"
				" %vB0 = load i32, i32* %B0, align 4\n"
				" %add0 = add nsw i32 %vA0, %vB0\n"
				" %A1 = getelementptr inbounds %struct.Test, %struct.Test* %A, i64 "
				"%indvars.iv, i32 1\n"
				" %vA1 = load i32, i32* %A1, align 4\n"
				" %B1 = getelementptr inbounds %struct.Test, %struct.Test* %B, i64 "
				"%indvars.iv, i32 1\n"
				" %vB1 = load i32, i32* %B1, align 4\n"
				" %add1 = add nsw i32 %vA1, %vB1\n"
				" %C0 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				"%indvars.iv, i32 0\n"
				" store atomic i32 %add0, i32* %C0 monotonic, align 4\n"
				" %C1 = getelementptr inbounds %struct.Test, %struct.Test* %C, i64 "
				"%indvars.iv, i32 1\n"
				" store i32 %add1, i32* %C1, align 4\n"
				" %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1\n"
				" %exitcond = icmp eq i64 %indvars.iv.next, 1024\n"
				" br i1 %exitcond, label %for.cond.cleanup, label %for.body\n"
				"for.cond.cleanup: ; preds = %for.body\n"
				" ret void\n"
				"}\n";

				Module &M = parseModule(ModuleString);

				Function *F = M.getFunction("add_x2");
				BasicBlock *LoopHeader = F->getEntryBlock().getSingleSuccessor();
				auto Plan = buildHCFG(LoopHeader);
				auto VPIAI = getInterleavedAccessInfo(F, LI->getLoopFor(LoopHeader), Plan);

				VPBlockBase *Entry = Plan->getEntry()->getEntryBasicBlock();
				EXPECT_NE(nullptr, Entry->getSingleSuccessor());
				VPBasicBlock *Body = Entry->getSingleSuccessor()->getEntryBasicBlock();

				VPInstruction Store1 = cast<VPInstruction>(&std::next(Body->begin(), 12));
				VPInstruction Store2 = cast<VPInstruction>(&std::next(Body->begin(), 14));

				VPlanSlp Slp(VPIAI, *Body);
				SmallVector<VPValue *, 4> StoreRoot = {Store1, Store2};
				Slp.buildGraph(StoreRoot);
				EXPECT_FALSE(Slp.isCompletelySLP());
				}

				} // namespace
				} // namespace llvm

This is an archive of the discontinued LLVM Phabricator instance.

[RFC][VPlan, SLP] Add simple SLP analysis on top of VPlan.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 173830

lib/Transforms/Vectorize/CMakeLists.txt

lib/Transforms/Vectorize/VPlan.h

lib/Transforms/Vectorize/VPlan.cpp

lib/Transforms/Vectorize/VPlanSLP.cpp

lib/Transforms/Vectorize/VPlanValue.h

unittests/Transforms/Vectorize/CMakeLists.txt

unittests/Transforms/Vectorize/VPlanSlpTest.cpp

[RFC][VPlan, SLP] Add simple SLP analysis on top of VPlan.
ClosedPublic