This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
-
LoopAccessAnalysis.h
-
lib/
-
Analysis/
-
LoopAccessAnalysis.cpp
-
Transforms/Vectorize/
-
Vectorize/
-
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
-
jumbled-load-multiuse.ll
-
jumbled-load-shuffle-placement.ll
-
jumbled-load-used-in-phi.ll
-
jumbled-load.ll
-
store-jumbled.ll

Differential D36130

[SLP] Vectorize jumbled memory loads.
AcceptedPublic

Authored by • ashahid on Aug 1 2017, 12:31 AM.

Download Raw Diff

Details

Reviewers

mkuper
loladiro
Ayal
zvi
danielcdh
ABataev

Commits

rGdbd30edb7ff8: [SLP] Vectorize jumbled memory loads.
rG1d5422f27f60: [SLP] Vectorize jumbled memory loads.
rG2b281de5769e: [SLP] Vectorize jumbled memory loads.
rGf8db9bd85791: [SLP] Vectorize jumbled memory loads.
rL320548: [SLP] Vectorize jumbled memory loads.
rL314806: [SLP] Vectorize jumbled memory loads.
rL313771: [SLP] Vectorize jumbled memory loads.
rL313736: [SLP] Vectorize jumbled memory loads.

Summary

This patch tries to vectorize loads of consecutive memory accesses, accessed
in non-consecutive or jumbled way. An earlier attempt was made with patch D26905
which was reverted back due to some basic issue with representing the 'use mask' of
jumbled accesses.

This patch fixes the mask representation by recording the 'use mask' in the usertree entry.

Change-Id: I9fe7f5045f065d84c126fa307ef6ebe0787296df

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Patch update for fixing build bot failure:

This fix makes the place holder for Shuffle Mask from fixed array of 3 element
to an std::map. This need arises from the fact that a PHI node can have
any number of operand as incoming value.

Test performed:
LLVM lit test, 3 stage bootstrap build and LNT (Thanks to Hans and Daniel)

Harbormaster completed remote builds in B12741: Diff 125470.Dec 4 2017, 9:17 PM

In D36130#944703, @ashahid wrote:

Patch update for fixing build bot failure:

I haven't looked at the patch at all, but I just tried it on a local Chrome build on Linux, and it seems to work for that.

Good catch. Add a LIT test?

lib/Transforms/Vectorize/SLPVectorizer.cpp
736 ↗	(On Diff #125470)	The fixed array SmallVector<unsigned, 4> ShuffleMask[3]; of the previous version indeed cannot account for all operands. How about holding a SmallVector<SmallVector<unsigned, 4>, 2> ShuffleMask; instead of holding a map from 0,1,2,..,numOperands ?
766 ↗	(On Diff #125470)	Are both conditions really needed, or suffice say to check for -1 and assert positive indices are not too large?
3054 ↗	(On Diff #125470)	May be simpler to check instead ShuffleMask.count(OpdNum)
3085 ↗	(On Diff #125470)	clang-format

In D36130#945306, @hans wrote:

In D36130#944703, @ashahid wrote:

Patch update for fixing build bot failure:

I haven't looked at the patch at all, but I just tried it on a local Chrome build on Linux, and it seems to work for that.

Thanks Hans for triage.

In D36130#945728, @Ayal wrote:

Good catch. Add a LIT test?

It was asserting in few of LNT Multisource bench mark. How to extract it for LIT test?

lib/Transforms/Vectorize/SLPVectorizer.cpp
736 ↗	(On Diff #125470)	I think this can be done. I will try.
766 ↗	(On Diff #125470)	Sure I will check. I am thinking 30000 as large indices threshold, do you have any number in mind?
3054 ↗	(On Diff #125470)	Quite right.

• ashahid added inline comments.Dec 8 2017, 8:12 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
766 ↗	(On Diff #125470)	I tried but seems both conditions are needed as I am getting assertion "Idx < size()" for SmallVector<<SmallVector, 4> 2> ShuffleMask.

Updated the review comments.

Herald added a subscriber: mgrang. · View Herald TranscriptDec 9 2017, 12:50 AM

Minor commented code clean up done.

In D36130#946236, @ashahid wrote:

In D36130#945728, @Ayal wrote:

Good catch. Add a LIT test?

It was asserting in few of LNT Multisource bench mark. How to extract it for LIT test?

Suffice to have a phi with 4 predecessors, where (at-least) the 4th needs a shuffle-mask.

lib/Transforms/Vectorize/SLPVectorizer.cpp
678–679 ↗	(On Diff #112854)	Code below still uses emplace_back contrary to the discussion above. May need to call UserTreeEntry->ShuffleMask.resize() if OpdNum is larger than its initial/current size, before setting UserTreeEntry->ShuffleMask[OpdNum] = tempMask. (Otherwise the original "LNT Multisource bench mark" asserts should trigger again?) Suggest to add a test where the first operand does not need a shuffle but the second one does.
766 ↗	(On Diff #125470)	UserTreeIdx is the index of the User entry as we build the tree bottom-up, so it should always be between 0 and VectorizableTree.size()-1, except for -1 when creating the new entry for the root, which is User-less. So it should suffice to check if Idx is -1, and otherwise assert that Idx < size(), if desired, right?
2801 ↗	(On Diff #126266)	See above discussion about replacing second condition with an assert.
3044 ↗	(On Diff #126266)	ditto
test/Transforms/SLPVectorizer/X86/crash_cmpop.ll
1 ↗	(On Diff #126266)	Why add -debug?

Review comments updated and added lit tests.

Harbormaster completed remote builds in B12974: Diff 126374.Dec 11 2017, 8:34 AM

• ashahid added inline comments.Dec 11 2017, 8:41 AM

test/Transforms/SLPVectorizer/X86/crash_cmpop.ll
1 ↗	(On Diff #126266)	My bad, not intended.

This looks good to me, with a couple of last minor fixes.

Hope it stays in this time...

lib/Transforms/Vectorize/SLPVectorizer.cpp
773 ↗	(On Diff #126374)	alrea[d]y
3090 ↗	(On Diff #126374)	Can simply do `for (unsigned Entry : ShuffleMask[OpdNum])` instead of iterating explicitly over all lanes and retrieving each `UserTreeEntry->ShuffleMask[OpdNum][Lane]`.
test/Transforms/SLPVectorizer/X86/jumbled-load-shuffle-placement.ll
31 ↗	(On Diff #126374)	Suggested to also have a test where the 2nd operand is a shuffle but the 1st one isn't, which will fail if shuffles are added using emplace_back().

Updated test and review comment.

Bootstrap and LNT test underway.

• ashahid closed this revision.Dec 12 2017, 7:09 PM

Hi Shahid,

These changes caused 27.7% and 30.2% regressions on an AArch64 Juno board (http://lnt.llvm.org/db_default/v4/nts/83681):

MultiSource/Benchmarks/mediabench/gsm/toast/toast: 30.20%
MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm: 27.73%

We have the same benchmarks regressed on our AArch64 boards (Cortex-A53, Cortex-A57).

-Evgeny Astigeevich
The ARM Compiler Optimisation team

In D36130#955158, @eastig wrote:

Hi Shahid,

These changes caused 27.7% and 30.2% regressions on an AArch64 Juno board (http://lnt.llvm.org/db_default/v4/nts/83681):

MultiSource/Benchmarks/mediabench/gsm/toast/toast: 30.20%
MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm: 27.73%

We have the same benchmarks regressed on our AArch64 boards (Cortex-A53, Cortex-A57).

-Evgeny Astigeevich
The ARM Compiler Optimisation team

A problem report: https://bugs.llvm.org/show_bug.cgi?id=35673

eastig mentioned this in D41324: [SLPVectorizer] Add shuffle instruction cost for jumbled load.Dec 18 2017, 4:11 AM

sanjoy added a subscriber: sanjoy.Dec 19 2017, 4:03 PM

sanjoy added inline comments.

lib/Analysis/LoopAccessAnalysis.cpp
1125 ↗	(On Diff #126571)	This should be a `cast<>`.
1153 ↗	(On Diff #126571)	LLVM style is to avoid using curly braces on single like for loops. Using `std::iota` would be even better.
lib/Transforms/Vectorize/SLPVectorizer.cpp
774 ↗	(On Diff #126571)	I think you should be able to do: auto &OperandMask = UserTreeEntry->ShuffleMask[OpdNum]; assert(OperandMask.empty()); OperandMask.insert(OperandMask.end(), ShuffleMask.begin(), ShuffleMask.end());
1666 ↗	(On Diff #126571)	Not sure why you need `NewVL` here -- doesn't just using `Sorted` work?
3054 ↗	(On Diff #126571)	Might be cleaner to abstract `(unsigned)OpdNum < UserTreeEntry->ShuffleMask.size() && !UserTreeEntry->ShuffleMask[OpdNum].empty()` into a `UserTreeEntry->hasShuffleMaskForOp(Index)` helper.
3091 ↗	(On Diff #126571)	The cast to `Value *` should not be necessary.
3319 ↗	(On Diff #126571)	`dyn_cast<XXX>(f)->g()` should never be necessary. Either the `dyn_cast` can return null in which case you should check for that, or it can't and you should use `cast<>`. Also the cast of `Vec` to `Instruction` seems unnecessary: `ShuffleVectorInst` is an `Instruction`.

Ayal added inline comments.Dec 21 2017, 3:25 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
774 ↗	(On Diff #126571)	While we're at it, this should move under the `if (UserTreeIdx != -1)` to avoid checking if `&VectorizableTree[UserTreeIdx]` is null, as commented in https://reviews.llvm.org/D41324#inline-361435
1675 ↗	(On Diff #126571)	Should probably also check here that UserTreeIdx is not -1, to avoid creating a mask for the root with no place to hang it, as @sanjoy observed.

• ashahid added inline comments.Dec 22 2017, 6:20 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
774 ↗	(On Diff #126571)	If we check for if (UserTreeIdx != -1 && ShuffledLoad) before the call of newTreeEntry(), we can avoid "UserTreeIdx != -1" check completely inside newTreeEntry().
1675 ↗	(On Diff #126571)	Yes, I had planned to do exactly this.

• ashahid reopened this revision.Dec 28 2017, 11:04 PM

• ashahid marked 8 inline comments as done.

• ashahid added inline comments.

lib/Transforms/Vectorize/SLPVectorizer.cpp
3319 ↗	(On Diff #126571)	Here I am trying to ensure that the instructions are "ShuffleVectorInst" and "LoadInst" respectively. Casting of Vec to Instruction, is to satisfy the membership of getOperand() which compiler otherwise report as error.

This revision is now accepted and ready to land.Dec 28 2017, 11:04 PM

Updates review comments.

Regression test and LNT passes, 3 stage bootstrap test underway.

Ayal added inline comments.Dec 29 2017, 7:31 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp

3327 ↗

(On Diff #128320)

Use isa instead of dyn_cast here:
if (Vec && dyn_cast<LoadInst>(cast<Instruction>(Vec)->getOperand(0))) {

or alternatively do something like:

Value *Vec = E->VectorizedValue;
assert(Vec && "Can't find vectorizable value");
if (ShuffleVectorInst *Shuffle = dyn_cast<ShuffleVectorInst>(Vec))
  if (LoadInst *Load = dyn_cast<LoadInst>(Shuffle->getOperand(0)))
    Vec = Load;

Updated Ayal's comment accordingly

• ashahid marked an inline comment as done.Jan 1 2018, 8:01 AM

Ping!

This should fix the case observed by @sanjoy in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20171218/511721.html; please also include a testcase.

In D36130#971181, @Ayal wrote:

This should fix the case observed by @sanjoy in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20171218/511721.html; please also include a testcase.

Test case, test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll, already included.

In D36130#973399, @ashahid wrote:

In D36130#971181, @Ayal wrote:

This should fix the case observed by @sanjoy in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20171218/511721.html; please also include a testcase.

Test case, test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll, already included.

Ah, right, sorry, missed it.

This looks good to me, with only minor comments about the testcase.

Please see that @sanjoy approves too, as this mostly addresses issues he raised.

test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
27 ↗	(On Diff #128388)	"SINK" is defined redundantly, as it is not used. Could this be simplified by removing the float-to-int casts? In general, it may suffice to check that there's no load of <4 x i32>, which would be jumbled. Checking that two of the lanes have been vectorized may be fragile, in case a modified cost model will decide it ain't worth it.

sanjoy accepted this revision.Jan 13 2018, 2:40 PM

sanjoy added inline comments.

lib/Analysis/LoopAccessAnalysis.cpp
1113 ↗	(On Diff #128388)	The indent looks off here; can you please run clang-format?
lib/Transforms/Vectorize/SLPVectorizer.cpp
723 ↗	(On Diff #128388)	Optional: you can write `return X;` instead of `if (X) return true; return false;`.
1675 ↗	(On Diff #128388)	Nit: s/usefull/useful/
3326 ↗	(On Diff #128388)	I think you can rewrite this more cleanly using an immediately-invoked function expression: Value Vec = [&]() { if (auto SVI = dyn_cast<ShuffleVectorInst>(E->VectorizedValue)) if (auto *LI = dyn_cast<LoadInst>(SVI->getOperand(0))) return LI->getOperand(0); return E->VectorizedValue; }();

• ashahid marked an inline comment as not done.Jan 15 2018, 9:01 PM

• ashahid added inline comments.

lib/Transforms/Vectorize/SLPVectorizer.cpp
3326 ↗	(On Diff #128388)	I tried this IIFE, however I am getting an assertion "Tried to create extractelement operation on non-vector type!" for jumbled-load-multiuse.ll test. Do you see any issue in this code?

sanjoy added inline comments.Jan 15 2018, 10:54 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
3326 ↗	(On Diff #128388)	Yes, I think I should have written: Value Vec = [&]() { if (auto SVI = dyn_cast<ShuffleVectorInst>(E->VectorizedValue)) if (isa<LoadInst>(SVI->getOperand(0))) return SVI->getOperand(0); return E->VectorizedValue; }();

Ayal added inline comments.Jan 15 2018, 11:49 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp

3326 ↗

(On Diff #128388)

Yes, this simplifies the below "alternatively do something like:"

Value *Vec = E->VectorizedValue;
assert(Vec && "Can't find vectorizable value");
if (ShuffleVectorInst *Shuffle = dyn_cast<ShuffleVectorInst>(Vec))
  if (LoadInst *Load = dyn_cast<LoadInst>(Shuffle->getOperand(0)))
    Vec = Load;

Updates test case and stylistic review comments

Herald added a subscriber: llvm-commits. · View Herald TranscriptJan 16 2018, 8:51 AM

Ping!

Hi Ayal, Sanjoy,

The last update's review was pending for long. Off late, SLP has lots of changes so I will have to rebase but before rebasing please see if any more changes required in its current form.

Thanks in advance.

RKSimon added a reviewer: ABataev.Feb 10 2018, 8:56 AM

In D36130#1004306, @ashahid wrote:

Hi Ayal, Sanjoy,

The last update's review was pending for long. Off late, SLP has lots of changes so I will have to rebase but before rebasing please see if any more changes required in its current form.

Thanks in advance.

This looks good to me, as commented earlier, but please see that @sanjoy approves too, as this mostly addresses issues he raised.

I don't have any more coding style comments. I've not reviewed the actual semantic changes.

lib/Analysis/LoopAccessAnalysis.cpp
1166 ↗	(On Diff #129968)	Can you use `std::iota` here?

ABataev added inline comments.Feb 12 2018, 8:04 AM

lib/Analysis/LoopAccessAnalysis.cpp
1112 ↗	(On Diff #129968)	This function can be used for stores also, it is better to make it universal for stores/loads.
1156 ↗	(On Diff #129968)	It is better to use `stable_sort` rather than `sort`
1169 ↗	(On Diff #129968)	`stable_sort`
lib/Transforms/Vectorize/SLPVectorizer.cpp
1661 ↗	(On Diff #129968)	Is it possible at all that `VL` has less than 4 elements here?
1666 ↗	(On Diff #129968)	`i`->`I`, `e`->`E`. Variables must have Camel-like names.
2229–2235 ↗	(On Diff #129968)	You don't need so many shuffles, it is enough just to have just one.

ABataev added inline comments.Feb 12 2018, 8:04 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
736–742 ↗	(On Diff #129968)	Why you can't have just one shuffle here for all external uses?
1677–1678 ↗	(On Diff #129968)	Bad decision. It is better to use original `VL` here, rather than `Sorted` and add an additional array of sorted indieces. In this case you don't need all these additional numbers and all that complex logic to find the correct tree entry for the list of values.
3324 ↗	(On Diff #129968)	I think you can have default capture by value here rather than by reference.

Hi Alexey,

As I was trying to rebase this patch, it seems this overlaps with your "reverse load" patch. Could you take a look in this patch?

courbet added a subscriber: courbet.Feb 13 2018, 2:12 AM

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

lib/Analysis/LoopAccessAnalysis.cpp
1112 ↗	(On Diff #129968)	I plan to do such improvement in separate patches.
lib/Transforms/Vectorize/SLPVectorizer.cpp
736–742 ↗	(On Diff #129968)	This is for in-tree multi uses of a single vector load where the uses has different masks/permutation. This section of comment https://reviews.llvm.org/D36130#inline-326711 discussed it earlier. Also there is figure attached.
1661 ↗	(On Diff #129968)	I think yes, for example a couple of i64 loads considering minimum register width as 128-bit. However, this check here was basically meant to indicate jumbled loads of size 2 is essentially a reversed load.
1677–1678 ↗	(On Diff #129968)	In fact earlier design in patch (https://reviews.llvm.org/D26905) was to use original VL, however there was counter argument to that which I don't remember exactly.
2229–2235 ↗	(On Diff #129968)	This is basically for multiple in-tree uses with different masks/permutation.

In D36130#1006202, @ashahid wrote:

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

Probably, it is hard to say exactly without looking at the result.

lib/Analysis/LoopAccessAnalysis.cpp
1112 ↗	(On Diff #129968)	I just suggest to make universal at the very beginning, that's it
lib/Transforms/Vectorize/SLPVectorizer.cpp
736–742 ↗	(On Diff #129968)	I still don't understand what's the problem here. You need to perform the loads in some order. You sort the loads to be in the sequntially direct order and perform the vector load starting from the lowest address. You reshuffle the loaded vector value to the original order. That's it, you have your loads in the required order. Just one shuffle is required. Why do you need some more? Also, I don't understand why do you need so many changes, why do you need additional indicies etc.
1661 ↗	(On Diff #129968)	It is going to be handled by the reverse loads patch
1677–1678 ↗	(On Diff #129968)	It is better to use original `VL` here, otherwise it will end with a lot of troubles and will require the whole bunch of changes in the vectorization process to find the perfect match for the vector of vectorized values. I don't think it is a good idea to have a lot of changes accross the whole module to handle jumbled loads.
3063 ↗	(On Diff #129968)	Is this correct? `E->Scalars[0]` is exactly `VL0`

Updates review comments and a test case.

Harbormaster completed remote builds in B14963: Diff 134170.Feb 14 2018, 1:38 AM

Minor clean up.

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

In D36130#1006238, @ABataev wrote:

In D36130#1006202, @ashahid wrote:

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

Probably, it is hard to say exactly without looking at the result.

No worry it was a merge issue, its fixed.

lib/Transforms/Vectorize/SLPVectorizer.cpp
736–742 ↗	(On Diff #129968)	Updated jumbled-load.ll captures this case where instead of gathering the second operand of MUL we can have required shuffle of the same loaded vector
1661 ↗	(On Diff #129968)	Yes, this check no more required.
1677–1678 ↗	(On Diff #129968)	In the context where we can have multiple user of loaded vector with different shuffle mask, the design is to represent these different shuffle mask for each user corresponding to the user's operand number. Having single sorted indices will not be sufficient for this. Given the objective of handling multiple out of order uses changes are not that big I feel.
3063 ↗	(On Diff #129968)	Ah, both are same.

ABataev added inline comments.Feb 14 2018, 6:50 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1677–1678 ↗	(On Diff #129968)	Now I see what do you want to do. But I don't think that this the correct way to implement it. It complicates the whole vectorization process. I'd suggest to create different tree entries for each particular order of the loads and exclude loads from the check that the same instruction is used several times in different tree entries. If you worry about several different loads of the same values, I think they will be optimized by instruction combiner.

• ashahid added inline comments.Feb 16 2018, 9:46 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1677–1678 ↗	(On Diff #129968)	Off course this could have been a better solution but I was not sure of the impact it may have by breaking the single tree entry assumption. One problem I see is the TreeEntry lookup if multiple node with same scalar values are present. I can use isSame() check to make sure correct tree entry is found, however it may become costly in case of PHI instruction fed by same vector Load.

ABataev added inline comments.Feb 16 2018, 10:29 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1677–1678 ↗	(On Diff #129968)	I think it is better to start with handling of single tree entry rather than trying to handle all possible situations in a single patch. I suggest to split this patch into 2 parts at least: 1. handling of tree entry with jumbled loads. 2. further improvements.
test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
7–10 ↗	(On Diff #134178)	These checks are not autogenerated, fix it. Moreover, it is recommended to commit these tests separately with the checks for the original version of the compiler and the update checks with the fixed version to demonstrate improvements.

Updated the patch to accomodate the review comments.

Harbormaster completed remote builds in B15472: Diff 136070.Feb 27 2018, 6:29 AM

As suggested, now the reordering mask will be part of each tree entry. Also this update does not consider to optimize the reordered load for multiple operand for now.

By the way, take a look at my D43776 that does the same but in more general way

lib/Transforms/Vectorize/SLPVectorizer.cpp
1644 ↗	(On Diff #136070)	Why you can do this only if `ReuseShuffleIndicies.empty()`?
1649–1654 ↗	(On Diff #136070)	It is enough just to compare `VL` and `Sorted`. If they are the same, the loads are not shuffled
1657 ↗	(On Diff #136070)	Why you can't do to add vectorized tree entry if `UserTreeIdx == -1`?
1660 ↗	(On Diff #136070)	Each `true` or `false` argument must have to prepend comment with the name of the function parameter, related to this argument
2279 ↗	(On Diff #136070)	You can remove the last argument here
2899 ↗	(On Diff #136070)	Why do you need this condition?
3251 ↗	(On Diff #136070)	Restore the original code here
3287 ↗	(On Diff #136070)	Remove this empty line
3528–3533 ↗	(On Diff #136070)	I rather doubt you need all that stuff. You can use original code
test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
1 ↗	(On Diff #136070)	You need to add this test separately and show changes in it.
test/Transforms/SLPVectorizer/X86/jumbled-load-shuffle-placement.ll
1 ↗	(On Diff #136070)	You need to add this test separately and show changes in it
test/Transforms/SLPVectorizer/X86/jumbled-load-used-in-phi.ll
1 ↗	(On Diff #136070)	You need to add this test separately and show changes in it
test/Transforms/SLPVectorizer/X86/jumbled-load.ll
64 ↗	(On Diff #136070)	You need to add this test separately and show changes in it

Will commit the tests as NFC.

Seems like I am not getting the mails from phabricator, what shall I do to get the mails?

Checked the patch D43776, seems it will make this patch redundant.

lib/Transforms/Vectorize/SLPVectorizer.cpp
1644 ↗	(On Diff #136070)	This is to avoid the overlapping the UniqueValues reuse logic of your changes.
1649–1654 ↗	(On Diff #136070)	Sure it is, but this avoids the compare. So I thought having a boolean is preferable.
1657 ↗	(On Diff #136070)	My bad, this is not required.
1660 ↗	(On Diff #136070)	Ok
2279 ↗	(On Diff #136070)	Sure
2899 ↗	(On Diff #136070)	In the 2nd test of jumbled-load.ll the two operands of MUL is fed from the same loaded vector. The 1st operand is SHUFFLE of LOAD and the 2nd operand is the gather of the same scalar loads. Query to getTreeEntry() will always return the node with the same vectorized value and hence both the operand of MUL will be fed the shuffled load. This check is to avoid this scenario.
3251 ↗	(On Diff #136070)	Thanks
3528–3533 ↗	(On Diff #136070)	This is required otherwise multiuse.ll test as well as PR32086.ll will fail because the lanes were recorded according to the order of scalar loads.

Updated further review comments.

Harbormaster completed remote builds in B15525: Diff 136311.Feb 28 2018, 9:30 AM

Hope this is fine.

ABataev added inline comments.Feb 28 2018, 9:49 AM

lib/Analysis/LoopAccessAnalysis.cpp
1112 ↗	(On Diff #129968)	What about this comment? Do you really need Sorted argument?
1125 ↗	(On Diff #136311)	`PointerType `->`auto `
1129–1131 ↗	(On Diff #136311)	I think there must be an assertion instead of this check.
1141 ↗	(On Diff #136311)	`const SCEVConstant `->`const auto `
1146–1148 ↗	(On Diff #136311)	This check better to move to SLPVectorizer.cpp, because the function can be used for masked load/store.
1161 ↗	(On Diff #136311)	`for (unsigned I = 0, E = VL.size(); I < E; ++I)`
1166 ↗	(On Diff #136311)	Actually `Mask` is a full copy of `UseOrder`, you don't need all that complex stuff here
lib/Transforms/Vectorize/SLPVectorizer.cpp
1644 ↗	(On Diff #136070)	Why you can't handle it? What's the problem?
1649–1654 ↗	(On Diff #136070)	Why do we need the compare?
2899 ↗	(On Diff #136070)	This scenario should happen in your patch, the instruction either vectorized, or gathered, but not both.
3528–3533 ↗	(On Diff #136070)	Again, it just may not happen in this patch

ABataev added inline comments.Feb 28 2018, 11:07 AM

lib/Analysis/LoopAccessAnalysis.cpp
1166 ↗	(On Diff #136311)	Oops, no, `Mask` is not a copy of `UseOrder` But you can create it much simpler: for (unsigned I = 0, E = VL.size(); I < E; ++I) Mask[UseOrder[I]] = I;

sanjoy removed a reviewer: sanjoy.Feb 28 2018, 11:34 AM

sanjoy removed a subscriber: sanjoy.

• ashahid added inline comments.Feb 28 2018, 11:30 PM

lib/Analysis/LoopAccessAnalysis.cpp
1112 ↗	(On Diff #129968)	Yes, otherwise my test fails. Seems it breaks some assumption.
1166 ↗	(On Diff #136311)	Thanks
lib/Transforms/Vectorize/SLPVectorizer.cpp
1644 ↗	(On Diff #136070)	It was a thought,I have not checked yet. I will check.
1649–1654 ↗	(On Diff #136070)	I meant, if we dont use ShuffledLoad flag we have to compare VL vs Sorted instead.
2899 ↗	(On Diff #136070)	This check is to avoid feeding the generated SHUFFLE to both operand of MUL which is not the intention of the test case.
3528–3533 ↗	(On Diff #136070)	It does happen and this test fails.

ABataev added inline comments.Mar 2 2018, 10:59 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1660 ↗	(On Diff #136311)	No, use original `VL` here, do not use `Sorted`. In this case you won't need an additional argument in `sortLoadAccesses` and you don't need all that complex stuff with the lambda on line 3528

• ashahid added inline comments.Mar 5 2018, 10:39 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1660 ↗	(On Diff #136311)	If I am not wrong, for LOADs, VL0 must be the 1st element of the buffer whose base address will be used for vector load. So using VL will break this assumption.

ABataev added inline comments.Mar 6 2018, 6:18 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1660 ↗	(On Diff #136311)	Why? And why you can't choose the right VL0 during vectorization?

• ashahid added inline comments.Mar 6 2018, 8:20 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1660 ↗	(On Diff #136311)	For example, if we have two arrays A[4] and B[1] laying one after another in memory and the selected VF is 4 for the scalar loads of A[1], A[2], A[0], A[3] in order of use, the generated vector load will load the elements A[1], A[2], A[3], B[1] which is not desired. Of-course we can choose the right VL0 during vectorization but we have to compute it again here using the mask which can be avoided if we use Sorted VL. If I am missing something?

ABataev added inline comments.Mar 6 2018, 8:42 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1660 ↗	(On Diff #136311)	You already store the mask in the tree entry and you can choose the right VL0 using this mask. Using Sorted VL complicates the whole vectorization process and, thus, adds some extra points for the incorrect vectorization. That's why I insist to use original VL and choose the correct VL0 during codegen.

• ashahid added inline comments.Mar 6 2018, 9:08 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1660 ↗	(On Diff #136311)	Got it. Since you already have these improvements in this patch https://reviews.llvm.org/D43776 , I think it is better to get that through.

fhahn mentioned this in D37738: [SLPVectorizer] Generalize vectorizeStores to support loads as well NFC. .Mar 22 2018, 10:50 AM

fhahn mentioned this in D37737: [SLPVectorizer] Merge subsequent gather loads..

@ashahid What's happening to this patch?

Closed by commit rGdbd30edb7ff8: [SLP] Vectorize jumbled memory loads. (authored by • ashahid). · Explain WhyOct 7 2019, 5:02 AM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptOct 7 2019, 5:02 AM

Herald added a subscriber: hiraditya. · View Herald Transcript

RKSimon reopened this revision.Oct 7 2019, 6:08 AM

This revision is now accepted and ready to land.Oct 7 2019, 6:08 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

LoopAccessAnalysis.h

15 lines

lib/

Analysis/

LoopAccessAnalysis.cpp

71 lines

Transforms/

Vectorize/

SLPVectorizer.cpp

278 lines

test/

Transforms/

SLPVectorizer/

X86/

jumbled-load-multiuse.ll

24 lines

jumbled-load-shuffle-placement.ll

125 lines

jumbled-load-used-in-phi.ll

225 lines

jumbled-load.ll

37 lines

store-jumbled.ll

25 lines

Diff 223515

llvm/include/llvm/Analysis/LoopAccessAnalysis.h

	Show First 20 Lines • Show All 661 Lines • ▼ Show 20 Lines
	/// If necessary this method will version the stride of the pointer according			/// If necessary this method will version the stride of the pointer according
	/// to \p PtrToStride and therefore add further predicates to \p PSE.			/// to \p PtrToStride and therefore add further predicates to \p PSE.
	/// The \p Assume parameter indicates if we are allowed to make additional			/// The \p Assume parameter indicates if we are allowed to make additional
	/// run-time assumptions.			/// run-time assumptions.
	int64_t getPtrStride(PredicatedScalarEvolution &PSE, Value Ptr, const Loop Lp,			int64_t getPtrStride(PredicatedScalarEvolution &PSE, Value Ptr, const Loop Lp,
	const ValueToValueMap &StridesMap = ValueToValueMap(),			const ValueToValueMap &StridesMap = ValueToValueMap(),
	bool Assume = false, bool ShouldCheckWrap = true);			bool Assume = false, bool ShouldCheckWrap = true);

				/// \brief Attempt to sort the 'loads' in \p VL and return the sorted values in
				/// \p Sorted.
				///
				/// Returns 'false' if sorting is not legal or feasible, otherwise returns
				/// 'true'. If \p Mask is not null, it also returns the \p Mask which is the
				/// shuffle mask for actual memory access order.
				///
				/// For example, for a given VL of memory accesses in program order, a[i+2],
				/// a[i+0], a[i+1] and a[i+3], this function will sort the VL and save the
				/// sorted value in 'Sorted' as a[i+0], a[i+1], a[i+2], a[i+3] and saves the
				/// mask for actual memory accesses in program order in 'Mask' as <2,0,1,3>
				bool sortLoadAccesses(ArrayRef<Value *> VL, const DataLayout &DL,
				ScalarEvolution &SE, SmallVectorImpl<Value *> &Sorted,
				SmallVectorImpl<unsigned> *Mask = nullptr);

	/// \brief Returns true if the memory operations \p A and \p B are consecutive.			/// \brief Returns true if the memory operations \p A and \p B are consecutive.
	/// This is a simple API that does not depend on the analysis pass.			/// This is a simple API that does not depend on the analysis pass.
	bool isConsecutiveAccess(Value A, Value B, const DataLayout &DL,			bool isConsecutiveAccess(Value A, Value B, const DataLayout &DL,
	ScalarEvolution &SE, bool CheckType = true);			ScalarEvolution &SE, bool CheckType = true);

	/// \brief This analysis provides dependence information for the memory accesses			/// \brief This analysis provides dependence information for the memory accesses
	/// of a loop.			/// of a loop.
	///			///
	▲ Show 20 Lines • Show All 72 Lines • Show Last 20 Lines

llvm/lib/Analysis/LoopAccessAnalysis.cpp

	Show First 20 Lines • Show All 1,101 Lines • ▼ Show 20 Lines
	static unsigned getAddressSpaceOperand(Value *I) {			static unsigned getAddressSpaceOperand(Value *I) {
	if (LoadInst *L = dyn_cast<LoadInst>(I))			if (LoadInst *L = dyn_cast<LoadInst>(I))
	return L->getPointerAddressSpace();			return L->getPointerAddressSpace();
	if (StoreInst *S = dyn_cast<StoreInst>(I))			if (StoreInst *S = dyn_cast<StoreInst>(I))
	return S->getPointerAddressSpace();			return S->getPointerAddressSpace();
	return -1;			return -1;
	}			}

				// TODO:This API can be improved by using the permutation of given width as the
				// accesses are entered into the map.
				bool llvm::sortLoadAccesses(ArrayRef<Value *> VL, const DataLayout &DL,
				ScalarEvolution &SE,
				SmallVectorImpl<Value *> &Sorted,
				SmallVectorImpl<unsigned> *Mask) {
				SmallVector<std::pair<int64_t, Value *>, 4> OffValPairs;
				OffValPairs.reserve(VL.size());
				Sorted.reserve(VL.size());

				// Walk over the pointers, and map each of them to an offset relative to
				// first pointer in the array.
				Value *Ptr0 = getPointerOperand(VL[0]);
				const SCEV *Scev0 = SE.getSCEV(Ptr0);
				Value *Obj0 = GetUnderlyingObject(Ptr0, DL);
				PointerType *PtrTy = dyn_cast<PointerType>(Ptr0->getType());
				uint64_t Size = DL.getTypeAllocSize(PtrTy->getElementType());

				for (auto *Val : VL) {
				// The only kind of access we care about here is load.
				if (!isa<LoadInst>(Val))
				return false;

				Value *Ptr = getPointerOperand(Val);
				assert(Ptr && "Expected value to have a pointer operand.");
				// If a pointer refers to a different underlying object, bail - the
				// pointers are by definition incomparable.
				Value *CurrObj = GetUnderlyingObject(Ptr, DL);
				if (CurrObj != Obj0)
				return false;

				const SCEVConstant *Diff =
				dyn_cast<SCEVConstant>(SE.getMinusSCEV(SE.getSCEV(Ptr), Scev0));
				// The pointers may not have a constant offset from each other, or SCEV
				// may just not be smart enough to figure out they do. Regardless,
				// there's nothing we can do.
				if (!Diff \|\| static_cast<unsigned>(Diff->getAPInt().abs().getSExtValue()) >
				(VL.size() - 1) * Size)
				return false;

				OffValPairs.emplace_back(Diff->getAPInt().getSExtValue(), Val);
				}
				SmallVector<unsigned, 4> UseOrder(VL.size());
				for (unsigned i = 0; i < VL.size(); i++) {
				UseOrder[i] = i;
				}

				// Sort the memory accesses and keep the order of their uses in UseOrder.
				std::sort(UseOrder.begin(), UseOrder.end(),
				[&OffValPairs](unsigned Left, unsigned Right) {
				return OffValPairs[Left].first < OffValPairs[Right].first;
				});

				for (unsigned i = 0; i < VL.size(); i++)
				Sorted.emplace_back(OffValPairs[UseOrder[i]].second);

				// Sort UseOrder to compute the Mask.
				if (Mask) {
				Mask->reserve(VL.size());
				for (unsigned i = 0; i < VL.size(); i++)
				Mask->emplace_back(i);
				std::sort(Mask->begin(), Mask->end(),
				[&UseOrder](unsigned Left, unsigned Right) {
				return UseOrder[Left] < UseOrder[Right];
				});
				}

				return true;
				}


	/// Returns true if the memory operations \p A and \p B are consecutive.			/// Returns true if the memory operations \p A and \p B are consecutive.
	bool llvm::isConsecutiveAccess(Value A, Value B, const DataLayout &DL,			bool llvm::isConsecutiveAccess(Value A, Value B, const DataLayout &DL,
	ScalarEvolution &SE, bool CheckType) {			ScalarEvolution &SE, bool CheckType) {
	Value *PtrA = getPointerOperand(A);			Value *PtrA = getPointerOperand(A);
	Value *PtrB = getPointerOperand(B);			Value *PtrB = getPointerOperand(B);
	unsigned ASA = getAddressSpaceOperand(A);			unsigned ASA = getAddressSpaceOperand(A);
	unsigned ASB = getAddressSpaceOperand(B);			unsigned ASB = getAddressSpaceOperand(B);

	▲ Show 20 Lines • Show All 1,187 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 640 Lines • ▼ Show 20 Lines	private:

/// Checks if all users of \p I are the part of the vectorization tree.		/// Checks if all users of \p I are the part of the vectorization tree.
bool areAllUsersVectorized(Instruction *I) const;		bool areAllUsersVectorized(Instruction *I) const;

/// \returns the cost of the vectorizable entry.		/// \returns the cost of the vectorizable entry.
int getEntryCost(TreeEntry *E);		int getEntryCost(TreeEntry *E);

/// This is the recursive part of buildTree.		/// This is the recursive part of buildTree.
void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth, int);		void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth, int UserIndx = -1,
		int OpdNum = 0);

/// \returns True if the ExtractElement/ExtractValue instructions in VL can		/// \returns True if the ExtractElement/ExtractValue instructions in VL can
/// be vectorized to use the original vector (or aggregate "bitcast" to a vector).		/// be vectorized to use the original vector (or aggregate "bitcast" to a vector).
bool canReuseExtract(ArrayRef<Value > VL, Value OpValue) const;		bool canReuseExtract(ArrayRef<Value > VL, Value OpValue) const;

/// Vectorize a single entry in the tree.		/// Vectorize a single entry in the tree.\p OpdNum indicate the ordinality of
Value vectorizeTree(TreeEntry E);		/// operand corrsponding to this tree entry \p E for the user tree entry
		/// indicated by \p UserIndx.
/// Vectorize a single entry in the tree, starting in \p VL.		// In other words, "E == TreeEntry[UserIndx].getOperand(OpdNum)".
Value vectorizeTree(ArrayRef<Value > VL);		Value vectorizeTree(TreeEntry E, int OpdNum = 0, int UserIndx = -1);

		/// Vectorize a single entry in the tree, starting in \p VL.\p OpdNum indicate
		/// the ordinality of operand corrsponding to the \p VL of scalar values for the
		/// user indicated by \p UserIndx this \p VL feeds into.
		Value vectorizeTree(ArrayRef<Value > VL, int OpdNum = 0, int UserIndx = -1);

/// \returns the pointer to the vectorized value if \p VL is already		/// \returns the pointer to the vectorized value if \p VL is already
/// vectorized, or NULL. They may happen in cycles.		/// vectorized, or NULL. They may happen in cycles.
Value alreadyVectorized(ArrayRef<Value > VL, Value *OpValue) const;		Value alreadyVectorized(ArrayRef<Value > VL, Value *OpValue) const;

/// \returns the scalarization cost for this type. Scalarization in this		/// \returns the scalarization cost for this type. Scalarization in this
/// context means the creation of vectors from a group of scalars.		/// context means the creation of vectors from a group of scalars.
int getGatherCost(Type *Ty);		int getGatherCost(Type *Ty);
Show All 29 Lines	struct TreeEntry {
TreeEntry(std::vector<TreeEntry> &Container) : Container(Container) {}		TreeEntry(std::vector<TreeEntry> &Container) : Container(Container) {}

/// \returns true if the scalars in VL are equal to this entry.		/// \returns true if the scalars in VL are equal to this entry.
bool isSame(ArrayRef<Value *> VL) const {		bool isSame(ArrayRef<Value *> VL) const {
assert(VL.size() == Scalars.size() && "Invalid size");		assert(VL.size() == Scalars.size() && "Invalid size");
return std::equal(VL.begin(), VL.end(), Scalars.begin());		return std::equal(VL.begin(), VL.end(), Scalars.begin());
}		}

		/// \returns true if the scalars in VL are found in this tree entry.
		bool isFoundJumbled(ArrayRef<Value *> VL, const DataLayout &DL,
		ScalarEvolution &SE) const {
		assert(VL.size() == Scalars.size() && "Invalid size");
		SmallVector<Value *, 8> List;
		if (!sortLoadAccesses(VL, DL, SE, List))
		return false;
		return std::equal(List.begin(), List.end(), Scalars.begin());
		}

/// A vector of scalars.		/// A vector of scalars.
ValueList Scalars;		ValueList Scalars;

/// The Scalars are vectorized into this value. It is initialized to Null.		/// The Scalars are vectorized into this value. It is initialized to Null.
Value *VectorizedValue = nullptr;		Value *VectorizedValue = nullptr;

/// Do we need to gather this sequence ?		/// Do we need to gather this sequence ?
bool NeedToGather = false;		bool NeedToGather = false;

		/// Records optional shuffle mask for the uses of jumbled memory accesses.
		/// For example, a non-empty ShuffleMask[1] represents the permutation of
		/// lanes that operand #1 of this vectorized instruction should undergo
		/// before feeding this vectorized instruction, whereas an empty
		/// ShuffleMask[0] indicates that the lanes of operand #0 of this vectorized
		/// instruction need not be permuted at all.
		SmallVector<SmallVector<unsigned, 4>, 2> ShuffleMask;

/// Points back to the VectorizableTree.		/// Points back to the VectorizableTree.
///		///
/// Only used for Graphviz right now. Unfortunately GraphTrait::NodeRef has		/// Only used for Graphviz right now. Unfortunately GraphTrait::NodeRef has
/// to be a pointer and needs to be able to initialize the child iterator.		/// to be a pointer and needs to be able to initialize the child iterator.
/// Thus we need a reference back to the container to translate the indices		/// Thus we need a reference back to the container to translate the indices
/// to entries.		/// to entries.
std::vector<TreeEntry> &Container;		std::vector<TreeEntry> &Container;

/// The TreeEntry index containing the user of this entry. We can actually		/// The TreeEntry index containing the user of this entry. We can actually
/// have multiple users so the data structure is not truly a tree.		/// have multiple users so the data structure is not truly a tree.
SmallVector<int, 1> UserTreeIndices;		SmallVector<int, 1> UserTreeIndices;
};		};

/// Create a new VectorizableTree entry.		/// Create a new VectorizableTree entry.
TreeEntry newTreeEntry(ArrayRef<Value > VL, bool Vectorized,		TreeEntry newTreeEntry(ArrayRef<Value > VL, bool Vectorized,
int &UserTreeIdx) {		int &UserTreeIdx, const InstructionsState &S,
		ArrayRef<unsigned> ShuffleMask = None,
		int OpdNum = 0) {
		assert((!Vectorized \|\| S.Opcode != 0) &&
		"Vectorized TreeEntry without opcode");
VectorizableTree.emplace_back(VectorizableTree);		VectorizableTree.emplace_back(VectorizableTree);

int idx = VectorizableTree.size() - 1;		int idx = VectorizableTree.size() - 1;
TreeEntry *Last = &VectorizableTree[idx];		TreeEntry *Last = &VectorizableTree[idx];
Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());		Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());
Last->NeedToGather = !Vectorized;		Last->NeedToGather = !Vectorized;

		TreeEntry *UserTreeEntry = nullptr;
		if (UserTreeIdx != -1)
		UserTreeEntry = &VectorizableTree[UserTreeIdx];

		if (UserTreeEntry && !ShuffleMask.empty()) {
		if ((unsigned)OpdNum >= UserTreeEntry->ShuffleMask.size())
		UserTreeEntry->ShuffleMask.resize(OpdNum + 1);
		assert(UserTreeEntry->ShuffleMask[OpdNum].empty() &&
		"Mask already present");
		using mask = SmallVector<unsigned, 4>;
		mask tempMask(ShuffleMask.begin(), ShuffleMask.end());
		UserTreeEntry->ShuffleMask[OpdNum] = tempMask;
		}
if (Vectorized) {		if (Vectorized) {
for (int i = 0, e = VL.size(); i != e; ++i) {		for (int i = 0, e = VL.size(); i != e; ++i) {
assert(!getTreeEntry(VL[i]) && "Scalar already in tree!");		assert(!getTreeEntry(VL[i]) && "Scalar already in tree!");
ScalarToTreeEntry[VL[i]] = idx;		ScalarToTreeEntry[VL[i]] = idx;
}		}
} else {		} else {
MustGather.insert(VL.begin(), VL.end());		MustGather.insert(VL.begin(), VL.end());
}		}
▲ Show 20 Lines • Show All 636 Lines • ▼ Show 20 Lines	for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {
Lane << " from " << *Scalar << ".\n");		Lane << " from " << *Scalar << ".\n");
ExternalUses.push_back(ExternalUser(Scalar, U, Lane));		ExternalUses.push_back(ExternalUser(Scalar, U, Lane));
}		}
}		}
}		}
}		}

void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,		void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
int UserTreeIdx) {		int UserTreeIdx, int OpdNum) {
assert((allConstant(VL) \|\| allSameType(VL)) && "Invalid types!");		assert((allConstant(VL) \|\| allSameType(VL)) && "Invalid types!");

InstructionsState S = getSameOpcode(VL);		InstructionsState S = getSameOpcode(VL);
if (Depth == RecursionMaxDepth) {		if (Depth == RecursionMaxDepth) {
DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");		DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
return;		return;
}		}

// Don't handle vectors.		// Don't handle vectors.
if (S.OpValue->getType()->isVectorTy()) {		if (S.OpValue->getType()->isVectorTy()) {
DEBUG(dbgs() << "SLP: Gathering due to vector type.\n");		DEBUG(dbgs() << "SLP: Gathering due to vector type.\n");
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
return;		return;
}		}

if (StoreInst *SI = dyn_cast<StoreInst>(S.OpValue))		if (StoreInst *SI = dyn_cast<StoreInst>(S.OpValue))
if (SI->getValueOperand()->getType()->isVectorTy()) {		if (SI->getValueOperand()->getType()->isVectorTy()) {
DEBUG(dbgs() << "SLP: Gathering due to store vector type.\n");		DEBUG(dbgs() << "SLP: Gathering due to store vector type.\n");
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
return;		return;
}		}

// If all of the operands are identical or constant we have a simple solution.		// If all of the operands are identical or constant we have a simple solution.
if (allConstant(VL) \|\| isSplat(VL) \|\| !allSameBlock(VL) \|\| !S.Opcode) {		if (allConstant(VL) \|\| isSplat(VL) \|\| !allSameBlock(VL) \|\| !S.Opcode) {
DEBUG(dbgs() << "SLP: Gathering due to C,S,B,O. \n");		DEBUG(dbgs() << "SLP: Gathering due to C,S,B,O. \n");
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
return;		return;
}		}

// We now know that this is a vector of instructions of the same type from		// We now know that this is a vector of instructions of the same type from
// the same block.		// the same block.

// Don't vectorize ephemeral values.		// Don't vectorize ephemeral values.
for (unsigned i = 0, e = VL.size(); i != e; ++i) {		for (unsigned i = 0, e = VL.size(); i != e; ++i) {
if (EphValues.count(VL[i])) {		if (EphValues.count(VL[i])) {
DEBUG(dbgs() << "SLP: The instruction (" << *VL[i] <<		DEBUG(dbgs() << "SLP: The instruction (" << *VL[i] <<
") is ephemeral.\n");		") is ephemeral.\n");
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
return;		return;
}		}
}		}

// Check if this is a duplicate of another entry.		// Check if this is a duplicate of another entry.
if (TreeEntry *E = getTreeEntry(S.OpValue)) {		if (TreeEntry *E = getTreeEntry(S.OpValue)) {
for (unsigned i = 0, e = VL.size(); i != e; ++i) {		for (unsigned i = 0, e = VL.size(); i != e; ++i) {
DEBUG(dbgs() << "SLP: \tChecking bundle: " << *VL[i] << ".\n");		DEBUG(dbgs() << "SLP: \tChecking bundle: " << *VL[i] << ".\n");
if (E->Scalars[i] != VL[i]) {		if (E->Scalars[i] != VL[i]) {
DEBUG(dbgs() << "SLP: Gathering due to partial overlap.\n");		DEBUG(dbgs() << "SLP: Gathering due to partial overlap.\n");
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
return;		return;
}		}
}		}
// Record the reuse of the tree node. FIXME, currently this is only used to		// Record the reuse of the tree node. FIXME, currently this is only used to
// properly draw the graph rather than for the actual vectorization.		// properly draw the graph rather than for the actual vectorization.
E->UserTreeIndices.push_back(UserTreeIdx);		E->UserTreeIndices.push_back(UserTreeIdx);
DEBUG(dbgs() << "SLP: Perfect diamond merge at " << *S.OpValue << ".\n");		DEBUG(dbgs() << "SLP: Perfect diamond merge at " << *S.OpValue << ".\n");
return;		return;
}		}

// Check that none of the instructions in the bundle are already in the tree.		// Check that none of the instructions in the bundle are already in the tree.
for (unsigned i = 0, e = VL.size(); i != e; ++i) {		for (unsigned i = 0, e = VL.size(); i != e; ++i) {
auto *I = dyn_cast<Instruction>(VL[i]);		auto *I = dyn_cast<Instruction>(VL[i]);
if (!I)		if (!I)
continue;		continue;
if (getTreeEntry(I)) {		if (getTreeEntry(I)) {
DEBUG(dbgs() << "SLP: The instruction (" << *VL[i] <<		DEBUG(dbgs() << "SLP: The instruction (" << *VL[i] <<
") is already in tree.\n");		") is already in tree.\n");
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
return;		return;
}		}
}		}

// If any of the scalars is marked as a value that needs to stay scalar, then		// If any of the scalars is marked as a value that needs to stay scalar, then
// we need to gather the scalars.		// we need to gather the scalars.
for (unsigned i = 0, e = VL.size(); i != e; ++i) {		for (unsigned i = 0, e = VL.size(); i != e; ++i) {
if (MustGather.count(VL[i])) {		if (MustGather.count(VL[i])) {
DEBUG(dbgs() << "SLP: Gathering due to gathered scalar.\n");		DEBUG(dbgs() << "SLP: Gathering due to gathered scalar.\n");
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
return;		return;
}		}
}		}

// Check that all of the users of the scalars that we want to vectorize are		// Check that all of the users of the scalars that we want to vectorize are
// schedulable.		// schedulable.
auto *VL0 = cast<Instruction>(S.OpValue);		auto *VL0 = cast<Instruction>(S.OpValue);
BasicBlock *BB = VL0->getParent();		BasicBlock *BB = VL0->getParent();

if (!DT->isReachableFromEntry(BB)) {		if (!DT->isReachableFromEntry(BB)) {
// Don't go into unreachable blocks. They may contain instructions with		// Don't go into unreachable blocks. They may contain instructions with
// dependency cycles which confuse the final scheduling.		// dependency cycles which confuse the final scheduling.
DEBUG(dbgs() << "SLP: bundle in unreachable block.\n");		DEBUG(dbgs() << "SLP: bundle in unreachable block.\n");
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
return;		return;
}		}

// Check that every instruction appears once in this bundle.		// Check that every instruction appears once in this bundle.
for (unsigned i = 0, e = VL.size(); i < e; ++i)		for (unsigned i = 0, e = VL.size(); i < e; ++i)
for (unsigned j = i + 1; j < e; ++j)		for (unsigned j = i + 1; j < e; ++j)
if (VL[i] == VL[j]) {		if (VL[i] == VL[j]) {
DEBUG(dbgs() << "SLP: Scalar used twice in bundle.\n");		DEBUG(dbgs() << "SLP: Scalar used twice in bundle.\n");
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
return;		return;
}		}

auto &BSRef = BlocksSchedules[BB];		auto &BSRef = BlocksSchedules[BB];
if (!BSRef)		if (!BSRef)
BSRef = llvm::make_unique<BlockScheduling>(BB);		BSRef = llvm::make_unique<BlockScheduling>(BB);

BlockScheduling &BS = *BSRef.get();		BlockScheduling &BS = *BSRef.get();

if (!BS.tryScheduleBundle(VL, this, S.OpValue)) {		if (!BS.tryScheduleBundle(VL, this, S.OpValue)) {
DEBUG(dbgs() << "SLP: We are not able to schedule this bundle!\n");		DEBUG(dbgs() << "SLP: We are not able to schedule this bundle!\n");
assert((!BS.getScheduleData(VL0) \|\|		assert((!BS.getScheduleData(VL0) \|\|
!BS.getScheduleData(VL0)->isPartOfBundle()) &&		!BS.getScheduleData(VL0)->isPartOfBundle()) &&
"tryScheduleBundle should cancelScheduling on failure");		"tryScheduleBundle should cancelScheduling on failure");
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
return;		return;
}		}
DEBUG(dbgs() << "SLP: We are able to schedule this bundle.\n");		DEBUG(dbgs() << "SLP: We are able to schedule this bundle.\n");

unsigned ShuffleOrOp = S.IsAltShuffle ?		unsigned ShuffleOrOp = S.IsAltShuffle ?
(unsigned) Instruction::ShuffleVector : S.Opcode;		(unsigned) Instruction::ShuffleVector : S.Opcode;
switch (ShuffleOrOp) {		switch (ShuffleOrOp) {
case Instruction::PHI: {		case Instruction::PHI: {
PHINode *PH = dyn_cast<PHINode>(VL0);		PHINode *PH = dyn_cast<PHINode>(VL0);

// Check for terminator values (e.g. invoke).		// Check for terminator values (e.g. invoke).
for (unsigned j = 0; j < VL.size(); ++j)		for (unsigned j = 0; j < VL.size(); ++j)
for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {		for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {
TerminatorInst *Term = dyn_cast<TerminatorInst>(		TerminatorInst *Term = dyn_cast<TerminatorInst>(
cast<PHINode>(VL[j])->getIncomingValueForBlock(PH->getIncomingBlock(i)));		cast<PHINode>(VL[j])->getIncomingValueForBlock(PH->getIncomingBlock(i)));
if (Term) {		if (Term) {
DEBUG(dbgs() << "SLP: Need to swizzle PHINodes (TerminatorInst use).\n");		DEBUG(dbgs() << "SLP: Need to swizzle PHINodes (TerminatorInst use).\n");
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
return;		return;
}		}
}		}

newTreeEntry(VL, true, UserTreeIdx);		newTreeEntry(VL, true, UserTreeIdx, S);
DEBUG(dbgs() << "SLP: added a vector of PHINodes.\n");		DEBUG(dbgs() << "SLP: added a vector of PHINodes.\n");

for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {		for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<PHINode>(j)->getIncomingValueForBlock(		Operands.push_back(cast<PHINode>(j)->getIncomingValueForBlock(
PH->getIncomingBlock(i)));		PH->getIncomingBlock(i)));

buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, Depth + 1, UserTreeIdx, i);
}		}
return;		return;
}		}
case Instruction::ExtractValue:		case Instruction::ExtractValue:
case Instruction::ExtractElement: {		case Instruction::ExtractElement: {
bool Reuse = canReuseExtract(VL, VL0);		bool Reuse = canReuseExtract(VL, VL0);
if (Reuse) {		if (Reuse) {
DEBUG(dbgs() << "SLP: Reusing extract sequence.\n");		DEBUG(dbgs() << "SLP: Reusing extract sequence.\n");
} else {		} else {
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
}		}
newTreeEntry(VL, Reuse, UserTreeIdx);		newTreeEntry(VL, Reuse, UserTreeIdx, S);
return;		return;
}		}
case Instruction::Load: {		case Instruction::Load: {
// Check that a vectorized load would load the same memory as a scalar		// Check that a vectorized load would load the same memory as a scalar
// load. For example, we don't want to vectorize loads that are smaller		// load. For example, we don't want to vectorize loads that are smaller
// than 8-bit. Even though we have a packed struct {<i2, i2, i2, i2>} LLVM		// than 8-bit. Even though we have a packed struct {<i2, i2, i2, i2>} LLVM
// treats loading/storing it as an i8 struct. If we vectorize loads/stores		// treats loading/storing it as an i8 struct. If we vectorize loads/stores
// from such a struct, we read/write packed bits disagreeing with the		// from such a struct, we read/write packed bits disagreeing with the
// unvectorized version.		// unvectorized version.
Type *ScalarTy = VL0->getType();		Type *ScalarTy = VL0->getType();

if (DL->getTypeSizeInBits(ScalarTy) !=		if (DL->getTypeSizeInBits(ScalarTy) !=
DL->getTypeAllocSizeInBits(ScalarTy)) {		DL->getTypeAllocSizeInBits(ScalarTy)) {
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
DEBUG(dbgs() << "SLP: Gathering loads of non-packed type.\n");		DEBUG(dbgs() << "SLP: Gathering loads of non-packed type.\n");
return;		return;
}		}

// Make sure all loads in the bundle are simple - we can't vectorize		// Make sure all loads in the bundle are simple - we can't vectorize
// atomic or volatile loads.		// atomic or volatile loads.
for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {		for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {
LoadInst *L = cast<LoadInst>(VL[i]);		LoadInst *L = cast<LoadInst>(VL[i]);
if (!L->isSimple()) {		if (!L->isSimple()) {
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");		DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");
return;		return;
}		}
}		}

// Check if the loads are consecutive, reversed, or neither.		// Check if the loads are consecutive, reversed, or neither.
// TODO: What we really want is to sort the loads, but for now, check
// the two likely directions.
bool Consecutive = true;		bool Consecutive = true;
bool ReverseConsecutive = true;		bool ReverseConsecutive = true;
for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {		for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {
if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {		if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {
Consecutive = false;		Consecutive = false;
break;		break;
} else {		} else {
ReverseConsecutive = false;		ReverseConsecutive = false;
}		}
}		}

if (Consecutive) {		if (Consecutive) {
++NumLoadsWantToKeepOrder;		++NumLoadsWantToKeepOrder;
newTreeEntry(VL, true, UserTreeIdx);		newTreeEntry(VL, true, UserTreeIdx, S);
DEBUG(dbgs() << "SLP: added a vector of loads.\n");		DEBUG(dbgs() << "SLP: added a vector of loads.\n");
return;		return;
}		}

// If none of the load pairs were consecutive when checked in order,		// If none of the load pairs were consecutive when checked in order,
// check the reverse order.		// check the reverse order.
if (ReverseConsecutive)		if (ReverseConsecutive)
for (unsigned i = VL.size() - 1; i > 0; --i)		for (unsigned i = VL.size() - 1; i > 0; --i)
if (!isConsecutiveAccess(VL[i], VL[i - 1], DL, SE)) {		if (!isConsecutiveAccess(VL[i], VL[i - 1], DL, SE)) {
ReverseConsecutive = false;		ReverseConsecutive = false;
break;		break;
}		}

BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx);

if (ReverseConsecutive) {		if (ReverseConsecutive) {
++NumLoadsWantToChangeOrder;
DEBUG(dbgs() << "SLP: Gathering reversed loads.\n");		DEBUG(dbgs() << "SLP: Gathering reversed loads.\n");
} else {		++NumLoadsWantToChangeOrder;
DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");		BS.cancelScheduling(VL, VL0);
		newTreeEntry(VL, false, UserTreeIdx, S);
		return;
}		}

		if (VL.size() > 2) {
		bool ShuffledLoads = true;
		SmallVector<Value *, 8> Sorted;
		SmallVector<unsigned, 4> Mask;
		if (sortLoadAccesses(VL, DL, SE, Sorted, &Mask)) {
		auto NewVL = makeArrayRef(Sorted.begin(), Sorted.end());
		for (unsigned i = 0, e = NewVL.size() - 1; i < e; ++i) {
		if (!isConsecutiveAccess(NewVL[i], NewVL[i + 1], DL, SE)) {
		ShuffledLoads = false;
		break;
		}
		}
		// TODO: Tracking how many load wants to have arbitrary shuffled order
		// would be usefull.
		if (ShuffledLoads) {
		DEBUG(dbgs() << "SLP: added a vector of loads which needs "
		"permutation of loaded lanes.\n");
		newTreeEntry(NewVL, true, UserTreeIdx, S,
		makeArrayRef(Mask.begin(), Mask.end()), OpdNum);
		return;
		}
		}
		}

		DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");
		BS.cancelScheduling(VL, VL0);
		newTreeEntry(VL, false, UserTreeIdx, S);
return;		return;
}		}
case Instruction::ZExt:		case Instruction::ZExt:
case Instruction::SExt:		case Instruction::SExt:
case Instruction::FPToUI:		case Instruction::FPToUI:
case Instruction::FPToSI:		case Instruction::FPToSI:
case Instruction::FPExt:		case Instruction::FPExt:
case Instruction::PtrToInt:		case Instruction::PtrToInt:
case Instruction::IntToPtr:		case Instruction::IntToPtr:
case Instruction::SIToFP:		case Instruction::SIToFP:
case Instruction::UIToFP:		case Instruction::UIToFP:
case Instruction::Trunc:		case Instruction::Trunc:
case Instruction::FPTrunc:		case Instruction::FPTrunc:
case Instruction::BitCast: {		case Instruction::BitCast: {
Type *SrcTy = VL0->getOperand(0)->getType();		Type *SrcTy = VL0->getOperand(0)->getType();
for (unsigned i = 0; i < VL.size(); ++i) {		for (unsigned i = 0; i < VL.size(); ++i) {
Type *Ty = cast<Instruction>(VL[i])->getOperand(0)->getType();		Type *Ty = cast<Instruction>(VL[i])->getOperand(0)->getType();
if (Ty != SrcTy \|\| !isValidElementType(Ty)) {		if (Ty != SrcTy \|\| !isValidElementType(Ty)) {
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
DEBUG(dbgs() << "SLP: Gathering casts with different src types.\n");		DEBUG(dbgs() << "SLP: Gathering casts with different src types.\n");
return;		return;
}		}
}		}
newTreeEntry(VL, true, UserTreeIdx);		newTreeEntry(VL, true, UserTreeIdx, S);
DEBUG(dbgs() << "SLP: added a vector of casts.\n");		DEBUG(dbgs() << "SLP: added a vector of casts.\n");

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, Depth + 1, UserTreeIdx, i);
}		}
return;		return;
}		}
case Instruction::ICmp:		case Instruction::ICmp:
case Instruction::FCmp: {		case Instruction::FCmp: {
// Check that all of the compares have the same predicate.		// Check that all of the compares have the same predicate.
CmpInst::Predicate P0 = cast<CmpInst>(VL0)->getPredicate();		CmpInst::Predicate P0 = cast<CmpInst>(VL0)->getPredicate();
Type *ComparedTy = VL0->getOperand(0)->getType();		Type *ComparedTy = VL0->getOperand(0)->getType();
for (unsigned i = 1, e = VL.size(); i < e; ++i) {		for (unsigned i = 1, e = VL.size(); i < e; ++i) {
CmpInst *Cmp = cast<CmpInst>(VL[i]);		CmpInst *Cmp = cast<CmpInst>(VL[i]);
if (Cmp->getPredicate() != P0 \|\|		if (Cmp->getPredicate() != P0 \|\|
Cmp->getOperand(0)->getType() != ComparedTy) {		Cmp->getOperand(0)->getType() != ComparedTy) {
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
DEBUG(dbgs() << "SLP: Gathering cmp with different predicate.\n");		DEBUG(dbgs() << "SLP: Gathering cmp with different predicate.\n");
return;		return;
}		}
}		}

newTreeEntry(VL, true, UserTreeIdx);		newTreeEntry(VL, true, UserTreeIdx, S);
DEBUG(dbgs() << "SLP: added a vector of compares.\n");		DEBUG(dbgs() << "SLP: added a vector of compares.\n");

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, Depth + 1, UserTreeIdx, i);
}		}
return;		return;
}		}
case Instruction::Select:		case Instruction::Select:
case Instruction::Add:		case Instruction::Add:
case Instruction::FAdd:		case Instruction::FAdd:
case Instruction::Sub:		case Instruction::Sub:
case Instruction::FSub:		case Instruction::FSub:
case Instruction::Mul:		case Instruction::Mul:
case Instruction::FMul:		case Instruction::FMul:
case Instruction::UDiv:		case Instruction::UDiv:
case Instruction::SDiv:		case Instruction::SDiv:
case Instruction::FDiv:		case Instruction::FDiv:
case Instruction::URem:		case Instruction::URem:
case Instruction::SRem:		case Instruction::SRem:
case Instruction::FRem:		case Instruction::FRem:
case Instruction::Shl:		case Instruction::Shl:
case Instruction::LShr:		case Instruction::LShr:
case Instruction::AShr:		case Instruction::AShr:
case Instruction::And:		case Instruction::And:
case Instruction::Or:		case Instruction::Or:
case Instruction::Xor:		case Instruction::Xor:
newTreeEntry(VL, true, UserTreeIdx);		newTreeEntry(VL, true, UserTreeIdx, S);
DEBUG(dbgs() << "SLP: added a vector of bin op.\n");		DEBUG(dbgs() << "SLP: added a vector of bin op.\n");

// Sort operands of the instructions so that each side is more likely to		// Sort operands of the instructions so that each side is more likely to
// have the same opcode.		// have the same opcode.
if (isa<BinaryOperator>(VL0) && VL0->isCommutative()) {		if (isa<BinaryOperator>(VL0) && VL0->isCommutative()) {
ValueList Left, Right;		ValueList Left, Right;
reorderInputsAccordingToOpcode(S.Opcode, VL, Left, Right);		reorderInputsAccordingToOpcode(S.Opcode, VL, Left, Right);
buildTree_rec(Left, Depth + 1, UserTreeIdx);		buildTree_rec(Left, Depth + 1, UserTreeIdx);
buildTree_rec(Right, Depth + 1, UserTreeIdx);		buildTree_rec(Right, Depth + 1, UserTreeIdx, 1);
return;		return;
}		}

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, Depth + 1, UserTreeIdx, i);
}		}
return;		return;

case Instruction::GetElementPtr: {		case Instruction::GetElementPtr: {
// We don't combine GEPs with complicated (nested) indexing.		// We don't combine GEPs with complicated (nested) indexing.
for (unsigned j = 0; j < VL.size(); ++j) {		for (unsigned j = 0; j < VL.size(); ++j) {
if (cast<Instruction>(VL[j])->getNumOperands() != 2) {		if (cast<Instruction>(VL[j])->getNumOperands() != 2) {
DEBUG(dbgs() << "SLP: not-vectorizable GEP (nested indexes).\n");		DEBUG(dbgs() << "SLP: not-vectorizable GEP (nested indexes).\n");
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
return;		return;
}		}
}		}

// We can't combine several GEPs into one vector if they operate on		// We can't combine several GEPs into one vector if they operate on
// different types.		// different types.
Type *Ty0 = VL0->getOperand(0)->getType();		Type *Ty0 = VL0->getOperand(0)->getType();
for (unsigned j = 0; j < VL.size(); ++j) {		for (unsigned j = 0; j < VL.size(); ++j) {
Type *CurTy = cast<Instruction>(VL[j])->getOperand(0)->getType();		Type *CurTy = cast<Instruction>(VL[j])->getOperand(0)->getType();
if (Ty0 != CurTy) {		if (Ty0 != CurTy) {
DEBUG(dbgs() << "SLP: not-vectorizable GEP (different types).\n");		DEBUG(dbgs() << "SLP: not-vectorizable GEP (different types).\n");
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
return;		return;
}		}
}		}

// We don't combine GEPs with non-constant indexes.		// We don't combine GEPs with non-constant indexes.
for (unsigned j = 0; j < VL.size(); ++j) {		for (unsigned j = 0; j < VL.size(); ++j) {
auto Op = cast<Instruction>(VL[j])->getOperand(1);		auto Op = cast<Instruction>(VL[j])->getOperand(1);
if (!isa<ConstantInt>(Op)) {		if (!isa<ConstantInt>(Op)) {
DEBUG(		DEBUG(
dbgs() << "SLP: not-vectorizable GEP (non-constant indexes).\n");		dbgs() << "SLP: not-vectorizable GEP (non-constant indexes).\n");
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
return;		return;
}		}
}		}

newTreeEntry(VL, true, UserTreeIdx);		newTreeEntry(VL, true, UserTreeIdx, S);
DEBUG(dbgs() << "SLP: added a vector of GEPs.\n");		DEBUG(dbgs() << "SLP: added a vector of GEPs.\n");
for (unsigned i = 0, e = 2; i < e; ++i) {		for (unsigned i = 0, e = 2; i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, Depth + 1, UserTreeIdx, i);
}		}
return;		return;
}		}
case Instruction::Store: {		case Instruction::Store: {
// Check if the stores are consecutive or of we need to swizzle them.		// Check if the stores are consecutive or of we need to swizzle them.
for (unsigned i = 0, e = VL.size() - 1; i < e; ++i)		for (unsigned i = 0, e = VL.size() - 1; i < e; ++i)
if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {		if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
DEBUG(dbgs() << "SLP: Non-consecutive store.\n");		DEBUG(dbgs() << "SLP: Non-consecutive store.\n");
return;		return;
}		}

newTreeEntry(VL, true, UserTreeIdx);		newTreeEntry(VL, true, UserTreeIdx, S);
DEBUG(dbgs() << "SLP: added a vector of stores.\n");		DEBUG(dbgs() << "SLP: added a vector of stores.\n");

ValueList Operands;		ValueList Operands;
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(0));		Operands.push_back(cast<Instruction>(j)->getOperand(0));

buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, Depth + 1, UserTreeIdx);
return;		return;
}		}
case Instruction::Call: {		case Instruction::Call: {
// Check if the calls are all to the same vectorizable intrinsic.		// Check if the calls are all to the same vectorizable intrinsic.
CallInst *CI = cast<CallInst>(VL0);		CallInst *CI = cast<CallInst>(VL0);
// Check if this is an Intrinsic call or something that can be		// Check if this is an Intrinsic call or something that can be
// represented by an intrinsic call		// represented by an intrinsic call
Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);		Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);
if (!isTriviallyVectorizable(ID)) {		if (!isTriviallyVectorizable(ID)) {
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
DEBUG(dbgs() << "SLP: Non-vectorizable call.\n");		DEBUG(dbgs() << "SLP: Non-vectorizable call.\n");
return;		return;
}		}
Function *Int = CI->getCalledFunction();		Function *Int = CI->getCalledFunction();
Value *A1I = nullptr;		Value *A1I = nullptr;
if (hasVectorInstrinsicScalarOpd(ID, 1))		if (hasVectorInstrinsicScalarOpd(ID, 1))
A1I = CI->getArgOperand(1);		A1I = CI->getArgOperand(1);
for (unsigned i = 1, e = VL.size(); i != e; ++i) {		for (unsigned i = 1, e = VL.size(); i != e; ++i) {
CallInst *CI2 = dyn_cast<CallInst>(VL[i]);		CallInst *CI2 = dyn_cast<CallInst>(VL[i]);
if (!CI2 \|\| CI2->getCalledFunction() != Int \|\|		if (!CI2 \|\| CI2->getCalledFunction() != Int \|\|
getVectorIntrinsicIDForCall(CI2, TLI) != ID \|\|		getVectorIntrinsicIDForCall(CI2, TLI) != ID \|\|
!CI->hasIdenticalOperandBundleSchema(*CI2)) {		!CI->hasIdenticalOperandBundleSchema(*CI2)) {
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
DEBUG(dbgs() << "SLP: mismatched calls:" << CI << "!=" << VL[i]		DEBUG(dbgs() << "SLP: mismatched calls:" << CI << "!=" << VL[i]
<< "\n");		<< "\n");
return;		return;
}		}
// ctlz,cttz and powi are special intrinsics whose second argument		// ctlz,cttz and powi are special intrinsics whose second argument
// should be same in order for them to be vectorized.		// should be same in order for them to be vectorized.
if (hasVectorInstrinsicScalarOpd(ID, 1)) {		if (hasVectorInstrinsicScalarOpd(ID, 1)) {
Value *A1J = CI2->getArgOperand(1);		Value *A1J = CI2->getArgOperand(1);
if (A1I != A1J) {		if (A1I != A1J) {
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
DEBUG(dbgs() << "SLP: mismatched arguments in call:" << *CI		DEBUG(dbgs() << "SLP: mismatched arguments in call:" << *CI
<< " argument "<< A1I<<"!=" << A1J		<< " argument "<< A1I<<"!=" << A1J
<< "\n");		<< "\n");
return;		return;
}		}
}		}
// Verify that the bundle operands are identical between the two calls.		// Verify that the bundle operands are identical between the two calls.
if (CI->hasOperandBundles() &&		if (CI->hasOperandBundles() &&
!std::equal(CI->op_begin() + CI->getBundleOperandsStartIndex(),		!std::equal(CI->op_begin() + CI->getBundleOperandsStartIndex(),
CI->op_begin() + CI->getBundleOperandsEndIndex(),		CI->op_begin() + CI->getBundleOperandsEndIndex(),
CI2->op_begin() + CI2->getBundleOperandsStartIndex())) {		CI2->op_begin() + CI2->getBundleOperandsStartIndex())) {
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
DEBUG(dbgs() << "SLP: mismatched bundle operands in calls:" << *CI << "!="		DEBUG(dbgs() << "SLP: mismatched bundle operands in calls:" << *CI << "!="
<< *VL[i] << '\n');		<< *VL[i] << '\n');
return;		return;
}		}
}		}

newTreeEntry(VL, true, UserTreeIdx);		newTreeEntry(VL, true, UserTreeIdx, S);
for (unsigned i = 0, e = CI->getNumArgOperands(); i != e; ++i) {		for (unsigned i = 0, e = CI->getNumArgOperands(); i != e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL) {		for (Value *j : VL) {
CallInst *CI2 = dyn_cast<CallInst>(j);		CallInst *CI2 = dyn_cast<CallInst>(j);
Operands.push_back(CI2->getArgOperand(i));		Operands.push_back(CI2->getArgOperand(i));
}		}
buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, Depth + 1, UserTreeIdx, i);
}		}
return;		return;
}		}
case Instruction::ShuffleVector:		case Instruction::ShuffleVector:
// If this is not an alternate sequence of opcode like add-sub		// If this is not an alternate sequence of opcode like add-sub
// then do not vectorize this instruction.		// then do not vectorize this instruction.
if (!S.IsAltShuffle) {		if (!S.IsAltShuffle) {
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
DEBUG(dbgs() << "SLP: ShuffleVector are not vectorized.\n");		DEBUG(dbgs() << "SLP: ShuffleVector are not vectorized.\n");
return;		return;
}		}
newTreeEntry(VL, true, UserTreeIdx);		newTreeEntry(VL, true, UserTreeIdx, S);
DEBUG(dbgs() << "SLP: added a ShuffleVector op.\n");		DEBUG(dbgs() << "SLP: added a ShuffleVector op.\n");

// Reorder operands if reordering would enable vectorization.		// Reorder operands if reordering would enable vectorization.
if (isa<BinaryOperator>(VL0)) {		if (isa<BinaryOperator>(VL0)) {
ValueList Left, Right;		ValueList Left, Right;
reorderAltShuffleOperands(S.Opcode, VL, Left, Right);		reorderAltShuffleOperands(S.Opcode, VL, Left, Right);
buildTree_rec(Left, Depth + 1, UserTreeIdx);		buildTree_rec(Left, Depth + 1, UserTreeIdx);
buildTree_rec(Right, Depth + 1, UserTreeIdx);		buildTree_rec(Right, Depth + 1, UserTreeIdx, 1);
return;		return;
}		}

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, Depth + 1, UserTreeIdx, i);
}		}
return;		return;

default:		default:
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx, S);
DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");		DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");
return;		return;
}		}
}		}

unsigned BoUpSLP::canMapToVector(Type *T, const DataLayout &DL) const {		unsigned BoUpSLP::canMapToVector(Type *T, const DataLayout &DL) const {
unsigned N;		unsigned N;
Type *EltTy;		Type *EltTy;
▲ Show 20 Lines • Show All 821 Lines • ▼ Show 20 Lines
Value BoUpSLP::alreadyVectorized(ArrayRef<Value > VL, Value *OpValue) const {		Value BoUpSLP::alreadyVectorized(ArrayRef<Value > VL, Value *OpValue) const {
if (const TreeEntry *En = getTreeEntry(OpValue)) {		if (const TreeEntry *En = getTreeEntry(OpValue)) {
if (En->isSame(VL) && En->VectorizedValue)		if (En->isSame(VL) && En->VectorizedValue)
return En->VectorizedValue;		return En->VectorizedValue;
}		}
return nullptr;		return nullptr;
}		}

Value BoUpSLP::vectorizeTree(ArrayRef<Value > VL) {		Value BoUpSLP::vectorizeTree(ArrayRef<Value > VL, int OpdNum, int UserIndx) {
InstructionsState S = getSameOpcode(VL);		InstructionsState S = getSameOpcode(VL);
if (S.Opcode) {		if (S.Opcode) {
if (TreeEntry *E = getTreeEntry(S.OpValue)) {		if (TreeEntry *E = getTreeEntry(S.OpValue)) {
if (E->isSame(VL))		TreeEntry *UserTreeEntry = nullptr;
return vectorizeTree(E);		if (UserIndx != -1)
		UserTreeEntry = &VectorizableTree[UserIndx];

		if (E->isSame(VL) \|\|
		(UserTreeEntry &&
		(unsigned)OpdNum < UserTreeEntry->ShuffleMask.size() &&
		!UserTreeEntry->ShuffleMask[OpdNum].empty() &&
		E->isFoundJumbled(VL, DL, SE)))
		return vectorizeTree(E, OpdNum, UserIndx);
}		}
}		}

Type *ScalarTy = S.OpValue->getType();		Type *ScalarTy = S.OpValue->getType();
if (StoreInst *SI = dyn_cast<StoreInst>(S.OpValue))		if (StoreInst *SI = dyn_cast<StoreInst>(S.OpValue))
ScalarTy = SI->getValueOperand()->getType();		ScalarTy = SI->getValueOperand()->getType();
VectorType *VecTy = VectorType::get(ScalarTy, VL.size());		VectorType *VecTy = VectorType::get(ScalarTy, VL.size());

return Gather(VL, VecTy);		return Gather(VL, VecTy);
}		}

Value BoUpSLP::vectorizeTree(TreeEntry E) {		Value BoUpSLP::vectorizeTree(TreeEntry E, int OpdNum, int UserIndx) {
IRBuilder<>::InsertPointGuard Guard(Builder);		IRBuilder<>::InsertPointGuard Guard(Builder);

		TreeEntry *UserTreeEntry = nullptr;
if (E->VectorizedValue) {		if (E->VectorizedValue) {
DEBUG(dbgs() << "SLP: Diamond merged for " << *E->Scalars[0] << ".\n");		DEBUG(dbgs() << "SLP: Diamond merged for " << *E->Scalars[0] << ".\n");
return E->VectorizedValue;		return E->VectorizedValue;
}		}

InstructionsState S = getSameOpcode(E->Scalars);		InstructionsState S = getSameOpcode(E->Scalars);
Instruction *VL0 = cast<Instruction>(E->Scalars[0]);		Instruction *VL0 = cast<Instruction>(E->Scalars[0]);
Type *ScalarTy = VL0->getType();		Type *ScalarTy = VL0->getType();
if (StoreInst *SI = dyn_cast<StoreInst>(VL0))		if (StoreInst *SI = dyn_cast<StoreInst>(VL0))
ScalarTy = SI->getValueOperand()->getType();		ScalarTy = SI->getValueOperand()->getType();
VectorType *VecTy = VectorType::get(ScalarTy, E->Scalars.size());		VectorType *VecTy = VectorType::get(ScalarTy, E->Scalars.size());

if (E->NeedToGather) {		if (E->NeedToGather) {
setInsertPointAfterBundle(E->Scalars, VL0);		setInsertPointAfterBundle(E->Scalars, VL0);
auto *V = Gather(E->Scalars, VecTy);		auto *V = Gather(E->Scalars, VecTy);
E->VectorizedValue = V;		E->VectorizedValue = V;
return V;		return V;
}		}

		assert(ScalarToTreeEntry.count(E->Scalars[0]) &&
		"Expected user tree entry, missing!");
		int CurrIndx = ScalarToTreeEntry[E->Scalars[0]];

unsigned ShuffleOrOp = S.IsAltShuffle ?		unsigned ShuffleOrOp = S.IsAltShuffle ?
(unsigned) Instruction::ShuffleVector : S.Opcode;		(unsigned) Instruction::ShuffleVector : S.Opcode;
switch (ShuffleOrOp) {		switch (ShuffleOrOp) {
case Instruction::PHI: {		case Instruction::PHI: {
PHINode *PH = dyn_cast<PHINode>(VL0);		PHINode *PH = dyn_cast<PHINode>(VL0);
Builder.SetInsertPoint(PH->getParent()->getFirstNonPHI());		Builder.SetInsertPoint(PH->getParent()->getFirstNonPHI());
Builder.SetCurrentDebugLocation(PH->getDebugLoc());		Builder.SetCurrentDebugLocation(PH->getDebugLoc());
PHINode *NewPhi = Builder.CreatePHI(VecTy, PH->getNumIncomingValues());		PHINode *NewPhi = Builder.CreatePHI(VecTy, PH->getNumIncomingValues());
Show All 13 Lines	case Instruction::PHI: {
}		}

// Prepare the operand vector.		// Prepare the operand vector.
for (Value *V : E->Scalars)		for (Value *V : E->Scalars)
Operands.push_back(cast<PHINode>(V)->getIncomingValueForBlock(IBB));		Operands.push_back(cast<PHINode>(V)->getIncomingValueForBlock(IBB));

Builder.SetInsertPoint(IBB->getTerminator());		Builder.SetInsertPoint(IBB->getTerminator());
Builder.SetCurrentDebugLocation(PH->getDebugLoc());		Builder.SetCurrentDebugLocation(PH->getDebugLoc());
Value *Vec = vectorizeTree(Operands);		Value *Vec = vectorizeTree(Operands, i, CurrIndx);
NewPhi->addIncoming(Vec, IBB);		NewPhi->addIncoming(Vec, IBB);
}		}

assert(NewPhi->getNumIncomingValues() == PH->getNumIncomingValues() &&		assert(NewPhi->getNumIncomingValues() == PH->getNumIncomingValues() &&
"Invalid number of incoming values");		"Invalid number of incoming values");
return NewPhi;		return NewPhi;
}		}

Show All 36 Lines	switch (ShuffleOrOp) {
case Instruction::FPTrunc:		case Instruction::FPTrunc:
case Instruction::BitCast: {		case Instruction::BitCast: {
ValueList INVL;		ValueList INVL;
for (Value *V : E->Scalars)		for (Value *V : E->Scalars)
INVL.push_back(cast<Instruction>(V)->getOperand(0));		INVL.push_back(cast<Instruction>(V)->getOperand(0));

setInsertPointAfterBundle(E->Scalars, VL0);		setInsertPointAfterBundle(E->Scalars, VL0);

Value *InVec = vectorizeTree(INVL);		Value *InVec = vectorizeTree(INVL, 0, CurrIndx);

if (Value *V = alreadyVectorized(E->Scalars, VL0))		if (Value *V = alreadyVectorized(E->Scalars, VL0))
return V;		return V;

CastInst *CI = dyn_cast<CastInst>(VL0);		CastInst *CI = dyn_cast<CastInst>(VL0);
Value *V = Builder.CreateCast(CI->getOpcode(), InVec, VecTy);		Value *V = Builder.CreateCast(CI->getOpcode(), InVec, VecTy);
E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
}		}
case Instruction::FCmp:		case Instruction::FCmp:
case Instruction::ICmp: {		case Instruction::ICmp: {
ValueList LHSV, RHSV;		ValueList LHSV, RHSV;
for (Value *V : E->Scalars) {		for (Value *V : E->Scalars) {
LHSV.push_back(cast<Instruction>(V)->getOperand(0));		LHSV.push_back(cast<Instruction>(V)->getOperand(0));
RHSV.push_back(cast<Instruction>(V)->getOperand(1));		RHSV.push_back(cast<Instruction>(V)->getOperand(1));
}		}

setInsertPointAfterBundle(E->Scalars, VL0);		setInsertPointAfterBundle(E->Scalars, VL0);

Value *L = vectorizeTree(LHSV);		Value *L = vectorizeTree(LHSV, 0, CurrIndx);
Value *R = vectorizeTree(RHSV);		Value *R = vectorizeTree(RHSV, 1, CurrIndx);

if (Value *V = alreadyVectorized(E->Scalars, VL0))		if (Value *V = alreadyVectorized(E->Scalars, VL0))
return V;		return V;

CmpInst::Predicate P0 = cast<CmpInst>(VL0)->getPredicate();		CmpInst::Predicate P0 = cast<CmpInst>(VL0)->getPredicate();
Value *V;		Value *V;
if (S.Opcode == Instruction::FCmp)		if (S.Opcode == Instruction::FCmp)
V = Builder.CreateFCmp(P0, L, R);		V = Builder.CreateFCmp(P0, L, R);
Show All 10 Lines	case Instruction::Select: {
for (Value *V : E->Scalars) {		for (Value *V : E->Scalars) {
CondVec.push_back(cast<Instruction>(V)->getOperand(0));		CondVec.push_back(cast<Instruction>(V)->getOperand(0));
TrueVec.push_back(cast<Instruction>(V)->getOperand(1));		TrueVec.push_back(cast<Instruction>(V)->getOperand(1));
FalseVec.push_back(cast<Instruction>(V)->getOperand(2));		FalseVec.push_back(cast<Instruction>(V)->getOperand(2));
}		}

setInsertPointAfterBundle(E->Scalars, VL0);		setInsertPointAfterBundle(E->Scalars, VL0);

Value *Cond = vectorizeTree(CondVec);		Value *Cond = vectorizeTree(CondVec, 0, CurrIndx);
Value *True = vectorizeTree(TrueVec);		Value *True = vectorizeTree(TrueVec, 1, CurrIndx);
Value *False = vectorizeTree(FalseVec);		Value *False = vectorizeTree(FalseVec, 2, CurrIndx);

if (Value *V = alreadyVectorized(E->Scalars, VL0))		if (Value *V = alreadyVectorized(E->Scalars, VL0))
return V;		return V;

Value *V = Builder.CreateSelect(Cond, True, False);		Value *V = Builder.CreateSelect(Cond, True, False);
E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
Show All 24 Lines	case Instruction::Xor: {
for (Value *V : E->Scalars) {		for (Value *V : E->Scalars) {
auto *I = cast<Instruction>(V);		auto *I = cast<Instruction>(V);
LHSVL.push_back(I->getOperand(0));		LHSVL.push_back(I->getOperand(0));
RHSVL.push_back(I->getOperand(1));		RHSVL.push_back(I->getOperand(1));
}		}

setInsertPointAfterBundle(E->Scalars, VL0);		setInsertPointAfterBundle(E->Scalars, VL0);

Value *LHS = vectorizeTree(LHSVL);		Value *LHS = vectorizeTree(LHSVL, 0, CurrIndx);
Value *RHS = vectorizeTree(RHSVL);		Value *RHS = vectorizeTree(RHSVL, 1, CurrIndx);

if (Value *V = alreadyVectorized(E->Scalars, VL0))		if (Value *V = alreadyVectorized(E->Scalars, VL0))
return V;		return V;

Value *V = Builder.CreateBinOp(		Value *V = Builder.CreateBinOp(
static_cast<Instruction::BinaryOps>(S.Opcode), LHS, RHS);		static_cast<Instruction::BinaryOps>(S.Opcode), LHS, RHS);
E->VectorizedValue = V;		E->VectorizedValue = V;
propagateIRFlags(E->VectorizedValue, E->Scalars, VL0);		propagateIRFlags(E->VectorizedValue, E->Scalars, VL0);
++NumVectorInstructions;		++NumVectorInstructions;

if (Instruction *I = dyn_cast<Instruction>(V))		if (Instruction *I = dyn_cast<Instruction>(V))
return propagateMetadata(I, E->Scalars);		return propagateMetadata(I, E->Scalars);

return V;		return V;
}		}
case Instruction::Load: {		case Instruction::Load: {
// Loads are inserted at the head of the tree because we don't want to		// Loads are inserted at the head of the tree because we don't want to
// sink them all the way down past store instructions.		// sink them all the way down past store instructions.
setInsertPointAfterBundle(E->Scalars, VL0);		setInsertPointAfterBundle(E->Scalars, VL0);

LoadInst *LI = cast<LoadInst>(VL0);		if (UserIndx != -1)
		UserTreeEntry = &VectorizableTree[UserIndx];

		bool isJumbled = false;
		LoadInst *LI = NULL;
		if (UserTreeEntry &&
		(unsigned)OpdNum < UserTreeEntry->ShuffleMask.size() &&
		!UserTreeEntry->ShuffleMask[OpdNum].empty()) {
		isJumbled = true;
		LI = cast<LoadInst>(E->Scalars[0]);
		} else {
		LI = cast<LoadInst>(VL0);
		}

Type *ScalarLoadTy = LI->getType();		Type *ScalarLoadTy = LI->getType();
unsigned AS = LI->getPointerAddressSpace();		unsigned AS = LI->getPointerAddressSpace();

Value *VecPtr = Builder.CreateBitCast(LI->getPointerOperand(),		Value *VecPtr = Builder.CreateBitCast(LI->getPointerOperand(),
VecTy->getPointerTo(AS));		VecTy->getPointerTo(AS));

// The pointer operand uses an in-tree scalar so we add the new BitCast to		// The pointer operand uses an in-tree scalar so we add the new BitCast to
// ExternalUses list to make sure that an extract will be generated in the		// ExternalUses list to make sure that an extract will be generated in the
// future.		// future.
Value *PO = LI->getPointerOperand();		Value *PO = LI->getPointerOperand();
if (getTreeEntry(PO))		if (getTreeEntry(PO))
ExternalUses.push_back(ExternalUser(PO, cast<User>(VecPtr), 0));		ExternalUses.push_back(ExternalUser(PO, cast<User>(VecPtr), 0));

unsigned Alignment = LI->getAlignment();		unsigned Alignment = LI->getAlignment();
LI = Builder.CreateLoad(VecPtr);		LI = Builder.CreateLoad(VecPtr);
if (!Alignment) {		if (!Alignment) {
Alignment = DL->getABITypeAlignment(ScalarLoadTy);		Alignment = DL->getABITypeAlignment(ScalarLoadTy);
}		}
LI->setAlignment(Alignment);		LI->setAlignment(Alignment);
E->VectorizedValue = LI;		E->VectorizedValue = LI;
++NumVectorInstructions;		++NumVectorInstructions;
return propagateMetadata(LI, E->Scalars);		propagateMetadata(LI, E->Scalars);

		if (isJumbled) {
		SmallVector<Constant *, 8> Mask;
		for (unsigned LaneEntry : UserTreeEntry->ShuffleMask[OpdNum])
		Mask.push_back(Builder.getInt32(LaneEntry));
		// Generate shuffle for jumbled memory access
		Value *Undef = UndefValue::get(VecTy);
		Value Shuf = Builder.CreateShuffleVector((Value )LI, Undef,
		ConstantVector::get(Mask));
		E->VectorizedValue = Shuf;
		++NumVectorInstructions;
		return Shuf;
		}
		return LI;
}		}
case Instruction::Store: {		case Instruction::Store: {
StoreInst *SI = cast<StoreInst>(VL0);		StoreInst *SI = cast<StoreInst>(VL0);
unsigned Alignment = SI->getAlignment();		unsigned Alignment = SI->getAlignment();
unsigned AS = SI->getPointerAddressSpace();		unsigned AS = SI->getPointerAddressSpace();

ValueList ScalarStoreValues;		ValueList ScalarStoreValues;
for (Value *V : E->Scalars)		for (Value *V : E->Scalars)
ScalarStoreValues.push_back(cast<StoreInst>(V)->getValueOperand());		ScalarStoreValues.push_back(cast<StoreInst>(V)->getValueOperand());

setInsertPointAfterBundle(E->Scalars, VL0);		setInsertPointAfterBundle(E->Scalars, VL0);

Value *VecValue = vectorizeTree(ScalarStoreValues);		Value *VecValue = vectorizeTree(ScalarStoreValues, 0, CurrIndx);
Value *ScalarPtr = SI->getPointerOperand();		Value *ScalarPtr = SI->getPointerOperand();
Value *VecPtr = Builder.CreateBitCast(ScalarPtr, VecTy->getPointerTo(AS));		Value *VecPtr = Builder.CreateBitCast(ScalarPtr, VecTy->getPointerTo(AS));
StoreInst *S = Builder.CreateStore(VecValue, VecPtr);		StoreInst *S = Builder.CreateStore(VecValue, VecPtr);

// The pointer operand uses an in-tree scalar, so add the new BitCast to		// The pointer operand uses an in-tree scalar, so add the new BitCast to
// ExternalUses to make sure that an extract will be generated in the		// ExternalUses to make sure that an extract will be generated in the
// future.		// future.
if (getTreeEntry(ScalarPtr))		if (getTreeEntry(ScalarPtr))
Show All 9 Lines	switch (ShuffleOrOp) {
}		}
case Instruction::GetElementPtr: {		case Instruction::GetElementPtr: {
setInsertPointAfterBundle(E->Scalars, VL0);		setInsertPointAfterBundle(E->Scalars, VL0);

ValueList Op0VL;		ValueList Op0VL;
for (Value *V : E->Scalars)		for (Value *V : E->Scalars)
Op0VL.push_back(cast<GetElementPtrInst>(V)->getOperand(0));		Op0VL.push_back(cast<GetElementPtrInst>(V)->getOperand(0));

Value *Op0 = vectorizeTree(Op0VL);		Value *Op0 = vectorizeTree(Op0VL, 0, CurrIndx);

std::vector<Value *> OpVecs;		std::vector<Value *> OpVecs;
for (int j = 1, e = cast<GetElementPtrInst>(VL0)->getNumOperands(); j < e;		for (int j = 1, e = cast<GetElementPtrInst>(VL0)->getNumOperands(); j < e;
++j) {		++j) {
ValueList OpVL;		ValueList OpVL;
for (Value *V : E->Scalars)		for (Value *V : E->Scalars)
OpVL.push_back(cast<GetElementPtrInst>(V)->getOperand(j));		OpVL.push_back(cast<GetElementPtrInst>(V)->getOperand(j));

Value *OpVec = vectorizeTree(OpVL);		Value *OpVec = vectorizeTree(OpVL, j, CurrIndx);
OpVecs.push_back(OpVec);		OpVecs.push_back(OpVec);
}		}

Value *V = Builder.CreateGEP(		Value *V = Builder.CreateGEP(
cast<GetElementPtrInst>(VL0)->getSourceElementType(), Op0, OpVecs);		cast<GetElementPtrInst>(VL0)->getSourceElementType(), Op0, OpVecs);
E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;

Show All 22 Lines	case Instruction::Call: {
OpVecs.push_back(CEI->getArgOperand(j));		OpVecs.push_back(CEI->getArgOperand(j));
continue;		continue;
}		}
for (Value *V : E->Scalars) {		for (Value *V : E->Scalars) {
CallInst *CEI = cast<CallInst>(V);		CallInst *CEI = cast<CallInst>(V);
OpVL.push_back(CEI->getArgOperand(j));		OpVL.push_back(CEI->getArgOperand(j));
}		}

Value *OpVec = vectorizeTree(OpVL);		Value *OpVec = vectorizeTree(OpVL, j, CurrIndx);
DEBUG(dbgs() << "SLP: OpVec[" << j << "]: " << *OpVec << "\n");		DEBUG(dbgs() << "SLP: OpVec[" << j << "]: " << *OpVec << "\n");
OpVecs.push_back(OpVec);		OpVecs.push_back(OpVec);
}		}

Module *M = F->getParent();		Module *M = F->getParent();
Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);		Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);
Type *Tys[] = { VectorType::get(CI->getType(), E->Scalars.size()) };		Type *Tys[] = { VectorType::get(CI->getType(), E->Scalars.size()) };
Function *CF = Intrinsic::getDeclaration(M, ID, Tys);		Function *CF = Intrinsic::getDeclaration(M, ID, Tys);
Show All 14 Lines	switch (ShuffleOrOp) {
}		}
case Instruction::ShuffleVector: {		case Instruction::ShuffleVector: {
ValueList LHSVL, RHSVL;		ValueList LHSVL, RHSVL;
assert(Instruction::isBinaryOp(S.Opcode) &&		assert(Instruction::isBinaryOp(S.Opcode) &&
"Invalid Shuffle Vector Operand");		"Invalid Shuffle Vector Operand");
reorderAltShuffleOperands(S.Opcode, E->Scalars, LHSVL, RHSVL);		reorderAltShuffleOperands(S.Opcode, E->Scalars, LHSVL, RHSVL);
setInsertPointAfterBundle(E->Scalars, VL0);		setInsertPointAfterBundle(E->Scalars, VL0);

Value *LHS = vectorizeTree(LHSVL);		Value *LHS = vectorizeTree(LHSVL, 0, CurrIndx);
Value *RHS = vectorizeTree(RHSVL);		Value *RHS = vectorizeTree(RHSVL, 1, CurrIndx);

if (Value *V = alreadyVectorized(E->Scalars, VL0))		if (Value *V = alreadyVectorized(E->Scalars, VL0))
return V;		return V;

// Create a vector of LHS op1 RHS		// Create a vector of LHS op1 RHS
Value *V0 = Builder.CreateBinOp(		Value *V0 = Builder.CreateBinOp(
static_cast<Instruction::BinaryOps>(S.Opcode), LHS, RHS);		static_cast<Instruction::BinaryOps>(S.Opcode), LHS, RHS);

▲ Show 20 Lines • Show All 83 Lines • ▼ Show 20 Lines	for (const auto &ExternalUse : ExternalUses) {
llvm::User *User = ExternalUse.User;		llvm::User *User = ExternalUse.User;

// Skip users that we already RAUW. This happens when one instruction		// Skip users that we already RAUW. This happens when one instruction
// has multiple uses of the same value.		// has multiple uses of the same value.
if (User && !is_contained(Scalar->users(), User))		if (User && !is_contained(Scalar->users(), User))
continue;		continue;
TreeEntry *E = getTreeEntry(Scalar);		TreeEntry *E = getTreeEntry(Scalar);
assert(E && "Invalid scalar");		assert(E && "Invalid scalar");
assert(!E->NeedToGather && "Extracting from a gather list");		assert((!E->NeedToGather) && "Extracting from a gather list");

Value *Vec = E->VectorizedValue;		Value *Vec = dyn_cast<ShuffleVectorInst>(E->VectorizedValue);
		if (Vec && dyn_cast<LoadInst>(cast<Instruction>(Vec)->getOperand(0))) {
		Vec = cast<Instruction>(E->VectorizedValue)->getOperand(0);
		} else {
		Vec = E->VectorizedValue;
		}
assert(Vec && "Can't find vectorizable value");		assert(Vec && "Can't find vectorizable value");

Value *Lane = Builder.getInt32(ExternalUse.Lane);		Value *Lane = Builder.getInt32(ExternalUse.Lane);
// If User == nullptr, the Scalar is used as extra arg. Generate		// If User == nullptr, the Scalar is used as extra arg. Generate
// ExtractElement instruction and update the record for this scalar in		// ExtractElement instruction and update the record for this scalar in
// ExternallyUsedValues.		// ExternallyUsedValues.
if (!User) {		if (!User) {
assert(ExternallyUsedValues.count(Scalar) &&		assert(ExternallyUsedValues.count(Scalar) &&
▲ Show 20 Lines • Show All 2,790 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/jumbled-load-multiuse.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux -mattr=+sse4.2 \| FileCheck %s			; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux -mattr=+sse4.2 \| FileCheck %s

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	@a = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4			@a = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4
	@b = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4			@b = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4

	define i32 @fn1() {			define i32 @fn1() {
	; CHECK-LABEL: @fn1(			; CHECK-LABEL: @fn1(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4			; CHECK-NEXT: [[TMP0:%.]] = load <4 x i32>, <4 x i32> bitcast ([4 x i32]* @b to <4 x i32>*), align 4
	; CHECK-NEXT: [[TMP1:%.]] = load i32, i32 getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 1), align 4			; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <4 x i32> [[TMP0]], <4 x i32> undef, <4 x i32> <i32 1, i32 2, i32 3, i32 0>
	; CHECK-NEXT: [[TMP2:%.]] = load i32, i32 getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 2), align 4			; CHECK-NEXT: [[TMP2:%.*]] = icmp sgt <4 x i32> [[TMP1]], zeroinitializer
	; CHECK-NEXT: [[TMP3:%.]] = load i32, i32 getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 3), align 4			; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x i32> [[TMP0]], i32 1
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> undef, i32 [[TMP1]], i32 0			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> undef, i32 [[TMP3]], i32 0
	; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x i32> [[TMP4]], i32 [[TMP2]], i32 1			; CHECK-NEXT: [[TMP5:%.]] = insertelement <4 x i32> [[TMP4]], i32 ptrtoint (i32 () @fn1 to i32), i32 1
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP5]], i32 [[TMP3]], i32 2			; CHECK-NEXT: [[TMP6:%.]] = insertelement <4 x i32> [[TMP5]], i32 ptrtoint (i32 () @fn1 to i32), i32 2
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[TMP0]], i32 3			; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 8, i32 3
	; CHECK-NEXT: [[TMP8:%.*]] = icmp sgt <4 x i32> [[TMP7]], zeroinitializer			; CHECK-NEXT: [[TMP8:%.*]] = select <4 x i1> [[TMP2]], <4 x i32> [[TMP7]], <4 x i32> <i32 6, i32 0, i32 0, i32 0>
	; CHECK-NEXT: [[TMP9:%.]] = insertelement <4 x i32> [[TMP4]], i32 ptrtoint (i32 () @fn1 to i32), i32 1			; CHECK-NEXT: store <4 x i32> [[TMP8]], <4 x i32>* bitcast ([4 x i32]* @a to <4 x i32>*), align 4
	; CHECK-NEXT: [[TMP10:%.]] = insertelement <4 x i32> [[TMP9]], i32 ptrtoint (i32 () @fn1 to i32), i32 2
	; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> [[TMP10]], i32 8, i32 3
	; CHECK-NEXT: [[TMP12:%.*]] = select <4 x i1> [[TMP8]], <4 x i32> [[TMP11]], <4 x i32> <i32 6, i32 0, i32 0, i32 0>
	; CHECK-NEXT: store <4 x i32> [[TMP12]], <4 x i32>* bitcast ([4 x i32]* @a to <4 x i32>*), align 4
	; CHECK-NEXT: ret i32 0			; CHECK-NEXT: ret i32 0
	;			;
	entry:			entry:
	%0 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4			%0 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4
	%cmp = icmp sgt i32 %0, 0			%cmp = icmp sgt i32 %0, 0
	%cond = select i1 %cmp, i32 8, i32 0			%cond = select i1 %cmp, i32 8, i32 0
	store i32 %cond, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i64 0, i32 3), align 4			store i32 %cond, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i64 0, i32 3), align 4
	%1 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 1), align 4			%1 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 1), align 4
	Show All 13 Lines

llvm/test/Transforms/SLPVectorizer/X86/jumbled-load-shuffle-placement.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s


				;void jumble (int * restrict A, int * restrict B) {
				; int tmp0 = A[10]*A[0];
				; int tmp1 = A[11]*A[1];
				; int tmp2 = A[12]*A[3];
				; int tmp3 = A[13]*A[2];
				; B[0] = tmp0;
				; B[1] = tmp1;
				; B[2] = tmp2;
				; B[3] = tmp3;
				;}


				; Function Attrs: norecurse nounwind uwtable
				define void @jumble1(i32* noalias nocapture readonly %A, i32* noalias nocapture %B) {
				; CHECK-LABEL: @jumble1(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 10
				; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[A]], i64 11
				; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[A]], i64 1
				; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i32, i32 [[A]], i64 12
				; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i32, i32 [[A]], i64 3
				; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[A]], i64 13
				; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[ARRAYIDX]] to <4 x i32>*
				; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4
				; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i32, i32 [[A]], i64 2
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[A]] to <4 x i32>*
				; CHECK-NEXT: [[TMP3:%.]] = load <4 x i32>, <4 x i32> [[TMP2]], align 4
				; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 3, i32 2>
				; CHECK-NEXT: [[TMP5:%.*]] = mul nsw <4 x i32> [[TMP1]], [[TMP4]]
				; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1
				; CHECK-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2
				; CHECK-NEXT: [[ARRAYIDX14:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3
				; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[B]] to <4 x i32>*
				; CHECK-NEXT: store <4 x i32> [[TMP5]], <4 x i32>* [[TMP6]], align 4
				; CHECK-NEXT: ret void
				;
				entry:
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 10
				%0 = load i32, i32* %arrayidx, align 4
				%1 = load i32, i32* %A, align 4
				%mul = mul nsw i32 %0, %1
				%arrayidx2 = getelementptr inbounds i32, i32* %A, i64 11
				%2 = load i32, i32* %arrayidx2, align 4
				%arrayidx3 = getelementptr inbounds i32, i32* %A, i64 1
				%3 = load i32, i32* %arrayidx3, align 4
				%mul4 = mul nsw i32 %2, %3
				%arrayidx5 = getelementptr inbounds i32, i32* %A, i64 12
				%4 = load i32, i32* %arrayidx5, align 4
				%arrayidx6 = getelementptr inbounds i32, i32* %A, i64 3
				%5 = load i32, i32* %arrayidx6, align 4
				%mul7 = mul nsw i32 %4, %5
				%arrayidx8 = getelementptr inbounds i32, i32* %A, i64 13
				%6 = load i32, i32* %arrayidx8, align 4
				%arrayidx9 = getelementptr inbounds i32, i32* %A, i64 2
				%7 = load i32, i32* %arrayidx9, align 4
				%mul10 = mul nsw i32 %6, %7
				store i32 %mul, i32* %B, align 4
				%arrayidx12 = getelementptr inbounds i32, i32* %B, i64 1
				store i32 %mul4, i32* %arrayidx12, align 4
				%arrayidx13 = getelementptr inbounds i32, i32* %B, i64 2
				store i32 %mul7, i32* %arrayidx13, align 4
				%arrayidx14 = getelementptr inbounds i32, i32* %B, i64 3
				store i32 %mul10, i32* %arrayidx14, align 4
				ret void
				}

				;Reversing the operand of MUL
				; Function Attrs: norecurse nounwind uwtable
				define void @jumble2(i32* noalias nocapture readonly %A, i32* noalias nocapture %B) {
				; CHECK-LABEL: @jumble2(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 10
				; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[A]], i64 11
				; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[A]], i64 1
				; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i32, i32 [[A]], i64 12
				; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i32, i32 [[A]], i64 3
				; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[A]], i64 13
				; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[ARRAYIDX]] to <4 x i32>*
				; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4
				; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i32, i32 [[A]], i64 2
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[A]] to <4 x i32>*
				; CHECK-NEXT: [[TMP3:%.]] = load <4 x i32>, <4 x i32> [[TMP2]], align 4
				; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 3, i32 2>
				; CHECK-NEXT: [[TMP5:%.*]] = mul nsw <4 x i32> [[TMP4]], [[TMP1]]
				; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1
				; CHECK-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2
				; CHECK-NEXT: [[ARRAYIDX14:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3
				; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[B]] to <4 x i32>*
				; CHECK-NEXT: store <4 x i32> [[TMP5]], <4 x i32>* [[TMP6]], align 4
				; CHECK-NEXT: ret void
				;
				entry:
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 10
				%0 = load i32, i32* %arrayidx, align 4
				%1 = load i32, i32* %A, align 4
				%mul = mul nsw i32 %1, %0
				%arrayidx2 = getelementptr inbounds i32, i32* %A, i64 11
				%2 = load i32, i32* %arrayidx2, align 4
				%arrayidx3 = getelementptr inbounds i32, i32* %A, i64 1
				%3 = load i32, i32* %arrayidx3, align 4
				%mul4 = mul nsw i32 %3, %2
				%arrayidx5 = getelementptr inbounds i32, i32* %A, i64 12
				%4 = load i32, i32* %arrayidx5, align 4
				%arrayidx6 = getelementptr inbounds i32, i32* %A, i64 3
				%5 = load i32, i32* %arrayidx6, align 4
				%mul7 = mul nsw i32 %5, %4
				%arrayidx8 = getelementptr inbounds i32, i32* %A, i64 13
				%6 = load i32, i32* %arrayidx8, align 4
				%arrayidx9 = getelementptr inbounds i32, i32* %A, i64 2
				%7 = load i32, i32* %arrayidx9, align 4
				%mul10 = mul nsw i32 %7, %6
				store i32 %mul, i32* %B, align 4
				%arrayidx12 = getelementptr inbounds i32, i32* %B, i64 1
				store i32 %mul4, i32* %arrayidx12, align 4
				%arrayidx13 = getelementptr inbounds i32, i32* %B, i64 2
				store i32 %mul7, i32* %arrayidx13, align 4
				%arrayidx14 = getelementptr inbounds i32, i32* %B, i64 3
				store i32 %mul10, i32* %arrayidx14, align 4
				ret void
				}

llvm/test/Transforms/SLPVectorizer/X86/jumbled-load-used-in-phi.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s

				;void phiUsingLoads(int restrict A, int restrict B) {
				; int tmp0, tmp1, tmp2, tmp3;
				; for (int i = 0; i < 100; i++) {
				; if (A[0] == 0) {
				; tmp0 = A[i + 0];
				; tmp1 = A[i + 1];
				; tmp2 = A[i + 2];
				; tmp3 = A[i + 3];
				; } else if (A[25] == 0) {
				; tmp0 = A[i + 0];
				; tmp1 = A[i + 1];
				; tmp2 = A[i + 2];
				; tmp3 = A[i + 3];
				; } else if (A[50] == 0) {
				; tmp0 = A[i + 0];
				; tmp1 = A[i + 1];
				; tmp2 = A[i + 2];
				; tmp3 = A[i + 3];
				; } else if (A[75] == 0) {
				; tmp0 = A[i + 0];
				; tmp1 = A[i + 1];
				; tmp2 = A[i + 3];
				; tmp3 = A[i + 2];
				; }
				; }
				; B[0] = tmp0;
				; B[1] = tmp1;
				; B[2] = tmp2;
				; B[3] = tmp3;
				;}


				; Function Attrs: norecurse nounwind uwtable
				define void @phiUsingLoads(i32* noalias nocapture readonly %A, i32* noalias nocapture %B) local_unnamed_addr #0 {
				; CHECK-LABEL: @phiUsingLoads(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 [[A:%.*]], align 4
				; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i32 [[TMP0]], 0
				; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i32, i32 [[A]], i64 25
				; CHECK-NEXT: [[ARRAYIDX28:%.]] = getelementptr inbounds i32, i32 [[A]], i64 50
				; CHECK-NEXT: [[ARRAYIDX44:%.]] = getelementptr inbounds i32, i32 [[A]], i64 75
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.cond.cleanup:
				; CHECK-NEXT: [[ARRAYIDX64:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1
				; CHECK-NEXT: [[ARRAYIDX65:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2
				; CHECK-NEXT: [[ARRAYIDX66:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3
				; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[B]] to <4 x i32>*
				; CHECK-NEXT: store <4 x i32> [[TMP27:%.]], <4 x i32> [[TMP1]], align 4
				; CHECK-NEXT: ret void
				; CHECK: for.body:
				; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.]] ]
				; CHECK-NEXT: [[TMP2:%.*]] = phi <4 x i32> [ undef, [[ENTRY]] ], [ [[TMP27]], [[FOR_INC]] ]
				; CHECK-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.]], label [[IF_ELSE:%.]]
				; CHECK: if.then:
				; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]
				; CHECK-NEXT: [[TMP3:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP3]]
				; CHECK-NEXT: [[TMP4:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 2
				; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP4]]
				; CHECK-NEXT: [[TMP5:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 3
				; CHECK-NEXT: [[ARRAYIDX11:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP5]]
				; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[ARRAYIDX2]] to <4 x i32>*
				; CHECK-NEXT: [[TMP7:%.]] = load <4 x i32>, <4 x i32> [[TMP6]], align 4
				; CHECK-NEXT: br label [[FOR_INC]]
				; CHECK: if.else:
				; CHECK-NEXT: [[TMP8:%.]] = load i32, i32 [[ARRAYIDX12]], align 4
				; CHECK-NEXT: [[CMP13:%.*]] = icmp eq i32 [[TMP8]], 0
				; CHECK-NEXT: br i1 [[CMP13]], label [[IF_THEN14:%.]], label [[IF_ELSE27:%.]]
				; CHECK: if.then14:
				; CHECK-NEXT: [[ARRAYIDX17:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]
				; CHECK-NEXT: [[TMP9:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[ARRAYIDX20:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP9]]
				; CHECK-NEXT: [[TMP10:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 2
				; CHECK-NEXT: [[ARRAYIDX23:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP10]]
				; CHECK-NEXT: [[TMP11:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 3
				; CHECK-NEXT: [[ARRAYIDX26:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP11]]
				; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[ARRAYIDX17]] to <4 x i32>*
				; CHECK-NEXT: [[TMP13:%.]] = load <4 x i32>, <4 x i32> [[TMP12]], align 4
				; CHECK-NEXT: br label [[FOR_INC]]
				; CHECK: if.else27:
				; CHECK-NEXT: [[TMP14:%.]] = load i32, i32 [[ARRAYIDX28]], align 4
				; CHECK-NEXT: [[CMP29:%.*]] = icmp eq i32 [[TMP14]], 0
				; CHECK-NEXT: br i1 [[CMP29]], label [[IF_THEN30:%.]], label [[IF_ELSE43:%.]]
				; CHECK: if.then30:
				; CHECK-NEXT: [[ARRAYIDX33:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]
				; CHECK-NEXT: [[TMP15:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[ARRAYIDX36:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP15]]
				; CHECK-NEXT: [[TMP16:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 2
				; CHECK-NEXT: [[ARRAYIDX39:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP16]]
				; CHECK-NEXT: [[TMP17:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 3
				; CHECK-NEXT: [[ARRAYIDX42:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP17]]
				; CHECK-NEXT: [[TMP18:%.]] = bitcast i32 [[ARRAYIDX33]] to <4 x i32>*
				; CHECK-NEXT: [[TMP19:%.]] = load <4 x i32>, <4 x i32> [[TMP18]], align 4
				; CHECK-NEXT: br label [[FOR_INC]]
				; CHECK: if.else43:
				; CHECK-NEXT: [[TMP20:%.]] = load i32, i32 [[ARRAYIDX44]], align 4
				; CHECK-NEXT: [[CMP45:%.*]] = icmp eq i32 [[TMP20]], 0
				; CHECK-NEXT: br i1 [[CMP45]], label [[IF_THEN46:%.*]], label [[FOR_INC]]
				; CHECK: if.then46:
				; CHECK-NEXT: [[ARRAYIDX49:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]
				; CHECK-NEXT: [[TMP21:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[ARRAYIDX52:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP21]]
				; CHECK-NEXT: [[TMP22:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 3
				; CHECK-NEXT: [[ARRAYIDX55:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP22]]
				; CHECK-NEXT: [[TMP23:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 2
				; CHECK-NEXT: [[ARRAYIDX58:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP23]]
				; CHECK-NEXT: [[TMP24:%.]] = bitcast i32 [[ARRAYIDX49]] to <4 x i32>*
				; CHECK-NEXT: [[TMP25:%.]] = load <4 x i32>, <4 x i32> [[TMP24]], align 4
				; CHECK-NEXT: [[TMP26:%.*]] = shufflevector <4 x i32> [[TMP25]], <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 3, i32 2>
				; CHECK-NEXT: br label [[FOR_INC]]
				; CHECK: for.inc:
				; CHECK-NEXT: [[TMP27]] = phi <4 x i32> [ [[TMP7]], [[IF_THEN]] ], [ [[TMP13]], [[IF_THEN14]] ], [ [[TMP19]], [[IF_THEN30]] ], [ [[TMP26]], [[IF_THEN46]] ], [ [[TMP2]], [[IF_ELSE43]] ]
				; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 100
				; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
				;
				entry:
				%0 = load i32, i32* %A, align 4
				%cmp1 = icmp eq i32 %0, 0
				%arrayidx12 = getelementptr inbounds i32, i32* %A, i64 25
				%arrayidx28 = getelementptr inbounds i32, i32* %A, i64 50
				%arrayidx44 = getelementptr inbounds i32, i32* %A, i64 75
				br label %for.body

				for.cond.cleanup: ; preds = %for.inc
				store i32 %tmp0.1, i32* %B, align 4
				%arrayidx64 = getelementptr inbounds i32, i32* %B, i64 1
				store i32 %tmp1.1, i32* %arrayidx64, align 4
				%arrayidx65 = getelementptr inbounds i32, i32* %B, i64 2
				store i32 %tmp2.1, i32* %arrayidx65, align 4
				%arrayidx66 = getelementptr inbounds i32, i32* %B, i64 3
				store i32 %tmp3.1, i32* %arrayidx66, align 4
				ret void

				for.body: ; preds = %for.inc, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]
				%tmp3.0111 = phi i32 [ undef, %entry ], [ %tmp3.1, %for.inc ]
				%tmp2.0110 = phi i32 [ undef, %entry ], [ %tmp2.1, %for.inc ]
				%tmp1.0109 = phi i32 [ undef, %entry ], [ %tmp1.1, %for.inc ]
				%tmp0.0108 = phi i32 [ undef, %entry ], [ %tmp0.1, %for.inc ]
				br i1 %cmp1, label %if.then, label %if.else

				if.then: ; preds = %for.body
				%arrayidx2 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%1 = load i32, i32* %arrayidx2, align 4
				%2 = add nuw nsw i64 %indvars.iv, 1
				%arrayidx5 = getelementptr inbounds i32, i32* %A, i64 %2
				%3 = load i32, i32* %arrayidx5, align 4
				%4 = add nuw nsw i64 %indvars.iv, 2
				%arrayidx8 = getelementptr inbounds i32, i32* %A, i64 %4
				%5 = load i32, i32* %arrayidx8, align 4
				%6 = add nuw nsw i64 %indvars.iv, 3
				%arrayidx11 = getelementptr inbounds i32, i32* %A, i64 %6
				%7 = load i32, i32* %arrayidx11, align 4
				br label %for.inc

				if.else: ; preds = %for.body
				%8 = load i32, i32* %arrayidx12, align 4
				%cmp13 = icmp eq i32 %8, 0
				br i1 %cmp13, label %if.then14, label %if.else27

				if.then14: ; preds = %if.else
				%arrayidx17 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%9 = load i32, i32* %arrayidx17, align 4
				%10 = add nuw nsw i64 %indvars.iv, 1
				%arrayidx20 = getelementptr inbounds i32, i32* %A, i64 %10
				%11 = load i32, i32* %arrayidx20, align 4
				%12 = add nuw nsw i64 %indvars.iv, 2
				%arrayidx23 = getelementptr inbounds i32, i32* %A, i64 %12
				%13 = load i32, i32* %arrayidx23, align 4
				%14 = add nuw nsw i64 %indvars.iv, 3
				%arrayidx26 = getelementptr inbounds i32, i32* %A, i64 %14
				%15 = load i32, i32* %arrayidx26, align 4
				br label %for.inc

				if.else27: ; preds = %if.else
				%16 = load i32, i32* %arrayidx28, align 4
				%cmp29 = icmp eq i32 %16, 0
				br i1 %cmp29, label %if.then30, label %if.else43

				if.then30: ; preds = %if.else27
				%arrayidx33 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%17 = load i32, i32* %arrayidx33, align 4
				%18 = add nuw nsw i64 %indvars.iv, 1
				%arrayidx36 = getelementptr inbounds i32, i32* %A, i64 %18
				%19 = load i32, i32* %arrayidx36, align 4
				%20 = add nuw nsw i64 %indvars.iv, 2
				%arrayidx39 = getelementptr inbounds i32, i32* %A, i64 %20
				%21 = load i32, i32* %arrayidx39, align 4
				%22 = add nuw nsw i64 %indvars.iv, 3
				%arrayidx42 = getelementptr inbounds i32, i32* %A, i64 %22
				%23 = load i32, i32* %arrayidx42, align 4
				br label %for.inc

				if.else43: ; preds = %if.else27
				%24 = load i32, i32* %arrayidx44, align 4
				%cmp45 = icmp eq i32 %24, 0
				br i1 %cmp45, label %if.then46, label %for.inc

				if.then46: ; preds = %if.else43
				%arrayidx49 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%25 = load i32, i32* %arrayidx49, align 4
				%26 = add nuw nsw i64 %indvars.iv, 1
				%arrayidx52 = getelementptr inbounds i32, i32* %A, i64 %26
				%27 = load i32, i32* %arrayidx52, align 4
				%28 = add nuw nsw i64 %indvars.iv, 3
				%arrayidx55 = getelementptr inbounds i32, i32* %A, i64 %28
				%29 = load i32, i32* %arrayidx55, align 4
				%30 = add nuw nsw i64 %indvars.iv, 2
				%arrayidx58 = getelementptr inbounds i32, i32* %A, i64 %30
				%31 = load i32, i32* %arrayidx58, align 4
				br label %for.inc

				for.inc: ; preds = %if.then, %if.then30, %if.else43, %if.then46, %if.then14
				%tmp0.1 = phi i32 [ %1, %if.then ], [ %9, %if.then14 ], [ %17, %if.then30 ], [ %25, %if.then46 ], [ %tmp0.0108, %if.else43 ]
				%tmp1.1 = phi i32 [ %3, %if.then ], [ %11, %if.then14 ], [ %19, %if.then30 ], [ %27, %if.then46 ], [ %tmp1.0109, %if.else43 ]
				%tmp2.1 = phi i32 [ %5, %if.then ], [ %13, %if.then14 ], [ %21, %if.then30 ], [ %29, %if.then46 ], [ %tmp2.0110, %if.else43 ]
				%tmp3.1 = phi i32 [ %7, %if.then ], [ %15, %if.then14 ], [ %23, %if.then30 ], [ %31, %if.then46 ], [ %tmp3.0111, %if.else43 ]
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 100
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

llvm/test/Transforms/SLPVectorizer/X86/jumbled-load.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s			; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s



	define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {			define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {
	; CHECK-LABEL: @jumbled-load(			; CHECK-LABEL: @jumbled-load(
	; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 %in, i64 0			; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0
	; CHECK-NEXT: [[LOAD_1:%.]] = load i32, i32 [[IN_ADDR]], align 4
	; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3
	; CHECK-NEXT: [[LOAD_2:%.]] = load i32, i32 [[GEP_1]], align 4
	; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1
	; CHECK-NEXT: [[LOAD_3:%.]] = load i32, i32 [[GEP_2]], align 4
	; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2
	; CHECK-NEXT: [[LOAD_4:%.]] = load i32, i32 [[GEP_3]], align 4			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*
	; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 %inn, i64 0			; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[LOAD_5:%.]] = load i32, i32 [[INN_ADDR]], align 4			; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 2, i32 0>
				; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0
	; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2
	; CHECK-NEXT: [[LOAD_6:%.]] = load i32, i32 [[GEP_4]], align 4
	; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3
	; CHECK-NEXT: [[LOAD_7:%.]] = load i32, i32 [[GEP_5]], align 4
	; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1
	; CHECK-NEXT: [[LOAD_8:%.]] = load i32, i32 [[GEP_6]], align 4			; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[INN_ADDR]] to <4 x i32>*
	; CHECK-NEXT: [[MUL_1:%.*]] = mul i32 [[LOAD_3]], [[LOAD_5]]			; CHECK-NEXT: [[TMP5:%.]] = load <4 x i32>, <4 x i32> [[TMP4]], align 4
	; CHECK-NEXT: [[MUL_2:%.*]] = mul i32 [[LOAD_2]], [[LOAD_8]]			; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 3, i32 2>
	; CHECK-NEXT: [[MUL_3:%.*]] = mul i32 [[LOAD_4]], [[LOAD_7]]			; CHECK-NEXT: [[TMP7:%.*]] = mul <4 x i32> [[TMP3]], [[TMP6]]
	; CHECK-NEXT: [[MUL_4:%.*]] = mul i32 [[LOAD_1]], [[LOAD_6]]			; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0
	; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 %out, i64 0			; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1
	; CHECK-NEXT: store i32 [[MUL_1]], i32* [[GEP_7]], align 4			; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2
	; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 %out, i64 1			; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3
	; CHECK-NEXT: store i32 [[MUL_2]], i32* [[GEP_8]], align 4			; CHECK-NEXT: [[TMP8:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*
	; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 %out, i64 2			; CHECK-NEXT: store <4 x i32> [[TMP7]], <4 x i32>* [[TMP8]], align 4
	; CHECK-NEXT: store i32 [[MUL_3]], i32* [[GEP_9]], align 4
	; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 %out, i64 3
	; CHECK-NEXT: store i32 [[MUL_4]], i32* [[GEP_10]], align 4
	; CHECK-NEXT: ret i32 undef			; CHECK-NEXT: ret i32 undef
	;			;
	%in.addr = getelementptr inbounds i32, i32* %in, i64 0			%in.addr = getelementptr inbounds i32, i32* %in, i64 0
	%load.1 = load i32, i32* %in.addr, align 4			%load.1 = load i32, i32* %in.addr, align 4
	%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3			%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3
	%load.2 = load i32, i32* %gep.1, align 4			%load.2 = load i32, i32* %gep.1, align 4
	%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1			%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1
	%load.3 = load i32, i32* %gep.2, align 4			%load.3 = load i32, i32* %gep.2, align 4
	Show All 25 Lines

llvm/test/Transforms/SLPVectorizer/X86/store-jumbled.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s			; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s



	define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {			define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {
	; CHECK-LABEL: @jumbled-load(			; CHECK-LABEL: @jumbled-load(
	; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0			; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0
	; CHECK-NEXT: [[LOAD_1:%.]] = load i32, i32 [[IN_ADDR]], align 4
	; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1
	; CHECK-NEXT: [[LOAD_2:%.]] = load i32, i32 [[GEP_1]], align 4
	; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2
	; CHECK-NEXT: [[LOAD_3:%.]] = load i32, i32 [[GEP_2]], align 4
	; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3
	; CHECK-NEXT: [[LOAD_4:%.]] = load i32, i32 [[GEP_3]], align 4			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*
				; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
				; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 0, i32 2>
	; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0			; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0
	; CHECK-NEXT: [[LOAD_5:%.]] = load i32, i32 [[INN_ADDR]], align 4
	; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1
	; CHECK-NEXT: [[LOAD_6:%.]] = load i32, i32 [[GEP_4]], align 4
	; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2
	; CHECK-NEXT: [[LOAD_7:%.]] = load i32, i32 [[GEP_5]], align 4
	; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3
	; CHECK-NEXT: [[LOAD_8:%.]] = load i32, i32 [[GEP_6]], align 4			; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[INN_ADDR]] to <4 x i32>*
	; CHECK-NEXT: [[MUL_1:%.*]] = mul i32 [[LOAD_1]], [[LOAD_5]]			; CHECK-NEXT: [[TMP5:%.]] = load <4 x i32>, <4 x i32> [[TMP4]], align 4
	; CHECK-NEXT: [[MUL_2:%.*]] = mul i32 [[LOAD_2]], [[LOAD_6]]			; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 0, i32 2>
	; CHECK-NEXT: [[MUL_3:%.*]] = mul i32 [[LOAD_3]], [[LOAD_7]]			; CHECK-NEXT: [[TMP7:%.*]] = mul <4 x i32> [[TMP3]], [[TMP6]]
	; CHECK-NEXT: [[MUL_4:%.*]] = mul i32 [[LOAD_4]], [[LOAD_8]]
	; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0			; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0
	; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1			; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1
	; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2			; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2
	; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3			; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3
	; CHECK-NEXT: store i32 [[MUL_1]], i32* [[GEP_9]], align 4			; CHECK-NEXT: [[TMP8:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*
	; CHECK-NEXT: store i32 [[MUL_2]], i32* [[GEP_7]], align 4			; CHECK-NEXT: store <4 x i32> [[TMP7]], <4 x i32>* [[TMP8]], align 4
	; CHECK-NEXT: store i32 [[MUL_3]], i32* [[GEP_10]], align 4
	; CHECK-NEXT: store i32 [[MUL_4]], i32* [[GEP_8]], align 4
	; CHECK-NEXT: ret i32 undef			; CHECK-NEXT: ret i32 undef
	;			;
	%in.addr = getelementptr inbounds i32, i32* %in, i64 0			%in.addr = getelementptr inbounds i32, i32* %in, i64 0
	%load.1 = load i32, i32* %in.addr, align 4			%load.1 = load i32, i32* %in.addr, align 4
	%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 1			%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 1
	%load.2 = load i32, i32* %gep.1, align 4			%load.2 = load i32, i32* %gep.1, align 4
	%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 2			%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 2
	%load.3 = load i32, i32* %gep.2, align 4			%load.3 = load i32, i32* %gep.2, align 4
	Show All 25 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Vectorize jumbled memory loads.AcceptedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 223515

llvm/include/llvm/Analysis/LoopAccessAnalysis.h

llvm/lib/Analysis/LoopAccessAnalysis.cpp

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/X86/jumbled-load-multiuse.ll

llvm/test/Transforms/SLPVectorizer/X86/jumbled-load-shuffle-placement.ll

llvm/test/Transforms/SLPVectorizer/X86/jumbled-load-used-in-phi.ll

llvm/test/Transforms/SLPVectorizer/X86/jumbled-load.ll

llvm/test/Transforms/SLPVectorizer/X86/store-jumbled.ll

[SLP] Vectorize jumbled memory loads.
AcceptedPublic