This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Analysis/
-
llvm/
-
Analysis/
2
LoopAccessAnalysis.h
-
lib/
-
Analysis/
2/25
LoopAccessAnalysis.cpp
-
Transforms/Vectorize/
-
Vectorize/
16/126
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
1/3
external_user_jumbled_load.ll
-
jumbled-load-multiuse.ll
2/2
jumbled-load-shuffle-placement.ll
1/1
jumbled-load-used-in-phi.ll
1/1
jumbled-load.ll
-
store-jumbled.ll

Differential D36130

[SLP] Vectorize jumbled memory loads.
AcceptedPublic

Authored by • ashahid on Aug 1 2017, 12:31 AM.

Download Raw Diff

Details

Reviewers

mkuper
loladiro
Ayal
zvi
danielcdh
ABataev

Commits

rGdbd30edb7ff8: [SLP] Vectorize jumbled memory loads.
rG1d5422f27f60: [SLP] Vectorize jumbled memory loads.
rG2b281de5769e: [SLP] Vectorize jumbled memory loads.
rGf8db9bd85791: [SLP] Vectorize jumbled memory loads.
rL320548: [SLP] Vectorize jumbled memory loads.
rL314806: [SLP] Vectorize jumbled memory loads.
rL313771: [SLP] Vectorize jumbled memory loads.
rL313736: [SLP] Vectorize jumbled memory loads.

Summary

This patch tries to vectorize loads of consecutive memory accesses, accessed
in non-consecutive or jumbled way. An earlier attempt was made with patch D26905
which was reverted back due to some basic issue with representing the 'use mask' of
jumbled accesses.

This patch fixes the mask representation by recording the 'use mask' in the usertree entry.

Change-Id: I9fe7f5045f065d84c126fa307ef6ebe0787296df

Diff Detail

Repository

rL LLVM

Build Status

Buildable 15472
Build 15472: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Patch update for fixing build bot failure:

This fix makes the place holder for Shuffle Mask from fixed array of 3 element
to an std::map. This need arises from the fact that a PHI node can have
any number of operand as incoming value.

Test performed:
LLVM lit test, 3 stage bootstrap build and LNT (Thanks to Hans and Daniel)

Harbormaster completed remote builds in B12741: Diff 125470.Dec 4 2017, 9:17 PM

In D36130#944703, @ashahid wrote:

Patch update for fixing build bot failure:

I haven't looked at the patch at all, but I just tried it on a local Chrome build on Linux, and it seems to work for that.

Good catch. Add a LIT test?

lib/Transforms/Vectorize/SLPVectorizer.cpp
724	The fixed array SmallVector<unsigned, 4> ShuffleMask[3]; of the previous version indeed cannot account for all operands. How about holding a SmallVector<SmallVector<unsigned, 4>, 2> ShuffleMask; instead of holding a map from 0,1,2,..,numOperands ?
749	Are both conditions really needed, or suffice say to check for -1 and assert positive indices are not too large?
3258	May be simpler to check instead ShuffleMask.count(OpdNum)
3290	clang-format

In D36130#945306, @hans wrote:

In D36130#944703, @ashahid wrote:

Patch update for fixing build bot failure:

I haven't looked at the patch at all, but I just tried it on a local Chrome build on Linux, and it seems to work for that.

Thanks Hans for triage.

In D36130#945728, @Ayal wrote:

Good catch. Add a LIT test?

It was asserting in few of LNT Multisource bench mark. How to extract it for LIT test?

lib/Transforms/Vectorize/SLPVectorizer.cpp
724	I think this can be done. I will try.
749	Sure I will check. I am thinking 30000 as large indices threshold, do you have any number in mind?
3258	Quite right.

• ashahid added inline comments.Dec 8 2017, 8:12 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
749	I tried but seems both conditions are needed as I am getting assertion "Idx < size()" for SmallVector<<SmallVector, 4> 2> ShuffleMask.

Updated the review comments.

Herald added a subscriber: mgrang. · View Herald TranscriptDec 9 2017, 12:50 AM

Minor commented code clean up done.

In D36130#946236, @ashahid wrote:

In D36130#945728, @Ayal wrote:

Good catch. Add a LIT test?

It was asserting in few of LNT Multisource bench mark. How to extract it for LIT test?

Suffice to have a phi with 4 predecessors, where (at-least) the 4th needs a shuffle-mask.

lib/Transforms/Vectorize/SLPVectorizer.cpp
749	UserTreeIdx is the index of the User entry as we build the tree bottom-up, so it should always be between 0 and VectorizableTree.size()-1, except for -1 when creating the new entry for the root, which is User-less. So it should suffice to check if Idx is -1, and otherwise assert that Idx < size(), if desired, right?
750–751	Code below still uses emplace_back contrary to the discussion above. May need to call UserTreeEntry->ShuffleMask.resize() if OpdNum is larger than its initial/current size, before setting UserTreeEntry->ShuffleMask[OpdNum] = tempMask. (Otherwise the original "LNT Multisource bench mark" asserts should trigger again?) Suggest to add a test where the first operand does not need a shuffle but the second one does.
2903	See above discussion about replacing second condition with an assert.
3251	ditto
test/Transforms/SLPVectorizer/X86/crash_cmpop.ll
1 ↗	(On Diff #126266)	Why add -debug?

Review comments updated and added lit tests.

Harbormaster completed remote builds in B12974: Diff 126374.Dec 11 2017, 8:34 AM

• ashahid added inline comments.Dec 11 2017, 8:41 AM

test/Transforms/SLPVectorizer/X86/crash_cmpop.ll
1 ↗	(On Diff #126266)	My bad, not intended.

This looks good to me, with a couple of last minor fixes.

Hope it stays in this time...

lib/Transforms/Vectorize/SLPVectorizer.cpp
756	alrea[d]y
3295	Can simply do `for (unsigned Entry : ShuffleMask[OpdNum])` instead of iterating explicitly over all lanes and retrieving each `UserTreeEntry->ShuffleMask[OpdNum][Lane]`.
test/Transforms/SLPVectorizer/X86/jumbled-load-shuffle-placement.ll
32	Suggested to also have a test where the 2nd operand is a shuffle but the 1st one isn't, which will fail if shuffles are added using emplace_back().

Updated test and review comment.

Bootstrap and LNT test underway.

• ashahid closed this revision.Dec 12 2017, 7:09 PM

Hi Shahid,

These changes caused 27.7% and 30.2% regressions on an AArch64 Juno board (http://lnt.llvm.org/db_default/v4/nts/83681):

MultiSource/Benchmarks/mediabench/gsm/toast/toast: 30.20%
MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm: 27.73%

We have the same benchmarks regressed on our AArch64 boards (Cortex-A53, Cortex-A57).

-Evgeny Astigeevich
The ARM Compiler Optimisation team

In D36130#955158, @eastig wrote:

Hi Shahid,

These changes caused 27.7% and 30.2% regressions on an AArch64 Juno board (http://lnt.llvm.org/db_default/v4/nts/83681):

MultiSource/Benchmarks/mediabench/gsm/toast/toast: 30.20%
MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm: 27.73%

We have the same benchmarks regressed on our AArch64 boards (Cortex-A53, Cortex-A57).

-Evgeny Astigeevich
The ARM Compiler Optimisation team

A problem report: https://bugs.llvm.org/show_bug.cgi?id=35673

eastig mentioned this in D41324: [SLPVectorizer] Add shuffle instruction cost for jumbled load.Dec 18 2017, 4:11 AM

sanjoy added a subscriber: sanjoy.Dec 19 2017, 4:03 PM

sanjoy added inline comments.

lib/Analysis/LoopAccessAnalysis.cpp
1125	This should be a `cast<>`.
1153	LLVM style is to avoid using curly braces on single like for loops. Using `std::iota` would be even better.
lib/Transforms/Vectorize/SLPVectorizer.cpp
757	I think you should be able to do: auto &OperandMask = UserTreeEntry->ShuffleMask[OpdNum]; assert(OperandMask.empty()); OperandMask.insert(OperandMask.end(), ShuffleMask.begin(), ShuffleMask.end());
1649	Not sure why you need `NewVL` here -- doesn't just using `Sorted` work?
3257	Might be cleaner to abstract `(unsigned)OpdNum < UserTreeEntry->ShuffleMask.size() && !UserTreeEntry->ShuffleMask[OpdNum].empty()` into a `UserTreeEntry->hasShuffleMaskForOp(Index)` helper.
3296	The cast to `Value *` should not be necessary.
3529	`dyn_cast<XXX>(f)->g()` should never be necessary. Either the `dyn_cast` can return null in which case you should check for that, or it can't and you should use `cast<>`. Also the cast of `Vec` to `Instruction` seems unnecessary: `ShuffleVectorInst` is an `Instruction`.

Ayal added inline comments.Dec 21 2017, 3:25 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
757	While we're at it, this should move under the `if (UserTreeIdx != -1)` to avoid checking if `&VectorizableTree[UserTreeIdx]` is null, as commented in https://reviews.llvm.org/D41324#inline-361435
1658	Should probably also check here that UserTreeIdx is not -1, to avoid creating a mask for the root with no place to hang it, as @sanjoy observed.

• ashahid added inline comments.Dec 22 2017, 6:20 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
757	If we check for if (UserTreeIdx != -1 && ShuffledLoad) before the call of newTreeEntry(), we can avoid "UserTreeIdx != -1" check completely inside newTreeEntry().
1658	Yes, I had planned to do exactly this.

• ashahid reopened this revision.Dec 28 2017, 11:04 PM

• ashahid marked 8 inline comments as done.

• ashahid added inline comments.

lib/Transforms/Vectorize/SLPVectorizer.cpp
3529	Here I am trying to ensure that the instructions are "ShuffleVectorInst" and "LoadInst" respectively. Casting of Vec to Instruction, is to satisfy the membership of getOperand() which compiler otherwise report as error.

This revision is now accepted and ready to land.Dec 28 2017, 11:04 PM

Updates review comments.

Regression test and LNT passes, 3 stage bootstrap test underway.

Ayal added inline comments.Dec 29 2017, 7:31 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp

3529

Use isa instead of dyn_cast here:
if (Vec && dyn_cast<LoadInst>(cast<Instruction>(Vec)->getOperand(0))) {

or alternatively do something like:

Value *Vec = E->VectorizedValue;
assert(Vec && "Can't find vectorizable value");
if (ShuffleVectorInst *Shuffle = dyn_cast<ShuffleVectorInst>(Vec))
  if (LoadInst *Load = dyn_cast<LoadInst>(Shuffle->getOperand(0)))
    Vec = Load;

Updated Ayal's comment accordingly

• ashahid marked an inline comment as done.Jan 1 2018, 8:01 AM

Ping!

This should fix the case observed by @sanjoy in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20171218/511721.html; please also include a testcase.

In D36130#971181, @Ayal wrote:

This should fix the case observed by @sanjoy in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20171218/511721.html; please also include a testcase.

Test case, test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll, already included.

In D36130#973399, @ashahid wrote:

In D36130#971181, @Ayal wrote:

This should fix the case observed by @sanjoy in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20171218/511721.html; please also include a testcase.

Test case, test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll, already included.

Ah, right, sorry, missed it.

This looks good to me, with only minor comments about the testcase.

Please see that @sanjoy approves too, as this mostly addresses issues he raised.

test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
28	"SINK" is defined redundantly, as it is not used. Could this be simplified by removing the float-to-int casts? In general, it may suffice to check that there's no load of <4 x i32>, which would be jumbled. Checking that two of the lanes have been vectorized may be fragile, in case a modified cost model will decide it ain't worth it.

sanjoy accepted this revision.Jan 13 2018, 2:40 PM

sanjoy added inline comments.

lib/Analysis/LoopAccessAnalysis.cpp
1113	The indent looks off here; can you please run clang-format?
lib/Transforms/Vectorize/SLPVectorizer.cpp
721	Optional: you can write `return X;` instead of `if (X) return true; return false;`.
1656	Nit: s/usefull/useful/
3528–3533	I think you can rewrite this more cleanly using an immediately-invoked function expression: Value Vec = [&]() { if (auto SVI = dyn_cast<ShuffleVectorInst>(E->VectorizedValue)) if (auto *LI = dyn_cast<LoadInst>(SVI->getOperand(0))) return LI->getOperand(0); return E->VectorizedValue; }();

• ashahid marked an inline comment as not done.Jan 15 2018, 9:01 PM

• ashahid added inline comments.

lib/Transforms/Vectorize/SLPVectorizer.cpp
3528–3533	I tried this IIFE, however I am getting an assertion "Tried to create extractelement operation on non-vector type!" for jumbled-load-multiuse.ll test. Do you see any issue in this code?

sanjoy added inline comments.Jan 15 2018, 10:54 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
3528–3533	Yes, I think I should have written: Value Vec = [&]() { if (auto SVI = dyn_cast<ShuffleVectorInst>(E->VectorizedValue)) if (isa<LoadInst>(SVI->getOperand(0))) return SVI->getOperand(0); return E->VectorizedValue; }();

Ayal added inline comments.Jan 15 2018, 11:49 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp

3528–3533

Yes, this simplifies the below "alternatively do something like:"

Value *Vec = E->VectorizedValue;
assert(Vec && "Can't find vectorizable value");
if (ShuffleVectorInst *Shuffle = dyn_cast<ShuffleVectorInst>(Vec))
  if (LoadInst *Load = dyn_cast<LoadInst>(Shuffle->getOperand(0)))
    Vec = Load;

Updates test case and stylistic review comments

Herald added a subscriber: llvm-commits. · View Herald TranscriptJan 16 2018, 8:51 AM

Ping!

Hi Ayal, Sanjoy,

The last update's review was pending for long. Off late, SLP has lots of changes so I will have to rebase but before rebasing please see if any more changes required in its current form.

Thanks in advance.

RKSimon added a reviewer: ABataev.Feb 10 2018, 8:56 AM

In D36130#1004306, @ashahid wrote:

Hi Ayal, Sanjoy,

The last update's review was pending for long. Off late, SLP has lots of changes so I will have to rebase but before rebasing please see if any more changes required in its current form.

Thanks in advance.

This looks good to me, as commented earlier, but please see that @sanjoy approves too, as this mostly addresses issues he raised.

I don't have any more coding style comments. I've not reviewed the actual semantic changes.

lib/Analysis/LoopAccessAnalysis.cpp
1166	Can you use `std::iota` here?

ABataev added inline comments.Feb 12 2018, 8:04 AM

lib/Analysis/LoopAccessAnalysis.cpp
1112	This function can be used for stores also, it is better to make it universal for stores/loads.
1156	It is better to use `stable_sort` rather than `sort`
1169	`stable_sort`
lib/Transforms/Vectorize/SLPVectorizer.cpp
1644	Is it possible at all that `VL` has less than 4 elements here?
1649	`i`->`I`, `e`->`E`. Variables must have Camel-like names.
2279–2285	You don't need so many shuffles, it is enough just to have just one.

ABataev added inline comments.Feb 12 2018, 8:04 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
718–724	Why you can't have just one shuffle here for all external uses?
1660–1661	Bad decision. It is better to use original `VL` here, rather than `Sorted` and add an additional array of sorted indieces. In this case you don't need all these additional numbers and all that complex logic to find the correct tree entry for the list of values.
3528–3533	I think you can have default capture by value here rather than by reference.

Hi Alexey,

As I was trying to rebase this patch, it seems this overlaps with your "reverse load" patch. Could you take a look in this patch?

courbet added a subscriber: courbet.Feb 13 2018, 2:12 AM

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

lib/Analysis/LoopAccessAnalysis.cpp
1112	I plan to do such improvement in separate patches.
lib/Transforms/Vectorize/SLPVectorizer.cpp
718–724	This is for in-tree multi uses of a single vector load where the uses has different masks/permutation. This section of comment https://reviews.llvm.org/D36130#inline-326711 discussed it earlier. Also there is figure attached.
1644	I think yes, for example a couple of i64 loads considering minimum register width as 128-bit. However, this check here was basically meant to indicate jumbled loads of size 2 is essentially a reversed load.
1660–1661	In fact earlier design in patch (https://reviews.llvm.org/D26905) was to use original VL, however there was counter argument to that which I don't remember exactly.
2279–2285	This is basically for multiple in-tree uses with different masks/permutation.

In D36130#1006202, @ashahid wrote:

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

Probably, it is hard to say exactly without looking at the result.

lib/Analysis/LoopAccessAnalysis.cpp
1112	I just suggest to make universal at the very beginning, that's it
lib/Transforms/Vectorize/SLPVectorizer.cpp
718–724	I still don't understand what's the problem here. You need to perform the loads in some order. You sort the loads to be in the sequntially direct order and perform the vector load starting from the lowest address. You reshuffle the loaded vector value to the original order. That's it, you have your loads in the required order. Just one shuffle is required. Why do you need some more? Also, I don't understand why do you need so many changes, why do you need additional indicies etc.
1644	It is going to be handled by the reverse loads patch
1660–1661	It is better to use original `VL` here, otherwise it will end with a lot of troubles and will require the whole bunch of changes in the vectorization process to find the perfect match for the vector of vectorized values. I don't think it is a good idea to have a lot of changes accross the whole module to handle jumbled loads.
3258	Is this correct? `E->Scalars[0]` is exactly `VL0`

Updates review comments and a test case.

Harbormaster completed remote builds in B14963: Diff 134170.Feb 14 2018, 1:38 AM

Minor clean up.

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

In D36130#1006238, @ABataev wrote:

In D36130#1006202, @ashahid wrote:

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

Probably, it is hard to say exactly without looking at the result.

No worry it was a merge issue, its fixed.

lib/Transforms/Vectorize/SLPVectorizer.cpp
718–724	Updated jumbled-load.ll captures this case where instead of gathering the second operand of MUL we can have required shuffle of the same loaded vector
1644	Yes, this check no more required.
1660–1661	In the context where we can have multiple user of loaded vector with different shuffle mask, the design is to represent these different shuffle mask for each user corresponding to the user's operand number. Having single sorted indices will not be sufficient for this. Given the objective of handling multiple out of order uses changes are not that big I feel.
3258	Ah, both are same.

ABataev added inline comments.Feb 14 2018, 6:50 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1660–1661	Now I see what do you want to do. But I don't think that this the correct way to implement it. It complicates the whole vectorization process. I'd suggest to create different tree entries for each particular order of the loads and exclude loads from the check that the same instruction is used several times in different tree entries. If you worry about several different loads of the same values, I think they will be optimized by instruction combiner.

• ashahid added inline comments.Feb 16 2018, 9:46 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1660–1661	Off course this could have been a better solution but I was not sure of the impact it may have by breaking the single tree entry assumption. One problem I see is the TreeEntry lookup if multiple node with same scalar values are present. I can use isSame() check to make sure correct tree entry is found, however it may become costly in case of PHI instruction fed by same vector Load.

ABataev added inline comments.Feb 16 2018, 10:29 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1660–1661	I think it is better to start with handling of single tree entry rather than trying to handle all possible situations in a single patch. I suggest to split this patch into 2 parts at least: 1. handling of tree entry with jumbled loads. 2. further improvements.
test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
8–11	These checks are not autogenerated, fix it. Moreover, it is recommended to commit these tests separately with the checks for the original version of the compiler and the update checks with the fixed version to demonstrate improvements.

Updated the patch to accomodate the review comments.

Harbormaster completed remote builds in B15472: Diff 136070.Feb 27 2018, 6:29 AM

As suggested, now the reordering mask will be part of each tree entry. Also this update does not consider to optimize the reordered load for multiple operand for now.

By the way, take a look at my D43776 that does the same but in more general way

lib/Transforms/Vectorize/SLPVectorizer.cpp
1644	Why you can do this only if `ReuseShuffleIndicies.empty()`?
1649–1654	It is enough just to compare `VL` and `Sorted`. If they are the same, the loads are not shuffled
1657	Why you can't do to add vectorized tree entry if `UserTreeIdx == -1`?
1660	Each `true` or `false` argument must have to prepend comment with the name of the function parameter, related to this argument
2279	You can remove the last argument here
2899	Why do you need this condition?
3251	Restore the original code here
3287	Remove this empty line
3528–3533	I rather doubt you need all that stuff. You can use original code
test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
1	You need to add this test separately and show changes in it.
test/Transforms/SLPVectorizer/X86/jumbled-load-shuffle-placement.ll
1	You need to add this test separately and show changes in it
test/Transforms/SLPVectorizer/X86/jumbled-load-used-in-phi.ll
1	You need to add this test separately and show changes in it
test/Transforms/SLPVectorizer/X86/jumbled-load.ll
64	You need to add this test separately and show changes in it

Will commit the tests as NFC.

Seems like I am not getting the mails from phabricator, what shall I do to get the mails?

Checked the patch D43776, seems it will make this patch redundant.

lib/Transforms/Vectorize/SLPVectorizer.cpp
1644	This is to avoid the overlapping the UniqueValues reuse logic of your changes.
1649–1654	Sure it is, but this avoids the compare. So I thought having a boolean is preferable.
1657	My bad, this is not required.
1660	Ok
2279	Sure
2899	In the 2nd test of jumbled-load.ll the two operands of MUL is fed from the same loaded vector. The 1st operand is SHUFFLE of LOAD and the 2nd operand is the gather of the same scalar loads. Query to getTreeEntry() will always return the node with the same vectorized value and hence both the operand of MUL will be fed the shuffled load. This check is to avoid this scenario.
3251	Thanks
3528–3533	This is required otherwise multiuse.ll test as well as PR32086.ll will fail because the lanes were recorded according to the order of scalar loads.

Updated further review comments.

Harbormaster completed remote builds in B15525: Diff 136311.Feb 28 2018, 9:30 AM

Hope this is fine.

ABataev added inline comments.Feb 28 2018, 9:49 AM

lib/Analysis/LoopAccessAnalysis.cpp
1112	What about this comment? Do you really need Sorted argument?
1125	`PointerType `->`auto `
1129–1131	I think there must be an assertion instead of this check.
1141	`const SCEVConstant `->`const auto `
1146–1148	This check better to move to SLPVectorizer.cpp, because the function can be used for masked load/store.
1161	`for (unsigned I = 0, E = VL.size(); I < E; ++I)`
1166	Actually `Mask` is a full copy of `UseOrder`, you don't need all that complex stuff here
lib/Transforms/Vectorize/SLPVectorizer.cpp
1644	Why you can't handle it? What's the problem?
1649–1654	Why do we need the compare?
2899	This scenario should happen in your patch, the instruction either vectorized, or gathered, but not both.
3528–3533	Again, it just may not happen in this patch

ABataev added inline comments.Feb 28 2018, 11:07 AM

lib/Analysis/LoopAccessAnalysis.cpp
1166	Oops, no, `Mask` is not a copy of `UseOrder` But you can create it much simpler: for (unsigned I = 0, E = VL.size(); I < E; ++I) Mask[UseOrder[I]] = I;

sanjoy removed a reviewer: sanjoy.Feb 28 2018, 11:34 AM

sanjoy removed a subscriber: sanjoy.

• ashahid added inline comments.Feb 28 2018, 11:30 PM

lib/Analysis/LoopAccessAnalysis.cpp
1112	Yes, otherwise my test fails. Seems it breaks some assumption.
1166	Thanks
lib/Transforms/Vectorize/SLPVectorizer.cpp
1644	It was a thought,I have not checked yet. I will check.
1649–1654	I meant, if we dont use ShuffledLoad flag we have to compare VL vs Sorted instead.
2899	This check is to avoid feeding the generated SHUFFLE to both operand of MUL which is not the intention of the test case.
3528–3533	It does happen and this test fails.

ABataev added inline comments.Mar 2 2018, 10:59 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1660	No, use original `VL` here, do not use `Sorted`. In this case you won't need an additional argument in `sortLoadAccesses` and you don't need all that complex stuff with the lambda on line 3528

• ashahid added inline comments.Mar 5 2018, 10:39 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1660	If I am not wrong, for LOADs, VL0 must be the 1st element of the buffer whose base address will be used for vector load. So using VL will break this assumption.

ABataev added inline comments.Mar 6 2018, 6:18 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1660	Why? And why you can't choose the right VL0 during vectorization?

• ashahid added inline comments.Mar 6 2018, 8:20 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1660	For example, if we have two arrays A[4] and B[1] laying one after another in memory and the selected VF is 4 for the scalar loads of A[1], A[2], A[0], A[3] in order of use, the generated vector load will load the elements A[1], A[2], A[3], B[1] which is not desired. Of-course we can choose the right VL0 during vectorization but we have to compute it again here using the mask which can be avoided if we use Sorted VL. If I am missing something?

ABataev added inline comments.Mar 6 2018, 8:42 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1660	You already store the mask in the tree entry and you can choose the right VL0 using this mask. Using Sorted VL complicates the whole vectorization process and, thus, adds some extra points for the incorrect vectorization. That's why I insist to use original VL and choose the correct VL0 during codegen.

• ashahid added inline comments.Mar 6 2018, 9:08 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1660	Got it. Since you already have these improvements in this patch https://reviews.llvm.org/D43776 , I think it is better to get that through.

fhahn mentioned this in D37738: [SLPVectorizer] Generalize vectorizeStores to support loads as well NFC. .Mar 22 2018, 10:50 AM

fhahn mentioned this in D37737: [SLPVectorizer] Merge subsequent gather loads..

@ashahid What's happening to this patch?

Closed by commit rGdbd30edb7ff8: [SLP] Vectorize jumbled memory loads. (authored by • ashahid). · Explain WhyOct 7 2019, 5:02 AM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptOct 7 2019, 5:02 AM

Herald added a subscriber: hiraditya. · View Herald Transcript

RKSimon reopened this revision.Oct 7 2019, 6:08 AM

This revision is now accepted and ready to land.Oct 7 2019, 6:08 AM

Revision Contents

Path

Size

include/

llvm/

Analysis/

LoopAccessAnalysis.h

15 lines

lib/

Analysis/

LoopAccessAnalysis.cpp

69 lines

Transforms/

Vectorize/

SLPVectorizer.cpp

58 lines

test/

Transforms/

SLPVectorizer/

X86/

external_user_jumbled_load.ll

27 lines

jumbled-load-multiuse.ll

23 lines

jumbled-load-shuffle-placement.ll

125 lines

jumbled-load-used-in-phi.ll

225 lines

jumbled-load.ll

88 lines

store-jumbled.ll

25 lines

Diff 136070

include/llvm/Analysis/LoopAccessAnalysis.h

	Show First 20 Lines • Show All 661 Lines • ▼ Show 20 Lines
	/// If necessary this method will version the stride of the pointer according			/// If necessary this method will version the stride of the pointer according
	/// to \p PtrToStride and therefore add further predicates to \p PSE.			/// to \p PtrToStride and therefore add further predicates to \p PSE.
	/// The \p Assume parameter indicates if we are allowed to make additional			/// The \p Assume parameter indicates if we are allowed to make additional
	/// run-time assumptions.			/// run-time assumptions.
	int64_t getPtrStride(PredicatedScalarEvolution &PSE, Value Ptr, const Loop Lp,			int64_t getPtrStride(PredicatedScalarEvolution &PSE, Value Ptr, const Loop Lp,
	const ValueToValueMap &StridesMap = ValueToValueMap(),			const ValueToValueMap &StridesMap = ValueToValueMap(),
	bool Assume = false, bool ShouldCheckWrap = true);			bool Assume = false, bool ShouldCheckWrap = true);

				/// \brief Attempt to sort the 'loads' in \p VL and return the sorted values in
				AyalUnsubmitted Not Done Reply Inline Actions Document what the method does, including its boolean return value, before indicating what happens when \p Mask is not null. An example of VL coming in and Sorted plus Mask coming out would be useful. Ayal: Document what the method does, including its boolean return value, before indicating what…
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure. ashahid: Sure.
				/// \p Sorted.
				///
				/// Returns 'false' if sorting is not legal or feasible, otherwise returns
				/// 'true'. If \p Mask is not null, it also returns the \p Mask which is the
				/// shuffle mask for actual memory access order.
				///
				/// For example, for a given VL of memory accesses in program order, a[i+2],
				/// a[i+0], a[i+1] and a[i+3], this function will sort the VL and save the
				/// sorted value in 'Sorted' as a[i+0], a[i+1], a[i+2], a[i+3] and saves the
				/// mask for actual memory accesses in program order in 'Mask' as <2,0,1,3>
				bool sortLoadAccesses(ArrayRef<Value *> VL, const DataLayout &DL,
				ScalarEvolution &SE, SmallVectorImpl<Value *> &Sorted,
				SmallVectorImpl<unsigned> *Mask = nullptr);

	/// \brief Returns true if the memory operations \p A and \p B are consecutive.			/// \brief Returns true if the memory operations \p A and \p B are consecutive.
	/// This is a simple API that does not depend on the analysis pass.			/// This is a simple API that does not depend on the analysis pass.
	bool isConsecutiveAccess(Value A, Value B, const DataLayout &DL,			bool isConsecutiveAccess(Value A, Value B, const DataLayout &DL,
	ScalarEvolution &SE, bool CheckType = true);			ScalarEvolution &SE, bool CheckType = true);

	/// \brief This analysis provides dependence information for the memory accesses			/// \brief This analysis provides dependence information for the memory accesses
	/// of a loop.			/// of a loop.
	///			///
	▲ Show 20 Lines • Show All 72 Lines • Show Last 20 Lines

lib/Analysis/LoopAccessAnalysis.cpp

	Show First 20 Lines • Show All 1,101 Lines • ▼ Show 20 Lines
	static unsigned getAddressSpaceOperand(Value *I) {			static unsigned getAddressSpaceOperand(Value *I) {
	if (LoadInst *L = dyn_cast<LoadInst>(I))			if (LoadInst *L = dyn_cast<LoadInst>(I))
	return L->getPointerAddressSpace();			return L->getPointerAddressSpace();
	if (StoreInst *S = dyn_cast<StoreInst>(I))			if (StoreInst *S = dyn_cast<StoreInst>(I))
	return S->getPointerAddressSpace();			return S->getPointerAddressSpace();
	return -1;			return -1;
	}			}

				// TODO:This API can be improved by using the permutation of given width as the
				// accesses are entered into the map.
				bool llvm::sortLoadAccesses(ArrayRef<Value *> VL, const DataLayout &DL,
				ABataevUnsubmitted Not Done Reply Inline Actions This function can be used for stores also, it is better to make it universal for stores/loads. ABataev: This function can be used for stores also, it is better to make it universal for stores/loads.
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions I plan to do such improvement in separate patches. ashahid: I plan to do such improvement in separate patches.
				ABataevUnsubmitted Not Done Reply Inline Actions I just suggest to make universal at the very beginning, that's it ABataev: I just suggest to make universal at the very beginning, that's it
				ABataevUnsubmitted Not Done Reply Inline Actions What about this comment? Do you really need Sorted argument? ABataev: What about this comment? Do you really need Sorted argument?
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions Yes, otherwise my test fails. Seems it breaks some assumption. ashahid: Yes, otherwise my test fails. Seems it breaks some assumption.
				ScalarEvolution &SE,
				sanjoyUnsubmitted Not Done Reply Inline Actions The indent looks off here; can you please run clang-format? sanjoy: The indent looks off here; can you please run clang-format?
				SmallVectorImpl<Value *> &Sorted,
				SmallVectorImpl<unsigned> *Mask) {
				SmallVector<std::pair<int64_t, Value *>, 4> OffValPairs;
				OffValPairs.reserve(VL.size());
				Sorted.reserve(VL.size());

				// Walk over the pointers, and map each of them to an offset relative to
				// first pointer in the array.
				Value *Ptr0 = getPointerOperand(VL[0]);
				const SCEV *Scev0 = SE.getSCEV(Ptr0);
				Value *Obj0 = GetUnderlyingObject(Ptr0, DL);
				PointerType *PtrTy = cast<PointerType>(Ptr0->getType());
				AyalUnsubmitted Not Done Reply Inline Actions More accurate name for the method may be sortLoadAccesses()? Ayal: More accurate name for the method may be sortLoadAccesses()?
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions Thought to keep it generic but as of now sortLoadAccesses() seems more appropriate. ashahid: Thought to keep it generic but as of now sortLoadAccesses() seems more appropriate.
				sanjoyUnsubmitted Done Reply Inline Actions This should be a `cast<>`. sanjoy: This should be a `cast<>`.
				ABataevUnsubmitted Not Done Reply Inline Actions `PointerType `->`auto ` ABataev: `PointerType `->`auto `
				uint64_t Size = DL.getTypeAllocSize(PtrTy->getElementType());

				for (auto *Val : VL) {
				// The only kind of access we care about here is load.
				if (!isa<LoadInst>(Val))
				return false;
				ABataevUnsubmitted Not Done Reply Inline Actions I think there must be an assertion instead of this check. ABataev: I think there must be an assertion instead of this check.

				Value *Ptr = getPointerOperand(Val);
				AyalUnsubmitted Not Done Reply Inline Actions LoopVectorizer's analogous "analyzeInterleaving()" also guards this getMinusSCEV() by: // Ignore A if the memory object of A and B don't belong to the same // address space if (getMemInstAddressSpace(A) != getMemInstAddressSpace(B)) continue; Ayal: LoopVectorizer's analogous "analyzeInterleaving()" also guards this getMinusSCEV() by: ```…
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions Thinking of using this API for now. ashahid: Thinking of using this API for now.
				assert(Ptr && "Expected value to have a pointer operand.");
				// If a pointer refers to a different underlying object, bail - the
				// pointers are by definition incomparable.
				Value *CurrObj = GetUnderlyingObject(Ptr, DL);
				if (CurrObj != Obj0)
				return false;

				const SCEVConstant *Diff =
				ABataevUnsubmitted Not Done Reply Inline Actions `const SCEVConstant `->`const auto ` ABataev: `const SCEVConstant `->`const auto `
				dyn_cast<SCEVConstant>(SE.getMinusSCEV(SE.getSCEV(Ptr), Scev0));
				// The pointers may not have a constant offset from each other, or SCEV
				// may just not be smart enough to figure out they do. Regardless,
				// there's nothing we can do.
				AyalUnsubmitted Not Done Reply Inline Actions We can bailout here if \|Diff\| >= VL, right? Ayal: We can bailout here if \|Diff\| >= VL, right?
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure. ashahid: Sure.
				if (!Diff \|\| static_cast<unsigned>(Diff->getAPInt().abs().getSExtValue()) >
				(VL.size() - 1) * Size)
				return false;
				ABataevUnsubmitted Not Done Reply Inline Actions This check better to move to SLPVectorizer.cpp, because the function can be used for masked load/store. ABataev: This check better to move to SLPVectorizer.cpp, because the function can be used for masked…

				OffValPairs.emplace_back(Diff->getAPInt().getSExtValue(), Val);
				}
				SmallVector<unsigned, 4> UseOrder(VL.size());
				std::iota(UseOrder.begin(), UseOrder.end(), 0);
				sanjoyUnsubmitted Done Reply Inline Actions LLVM style is to avoid using curly braces on single like for loops. Using `std::iota` would be even better. sanjoy: LLVM style is to avoid using curly braces on single like for loops. Using `std::iota` would be…

				// Sort the memory accesses and keep the order of their uses in UseOrder.
				std::stable_sort(UseOrder.begin(), UseOrder.end(),
				ABataevUnsubmitted Not Done Reply Inline Actions It is better to use `stable_sort` rather than `sort` ABataev: It is better to use `stable_sort` rather than `sort`
				[&OffValPairs](unsigned Left, unsigned Right) {
				return OffValPairs[Left].first < OffValPairs[Right].first;
				});

				for (unsigned i = 0; i < VL.size(); i++)
				ABataevUnsubmitted Not Done Reply Inline Actions `for (unsigned I = 0, E = VL.size(); I < E; ++I)` ABataev: `for (unsigned I = 0, E = VL.size(); I < E; ++I)`
				Sorted.emplace_back(OffValPairs[UseOrder[i]].second);

				// Sort UseOrder to compute the Mask.
				if (Mask) {
				Mask->reserve(VL.size());
				sanjoyUnsubmitted Not Done Reply Inline Actions Can you use `std::iota` here? sanjoy: Can you use `std::iota` here?
				ABataevUnsubmitted Not Done Reply Inline Actions Actually `Mask` is a full copy of `UseOrder`, you don't need all that complex stuff here ABataev: Actually `Mask` is a full copy of `UseOrder`, you don't need all that complex stuff here
				ABataevUnsubmitted Not Done Reply Inline Actions Oops, no, `Mask` is not a copy of `UseOrder` But you can create it much simpler: for (unsigned I = 0, E = VL.size(); I < E; ++I) Mask[UseOrder[I]] = I; ABataev: Oops, no, `Mask` is not a copy of `UseOrder` But you can create it much simpler: ``` for…
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions Thanks ashahid: Thanks
				for (unsigned i = 0; i < VL.size(); i++)
				Mask->emplace_back(i);
				std::stable_sort(Mask->begin(), Mask->end(),
				ABataevUnsubmitted Not Done Reply Inline Actions `stable_sort` ABataev: `stable_sort`
				[&UseOrder](unsigned Left, unsigned Right) {
				return UseOrder[Left] < UseOrder[Right];
				});
				}

				return true;
				}


	/// Returns true if the memory operations \p A and \p B are consecutive.			/// Returns true if the memory operations \p A and \p B are consecutive.
	bool llvm::isConsecutiveAccess(Value A, Value B, const DataLayout &DL,			bool llvm::isConsecutiveAccess(Value A, Value B, const DataLayout &DL,
	ScalarEvolution &SE, bool CheckType) {			ScalarEvolution &SE, bool CheckType) {
	Value *PtrA = getPointerOperand(A);			Value *PtrA = getPointerOperand(A);
	Value *PtrB = getPointerOperand(B);			Value *PtrB = getPointerOperand(B);
	unsigned ASA = getAddressSpaceOperand(A);			unsigned ASA = getAddressSpaceOperand(A);
	unsigned ASB = getAddressSpaceOperand(B);			unsigned ASB = getAddressSpaceOperand(B);

	▲ Show 20 Lines • Show All 1,187 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 644 Lines • ▼ Show 20 Lines	private:

/// Checks if all users of \p I are the part of the vectorization tree.		/// Checks if all users of \p I are the part of the vectorization tree.
bool areAllUsersVectorized(Instruction *I) const;		bool areAllUsersVectorized(Instruction *I) const;

/// \returns the cost of the vectorizable entry.		/// \returns the cost of the vectorizable entry.
int getEntryCost(TreeEntry *E);		int getEntryCost(TreeEntry *E);

/// This is the recursive part of buildTree.		/// This is the recursive part of buildTree.
void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth, int);		void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth, int);
		AyalUnsubmitted Not Done Reply Inline Actions While you're at it, please add a variable name for that other int argument as well (UserTreeIdx), for completeness. Ayal: While you're at it, please add a variable name for that other int argument as well…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure ashahid: Sure

/// \returns True if the ExtractElement/ExtractValue instructions in VL can		/// \returns True if the ExtractElement/ExtractValue instructions in VL can
/// be vectorized to use the original vector (or aggregate "bitcast" to a vector).		/// be vectorized to use the original vector (or aggregate "bitcast" to a vector).
bool canReuseExtract(ArrayRef<Value > VL, Value OpValue) const;		bool canReuseExtract(ArrayRef<Value > VL, Value OpValue) const;

/// Vectorize a single entry in the tree.		/// Vectorize a single entry in the tree.
Value vectorizeTree(TreeEntry E);		Value vectorizeTree(TreeEntry E);
		AyalUnsubmitted Not Done Reply Inline Actions Please document additional parameters. Ayal: Please document additional parameters.
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure ashahid: Sure

/// Vectorize a single entry in the tree, starting in \p VL.		/// Vectorize a single entry in the tree, starting in \p VL.
		AyalUnsubmitted Not Done Reply Inline Actions In other words, loosely speaking, `E == TreeEntry[UserIndx].getOperand(OpdNum)`, right? Ayal: In other words, loosely speaking, `E == TreeEntry[UserIndx].getOperand(OpdNum)`, right?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Yes, thats right. ashahid: Yes, thats right.
		AyalUnsubmitted Not Done Reply Inline Actions Could be used to help clarify the explanation. Ayal: Could be used to help clarify the explanation.
Value vectorizeTree(ArrayRef<Value > VL);		Value vectorizeTree(ArrayRef<Value > VL);
		AyalUnsubmitted Not Done Reply Inline Actions ditto. Ayal: ditto.

/// \returns the scalarization cost for this type. Scalarization in this		/// \returns the scalarization cost for this type. Scalarization in this
/// context means the creation of vectors from a group of scalars.		/// context means the creation of vectors from a group of scalars.
int getGatherCost(Type *Ty, const DenseSet<unsigned> &ShuffledIndices);		int getGatherCost(Type *Ty, const DenseSet<unsigned> &ShuffledIndices);

/// \returns the scalarization cost for this list of values. Assuming that		/// \returns the scalarization cost for this list of values. Assuming that
/// this subtree gets vectorized, we may need to extract the values from the		/// this subtree gets vectorized, we may need to extract the values from the
/// roots. This method calculates the cost of extracting the values.		/// roots. This method calculates the cost of extracting the values.
Show All 38 Lines	struct TreeEntry {
ValueList Scalars;		ValueList Scalars;

/// The Scalars are vectorized into this value. It is initialized to Null.		/// The Scalars are vectorized into this value. It is initialized to Null.
Value *VectorizedValue = nullptr;		Value *VectorizedValue = nullptr;

/// Do we need to gather this sequence ?		/// Do we need to gather this sequence ?
bool NeedToGather = false;		bool NeedToGather = false;

		/// Records optional shuffle mask for the uses of jumbled memory accesses.
		SmallVector<unsigned, 4> JumbleShuffleIndices;

/// Does this sequence require some shuffling?		/// Does this sequence require some shuffling?
		sanjoyUnsubmitted Not Done Reply Inline Actions Optional: you can write `return X;` instead of `if (X) return true; return false;`. sanjoy: Optional: you can write `return X;` instead of `if (X) return true; return false;`.
SmallVector<unsigned, 4> ReuseShuffleIndices;		SmallVector<unsigned, 4> ReuseShuffleIndices;

/// Points back to the VectorizableTree.		/// Points back to the VectorizableTree.
		AyalUnsubmitted Not Done Reply Inline Actions s[h]uffle An example would help clarify that, say, a non-empty ShuffleMask[1] represents the permutation of lanes that operand #1 should undergo before feeding this vectorized instruction, whereas an empty ShuffleMask[0] indicates that the lanes of operand #0 need not be permuted at all. Ayal: s[h]uffle An example would help clarify that, say, a non-empty ShuffleMask[1] represents the…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure ashahid: Sure
		AyalUnsubmitted Not Done Reply Inline Actions The fixed array SmallVector<unsigned, 4> ShuffleMask[3]; of the previous version indeed cannot account for all operands. How about holding a SmallVector<SmallVector<unsigned, 4>, 2> ShuffleMask; instead of holding a map from 0,1,2,..,numOperands ? Ayal: The fixed array ``` SmallVector<unsigned, 4> ShuffleMask[3]; ``` of the previous version indeed…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions I think this can be done. I will try. ashahid: I think this can be done. I will try.
		ABataevUnsubmitted Not Done Reply Inline Actions Why you can't have just one shuffle here for all external uses? ABataev: Why you can't have just one shuffle here for all external uses?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions This is for in-tree multi uses of a single vector load where the uses has different masks/permutation. This section of comment https://reviews.llvm.org/D36130#inline-326711 discussed it earlier. Also there is figure attached. ashahid: This is for in-tree multi uses of a single vector load where the uses has different…
		ABataevUnsubmitted Not Done Reply Inline Actions I still don't understand what's the problem here. You need to perform the loads in some order. You sort the loads to be in the sequntially direct order and perform the vector load starting from the lowest address. You reshuffle the loaded vector value to the original order. That's it, you have your loads in the required order. Just one shuffle is required. Why do you need some more? Also, I don't understand why do you need so many changes, why do you need additional indicies etc. ABataev: I still don't understand what's the problem here. 1. You need to perform the loads in some…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Updated jumbled-load.ll captures this case where instead of gathering the second operand of MUL we can have required shuffle of the same loaded vector ashahid: Updated jumbled-load.ll captures this case where instead of gathering the second operand of MUL…
///		///
/// Only used for Graphviz right now. Unfortunately GraphTrait::NodeRef has		/// Only used for Graphviz right now. Unfortunately GraphTrait::NodeRef has
/// to be a pointer and needs to be able to initialize the child iterator.		/// to be a pointer and needs to be able to initialize the child iterator.
/// Thus we need a reference back to the container to translate the indices		/// Thus we need a reference back to the container to translate the indices
/// to entries.		/// to entries.
std::vector<TreeEntry> &Container;		std::vector<TreeEntry> &Container;

/// The TreeEntry index containing the user of this entry. We can actually		/// The TreeEntry index containing the user of this entry. We can actually
/// have multiple users so the data structure is not truly a tree.		/// have multiple users so the data structure is not truly a tree.
SmallVector<int, 1> UserTreeIndices;		SmallVector<int, 1> UserTreeIndices;
};		};

/// Create a new VectorizableTree entry.		/// Create a new VectorizableTree entry.
void newTreeEntry(ArrayRef<Value *> VL, bool Vectorized, int &UserTreeIdx,		void newTreeEntry(ArrayRef<Value *> VL, bool Vectorized, int &UserTreeIdx,
ArrayRef<unsigned> ReuseShuffleIndices = None) {		ArrayRef<unsigned> ReuseShuffleIndices = None,
		ArrayRef<unsigned> ShuffleMask = None) {
VectorizableTree.emplace_back(VectorizableTree);		VectorizableTree.emplace_back(VectorizableTree);

int idx = VectorizableTree.size() - 1;		int idx = VectorizableTree.size() - 1;
TreeEntry *Last = &VectorizableTree[idx];		TreeEntry *Last = &VectorizableTree[idx];
		AyalUnsubmitted Not Done Reply Inline Actions Why change the original "TreeEntry Last =" here? Ayal:* Why change the original "TreeEntry *Last =" here?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Nothing specific, will make it as earlier. ashahid: Nothing specific, will make it as earlier.
Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());		Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());
Last->NeedToGather = !Vectorized;		Last->NeedToGather = !Vectorized;
Last->ReuseShuffleIndices.append(ReuseShuffleIndices.begin(),		Last->ReuseShuffleIndices.append(ReuseShuffleIndices.begin(),
ReuseShuffleIndices.end());		ReuseShuffleIndices.end());
		Last->JumbleShuffleIndices.append(ShuffleMask.begin(), ShuffleMask.end());
		AyalUnsubmitted Not Done Reply Inline Actions Are both conditions really needed, or suffice say to check for -1 and assert positive indices are not too large? Ayal: Are both conditions really needed, or suffice say to check for -1 and assert positive indices…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure I will check. I am thinking 30000 as large indices threshold, do you have any number in mind? ashahid: Sure I will check. I am thinking 30000 as large indices threshold, do you have any number in…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions I tried but seems both conditions are needed as I am getting assertion "Idx < size()" for SmallVector<<SmallVector, 4> 2> ShuffleMask. ashahid: I tried but seems both conditions are needed as I am getting assertion "Idx < size()" for…
		AyalUnsubmitted Done Reply Inline Actions UserTreeIdx is the index of the User entry as we build the tree bottom-up, so it should always be between 0 and VectorizableTree.size()-1, except for -1 when creating the new entry for the root, which is User-less. So it should suffice to check if Idx is -1, and otherwise assert that Idx < size(), if desired, right? Ayal: UserTreeIdx is the index of the User entry as we build the tree bottom-up, so it should always…

if (Vectorized) {		if (Vectorized) {
		AyalUnsubmitted Not Done Reply Inline Actions Should ShuffleMask be inserted into UserEntry's ShuffleMask in position OpdNum? (Possibly asserting no other mask is already there?) Otherwise, where is OpdNum used? Ayal: Should ShuffleMask be inserted into UserEntry's ShuffleMask in position OpdNum? (Possibly…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Good catch! The intention is exactly that and the order of building tree ensures that. Do you want it to be explicit here? Any way it does need assertion for the emptiness of the mask before insertion. ashahid: Good catch! The intention is exactly that and the order of building tree ensures that. Do you…
		AyalUnsubmitted Not Done Reply Inline Actions Either be explicit, or assert that `emplace_back` inserts at position `OpdNum`, based on the assumption that the order of building tree ensures that which should be documented (e.g., in the form of the assert message). Ayal: Either be explicit, or assert that `emplace_back` inserts at position `OpdNum`, based on the…
		AyalUnsubmitted Not Done Reply Inline Actions So if the first operand does not need a shuffle but the second one does, will ShuffleMask.emplace_back() place the shuffle in the right position, namely that of OpdNum=1 rather than OpdNum=0? Ayal: So if the first operand does not need a shuffle but the second one does, will ShuffleMask.
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions You are right, in this case this assumption will break. So OpdNum needs to be explicitly used while inserting the shuffle mask. ashahid: You are right, in this case this assumption will break. So OpdNum needs to be explicitly used…
		AyalUnsubmitted Done Reply Inline Actions Code below still uses emplace_back contrary to the discussion above. May need to call UserTreeEntry->ShuffleMask.resize() if OpdNum is larger than its initial/current size, before setting UserTreeEntry->ShuffleMask[OpdNum] = tempMask. (Otherwise the original "LNT Multisource bench mark" asserts should trigger again?) Suggest to add a test where the first operand does not need a shuffle but the second one does. Ayal: Code below still uses emplace_back contrary to the discussion above. May need to call…
for (int i = 0, e = VL.size(); i != e; ++i) {		for (int i = 0, e = VL.size(); i != e; ++i) {
assert(!getTreeEntry(VL[i]) && "Scalar already in tree!");		assert(!getTreeEntry(VL[i]) && "Scalar already in tree!");
ScalarToTreeEntry[VL[i]] = idx;		ScalarToTreeEntry[VL[i]] = idx;
}		}
} else {		} else {
		AyalUnsubmitted Done Reply Inline Actions alrea[d]y Ayal: alrea[d]y
MustGather.insert(VL.begin(), VL.end());		MustGather.insert(VL.begin(), VL.end());
		sanjoyUnsubmitted Done Reply Inline Actions I think you should be able to do: auto &OperandMask = UserTreeEntry->ShuffleMask[OpdNum]; assert(OperandMask.empty()); OperandMask.insert(OperandMask.end(), ShuffleMask.begin(), ShuffleMask.end()); sanjoy: I think you should be able to do: ``` auto &OperandMask = UserTreeEntry->ShuffleMask[OpdNum]…
		AyalUnsubmitted Done Reply Inline Actions While we're at it, this should move under the `if (UserTreeIdx != -1)` to avoid checking if `&VectorizableTree[UserTreeIdx]` is null, as commented in https://reviews.llvm.org/D41324#inline-361435 Ayal: While we're at it, this should move under the `if (UserTreeIdx != -1)` to avoid checking if…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions If we check for if (UserTreeIdx != -1 && ShuffledLoad) before the call of newTreeEntry(), we can avoid "UserTreeIdx != -1" check completely inside newTreeEntry(). ashahid: If we check for if (UserTreeIdx != -1 && ShuffledLoad) before the call of newTreeEntry(), we…
}		}

if (UserTreeIdx >= 0)		if (UserTreeIdx >= 0)
Last->UserTreeIndices.push_back(UserTreeIdx);		Last->UserTreeIndices.push_back(UserTreeIdx);
UserTreeIdx = idx;		UserTreeIdx = idx;
}		}

/// -- Vectorization State --		/// -- Vectorization State --
▲ Show 20 Lines • Show All 522 Lines • ▼ Show 20 Lines	template <> struct GraphTraits<BoUpSLP *> {
static unsigned size(BoUpSLP *R) { return R->VectorizableTree.size(); }		static unsigned size(BoUpSLP *R) { return R->VectorizableTree.size(); }
};		};

template <> struct DOTGraphTraits<BoUpSLP *> : public DefaultDOTGraphTraits {		template <> struct DOTGraphTraits<BoUpSLP *> : public DefaultDOTGraphTraits {
using TreeEntry = BoUpSLP::TreeEntry;		using TreeEntry = BoUpSLP::TreeEntry;

DOTGraphTraits(bool isSimple = false) : DefaultDOTGraphTraits(isSimple) {}		DOTGraphTraits(bool isSimple = false) : DefaultDOTGraphTraits(isSimple) {}

std::string getNodeLabel(const TreeEntry Entry, const BoUpSLP R) {		std::string getNodeLabel(const TreeEntry Entry, const BoUpSLP R) {
		AyalUnsubmitted Not Done Reply Inline Actions Would be good to include (non-empty) ShuffleMasks when dumping the tree, for debugging? Ayal: Would be good to include (non-empty) ShuffleMasks when dumping the tree, for debugging?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure. ashahid: Sure.
std::string Str;		std::string Str;
raw_string_ostream OS(Str);		raw_string_ostream OS(Str);
if (isSplat(Entry->Scalars)) {		if (isSplat(Entry->Scalars)) {
OS << "<splat> " << *Entry->Scalars[0];		OS << "<splat> " << *Entry->Scalars[0];
return Str;		return Str;
}		}
for (auto V : Entry->Scalars) {		for (auto V : Entry->Scalars) {
OS << *V;		OS << *V;
▲ Show 20 Lines • Show All 296 Lines • ▼ Show 20 Lines	case Instruction::Load: {
if (!L->isSimple()) {		if (!L->isSimple()) {
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);		newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);
DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");		DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");
return;		return;
}		}
}		}

// Check if the loads are consecutive, reversed, or neither.		// Check if the loads are consecutive, reversed, or neither.
// TODO: What we really want is to sort the loads, but for now, check
// the two likely directions.
bool Consecutive = true;		bool Consecutive = true;
		AyalUnsubmitted Not Done Reply Inline Actions Remove this TODO :-) Ayal: Remove this TODO :-)
bool ReverseConsecutive = true;		bool ReverseConsecutive = true;
for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {		for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {
if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {		if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {
Consecutive = false;		Consecutive = false;
break;		break;
} else {		} else {
ReverseConsecutive = false;		ReverseConsecutive = false;
}		}
Show All 9 Lines	case Instruction::Load: {
// If none of the load pairs were consecutive when checked in order,		// If none of the load pairs were consecutive when checked in order,
// check the reverse order.		// check the reverse order.
if (ReverseConsecutive)		if (ReverseConsecutive)
for (unsigned i = VL.size() - 1; i > 0; --i)		for (unsigned i = VL.size() - 1; i > 0; --i)
if (!isConsecutiveAccess(VL[i], VL[i - 1], DL, SE)) {		if (!isConsecutiveAccess(VL[i], VL[i - 1], DL, SE)) {
ReverseConsecutive = false;		ReverseConsecutive = false;
break;		break;
}		}

		AyalUnsubmitted Not Done Reply Inline Actions Consider checking `if (ReverseConsecutive)` here and exit early. Ayal: Consider checking `if (ReverseConsecutive)` here and exit early.
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure, will consider. ashahid: Sure, will consider.
if (ReverseConsecutive) {		if (ReverseConsecutive) {
--NumOpsWantToKeepOrder[S.Opcode];		--NumOpsWantToKeepOrder[S.Opcode];
newTreeEntry(VL, true, UserTreeIdx, ReuseShuffleIndicies);		newTreeEntry(VL, true, UserTreeIdx, ReuseShuffleIndicies);
DEBUG(dbgs() << "SLP: added a vector of reversed loads.\n");		DEBUG(dbgs() << "SLP: added a vector of reversed loads.\n");
		AyalUnsubmitted Not Done Reply Inline Actions `ReverseConsecutive` is a special case of `ShuffledLoads`; so should the above treatment of a reverse load be the same as that of a shuffled load below? I.e., generate a true TreeEntry here with a reverse mask, and avoid cancel scheduling? Ayal: `ReverseConsecutive` is a special case of `ShuffledLoads`; so should the above treatment of a…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Ah, yes you are correct, in fact initially I gave a try but I faced some issue I am unable to recall now. I will give a try again and see whats the problem may be some thing to do with rebuilding of the tree with reversed scalar inputs. ashahid: Ah, yes you are correct, in fact initially I gave a try but I faced some issue I am unable to…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions I tried to incorporate however there were regression as I mentioned earlier. I think it is better if we take it in separate patch. ashahid: I tried to incorporate however there were regression as I mentioned earlier. I think it is…
return;		return;
		AyalUnsubmitted Not Done Reply Inline Actions It would have been good to also record how many loads want to have an arbitrary shuffled order, and shuffle according to the majority; but its admittedly harder than recording how many want the reversed order. Maybe worth a comment. Ayal: It would have been good to also record how many loads want to have an arbitrary shuffled order…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Could not get "shuffle according to the majority", would you please elaborate. ashahid: Could not get "shuffle according to the majority", would you please elaborate.
		AyalUnsubmitted Not Done Reply Inline Actions `NumLoadsWantToChangeOrder` is used to decide if the entire tree `shouldReorder()`, based on how many want to keep the order vs. how many want to change=reverse it (majority). My comment was that this would ideally extend to pick the most frequent order from among more possible orders than {original, reverse}. `AllowReorder` however restricts reordering the 2 element vectors only, where only these two orders exist. This relates to the existing `// TODO: check if we can allow reordering for more cases.` Ayal: `NumLoadsWantToChangeOrder `is used to decide if the entire tree `shouldReorder()`, based on…
}		}

		if (ReuseShuffleIndicies.empty()) {
		ABataevUnsubmitted Not Done Reply Inline Actions Is it possible at all that `VL` has less than 4 elements here? ABataev: Is it possible at all that `VL` has less than 4 elements here?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions I think yes, for example a couple of i64 loads considering minimum register width as 128-bit. However, this check here was basically meant to indicate jumbled loads of size 2 is essentially a reversed load. ashahid: I think yes, for example a couple of i64 loads considering minimum register width as 128-bit.
		ABataevUnsubmitted Not Done Reply Inline Actions It is going to be handled by the reverse loads patch ABataev: It is going to be handled by the reverse loads patch
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Yes, this check no more required. ashahid: Yes, this check no more required.
		ABataevUnsubmitted Not Done Reply Inline Actions Why you can do this only if `ReuseShuffleIndicies.empty()`? ABataev: Why you can do this only if `ReuseShuffleIndicies.empty()`?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions This is to avoid the overlapping the UniqueValues reuse logic of your changes. ashahid: This is to avoid the overlapping the UniqueValues reuse logic of your changes.
		ABataevUnsubmitted Not Done Reply Inline Actions Why you can't handle it? What's the problem? ABataev: Why you can't handle it? What's the problem?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions It was a thought,I have not checked yet. I will check. ashahid: It was a thought,I have not checked yet. I will check.
		bool ShuffledLoads = true;
		SmallVector<Value *, 8> Sorted;
		SmallVector<unsigned, 4> Mask;
		if (sortLoadAccesses(VL, DL, SE, Sorted, &Mask)) {
		for (unsigned I = 0, E = Sorted.size() - 1; I < E; ++I) {
		sanjoyUnsubmitted Done Reply Inline Actions Not sure why you need `NewVL` here -- doesn't just using `Sorted` work? sanjoy: Not sure why you need `NewVL` here -- doesn't just using `Sorted` work?
		ABataevUnsubmitted Not Done Reply Inline Actions `i`->`I`, `e`->`E`. Variables must have Camel-like names. ABataev: `i`->`I`, `e`->`E`. Variables must have Camel-like names.
		if (!isConsecutiveAccess(Sorted[I], Sorted[I + 1], DL, SE)) {
		ShuffledLoads = false;
		AyalUnsubmitted Not Done Reply Inline Actions Worthy of a `DEBUG(dbgs() << "...")` message here. Ayal: Worthy of a `DEBUG(dbgs() << "...")` message here.
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure ashahid: Sure
		break;
		}
		}
		ABataevUnsubmitted Not Done Reply Inline Actions It is enough just to compare `VL` and `Sorted`. If they are the same, the loads are not shuffled ABataev: It is enough just to compare `VL` and `Sorted`. If they are the same, the loads are not shuffled
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure it is, but this avoids the compare. So I thought having a boolean is preferable. ashahid: Sure it is, but this avoids the compare. So I thought having a boolean is preferable.
		ABataevUnsubmitted Not Done Reply Inline Actions Why do we need the compare? ABataev: Why do we need the compare?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions I meant, if we dont use ShuffledLoad flag we have to compare VL vs Sorted instead. ashahid: I meant, if we dont use ShuffledLoad flag we have to compare VL vs Sorted instead.
		// TODO: Tracking how many load wants to have arbitrary shuffled order
		// would be useful.
		sanjoyUnsubmitted Not Done Reply Inline Actions Nit: s/usefull/useful/ sanjoy: Nit: s/usefull/useful/
		if (ShuffledLoads && UserTreeIdx != -1) {
		ABataevUnsubmitted Not Done Reply Inline Actions Why you can't do to add vectorized tree entry if `UserTreeIdx == -1`? ABataev: Why you can't do to add vectorized tree entry if `UserTreeIdx == -1`?
		ashahidAuthorUnsubmitted Done Reply Inline Actions My bad, this is not required. ashahid: My bad, this is not required.
		DEBUG(dbgs() << "SLP: added a vector of loads which needs "
		AyalUnsubmitted Done Reply Inline Actions Should probably also check here that UserTreeIdx is not -1, to avoid creating a mask for the root with no place to hang it, as @sanjoy observed. Ayal: Should probably also check here that UserTreeIdx is not -1, to avoid creating a mask for the…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Yes, I had planned to do exactly this. ashahid: Yes, I had planned to do exactly this.
		"permutation of loaded lanes.\n");
		newTreeEntry(Sorted, true, UserTreeIdx, ReuseShuffleIndicies, Mask);
		ABataevUnsubmitted Not Done Reply Inline Actions Each `true` or `false` argument must have to prepend comment with the name of the function parameter, related to this argument ABataev: Each `true` or `false` argument must have to prepend comment with the name of the function…
		ashahidAuthorUnsubmitted Done Reply Inline Actions Ok ashahid: Ok
		ABataevUnsubmitted Not Done Reply Inline Actions No, use original `VL` here, do not use `Sorted`. In this case you won't need an additional argument in `sortLoadAccesses` and you don't need all that complex stuff with the lambda on line 3528 ABataev: No, use original `VL` here, do not use `Sorted`. In this case you won't need an additional…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions If I am not wrong, for LOADs, VL0 must be the 1st element of the buffer whose base address will be used for vector load. So using VL will break this assumption. ashahid: If I am not wrong, for LOADs, VL0 must be the 1st element of the buffer whose base address will…
		ABataevUnsubmitted Not Done Reply Inline Actions Why? And why you can't choose the right VL0 during vectorization? ABataev: Why? And why you can't choose the right VL0 during vectorization?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions For example, if we have two arrays A[4] and B[1] laying one after another in memory and the selected VF is 4 for the scalar loads of A[1], A[2], A[0], A[3] in order of use, the generated vector load will load the elements A[1], A[2], A[3], B[1] which is not desired. Of-course we can choose the right VL0 during vectorization but we have to compute it again here using the mask which can be avoided if we use Sorted VL. If I am missing something? ashahid: For example, if we have two arrays A[4] and B[1] laying one after another in memory and the…
		ABataevUnsubmitted Not Done Reply Inline Actions You already store the mask in the tree entry and you can choose the right VL0 using this mask. Using Sorted VL complicates the whole vectorization process and, thus, adds some extra points for the incorrect vectorization. That's why I insist to use original VL and choose the correct VL0 during codegen. ABataev: You already store the mask in the tree entry and you can choose the right VL0 using this mask.
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Got it. Since you already have these improvements in this patch https://reviews.llvm.org/D43776 , I think it is better to get that through. ashahid: Got it. Since you already have these improvements in this patch https://reviews.llvm.org/D43776…
		return;
		ABataevUnsubmitted Not Done Reply Inline Actions Bad decision. It is better to use original `VL` here, rather than `Sorted` and add an additional array of sorted indieces. In this case you don't need all these additional numbers and all that complex logic to find the correct tree entry for the list of values. ABataev: Bad decision. It is better to use original `VL` here, rather than `Sorted` and add an…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions In fact earlier design in patch (https://reviews.llvm.org/D26905) was to use original VL, however there was counter argument to that which I don't remember exactly. ashahid: In fact earlier design in patch (https://reviews.llvm.org/D26905) was to use original VL…
		ABataevUnsubmitted Not Done Reply Inline Actions It is better to use original `VL` here, otherwise it will end with a lot of troubles and will require the whole bunch of changes in the vectorization process to find the perfect match for the vector of vectorized values. I don't think it is a good idea to have a lot of changes accross the whole module to handle jumbled loads. ABataev: It is better to use original `VL` here, otherwise it will end with a lot of troubles and will…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions In the context where we can have multiple user of loaded vector with different shuffle mask, the design is to represent these different shuffle mask for each user corresponding to the user's operand number. Having single sorted indices will not be sufficient for this. Given the objective of handling multiple out of order uses changes are not that big I feel. ashahid: In the context where we can have multiple user of loaded vector with different shuffle mask…
		ABataevUnsubmitted Not Done Reply Inline Actions Now I see what do you want to do. But I don't think that this the correct way to implement it. It complicates the whole vectorization process. I'd suggest to create different tree entries for each particular order of the loads and exclude loads from the check that the same instruction is used several times in different tree entries. If you worry about several different loads of the same values, I think they will be optimized by instruction combiner. ABataev: Now I see what do you want to do. But I don't think that this the correct way to implement it.
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Off course this could have been a better solution but I was not sure of the impact it may have by breaking the single tree entry assumption. One problem I see is the TreeEntry lookup if multiple node with same scalar values are present. I can use isSame() check to make sure correct tree entry is found, however it may become costly in case of PHI instruction fed by same vector Load. ashahid: Off course this could have been a better solution but I was not sure of the impact it may have…
		ABataevUnsubmitted Not Done Reply Inline Actions I think it is better to start with handling of single tree entry rather than trying to handle all possible situations in a single patch. I suggest to split this patch into 2 parts at least: 1. handling of tree entry with jumbled loads. 2. further improvements. ABataev: I think it is better to start with handling of single tree entry rather than trying to handle…
		}
		}
		}

DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");		DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);		newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);
return;		return;
}		}
case Instruction::ZExt:		case Instruction::ZExt:
case Instruction::SExt:		case Instruction::SExt:
case Instruction::FPToUI:		case Instruction::FPToUI:
▲ Show 20 Lines • Show All 594 Lines • ▼ Show 20 Lines	case Instruction::Load: {
ReuseShuffleCost -= (ReuseShuffleNumbers - VL.size()) *		ReuseShuffleCost -= (ReuseShuffleNumbers - VL.size()) *
TTI->getMemoryOpCost(Instruction::Load, ScalarTy,		TTI->getMemoryOpCost(Instruction::Load, ScalarTy,
alignment, 0, VL0);		alignment, 0, VL0);
}		}
int ScalarLdCost = VecTy->getNumElements() *		int ScalarLdCost = VecTy->getNumElements() *
TTI->getMemoryOpCost(Instruction::Load, ScalarTy, alignment, 0, VL0);		TTI->getMemoryOpCost(Instruction::Load, ScalarTy, alignment, 0, VL0);
int VecLdCost = TTI->getMemoryOpCost(Instruction::Load,		int VecLdCost = TTI->getMemoryOpCost(Instruction::Load,
VecTy, alignment, 0, VL0);		VecTy, alignment, 0, VL0);
		// Add the cost of shuffle for jumbled loads
		if (!E->JumbleShuffleIndices.empty()) {
		VecLdCost += TTI->getShuffleCost(
		TargetTransformInfo::SK_PermuteSingleSrc, VecTy, 0);
		ABataevUnsubmitted Not Done Reply Inline Actions You can remove the last argument here ABataev: You can remove the last argument here
		ashahidAuthorUnsubmitted Done Reply Inline Actions Sure ashahid: Sure
		}
if (!isConsecutiveAccess(VL[0], VL[1], DL, SE)) {		if (!isConsecutiveAccess(VL[0], VL[1], DL, SE)) {
VecLdCost += TTI->getShuffleCost(		VecLdCost += TTI->getShuffleCost(
TargetTransformInfo::SK_PermuteSingleSrc, VecTy);		TargetTransformInfo::SK_PermuteSingleSrc, VecTy);
}		}
return ReuseShuffleCost + VecLdCost - ScalarLdCost;		return ReuseShuffleCost + VecLdCost - ScalarLdCost;
		ABataevUnsubmitted Not Done Reply Inline Actions You don't need so many shuffles, it is enough just to have just one. ABataev: You don't need so many shuffles, it is enough just to have just one.
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions This is basically for multiple in-tree uses with different masks/permutation. ashahid: This is basically for multiple in-tree uses with different masks/permutation.
}		}
case Instruction::Store: {		case Instruction::Store: {
// We know that we can merge the stores. Calculate the cost.		// We know that we can merge the stores. Calculate the cost.
unsigned alignment = dyn_cast<StoreInst>(VL0)->getAlignment();		unsigned alignment = dyn_cast<StoreInst>(VL0)->getAlignment();
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
ReuseShuffleCost -= (ReuseShuffleNumbers - VL.size()) *		ReuseShuffleCost -= (ReuseShuffleNumbers - VL.size()) *
TTI->getMemoryOpCost(Instruction::Store, ScalarTy,		TTI->getMemoryOpCost(Instruction::Store, ScalarTy,
alignment, 0, VL0);		alignment, 0, VL0);
▲ Show 20 Lines • Show All 597 Lines • ▼ Show 20 Lines	Value BoUpSLP::Gather(ArrayRef<Value > VL, VectorType *Ty) {

return Vec;		return Vec;
}		}

Value BoUpSLP::vectorizeTree(ArrayRef<Value > VL) {		Value BoUpSLP::vectorizeTree(ArrayRef<Value > VL) {
InstructionsState S = getSameOpcode(VL);		InstructionsState S = getSameOpcode(VL);
if (S.Opcode) {		if (S.Opcode) {
if (TreeEntry *E = getTreeEntry(S.OpValue)) {		if (TreeEntry *E = getTreeEntry(S.OpValue)) {
		if (!E->VectorizedValue && !E->JumbleShuffleIndices.empty())
		ABataevUnsubmitted Not Done Reply Inline Actions Why do you need this condition? ABataev: Why do you need this condition?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions In the 2nd test of jumbled-load.ll the two operands of MUL is fed from the same loaded vector. The 1st operand is SHUFFLE of LOAD and the 2nd operand is the gather of the same scalar loads. Query to getTreeEntry() will always return the node with the same vectorized value and hence both the operand of MUL will be fed the shuffled load. This check is to avoid this scenario. ashahid: In the 2nd test of jumbled-load.ll the two operands of MUL is fed from the same loaded vector.
		ABataevUnsubmitted Not Done Reply Inline Actions This scenario should happen in your patch, the instruction either vectorized, or gathered, but not both. ABataev: This scenario should happen in your patch, the instruction either vectorized, or gathered, but…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions This check is to avoid feeding the generated SHUFFLE to both operand of MUL which is not the intention of the test case. ashahid: This check is to avoid feeding the generated SHUFFLE to both operand of MUL which is not the…
		return vectorizeTree(E);

if (E->isSame(VL)) {		if (E->isSame(VL)) {
Value *V = vectorizeTree(E);		Value *V = vectorizeTree(E);
		AyalUnsubmitted Done Reply Inline Actions See above discussion about replacing second condition with an assert. Ayal: See above discussion about replacing second condition with an assert.
if (VL.size() == E->Scalars.size() && !E->ReuseShuffleIndices.empty()) {		if (VL.size() == E->Scalars.size() && !E->ReuseShuffleIndices.empty()) {
// We need to get the vectorized value but without shuffle.		// We need to get the vectorized value but without shuffle.
if (auto *SV = dyn_cast<ShuffleVectorInst>(V)) {		if (auto *SV = dyn_cast<ShuffleVectorInst>(V)) {
V = SV->getOperand(0);		V = SV->getOperand(0);
} else {		} else {
// Reshuffle to get only unique values.		// Reshuffle to get only unique values.
SmallVector<unsigned, 4> UniqueIdxs;		SmallVector<unsigned, 4> UniqueIdxs;
SmallSet<unsigned, 4> UsedIdxs;		SmallSet<unsigned, 4> UsedIdxs;
▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	if (NeedToShuffleReuses) {
GatherSeq.insert(I);		GatherSeq.insert(I);
CSEBlocks.insert(I->getParent());		CSEBlocks.insert(I->getParent());
}		}
}		}
E->VectorizedValue = V;		E->VectorizedValue = V;
return V;		return V;
}		}

		assert(ScalarToTreeEntry.count(E->Scalars[0]) &&
		"Expected user tree entry, missing!");

unsigned ShuffleOrOp = S.IsAltShuffle ?		unsigned ShuffleOrOp = S.IsAltShuffle ?
(unsigned) Instruction::ShuffleVector : S.Opcode;		(unsigned) Instruction::ShuffleVector : S.Opcode;
switch (ShuffleOrOp) {		switch (ShuffleOrOp) {
case Instruction::PHI: {		case Instruction::PHI: {
PHINode *PH = dyn_cast<PHINode>(VL0);		PHINode *PH = dyn_cast<PHINode>(VL0);
Builder.SetInsertPoint(PH->getParent()->getFirstNonPHI());		Builder.SetInsertPoint(PH->getParent()->getFirstNonPHI());
Builder.SetCurrentDebugLocation(PH->getDebugLoc());		Builder.SetCurrentDebugLocation(PH->getDebugLoc());
PHINode *NewPhi = Builder.CreatePHI(VecTy, PH->getNumIncomingValues());		PHINode *NewPhi = Builder.CreatePHI(VecTy, PH->getNumIncomingValues());
▲ Show 20 Lines • Show All 239 Lines • ▼ Show 20 Lines	case Instruction::Load: {
// Loads are inserted at the head of the tree because we don't want to		// Loads are inserted at the head of the tree because we don't want to
// sink them all the way down past store instructions.		// sink them all the way down past store instructions.
bool IsReversed =		bool IsReversed =
!isConsecutiveAccess(E->Scalars[0], E->Scalars[1], DL, SE);		!isConsecutiveAccess(E->Scalars[0], E->Scalars[1], DL, SE);
if (IsReversed)		if (IsReversed)
VL0 = cast<Instruction>(E->Scalars.back());		VL0 = cast<Instruction>(E->Scalars.back());
setInsertPointAfterBundle(E->Scalars, VL0);		setInsertPointAfterBundle(E->Scalars, VL0);

LoadInst *LI = cast<LoadInst>(VL0);		LoadInst *LI = cast<LoadInst>(VL0);;
		AyalUnsubmitted Not Done Reply Inline Actions ditto Ayal: ditto
		ABataevUnsubmitted Done Reply Inline Actions Restore the original code here ABataev: Restore the original code here
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Thanks ashahid: Thanks
Type *ScalarLoadTy = LI->getType();		Type *ScalarLoadTy = LI->getType();
unsigned AS = LI->getPointerAddressSpace();		unsigned AS = LI->getPointerAddressSpace();

Value *VecPtr = Builder.CreateBitCast(LI->getPointerOperand(),		Value *VecPtr = Builder.CreateBitCast(LI->getPointerOperand(),
VecTy->getPointerTo(AS));		VecTy->getPointerTo(AS));

		sanjoyUnsubmitted Done Reply Inline Actions Might be cleaner to abstract `(unsigned)OpdNum < UserTreeEntry->ShuffleMask.size() && !UserTreeEntry->ShuffleMask[OpdNum].empty()` into a `UserTreeEntry->hasShuffleMaskForOp(Index)` helper. sanjoy: Might be cleaner to abstract `(unsigned)OpdNum < UserTreeEntry->ShuffleMask.size() && !
// The pointer operand uses an in-tree scalar so we add the new BitCast to		// The pointer operand uses an in-tree scalar so we add the new BitCast to
		AyalUnsubmitted Not Done Reply Inline Actions May be simpler to check instead ShuffleMask.count(OpdNum) Ayal: May be simpler to check instead ShuffleMask.count(OpdNum)
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Quite right. ashahid: Quite right.
		ABataevUnsubmitted Not Done Reply Inline Actions Is this correct? `E->Scalars[0]` is exactly `VL0` ABataev: Is this correct? `E->Scalars[0]` is exactly `VL0`
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Ah, both are same. ashahid: Ah, both are same.
// ExternalUses list to make sure that an extract will be generated in the		// ExternalUses list to make sure that an extract will be generated in the
// future.		// future.
Value *PO = LI->getPointerOperand();		Value *PO = LI->getPointerOperand();
if (getTreeEntry(PO))		if (getTreeEntry(PO))
ExternalUses.push_back(ExternalUser(PO, cast<User>(VecPtr), 0));		ExternalUses.push_back(ExternalUser(PO, cast<User>(VecPtr), 0));

unsigned Alignment = LI->getAlignment();		unsigned Alignment = LI->getAlignment();
LI = Builder.CreateLoad(VecPtr);		LI = Builder.CreateLoad(VecPtr);
if (!Alignment) {		if (!Alignment) {
Alignment = DL->getABITypeAlignment(ScalarLoadTy);		Alignment = DL->getABITypeAlignment(ScalarLoadTy);
}		}
LI->setAlignment(Alignment);		LI->setAlignment(Alignment);
Value *V = propagateMetadata(LI, E->Scalars);		Value *V = propagateMetadata(LI, E->Scalars);
if (IsReversed) {		if (IsReversed) {
SmallVector<uint32_t, 4> Mask(E->Scalars.size());		SmallVector<uint32_t, 4> Mask(E->Scalars.size());
std::iota(Mask.rbegin(), Mask.rend(), 0);		std::iota(Mask.rbegin(), Mask.rend(), 0);
V = Builder.CreateShuffleVector(V, UndefValue::get(V->getType()), Mask);		V = Builder.CreateShuffleVector(V, UndefValue::get(V->getType()), Mask);
}		}
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),		V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),
E->ReuseShuffleIndices, "shuffle");		E->ReuseShuffleIndices, "shuffle");
}		}
		if (!E->JumbleShuffleIndices.empty()) {
		V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),
		E->JumbleShuffleIndices, "shuffle");
		}
E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;

		ABataevUnsubmitted Done Reply Inline Actions Remove this empty line ABataev: Remove this empty line
return V;		return V;
}		}
case Instruction::Store: {		case Instruction::Store: {
		AyalUnsubmitted Not Done Reply Inline Actions clang-format Ayal: clang-format
StoreInst *SI = cast<StoreInst>(VL0);		StoreInst *SI = cast<StoreInst>(VL0);
		AyalUnsubmitted Not Done Reply Inline Actions So can a load have more than one user in need of permuting the loaded lanes; are diamonds ok? OTOH, the branch seems redundant - can we assert that at-least one user exists? (Missing space: `if[ ](UserIndx != -1)`) Ayal: So can a load have more than one user in need of permuting the loaded lanes; are diamonds ok?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Yes, there can be more than one user requiring permuting the loaded lanes.Not sure but seems like diamonds are not enough. ashahid: Yes, there can be more than one user requiring permuting the loaded lanes.Not sure but seems…
		AyalUnsubmitted Not Done Reply Inline Actions So will each such user get its desired permutation of the loaded lanes? Only a single user is handled here. Ayal: So will each such user get its desired permutation of the loaded lanes? Only a single user is…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Yes, each user will get its desired permutation of loaded lanes due to the fact that the tree here is a DAG and distinct user treeEntry will have different user indexes. OTOH a specific user having all its uses(operands) as different permutation of a loaded lanes will be distinguished by 'OpdNum' ashahid: Yes, each user will get its desired permutation of loaded lanes due to the fact that the tree…
		AyalUnsubmitted Not Done Reply Inline Actions In `buildTree_rec` above, we're still looking for perfect diamonds w/o considering shuffled loads. So if a second user wants to shuffle a load similar to what a first user wanted (and got, being first), they will not share the shuffle, right? The second user will gather its loads instead. In other words, a shuffled load will have only a single user, right? Ayal: In `buildTree_rec` above, we're still looking for perfect diamonds w/o considering shuffled…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Yes. ashahid: Yes.
		AyalUnsubmitted Not Done Reply Inline Actions If a shuffled load will have only a single user, then a single optional ShuffleMask could be held at each load (def), instead of holding an array of ShuffleMasks per operand at the user. This could be done w/o introducing OpdNum, One way of generating the code, at-least conceptually, could be to first generate it w/o the ShuffleMask, and then RAUW where the single user of the load is replaced by the shuffle. You may want to introduce OpdNum for future use, i.e., where a single ShuffleMask handles a non-trivial subset of a load's users. In any case, add a TODO in `buildTree_rec` to consider shuffled loads when looking for perfect diamonds, thereby reusing a ShuffleMask for multiple users in the future? Ayal: If a shuffled load will have only a single user, then a single optional ShuffleMask could be…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions It seems I misunderstood your question. Actually every shuffle of a load is used by different user which would be captured by OpdNum. So in this sense a shuffled load can have multiple user and this is the real issue I am trying to resolve with this patch which was lacking in my earlier attempt. ashahid: It seems I misunderstood your question. Actually every shuffle of a load is used by different…
		AyalUnsubmitted Not Done Reply Inline Actions In jumbled-load-multiuse.ll testcase there are two users: the first (cmp) gets to use the shuffle whereas the second (select) ends up gathering its loads instead. OpdNum captures having a distinct shuffle per operand, iiuc, rather than a distinct shuffle per user, or support for having multiple users share a common shuffle. Ayal: In jumbled-load-multiuse.ll testcase there are two users: the first (cmp) gets to use the…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions For distinct user OpdNum=0 will capture the shuffle mask. For users which needs distinct shuffle mask of a loaded value, OpdNum= 0,1(for binary operation) will capture the required shuffle mask. ashahid: For distinct user OpdNum=0 will capture the shuffle mask. For users which needs distinct…
		AyalUnsubmitted Not Done Reply Inline Actions Can you show an example where two distinct users of the same load get to use the same shuffle, and an example where two such users get to use two distinct shuffles? Each user can have one or more operands, their OpdNum shouldn't matter. Suspect such examples may not exist - in the revised jumbled-load-multiuse.ll testcase below, two users of the same load want to use the same shuffle, but don't get to. Ayal: Can you show an example where two distinct users of the same load get to use the same shuffle…
		AyalUnsubmitted Not Done Reply Inline Actions So in this sense a shuffled load can have multiple user and this is the real issue I am trying to resolve with this patch which was lacking in my earlier attempt. The real issue I think you're trying to resolve with this patch which was lacking in your earlier attempt, is to support loads that have multiple users and need shuffle(s), by allowing only a single (the first) user to feed from the single shuffle, and all other users to extract and gather their elements from the original unshuffled load, as shown in jumbled-load-multiuse.ll. This can be achieved by marking each user if it gets to use a shuffle or not, as done in this patch; or could alternatively be achieved by holding a single optional mask per load (as in previous attempt), along with an indication which user it should feed. In any case, a load can end up having at-most a single shuffle, which in turn feeds a single user, right? Ayal: > So in this sense a shuffled load can have multiple user and this is the real issue I am…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions I have tried to depict the case I have in mind in the attached file. In the given figures, shuffle mask edges are captured by the OpdNum of 'U's. ashahid: I have tried to depict the case I have in mind in the attached file. In the given figures…
unsigned Alignment = SI->getAlignment();		unsigned Alignment = SI->getAlignment();
unsigned AS = SI->getPointerAddressSpace();		unsigned AS = SI->getPointerAddressSpace();

ValueList ScalarStoreValues;		ValueList ScalarStoreValues;
		AyalUnsubmitted Done Reply Inline Actions Can simply do `for (unsigned Entry : ShuffleMask[OpdNum])` instead of iterating explicitly over all lanes and retrieving each `UserTreeEntry->ShuffleMask[OpdNum][Lane]`. Ayal: Can simply do `for (unsigned Entry : ShuffleMask[OpdNum])` instead of iterating explicitly over…
for (Value *V : E->Scalars)		for (Value *V : E->Scalars)
		sanjoyUnsubmitted Done Reply Inline Actions The cast to `Value ` should not be necessary. sanjoy:* The cast to `Value *` should not be necessary.
ScalarStoreValues.push_back(cast<StoreInst>(V)->getValueOperand());		ScalarStoreValues.push_back(cast<StoreInst>(V)->getValueOperand());

setInsertPointAfterBundle(E->Scalars, VL0);		setInsertPointAfterBundle(E->Scalars, VL0);

Value *VecValue = vectorizeTree(ScalarStoreValues);		Value *VecValue = vectorizeTree(ScalarStoreValues);
Value *ScalarPtr = SI->getPointerOperand();		Value *ScalarPtr = SI->getPointerOperand();
Value *VecPtr = Builder.CreateBitCast(ScalarPtr, VecTy->getPointerTo(AS));		Value *VecPtr = Builder.CreateBitCast(ScalarPtr, VecTy->getPointerTo(AS));
StoreInst *S = Builder.CreateStore(VecValue, VecPtr);		StoreInst *S = Builder.CreateStore(VecValue, VecPtr);
▲ Show 20 Lines • Show All 215 Lines • ▼ Show 20 Lines	for (const auto &ExternalUse : ExternalUses) {
// Skip users that we already RAUW. This happens when one instruction		// Skip users that we already RAUW. This happens when one instruction
// has multiple uses of the same value.		// has multiple uses of the same value.
if (User && !is_contained(Scalar->users(), User))		if (User && !is_contained(Scalar->users(), User))
continue;		continue;
TreeEntry *E = getTreeEntry(Scalar);		TreeEntry *E = getTreeEntry(Scalar);
assert(E && "Invalid scalar");		assert(E && "Invalid scalar");
assert(!E->NeedToGather && "Extracting from a gather list");		assert(!E->NeedToGather && "Extracting from a gather list");

Value *Vec = E->VectorizedValue;		Value *Vec = [&]() {
		if (auto *SVI = dyn_cast<ShuffleVectorInst>(E->VectorizedValue))
		sanjoyUnsubmitted Not Done Reply Inline Actions `dyn_cast<XXX>(f)->g()` should never be necessary. Either the `dyn_cast` can return null in which case you should check for that, or it can't and you should use `cast<>`. Also the cast of `Vec` to `Instruction` seems unnecessary: `ShuffleVectorInst` is an `Instruction`. sanjoy: `dyn_cast<XXX>(f)->g()` should never be necessary. Either the `dyn_cast` can return null in…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Here I am trying to ensure that the instructions are "ShuffleVectorInst" and "LoadInst" respectively. Casting of Vec to Instruction, is to satisfy the membership of getOperand() which compiler otherwise report as error. ashahid: Here I am trying to ensure that the instructions are "ShuffleVectorInst" and "LoadInst"…
		AyalUnsubmitted Not Done Reply Inline Actions Use `isa` instead of `dyn_cast` here: `if (Vec && dyn_cast<LoadInst>(cast<Instruction>(Vec)->getOperand(0))) {` or alternatively do something like: Value Vec = E->VectorizedValue; assert(Vec && "Can't find vectorizable value"); if (ShuffleVectorInst Shuffle = dyn_cast<ShuffleVectorInst>(Vec)) if (LoadInst Load = dyn_cast<LoadInst>(Shuffle->getOperand(0))) Vec = Load; Ayal:* Use `isa` instead of `dyn_cast` here: `if (Vec && dyn_cast<LoadInst>(cast<Instruction>(Vec)…
		if (!E->JumbleShuffleIndices.empty() && isa<LoadInst>(SVI->getOperand(0)))
		return SVI->getOperand(0);
		return E->VectorizedValue;
		}();
		sanjoyUnsubmitted Not Done Reply Inline Actions I think you can rewrite this more cleanly using an immediately-invoked function expression: Value Vec = [&]() { if (auto SVI = dyn_cast<ShuffleVectorInst>(E->VectorizedValue)) if (auto LI = dyn_cast<LoadInst>(SVI->getOperand(0))) return LI->getOperand(0); return E->VectorizedValue; }(); sanjoy:* I think you can rewrite this more cleanly using an immediately-invoked function expression…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions I tried this IIFE, however I am getting an assertion "Tried to create extractelement operation on non-vector type!" for jumbled-load-multiuse.ll test. Do you see any issue in this code? ashahid: I tried this IIFE, however I am getting an assertion "Tried to create extractelement operation…
		sanjoyUnsubmitted Not Done Reply Inline Actions Yes, I think I should have written: Value Vec = [&]() { if (auto SVI = dyn_cast<ShuffleVectorInst>(E->VectorizedValue)) if (isa<LoadInst>(SVI->getOperand(0))) return SVI->getOperand(0); return E->VectorizedValue; }(); sanjoy: Yes, I think I should have written: ``` Value Vec = [&]() { if (auto SVI =…
		AyalUnsubmitted Not Done Reply Inline Actions Yes, this simplifies the below "alternatively do something like:" Value Vec = E->VectorizedValue; assert(Vec && "Can't find vectorizable value"); if (ShuffleVectorInst Shuffle = dyn_cast<ShuffleVectorInst>(Vec)) if (LoadInst Load = dyn_cast<LoadInst>(Shuffle->getOperand(0))) Vec = Load; Ayal:* Yes, this simplifies the below "alternatively do something like:" ``` Value *Vec = E…
		ABataevUnsubmitted Not Done Reply Inline Actions I think you can have default capture by value here rather than by reference. ABataev: I think you can have default capture by value here rather than by reference.
		ABataevUnsubmitted Not Done Reply Inline Actions I rather doubt you need all that stuff. You can use original code ABataev: I rather doubt you need all that stuff. You can use original code
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions This is required otherwise multiuse.ll test as well as PR32086.ll will fail because the lanes were recorded according to the order of scalar loads. ashahid: This is required otherwise multiuse.ll test as well as PR32086.ll will fail because the lanes…
		ABataevUnsubmitted Not Done Reply Inline Actions Again, it just may not happen in this patch ABataev: Again, it just may not happen in this patch
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions It does happen and this test fails. ashahid: It does happen and this test fails.
assert(Vec && "Can't find vectorizable value");		assert(Vec && "Can't find vectorizable value");

Value *Lane = Builder.getInt32(ExternalUse.Lane);		Value *Lane = Builder.getInt32(ExternalUse.Lane);
// If User == nullptr, the Scalar is used as extra arg. Generate		// If User == nullptr, the Scalar is used as extra arg. Generate
// ExtractElement instruction and update the record for this scalar in		// ExtractElement instruction and update the record for this scalar in
// ExternallyUsedValues.		// ExternallyUsedValues.
if (!User) {		if (!User) {
assert(ExternallyUsedValues.count(Scalar) &&		assert(ExternallyUsedValues.count(Scalar) &&
▲ Show 20 Lines • Show All 2,792 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				ABataevUnsubmitted Done Reply Inline Actions You need to add this test separately and show changes in it. ABataev: You need to add this test separately and show changes in it.
				; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s

				@array = external global [20 x [13 x i32]]

				define void @hoge(i64 %idx, <4 x i32>* %sink) {
				; CHECK-LABEL: @hoge(
				; CHECK-NEXT: bb:
				; CHECK-NOT: load <4 x i32>
				; CHECK-NOT: shufflevector <4 x i32>
				bb:
				ABataevUnsubmitted Not Done Reply Inline Actions These checks are not autogenerated, fix it. Moreover, it is recommended to commit these tests separately with the checks for the original version of the compiler and the update checks with the fixed version to demonstrate improvements. ABataev: These checks are not autogenerated, fix it. Moreover, it is recommended to commit these tests…
				%0 = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]]* @array, i64 0, i64 %idx, i64 5
				%1 = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]]* @array, i64 0, i64 %idx, i64 6
				%2 = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]]* @array, i64 0, i64 %idx, i64 7
				%3 = getelementptr inbounds [20 x [13 x i32]], [20 x [13 x i32]]* @array, i64 0, i64 %idx, i64 8
				%4 = load i32, i32* %1, align 4
				%5 = insertelement <4 x i32> undef, i32 %4, i32 0
				%6 = load i32, i32* %2, align 4
				%7 = insertelement <4 x i32> %5, i32 %6, i32 1
				%8 = load i32, i32* %3, align 4
				%9 = insertelement <4 x i32> %7, i32 %8, i32 2
				%10 = load i32, i32* %0, align 4
				%11 = insertelement <4 x i32> %9, i32 %10, i32 3
				store <4 x i32> %11, <4 x i32>* %sink
				ret void
				}

				AyalUnsubmitted Not Done Reply Inline Actions "SINK" is defined redundantly, as it is not used. Could this be simplified by removing the float-to-int casts? In general, it may suffice to check that there's no load of <4 x i32>, which would be jumbled. Checking that two of the lanes have been vectorized may be fragile, in case a modified cost model will decide it ain't worth it. Ayal: "SINK" is defined redundantly, as it is not used. Could this be simplified by removing the…

test/Transforms/SLPVectorizer/X86/jumbled-load-multiuse.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux -mattr=+sse4.2 \| FileCheck %s			; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux -mattr=+sse4.2 \| FileCheck %s

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	@a = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4			@a = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4
	@b = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4			@b = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4

	define i32 @fn1() {			define i32 @fn1() {
	; CHECK-LABEL: @fn1(			; CHECK-LABEL: @fn1(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4			; CHECK-NEXT: [[TMP0:%.]] = load <4 x i32>, <4 x i32> bitcast ([4 x i32]* @b to <4 x i32>*), align 4
	; CHECK-NEXT: [[TMP1:%.]] = load <2 x i32>, <2 x i32> bitcast (i32* getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 1) to <2 x i32>*), align 4			; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <4 x i32> [[TMP0]], <4 x i32> undef, <4 x i32> <i32 1, i32 2, i32 3, i32 0>
	; CHECK-NEXT: [[TMP2:%.]] = load i32, i32 getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 3), align 4			; CHECK-NEXT: [[TMP2:%.*]] = icmp sgt <4 x i32> [[TMP1]], zeroinitializer
	; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x i32> [[TMP1]], i32 0			; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x i32> [[TMP0]], i32 1
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> undef, i32 [[TMP3]], i32 0			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> undef, i32 [[TMP3]], i32 0
	; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x i32> [[TMP1]], i32 1			; CHECK-NEXT: [[TMP5:%.]] = insertelement <4 x i32> [[TMP4]], i32 ptrtoint (i32 () @fn1 to i32), i32 1
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP4]], i32 [[TMP5]], i32 1			; CHECK-NEXT: [[TMP6:%.]] = insertelement <4 x i32> [[TMP5]], i32 ptrtoint (i32 () @fn1 to i32), i32 2
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[TMP2]], i32 2			; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 8, i32 3
	; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x i32> [[TMP7]], i32 [[TMP0]], i32 3			; CHECK-NEXT: [[TMP8:%.*]] = select <4 x i1> [[TMP2]], <4 x i32> [[TMP7]], <4 x i32> <i32 6, i32 0, i32 0, i32 0>
	; CHECK-NEXT: [[TMP9:%.*]] = icmp sgt <4 x i32> [[TMP8]], zeroinitializer			; CHECK-NEXT: store <4 x i32> [[TMP8]], <4 x i32>* bitcast ([4 x i32]* @a to <4 x i32>*), align 4
	; CHECK-NEXT: [[TMP10:%.]] = insertelement <4 x i32> [[TMP4]], i32 ptrtoint (i32 () @fn1 to i32), i32 1
	; CHECK-NEXT: [[TMP11:%.]] = insertelement <4 x i32> [[TMP10]], i32 ptrtoint (i32 () @fn1 to i32), i32 2
	; CHECK-NEXT: [[TMP12:%.*]] = insertelement <4 x i32> [[TMP11]], i32 8, i32 3
	; CHECK-NEXT: [[TMP13:%.*]] = select <4 x i1> [[TMP9]], <4 x i32> [[TMP12]], <4 x i32> <i32 6, i32 0, i32 0, i32 0>
	; CHECK-NEXT: store <4 x i32> [[TMP13]], <4 x i32>* bitcast ([4 x i32]* @a to <4 x i32>*), align 4
	; CHECK-NEXT: ret i32 0			; CHECK-NEXT: ret i32 0
	;			;
	entry:			entry:
	%0 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4			%0 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4
	%cmp = icmp sgt i32 %0, 0			%cmp = icmp sgt i32 %0, 0
	%cond = select i1 %cmp, i32 8, i32 0			%cond = select i1 %cmp, i32 8, i32 0
	store i32 %cond, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i64 0, i32 3), align 4			store i32 %cond, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i64 0, i32 3), align 4
	%1 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 1), align 4			%1 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 1), align 4
	Show All 13 Lines

test/Transforms/SLPVectorizer/X86/jumbled-load-shuffle-placement.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				ABataevUnsubmitted Done Reply Inline Actions You need to add this test separately and show changes in it ABataev: You need to add this test separately and show changes in it
				; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s


				;void jumble (int * restrict A, int * restrict B) {
				; int tmp0 = A[10]*A[0];
				; int tmp1 = A[11]*A[1];
				; int tmp2 = A[12]*A[3];
				; int tmp3 = A[13]*A[2];
				; B[0] = tmp0;
				; B[1] = tmp1;
				; B[2] = tmp2;
				; B[3] = tmp3;
				;}


				; Function Attrs: norecurse nounwind uwtable
				define void @jumble1(i32* noalias nocapture readonly %A, i32* noalias nocapture %B) {
				; CHECK-LABEL: @jumble1(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 10
				; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[A]], i64 11
				; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[A]], i64 1
				; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i32, i32 [[A]], i64 12
				; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i32, i32 [[A]], i64 3
				; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[A]], i64 13
				; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[ARRAYIDX]] to <4 x i32>*
				; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4
				; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i32, i32 [[A]], i64 2
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[A]] to <4 x i32>*
				; CHECK-NEXT: [[TMP3:%.]] = load <4 x i32>, <4 x i32> [[TMP2]], align 4
				; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 3, i32 2>
				AyalUnsubmitted Done Reply Inline Actions Suggested to also have a test where the 2nd operand is a shuffle but the 1st one isn't, which will fail if shuffles are added using emplace_back(). Ayal: Suggested to also have a test where the 2nd operand is a shuffle but the 1st one isn't, which…
				; CHECK-NEXT: [[TMP5:%.*]] = mul nsw <4 x i32> [[TMP1]], [[TMP4]]
				; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1
				; CHECK-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2
				; CHECK-NEXT: [[ARRAYIDX14:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3
				; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[B]] to <4 x i32>*
				; CHECK-NEXT: store <4 x i32> [[TMP5]], <4 x i32>* [[TMP6]], align 4
				; CHECK-NEXT: ret void
				;
				entry:
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 10
				%0 = load i32, i32* %arrayidx, align 4
				%1 = load i32, i32* %A, align 4
				%mul = mul nsw i32 %0, %1
				%arrayidx2 = getelementptr inbounds i32, i32* %A, i64 11
				%2 = load i32, i32* %arrayidx2, align 4
				%arrayidx3 = getelementptr inbounds i32, i32* %A, i64 1
				%3 = load i32, i32* %arrayidx3, align 4
				%mul4 = mul nsw i32 %2, %3
				%arrayidx5 = getelementptr inbounds i32, i32* %A, i64 12
				%4 = load i32, i32* %arrayidx5, align 4
				%arrayidx6 = getelementptr inbounds i32, i32* %A, i64 3
				%5 = load i32, i32* %arrayidx6, align 4
				%mul7 = mul nsw i32 %4, %5
				%arrayidx8 = getelementptr inbounds i32, i32* %A, i64 13
				%6 = load i32, i32* %arrayidx8, align 4
				%arrayidx9 = getelementptr inbounds i32, i32* %A, i64 2
				%7 = load i32, i32* %arrayidx9, align 4
				%mul10 = mul nsw i32 %6, %7
				store i32 %mul, i32* %B, align 4
				%arrayidx12 = getelementptr inbounds i32, i32* %B, i64 1
				store i32 %mul4, i32* %arrayidx12, align 4
				%arrayidx13 = getelementptr inbounds i32, i32* %B, i64 2
				store i32 %mul7, i32* %arrayidx13, align 4
				%arrayidx14 = getelementptr inbounds i32, i32* %B, i64 3
				store i32 %mul10, i32* %arrayidx14, align 4
				ret void
				}

				;Reversing the operand of MUL
				; Function Attrs: norecurse nounwind uwtable
				define void @jumble2(i32* noalias nocapture readonly %A, i32* noalias nocapture %B) {
				; CHECK-LABEL: @jumble2(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 10
				; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[A]], i64 11
				; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[A]], i64 1
				; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i32, i32 [[A]], i64 12
				; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i32, i32 [[A]], i64 3
				; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[A]], i64 13
				; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[ARRAYIDX]] to <4 x i32>*
				; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4
				; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i32, i32 [[A]], i64 2
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[A]] to <4 x i32>*
				; CHECK-NEXT: [[TMP3:%.]] = load <4 x i32>, <4 x i32> [[TMP2]], align 4
				; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 3, i32 2>
				; CHECK-NEXT: [[TMP5:%.*]] = mul nsw <4 x i32> [[TMP4]], [[TMP1]]
				; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1
				; CHECK-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2
				; CHECK-NEXT: [[ARRAYIDX14:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3
				; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[B]] to <4 x i32>*
				; CHECK-NEXT: store <4 x i32> [[TMP5]], <4 x i32>* [[TMP6]], align 4
				; CHECK-NEXT: ret void
				;
				entry:
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 10
				%0 = load i32, i32* %arrayidx, align 4
				%1 = load i32, i32* %A, align 4
				%mul = mul nsw i32 %1, %0
				%arrayidx2 = getelementptr inbounds i32, i32* %A, i64 11
				%2 = load i32, i32* %arrayidx2, align 4
				%arrayidx3 = getelementptr inbounds i32, i32* %A, i64 1
				%3 = load i32, i32* %arrayidx3, align 4
				%mul4 = mul nsw i32 %3, %2
				%arrayidx5 = getelementptr inbounds i32, i32* %A, i64 12
				%4 = load i32, i32* %arrayidx5, align 4
				%arrayidx6 = getelementptr inbounds i32, i32* %A, i64 3
				%5 = load i32, i32* %arrayidx6, align 4
				%mul7 = mul nsw i32 %5, %4
				%arrayidx8 = getelementptr inbounds i32, i32* %A, i64 13
				%6 = load i32, i32* %arrayidx8, align 4
				%arrayidx9 = getelementptr inbounds i32, i32* %A, i64 2
				%7 = load i32, i32* %arrayidx9, align 4
				%mul10 = mul nsw i32 %7, %6
				store i32 %mul, i32* %B, align 4
				%arrayidx12 = getelementptr inbounds i32, i32* %B, i64 1
				store i32 %mul4, i32* %arrayidx12, align 4
				%arrayidx13 = getelementptr inbounds i32, i32* %B, i64 2
				store i32 %mul7, i32* %arrayidx13, align 4
				%arrayidx14 = getelementptr inbounds i32, i32* %B, i64 3
				store i32 %mul10, i32* %arrayidx14, align 4
				ret void
				}

test/Transforms/SLPVectorizer/X86/jumbled-load-used-in-phi.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				ABataevUnsubmitted Done Reply Inline Actions You need to add this test separately and show changes in it ABataev: You need to add this test separately and show changes in it
				; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s

				;void phiUsingLoads(int restrict A, int restrict B) {
				; int tmp0, tmp1, tmp2, tmp3;
				; for (int i = 0; i < 100; i++) {
				; if (A[0] == 0) {
				; tmp0 = A[i + 0];
				; tmp1 = A[i + 1];
				; tmp2 = A[i + 2];
				; tmp3 = A[i + 3];
				; } else if (A[25] == 0) {
				; tmp0 = A[i + 0];
				; tmp1 = A[i + 1];
				; tmp2 = A[i + 2];
				; tmp3 = A[i + 3];
				; } else if (A[50] == 0) {
				; tmp0 = A[i + 0];
				; tmp1 = A[i + 1];
				; tmp2 = A[i + 2];
				; tmp3 = A[i + 3];
				; } else if (A[75] == 0) {
				; tmp0 = A[i + 0];
				; tmp1 = A[i + 1];
				; tmp2 = A[i + 3];
				; tmp3 = A[i + 2];
				; }
				; }
				; B[0] = tmp0;
				; B[1] = tmp1;
				; B[2] = tmp2;
				; B[3] = tmp3;
				;}


				; Function Attrs: norecurse nounwind uwtable
				define void @phiUsingLoads(i32* noalias nocapture readonly %A, i32* noalias nocapture %B) local_unnamed_addr #0 {
				; CHECK-LABEL: @phiUsingLoads(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 [[A:%.*]], align 4
				; CHECK-NEXT: [[CMP1:%.*]] = icmp eq i32 [[TMP0]], 0
				; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i32, i32 [[A]], i64 25
				; CHECK-NEXT: [[ARRAYIDX28:%.]] = getelementptr inbounds i32, i32 [[A]], i64 50
				; CHECK-NEXT: [[ARRAYIDX44:%.]] = getelementptr inbounds i32, i32 [[A]], i64 75
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.cond.cleanup:
				; CHECK-NEXT: [[ARRAYIDX64:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1
				; CHECK-NEXT: [[ARRAYIDX65:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2
				; CHECK-NEXT: [[ARRAYIDX66:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3
				; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[B]] to <4 x i32>*
				; CHECK-NEXT: store <4 x i32> [[TMP27:%.]], <4 x i32> [[TMP1]], align 4
				; CHECK-NEXT: ret void
				; CHECK: for.body:
				; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_INC:%.]] ]
				; CHECK-NEXT: [[TMP2:%.*]] = phi <4 x i32> [ undef, [[ENTRY]] ], [ [[TMP27]], [[FOR_INC]] ]
				; CHECK-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.]], label [[IF_ELSE:%.]]
				; CHECK: if.then:
				; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]
				; CHECK-NEXT: [[TMP3:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP3]]
				; CHECK-NEXT: [[TMP4:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 2
				; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP4]]
				; CHECK-NEXT: [[TMP5:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 3
				; CHECK-NEXT: [[ARRAYIDX11:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP5]]
				; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[ARRAYIDX2]] to <4 x i32>*
				; CHECK-NEXT: [[TMP7:%.]] = load <4 x i32>, <4 x i32> [[TMP6]], align 4
				; CHECK-NEXT: br label [[FOR_INC]]
				; CHECK: if.else:
				; CHECK-NEXT: [[TMP8:%.]] = load i32, i32 [[ARRAYIDX12]], align 4
				; CHECK-NEXT: [[CMP13:%.*]] = icmp eq i32 [[TMP8]], 0
				; CHECK-NEXT: br i1 [[CMP13]], label [[IF_THEN14:%.]], label [[IF_ELSE27:%.]]
				; CHECK: if.then14:
				; CHECK-NEXT: [[ARRAYIDX17:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]
				; CHECK-NEXT: [[TMP9:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[ARRAYIDX20:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP9]]
				; CHECK-NEXT: [[TMP10:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 2
				; CHECK-NEXT: [[ARRAYIDX23:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP10]]
				; CHECK-NEXT: [[TMP11:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 3
				; CHECK-NEXT: [[ARRAYIDX26:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP11]]
				; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[ARRAYIDX17]] to <4 x i32>*
				; CHECK-NEXT: [[TMP13:%.]] = load <4 x i32>, <4 x i32> [[TMP12]], align 4
				; CHECK-NEXT: br label [[FOR_INC]]
				; CHECK: if.else27:
				; CHECK-NEXT: [[TMP14:%.]] = load i32, i32 [[ARRAYIDX28]], align 4
				; CHECK-NEXT: [[CMP29:%.*]] = icmp eq i32 [[TMP14]], 0
				; CHECK-NEXT: br i1 [[CMP29]], label [[IF_THEN30:%.]], label [[IF_ELSE43:%.]]
				; CHECK: if.then30:
				; CHECK-NEXT: [[ARRAYIDX33:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]
				; CHECK-NEXT: [[TMP15:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[ARRAYIDX36:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP15]]
				; CHECK-NEXT: [[TMP16:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 2
				; CHECK-NEXT: [[ARRAYIDX39:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP16]]
				; CHECK-NEXT: [[TMP17:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 3
				; CHECK-NEXT: [[ARRAYIDX42:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP17]]
				; CHECK-NEXT: [[TMP18:%.]] = bitcast i32 [[ARRAYIDX33]] to <4 x i32>*
				; CHECK-NEXT: [[TMP19:%.]] = load <4 x i32>, <4 x i32> [[TMP18]], align 4
				; CHECK-NEXT: br label [[FOR_INC]]
				; CHECK: if.else43:
				; CHECK-NEXT: [[TMP20:%.]] = load i32, i32 [[ARRAYIDX44]], align 4
				; CHECK-NEXT: [[CMP45:%.*]] = icmp eq i32 [[TMP20]], 0
				; CHECK-NEXT: br i1 [[CMP45]], label [[IF_THEN46:%.*]], label [[FOR_INC]]
				; CHECK: if.then46:
				; CHECK-NEXT: [[ARRAYIDX49:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]
				; CHECK-NEXT: [[TMP21:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[ARRAYIDX52:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP21]]
				; CHECK-NEXT: [[TMP22:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 3
				; CHECK-NEXT: [[ARRAYIDX55:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP22]]
				; CHECK-NEXT: [[TMP23:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 2
				; CHECK-NEXT: [[ARRAYIDX58:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP23]]
				; CHECK-NEXT: [[TMP24:%.]] = bitcast i32 [[ARRAYIDX49]] to <4 x i32>*
				; CHECK-NEXT: [[TMP25:%.]] = load <4 x i32>, <4 x i32> [[TMP24]], align 4
				; CHECK-NEXT: [[TMP26:%.*]] = shufflevector <4 x i32> [[TMP25]], <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 3, i32 2>
				; CHECK-NEXT: br label [[FOR_INC]]
				; CHECK: for.inc:
				; CHECK-NEXT: [[TMP27]] = phi <4 x i32> [ [[TMP7]], [[IF_THEN]] ], [ [[TMP13]], [[IF_THEN14]] ], [ [[TMP19]], [[IF_THEN30]] ], [ [[TMP26]], [[IF_THEN46]] ], [ [[TMP2]], [[IF_ELSE43]] ]
				; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 100
				; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
				;
				entry:
				%0 = load i32, i32* %A, align 4
				%cmp1 = icmp eq i32 %0, 0
				%arrayidx12 = getelementptr inbounds i32, i32* %A, i64 25
				%arrayidx28 = getelementptr inbounds i32, i32* %A, i64 50
				%arrayidx44 = getelementptr inbounds i32, i32* %A, i64 75
				br label %for.body

				for.cond.cleanup: ; preds = %for.inc
				store i32 %tmp0.1, i32* %B, align 4
				%arrayidx64 = getelementptr inbounds i32, i32* %B, i64 1
				store i32 %tmp1.1, i32* %arrayidx64, align 4
				%arrayidx65 = getelementptr inbounds i32, i32* %B, i64 2
				store i32 %tmp2.1, i32* %arrayidx65, align 4
				%arrayidx66 = getelementptr inbounds i32, i32* %B, i64 3
				store i32 %tmp3.1, i32* %arrayidx66, align 4
				ret void

				for.body: ; preds = %for.inc, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.inc ]
				%tmp3.0111 = phi i32 [ undef, %entry ], [ %tmp3.1, %for.inc ]
				%tmp2.0110 = phi i32 [ undef, %entry ], [ %tmp2.1, %for.inc ]
				%tmp1.0109 = phi i32 [ undef, %entry ], [ %tmp1.1, %for.inc ]
				%tmp0.0108 = phi i32 [ undef, %entry ], [ %tmp0.1, %for.inc ]
				br i1 %cmp1, label %if.then, label %if.else

				if.then: ; preds = %for.body
				%arrayidx2 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%1 = load i32, i32* %arrayidx2, align 4
				%2 = add nuw nsw i64 %indvars.iv, 1
				%arrayidx5 = getelementptr inbounds i32, i32* %A, i64 %2
				%3 = load i32, i32* %arrayidx5, align 4
				%4 = add nuw nsw i64 %indvars.iv, 2
				%arrayidx8 = getelementptr inbounds i32, i32* %A, i64 %4
				%5 = load i32, i32* %arrayidx8, align 4
				%6 = add nuw nsw i64 %indvars.iv, 3
				%arrayidx11 = getelementptr inbounds i32, i32* %A, i64 %6
				%7 = load i32, i32* %arrayidx11, align 4
				br label %for.inc

				if.else: ; preds = %for.body
				%8 = load i32, i32* %arrayidx12, align 4
				%cmp13 = icmp eq i32 %8, 0
				br i1 %cmp13, label %if.then14, label %if.else27

				if.then14: ; preds = %if.else
				%arrayidx17 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%9 = load i32, i32* %arrayidx17, align 4
				%10 = add nuw nsw i64 %indvars.iv, 1
				%arrayidx20 = getelementptr inbounds i32, i32* %A, i64 %10
				%11 = load i32, i32* %arrayidx20, align 4
				%12 = add nuw nsw i64 %indvars.iv, 2
				%arrayidx23 = getelementptr inbounds i32, i32* %A, i64 %12
				%13 = load i32, i32* %arrayidx23, align 4
				%14 = add nuw nsw i64 %indvars.iv, 3
				%arrayidx26 = getelementptr inbounds i32, i32* %A, i64 %14
				%15 = load i32, i32* %arrayidx26, align 4
				br label %for.inc

				if.else27: ; preds = %if.else
				%16 = load i32, i32* %arrayidx28, align 4
				%cmp29 = icmp eq i32 %16, 0
				br i1 %cmp29, label %if.then30, label %if.else43

				if.then30: ; preds = %if.else27
				%arrayidx33 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%17 = load i32, i32* %arrayidx33, align 4
				%18 = add nuw nsw i64 %indvars.iv, 1
				%arrayidx36 = getelementptr inbounds i32, i32* %A, i64 %18
				%19 = load i32, i32* %arrayidx36, align 4
				%20 = add nuw nsw i64 %indvars.iv, 2
				%arrayidx39 = getelementptr inbounds i32, i32* %A, i64 %20
				%21 = load i32, i32* %arrayidx39, align 4
				%22 = add nuw nsw i64 %indvars.iv, 3
				%arrayidx42 = getelementptr inbounds i32, i32* %A, i64 %22
				%23 = load i32, i32* %arrayidx42, align 4
				br label %for.inc

				if.else43: ; preds = %if.else27
				%24 = load i32, i32* %arrayidx44, align 4
				%cmp45 = icmp eq i32 %24, 0
				br i1 %cmp45, label %if.then46, label %for.inc

				if.then46: ; preds = %if.else43
				%arrayidx49 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%25 = load i32, i32* %arrayidx49, align 4
				%26 = add nuw nsw i64 %indvars.iv, 1
				%arrayidx52 = getelementptr inbounds i32, i32* %A, i64 %26
				%27 = load i32, i32* %arrayidx52, align 4
				%28 = add nuw nsw i64 %indvars.iv, 3
				%arrayidx55 = getelementptr inbounds i32, i32* %A, i64 %28
				%29 = load i32, i32* %arrayidx55, align 4
				%30 = add nuw nsw i64 %indvars.iv, 2
				%arrayidx58 = getelementptr inbounds i32, i32* %A, i64 %30
				%31 = load i32, i32* %arrayidx58, align 4
				br label %for.inc

				for.inc: ; preds = %if.then, %if.then30, %if.else43, %if.then46, %if.then14
				%tmp0.1 = phi i32 [ %1, %if.then ], [ %9, %if.then14 ], [ %17, %if.then30 ], [ %25, %if.then46 ], [ %tmp0.0108, %if.else43 ]
				%tmp1.1 = phi i32 [ %3, %if.then ], [ %11, %if.then14 ], [ %19, %if.then30 ], [ %27, %if.then46 ], [ %tmp1.0109, %if.else43 ]
				%tmp2.1 = phi i32 [ %5, %if.then ], [ %13, %if.then14 ], [ %21, %if.then30 ], [ %29, %if.then46 ], [ %tmp2.0110, %if.else43 ]
				%tmp3.1 = phi i32 [ %7, %if.then ], [ %15, %if.then14 ], [ %23, %if.then30 ], [ %31, %if.then46 ], [ %tmp3.0111, %if.else43 ]
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 100
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

test/Transforms/SLPVectorizer/X86/jumbled-load.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s		; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s



define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {		define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {
; CHECK-LABEL: @jumbled-load(		; CHECK-LABEL: @jumbled-load(
; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 %in, i64 0		; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0
; CHECK-NEXT: [[LOAD_1:%.]] = load i32, i32 [[IN_ADDR]], align 4
; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3		; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3
; CHECK-NEXT: [[LOAD_2:%.]] = load i32, i32 [[GEP_1]], align 4
; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1		; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1
; CHECK-NEXT: [[LOAD_3:%.]] = load i32, i32 [[GEP_2]], align 4
; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2		; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2
; CHECK-NEXT: [[LOAD_4:%.]] = load i32, i32 [[GEP_3]], align 4		; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*
; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 %inn, i64 0		; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
; CHECK-NEXT: [[LOAD_5:%.]] = load i32, i32 [[INN_ADDR]], align 4		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 2, i32 0>
		; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0
; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2		; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2
; CHECK-NEXT: [[LOAD_6:%.]] = load i32, i32 [[GEP_4]], align 4
; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3		; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3
; CHECK-NEXT: [[LOAD_7:%.]] = load i32, i32 [[GEP_5]], align 4
; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1		; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1
; CHECK-NEXT: [[LOAD_8:%.]] = load i32, i32 [[GEP_6]], align 4		; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[INN_ADDR]] to <4 x i32>*
; CHECK-NEXT: [[MUL_1:%.*]] = mul i32 [[LOAD_3]], [[LOAD_5]]		; CHECK-NEXT: [[TMP5:%.]] = load <4 x i32>, <4 x i32> [[TMP4]], align 4
; CHECK-NEXT: [[MUL_2:%.*]] = mul i32 [[LOAD_2]], [[LOAD_8]]		; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 3, i32 2>
; CHECK-NEXT: [[MUL_3:%.*]] = mul i32 [[LOAD_4]], [[LOAD_7]]		; CHECK-NEXT: [[TMP7:%.*]] = mul <4 x i32> [[TMP3]], [[TMP6]]
; CHECK-NEXT: [[MUL_4:%.*]] = mul i32 [[LOAD_1]], [[LOAD_6]]		; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0
; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 %out, i64 0		; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1
; CHECK-NEXT: store i32 [[MUL_1]], i32* [[GEP_7]], align 4		; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2
; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 %out, i64 1		; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3
; CHECK-NEXT: store i32 [[MUL_2]], i32* [[GEP_8]], align 4		; CHECK-NEXT: [[TMP8:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*
; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 %out, i64 2		; CHECK-NEXT: store <4 x i32> [[TMP7]], <4 x i32>* [[TMP8]], align 4
; CHECK-NEXT: store i32 [[MUL_3]], i32* [[GEP_9]], align 4
; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 %out, i64 3
; CHECK-NEXT: store i32 [[MUL_4]], i32* [[GEP_10]], align 4
; CHECK-NEXT: ret i32 undef		; CHECK-NEXT: ret i32 undef
;		;
%in.addr = getelementptr inbounds i32, i32* %in, i64 0		%in.addr = getelementptr inbounds i32, i32* %in, i64 0
%load.1 = load i32, i32* %in.addr, align 4		%load.1 = load i32, i32* %in.addr, align 4
%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3		%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3
%load.2 = load i32, i32* %gep.1, align 4		%load.2 = load i32, i32* %gep.1, align 4
%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1		%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1
%load.3 = load i32, i32* %gep.2, align 4		%load.3 = load i32, i32* %gep.2, align 4
Show All 17 Lines	;
store i32 %mul.2, i32* %gep.8, align 4		store i32 %mul.2, i32* %gep.8, align 4
%gep.9 = getelementptr inbounds i32, i32* %out, i64 2		%gep.9 = getelementptr inbounds i32, i32* %out, i64 2
store i32 %mul.3, i32* %gep.9, align 4		store i32 %mul.3, i32* %gep.9, align 4
%gep.10 = getelementptr inbounds i32, i32* %out, i64 3		%gep.10 = getelementptr inbounds i32, i32* %out, i64 3
store i32 %mul.4, i32* %gep.10, align 4		store i32 %mul.4, i32* %gep.10, align 4

ret i32 undef		ret i32 undef
}		}


		define i32 @jumbled-load-multiuses(i32* noalias nocapture %in, i32* noalias nocapture %out) {
		ABataevUnsubmitted Done Reply Inline Actions You need to add this test separately and show changes in it ABataev: You need to add this test separately and show changes in it
		; CHECK-LABEL: @jumbled-load-multiuses(
		; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0
		; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3
		; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1
		; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2
		; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*
		; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 2, i32 0>
		; CHECK-NEXT: [[TMP4:%.*]] = extractelement <4 x i32> [[TMP2]], i32 2
		; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x i32> undef, i32 [[TMP4]], i32 0
		; CHECK-NEXT: [[TMP6:%.*]] = extractelement <4 x i32> [[TMP2]], i32 3
		; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP5]], i32 [[TMP6]], i32 1
		; CHECK-NEXT: [[TMP8:%.*]] = extractelement <4 x i32> [[TMP2]], i32 0
		; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x i32> [[TMP7]], i32 [[TMP8]], i32 2
		; CHECK-NEXT: [[TMP10:%.*]] = extractelement <4 x i32> [[TMP2]], i32 1
		; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> [[TMP9]], i32 [[TMP10]], i32 3
		; CHECK-NEXT: [[TMP12:%.*]] = mul <4 x i32> [[TMP3]], [[TMP11]]
		; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0
		; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1
		; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2
		; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3
		; CHECK-NEXT: [[TMP13:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*
		; CHECK-NEXT: store <4 x i32> [[TMP12]], <4 x i32>* [[TMP13]], align 4
		; CHECK-NEXT: ret i32 undef
		;
		%in.addr = getelementptr inbounds i32, i32* %in, i64 0
		%load.1 = load i32, i32* %in.addr, align 4
		%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3
		%load.2 = load i32, i32* %gep.1, align 4
		%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1
		%load.3 = load i32, i32* %gep.2, align 4
		%gep.3 = getelementptr inbounds i32, i32* %in.addr, i64 2
		%load.4 = load i32, i32* %gep.3, align 4
		%mul.1 = mul i32 %load.3, %load.4
		%mul.2 = mul i32 %load.2, %load.2
		%mul.3 = mul i32 %load.4, %load.1
		%mul.4 = mul i32 %load.1, %load.3
		%gep.7 = getelementptr inbounds i32, i32* %out, i64 0
		store i32 %mul.1, i32* %gep.7, align 4
		%gep.8 = getelementptr inbounds i32, i32* %out, i64 1
		store i32 %mul.2, i32* %gep.8, align 4
		%gep.9 = getelementptr inbounds i32, i32* %out, i64 2
		store i32 %mul.3, i32* %gep.9, align 4
		%gep.10 = getelementptr inbounds i32, i32* %out, i64 3
		store i32 %mul.4, i32* %gep.10, align 4

		ret i32 undef
		}

test/Transforms/SLPVectorizer/X86/store-jumbled.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s			; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s



	define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {			define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {
	; CHECK-LABEL: @jumbled-load(			; CHECK-LABEL: @jumbled-load(
	; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0			; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0
	; CHECK-NEXT: [[LOAD_1:%.]] = load i32, i32 [[IN_ADDR]], align 4
	; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1
	; CHECK-NEXT: [[LOAD_2:%.]] = load i32, i32 [[GEP_1]], align 4
	; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2
	; CHECK-NEXT: [[LOAD_3:%.]] = load i32, i32 [[GEP_2]], align 4
	; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3
	; CHECK-NEXT: [[LOAD_4:%.]] = load i32, i32 [[GEP_3]], align 4			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*
				; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
				; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 0, i32 2>
	; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0			; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0
	; CHECK-NEXT: [[LOAD_5:%.]] = load i32, i32 [[INN_ADDR]], align 4
	; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1
	; CHECK-NEXT: [[LOAD_6:%.]] = load i32, i32 [[GEP_4]], align 4
	; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2
	; CHECK-NEXT: [[LOAD_7:%.]] = load i32, i32 [[GEP_5]], align 4
	; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3
	; CHECK-NEXT: [[LOAD_8:%.]] = load i32, i32 [[GEP_6]], align 4			; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[INN_ADDR]] to <4 x i32>*
	; CHECK-NEXT: [[MUL_1:%.*]] = mul i32 [[LOAD_1]], [[LOAD_5]]			; CHECK-NEXT: [[TMP5:%.]] = load <4 x i32>, <4 x i32> [[TMP4]], align 4
	; CHECK-NEXT: [[MUL_2:%.*]] = mul i32 [[LOAD_2]], [[LOAD_6]]			; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 0, i32 2>
	; CHECK-NEXT: [[MUL_3:%.*]] = mul i32 [[LOAD_3]], [[LOAD_7]]			; CHECK-NEXT: [[TMP7:%.*]] = mul <4 x i32> [[TMP3]], [[TMP6]]
	; CHECK-NEXT: [[MUL_4:%.*]] = mul i32 [[LOAD_4]], [[LOAD_8]]
	; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0			; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0
	; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1			; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1
	; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2			; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2
	; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3			; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3
	; CHECK-NEXT: store i32 [[MUL_1]], i32* [[GEP_9]], align 4			; CHECK-NEXT: [[TMP8:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*
	; CHECK-NEXT: store i32 [[MUL_2]], i32* [[GEP_7]], align 4			; CHECK-NEXT: store <4 x i32> [[TMP7]], <4 x i32>* [[TMP8]], align 4
	; CHECK-NEXT: store i32 [[MUL_3]], i32* [[GEP_10]], align 4
	; CHECK-NEXT: store i32 [[MUL_4]], i32* [[GEP_8]], align 4
	; CHECK-NEXT: ret i32 undef			; CHECK-NEXT: ret i32 undef
	;			;
	%in.addr = getelementptr inbounds i32, i32* %in, i64 0			%in.addr = getelementptr inbounds i32, i32* %in, i64 0
	%load.1 = load i32, i32* %in.addr, align 4			%load.1 = load i32, i32* %in.addr, align 4
	%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 1			%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 1
	%load.2 = load i32, i32* %gep.1, align 4			%load.2 = load i32, i32* %gep.1, align 4
	%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 2			%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 2
	%load.3 = load i32, i32* %gep.2, align 4			%load.3 = load i32, i32* %gep.2, align 4
	Show All 25 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Vectorize jumbled memory loads.AcceptedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 136070

include/llvm/Analysis/LoopAccessAnalysis.h

lib/Analysis/LoopAccessAnalysis.cpp

lib/Transforms/Vectorize/SLPVectorizer.cpp

test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll

test/Transforms/SLPVectorizer/X86/jumbled-load-multiuse.ll

test/Transforms/SLPVectorizer/X86/jumbled-load-shuffle-placement.ll

test/Transforms/SLPVectorizer/X86/jumbled-load-used-in-phi.ll

test/Transforms/SLPVectorizer/X86/jumbled-load.ll

test/Transforms/SLPVectorizer/X86/store-jumbled.ll

[SLP] Vectorize jumbled memory loads.
AcceptedPublic