This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Analysis/
-
llvm/
-
Analysis/
2
LoopAccessAnalysis.h
-
lib/
-
Analysis/
2/25
LoopAccessAnalysis.cpp
-
Transforms/Vectorize/
-
Vectorize/
16/126
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
-
jumbled-load-multiuse.ll
1/1
jumbled-load.ll
-
store-jumbled.ll

Differential D36130

[SLP] Vectorize jumbled memory loads.
AcceptedPublic

Authored by • ashahid on Aug 1 2017, 12:31 AM.

Download Raw Diff

Details

Reviewers

mkuper
loladiro
Ayal
zvi
danielcdh
ABataev

Commits

rGdbd30edb7ff8: [SLP] Vectorize jumbled memory loads.
rG1d5422f27f60: [SLP] Vectorize jumbled memory loads.
rG2b281de5769e: [SLP] Vectorize jumbled memory loads.
rGf8db9bd85791: [SLP] Vectorize jumbled memory loads.
rL320548: [SLP] Vectorize jumbled memory loads.
rL314806: [SLP] Vectorize jumbled memory loads.
rL313771: [SLP] Vectorize jumbled memory loads.
rL313736: [SLP] Vectorize jumbled memory loads.

Summary

This patch tries to vectorize loads of consecutive memory accesses, accessed
in non-consecutive or jumbled way. An earlier attempt was made with patch D26905
which was reverted back due to some basic issue with representing the 'use mask' of
jumbled accesses.

This patch fixes the mask representation by recording the 'use mask' in the usertree entry.

Change-Id: I9fe7f5045f065d84c126fa307ef6ebe0787296df

Diff Detail

Build Status

Buildable 8804
Build 8804: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Patch update for fixing build bot failure:

This fix makes the place holder for Shuffle Mask from fixed array of 3 element
to an std::map. This need arises from the fact that a PHI node can have
any number of operand as incoming value.

Test performed:
LLVM lit test, 3 stage bootstrap build and LNT (Thanks to Hans and Daniel)

Harbormaster completed remote builds in B12741: Diff 125470.Dec 4 2017, 9:17 PM

In D36130#944703, @ashahid wrote:

Patch update for fixing build bot failure:

I haven't looked at the patch at all, but I just tried it on a local Chrome build on Linux, and it seems to work for that.

Good catch. Add a LIT test?

lib/Transforms/Vectorize/SLPVectorizer.cpp
510	The fixed array SmallVector<unsigned, 4> ShuffleMask[3]; of the previous version indeed cannot account for all operands. How about holding a SmallVector<SmallVector<unsigned, 4>, 2> ShuffleMask; instead of holding a map from 0,1,2,..,numOperands ?
536	Are both conditions really needed, or suffice say to check for -1 and assert positive indices are not too large?
2730	May be simpler to check instead ShuffleMask.count(OpdNum)
2747	clang-format

In D36130#945306, @hans wrote:

In D36130#944703, @ashahid wrote:

Patch update for fixing build bot failure:

I haven't looked at the patch at all, but I just tried it on a local Chrome build on Linux, and it seems to work for that.

Thanks Hans for triage.

In D36130#945728, @Ayal wrote:

Good catch. Add a LIT test?

It was asserting in few of LNT Multisource bench mark. How to extract it for LIT test?

lib/Transforms/Vectorize/SLPVectorizer.cpp
510	I think this can be done. I will try.
536	Sure I will check. I am thinking 30000 as large indices threshold, do you have any number in mind?
2730	Quite right.

• ashahid added inline comments.Dec 8 2017, 8:12 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
536	I tried but seems both conditions are needed as I am getting assertion "Idx < size()" for SmallVector<<SmallVector, 4> 2> ShuffleMask.

Updated the review comments.

Herald added a subscriber: mgrang. · View Herald TranscriptDec 9 2017, 12:50 AM

Minor commented code clean up done.

In D36130#946236, @ashahid wrote:

In D36130#945728, @Ayal wrote:

Good catch. Add a LIT test?

It was asserting in few of LNT Multisource bench mark. How to extract it for LIT test?

Suffice to have a phi with 4 predecessors, where (at-least) the 4th needs a shuffle-mask.

lib/Transforms/Vectorize/SLPVectorizer.cpp
536	UserTreeIdx is the index of the User entry as we build the tree bottom-up, so it should always be between 0 and VectorizableTree.size()-1, except for -1 when creating the new entry for the root, which is User-less. So it should suffice to check if Idx is -1, and otherwise assert that Idx < size(), if desired, right?
537–538	Code below still uses emplace_back contrary to the discussion above. May need to call UserTreeEntry->ShuffleMask.resize() if OpdNum is larger than its initial/current size, before setting UserTreeEntry->ShuffleMask[OpdNum] = tempMask. (Otherwise the original "LNT Multisource bench mark" asserts should trigger again?) Suggest to add a test where the first operand does not need a shuffle but the second one does.
2486	See above discussion about replacing second condition with an assert.
2723	ditto
test/Transforms/SLPVectorizer/X86/crash_cmpop.ll
1 ↗	(On Diff #126266)	Why add -debug?

Review comments updated and added lit tests.

Harbormaster completed remote builds in B12974: Diff 126374.Dec 11 2017, 8:34 AM

• ashahid added inline comments.Dec 11 2017, 8:41 AM

test/Transforms/SLPVectorizer/X86/crash_cmpop.ll
1 ↗	(On Diff #126266)	My bad, not intended.

This looks good to me, with a couple of last minor fixes.

Hope it stays in this time...

lib/Transforms/Vectorize/SLPVectorizer.cpp
543	alrea[d]y
2752	Can simply do `for (unsigned Entry : ShuffleMask[OpdNum])` instead of iterating explicitly over all lanes and retrieving each `UserTreeEntry->ShuffleMask[OpdNum][Lane]`.
test/Transforms/SLPVectorizer/X86/jumbled-load-shuffle-placement.ll
31 ↗	(On Diff #126374)	Suggested to also have a test where the 2nd operand is a shuffle but the 1st one isn't, which will fail if shuffles are added using emplace_back().

Updated test and review comment.

Bootstrap and LNT test underway.

• ashahid closed this revision.Dec 12 2017, 7:09 PM

Hi Shahid,

These changes caused 27.7% and 30.2% regressions on an AArch64 Juno board (http://lnt.llvm.org/db_default/v4/nts/83681):

MultiSource/Benchmarks/mediabench/gsm/toast/toast: 30.20%
MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm: 27.73%

We have the same benchmarks regressed on our AArch64 boards (Cortex-A53, Cortex-A57).

-Evgeny Astigeevich
The ARM Compiler Optimisation team

In D36130#955158, @eastig wrote:

Hi Shahid,

These changes caused 27.7% and 30.2% regressions on an AArch64 Juno board (http://lnt.llvm.org/db_default/v4/nts/83681):

MultiSource/Benchmarks/mediabench/gsm/toast/toast: 30.20%
MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm: 27.73%

We have the same benchmarks regressed on our AArch64 boards (Cortex-A53, Cortex-A57).

-Evgeny Astigeevich
The ARM Compiler Optimisation team

A problem report: https://bugs.llvm.org/show_bug.cgi?id=35673

eastig mentioned this in D41324: [SLPVectorizer] Add shuffle instruction cost for jumbled load.Dec 18 2017, 4:11 AM

sanjoy added a subscriber: sanjoy.Dec 19 2017, 4:03 PM

sanjoy added inline comments.

lib/Analysis/LoopAccessAnalysis.cpp
1056	This should be a `cast<>`.
1084	LLVM style is to avoid using curly braces on single like for loops. Using `std::iota` would be even better.
lib/Transforms/Vectorize/SLPVectorizer.cpp
544	I think you should be able to do: auto &OperandMask = UserTreeEntry->ShuffleMask[OpdNum]; assert(OperandMask.empty()); OperandMask.insert(OperandMask.end(), ShuffleMask.begin(), ShuffleMask.end());
1418	Not sure why you need `NewVL` here -- doesn't just using `Sorted` work?
2729	Might be cleaner to abstract `(unsigned)OpdNum < UserTreeEntry->ShuffleMask.size() && !UserTreeEntry->ShuffleMask[OpdNum].empty()` into a `UserTreeEntry->hasShuffleMaskForOp(Index)` helper.
2753	The cast to `Value *` should not be necessary.
2989	`dyn_cast<XXX>(f)->g()` should never be necessary. Either the `dyn_cast` can return null in which case you should check for that, or it can't and you should use `cast<>`. Also the cast of `Vec` to `Instruction` seems unnecessary: `ShuffleVectorInst` is an `Instruction`.

Ayal added inline comments.Dec 21 2017, 3:25 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
544	While we're at it, this should move under the `if (UserTreeIdx != -1)` to avoid checking if `&VectorizableTree[UserTreeIdx]` is null, as commented in https://reviews.llvm.org/D41324#inline-361435
1427	Should probably also check here that UserTreeIdx is not -1, to avoid creating a mask for the root with no place to hang it, as @sanjoy observed.

• ashahid added inline comments.Dec 22 2017, 6:20 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
544	If we check for if (UserTreeIdx != -1 && ShuffledLoad) before the call of newTreeEntry(), we can avoid "UserTreeIdx != -1" check completely inside newTreeEntry().
1427	Yes, I had planned to do exactly this.

• ashahid reopened this revision.Dec 28 2017, 11:04 PM

• ashahid marked 8 inline comments as done.

• ashahid added inline comments.

lib/Transforms/Vectorize/SLPVectorizer.cpp
2989	Here I am trying to ensure that the instructions are "ShuffleVectorInst" and "LoadInst" respectively. Casting of Vec to Instruction, is to satisfy the membership of getOperand() which compiler otherwise report as error.

This revision is now accepted and ready to land.Dec 28 2017, 11:04 PM

Updates review comments.

Regression test and LNT passes, 3 stage bootstrap test underway.

Ayal added inline comments.Dec 29 2017, 7:31 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp

2989

Use isa instead of dyn_cast here:
if (Vec && dyn_cast<LoadInst>(cast<Instruction>(Vec)->getOperand(0))) {

or alternatively do something like:

Value *Vec = E->VectorizedValue;
assert(Vec && "Can't find vectorizable value");
if (ShuffleVectorInst *Shuffle = dyn_cast<ShuffleVectorInst>(Vec))
  if (LoadInst *Load = dyn_cast<LoadInst>(Shuffle->getOperand(0)))
    Vec = Load;

Updated Ayal's comment accordingly

• ashahid marked an inline comment as done.Jan 1 2018, 8:01 AM

Ping!

This should fix the case observed by @sanjoy in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20171218/511721.html; please also include a testcase.

In D36130#971181, @Ayal wrote:

This should fix the case observed by @sanjoy in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20171218/511721.html; please also include a testcase.

Test case, test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll, already included.

In D36130#973399, @ashahid wrote:

In D36130#971181, @Ayal wrote:

This should fix the case observed by @sanjoy in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20171218/511721.html; please also include a testcase.

Test case, test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll, already included.

Ah, right, sorry, missed it.

This looks good to me, with only minor comments about the testcase.

Please see that @sanjoy approves too, as this mostly addresses issues he raised.

test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
27 ↗	(On Diff #128388)	"SINK" is defined redundantly, as it is not used. Could this be simplified by removing the float-to-int casts? In general, it may suffice to check that there's no load of <4 x i32>, which would be jumbled. Checking that two of the lanes have been vectorized may be fragile, in case a modified cost model will decide it ain't worth it.

sanjoy accepted this revision.Jan 13 2018, 2:40 PM

sanjoy added inline comments.

lib/Analysis/LoopAccessAnalysis.cpp
1044	The indent looks off here; can you please run clang-format?
lib/Transforms/Vectorize/SLPVectorizer.cpp
497	Optional: you can write `return X;` instead of `if (X) return true; return false;`.
1422	Nit: s/usefull/useful/
2988–2994	I think you can rewrite this more cleanly using an immediately-invoked function expression: Value Vec = [&]() { if (auto SVI = dyn_cast<ShuffleVectorInst>(E->VectorizedValue)) if (auto *LI = dyn_cast<LoadInst>(SVI->getOperand(0))) return LI->getOperand(0); return E->VectorizedValue; }();

• ashahid marked an inline comment as not done.Jan 15 2018, 9:01 PM

• ashahid added inline comments.

lib/Transforms/Vectorize/SLPVectorizer.cpp
2988–2994	I tried this IIFE, however I am getting an assertion "Tried to create extractelement operation on non-vector type!" for jumbled-load-multiuse.ll test. Do you see any issue in this code?

sanjoy added inline comments.Jan 15 2018, 10:54 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
2988–2994	Yes, I think I should have written: Value Vec = [&]() { if (auto SVI = dyn_cast<ShuffleVectorInst>(E->VectorizedValue)) if (isa<LoadInst>(SVI->getOperand(0))) return SVI->getOperand(0); return E->VectorizedValue; }();

Ayal added inline comments.Jan 15 2018, 11:49 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp

2988–2994

Yes, this simplifies the below "alternatively do something like:"

Value *Vec = E->VectorizedValue;
assert(Vec && "Can't find vectorizable value");
if (ShuffleVectorInst *Shuffle = dyn_cast<ShuffleVectorInst>(Vec))
  if (LoadInst *Load = dyn_cast<LoadInst>(Shuffle->getOperand(0)))
    Vec = Load;

Updates test case and stylistic review comments

Herald added a subscriber: llvm-commits. · View Herald TranscriptJan 16 2018, 8:51 AM

Ping!

Hi Ayal, Sanjoy,

The last update's review was pending for long. Off late, SLP has lots of changes so I will have to rebase but before rebasing please see if any more changes required in its current form.

Thanks in advance.

RKSimon added a reviewer: ABataev.Feb 10 2018, 8:56 AM

In D36130#1004306, @ashahid wrote:

Hi Ayal, Sanjoy,

The last update's review was pending for long. Off late, SLP has lots of changes so I will have to rebase but before rebasing please see if any more changes required in its current form.

Thanks in advance.

This looks good to me, as commented earlier, but please see that @sanjoy approves too, as this mostly addresses issues he raised.

I don't have any more coding style comments. I've not reviewed the actual semantic changes.

lib/Analysis/LoopAccessAnalysis.cpp
1097	Can you use `std::iota` here?

ABataev added inline comments.Feb 12 2018, 8:04 AM

lib/Analysis/LoopAccessAnalysis.cpp
1043	This function can be used for stores also, it is better to make it universal for stores/loads.
1087	It is better to use `stable_sort` rather than `sort`
1100	`stable_sort`
lib/Transforms/Vectorize/SLPVectorizer.cpp
1410	Is it possible at all that `VL` has less than 4 elements here?
1415	`i`->`I`, `e`->`E`. Variables must have Camel-like names.
1929–1935	You don't need so many shuffles, it is enough just to have just one.

ABataev added inline comments.Feb 12 2018, 8:04 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
504–510	Why you can't have just one shuffle here for all external uses?
1426–1427	Bad decision. It is better to use original `VL` here, rather than `Sorted` and add an additional array of sorted indieces. In this case you don't need all these additional numbers and all that complex logic to find the correct tree entry for the list of values.
2988–2994	I think you can have default capture by value here rather than by reference.

Hi Alexey,

As I was trying to rebase this patch, it seems this overlaps with your "reverse load" patch. Could you take a look in this patch?

courbet added a subscriber: courbet.Feb 13 2018, 2:12 AM

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

lib/Analysis/LoopAccessAnalysis.cpp
1043	I plan to do such improvement in separate patches.
lib/Transforms/Vectorize/SLPVectorizer.cpp
504–510	This is for in-tree multi uses of a single vector load where the uses has different masks/permutation. This section of comment https://reviews.llvm.org/D36130#inline-326711 discussed it earlier. Also there is figure attached.
1410	I think yes, for example a couple of i64 loads considering minimum register width as 128-bit. However, this check here was basically meant to indicate jumbled loads of size 2 is essentially a reversed load.
1426–1427	In fact earlier design in patch (https://reviews.llvm.org/D26905) was to use original VL, however there was counter argument to that which I don't remember exactly.
1929–1935	This is basically for multiple in-tree uses with different masks/permutation.

In D36130#1006202, @ashahid wrote:

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

Probably, it is hard to say exactly without looking at the result.

lib/Analysis/LoopAccessAnalysis.cpp
1043	I just suggest to make universal at the very beginning, that's it
lib/Transforms/Vectorize/SLPVectorizer.cpp
504–510	I still don't understand what's the problem here. You need to perform the loads in some order. You sort the loads to be in the sequntially direct order and perform the vector load starting from the lowest address. You reshuffle the loaded vector value to the original order. That's it, you have your loads in the required order. Just one shuffle is required. Why do you need some more? Also, I don't understand why do you need so many changes, why do you need additional indicies etc.
1410	It is going to be handled by the reverse loads patch
1426–1427	It is better to use original `VL` here, otherwise it will end with a lot of troubles and will require the whole bunch of changes in the vectorization process to find the perfect match for the vector of vectorized values. I don't think it is a good idea to have a lot of changes accross the whole module to handle jumbled loads.
2730	Is this correct? `E->Scalars[0]` is exactly `VL0`

Updates review comments and a test case.

Harbormaster completed remote builds in B14963: Diff 134170.Feb 14 2018, 1:38 AM

Minor clean up.

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

In D36130#1006238, @ABataev wrote:

In D36130#1006202, @ashahid wrote:

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

Probably, it is hard to say exactly without looking at the result.

No worry it was a merge issue, its fixed.

lib/Transforms/Vectorize/SLPVectorizer.cpp
504–510	Updated jumbled-load.ll captures this case where instead of gathering the second operand of MUL we can have required shuffle of the same loaded vector
1410	Yes, this check no more required.
1426–1427	In the context where we can have multiple user of loaded vector with different shuffle mask, the design is to represent these different shuffle mask for each user corresponding to the user's operand number. Having single sorted indices will not be sufficient for this. Given the objective of handling multiple out of order uses changes are not that big I feel.
2730	Ah, both are same.

ABataev added inline comments.Feb 14 2018, 6:50 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1426–1427	Now I see what do you want to do. But I don't think that this the correct way to implement it. It complicates the whole vectorization process. I'd suggest to create different tree entries for each particular order of the loads and exclude loads from the check that the same instruction is used several times in different tree entries. If you worry about several different loads of the same values, I think they will be optimized by instruction combiner.

• ashahid added inline comments.Feb 16 2018, 9:46 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1426–1427	Off course this could have been a better solution but I was not sure of the impact it may have by breaking the single tree entry assumption. One problem I see is the TreeEntry lookup if multiple node with same scalar values are present. I can use isSame() check to make sure correct tree entry is found, however it may become costly in case of PHI instruction fed by same vector Load.

ABataev added inline comments.Feb 16 2018, 10:29 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1426–1427	I think it is better to start with handling of single tree entry rather than trying to handle all possible situations in a single patch. I suggest to split this patch into 2 parts at least: 1. handling of tree entry with jumbled loads. 2. further improvements.
test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
7–10 ↗	(On Diff #134178)	These checks are not autogenerated, fix it. Moreover, it is recommended to commit these tests separately with the checks for the original version of the compiler and the update checks with the fixed version to demonstrate improvements.

Updated the patch to accomodate the review comments.

Harbormaster completed remote builds in B15472: Diff 136070.Feb 27 2018, 6:29 AM

As suggested, now the reordering mask will be part of each tree entry. Also this update does not consider to optimize the reordered load for multiple operand for now.

By the way, take a look at my D43776 that does the same but in more general way

lib/Transforms/Vectorize/SLPVectorizer.cpp
1409	Why you can do this only if `ReuseShuffleIndicies.empty()`?
1414–1419	It is enough just to compare `VL` and `Sorted`. If they are the same, the loads are not shuffled
1422	Why you can't do to add vectorized tree entry if `UserTreeIdx == -1`?
1425	Each `true` or `false` argument must have to prepend comment with the name of the function parameter, related to this argument
1929	You can remove the last argument here
2478	Why do you need this condition?
2723	Restore the original code here
2745	Remove this empty line
2988–2995	I rather doubt you need all that stuff. You can use original code
test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
1 ↗	(On Diff #136070)	You need to add this test separately and show changes in it.
test/Transforms/SLPVectorizer/X86/jumbled-load-shuffle-placement.ll
1 ↗	(On Diff #136070)	You need to add this test separately and show changes in it
test/Transforms/SLPVectorizer/X86/jumbled-load-used-in-phi.ll
1 ↗	(On Diff #136070)	You need to add this test separately and show changes in it
test/Transforms/SLPVectorizer/X86/jumbled-load.ll
64	You need to add this test separately and show changes in it

Will commit the tests as NFC.

Seems like I am not getting the mails from phabricator, what shall I do to get the mails?

Checked the patch D43776, seems it will make this patch redundant.

lib/Transforms/Vectorize/SLPVectorizer.cpp
1409	This is to avoid the overlapping the UniqueValues reuse logic of your changes.
1414–1419	Sure it is, but this avoids the compare. So I thought having a boolean is preferable.
1422	My bad, this is not required.
1425	Ok
1929	Sure
2478	In the 2nd test of jumbled-load.ll the two operands of MUL is fed from the same loaded vector. The 1st operand is SHUFFLE of LOAD and the 2nd operand is the gather of the same scalar loads. Query to getTreeEntry() will always return the node with the same vectorized value and hence both the operand of MUL will be fed the shuffled load. This check is to avoid this scenario.
2723	Thanks
2988–2995	This is required otherwise multiuse.ll test as well as PR32086.ll will fail because the lanes were recorded according to the order of scalar loads.

Updated further review comments.

Harbormaster completed remote builds in B15525: Diff 136311.Feb 28 2018, 9:30 AM

Hope this is fine.

ABataev added inline comments.Feb 28 2018, 9:49 AM

lib/Analysis/LoopAccessAnalysis.cpp
1043	What about this comment? Do you really need Sorted argument?
1056	`PointerType `->`auto `
1060–1062	I think there must be an assertion instead of this check.
1072	`const SCEVConstant `->`const auto `
1077–1079	This check better to move to SLPVectorizer.cpp, because the function can be used for masked load/store.
1092	`for (unsigned I = 0, E = VL.size(); I < E; ++I)`
1097	Actually `Mask` is a full copy of `UseOrder`, you don't need all that complex stuff here
lib/Transforms/Vectorize/SLPVectorizer.cpp
1409	Why you can't handle it? What's the problem?
1414–1419	Why do we need the compare?
2478	This scenario should happen in your patch, the instruction either vectorized, or gathered, but not both.
2988–2995	Again, it just may not happen in this patch

ABataev added inline comments.Feb 28 2018, 11:07 AM

lib/Analysis/LoopAccessAnalysis.cpp
1097	Oops, no, `Mask` is not a copy of `UseOrder` But you can create it much simpler: for (unsigned I = 0, E = VL.size(); I < E; ++I) Mask[UseOrder[I]] = I;

sanjoy removed a reviewer: sanjoy.Feb 28 2018, 11:34 AM

sanjoy removed a subscriber: sanjoy.

• ashahid added inline comments.Feb 28 2018, 11:30 PM

lib/Analysis/LoopAccessAnalysis.cpp
1043	Yes, otherwise my test fails. Seems it breaks some assumption.
1097	Thanks
lib/Transforms/Vectorize/SLPVectorizer.cpp
1409	It was a thought,I have not checked yet. I will check.
1414–1419	I meant, if we dont use ShuffledLoad flag we have to compare VL vs Sorted instead.
2478	This check is to avoid feeding the generated SHUFFLE to both operand of MUL which is not the intention of the test case.
2988–2995	It does happen and this test fails.

ABataev added inline comments.Mar 2 2018, 10:59 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1425	No, use original `VL` here, do not use `Sorted`. In this case you won't need an additional argument in `sortLoadAccesses` and you don't need all that complex stuff with the lambda on line 3528

• ashahid added inline comments.Mar 5 2018, 10:39 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1425	If I am not wrong, for LOADs, VL0 must be the 1st element of the buffer whose base address will be used for vector load. So using VL will break this assumption.

ABataev added inline comments.Mar 6 2018, 6:18 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1425	Why? And why you can't choose the right VL0 during vectorization?

• ashahid added inline comments.Mar 6 2018, 8:20 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1425	For example, if we have two arrays A[4] and B[1] laying one after another in memory and the selected VF is 4 for the scalar loads of A[1], A[2], A[0], A[3] in order of use, the generated vector load will load the elements A[1], A[2], A[3], B[1] which is not desired. Of-course we can choose the right VL0 during vectorization but we have to compute it again here using the mask which can be avoided if we use Sorted VL. If I am missing something?

ABataev added inline comments.Mar 6 2018, 8:42 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1425	You already store the mask in the tree entry and you can choose the right VL0 using this mask. Using Sorted VL complicates the whole vectorization process and, thus, adds some extra points for the incorrect vectorization. That's why I insist to use original VL and choose the correct VL0 during codegen.

• ashahid added inline comments.Mar 6 2018, 9:08 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
1425	Got it. Since you already have these improvements in this patch https://reviews.llvm.org/D43776 , I think it is better to get that through.

fhahn mentioned this in D37738: [SLPVectorizer] Generalize vectorizeStores to support loads as well NFC. .Mar 22 2018, 10:50 AM

fhahn mentioned this in D37737: [SLPVectorizer] Merge subsequent gather loads..

@ashahid What's happening to this patch?

Closed by commit rGdbd30edb7ff8: [SLP] Vectorize jumbled memory loads. (authored by • ashahid). · Explain WhyOct 7 2019, 5:02 AM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptOct 7 2019, 5:02 AM

Herald added a subscriber: hiraditya. · View Herald Transcript

RKSimon reopened this revision.Oct 7 2019, 6:08 AM

This revision is now accepted and ready to land.Oct 7 2019, 6:08 AM

Revision Contents

Path

Size

include/

llvm/

Analysis/

LoopAccessAnalysis.h

6 lines

lib/

Analysis/

LoopAccessAnalysis.cpp

66 lines

Transforms/

Vectorize/

SLPVectorizer.cpp

157 lines

test/

Transforms/

SLPVectorizer/

X86/

jumbled-load-multiuse.ll

24 lines

jumbled-load.ll

37 lines

store-jumbled.ll

25 lines

Diff 109054

include/llvm/Analysis/LoopAccessAnalysis.h

	Show First 20 Lines • Show All 651 Lines • ▼ Show 20 Lines
	/// If necessary this method will version the stride of the pointer according			/// If necessary this method will version the stride of the pointer according
	/// to \p PtrToStride and therefore add further predicates to \p PSE.			/// to \p PtrToStride and therefore add further predicates to \p PSE.
	/// The \p Assume parameter indicates if we are allowed to make additional			/// The \p Assume parameter indicates if we are allowed to make additional
	/// run-time assumptions.			/// run-time assumptions.
	int64_t getPtrStride(PredicatedScalarEvolution &PSE, Value Ptr, const Loop Lp,			int64_t getPtrStride(PredicatedScalarEvolution &PSE, Value Ptr, const Loop Lp,
	const ValueToValueMap &StridesMap = ValueToValueMap(),			const ValueToValueMap &StridesMap = ValueToValueMap(),
	bool Assume = false, bool ShouldCheckWrap = true);			bool Assume = false, bool ShouldCheckWrap = true);

				// If \p Mask is not null, it also returns the \p Mask which is the shuffle
				AyalUnsubmitted Not Done Reply Inline Actions Document what the method does, including its boolean return value, before indicating what happens when \p Mask is not null. An example of VL coming in and Sorted plus Mask coming out would be useful. Ayal: Document what the method does, including its boolean return value, before indicating what…
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure. ashahid: Sure.
				// mask for actual memory access order.
				bool sortMemAccesses(ArrayRef<Value *> VL, const DataLayout &DL,
				ScalarEvolution &SE, SmallVectorImpl<Value *> &Sorted,
				SmallVectorImpl<unsigned> *Mask = nullptr);

	/// \brief Returns true if the memory operations \p A and \p B are consecutive.			/// \brief Returns true if the memory operations \p A and \p B are consecutive.
	/// This is a simple API that does not depend on the analysis pass.			/// This is a simple API that does not depend on the analysis pass.
	bool isConsecutiveAccess(Value A, Value B, const DataLayout &DL,			bool isConsecutiveAccess(Value A, Value B, const DataLayout &DL,
	ScalarEvolution &SE, bool CheckType = true);			ScalarEvolution &SE, bool CheckType = true);

	/// \brief This analysis provides dependence information for the memory accesses			/// \brief This analysis provides dependence information for the memory accesses
	/// of a loop.			/// of a loop.
	///			///
	▲ Show 20 Lines • Show All 72 Lines • Show Last 20 Lines

lib/Analysis/LoopAccessAnalysis.cpp

	Show First 20 Lines • Show All 1,032 Lines • ▼ Show 20 Lines
	static unsigned getAddressSpaceOperand(Value *I) {			static unsigned getAddressSpaceOperand(Value *I) {
	if (LoadInst *L = dyn_cast<LoadInst>(I))			if (LoadInst *L = dyn_cast<LoadInst>(I))
	return L->getPointerAddressSpace();			return L->getPointerAddressSpace();
	if (StoreInst *S = dyn_cast<StoreInst>(I))			if (StoreInst *S = dyn_cast<StoreInst>(I))
	return S->getPointerAddressSpace();			return S->getPointerAddressSpace();
	return -1;			return -1;
	}			}

				bool llvm::sortMemAccesses(ArrayRef<Value *> VL, const DataLayout &DL,
				ScalarEvolution &SE,
				SmallVectorImpl<Value *> &Sorted,
				ABataevUnsubmitted Not Done Reply Inline Actions This function can be used for stores also, it is better to make it universal for stores/loads. ABataev: This function can be used for stores also, it is better to make it universal for stores/loads.
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions I plan to do such improvement in separate patches. ashahid: I plan to do such improvement in separate patches.
				ABataevUnsubmitted Not Done Reply Inline Actions I just suggest to make universal at the very beginning, that's it ABataev: I just suggest to make universal at the very beginning, that's it
				ABataevUnsubmitted Not Done Reply Inline Actions What about this comment? Do you really need Sorted argument? ABataev: What about this comment? Do you really need Sorted argument?
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions Yes, otherwise my test fails. Seems it breaks some assumption. ashahid: Yes, otherwise my test fails. Seems it breaks some assumption.
				SmallVectorImpl<unsigned> *Mask) {
				sanjoyUnsubmitted Not Done Reply Inline Actions The indent looks off here; can you please run clang-format? sanjoy: The indent looks off here; can you please run clang-format?
				SmallVector<std::pair<int64_t, Value *>, 4> OffValPairs;
				OffValPairs.reserve(VL.size());
				Sorted.reserve(VL.size());

				// Walk over the pointers, and map each of them to an offset relative to
				// first pointer in the array.
				Value *Ptr0 = getPointerOperand(VL[0]);
				const SCEV *Scev0 = SE.getSCEV(Ptr0);
				Value *Obj0 = GetUnderlyingObject(Ptr0, DL);

				for (auto *Val : VL) {
				// The only kind of access we care about here is load.
				AyalUnsubmitted Not Done Reply Inline Actions More accurate name for the method may be sortLoadAccesses()? Ayal: More accurate name for the method may be sortLoadAccesses()?
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions Thought to keep it generic but as of now sortLoadAccesses() seems more appropriate. ashahid: Thought to keep it generic but as of now sortLoadAccesses() seems more appropriate.
				sanjoyUnsubmitted Done Reply Inline Actions This should be a `cast<>`. sanjoy: This should be a `cast<>`.
				ABataevUnsubmitted Not Done Reply Inline Actions `PointerType `->`auto ` ABataev: `PointerType `->`auto `
				if (!isa<LoadInst>(Val))
				return false;

				Value *Ptr = getPointerOperand(Val);
				assert(Ptr && "Expected value to have a pointer operand.");
				// If a pointer refers to a different underlying object, bail - the
				ABataevUnsubmitted Not Done Reply Inline Actions I think there must be an assertion instead of this check. ABataev: I think there must be an assertion instead of this check.
				// pointers are by definition incomparable.
				Value *CurrObj = GetUnderlyingObject(Ptr, DL);
				AyalUnsubmitted Not Done Reply Inline Actions LoopVectorizer's analogous "analyzeInterleaving()" also guards this getMinusSCEV() by: // Ignore A if the memory object of A and B don't belong to the same // address space if (getMemInstAddressSpace(A) != getMemInstAddressSpace(B)) continue; Ayal: LoopVectorizer's analogous "analyzeInterleaving()" also guards this getMinusSCEV() by: ```…
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions Thinking of using this API for now. ashahid: Thinking of using this API for now.
				if (CurrObj != Obj0)
				return false;

				const SCEVConstant *Diff =
				dyn_cast<SCEVConstant>(SE.getMinusSCEV(SE.getSCEV(Ptr), Scev0));
				// The pointers may not have a constant offset from each other, or SCEV
				// may just not be smart enough to figure out they do. Regardless,
				// there's nothing we can do.
				ABataevUnsubmitted Not Done Reply Inline Actions `const SCEVConstant `->`const auto ` ABataev: `const SCEVConstant `->`const auto `
				if (!Diff)
				return false;

				OffValPairs.emplace_back(Diff->getAPInt().getSExtValue(), Val);
				AyalUnsubmitted Not Done Reply Inline Actions We can bailout here if \|Diff\| >= VL, right? Ayal: We can bailout here if \|Diff\| >= VL, right?
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure. ashahid: Sure.
				}
				SmallVector<unsigned, 4> UseOrder(VL.size());
				for (unsigned i = 0; i < VL.size(); i++) {
				ABataevUnsubmitted Not Done Reply Inline Actions This check better to move to SLPVectorizer.cpp, because the function can be used for masked load/store. ABataev: This check better to move to SLPVectorizer.cpp, because the function can be used for masked…
				UseOrder[i] = i;
				}

				// Sort the memory accesses and keep the order of their uses in UseOrder.
				std::sort(UseOrder.begin(), UseOrder.end(),
				sanjoyUnsubmitted Done Reply Inline Actions LLVM style is to avoid using curly braces on single like for loops. Using `std::iota` would be even better. sanjoy: LLVM style is to avoid using curly braces on single like for loops. Using `std::iota` would be…
				[&OffValPairs](unsigned Left, unsigned Right) {
				return OffValPairs[Left].first < OffValPairs[Right].first;
				});
				ABataevUnsubmitted Not Done Reply Inline Actions It is better to use `stable_sort` rather than `sort` ABataev: It is better to use `stable_sort` rather than `sort`

				for (unsigned i = 0; i < VL.size(); i++)
				Sorted.emplace_back(OffValPairs[UseOrder[i]].second);

				// Sort UseOrder to compute the Mask.
				ABataevUnsubmitted Not Done Reply Inline Actions `for (unsigned I = 0, E = VL.size(); I < E; ++I)` ABataev: `for (unsigned I = 0, E = VL.size(); I < E; ++I)`
				if (Mask) {
				Mask->reserve(VL.size());
				for (unsigned i = 0; i < VL.size(); i++)
				Mask->emplace_back(i);
				std::sort(Mask->begin(), Mask->end(),
				sanjoyUnsubmitted Not Done Reply Inline Actions Can you use `std::iota` here? sanjoy: Can you use `std::iota` here?
				ABataevUnsubmitted Not Done Reply Inline Actions Actually `Mask` is a full copy of `UseOrder`, you don't need all that complex stuff here ABataev: Actually `Mask` is a full copy of `UseOrder`, you don't need all that complex stuff here
				ABataevUnsubmitted Not Done Reply Inline Actions Oops, no, `Mask` is not a copy of `UseOrder` But you can create it much simpler: for (unsigned I = 0, E = VL.size(); I < E; ++I) Mask[UseOrder[I]] = I; ABataev: Oops, no, `Mask` is not a copy of `UseOrder` But you can create it much simpler: ``` for…
				ashahidAuthorUnsubmitted Not Done Reply Inline Actions Thanks ashahid: Thanks
				[&UseOrder](unsigned Left, unsigned Right) {
				return UseOrder[Left] < UseOrder[Right];
				});
				ABataevUnsubmitted Not Done Reply Inline Actions `stable_sort` ABataev: `stable_sort`
				}

				return true;
				}


	/// Returns true if the memory operations \p A and \p B are consecutive.			/// Returns true if the memory operations \p A and \p B are consecutive.
	bool llvm::isConsecutiveAccess(Value A, Value B, const DataLayout &DL,			bool llvm::isConsecutiveAccess(Value A, Value B, const DataLayout &DL,
	ScalarEvolution &SE, bool CheckType) {			ScalarEvolution &SE, bool CheckType) {
	Value *PtrA = getPointerOperand(A);			Value *PtrA = getPointerOperand(A);
	Value *PtrB = getPointerOperand(B);			Value *PtrB = getPointerOperand(B);
	unsigned ASA = getAddressSpaceOperand(A);			unsigned ASA = getAddressSpaceOperand(A);
	unsigned ASB = getAddressSpaceOperand(B);			unsigned ASB = getAddressSpaceOperand(B);

	▲ Show 20 Lines • Show All 1,143 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 419 Lines • ▼ Show 20 Lines	private:

/// Checks if all users of \p I are the part of the vectorization tree.		/// Checks if all users of \p I are the part of the vectorization tree.
bool areAllUsersVectorized(Instruction *I) const;		bool areAllUsersVectorized(Instruction *I) const;

/// \returns the cost of the vectorizable entry.		/// \returns the cost of the vectorizable entry.
int getEntryCost(TreeEntry *E);		int getEntryCost(TreeEntry *E);

/// This is the recursive part of buildTree.		/// This is the recursive part of buildTree.
void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth, int);		void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth, int, int OpdNum = 0);
		AyalUnsubmitted Not Done Reply Inline Actions While you're at it, please add a variable name for that other int argument as well (UserTreeIdx), for completeness. Ayal: While you're at it, please add a variable name for that other int argument as well…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure ashahid: Sure

/// \returns True if the ExtractElement/ExtractValue instructions in VL can		/// \returns True if the ExtractElement/ExtractValue instructions in VL can
/// be vectorized to use the original vector (or aggregate "bitcast" to a vector).		/// be vectorized to use the original vector (or aggregate "bitcast" to a vector).
bool canReuseExtract(ArrayRef<Value > VL, Value OpValue) const;		bool canReuseExtract(ArrayRef<Value > VL, Value OpValue) const;

/// Vectorize a single entry in the tree.		/// Vectorize a single entry in the tree.
Value vectorizeTree(TreeEntry E);		Value vectorizeTree(TreeEntry E, int OpdNum = 0, int UserIndx = -1);
		AyalUnsubmitted Not Done Reply Inline Actions Please document additional parameters. Ayal: Please document additional parameters.
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure ashahid: Sure

/// Vectorize a single entry in the tree, starting in \p VL.		/// Vectorize a single entry in the tree, starting in \p VL.
		AyalUnsubmitted Not Done Reply Inline Actions In other words, loosely speaking, `E == TreeEntry[UserIndx].getOperand(OpdNum)`, right? Ayal: In other words, loosely speaking, `E == TreeEntry[UserIndx].getOperand(OpdNum)`, right?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Yes, thats right. ashahid: Yes, thats right.
		AyalUnsubmitted Not Done Reply Inline Actions Could be used to help clarify the explanation. Ayal: Could be used to help clarify the explanation.
Value vectorizeTree(ArrayRef<Value > VL);		Value vectorizeTree(ArrayRef<Value > VL, int OpdNum = 0, int UserIndx = -1);
		AyalUnsubmitted Not Done Reply Inline Actions ditto. Ayal: ditto.

/// \returns the pointer to the vectorized value if \p VL is already		/// \returns the pointer to the vectorized value if \p VL is already
/// vectorized, or NULL. They may happen in cycles.		/// vectorized, or NULL. They may happen in cycles.
Value alreadyVectorized(ArrayRef<Value > VL, Value *OpValue) const;		Value alreadyVectorized(ArrayRef<Value > VL, Value *OpValue) const;

/// \returns the scalarization cost for this type. Scalarization in this		/// \returns the scalarization cost for this type. Scalarization in this
/// context means the creation of vectors from a group of scalars.		/// context means the creation of vectors from a group of scalars.
int getGatherCost(Type *Ty);		int getGatherCost(Type *Ty);
Show All 22 Lines	private:
/// \reorder commutative operands to get better probability of		/// \reorder commutative operands to get better probability of
/// generating vectorized code.		/// generating vectorized code.
void reorderInputsAccordingToOpcode(ArrayRef<Value *> VL,		void reorderInputsAccordingToOpcode(ArrayRef<Value *> VL,
SmallVectorImpl<Value *> &Left,		SmallVectorImpl<Value *> &Left,
SmallVectorImpl<Value *> &Right);		SmallVectorImpl<Value *> &Right);
struct TreeEntry {		struct TreeEntry {
TreeEntry(std::vector<TreeEntry> &Container)		TreeEntry(std::vector<TreeEntry> &Container)
: Scalars(), VectorizedValue(nullptr), NeedToGather(0),		: Scalars(), VectorizedValue(nullptr), NeedToGather(0),
Container(Container) {}		ShuffleMask(), Container(Container) {}

/// \returns true if the scalars in VL are equal to this entry.		/// \returns true if the scalars in VL are equal to this entry.
bool isSame(ArrayRef<Value *> VL) const {		bool isSame(ArrayRef<Value *> VL) const {
assert(VL.size() == Scalars.size() && "Invalid size");		assert(VL.size() == Scalars.size() && "Invalid size");
return std::equal(VL.begin(), VL.end(), Scalars.begin());		return std::equal(VL.begin(), VL.end(), Scalars.begin());
}		}

		/// \returns true if the scalars in VL are found in this tree entry.
		bool isFoundJumbled(ArrayRef<Value *> VL, const DataLayout &DL,
		ScalarEvolution &SE) const {
		assert(VL.size() == Scalars.size() && "Invalid size");
		SmallVector<Value *, 8> List;
		if (!sortMemAccesses(VL, DL, SE, List))
		return false;
		return std::equal(List.begin(), List.end(), Scalars.begin());
		}

/// A vector of scalars.		/// A vector of scalars.
ValueList Scalars;		ValueList Scalars;

		sanjoyUnsubmitted Not Done Reply Inline Actions Optional: you can write `return X;` instead of `if (X) return true; return false;`. sanjoy: Optional: you can write `return X;` instead of `if (X) return true; return false;`.
/// The Scalars are vectorized into this value. It is initialized to Null.		/// The Scalars are vectorized into this value. It is initialized to Null.
Value *VectorizedValue;		Value *VectorizedValue;

/// Do we need to gather this sequence ?		/// Do we need to gather this sequence ?
bool NeedToGather;		bool NeedToGather;

		/// Records optional suffle mask for the uses of jumbled memory accesses.
		AyalUnsubmitted Not Done Reply Inline Actions s[h]uffle An example would help clarify that, say, a non-empty ShuffleMask[1] represents the permutation of lanes that operand #1 should undergo before feeding this vectorized instruction, whereas an empty ShuffleMask[0] indicates that the lanes of operand #0 need not be permuted at all. Ayal: s[h]uffle An example would help clarify that, say, a non-empty ShuffleMask[1] represents the…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure ashahid: Sure
		std::vector<SmallVector<unsigned, 4>> ShuffleMask;

/// Points back to the VectorizableTree.		/// Points back to the VectorizableTree.
///		///
/// Only used for Graphviz right now. Unfortunately GraphTrait::NodeRef has		/// Only used for Graphviz right now. Unfortunately GraphTrait::NodeRef has
/// to be a pointer and needs to be able to initialize the child iterator.		/// to be a pointer and needs to be able to initialize the child iterator.
		AyalUnsubmitted Not Done Reply Inline Actions The fixed array SmallVector<unsigned, 4> ShuffleMask[3]; of the previous version indeed cannot account for all operands. How about holding a SmallVector<SmallVector<unsigned, 4>, 2> ShuffleMask; instead of holding a map from 0,1,2,..,numOperands ? Ayal: The fixed array ``` SmallVector<unsigned, 4> ShuffleMask[3]; ``` of the previous version indeed…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions I think this can be done. I will try. ashahid: I think this can be done. I will try.
		ABataevUnsubmitted Not Done Reply Inline Actions Why you can't have just one shuffle here for all external uses? ABataev: Why you can't have just one shuffle here for all external uses?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions This is for in-tree multi uses of a single vector load where the uses has different masks/permutation. This section of comment https://reviews.llvm.org/D36130#inline-326711 discussed it earlier. Also there is figure attached. ashahid: This is for in-tree multi uses of a single vector load where the uses has different…
		ABataevUnsubmitted Not Done Reply Inline Actions I still don't understand what's the problem here. You need to perform the loads in some order. You sort the loads to be in the sequntially direct order and perform the vector load starting from the lowest address. You reshuffle the loaded vector value to the original order. That's it, you have your loads in the required order. Just one shuffle is required. Why do you need some more? Also, I don't understand why do you need so many changes, why do you need additional indicies etc. ABataev: I still don't understand what's the problem here. 1. You need to perform the loads in some…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Updated jumbled-load.ll captures this case where instead of gathering the second operand of MUL we can have required shuffle of the same loaded vector ashahid: Updated jumbled-load.ll captures this case where instead of gathering the second operand of MUL…
/// Thus we need a reference back to the container to translate the indices		/// Thus we need a reference back to the container to translate the indices
/// to entries.		/// to entries.
std::vector<TreeEntry> &Container;		std::vector<TreeEntry> &Container;

/// The TreeEntry index containing the user of this entry. We can actually		/// The TreeEntry index containing the user of this entry. We can actually
/// have multiple users so the data structure is not truly a tree.		/// have multiple users so the data structure is not truly a tree.
SmallVector<int, 1> UserTreeIndices;		SmallVector<int, 1> UserTreeIndices;
};		};

/// Create a new VectorizableTree entry.		/// Create a new VectorizableTree entry.
TreeEntry newTreeEntry(ArrayRef<Value > VL, bool Vectorized,		TreeEntry newTreeEntry(ArrayRef<Value > VL, bool Vectorized,
int &UserTreeIdx) {		int &UserTreeIdx,
		ArrayRef<unsigned> ShuffleMask = None,
		int OpdNum = 0) {

VectorizableTree.emplace_back(VectorizableTree);		VectorizableTree.emplace_back(VectorizableTree);
		TreeEntry *UserEntry = &VectorizableTree[UserTreeIdx];

		TreeEntry *Last = NULL;
int idx = VectorizableTree.size() - 1;		int idx = VectorizableTree.size() - 1;
TreeEntry *Last = &VectorizableTree[idx];		Last = &VectorizableTree[idx];
		AyalUnsubmitted Not Done Reply Inline Actions Why change the original "TreeEntry Last =" here? Ayal:* Why change the original "TreeEntry *Last =" here?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Nothing specific, will make it as earlier. ashahid: Nothing specific, will make it as earlier.
Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());		Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());
Last->NeedToGather = !Vectorized;		Last->NeedToGather = !Vectorized;
		if (!ShuffleMask.empty()) {
		UserEntry->ShuffleMask.emplace_back(ShuffleMask.begin(),
		ShuffleMask.end());
		AyalUnsubmitted Not Done Reply Inline Actions Are both conditions really needed, or suffice say to check for -1 and assert positive indices are not too large? Ayal: Are both conditions really needed, or suffice say to check for -1 and assert positive indices…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure I will check. I am thinking 30000 as large indices threshold, do you have any number in mind? ashahid: Sure I will check. I am thinking 30000 as large indices threshold, do you have any number in…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions I tried but seems both conditions are needed as I am getting assertion "Idx < size()" for SmallVector<<SmallVector, 4> 2> ShuffleMask. ashahid: I tried but seems both conditions are needed as I am getting assertion "Idx < size()" for…
		AyalUnsubmitted Done Reply Inline Actions UserTreeIdx is the index of the User entry as we build the tree bottom-up, so it should always be between 0 and VectorizableTree.size()-1, except for -1 when creating the new entry for the root, which is User-less. So it should suffice to check if Idx is -1, and otherwise assert that Idx < size(), if desired, right? Ayal: UserTreeIdx is the index of the User entry as we build the tree bottom-up, so it should always…
		}
if (Vectorized) {		if (Vectorized) {
		AyalUnsubmitted Not Done Reply Inline Actions Should ShuffleMask be inserted into UserEntry's ShuffleMask in position OpdNum? (Possibly asserting no other mask is already there?) Otherwise, where is OpdNum used? Ayal: Should ShuffleMask be inserted into UserEntry's ShuffleMask in position OpdNum? (Possibly…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Good catch! The intention is exactly that and the order of building tree ensures that. Do you want it to be explicit here? Any way it does need assertion for the emptiness of the mask before insertion. ashahid: Good catch! The intention is exactly that and the order of building tree ensures that. Do you…
		AyalUnsubmitted Not Done Reply Inline Actions Either be explicit, or assert that `emplace_back` inserts at position `OpdNum`, based on the assumption that the order of building tree ensures that which should be documented (e.g., in the form of the assert message). Ayal: Either be explicit, or assert that `emplace_back` inserts at position `OpdNum`, based on the…
		AyalUnsubmitted Not Done Reply Inline Actions So if the first operand does not need a shuffle but the second one does, will ShuffleMask.emplace_back() place the shuffle in the right position, namely that of OpdNum=1 rather than OpdNum=0? Ayal: So if the first operand does not need a shuffle but the second one does, will ShuffleMask.
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions You are right, in this case this assumption will break. So OpdNum needs to be explicitly used while inserting the shuffle mask. ashahid: You are right, in this case this assumption will break. So OpdNum needs to be explicitly used…
		AyalUnsubmitted Done Reply Inline Actions Code below still uses emplace_back contrary to the discussion above. May need to call UserTreeEntry->ShuffleMask.resize() if OpdNum is larger than its initial/current size, before setting UserTreeEntry->ShuffleMask[OpdNum] = tempMask. (Otherwise the original "LNT Multisource bench mark" asserts should trigger again?) Suggest to add a test where the first operand does not need a shuffle but the second one does. Ayal: Code below still uses emplace_back contrary to the discussion above. May need to call…
for (int i = 0, e = VL.size(); i != e; ++i) {		for (int i = 0, e = VL.size(); i != e; ++i) {
assert(!getTreeEntry(VL[i]) && "Scalar already in tree!");		assert(!ScalarToTreeEntry.count(VL[i]) && "Scalar already in tree!");
ScalarToTreeEntry[VL[i]] = idx;		ScalarToTreeEntry[VL[i]] = idx;
}		}
} else {		} else {
		AyalUnsubmitted Done Reply Inline Actions alrea[d]y Ayal: alrea[d]y
MustGather.insert(VL.begin(), VL.end());		MustGather.insert(VL.begin(), VL.end());
		sanjoyUnsubmitted Done Reply Inline Actions I think you should be able to do: auto &OperandMask = UserTreeEntry->ShuffleMask[OpdNum]; assert(OperandMask.empty()); OperandMask.insert(OperandMask.end(), ShuffleMask.begin(), ShuffleMask.end()); sanjoy: I think you should be able to do: ``` auto &OperandMask = UserTreeEntry->ShuffleMask[OpdNum]…
		AyalUnsubmitted Done Reply Inline Actions While we're at it, this should move under the `if (UserTreeIdx != -1)` to avoid checking if `&VectorizableTree[UserTreeIdx]` is null, as commented in https://reviews.llvm.org/D41324#inline-361435 Ayal: While we're at it, this should move under the `if (UserTreeIdx != -1)` to avoid checking if…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions If we check for if (UserTreeIdx != -1 && ShuffledLoad) before the call of newTreeEntry(), we can avoid "UserTreeIdx != -1" check completely inside newTreeEntry(). ashahid: If we check for if (UserTreeIdx != -1 && ShuffledLoad) before the call of newTreeEntry(), we…
}		}

if (UserTreeIdx >= 0)		if (UserTreeIdx >= 0)
Last->UserTreeIndices.push_back(UserTreeIdx);		Last->UserTreeIndices.push_back(UserTreeIdx);
UserTreeIdx = idx;		UserTreeIdx = idx;
return Last;		return Last;
}		}

▲ Show 20 Lines • Show All 491 Lines • ▼ Show 20 Lines	template <> struct GraphTraits<BoUpSLP *> {
static unsigned size(BoUpSLP *R) { return R->VectorizableTree.size(); }		static unsigned size(BoUpSLP *R) { return R->VectorizableTree.size(); }
};		};

template <> struct DOTGraphTraits<BoUpSLP *> : public DefaultDOTGraphTraits {		template <> struct DOTGraphTraits<BoUpSLP *> : public DefaultDOTGraphTraits {
typedef BoUpSLP::TreeEntry TreeEntry;		typedef BoUpSLP::TreeEntry TreeEntry;

DOTGraphTraits(bool isSimple = false) : DefaultDOTGraphTraits(isSimple) {}		DOTGraphTraits(bool isSimple = false) : DefaultDOTGraphTraits(isSimple) {}

std::string getNodeLabel(const TreeEntry Entry, const BoUpSLP R) {		std::string getNodeLabel(const TreeEntry Entry, const BoUpSLP R) {
		AyalUnsubmitted Not Done Reply Inline Actions Would be good to include (non-empty) ShuffleMasks when dumping the tree, for debugging? Ayal: Would be good to include (non-empty) ShuffleMasks when dumping the tree, for debugging?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure. ashahid: Sure.
std::string Str;		std::string Str;
raw_string_ostream OS(Str);		raw_string_ostream OS(Str);
if (isSplat(Entry->Scalars)) {		if (isSplat(Entry->Scalars)) {
OS << "<splat> " << *Entry->Scalars[0];		OS << "<splat> " << *Entry->Scalars[0];
return Str;		return Str;
}		}
for (auto V : Entry->Scalars) {		for (auto V : Entry->Scalars) {
OS << *V;		OS << *V;
▲ Show 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {
Lane << " from " << *Scalar << ".\n");		Lane << " from " << *Scalar << ".\n");
ExternalUses.push_back(ExternalUser(Scalar, U, Lane));		ExternalUses.push_back(ExternalUser(Scalar, U, Lane));
}		}
}		}
}		}
}		}

void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,		void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
int UserTreeIdx) {		int UserTreeIdx, int OpdNum) {
bool isAltShuffle = false;		bool isAltShuffle = false;
assert((allConstant(VL) \|\| allSameType(VL)) && "Invalid types!");		assert((allConstant(VL) \|\| allSameType(VL)) && "Invalid types!");

if (Depth == RecursionMaxDepth) {		if (Depth == RecursionMaxDepth) {
DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");		DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx);
return;		return;
}		}
▲ Show 20 Lines • Show All 139 Lines • ▼ Show 20 Lines	case Instruction::PHI: {

for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {		for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<PHINode>(j)->getIncomingValueForBlock(		Operands.push_back(cast<PHINode>(j)->getIncomingValueForBlock(
PH->getIncomingBlock(i)));		PH->getIncomingBlock(i)));

buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, Depth + 1, UserTreeIdx, i);
}		}
return;		return;
}		}
case Instruction::ExtractValue:		case Instruction::ExtractValue:
case Instruction::ExtractElement: {		case Instruction::ExtractElement: {
bool Reuse = canReuseExtract(VL, VL0);		bool Reuse = canReuseExtract(VL, VL0);
if (Reuse) {		if (Reuse) {
DEBUG(dbgs() << "SLP: Reusing extract sequence.\n");		DEBUG(dbgs() << "SLP: Reusing extract sequence.\n");
Show All 29 Lines	case Instruction::Load: {
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx);
DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");		DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");
return;		return;
}		}
}		}

// Check if the loads are consecutive, reversed, or neither.		// Check if the loads are consecutive, reversed, or neither.
// TODO: What we really want is to sort the loads, but for now, check		// TODO: What we really want is to sort the loads, but for now, check
		AyalUnsubmitted Not Done Reply Inline Actions Remove this TODO :-) Ayal: Remove this TODO :-)
// the two likely directions.		// the two likely directions.
bool Consecutive = true;		bool Consecutive = true;
bool ReverseConsecutive = true;		bool ReverseConsecutive = true;
for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {		for (unsigned i = 0, e = VL.size() - 1; i < e; ++i) {
if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {		if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {
Consecutive = false;		Consecutive = false;
break;		break;
} else {		} else {
Show All 11 Lines	case Instruction::Load: {
// If none of the load pairs were consecutive when checked in order,		// If none of the load pairs were consecutive when checked in order,
// check the reverse order.		// check the reverse order.
if (ReverseConsecutive)		if (ReverseConsecutive)
for (unsigned i = VL.size() - 1; i > 0; --i)		for (unsigned i = VL.size() - 1; i > 0; --i)
if (!isConsecutiveAccess(VL[i], VL[i - 1], DL, SE)) {		if (!isConsecutiveAccess(VL[i], VL[i - 1], DL, SE)) {
ReverseConsecutive = false;		ReverseConsecutive = false;
break;		break;
}		}

		AyalUnsubmitted Not Done Reply Inline Actions Consider checking `if (ReverseConsecutive)` here and exit early. Ayal: Consider checking `if (ReverseConsecutive)` here and exit early.
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure, will consider. ashahid: Sure, will consider.
		if (VL.size() > 2 && !ReverseConsecutive) {
		bool ShuffledLoads = true;
		SmallVector<Value *, 8> Sorted;
		SmallVector<unsigned, 4> Mask;
		if (sortMemAccesses(VL, DL, SE, Sorted, &Mask)) {
		auto NewVL = makeArrayRef(Sorted.begin(), Sorted.end());
		for (unsigned i = 0, e = NewVL.size() - 1; i < e; ++i) {
		if (!isConsecutiveAccess(NewVL[i], NewVL[i + 1], DL, SE)) {
		ShuffledLoads = false;
		break;
		}
		}
		if (ShuffledLoads) {
		newTreeEntry(NewVL, true, UserTreeIdx,
		makeArrayRef(Mask.begin(), Mask.end()), OpdNum);
		AyalUnsubmitted Not Done Reply Inline Actions Worthy of a `DEBUG(dbgs() << "...")` message here. Ayal: Worthy of a `DEBUG(dbgs() << "...")` message here.
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure ashahid: Sure
		return;
		}
		}
		}

BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx);

if (ReverseConsecutive) {		if (ReverseConsecutive) {
++NumLoadsWantToChangeOrder;		++NumLoadsWantToChangeOrder;
DEBUG(dbgs() << "SLP: Gathering reversed loads.\n");		DEBUG(dbgs() << "SLP: Gathering reversed loads.\n");
} else {		} else {
DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");		DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");
		AyalUnsubmitted Not Done Reply Inline Actions It would have been good to also record how many loads want to have an arbitrary shuffled order, and shuffle according to the majority; but its admittedly harder than recording how many want the reversed order. Maybe worth a comment. Ayal: It would have been good to also record how many loads want to have an arbitrary shuffled order…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Could not get "shuffle according to the majority", would you please elaborate. ashahid: Could not get "shuffle according to the majority", would you please elaborate.
		AyalUnsubmitted Not Done Reply Inline Actions `NumLoadsWantToChangeOrder` is used to decide if the entire tree `shouldReorder()`, based on how many want to keep the order vs. how many want to change=reverse it (majority). My comment was that this would ideally extend to pick the most frequent order from among more possible orders than {original, reverse}. `AllowReorder` however restricts reordering the 2 element vectors only, where only these two orders exist. This relates to the existing `// TODO: check if we can allow reordering for more cases.` Ayal: `NumLoadsWantToChangeOrder `is used to decide if the entire tree `shouldReorder()`, based on…
}		}
return;		return;
		AyalUnsubmitted Not Done Reply Inline Actions `ReverseConsecutive` is a special case of `ShuffledLoads`; so should the above treatment of a reverse load be the same as that of a shuffled load below? I.e., generate a true TreeEntry here with a reverse mask, and avoid cancel scheduling? Ayal: `ReverseConsecutive` is a special case of `ShuffledLoads`; so should the above treatment of a…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Ah, yes you are correct, in fact initially I gave a try but I faced some issue I am unable to recall now. I will give a try again and see whats the problem may be some thing to do with rebuilding of the tree with reversed scalar inputs. ashahid: Ah, yes you are correct, in fact initially I gave a try but I faced some issue I am unable to…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions I tried to incorporate however there were regression as I mentioned earlier. I think it is better if we take it in separate patch. ashahid: I tried to incorporate however there were regression as I mentioned earlier. I think it is…
		ABataevUnsubmitted Not Done Reply Inline Actions Why you can do this only if `ReuseShuffleIndicies.empty()`? ABataev: Why you can do this only if `ReuseShuffleIndicies.empty()`?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions This is to avoid the overlapping the UniqueValues reuse logic of your changes. ashahid: This is to avoid the overlapping the UniqueValues reuse logic of your changes.
		ABataevUnsubmitted Not Done Reply Inline Actions Why you can't handle it? What's the problem? ABataev: Why you can't handle it? What's the problem?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions It was a thought,I have not checked yet. I will check. ashahid: It was a thought,I have not checked yet. I will check.
}		}
		ABataevUnsubmitted Not Done Reply Inline Actions Is it possible at all that `VL` has less than 4 elements here? ABataev: Is it possible at all that `VL` has less than 4 elements here?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions I think yes, for example a couple of i64 loads considering minimum register width as 128-bit. However, this check here was basically meant to indicate jumbled loads of size 2 is essentially a reversed load. ashahid: I think yes, for example a couple of i64 loads considering minimum register width as 128-bit.
		ABataevUnsubmitted Not Done Reply Inline Actions It is going to be handled by the reverse loads patch ABataev: It is going to be handled by the reverse loads patch
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Yes, this check no more required. ashahid: Yes, this check no more required.
case Instruction::ZExt:		case Instruction::ZExt:
case Instruction::SExt:		case Instruction::SExt:
case Instruction::FPToUI:		case Instruction::FPToUI:
case Instruction::FPToSI:		case Instruction::FPToSI:
case Instruction::FPExt:		case Instruction::FPExt:
		ABataevUnsubmitted Not Done Reply Inline Actions `i`->`I`, `e`->`E`. Variables must have Camel-like names. ABataev: `i`->`I`, `e`->`E`. Variables must have Camel-like names.
case Instruction::PtrToInt:		case Instruction::PtrToInt:
case Instruction::IntToPtr:		case Instruction::IntToPtr:
case Instruction::SIToFP:		case Instruction::SIToFP:
		sanjoyUnsubmitted Done Reply Inline Actions Not sure why you need `NewVL` here -- doesn't just using `Sorted` work? sanjoy: Not sure why you need `NewVL` here -- doesn't just using `Sorted` work?
case Instruction::UIToFP:		case Instruction::UIToFP:
		ABataevUnsubmitted Not Done Reply Inline Actions It is enough just to compare `VL` and `Sorted`. If they are the same, the loads are not shuffled ABataev: It is enough just to compare `VL` and `Sorted`. If they are the same, the loads are not shuffled
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Sure it is, but this avoids the compare. So I thought having a boolean is preferable. ashahid: Sure it is, but this avoids the compare. So I thought having a boolean is preferable.
		ABataevUnsubmitted Not Done Reply Inline Actions Why do we need the compare? ABataev: Why do we need the compare?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions I meant, if we dont use ShuffledLoad flag we have to compare VL vs Sorted instead. ashahid: I meant, if we dont use ShuffledLoad flag we have to compare VL vs Sorted instead.
case Instruction::Trunc:		case Instruction::Trunc:
case Instruction::FPTrunc:		case Instruction::FPTrunc:
case Instruction::BitCast: {		case Instruction::BitCast: {
		sanjoyUnsubmitted Not Done Reply Inline Actions Nit: s/usefull/useful/ sanjoy: Nit: s/usefull/useful/
		ABataevUnsubmitted Not Done Reply Inline Actions Why you can't do to add vectorized tree entry if `UserTreeIdx == -1`? ABataev: Why you can't do to add vectorized tree entry if `UserTreeIdx == -1`?
		ashahidAuthorUnsubmitted Done Reply Inline Actions My bad, this is not required. ashahid: My bad, this is not required.
Type *SrcTy = VL0->getOperand(0)->getType();		Type *SrcTy = VL0->getOperand(0)->getType();
for (unsigned i = 0; i < VL.size(); ++i) {		for (unsigned i = 0; i < VL.size(); ++i) {
Type *Ty = cast<Instruction>(VL[i])->getOperand(0)->getType();		Type *Ty = cast<Instruction>(VL[i])->getOperand(0)->getType();
		ABataevUnsubmitted Not Done Reply Inline Actions Each `true` or `false` argument must have to prepend comment with the name of the function parameter, related to this argument ABataev: Each `true` or `false` argument must have to prepend comment with the name of the function…
		ashahidAuthorUnsubmitted Done Reply Inline Actions Ok ashahid: Ok
		ABataevUnsubmitted Not Done Reply Inline Actions No, use original `VL` here, do not use `Sorted`. In this case you won't need an additional argument in `sortLoadAccesses` and you don't need all that complex stuff with the lambda on line 3528 ABataev: No, use original `VL` here, do not use `Sorted`. In this case you won't need an additional…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions If I am not wrong, for LOADs, VL0 must be the 1st element of the buffer whose base address will be used for vector load. So using VL will break this assumption. ashahid: If I am not wrong, for LOADs, VL0 must be the 1st element of the buffer whose base address will…
		ABataevUnsubmitted Not Done Reply Inline Actions Why? And why you can't choose the right VL0 during vectorization? ABataev: Why? And why you can't choose the right VL0 during vectorization?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions For example, if we have two arrays A[4] and B[1] laying one after another in memory and the selected VF is 4 for the scalar loads of A[1], A[2], A[0], A[3] in order of use, the generated vector load will load the elements A[1], A[2], A[3], B[1] which is not desired. Of-course we can choose the right VL0 during vectorization but we have to compute it again here using the mask which can be avoided if we use Sorted VL. If I am missing something? ashahid: For example, if we have two arrays A[4] and B[1] laying one after another in memory and the…
		ABataevUnsubmitted Not Done Reply Inline Actions You already store the mask in the tree entry and you can choose the right VL0 using this mask. Using Sorted VL complicates the whole vectorization process and, thus, adds some extra points for the incorrect vectorization. That's why I insist to use original VL and choose the correct VL0 during codegen. ABataev: You already store the mask in the tree entry and you can choose the right VL0 using this mask.
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Got it. Since you already have these improvements in this patch https://reviews.llvm.org/D43776 , I think it is better to get that through. ashahid: Got it. Since you already have these improvements in this patch https://reviews.llvm.org/D43776…
if (Ty != SrcTy \|\| !isValidElementType(Ty)) {		if (Ty != SrcTy \|\| !isValidElementType(Ty)) {
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
		AyalUnsubmitted Done Reply Inline Actions Should probably also check here that UserTreeIdx is not -1, to avoid creating a mask for the root with no place to hang it, as @sanjoy observed. Ayal: Should probably also check here that UserTreeIdx is not -1, to avoid creating a mask for the…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Yes, I had planned to do exactly this. ashahid: Yes, I had planned to do exactly this.
		ABataevUnsubmitted Not Done Reply Inline Actions Bad decision. It is better to use original `VL` here, rather than `Sorted` and add an additional array of sorted indieces. In this case you don't need all these additional numbers and all that complex logic to find the correct tree entry for the list of values. ABataev: Bad decision. It is better to use original `VL` here, rather than `Sorted` and add an…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions In fact earlier design in patch (https://reviews.llvm.org/D26905) was to use original VL, however there was counter argument to that which I don't remember exactly. ashahid: In fact earlier design in patch (https://reviews.llvm.org/D26905) was to use original VL…
		ABataevUnsubmitted Not Done Reply Inline Actions It is better to use original `VL` here, otherwise it will end with a lot of troubles and will require the whole bunch of changes in the vectorization process to find the perfect match for the vector of vectorized values. I don't think it is a good idea to have a lot of changes accross the whole module to handle jumbled loads. ABataev: It is better to use original `VL` here, otherwise it will end with a lot of troubles and will…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions In the context where we can have multiple user of loaded vector with different shuffle mask, the design is to represent these different shuffle mask for each user corresponding to the user's operand number. Having single sorted indices will not be sufficient for this. Given the objective of handling multiple out of order uses changes are not that big I feel. ashahid: In the context where we can have multiple user of loaded vector with different shuffle mask…
		ABataevUnsubmitted Not Done Reply Inline Actions Now I see what do you want to do. But I don't think that this the correct way to implement it. It complicates the whole vectorization process. I'd suggest to create different tree entries for each particular order of the loads and exclude loads from the check that the same instruction is used several times in different tree entries. If you worry about several different loads of the same values, I think they will be optimized by instruction combiner. ABataev: Now I see what do you want to do. But I don't think that this the correct way to implement it.
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Off course this could have been a better solution but I was not sure of the impact it may have by breaking the single tree entry assumption. One problem I see is the TreeEntry lookup if multiple node with same scalar values are present. I can use isSame() check to make sure correct tree entry is found, however it may become costly in case of PHI instruction fed by same vector Load. ashahid: Off course this could have been a better solution but I was not sure of the impact it may have…
		ABataevUnsubmitted Not Done Reply Inline Actions I think it is better to start with handling of single tree entry rather than trying to handle all possible situations in a single patch. I suggest to split this patch into 2 parts at least: 1. handling of tree entry with jumbled loads. 2. further improvements. ABataev: I think it is better to start with handling of single tree entry rather than trying to handle…
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx);
DEBUG(dbgs() << "SLP: Gathering casts with different src types.\n");		DEBUG(dbgs() << "SLP: Gathering casts with different src types.\n");
return;		return;
}		}
}		}
newTreeEntry(VL, true, UserTreeIdx);		newTreeEntry(VL, true, UserTreeIdx);
DEBUG(dbgs() << "SLP: added a vector of casts.\n");		DEBUG(dbgs() << "SLP: added a vector of casts.\n");

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, Depth + 1, UserTreeIdx, i);
}		}
return;		return;
}		}
case Instruction::ICmp:		case Instruction::ICmp:
case Instruction::FCmp: {		case Instruction::FCmp: {
// Check that all of the compares have the same predicate.		// Check that all of the compares have the same predicate.
CmpInst::Predicate P0 = cast<CmpInst>(VL0)->getPredicate();		CmpInst::Predicate P0 = cast<CmpInst>(VL0)->getPredicate();
Type *ComparedTy = VL0->getOperand(0)->getType();		Type *ComparedTy = VL0->getOperand(0)->getType();
Show All 12 Lines	case Instruction::FCmp: {
DEBUG(dbgs() << "SLP: added a vector of compares.\n");		DEBUG(dbgs() << "SLP: added a vector of compares.\n");

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, Depth + 1, UserTreeIdx, i);
}		}
return;		return;
}		}
case Instruction::Select:		case Instruction::Select:
case Instruction::Add:		case Instruction::Add:
case Instruction::FAdd:		case Instruction::FAdd:
case Instruction::Sub:		case Instruction::Sub:
case Instruction::FSub:		case Instruction::FSub:
Show All 15 Lines	case Instruction::Xor: {
DEBUG(dbgs() << "SLP: added a vector of bin op.\n");		DEBUG(dbgs() << "SLP: added a vector of bin op.\n");

// Sort operands of the instructions so that each side is more likely to		// Sort operands of the instructions so that each side is more likely to
// have the same opcode.		// have the same opcode.
if (isa<BinaryOperator>(VL0) && VL0->isCommutative()) {		if (isa<BinaryOperator>(VL0) && VL0->isCommutative()) {
ValueList Left, Right;		ValueList Left, Right;
reorderInputsAccordingToOpcode(VL, Left, Right);		reorderInputsAccordingToOpcode(VL, Left, Right);
buildTree_rec(Left, Depth + 1, UserTreeIdx);		buildTree_rec(Left, Depth + 1, UserTreeIdx);
buildTree_rec(Right, Depth + 1, UserTreeIdx);		buildTree_rec(Right, Depth + 1, UserTreeIdx, 1);
return;		return;
}		}

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, Depth + 1, UserTreeIdx, i);
}		}
return;		return;
}		}
case Instruction::GetElementPtr: {		case Instruction::GetElementPtr: {
// We don't combine GEPs with complicated (nested) indexing.		// We don't combine GEPs with complicated (nested) indexing.
for (unsigned j = 0; j < VL.size(); ++j) {		for (unsigned j = 0; j < VL.size(); ++j) {
if (cast<Instruction>(VL[j])->getNumOperands() != 2) {		if (cast<Instruction>(VL[j])->getNumOperands() != 2) {
DEBUG(dbgs() << "SLP: not-vectorizable GEP (nested indexes).\n");		DEBUG(dbgs() << "SLP: not-vectorizable GEP (nested indexes).\n");
Show All 31 Lines	case Instruction::GetElementPtr: {
newTreeEntry(VL, true, UserTreeIdx);		newTreeEntry(VL, true, UserTreeIdx);
DEBUG(dbgs() << "SLP: added a vector of GEPs.\n");		DEBUG(dbgs() << "SLP: added a vector of GEPs.\n");
for (unsigned i = 0, e = 2; i < e; ++i) {		for (unsigned i = 0, e = 2; i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, Depth + 1, UserTreeIdx, i);
}		}
return;		return;
}		}
case Instruction::Store: {		case Instruction::Store: {
// Check if the stores are consecutive or of we need to swizzle them.		// Check if the stores are consecutive or of we need to swizzle them.
for (unsigned i = 0, e = VL.size() - 1; i < e; ++i)		for (unsigned i = 0, e = VL.size() - 1; i < e; ++i)
if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {		if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	case Instruction::Call: {
newTreeEntry(VL, true, UserTreeIdx);		newTreeEntry(VL, true, UserTreeIdx);
for (unsigned i = 0, e = CI->getNumArgOperands(); i != e; ++i) {		for (unsigned i = 0, e = CI->getNumArgOperands(); i != e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL) {		for (Value *j : VL) {
CallInst *CI2 = dyn_cast<CallInst>(j);		CallInst *CI2 = dyn_cast<CallInst>(j);
Operands.push_back(CI2->getArgOperand(i));		Operands.push_back(CI2->getArgOperand(i));
}		}
buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, Depth + 1, UserTreeIdx, i);
}		}
return;		return;
}		}
case Instruction::ShuffleVector: {		case Instruction::ShuffleVector: {
// If this is not an alternate sequence of opcode like add-sub		// If this is not an alternate sequence of opcode like add-sub
// then do not vectorize this instruction.		// then do not vectorize this instruction.
if (!isAltShuffle) {		if (!isAltShuffle) {
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx);
DEBUG(dbgs() << "SLP: ShuffleVector are not vectorized.\n");		DEBUG(dbgs() << "SLP: ShuffleVector are not vectorized.\n");
return;		return;
}		}
newTreeEntry(VL, true, UserTreeIdx);		newTreeEntry(VL, true, UserTreeIdx);
DEBUG(dbgs() << "SLP: added a ShuffleVector op.\n");		DEBUG(dbgs() << "SLP: added a ShuffleVector op.\n");

// Reorder operands if reordering would enable vectorization.		// Reorder operands if reordering would enable vectorization.
if (isa<BinaryOperator>(VL0)) {		if (isa<BinaryOperator>(VL0)) {
ValueList Left, Right;		ValueList Left, Right;
reorderAltShuffleOperands(VL, Left, Right);		reorderAltShuffleOperands(VL, Left, Right);
buildTree_rec(Left, Depth + 1, UserTreeIdx);		buildTree_rec(Left, Depth + 1, UserTreeIdx);
buildTree_rec(Right, Depth + 1, UserTreeIdx);		buildTree_rec(Right, Depth + 1, UserTreeIdx, 1);
return;		return;
}		}

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, Depth + 1, UserTreeIdx, i);
}		}
return;		return;
}		}
default:		default:
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx);
DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");		DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");
return;		return;
▲ Show 20 Lines • Show All 235 Lines • ▼ Show 20 Lines	case Instruction::Load: {
unsigned alignment = dyn_cast<LoadInst>(VL0)->getAlignment();		unsigned alignment = dyn_cast<LoadInst>(VL0)->getAlignment();
int ScalarLdCost = VecTy->getNumElements() *		int ScalarLdCost = VecTy->getNumElements() *
TTI->getMemoryOpCost(Instruction::Load, ScalarTy, alignment, 0, VL0);		TTI->getMemoryOpCost(Instruction::Load, ScalarTy, alignment, 0, VL0);
int VecLdCost = TTI->getMemoryOpCost(Instruction::Load,		int VecLdCost = TTI->getMemoryOpCost(Instruction::Load,
VecTy, alignment, 0, VL0);		VecTy, alignment, 0, VL0);
return VecLdCost - ScalarLdCost;		return VecLdCost - ScalarLdCost;
}		}
case Instruction::Store: {		case Instruction::Store: {
// We know that we can merge the stores. Calculate the cost.		// We know that we can merge the stores. Calculate the cost.
		ABataevUnsubmitted Not Done Reply Inline Actions You can remove the last argument here ABataev: You can remove the last argument here
		ashahidAuthorUnsubmitted Done Reply Inline Actions Sure ashahid: Sure
unsigned alignment = dyn_cast<StoreInst>(VL0)->getAlignment();		unsigned alignment = dyn_cast<StoreInst>(VL0)->getAlignment();
int ScalarStCost = VecTy->getNumElements() *		int ScalarStCost = VecTy->getNumElements() *
TTI->getMemoryOpCost(Instruction::Store, ScalarTy, alignment, 0, VL0);		TTI->getMemoryOpCost(Instruction::Store, ScalarTy, alignment, 0, VL0);
int VecStCost = TTI->getMemoryOpCost(Instruction::Store,		int VecStCost = TTI->getMemoryOpCost(Instruction::Store,
VecTy, alignment, 0, VL0);		VecTy, alignment, 0, VL0);
return VecStCost - ScalarStCost;		return VecStCost - ScalarStCost;
		ABataevUnsubmitted Not Done Reply Inline Actions You don't need so many shuffles, it is enough just to have just one. ABataev: You don't need so many shuffles, it is enough just to have just one.
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions This is basically for multiple in-tree uses with different masks/permutation. ashahid: This is basically for multiple in-tree uses with different masks/permutation.
}		}
case Instruction::Call: {		case Instruction::Call: {
CallInst *CI = cast<CallInst>(VL0);		CallInst *CI = cast<CallInst>(VL0);
Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);		Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);

// Calculate the cost of the scalar and vector calls.		// Calculate the cost of the scalar and vector calls.
SmallVector<Type*, 4> ScalarTys;		SmallVector<Type*, 4> ScalarTys;
for (unsigned op = 0, opc = CI->getNumArgOperands(); op!= opc; ++op)		for (unsigned op = 0, opc = CI->getNumArgOperands(); op!= opc; ++op)
▲ Show 20 Lines • Show All 526 Lines • ▼ Show 20 Lines	if (Instruction *Insrt = dyn_cast<Instruction>(Vec)) {
ExternalUses.push_back(ExternalUser(VL[i], Insrt, FoundLane));		ExternalUses.push_back(ExternalUser(VL[i], Insrt, FoundLane));
}		}
}		}
}		}

return Vec;		return Vec;
}		}

Value BoUpSLP::alreadyVectorized(ArrayRef<Value > VL, Value *OpValue) const {		Value BoUpSLP::alreadyVectorized(ArrayRef<Value > VL, Value *OpValue) const {
		ABataevUnsubmitted Not Done Reply Inline Actions Why do you need this condition? ABataev: Why do you need this condition?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions In the 2nd test of jumbled-load.ll the two operands of MUL is fed from the same loaded vector. The 1st operand is SHUFFLE of LOAD and the 2nd operand is the gather of the same scalar loads. Query to getTreeEntry() will always return the node with the same vectorized value and hence both the operand of MUL will be fed the shuffled load. This check is to avoid this scenario. ashahid: In the 2nd test of jumbled-load.ll the two operands of MUL is fed from the same loaded vector.
		ABataevUnsubmitted Not Done Reply Inline Actions This scenario should happen in your patch, the instruction either vectorized, or gathered, but not both. ABataev: This scenario should happen in your patch, the instruction either vectorized, or gathered, but…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions This check is to avoid feeding the generated SHUFFLE to both operand of MUL which is not the intention of the test case. ashahid: This check is to avoid feeding the generated SHUFFLE to both operand of MUL which is not the…
if (const TreeEntry *En = getTreeEntry(OpValue)) {		if (const TreeEntry *En = getTreeEntry(OpValue)) {
if (En->isSame(VL) && En->VectorizedValue)		if (En->isSame(VL) && En->VectorizedValue)
return En->VectorizedValue;		return En->VectorizedValue;
}		}
return nullptr;		return nullptr;
}		}

Value BoUpSLP::vectorizeTree(ArrayRef<Value > VL) {		Value BoUpSLP::vectorizeTree(ArrayRef<Value > VL, int OpdNum, int UserIndx) {
		AyalUnsubmitted Done Reply Inline Actions See above discussion about replacing second condition with an assert. Ayal: See above discussion about replacing second condition with an assert.
if (TreeEntry *E = getTreeEntry(VL[0]))		if (ScalarToTreeEntry.count(VL[0])) {
if (E->isSame(VL))		int Idx = ScalarToTreeEntry[VL[0]];
return vectorizeTree(E);		TreeEntry *E = &VectorizableTree[Idx];
		TreeEntry *UserTreeEntry = &VectorizableTree[UserIndx];
		if (E->isSame(VL) \|\|
		(UserTreeEntry && !UserTreeEntry->ShuffleMask.empty() &&
		!UserTreeEntry->ShuffleMask[OpdNum].empty() &&
		E->isFoundJumbled(VL, DL, SE)))
		return vectorizeTree(E, OpdNum, UserIndx);
		}

Type *ScalarTy = VL[0]->getType();		Type *ScalarTy = VL[0]->getType();
if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))		if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))
ScalarTy = SI->getValueOperand()->getType();		ScalarTy = SI->getValueOperand()->getType();
VectorType *VecTy = VectorType::get(ScalarTy, VL.size());		VectorType *VecTy = VectorType::get(ScalarTy, VL.size());

return Gather(VL, VecTy);		return Gather(VL, VecTy);
}		}

Value BoUpSLP::vectorizeTree(TreeEntry E) {		Value BoUpSLP::vectorizeTree(TreeEntry E, int OpdNum, int UserIndx) {
IRBuilder<>::InsertPointGuard Guard(Builder);		IRBuilder<>::InsertPointGuard Guard(Builder);

		int CurrIndx = ScalarToTreeEntry[E->Scalars[0]];
		TreeEntry *UserTreeEntry = nullptr;
if (E->VectorizedValue) {		if (E->VectorizedValue) {
DEBUG(dbgs() << "SLP: Diamond merged for " << *E->Scalars[0] << ".\n");		DEBUG(dbgs() << "SLP: Diamond merged for " << *E->Scalars[0] << ".\n");
return E->VectorizedValue;		return E->VectorizedValue;
}		}

Instruction *VL0 = cast<Instruction>(E->Scalars[0]);		Instruction *VL0 = cast<Instruction>(E->Scalars[0]);
Type *ScalarTy = VL0->getType();		Type *ScalarTy = VL0->getType();
if (StoreInst *SI = dyn_cast<StoreInst>(VL0))		if (StoreInst *SI = dyn_cast<StoreInst>(VL0))
Show All 31 Lines	case Instruction::PHI: {
}		}

// Prepare the operand vector.		// Prepare the operand vector.
for (Value *V : E->Scalars)		for (Value *V : E->Scalars)
Operands.push_back(cast<PHINode>(V)->getIncomingValueForBlock(IBB));		Operands.push_back(cast<PHINode>(V)->getIncomingValueForBlock(IBB));

Builder.SetInsertPoint(IBB->getTerminator());		Builder.SetInsertPoint(IBB->getTerminator());
Builder.SetCurrentDebugLocation(PH->getDebugLoc());		Builder.SetCurrentDebugLocation(PH->getDebugLoc());
Value *Vec = vectorizeTree(Operands);		Value *Vec = vectorizeTree(Operands, i, CurrIndx);
NewPhi->addIncoming(Vec, IBB);		NewPhi->addIncoming(Vec, IBB);
}		}

assert(NewPhi->getNumIncomingValues() == PH->getNumIncomingValues() &&		assert(NewPhi->getNumIncomingValues() == PH->getNumIncomingValues() &&
"Invalid number of incoming values");		"Invalid number of incoming values");
return NewPhi;		return NewPhi;
}		}

Show All 36 Lines	switch (Opcode) {
case Instruction::FPTrunc:		case Instruction::FPTrunc:
case Instruction::BitCast: {		case Instruction::BitCast: {
ValueList INVL;		ValueList INVL;
for (Value *V : E->Scalars)		for (Value *V : E->Scalars)
INVL.push_back(cast<Instruction>(V)->getOperand(0));		INVL.push_back(cast<Instruction>(V)->getOperand(0));

setInsertPointAfterBundle(E->Scalars);		setInsertPointAfterBundle(E->Scalars);

Value *InVec = vectorizeTree(INVL);		Value *InVec = vectorizeTree(INVL, 0, CurrIndx);

if (Value *V = alreadyVectorized(E->Scalars, VL0))		if (Value *V = alreadyVectorized(E->Scalars, VL0))
return V;		return V;

CastInst *CI = dyn_cast<CastInst>(VL0);		CastInst *CI = dyn_cast<CastInst>(VL0);
Value *V = Builder.CreateCast(CI->getOpcode(), InVec, VecTy);		Value *V = Builder.CreateCast(CI->getOpcode(), InVec, VecTy);
E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
}		}
case Instruction::FCmp:		case Instruction::FCmp:
case Instruction::ICmp: {		case Instruction::ICmp: {
ValueList LHSV, RHSV;		ValueList LHSV, RHSV;
for (Value *V : E->Scalars) {		for (Value *V : E->Scalars) {
LHSV.push_back(cast<Instruction>(V)->getOperand(0));		LHSV.push_back(cast<Instruction>(V)->getOperand(0));
RHSV.push_back(cast<Instruction>(V)->getOperand(1));		RHSV.push_back(cast<Instruction>(V)->getOperand(1));
}		}

setInsertPointAfterBundle(E->Scalars);		setInsertPointAfterBundle(E->Scalars);

Value *L = vectorizeTree(LHSV);		Value *L = vectorizeTree(LHSV, 0, CurrIndx);
Value *R = vectorizeTree(RHSV);		Value *R = vectorizeTree(RHSV, 1, CurrIndx);

if (Value *V = alreadyVectorized(E->Scalars, VL0))		if (Value *V = alreadyVectorized(E->Scalars, VL0))
return V;		return V;

CmpInst::Predicate P0 = cast<CmpInst>(VL0)->getPredicate();		CmpInst::Predicate P0 = cast<CmpInst>(VL0)->getPredicate();
Value *V;		Value *V;
if (Opcode == Instruction::FCmp)		if (Opcode == Instruction::FCmp)
V = Builder.CreateFCmp(P0, L, R);		V = Builder.CreateFCmp(P0, L, R);
Show All 10 Lines	case Instruction::Select: {
for (Value *V : E->Scalars) {		for (Value *V : E->Scalars) {
CondVec.push_back(cast<Instruction>(V)->getOperand(0));		CondVec.push_back(cast<Instruction>(V)->getOperand(0));
TrueVec.push_back(cast<Instruction>(V)->getOperand(1));		TrueVec.push_back(cast<Instruction>(V)->getOperand(1));
FalseVec.push_back(cast<Instruction>(V)->getOperand(2));		FalseVec.push_back(cast<Instruction>(V)->getOperand(2));
}		}

setInsertPointAfterBundle(E->Scalars);		setInsertPointAfterBundle(E->Scalars);

Value *Cond = vectorizeTree(CondVec);		Value *Cond = vectorizeTree(CondVec, 0, CurrIndx);
Value *True = vectorizeTree(TrueVec);		Value *True = vectorizeTree(TrueVec, 1, CurrIndx);
Value *False = vectorizeTree(FalseVec);		Value *False = vectorizeTree(FalseVec, 2, CurrIndx);

if (Value *V = alreadyVectorized(E->Scalars, VL0))		if (Value *V = alreadyVectorized(E->Scalars, VL0))
return V;		return V;

Value *V = Builder.CreateSelect(Cond, True, False);		Value *V = Builder.CreateSelect(Cond, True, False);
E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
Show All 22 Lines	case Instruction::Xor: {
else		else
for (Value *V : E->Scalars) {		for (Value *V : E->Scalars) {
LHSVL.push_back(cast<Instruction>(V)->getOperand(0));		LHSVL.push_back(cast<Instruction>(V)->getOperand(0));
RHSVL.push_back(cast<Instruction>(V)->getOperand(1));		RHSVL.push_back(cast<Instruction>(V)->getOperand(1));
}		}

setInsertPointAfterBundle(E->Scalars);		setInsertPointAfterBundle(E->Scalars);

Value *LHS = vectorizeTree(LHSVL);		Value *LHS = vectorizeTree(LHSVL, 0, CurrIndx);
Value *RHS = vectorizeTree(RHSVL);		Value *RHS = vectorizeTree(RHSVL, 1, CurrIndx);

if (Value *V = alreadyVectorized(E->Scalars, VL0))		if (Value *V = alreadyVectorized(E->Scalars, VL0))
return V;		return V;

BinaryOperator *BinOp = cast<BinaryOperator>(VL0);		BinaryOperator *BinOp = cast<BinaryOperator>(VL0);
Value *V = Builder.CreateBinOp(BinOp->getOpcode(), LHS, RHS);		Value *V = Builder.CreateBinOp(BinOp->getOpcode(), LHS, RHS);
E->VectorizedValue = V;		E->VectorizedValue = V;
propagateIRFlags(E->VectorizedValue, E->Scalars);		propagateIRFlags(E->VectorizedValue, E->Scalars);
++NumVectorInstructions;		++NumVectorInstructions;

if (Instruction *I = dyn_cast<Instruction>(V))		if (Instruction *I = dyn_cast<Instruction>(V))
return propagateMetadata(I, E->Scalars);		return propagateMetadata(I, E->Scalars);

return V;		return V;
}		}
case Instruction::Load: {		case Instruction::Load: {
// Loads are inserted at the head of the tree because we don't want to		// Loads are inserted at the head of the tree because we don't want to
// sink them all the way down past store instructions.		// sink them all the way down past store instructions.
setInsertPointAfterBundle(E->Scalars);		setInsertPointAfterBundle(E->Scalars);

LoadInst *LI = cast<LoadInst>(VL0);		LoadInst *LI = cast<LoadInst>(VL0);
		AyalUnsubmitted Not Done Reply Inline Actions ditto Ayal: ditto
		ABataevUnsubmitted Done Reply Inline Actions Restore the original code here ABataev: Restore the original code here
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Thanks ashahid: Thanks
Type *ScalarLoadTy = LI->getType();		Type *ScalarLoadTy = LI->getType();
unsigned AS = LI->getPointerAddressSpace();		unsigned AS = LI->getPointerAddressSpace();

Value *VecPtr = Builder.CreateBitCast(LI->getPointerOperand(),		Value *VecPtr = Builder.CreateBitCast(LI->getPointerOperand(),
VecTy->getPointerTo(AS));		VecTy->getPointerTo(AS));

		sanjoyUnsubmitted Done Reply Inline Actions Might be cleaner to abstract `(unsigned)OpdNum < UserTreeEntry->ShuffleMask.size() && !UserTreeEntry->ShuffleMask[OpdNum].empty()` into a `UserTreeEntry->hasShuffleMaskForOp(Index)` helper. sanjoy: Might be cleaner to abstract `(unsigned)OpdNum < UserTreeEntry->ShuffleMask.size() && !
// The pointer operand uses an in-tree scalar so we add the new BitCast to		// The pointer operand uses an in-tree scalar so we add the new BitCast to
		AyalUnsubmitted Not Done Reply Inline Actions May be simpler to check instead ShuffleMask.count(OpdNum) Ayal: May be simpler to check instead ShuffleMask.count(OpdNum)
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Quite right. ashahid: Quite right.
		ABataevUnsubmitted Not Done Reply Inline Actions Is this correct? `E->Scalars[0]` is exactly `VL0` ABataev: Is this correct? `E->Scalars[0]` is exactly `VL0`
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Ah, both are same. ashahid: Ah, both are same.
// ExternalUses list to make sure that an extract will be generated in the		// ExternalUses list to make sure that an extract will be generated in the
// future.		// future.
Value *PO = LI->getPointerOperand();		Value *PO = LI->getPointerOperand();
if (getTreeEntry(PO))		if (getTreeEntry(PO))
ExternalUses.push_back(ExternalUser(PO, cast<User>(VecPtr), 0));		ExternalUses.push_back(ExternalUser(PO, cast<User>(VecPtr), 0));

unsigned Alignment = LI->getAlignment();		unsigned Alignment = LI->getAlignment();
LI = Builder.CreateLoad(VecPtr);		LI = Builder.CreateLoad(VecPtr);
if (!Alignment) {		if (!Alignment) {
Alignment = DL->getABITypeAlignment(ScalarLoadTy);		Alignment = DL->getABITypeAlignment(ScalarLoadTy);
}		}
LI->setAlignment(Alignment);		LI->setAlignment(Alignment);
E->VectorizedValue = LI;		E->VectorizedValue = LI;
++NumVectorInstructions;		++NumVectorInstructions;
return propagateMetadata(LI, E->Scalars);		propagateMetadata(LI, E->Scalars);
		ABataevUnsubmitted Done Reply Inline Actions Remove this empty line ABataev: Remove this empty line

		if(UserIndx != -1) {
		AyalUnsubmitted Not Done Reply Inline Actions clang-format Ayal: clang-format
		UserTreeEntry = &VectorizableTree[UserIndx];
		AyalUnsubmitted Not Done Reply Inline Actions So can a load have more than one user in need of permuting the loaded lanes; are diamonds ok? OTOH, the branch seems redundant - can we assert that at-least one user exists? (Missing space: `if[ ](UserIndx != -1)`) Ayal: So can a load have more than one user in need of permuting the loaded lanes; are diamonds ok?
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Yes, there can be more than one user requiring permuting the loaded lanes.Not sure but seems like diamonds are not enough. ashahid: Yes, there can be more than one user requiring permuting the loaded lanes.Not sure but seems…
		AyalUnsubmitted Not Done Reply Inline Actions So will each such user get its desired permutation of the loaded lanes? Only a single user is handled here. Ayal: So will each such user get its desired permutation of the loaded lanes? Only a single user is…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Yes, each user will get its desired permutation of loaded lanes due to the fact that the tree here is a DAG and distinct user treeEntry will have different user indexes. OTOH a specific user having all its uses(operands) as different permutation of a loaded lanes will be distinguished by 'OpdNum' ashahid: Yes, each user will get its desired permutation of loaded lanes due to the fact that the tree…
		AyalUnsubmitted Not Done Reply Inline Actions In `buildTree_rec` above, we're still looking for perfect diamonds w/o considering shuffled loads. So if a second user wants to shuffle a load similar to what a first user wanted (and got, being first), they will not share the shuffle, right? The second user will gather its loads instead. In other words, a shuffled load will have only a single user, right? Ayal: In `buildTree_rec` above, we're still looking for perfect diamonds w/o considering shuffled…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Yes. ashahid: Yes.
		AyalUnsubmitted Not Done Reply Inline Actions If a shuffled load will have only a single user, then a single optional ShuffleMask could be held at each load (def), instead of holding an array of ShuffleMasks per operand at the user. This could be done w/o introducing OpdNum, One way of generating the code, at-least conceptually, could be to first generate it w/o the ShuffleMask, and then RAUW where the single user of the load is replaced by the shuffle. You may want to introduce OpdNum for future use, i.e., where a single ShuffleMask handles a non-trivial subset of a load's users. In any case, add a TODO in `buildTree_rec` to consider shuffled loads when looking for perfect diamonds, thereby reusing a ShuffleMask for multiple users in the future? Ayal: If a shuffled load will have only a single user, then a single optional ShuffleMask could be…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions It seems I misunderstood your question. Actually every shuffle of a load is used by different user which would be captured by OpdNum. So in this sense a shuffled load can have multiple user and this is the real issue I am trying to resolve with this patch which was lacking in my earlier attempt. ashahid: It seems I misunderstood your question. Actually every shuffle of a load is used by different…
		AyalUnsubmitted Not Done Reply Inline Actions In jumbled-load-multiuse.ll testcase there are two users: the first (cmp) gets to use the shuffle whereas the second (select) ends up gathering its loads instead. OpdNum captures having a distinct shuffle per operand, iiuc, rather than a distinct shuffle per user, or support for having multiple users share a common shuffle. Ayal: In jumbled-load-multiuse.ll testcase there are two users: the first (cmp) gets to use the…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions For distinct user OpdNum=0 will capture the shuffle mask. For users which needs distinct shuffle mask of a loaded value, OpdNum= 0,1(for binary operation) will capture the required shuffle mask. ashahid: For distinct user OpdNum=0 will capture the shuffle mask. For users which needs distinct…
		AyalUnsubmitted Not Done Reply Inline Actions Can you show an example where two distinct users of the same load get to use the same shuffle, and an example where two such users get to use two distinct shuffles? Each user can have one or more operands, their OpdNum shouldn't matter. Suspect such examples may not exist - in the revised jumbled-load-multiuse.ll testcase below, two users of the same load want to use the same shuffle, but don't get to. Ayal: Can you show an example where two distinct users of the same load get to use the same shuffle…
		AyalUnsubmitted Not Done Reply Inline Actions So in this sense a shuffled load can have multiple user and this is the real issue I am trying to resolve with this patch which was lacking in my earlier attempt. The real issue I think you're trying to resolve with this patch which was lacking in your earlier attempt, is to support loads that have multiple users and need shuffle(s), by allowing only a single (the first) user to feed from the single shuffle, and all other users to extract and gather their elements from the original unshuffled load, as shown in jumbled-load-multiuse.ll. This can be achieved by marking each user if it gets to use a shuffle or not, as done in this patch; or could alternatively be achieved by holding a single optional mask per load (as in previous attempt), along with an indication which user it should feed. In any case, a load can end up having at-most a single shuffle, which in turn feeds a single user, right? Ayal: > So in this sense a shuffled load can have multiple user and this is the real issue I am…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions I have tried to depict the case I have in mind in the attached file. In the given figures, shuffle mask edges are captured by the OpdNum of 'U's. ashahid: I have tried to depict the case I have in mind in the attached file. In the given figures…
		}
		if (UserTreeEntry && !UserTreeEntry->ShuffleMask.empty() &&
		!UserTreeEntry->ShuffleMask[OpdNum].empty()) {
		SmallVector<Constant *, 8> Mask;
		AyalUnsubmitted Done Reply Inline Actions Can simply do `for (unsigned Entry : ShuffleMask[OpdNum])` instead of iterating explicitly over all lanes and retrieving each `UserTreeEntry->ShuffleMask[OpdNum][Lane]`. Ayal: Can simply do `for (unsigned Entry : ShuffleMask[OpdNum])` instead of iterating explicitly over…
		for (unsigned Lane = 0, LE = UserTreeEntry->ShuffleMask[OpdNum].size();
		sanjoyUnsubmitted Done Reply Inline Actions The cast to `Value ` should not be necessary. sanjoy:* The cast to `Value *` should not be necessary.
		Lane != LE; ++Lane) {
		Mask.push_back(
		Builder.getInt32(UserTreeEntry->ShuffleMask[OpdNum][Lane]));
		}
		// Generate shuffle for jumbled memory access
		Value *Undef = UndefValue::get(VecTy);
		Value Shuf = Builder.CreateShuffleVector((Value )LI, Undef,
		ConstantVector::get(Mask));
		E->VectorizedValue = Shuf;
		++NumVectorInstructions;
		return Shuf;
		}
		return LI;
}		}
case Instruction::Store: {		case Instruction::Store: {
StoreInst *SI = cast<StoreInst>(VL0);		StoreInst *SI = cast<StoreInst>(VL0);
unsigned Alignment = SI->getAlignment();		unsigned Alignment = SI->getAlignment();
unsigned AS = SI->getPointerAddressSpace();		unsigned AS = SI->getPointerAddressSpace();

ValueList ValueOp;		ValueList ValueOp;
for (Value *V : E->Scalars)		for (Value *V : E->Scalars)
ValueOp.push_back(cast<StoreInst>(V)->getValueOperand());		ValueOp.push_back(cast<StoreInst>(V)->getValueOperand());

setInsertPointAfterBundle(E->Scalars);		setInsertPointAfterBundle(E->Scalars);

Value *VecValue = vectorizeTree(ValueOp);		Value *VecValue = vectorizeTree(ValueOp, 0, CurrIndx);
Value *VecPtr = Builder.CreateBitCast(SI->getPointerOperand(),		Value *VecPtr = Builder.CreateBitCast(SI->getPointerOperand(),
VecTy->getPointerTo(AS));		VecTy->getPointerTo(AS));
StoreInst *S = Builder.CreateStore(VecValue, VecPtr);		StoreInst *S = Builder.CreateStore(VecValue, VecPtr);

// The pointer operand uses an in-tree scalar so we add the new BitCast to		// The pointer operand uses an in-tree scalar so we add the new BitCast to
// ExternalUses list to make sure that an extract will be generated in the		// ExternalUses list to make sure that an extract will be generated in the
// future.		// future.
Value *PO = SI->getPointerOperand();		Value *PO = SI->getPointerOperand();
Show All 10 Lines	switch (Opcode) {
}		}
case Instruction::GetElementPtr: {		case Instruction::GetElementPtr: {
setInsertPointAfterBundle(E->Scalars);		setInsertPointAfterBundle(E->Scalars);

ValueList Op0VL;		ValueList Op0VL;
for (Value *V : E->Scalars)		for (Value *V : E->Scalars)
Op0VL.push_back(cast<GetElementPtrInst>(V)->getOperand(0));		Op0VL.push_back(cast<GetElementPtrInst>(V)->getOperand(0));

Value *Op0 = vectorizeTree(Op0VL);		Value *Op0 = vectorizeTree(Op0VL, 0, CurrIndx);

std::vector<Value *> OpVecs;		std::vector<Value *> OpVecs;
for (int j = 1, e = cast<GetElementPtrInst>(VL0)->getNumOperands(); j < e;		for (int j = 1, e = cast<GetElementPtrInst>(VL0)->getNumOperands(); j < e;
++j) {		++j) {
ValueList OpVL;		ValueList OpVL;
for (Value *V : E->Scalars)		for (Value *V : E->Scalars)
OpVL.push_back(cast<GetElementPtrInst>(V)->getOperand(j));		OpVL.push_back(cast<GetElementPtrInst>(V)->getOperand(j));

Value *OpVec = vectorizeTree(OpVL);		Value *OpVec = vectorizeTree(OpVL, j, CurrIndx);
OpVecs.push_back(OpVec);		OpVecs.push_back(OpVec);
}		}

Value *V = Builder.CreateGEP(		Value *V = Builder.CreateGEP(
cast<GetElementPtrInst>(VL0)->getSourceElementType(), Op0, OpVecs);		cast<GetElementPtrInst>(VL0)->getSourceElementType(), Op0, OpVecs);
E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;

Show All 22 Lines	case Instruction::Call: {
OpVecs.push_back(CEI->getArgOperand(j));		OpVecs.push_back(CEI->getArgOperand(j));
continue;		continue;
}		}
for (Value *V : E->Scalars) {		for (Value *V : E->Scalars) {
CallInst *CEI = cast<CallInst>(V);		CallInst *CEI = cast<CallInst>(V);
OpVL.push_back(CEI->getArgOperand(j));		OpVL.push_back(CEI->getArgOperand(j));
}		}

Value *OpVec = vectorizeTree(OpVL);		Value *OpVec = vectorizeTree(OpVL, j, CurrIndx);
DEBUG(dbgs() << "SLP: OpVec[" << j << "]: " << *OpVec << "\n");		DEBUG(dbgs() << "SLP: OpVec[" << j << "]: " << *OpVec << "\n");
OpVecs.push_back(OpVec);		OpVecs.push_back(OpVec);
}		}

Module *M = F->getParent();		Module *M = F->getParent();
Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);		Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);
Type *Tys[] = { VectorType::get(CI->getType(), E->Scalars.size()) };		Type *Tys[] = { VectorType::get(CI->getType(), E->Scalars.size()) };
Function *CF = Intrinsic::getDeclaration(M, ID, Tys);		Function *CF = Intrinsic::getDeclaration(M, ID, Tys);
Show All 13 Lines	case Instruction::Call: {
return V;		return V;
}		}
case Instruction::ShuffleVector: {		case Instruction::ShuffleVector: {
ValueList LHSVL, RHSVL;		ValueList LHSVL, RHSVL;
assert(isa<BinaryOperator>(VL0) && "Invalid Shuffle Vector Operand");		assert(isa<BinaryOperator>(VL0) && "Invalid Shuffle Vector Operand");
reorderAltShuffleOperands(E->Scalars, LHSVL, RHSVL);		reorderAltShuffleOperands(E->Scalars, LHSVL, RHSVL);
setInsertPointAfterBundle(E->Scalars);		setInsertPointAfterBundle(E->Scalars);

Value *LHS = vectorizeTree(LHSVL);		Value *LHS = vectorizeTree(LHSVL, 0, CurrIndx);
Value *RHS = vectorizeTree(RHSVL);		Value *RHS = vectorizeTree(RHSVL, 1, CurrIndx);

if (Value *V = alreadyVectorized(E->Scalars, VL0))		if (Value *V = alreadyVectorized(E->Scalars, VL0))
return V;		return V;

// Create a vector of LHS op1 RHS		// Create a vector of LHS op1 RHS
BinaryOperator *BinOp0 = cast<BinaryOperator>(VL0);		BinaryOperator *BinOp0 = cast<BinaryOperator>(VL0);
Value *V0 = Builder.CreateBinOp(BinOp0->getOpcode(), LHS, RHS);		Value *V0 = Builder.CreateBinOp(BinOp0->getOpcode(), LHS, RHS);

▲ Show 20 Lines • Show All 86 Lines • ▼ Show 20 Lines	for (const auto &ExternalUse : ExternalUses) {
// Skip users that we already RAUW. This happens when one instruction		// Skip users that we already RAUW. This happens when one instruction
// has multiple uses of the same value.		// has multiple uses of the same value.
if (User && !is_contained(Scalar->users(), User))		if (User && !is_contained(Scalar->users(), User))
continue;		continue;
TreeEntry *E = getTreeEntry(Scalar);		TreeEntry *E = getTreeEntry(Scalar);
assert(E && "Invalid scalar");		assert(E && "Invalid scalar");
assert(!E->NeedToGather && "Extracting from a gather list");		assert(!E->NeedToGather && "Extracting from a gather list");

Value *Vec = E->VectorizedValue;		Value *Vec = nullptr;
		if ((Vec = dyn_cast<ShuffleVectorInst>(E->VectorizedValue)) &&
		sanjoyUnsubmitted Not Done Reply Inline Actions `dyn_cast<XXX>(f)->g()` should never be necessary. Either the `dyn_cast` can return null in which case you should check for that, or it can't and you should use `cast<>`. Also the cast of `Vec` to `Instruction` seems unnecessary: `ShuffleVectorInst` is an `Instruction`. sanjoy: `dyn_cast<XXX>(f)->g()` should never be necessary. Either the `dyn_cast` can return null in…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions Here I am trying to ensure that the instructions are "ShuffleVectorInst" and "LoadInst" respectively. Casting of Vec to Instruction, is to satisfy the membership of getOperand() which compiler otherwise report as error. ashahid: Here I am trying to ensure that the instructions are "ShuffleVectorInst" and "LoadInst"…
		AyalUnsubmitted Not Done Reply Inline Actions Use `isa` instead of `dyn_cast` here: `if (Vec && dyn_cast<LoadInst>(cast<Instruction>(Vec)->getOperand(0))) {` or alternatively do something like: Value Vec = E->VectorizedValue; assert(Vec && "Can't find vectorizable value"); if (ShuffleVectorInst Shuffle = dyn_cast<ShuffleVectorInst>(Vec)) if (LoadInst Load = dyn_cast<LoadInst>(Shuffle->getOperand(0))) Vec = Load; Ayal:* Use `isa` instead of `dyn_cast` here: `if (Vec && dyn_cast<LoadInst>(cast<Instruction>(Vec)…
		dyn_cast<LoadInst>(cast<Instruction>(Vec)->getOperand(0))) {
		Vec = cast<Instruction>(E->VectorizedValue)->getOperand(0);
		} else {
		Vec = E->VectorizedValue;
		}
		sanjoyUnsubmitted Not Done Reply Inline Actions I think you can rewrite this more cleanly using an immediately-invoked function expression: Value Vec = [&]() { if (auto SVI = dyn_cast<ShuffleVectorInst>(E->VectorizedValue)) if (auto LI = dyn_cast<LoadInst>(SVI->getOperand(0))) return LI->getOperand(0); return E->VectorizedValue; }(); sanjoy:* I think you can rewrite this more cleanly using an immediately-invoked function expression…
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions I tried this IIFE, however I am getting an assertion "Tried to create extractelement operation on non-vector type!" for jumbled-load-multiuse.ll test. Do you see any issue in this code? ashahid: I tried this IIFE, however I am getting an assertion "Tried to create extractelement operation…
		sanjoyUnsubmitted Not Done Reply Inline Actions Yes, I think I should have written: Value Vec = [&]() { if (auto SVI = dyn_cast<ShuffleVectorInst>(E->VectorizedValue)) if (isa<LoadInst>(SVI->getOperand(0))) return SVI->getOperand(0); return E->VectorizedValue; }(); sanjoy: Yes, I think I should have written: ``` Value Vec = [&]() { if (auto SVI =…
		AyalUnsubmitted Not Done Reply Inline Actions Yes, this simplifies the below "alternatively do something like:" Value Vec = E->VectorizedValue; assert(Vec && "Can't find vectorizable value"); if (ShuffleVectorInst Shuffle = dyn_cast<ShuffleVectorInst>(Vec)) if (LoadInst Load = dyn_cast<LoadInst>(Shuffle->getOperand(0))) Vec = Load; Ayal:* Yes, this simplifies the below "alternatively do something like:" ``` Value *Vec = E…
		ABataevUnsubmitted Not Done Reply Inline Actions I think you can have default capture by value here rather than by reference. ABataev: I think you can have default capture by value here rather than by reference.
assert(Vec && "Can't find vectorizable value");		assert(Vec && "Can't find vectorizable value");
		ABataevUnsubmitted Not Done Reply Inline Actions I rather doubt you need all that stuff. You can use original code ABataev: I rather doubt you need all that stuff. You can use original code
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions This is required otherwise multiuse.ll test as well as PR32086.ll will fail because the lanes were recorded according to the order of scalar loads. ashahid: This is required otherwise multiuse.ll test as well as PR32086.ll will fail because the lanes…
		ABataevUnsubmitted Not Done Reply Inline Actions Again, it just may not happen in this patch ABataev: Again, it just may not happen in this patch
		ashahidAuthorUnsubmitted Not Done Reply Inline Actions It does happen and this test fails. ashahid: It does happen and this test fails.

Value *Lane = Builder.getInt32(ExternalUse.Lane);		Value *Lane = Builder.getInt32(ExternalUse.Lane);
// If User == nullptr, the Scalar is used as extra arg. Generate		// If User == nullptr, the Scalar is used as extra arg. Generate
// ExtractElement instruction and update the record for this scalar in		// ExtractElement instruction and update the record for this scalar in
// ExternallyUsedValues.		// ExternallyUsedValues.
if (!User) {		if (!User) {
assert(ExternallyUsedValues.count(Scalar) &&		assert(ExternallyUsedValues.count(Scalar) &&
"Scalar with nullptr as an external user must be registered in "		"Scalar with nullptr as an external user must be registered in "
▲ Show 20 Lines • Show All 2,353 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/X86/jumbled-load-multiuse.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux -mattr=+sse4.2 \| FileCheck %s			; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux -mattr=+sse4.2 \| FileCheck %s

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	@a = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4			@a = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4
	@b = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4			@b = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4

	define i32 @fn1() {			define i32 @fn1() {
	; CHECK-LABEL: @fn1(			; CHECK-LABEL: @fn1(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4			; CHECK-NEXT: [[TMP0:%.]] = load <4 x i32>, <4 x i32> bitcast ([4 x i32]* @b to <4 x i32>*), align 4
	; CHECK-NEXT: [[TMP1:%.]] = load i32, i32 getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 1), align 4			; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <4 x i32> [[TMP0]], <4 x i32> undef, <4 x i32> <i32 1, i32 2, i32 3, i32 0>
	; CHECK-NEXT: [[TMP2:%.]] = load i32, i32 getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 2), align 4			; CHECK-NEXT: [[TMP2:%.*]] = icmp sgt <4 x i32> [[TMP1]], zeroinitializer
	; CHECK-NEXT: [[TMP3:%.]] = load i32, i32 getelementptr inbounds ([4 x i32], [4 x i32]* @b, i64 0, i32 3), align 4			; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x i32> [[TMP0]], i32 1
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> undef, i32 [[TMP1]], i32 0			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> undef, i32 [[TMP3]], i32 0
	; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x i32> [[TMP4]], i32 [[TMP2]], i32 1			; CHECK-NEXT: [[TMP5:%.]] = insertelement <4 x i32> [[TMP4]], i32 ptrtoint (i32 () @fn1 to i32), i32 1
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP5]], i32 [[TMP3]], i32 2			; CHECK-NEXT: [[TMP6:%.]] = insertelement <4 x i32> [[TMP5]], i32 ptrtoint (i32 () @fn1 to i32), i32 2
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[TMP0]], i32 3			; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i32> [[TMP6]], i32 8, i32 3
	; CHECK-NEXT: [[TMP8:%.*]] = icmp sgt <4 x i32> [[TMP7]], zeroinitializer			; CHECK-NEXT: [[TMP8:%.*]] = select <4 x i1> [[TMP2]], <4 x i32> [[TMP7]], <4 x i32> <i32 6, i32 0, i32 0, i32 0>
	; CHECK-NEXT: [[TMP9:%.]] = insertelement <4 x i32> [[TMP4]], i32 ptrtoint (i32 () @fn1 to i32), i32 1			; CHECK-NEXT: store <4 x i32> [[TMP8]], <4 x i32>* bitcast ([4 x i32]* @a to <4 x i32>*), align 4
	; CHECK-NEXT: [[TMP10:%.]] = insertelement <4 x i32> [[TMP9]], i32 ptrtoint (i32 () @fn1 to i32), i32 2
	; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> [[TMP10]], i32 8, i32 3
	; CHECK-NEXT: [[TMP12:%.*]] = select <4 x i1> [[TMP8]], <4 x i32> [[TMP11]], <4 x i32> <i32 6, i32 0, i32 0, i32 0>
	; CHECK-NEXT: store <4 x i32> [[TMP12]], <4 x i32>* bitcast ([4 x i32]* @a to <4 x i32>*), align 4
	; CHECK-NEXT: ret i32 0			; CHECK-NEXT: ret i32 0
	;			;
	entry:			entry:
	%0 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4			%0 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4
	%cmp = icmp sgt i32 %0, 0			%cmp = icmp sgt i32 %0, 0
	%cond = select i1 %cmp, i32 8, i32 0			%cond = select i1 %cmp, i32 8, i32 0
	store i32 %cond, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i64 0, i32 3), align 4			store i32 %cond, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i64 0, i32 3), align 4
	%1 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 1), align 4			%1 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 1), align 4
	Show All 13 Lines

test/Transforms/SLPVectorizer/X86/jumbled-load.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s		; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s



define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {		define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {
; CHECK-LABEL: @jumbled-load(		; CHECK-LABEL: @jumbled-load(
; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 %in, i64 0		; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0
; CHECK-NEXT: [[LOAD_1:%.]] = load i32, i32 [[IN_ADDR]], align 4
; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3		; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3
; CHECK-NEXT: [[LOAD_2:%.]] = load i32, i32 [[GEP_1]], align 4
; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1		; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1
; CHECK-NEXT: [[LOAD_3:%.]] = load i32, i32 [[GEP_2]], align 4
; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2		; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2
; CHECK-NEXT: [[LOAD_4:%.]] = load i32, i32 [[GEP_3]], align 4		; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*
; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 %inn, i64 0		; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
; CHECK-NEXT: [[LOAD_5:%.]] = load i32, i32 [[INN_ADDR]], align 4		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 2, i32 0>
		; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0
; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2		; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2
; CHECK-NEXT: [[LOAD_6:%.]] = load i32, i32 [[GEP_4]], align 4
; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3		; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3
; CHECK-NEXT: [[LOAD_7:%.]] = load i32, i32 [[GEP_5]], align 4
; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1		; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1
; CHECK-NEXT: [[LOAD_8:%.]] = load i32, i32 [[GEP_6]], align 4		; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[INN_ADDR]] to <4 x i32>*
; CHECK-NEXT: [[MUL_1:%.*]] = mul i32 [[LOAD_3]], [[LOAD_5]]		; CHECK-NEXT: [[TMP5:%.]] = load <4 x i32>, <4 x i32> [[TMP4]], align 4
; CHECK-NEXT: [[MUL_2:%.*]] = mul i32 [[LOAD_2]], [[LOAD_8]]		; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 3, i32 2>
; CHECK-NEXT: [[MUL_3:%.*]] = mul i32 [[LOAD_4]], [[LOAD_7]]		; CHECK-NEXT: [[TMP7:%.*]] = mul <4 x i32> [[TMP3]], [[TMP6]]
; CHECK-NEXT: [[MUL_4:%.*]] = mul i32 [[LOAD_1]], [[LOAD_6]]		; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0
; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 %out, i64 0		; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1
; CHECK-NEXT: store i32 [[MUL_1]], i32* [[GEP_7]], align 4		; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2
; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 %out, i64 1		; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3
; CHECK-NEXT: store i32 [[MUL_2]], i32* [[GEP_8]], align 4		; CHECK-NEXT: [[TMP8:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*
; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 %out, i64 2		; CHECK-NEXT: store <4 x i32> [[TMP7]], <4 x i32>* [[TMP8]], align 4
; CHECK-NEXT: store i32 [[MUL_3]], i32* [[GEP_9]], align 4
; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 %out, i64 3
; CHECK-NEXT: store i32 [[MUL_4]], i32* [[GEP_10]], align 4
; CHECK-NEXT: ret i32 undef		; CHECK-NEXT: ret i32 undef
;		;
%in.addr = getelementptr inbounds i32, i32* %in, i64 0		%in.addr = getelementptr inbounds i32, i32* %in, i64 0
%load.1 = load i32, i32* %in.addr, align 4		%load.1 = load i32, i32* %in.addr, align 4
%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3		%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3
%load.2 = load i32, i32* %gep.1, align 4		%load.2 = load i32, i32* %gep.1, align 4
%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1		%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1
%load.3 = load i32, i32* %gep.2, align 4		%load.3 = load i32, i32* %gep.2, align 4
Show All 16 Lines	;
%gep.8 = getelementptr inbounds i32, i32* %out, i64 1		%gep.8 = getelementptr inbounds i32, i32* %out, i64 1
store i32 %mul.2, i32* %gep.8, align 4		store i32 %mul.2, i32* %gep.8, align 4
%gep.9 = getelementptr inbounds i32, i32* %out, i64 2		%gep.9 = getelementptr inbounds i32, i32* %out, i64 2
store i32 %mul.3, i32* %gep.9, align 4		store i32 %mul.3, i32* %gep.9, align 4
%gep.10 = getelementptr inbounds i32, i32* %out, i64 3		%gep.10 = getelementptr inbounds i32, i32* %out, i64 3
store i32 %mul.4, i32* %gep.10, align 4		store i32 %mul.4, i32* %gep.10, align 4

ret i32 undef		ret i32 undef
}		}
		ABataevUnsubmitted Done Reply Inline Actions You need to add this test separately and show changes in it ABataev: You need to add this test separately and show changes in it

test/Transforms/SLPVectorizer/X86/store-jumbled.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s			; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s



	define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {			define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {
	; CHECK-LABEL: @jumbled-load(			; CHECK-LABEL: @jumbled-load(
	; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0			; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0
	; CHECK-NEXT: [[LOAD_1:%.]] = load i32, i32 [[IN_ADDR]], align 4
	; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1
	; CHECK-NEXT: [[LOAD_2:%.]] = load i32, i32 [[GEP_1]], align 4
	; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2
	; CHECK-NEXT: [[LOAD_3:%.]] = load i32, i32 [[GEP_2]], align 4
	; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3
	; CHECK-NEXT: [[LOAD_4:%.]] = load i32, i32 [[GEP_3]], align 4			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*
				; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
				; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 0, i32 2>
	; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0			; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0
	; CHECK-NEXT: [[LOAD_5:%.]] = load i32, i32 [[INN_ADDR]], align 4
	; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1
	; CHECK-NEXT: [[LOAD_6:%.]] = load i32, i32 [[GEP_4]], align 4
	; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2
	; CHECK-NEXT: [[LOAD_7:%.]] = load i32, i32 [[GEP_5]], align 4
	; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3
	; CHECK-NEXT: [[LOAD_8:%.]] = load i32, i32 [[GEP_6]], align 4			; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[INN_ADDR]] to <4 x i32>*
	; CHECK-NEXT: [[MUL_1:%.*]] = mul i32 [[LOAD_1]], [[LOAD_5]]			; CHECK-NEXT: [[TMP5:%.]] = load <4 x i32>, <4 x i32> [[TMP4]], align 4
	; CHECK-NEXT: [[MUL_2:%.*]] = mul i32 [[LOAD_2]], [[LOAD_6]]			; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 0, i32 2>
	; CHECK-NEXT: [[MUL_3:%.*]] = mul i32 [[LOAD_3]], [[LOAD_7]]			; CHECK-NEXT: [[TMP7:%.*]] = mul <4 x i32> [[TMP3]], [[TMP6]]
	; CHECK-NEXT: [[MUL_4:%.*]] = mul i32 [[LOAD_4]], [[LOAD_8]]
	; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0			; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0
	; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1			; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1
	; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2			; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2
	; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3			; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3
	; CHECK-NEXT: store i32 [[MUL_1]], i32* [[GEP_9]], align 4			; CHECK-NEXT: [[TMP8:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*
	; CHECK-NEXT: store i32 [[MUL_2]], i32* [[GEP_7]], align 4			; CHECK-NEXT: store <4 x i32> [[TMP7]], <4 x i32>* [[TMP8]], align 4
	; CHECK-NEXT: store i32 [[MUL_3]], i32* [[GEP_10]], align 4
	; CHECK-NEXT: store i32 [[MUL_4]], i32* [[GEP_8]], align 4
	; CHECK-NEXT: ret i32 undef			; CHECK-NEXT: ret i32 undef
	;			;
	%in.addr = getelementptr inbounds i32, i32* %in, i64 0			%in.addr = getelementptr inbounds i32, i32* %in, i64 0
	%load.1 = load i32, i32* %in.addr, align 4			%load.1 = load i32, i32* %in.addr, align 4
	%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 1			%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 1
	%load.2 = load i32, i32* %gep.1, align 4			%load.2 = load i32, i32* %gep.1, align 4
	%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 2			%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 2
	%load.3 = load i32, i32* %gep.2, align 4			%load.3 = load i32, i32* %gep.2, align 4
	Show All 25 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Vectorize jumbled memory loads.AcceptedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 109054

include/llvm/Analysis/LoopAccessAnalysis.h

lib/Analysis/LoopAccessAnalysis.cpp

lib/Transforms/Vectorize/SLPVectorizer.cpp

test/Transforms/SLPVectorizer/X86/jumbled-load-multiuse.ll

test/Transforms/SLPVectorizer/X86/jumbled-load.ll

test/Transforms/SLPVectorizer/X86/store-jumbled.ll

[SLP] Vectorize jumbled memory loads.
AcceptedPublic