This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Vectorize jumbled memory loads.
AcceptedPublic

Authored by ashahid on Aug 1 2017, 12:31 AM.

Details

Summary

This patch tries to vectorize loads of consecutive memory accesses, accessed
in non-consecutive or jumbled way. An earlier attempt was made with patch D26905
which was reverted back due to some basic issue with representing the 'use mask' of
jumbled accesses.

This patch fixes the mask representation by recording the 'use mask' in the usertree entry.

Change-Id: I9fe7f5045f065d84c126fa307ef6ebe0787296df

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Patch update for fixing build bot failure:

This fix makes the place holder for Shuffle Mask from fixed array of 3 element
to an std::map. This need arises from the fact that a PHI node can have
any number of operand as incoming value.

Test performed:
LLVM lit test, 3 stage bootstrap build and LNT (Thanks to Hans and Daniel)

hans added a comment.Dec 5 2017, 10:18 AM

Patch update for fixing build bot failure:

I haven't looked at the patch at all, but I just tried it on a local Chrome build on Linux, and it seems to work for that.

Ayal added a comment.Dec 5 2017, 2:40 PM

Good catch. Add a LIT test?

lib/Transforms/Vectorize/SLPVectorizer.cpp
736 ↗(On Diff #125470)

The fixed array

SmallVector<unsigned, 4> ShuffleMask[3];

of the previous version indeed cannot account for all operands. How about holding a

SmallVector<SmallVector<unsigned, 4>, 2> ShuffleMask;

instead of holding a map from 0,1,2,..,numOperands ?

766 ↗(On Diff #125470)

Are both conditions really needed, or suffice say to check for -1 and assert positive indices are not too large?

3054 ↗(On Diff #125470)

May be simpler to check instead ShuffleMask.count(OpdNum)

3085 ↗(On Diff #125470)

clang-format

In D36130#945306, @hans wrote:

Patch update for fixing build bot failure:

I haven't looked at the patch at all, but I just tried it on a local Chrome build on Linux, and it seems to work for that.

Thanks Hans for triage.

In D36130#945728, @Ayal wrote:

Good catch. Add a LIT test?

It was asserting in few of LNT Multisource bench mark. How to extract it for LIT test?

lib/Transforms/Vectorize/SLPVectorizer.cpp
736 ↗(On Diff #125470)

I think this can be done. I will try.

766 ↗(On Diff #125470)

Sure I will check. I am thinking 30000 as large indices threshold, do you have any number in mind?

3054 ↗(On Diff #125470)

Quite right.

ashahid added inline comments.Dec 8 2017, 8:12 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
766 ↗(On Diff #125470)

I tried but seems both conditions are needed as I am getting assertion "Idx < size()" for SmallVector<<SmallVector, 4> 2> ShuffleMask.

Updated the review comments.

Minor commented code clean up done.

Ayal added a comment.Dec 9 2017, 1:30 PM
In D36130#945728, @Ayal wrote:

Good catch. Add a LIT test?

It was asserting in few of LNT Multisource bench mark. How to extract it for LIT test?

Suffice to have a phi with 4 predecessors, where (at-least) the 4th needs a shuffle-mask.

lib/Transforms/Vectorize/SLPVectorizer.cpp
678–679 ↗(On Diff #112854)

Code below still uses emplace_back contrary to the discussion above.

May need to call UserTreeEntry->ShuffleMask.resize() if OpdNum is larger than its initial/current size, before setting UserTreeEntry->ShuffleMask[OpdNum] = tempMask. (Otherwise the original "LNT Multisource bench mark" asserts should trigger again?)

Suggest to add a test where the first operand does not need a shuffle but the second one does.

766 ↗(On Diff #125470)

UserTreeIdx is the index of the User entry as we build the tree bottom-up, so it should always be between 0 and VectorizableTree.size()-1, except for -1 when creating the new entry for the root, which is User-less. So it should suffice to check if Idx is -1, and otherwise assert that Idx < size(), if desired, right?

2801 ↗(On Diff #126266)

See above discussion about replacing second condition with an assert.

3044 ↗(On Diff #126266)

ditto

test/Transforms/SLPVectorizer/X86/crash_cmpop.ll
1 ↗(On Diff #126266)

Why add -debug?

Review comments updated and added lit tests.

ashahid marked 3 inline comments as done.

Review comments updated and added lit tests.

ashahid added inline comments.Dec 11 2017, 8:41 AM
test/Transforms/SLPVectorizer/X86/crash_cmpop.ll
1 ↗(On Diff #126266)

My bad, not intended.

Ayal accepted this revision.Dec 11 2017, 2:51 PM

This looks good to me, with a couple of last minor fixes.

Hope it stays in this time...

lib/Transforms/Vectorize/SLPVectorizer.cpp
773 ↗(On Diff #126374)

alrea[d]y

3090 ↗(On Diff #126374)

Can simply do for (unsigned Entry : ShuffleMask[OpdNum]) instead of iterating explicitly over all lanes and retrieving each UserTreeEntry->ShuffleMask[OpdNum][Lane].

test/Transforms/SLPVectorizer/X86/jumbled-load-shuffle-placement.ll
31 ↗(On Diff #126374)

Suggested to also have a test where the 2nd operand is a shuffle but the 1st one isn't, which will fail if shuffles are added using emplace_back().

ashahid marked 3 inline comments as done.

Updated test and review comment.

Bootstrap and LNT test underway.

ashahid closed this revision.Dec 12 2017, 7:09 PM
eastig added a subscriber: eastig.Dec 14 2017, 5:32 AM

Hi Shahid,

These changes caused 27.7% and 30.2% regressions on an AArch64 Juno board (http://lnt.llvm.org/db_default/v4/nts/83681):

MultiSource/Benchmarks/mediabench/gsm/toast/toast: 30.20%
MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm: 27.73%

We have the same benchmarks regressed on our AArch64 boards (Cortex-A53, Cortex-A57).

-Evgeny Astigeevich
The ARM Compiler Optimisation team

Hi Shahid,

These changes caused 27.7% and 30.2% regressions on an AArch64 Juno board (http://lnt.llvm.org/db_default/v4/nts/83681):

MultiSource/Benchmarks/mediabench/gsm/toast/toast: 30.20%
MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm: 27.73%

We have the same benchmarks regressed on our AArch64 boards (Cortex-A53, Cortex-A57).

-Evgeny Astigeevich
The ARM Compiler Optimisation team

A problem report: https://bugs.llvm.org/show_bug.cgi?id=35673

sanjoy added a subscriber: sanjoy.Dec 19 2017, 4:03 PM
sanjoy added inline comments.
lib/Analysis/LoopAccessAnalysis.cpp
1125 ↗(On Diff #126571)

This should be a cast<>.

1153 ↗(On Diff #126571)

LLVM style is to avoid using curly braces on single like for loops. Using std::iota would be even better.

lib/Transforms/Vectorize/SLPVectorizer.cpp
774 ↗(On Diff #126571)

I think you should be able to do:

auto &OperandMask = UserTreeEntry->ShuffleMask[OpdNum];
assert(OperandMask.empty());
OperandMask.insert(OperandMask.end(), ShuffleMask.begin(), ShuffleMask.end());
1666 ↗(On Diff #126571)

Not sure why you need NewVL here -- doesn't just using Sorted work?

3054 ↗(On Diff #126571)

Might be cleaner to abstract (unsigned)OpdNum < UserTreeEntry->ShuffleMask.size() && !UserTreeEntry->ShuffleMask[OpdNum].empty() into a UserTreeEntry->hasShuffleMaskForOp(Index) helper.

3091 ↗(On Diff #126571)

The cast to Value * should not be necessary.

3319 ↗(On Diff #126571)

dyn_cast<XXX>(f)->g() should never be necessary. Either the dyn_cast can return null in which case you should check for that, or it can't and you should use cast<>. Also the cast of Vec to Instruction seems unnecessary: ShuffleVectorInst is an Instruction.

Ayal added inline comments.Dec 21 2017, 3:25 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
774 ↗(On Diff #126571)

While we're at it, this should move under the if (UserTreeIdx != -1) to avoid checking if &VectorizableTree[UserTreeIdx] is null, as commented in https://reviews.llvm.org/D41324#inline-361435

1675 ↗(On Diff #126571)

Should probably also check here that UserTreeIdx is not -1, to avoid creating a mask for the root with no place to hang it, as @sanjoy observed.

ashahid added inline comments.Dec 22 2017, 6:20 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
774 ↗(On Diff #126571)

If we check for if (UserTreeIdx != -1 && ShuffledLoad) before the call of newTreeEntry(), we can avoid "UserTreeIdx != -1" check completely inside newTreeEntry().

1675 ↗(On Diff #126571)

Yes, I had planned to do exactly this.

ashahid reopened this revision.Dec 28 2017, 11:04 PM
ashahid marked 8 inline comments as done.
ashahid added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
3319 ↗(On Diff #126571)

Here I am trying to ensure that the instructions are "ShuffleVectorInst" and "LoadInst" respectively.

Casting of Vec to Instruction, is to satisfy the membership of getOperand() which compiler otherwise report as error.

This revision is now accepted and ready to land.Dec 28 2017, 11:04 PM

Updates review comments.

Regression test and LNT passes, 3 stage bootstrap test underway.

Ayal added inline comments.Dec 29 2017, 7:31 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3327 ↗(On Diff #128320)

Use isa instead of dyn_cast here:
if (Vec && dyn_cast<LoadInst>(cast<Instruction>(Vec)->getOperand(0))) {

or alternatively do something like:

Value *Vec = E->VectorizedValue;
assert(Vec && "Can't find vectorizable value");
if (ShuffleVectorInst *Shuffle = dyn_cast<ShuffleVectorInst>(Vec))
  if (LoadInst *Load = dyn_cast<LoadInst>(Shuffle->getOperand(0)))
    Vec = Load;

Updated Ayal's comment accordingly

ashahid marked an inline comment as done.Jan 1 2018, 8:01 AM
Ayal added a comment.Jan 9 2018, 8:40 AM

This should fix the case observed by @sanjoy in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20171218/511721.html; please also include a testcase.

In D36130#971181, @Ayal wrote:

This should fix the case observed by @sanjoy in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20171218/511721.html; please also include a testcase.

Test case, test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll, already included.

Ayal added a comment.Jan 11 2018, 1:13 PM
In D36130#971181, @Ayal wrote:

This should fix the case observed by @sanjoy in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20171218/511721.html; please also include a testcase.

Test case, test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll, already included.

Ah, right, sorry, missed it.

This looks good to me, with only minor comments about the testcase.

Please see that @sanjoy approves too, as this mostly addresses issues he raised.

test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
27 ↗(On Diff #128388)

"SINK" is defined redundantly, as it is not used.

Could this be simplified by removing the float-to-int casts?

In general, it may suffice to check that there's no load of <4 x i32>, which would be jumbled. Checking that two of the lanes have been vectorized may be fragile, in case a modified cost model will decide it ain't worth it.

sanjoy accepted this revision.Jan 13 2018, 2:40 PM
sanjoy added inline comments.
lib/Analysis/LoopAccessAnalysis.cpp
1113 ↗(On Diff #128388)

The indent looks off here; can you please run clang-format?

lib/Transforms/Vectorize/SLPVectorizer.cpp
723 ↗(On Diff #128388)

Optional: you can write return X; instead of if (X) return true; return false;.

1675 ↗(On Diff #128388)

Nit: s/usefull/useful/

3326 ↗(On Diff #128388)

I think you can rewrite this more cleanly using an immediately-invoked function expression:

Value *Vec = [&]() {
  if (auto *SVI = dyn_cast<ShuffleVectorInst>(E->VectorizedValue))
    if (auto *LI = dyn_cast<LoadInst>(SVI->getOperand(0)))
      return LI->getOperand(0);
  return E->VectorizedValue;
}();
ashahid marked an inline comment as not done.Jan 15 2018, 9:01 PM
ashahid added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
3326 ↗(On Diff #128388)

I tried this IIFE, however I am getting an assertion "Tried to create extractelement operation on non-vector type!" for jumbled-load-multiuse.ll test. Do you see any issue in this code?

sanjoy added inline comments.Jan 15 2018, 10:54 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3326 ↗(On Diff #128388)

Yes, I think I should have written:

Value *Vec = [&]() {
  if (auto *SVI = dyn_cast<ShuffleVectorInst>(E->VectorizedValue))
    if (isa<LoadInst>(SVI->getOperand(0)))
      return SVI->getOperand(0);
  return E->VectorizedValue;
}();
Ayal added inline comments.Jan 15 2018, 11:49 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3326 ↗(On Diff #128388)

Yes, this simplifies the below "alternatively do something like:"

Value *Vec = E->VectorizedValue;
assert(Vec && "Can't find vectorizable value");
if (ShuffleVectorInst *Shuffle = dyn_cast<ShuffleVectorInst>(Vec))
  if (LoadInst *Load = dyn_cast<LoadInst>(Shuffle->getOperand(0)))
    Vec = Load;

Updates test case and stylistic review comments

Hi Ayal, Sanjoy,

The last update's review was pending for long. Off late, SLP has lots of changes so I will have to rebase but before rebasing please see if any more changes required in its current form.

Thanks in advance.

Ayal added a comment.Feb 10 2018, 11:48 PM

Hi Ayal, Sanjoy,

The last update's review was pending for long. Off late, SLP has lots of changes so I will have to rebase but before rebasing please see if any more changes required in its current form.

Thanks in advance.

This looks good to me, as commented earlier, but please see that @sanjoy approves too, as this mostly addresses issues he raised.

sanjoy accepted this revision.Feb 11 2018, 12:18 PM

I don't have any more coding style comments. I've not reviewed the actual semantic changes.

lib/Analysis/LoopAccessAnalysis.cpp
1166 ↗(On Diff #129968)

Can you use std::iota here?

ABataev added inline comments.Feb 12 2018, 8:04 AM
lib/Analysis/LoopAccessAnalysis.cpp
1112 ↗(On Diff #129968)

This function can be used for stores also, it is better to make it universal for stores/loads.

1156 ↗(On Diff #129968)

It is better to use stable_sort rather than sort

1169 ↗(On Diff #129968)

stable_sort

lib/Transforms/Vectorize/SLPVectorizer.cpp
1661 ↗(On Diff #129968)

Is it possible at all that VL has less than 4 elements here?

1666 ↗(On Diff #129968)

i->I, e->E. Variables must have Camel-like names.

2229–2235 ↗(On Diff #129968)

You don't need so many shuffles, it is enough just to have just one.

ABataev added inline comments.Feb 12 2018, 8:04 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
736–742 ↗(On Diff #129968)

Why you can't have just one shuffle here for all external uses?

1677–1678 ↗(On Diff #129968)

Bad decision. It is better to use original VL here, rather than Sorted and add an additional array of sorted indieces. In this case you don't need all these additional numbers and all that complex logic to find the correct tree entry for the list of values.

3324 ↗(On Diff #129968)

I think you can have default capture by value here rather than by reference.

Hi Alexey,

As I was trying to rebase this patch, it seems this overlaps with your "reverse load" patch. Could you take a look in this patch?

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

lib/Analysis/LoopAccessAnalysis.cpp
1112 ↗(On Diff #129968)

I plan to do such improvement in separate patches.

lib/Transforms/Vectorize/SLPVectorizer.cpp
736–742 ↗(On Diff #129968)

This is for in-tree multi uses of a single vector load where the uses has different masks/permutation.
This section of comment https://reviews.llvm.org/D36130#inline-326711
discussed it earlier. Also there is figure attached.

1661 ↗(On Diff #129968)

I think yes, for example a couple of i64 loads considering minimum register width as 128-bit.

However, this check here was basically meant to indicate jumbled loads of size 2 is essentially a reversed load.

1677–1678 ↗(On Diff #129968)

In fact earlier design in patch (https://reviews.llvm.org/D26905) was to use original VL, however there was counter argument to that which I don't remember exactly.

2229–2235 ↗(On Diff #129968)

This is basically for multiple in-tree uses with different masks/permutation.

ABataev added a comment.EditedFeb 13 2018, 7:33 AM

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

Probably, it is hard to say exactly without looking at the result.

lib/Analysis/LoopAccessAnalysis.cpp
1112 ↗(On Diff #129968)

I just suggest to make universal at the very beginning, that's it

lib/Transforms/Vectorize/SLPVectorizer.cpp
736–742 ↗(On Diff #129968)

I still don't understand what's the problem here.

  1. You need to perform the loads in some order.
  2. You sort the loads to be in the sequntially direct order and perform the vector load starting from the lowest address.
  3. You reshuffle the loaded vector value to the original order.

That's it, you have your loads in the required order. Just one shuffle is required. Why do you need some more? Also, I don't understand why do you need so many changes, why do you need additional indicies etc.

1661 ↗(On Diff #129968)

It is going to be handled by the reverse loads patch

1677–1678 ↗(On Diff #129968)

It is better to use original VL here, otherwise it will end with a lot of troubles and will require the whole bunch of changes in the vectorization process to find the perfect match for the vector of vectorized values. I don't think it is a good idea to have a lot of changes accross the whole module to handle jumbled loads.

3063 ↗(On Diff #129968)

Is this correct? E->Scalars[0] is exactly VL0

Updates review comments and a test case.

Minor clean up.

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

Probably, it is hard to say exactly without looking at the result.

No worry it was a merge issue, its fixed.

lib/Transforms/Vectorize/SLPVectorizer.cpp
736–742 ↗(On Diff #129968)

Updated jumbled-load.ll captures this case where instead of gathering the second operand of MUL we can have required shuffle of the same loaded vector

1661 ↗(On Diff #129968)

Yes, this check no more required.

1677–1678 ↗(On Diff #129968)

In the context where we can have multiple user of loaded vector with different shuffle mask, the design is to represent these different shuffle mask for each user corresponding to the user's operand number. Having single sorted indices will not be sufficient for this.
Given the objective of handling multiple out of order uses changes are not that big I feel.

3063 ↗(On Diff #129968)

Ah, both are same.

ABataev added inline comments.Feb 14 2018, 6:50 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1677–1678 ↗(On Diff #129968)

Now I see what do you want to do. But I don't think that this the correct way to implement it. It complicates the whole vectorization process. I'd suggest to create different tree entries for each particular order of the loads and exclude loads from the check that the same instruction is used several times in different tree entries.
If you worry about several different loads of the same values, I think they will be optimized by instruction combiner.

ashahid added inline comments.Feb 16 2018, 9:46 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1677–1678 ↗(On Diff #129968)

Off course this could have been a better solution but I was not sure of the impact it may have by breaking the single tree entry assumption. One problem I see is the TreeEntry lookup if multiple node with same scalar values are present.
I can use isSame() check to make sure correct tree entry is found, however it may become costly in case of PHI instruction fed by same vector Load.

ABataev added inline comments.Feb 16 2018, 10:29 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1677–1678 ↗(On Diff #129968)

I think it is better to start with handling of single tree entry rather than trying to handle all possible situations in a single patch. I suggest to split this patch into 2 parts at least: 1. handling of tree entry with jumbled loads. 2. further improvements.

test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
7–10 ↗(On Diff #134178)

These checks are not autogenerated, fix it. Moreover, it is recommended to commit these tests separately with the checks for the original version of the compiler and the update checks with the fixed version to demonstrate improvements.

Updated the patch to accomodate the review comments.

As suggested, now the reordering mask will be part of each tree entry. Also this update does not consider to optimize the reordered load for multiple operand for now.

By the way, take a look at my D43776 that does the same but in more general way

lib/Transforms/Vectorize/SLPVectorizer.cpp
1644 ↗(On Diff #136070)

Why you can do this only if ReuseShuffleIndicies.empty()?

1649–1654 ↗(On Diff #136070)

It is enough just to compare VL and Sorted. If they are the same, the loads are not shuffled

1657 ↗(On Diff #136070)

Why you can't do to add vectorized tree entry if UserTreeIdx == -1?

1660 ↗(On Diff #136070)

Each true or false argument must have to prepend comment with the name of the function parameter, related to this argument

2279 ↗(On Diff #136070)

You can remove the last argument here

2899 ↗(On Diff #136070)

Why do you need this condition?

3251 ↗(On Diff #136070)

Restore the original code here

3287 ↗(On Diff #136070)

Remove this empty line

3528–3533 ↗(On Diff #136070)

I rather doubt you need all that stuff. You can use original code

test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
1 ↗(On Diff #136070)

You need to add this test separately and show changes in it.

test/Transforms/SLPVectorizer/X86/jumbled-load-shuffle-placement.ll
1 ↗(On Diff #136070)

You need to add this test separately and show changes in it

test/Transforms/SLPVectorizer/X86/jumbled-load-used-in-phi.ll
1 ↗(On Diff #136070)

You need to add this test separately and show changes in it

test/Transforms/SLPVectorizer/X86/jumbled-load.ll
64 ↗(On Diff #136070)

You need to add this test separately and show changes in it

Will commit the tests as NFC.

Seems like I am not getting the mails from phabricator, what shall I do to get the mails?

Checked the patch D43776, seems it will make this patch redundant.

lib/Transforms/Vectorize/SLPVectorizer.cpp
1644 ↗(On Diff #136070)

This is to avoid the overlapping the UniqueValues reuse logic of your changes.

1649–1654 ↗(On Diff #136070)

Sure it is, but this avoids the compare. So I thought having a boolean is preferable.

1657 ↗(On Diff #136070)

My bad, this is not required.

1660 ↗(On Diff #136070)

Ok

2279 ↗(On Diff #136070)

Sure

2899 ↗(On Diff #136070)

In the 2nd test of jumbled-load.ll the two operands of MUL is fed from the same loaded vector. The 1st operand is SHUFFLE of LOAD and the 2nd operand is the gather of the same scalar loads.
Query to getTreeEntry() will always return the node with the same vectorized value and hence both the operand of MUL will be fed the shuffled load.
This check is to avoid this scenario.

3251 ↗(On Diff #136070)

Thanks

3528–3533 ↗(On Diff #136070)

This is required otherwise *multiuse*.ll test as well as PR32086.ll will fail because the lanes were recorded according to the order of scalar loads.

ashahid marked 9 inline comments as done.

Updated further review comments.

Hope this is fine.

ABataev added inline comments.Feb 28 2018, 9:49 AM
lib/Analysis/LoopAccessAnalysis.cpp
1112 ↗(On Diff #129968)

What about this comment? Do you really need Sorted argument?

1125 ↗(On Diff #136311)

PointerType *->auto *

1129–1131 ↗(On Diff #136311)

I think there must be an assertion instead of this check.

1141 ↗(On Diff #136311)

const SCEVConstant *->const auto *

1146–1148 ↗(On Diff #136311)

This check better to move to SLPVectorizer.cpp, because the function can be used for masked load/store.

1161 ↗(On Diff #136311)

for (unsigned I = 0, E = VL.size(); I < E; ++I)

1166 ↗(On Diff #136311)

Actually Mask is a full copy of UseOrder, you don't need all that complex stuff here

lib/Transforms/Vectorize/SLPVectorizer.cpp
1644 ↗(On Diff #136070)

Why you can't handle it? What's the problem?

1649–1654 ↗(On Diff #136070)

Why do we need the compare?

2899 ↗(On Diff #136070)

This scenario should happen in your patch, the instruction either vectorized, or gathered, but not both.

3528–3533 ↗(On Diff #136070)

Again, it just may not happen in this patch

ABataev added inline comments.Feb 28 2018, 11:07 AM
lib/Analysis/LoopAccessAnalysis.cpp
1166 ↗(On Diff #136311)

Oops, no, Mask is not a copy of UseOrder
But you can create it much simpler:

for (unsigned I = 0, E = VL.size(); I < E; ++I)
  Mask[UseOrder[I]] = I;
sanjoy removed a reviewer: sanjoy.Feb 28 2018, 11:34 AM
sanjoy removed a subscriber: sanjoy.
ashahid added inline comments.Feb 28 2018, 11:30 PM
lib/Analysis/LoopAccessAnalysis.cpp
1112 ↗(On Diff #129968)

Yes, otherwise my test fails. Seems it breaks some assumption.

1166 ↗(On Diff #136311)

Thanks

lib/Transforms/Vectorize/SLPVectorizer.cpp
1644 ↗(On Diff #136070)

It was a thought,I have not checked yet. I will check.

1649–1654 ↗(On Diff #136070)

I meant, if we dont use ShuffledLoad flag we have to compare VL vs Sorted instead.

2899 ↗(On Diff #136070)

This check is to avoid feeding the generated SHUFFLE to both operand of MUL which is not the intention of the test case.

3528–3533 ↗(On Diff #136070)

It does happen and this test fails.

ABataev added inline comments.Mar 2 2018, 10:59 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1660 ↗(On Diff #136311)

No, use original VL here, do not use Sorted. In this case you won't need an additional argument in sortLoadAccesses and you don't need all that complex stuff with the lambda on line 3528

ashahid added inline comments.Mar 5 2018, 10:39 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1660 ↗(On Diff #136311)

If I am not wrong, for LOADs, VL0 must be the 1st element of the buffer whose base address will be used for vector load.
So using VL will break this assumption.

ABataev added inline comments.Mar 6 2018, 6:18 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1660 ↗(On Diff #136311)

Why? And why you can't choose the right VL0 during vectorization?

ashahid added inline comments.Mar 6 2018, 8:20 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1660 ↗(On Diff #136311)

For example, if we have two arrays A[4] and B[1] laying one after another in memory and the selected VF is 4 for the scalar loads of A[1], A[2], A[0], A[3] in order of use, the generated vector load will load the elements A[1], A[2], A[3], B[1] which is not desired.

Of-course we can choose the right VL0 during vectorization but we have to compute it again here using the mask which can be avoided if we use Sorted VL.

If I am missing something?

ABataev added inline comments.Mar 6 2018, 8:42 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1660 ↗(On Diff #136311)

You already store the mask in the tree entry and you can choose the right VL0 using this mask. Using Sorted VL complicates the whole vectorization process and, thus, adds some extra points for the incorrect vectorization. That's why I insist to use original VL and choose the correct VL0 during codegen.

ashahid added inline comments.Mar 6 2018, 9:08 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1660 ↗(On Diff #136311)

Got it. Since you already have these improvements in this patch https://reviews.llvm.org/D43776 , I think it is better to get that through.

RKSimon added a subscriber: RKSimon.Apr 8 2018, 2:43 AM

@ashahid What's happening to this patch?

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptOct 7 2019, 5:02 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
RKSimon reopened this revision.Oct 7 2019, 6:08 AM
This revision is now accepted and ready to land.Oct 7 2019, 6:08 AM