This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Vectorize jumbled memory loads.
AcceptedPublic

Authored by ashahid on Aug 1 2017, 12:31 AM.

Details

Summary

This patch tries to vectorize loads of consecutive memory accesses, accessed
in non-consecutive or jumbled way. An earlier attempt was made with patch D26905
which was reverted back due to some basic issue with representing the 'use mask' of
jumbled accesses.

This patch fixes the mask representation by recording the 'use mask' in the usertree entry.

Change-Id: I9fe7f5045f065d84c126fa307ef6ebe0787296df

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Patch update for fixing build bot failure:

This fix makes the place holder for Shuffle Mask from fixed array of 3 element
to an std::map. This need arises from the fact that a PHI node can have
any number of operand as incoming value.

Test performed:
LLVM lit test, 3 stage bootstrap build and LNT (Thanks to Hans and Daniel)

hans added a comment.Dec 5 2017, 10:18 AM

Patch update for fixing build bot failure:

I haven't looked at the patch at all, but I just tried it on a local Chrome build on Linux, and it seems to work for that.

Ayal added a comment.Dec 5 2017, 2:40 PM

Good catch. Add a LIT test?

lib/Transforms/Vectorize/SLPVectorizer.cpp
648

The fixed array

SmallVector<unsigned, 4> ShuffleMask[3];

of the previous version indeed cannot account for all operands. How about holding a

SmallVector<SmallVector<unsigned, 4>, 2> ShuffleMask;

instead of holding a map from 0,1,2,..,numOperands ?

677

Are both conditions really needed, or suffice say to check for -1 and assert positive indices are not too large?

2960

May be simpler to check instead ShuffleMask.count(OpdNum)

2977

clang-format

In D36130#945306, @hans wrote:

Patch update for fixing build bot failure:

I haven't looked at the patch at all, but I just tried it on a local Chrome build on Linux, and it seems to work for that.

Thanks Hans for triage.

In D36130#945728, @Ayal wrote:

Good catch. Add a LIT test?

It was asserting in few of LNT Multisource bench mark. How to extract it for LIT test?

lib/Transforms/Vectorize/SLPVectorizer.cpp
648

I think this can be done. I will try.

677

Sure I will check. I am thinking 30000 as large indices threshold, do you have any number in mind?

2960

Quite right.

ashahid added inline comments.Dec 8 2017, 8:12 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
677

I tried but seems both conditions are needed as I am getting assertion "Idx < size()" for SmallVector<<SmallVector, 4> 2> ShuffleMask.

Updated the review comments.

Minor commented code clean up done.

Ayal added a comment.Dec 9 2017, 1:30 PM
In D36130#945728, @Ayal wrote:

Good catch. Add a LIT test?

It was asserting in few of LNT Multisource bench mark. How to extract it for LIT test?

Suffice to have a phi with 4 predecessors, where (at-least) the 4th needs a shuffle-mask.

lib/Transforms/Vectorize/SLPVectorizer.cpp
677

UserTreeIdx is the index of the User entry as we build the tree bottom-up, so it should always be between 0 and VectorizableTree.size()-1, except for -1 when creating the new entry for the root, which is User-less. So it should suffice to check if Idx is -1, and otherwise assert that Idx < size(), if desired, right?

678–679

Code below still uses emplace_back contrary to the discussion above.

May need to call UserTreeEntry->ShuffleMask.resize() if OpdNum is larger than its initial/current size, before setting UserTreeEntry->ShuffleMask[OpdNum] = tempMask. (Otherwise the original "LNT Multisource bench mark" asserts should trigger again?)

Suggest to add a test where the first operand does not need a shuffle but the second one does.

2714

See above discussion about replacing second condition with an assert.

2953

ditto

test/Transforms/SLPVectorizer/X86/crash_cmpop.ll
1 ↗(On Diff #126266)

Why add -debug?

Review comments updated and added lit tests.

ashahid marked 3 inline comments as done.

Review comments updated and added lit tests.

ashahid added inline comments.Dec 11 2017, 8:41 AM
test/Transforms/SLPVectorizer/X86/crash_cmpop.ll
1 ↗(On Diff #126266)

My bad, not intended.

Ayal accepted this revision.Dec 11 2017, 2:51 PM

This looks good to me, with a couple of last minor fixes.

Hope it stays in this time...

lib/Transforms/Vectorize/SLPVectorizer.cpp
684

alrea[d]y

2982

Can simply do for (unsigned Entry : ShuffleMask[OpdNum]) instead of iterating explicitly over all lanes and retrieving each UserTreeEntry->ShuffleMask[OpdNum][Lane].

test/Transforms/SLPVectorizer/X86/jumbled-load-shuffle-placement.ll
31 ↗(On Diff #126374)

Suggested to also have a test where the 2nd operand is a shuffle but the 1st one isn't, which will fail if shuffles are added using emplace_back().

ashahid marked 3 inline comments as done.

Updated test and review comment.

Bootstrap and LNT test underway.

ashahid closed this revision.Dec 12 2017, 7:09 PM
eastig added a subscriber: eastig.Dec 14 2017, 5:32 AM

Hi Shahid,

These changes caused 27.7% and 30.2% regressions on an AArch64 Juno board (http://lnt.llvm.org/db_default/v4/nts/83681):

MultiSource/Benchmarks/mediabench/gsm/toast/toast: 30.20%
MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm: 27.73%

We have the same benchmarks regressed on our AArch64 boards (Cortex-A53, Cortex-A57).

-Evgeny Astigeevich
The ARM Compiler Optimisation team

Hi Shahid,

These changes caused 27.7% and 30.2% regressions on an AArch64 Juno board (http://lnt.llvm.org/db_default/v4/nts/83681):

MultiSource/Benchmarks/mediabench/gsm/toast/toast: 30.20%
MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm: 27.73%

We have the same benchmarks regressed on our AArch64 boards (Cortex-A53, Cortex-A57).

-Evgeny Astigeevich
The ARM Compiler Optimisation team

A problem report: https://bugs.llvm.org/show_bug.cgi?id=35673

sanjoy added a subscriber: sanjoy.Dec 19 2017, 4:03 PM
sanjoy added inline comments.
lib/Analysis/LoopAccessAnalysis.cpp
1056

This should be a cast<>.

1084

LLVM style is to avoid using curly braces on single like for loops. Using std::iota would be even better.

lib/Transforms/Vectorize/SLPVectorizer.cpp
685

I think you should be able to do:

auto &OperandMask = UserTreeEntry->ShuffleMask[OpdNum];
assert(OperandMask.empty());
OperandMask.insert(OperandMask.end(), ShuffleMask.begin(), ShuffleMask.end());
1581

Not sure why you need NewVL here -- doesn't just using Sorted work?

2959

Might be cleaner to abstract (unsigned)OpdNum < UserTreeEntry->ShuffleMask.size() && !UserTreeEntry->ShuffleMask[OpdNum].empty() into a UserTreeEntry->hasShuffleMaskForOp(Index) helper.

2983

The cast to Value * should not be necessary.

3219

dyn_cast<XXX>(f)->g() should never be necessary. Either the dyn_cast can return null in which case you should check for that, or it can't and you should use cast<>. Also the cast of Vec to Instruction seems unnecessary: ShuffleVectorInst is an Instruction.

Ayal added inline comments.Dec 21 2017, 3:25 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
685

While we're at it, this should move under the if (UserTreeIdx != -1) to avoid checking if &VectorizableTree[UserTreeIdx] is null, as commented in https://reviews.llvm.org/D41324#inline-361435

1590

Should probably also check here that UserTreeIdx is not -1, to avoid creating a mask for the root with no place to hang it, as @sanjoy observed.

ashahid added inline comments.Dec 22 2017, 6:20 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
685

If we check for if (UserTreeIdx != -1 && ShuffledLoad) before the call of newTreeEntry(), we can avoid "UserTreeIdx != -1" check completely inside newTreeEntry().

1590

Yes, I had planned to do exactly this.

ashahid reopened this revision.Dec 28 2017, 11:04 PM
ashahid marked 8 inline comments as done.
ashahid added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
3219

Here I am trying to ensure that the instructions are "ShuffleVectorInst" and "LoadInst" respectively.

Casting of Vec to Instruction, is to satisfy the membership of getOperand() which compiler otherwise report as error.

This revision is now accepted and ready to land.Dec 28 2017, 11:04 PM

Updates review comments.

Regression test and LNT passes, 3 stage bootstrap test underway.

Ayal added inline comments.Dec 29 2017, 7:31 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3219

Use isa instead of dyn_cast here:
if (Vec && dyn_cast<LoadInst>(cast<Instruction>(Vec)->getOperand(0))) {

or alternatively do something like:

Value *Vec = E->VectorizedValue;
assert(Vec && "Can't find vectorizable value");
if (ShuffleVectorInst *Shuffle = dyn_cast<ShuffleVectorInst>(Vec))
  if (LoadInst *Load = dyn_cast<LoadInst>(Shuffle->getOperand(0)))
    Vec = Load;

Updated Ayal's comment accordingly

ashahid marked an inline comment as done.Jan 1 2018, 8:01 AM
Ayal added a comment.Jan 9 2018, 8:40 AM

This should fix the case observed by @sanjoy in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20171218/511721.html; please also include a testcase.

In D36130#971181, @Ayal wrote:

This should fix the case observed by @sanjoy in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20171218/511721.html; please also include a testcase.

Test case, test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll, already included.

Ayal added a comment.Jan 11 2018, 1:13 PM
In D36130#971181, @Ayal wrote:

This should fix the case observed by @sanjoy in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20171218/511721.html; please also include a testcase.

Test case, test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll, already included.

Ah, right, sorry, missed it.

This looks good to me, with only minor comments about the testcase.

Please see that @sanjoy approves too, as this mostly addresses issues he raised.

test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
27 ↗(On Diff #128388)

"SINK" is defined redundantly, as it is not used.

Could this be simplified by removing the float-to-int casts?

In general, it may suffice to check that there's no load of <4 x i32>, which would be jumbled. Checking that two of the lanes have been vectorized may be fragile, in case a modified cost model will decide it ain't worth it.

sanjoy accepted this revision.Jan 13 2018, 2:40 PM
sanjoy added inline comments.
lib/Analysis/LoopAccessAnalysis.cpp
1044

The indent looks off here; can you please run clang-format?

lib/Transforms/Vectorize/SLPVectorizer.cpp
635

Optional: you can write return X; instead of if (X) return true; return false;.

1588

Nit: s/usefull/useful/

3218–3224

I think you can rewrite this more cleanly using an immediately-invoked function expression:

Value *Vec = [&]() {
  if (auto *SVI = dyn_cast<ShuffleVectorInst>(E->VectorizedValue))
    if (auto *LI = dyn_cast<LoadInst>(SVI->getOperand(0)))
      return LI->getOperand(0);
  return E->VectorizedValue;
}();
ashahid marked an inline comment as not done.Jan 15 2018, 9:01 PM
ashahid added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
3218–3224

I tried this IIFE, however I am getting an assertion "Tried to create extractelement operation on non-vector type!" for jumbled-load-multiuse.ll test. Do you see any issue in this code?

sanjoy added inline comments.Jan 15 2018, 10:54 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3218–3224

Yes, I think I should have written:

Value *Vec = [&]() {
  if (auto *SVI = dyn_cast<ShuffleVectorInst>(E->VectorizedValue))
    if (isa<LoadInst>(SVI->getOperand(0)))
      return SVI->getOperand(0);
  return E->VectorizedValue;
}();
Ayal added inline comments.Jan 15 2018, 11:49 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3218–3224

Yes, this simplifies the below "alternatively do something like:"

Value *Vec = E->VectorizedValue;
assert(Vec && "Can't find vectorizable value");
if (ShuffleVectorInst *Shuffle = dyn_cast<ShuffleVectorInst>(Vec))
  if (LoadInst *Load = dyn_cast<LoadInst>(Shuffle->getOperand(0)))
    Vec = Load;

Updates test case and stylistic review comments

Hi Ayal, Sanjoy,

The last update's review was pending for long. Off late, SLP has lots of changes so I will have to rebase but before rebasing please see if any more changes required in its current form.

Thanks in advance.

Ayal added a comment.Feb 10 2018, 11:48 PM

Hi Ayal, Sanjoy,

The last update's review was pending for long. Off late, SLP has lots of changes so I will have to rebase but before rebasing please see if any more changes required in its current form.

Thanks in advance.

This looks good to me, as commented earlier, but please see that @sanjoy approves too, as this mostly addresses issues he raised.

sanjoy accepted this revision.Feb 11 2018, 12:18 PM

I don't have any more coding style comments. I've not reviewed the actual semantic changes.

lib/Analysis/LoopAccessAnalysis.cpp
1097

Can you use std::iota here?

ABataev added inline comments.Feb 12 2018, 8:04 AM
lib/Analysis/LoopAccessAnalysis.cpp
1043

This function can be used for stores also, it is better to make it universal for stores/loads.

1087

It is better to use stable_sort rather than sort

1100

stable_sort

lib/Transforms/Vectorize/SLPVectorizer.cpp
1576

Is it possible at all that VL has less than 4 elements here?

1581

i->I, e->E. Variables must have Camel-like names.

2141–2147

You don't need so many shuffles, it is enough just to have just one.

ABataev added inline comments.Feb 12 2018, 8:04 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
642–648

Why you can't have just one shuffle here for all external uses?

1592–1593

Bad decision. It is better to use original VL here, rather than Sorted and add an additional array of sorted indieces. In this case you don't need all these additional numbers and all that complex logic to find the correct tree entry for the list of values.

3218–3224

I think you can have default capture by value here rather than by reference.

Hi Alexey,

As I was trying to rebase this patch, it seems this overlaps with your "reverse load" patch. Could you take a look in this patch?

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

lib/Analysis/LoopAccessAnalysis.cpp
1043

I plan to do such improvement in separate patches.

lib/Transforms/Vectorize/SLPVectorizer.cpp
642–648

This is for in-tree multi uses of a single vector load where the uses has different masks/permutation.
This section of comment https://reviews.llvm.org/D36130#inline-326711
discussed it earlier. Also there is figure attached.

1576

I think yes, for example a couple of i64 loads considering minimum register width as 128-bit.

However, this check here was basically meant to indicate jumbled loads of size 2 is essentially a reversed load.

1592–1593

In fact earlier design in patch (https://reviews.llvm.org/D26905) was to use original VL, however there was counter argument to that which I don't remember exactly.

2141–2147

This is basically for multiple in-tree uses with different masks/permutation.

ABataev added a comment.EditedFeb 13 2018, 7:33 AM

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

Probably, it is hard to say exactly without looking at the result.

lib/Analysis/LoopAccessAnalysis.cpp
1043

I just suggest to make universal at the very beginning, that's it

lib/Transforms/Vectorize/SLPVectorizer.cpp
642–648

I still don't understand what's the problem here.

  1. You need to perform the loads in some order.
  2. You sort the loads to be in the sequntially direct order and perform the vector load starting from the lowest address.
  3. You reshuffle the loaded vector value to the original order.

That's it, you have your loads in the required order. Just one shuffle is required. Why do you need some more? Also, I don't understand why do you need so many changes, why do you need additional indicies etc.

1576

It is going to be handled by the reverse loads patch

1592–1593

It is better to use original VL here, otherwise it will end with a lot of troubles and will require the whole bunch of changes in the vectorization process to find the perfect match for the vector of vectorized values. I don't think it is a good idea to have a lot of changes accross the whole module to handle jumbled loads.

2960

Is this correct? E->Scalars[0] is exactly VL0

Updates review comments and a test case.

Minor clean up.

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

Probably, it is hard to say exactly without looking at the result.

No worry it was a merge issue, its fixed.

lib/Transforms/Vectorize/SLPVectorizer.cpp
642–648

Updated jumbled-load.ll captures this case where instead of gathering the second operand of MUL we can have required shuffle of the same loaded vector

1576

Yes, this check no more required.

1592–1593

In the context where we can have multiple user of loaded vector with different shuffle mask, the design is to represent these different shuffle mask for each user corresponding to the user's operand number. Having single sorted indices will not be sufficient for this.
Given the objective of handling multiple out of order uses changes are not that big I feel.

2960

Ah, both are same.

ABataev added inline comments.Feb 14 2018, 6:50 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1592–1593

Now I see what do you want to do. But I don't think that this the correct way to implement it. It complicates the whole vectorization process. I'd suggest to create different tree entries for each particular order of the loads and exclude loads from the check that the same instruction is used several times in different tree entries.
If you worry about several different loads of the same values, I think they will be optimized by instruction combiner.

ashahid added inline comments.Feb 16 2018, 9:46 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1592–1593

Off course this could have been a better solution but I was not sure of the impact it may have by breaking the single tree entry assumption. One problem I see is the TreeEntry lookup if multiple node with same scalar values are present.
I can use isSame() check to make sure correct tree entry is found, however it may become costly in case of PHI instruction fed by same vector Load.

ABataev added inline comments.Feb 16 2018, 10:29 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1592–1593

I think it is better to start with handling of single tree entry rather than trying to handle all possible situations in a single patch. I suggest to split this patch into 2 parts at least: 1. handling of tree entry with jumbled loads. 2. further improvements.

test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
7–10 ↗(On Diff #134178)

These checks are not autogenerated, fix it. Moreover, it is recommended to commit these tests separately with the checks for the original version of the compiler and the update checks with the fixed version to demonstrate improvements.

Updated the patch to accomodate the review comments.

As suggested, now the reordering mask will be part of each tree entry. Also this update does not consider to optimize the reordered load for multiple operand for now.

By the way, take a look at my D43776 that does the same but in more general way

lib/Transforms/Vectorize/SLPVectorizer.cpp
1575

Why you can do this only if ReuseShuffleIndicies.empty()?

1580–1585

It is enough just to compare VL and Sorted. If they are the same, the loads are not shuffled

1588

Why you can't do to add vectorized tree entry if UserTreeIdx == -1?

1591

Each true or false argument must have to prepend comment with the name of the function parameter, related to this argument

2141

You can remove the last argument here

2706

Why do you need this condition?

2953

Restore the original code here

2975

Remove this empty line

3218–3225

I rather doubt you need all that stuff. You can use original code

test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
1 ↗(On Diff #136070)

You need to add this test separately and show changes in it.

test/Transforms/SLPVectorizer/X86/jumbled-load-shuffle-placement.ll
1 ↗(On Diff #136070)

You need to add this test separately and show changes in it

test/Transforms/SLPVectorizer/X86/jumbled-load-used-in-phi.ll
1 ↗(On Diff #136070)

You need to add this test separately and show changes in it

test/Transforms/SLPVectorizer/X86/jumbled-load.ll
64

You need to add this test separately and show changes in it

Will commit the tests as NFC.

Seems like I am not getting the mails from phabricator, what shall I do to get the mails?

Checked the patch D43776, seems it will make this patch redundant.

lib/Transforms/Vectorize/SLPVectorizer.cpp
1575

This is to avoid the overlapping the UniqueValues reuse logic of your changes.

1580–1585

Sure it is, but this avoids the compare. So I thought having a boolean is preferable.

1588

My bad, this is not required.

1591

Ok

2141

Sure

2706

In the 2nd test of jumbled-load.ll the two operands of MUL is fed from the same loaded vector. The 1st operand is SHUFFLE of LOAD and the 2nd operand is the gather of the same scalar loads.
Query to getTreeEntry() will always return the node with the same vectorized value and hence both the operand of MUL will be fed the shuffled load.
This check is to avoid this scenario.

2953

Thanks

3218–3225

This is required otherwise *multiuse*.ll test as well as PR32086.ll will fail because the lanes were recorded according to the order of scalar loads.

ashahid marked 9 inline comments as done.

Updated further review comments.

Hope this is fine.

ABataev added inline comments.Feb 28 2018, 9:49 AM
lib/Analysis/LoopAccessAnalysis.cpp
1043

What about this comment? Do you really need Sorted argument?

1056

PointerType *->auto *

1060–1062

I think there must be an assertion instead of this check.

1072

const SCEVConstant *->const auto *

1077–1079

This check better to move to SLPVectorizer.cpp, because the function can be used for masked load/store.

1092

for (unsigned I = 0, E = VL.size(); I < E; ++I)

1097

Actually Mask is a full copy of UseOrder, you don't need all that complex stuff here

lib/Transforms/Vectorize/SLPVectorizer.cpp
1575

Why you can't handle it? What's the problem?

1580–1585

Why do we need the compare?

2706

This scenario should happen in your patch, the instruction either vectorized, or gathered, but not both.

3218–3225

Again, it just may not happen in this patch

ABataev added inline comments.Feb 28 2018, 11:07 AM
lib/Analysis/LoopAccessAnalysis.cpp
1097

Oops, no, Mask is not a copy of UseOrder
But you can create it much simpler:

for (unsigned I = 0, E = VL.size(); I < E; ++I)
  Mask[UseOrder[I]] = I;
sanjoy removed a reviewer: sanjoy.Feb 28 2018, 11:34 AM
sanjoy removed a subscriber: sanjoy.
ashahid added inline comments.Feb 28 2018, 11:30 PM
lib/Analysis/LoopAccessAnalysis.cpp
1043

Yes, otherwise my test fails. Seems it breaks some assumption.

1097

Thanks

lib/Transforms/Vectorize/SLPVectorizer.cpp
1575

It was a thought,I have not checked yet. I will check.

1580–1585

I meant, if we dont use ShuffledLoad flag we have to compare VL vs Sorted instead.

2706

This check is to avoid feeding the generated SHUFFLE to both operand of MUL which is not the intention of the test case.

3218–3225

It does happen and this test fails.

ABataev added inline comments.Mar 2 2018, 10:59 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1591

No, use original VL here, do not use Sorted. In this case you won't need an additional argument in sortLoadAccesses and you don't need all that complex stuff with the lambda on line 3528

ashahid added inline comments.Mar 5 2018, 10:39 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1591

If I am not wrong, for LOADs, VL0 must be the 1st element of the buffer whose base address will be used for vector load.
So using VL will break this assumption.

ABataev added inline comments.Mar 6 2018, 6:18 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1591

Why? And why you can't choose the right VL0 during vectorization?

ashahid added inline comments.Mar 6 2018, 8:20 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1591

For example, if we have two arrays A[4] and B[1] laying one after another in memory and the selected VF is 4 for the scalar loads of A[1], A[2], A[0], A[3] in order of use, the generated vector load will load the elements A[1], A[2], A[3], B[1] which is not desired.

Of-course we can choose the right VL0 during vectorization but we have to compute it again here using the mask which can be avoided if we use Sorted VL.

If I am missing something?

ABataev added inline comments.Mar 6 2018, 8:42 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1591

You already store the mask in the tree entry and you can choose the right VL0 using this mask. Using Sorted VL complicates the whole vectorization process and, thus, adds some extra points for the incorrect vectorization. That's why I insist to use original VL and choose the correct VL0 during codegen.

ashahid added inline comments.Mar 6 2018, 9:08 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1591

Got it. Since you already have these improvements in this patch https://reviews.llvm.org/D43776 , I think it is better to get that through.

RKSimon added a subscriber: RKSimon.Apr 8 2018, 2:43 AM

@ashahid What's happening to this patch?

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptOct 7 2019, 5:02 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
RKSimon reopened this revision.Oct 7 2019, 6:08 AM
This revision is now accepted and ready to land.Oct 7 2019, 6:08 AM