Page MenuHomePhabricator

[SLP] Vectorize jumbled memory loads.
AcceptedPublic

Authored by ashahid on Aug 1 2017, 12:31 AM.

Details

Summary

This patch tries to vectorize loads of consecutive memory accesses, accessed
in non-consecutive or jumbled way. An earlier attempt was made with patch D26905
which was reverted back due to some basic issue with representing the 'use mask' of
jumbled accesses.

This patch fixes the mask representation by recording the 'use mask' in the usertree entry.

Change-Id: I9fe7f5045f065d84c126fa307ef6ebe0787296df

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ABataev added inline comments.Feb 12 2018, 8:04 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
736–742

Why you can't have just one shuffle here for all external uses?

1677–1678

Bad decision. It is better to use original VL here, rather than Sorted and add an additional array of sorted indieces. In this case you don't need all these additional numbers and all that complex logic to find the correct tree entry for the list of values.

3324

I think you can have default capture by value here rather than by reference.

Hi Alexey,

As I was trying to rebase this patch, it seems this overlaps with your "reverse load" patch. Could you take a look in this patch?

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

lib/Analysis/LoopAccessAnalysis.cpp
1112

I plan to do such improvement in separate patches.

lib/Transforms/Vectorize/SLPVectorizer.cpp
736–742

This is for in-tree multi uses of a single vector load where the uses has different masks/permutation.
This section of comment https://reviews.llvm.org/D36130#inline-326711
discussed it earlier. Also there is figure attached.

1661

I think yes, for example a couple of i64 loads considering minimum register width as 128-bit.

However, this check here was basically meant to indicate jumbled loads of size 2 is essentially a reversed load.

1677–1678

In fact earlier design in patch (https://reviews.llvm.org/D26905) was to use original VL, however there was counter argument to that which I don't remember exactly.

2229–2235

This is basically for multiple in-tree uses with different masks/permutation.

ABataev added a comment.EditedFeb 13 2018, 7:33 AM

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

Probably, it is hard to say exactly without looking at the result.

lib/Analysis/LoopAccessAnalysis.cpp
1112

I just suggest to make universal at the very beginning, that's it

lib/Transforms/Vectorize/SLPVectorizer.cpp
736–742

I still don't understand what's the problem here.

  1. You need to perform the loads in some order.
  2. You sort the loads to be in the sequntially direct order and perform the vector load starting from the lowest address.
  3. You reshuffle the loaded vector value to the original order.

That's it, you have your loads in the required order. Just one shuffle is required. Why do you need some more? Also, I don't understand why do you need so many changes, why do you need additional indicies etc.

1661

It is going to be handled by the reverse loads patch

1677–1678

It is better to use original VL here, otherwise it will end with a lot of troubles and will require the whole bunch of changes in the vectorization process to find the perfect match for the vector of vectorized values. I don't think it is a good idea to have a lot of changes accross the whole module to handle jumbled loads.

3063

Is this correct? E->Scalars[0] is exactly VL0

Updates review comments and a test case.

Minor clean up.

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

Hi Alexey,

Thanks for looking into it.I will update it accordingly.
BTW this patch is failing with its tests after the re-base on top of your patch. Do you foresee any conflicting code?

Probably, it is hard to say exactly without looking at the result.

No worry it was a merge issue, its fixed.

lib/Transforms/Vectorize/SLPVectorizer.cpp
736–742

Updated jumbled-load.ll captures this case where instead of gathering the second operand of MUL we can have required shuffle of the same loaded vector

1661

Yes, this check no more required.

1677–1678

In the context where we can have multiple user of loaded vector with different shuffle mask, the design is to represent these different shuffle mask for each user corresponding to the user's operand number. Having single sorted indices will not be sufficient for this.
Given the objective of handling multiple out of order uses changes are not that big I feel.

3063

Ah, both are same.

ABataev added inline comments.Feb 14 2018, 6:50 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1677–1678

Now I see what do you want to do. But I don't think that this the correct way to implement it. It complicates the whole vectorization process. I'd suggest to create different tree entries for each particular order of the loads and exclude loads from the check that the same instruction is used several times in different tree entries.
If you worry about several different loads of the same values, I think they will be optimized by instruction combiner.

ashahid added inline comments.Feb 16 2018, 9:46 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1677–1678

Off course this could have been a better solution but I was not sure of the impact it may have by breaking the single tree entry assumption. One problem I see is the TreeEntry lookup if multiple node with same scalar values are present.
I can use isSame() check to make sure correct tree entry is found, however it may become costly in case of PHI instruction fed by same vector Load.

ABataev added inline comments.Feb 16 2018, 10:29 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1677–1678

I think it is better to start with handling of single tree entry rather than trying to handle all possible situations in a single patch. I suggest to split this patch into 2 parts at least: 1. handling of tree entry with jumbled loads. 2. further improvements.

test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
8–11

These checks are not autogenerated, fix it. Moreover, it is recommended to commit these tests separately with the checks for the original version of the compiler and the update checks with the fixed version to demonstrate improvements.

Updated the patch to accomodate the review comments.

As suggested, now the reordering mask will be part of each tree entry. Also this update does not consider to optimize the reordered load for multiple operand for now.

By the way, take a look at my D43776 that does the same but in more general way

lib/Transforms/Vectorize/SLPVectorizer.cpp
1653

Why you can do this only if ReuseShuffleIndicies.empty()?

1658–1663

It is enough just to compare VL and Sorted. If they are the same, the loads are not shuffled

1666

Why you can't do to add vectorized tree entry if UserTreeIdx == -1?

1669

Each true or false argument must have to prepend comment with the name of the function parameter, related to this argument

2229

You can remove the last argument here

2814

Why do you need this condition?

3056–3067

Restore the original code here

3089

Remove this empty line

3324–3331

I rather doubt you need all that stuff. You can use original code

test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
2

You need to add this test separately and show changes in it.

test/Transforms/SLPVectorizer/X86/jumbled-load-shuffle-placement.ll
2

You need to add this test separately and show changes in it

test/Transforms/SLPVectorizer/X86/jumbled-load-used-in-phi.ll
2

You need to add this test separately and show changes in it

test/Transforms/SLPVectorizer/X86/jumbled-load.ll
64

You need to add this test separately and show changes in it

Will commit the tests as NFC.

Seems like I am not getting the mails from phabricator, what shall I do to get the mails?

Checked the patch D43776, seems it will make this patch redundant.

lib/Transforms/Vectorize/SLPVectorizer.cpp
1653

This is to avoid the overlapping the UniqueValues reuse logic of your changes.

1658–1663

Sure it is, but this avoids the compare. So I thought having a boolean is preferable.

1666

My bad, this is not required.

1669

Ok

2229

Sure

2814

In the 2nd test of jumbled-load.ll the two operands of MUL is fed from the same loaded vector. The 1st operand is SHUFFLE of LOAD and the 2nd operand is the gather of the same scalar loads.
Query to getTreeEntry() will always return the node with the same vectorized value and hence both the operand of MUL will be fed the shuffled load.
This check is to avoid this scenario.

3056–3067

Thanks

3324–3331

This is required otherwise *multiuse*.ll test as well as PR32086.ll will fail because the lanes were recorded according to the order of scalar loads.

ashahid marked 9 inline comments as done.

Updated further review comments.

Hope this is fine.

ABataev added inline comments.Feb 28 2018, 9:49 AM
lib/Analysis/LoopAccessAnalysis.cpp
1112

What about this comment? Do you really need Sorted argument?

1125

PointerType *->auto *

1129–1131

I think there must be an assertion instead of this check.

1141

const SCEVConstant *->const auto *

1146–1148

This check better to move to SLPVectorizer.cpp, because the function can be used for masked load/store.

1161

for (unsigned I = 0, E = VL.size(); I < E; ++I)

1166

Actually Mask is a full copy of UseOrder, you don't need all that complex stuff here

lib/Transforms/Vectorize/SLPVectorizer.cpp
1653

Why you can't handle it? What's the problem?

1658–1663

Why do we need the compare?

2814

This scenario should happen in your patch, the instruction either vectorized, or gathered, but not both.

3324–3331

Again, it just may not happen in this patch

ABataev added inline comments.Feb 28 2018, 11:07 AM
lib/Analysis/LoopAccessAnalysis.cpp
1166

Oops, no, Mask is not a copy of UseOrder
But you can create it much simpler:

for (unsigned I = 0, E = VL.size(); I < E; ++I)
  Mask[UseOrder[I]] = I;
sanjoy removed a reviewer: sanjoy.Feb 28 2018, 11:34 AM
sanjoy removed a subscriber: sanjoy.
ashahid added inline comments.Feb 28 2018, 11:30 PM
lib/Analysis/LoopAccessAnalysis.cpp
1112

Yes, otherwise my test fails. Seems it breaks some assumption.

1166

Thanks

lib/Transforms/Vectorize/SLPVectorizer.cpp
1653

It was a thought,I have not checked yet. I will check.

1658–1663

I meant, if we dont use ShuffledLoad flag we have to compare VL vs Sorted instead.

2814

This check is to avoid feeding the generated SHUFFLE to both operand of MUL which is not the intention of the test case.

3324–3331

It does happen and this test fails.

ABataev added inline comments.Mar 2 2018, 10:59 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1669

No, use original VL here, do not use Sorted. In this case you won't need an additional argument in sortLoadAccesses and you don't need all that complex stuff with the lambda on line 3528

ashahid added inline comments.Mar 5 2018, 10:39 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1669

If I am not wrong, for LOADs, VL0 must be the 1st element of the buffer whose base address will be used for vector load.
So using VL will break this assumption.

ABataev added inline comments.Mar 6 2018, 6:18 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1669

Why? And why you can't choose the right VL0 during vectorization?

ashahid added inline comments.Mar 6 2018, 8:20 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1669

For example, if we have two arrays A[4] and B[1] laying one after another in memory and the selected VF is 4 for the scalar loads of A[1], A[2], A[0], A[3] in order of use, the generated vector load will load the elements A[1], A[2], A[3], B[1] which is not desired.

Of-course we can choose the right VL0 during vectorization but we have to compute it again here using the mask which can be avoided if we use Sorted VL.

If I am missing something?

ABataev added inline comments.Mar 6 2018, 8:42 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1669

You already store the mask in the tree entry and you can choose the right VL0 using this mask. Using Sorted VL complicates the whole vectorization process and, thus, adds some extra points for the incorrect vectorization. That's why I insist to use original VL and choose the correct VL0 during codegen.

ashahid added inline comments.Mar 6 2018, 9:08 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1669

Got it. Since you already have these improvements in this patch https://reviews.llvm.org/D43776 , I think it is better to get that through.

RKSimon added a subscriber: RKSimon.Apr 8 2018, 2:43 AM

@ashahid What's happening to this patch?

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptMon, Oct 7, 5:02 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
RKSimon reopened this revision.Mon, Oct 7, 6:08 AM
This revision is now accepted and ready to land.Mon, Oct 7, 6:08 AM