This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Vectorize loads of consecutive memory accesses, accessed in non-consecutive (jumbled) way.
ClosedPublic

Authored by ashahid on Nov 21 2016, 1:27 AM.

Details

Summary

This patch improves the capability of SLPVectorizer pass to vectorize the loads of memory accesses in jumbled manner by using "load + shufflevector" IR instructions. The jumbled scalar loads will be sorted while building the tree and these accesses will be marked to generate "shufflevector" after the vectorized load with proper mask.

Diff Detail

Event Timeline

ashahid retitled this revision from to [SLP] Vectorize loads of consecutive memory accesses, accessed in non-consecutive (jumbled) way..
ashahid updated this object.
ashahid added reviewers: mkuper, hfinkel, mssimpso.
ashahid added a subscriber: llvm-commits.

Some minor comments - mainly code style etc.

lib/Analysis/LoopAccessAnalysis.cpp
1064

clang-format this

1066

This can probably be replaced with a for range loop:

for (auto *Val : VL) {

and then replace the uses of VL[i] with Val

1075

use auto* for dyn_cast return.

1081

newVL is unused?

1083

Use auto? You can drop the braces as well.

lib/Transforms/Vectorize/SLPVectorizer.cpp
468

This might be simplified with a for range loop and use of llvm:none_of / any_of?

1226

Is it worth breaking here once we know that shuffledLoad is false?

Remove the braces if you can.

2580

for range loop?

test/Transforms/SLPVectorizer/X86/jumbled-load.ll
3

Possibly commit this test to trunk with the current output generated by utils/update_test_checks.py?

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
4

It'd be better if there was more context to this additional shuffle - regenerate + commit the current output with utils/update_test_checks.py ? For an IR loop it's not that large an output.

mkuper edited edge metadata.Nov 21 2016, 6:11 PM

This basically fixes PR28474, right? Does it work correctly on the test-cases there?

lib/Analysis/LoopAccessAnalysis.cpp
1065

Why is this a multimap?

1067

Could you add some documentation to explain what exactly this does?

lib/Transforms/Vectorize/SLPVectorizer.cpp
466

I think you may want a different name - this doesn't actually check whether the scalars are jumbled, it checks whether they're all present. That is, it'll return true even if they're all in-order.

468

Would it make sense to pre-sort both arrays, and then check the two sorted arrays are equal? This would make it O(nlogn) instead of O(n^2)
(I'm not sure sort based on what, though - as well as the actual gain, since I guess VL.size() is small in practice.)

1196–1197

This TODO gets done.

1215

Do we still need the ReverseConsecutive case at all? That was introduced in r276477 as a patch for the common case of PR28474, where we find the loads in reverse order. But this should completely supersede it, right?

1215

Why not for VL.size() == 2?

1217

This looks rather weird. Can you make it more idiomatic?

test/Transforms/SLPVectorizer/X86/jumbled-load.ll
10

Please add a test that has several load packets (e.g. multiplies one load sequence by another load sequence).

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
4

Yes, I'd be interested to know if we added a shuffle here, or just moved a shuffle from the store side to the load side (which makes sense).

Hi Simon, Michael

Thanks for the comments. Pls find the response inlined.

Thanks,
Shahid

lib/Analysis/LoopAccessAnalysis.cpp
1065

This is because the elements in the multimap follow a certain order, so using this will ensure that the values are sorted accordingly.

lib/Transforms/Vectorize/SLPVectorizer.cpp
466

What about isFoundJumbled()?

468

Pre-sorting would require two calls for sort and then compare, IMO, for the given small VL.size it would not make much difference. However I am open to other views.

1196–1197

Yes, that's right.

1215

A jumbled VL of VL.size() == 2 is essentially a case of reversed VL. Considering the tradeoff between compile time of extra buildTree() for VL.size==2 vs additional runtime for shufflevector, I opted for extra compile time over extra runtime.

1217

Sure

1226

Seems yes.

2580

Sure

test/Transforms/SLPVectorizer/X86/jumbled-load.ll
3

By "current output" do you mean output generated by utils/update_test_checks.py with this patch by ?

10

Sure

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
4

By "current output" do you mean output generated by utils/update_test_checks.py with this patch by ? Pls explain.

RKSimon added inline comments.Nov 22 2016, 1:31 PM
test/Transforms/SLPVectorizer/X86/jumbled-load.ll
3

I mean commit the current (pre-patch) codegen so that this patch demonstrates the diff.

ashahid edited edge metadata.

Updates the review comments and the also updates the test case with more context.

Sorry for the delay, I was on vacation.

lib/Analysis/LoopAccessAnalysis.cpp
1076

Why are you turning a constant into a SCEV and back into a constant?

1077

If you know this is a SCEVConstant, this should be a cast<>. Otherwise, you need to check the dyn_cast<> actually succeeded.

lib/Transforms/Vectorize/SLPVectorizer.cpp
413–416

Please add an explanation for what the VL parameters means here.

468

To be honest, I'm not sure - so I'd appreciate another opinion about pre-sorting.
Matt/Hal/Simon?

1215

The trade off here is more one of code complexity - is the gain in compile time worth having all the additional logic present for both the "fully unsorted" case and the "reversed" case.

mkuper added inline comments.Nov 30 2016, 1:02 PM
lib/Analysis/LoopAccessAnalysis.cpp
1062

The LLVM coding standard is that function names start with a non-capital, and variable names start with a capital.
(There are some exceptions for functions, but this is mostly in old code.)

1065

It doesn't seem right to use a multimap just for the sorting behavior.
I think you can find a more appropriate container. See http://llvm.org/docs/ProgrammersManual.html#picking-the-right-data-structure-for-a-task

1065

Also, capitalization.

RKSimon added inline comments.Nov 30 2016, 1:13 PM
lib/Analysis/LoopAccessAnalysis.cpp
1072

You are casting to PointerType and then only using it as a Type.

lib/Transforms/Vectorize/SLPVectorizer.cpp
1263–1264

for (unsigned i = 0, e = VL.size(); i < e; ++i) {

1358–1359

for (unsigned j = 0, e = VL.size(); j < e; ++j) {

1370–1371

for (unsigned j = 0, e = VL.size(); j < e; ++j) {

1381–1382

for (unsigned j = 0, e = VL.size(); j < e; ++j) {

2576

for (unsigned i = 0, e = VecTy->getNumElements(); i < e; ++i) {

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20

What can be done to avoid this regression?

mkuper added inline comments.Nov 30 2016, 1:21 PM
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20

Ohh, right, wanted to ask about this as well.
My guess is that this wasn't actually a regression, but we moved the shuffle from store side to the load side. Is that right?

RKSimon added inline comments.Dec 1 2016, 1:42 AM
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20

If the update_test_checks script has done its job and generated checks for all the IR then this is an additional shuffle, I can't see an equivalent shuffle or set of extracts in the codegen on the left.

mssimpso added inline comments.Dec 1 2016, 10:33 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
468

I think we currently limit VL.size() to a maximum of 16? If so, the gain may not be that much, but I wouldn't expect presorting to be any worse.

1215

Am I wrong in thinking we don't necessarily know if rebuilding the tree with reversed loads would be any better than having the shuffle? Previously we were going to bail, but now we have an option.

2574

I probably missed this, but why are we checking the sizes? Does this mean there will be cases where E->NeedToShuffle is true but we don't generate the shuffle?

mkuper added inline comments.Dec 1 2016, 10:56 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1215

I don't think you're wrong, on the contrary - I was advocating removing the code I added (for reversing loads) and completely replacing it with something like this. But I didn't realize that'll introduce an extra shuffle.

test/Transforms/SLPVectorizer/X86/jumbled-load.ll
51

What happens if the stores are also out of order?
(IIRC, we should already have code to deal with that, I just want to make sure it meshes with the stores being out of order correctly)

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20

Argh, I didn't even look at the new version of the test, my assumption was from looking at the non-generated one (which is even more embarrassing, since I originally wrote that test, and didn't remember it doesn't have a shuffle...)
We really should not be regressing this.

ashahid added inline comments.Dec 2 2016, 2:58 AM
lib/Analysis/LoopAccessAnalysis.cpp
1062

ok

1065

I did refer to this manual but I could not find some thing similar.I am curious, what issue do you see with the usage of multimap? BTW, If you have any specific container in your mind, pls let me know.

1072

This is to resolve method membership error "class llvm::Type’ has no member named ‘getElementType" during compile time .

lib/Transforms/Vectorize/SLPVectorizer.cpp
413–416

Sure

1263–1264

Do you want me to change the style of FOR statement to the above one?

2574

No, I want to ensure that resulting vector type is not differing due to the length of the vector value.

test/Transforms/SLPVectorizer/X86/jumbled-load.ll
51

I have not checked yet, but I will check.

RKSimon added inline comments.Dec 2 2016, 5:34 AM
lib/Analysis/LoopAccessAnalysis.cpp
1072

Sorry my mistake!

lib/Transforms/Vectorize/SLPVectorizer.cpp
1263–1264

Yes please - since you're touching this code, it might as well be dealt with. It can be done as a NFC pre-commit if you prefer to keep this patch cleaner.

mkuper added inline comments.Dec 2 2016, 10:44 AM
lib/Analysis/LoopAccessAnalysis.cpp
1065

Well, first, I don't believe you actually need a *mutli*map here, right?
We don't actually expect to get several elements with the same offset, we can fail immediately if that happens. So, you could replace the multimap with a regular map, and a check for the "multi" condition.

Assuming I'm not missing anything about that, the options are basically either a regular std::map, or a sorted vector ( http://llvm.org/docs/ProgrammersManual.html#dss-sortedvectormap )

ashahid updated this revision to Diff 80234.Dec 5 2016, 1:11 AM

Updated the review comments accordingly.

RKSimon added inline comments.Dec 7 2016, 4:26 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2576

clang-format?

Updated the comment for formatting and a test to incorporate this patch.

ping!

lib/Analysis/LoopAccessAnalysis.cpp
1065

Agreed. Updating the patch accordingly.

1076

My bad, refactored accordingly.

test/Transforms/SLPVectorizer/X86/jumbled-load.ll
51

It gels well with 'stores' being out-of-order by generating proper shufflemask for loads according to the out-of-order stores.

RKSimon added inline comments.Dec 15 2016, 8:26 AM
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20

Any luck with working out what is causing this regression? Cross lane shuffles can be quite expensive.

mssimpso added inline comments.Dec 15 2016, 8:55 AM
include/llvm/Analysis/LoopAccessAnalysis.h
695

Should this be renamed to sortMemAccesses? If so, the comment above should also be updated: "jumbled memory accesses".

Also, should we be returning a SmallVector here? We could also pass a SmallVectorImpl<Value *> &Sorted to the function and place the sorted values there.

lib/Analysis/LoopAccessAnalysis.cpp
1068

Please update comment since you're no longer using a multimap.

lib/Transforms/Vectorize/SLPVectorizer.cpp
413–416

Can we be a bit more explicit about VL. Are VL the scalar roots of the vectorizable tree?

2574

I don't think I fully understand this yet. Can you please make the comment more detailed. In particular, when does VL.size() not equal Scalars.size()? Is this the case when a bundle gets split up into smaller chunks? And then if this is true, what does it imply for the jumbled accesses. It looks like we will end up with a vector load still, but then when are they placed in the right order? Sorry if this should all be obvious!

ashahid added inline comments.Dec 21 2016, 3:06 AM
include/llvm/Analysis/LoopAccessAnalysis.h
695

Ok, I will do that.

lib/Analysis/LoopAccessAnalysis.cpp
1068

Oh, sure.

lib/Transforms/Vectorize/SLPVectorizer.cpp
413–416

No, it is not scalar roots of vectorizable tree. VL is all isomorphic scalars , for example ADD1, ADD2 and so on or LOAD1 , LOAD2 etc

2574

As such I don't expect VL.size() not equal to Scalars.size(), but if it is so, the compiler may throw assertion for incorrect vector types. I just wanted to avoid that. May be I am presuming it. I will check by avoiding this specific check.

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20

The regression is because here the order of scalar loads are reverse consecutive initially. I will update the patch to resolve it.

mssimpso added inline comments.Dec 21 2016, 7:20 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2574

Should we make the size check an assertion?

Updated the patch for the recent review comments which resolves the regressions in the given tests.

mssimpso added inline comments.Jan 3 2017, 8:32 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2575–2576

Hi Shahid,

I'm hitting the assertion here while testing this patch. Can you take a look?

mssimpso added inline comments.Jan 3 2017, 8:59 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2596

I also saw verifier failures where TBAA metadata had been applied to the shuffle, like:

TBAA is only for loads, stores and calls!
  %14 = shufflevector <4 x i32> %13, <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 0, i32 1>, !tbaa !66
ashahid added inline comments.Jan 3 2017, 7:11 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2575–2576

Sure. If possible can you share the asserting test?

2596

Ok, will fix it.

RKSimon added inline comments.Jan 4 2017, 8:14 AM
test/Transforms/SLPVectorizer/X86/horizontal-list.ll
12 ↗(On Diff #82406)

The changes in this file are from the regeneration script and are just polluting this patch, I've commit this against trunk at rL290969 - please rebase.

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
35

This looks suspicious - why the lonely change from TMP3 to TMP4?

mssimpso added inline comments.Jan 4 2017, 8:23 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2575–2576

Sure, I'll try and reduce something for you.

2596

I think you should probably just copy the metadata from the scalar load to the vector load, like:

propagateMetadata(LI, E->Scalars);
return Shuf;
ashahid added inline comments.Jan 4 2017, 8:51 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2596

Yes, TBAA metadata is not for shufflevector and recently verifier added this assert.

test/Transforms/SLPVectorizer/X86/horizontal-list.ll
12 ↗(On Diff #82406)

Ok

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
35

Oh good catch, I will see.

mssimpso added inline comments.Jan 4 2017, 9:08 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2575–2576

OK, you should be able to reproduce the assert with the bugpoint reduced test case at P7950. Thanks!

opt < D26905.ll -slp-vectorizer -S
ashahid added inline comments.Jan 6 2017, 1:45 AM
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
35

I was surprised initially but later realized that this is because the current patch resolves the regression you pointed out. So if you compare this patch i.e Diff5 with the previous patch i.e Diff4, you will see the expected difference

ashahid updated this revision to Diff 83352.Jan 6 2017, 2:08 AM

Updated the patch to fix the assertion observed by Simon (thanks for the reduced test) and other comments.

No other comments from me. Thanks.

mkuper added inline comments.Jan 6 2017, 2:46 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
468

I'm still not sure we want this to be quadratic.
I'd suggest one of two things:

  1. Change this to presort. For VL.size() == 4, it may be slower, but for VL.size() == 16, I'd expect it to be faster.
  2. If there's evidence that presorting is actually bad for small sizes, add a FIXME and bail out for VL.size() > 16. I'd prefer for us to fail to vectorize at larger VLs, than silently introduce a quadratic algorithm for larger Ns.

Updated the patch accordingly to address the comment

Thanks, Shahid.
The rest of my comments are cosmetic - except the one about the sort. I think your sort accidentally ended up quadratic.

lib/Analysis/LoopAccessAnalysis.cpp
1077

Shouldn't this sort call be outside the for loop?

lib/Transforms/Vectorize/SLPVectorizer.cpp
471

Can you just use std::equal directly? The only thing isSame() does, aside from that, is assert on the sizes - and you have that assert just a few lines above.

1228

newVL -> NewVL

1229

Could you please add braces to this for? It's a one-statement body, but not a one-line, so I think braces would be better.

mkuper added inline comments.
lib/Analysis/LoopAccessAnalysis.cpp
1075

Thanks to @dberlin and @wmi - I now realize this is too simplistic.
It will handle cases where the offsets are constant, but not when all the offsets are variable, but the variables have constant differences from each other.

Anyway that can be handled separately. Could you add a FIXME here, please?

Updated the patch accordingly.

A few more cosmetic comments (sorry I didn't ask you to fix them all at once, but I keep noticing new ones every time I read the code).
Also, there's another thing I just realized is missing from this patch - we don't consider NeedToShuffle in getEntryCost().
You basically need to add the cost of a TTI::SK_PermuteSingleSrc shuffle to every NeedToShuffle load.

(Actually, it's a bit more complicated than that, since some of those shuffles may end up getting removed later, but it's probably better to be conservative here.)

lib/Analysis/LoopAccessAnalysis.cpp
1065

Also, this isn't a pair, it's a list of pairs.
OffValPairs can work.

lib/Transforms/Vectorize/SLPVectorizer.cpp
2594

shuf -> Shuf

test/Transforms/SLPVectorizer/X86/jumbled-load.ll
51

I thought you added a test for the combination of out-of-order loads and out-of-order stores, but turns out I was imagining it. Could you please add one?
(We should have a regression test making sure we don't generate extra shuffles.)

Sorry for the delayed response. Updated the patch to include costing for extra shuffle and minor formatting.

mkuper accepted this revision.Jan 27 2017, 10:53 AM

LGTM, Thanks!

test/Transforms/SLPVectorizer/X86/store-jumbled.ll
14 ↗(On Diff #86031)

Ok, so this is pretty much what I thought will happen. We shuffle both loads the same way, and then multiply, instead of multiplying and then shuffling. But this is probably fine - I hope InstCombine will pick up on this and combine it to a mul followed by a shuffle, if the masks match.

This revision is now accepted and ready to land.Jan 27 2017, 10:53 AM
This revision was automatically updated to reflect the committed changes.
ashahid added inline comments.Jan 31 2017, 1:35 AM
test/Transforms/SLPVectorizer/X86/store-jumbled.ll
14 ↗(On Diff #86031)

Yes you are right. I verified, its happening exactly as you explained

Via Web