Page MenuHomePhabricator

[SLP] Vectorize loads of consecutive memory accesses, accessed in non-consecutive (jumbled) way.
ClosedPublic

Authored by ashahid on Nov 21 2016, 1:27 AM.

Details

Summary

This patch improves the capability of SLPVectorizer pass to vectorize the loads of memory accesses in jumbled manner by using "load + shufflevector" IR instructions. The jumbled scalar loads will be sorted while building the tree and these accesses will be marked to generate "shufflevector" after the vectorized load with proper mask.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
mkuper added inline comments.Nov 30 2016, 1:02 PM
lib/Analysis/LoopAccessAnalysis.cpp
1022 ↗(On Diff #78690)

It doesn't seem right to use a multimap just for the sorting behavior.
I think you can find a more appropriate container. See http://llvm.org/docs/ProgrammersManual.html#picking-the-right-data-structure-for-a-task

1019 ↗(On Diff #79376)

The LLVM coding standard is that function names start with a non-capital, and variable names start with a capital.
(There are some exceptions for functions, but this is mostly in old code.)

1022 ↗(On Diff #79376)

Also, capitalization.

RKSimon added inline comments.Nov 30 2016, 1:13 PM
lib/Analysis/LoopAccessAnalysis.cpp
1029 ↗(On Diff #79376)

You are casting to PointerType and then only using it as a Type.

lib/Transforms/Vectorize/SLPVectorizer.cpp
1257 ↗(On Diff #79376)

for (unsigned i = 0, e = VL.size(); i < e; ++i) {

1352 ↗(On Diff #79376)

for (unsigned j = 0, e = VL.size(); j < e; ++j) {

1364 ↗(On Diff #79376)

for (unsigned j = 0, e = VL.size(); j < e; ++j) {

1375 ↗(On Diff #79376)

for (unsigned j = 0, e = VL.size(); j < e; ++j) {

2567 ↗(On Diff #79376)

for (unsigned i = 0, e = VecTy->getNumElements(); i < e; ++i) {

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20 ↗(On Diff #79376)

What can be done to avoid this regression?

mkuper added inline comments.Nov 30 2016, 1:21 PM
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20 ↗(On Diff #79376)

Ohh, right, wanted to ask about this as well.
My guess is that this wasn't actually a regression, but we moved the shuffle from store side to the load side. Is that right?

RKSimon added inline comments.Dec 1 2016, 1:42 AM
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20 ↗(On Diff #79376)

If the update_test_checks script has done its job and generated checks for all the IR then this is an additional shuffle, I can't see an equivalent shuffle or set of extracts in the codegen on the left.

mssimpso added inline comments.Dec 1 2016, 10:33 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
466 ↗(On Diff #78690)

I think we currently limit VL.size() to a maximum of 16? If so, the gain may not be that much, but I wouldn't expect presorting to be any worse.

1217 ↗(On Diff #78690)

Am I wrong in thinking we don't necessarily know if rebuilding the tree with reversed loads would be any better than having the shuffle? Previously we were going to bail, but now we have an option.

2565 ↗(On Diff #79376)

I probably missed this, but why are we checking the sizes? Does this mean there will be cases where E->NeedToShuffle is true but we don't generate the shuffle?

mkuper added inline comments.Dec 1 2016, 10:56 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1217 ↗(On Diff #78690)

I don't think you're wrong, on the contrary - I was advocating removing the code I added (for reversing loads) and completely replacing it with something like this. But I didn't realize that'll introduce an extra shuffle.

test/Transforms/SLPVectorizer/X86/jumbled-load.ll
51 ↗(On Diff #79376)

What happens if the stores are also out of order?
(IIRC, we should already have code to deal with that, I just want to make sure it meshes with the stores being out of order correctly)

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20 ↗(On Diff #79376)

Argh, I didn't even look at the new version of the test, my assumption was from looking at the non-generated one (which is even more embarrassing, since I originally wrote that test, and didn't remember it doesn't have a shuffle...)
We really should not be regressing this.

ashahid added inline comments.Dec 2 2016, 2:58 AM
lib/Analysis/LoopAccessAnalysis.cpp
1022 ↗(On Diff #78690)

I did refer to this manual but I could not find some thing similar.I am curious, what issue do you see with the usage of multimap? BTW, If you have any specific container in your mind, pls let me know.

1019 ↗(On Diff #79376)

ok

1029 ↗(On Diff #79376)

This is to resolve method membership error "class llvm::Type’ has no member named ‘getElementType" during compile time .

lib/Transforms/Vectorize/SLPVectorizer.cpp
414 ↗(On Diff #79376)

Sure

1257 ↗(On Diff #79376)

Do you want me to change the style of FOR statement to the above one?

2565 ↗(On Diff #79376)

No, I want to ensure that resulting vector type is not differing due to the length of the vector value.

test/Transforms/SLPVectorizer/X86/jumbled-load.ll
51 ↗(On Diff #79376)

I have not checked yet, but I will check.

RKSimon added inline comments.Dec 2 2016, 5:34 AM
lib/Analysis/LoopAccessAnalysis.cpp
1029 ↗(On Diff #79376)

Sorry my mistake!

lib/Transforms/Vectorize/SLPVectorizer.cpp
1257 ↗(On Diff #79376)

Yes please - since you're touching this code, it might as well be dealt with. It can be done as a NFC pre-commit if you prefer to keep this patch cleaner.

mkuper added inline comments.Dec 2 2016, 10:44 AM
lib/Analysis/LoopAccessAnalysis.cpp
1022 ↗(On Diff #78690)

Well, first, I don't believe you actually need a *mutli*map here, right?
We don't actually expect to get several elements with the same offset, we can fail immediately if that happens. So, you could replace the multimap with a regular map, and a check for the "multi" condition.

Assuming I'm not missing anything about that, the options are basically either a regular std::map, or a sorted vector ( http://llvm.org/docs/ProgrammersManual.html#dss-sortedvectormap )

ashahid updated this revision to Diff 80234.Dec 5 2016, 1:11 AM

Updated the review comments accordingly.

RKSimon added inline comments.Dec 7 2016, 4:26 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2568 ↗(On Diff #80234)

clang-format?

Updated the comment for formatting and a test to incorporate this patch.

ping!

lib/Analysis/LoopAccessAnalysis.cpp
1022 ↗(On Diff #78690)

Agreed. Updating the patch accordingly.

1033 ↗(On Diff #79376)

My bad, refactored accordingly.

test/Transforms/SLPVectorizer/X86/jumbled-load.ll
51 ↗(On Diff #79376)

It gels well with 'stores' being out-of-order by generating proper shufflemask for loads according to the out-of-order stores.

RKSimon added inline comments.Dec 15 2016, 8:26 AM
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20 ↗(On Diff #79376)

Any luck with working out what is causing this regression? Cross lane shuffles can be quite expensive.

mssimpso added inline comments.Dec 15 2016, 8:55 AM
include/llvm/Analysis/LoopAccessAnalysis.h
695 ↗(On Diff #81051)

Should this be renamed to sortMemAccesses? If so, the comment above should also be updated: "jumbled memory accesses".

Also, should we be returning a SmallVector here? We could also pass a SmallVectorImpl<Value *> &Sorted to the function and place the sorted values there.

lib/Analysis/LoopAccessAnalysis.cpp
1068 ↗(On Diff #81051)

Please update comment since you're no longer using a multimap.

lib/Transforms/Vectorize/SLPVectorizer.cpp
413 ↗(On Diff #81051)

Can we be a bit more explicit about VL. Are VL the scalar roots of the vectorizable tree?

2565 ↗(On Diff #79376)

I don't think I fully understand this yet. Can you please make the comment more detailed. In particular, when does VL.size() not equal Scalars.size()? Is this the case when a bundle gets split up into smaller chunks? And then if this is true, what does it imply for the jumbled accesses. It looks like we will end up with a vector load still, but then when are they placed in the right order? Sorry if this should all be obvious!

ashahid added inline comments.Dec 21 2016, 3:06 AM
include/llvm/Analysis/LoopAccessAnalysis.h
695 ↗(On Diff #81051)

Ok, I will do that.

lib/Analysis/LoopAccessAnalysis.cpp
1068 ↗(On Diff #81051)

Oh, sure.

lib/Transforms/Vectorize/SLPVectorizer.cpp
413 ↗(On Diff #81051)

No, it is not scalar roots of vectorizable tree. VL is all isomorphic scalars , for example ADD1, ADD2 and so on or LOAD1 , LOAD2 etc

2565 ↗(On Diff #79376)

As such I don't expect VL.size() not equal to Scalars.size(), but if it is so, the compiler may throw assertion for incorrect vector types. I just wanted to avoid that. May be I am presuming it. I will check by avoiding this specific check.

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
20 ↗(On Diff #79376)

The regression is because here the order of scalar loads are reverse consecutive initially. I will update the patch to resolve it.

mssimpso added inline comments.Dec 21 2016, 7:20 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2565 ↗(On Diff #79376)

Should we make the size check an assertion?

Updated the patch for the recent review comments which resolves the regressions in the given tests.

mssimpso added inline comments.Jan 3 2017, 8:32 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2575–2576 ↗(On Diff #82406)

Hi Shahid,

I'm hitting the assertion here while testing this patch. Can you take a look?

mssimpso added inline comments.Jan 3 2017, 8:59 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2596 ↗(On Diff #82406)

I also saw verifier failures where TBAA metadata had been applied to the shuffle, like:

TBAA is only for loads, stores and calls!
  %14 = shufflevector <4 x i32> %13, <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 0, i32 1>, !tbaa !66
ashahid added inline comments.Jan 3 2017, 7:11 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2575–2576 ↗(On Diff #82406)

Sure. If possible can you share the asserting test?

2596 ↗(On Diff #82406)

Ok, will fix it.

RKSimon added inline comments.Jan 4 2017, 8:14 AM
test/Transforms/SLPVectorizer/X86/horizontal-list.ll
12 ↗(On Diff #82406)

The changes in this file are from the regeneration script and are just polluting this patch, I've commit this against trunk at rL290969 - please rebase.

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
35 ↗(On Diff #82406)

This looks suspicious - why the lonely change from TMP3 to TMP4?

mssimpso added inline comments.Jan 4 2017, 8:23 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2575–2576 ↗(On Diff #82406)

Sure, I'll try and reduce something for you.

2596 ↗(On Diff #82406)

I think you should probably just copy the metadata from the scalar load to the vector load, like:

propagateMetadata(LI, E->Scalars);
return Shuf;
ashahid added inline comments.Jan 4 2017, 8:51 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2596 ↗(On Diff #82406)

Yes, TBAA metadata is not for shufflevector and recently verifier added this assert.

test/Transforms/SLPVectorizer/X86/horizontal-list.ll
12 ↗(On Diff #82406)

Ok

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
35 ↗(On Diff #82406)

Oh good catch, I will see.

mssimpso added inline comments.Jan 4 2017, 9:08 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2575–2576 ↗(On Diff #82406)

OK, you should be able to reproduce the assert with the bugpoint reduced test case at P7950. Thanks!

opt < D26905.ll -slp-vectorizer -S
ashahid added inline comments.Jan 6 2017, 1:45 AM
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
35 ↗(On Diff #82406)

I was surprised initially but later realized that this is because the current patch resolves the regression you pointed out. So if you compare this patch i.e Diff5 with the previous patch i.e Diff4, you will see the expected difference

ashahid updated this revision to Diff 83352.Jan 6 2017, 2:08 AM

Updated the patch to fix the assertion observed by Simon (thanks for the reduced test) and other comments.

No other comments from me. Thanks.

mkuper added inline comments.Jan 6 2017, 2:46 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
468 ↗(On Diff #83352)

I'm still not sure we want this to be quadratic.
I'd suggest one of two things:

  1. Change this to presort. For VL.size() == 4, it may be slower, but for VL.size() == 16, I'd expect it to be faster.
  2. If there's evidence that presorting is actually bad for small sizes, add a FIXME and bail out for VL.size() > 16. I'd prefer for us to fail to vectorize at larger VLs, than silently introduce a quadratic algorithm for larger Ns.

Updated the patch accordingly to address the comment

Thanks, Shahid.
The rest of my comments are cosmetic - except the one about the sort. I think your sort accidentally ended up quadratic.

lib/Analysis/LoopAccessAnalysis.cpp
1034 ↗(On Diff #79376)

Shouldn't this sort call be outside the for loop?

lib/Transforms/Vectorize/SLPVectorizer.cpp
471 ↗(On Diff #83923)

Can you just use std::equal directly? The only thing isSame() does, aside from that, is assert on the sizes - and you have that assert just a few lines above.

1228 ↗(On Diff #83923)

newVL -> NewVL

1229 ↗(On Diff #83923)

Could you please add braces to this for? It's a one-statement body, but not a one-line, so I think braces would be better.

mkuper added inline comments.
lib/Analysis/LoopAccessAnalysis.cpp
1075 ↗(On Diff #83923)

Thanks to @dberlin and @wmi - I now realize this is too simplistic.
It will handle cases where the offsets are constant, but not when all the offsets are variable, but the variables have constant differences from each other.

Anyway that can be handled separately. Could you add a FIXME here, please?

Updated the patch accordingly.

A few more cosmetic comments (sorry I didn't ask you to fix them all at once, but I keep noticing new ones every time I read the code).
Also, there's another thing I just realized is missing from this patch - we don't consider NeedToShuffle in getEntryCost().
You basically need to add the cost of a TTI::SK_PermuteSingleSrc shuffle to every NeedToShuffle load.

(Actually, it's a bit more complicated than that, since some of those shuffles may end up getting removed later, but it's probably better to be conservative here.)

lib/Analysis/LoopAccessAnalysis.cpp
1022 ↗(On Diff #79376)

Also, this isn't a pair, it's a list of pairs.
OffValPairs can work.

lib/Transforms/Vectorize/SLPVectorizer.cpp
2595 ↗(On Diff #84262)

shuf -> Shuf

test/Transforms/SLPVectorizer/X86/jumbled-load.ll
51 ↗(On Diff #79376)

I thought you added a test for the combination of out-of-order loads and out-of-order stores, but turns out I was imagining it. Could you please add one?
(We should have a regression test making sure we don't generate extra shuffles.)

Sorry for the delayed response. Updated the patch to include costing for extra shuffle and minor formatting.

mkuper accepted this revision.Jan 27 2017, 10:53 AM

LGTM, Thanks!

test/Transforms/SLPVectorizer/X86/store-jumbled.ll
14 ↗(On Diff #86031)

Ok, so this is pretty much what I thought will happen. We shuffle both loads the same way, and then multiply, instead of multiplying and then shuffling. But this is probably fine - I hope InstCombine will pick up on this and combine it to a mul followed by a shuffle, if the masks match.

This revision is now accepted and ready to land.Jan 27 2017, 10:53 AM
This revision was automatically updated to reflect the committed changes.
ashahid added inline comments.Jan 31 2017, 1:35 AM
test/Transforms/SLPVectorizer/X86/store-jumbled.ll
14 ↗(On Diff #86031)

Yes you are right. I verified, its happening exactly as you explained