This patch improves the capability of SLPVectorizer pass to vectorize the loads of memory accesses in jumbled manner by using "load + shufflevector" IR instructions. The jumbled scalar loads will be sorted while building the tree and these accesses will be marked to generate "shufflevector" after the vectorized load with proper mask.
Details
Diff Detail
Event Timeline
Some minor comments - mainly code style etc.
lib/Analysis/LoopAccessAnalysis.cpp | ||
---|---|---|
1064 | clang-format this | |
1066 | This can probably be replaced with a for range loop: for (auto *Val : VL) { and then replace the uses of VL[i] with Val | |
1075 | use auto* for dyn_cast return. | |
1081 | newVL is unused? | |
1083 | Use auto? You can drop the braces as well. | |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
468 | This might be simplified with a for range loop and use of llvm:none_of / any_of? | |
1226 | Is it worth breaking here once we know that shuffledLoad is false? Remove the braces if you can. | |
2580 | for range loop? | |
test/Transforms/SLPVectorizer/X86/jumbled-load.ll | ||
3 | Possibly commit this test to trunk with the current output generated by utils/update_test_checks.py? | |
test/Transforms/SLPVectorizer/X86/reduction_loads.ll | ||
4 | It'd be better if there was more context to this additional shuffle - regenerate + commit the current output with utils/update_test_checks.py ? For an IR loop it's not that large an output. |
This basically fixes PR28474, right? Does it work correctly on the test-cases there?
lib/Analysis/LoopAccessAnalysis.cpp | ||
---|---|---|
1065 | Why is this a multimap? | |
1067 | Could you add some documentation to explain what exactly this does? | |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
466 | I think you may want a different name - this doesn't actually check whether the scalars are jumbled, it checks whether they're all present. That is, it'll return true even if they're all in-order. | |
468 | Would it make sense to pre-sort both arrays, and then check the two sorted arrays are equal? This would make it O(nlogn) instead of O(n^2) | |
1196–1197 | This TODO gets done. | |
1215 | Do we still need the ReverseConsecutive case at all? That was introduced in r276477 as a patch for the common case of PR28474, where we find the loads in reverse order. But this should completely supersede it, right? | |
1215 | Why not for VL.size() == 2? | |
1217 | This looks rather weird. Can you make it more idiomatic? | |
test/Transforms/SLPVectorizer/X86/jumbled-load.ll | ||
10 | Please add a test that has several load packets (e.g. multiplies one load sequence by another load sequence). | |
test/Transforms/SLPVectorizer/X86/reduction_loads.ll | ||
4 | Yes, I'd be interested to know if we added a shuffle here, or just moved a shuffle from the store side to the load side (which makes sense). |
Hi Simon, Michael
Thanks for the comments. Pls find the response inlined.
Thanks,
Shahid
lib/Analysis/LoopAccessAnalysis.cpp | ||
---|---|---|
1065 | This is because the elements in the multimap follow a certain order, so using this will ensure that the values are sorted accordingly. | |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
466 | What about isFoundJumbled()? | |
468 | Pre-sorting would require two calls for sort and then compare, IMO, for the given small VL.size it would not make much difference. However I am open to other views. | |
1196–1197 | Yes, that's right. | |
1215 | A jumbled VL of VL.size() == 2 is essentially a case of reversed VL. Considering the tradeoff between compile time of extra buildTree() for VL.size==2 vs additional runtime for shufflevector, I opted for extra compile time over extra runtime. | |
1217 | Sure | |
1226 | Seems yes. | |
2580 | Sure | |
test/Transforms/SLPVectorizer/X86/jumbled-load.ll | ||
3 | By "current output" do you mean output generated by utils/update_test_checks.py with this patch by ? | |
10 | Sure | |
test/Transforms/SLPVectorizer/X86/reduction_loads.ll | ||
4 | By "current output" do you mean output generated by utils/update_test_checks.py with this patch by ? Pls explain. |
test/Transforms/SLPVectorizer/X86/jumbled-load.ll | ||
---|---|---|
3 | I mean commit the current (pre-patch) codegen so that this patch demonstrates the diff. |
Sorry for the delay, I was on vacation.
lib/Analysis/LoopAccessAnalysis.cpp | ||
---|---|---|
1076 | Why are you turning a constant into a SCEV and back into a constant? | |
1077 | If you know this is a SCEVConstant, this should be a cast<>. Otherwise, you need to check the dyn_cast<> actually succeeded. | |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
413–416 | Please add an explanation for what the VL parameters means here. | |
468 | To be honest, I'm not sure - so I'd appreciate another opinion about pre-sorting. | |
1215 | The trade off here is more one of code complexity - is the gain in compile time worth having all the additional logic present for both the "fully unsorted" case and the "reversed" case. |
lib/Analysis/LoopAccessAnalysis.cpp | ||
---|---|---|
1062 | The LLVM coding standard is that function names start with a non-capital, and variable names start with a capital. | |
1065 | It doesn't seem right to use a multimap just for the sorting behavior. | |
1065 | Also, capitalization. |
lib/Analysis/LoopAccessAnalysis.cpp | ||
---|---|---|
1072 | You are casting to PointerType and then only using it as a Type. | |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
1263–1264 | for (unsigned i = 0, e = VL.size(); i < e; ++i) { | |
1358–1359 | for (unsigned j = 0, e = VL.size(); j < e; ++j) { | |
1370–1371 | for (unsigned j = 0, e = VL.size(); j < e; ++j) { | |
1381–1382 | for (unsigned j = 0, e = VL.size(); j < e; ++j) { | |
2576 | for (unsigned i = 0, e = VecTy->getNumElements(); i < e; ++i) { | |
test/Transforms/SLPVectorizer/X86/reduction_loads.ll | ||
20 | What can be done to avoid this regression? |
test/Transforms/SLPVectorizer/X86/reduction_loads.ll | ||
---|---|---|
20 | Ohh, right, wanted to ask about this as well. |
test/Transforms/SLPVectorizer/X86/reduction_loads.ll | ||
---|---|---|
20 | If the update_test_checks script has done its job and generated checks for all the IR then this is an additional shuffle, I can't see an equivalent shuffle or set of extracts in the codegen on the left. |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
468 | I think we currently limit VL.size() to a maximum of 16? If so, the gain may not be that much, but I wouldn't expect presorting to be any worse. | |
1215 | Am I wrong in thinking we don't necessarily know if rebuilding the tree with reversed loads would be any better than having the shuffle? Previously we were going to bail, but now we have an option. | |
2574 | I probably missed this, but why are we checking the sizes? Does this mean there will be cases where E->NeedToShuffle is true but we don't generate the shuffle? |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
1215 | I don't think you're wrong, on the contrary - I was advocating removing the code I added (for reversing loads) and completely replacing it with something like this. But I didn't realize that'll introduce an extra shuffle. | |
test/Transforms/SLPVectorizer/X86/jumbled-load.ll | ||
51 | What happens if the stores are also out of order? | |
test/Transforms/SLPVectorizer/X86/reduction_loads.ll | ||
20 | Argh, I didn't even look at the new version of the test, my assumption was from looking at the non-generated one (which is even more embarrassing, since I originally wrote that test, and didn't remember it doesn't have a shuffle...) |
lib/Analysis/LoopAccessAnalysis.cpp | ||
---|---|---|
1062 | ok | |
1065 | I did refer to this manual but I could not find some thing similar.I am curious, what issue do you see with the usage of multimap? BTW, If you have any specific container in your mind, pls let me know. | |
1072 | This is to resolve method membership error "class llvm::Type’ has no member named ‘getElementType" during compile time . | |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
413–416 | Sure | |
1263–1264 | Do you want me to change the style of FOR statement to the above one? | |
2574 | No, I want to ensure that resulting vector type is not differing due to the length of the vector value. | |
test/Transforms/SLPVectorizer/X86/jumbled-load.ll | ||
51 | I have not checked yet, but I will check. |
lib/Analysis/LoopAccessAnalysis.cpp | ||
---|---|---|
1065 | Well, first, I don't believe you actually need a *mutli*map here, right? Assuming I'm not missing anything about that, the options are basically either a regular std::map, or a sorted vector ( http://llvm.org/docs/ProgrammersManual.html#dss-sortedvectormap ) |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2576 | clang-format? |
test/Transforms/SLPVectorizer/X86/reduction_loads.ll | ||
---|---|---|
20 | Any luck with working out what is causing this regression? Cross lane shuffles can be quite expensive. |
include/llvm/Analysis/LoopAccessAnalysis.h | ||
---|---|---|
695 | Should this be renamed to sortMemAccesses? If so, the comment above should also be updated: "jumbled memory accesses". Also, should we be returning a SmallVector here? We could also pass a SmallVectorImpl<Value *> &Sorted to the function and place the sorted values there. | |
lib/Analysis/LoopAccessAnalysis.cpp | ||
1068 | Please update comment since you're no longer using a multimap. | |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
413–416 | Can we be a bit more explicit about VL. Are VL the scalar roots of the vectorizable tree? | |
2574 | I don't think I fully understand this yet. Can you please make the comment more detailed. In particular, when does VL.size() not equal Scalars.size()? Is this the case when a bundle gets split up into smaller chunks? And then if this is true, what does it imply for the jumbled accesses. It looks like we will end up with a vector load still, but then when are they placed in the right order? Sorry if this should all be obvious! |
include/llvm/Analysis/LoopAccessAnalysis.h | ||
---|---|---|
695 | Ok, I will do that. | |
lib/Analysis/LoopAccessAnalysis.cpp | ||
1068 | Oh, sure. | |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
413–416 | No, it is not scalar roots of vectorizable tree. VL is all isomorphic scalars , for example ADD1, ADD2 and so on or LOAD1 , LOAD2 etc | |
2574 | As such I don't expect VL.size() not equal to Scalars.size(), but if it is so, the compiler may throw assertion for incorrect vector types. I just wanted to avoid that. May be I am presuming it. I will check by avoiding this specific check. | |
test/Transforms/SLPVectorizer/X86/reduction_loads.ll | ||
20 | The regression is because here the order of scalar loads are reverse consecutive initially. I will update the patch to resolve it. |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2574 | Should we make the size check an assertion? |
Updated the patch for the recent review comments which resolves the regressions in the given tests.
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2575–2576 | Hi Shahid, I'm hitting the assertion here while testing this patch. Can you take a look? |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2596 | I also saw verifier failures where TBAA metadata had been applied to the shuffle, like: TBAA is only for loads, stores and calls! %14 = shufflevector <4 x i32> %13, <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 0, i32 1>, !tbaa !66 |
test/Transforms/SLPVectorizer/X86/horizontal-list.ll | ||
---|---|---|
12 ↗ | (On Diff #82406) | The changes in this file are from the regeneration script and are just polluting this patch, I've commit this against trunk at rL290969 - please rebase. |
test/Transforms/SLPVectorizer/X86/reduction_loads.ll | ||
35 | This looks suspicious - why the lonely change from TMP3 to TMP4? |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
2575–2576 | OK, you should be able to reproduce the assert with the bugpoint reduced test case at P7950. Thanks! opt < D26905.ll -slp-vectorizer -S |
test/Transforms/SLPVectorizer/X86/reduction_loads.ll | ||
---|---|---|
35 | I was surprised initially but later realized that this is because the current patch resolves the regression you pointed out. So if you compare this patch i.e Diff5 with the previous patch i.e Diff4, you will see the expected difference |
Updated the patch to fix the assertion observed by Simon (thanks for the reduced test) and other comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
468 | I'm still not sure we want this to be quadratic.
|
Thanks, Shahid.
The rest of my comments are cosmetic - except the one about the sort. I think your sort accidentally ended up quadratic.
lib/Analysis/LoopAccessAnalysis.cpp | ||
---|---|---|
1077 | Shouldn't this sort call be outside the for loop? | |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
471 | Can you just use std::equal directly? The only thing isSame() does, aside from that, is assert on the sizes - and you have that assert just a few lines above. | |
1228 | newVL -> NewVL | |
1229 | Could you please add braces to this for? It's a one-statement body, but not a one-line, so I think braces would be better. |
lib/Analysis/LoopAccessAnalysis.cpp | ||
---|---|---|
1075 |
A few more cosmetic comments (sorry I didn't ask you to fix them all at once, but I keep noticing new ones every time I read the code).
Also, there's another thing I just realized is missing from this patch - we don't consider NeedToShuffle in getEntryCost().
You basically need to add the cost of a TTI::SK_PermuteSingleSrc shuffle to every NeedToShuffle load.
(Actually, it's a bit more complicated than that, since some of those shuffles may end up getting removed later, but it's probably better to be conservative here.)
lib/Analysis/LoopAccessAnalysis.cpp | ||
---|---|---|
1065 | Also, this isn't a pair, it's a list of pairs. | |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
2594 | shuf -> Shuf | |
test/Transforms/SLPVectorizer/X86/jumbled-load.ll | ||
51 | I thought you added a test for the combination of out-of-order loads and out-of-order stores, but turns out I was imagining it. Could you please add one? |
Sorry for the delayed response. Updated the patch to include costing for extra shuffle and minor formatting.
LGTM, Thanks!
test/Transforms/SLPVectorizer/X86/store-jumbled.ll | ||
---|---|---|
14 ↗ | (On Diff #86031) | Ok, so this is pretty much what I thought will happen. We shuffle both loads the same way, and then multiply, instead of multiplying and then shuffling. But this is probably fine - I hope InstCombine will pick up on this and combine it to a mul followed by a shuffle, if the masks match. |
test/Transforms/SLPVectorizer/X86/store-jumbled.ll | ||
---|---|---|
14 ↗ | (On Diff #86031) | Yes you are right. I verified, its happening exactly as you explained |
Should this be renamed to sortMemAccesses? If so, the comment above should also be updated: "jumbled memory accesses".
Also, should we be returning a SmallVector here? We could also pass a SmallVectorImpl<Value *> &Sorted to the function and place the sorted values there.