If the load/extractelement/extractvalue instructions are not originally
consecutive, the SLP vectorizer is unable to vectorize them. Patch
allows reordering of such instructions.
Details
- Reviewers
RKSimon spatel hfinkel mkuper Ayal • ashahid - Commits
- rG428e9d9d8784: [SLP] Fix PR36481: vectorize reassociated instructions.
rG3decaf4275be: [SLP] Fix PR36481: vectorize reassociated instructions.
rL329085: [SLP] Fix PR36481: vectorize reassociated instructions.
rL328980: [SLP] Fix PR36481: vectorize reassociated instructions.
Diff Detail
- Repository
- rL LLVM
Event Timeline
This patch addresses the following TODO, plus handles extracts:
// Check if the loads are consecutive, reversed, or neither. // TODO: What we really want is to sort the loads, but for now, check // the two likely directions.
At some point it's worth documenting that the best order is set once, at the root of the tree, and then gets propagated to its leaves. Would be good to do so w/o having to rebuild the tree (introduced before this patch)?
include/llvm/Analysis/LoopAccessAnalysis.h | ||
---|---|---|
679 ↗ | (On Diff #136345) | The usage later and documentation above look for a set of references that are consecutive under some permutation. I.e., w/o gaps. The implementation below allows arbitrary gaps, and does not check for zero gaps, i.e., replicated references. Would it be better to simply do a Pigeonhole sort, and perhaps call it isPermutedConsecutiveAccess():
Note: it may be good to support replicated references, say when the gaps are internal to the vector to avoid loading out of bounds. Perhaps add a TODO. Note: it may be good to have LAA provide some common support for both SLP and LV's InterleaveGroup construction, which share some aspects. Perhaps add a TODO. |
lib/Analysis/LoopAccessAnalysis.cpp | ||
1113 ↗ | (On Diff #136345) | Use computeConstantDifference() instead of computing it explicitly? It should compare GetUnderlyingObject()'s, if worthwhile, rather than doing so here. |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
454 ↗ | (On Diff #136345) | Update above documentation accordingly. Instead of returning the index when it's not Idx, may as well have getExtractIndex() return it always, and have the caller compare it to Idx? |
608 ↗ | (On Diff #136345) | The order which bestOrder() provides is then used to form a vector of instructions. Suggest to have this method supply the desired vector, given the instructions to permute. |
667 ↗ | (On Diff #136345) | and returns the mask for reordering operations, if it allows should specify more accurately something like: |
1224 ↗ | (On Diff #136345) | Update above documentation. |
1228 ↗ | (On Diff #136345) | Can the permutations be kept inside NumOpsWantsToKeepOrder, using OrdersType as its key, instead of holding them in OpOrders? So that one could later simply do ++NumOpsWantsToKeepOrder[CurrentOrder]; See, e.g., UniquifierDenseMapInfo in LSR. |
1229 ↗ | (On Diff #136345) | Document what DirectOrderNum counts and/or have a more self-explanatory name similar to the original one, e.g., NumOpsWantToKeepOriginalOrder Can add that NumOpsWantToKeepOriginalOrder holds the count for the identity permutation, instead of holding this count inside NumOpsWantsToKeepOrder along with all other permutations. |
1581 ↗ | (On Diff #136345) | BestOrder >> CurrentOrder? |
1587 ↗ | (On Diff #136345) | Better early exit by returning here. |
1593 ↗ | (On Diff #136345) | This is pretty hard to follow, and deserves an explanation. Would be better to simply do something like ++NumOpsWantToKeepOrder[BestOrder]. |
1631 ↗ | (On Diff #136345) | Sink the emplace_back to after the handling of non-simple loads? |
1640 ↗ | (On Diff #136345) | "BestOrder" >> "CurrentOrder", or "VLOrder"? |
1644 ↗ | (On Diff #136345) | Reuse PointerOps and have Value *P0,1 = PointerOps.front,back() instead of LoadInst *L0,1 just to get their PointerOperand (and SCEV) later? |
1659 ↗ | (On Diff #136345) | Better have sortPtrAccess() set "BestOrder" only if the given pointers are indeed consecutive once permuted, instead of checking here the Diff of max - min. |
2022 ↗ | (On Diff #136345) | Comment that BestOrder is initialize to invalid values. |
2029 ↗ | (On Diff #136345) | Can simplify by checking if BestOrder is the identify permutation at the end, as done at the end of sortPtrAccesses(); using getExtractIndex(Inst) which returns Idx even if it's equal to I. Better rename BestOrder here too. |
3077 ↗ | (On Diff #136345) | Capture this "inversePermutation" in a method, to be called again below? |
3086 ↗ | (On Diff #136345) | InsertPoint may be set twice to VL0? |
3298 ↗ | (On Diff #136345) | Should we first inverse the permutation and then take its front()? Would be good to have a testcase where this makes a difference and check it (one way or the other), if there isn't one already. |
4915 ↗ | (On Diff #136345) | If only two operations are still allowed, ReorderedOps may as well stay Ops[1], Ops[0]. |
4917 ↗ | (On Diff #136345) | Provide Ops.size() as operand to constructor of ReorderedOps (multiple similar occurrences). |
test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll | ||
24 ↗ | (On Diff #136345) | This redundant sequence of extractions from REORDER_SHUFFLE and insertions into TMP13 is hopefully eliminated later. Is the cost model ignoring it, or can we avoid generating it? Would be good to have the test CHECK the cost. |
Yes, but this must be in a different patch, not this one.
include/llvm/Analysis/LoopAccessAnalysis.h | ||
---|---|---|
679 ↗ | (On Diff #136345) | The documentation above does not say anything about consecutive access. It just states, that the pointers are sorted and that's it. I did it on purpose, so later we could reuse this function for masked loads|stores. |
lib/Analysis/LoopAccessAnalysis.cpp | ||
1113 ↗ | (On Diff #136345) | Tried, it does not work in many cases. |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
1229 ↗ | (On Diff #136345) | I don't want to add the new entry for operations, that do not require reordering. I'd better the code in another way. |
1581 ↗ | (On Diff #136345) | I think it does not matter because the current order is the best order. |
1593 ↗ | (On Diff #136345) | I need to use an iterator, will rework the code. |
1659 ↗ | (On Diff #136345) | Just like I said, I did it for future support of masked loads|stores. |
2029 ↗ | (On Diff #136345) | Check the boolean flag is faster than to perform N comparisons. That's why I'd prefer to leave it as is. |
3086 ↗ | (On Diff #136345) | Missed it, thanks. |
3298 ↗ | (On Diff #136345) | There are several tests already that test that this is correct code. |
test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll | ||
24 ↗ | (On Diff #136345) | Yes, InstCombiner will squash all these instructions into one shuffle. Yes, cost model is aware of these operations and ignores their cost (canReuseExtract() is intended to do this). |
include/llvm/Analysis/LoopAccessAnalysis.h | ||
---|---|---|
679 ↗ | (On Diff #136345) | By "The documentation above" I meant the one example of a[i+2], a[i+0], a[i+1] and a[i+3] which is a permutation of consecutive addresses. Masked loads and stores (will) also require that all pointers be within range of a single vector, right? |
lib/Analysis/LoopAccessAnalysis.cpp | ||
1113 ↗ | (On Diff #136345) | Then that should be fixed, worth opening a PR? |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
1229 ↗ | (On Diff #136345) | Understood; suggested before to simply document that one could alternatively hold the count for the identity permutation inside NumOpsWantsToKeepOrder map, but we're instead keeping it outside using NumOpsWantToKeepOriginalOrder counter. |
1581 ↗ | (On Diff #136345) | The current order may not be the best order, the first time the tree is built, right? |
2029 ↗ | (On Diff #136345) | Right; it could simplify the code though, and N is usually very small. |
1621 ↗ | (On Diff #137784) | Very good, thanks. |
3027 ↗ | (On Diff #137784) | May as well use E when resizing Mask. |
3358 ↗ | (On Diff #137784) | Another inversePermutation? |
5693 ↗ | (On Diff #137784) | Fold by feeding the constructor with the size. |
include/llvm/Analysis/LoopAccessAnalysis.h | ||
---|---|---|
679 ↗ | (On Diff #136345) |
|
lib/Analysis/LoopAccessAnalysis.cpp | ||
1113 ↗ | (On Diff #136345) | I'm not sure that this is a bug, it is a feature :) . computeConstantDifference() is used to get the difference between the pointer returned by the GetUnderlyingObject() and kind of a GEP %ptr, 0, n, where n is constant. But this function does not work if the first load element is from GEP %ptr, 0, m, not from %ptr, and others are from GEP %ptr, 0, m+1, GEP %ptr, 0, m+2 etc. |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
1229 ↗ | (On Diff #136345) | Added a comment to the declaration of NumOpsWantToKeepOrder |
1581 ↗ | (On Diff #136345) | Yes, but this is the best order for this particular bundle. Anyway, I renamed it. |
1621 ↗ | (On Diff #137784) | No, we cannot use CurrentOrder. I tried to reduce the memory usage and instead of copying the CurrentOrder in the newTreEntry() function I keep the ArrayRef for this order. But CurrentOrder is the local variable and it will be destroyed when we exit out of its declaration scope. And, thus, we will keep the reference to incorrect memory. Instead, I need to use the reference that will exist until the end of the vectorization process (as the key of the map) |
3027 ↗ | (On Diff #137784) | Oh, yes, missed it, thanks. |
3358 ↗ | (On Diff #137784) | Yup, thanks. |
5693 ↗ | (On Diff #137784) | Reworked it, thanks. |
Have test(s) for extractvalue's, for completeness.
Make sure tests cover best-order selection: cases where original order is just as frequent as other orders (tie-break), less frequent, more frequent.
include/llvm/Analysis/LoopAccessAnalysis.h | ||
---|---|---|
679 ↗ | (On Diff #136345) | Check for zero gaps, i.e., replicated references? |
lib/Analysis/LoopAccessAnalysis.cpp | ||
1113 ↗ | (On Diff #136345) | ok, right; computeConstantDifference() is lightweight. But it would be good to refactor a more time consuming variant which checks if getMinusSCEV returns a constant, and first compares UnderlyingObjects; and also AddressSpaces, following LoopVectorize. |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
1659 ↗ | (On Diff #136345) | OK, so sortPtrAccesses() can serve more general cases where max - min > VF. It would still be helpful to wrap it, and refactor the above code/usage inside an isPermutedConsecutiveAccess() method which would call sortPtrAccesses(). |
1621 ↗ | (On Diff #137784) | OK, right; it's clear why each Tree Entry should not hold an OrdersType object, and that newTreeEntry() should be given the object stored inside NumOpsWantToKeepOrder rather than the equivalent temporary CurrentOrder. So ++NumOpsWantToKeepOrder[CurrentOrder] can work, but then newTreeEntry() will need to be given NumOpsWantToKeepOrder.find(CurrentOrder)->getFirst()? |
458 ↗ | (On Diff #138035) | Add a message to the assert. |
1250 ↗ | (On Diff #138035) | "\a" >> "\p" |
1254 ↗ | (On Diff #138035) | "DirectOrderNum" >> "NumOpsWantToKeepOriginalOrder" instead of adding the comment? |
1617 ↗ | (On Diff #138035) | May be helpful to also dump CurrentOrder into dbgs(). |
1651 ↗ | (On Diff #138035) | Fold the size into the constructor. |
2041 ↗ | (On Diff #138035) |
// Assign to all items the initial value E + 1 so we can check if the |
2044 ↗ | (On Diff #138035) |
", by checking that no element of CurrentOrder still has value E + 1." |
2074 ↗ | (On Diff #138035) | It may be easier to read if instead of clearing CurrentOrder at each early exit, we break from the loop, and right after the loop check if it exited early and if (I < E) clear CurrentOrder and return false. |
We already have couple tests for extractvalue: ARM/sroa.ll and X86/insertvalue.ll. They already have tests with the different order of the extractvalue instructions.
include/llvm/Analysis/LoopAccessAnalysis.h | ||
---|---|---|
679 ↗ | (On Diff #136345) | Agree, will add checks |
lib/Analysis/LoopAccessAnalysis.cpp | ||
1113 ↗ | (On Diff #136345) | Added checks for addressspace, comparison for underlying objects already were there. |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
458 ↗ | (On Diff #138035) | Ok |
2044 ↗ | (On Diff #138035) | This check is not required, it is checked automatically. We check that number of the extract elements is the same as the vector length at first. If later we try to write the index to element that does not equals E+1 it means that at least one element will still have E+1 as the value and we're have at least 2 elements with the same index. |
Looks good to me, thanks for addressing the issues, have only a few last minor suggestions.
Sure, please add a TODO. This patch makes such rebuilds more frequent.
include/llvm/Analysis/LoopAccessAnalysis.h | ||
---|---|---|
673 ↗ | (On Diff #140474) | Sounds better to specify "Returns 'true' if ..., otherwise returns 'false'" ? |
lib/Analysis/LoopAccessAnalysis.cpp | ||
1113 ↗ | (On Diff #136345) | Right, the comment was about refactoring it altogether into a separate, more time consuming variant of computeConstantDifference() that checks everything; can alternatively leave a TODO for now? |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
1252 ↗ | (On Diff #140474) | /// DirectOrderNum. >> /// NumOpsWantToKeepOriginalOrder. |
1629 ↗ | (On Diff #140474) | Would the following work and be easier to read?
or alternatively rename I to something more meaningful like StoredCurrentOrderAndNum. |
2054 ↗ | (On Diff #140474) | May as well do CurrentOrder.assign(E, E+1); |
2074 ↗ | (On Diff #138035) | The suggestion for easier reading was for something like: for (unsigned I = 0; I < E; ++I) { auto *Inst = cast<Instruction>(VL[I]); if (Inst->getOperand(0) != Vec) break; Optional<unsigned> Idx = getExtractIndex(Inst); if (!Idx) break; const unsigned ExtIdx = *Idx; if (ExtIdx >= E || CurrentOrder[ExtIdx] != E + 1) break; CurrentOrder[ExtIdx] = I; if (ExtIdx != I) ShouldKeepOrder = false; } if (I < E) { CurrentOrder.clear(); return false; } return ShouldKeepOrder; |