[SLP] Fix PR36481: vectorize reassociated instructions.
ClosedPublic

Authored by ABataev on Feb 26 2018, 12:20 PM.

Details

Summary

If the load/extractelement/extractvalue instructions are not originally
consecutive, the SLP vectorizer is unable to vectorize them. Patch
allows reordering of such instructions.

Diff Detail

Repository
rL LLVM
There are a very large number of changes, so older changes are hidden. Show Older Changes
ABataev updated this revision to Diff 136345.Feb 28 2018, 11:13 AM

Fixed generation of mask for shuffling of reordered instructions.

Ayal added a comment.Mar 7 2018, 11:38 PM

This patch addresses the following TODO, plus handles extracts:

// Check if the loads are consecutive, reversed, or neither.
// TODO: What we really want is to sort the loads, but for now, check
// the two likely directions.

At some point it's worth documenting that the best order is set once, at the root of the tree, and then gets propagated to its leaves. Would be good to do so w/o having to rebuild the tree (introduced before this patch)?

include/llvm/Analysis/LoopAccessAnalysis.h
679 ↗(On Diff #136345)

The usage later and documentation above look for a set of references that are consecutive under some permutation. I.e., w/o gaps. The implementation below allows arbitrary gaps, and does not check for zero gaps, i.e., replicated references.

Would it be better to simply do a Pigeonhole sort, and perhaps call it isPermutedConsecutiveAccess():

  1. Scan all references to find the one with minimal address. Bail out if any reference is incomparable with the running min.
  2. Scan all references and set SortedIndices[i] = difference(VL[i], Minimum). Bail out if this entry is to be set more than once.

Note: it may be good to support replicated references, say when the gaps are internal to the vector to avoid loading out of bounds. Perhaps add a TODO.

Note: it may be good to have LAA provide some common support for both SLP and LV's InterleaveGroup construction, which share some aspects. Perhaps add a TODO.

lib/Analysis/LoopAccessAnalysis.cpp
1113 ↗(On Diff #136345)

Use computeConstantDifference() instead of computing it explicitly? It should compare GetUnderlyingObject()'s, if worthwhile, rather than doing so here.

lib/Transforms/Vectorize/SLPVectorizer.cpp
454 ↗(On Diff #136345)

Update above documentation accordingly.

Instead of returning the index when it's not Idx, may as well have getExtractIndex() return it always, and have the caller compare it to Idx?
While we're at it, may as well pass only E and have the callee get its Opcode.

608 ↗(On Diff #136345)

The order which bestOrder() provides is then used to form a vector of instructions. Suggest to have this method supply the desired vector, given the instructions to permute.

667 ↗(On Diff #136345)
and returns the mask for reordering operations, if it allows

should specify more accurately something like:
...and sets \p BestOrder to the identity permutation; otherwise returns False, setting \p BestOrder to either an empty vector or a non-identity permutation that allows...

1224 ↗(On Diff #136345)

Update above documentation.

1228 ↗(On Diff #136345)

Can the permutations be kept inside NumOpsWantsToKeepOrder, using OrdersType as its key, instead of holding them in OpOrders? So that one could later simply do

++NumOpsWantsToKeepOrder[CurrentOrder];

See, e.g., UniquifierDenseMapInfo in LSR.

1229 ↗(On Diff #136345)

Document what DirectOrderNum counts and/or have a more self-explanatory name similar to the original one, e.g., NumOpsWantToKeepOriginalOrder

Can add that NumOpsWantToKeepOriginalOrder holds the count for the identity permutation, instead of holding this count inside NumOpsWantsToKeepOrder along with all other permutations.

1581 ↗(On Diff #136345)

BestOrder >> CurrentOrder?

1587 ↗(On Diff #136345)

Better early exit by returning here.

1593 ↗(On Diff #136345)

This is pretty hard to follow, and deserves an explanation. Would be better to simply do something like ++NumOpsWantToKeepOrder[BestOrder].

1631 ↗(On Diff #136345)

Sink the emplace_back to after the handling of non-simple loads?

1640 ↗(On Diff #136345)

"BestOrder" >> "CurrentOrder", or "VLOrder"?

1644 ↗(On Diff #136345)

Reuse PointerOps and have Value *P0,1 = PointerOps.front,back() instead of LoadInst *L0,1 just to get their PointerOperand (and SCEV) later?

1659 ↗(On Diff #136345)

Better have sortPtrAccess() set "BestOrder" only if the given pointers are indeed consecutive once permuted, instead of checking here the Diff of max - min.

2022 ↗(On Diff #136345)

Comment that BestOrder is initialize to invalid values.
Perhaps set E = VL.size() here and assign E + 1, to match the later checks for initialized/unset values.

2029 ↗(On Diff #136345)

Can simplify by checking if BestOrder is the identify permutation at the end, as done at the end of sortPtrAccesses(); using getExtractIndex(Inst) which returns Idx even if it's equal to I. Better rename BestOrder here too.

3077 ↗(On Diff #136345)

Capture this "inversePermutation" in a method, to be called again below?

3086 ↗(On Diff #136345)

InsertPoint may be set twice to VL0?

3298 ↗(On Diff #136345)

Should we first inverse the permutation and then take its front()?

Would be good to have a testcase where this makes a difference and check it (one way or the other), if there isn't one already.

4915 ↗(On Diff #136345)

If only two operations are still allowed, ReorderedOps may as well stay Ops[1], Ops[0].
Provide the generalization below to any permutation when the assert can be dropped, i.e., in a separate patch which handles this TODO?

4917 ↗(On Diff #136345)

Provide Ops.size() as operand to constructor of ReorderedOps (multiple similar occurrences).

test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
24 ↗(On Diff #136345)

This redundant sequence of extractions from REORDER_SHUFFLE and insertions into TMP13 is hopefully eliminated later. Is the cost model ignoring it, or can we avoid generating it? Would be good to have the test CHECK the cost.

ABataev marked 12 inline comments as done.Mar 9 2018, 9:52 AM

This patch addresses the following TODO, plus handles extracts:

// Check if the loads are consecutive, reversed, or neither.
// TODO: What we really want is to sort the loads, but for now, check
// the two likely directions.

At some point it's worth documenting that the best order is set once, at the root of the tree, and then gets propagated to its leaves. Would be good to do so w/o having to rebuild the tree (introduced before this patch)?

Yes, but this must be in a different patch, not this one.

include/llvm/Analysis/LoopAccessAnalysis.h
679 ↗(On Diff #136345)

The documentation above does not say anything about consecutive access. It just states, that the pointers are sorted and that's it. I did it on purpose, so later we could reuse this function for masked loads|stores.
Masked loads are not supported yet, that's why in the SLPVectorizer I added an additional check for the consecutive access.

lib/Analysis/LoopAccessAnalysis.cpp
1113 ↗(On Diff #136345)

Tried, it does not work in many cases.

lib/Transforms/Vectorize/SLPVectorizer.cpp
1229 ↗(On Diff #136345)

I don't want to add the new entry for operations, that do not require reordering. I'd better the code in another way.

1581 ↗(On Diff #136345)

I think it does not matter because the current order is the best order.

1593 ↗(On Diff #136345)

I need to use an iterator, will rework the code.

1659 ↗(On Diff #136345)

Just like I said, I did it for future support of masked loads|stores.

2029 ↗(On Diff #136345)

Check the boolean flag is faster than to perform N comparisons. That's why I'd prefer to leave it as is.

3086 ↗(On Diff #136345)

Missed it, thanks.

3298 ↗(On Diff #136345)

There are several tests already that test that this is correct code.
SLPVectorizer/X86/PR32086.ll, SLPVectorizer/X86/jumbled-load.ll, SLPVectorizer/X86/store-jumbled.ll etc.

test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
24 ↗(On Diff #136345)

Yes, InstCombiner will squash all these instructions into one shuffle. Yes, cost model is aware of these operations and ignores their cost (canReuseExtract() is intended to do this).

ABataev updated this revision to Diff 137784.Mar 9 2018, 10:02 AM
ABataev marked 2 inline comments as done.

Updates after review

Ayal added inline comments.Mar 9 2018, 2:43 PM
include/llvm/Analysis/LoopAccessAnalysis.h
679 ↗(On Diff #136345)

By "The documentation above" I meant the one example of a[i+2], a[i+0], a[i+1] and a[i+3] which is a permutation of consecutive addresses.

Masked loads and stores (will) also require that all pointers be within range of a single vector, right?

lib/Analysis/LoopAccessAnalysis.cpp
1113 ↗(On Diff #136345)

Then that should be fixed, worth opening a PR?

lib/Transforms/Vectorize/SLPVectorizer.cpp
1621 ↗(On Diff #137784)

Very good, thanks.
Can CurrentOrder be used instead of I->getFirst()?
Can ++NumOpsWantToKeepOrder[CurrentOrder] work? After all the trouble...

3027 ↗(On Diff #137784)

May as well use E when resizing Mask.

3358 ↗(On Diff #137784)

Another inversePermutation?

5693 ↗(On Diff #137784)

Fold by feeding the constructor with the size.

1229 ↗(On Diff #136345)

Understood; suggested before to simply document that one could alternatively hold the count for the identity permutation inside NumOpsWantsToKeepOrder map, but we're instead keeping it outside using NumOpsWantToKeepOriginalOrder counter.

1581 ↗(On Diff #136345)

The current order may not be the best order, the first time the tree is built, right?

2029 ↗(On Diff #136345)

Right; it could simplify the code though, and N is usually very small.

ABataev added inline comments.Mar 12 2018, 9:35 AM
include/llvm/Analysis/LoopAccessAnalysis.h
679 ↗(On Diff #136345)
  1. I see.
  2. Right
lib/Analysis/LoopAccessAnalysis.cpp
1113 ↗(On Diff #136345)

I'm not sure that this is a bug, it is a feature :) . computeConstantDifference() is used to get the difference between the pointer returned by the GetUnderlyingObject() and kind of a GEP %ptr, 0, n, where n is constant. But this function does not work if the first load element is from GEP %ptr, 0, m, not from %ptr, and others are from GEP %ptr, 0, m+1, GEP %ptr, 0, m+2 etc.

lib/Transforms/Vectorize/SLPVectorizer.cpp
1621 ↗(On Diff #137784)

No, we cannot use CurrentOrder. I tried to reduce the memory usage and instead of copying the CurrentOrder in the newTreEntry() function I keep the ArrayRef for this order. But CurrentOrder is the local variable and it will be destroyed when we exit out of its declaration scope. And, thus, we will keep the reference to incorrect memory. Instead, I need to use the reference that will exist until the end of the vectorization process (as the key of the map)

3027 ↗(On Diff #137784)

Oh, yes, missed it, thanks.

3358 ↗(On Diff #137784)

Yup, thanks.

5693 ↗(On Diff #137784)

Reworked it, thanks.

1229 ↗(On Diff #136345)

Added a comment to the declaration of NumOpsWantToKeepOrder

1581 ↗(On Diff #136345)

Yes, but this is the best order for this particular bundle. Anyway, I renamed it.

ABataev updated this revision to Diff 138035.Mar 12 2018, 9:35 AM

Update after review

Ayal added a comment.Fri, Mar 23, 4:55 PM

Have test(s) for extractvalue's, for completeness.
Make sure tests cover best-order selection: cases where original order is just as frequent as other orders (tie-break), less frequent, more frequent.

include/llvm/Analysis/LoopAccessAnalysis.h
679 ↗(On Diff #136345)

Check for zero gaps, i.e., replicated references?
These are currently not supported, when checking canReuseExtract().

lib/Analysis/LoopAccessAnalysis.cpp
1113 ↗(On Diff #136345)

ok, right; computeConstantDifference() is lightweight. But it would be good to refactor a more time consuming variant which checks if getMinusSCEV returns a constant, and first compares UnderlyingObjects; and also AddressSpaces, following LoopVectorize.

lib/Transforms/Vectorize/SLPVectorizer.cpp
458 ↗(On Diff #138035)

Add a message to the assert.

1250 ↗(On Diff #138035)

"\a" >> "\p"

1254 ↗(On Diff #138035)

"DirectOrderNum" >> "NumOpsWantToKeepOriginalOrder" instead of adding the comment?

1617 ↗(On Diff #138035)

May be helpful to also dump CurrentOrder into dbgs().

1651 ↗(On Diff #138035)

Fold the size into the constructor.

2041 ↗(On Diff #138035)

// Assign initial value to of all items to E + 1 so we can check if the

// Assign to all items the initial value E + 1 so we can check if the

2044 ↗(On Diff #138035)

(if at least one element of ... ).

", by checking that no element of CurrentOrder still has value E + 1."
Note that there is no such check later.

2074 ↗(On Diff #138035)

It may be easier to read if instead of clearing CurrentOrder at each early exit, we break from the loop, and right after the loop check if it exited early and if (I < E) clear CurrentOrder and return false.

1621 ↗(On Diff #137784)

OK, right; it's clear why each Tree Entry should not hold an OrdersType object, and that newTreeEntry() should be given the object stored inside NumOpsWantToKeepOrder rather than the equivalent temporary CurrentOrder.

So ++NumOpsWantToKeepOrder[CurrentOrder] can work, but then newTreeEntry() will need to be given NumOpsWantToKeepOrder.find(CurrentOrder)->getFirst()?

1659 ↗(On Diff #136345)

OK, so sortPtrAccesses() can serve more general cases where max - min > VF. It would still be helpful to wrap it, and refactor the above code/usage inside an isPermutedConsecutiveAccess() method which would call sortPtrAccesses().

ABataev marked 9 inline comments as done.Fri, Mar 30, 12:34 PM

Have test(s) for extractvalue's, for completeness.
Make sure tests cover best-order selection: cases where original order is just as frequent as other orders (tie-break), less frequent, more frequent.

We already have couple tests for extractvalue: ARM/sroa.ll and X86/insertvalue.ll. They already have tests with the different order of the extractvalue instructions.

include/llvm/Analysis/LoopAccessAnalysis.h
679 ↗(On Diff #136345)

Agree, will add checks

lib/Analysis/LoopAccessAnalysis.cpp
1113 ↗(On Diff #136345)

Added checks for addressspace, comparison for underlying objects already were there.

lib/Transforms/Vectorize/SLPVectorizer.cpp
458 ↗(On Diff #138035)

Ok

2044 ↗(On Diff #138035)

This check is not required, it is checked automatically. We check that number of the extract elements is the same as the vector length at first. If later we try to write the index to element that does not equals E+1 it means that at least one element will still have E+1 as the value and we're have at least 2 elements with the same index.

ABataev updated this revision to Diff 140474.Fri, Mar 30, 12:34 PM
ABataev marked 3 inline comments as done.

Updated after review

Ayal accepted this revision.Sun, Apr 1, 12:29 AM

Looks good to me, thanks for addressing the issues, have only a few last minor suggestions.

This patch addresses the following TODO, plus handles extracts:

// Check if the loads are consecutive, reversed, or neither.
// TODO: What we really want is to sort the loads, but for now, check
// the two likely directions.

At some point it's worth documenting that the best order is set once, at the root of the tree, and then gets propagated to its leaves. Would be good to do so w/o having to rebuild the tree (introduced before this patch)?

Yes, but this must be in a different patch, not this one.

Sure, please add a TODO. This patch makes such rebuilds more frequent.

include/llvm/Analysis/LoopAccessAnalysis.h
673 ↗(On Diff #140474)

Sounds better to specify "Returns 'true' if ..., otherwise returns 'false'" ?

lib/Analysis/LoopAccessAnalysis.cpp
1113 ↗(On Diff #136345)

Right, the comment was about refactoring it altogether into a separate, more time consuming variant of computeConstantDifference() that checks everything; can alternatively leave a TODO for now?

lib/Transforms/Vectorize/SLPVectorizer.cpp
2074 ↗(On Diff #138035)

The suggestion for easier reading was for something like:

for (unsigned I = 0; I < E; ++I) {
  auto *Inst = cast<Instruction>(VL[I]);
  if (Inst->getOperand(0) != Vec)
    break;
  Optional<unsigned> Idx = getExtractIndex(Inst);
  if (!Idx)
    break;
  const unsigned ExtIdx = *Idx;
  if (ExtIdx >= E || CurrentOrder[ExtIdx] != E + 1)
    break;
  CurrentOrder[ExtIdx] = I;
  if (ExtIdx != I)
    ShouldKeepOrder = false;
}

if (I < E) {
  CurrentOrder.clear();
  return false;
}
return ShouldKeepOrder;
1252 ↗(On Diff #140474)

/// DirectOrderNum. >> /// NumOpsWantToKeepOriginalOrder.

1629 ↗(On Diff #140474)

Would the following work and be easier to read?

++NumOpsWantToKeepOrder[CurrentOrder];
 
auto &StoredCurrentOrder = NumOpsWantToKeepOrder.find(CurrentOrder)->getFirst();
newTreeEntry(VL, /*Vectorized=*/true, UserTreeIdx, ReuseShuffleIndicies,
             StoredCurrentOrder);

or alternatively rename I to something more meaningful like StoredCurrentOrderAndNum.

2054 ↗(On Diff #140474)

May as well do CurrentOrder.assign(E, E+1);

This revision is now accepted and ready to land.Sun, Apr 1, 12:29 AM
This revision was automatically updated to reflect the committed changes.
ABataev marked 5 inline comments as done.