This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Fix PR36481: vectorize reassociated instructions.
ClosedPublic

Authored by ABataev on Feb 26 2018, 12:20 PM.

Details

Diff Detail

Event Timeline

ABataev created this revision.Feb 26 2018, 12:20 PM
ABataev updated this revision to Diff 136325.Feb 28 2018, 10:00 AM

Updated to the latest version

ABataev updated this revision to Diff 136345.Feb 28 2018, 11:13 AM

Fixed generation of mask for shuffling of reordered instructions.

Ayal added a comment.Mar 7 2018, 11:38 PM

This patch addresses the following TODO, plus handles extracts:

// Check if the loads are consecutive, reversed, or neither.
// TODO: What we really want is to sort the loads, but for now, check
// the two likely directions.

At some point it's worth documenting that the best order is set once, at the root of the tree, and then gets propagated to its leaves. Would be good to do so w/o having to rebuild the tree (introduced before this patch)?

include/llvm/Analysis/LoopAccessAnalysis.h
679

The usage later and documentation above look for a set of references that are consecutive under some permutation. I.e., w/o gaps. The implementation below allows arbitrary gaps, and does not check for zero gaps, i.e., replicated references.

Would it be better to simply do a Pigeonhole sort, and perhaps call it isPermutedConsecutiveAccess():

  1. Scan all references to find the one with minimal address. Bail out if any reference is incomparable with the running min.
  2. Scan all references and set SortedIndices[i] = difference(VL[i], Minimum). Bail out if this entry is to be set more than once.

Note: it may be good to support replicated references, say when the gaps are internal to the vector to avoid loading out of bounds. Perhaps add a TODO.

Note: it may be good to have LAA provide some common support for both SLP and LV's InterleaveGroup construction, which share some aspects. Perhaps add a TODO.

lib/Analysis/LoopAccessAnalysis.cpp
1113

Use computeConstantDifference() instead of computing it explicitly? It should compare GetUnderlyingObject()'s, if worthwhile, rather than doing so here.

lib/Transforms/Vectorize/SLPVectorizer.cpp
454

Update above documentation accordingly.

Instead of returning the index when it's not Idx, may as well have getExtractIndex() return it always, and have the caller compare it to Idx?
While we're at it, may as well pass only E and have the callee get its Opcode.

607

The order which bestOrder() provides is then used to form a vector of instructions. Suggest to have this method supply the desired vector, given the instructions to permute.

667
and returns the mask for reordering operations, if it allows

should specify more accurately something like:
...and sets \p BestOrder to the identity permutation; otherwise returns False, setting \p BestOrder to either an empty vector or a non-identity permutation that allows...

1222–1223

Update above documentation.

1226

Can the permutations be kept inside NumOpsWantsToKeepOrder, using OrdersType as its key, instead of holding them in OpOrders? So that one could later simply do

++NumOpsWantsToKeepOrder[CurrentOrder];

See, e.g., UniquifierDenseMapInfo in LSR.

1227

Document what DirectOrderNum counts and/or have a more self-explanatory name similar to the original one, e.g., NumOpsWantToKeepOriginalOrder

Can add that NumOpsWantToKeepOriginalOrder holds the count for the identity permutation, instead of holding this count inside NumOpsWantsToKeepOrder along with all other permutations.

1604–1605

BestOrder >> CurrentOrder?

1609

Better early exit by returning here.

1612

This is pretty hard to follow, and deserves an explanation. Would be better to simply do something like ++NumOpsWantToKeepOrder[BestOrder].

1651

Sink the emplace_back to after the handling of non-simple loads?

1670–1671

"BestOrder" >> "CurrentOrder", or "VLOrder"?

1674

Reuse PointerOps and have Value *P0,1 = PointerOps.front,back() instead of LoadInst *L0,1 just to get their PointerOperand (and SCEV) later?

1683

Better have sortPtrAccess() set "BestOrder" only if the given pointers are indeed consecutive once permuted, instead of checking here the Diff of max - min.

2038

Comment that BestOrder is initialize to invalid values.
Perhaps set E = VL.size() here and assign E + 1, to match the later checks for initialized/unset values.

2057

Can simplify by checking if BestOrder is the identify permutation at the end, as done at the end of sortPtrAccesses(); using getExtractIndex(Inst) which returns Idx even if it's equal to I. Better rename BestOrder here too.

3113

Capture this "inversePermutation" in a method, to be called again below?

3113–3120

InsertPoint may be set twice to VL0?

3331

Should we first inverse the permutation and then take its front()?

Would be good to have a testcase where this makes a difference and check it (one way or the other), if there isn't one already.

4949

If only two operations are still allowed, ReorderedOps may as well stay Ops[1], Ops[0].
Provide the generalization below to any permutation when the assert can be dropped, i.e., in a separate patch which handles this TODO?

4951

Provide Ops.size() as operand to constructor of ReorderedOps (multiple similar occurrences).

test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
24

This redundant sequence of extractions from REORDER_SHUFFLE and insertions into TMP13 is hopefully eliminated later. Is the cost model ignoring it, or can we avoid generating it? Would be good to have the test CHECK the cost.

ABataev marked 12 inline comments as done.Mar 9 2018, 9:52 AM

This patch addresses the following TODO, plus handles extracts:

// Check if the loads are consecutive, reversed, or neither.
// TODO: What we really want is to sort the loads, but for now, check
// the two likely directions.

At some point it's worth documenting that the best order is set once, at the root of the tree, and then gets propagated to its leaves. Would be good to do so w/o having to rebuild the tree (introduced before this patch)?

Yes, but this must be in a different patch, not this one.

include/llvm/Analysis/LoopAccessAnalysis.h
679

The documentation above does not say anything about consecutive access. It just states, that the pointers are sorted and that's it. I did it on purpose, so later we could reuse this function for masked loads|stores.
Masked loads are not supported yet, that's why in the SLPVectorizer I added an additional check for the consecutive access.

lib/Analysis/LoopAccessAnalysis.cpp
1113

Tried, it does not work in many cases.

lib/Transforms/Vectorize/SLPVectorizer.cpp
1227

I don't want to add the new entry for operations, that do not require reordering. I'd better the code in another way.

1604–1605

I think it does not matter because the current order is the best order.

1612

I need to use an iterator, will rework the code.

1683

Just like I said, I did it for future support of masked loads|stores.

2057

Check the boolean flag is faster than to perform N comparisons. That's why I'd prefer to leave it as is.

3113–3120

Missed it, thanks.

3331

There are several tests already that test that this is correct code.
SLPVectorizer/X86/PR32086.ll, SLPVectorizer/X86/jumbled-load.ll, SLPVectorizer/X86/store-jumbled.ll etc.

test/Transforms/SLPVectorizer/X86/external_user_jumbled_load.ll
24

Yes, InstCombiner will squash all these instructions into one shuffle. Yes, cost model is aware of these operations and ignores their cost (canReuseExtract() is intended to do this).

ABataev updated this revision to Diff 137784.Mar 9 2018, 10:02 AM
ABataev marked 2 inline comments as done.

Updates after review

Ayal added inline comments.Mar 9 2018, 2:43 PM
include/llvm/Analysis/LoopAccessAnalysis.h
679

By "The documentation above" I meant the one example of a[i+2], a[i+0], a[i+1] and a[i+3] which is a permutation of consecutive addresses.

Masked loads and stores (will) also require that all pointers be within range of a single vector, right?

lib/Analysis/LoopAccessAnalysis.cpp
1113

Then that should be fixed, worth opening a PR?

lib/Transforms/Vectorize/SLPVectorizer.cpp
1227

Understood; suggested before to simply document that one could alternatively hold the count for the identity permutation inside NumOpsWantsToKeepOrder map, but we're instead keeping it outside using NumOpsWantToKeepOriginalOrder counter.

1604–1605

The current order may not be the best order, the first time the tree is built, right?

1621

Very good, thanks.
Can CurrentOrder be used instead of I->getFirst()?
Can ++NumOpsWantToKeepOrder[CurrentOrder] work? After all the trouble...

2057

Right; it could simplify the code though, and N is usually very small.

3027

May as well use E when resizing Mask.

3358

Another inversePermutation?

5693

Fold by feeding the constructor with the size.

ABataev added inline comments.Mar 12 2018, 9:35 AM
include/llvm/Analysis/LoopAccessAnalysis.h
679
  1. I see.
  2. Right
lib/Analysis/LoopAccessAnalysis.cpp
1113

I'm not sure that this is a bug, it is a feature :) . computeConstantDifference() is used to get the difference between the pointer returned by the GetUnderlyingObject() and kind of a GEP %ptr, 0, n, where n is constant. But this function does not work if the first load element is from GEP %ptr, 0, m, not from %ptr, and others are from GEP %ptr, 0, m+1, GEP %ptr, 0, m+2 etc.

lib/Transforms/Vectorize/SLPVectorizer.cpp
1227

Added a comment to the declaration of NumOpsWantToKeepOrder

1604–1605

Yes, but this is the best order for this particular bundle. Anyway, I renamed it.

1621

No, we cannot use CurrentOrder. I tried to reduce the memory usage and instead of copying the CurrentOrder in the newTreEntry() function I keep the ArrayRef for this order. But CurrentOrder is the local variable and it will be destroyed when we exit out of its declaration scope. And, thus, we will keep the reference to incorrect memory. Instead, I need to use the reference that will exist until the end of the vectorization process (as the key of the map)

3027

Oh, yes, missed it, thanks.

3358

Yup, thanks.

5693

Reworked it, thanks.

ABataev updated this revision to Diff 138035.Mar 12 2018, 9:35 AM

Update after review

Ayal added a comment.Mar 23 2018, 4:55 PM

Have test(s) for extractvalue's, for completeness.
Make sure tests cover best-order selection: cases where original order is just as frequent as other orders (tie-break), less frequent, more frequent.

include/llvm/Analysis/LoopAccessAnalysis.h
679

Check for zero gaps, i.e., replicated references?
These are currently not supported, when checking canReuseExtract().

lib/Analysis/LoopAccessAnalysis.cpp
1113

ok, right; computeConstantDifference() is lightweight. But it would be good to refactor a more time consuming variant which checks if getMinusSCEV returns a constant, and first compares UnderlyingObjects; and also AddressSpaces, following LoopVectorize.

lib/Transforms/Vectorize/SLPVectorizer.cpp
458

Add a message to the assert.

1250

"\a" >> "\p"

1254

"DirectOrderNum" >> "NumOpsWantToKeepOriginalOrder" instead of adding the comment?

1615

May be helpful to also dump CurrentOrder into dbgs().

1621

OK, right; it's clear why each Tree Entry should not hold an OrdersType object, and that newTreeEntry() should be given the object stored inside NumOpsWantToKeepOrder rather than the equivalent temporary CurrentOrder.

So ++NumOpsWantToKeepOrder[CurrentOrder] can work, but then newTreeEntry() will need to be given NumOpsWantToKeepOrder.find(CurrentOrder)->getFirst()?

1649

Fold the size into the constructor.

1683

OK, so sortPtrAccesses() can serve more general cases where max - min > VF. It would still be helpful to wrap it, and refactor the above code/usage inside an isPermutedConsecutiveAccess() method which would call sortPtrAccesses().

2039

// Assign initial value to of all items to E + 1 so we can check if the

// Assign to all items the initial value E + 1 so we can check if the

2042

(if at least one element of ... ).

", by checking that no element of CurrentOrder still has value E + 1."
Note that there is no such check later.

2072

It may be easier to read if instead of clearing CurrentOrder at each early exit, we break from the loop, and right after the loop check if it exited early and if (I < E) clear CurrentOrder and return false.

ABataev marked 9 inline comments as done.Mar 30 2018, 12:34 PM

Have test(s) for extractvalue's, for completeness.
Make sure tests cover best-order selection: cases where original order is just as frequent as other orders (tie-break), less frequent, more frequent.

We already have couple tests for extractvalue: ARM/sroa.ll and X86/insertvalue.ll. They already have tests with the different order of the extractvalue instructions.

include/llvm/Analysis/LoopAccessAnalysis.h
679

Agree, will add checks

lib/Analysis/LoopAccessAnalysis.cpp
1113

Added checks for addressspace, comparison for underlying objects already were there.

lib/Transforms/Vectorize/SLPVectorizer.cpp
458

Ok

2042

This check is not required, it is checked automatically. We check that number of the extract elements is the same as the vector length at first. If later we try to write the index to element that does not equals E+1 it means that at least one element will still have E+1 as the value and we're have at least 2 elements with the same index.

ABataev updated this revision to Diff 140474.Mar 30 2018, 12:34 PM
ABataev marked 3 inline comments as done.

Updated after review

Ayal accepted this revision.Apr 1 2018, 12:29 AM

Looks good to me, thanks for addressing the issues, have only a few last minor suggestions.

This patch addresses the following TODO, plus handles extracts:

// Check if the loads are consecutive, reversed, or neither.
// TODO: What we really want is to sort the loads, but for now, check
// the two likely directions.

At some point it's worth documenting that the best order is set once, at the root of the tree, and then gets propagated to its leaves. Would be good to do so w/o having to rebuild the tree (introduced before this patch)?

Yes, but this must be in a different patch, not this one.

Sure, please add a TODO. This patch makes such rebuilds more frequent.

include/llvm/Analysis/LoopAccessAnalysis.h
673

Sounds better to specify "Returns 'true' if ..., otherwise returns 'false'" ?

lib/Analysis/LoopAccessAnalysis.cpp
1113

Right, the comment was about refactoring it altogether into a separate, more time consuming variant of computeConstantDifference() that checks everything; can alternatively leave a TODO for now?

lib/Transforms/Vectorize/SLPVectorizer.cpp
1251

/// DirectOrderNum. >> /// NumOpsWantToKeepOriginalOrder.

1626

Would the following work and be easier to read?

++NumOpsWantToKeepOrder[CurrentOrder];

auto &StoredCurrentOrder = NumOpsWantToKeepOrder.find(CurrentOrder)->getFirst();
newTreeEntry(VL, /*Vectorized=*/true, UserTreeIdx, ReuseShuffleIndicies,
             StoredCurrentOrder);

or alternatively rename I to something more meaningful like StoredCurrentOrderAndNum.

2063

May as well do CurrentOrder.assign(E, E+1);

2072

The suggestion for easier reading was for something like:

for (unsigned I = 0; I < E; ++I) {
  auto *Inst = cast<Instruction>(VL[I]);
  if (Inst->getOperand(0) != Vec)
    break;
  Optional<unsigned> Idx = getExtractIndex(Inst);
  if (!Idx)
    break;
  const unsigned ExtIdx = *Idx;
  if (ExtIdx >= E || CurrentOrder[ExtIdx] != E + 1)
    break;
  CurrentOrder[ExtIdx] = I;
  if (ExtIdx != I)
    ShouldKeepOrder = false;
}

if (I < E) {
  CurrentOrder.clear();
  return false;
}
return ShouldKeepOrder;
This revision is now accepted and ready to land.Apr 1 2018, 12:29 AM
This revision was automatically updated to reflect the committed changes.
ABataev marked 5 inline comments as done.