This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Make reordering aware of external vectorizable scalar stores.
ClosedPublic

Authored by vporpo on May 6 2022, 11:25 AM.

Details

Summary

The current reordering scheme only checks the ordering of in-tree operands.
There are some cases, however, where we need to adjust the ordering based on
the ordering of a future SLP-tree who's instructions are not part of the
current tree, but are external users.

This patch is a simple implementation of this. We keep track of scalar stores
that are users of TreeEntries and if they look profitable to vectorize, then
we keep track of their ordering. During the reordering step we take this new
index order into account. This can remove some shuffles in cases like in the
lit test.

Diff Detail

Event Timeline

vporpo created this revision.May 6 2022, 11:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 6 2022, 11:25 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
vporpo requested review of this revision.May 6 2022, 11:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 6 2022, 11:25 AM
jgorbe added a subscriber: jgorbe.May 6 2022, 11:31 AM
ABataev added inline comments.May 6 2022, 2:07 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3537

Maybe just add another analysis somewhere here instead of adding new fields etc.?

vporpo added inline comments.May 6 2022, 2:30 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3537

We could do the check here, but I think it is a bit more explicit if we have a field in TreeEntry. Also it is very similar in nature to the other reordering data, so I think they should be represented in a similar way. It also helps with debugging because you can actually see it with a dumpVectorizableTree() dump just like the other reordering data. Wdyt?

vdmitrie added inline comments.May 6 2022, 3:01 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4092

Hi Vasileios.

May I offer you to make some stylistic changes to this method to improve code readability?
I was bit lazy to place all comments inline - sorry for that :). It was much easier for me to just apply your patch and use editor locally. So I'm only putting summary here:

  • reduce abuse of "auto"
  • do not create thin lambdas just for minor reducing amount of code.
  • improve Cmp to not call getPointerDiff if we already failed once
  • added type check to Cmp before querying for diff (although I'm not sure whether we can have different store types here)
  • inline Cmp into the call for stable_sort
  • inline stable_sort into CanFormVector
  • outline CanFromVector outside the loop
  • I'm not sure we need to look at each store to find out whether they are all sequential. We probably can take diff between the first and the last ones.
  • use range loop for sequential index traversing

After applying all these suggestions here is how the method could possibly look:

SmallVector<BoUpSLP::OrdersType, 1>
BoUpSLP::findExternalStoreUsersReorderIndices(TreeEntry *TE) const {

unsigned NumLanes = TE->Scalars.size();

DenseMap<Value *, SmallVector<StoreInst *, 4>> PtrToStoresMap =
    collectUserStores(TE);

// Holds the reorder indices for each candidate store vector that is a user of
// the current TreeEntry.
SmallVector<OrdersType, 1> ExternalReorderIndices;

// \Returns true if the stores in `SortedStoresVec` are consecutive and can
// form a vector.
auto &&CanFormVector = [this](SmallVector<StoreInst *, 4> &StoresVecSorted) {
  Type *Ty = StoresVecSorted.front()->getValueOperand()->getType();
  bool FailedToSort = false;
  stable_sort(StoresVecSorted, [this, &FailedToSort, Ty](StoreInst *S1,
                                                         StoreInst *S2) {
    if (FailedToSort)
      return false;
    if (S1->getValueOperand()->getType() != Ty ||
        S2->getValueOperand()->getType() != Ty) {
      FailedToSort = true;
      return false;
    }
    Optional<int> Diff = getPointersDiff(Ty, S2->getPointerOperand(), Ty,
                                         S1->getPointerOperand(), *DL, *SE,
                                         /*StrictCheck=*/true);
    if (!Diff) {
      FailedToSort = true;
      return false;
    }
    return Diff < 0;
  });
  // If we failed to compare stores, then just abandon this stores vector.
  if (FailedToSort)
    return false;
  Optional<int> Range =
      getPointersDiff(Ty, StoresVecSorted.front()->getPointerOperand(), Ty,
                      StoresVecSorted.back()->getPointerOperand(), *DL, *SE,
                      /*StrictCheck=*/true);
  return *Range == (int)StoresVecSorted.size() - 1;
};

// Now inspect the stores collected per pointer and look for vectorization
// candidates. For each candidate calculate the reorder index vector and push
// it into `ExternalReorderIndices`
for (const auto &Pair : PtrToStoresMap) {
  const SmallVector<StoreInst *, 4> &StoresVec = Pair.second;

  // If we have fewer than NumLanes stores, then we can't form a vector.
  if (StoresVec.size() != NumLanes)
    continue;

  // Sort the vector based on the pointers. We create a copy because we may
  // need the original later for calculating the reorder (shuffle) indices.
  auto StoresVecSorted = StoresVec;

  // If the stores are not consecutive then abandon this sotres vector.
  if (!CanFormVector(StoresVecSorted))
    continue;

  // The scalars in StoresVec can form a vector instruction, so calculate the
  // shuffle indices.
  ExternalReorderIndices.resize(ExternalReorderIndices.size() + 1);
  OrdersType &ReorderIndices = ExternalReorderIndices.back();
  for (StoreInst *SI : StoresVec) {
    unsigned Idx = llvm::find(StoresVecSorted, SI) - StoresVecSorted.begin();
    ReorderIndices.push_back(Idx);
  }

  // Identity order (e.g., {0,1,2,3}) is modeled as an empty OrdersType in
  // reorderTopToBottom() and reorderBottomToTop(), so we are following the
  // same convention here.
  auto IsIdentityOrder = [](const OrdersType &Order) {
    for (unsigned Idx : seq<unsigned>(0, Order.size()))
      if (Idx != Order[Idx])
        return false;
    return true;
  };
  if (IsIdentityOrder(ReorderIndices))
    ReorderIndices.clear();
}
return ExternalReorderIndices;

}

ABataev added inline comments.May 6 2022, 3:07 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3537

I rather doubt that it is wise decision to waste some extra memory just to handle this corner case.

vporpo added a comment.May 6 2022, 3:16 PM

Thanks @vdmitrie for your comments. I will update the code.

vporpo added inline comments.May 6 2022, 3:26 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3537

I don't think that memory is a concern here since the VectorizableTree does not grow too large and we clear it before we build the next one. I think it is not worth making it less explicit just to save some memory. Reordering is already quite complex and without having this explicitly shown in the dump it would just make debugging harder.
@vdmitrie any thoughts on this?

ABataev added inline comments.May 6 2022, 3:38 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3537

The reordering data is not stored in the tree, except for some cases, where this data is also required for correct codegen/cost estimation. I do not like the idea to keep this data in the tree without actually being used for cost/codegen.

vdmitrie added inline comments.May 6 2022, 4:02 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3537

I do not have strong objections wrt to keeping it in the tree although Alexey's arguments sound very reasonable. If debug printing is the issue then may be it worth trying to tweak debug printing routine(s) instead? dumpVectorizableTree() for example could collect this data and print it alongside with each tree node.

ABataev added inline comments.May 6 2022, 4:11 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3537

Better to teach reordering functions about dumping, rather than put some service data to the tree structure.

vporpo added inline comments.May 6 2022, 4:16 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3537

It is not just about debugging, it is also about the design.

I agree with Alexey that, we should not keep data in the tree that is purely temporal. But I think in this case it is not temporal data. I believe it is a good design principle to separate phases, just because we can place the analysis in the reorder function, I don't think we should. Please bear in mind that we may decide in the future to extend the analysis to cover more cross-tree cases like this by doing a more extensive search, so the analysis could grow.

If we do decide to have this analysis as a separate phase, then the natural place for holding the ordering data is the TreeEntry. I don't think this should be restricted to data passed to the cost/codegen phase only. Any data passed from analysis to transformations should qualify, reordering included.

ABataev added inline comments.May 6 2022, 4:33 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3537

I would not store it in the tree unless we definitely will use it somewhere else except for reordering. If (!) we'll need it for something else (probably, cost estimation), we can make this data a data member. It should not be hard. But before that I'd keep it internal to reordering phase.

What's the impact on the tests/benchmarks?

vporpo added a comment.May 6 2022, 4:47 PM

This fixes a regression in eigen BM_Dot_ComplexComplex_Naive. I doubt it will have any large impact anywhere else, but we should know soon enough.

This fixes a regression in eigen BM_Dot_ComplexComplex_Naive. I doubt it will have any large impact anywhere else, but we should know soon enough.

The regressions would still appear unless final solution for non-power-of-2 is landed. Even after this there might be the issues with the cost model. What is the actual cause of the regression? The code was vectorized before but then it did not? There are couole regressions in reductiins, caused bya bit dufferent processing order, they should go away with the final code for reductions and non-power-of2.

vdmitrie added inline comments.May 6 2022, 5:02 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3556

not your code but this is unreachable because of check at line 3539.

3599

nit: name Order is already used by lambda above.

4042

nit: for (unsigned Lane : seq<unsigned>(0, TE->Scalars.size())) {
then NumLanes can be dropped

4056

nit: use dyn_cast and move it to line 4043 (presumably with isSimple and isValidElementType checks)

vporpo added a comment.May 6 2022, 5:05 PM

The regressions would still appear unless final solution for non-power-of-2 is landed. Even after this there might be the issues with the cost model. What is the actual cause of the regression? The code was vectorized before but then it did not? There are couole regressions in reductiins, caused bya bit dufferent processing order, they should go away with the final code for reductions and non-power-of2.

This is not related to the non-power-of-2. The cause is shown in the lit test: SLP vectorizes trees in isolation so it may generate extra shuffles. It was triggered by the load broadcast cost change.

I'm also observing a stability issue. I'll submit a test case once reduce it.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3599

nit: name Order is already used by lambda above.

just to correct myself: not lambda but by the result of the lambda call ( I first overlooked it is a call).
Anyway, Order in this loop hides Order from line 3582

4139

std::distance(StoresVecSorted.begin(), find(StoresVecSorted, SI));

vporpo updated this revision to Diff 427807.May 6 2022, 6:29 PM
vporpo marked 3 inline comments as done.

Addressed most of @vdmitrie's comments.

vporpo added inline comments.May 6 2022, 6:31 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4092

I think I addressed most of these points. Please take a look.

vporpo added a comment.May 6 2022, 6:36 PM

@vdmitrie please let me know if you find a stability issue, I will do more testing on my side too.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4139

Hmm why is it preferable over operator-()? StoresVecSorted is a SmallVector, it should implement a random access iterator.

to reproduce crash: opt -slp-vectorizer -disable-output reduced.ll
Fails on assertion:
... llvm-project/llvm/include/llvm/ADT/DenseMap.h:1244: llvm::DenseMapIterator<KeyT, ValueT, KeyInfoT, Bucket, IsConst>::value_type* llvm::DenseMapIterator<KeyT, ValueT, KeyInfoT, Bucket, IsConst>::operator->() const [with KeyT = const llvm::slpvectorizer::BoUpSLP::TreeEntry*; ValueT = llvm::SmallVector<unsigned int, 4>; KeyInfoT = llvm::DenseMapInfo<const llvm::slpvectorizer::BoUpSLP::TreeEntry*, void>; Bucket = llvm::detail::DenseMapPair<const llvm::slpvectorizer::BoUpSLP::TreeEntry*, llvm::SmallVector<unsigned int, 4> >; bool IsConst = false; llvm::DenseMapIterator<KeyT, ValueT, KeyInfoT, Bucket, IsConst>::pointer = llvm::detail::DenseMapPair<const llvm::slpvectorizer::BoUpSLP::TreeEntry*, llvm::SmallVector<unsigned int, 4> >*; llvm::DenseMapIterator<KeyT, ValueT, KeyInfoT, Bucket, IsConst>::value_type = llvm::detail::DenseMapPair<const llvm::slpvectorizer::BoUpSLP::TreeEntry*, llvm::SmallVector<unsigned int, 4> >]: Assertion `Ptr != End && "dereferencing end() iterator"' failed.

Thanks for the reproducer @vdmitrie , think I fixed the issue.

vporpo updated this revision to Diff 428132.May 9 2022, 10:41 AM

Fixed stability issue and also removed user tree indices out of the TreeEntry.

vdmitrie accepted this revision.May 10 2022, 7:59 AM

Looks good with a nit.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4119

wrap it with #ifndef NDEBUG (or delete).

This revision is now accepted and ready to land.May 10 2022, 7:59 AM
dmgreen added inline comments.May 10 2022, 8:15 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4094

Just a quick point - I would recommend against calling getPointersDiff in the sort compare function. I believe std::sorts require a strict weak ordering, and some compilers (MSVC) will complain if they do not. It's also probably calling getPointersDiff more times than necessary, being O(NlogN) as opposed to the N-1 calls needed.

I think it should be possible to precompute all the offsets from the first pointer initially, and sort the offsets?

vporpo updated this revision to Diff 428476.May 10 2022, 1:05 PM

Thanks for the comments. Updated sorting according to @dmgreen's comments.

This revision was landed with ongoing or failed builds.May 10 2022, 3:27 PM
This revision was automatically updated to reflect the committed changes.

this is causing crashes

$ cat /tmp/d.ll
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define void @f() {
  store i32 undef, ptr undef, align 4
  %1 = getelementptr inbounds [4 x i32], ptr undef, i64 0, i64 3
  %2 = load i32, ptr null, align 4
  store i32 %2, ptr %1, align 4
  br label %3

3:                                                ; preds = %0
  br label %4

4:                                                ; preds = %9, %3
  %5 = phi i32 [ %2, %3 ], [ 0, %9 ]
  %6 = phi i32 [ undef, %3 ], [ 0, %9 ]
  %7 = phi i32 [ undef, %3 ], [ 0, %9 ]
  %8 = phi i32 [ %2, %3 ], [ 0, %9 ]
  br label %9

9:                                                ; preds = %4
  br label %4
}
$ ./build/rel/bin/opt -passes=slp-vectorizer -disable-output /tmp/d.ll
opt: ../../llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:5851: llvm::InstructionCost llvm::slpvectorizer::BoUpSLP::getEntryCost(const llvm::slpvectorizer::BoUpSLP::TreeEntry *, ArrayRef<llvm::Value *>): Assertion `E->getOpcode() && allSameType(VL) && allSameBlock(VL) && "Invalid VL"' failed.