Page MenuHomePhabricator

[SLP] Add support for throttling.
Needs ReviewPublic

Authored by dtemirbulatov on Feb 5 2019, 12:48 PM.

Details

Summary

Here is support for SLP throttling, when cost is high to vectorize the whole tree we can reduce the number of proposed vectorizable elements and partially vectorize the tree. https://www.youtube.com/watch?v=xxtA2XPmIug&t=5s

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ABataev added inline comments.Jul 31 2020, 11:35 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3443–3446

I mean, that the check in the previous if must be Entry->State == TreeEntry::ProposedToGather, not != TreeEntry::Vectorize

3462–3465

Why?

dtemirbulatov added inline comments.Jul 31 2020, 11:43 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3462–3465

For example, in the first loops, we could change from Entry1 TreeEntry::ProposedToGather to TreeEntry::NeedToGather status, but we later could encounter another use of this Entry1 and from another Entry2()let's say) with TreeEntry::Vectorize status and we could tell difference with just canceled item and not considered to vectorize Entry. thus ExternalUses would not be properly populated.

ABataev added inline comments.Jul 31 2020, 11:45 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3462–3465

The first loop does not change the state of the tree entries.

dtemirbulatov added inline comments.Jul 31 2020, 11:53 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3462–3465

I mean if we merge them into one loop.

oh, sorry I misspelled:
For example, in the first loops, we could change from Entry1 TreeEntry::ProposedToGather to TreeEntry::NeedToGather status, but we later could encounter another use of this Entry1 and from another Entry2()let's say) with TreeEntry::Vectorize status and we could NOT tell difference with just canceled item and not considered to vectorize Entry. thus ExternalUses would not be properly populated.

Rebased, Ping

RKSimon added inline comments.Aug 10 2020, 6:30 AM
clang/tools/clang-tidy
2 ↗(On Diff #284313)

Remove this

llvm/tools/mlir
2 ↗(On Diff #284313)

remove this

RKSimon added inline comments.Aug 10 2020, 8:42 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7402

This looks like a NFC clang-format change now - either pre-commit or discard from the patch?

llvm/test/Transforms/SLPVectorizer/X86/load-merge.ll
59

rebase - this was committed at rG90f721404ff8

Rebased, Fixed.

oh, I missed to fully remove from diff at 7269, Fixed

RKSimon added inline comments.Aug 11 2020, 2:36 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4402–4414

Maybe pull this NDEBUG change out into its own patch?

6656

This still looks wrong - isn't the UserCost only used locally in the CompensateUseCost path?

dtemirbulatov added inline comments.Aug 11 2020, 3:09 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4402–4414

yes, it could go as NFC.

6656

No, there is another instance of UserCost at 6476, We have to compare the cost to SLPCostThreshold inside findSubTree() and subtract UserCost.

dtemirbulatov added inline comments.Aug 11 2020, 3:12 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6656

I mean the instance of usage.

@ABataev @anton-afanasyev Any more comments on this?

llvm/test/Transforms/SLPVectorizer/X86/slp-throttle.ll
2–3

Is it worth adding a second RUN with -slp-throttle=false ?

xbolva00 added inline comments.Aug 16 2020, 10:23 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
640

Please mention paper name:

“Throttling Automatic Vectorization: When Less Is More”

https://www.cl.cam.ac.uk/~tmj32/papers/docs/porpodas15-pact.pdf

Slides are good, but paper is paper :)

Corrected paper citation, added -slp-throttle=false to llvm/test/Transforms/SLPVectorizer/X86/slp-throttle.ll, rebased.

dtemirbulatov marked 2 inline comments as done.Aug 17 2020, 4:21 AM
ABataev added inline comments.Aug 21 2020, 6:51 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3450

What if the user does not have corresponding tree entry, i.e. it is initially scalar? What if the Scalar itself is going to remain scalar?

4294–4301

Just:

for (Value *V : Entry->Scalars) {
  auto *Inst = cast<Instruction>(V);
  if (llvm::any_of(Inst->users(), [this](User *Op){ return Tree->ScalarToTreeEntry.count(Op) > 0; }))
      return InsertCost + getEntryCost(Entry);
}

Also, check code formatting

dtemirbulatov added inline comments.Aug 21 2020, 7:38 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3450

What if the Scalar itself is going to remain scalar?

At this point, the decision to cut the tree was made and the Scalar could be only with intend to vectorize. Note about that 3295 we are ignoring any tree entries without State not equals TreeEntry::Vectorize.

What if the user does not have corresponding tree entry, i.e. it is initially scalar?

ah, yes. I have to check that !UserTE at 3305 and just continue if it is true.

dtemirbulatov added inline comments.Aug 21 2020, 3:23 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4294–4301

hmm, I think this is not a correct suggestion, there might be several tree entries with TreeEntry::ProposedToGather status and we have to calculate Insert cost for the whole tree here.

ABataev added inline comments.Aug 21 2020, 3:27 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4294–4301

Yeah, maybe. But you van do something similar, like

InsertCost += ...
break;

instead of setting flag and do a check after the loop.

Fixed remarks, rebased.

dtemirbulatov marked 3 inline comments as done.Aug 21 2020, 3:52 PM

Removed unnecessary check for "UserTE" at 3305.

Rebased. Ping.

Good enough for initial implementation?

Good enough for initial implementation?

yes, For me, it looks like ready.

Good enough for initial implementation?

yes, For me, it looks like ready.

Will be able to review it next week, after returning from vacation.

ABataev added inline comments.Sep 15 2020, 9:30 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3451–3452

Could you compare it with a similar code in BoUpSLP::buildTree? Looks like you still missed some cases for user ignoring.

Matt added a subscriber: Matt.Sep 16 2020, 8:53 AM
dtemirbulatov updated this revision to Diff 293457.EditedSep 22 2020, 8:09 AM
dtemirbulatov marked an inline comment as done.

Rebased. Moved InternalTreeUses population out of (UseScalar != U || !InTreeUserNeedToExtract(Scalar, UserInst, TLI)) limitation at line 2661 in BoUpSLP::buildTree(), since we have to consider every interal user for partial vectorization, while calculating cost.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3451–3452

I think those ignoring cases are related to the fact that we are doing full vectorization at BoUpSLP::buildTree and we can avoid extracting for in-tree users. And here we have to extract to each user of once proposed to vectorized value.

dtemirbulatov added inline comments.Sep 22 2020, 8:12 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3451–3452

And here we have to extract to each user of once proposed to vectorized value. I mean for the partial vectorization.

Rebased. PING

Rebased. Ping.

Rebased. Ping^2

ABataev added inline comments.Nov 23 2020, 9:50 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3455–3457

Either just cast without if or dyn_cast

4378

Not sure that this is the best criterion. I think you also need to include the distance from the head of the tree to the entry, because some big costs can be compensated by the vectorizable nodes in the tree.
What I would do here is just some kind of level ordering search (BFS) starting from the deepest level.

4385

I think you can also exclude entries with the number of operands <= 1.

dtemirbulatov added inline comments.Dec 3 2020, 5:55 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4378

Hmm, implemented, but I don't see any benefit from that, plus we have to do BFS search. And we are going to throw away any non-vectorizable nodes at 4295.

4385

But why? The only thing that matters here is the cost.

dtemirbulatov marked 2 inline comments as done.Dec 3 2020, 5:57 PM
ABataev added inline comments.Dec 4 2020, 5:20 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4378

It may trigger for targets like silvermont or in future for vectorized functions.

4385

Because the main idea is to drop gathers and drop one gather in favor of another one will not be profitable for sure. But it may improve compile time and the list of candidates, The only case you need to check for is the latest masked gather case, it may be profitable to convert it to gathers for some targets.

dtemirbulatov marked an inline comment as done.Dec 7 2020, 7:32 AM
dtemirbulatov added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4378

I measured the BFS approach vs this implementation. And with BFS, it is ~10% less efficient on SPEC2006 INT and ~20% less on compilable SPEC2006 FP. By efficiency, I mean the total number of reduced trees while the whole compilation.

4385

I think I can check here if scutter/gather is supported via TargetrInfo and if it is not then move all nodes with TreeEntry::ScatterVectorize to TreeEntry::Gather.

anton-afanasyev added inline comments.Dec 7 2020, 8:43 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4385

I believe it's wrong decision to check scatter/gather target support for the reason mentioned here https://reviews.llvm.org/D92701#2435573. Why could not we just rely on costs (node cost and total one)?

dtemirbulatov marked an inline comment as not done.Dec 7 2020, 9:04 AM
dtemirbulatov added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4385

I agree with @anton-afanasyev here. I am not sure what @ABataev wants here? If I exclude (operands <= 1) then we would lose have of all tests in SLP affected by throttling.

ABataev added inline comments.Dec 7 2020, 9:16 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4378

Could you post it anyway to check if it may be improved?

4385

I did not say anything about checking if scatter is supported here. I just said that we can improve the criterion here by checking that the entry node has at least 2 operands (because if it has just one operand, most probably we can skip it) and we just need to check the nodes with only 1 operand if it is gather scatter node, because it may be better to represent it as simple gather.

dtemirbulatov added inline comments.Dec 7 2020, 9:19 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4378

ok, I might miss something. Thanks.

Here is the BFS version of the change. Rebased.

And I counted the total number of nodes vectorized with throttling, instead of just the number of successful tree reductions. So, the total number is higher ~25% for INT and FP CPU2006(AVX2 and AVX512F) with Cost sort compare to Distance.

Discussed with @ABataev further improvements offline and he suggested removing the throttle limiter ("slp-throttling-budget"), at least for basic blocks without calls. I am looking for new functionality.

Removed "slp-throttling-budget" limiter for trees without calls
Moved the main tree reduction loop to getTreeCost() function
deleted ProposedToGather node attribute out of EntryState