Page MenuHomePhabricator

[SLP] Add support for throttling.
Needs ReviewPublic

Authored by dtemirbulatov on Feb 5 2019, 12:48 PM.

Details

Summary

Here is support for SLP throttling, when cost is high to vectorize the whole tree we can reduce the number of proposed vectorizable elements and partially vectorize the tree. https://www.youtube.com/watch?v=xxtA2XPmIug&t=5s

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ABataev added inline comments.Jul 10 2020, 8:19 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4198–4202

llvm::any_of(Inst->users(), [Tree](User *Op){ return Tree->ScalarToTreeEntry.count(Op) > 0; }

4235

>=

6487

Do you really need this new var here? I don't see where it is used except as an argument of R.findSubTree(UserCost) call

dtemirbulatov marked 7 inline comments as done.Jul 13 2020, 7:08 PM
dtemirbulatov added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3381–3390

Thanks.

4222–4225

eh, no it is std::_Rb_tree_const_iterator<llvm::slpvectorizer::BoUpSLP::TreeEntry*>.

6487

I think we need to compensate the ExctractCost with that cost of the insert sequence as in case of full-vectorization.

dtemirbulatov marked an inline comment as done.

Addressed comments, rebased.

ABataev added inline comments.Jul 14 2020, 10:57 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3404

Looks like the Scalar should be extracted only if its user is vectorized and it remains to be scalar in the vectorized tree. Or it is not going to be vectorized.

4222–4225

Why is this a const iterator?

6546

Better to enclose this substatement in braces to make the code uniform

dtemirbulatov updated this revision to Diff 279070.EditedJul 19 2020, 4:14 AM
dtemirbulatov marked 5 inline comments as done.

Addressed remarks, rebased.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3404

Thanks, good catch. I think I need first to populate ExternalUser as TreeEntires are mark with ProposedToGather and then change them to a NeedToGather node in the separate loop.

4222–4225

std::set iterators are bidirectional, not random-access. we have to use std::advance

ABataev added inline comments.Jul 29 2020, 7:31 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
118–128

Tabs are added

2381–2388

Tabs

3385–3388

Should this loop be executed only for ProposedToGather Entrys?

3404–3407

I actually don't see propagation for ProposedTogather and these loops can be merged, no?

4275–4278

Tabs

4292

Tab

5201

Use llvm::erase_if

dtemirbulatov marked 2 inline comments as done.Jul 31 2020, 11:32 AM
dtemirbulatov added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2381–2388

This is now part of TreeState structure, this is LLVM's standard format(clang-format).

3385–3388

Yes, but we have to know the lane there.

3404–3407

No, No possible to merge those two loops.

4275–4278

this is LLVM's standard format(clang-format).

4292

this is LLVM's standard format(clang-format).

Rebased, addressed comments

ABataev added inline comments.Jul 31 2020, 11:35 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3385–3388

I mean, that the check in the previous if must be Entry->State == TreeEntry::ProposedToGather, not != TreeEntry::Vectorize

3404–3407

Why?

dtemirbulatov added inline comments.Jul 31 2020, 11:43 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3404–3407

For example, in the first loops, we could change from Entry1 TreeEntry::ProposedToGather to TreeEntry::NeedToGather status, but we later could encounter another use of this Entry1 and from another Entry2()let's say) with TreeEntry::Vectorize status and we could tell difference with just canceled item and not considered to vectorize Entry. thus ExternalUses would not be properly populated.

ABataev added inline comments.Jul 31 2020, 11:45 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3404–3407

The first loop does not change the state of the tree entries.

dtemirbulatov added inline comments.Jul 31 2020, 11:53 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3404–3407

I mean if we merge them into one loop.

oh, sorry I misspelled:
For example, in the first loops, we could change from Entry1 TreeEntry::ProposedToGather to TreeEntry::NeedToGather status, but we later could encounter another use of this Entry1 and from another Entry2()let's say) with TreeEntry::Vectorize status and we could NOT tell difference with just canceled item and not considered to vectorize Entry. thus ExternalUses would not be properly populated.

Rebased, Ping

RKSimon added inline comments.Aug 10 2020, 6:30 AM
clang/tools/clang-tidy
2 ↗(On Diff #284313)

Remove this

llvm/tools/mlir
2 ↗(On Diff #284313)

remove this

RKSimon added inline comments.Aug 10 2020, 8:42 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7329

This looks like a NFC clang-format change now - either pre-commit or discard from the patch?

llvm/test/Transforms/SLPVectorizer/X86/load-merge.ll
59

rebase - this was committed at rG90f721404ff8

Rebased, Fixed.

oh, I missed to fully remove from diff at 7269, Fixed

RKSimon added inline comments.Aug 11 2020, 2:36 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4331–4339

Maybe pull this NDEBUG change out into its own patch?

6487

This still looks wrong - isn't the UserCost only used locally in the CompensateUseCost path?

dtemirbulatov added inline comments.Aug 11 2020, 3:09 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4331–4339

yes, it could go as NFC.

6487

No, there is another instance of UserCost at 6476, We have to compare the cost to SLPCostThreshold inside findSubTree() and subtract UserCost.

dtemirbulatov added inline comments.Aug 11 2020, 3:12 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6487

I mean the instance of usage.

@ABataev @anton-afanasyev Any more comments on this?

llvm/test/Transforms/SLPVectorizer/X86/slp-throttle.ll
2–3

Is it worth adding a second RUN with -slp-throttle=false ?

xbolva00 added inline comments.Aug 16 2020, 10:23 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
634

Please mention paper name:

“Throttling Automatic Vectorization: When Less Is More”

https://www.cl.cam.ac.uk/~tmj32/papers/docs/porpodas15-pact.pdf

Slides are good, but paper is paper :)

Corrected paper citation, added -slp-throttle=false to llvm/test/Transforms/SLPVectorizer/X86/slp-throttle.ll, rebased.

dtemirbulatov marked 2 inline comments as done.Aug 17 2020, 4:21 AM
ABataev added inline comments.Aug 21 2020, 6:51 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3392

What if the user does not have corresponding tree entry, i.e. it is initially scalar? What if the Scalar itself is going to remain scalar?

4196–4203

Just:

for (Value *V : Entry->Scalars) {
  auto *Inst = cast<Instruction>(V);
  if (llvm::any_of(Inst->users(), [this](User *Op){ return Tree->ScalarToTreeEntry.count(Op) > 0; }))
      return InsertCost + getEntryCost(Entry);
}

Also, check code formatting

dtemirbulatov added inline comments.Aug 21 2020, 7:38 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3392

What if the Scalar itself is going to remain scalar?

At this point, the decision to cut the tree was made and the Scalar could be only with intend to vectorize. Note about that 3295 we are ignoring any tree entries without State not equals TreeEntry::Vectorize.

What if the user does not have corresponding tree entry, i.e. it is initially scalar?

ah, yes. I have to check that !UserTE at 3305 and just continue if it is true.

dtemirbulatov added inline comments.Aug 21 2020, 3:23 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4196–4203

hmm, I think this is not a correct suggestion, there might be several tree entries with TreeEntry::ProposedToGather status and we have to calculate Insert cost for the whole tree here.

ABataev added inline comments.Aug 21 2020, 3:27 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4196–4203

Yeah, maybe. But you van do something similar, like

InsertCost += ...
break;

instead of setting flag and do a check after the loop.

Fixed remarks, rebased.

dtemirbulatov marked 3 inline comments as done.Aug 21 2020, 3:52 PM

Removed unnecessary check for "UserTE" at 3305.

Rebased. Ping.

Good enough for initial implementation?

Good enough for initial implementation?

yes, For me, it looks like ready.

Good enough for initial implementation?

yes, For me, it looks like ready.

Will be able to review it next week, after returning from vacation.

ABataev added inline comments.Tue, Sep 15, 9:30 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3393–3394

Could you compare it with a similar code in BoUpSLP::buildTree? Looks like you still missed some cases for user ignoring.

Matt added a subscriber: Matt.Wed, Sep 16, 8:53 AM
dtemirbulatov updated this revision to Diff 293457.EditedTue, Sep 22, 8:09 AM
dtemirbulatov marked an inline comment as done.

Rebased. Moved InternalTreeUses population out of (UseScalar != U || !InTreeUserNeedToExtract(Scalar, UserInst, TLI)) limitation at line 2661 in BoUpSLP::buildTree(), since we have to consider every interal user for partial vectorization, while calculating cost.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3393–3394

I think those ignoring cases are related to the fact that we are doing full vectorization at BoUpSLP::buildTree and we can avoid extracting for in-tree users. And here we have to extract to each user of once proposed to vectorized value.

dtemirbulatov added inline comments.Tue, Sep 22, 8:12 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3393–3394

And here we have to extract to each user of once proposed to vectorized value. I mean for the partial vectorization.