Page MenuHomePhabricator

[SLP] Add support for throttling.
Needs ReviewPublic

Authored by dtemirbulatov on Feb 5 2019, 12:48 PM.

Details

Summary

Here is support for SLP throttling, when cost is high to vectorize the whole tree we can reduce the number of proposed vectorizable elements and partially vectorize the tree. https://www.youtube.com/watch?v=xxtA2XPmIug&t=5s

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Missed one change, removed braces at 3762.

Moved finding subtree logic out of getTreeCost() to tryPartialVectorization().

Rebased, Ping.

dtemirbulatov marked 2 inline comments as done.Oct 17 2019, 5:03 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
5523 ↗(On Diff #200259)

Done

This should be ready for trunk, no?

ABataev added inline comments.Oct 17 2019, 8:32 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1303 ↗(On Diff #225122)

TinyPtrVector<TreeEntry *>?

1416 ↗(On Diff #225122)

const member function.

1473–1474 ↗(On Diff #225122)

Maybe, it is better to merge these 2 flags into one and use enumeric instead of booleans? Seems to me, these flags are mutually exclusive.

2920 ↗(On Diff #225122)

auto->real type

2927–2933 ↗(On Diff #225122)

You can do this analysis inside the first loop, no?

2974 ↗(On Diff #225122)

Remove call of ExternalUser constructor, just ExternalUses.emplace_back(Scalar, U, Lane);

3004–3005 ↗(On Diff #225122)

I think it must be an assert, that Cost.first >= -SLPCostThreshold here.

3006 ↗(On Diff #225122)

auto->real type

3018–3024 ↗(On Diff #225122)

All this code must be compiled only in the debug mode.

3027 ↗(On Diff #225122)

Merge this if with the previous one.

3696 ↗(On Diff #225122)

Missed the check for CostsRecalculations >= MaxCostsRecalculations here

3710 ↗(On Diff #225122)

Maybe transform the function into return bool and return the cost in Cost parameter, passed by reference?

3718 ↗(On Diff #225122)

is_contained is used several times and not effective. Maybe you can use a set here instead of SmallVector?

645 ↗(On Diff #218869)

I agree with Vasileios here, we should follow the same approach as general SLP vectorization. Otherwise, the compile time increases significantly.

Re-ping

Working, I will update shortly.

Addressed remarks.

Herald added a project: Restricted Project. · View Herald TranscriptDec 16 2019, 11:30 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
ABataev added inline comments.Dec 19 2019, 8:00 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
645 ↗(On Diff #218869)

Also, aggree with Vasileios here. Why do we need to record seeds and then rebuild the same tree for the second time before trying to apply partial vectorization? Why we can't reuse previously built tree and try cut the tree nodes one-by-one or something like this rather than repeat all the previous steps?

dtemirbulatov marked an inline comment as done.Dec 19 2019, 9:06 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
645 ↗(On Diff #218869)

Ok, I don't see any regression now without recordSeeds() and spec 2k int also is the same numbers. I will update shortly without recordSeeds(). Thanks.

Removed record seeds functionality.

ABataev added inline comments.Dec 19 2019, 1:39 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
670

DO you really need to return Optional here? Maybe, just return -SLPCostThreshold if not successful?

1513–1515

Could you split the patch and commit this part of the change (I mean, using of the enum instead of bool) as a separate NFC patch?

3281–3282

Can this occur at all?

3954–4074

The code in this function is very similar to the code in the getTreeCost(). Can it be reused somehow to avoid duplication?

3967

VectorizableTree.front() instead of VectorizableTree[0] just like in the first statement.

4067

is_contained() is O(n). Maybe use a set instead of it in the loop?

4156–4165

All this code must be active only when the debug mode on?

Rebased after committing the first part as NFC, it looks like resolved all remarks.

Rebased after committing the first part as NFC, it looks like resolved all remarks.

No context in the update, upload full diff, please.

Try to restore the context.

Try to restore the context.

Hmm, I see that the patch is updated. Could you try to update to the latest version?

dtemirbulatov marked an inline comment as done.Dec 23 2019, 11:29 AM

oh, sorry, it l

Try to restore the context.

Hmm, I see that the patch is updated. Could you try to update to the latest version?

oh, yes, sorry.

Update once again.

ABataev added inline comments.Dec 24 2019, 10:04 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
617 ↗(On Diff #235169)

Better to split this function, one to have the raw cost, and the original one, that uses the first for the raw cost, that returns the full cost.

1527 ↗(On Diff #235169)

Give a better name, this is not indices.

1639 ↗(On Diff #235169)
  1. It does not find anything, it checks for something. A better description is required.
  2. Use /// style of comment here.
1642 ↗(On Diff #235169)

Check for a cyclic dependency? What we have something like A->B->A?

4004 ↗(On Diff #235169)

Better to return bool here and insert last found entry into the path.

4024 ↗(On Diff #235169)

You don't insert last found non-gathering node to the path. Is this correct?

4027 ↗(On Diff #235169)

Did you think about recalculating the cost of the reduced tree instead of using a scheme with cost subtraction? This looks to be a lot more natural way. I mean, in the same loop, where we try to vectorize the original tree, try to reduce the tree (+, maybe, check that the cost for the removed nodes is going to be negative) and then just recalculate the cost of the reduced tree? The current scheme looks overcomplicated. This is just a suggestion.

6091–6096 ↗(On Diff #235169)

No need for else here, just if (SLPThrottling && ...

6348–6349 ↗(On Diff #235169)

else if

6352 ↗(On Diff #235169)

Missed moving to the next bundle, no?

dtemirbulatov marked an inline comment as done.Dec 29 2019, 4:05 PM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
1642 ↗(On Diff #235169)

We don't have to check for cycles here, it is done at 4016. We don't have any paths here, it is done dynamically later. We mark to mark only branch nodes and no problem if the node has cycle like A->B->A.

dtemirbulatov marked an inline comment as done.Jan 12 2020, 11:42 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
4024 ↗(On Diff #235169)

I don't think this is correct, Any non-gathering entry on the way get inserted at 4019, including the last one.

Resolved previous remarks, except that we are still using "cost subtraction" and we don't need now to call getTreeCost() all the time we can now subtract all cost components, except spill cost, in the main path traversing loop. Fixed issue with incorrect cost calculation in getInsertCost().

dtemirbulatov marked an inline comment as done.Feb 25 2020, 4:00 AM
dtemirbulatov added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/pr35497.ll
71 ↗(On Diff #246090)

hmm, I missed that but it looks like one extra "extractelement" appeared, which is not correct.

Rebased, fix regression in pr35497.ll with extra extract instruction, it was caused by partial vectorization preventing the reduction by changing BB too early. Added functionality of doing throttling as the last one but without rebuilding tree again. It is done by saving tree states for profitable trees, this could allow us to estimate which vectorization to choose based on the cost in the future.

dtemirbulatov marked an inline comment as done.Tue, Mar 10, 3:57 AM
dtemirbulatov added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5077

hmm. I would like to use remove_if here, but we have to capture a unique_ptr here.

Replaced auto with "SmallVector<std::unique_ptr<TreeState>, 8>::iterator " at 5207

Replaced to remove_if at 5207

Did you think about recalculating the cost of the reduced tree instead of using a scheme with cost subtraction? This looks to be a lot more natural way. I mean, in the same loop, where we try to vectorize the original tree, try to reduce the tree (+, maybe, check that the cost for the removed nodes is going to be negative) and then just recalculate the cost of the reduced tree? The current scheme looks >overcomplicated. This is just a suggestion.

I am doing now the whole tree cost substruction in the main cutPath() loop, except spill cost and we usually don't need it unless CallInst occurs. And I am thinking it is better to follow the tree structure.

dtemirbulatov marked an inline comment as done.Wed, Mar 11, 10:58 AM
dtemirbulatov added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4124

eh, I found typo here, it should be Inst->Users, the same for 4101 line.

wwei added a subscriber: wwei.Mon, Mar 16, 10:10 AM

Do you have any plan to support throttling for horizontal reduction?

Do you have any plan to support throttling for horizontal reduction?

yes, I noticed some regression with my first implementation, but certainly, it worth adding after basic functionality.

Rebased, Removed tree traversal approach and implemented suggestion by Alexey Bataev : "Did you think about recalculating the cost of the reduced tree instead of using a scheme with cost subtraction?", This looks like more efficient than tree traversal.

wwei added inline comments.Wed, Mar 18, 11:09 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4121

I think the input for getGatherCost should be Entry->Scalars instead of V. The code in getGatherCost likes:

int BoUpSLP::getGatherCost(ArrayRef<Value *> VL) const {
  // Find the type of the operands in VL.
  Type *ScalarTy = VL[0]->getType();
  ... ...
  VectorType *VecTy = VectorType::get(ScalarTy, VL.size());
  ... ...
  return getGatherCost(VecTy, ShuffledElements);

So, if input is V, VectTy will always be equal to <1 x iN>, I think it's the same with iN type, and getGatherCost(VecTy, ShuffledElements) will return incorrect InsertCost value.

dtemirbulatov marked an inline comment as done.Thu, Mar 19, 5:54 AM
dtemirbulatov added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4121

Yes, correct. Thanks, I will fix that.

This comment was removed by dtemirbulatov.
This comment was removed by dtemirbulatov.

Changed getInsetCost() to correct insert cost calculation by allowing getEntryCost to handle entries that proposed to gather similarly as NeedToGather entries.

Rebased, added the following changes:

  • removed TreeState from the proposed change since after fixing getInsertCost I could not observe any regressions.
  • reduced default number of cost recalculations to 32 after noticing the new approach to find a subtree works efficiently.
  • added support for reductions.

So is this ready for review? If you are working on this and not ready yet, consider [WIP] tag in title.

dtemirbulatov retitled this revision from [SLP] Add support for throttling. to [SLP] Add support for throttling. [WIP].Sat, Apr 4, 12:20 AM

Rebased, removed isGoodSubTreeToVectorize() function because there is just one user to this function. Now, I think the change is ready to review, Please review this revision.

dtemirbulatov retitled this revision from [SLP] Add support for throttling. [WIP] to [SLP] Add support for throttling..Sat, Apr 4, 11:54 PM