Page MenuHomePhabricator

[SLP] Add support for throttling.
Needs ReviewPublic

Authored by dtemirbulatov on Feb 5 2019, 12:48 PM.

Details

Summary

Here is support for SLP throttling, when cost is high to vectorize the whole tree we can reduce the number of proposed vectorizable elements and partially vectorize the tree. https://www.youtube.com/watch?v=xxtA2XPmIug&t=5s

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Fixed regression issue on 401.bzip2 and others. This was due to deciding to vectorize the whole tree(not partially) while revisiting seeds for partial vectorization. Fixed at line 3789.
Here are Spec 2006 INT results:
before:
400.perlbench 9770 249 39.3 *
401.bzip2 9650 311 31.1 *
403.gcc 8050 191 42.1 *
429.mcf 9120 217 42.0 *
445.gobmk 10490 330 31.8 *
456.hmmer 9330 267 35.0 *
458.sjeng 12100 394 30.7 *
462.libquantum 20720 241 85.9 *
464.h264ref 22130 334 66.2 *
471.omnetpp 6250 249 25.1 *
473.astar 7020 284 24.7 *
483.xalancbmk 6900 170 40.7 *
Est. SPECint(R)_base2006 38.5
Est. SPECint2006 Not Run

after:
400.perlbench 9770 249 39.2 *
401.bzip2 9650 311 31.1 *
403.gcc 8050 190 42.3 *
429.mcf 9120 218 41.9 *
445.gobmk 10490 329 31.9 *
456.hmmer 9330 266 35.0 *
458.sjeng 12100 391 31.0 *
462.libquantum 20720 241 86.1 *
464.h264ref 22130 329 67.3 *
471.omnetpp 6250 249 25.1 *
473.astar 7020 284 24.7 *
483.xalancbmk 6900 169 40.7 *
Est. SPECint(R)_base2006 38.6
Est. SPECint2006 Not Run

Note the 429.mcf runtime is different by accident, the actual checksum of binary is the same. Sorry, @lebedev.ri I will use test-suite/utils/compare.py next time.

dtemirbulatov marked 2 inline comments as done.Aug 25 2019, 7:15 PM

This was due to deciding to vectorize the whole tree(not partially) while revisiting seeds for partial vectorization. Fixed at line 3789.

oh, the decision for the whole vectorization was due to other changes in the BB.

xbolva00 added a comment.EditedAug 26 2019, 12:23 AM

h264ref - nice improvement

Also, would be good to see some general description of this algorithm, without references to youtube.

lib/Transforms/Vectorize/SLPVectorizer.cpp
2264

Just emplace_back(Scalar, U, FOundLane);.

2928–2937

Use llvm::find_if instead of this block of code.

2986

Just ArrayRef<Value *>?

3006

Some kind of logging here and in case of failed vectorization/too high cost?

3015

emplace_back(Ops.begin(), Ops.end())?

3638

auto *

3721–3722

Use for (TreeEntry *Next : llvm::reverse(C->UseTreeIndices)) if possible.

3746–3752

Use TE->IsBranch = llvm::count_if(TE->UseTreeIndices, ...) > 1;

3756

Better to return Optional<int> and return llvm::None instead of INT_MAX

3768–3770

This might lead to the infinite loop since Path is not modified

3784–3792

Use llvm::find_if(llvm::reverse(Node->UseTreeIndices), ...)

3827–3828

Use llvm::copy

3832–3839
int ExtractCost = getExtractCost();
int SpillCost = getSpillCost();
int Cost = CostSum + ExtractCost;
3839–3840
  1. Use local Str in all branches
  2. Switch to SmallString.
3845

Use range-based loop

4590

Not done. Please, review all previous comments.

Fixed remarks, rebased.

dtemirbulatov marked 14 inline comments as done.Aug 31 2019, 7:23 PM
dtemirbulatov marked 2 inline comments as done.Aug 31 2019, 7:30 PM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
3015

Hmm, it does not work out of ArrayRef without specifying the constructor.

3827–3828

I could not use llvm::copy here since Scalars is SmallVector and ScalarsToVec declared as SmallPtrSet type.

Looks good.

Others?

lib/Transforms/Vectorize/SLPVectorizer.cpp
687

Please add link to a paper

“More details: ...”

Added link to the article.

LGTM - @ABataev any more comments?

LGTM - @ABataev any more comments?

It is holiday, will review it tomorrow.

Will it work if the same treeentry is used several times in the tree? For example, in diamond merge? If the treeentry, used several times is not profitable?

lib/Transforms/Vectorize/SLPVectorizer.cpp
2949–2950

Use llvm::remove_if

3015

And just emplace_back(Ops)?

3819–3823

Use llvm::any_of(llvm::drop_begin(VectorizableTree, TE.Idx + 1), ...)

3827–3828

Then just ScalarsToVec.insert(TE.Scalars.begin(), TE.Scalars.end());?

Will it work if the same treeentry is used several times in the tree? For example, in diamond merge? If the treeentry, used several times is not profitable?

I might be wrong, but I don't see any difference in terms of cost calculation for reused operations in partial or full vectorization. I mean, for example, if some group of operations has been vectorized and we plan to use those operations then we still need to calculate those operations costs, extract cost and etc. One another thing we don't want to calculate insert costs for the cost of gathering canceled elements(see getInsertCost()) for reused operations since those operations have been already vectorized anyway, I will fix that.

dtemirbulatov updated this revision to Diff 218869.EditedSep 5 2019, 2:34 AM

Fixed all remarks. Also, add a solution for the same treeentry is used several times changed cost calculation at line 3615. Thanks for finding it.

dtemirbulatov marked 4 inline comments as done.Sep 5 2019, 2:36 AM
ABataev added inline comments.Sep 11 2019, 8:55 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2994

What about this?

2997

What about this?

3770–3775

Maybe, it is better to make the second if the first one and vice versa? Just in case, if we found a path with good cost even if maximum number of recalculations is reached?

3785–3789

No need for braces here

3809

Could you use range-based loop here?

5701

What about this?

vporpo added inline comments.Sep 13 2019, 11:38 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
646

Why are we collecting the seeds specifically for partial vectorization? Is this really needed? Why don't you just call tryPartialVectorization(Seeds) within vectorizeStorChain() etc. ?

1287

Do we need a separate variable for this? Can't we replace it with a small isBranch() function that checks the operands?

1303

Hmm is this needed? Isn't this available in EdgeInfo of UserTreeIndices?

3647

Do we need separate getExtractCost() , getSpillCost() , getInsertCost() functions? Can't we update the getTreeCost() function instead to be aware of tree cutting?

3735

I think we could avoid this traversal completely by marking the nodes when the tree is being built in buildTree_rec(). For example, the TE->ProposedToGather could be set in newTreeEntry().

dtemirbulatov marked an inline comment as done.Sep 14 2019, 1:15 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
646

Ok, Imagine that we could do partial vectorization with vector size 4 for 30% of the tree, but for the same tree, we could have full vectorization with vector size 2. I think that it would be serious regression if we could do it with just 30%. Or for example, for the same tree, we could do reduction later for the whole tree.

vporpo added inline comments.Sep 14 2019, 11:19 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
646

Yes, but this problem is not specific to throttling. As it is now, SLP will greedily accept 4-wide vectorization if profitable, without comparing it against 2x 2-wide. I think this problem should be addressed separately.

dtemirbulatov marked an inline comment as done.Sep 19 2019, 12:44 PM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
646

Let's change that behavior separately or this change grows even further. BTW, I measured the SLP run-time change on SPEC2k6 compilation and it is about ~10% on average.

dtemirbulatov marked an inline comment as done.Sep 19 2019, 12:45 PM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
1303

yes, It is the opposite of UserTreeIndices and it requires in several places.

dtemirbulatov marked an inline comment as done.Sep 19 2019, 3:30 PM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
646

and It is not clear to me how to compare the benefit of one vectorization against another, the same example we achieved 50% of tree 4 wide vectorization with still profitable cost, let's say, -1. But, for the same tree, we have 90% of vectorization with vector size 2 with for example the same cost. We could not say let's pick one with the highest score, we are interesting to vectorize while it is profitable.

dtemirbulatov marked an inline comment as done.Sep 20 2019, 2:39 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
646

Maybe we could have a subjective score based upon a vectorized tree-hight and vector widening.

Rebased, addressed remarks.

dtemirbulatov marked 6 inline comments as done.Sep 30 2019, 1:54 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
3647

Hmm, I think that we do a tree reduction with throttling in cutPath(), not in getTreeCost() and we need those functions outside.

dtemirbulatov marked 3 inline comments as done.Sep 30 2019, 2:00 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
2994

Discussed at line #645

2997

Discussed at line #645.

5701

Discussed at line #645

Missed one change, removed braces at 3762.

Moved finding subtree logic out of getTreeCost() to tryPartialVectorization().

Rebased, Ping.

dtemirbulatov marked 2 inline comments as done.Oct 17 2019, 5:03 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
5701

Done

This should be ready for trunk, no?

ABataev added inline comments.Oct 17 2019, 8:32 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
646

I agree with Vasileios here, we should follow the same approach as general SLP vectorization. Otherwise, the compile time increases significantly.

1303

TinyPtrVector<TreeEntry *>?

1416

const member function.

1473–1474

Maybe, it is better to merge these 2 flags into one and use enumeric instead of booleans? Seems to me, these flags are mutually exclusive.

2920

auto->real type

2927–2933

You can do this analysis inside the first loop, no?

2974

Remove call of ExternalUser constructor, just ExternalUses.emplace_back(Scalar, U, Lane);

3004–3005

I think it must be an assert, that Cost.first >= -SLPCostThreshold here.

3006

auto->real type

3018–3024

All this code must be compiled only in the debug mode.

3027

Merge this if with the previous one.

3696

Missed the check for CostsRecalculations >= MaxCostsRecalculations here

3710

Maybe transform the function into return bool and return the cost in Cost parameter, passed by reference?

3718

is_contained is used several times and not effective. Maybe you can use a set here instead of SmallVector?

Re-ping

Working, I will update shortly.