Page MenuHomePhabricator

[SLP] Add support for throttling.
Needs ReviewPublic

Authored by dtemirbulatov on Feb 5 2019, 12:48 PM.

Details

Summary

Here is support for SLP throttling, when cost is high to vectorize the whole tree we can reduce the number of proposed vectorizable elements and partially vectorize the tree. https://www.youtube.com/watch?v=xxtA2XPmIug&t=5s

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
vporpo added inline comments.May 20 2019, 2:35 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
3488

This is calculating every possible cut in the path to the leaf. However, some paths will probably share the nodes close to the root, so we are recomputing the same thing multiple times.

dtemirbulatov marked an inline comment as done.May 20 2019, 5:08 PM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
5521

I have seen so many regressions where partial vectorization prevents more profitable full vectorization and I think it is unacceptable to allow any and I think it is a good solution to those issues and It is not time-consuming we already done dependancy analysis at this point for all those seeds. I will double check.

dtemirbulatov marked an inline comment as done.May 20 2019, 5:11 PM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
3488

oh, I missed that in this version completely, Thanks.

I think the first throttling patch should implement a very simple and fast algorithm for finding the cut:

  1. Add new fields to TreeEntry for Cost, ExtractCost and PredecessorsCost.
  2. During getTreeCost() set the TE.Cost and TE.ExtractCost (as you did in an earlier version of the patch if I am not mistaken)
  3. Do a single top-down traversal of the tree in reverse postorder and set the TE.PredecessorsCost equal to the cost of all the predecessor's costs until TE. While doing so, you can compare the cost of cutting just below TE by comparing the gather cost of TE versus the Cost + PredecessorsCost. This is very fast as you only need to visit each TreeEntry node once, so the complexity is linear to the size of the tree.

    For example, in slp-throttle.ll the bundle that needs to be scalarized [%add19, %sub22] has costs of Cost=1, ExtractCost = 0, PredecessorsCost=1 (because of bundle [%mul18, undef]). Cutting below the bundle has a cost of +1, while keeping it vectorized has a cost of +2 (Cost=1 + PredecessorsCost=1).

    This should be good-enough for most simple cases. We can improve it later, if needed, with follow-up patches. What do you think?

yes, Looks good.

Addressing last remarks, rebased.

ABataev added inline comments.Jun 2 2019, 10:30 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
126

Why 64? Did you try to estimate effect of this change on the compilation time?

654

What is criteria of the estimation?

1444

\returns

2091

Use emplace_back instead.

2698

Function can be made const

2704

auto-> to real type.

2732

Explicitly specify list of captures

2744

I think, the outer loop must iterate through elements of VecNodes. In this case, you won't need need this check.

2749

Instruction *->auto *

2760

emplace_back

2770

Can you use range loop here?

2771

Convert auto to real type.

2773

Use llvm::any_of

2818

Restore original code

2841

Restore original

3321

Restore original

3370

Restore original

3382–3589

const function?

3395

auto->to real type

3399

auto->to real type

3408

const function?

3411

const ExternalUser &

3431

const function

3449

const function?

3451

Why std::vector?

3470

Use preincrement

3490

Use llvm::find

3519

Iterate through Leaves

3555

Use llvm::find

3570

while (!Node);?

3574

Are you sure you need to do this on each iteration?

3616–3622

auto to real type

3637–3659

Why std::string and why declared here?

3680

One group of braces can be removed

4543

Use empty()

test/Transforms/SLPVectorizer/AArch64/horizontal.ll
58

Does not seem to me we should vectorize this. 2 add ops have different attributes: one has just nsw, another one nuw nsw.
Plus, 2 inserts+vec add+2 extracts does not look more beneficial than 2 scalar adds

dtemirbulatov marked an inline comment as done.Jun 3 2019, 6:00 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
126

oh, That just from my synthetic examples. I have noticed double on SLP time with 100+ tree nodes examples.

Rebased, fixed all remarks.

dtemirbulatov marked 11 inline comments as done and an inline comment as not done.

fixed a typo.

dtemirbulatov marked 23 inline comments as done.Jun 11 2019, 11:11 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
3637–3659

We have to use it with ViewSLPTree, for full and partial vectorization.

dtemirbulatov marked 2 inline comments as done.Jun 11 2019, 11:18 AM
dtemirbulatov added inline comments.
test/Transforms/SLPVectorizer/AArch64/horizontal.ll
58

This is correct, the vector add should simply take the smallest common subset of flags. And the test is used with "-slp-threshold=-6" flag that it why we vectorized those two add operations.

vporpo added inline comments.Jun 11 2019, 11:30 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2775

Keeping track of the budget here is a bit strange, because the budget tracking is internal state of throttling. Could you move this check somewhere better?

2778

buildTree() is an expensive operation because of scheduling. We should not build a new tree if we can reuse the existing one.

3542

Collecting all paths is expensive, and I think we should not be doing this here just for removing some nodes from the tree. Also, you will end up visiting the same nodes over and over again (the ones close to the root). As I had described in a previous comment, I think we could get a good-enough tree cutting with a single top-down traversal of the tree. Did you find any examples where that would not work?

dtemirbulatov marked 3 inline comments as done.Jun 11 2019, 12:20 PM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
2778

I think it is the dependency analysis is expensive, when certain seeds were just ignored, I could get the numbers. But here, we have already correct seed from the dependency analysis point of view. hmm, We could turn off the dependancy check for our collected seeds.

3542

Also, you will end up visiting the same nodes over and over again (the ones close to the root).

Hmm, It should not visit, multi-times the same node. I will check again.

Did you find any examples where that would not work?

yes, Many, it is definitely better with tree traverse.

dtemirbulatov marked an inline comment as done.Jun 11 2019, 12:25 PM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
3542

Did you find any examples where that would not work?

just look at the tests files for the SLP, with the single index it was just "slp-throttle.ll", definitely it is more robust.

dtemirbulatov marked an inline comment as done.Jun 11 2019, 12:34 PM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
3542

Also, you will end up visiting the same nodes over and over again (the ones close to the root).

sure, we are not visiting the same node over and over again.

vporpo added inline comments.Jun 11 2019, 1:38 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1518

Since these are operand edges for each TreeEntry, I think they should be members of the TreeEntry struct, similarly to the UserTreeIndices.

3455

You don't seem to be using the order of the entries in branches apart from checking is_contained() a couple of times. Isn't it better to add a member variable in TreeEntry for this instead?

Also I don't understand what is considered a branch. Is it any non-leaf node in the graph? Could you add some comment about it.

3460

Why is this a BFS ? Can't you visit the entries in the order they are pushed in VectorizableTree and collect the Leaves ?

3537

This is linear time search in Branches. I think you should find a better way of keeping this data. For example you could add some member variables in TreeEntry.isBranch or something similar to speed things up.

3542

Could you point to some of the examples you found? I think the fast top-down approach should work well on tree-like graphs.

3616–3622

Connecting the TreeEntries with edges should be part of building the graph, so I think a better place for this is somewhere in buildTree_rec().

dtemirbulatov marked an inline comment as done.Jun 12 2019, 8:22 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
3460

yes, It should work, we have VectorizableTree order, we don't have to build our own to process cycles. Thanks.

dtemirbulatov marked an inline comment as done.Jun 12 2019, 11:13 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
3542

What do you mean here with "top-down approach"? We are already here moving from leaves to the root of the graph and erasing nodes from leaves to the root in order to find satisfiable tree cost.

dtemirbulatov marked an inline comment as done.Jun 12 2019, 11:23 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
3455

Also I don't understand what is considered a branch. Is it any non-leaf node in the graph? Could you add some comment about it.

I think it is a node that uses more than one non-gathering node, or in other words, it has more than one non-gathering children.

vporpo added inline comments.Jun 12 2019, 7:58 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
557

A general comment: Now that we have reliable pointers to TreeEntry, I think it is better to use TreeEntry * instead of an index to VectorizableTree.

3542

What I am referring to is the simple approach that I had described where you don't follow paths, but instead you visit nodes one by one top-down. My point is that if we can avoid building paths, we should do so.

Rebased and addressed previous remarks.

Fixed incorrect algorithm implementation in findSubTree() fundction.

Fixed issue with visiting the same node twice also added the assertation that we are not visiting any node twice and the assertion that all proposed nodes were visited.

Rebased, updated incorrect comment in treeTraversal() function.

dtemirbulatov marked 7 inline comments as done.Jul 6 2019, 7:00 PM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
3542

It is similar to what was before but without building paths.

Ping, if approved I am going to commit with "slp-throttling" disabled by default, it is enabled just to show the impact on the test.

RKSimon added inline comments.Mon, Jul 29, 7:31 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
123

"Enable partial tree vectorization with throttling"

Fixed typo at line 123.

dtemirbulatov marked an inline comment as done.Tue, Jul 30, 5:46 AM