Page MenuHomePhabricator

[SLP] Add support for throttling.
Needs ReviewPublic

Authored by dtemirbulatov on Feb 5 2019, 12:48 PM.

Details

Summary

Here is support for SLP throttling, when cost is high to vectorize the whole tree we can reduce the number of proposed vectorizable elements and partially vectorize the tree. https://www.youtube.com/watch?v=xxtA2XPmIug&t=5s

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Rebased. Simplified solution by removing leaves structure and any notion of branches. Left "slp-throttling" flag true by default just to show the impact on SLP tests.

Found typo in the previous upload.

vporpo added a comment.May 7 2019, 7:50 PM

Please try not to introduce a new Node structure or a new graph. You can use TreeEntry instead and you can add any necessary fields to it.

lib/Transforms/Vectorize/SLPVectorizer.cpp
548

Typo: throttling

551

Typo: partial

1440

Why not use the TreeEntry directly? It is already a graph node, with user edges and it already supports GraphTraits for traversals etc.

1449

What is the color used for? Please describe.

3624

Please use a more descriptive name here, as this function does more things than just a BFS. It is also connecting the nodes?
Also add a brief description comment to all functions.

Looks like fixed all remark, rebased.

dtemirbulatov marked 5 inline comments as done.May 16 2019, 12:16 PM

Again, the slp-throttling is on to just show the impact on SLP tests.

Added comments to cutBranch() and extendSubTree() functions. Removed dupicate variable "I" in extendSubTree() function.

vporpo added a comment.EditedMon, May 20, 2:35 PM

I think the first throttling patch should implement a very simple and fast algorithm for finding the cut:

  1. Add new fields to TreeEntry for Cost, ExtractCost and PredecessorsCost.
  2. During getTreeCost() set the TE.Cost and TE.ExtractCost (as you did in an earlier version of the patch if I am not mistaken)
  3. Do a single top-down traversal of the tree in reverse postorder and set the TE.PredecessorsCost equal to the cost of all the predecessor's costs until TE. While doing so, you can compare the cost of cutting just below TE by comparing the gather cost of TE versus the Cost + PredecessorsCost. This is very fast as you only need to visit each TreeEntry node once, so the complexity is linear to the size of the tree.

For example, in slp-throttle.ll the bundle that needs to be scalarized [%add19, %sub22] has costs of Cost=1, ExtractCost = 0, PredecessorsCost=1 (because of bundle [%mul18, undef]). Cutting below the bundle has a cost of +1, while keeping it vectorized has a cost of +2 (Cost=1 + PredecessorsCost=1).

This should be good-enough for most simple cases. We can improve it later, if needed, with follow-up patches.
What do you think?

lib/Transforms/Vectorize/SLPVectorizer.cpp
3476

This is calculating every possible cut in the path to the leaf. However, some paths will probably share the nodes close to the root, so we are recomputing the same thing multiple times.

5486

I don't think throttling should be visible at this level. It should be called after the call to getTreeCost().

dtemirbulatov marked an inline comment as done.Mon, May 20, 5:08 PM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
5486

I have seen so many regressions where partial vectorization prevents more profitable full vectorization and I think it is unacceptable to allow any and I think it is a good solution to those issues and It is not time-consuming we already done dependancy analysis at this point for all those seeds. I will double check.

dtemirbulatov marked an inline comment as done.Mon, May 20, 5:11 PM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
3476

oh, I missed that in this version completely, Thanks.

I think the first throttling patch should implement a very simple and fast algorithm for finding the cut:

  1. Add new fields to TreeEntry for Cost, ExtractCost and PredecessorsCost.
  2. During getTreeCost() set the TE.Cost and TE.ExtractCost (as you did in an earlier version of the patch if I am not mistaken)
  3. Do a single top-down traversal of the tree in reverse postorder and set the TE.PredecessorsCost equal to the cost of all the predecessor's costs until TE. While doing so, you can compare the cost of cutting just below TE by comparing the gather cost of TE versus the Cost + PredecessorsCost. This is very fast as you only need to visit each TreeEntry node once, so the complexity is linear to the size of the tree.

    For example, in slp-throttle.ll the bundle that needs to be scalarized [%add19, %sub22] has costs of Cost=1, ExtractCost = 0, PredecessorsCost=1 (because of bundle [%mul18, undef]). Cutting below the bundle has a cost of +1, while keeping it vectorized has a cost of +2 (Cost=1 + PredecessorsCost=1).

    This should be good-enough for most simple cases. We can improve it later, if needed, with follow-up patches. What do you think?

yes, Looks good.

Addressing last remarks, rebased.

ABataev added inline comments.Sun, Jun 2, 10:30 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
126

Why 64? Did you try to estimate effect of this change on the compilation time?

653

What is criteria of the estimation?

1440

\returns

2090

Use emplace_back instead.

2696

Function can be made const

2702

auto-> to real type.

2730

Explicitly specify list of captures

2742

I think, the outer loop must iterate through elements of VecNodes. In this case, you won't need need this check.

2747

Instruction *->auto *

2758

emplace_back

2768

Can you use range loop here?

2769

Convert auto to real type.

2771

Use llvm::any_of

2812

Restore original code

2835

Restore original

3313

Restore original

3355

Restore original

3370–3573

const function?

3383

auto->to real type

3387

auto->to real type

3396

const function?

3399

const ExternalUser &

3419

const function

3437

const function?

3439

Why std::vector?

3458

Use preincrement

3478

Use llvm::find

3507

Iterate through Leaves

3543

Use llvm::find

3558

while (!Node);?

3562

Are you sure you need to do this on each iteration?

3600–3609

auto to real type

3624–3648

Why std::string and why declared here?

3669

One group of braces can be removed

4507

Use empty()

test/Transforms/SLPVectorizer/AArch64/horizontal.ll
58

Does not seem to me we should vectorize this. 2 add ops have different attributes: one has just nsw, another one nuw nsw.
Plus, 2 inserts+vec add+2 extracts does not look more beneficial than 2 scalar adds

dtemirbulatov marked an inline comment as done.Mon, Jun 3, 6:00 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
126

oh, That just from my synthetic examples. I have noticed double on SLP time with 100+ tree nodes examples.

Rebased, fixed all remarks.

dtemirbulatov marked 11 inline comments as done and an inline comment as not done.

fixed a typo.

dtemirbulatov marked 23 inline comments as done.Tue, Jun 11, 11:11 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
3624–3648

We have to use it with ViewSLPTree, for full and partial vectorization.

dtemirbulatov marked 2 inline comments as done.Tue, Jun 11, 11:18 AM
dtemirbulatov added inline comments.
test/Transforms/SLPVectorizer/AArch64/horizontal.ll
58

This is correct, the vector add should simply take the smallest common subset of flags. And the test is used with "-slp-threshold=-6" flag that it why we vectorized those two add operations.

vporpo added inline comments.Tue, Jun 11, 11:30 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2773

Keeping track of the budget here is a bit strange, because the budget tracking is internal state of throttling. Could you move this check somewhere better?

2776

buildTree() is an expensive operation because of scheduling. We should not build a new tree if we can reuse the existing one.

3530

Collecting all paths is expensive, and I think we should not be doing this here just for removing some nodes from the tree. Also, you will end up visiting the same nodes over and over again (the ones close to the root). As I had described in a previous comment, I think we could get a good-enough tree cutting with a single top-down traversal of the tree. Did you find any examples where that would not work?

dtemirbulatov marked 3 inline comments as done.Tue, Jun 11, 12:20 PM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
2776

I think it is the dependency analysis is expensive, when certain seeds were just ignored, I could get the numbers. But here, we have already correct seed from the dependency analysis point of view. hmm, We could turn off the dependancy check for our collected seeds.

3530

Also, you will end up visiting the same nodes over and over again (the ones close to the root).

Hmm, It should not visit, multi-times the same node. I will check again.

Did you find any examples where that would not work?

yes, Many, it is definitely better with tree traverse.

dtemirbulatov marked an inline comment as done.Tue, Jun 11, 12:25 PM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
3530

Did you find any examples where that would not work?

just look at the tests files for the SLP, with the single index it was just "slp-throttle.ll", definitely it is more robust.

dtemirbulatov marked an inline comment as done.Tue, Jun 11, 12:34 PM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
3530

Also, you will end up visiting the same nodes over and over again (the ones close to the root).

sure, we are not visiting the same node over and over again.

vporpo added inline comments.Tue, Jun 11, 1:38 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
1514

Since these are operand edges for each TreeEntry, I think they should be members of the TreeEntry struct, similarly to the UserTreeIndices.

3443

You don't seem to be using the order of the entries in branches apart from checking is_contained() a couple of times. Isn't it better to add a member variable in TreeEntry for this instead?

Also I don't understand what is considered a branch. Is it any non-leaf node in the graph? Could you add some comment about it.

3448

Why is this a BFS ? Can't you visit the entries in the order they are pushed in VectorizableTree and collect the Leaves ?

3525

This is linear time search in Branches. I think you should find a better way of keeping this data. For example you could add some member variables in TreeEntry.isBranch or something similar to speed things up.

3530

Could you point to some of the examples you found? I think the fast top-down approach should work well on tree-like graphs.

3600

Connecting the TreeEntries with edges should be part of building the graph, so I think a better place for this is somewhere in buildTree_rec().

dtemirbulatov marked an inline comment as done.Wed, Jun 12, 8:22 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
3448

yes, It should work, we have VectorizableTree order, we don't have to build our own to process cycles. Thanks.

dtemirbulatov marked an inline comment as done.Wed, Jun 12, 11:13 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
3530

What do you mean here with "top-down approach"? We are already here moving from leaves to the root of the graph and erasing nodes from leaves to the root in order to find satisfiable tree cost.

dtemirbulatov marked an inline comment as done.Wed, Jun 12, 11:23 AM
dtemirbulatov added inline comments.
lib/Transforms/Vectorize/SLPVectorizer.cpp
3443

Also I don't understand what is considered a branch. Is it any non-leaf node in the graph? Could you add some comment about it.

I think it is a node that uses more than one non-gathering node, or in other words, it has more than one non-gathering children.

vporpo added inline comments.Wed, Jun 12, 7:58 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
557

A general comment: Now that we have reliable pointers to TreeEntry, I think it is better to use TreeEntry * instead of an index to VectorizableTree.

3530

What I am referring to is the simple approach that I had described where you don't follow paths, but instead you visit nodes one by one top-down. My point is that if we can avoid building paths, we should do so.