Here is support for SLP throttling, when cost is high to vectorize the whole tree we can reduce the number of proposed vectorizable elements and partially vectorize the tree. https://www.youtube.com/watch?v=xxtA2XPmIug&t=5s
Please try not to introduce a new Node structure or a new graph. You can use TreeEntry instead and you can add any necessary fields to it.
Why not use the TreeEntry directly? It is already a graph node, with user edges and it already supports GraphTraits for traversals etc.
What is the color used for? Please describe.
Please use a more descriptive name here, as this function does more things than just a BFS. It is also connecting the nodes?
I think the first throttling patch should implement a very simple and fast algorithm for finding the cut:
- Add new fields to TreeEntry for Cost, ExtractCost and PredecessorsCost.
- During getTreeCost() set the TE.Cost and TE.ExtractCost (as you did in an earlier version of the patch if I am not mistaken)
- Do a single top-down traversal of the tree in reverse postorder and set the TE.PredecessorsCost equal to the cost of all the predecessor's costs until TE. While doing so, you can compare the cost of cutting just below TE by comparing the gather cost of TE versus the Cost + PredecessorsCost. This is very fast as you only need to visit each TreeEntry node once, so the complexity is linear to the size of the tree.
For example, in slp-throttle.ll the bundle that needs to be scalarized [%add19, %sub22] has costs of Cost=1, ExtractCost = 0, PredecessorsCost=1 (because of bundle [%mul18, undef]). Cutting below the bundle has a cost of +1, while keeping it vectorized has a cost of +2 (Cost=1 + PredecessorsCost=1).
This should be good-enough for most simple cases. We can improve it later, if needed, with follow-up patches.
What do you think?
This is calculating every possible cut in the path to the leaf. However, some paths will probably share the nodes close to the root, so we are recomputing the same thing multiple times.
I don't think throttling should be visible at this level. It should be called after the call to getTreeCost().
I have seen so many regressions where partial vectorization prevents more profitable full vectorization and I think it is unacceptable to allow any and I think it is a good solution to those issues and It is not time-consuming we already done dependancy analysis at this point for all those seeds. I will double check.
Why 64? Did you try to estimate effect of this change on the compilation time?
What is criteria of the estimation?
Use emplace_back instead.
Function can be made const
auto-> to real type.
Explicitly specify list of captures
I think, the outer loop must iterate through elements of VecNodes. In this case, you won't need need this check.
Instruction *->auto *
Can you use range loop here?
Convert auto to real type.
Restore original code
auto->to real type
auto->to real type
const ExternalUser &
Iterate through Leaves
Are you sure you need to do this on each iteration?
auto to real type
Why std::string and why declared here?
One group of braces can be removed
Does not seem to me we should vectorize this. 2 add ops have different attributes: one has just nsw, another one nuw nsw.
This is correct, the vector add should simply take the smallest common subset of flags. And the test is used with "-slp-threshold=-6" flag that it why we vectorized those two add operations.
Keeping track of the budget here is a bit strange, because the budget tracking is internal state of throttling. Could you move this check somewhere better?
buildTree() is an expensive operation because of scheduling. We should not build a new tree if we can reuse the existing one.
Collecting all paths is expensive, and I think we should not be doing this here just for removing some nodes from the tree. Also, you will end up visiting the same nodes over and over again (the ones close to the root). As I had described in a previous comment, I think we could get a good-enough tree cutting with a single top-down traversal of the tree. Did you find any examples where that would not work?
I think it is the dependency analysis is expensive, when certain seeds were just ignored, I could get the numbers. But here, we have already correct seed from the dependency analysis point of view. hmm, We could turn off the dependancy check for our collected seeds.
Hmm, It should not visit, multi-times the same node. I will check again.
yes, Many, it is definitely better with tree traverse.
just look at the tests files for the SLP, with the single index it was just "slp-throttle.ll", definitely it is more robust.
sure, we are not visiting the same node over and over again.
Since these are operand edges for each TreeEntry, I think they should be members of the TreeEntry struct, similarly to the UserTreeIndices.
You don't seem to be using the order of the entries in branches apart from checking is_contained() a couple of times. Isn't it better to add a member variable in TreeEntry for this instead?
Also I don't understand what is considered a branch. Is it any non-leaf node in the graph? Could you add some comment about it.
Why is this a BFS ? Can't you visit the entries in the order they are pushed in VectorizableTree and collect the Leaves ?
This is linear time search in Branches. I think you should find a better way of keeping this data. For example you could add some member variables in TreeEntry.isBranch or something similar to speed things up.
Could you point to some of the examples you found? I think the fast top-down approach should work well on tree-like graphs.
Connecting the TreeEntries with edges should be part of building the graph, so I think a better place for this is somewhere in buildTree_rec().
What do you mean here with "top-down approach"? We are already here moving from leaves to the root of the graph and erasing nodes from leaves to the root in order to find satisfiable tree cost.
I think it is a node that uses more than one non-gathering node, or in other words, it has more than one non-gathering children.
A general comment: Now that we have reliable pointers to TreeEntry, I think it is better to use TreeEntry * instead of an index to VectorizableTree.
What I am referring to is the simple approach that I had described where you don't follow paths, but instead you visit nodes one by one top-down. My point is that if we can avoid building paths, we should do so.