Currently SLP vectorizer tries to vectorize a binary operation and dies immediately after unsuccessful the first unsuccessfull attempt. Patch tries to improve the situation, trying to vectorize all binary operations of all children nodes in the binop tree.
Details
Diff Detail
Event Timeline
Could you give some more comments about the suggested algorithm? Is it for reductions only?
Yes, it is for reductions. Previously, when the non-vectorizable node was found during horizontal reduction analysis, the whole vectorization process stopped without any further attempts for vectorization. Now I consider this node as a possible start for another possibly vectorizable instruction.
Also, patch does not require that the initial instruction is a binary operation. It scans all operands of the root instruction, trying to find all binops in the tree, which can be vectorized.
It allows to vectorize the туче code:
int foo(int *a) {
  int res = 0;
  for (int i = 0; i < 4; ++i)
    res += a[i];
  return res * 42;
}Currently SLP vectorizer is unable to vectorize it, but this patch allows to do it.
Have you checked the compile-time impact of this change? I wonder if you need some additional cutoff because it looks like you might be increasing the worst-case complexity of the algorithm here.
| lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
|---|---|---|
| 4459 | The comment here should explain that the child level corresponds to the operand index of the instruction. | |
| 4461 | /// Is this the instruction that should be vectorized, or are we now processing children (i.e. operands of this instruction) for potential vectorization? | |
| 4499 | The comment here is out of date. | |
| 4547 | dyn_cast_or_null -> dyn_cast (here and below). BI->getOperand(0) should not return null. | |
Yes, I checked it using test suite + several programs and did not find any significant changes in compile time. Actually, I thought already about some additional cutoff and I think it worth adding just in case.
Hi Alexey,
I've just come across this commit. I don't think it is up to LLVM standards in terms of understandably. Please improve; I've provided some questions inline.
Thanks,
Adam
| llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
|---|---|---|
| 4457 ↗ | (On Diff #79513) | Tracks for what? | 
| 4499–4505 ↗ | (On Diff #79513) | The name of the function and the comment mismatch. What is this function supposed to do? | 
| 4517–4573 ↗ | (On Diff #79513) | This needs a description of the algorithm. | 
| llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
|---|---|---|
| 4457 ↗ | (On Diff #79513) | Traks if the instruction is deleted (replaced by undef value) /replaced by the new instruction (the vector version, as the result of the whole vectorization process) + tracks the processing of the instruction operands | 
| 4499–4505 ↗ | (On Diff #79513) | Yes, probably. This function checks if it is possible to vectorize the tree + performs the vectorization of horizontal reduction or, if the instruction is not the top instruction of the horizontal reduction and this is a binary operation, vectorizes the operands of this binary instruction. | 
| 4517–4573 ↗ | (On Diff #79513) | The algorithm is the same as before, just exit criteria become a bit weaker. | 
Please write a new patch with all these improvements and add me as a reviewer. Thanks.
| llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
|---|---|---|
| 4457 ↗ | (On Diff #79513) | Improve the comment then, please. | 
| 4460 ↗ | (On Diff #79513) | OperandIndexAnalyzed? | 
| 4464 ↗ | (On Diff #79513) | IsAnalyzingOperands? Or just merge these into a single state variable, Optional<unsigned> which if None means we're processing the instruction itself. | 
| 4499–4505 ↗ | (On Diff #79513) | Then improve the comment please. | 
| 4517–4573 ↗ | (On Diff #79513) | There is a non-trivial loop that wasn't there before!? Please explain what it does and what state the various new data structures represent. | 
Just FYI: this patch was improved at here: https://reviews.llvm.org/D33320
Ongoing discussion about compile time impact is on that review thread. We do see huge compile time impact (internally) because of the change in this function (the processing has increased from a single node to a max of 2 ^ 12 nodes).
The comment here should explain that the child level corresponds to the operand index of the instruction.