Fixed some comments, added an additional description of the algorithms,
improved readability of the code.
Details
- Reviewers
anemet - Commits
- rGe4e5923ef17c: [SLP] Improve comments and naming of functions/variables/members, NFC.
rG2c08fde9e570: [SLP] Improve comments and naming of functions/variables/members, NFC.
rL304616: [SLP] Improve comments and naming of functions/variables/members, NFC.
rL304593: [SLP] Improve comments and naming of functions/variables/members, NFC.
Diff Detail
- Repository
- rL LLVM
Event Timeline
Thanks for improving this!
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4772 ↗ | (On Diff #99445) | First word needs to be a verb, how about analyzingInstruction. |
4774 ↗ | (On Diff #99445) | setAnalyzingOperands |
4776 ↗ | (On Diff #99445) | areOperandsAnalyzed |
4785 ↗ | (On Diff #99445) | getNextOperand |
4820–4827 ↗ | (On Diff #99445) | Please mention the actual traversal used. It seems this is preorder. If that is true why do you even need to track the operand index? Iterative pre-order is: while (!Stack.empty()) { Value *V = Stack.back(); Stack.pop_back(); visit(V); for (auto *O = V->rbegin(); O != V->rend(); ++O) Stack.push_back(*O); } |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4824–4825 ↗ | (On Diff #100082) | Again, which order? |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4824–4825 ↗ | (On Diff #100082) | DFS |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4824–4825 ↗ | (On Diff #100082) | pre/in/post? |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4824–4825 ↗ | (On Diff #100082) | pre |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4824–4825 ↗ | (On Diff #100082) | Okay, then please update the comment. Also please answer my question below why we chose to track the edges with an iterative pre-order traversal. If it's unnecessary, please fix or add a FIXME. Thank you. |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4824–4825 ↗ | (On Diff #100082) | It simplifies limiting the tree traversal level of the function. We can use simple pre-order traversal, but still need to keep the tree node level within the Stack, for example, to limit the recursion level. |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4824–4825 ↗ | (On Diff #100082) | I am not sure I see a major difference, do you have an example? |
4861 ↗ | (On Diff #100082) | It would be good to add a comment why we need to clear P beyond the first level. |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4824–4825 ↗ | (On Diff #100082) | I'm just saying that there is no major difference. We won't win anything by using of pre-order traversal comparing with DFS preorder traversal. We will need to track the current tree depth to be able to cut off the vectorization process if the tree depth is more than RecursionMaxDepth (see line 4879). |
4861 ↗ | (On Diff #100082) | Ok |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4824–4825 ↗ | (On Diff #100082) |
Now I am confused. pre-order traversal *is* DFS pre-order traversal. |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4824–4825 ↗ | (On Diff #100082) | I'm sorry, I meant you're suggesting to change traversal to BFS. |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4824–4825 ↗ | (On Diff #100082) | No it's still preorder. My point was that for preorder unlike postorder you don't need to keep track of level of processing that has been performed on a node (i.e. getOperandIndex()). See the code one comment down. |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4824–4825 ↗ | (On Diff #100082) | Yes, I understand this. But how to track the tree depth in your code? We need to limit the depth by the RecusrsionMaxDepth. |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4824–4825 ↗ | (On Diff #100082) | The depth could be stored with each stack element -- that would still be simpler than keeping track of the operand. It would also properly describe the intent; the depth is only maintained to control the recursion limit. |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4824–4825 ↗ | (On Diff #100082) | I believe this is a matter of taste. Besides, the same approach is used in matchAssociativeReduction function and I believe it is better to use the same approach in all functions, instead of using the different algorithms for implementation of the same idea. |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4824–4825 ↗ | (On Diff #100082) | But it is *not* the same idea. matchAssociativeReduction performs *post-order* traversal. So it's a wrong example to support your case! I did see that you were copying the code from there (already a bad idea). I think that both of these should be rewritten with df_iterator and po_iterator respectively. Let's just add a FIXME in the code and move on: "This performs pre-order traversal. FIXME: can we do this with df_iterator?" |
Thanks very much for rewriting the loop. This is way more intuitive now.
A few more nits/questions below but feel free to commit either way.
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4783 ↗ | (On Diff #101222) | Capitalize sentence. |
4784 ↗ | (On Diff #101222) | I would start the level from 0, that's more canonical and then later would check like this: if (++Level < MaxDepth) |
4801–4802 ↗ | (On Diff #101222) | Rather than saying "on the next iteration", isn't it the desire that we don't analyze the phi node unless this is the root node? If yes it's better to say so, i.e. "unless this is the root node". |
LGTM.
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4782–4783 ↗ | (On Diff #101265) | Something got messed up with upper/lowercase here. |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4782–4783 ↗ | (On Diff #101265) | I tried to keep the original names of the variables. Should I capitalize them too? |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4782–4783 ↗ | (On Diff #101265) | Oh sorry, no I only meant the capitalize the first letter of the sentence. Looking at it again, I think I misread this and thought that "subtrees" was the beginning of a sentence. It's not so just go back to the original version. Sorry about the confusion! |
lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4782–4783 ↗ | (On Diff #101265) | Ok, no problems. Then I'll restore it back before commit |
Hi Alexey, Adam,
From what I can see in this algorithm, there is no limit on the actual size of the stack in the loop. The level variable controls just the recursion limit. So, in effect, IIUC, the max total number of operands being processed by the while loop is 2 ^ RecursionLimit (it's to the base 2 because we avoid phi nodes).
llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4828 | Could you please explain how this is NFC wrt the previous code? Here we are unconditionally placing all operands on the stack. |
It does not limits the number of processed nodes, it limits the tree height just like it was before.
llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4828 | Yes, missed these checks, will add them |
Yes, but limiting the tree height itself is not enough right? Now, in the worst case, 2^12 nodes being processed in the tryToVectorizeHorReductionOrInstOperands, when earlier it was just a single node (i.e. before this change: https://reviews.llvm.org/D25517).
llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
4828 | Yeah, we saw huge compile time degradations in tryToVectorizeHorReductionOrInstOperands. |
@anna, I am confused whether you're complaining about the additional overhead in the original change (D25517) or the algorithmic change in this refinement (D33320). Your comparison above seems to suggest your baseline is *before* the original changes.
In this refinement change, we can have more nodes on the stack compared to the original change but the number of nodes processed should remain unchanged.
@anemet: We saw the degradations with respect to this change (D33320) itself, not before the original change. Here's the code change as I see it. Please correct if this is wrong:
- we were initially processing exactly one node in tryToVectorizeHorReductionOrInstOperands equivalent function, before any changes below.
- after D25517, we are processing at most 2 ^ 12 nodes in tryToVectorizeHorReductionOrInstOperands, but they are limited to a single basic block.
- in this refinement change (D33320), we are processing at most 2 ^ 12 nodes, but they are no longer limited to a single basic block.
@ABataev is fixing #3 by limiting it to the same basic block so that it's actually an NFC wrt #2. That might fix the compile time regressions. However, my concern is that aren't we still having the possibility of high compile time impact in #2 when we have a single large basic block with 2^10 binary instructions for example (because we are not limiting the number of nodes, but rather the depth of the tree)?
Should we perhaps have a threshold cutoff for the number of stack nodes? Or reduce the MaxDepthRecursion from 12 to 6?
I bailed out of this loop (return false in tryToVectorizeHorReductionOrInstOperands) when the stack size is too large, and that fixed our compile time regression temporarily.
I'm working on the patch that stops vectorization if the parent basic block of the instruction is not BB or the instruction was processed already, but it's quite hard to add a test for this change. I can publish it as NFC, because we just limiting the number of analyzed instructions if this is acceptable.
Thanks Alexey. That should fix the current regressions we are seeing. Could you please add me as a reviewer to the patch?
The second concern I have (as mentioned in above comment) is that with D25517, we are increasing the complexity of tryToVectorizeHorReductionOrInstOperands from a single node being processed to an exponential number of nodes (2^12). Is this a correct analysis?
It may not be an issue in practice once we limit to the current basic block (the fix you're working on). However, this function is called for every instruction in the IR as part of vectorizeChainsInBlock, so as the basic block size increases, we may see a compile time impact.
We just identified this commit as the cause of a 10x slowdown when compiling shared_sha256.c in the llvm test-suite/CTMark (x86, no special flags should get you the default -O3) showing up on our performance tracking.
Did any of the planned improvements make it to ToT yet?
Could you please explain how this is NFC wrt the previous code?
In the code on the LHS, we are checking that the operand is in the same basic block as the root before placing it on the stack.
Here we are unconditionally placing all operands on the stack.