This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Improve comments and naming of functions/variables/members, NFC.
ClosedPublic

Authored by ABataev on May 18 2017, 9:33 AM.

Diff Detail

Repository
rL LLVM

Event Timeline

ABataev created this revision.May 18 2017, 9:33 AM
anemet edited edge metadata.May 22 2017, 8:55 AM

Thanks for improving this!

lib/Transforms/Vectorize/SLPVectorizer.cpp
4772 ↗(On Diff #99445)

First word needs to be a verb, how about analyzingInstruction.

4774 ↗(On Diff #99445)

setAnalyzingOperands

4776 ↗(On Diff #99445)

areOperandsAnalyzed

4785 ↗(On Diff #99445)

getNextOperand

4820–4827 ↗(On Diff #99445)

Please mention the actual traversal used. It seems this is preorder. If that is true why do you even need to track the operand index? Iterative pre-order is:

while (!Stack.empty()) {
  Value *V = Stack.back();
  Stack.pop_back();
  visit(V);
  for (auto *O = V->rbegin(); O != V->rend(); ++O)
    Stack.push_back(*O);
}
ABataev updated this revision to Diff 100082.May 24 2017, 7:13 AM

Address comments from Adam.

anemet added inline comments.May 24 2017, 9:53 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗(On Diff #100082)

Again, which order?

ABataev added inline comments.May 24 2017, 10:11 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗(On Diff #100082)

DFS

anemet added inline comments.May 24 2017, 10:12 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗(On Diff #100082)

pre/in/post?

ABataev added inline comments.May 24 2017, 10:15 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗(On Diff #100082)

pre

anemet added inline comments.May 24 2017, 10:20 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗(On Diff #100082)

Okay, then please update the comment. Also please answer my question below why we chose to track the edges with an iterative pre-order traversal. If it's unnecessary, please fix or add a FIXME. Thank you.

ABataev added inline comments.May 24 2017, 11:01 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗(On Diff #100082)

It simplifies limiting the tree traversal level of the function. We can use simple pre-order traversal, but still need to keep the tree node level within the Stack, for example, to limit the recursion level.

anemet added inline comments.May 25 2017, 2:26 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗(On Diff #100082)

I am not sure I see a major difference, do you have an example?

4861 ↗(On Diff #100082)

It would be good to add a comment why we need to clear P beyond the first level.

ABataev added inline comments.May 26 2017, 6:37 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗(On Diff #100082)

I'm just saying that there is no major difference. We won't win anything by using of pre-order traversal comparing with DFS preorder traversal. We will need to track the current tree depth to be able to cut off the vectorization process if the tree depth is more than RecursionMaxDepth (see line 4879).

4861 ↗(On Diff #100082)

Ok

ABataev updated this revision to Diff 100398.May 26 2017, 6:45 AM

Added a comment regarding assigning nullptr value to P

anemet added inline comments.May 26 2017, 10:44 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗(On Diff #100082)

I'm just saying that there is no major difference. We won't win anything by using of pre-order traversal comparing with DFS preorder traversal.

Now I am confused. pre-order traversal *is* DFS pre-order traversal.

ABataev added inline comments.May 26 2017, 10:53 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗(On Diff #100082)

I'm sorry, I meant you're suggesting to change traversal to BFS.

anemet added inline comments.May 26 2017, 11:00 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗(On Diff #100082)

No it's still preorder. My point was that for preorder unlike postorder you don't need to keep track of level of processing that has been performed on a node (i.e. getOperandIndex()). See the code one comment down.

ABataev added inline comments.May 26 2017, 12:03 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗(On Diff #100082)

Yes, I understand this. But how to track the tree depth in your code? We need to limit the depth by the RecusrsionMaxDepth.

anemet added inline comments.May 31 2017, 8:35 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗(On Diff #100082)

The depth could be stored with each stack element -- that would still be simpler than keeping track of the operand.

It would also properly describe the intent; the depth is only maintained to control the recursion limit.

ABataev added inline comments.Jun 1 2017, 6:07 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗(On Diff #100082)

I believe this is a matter of taste. Besides, the same approach is used in matchAssociativeReduction function and I believe it is better to use the same approach in all functions, instead of using the different algorithms for implementation of the same idea.

anemet added inline comments.Jun 1 2017, 11:45 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4824–4825 ↗(On Diff #100082)

But it is *not* the same idea. matchAssociativeReduction performs *post-order* traversal. So it's a wrong example to support your case!

I did see that you were copying the code from there (already a bad idea). I think that both of these should be rewritten with df_iterator and po_iterator respectively.

Let's just add a FIXME in the code and move on: "This performs pre-order traversal. FIXME: can we do this with df_iterator?"

ABataev updated this revision to Diff 101222.Jun 2 2017, 8:49 AM

Updates after comments.

anemet accepted this revision.Jun 2 2017, 9:48 AM

Thanks very much for rewriting the loop. This is way more intuitive now.

A few more nits/questions below but feel free to commit either way.

lib/Transforms/Vectorize/SLPVectorizer.cpp
4783 ↗(On Diff #101222)

Capitalize sentence.

4784 ↗(On Diff #101222)

I would start the level from 0, that's more canonical and then later would check like this:

if (++Level < MaxDepth)

4801–4802 ↗(On Diff #101222)

Rather than saying "on the next iteration", isn't it the desire that we don't analyze the phi node unless this is the root node? If yes it's better to say so, i.e. "unless this is the root node".

This revision is now accepted and ready to land.Jun 2 2017, 9:48 AM
ABataev updated this revision to Diff 101265.Jun 2 2017, 12:49 PM

Update after review

anemet accepted this revision.Jun 2 2017, 12:52 PM

LGTM.

lib/Transforms/Vectorize/SLPVectorizer.cpp
4782–4783 ↗(On Diff #101265)

Something got messed up with upper/lowercase here.

ABataev added inline comments.Jun 2 2017, 1:08 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4782–4783 ↗(On Diff #101265)

I tried to keep the original names of the variables. Should I capitalize them too?

anemet added inline comments.Jun 2 2017, 1:15 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4782–4783 ↗(On Diff #101265)

Oh sorry, no I only meant the capitalize the first letter of the sentence.

Looking at it again, I think I misread this and thought that "subtrees" was the beginning of a sentence. It's not so just go back to the original version. Sorry about the confusion!

ABataev added inline comments.Jun 2 2017, 1:20 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
4782–4783 ↗(On Diff #101265)

Ok, no problems. Then I'll restore it back before commit

This revision was automatically updated to reflect the committed changes.
anna added a subscriber: anna.Jun 29 2017, 2:15 PM

Hi Alexey, Adam,

From what I can see in this algorithm, there is no limit on the actual size of the stack in the loop. The level variable controls just the recursion limit. So, in effect, IIUC, the max total number of operands being processed by the while loop is 2 ^ RecursionLimit (it's to the base 2 because we avoid phi nodes).

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp
4828

Could you please explain how this is NFC wrt the previous code?
In the code on the LHS, we are checking that the operand is in the same basic block as the root before placing it on the stack.

Here we are unconditionally placing all operands on the stack.

In D33320#795834, @anna wrote:

Hi Alexey, Adam,

From what I can see in this algorithm, there is no limit on the actual size of the stack in the loop. The level variable controls just the recursion limit. So, in effect, IIUC, the max total number of operands being processed by the while loop is 2 ^ RecursionLimit (it's to the base 2 because we avoid phi nodes).

It does not limits the number of processed nodes, it limits the tree height just like it was before.

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp
4828

Yes, missed these checks, will add them

anna added a comment.Jun 30 2017, 7:05 AM
In D33320#795834, @anna wrote:

Hi Alexey, Adam,

From what I can see in this algorithm, there is no limit on the actual size of the stack in the loop. The level variable controls just the recursion limit. So, in effect, IIUC, the max total number of operands being processed by the while loop is 2 ^ RecursionLimit (it's to the base 2 because we avoid phi nodes).

It does not limits the number of processed nodes, it limits the tree height just like it was before.

Yes, but limiting the tree height itself is not enough right? Now, in the worst case, 2^12 nodes being processed in the tryToVectorizeHorReductionOrInstOperands, when earlier it was just a single node (i.e. before this change: https://reviews.llvm.org/D25517).

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp
4828

Yeah, we saw huge compile time degradations in tryToVectorizeHorReductionOrInstOperands.

In D33320#796697, @anna wrote:
In D33320#795834, @anna wrote:

Hi Alexey, Adam,

From what I can see in this algorithm, there is no limit on the actual size of the stack in the loop. The level variable controls just the recursion limit. So, in effect, IIUC, the max total number of operands being processed by the while loop is 2 ^ RecursionLimit (it's to the base 2 because we avoid phi nodes).

It does not limits the number of processed nodes, it limits the tree height just like it was before.

Yes, but limiting the tree height itself is not enough right? Now, in the worst case, 2^12 nodes being processed in the tryToVectorizeHorReductionOrInstOperands, when earlier it was just a single node (i.e. before this change: https://reviews.llvm.org/D25517).

@anna, I am confused whether you're complaining about the additional overhead in the original change (D25517) or the algorithmic change in this refinement (D33320). Your comparison above seems to suggest your baseline is *before* the original changes.

In this refinement change, we can have more nodes on the stack compared to the original change but the number of nodes processed should remain unchanged.

anna added a comment.Jun 30 2017, 7:52 AM

@anemet: We saw the degradations with respect to this change (D33320) itself, not before the original change. Here's the code change as I see it. Please correct if this is wrong:

  1. we were initially processing exactly one node in tryToVectorizeHorReductionOrInstOperands equivalent function, before any changes below.
  2. after D25517, we are processing at most 2 ^ 12 nodes in tryToVectorizeHorReductionOrInstOperands, but they are limited to a single basic block.
  3. in this refinement change (D33320), we are processing at most 2 ^ 12 nodes, but they are no longer limited to a single basic block.

@ABataev is fixing #3 by limiting it to the same basic block so that it's actually an NFC wrt #2. That might fix the compile time regressions. However, my concern is that aren't we still having the possibility of high compile time impact in #2 when we have a single large basic block with 2^10 binary instructions for example (because we are not limiting the number of nodes, but rather the depth of the tree)?
Should we perhaps have a threshold cutoff for the number of stack nodes? Or reduce the MaxDepthRecursion from 12 to 6?

I bailed out of this loop (return false in tryToVectorizeHorReductionOrInstOperands) when the stack size is too large, and that fixed our compile time regression temporarily.

I'm working on the patch that stops vectorization if the parent basic block of the instruction is not BB or the instruction was processed already, but it's quite hard to add a test for this change. I can publish it as NFC, because we just limiting the number of analyzed instructions if this is acceptable.

anna added a comment.Jun 30 2017, 9:11 AM

I'm working on the patch that stops vectorization if the parent basic block of the instruction is not BB or the instruction was processed already, but it's quite hard to add a test for this change. I can publish it as NFC, because we just limiting the number of analyzed instructions if this is acceptable.

Thanks Alexey. That should fix the current regressions we are seeing. Could you please add me as a reviewer to the patch?

The second concern I have (as mentioned in above comment) is that with D25517, we are increasing the complexity of tryToVectorizeHorReductionOrInstOperands from a single node being processed to an exponential number of nodes (2^12). Is this a correct analysis?

It may not be an issue in practice once we limit to the current basic block (the fix you're working on). However, this function is called for every instruction in the IR as part of vectorizeChainsInBlock, so as the basic block size increases, we may see a compile time impact.

MatzeB added a subscriber: MatzeB.Jul 28 2017, 1:50 PM

We just identified this commit as the cause of a 10x slowdown when compiling shared_sha256.c in the llvm test-suite/CTMark (x86, no special flags should get you the default -O3) showing up on our performance tracking.

Did any of the planned improvements make it to ToT yet?

We just identified this commit as the cause of a 10x slowdown when compiling shared_sha256.c in the llvm test-suite/CTMark (x86, no special flags should get you the default -O3) showing up on our performance tracking.

Did any of the planned improvements make it to ToT yet?

Seems like this has recovered on ToT.

anna added a comment.Jul 28 2017, 6:57 PM

We just identified this commit as the cause of a 10x slowdown when compiling shared_sha256.c in the llvm test-suite/CTMark (x86, no special flags should get you the default -O3) showing up on our performance tracking.

Did any of the planned improvements make it to ToT yet?

Seems like this has recovered on ToT.

This was the review thread: https://reviews.llvm.org/D34881