This is an archive of the discontinued LLVM Phabricator instance.

[GVNHoist] Factor out reachability to search for anticipable instructions quickly
ClosedPublic

Authored by hiraditya on Jul 26 2017, 2:36 PM.

Details

Summary

Factor out the reachability such that multiple queries to find reachability of values are fast. This is based on finding the ANTIC points
in the CFG which do not change during hoisting. The ANTIC points are basically the dominance-frontiers in the inverse graph. So we introduce a data structure (CHI nodes)
to keep track of values flowing out of a basic block. We only do this for values with multiple occurrences in the function as they are the potential hoistable candidates.

This patch allows us to hoist instructions to a basic block with >2 successors, as well as deal with infinite loops in a trivial way.
Relevant test cases are added to show the functionality as well as regression fixes from PR32821.

Regression from previous GVNHoist:
We do not hoist fully redundant expressions because fully redundant expressions are already handled by NewGVN

Based on the suggestions from @dberlin and @sebpop

Diff Detail

Event Timeline

hiraditya created this revision.Jul 26 2017, 2:36 PM
dberlin edited edge metadata.Jul 26 2017, 3:05 PM

FYI: IDFCalculator can compute the PostDominanceFrontier in linear time.

FYI: IDFCalculator can compute the PostDominanceFrontier in linear time.

I was unable to find how to pass the root node of inverse graph, because there might be more than one root.
Thanks,

FYI: IDFCalculator can compute the PostDominanceFrontier in linear time.

I was unable to find how to pass the root node of inverse graph, because there might be more than one root.
Thanks,

No.
The graph always has one root, virtual or not.
Once kuba's latest patch is committed, it will *always* be virtual, actually.

But i'm not sure why that would stop you one way or the other?
The IDF calculator only asks for the defining blocks, not the root.
You can hand it all CFG blocks and it should work fine.

What do you believe you need to hand it the root node for?

For example,the dominance frontier of the root node is guaranteed to be empty.
"the dominance frontier of a node d is the set of all nodes n such that d dominates an immediate predecessor of n, but d does not strictly dominate n."

The root node strictly dominates everything but itself, and it has no immediate predecessors (successors on the reverse graph),.

FYI: IDFCalculator can compute the PostDominanceFrontier in linear time.

I was unable to find how to pass the root node of inverse graph, because there might be more than one root.
Thanks,

No.
The graph always has one root, virtual or not.
Once kuba's latest patch is committed, it will *always* be virtual, actually.

But i'm not sure why that would stop you one way or the other?
The IDF calculator only asks for the defining blocks, not the root.
You can hand it all CFG blocks and it should work fine.

What I understand is that the API of IDFCalculator::calculate, populates a vector with all the dominance frontiers of the defining blocks (AFAICT from the usage in ADCE.cpp).
How can I create a mapping of a basic block vs. its (post) dominance frontier. For this pass I would need to have such a mapping. I guess I'm having difficulty understanding the code. THanks for the help.

What do you believe you need to hand it the root node for?

For example,the dominance frontier of the root node is guaranteed to be empty.
"the dominance frontier of a node d is the set of all nodes n such that d dominates an immediate predecessor of n, but d does not strictly dominate n."

The root node strictly dominates everything but itself, and it has no immediate predecessors (successors on the reverse graph),.

FYI: IDFCalculator can compute the PostDominanceFrontier in linear time.

I was unable to find how to pass the root node of inverse graph, because there might be more than one root.
Thanks,

No.
The graph always has one root, virtual or not.
Once kuba's latest patch is committed, it will *always* be virtual, actually.

But i'm not sure why that would stop you one way or the other?
The IDF calculator only asks for the defining blocks, not the root.
You can hand it all CFG blocks and it should work fine.

What I understand is that the API of IDFCalculator::calculate, populates a vector with all the dominance frontiers of the defining blocks (AFAICT from the usage in ADCE.cpp).
How can I create a mapping of a basic block vs. its (post) dominance frontier. For this pass I would need to have such a mapping. I guess I'm having difficulty understanding the code. THanks for the help.

Such a mapping is by definition n^2 space, and i'm having trouble seeing why it is necessary.

Here is what you are doing:
For each instruction with the same VN:

find the post-dominance frontier of the block of the instruction
Insert a chi there, with certain arguments.

This is unnecessarily wasteful (you may compute the same pdf again and again).

Here is what SSUPRE (and you), should do:

Collect all the blocks of the instructions with the same VN into defining blocks.
Compute the PDF using IDFCalculator.
Place empty chis in the PDF.

At this point, you have two options:

Walk post-dominator tree top-down and use a stack to store the last value you see.
When you hit a chi from a given edge, the value to use as the argument is at the top of the stack.

This is O(Basic Blocks)

The O(instructions+chis) way to do it is:

Make a vector of instructions and chi argument uses. Each should be given the DFS in/out number from the post dominator tree node for the basic block they come from, and a local DFS number (IE order in block) in the case of instructions
A "chi argument use" is created for each incoming edge to the chi, but is empty/fake. These should assume the basic block from the other side of the edge (IE not the chi block, but the edge to the chi block).
Sort by dfs in/out, then local number

Walk vector with a stack.
At each element of vector:
  while( !top of stack is empty && DFS in/out  of current thing in vector is not inside of DFS number of top of stack)
   pop stack
If element you are staring at is a chi use:
  if stack is empty, chi has null operand
  if stack is not, set chi argument for the edge to top of stack
else: // must be an instruction
    if stack is empty, push onto stack
   If stack is not empty, the thing on the stack post-dominates you and you are redundant :)

The forwards version of this algorithm is used by predicateinfo to do SSA renaming.
Your algorithm is the same on the reverse graph, except the chi arguments are virtual :)

hiraditya added a subscriber: kuhar.Aug 1 2017, 8:51 AM

Such a mapping is by definition n^2 space, and i'm having trouble seeing why it is necessary.

Here is what you are doing:
For each instruction with the same VN:

find the post-dominance frontier of the block of the instruction
Insert a chi there, with certain arguments.

This is unnecessarily wasteful (you may compute the same pdf again and again).

Here is what SSUPRE (and you), should do:

Collect all the blocks of the instructions with the same VN into defining blocks.
Compute the PDF using IDFCalculator.
Place empty chis in the PDF.

At this point, you have two options:

Walk post-dominator tree top-down and use a stack to store the last value you see.
When you hit a chi from a given edge, the value to use as the argument is at the top of the stack.

This is O(Basic Blocks)

I tried this based on your suggestions, but post-dominator tree does not work well with infinite loops or CFG with multiple exits.
I can wait on @kuhar 's patch to be merged+stabilize and then I can work on this idea.

The O(instructions+chis) way to do it is:

Make a vector of instructions and chi argument uses. Each should be given the DFS in/out number from the post dominator tree node for the basic block they come from, and a local DFS number (IE order in block) in the case of instructions
A "chi argument use" is created for each incoming edge to the chi, but is empty/fake. These should assume the basic block from the other side of the edge (IE not the chi block, but the edge to the chi block).
Sort by dfs in/out, then local number

Walk vector with a stack.
At each element of vector:
  while( !top of stack is empty && DFS in/out  of current thing in vector is not inside of DFS number of top of stack)
   pop stack
If element you are staring at is a chi use:
  if stack is empty, chi has null operand
  if stack is not, set chi argument for the edge to top of stack
else: // must be an instruction
    if stack is empty, push onto stack
   If stack is not empty, the thing on the stack post-dominates you and you are redundant :)

The forwards version of this algorithm is used by predicateinfo to do SSA renaming.
Your algorithm is the same on the reverse graph, except the chi arguments are virtual :)

Because this patch precomputes ANTIC points based on your suggestions, it is already faster than the previous fix of iterating on dominator tree for each instruction to be hoisted.
Can we merge this patch if you think it is good to go, and then I'll work on this idea once the post-dominator patch by @kuhar is ready.
Thank you for writing the algorithm here, it helped me realize how close the algorithm is to PHI-insertion.

For some reason phabricator missed the following comments:

If you are willing to commit to making it better, i'm happy to review this version :)

Yes, I'll update gvn-hoist once post-dominator patch is merged.

Thanks,

-Aditya


From: Daniel Berlin <dberlin at dberlin.org>
Sent: Tuesday, August 1, 2017 11:11 AM
To: reviews+D35918+public+40ddab4ebe264bbb at reviews.llvm.org
Cc: Aditya Kumar; Sebastian Pop; Geoff Berry; Davide Italiano; Kuba Kuderski; Piotr Padlewski; llvm-commits
Subject: Re: [PATCH] D35918: [GVNHoist] Factor out reachability to search for anticipable instructions quickly

If you are willing to commit to making it better, i'm happy to review this version :)

hiraditya added a comment.EditedAug 3 2017, 1:22 PM

Can you illustrate how to perform a DFS walk on a post-dom tree with your patch. Also, how would I get the (virtual) root node of post-dom tree.

Thanks,

kuhar added a comment.Aug 3 2017, 1:30 PM

Can you illustrate how to perform a DFS walk on a post-dom tree with your patch. Also, how would I get the (virtual) root node of post-dom tree.

Thanks,

You can get the virtual root by calling PDT.getNode(nullptr). Then you can use something like llvm/ADT/DepthFirstIterator.h or llvm/ADT/PostOrderIterator.h to run DFS on it. Implementing a custom DFS should also be easy.

hiraditya added a comment.EditedAug 3 2017, 2:15 PM

Can you illustrate how to perform a DFS walk on a post-dom tree with your patch. Also, how would I get the (virtual) root node of post-dom tree.

Thanks,

You can get the virtual root by calling PDT.getNode(nullptr). Then you can use something like llvm/ADT/DepthFirstIterator.h or llvm/ADT/PostOrderIterator.h to run DFS on it. Implementing a custom DFS should also be easy.

I tried a simple df walk like this, it crashes the compiler for the test cases llvm/test/Transforms/GVNHoist/infinite-loop-indirect.ll, and llvm/test/Transforms/GVNHoist/infinite-loop-direct.ll:

auto PrevBB = PDT->getNode(nullptr);
for (auto it = df_begin(PrevBB); it != df_end(PrevBB);
     ++it) {
}

Because PrevBB is NULL when PrevBB = PDF->getNode(nullptr)

These test cases are added in this patch.

Thanks,

kuhar added a comment.EditedAug 3 2017, 2:26 PM

I tried a simple df walk like this, it crashes the compiler for the test cases llvm/test/Transforms/GVNHoist/infinite-loop-indirect.ll, and llvm/test/Transforms/GVNHoist/infinite-loop-direct.ll:

auto PrevBB = PDT->getNode(nullptr);
for (auto it = df_begin(PrevBB); it != df_end(PrevBB);
     ++it) {
}

These test cases are added in this patch.

Thanks,

I tried running loop like yours in a couple of my unittests (DominatorTreeTest.cpp) and it seems to work:

auto PrevBB = PDT->getNode(nullptr);
for (auto it = df_begin(PrevBB); it != df_end(PrevBB);
     ++it) {
  auto BB = it->getBlock();
  outs() << (BB ? BB->getName() : "virtual root") << "\n";
}

Could you prepare a reduced repro for the crash you saw?

Thanks,
Kuba

hiraditya added a comment.EditedAug 3 2017, 2:36 PM

I tried a simple df walk like this, it crashes the compiler for the test cases llvm/test/Transforms/GVNHoist/infinite-loop-indirect.ll, and llvm/test/Transforms/GVNHoist/infinite-loop-direct.ll:

auto PrevBB = PDT->getNode(nullptr);
for (auto it = df_begin(PrevBB); it != df_end(PrevBB);
     ++it) {
}

These test cases are added in this patch.

Thanks,

I tried running loop like yours in a couple of my unittests (DominatorTreeTest.cpp) and it seems to work:

auto PrevBB = PDT->getNode(nullptr);
for (auto it = df_begin(PrevBB); it != df_end(PrevBB);
     ++it) {
  auto BB = it->getBlock();
  outs() << (BB ? BB->getName() : "virtual root") << "\n";
}

Could you prepare a reduced repro for the crash you saw?

Thanks,
Kuba

Try ./bin/opt -S -adce < llvm/test/Transforms/GVNHoist/infinite-loop-direct.ll

with the following patch. I added the code here for easy reproducibility.

diff --git a/lib/Transforms/Scalar/ADCE.cpp b/lib/Transforms/Scalar/ADCE.cpp
index 5b467dc..eb711b9 100644
--- a/lib/Transforms/Scalar/ADCE.cpp
+++ b/lib/Transforms/Scalar/ADCE.cpp
@@ -164,6 +164,10 @@ public:
 }

 bool AggressiveDeadCodeElimination::performDeadCodeElimination() {
+  auto PrevBB = PDT.getNode(nullptr);
+  for (auto it = df_begin(PrevBB), E = df_end(PrevBB); it != E; ++it) {
+      dbgs() << "\nTest\n";
+  }

Here is the reduced test case:

$ cat a.ll
; ModuleID = 'bugpoint-reduced-simplified.bc'
source_filename = "bugpoint-output-f4ca947.bc"
target triple = "x86_64-unknown-linux-gnu"

define void @bazv1() local_unnamed_addr {
entry:
  br label %while.cond

while.cond:                                       ; preds = %while.cond, %entry
  br label %while.cond
}
kuhar added a comment.Aug 3 2017, 3:00 PM

Try ./bin/opt -S -adce < llvm/test/Transforms/GVNHoist/infinite-loop-direct.ll

with the following patch. I added the code here for easy reproducibility.

diff --git a/lib/Transforms/Scalar/ADCE.cpp b/lib/Transforms/Scalar/ADCE.cpp
index 5b467dc..eb711b9 100644
--- a/lib/Transforms/Scalar/ADCE.cpp
+++ b/lib/Transforms/Scalar/ADCE.cpp
@@ -164,6 +164,10 @@ public:
 }

 bool AggressiveDeadCodeElimination::performDeadCodeElimination() {
+  auto PrevBB = PDT.getNode(nullptr);
+  for (auto it = df_begin(PrevBB), E = df_end(PrevBB); it != E; ++it) {
+      dbgs() << "\nTest\n";
+  }

Here is the reduced test case:

$ cat a.ll
; ModuleID = 'bugpoint-reduced-simplified.bc'
source_filename = "bugpoint-output-f4ca947.bc"
target triple = "x86_64-unknown-linux-gnu"

define void @bazv1() local_unnamed_addr {
entry:
  br label %while.cond

while.cond:                                       ; preds = %while.cond, %entry
  br label %while.cond
}

Oh, yes, that happens because the entire function is reverse-unreachable, which causes the tree to be completely empty. D35851 puts reverse-unreachable CFG nodes in the tree and solves this problem.
In the meanwhile, you can check if PDT.getNode(nullptr) returns a nullptr and not iterate over it that situation.

hiraditya updated this revision to Diff 110647.Aug 10 2017, 4:24 PM

Based on suggestions from @dberlin, I have updated the patch:

  1. Compute iterated post-dominance frontiers
  2. Iterate on post-dominator tree to insert CHIargs more efficiently

The time complexity would now be: O(duplicate-values * N) where N = nodes in the CFG. For each duplicate values the algorithm finds their control-dependence by computing the iterated post-dominance frontier and then inserts CHI args to track anticipability.

hiraditya updated this revision to Diff 110743.Aug 11 2017, 9:25 AM

Added Comments

This looks reasonable at a glance, note it is likely to be subtly broken in some cases until D35381 goes in.
I've added a few coments, i'll take a deeper look in a little bit.

lib/Transforms/Scalar/GVNHoist.cpp
751

What are you really trying to do here? (IE why is the continue commented out)

954

I think since you wrote this we added moving functions you could use here.

Remove dead comments.

hiraditya added inline comments.Aug 11 2017, 12:06 PM
lib/Transforms/Scalar/GVNHoist.cpp
751

That was from the early stage when I only handled unique post-dom frontiers. I've removed those lines.

954

I'll look into it and update this code. Thanks for the feedback.

hiraditya updated this revision to Diff 112071.Aug 21 2017, 3:31 PM

Updated MemorySSA updates.
Outlined functions from codegen part.

I think this is mostly reasonable at this point.
If you clean up the naming so it matches the style guide (IE you have a lot of things ending in T for no reason, etc), i'll approve it.

lib/Transforms/Scalar/GVNHoist.cpp
87

This is unused

90

This is unused

305

This is unused

hiraditya updated this revision to Diff 112386.Aug 23 2017, 9:19 AM

Addressed @dberlin 's comments.

dberlin accepted this revision.Aug 24 2017, 12:28 PM

With these changes, i think it is a good start.

lib/Transforms/Scalar/GVNHoist.cpp
119–123

All of the uses of this type should just be auto, AFAICT

560

This keeps going after it finds anything, despite the fact that the boolean won't change.
I would instead use (you probably need to include STLExtras.h)

auto Found = any_of(TI->successors(), [](const BasicBlock *BB){ return BB == Dest; });

This revision is now accepted and ready to land.Aug 24 2017, 12:28 PM
hiraditya updated this revision to Diff 114071.Sep 6 2017, 2:27 PM

use any_of instead of iterating over the successors.

hiraditya marked an inline comment as done.Sep 6 2017, 2:30 PM
hiraditya added inline comments.
lib/Transforms/Scalar/GVNHoist.cpp
119–123

CHHIt is used in function argument at a couple of places.

Sure, it's a function argument in a few other places.
The other places, you should be using auto :)
I marked a bunch of them.

In general, the only reason not to use auto is if the type is important and non-obvious.

In most of the cases, you would be better off naming variables better than exposing the types.

IE instead of CHIArg C, use auto CHIArg
(auto C is also fine if you think the type is obvious or unimportant)

lib/Transforms/Scalar/GVNHoist.cpp
553

Pass in a range instead of two iterators.

556

auto C

573

Pass in a range instead?

575–581

range based for loop?

642

auto VCHI

643

auto It

667

depth_first instead of explicit iterator

675

auto &A

685

auto

687

auto

hiraditya updated this revision to Diff 114439.Sep 8 2017, 2:59 PM

added auto to relevant places.

hiraditya marked 4 inline comments as done.Sep 8 2017, 2:59 PM
hiraditya updated this revision to Diff 114446.Sep 8 2017, 3:44 PM

use depth_first

This revision was automatically updated to reflect the committed changes.