This is an archive of the discontinued LLVM Phabricator instance.

[Dominators] Include infinite loops in PostDominatorTree
ClosedPublic

Authored by kuhar on Jul 25 2017, 12:22 PM.

Details

Summary

This patch teaches PostDominatorTree about infinite loops. It is built on top of D29705 by @dberlin which includes a very detailed motivation for this change.

What's new is that the patch also teaches the incremental updater how to deal with reverse-unreachable regions and how to properly maintain and verify tree roots. Before that, the incremental algorithm sometimes ended up preserving reverse-unreachable regions after updates that wouldn't appear in the tree if it was constructed from scratch on the same CFG.

This patch makes the following assumptions:

  • A sequence of updates should produce the same tree as a recalculating it.
  • Any sequence of the same updates should lead to the same tree.
  • Siblings and roots are unordered.

The last two properties are essential to efficiently perform batch updates in the future.
When it comes to the first one, we can decide later that the consistency between freshly built tree and an updated one doesn't matter match, as there are many correct ways to pick roots in infinite loops, and to relax this assumption. That should enable us to recalculate postdominators less frequently.

This patch is pretty conservative when it comes to incremental updates on reverse-unreachable regions and ends up recalculating the whole tree in many cases. It should be possible to improve the performance in many cases, if we decide that it's important enough.
That being said, my experiments showed that reverse-unreachable are very rare in the IR emitted by clang when bootstrapping clang. Here are the statistics I collected by analyzing IR between passes and after each removePredecessor call:

# functions:  52283
# samples:  337609
# reverse unreachable BBs:  216022
# BBs:  247840796
Percent reverse-unreachable:  0.08716159869015269 %
Max(PercRevUnreachable) in a function:  87.58620689655172 %
# > 25 % samples:  471 ( 0.1395104988314885 % samples )
... in 145 ( 0.27733680163724345 % functions )

Most of the reverse-unreachable regions come from invalid IR where it wouldn't be possible to construct a PostDomTree anyway.

I would like to commit this patch in the next week in order to be able to complete the work that depends on it before the end of my internship, so please don't wait long to voice your concerns :).

Diff Detail

Repository
rL LLVM

Event Timeline

kuhar created this revision.Jul 25 2017, 12:22 PM
kuhar updated this revision to Diff 108145.Jul 25 2017, 1:15 PM

Add a comment explaining rebuilt after an unreachable deletion.

kuhar updated this revision to Diff 108151.Jul 25 2017, 1:43 PM

Don't recalculate the whole tree when a deletion makes a CFG node reverse-unreachable.

kuhar added inline comments.Jul 25 2017, 2:55 PM
include/llvm/Support/GenericDomTreeConstruction.h
611 ↗(On Diff #108151)

s/infinite root/infinite loop

kuhar edited the summary of this revision. (Show Details)
kuhar added inline comments.Jul 25 2017, 3:21 PM
lib/Transforms/Scalar/ADCE.cpp
265 ↗(On Diff #108151)

@dberlin: shouldn't this be:

post-dom root child *is* a return

?

dberlin added inline comments.Jul 25 2017, 3:34 PM
lib/Transforms/Scalar/ADCE.cpp
265 ↗(On Diff #108151)

yes, it should be.

kuhar updated this revision to Diff 108169.EditedJul 25 2017, 3:38 PM

Fix a comment typo and a debug message in ADCE.

kuhar marked 3 inline comments as done.Jul 25 2017, 3:39 PM
kuhar added inline comments.Jul 25 2017, 6:09 PM
lib/Transforms/Scalar/ADCE.cpp
265 ↗(On Diff #108151)

Followup: why do we only check for ReturnInst here? I mean, shouldn't we treat UnreachableInst the same way and not mark it as live?

kuhar added inline comments.Jul 25 2017, 6:13 PM
lib/Transforms/Scalar/ADCE.cpp
265 ↗(On Diff #108151)

I checked and continuing also on UnreachableInst doesn't seem to cause any regressions.

dberlin added inline comments.Jul 26 2017, 10:05 AM
include/llvm/Support/GenericDomTreeConstruction.h
401 ↗(On Diff #108169)

Can't you use the forward dfs in/outnumbers to tell if the roots were redundant?

A redundant root should be within the in/out of some other root, no?

lib/Transforms/Scalar/ADCE.cpp
265 ↗(On Diff #108151)

I'm actually curious if ADCE works without this loop at all now
In theory, it claims it was being done "to protect IDF calculator" (which is definitely not right, but ...).
It already knows to mark loops live if !removeloops. Everything else should be fair game to remove, at least by the current logic.

But i think we should worry about this in a followup, and stick with making the original loop do "the same thing"on the postdom tree it was trying to do before.

kuhar added inline comments.Jul 26 2017, 10:32 AM
include/llvm/Support/GenericDomTreeConstruction.h
401 ↗(On Diff #108169)

It can happen that a path from a redundant root to some other one goes through reverse-reachable regions, so I think that would mean that we'd have to run the forward DFS on the whole CFG, including the forward-unreachable parts. Then, can you always tell if a node is a part of the same SCC as some other one using only DFS in/out numbers?

lib/Transforms/Scalar/ADCE.cpp
265 ↗(On Diff #108151)

Yeah, I think that it'd be reasonable to revisit it later.
(BTW, I tried deleting the loop, but it resulted in multiple things crashing/failing.)

jlebar removed a reviewer: jlebar.Jul 26 2017, 10:41 AM
jlebar added a subscriber: jlebar.

I'm probably not the best person to review this.

dberlin accepted this revision.Jul 27 2017, 8:06 AM

I'm going to accept this.
But please give Tobias until early next week to comment, given his past concerns.
(IMHO, if he wants to suggest we do something different, at this point, I believe the onus is on him to implement such a thing and show it works)

include/llvm/Support/GenericDomTreeConstruction.h
401 ↗(On Diff #108169)

We could get away with running forward DFS only on the reverse-unreachable parts i believe.
But i think what you have is fine for now.

This revision is now accepted and ready to land.Jul 27 2017, 8:06 AM
grosser edited edge metadata.Jul 31 2017, 4:25 AM

Hi all,

sorry for coming late into the game. It seems the phabricator notifications again don't work for me. I started to look into this during the weekend (and today) and am planning to send an update.

Best,
Tobias

test/Transforms/StructurizeCFG/no-branch-to-entry.ll
1 ↗(On Diff #108169)

Why was this XFAILed? Could you potentially add a comment why this test cases does not work any more?

dberlin added inline comments.Jul 31 2017, 8:47 AM
test/Transforms/StructurizeCFG/no-branch-to-entry.ll
1 ↗(On Diff #108169)

This is probably copied from my original patch.

It used to generate a region that caused it to try to replace the entry block, but it does not anymore.
I had left it xfailed and pinged the structurizer maintainer to see if he could generate a case where it still tried to replace the entry block.
I believe it to not be possible anymore, but i couldn't be sure.

So the short version is: "This test will probably end up removed, but i wanted to give it a little time"

Hi @kuhar, @dberlin,

thanks for giving me time to look into this and especially thanks for adding dynamic update support to DT/PDT. This new interface is indeed very nice!

As you know I am concerned about some of the implications of connecting infinite loops to the virtual exit (see below). But assuming this is the way to go, the structure of this patch looks OK. I just found a couple of minor typos and would appreciate some more documentation.

Specifically, while there is precedence for extending the post-dominance tree to reverse unreachable nodes the way you propose, this is not well documented in the literature. The paper you based the dominator tree construction on ("An Experimental Study of Dynamic Dominators") seems to very clearly distinguish between unreachable and reachable CFG nodes. Whereas in the new code there won't be any unreachable nodes any more. I think it would be nice to document these differences, the reason why this choice has been taken, and the impact this has on guarantees the PDT interface gives to the user.

The changes this patch introduces to D36107 will already add some documentation.

Daniel Berlin wrote some motivation in his old commit and we afterwards did quite some research that brought up both arguments for and against this change. Especially as the goal is to make the use of the PDT a lot more common, I think it would be great to document the motivation and reasoning both in the commit message and in the patch itself. Could you e.g. briefly list the other compilers that take this approach (I already looked them up for you [https://docs.google.com/document/d/1lL8xcqfnqNArvQj3TpgEXOTybr8kucagjzqvChQCUGI/edit]), clarify that this is indeed not a common extension in the literature but that it is in practice desirable for some reason (which?). I also would really appreciate a brief comment in situations where you had to adapt the original dynamic dominators algorithm to support this use case. This will certainly help others who might work on PDT in the future.

I also have two questions:

1) Does connecting edges to the virtual exit break the parent property?

You define the parent property as: "If a node gets disconnected from the graph, then all of the nodes it dominated previously will now become unreachable." After this patch, nodes that become unreachable are still connected to the graph. This is not visible in verifyParentProperty, as it does not update Roots properly, but my understanding is that this is indeed the case and a breakage you are willing to accept to ensure that all nodes are part of the CFG. Is this correct?

2) Can removing an edge from a dominator tree weaken the dominance relation?

The example I am thinking of is the one in D36107.

Assume we start with a normal (complete) CFG:

and then drop the edge C -> B.

Today + Literature (leave C unreachable)

The output we get today (and which would be computed by "An Experimental Study of Dynamic Dominators") would be the following graph with C as now being reverse unreachable not part of the graph:

The relation D postdom B is unaffected. This is to my understanding in line with the definition of post-dominance.

Proposed Patch + GCC (Connecting C to virtual exit)

This patch will now instead create the following PDT (grey edges have been removed compared to original PDT and are not part of new PDT)

Interestingly, the property D postdom B does not hold any more. To my understanding, in a complete (post) dominator tree, removing an edge should never weaken a normal dominatore tree. Is this true? Do we loose this property by supporting reverse-unreachable CFG nodes? (I am aware there are certain trade-offs we need to take, but I would like to understand and at best document them)

I believe the issues above are worth documenting.

While thinking about this problem, I experimented with connecting unreachables to the virtual exit, but dropping their connection to the reachable part of the CFG when constructing the Post-Dominator tree (D36135). Similarly to the proposed patch, we change the definition of what post-dominance means to include unreachable nodes, but as the edges we ignore can anyhow never be reached on the reverse CFG (they only became reachable through our virtual edges) this seems like a reasonable extension.

The resulting tree matches the dominator tree of the reachable part, but also contains all unreachable nodes:

If you compare the changed test cases, a lot of the regressions we see with D35851 disappear while still all nodes are part of the PDT. I managed to port all existing code to this new approach, but maybe I missed something obvious? I wonder if we have some actual test cases that would break with the above implementation? Is there any transformation pass which would become incorrect with this PDT?

include/llvm/Support/GenericDomTreeConstruction.h
267 ↗(On Diff #108169)

terminators

1012 ↗(On Diff #108169)

PostDominatorTree

1202 ↗(On Diff #108169)

parent

test/Transforms/StructurizeCFG/no-branch-to-entry.ll
1 ↗(On Diff #108169)

Interesting. @kuhar: It probably makes sense to add a comment to the test cases, that it intended to be removed.

kuhar added a comment.Aug 1 2017, 7:37 PM

@grosser

Tobias,

Before documenting my code more, let me answer your questions and comment on your counterproposal.

this is not well documented in the literature. The paper you based the dominator tree construction on ("An Experimental Study of Dynamic Dominators") seems to very clearly distinguish between unreachable and reachable CFG nodes.

I think that it’s pretty convenient to assume that functions always have a single exit for academic purposes. If this was the case in LLVM, we could build postdominators on reverse CFG and literally be done at that point. But that’s not the case, and there’s a real life difference between forward-unreachable and reverse-unreachable code. There’s no way to end up in forward-unreachable code, because we always start executing code from the (single) entry node. Whereas when it comes to reverse-unreachable code, executing it happens in practice and we actually care about optimizing such code. Many embedded systems consist of a single large infinite loop that reads some memory and writes data somewhere else. And it is actually possible to exit the infinite loop there, just by calling a function with a call to exit inside. IMHO it is not much different from dynamically infinite loops, which are naturally modeled by postdominators. Because of that, it seems entirely reasonable to me to include reverse-unreachable in the PostDominatorTree.

that it is in practice desirable for some reason (which?)

The main motivation for modeling reverse-unreachable code and (statically) infinite loops in the PostDominatorTree is safely sinking code and optimizing code within infinite loops. To do that, you cannot pretend that reverse-reachable code dominates code that can branch to an infinite loop, which I find particularly problematic in your proposal.

Let’s consider one of your examples. In this case your implantation says that B immediately postdominates D, even though it is possible to branch to C and keep looping there until something exits the program. It would seem valid to sink instruction from B to D, which would not be the case if C had a dynamically unreachable exit. This example is more about profitability, but the real problem would be hoisting here, even assuming that there are no functions call in B, C, or D. Say there is instruction that can cause undefined bahavior in D, even something as simple like divide by zero. D immediately postdominates B and is its successor, so it would seem that hoisting that instruction to B is safe (assuming that we already proved that it is from other standpoints), and that instruction wouldn’t normally be executed if the control flow entered the infinite loop C. The same applies to hoisting loads and stores from D to B.

The other problem I see is that your definition of postdominance for reverse-unreachable code doesn’t lead to useful regions, if I understand it correctly. Regions in LLVM are defined as single-entry single-exit parts of CFG where the entry dominates every node in the region and the exit postdominates every node in it. For your example, it seems that there would be 2 regions: ABD and C, which doesn’t seem correct to me, as in practice you can jump from B, get stuck and then exit through C. Using the postdominance as defined D35851 doesn’t have this problem and results in 3 regions: AB, C, and D.

Saying that reverse-unreachable code never postdominates reverse-reachable code is like saying infinite loops, even with side effects, has undefined behavior and we assume to never execute it.

You define the parent property as: "If a node gets disconnected from the graph, then all of the nodes it dominated previously will now become unreachable." After this patch, nodes that become unreachable are still connected to the graph.

Interestingly, the property D postdom B does not hold any more. To my understanding, in a complete (post) dominator tree, removing an edge should never weaken a normal dominatore tree. Is this true? Do we loose this property by supporting reverse-unreachable CFG nodes?

This may be a little bit unintuitive at first, but here’s what happens when you call .deleteEdge with my patch: it first performs the edge deletion and can then immediately makes an insertion from the virtual exit to the reverse-unreachable code. From the end user’s perspective, those two steps happen atomically, but in reality it performs two different operations with an intermediate result not observable from outside. If you consider these two internal steps in terms of the parent and sibling property, the first one holds after deletion and the second one after insertion.

This is not visible in verifyParentProperty, as it does not update Roots properly

Verifying the parent property doesn’t really depend on some information internal to the DomTree -- you could place a .verifyParentProperty call in the middle of deletion and it would check it fine. The only thing that .verify would disagree with would be the new root not present in the Roots yet, but that’s kind of expected here, and the deletion and insertion happen atomically anyway. In this respect, neither deletion nor verification seems broken to me.

@grosser

Tobias,

Before documenting my code more, let me answer your questions and comment on your counterproposal.

Very good idea.

this is not well documented in the literature. The paper you based the dominator tree construction on ("An Experimental Study of Dynamic Dominators") seems to very clearly distinguish between unreachable and reachable CFG nodes.

I think that it’s pretty convenient to assume that functions always have a single exit for academic purposes. If this was the case in LLVM, we could build postdominators on reverse CFG and literally be done at that point. But that’s not the case, and there’s a real life difference between forward-unreachable and reverse-unreachable code. There’s no way to end up in forward-unreachable code, because we always start executing code from the (single) entry node. Whereas when it comes to reverse-unreachable code, executing it happens in practice and we actually care about optimizing such code. Many embedded systems consist of a single large infinite loop that reads some memory and writes data somewhere else. And it is actually possible to exit the infinite loop there, just by calling a function with a call to exit inside. IMHO it is not much different from dynamically infinite loops, which are naturally modeled by postdominators. Because of that, it seems entirely reasonable to me to include reverse-unreachable in the PostDominatorTree.

While I currently do not have any benchmark where optimizing infinite loops is useful (is there one I could have a look?), let's working under the assumption that it actually is. Chandler and Daniel raised the point that just the fact that the PDT contains all nodes makes user code simpler, which is another argument for this kind of modeling. In general, both points together make a good motivation to include all nodes in the PDT. I agree, assuming we understand and are happy with the resulting implications.

that it is in practice desirable for some reason (which?)

The main motivation for modeling reverse-unreachable code and (statically) infinite loops in the PostDominatorTree is safely sinking code and optimizing code within infinite loops. To do that, you cannot pretend that reverse-reachable code dominates code that can branch to an infinite loop, which I find particularly problematic in your proposal.

Let’s consider one of your examples. In this case your implantation says that B immediately postdominates D, even though it is possible to branch to C and keep looping there until something exits the program. It would seem valid to sink instruction from B to D, which would not be the case if C had a dynamically unreachable exit.

Thank you for coming up with an actual code transformation.

I assume you mean D postdom B (rather than B postdom D).

You say the fact that D postdom B is by itself sufficient to allow code sinking from B to D. Remember the original program:

which, as in D36107, may terminate in C with "br i1 true, label %C, label %B". To my understanding there is dynamically no difference between the original code and the code after dropping the edge C -> B. Both have the property that B postdom D. Would code sinking from B to D be in your opinion valid in the original program?

This example is more about profitability, but the real problem would be hoisting here, even assuming that there are no functions call in B, C, or D. Say there is instruction that can cause undefined behavior in D, even something as simple like divide by zero. D immediately postdominates B and is its successor, so it would seem that hoisting that instruction to B is safe (assuming that we already proved that it is from other standpoints), and that instruction wouldn’t normally be executed if the control flow entered the infinite loop C. The same applies to hoisting loads and stores from D to B.

The same question as above: what prevents you to hoist these very instructions in the original program with a dynamically infinite loop?

AFAIU X postdom Y does not guarantee that after Y is executed X is at some later point certainly executed.

The other problem I see is that your definition of postdominance for reverse-unreachable code doesn’t lead to useful regions, if I understand it correctly. Regions in LLVM are defined as single-entry single-exit parts of CFG where the entry dominates every node in the region and the exit postdominates every node in it. For your example, it seems that there would be 2 regions: ABD and C, which doesn’t seem correct to me, as in practice you can jump from B, get stuck and then exit through C. Using the postdominance as defined D35851 doesn’t have this problem and results in 3 regions: AB, C, and D.

Disclosure. I added the region test cases and I believe they are indeed useful the way they are. ;-)

I assume you mean "you can jump from B to C".

Regions for fully reverse reachable CFGs can be well defined through (post-)dominance. As we try to find a post-dominance definition that is consistent with what we expect, I prefer to define regions as a subgraph of the CFG that is connected to the CFG with a single incoming edge and a single outgoing edge.

Do we agree that the region for the original program

computed today makes sense? Similar to the example above, by dropping an edge that is known to not be executed, we just restrict the set of program executions the PDT can assume. No new incoming and outcoming edges are added:

So from this perspective I would assume the new region, to still be a region. Would you agree?

Today we achieve this result by checking that the entry dominates the exit and the exit post-dominates the entry. To check if a node is contained in the region, we check if it is dominated by entry and not dominated by exit. Post-dominance for reverse-unreachable blocks is unfortunately not very useful here for the reason you stated above. Another option was to put the virtual exit edge not to the actual exit, but to the node that connect reverse unreachable and reverse reachable parts of the CFG. This would address this issue in most common cases. If this is preferable, I could port this patch to your most recent code.

Saying that reverse-unreachable code never postdominates reverse-reachable code is like saying infinite loops, even with side effects, has undefined behavior and we assume to never execute it.

You define the parent property as: "If a node gets disconnected from the graph, then all of the nodes it dominated previously will now become unreachable." After this patch, nodes that become unreachable are still connected to the graph.

Interestingly, the property D postdom B does not hold any more. To my understanding, in a complete (post) dominator tree, removing an edge should never weaken a normal dominatore tree. Is this true? Do we loose this property by supporting reverse-unreachable CFG nodes?

This may be a little bit unintuitive at first

Then it certainly warrants some documentation. Thank you for explaining!

but here’s what happens when you call .deleteEdge with my patch: it first performs the edge deletion and can then immediately makes an insertion from the virtual exit to the reverse-unreachable code. From the end user’s perspective, those two steps happen atomically, but in reality it performs two different operations with an intermediate result not observable from outside. If you consider these two internal steps in terms of the parent and sibling property, the first one holds after deletion and the second one after insertion.

Interesting! Do I understand correctly that for a normal compete (post)dominator tree these two steps fall together. Hence, both parent and sibling property hold always?

This is not visible in verifyParentProperty, as it does not update Roots properly

Verifying the parent property doesn’t really depend on some information internal to the DomTree -- you could place a .verifyParentProperty call in the middle of deletion and it would check it fine. The only thing that .verify would disagree with would be the new root not present in the Roots yet, but that’s kind of expected here, and the deletion and insertion happen atomically anyway. In this respect, neither deletion nor verification seems broken to me.

Thank you for explaining the above. Yes, verify is verifying under the assumption that no virtual edges are added. Without virtual edges, the properties certainly hold. This difference is again something interesting to point out.

Best,
Tobias

While I currently do not have any benchmark where optimizing infinite loops is useful (is there one I could have a look?), let's working under the assumption that it actually is. Chandler and Daniel raised the point that just the fact that the PDT contains all nodes makes user code simpler, which is another argument for this kind of modeling. In general, both points together make a good motivation to include all nodes in the PDT. I agree, assuming we understand and are happy with the resulting implications.

FWIW, here is some more literature that assumes you have a single entry/exit node in postdom.
http://ssabook.gforge.inria.fr/latest/book.pdf

The PDG building algorithm (whihc is the standard one, AFAIK, but you are probably more familiar with PDG than i am): ".. Similar to the CFG, a PDG also has two nodes ENTRY and EXIT,
through which control flow enters and exits the program respectively. For this
purpose, it is assumed that every node of the CFG is reachable from the entry
node and can reach the exit node."

Most *optimizations* that use post-dominators have this assumption hidden in them, stated or not.

If you stare at 12.5.3 in that book, which presents a simplified explanation of SSUPRE.

If you don't include infinite loops and connect them to exit, the sink points will be wrong. We have an impl of SSUPRE posted for review, i think you can actually get it to happen for real with it (but i haven't had time to play)

IE given a trivial variation on their program, to try to make it easy to follow

regular loop {
  store a
  if (foo) {
    infinite loop:
      load a, do something with it
    exit program or go to infinite loop
  } 
}
load a

The goal of this program is that the blocks for infinite loop do not appear in the postdominator tree currently or are not connected to exit (hopefully it is trivial to see that it is possible to accomplish this given the current root finding, but again, i have not run it through clang)
Without infinite loops included or connected to exit, it will generate:

regular loop {
  if (foo) {
    infinite loop:
      load a, do something with it
    exit program or go to infinite loop
  } 
}
store a
load a

(Note that this load is deliberately here to cause it to store back. If it wasn't here, and instead you just had "use <value stored>", it would eliminate the store in favor of a temporary, registerizing the memory access)

This is, however, wrong, as now the load in the infinite loop does not see the store.
IE It will never consider the load in the infinite loop as a use that it must consider when placing factoring points (as it will not appear in the postdominance frontier). It will see this as a full redundancy.

With infinite loops present and connected, it will do this:

regular loop {
  if (foo) {
    infinite loop:
      store a, value
      load a, do something with it
    exit program or go to infinite loop
  } 
}
store a, value
load a

The same question as above: what prevents you to hoist these very instructions in the original program with a dynamically infinite loop?

post-dominance alone is definitely not sufficient to guarantee safety in general for either hoisting or sinking.
It's often a necessary but not sufficient condition for the hoisting sinking, and is usually used to ensure non-speculation of memory along new paths.
In particular: does this load post-dominate all other loads that are the same is the trivial version of this (if so, you are not speculating this). LLVM currently hacks around this using walking some expensive testing that could be replaced with O(1) answers based on post-dominance. That would be "is the later load in a cdeq set of an earlier load". If so, they must execute together or not at all.

However, as it is also used for elimination and elimination based sinking as well
As another example: In a function without any loads (to keep it simple), a later store that post-dominates all blocks containing other stores makes all the earlier stores redundant.

IE

if (foo)
store a
else
store a

store a

the later store post-dominating all the other stores is sufficient to eliminate them[1]

this also fails to be true in certain conditions if you either don't include the loops, or don't connect them to exit.
The important part is that the loop appears, and that it does not appear In the "wrong" part of the tree. As a trivial example why, if you add an infinite loop on a branch that, say, calls a global function that can see the store, you need to ensure we don't delete that store by seeing a wrong relationship. So the infinite loop can only be connected to blocks that will not cause us to believe that the later store post-dominates all uses.
It is certainly possible, depending on the location of stores/loads/etc, to connect the infinite loops somewhere else, and it may not matter for regions, but depending on the locations, it may affect store elimination.

[1] This notion is really is the basis of the SSUPRE transformation.
If you view the stores and aliased loads as "uses", it's basically "place a store where it postdominates all 'uses'. Eliminate any earlier 'uses' that are stores. Insert a store before any 'use' that is a load".
again, this is not a *perfect* description, because SSUPRE also considers lifetime optimality, and only tries to sink them as far as necessary to remove redundancy, instead of "as far down as it can".

Hi daniel,

thank you for adding some more context! Let me fire a quick reply asking some questions before looking deeper into your points!

While I currently do not have any benchmark where optimizing infinite loops is useful (is there one I could have a look?), let's working under the assumption that it actually is. Chandler and Daniel raised the point that just the fact that the PDT contains all nodes makes user code simpler, which is another argument for this kind of modeling. In general, both points together make a good motivation to include all nodes in the PDT. I agree, assuming we understand and are happy with the resulting implications.

FWIW, here is some more literature that assumes you have a single entry/exit node in postdom.
http://ssabook.gforge.inria.fr/latest/book.pdf

Fabrice's book, very cool indeed!

The PDG building algorithm (whihc is the standard one, AFAIK, but you are probably more familiar with PDG than i am): ".. Similar to the CFG, a PDG also has two nodes ENTRY and EXIT,
through which control flow enters and exits the program respectively. For this
purpose, it is assumed that every node of the CFG is reachable from the entry
node and can reach the exit node."

Most *optimizations* that use post-dominators have this assumption hidden in them, stated or not.

If you stare at 12.5.3 in that book, which presents a simplified explanation of SSUPRE.

If you don't include infinite loops and connect them to exit, the sink points will be wrong. We have an impl of SSUPRE posted for review, i think you can actually get it to happen for real with it (but i haven't had time to play)

Could you point me to the phabricator link. A quick search for SSUPRE did not bring it up.

IE given a trivial variation on their program, to try to make it easy to follow

regular loop {
  store a
  if (foo) {
    infinite loop:
      load a, do something with it
    exit program or go to infinite loop
  } 
}
load a

The goal of this program is that the blocks for infinite loop do not appear in the postdominator tree currently or are not connected to exit (hopefully it is trivial to see that it is possible to accomplish this given the current root finding, but again, i have not run it through clang)
Without infinite loops included or connected to exit, it will generate:

regular loop {
  if (foo) {
    infinite loop:
      load a, do something with it
    exit program or go to infinite loop
  } 
}
store a
load a

(Note that this load is deliberately here to cause it to store back. If it wasn't here, and instead you just had "use <value stored>", it would eliminate the store in favor of a temporary, registerizing the memory access)

This is, however, wrong, as now the load in the infinite loop does not see the store.
IE It will never consider the load in the infinite loop as a use that it must consider when placing factoring points (as it will not appear in the postdominance frontier). It will see this as a full redundancy.

With infinite loops present and connected, it will do this:

regular loop {
  if (foo) {
    infinite loop:
      store a, value
      load a, do something with it
    exit program or go to infinite loop
  } 
}
store a, value
load a

I need to think about these. Some LLVM-IR unit tests that explain the property you are looking for would make the discussion here more concrete, as I will need some time to translate your textual ideas to IR. Even without, I certainly will have a look at your pointers.

The same question as above: what prevents you to hoist these very instructions in the original program with a dynamically infinite loop?

post-dominance alone is definitely not sufficient to guarantee safety in general for either hoisting or sinking.

I fully agree that post-dominance alone is not sufficient. This seems contrary to the impression @kuhar's reply made to me. I need to think more about your answer below, but a more direct answer to the questions I asked will certainly help to clarify some of what has been discussed before. @kuhar can you help me understand if some of the previous examples make good test cases for the hoisting transformation you have in mind.

Overall, I am really not trying to shoot down this proposal. Even though I would like to have a PDT that works well with regions, if we can come up together with good test cases that show this is not possible, I am the first to commit and document them! Please help me to get your requirements written into nice understandable unit tests! If our discussions end up in useful test cases and documentation, then it was certainly of use.

It's often a necessary but not sufficient condition for the hoisting sinking, and is usually used to ensure non-speculation of memory along new paths.
In particular: does this load post-dominate all other loads that are the same is the trivial version of this (if so, you are not speculating this). LLVM currently hacks around this using walking some expensive testing that could be replaced with O(1) answers based on post-dominance. That would be "is the later load in a cdeq set of an earlier load". If so, they must execute together or not at all.

However, as it is also used for elimination and elimination based sinking as well
As another example: In a function without any loads (to keep it simple), a later store that post-dominates all blocks containing other stores makes all the earlier stores redundant.

IE

if (foo)
store a
else
store a

store a

the later store post-dominating all the other stores is sufficient to eliminate them[1]

this also fails to be true in certain conditions if you either don't include the loops, or don't connect them to exit.
The important part is that the loop appears, and that it does not appear In the "wrong" part of the tree. As a trivial example why, if you add an infinite loop on a branch that, say, calls a global function that can see the store, you need to ensure we don't delete that store by seeing a wrong relationship. So the infinite loop can only be connected to blocks that will not cause us to believe that the later store post-dominates all uses.
It is certainly possible, depending on the location of stores/loads/etc, to connect the infinite loops somewhere else, and it may not matter for regions, but depending on the locations, it may affect store elimination.

Again, I will need to think about the above and will try to come up with LLVM-IR test cases that illustrate the problem you describe. If you happen to have some available, this would be very much appreciated.

Best,
Tobias

[1] This notion is really is the basis of the SSUPRE transformation.
If you view the stores and aliased loads as "uses", it's basically "place a store where it postdominates all 'uses'. Eliminate any earlier 'uses' that are stores. Insert a store before any 'use' that is a load".
again, this is not a *perfect* description, because SSUPRE also considers lifetime optimality, and only tries to sink them as far as necessary to remove redundancy, instead of "as far down as it can".

kuhar added a comment.Aug 2 2017, 2:14 PM

Tobias,

To quickly address some of the remaining points:

but here’s what happens when you call .deleteEdge with my patch: it first performs the edge deletion and can then immediately makes an insertion from the virtual exit to the reverse-unreachable code. From the end user’s perspective, those two steps happen atomically, but in reality it performs two different operations with an intermediate result not observable from outside. If you consider these two internal steps in terms of the parent and sibling property, the first one holds after deletion and the second one after insertion.

Interesting! Do I understand correctly that for a normal compete (post)dominator tree these two steps fall together. Hence, both parent and sibling property hold always?

Yes, although you cannot assume that the .deleteEdge call doesn't affect the parent property, as internally may result in an insertion. Even then, both parent and sibling property hold the whole time -- between the internal deletion an insertion, and after the whole function ends.

Thank you for explaining the above. Yes, verify is verifying under the assumption that no virtual edges are added. Without virtual edges, the properties certainly hold. This difference is again something interesting to point out.

The properties hold all the time -- the thing is that the set of virtual exit's predecessor is stored only in the tree, and if you were to compute it just looking at the CFG you get an answer that assumes that the insertion already happened. If we made it a special case of root finding, then calling .verifyRootf before the internal insertion would also work, and thus the whole .veirfy.

I fully agree that post-dominance alone is not sufficient. This seems contrary to the impression @kuhar's reply made to me

I tried to point out that it's necessary but not sufficient by saying:

... (assuming that we already proved that it is (fine) from other standpoints) ...

kuhar added a comment.Aug 2 2017, 2:59 PM

If you stare at 12.5.3 in that book, which presents a simplified explanation of SSUPRE.

If you don't include infinite loops and connect them to exit, the sink points will be wrong. We have an impl of SSUPRE posted for review, i think you can actually get it to happen for real with it (but i haven't had time to play)

Could you point me to the phabricator link. A quick search for SSUPRE did not bring it up.

I believe that's the one: https://reviews.llvm.org/D29866
http://theory.stanford.edu/~robert/papers/rvi.ps

dberlin added a comment.EditedAug 7 2017, 10:15 AM

(Copying this comment because it ended up on the wrong review thread)
Note: There is also a rewritten DSE that uses PostDom as well: https://reviews.llvm.org/D29624

I'm not sure i will get a chance to produce LLVM IR for you (i'm sadly finding less and less time to work on LLVM these days, so i'm mostly having other people do it :P).

To give you a more described example the following completely simple algorithm (which GCC and a few other compilers use variants of, see tree-ssa-dse.c) *should* work, but will give wrong answers if you ignore infinite loops or do weird things to them.
again, i've made it not catch every case so i can do it in <50 lines. Normally, for example, there is more than one stack, instead of all possible stores on the same stack, which misses a ton of cases due to irrelevant aliasing, blah blah blah.

Stack StoreStack;  // In realer versions of this algorithm, we have multiple stacks or other structures. In fact, in this version, there will only ever be zero or one things on the stack.
void walkPostDomChildren(DomTreeNode *DTN)
{
  BasicBlock *BB = DTN->getBlock();
  if (!BB)
    return;
 // Ensure the that the store at the top of the stack is from something above us in our tree.  Normally would be done by checking if  dfs numbers of the current block is contained within the dfs in/out numbers of the top of stack.   See newgvn.cpp, predicateinfo.cpp, etc for ValueStack, which does this.


Pop Store Stack until block for instruction at  Stack.top() postdominates DTN.
 
  for (auto *I : reverse(BB))
    if (auto *SI = dyn_cast<StoreInst>(I)) {
      // If whatever is on the top of the stack is a must alias and post-dominates us, we're dead.
      if (!Stack.empty() &&  AA.isMustAlias(Stack.top()->getPointerOperand(), I->getPointerOperand()) )
         I->eraseFromParent();
      else if (Stack.empty())
        Stack.push(SI)
      else
        // This is really a may-def, so we must clear the stack. In this stupid version, this is true even if it noaliases top of stack.
        Stack.clear()
       continue;
   }
      if (!(getModRefInfo(I) & MRI_NoModRef))
        // A use of memory, so reset our stack
        Stack.clear();
  for (auto * Child : DTN)
       walkPostDomChildren(DTN)
}

walkPostDomChildren(PDT->getRoot())

Basically, if a store postdominates another earlier store without an intervening may-use/may-def, the earlier store is dead. There is nothing that could possibly see it happen before the later store.

Modulo transcription bugs, this algorithm is provably correct for our simplified assumptions. It will only eliminate stores that are dead.

Given the following program:

Block 1
store a, 5
if (foo)
{
Block 2:
     call foo(load a)
     goto Block 2
}
Block 3:
store a, 6

If infinite loops are not in the post-dominator tree, the tree will look like this:

3
|
1

The algorithm will visit 3, then push store a, 6 on the stack, visit 1, and use it to remove store a, 5
This is illegal. The store a, 5, is visible and used in block 2, and block 2 in fact, calls a function with value we stored.

Whoops!

If infinite loops are connected to virtual exit, the tree will look like this:

virtual exit
 /  |  \
3  2   1

We will no longer eliminate store a, 5, and we will get a correct answer because at each point, nothing will be on the stack.
We will visit 3, push store a, 6. We will visit 2, pop the stack because store a, 6 is no longer in scope. Stack will stay empty throughout 2.
We will visit 1, push store a, 5. We will exit the algorithm with this store on the stack, having eliminated nothing.
We will get the correct answers no matter how many things no longer exit. This is a safe and conservative post-dominator tree for our algorithm.

Is this perfect? For certain, no.

In this example, for *this* algorithm, you can connect the infinite loop anywhere you like as long as you do not generate the post-dominator tree:

3
|
1
|
2

or
3
|
2

or
3
|  \
2   1

If you generate that, you will eliminate the store a, 5 again, and you should not.

I believe the last one, sadly, is what your other patch will generate. It's not going to end up legal to simply cause the reverse-unreachable parts to have no effect on post-dom, because they are still forward-reachable.

As an example of what would theoretically work:

virtual exit
 |   \
 3    \
 |       1
 2

Would also give a correct answer for this algorithm, on this example. But would be not valid as a postdominator tree because there is no reverse path from 3 to 2, *and* there is a reverse path from 3 to 1, but 3 is not an ancestor of 1.

Sadly, there is no good-in-all-cases to be had

In fact, if the program was:

Block 1
store a, 5
if (foo)
{
Block 2:
     goto Block 2
}
Block 3:
store a, 6

we will still generate:

virtual exit
 /  |  \
3  2   1

but in practice, we could eliminate store a, 5 as it's not possible to see that it didn't happen in this example.

The greater problem with not connecting the infinite loops to exit is that "where it is legal to connect the loops to" (IE what doesn't break this algorithm) is now complex. Effectively, it must take into account where the memory instructions are (though the exact conditions that must be fulfilled do not have to be expressed like this)
Otherwise, i can generate an adversarial program that will give a wrong answer when you apply the above algorithm.
IE by placing loads and stores and calls such that you end up with the same problem as the 3->1->2 post dom tree above.

In particular, for the algorithm above, the places for where you connect the infinite loops must ensure either the stack gets cleared in the right place, or that the blocks are siblings instead of parent child, so the scope gets reset. (IE either we must always see the may-use in the infinite loop and clear the stack before going above it, or we must pop the stack so that we do not try to eliminate something "above" the infinite loop using something "below" the infinite loop)

Worse than that, that's just what works with *this* algorithm.

SSUPRE, which requires a conservative correct postdominance frontier, has worse requirements you'd have to prove are met.
So does Sreedhar and Gao's IDF algorithm (which is what IDF calculator uses), which requires the CFG and PostDom tree be in sync, and postdom have the parent and sibling property.
(This is solvable by making the real cfg have the set of fake edges you are creating, but they'd need to be walked when you call predecessors. If you connect them all to a fake exit, even if you have algorithms that must know about the edges, it's one block to special case instead of many)

Additionally, if you use post-dominance to prove speculation-safety, it's now not enough to know where the load/stores are, you'd have to know where every trapping instruction is to ensure the place you connect the infinite loop does not imply speculation safety where it should not.

Essentially: Even if you completely ignore the sibling and parent properties, connecting the infinite loops or changing the paths in general requires enough knowledge of the algorithms being used to be able to prove that you aren't going to generate post-dom trees that make them wrong.
In practice, since at least the IDF calculator relies on the sibling and parent properties, this ends up a practical requirement for the connection points...

All that seems ... really fragile and difficult :)

chandlerc edited edge metadata.Aug 8 2017, 4:04 AM

I need to think about these. Some LLVM-IR unit tests that explain the property you are looking for would make the discussion here more concrete, as I will need some time to translate your textual ideas to IR. Even without, I certainly will have a look at your pointers.

<snip>

Overall, I am really not trying to shoot down this proposal. Even though I would like to have a PDT that works well with regions, if we can come up together with good test cases that show this is not possible, I am the first to commit and document them! Please help me to get your requirements written into nice understandable unit tests! If our discussions end up in useful test cases and documentation, then it was certainly of use.

<snip>

Again, I will need to think about the above and will try to come up with LLVM-IR test cases that illustrate the problem you describe. If you happen to have some available, this would be very much appreciated.

Hey Tobias,

Jakub mentioned that he was essentially blocked here and there didn't seem to be a good path forward, and I have to agree that this doesn't seem like a very tenable end state...

I see two high level issues here:

  1. You've somewhat left this review in a bad place. "I need to think about this" without a real time bound doesn't seem like a tenable state. It's been almost a week without update. I think it is very reasonable for folks to want to make progress without getting blocked like this.
  1. You seem to be setting the bar *really* high. I know that regions are used in Polly, but they aren't used in any other part of LLVM and postdom has fairly immediate uses. While I'm also interested in figuring out the right way to integrate things like Polly's regions with postdom, I don't think it is really reasonable to hold up pretty significant improvements to the rest of LLVM on that working. I feel like we should be able to make more incremental progress here.

My suggestion would be something like the following:

  • If this actively breaks some Polly use case, we should get a test case for that *now*. And if so, it might make sense to have Polly use its own postdom code until the interactions here are sorted out. There doesn't seem to be any need for one singular implementation of this stuff.
  • Jakub and Danny make progress on postdoms here, specifically making them faster to update and more viable for some of the non-region use cases.
  • If/when folks have a better idea of how to integrate them with regions *or* an understanding of why they can't *then* we add the region test cases and support you're mentioning.

Essentially, 100% integration with regions, an analysis with a very specialized single user at the moment, doesn't seem like a great hard requirement for us to make progress. While I don't want to break regions and Polly, I think these things need to be able to move independently rather than being forced to boil the ocean.

I need to think about these. Some LLVM-IR unit tests that explain the property you are looking for would make the discussion here more concrete, as I will need some time to translate your textual ideas to IR. Even without, I certainly will have a look at your pointers.

<snip>

Overall, I am really not trying to shoot down this proposal. Even though I would like to have a PDT that works well with regions, if we can come up together with good test cases that show this is not possible, I am the first to commit and document them! Please help me to get your requirements written into nice understandable unit tests! If our discussions end up in useful test cases and documentation, then it was certainly of use.

<snip>

Again, I will need to think about the above and will try to come up with LLVM-IR test cases that illustrate the problem you describe. If you happen to have some available, this would be very much appreciated.

Hey Tobias,

Jakub mentioned that he was essentially blocked here and there didn't seem to be a good path forward, and I have to agree that this doesn't seem like a very tenable end state...

I fully agree with you. We should get this as quickly resolved as possible!

I see two high level issues here:

  1. You've somewhat left this review in a bad place. "I need to think about this" without a real time bound doesn't seem like a tenable state. It's been almost a week without update. I think it is very reasonable for folks to want to make progress without getting blocked like this.

I also would like to move on, especially as I would like Jakub to be able to close this last point (and also because it takes a lot of my time). I am doing my best to look into this. It would really help if we can boil this down to some LLVM-IR test cases and which we can use to better document the guarantees the proposed post-dominator implementation gives. I now need to find a time slot to translate Dany's examples to LLVM-IR. If I could get help with this, this would be amazing.

  1. You seem to be setting the bar *really* high. I know that regions are used in Polly, but they aren't used in any other part of LLVM and postdom has fairly immediate uses. While I'm also interested in figuring out the right way to integrate things like Polly's regions with postdom, I don't think it is really reasonable to hold up pretty significant improvements to the rest of LLVM on that working. I feel like we should be able to make more incremental progress here.

While this tangentially also affects Polly (for infinite loops), we can certainly work around this. Dany even suggested to have a Polly compatible mode.

Now, my main concerns are not regions or Polly, but the semantics of post-dominance in general. We are implementing an extension which is not documented in papers (even though other similar implementations exist). I would like to get the guarantees and implications documented before we move large portions of LLVM to use this.

My main concerns is the fact that the introduction of virtual edges weakens the post-dominance relationship. If we want to verify the correct placement of life-range metadata one could e.g. require that the life-range-end node post-dominates the live-range start. This invariant could be verified throughout the pass transformations to ensure no pass breaks this invariant. Unfortunately, when cfg-simplify drops edges and we re-connect the backwards unreachable nodes to the exit node as proposed here, the post-dominance relation is weakened and the verification surprisingly fails. I would really like to understand (and see documentation) if writing such checks
with the proposed post-dominance information would require us to think about such possible invalidations at each point where we potentially make parts of the CFG backwards unreachable.

This is not the case when working with a complete CFG (or a CFG where unreachable nodes are not modeled).

My suggestion would be something like the following:

  • If this actively breaks some Polly use case, we should get a test case for that *now*. And if so, it might make sense to have Polly use its own postdom code until the interactions here are sorted out. There doesn't seem to be any need for one singular implementation of this stuff.

There are test cases in tree since years, which I believe break. What else can I provide?

I clearly dislike two versions, but even if we indeed introduce two versions their guarantees and semantics differences should be made clear.

  • Jakub and Danny make progress on postdoms here, specifically making them faster to update and more viable for some of the non-region use cases.

I totally agree. This is great in general and I am very glad to see this going in. Jakub did an outstanding job here.

  • If/when folks have a better idea of how to integrate them with regions *or* an understanding of why they can't *then* we add the region test cases and support you're mentioning.

Essentially, 100% integration with regions, an analysis with a very specialized single user at the moment, doesn't seem like a great hard requirement for us to make progress. While I don't want to break regions and Polly, I think these things need to be able to move independently rather than being forced to boil the ocean.

Polly won't break with this change, but it will be very difficult to asses alternatives if the current requirements are not documented and tested.

Let me propose a way forward:

  • I discussed this very same issue with Hal at the LLVM summer school for over two hours. He has very good understanding of this subject. If he can review this patch faster then me and agrees with the direction, I am glad to accept to step back from this review.
  • I really would like to get at least one (or two) LLVM-IR test cases that clarify the corner-case properties of the proposed patch. If Jakub could help me to translate Dany's example into a self-contained test case, this would be outstanding. This would make it easier for me to reply. Even if not, I will try to look into this tonight or tomorrow night at the latest and will propose Dany's example as new test case. This will then hopefully give more insights.

Best,
Tobias

kuhar updated this revision to Diff 110304.Aug 8 2017, 4:42 PM
kuhar marked 5 inline comments as done.

Rebase to ToT and address remarks.

kuhar updated this revision to Diff 110305.Aug 8 2017, 4:45 PM
kuhar marked an inline comment as done.

Fix a typo.

(Copying this comment because it ended up on the wrong review thread)
Note: There is also a rewritten DSE that uses PostDom as well: https://reviews.llvm.org/D29624

Hi Daniel,

sorry for the delay. Let me get back into the discussion.

I'm not sure i will get a chance to produce LLVM IR for you (i'm sadly finding less and less time to work on LLVM these days, so i'm mostly having other people do it :P).

Especially because of this, thank you very much for taking the time to discuss this in so much detail. This is indeed very helpful.

To give you a more described example the following completely simple algorithm (which GCC and a few other compilers use variants of, see tree-ssa-dse.c) *should* work, but will give wrong answers if you ignore infinite loops or do weird things to them.
again, i've made it not catch every case so i can do it in <50 lines. Normally, for example, there is more than one stack, instead of all possible stores on the same stack, which misses a ton of cases due to irrelevant aliasing, blah blah blah.

Stack StoreStack;  // In realer versions of this algorithm, we have multiple stacks or other structures. In fact, in this version, there will only ever be zero or one things on the stack.
void walkPostDomChildren(DomTreeNode *DTN)
{
  BasicBlock *BB = DTN->getBlock();
  if (!BB)
    return;
 // Ensure the that the store at the top of the stack is from something above us in our tree.  Normally would be done by checking if  dfs numbers of the current block is contained within the dfs in/out numbers of the top of stack.   See newgvn.cpp, predicateinfo.cpp, etc for ValueStack, which does this.


Pop Store Stack until block for instruction at  Stack.top() postdominates DTN.
 
  for (auto *I : reverse(BB))
    if (auto *SI = dyn_cast<StoreInst>(I)) {
      // If whatever is on the top of the stack is a must alias and post-dominates us, we're dead.
      if (!Stack.empty() &&  AA.isMustAlias(Stack.top()->getPointerOperand(), I->getPointerOperand()) )
         I->eraseFromParent();
      else if (Stack.empty())
        Stack.push(SI)
      else
        // This is really a may-def, so we must clear the stack. In this stupid version, this is true even if it noaliases top of stack.
        Stack.clear()
       continue;
   }
      if (!(getModRefInfo(I) & MRI_NoModRef))
        // A use of memory, so reset our stack
        Stack.clear();
  for (auto * Child : DTN)
       walkPostDomChildren(DTN)
}

walkPostDomChildren(PDT->getRoot())

Basically, if a store postdominates another earlier store without an intervening may-use/may-def, the earlier store is dead. There is nothing that could possibly see it happen before the later store.

Modulo transcription bugs, this algorithm is provably correct for our simplified assumptions. It will only eliminate stores that are dead.

Given the following program:

Block 1
store a, 5
if (foo)
{
Block 2:
     call foo(load a)
     goto Block 2
}
Block 3:
store a, 6

This is a great example, which perfectly highlights the part I do not yet feel is clear to me.

AFAIU your program corresponds to the following piece of LLVM-IR (which happens to almost correspond to the visualizations I posted before):

define void @foo(float* %a) {
   ....
}

define void @foo(i1 %cond, float* %a) {
block1:
   store float 5.0, float* %a
   br i1 %cond, label %block2, label %block3

block2:
  call foo(float* %a)
  br label %block2

block3:
  store float 6.0, float* %a
}

which is very similar in nature to the one I posted above:

If infinite loops are not in the post-dominator tree, the tree will look like this:

3
|
1

The algorithm will visit 3, then push store a, 6 on the stack, visit 1, and use it to remove store a, 5
This is illegal. The store a, 5, is visible and used in block 2, and block 2 in fact, calls a function with value we stored.

Whoops!

If infinite loops are connected to virtual exit, the tree will look like this:

virtual exit
 /  |  \
3  2   1

We will no longer eliminate store a, 5, and we will get a correct answer because at each point, nothing will be on the stack.
We will visit 3, push store a, 6. We will visit 2, pop the stack because store a, 6 is no longer in scope. Stack will stay empty throughout 2.
We will visit 1, push store a, 5. We will exit the algorithm with this store on the stack, having eliminated nothing.
We will get the correct answers no matter how many things no longer exit. This is a safe and conservative post-dominator tree for our algorithm.

Is this perfect? For certain, no.

In this example, for *this* algorithm, you can connect the infinite loop anywhere you like as long as you do not generate the post-dominator tree:

3
|
1
|
2

or
3
|
2

or
3
|  \
2   1

If you generate that, you will eliminate the store a, 5 again, and you should not.

Interesting, but slightly confusing. Assume you change the previous program slightly:

define void @foo(float* %a) {
   ....
}

define void @foo(i1 %cond, float* %a) {
block1:
   store float 5.0, float* %a
   br i1 %cond, label %block2, label %block3

block2:
  call foo(float* %a)
  br i1 true, label %block2, label %block1   <---- Only this line has changed

block3:
  store float 6.0, float* %a
}

You get a CFG very similar to the one I posted earlier:

In your case the PDT is the first of the three you just mentioned:

3

1

2

This is the post-dominator tree which we compute today and which won't be changed by this proposal (as no reverse-unreachable blocks exist). Following your description the above post-dominator tree has the property that block3 post-dominates block1. What prevents the algorithm above to delete the store in block1 in the second slightly modified case? AFAIU the PDTs of both examples are identical and your algorithm does not look at the CFG edges. Hence I would expect it to incorrectly delete the original store also in the slightly modified example. Do I miss something obvious?

I will look into this again tomorrow to understand better where I am wrong.

One other slightly related question. Do you assume that A post-dominates B means that a program that reaches B eventually must reach A?

Hi Jakub,

thanks for the new documentation. Just a couple of comments, most minor. This is a clear improvement and will help the discussion (and future people developing the code). Now just trying to understand where I am wrong in Daniel's example.

Best,
Tobias

lib/Transforms/Scalar/ADCE.cpp
265 ↗(On Diff #108151)

Should this be:

post-dom root child *is* a return

It seems the old version slipped in again?

unittests/IR/DominatorTreeTest.cpp
410 ↗(On Diff #110305)

nice

451 ↗(On Diff #110305)

because

455 ↗(On Diff #110305)

Thank you for the nice documentation. I still don't like that we loose the dominance relation between B and D, but at least this behavioral change is very clear.

kuhar updated this revision to Diff 110316.Aug 8 2017, 5:58 PM
kuhar marked an inline comment as done.

Fix nits and add infinite loop testcases (based on Danny's examples).

kuhar marked an inline comment as done.Aug 8 2017, 5:59 PM

Add Dani's email reply to Phab:

If you generate that, you will eliminate the store a, 5 again, and you should not.

Interesting, but slightly confusing.

This is what i get for trying to reduce an algorithm quickly.
I'm sure i incorrectly removed some postdominance frontier checks or something.

It's probably easiest to take one of the working PDSE patches and look at that.

 

Assume you change the previous program slightly:

  define void @foo(float* %a) {
     ....
  }

  define void @foo(i1 %cond, float* %a) {
  block1:
     store float 5.0, float* %a
     br i1 %cond, label %block2, label %block3

  block2:
    call foo(float* %a)
    br i1 true, label %block2, label %block1   <---- Only this line has changed

  block3:
    store float 6.0, float* %a
  }

You get a CFG very similar to the one I posted earlier:

F3887544: Post Dominance.png https://reviews.llvm.org/F3887544 
In your case the PDT is the first of the three you just mentioned:

 

3

>
>  |
>
>
> 1
>
>  |
>
>
> 2

This is the post-dominator tree which we compute today and which won't be changed by this proposal (as no reverse-unreachable blocks exist).

This program should be illegal, and as far as i can tell, it is :)

-> % bin/opt -view-cfg foo2.ll
Entry block to function must not have predecessors!
label %block1
bin/opt: foo2.ll: error: input module is broken!

Let's assume that's trivial to fix though, because it is.

One other slightly related question. Do you assume that A post-dominates B means that a program that reaches B eventually must reach A?

You can't say anything block wise, only edge wise.
That is, C is dependent on edge A->B if C postdominates B and C does not strictly postdominate A.

Added Dani's email reply II to phab:

So, i definitely removed a postdominancefrontier check.
The PDF of X is equivalent to the set of branch blocks upon which X is control dependent.
Though, as mentioned this only means the block controls whether X executes, you need to label the edges.  

FWIW:I noticed that earlier, you mentioned literature does not talk about what we are doing, though there are practical implementations. It's worth pointing out that literature doesn't often talk about it as part of the postdominator construction itself, but as part of the CFG part that they build post-dom on top of.  They very often assume and require a unique exit node where everything has a path to it. 

Here are some example seminal papers that do this, and then build post-dom/post-domfrontiers on top of it:

Cytron's Control dependence paper:
http://polaris.cs.uiuc.edu/~padua/cs426/p451-cytron.pdf page 456. 
Note that they require everything have a path to exit:
"We assume that each
node is on a path from Entry and on a path to Exit"

Ferrante's program dependence graph paper
http://dl.acm.org/citation.cfm?id=24041
"Definition 1. A control flow graph is a directed graph G augmented with a
unique entry node START and a unique exit node STOP such that each node in
the graph has at most two successors. ...  We further assume that for any node N in G there exists a path
from START to N and a path from N to STOP. "

If you start with a CFG where you require every node go to exit, you don't need to do anything special just when computing postdom, because you just added the previously virtual edges to the actual CFG instead :)

On Tue, Aug 8, 2017 at 6:43 PM, Daniel Berlin <dberlin@dberlin.org> wrote:

If you generate that, you will eliminate the store a, 5 again, and you should not.

Interesting, but slightly confusing.

This is what i get for trying to reduce an algorithm quickly.
I'm sure i incorrectly removed some postdominance frontier checks or something.

It's probably easiest to take one of the working PDSE patches and look at that.

 

Assume you change the previous program slightly:

  define void @foo(float* %a) {
     ....
  }

  define void @foo(i1 %cond, float* %a) {
  block1:
     store float 5.0, float* %a
     br i1 %cond, label %block2, label %block3

  block2:
    call foo(float* %a)
    br i1 true, label %block2, label %block1   <---- Only this line has changed

  block3:
    store float 6.0, float* %a
  }

You get a CFG very similar to the one I posted earlier:

F3887544: Post Dominance.png https://reviews.llvm.org/F3887544 
In your case the PDT is the first of the three you just mentioned:

 

3

>
>  |
>
>
> 1
>
>  |
>
>
> 2

This is the post-dominator tree which we compute today and which won't be changed by this proposal (as no reverse-unreachable blocks exist).

This program should be illegal, and as far as i can tell, it is :)

-> % bin/opt -view-cfg foo2.ll
Entry block to function must not have predecessors!
label %block1
bin/opt: foo2.ll: error: input module is broken!

Let's assume that's trivial to fix though, because it is.

One other slightly related question. Do you assume that A post-dominates B means that a program that reaches B eventually must reach A?

You can't say anything block wise, only edge wise.
That is, C is dependent on edge A->B if C postdominates B and C does not strictly postdominate A.

In any case, i'm sure i screwed up my translation of the algorithm, sorry ;)

 

Add Dani's email reply to Phab:

If you generate that, you will eliminate the store a, 5 again, and you should not.

Interesting, but slightly confusing.

This is what i get for trying to reduce an algorithm quickly.

That's OK. It gives us an example we can work with. At least we are now on the same page. We still need to correct the example and identify the property you are interested in, but I feel we make progress. Thanks a lot again!

I'm sure i incorrectly removed some postdominance frontier checks or something.

I doubt a post-dominance frontier check can be the reason. The post-dominance frontier is based on the PDT, so it should be identical for the two examples.

It's probably easiest to take one of the working PDSE patches and look at that.

AFAIK, you have been working closely with the PDSE people. The patches seem still WIP, so I will have a harder time digging through them to find the part that matters. Any chance you happen to have one of the authors in the office, such that you can ask him/her over lunch?
 

Assume you change the previous program slightly:

  define void @foo(float* %a) {
     ....
  }

  define void @foo(i1 %cond, float* %a) {
  block1:
     store float 5.0, float* %a
     br i1 %cond, label %block2, label %block3

  block2:
    call foo(float* %a)
    br i1 true, label %block2, label %block1   <---- Only this line has changed

  block3:
    store float 6.0, float* %a
  }

You get a CFG very similar to the one I posted earlier:

F3887544: Post Dominance.png https://reviews.llvm.org/F3887544 
In your case the PDT is the first of the three you just mentioned:

 

3

>
>  |
>
>
> 1
>
>  |
>
>
> 2

This is the post-dominator tree which we compute today and which won't be changed by this proposal (as no reverse-unreachable blocks exist).

This program should be illegal, and as far as i can tell, it is :)

-> % bin/opt -view-cfg foo2.ll
Entry block to function must not have predecessors!
label %block1
bin/opt: foo2.ll: error: input module is broken!

Let's assume that's trivial to fix though, because it is.

LLVM does not allow jumps to the entry node. Jakub fixed it correctly by adding an auxiliary block that does nothing but jumping to block1.

One other slightly related question. Do you assume that A post-dominates B means that a program that reaches B eventually must reach A?

You can't say anything block wise, only edge wise.

I am confused. Post-dominance is defined on BBs and paths, so the above should make sense. To me it seems to be where the confusion comes from. To my understanding A post-dominates B does not guarantee that a program that reaches B eventually must reach A. In all your previous examples, you assumed otherwise. For PDSE to be correct, this cannot be assumed there.

That is, C is dependent on edge A->B if C postdominates B and C does not strictly postdominate A.

That seems to be the definition of control dependence. The properties we can take from this depend on what post-dominance guarantees.

Added Dani's email reply II to phab:

So, i definitely removed a postdominancefrontier check.
The PDF of X is equivalent to the set of branch blocks upon which X is control dependent.
Though, as mentioned this only means the block controls whether X executes, you need to label the edges.  

FWIW:I noticed that earlier, you mentioned literature does not talk about what we are doing, though there are practical implementations. It's worth pointing out that literature doesn't often talk about it as part of the postdominator construction itself, but as part of the CFG part that they build post-dom on top of.  They very often assume and require a unique exit node where everything has a path to it. 

Here are some example seminal papers that do this, and then build post-dom/post-domfrontiers on top of it:

Cytron's Control dependence paper:
http://polaris.cs.uiuc.edu/~padua/cs426/p451-cytron.pdf page 456. 
Note that they require everything have a path to exit:
"We assume that each
node is on a path from Entry and on a path to Exit"

Ferrante's program dependence graph paper
http://dl.acm.org/citation.cfm?id=24041
"Definition 1. A control flow graph is a directed graph G augmented with a
unique entry node START and a unique exit node STOP such that each node in
the graph has at most two successors. ...  We further assume that for any node N in G there exists a path
from START to N and a path from N to STOP. "

If you start with a CFG where you require every node go to exit, you don't need to do anything special just when computing postdom, because you just added the previously virtual edges to the actual CFG instead :)

Very right, this phrase is very common (and basically the indication that they did not think about incomplete CFGs at all). This is why none of the above papers directly applies. We always first need to fix the CFG, and then can use the papers. How we do this fix is something we need to show is correct and does not break earlier assumptions.

Following Chandler's suggestion, here an idea of a path forward:

  1. Let's nail the PDSE thingy. We have a test case we all understand and a transformation that does something useful. Let's use this to identify the PDT property you need (and others that may not be needed) and write this down in the documentation. We now had two examples which we first believed work and then it became clear that just checking the PDT is not enough. If we all get confused by this, how can a newcomer not need documentation here.
  1. Depending on the outcome of 1) we can either already clearly rule out my alternative suggestion or show that they indeed would generally not break the PDSE algorithm.

I hope to converge quickly on 1) and 2). Even if we agree that my alternatives would not break things immediately, we should still judge their usefulness. I believe that by (commonly) not breaking the PDT in the reverse-reachable graph, they have some benefit. However, you still have the earlier code complexity argument on your side. I would love to see to have you at least evaluate these options (considering the outcome of our discussion) as I believe the needed changes are small. However, I clearly admit that in this case Jakub and your thoughts about code complexity are very important, so should have a strong weight in this discussion.

dberlin added a comment.EditedAug 9 2017, 8:06 AM

(Editing because hit submit earliy)

Add Dani's email reply to Phab:

If you generate that, you will eliminate the store a, 5 again, and you should not.

Interesting, but slightly confusing.

This is what i get for trying to reduce an algorithm quickly.

That's OK. It gives us an example we can work with. At least we are now on the same page. We still need to correct the example and identify the property you are interested in, but I feel we make progress. Thanks a lot again!

I'm sure i incorrectly removed some postdominance frontier checks or something.

I doubt a post-dominance frontier check can be the reason. The post-dominance frontier is based on the PDT, so it should be identical for the two examples.

This reasoning is not quite correct:
FIrst, remember that with this patch, we generate a different PDT for your example and mine. That's in fact, part of the point :)

Also remember the post dominance frontier is a combination of CFG and postdom edges. It represents the immediately control dependent branches for the blocks (IE you can build the CDG from the RDF and reversing the mapping).
Otherwise, the DF and PDF would not need to walk the CFG.

But let's go through them:
Control dependence for these examples is definitely different.
Let's call this A:

declare void @foo2(float* %a)
define void @foo(i1 %cond, float* %a) {
entry:
  br label %block1
 block1:
     store float 5.0, float* %a
     br i1 %cond, label %block2, label %block3

  block2:
    call void @foo2(float* %a)
    br label %block2

  block3:
    store float 6.0, float* %a
    ret void
}

and this B:

declare void @foo2(float* %a)
define void @foo(i1 %cond, float* %a) {
entry:
  br label %block1
 block1:
     store float 5.0, float* %a
     br i1 %cond, label %block2, label %block3

  block2:
    call void @foo2(float* %a)
    br i1 true, label %block2, label %block1

  block3:
    store float 6.0, float* %a
    ret void
}

In A, we get:

Printing analysis 'Post-Dominator Tree Construction' for function 'foo':
=============================--------------------------------
Inorder PostDominator Tree: DFSNumbers invalid: 0 slow queries.
  [1]  <<exit node>> {4294967295,4294967295} [0]
    [2] %block3 {4294967295,4294967295} [1]
    [2] %block1 {4294967295,4294967295} [1]
      [3] %entry {4294967295,4294967295} [2]
    [2] %block2 {4294967295,4294967295} [1]

In B, we get:

Printing analysis 'Post-Dominator Tree Construction' for function 'foo':
=============================--------------------------------
Inorder PostDominator Tree: DFSNumbers invalid: 0 slow queries.
  [1]  <<exit node>> {4294967295,4294967295} [0]
    [2] %block3 {4294967295,4294967295} [1]
      [3] %block1 {4294967295,4294967295} [2]
        [4] %entry {4294967295,4294967295} [3]
        [4] %block2 {4294967295,4294967295} [3]

From Cytron's paper:

A CFG node Y is control dependent on a CFG node X if both of the
following hold:
(1) There is a nonnull path p: X ~ Y such that Y postdominates every node
after X on p.
(2) The node Y does not strictly postdominate the node X

LEMMA 11. Let X and Y be CFG nodes. Then Y postdominates a successor
of X if and only if (iff ) there is a nonnull path p: X ~ Y such that Y
postdominates every node after X on p.
COROLLARY 1. Let X and Y be nodes in CFG. Then Y is control dependent
on X in CFG iff X~ DF( Y) in RCFG.

So, it should be obvious that given these PDT's, the PDF is going to be different.
In one example, block 3 postdominates all the successors of block 1, in the other, it does not postdominate any, so they can't have the same PDF. Which they shouldn't because they don't have the same control dependence graph no matter how you build it :)

IDFCalculator agrees:

For A:

IDF for label %entry is: {}
IDF for label %block1 is: {}
IDF for label %block2 is: {label %block2 , label %block1 , }
IDF for label %block3 is: {label %block1 , }}

For B:

IDF for label %entry is: {}
IDF for label %block1 is: {label %block1 , }
IDF for label %block2 is: {label %block2 , label %block1 , }
IDF for label %block3 is: {}

However, despite these deficiencies, You should also be able to clearly see that even if the PDT's *were* the same, the definition in the cytron paper means that changing the successor edges of a block can cause the PDF to change, even if the PDT is identical.
This is because control dependence can change, and thus, the PDF must change ;)
If this wasn't true, control dependence would *only be based on the PDT*, and *not the CFG*. But we've already established that is not true. A postdominates B alone does not imply that A executes whenever B does.

Your example, in fact, proves this :)

Add Dani's email reply to Phab:

If you generate that, you will eliminate the store a, 5 again, and you should not.

Interesting, but slightly confusing.

This is what i get for trying to reduce an algorithm quickly.

That's OK. It gives us an example we can work with. At least we are now on the same page. We still need to correct the example and identify the property you are interested in, but I feel we make progress. Thanks a lot again!

I'm sure i incorrectly removed some postdominance frontier checks or something.

I doubt a post-dominance frontier check can be the reason. The post-dominance frontier is based on the PDT, so it should be identical for the two examples.

It's probably easiest to take one of the working PDSE patches and look at that.

AFAIK, you have been working closely with the PDSE people. The patches seem still WIP, so I will have a harder time digging through them to find the part that matters. Any chance you happen to have one of the authors in the office, such that you can ask him/her over lunch?

Sadly, no. They don't work for me or Google :)
The only other reference code for PDSE that i'm aware of is Open64, which is "hard to follow" to say the least.

 

Assume you change the previous program slightly:

  define void @foo(float* %a) {
     ....
  }

  define void @foo(i1 %cond, float* %a) {
  block1:
     store float 5.0, float* %a
     br i1 %cond, label %block2, label %block3

  block2:
    call foo(float* %a)
    br i1 true, label %block2, label %block1   <---- Only this line has changed

  block3:
    store float 6.0, float* %a
  }

You get a CFG very similar to the one I posted earlier:

F3887544: Post Dominance.png https://reviews.llvm.org/F3887544 
In your case the PDT is the first of the three you just mentioned:

 

3

>
>  |
>
>
> 1
>
>  |
>
>
> 2

This is the post-dominator tree which we compute today and which won't be changed by this proposal (as no reverse-unreachable blocks exist).

This program should be illegal, and as far as i can tell, it is :)

-> % bin/opt -view-cfg foo2.ll
Entry block to function must not have predecessors!
label %block1
bin/opt: foo2.ll: error: input module is broken!

Let's assume that's trivial to fix though, because it is.

LLVM does not allow jumps to the entry node. Jakub fixed it correctly by adding an auxiliary block that does nothing but jumping to block1.

One other slightly related question. Do you assume that A post-dominates B means that a program that reaches B eventually must reach A?

You can't say anything block wise, only edge wise.

I am confused. Post-dominance is defined on BBs and paths, so the above should make sense.

Sorry, let me be clear.
You can define it block wise, but it requires more than the PDT alone, as you must look at successors of the block.

To me it seems to be where the confusion comes from. To my understanding A post-dominates B does not guarantee that a program that reaches B eventually must reach A. In all your previous examples, you assumed otherwise.

I have not assumed otherwise :)

For PDSE to be correct, this cannot be assumed there.

On this we agree.

Following Chandler's suggestion, here an idea of a path forward:

  1. Let's nail the PDSE thingy. We have a test case we all understand and a transformation that does something useful. Let's use this to identify the PDT property you need (and others that may not be needed) and write this down in the documentation. We now had two examples which we first believed work and then it became clear that just checking the PDT is not enough.

If we all get confused by this, how can a newcomer not need documentation here.

I believe at this point, what you want is documented?
If not, can you point out what is not?

  1. Depending on the outcome of 1) we can either already clearly rule out my alternative suggestion or show that they indeed would generally not break the PDSE algorithm.

This is the part i disagree with.
Assume for a second your suggestion is an improvement, and in fact even an amazing one!.
We have already shown what we are doing here is conservatively correct.

If your suggestion is an improvement, and we can later prove it, awesome. Let's do that.

But given that the majority of this patch is unrelated to your suggestion, and this patch can be shown to work, why isn't the right answer:
Commit this patch, if you can improve it in a followup, improve it!

In fact, in pretty much every other review i've seen, we've explicitly asked people to separate followup improvements.

I think we should do the same here.
I get the distinct impression, given how hard you are pushing not to follow this path, that you are worried that if we commit this patch, that will not happen.
But i really don't understand why.

davide edited edge metadata.Aug 9 2017, 8:32 AM

I think this patch {c,sh}ould be considered for commit as is.
The reason is twofold:

  1. The current postdom behaviour has some known brokeness that caused several bugs. PDSE is just an example, but try to fuzz LLVM and you'll find all this sort of craziness. I've reviewed the original change in January to change this behaviour, which was reverted because you disagreed. I'm afraid that until somebody sits down and implements what you propose (probably you :) the discussion can go on and on indefinitely.
  2. I'd say this patch is an improvement on the current state, so I don't think we should hold on forever.

If you start with a CFG where you require every node go to exit, you don't need to do anything special just when computing postdom, because you just added the previously virtual edges to the actual CFG instead :)

Very right, this phrase is very common (and basically the indication that they did not think about incomplete CFGs at all).
This is why none of the above papers directly applies.

Just to note: the authors of both of these papers definitely thought of infinite loops that do not exit the program, and I don't believe they considered it an incomplete CFG.
Because what i was told (by a coauthor of at least one) was that what they wrote in the paper was their way of implying they just connected the non-exiting nodes to exit by way of fake edges, just like we are suggesting.
You are of course, welcome to ask as well, and see if they had any other tricks up their sleeve :)

BTW, if you wish to convince yourself what i said about two identical PDTs not generating the same PDF, here is an example:

define void @foo(i1 %cond, float* %a) {
entry:
    br label %block1
 block1:
    br label %block3
 block3:
    br i1 true, label %block4, label %block4
 block4:
    ret void
}

and

define void @foo(i1 %cond, float* %a) {
entry:
  br label %block1
 block1:
  br label %block3
 block3:
  br i1 true, label %block3, label %block4
 block4:
  ret void
}
Inorder PostDominator Tree: DFSNumbers invalid: 0 slow queries.
  [1]  <<exit node>> {4294967295,4294967295} [0]
    [2] %block4 {4294967295,4294967295} [1]
      [3] %block3 {4294967295,4294967295} [2]
        [4] %block1 {4294967295,4294967295} [3]
          [5] %entry {4294967295,4294967295} [4]
Roots: %block4
IDF for label %entry is: {}
IDF for label %block1 is: {}
IDF for label %block3 is: {}
IDF for label %block4 is: {}

vs

Inorder PostDominator Tree: DFSNumbers invalid: 0 slow queries.
  [1]  <<exit node>> {4294967295,4294967295} [0]
    [2] %block4 {4294967295,4294967295} [1]
      [3] %block3 {4294967295,4294967295} [2]
        [4] %block1 {4294967295,4294967295} [3]
          [5] %entry {4294967295,4294967295} [4]
Roots: %block4
IDF for label %entry is: {}
IDF for label %block1 is: {}
IDF for label %block3 is: {label %block3 , }
IDF for label %block4 is: {}

As you can see, i've changed the PostDominanceFrontier without changing the PDT by varying strict postdomination of successors.

Also someone pointed out to me today. If you want another literature reference for virtual edges for infinite loops, Bob Morgan's Building an Optimizing Compiler covers it, see pages 82-83 and figure 3.10.

I think this patch {c,sh}ould be considered for commit as is.

OK, another person is voting against me. I certainly won't single handedly block this this. As I said, I would like to have a brief opinion from Hal, as we discussed extensively. I will ask him. If he agrees with you guys, I am out.

The reason is twofold:

  1. The current postdom behaviour has some known brokeness that caused several bugs. PDSE is just an example, but try to fuzz LLVM and you'll find all this sort of craziness. I've reviewed the original change in January to change this behaviour, which was reverted because you disagreed. I'm afraid that until somebody sits down and implements what you propose (probably you :) the discussion can go on and on indefinitely.

I implemented an alternative (and proposed a second which I implemented earlier). I am still waiting for a test-case to break them.

  1. I'd say this patch is an improvement on the current state, so I don't think we should hold on forever.

So you are explicitly not concerned about the fact that these virtual edges weaken the post-dominance relation? If a pass inserts certain checks at post-dominance points and later wants to verify this invariant still holds, the invariant may be broken by deleting dead edges. I think this is a real concern.

This patch regresses code that was perfectly good over years.

One path to go quickly forward would be to commit my alternative (which does not regress much code), continue to develop code based on this alternative until we find test cases that actually break. It seems Daniel has some, I need to go over his mail now.

@hfinkel : Hi Hal, we discussed this topic for a while in Paris. I would very much appreciate your opinion on this patch.

hfinkel edited edge metadata.Aug 10 2017, 10:44 PM

@hfinkel : Hi Hal, we discussed this topic for a while in Paris. I would very much appreciate your opinion on this patch.

I'll look at this tomorrow.

I think this patch {c,sh}ould be considered for commit as is.

OK, another person is voting against me. I certainly won't single handedly block this this. As I said, I would like to have a brief opinion from Hal, as we discussed extensively. I will ask him. If he agrees with you guys, I am out.

The reason is twofold:

  1. The current postdom behaviour has some known brokeness that caused several bugs. PDSE is just an example, but try to fuzz LLVM and you'll find all this sort of craziness. I've reviewed the original change in January to change this behaviour, which was reverted because you disagreed. I'm afraid that until somebody sits down and implements what you propose (probably you :) the discussion can go on and on indefinitely.

I implemented an alternative (and proposed a second which I implemented earlier). I am still waiting for a test-case to break them.

There is nothing in llvm today that will break. It's been designed for the things we do today, which already completely ignore large classes of types of blocks. As Chandler said, i'm not sure that's a good use case to base the future on.
It's about what we want to implement in the future, and looking at what others have done to make that happen.
Your other approach is based on what you consider to be reasonable invariants: You are okay with, given one of the infinite loop examples

    A
 /    \
B     C

Where C is an infinite loop, saying that all paths to exit go through B.
That's definitely going to break optimizations we want to implement. You want me to go through, in excruciating detail, how. You want implementations of those optimizations to read.
Rather than come up with endless papers and examples here, and some point i think my suggestion is: Compile a GCC or Open64 where you've disabled connecting the infinite loops to exit before postdom, and see what breaks, and understand that. Because it does break!
In gcc, you'd have to hack up calc_dfs_tree a bit, since it's done inside post-dominators itself, besides some optimizations.
In Open64, you should be able to edit CFG::Process_multi_entryexit to not connect the infinite loops to exit (it does it all the time before computing postdom)
That's likely to lead to cfgs and understanding a lot easier than me trying to explain PDSE to you from first principles :)

  1. I'd say this patch is an improvement on the current state, so I don't think we should hold on forever.

So you are explicitly not concerned about the fact that these virtual edges weaken the post-dominance relation?

See, your view is they weaken it, while others have a perspective that it is making it conservatively correct.

If a pass inserts certain checks at post-dominance points and later wants to verify this invariant still holds, the invariant may be broken by deleting dead edges. I think this is a real concern.

They are not forward-dead, actually. That's part of the whole problem. As Jakub pointed out long ago, the fact that they are reverse unreachable does not mean they cannot be executed.

This patch regresses code that was perfectly good over years.

This code has known be broken for many years, and we have patches waiting for review that implement optimizations that even try to disable themselves in the presence of this these postdom bugs. There is external code that does the same. I know you disagree, so i'm going to leave this alone past that. I don't think we will agree here.

One path to go quickly forward would be to commit my alternative (which does not regress much code), continue to develop code based on this alternative until we find test cases that actually break. It seems Daniel has some, I need to go over his mail now.

Please don't.
Look.
One of my big problem with all of this is that you are suggesting we are treading on fresh snow. From your perspective, no literature deals with this, explicitly, and we are on our own, and can do what we want, so we should try to do something good. That's awesome. We should strive to do the best we can.
Except: we aren't treading on fresh snow.

They had infinite loops in the 1980's and 1990's too, and, for example, the authors of papers i cited will happily tell you their compilers either

  1. Inserted fake edges in the CFG
  2. Pretended those fake edges existed during postdominators.

They didn't just do nothing.

Your own research showed that other compilers take this approach, and *no compiler or literature takes your approach* (and FWIW, at least one of the ones you say does nothing actually carefully ensures the CFG represents infinite loops in a certain way).
Even "building a compiler" books say that you should insert fake edges from infinite loops to exit or you will break global optimizations.

At some point, i think it's unreasonable to say "all of this is inapplicable".
I think at some point, the onus is on you to prove that your approach won't break these things, even if we don't implement them yet, since the whole point of doing this is for us to be able to implement them.
There are patches outstanding for some of these optimizations if you want to try them (but i'd recommend using gcc or open64, see below)
In part simply disagree about what regions should look like, and it's unlikely we'll ever agree.
So i'm going to stick with "does it break optimizations".
We are suggesting doing a thing we *know* works with *all* of the algorithms, today's, and the future ones, because it's implemented.
(I again, am cognizant of the fact you consider the test case changes to be regressions). I don't believe they are, and they are the same as other compilers generate)

If you want to try your way in those things, and see what will break, that's awesome. I'm pretty sure the answer is "a bunch". I suggest those over applying existing opt patches out for review because those compilers have worked this way for years, so "most" bugs have been worked out :) In particular, Open64 has probably the most complete implementation of things using post-dominators.

I think that we should go ahead with this approach.

I agree with Tobias that this has some unfortunate properties. For example, if we have:

entry:
lifetime_begin(a);
br cond, blocka, blockb

blocka:
br blocka

blockb:
lifetime_end(a);
br blockc

blockc:
ret

Can we use post-dominance to verify that the lifetime_end(a) is correctly placed and sufficient (i.e. do the set of lifetime_end(a) post-dominate the lifetime_start(a)? If we connect infinite loops to the exit node, then the answer is no (because, as in this example, it isn't true that the set of lifetime_end(a) post-dominates lifetime_start(a), even though the set is correctly placed and sufficient). What's frustrating about this is that the definition of post-dominance seems perfect for this task: we want to ensure that all paths from the lifetime_start(a) to the exit pass through one of the lifetime_end(a)s.

There is also a small wrinkle in the above argument in that it also ignores functions that may throw (which don't induce any extra edges in the PDT to the exit). In the particular case of lifetime management of stack locations, this is okay (because the unwind edge implicitly includes a lifetime_end of everything on the stack). I'm not sure how well this generalizes, however, to other use cases. There's also no good way to change this (because functions that might unwind aren't terminators -- which I think that we should change, but that's perhaps another story).

Nevertheless, given that we have infinite loops, it is unclear that we'd ever be able to use post-dominance to verify the lifetime intrinsics. To do so would imply that, among other things, we could use post-dominance within infinite loops as well. Given that, within the loop there are no paths to the exit node, or if you add one you must do it arbitrarily, it is not obvious to me that you can construct a definition that makes sense there.

These issues are discussed in the literature. For example, see:

A new foundation for control-dependence and slicing for modern program structures
VP Ranganath, et al. - ESOP, 2005
http://link.springer.com/content/pdf/10.1007/b107380.pdf#page=89
http://www.dtic.mil/get-tr-doc/pdf?AD=ADA443736

There seems to be an active effort, to the present day, to understand how to analyze non-terminating programs, especially in the program-slicing literature - search for "non-termination sensitive control dependence" (on Google scholar or similar) and you'll find a significant amount of discussion. However, I don't see a clear answer in the literature for what to do here, and quickly scanning a few of the papers, it seems that the relevant algorithms are all less efficient than the ones based on traditional post-dominance (i.e. which connect everything to a virtual exit node). When Tobias and I discussed this in June, I don't recall settling on an obviously-better scheme. We discussed various ways for adding virtual edges to other places in the graph, instead of to the exit node, but no heuristic seemed ideal. I don't recall discussing removing edges as has been discussed here (and whereas with adding edges it still seems possible to generate a conservatively-correct graph, removing edges seems prone to doing otherwise - and that makes me a bit nervous). As a result, I'm in favor of moving forward with the current patch. I'd like to think that we can do better, but I'm convinced that we know how yet.

Another thing that Tobias and I discussed in June was a desire to handle unreachable and infinite loops in a consistent way. I agree with this, and while unfortunate, this does have that consistency property.

dberlin added a comment.EditedAug 11 2017, 8:32 PM

I think that we should go ahead with this approach.

Thanks for reviewing this.

I agree with Tobias that this has some unfortunate properties.

I agree too, fwiw!

Nevertheless, given that we have infinite loops, it is unclear that we'd ever be able to use post-dominance to verify the lifetime intrinsics. To do so would imply that, among other things, we could use post-dominance within infinite loops as well. Given that, within the loop there are no paths to the exit node, or if you add one you must do it arbitrarily, it is not obvious to me that you can construct a definition that makes sense there.

FWIW: I believe your first sentence is correct.
If you apply Tobias's patch, the following should work but will be broken:

declare void @foo2(float* %a)
define void @foo(i1 %cond, float* %a) {
entry:
  ; lifetime.start(a)
  br label %block1
block1:
   br i1 undef, label %block2, label %block5
block2:
   br i1 undef, label %block2, label %block3
block3:
   ;lifetime_end(a);
   br label %block4
block4:
   br label %block4
block5:
   ;lifetime_end(a);
   ret void
}

Equivalent CFG:

A
|  \
B   C
      |
      D
      |
      E

where C and E are also self-loops.

I believe this is legal.
Tobias's patch gives this:

[1]  <<exit node>> {4294967295,4294967295} [0]
   [2] %block5 {4294967295,4294967295} [1]
     [3] %block1 {4294967295,4294967295} [2]
       [4] %entry {4294967295,4294967295} [3]
   [2] %block4 {4294967295,4294967295} [1]
     [3] %block3 {4294967295,4294967295} [2]
       [4] %block2 {4294967295,4294967295} [3]

This implies the lifetime end is not legally placed in block 3 (I deliberately kept it out of the loops to avoid discussion of that).

It seems perfectly legal to me.
It's reachable from entry, and i can't see a reason it's not legal to say "memory ends here".

You can make block 4 whatever you want (a call that doesn't return, etc). It still seems legal to place a memory end in block 3.

These issues are discussed in the literature. For example, see:

A new foundation for control-dependence and slicing for modern program structures
VP Ranganath, et al. - ESOP, 2005
http://link.springer.com/content/pdf/10.1007/b107380.pdf#page=89
http://www.dtic.mil/get-tr-doc/pdf?AD=ADA443736

I think we can also get away with separating post-dominance and control dependence if we really want.
There is nothing that stops us from saying "Use a CDG for control dependence", and making that work however we want (or in multiple ways).

I am on a wedding today, so don't have a lot of time to discuss. As I said, I won't be the only to block this. I would appreciate us condensing the results of this discussion into documentation. I really believe it is important to understand and document this well. Now, let's do this post-commit. Then we can discuss the different implications in a relaxed way. Thanks for your time guys!

I think this patch {c,sh}ould be considered for commit as is.
The reason is twofold:

  1. The current postdom behaviour has some known brokeness that caused several bugs. PDSE is just an example, but try to fuzz LLVM and you'll find all this sort of craziness. I've reviewed the original change in January to change this behaviour, which was reverted because you disagreed. I'm afraid that until somebody sits down and implements what you propose (probably you :) the discussion can go on and on indefinitely.

Btw, could you provide a reference to any of these bugs or the test case that was committed to LLVM-IR? Have they been related to the treatment of infinite loops. I am trying throughout this discussion to get actual use cases that break to get to a more technical level of discussion. If there are already in-tree bugs that would be fixed by this change, this would make things a lot clearer for me.

I think this patch {c,sh}ould be considered for commit as is.
The reason is twofold:

  1. The current postdom behaviour has some known brokeness that caused several bugs. PDSE is just an example, but try to fuzz LLVM and you'll find all this sort of craziness. I've reviewed the original change in January to change this behaviour, which was reverted because you disagreed. I'm afraid that until somebody sits down and implements what you propose (probably you :) the discussion can go on and on indefinitely.

Btw, could you provide a reference to any of these bugs or the test case that was committed to LLVM-IR? Have they been related to the treatment of infinite loops. I am trying throughout this discussion to get actual use cases that break to get to a more technical level of discussion. If there are already in-tree bugs that would be fixed by this change, this would make things a lot clearer for me.

A few things, just so they end up here:
The NTSCD folks, unfortunately, seem to prove all of the algorithms for weak order dependence are equivalent to those that require with a unique end node. See, http://people.cs.ksu.edu/~tamtoft/Papers/Amt+al:WODvsCD/short.pdf . This means, i believe, outside of the NTSCD algorithms, we won't have a lot of luck with infinite loops.

The NTSCD paper mentions the augmentation we perform here in section 3.2 as being standard (see http://people.cs.ksu.edu/~tamtoft/Papers/Ran+Amt+Ban+Dwy+Hat:ESOP-2005/long.pdf).
They sadly point out, also, that "All definitions of control dependences that we are aware of require that CFGs satisfy the unique end node requirement"
(IE there is no magic we can use prior to this paper)

On the nice side, the NTSCD algorithms do not look that hard to implement.
On the not nice side, all the algorithms i can find are N^3 or N^4.

There was one paper author who believed significantly more efficient algorithms may exist (http://www.sciencedirect.com/science/article/pii/S0304397511007377).
He is, unfortunately, dead (not kidding).

I found one real world implementation of NTSCD (that doesn't use a published algorithm) (Joana, an information flow tool. Source on github)

However, now for some mixed news:
I'm positive you can prove any NTSCD algorithm based on all-pairs reachability of the CFG must be at least O(N*E) (IE N^2).
CFGs are a regular language (AFAIK!). The fastest known algorithms for all-pairs reachability on regular languages are (N*E)

Proof:
CFGs are trivially convertible to deterministic finite automata (uniquely label, generation transition table from successors, done)
All DFAs are trivially convertible to CFGs (again, AFAIK)
The language accepted by DFAs are exactly the set of regular languages.
Yannakakis gives a N*E algorithm for all-pairs reachability on regular languages (there is no faster known algorithm)

The good news is that it should be trivial to improve that part of the NTSCD algorithm (which is the N^3 part), *but* until they have algorithms that do not require all-pairs reachability, i do not believe you can make it faster than N^2.

I feel like you can also make this practically faster through various tricks that would not change complexity. IE you should be able to avoid all pairs reachability for the parts of the graph that don't go through loops (IE generate condensation graph with marked condensed regions to get the maximal part of the cfg that does not go through a loop, and handle these separately)

kuhar updated this revision to Diff 111088.Aug 14 2017, 3:57 PM
kuhar added subscribers: uabelho, kparzysz.

When running tests for all targets, I discovered that 3 codegen tests got affected by this patch.

I think it would be best if someone took a look at the changes -- I'm especially concerned about register spilling in test/CodeGen/ARM/struct-byval-frame-index.ll. I can see that the test was updates a few days ago by D34099.
@uabelho, @kparzysz, could you take a look?

kuhar updated this revision to Diff 111093.Aug 14 2017, 4:06 PM

Rebase the patch to the ToT.

uabelho added inline comments.Aug 15 2017, 4:53 AM
test/CodeGen/ARM/struct-byval-frame-index.ll
7–8 ↗(On Diff #111093)

Remove or update this comment depending on how the checks below end up.

10–12 ↗(On Diff #111093)

Is there any way to check that the spill/reload are placed at reasonable places instead of just checking
that they exist at all?

Before the change I made in D34099, the checks looked like

; CHECK: str r{{.*}}, [sp, [[SLOT:#[0-9]+]]] @ 4-byte Spill
; CHECK: bl RestoreMVBlock8x8
; CHECK: bl RestoreMVBlock8x8
; CHECK: bl RestoreMVBlock8x8
; CHECK: ldr r{{.*}}, [sp, [[SLOT]]] @ 4-byte Reload

which isn't very picky either, but at least it's something.

(I don't know the testcase at all, I just convinced myself that my change
didn't actually break anything, and then tried to update the testcase to
still check that the spill/reload are generated at some reasonable places)

kuhar added inline comments.Aug 15 2017, 9:15 AM
test/CodeGen/ARM/struct-byval-frame-index.ll
10–12 ↗(On Diff #111093)

After this patch the two slots are different and the reload uses lr instead of sp, so I don't think we can verify it better here.

kuhar updated this revision to Diff 111202.Aug 15 2017, 10:39 AM

Update the comment in struct-byval-frame-index.ll.

kuhar marked an inline comment as done.Aug 15 2017, 10:40 AM
This revision was automatically updated to reflect the committed changes.