Page MenuHomePhabricator

Write a new SSAUpdater
Needs ReviewPublic

Authored by dberlin on Jan 19 2017, 7:17 PM.

Details

Summary

This SSAUpdater has a simpler interface and even simpler
implementation than our current one. It is both faster, and IMHO,
simpler to understand. It also can be used for multiple variables at
once, and can be reused without destruction or worry.

It does not require dominance frontiers, dominators, etc, as our
current implementation does.

I originally wrote it for something else, but D28534 reminded me i had
it. It implements the marker algorithm from "Simple and Efficient
Construction of Static Single Assignment Form" at
http://compilers.cs.uni-saarland.de/papers/bbhlmz13cc.pdf. It does not
create completely minimal SSA (so in some cases leaves a few phi
nodes). I could make it do so by implementing the SCC algorithm in
the paper, at some implementation complexity cost.
The marker algorithm is superificially similar[1] to the current SSAUpdater,
but howwe place phi nodes is very different, and the API + implementation as
described is, IMHO, easier to understand :)

I also didn't implement the trivial recursive phi simplification (i
would rather just use the scc algorithm at that point, as it wouldn't
require recursive simplfication).

I do not pretend this version is ready to go in by far (it has no
tests, to start :P). But before i spent any more time on it, i
figured i'd get general thoughts on whether people think it is worth
cleaning it up, or if we don't care enough about SSAUpdater as-is.

(Also, fair warning, it worked the last time i tried it, but i haven'
touched it in a while)

[1] If one stares at the implementation, written in 2009, and then the
paper, written in 2012, one may wonder if some credit is not missing
somewhere.

Event Timeline

dberlin created this revision.Jan 19 2017, 7:17 PM

Note that regardless of whether we want this for regular SSA or not, this will serve as the basis for my generic MemorySSA updater.
Sadly, because MemorySSA is not really part of the IR, and because it deals with may-defs, and not must-defs, we won't be able to share much beyond the idea.
(On the plus side, MemorySSA's semantics actually make generic updating easier and faster)

sanjoy edited edge metadata.Jan 19 2017, 10:34 PM

I've never really heard people complain about our current SSAUpdater (about it being too slow or difficult to use) -- well except in this review :). Is there a specific problem you're trying to address with this?

chandlerc edited edge metadata.Jan 20 2017, 12:52 AM

I've never really heard people complain about our current SSAUpdater (about it being too slow or difficult to use) -- well except in this review :). Is there a specific problem you're trying to address with this?

When I worked on SROA and SROA had split logic between SSAUpdater and mem2reg, it was definitely confusing and a source of much sorrow... That said, we fixed it by getting rid of SSAUpdater...

But looking at this code and looking at SSAUpdater, I'm at least interested in the simplification. This code doesn't have comments, debug info handling, or the SCC-stuff Danny mentioned, but even if it grows by 3x, by by guess it will still be less than 1/3rd the code of the current SSAUpdater.

I guess the question is can we get the quality up w/o regressing the simplicity. From your description Danny, I'd be interested in the SCC formation bit to get minimal and other properties, and it would need to handle debug info a bit. Not sure if there are any other challenges really.

I've never really heard people complain about our current SSAUpdater (about it being too slow or difficult to use) -- well except in this review :). Is there a specific problem you're trying to address with this?

FWIW: It's been a source of sorrow before.
People kind of randomly try to figure out what order to call things in, because it's not clear what the terminology it uses means, nor is it clear what the right sequence necessarily is to get the right result.

As for speed, we wrote an entire separate linear time phi placement pass (see promotememtoreg, and more particularly, idfcalculator) just because the current SSAUpdater was too slow and not easily understandable.
It's also used for ADCE and MemorySSA.

This particular patch came about because i needed an SSAUpdater workalike for MemorySSA updating, and i figured, what the heck, might as well see if anyone wants me to replace the existing one while i'm at it.
I started with the above, and then modified it for what i needed.

If you stare at this vs the paper vs the existing SSAUpdater, you can clearly see the paper (and this impl) is a slightly more refined version of the renaming algorithm, with a completely different phi insertion scheme shoved on top.
(i'm not sure if they both share some common lineage, because otherwise, that's ... yeah)

Our current SSAUpdater tries to iteratively compute dominators, dominance frontiers, and then phi placement for those. It's actually N^3 in the number of variables, and in some cases, badly so.

I've never really heard people complain about our current SSAUpdater (about it being too slow or difficult to use) -- well except in this review :). Is there a specific problem you're trying to address with this?

Do you find it easy to use? Perhaps I just haven't s studied it in sufficient detail, but I don't. Moreover, I think that's. our informal project policy has tended toward, "use this only if there's no other way." I'll admit never having thought about rewriting it, or if there is a better algorithm, but I think this is certainly an area in which potential improvements should be considered.

davide edited edge metadata.Jan 20 2017, 10:34 AM

I've never really heard people complain about our current SSAUpdater (about it being too slow or difficult to use) -- well except in this review :). Is there a specific problem you're trying to address with this?

FWIW: It's been a source of sorrow before.
People kind of randomly try to figure out what order to call things in, because it's not clear what the terminology it uses means, nor is it clear what the right sequence necessarily is to get the right result.

As for speed, we wrote an entire separate linear time phi placement pass (see promotememtoreg, and more particularly, idfcalculator) just because the current SSAUpdater was too slow and not easily understandable.
It's also used for ADCE and MemorySSA.

This particular patch came about because i needed an SSAUpdater workalike for MemorySSA updating, and i figured, what the heck, might as well see if anyone wants me to replace the existing one while i'm at it.
I started with the above, and then modified it for what i needed.

If you stare at this vs the paper vs the existing SSAUpdater, you can clearly see the paper (and this impl) is a slightly more refined version of the renaming algorithm, with a completely different phi insertion scheme shoved on top.
(i'm not sure if they both share some common lineage, because otherwise, that's ... yeah)

Our current SSAUpdater tries to iteratively compute dominators, dominance frontiers, and then phi placement for those. It's actually N^3 in the number of variables, and in some cases, badly so.

I don't have any strong comments on the API (yet), I would like to try myself porting a pass as an experiment and see how that feels. I personally think removing the dependency on dom info is already a win per-se, and the code is definitely much smaller/easier to understand than what we currently have (I don't think it will grow enormously if we include SCC, but I may be wrong).
To sum up, I would like to see this going foward.

I've never really heard people complain about our current SSAUpdater (about it being too slow or difficult to use) -- well except in this review :). Is there a specific problem you're trying to address with this?

FWIW: It's been a source of sorrow before.
People kind of randomly try to figure out what order to call things in, because it's not clear what the terminology it uses means, nor is it clear what the right sequence necessarily is to get the right result.

As for speed, we wrote an entire separate linear time phi placement pass (see promotememtoreg, and more particularly, idfcalculator) just because the current SSAUpdater was too slow and not easily understandable.
It's also used for ADCE and MemorySSA.

This particular patch came about because i needed an SSAUpdater workalike for MemorySSA updating, and i figured, what the heck, might as well see if anyone wants me to replace the existing one while i'm at it.
I started with the above, and then modified it for what i needed.

If you stare at this vs the paper vs the existing SSAUpdater, you can clearly see the paper (and this impl) is a slightly more refined version of the renaming algorithm, with a completely different phi insertion scheme shoved on top.
(i'm not sure if they both share some common lineage, because otherwise, that's ... yeah)

Our current SSAUpdater tries to iteratively compute dominators, dominance frontiers, and then phi placement for those. It's actually N^3 in the number of variables, and in some cases, badly so.

While I don't think speed is the only factor we should consider, I don't think I ever found a case where compile time for SSAUpdater skyrockets, but this doesn't mean there aren't any. Did you find a case where this actually matters, out of curiosity?

I've never really heard people complain about our current SSAUpdater

Okay I guess I gotta get better at listening! :)

It's algorithm was definitely not usable, time wise, for MemorySSA (i tried it), it was about 2-5x slower, just because of the amount of work it does per thing. MemorySSA is, as i said, a bit weird.

I don't remember the llvmdev history, but my understanding was it was really bad for SROA (I'm sure chandler can saying more).

In any case, let me add the SCC post-pass to this (the do scc walking as we rename algorithm is ... more complex, and i'm not sure it's worth it speed wise) so someone can play with it more. We have an SCC iterator, so i think i can make this happen very easily and it should be fast.

My understanding, BTW (not verified), is that with one additional recursive simplification function above, the only difference between this and minimal form should be in irreducible control flow.

But i'll check :)

Updating the patch in a second, i templated the trivial phi removal, fixed some bug, started on unit tests enough for others to play with it.

It should be a pretty direct conversion from the old ssaupdater (setAvailableValue->writeVariable, getvalueAtEndOfBlock->readVariableAfterDef, getvalueAtMiddleOfBlock->readVariableBeforeDef)

A few notes from more testing:

  1. The old updater crashes if you call getValueAtEndOfBlock for a currently empty block (you have to insert the instruction first) (I can make the new one work on incomplete cfgs, actually, if we want)
  2. The only cases we generate spurious phis now is in the case of irreducible control flow. This will go away once i implement the SCC algorithm.

Something like this:

define void @F(i32, i32) {
if:
  br i1 true, label %loopmain, label %loopstart

loopstart:                                        ; preds = %loopmain, %if
  br label %loopmain

loopmain:                                         ; preds = %loopstart, %if
  br i1 true, label %loopstart, label %afterloop

afterloop:                                        ; preds = %loopmain
  ret i32 %0
}

Will become

define void @F(i32, i32) {
if:
  br i1 true, label %loopmain, label %loopstart

loopstart:                                        ; preds = %loopmain, %if
  %update1 = phi i32 [ %update, %loopmain ], [ %0, %if ]
  br label %loopmain

loopmain:                                         ; preds = %loopstart, %if
  %update = phi i32 [ %update1, %loopstart ], [ %0, %if ]
  br i1 true, label %loopstart, label %afterloop

afterloop:                                        ; preds = %loopmain
  ret i32 %0
}
  • Add basic unit tests, comment better, add SCC based minimization.

This should now produce minimal SSA.

MatzeB added a subscriber: MatzeB.Jan 20 2017, 5:49 PM

I am obviously a fan of the algorithm and biased :)

  • Interesting tweak with the visited set (compared to the paper), guess that saves us RAUWs in many cases at the cost of extra visited set lookups.
  • I am surprised to find readVariableBeforeDef() as public API, in my experience outside users of the algorithms do not need it (but maybe you have a good use case). I probably would have named it readVariableAtBlockBegin() or readVariableLiveIn() and the "normal" variant simply readVariable().
lib/Transforms/Utils/SSAUpdaterNew.cpp
60

Shouldn't this be readVariableBeforeDef(Var, BB)?

86

Maybe PredVal->getParent() == BB && isa<PHINode>(PredVal) is slightly more efficient.

With respect to SCC handling: In my experience the benefits of a truly minimal SSA form are very small and may not justify the extra complexity and calculations. At least back then we mostly added that version to the paper to present a complete solution, the default libfirm code didn't use it (but had a pass in the pipeline that would remove extra phis once).

My understanding, BTW (not verified), is that with one additional recursive simplification function above, the only difference between this and minimal form should be in irreducible control flow.

Yep we proved that in the paper. (Of course proving it correct doesn't replace testing :)

As for what's public:
The paper comments that for incremental updates, you want readVariableBeforeDef. This is equivalence to getValueAtMiddleOfBlock, which we make public in SSAUpdater.

The paper's API is actually identical to SSAUpdater's old API circa 2010, with the functions renamed.
The only difference is the more advanced phi placement algorithms.
I think Bob even has most of the marker algorithm right at one point if you go back through the file's history.

I think if we keep going down this route i would change the API.
Users shouldn't generally have to care where in the block they are.
The real problem on this front is loops.

Given:
main:
def(a)
goto loop

loop:
use (a)
def (a)
goto loop or somewhere else

For loop, you need

  1. To call writeVariable of both defs so it's visible to the algorithm, otherwise, it will not get the value over the backedge.
  2. But ask about the version before the def, because that's where the use is.

I think this should all be hidden, and orderedbb or something used to locate the relative position of the use to the defs.

(this is what i'm doing for MemorySSA, but memory ssa is easier because we have a single variable and every may-def creates a new version. Thus, i don't even need the user to tell me the defs, i can just find the closest def, etc)

lib/Transforms/Utils/SSAUpdaterNew.cpp
60

I think you are commenting on an old version of the patch where it had that bug :)

As for what's public:
The paper comments that for incremental updates, you want readVariableBeforeDef. This is equivalence to getValueAtMiddleOfBlock, which we make public in SSAUpdater.

The paper's API is actually identical to SSAUpdater's old API circa 2010, with the functions renamed.
The only difference is the more advanced phi placement algorithms.
I think Bob even has most of the marker algorithm right at one point if you go back through the file's history.

I think if we keep going down this route i would change the API.
Users shouldn't generally have to care where in the block they are.
The real problem on this front is loops.

Given:
main:
def(a)
goto loop

loop:
use (a)
def (a)
goto loop or somewhere else

For loop, you need

  1. To call writeVariable of both defs so it's visible to the algorithm, otherwise, it will not get the value over the backedge.
  2. But ask about the version before the def, because that's where the use is.

Yep, however that soon breaks down as soon as you have more than 1 definition in a block so there isn't a single before/after def anymore.
Just reviewing the libfirm SSA updater, we indeed first introduced all defs to the algorithm (picking the latest one if there are multiple) and then go over all defs/uses in a block from top to bottom.

I think this should all be hidden, and orderedbb or something used to locate the relative position of the use to the defs.

Yep, some notion of ordering is necessary to get multiple defs/uses in a block "right", finding a way to hide it would be cool.

(this is what i'm doing for MemorySSA, but memory ssa is easier because we have a single variable and every may-def creates a new version. Thus, i don't even need the user to tell me the defs, i can just find the closest def, etc)

As for what's public:
The paper comments that for incremental updates, you want readVariableBeforeDef. This is equivalence to getValueAtMiddleOfBlock, which we make public in SSAUpdater.

The paper's API is actually identical to SSAUpdater's old API circa 2010, with the functions renamed.
The only difference is the more advanced phi placement algorithms.
I think Bob even has most of the marker algorithm right at one point if you go back through the file's history.

I think if we keep going down this route i would change the API.
Users shouldn't generally have to care where in the block they are.
The real problem on this front is loops.

Given:
main:
def(a)
goto loop

loop:
use (a)
def (a)
goto loop or somewhere else

For loop, you need

  1. To call writeVariable of both defs so it's visible to the algorithm, otherwise, it will not get the value over the backedge.
  2. But ask about the version before the def, because that's where the use is.

Yep, however that soon breaks down as soon as you have more than 1 definition in a block so there isn't a single before/after def anymore.
Just reviewing the libfirm SSA updater, we indeed first introduced all defs to the algorithm (picking the latest one if there are multiple) and then go over all defs/uses in a block from top to bottom.

There is a better way, IMHO, to do this,if you have the set of all defs/uses at once (which in LLVM, you can get easily), and it's fixed, by using a variant of traditional renaming:

The traditional SSA renaming algorithm is O(B+I) because it walks all instructions in dominator order by walking the blocks in the dom tree.
The only reason it does this is to push/pop the renaming stack in the right order.
But it's not actually necessary to walk the blocks at all to get the right ordering of the renaming stack.
You can get the dfs completion numbers of the dominator block for each def/use (and, if they aren't the only def/use in the block, use orderedbasicblock or something to get a local numbering. A lot of passes i work on have the local numbering lying around anyway).

Then

Sort the defs/uses by the dfs completion numbers then the local numbers.
For each def/use: 
  <stack consists of member, domnodedfsin, domnodedfsout>
   sync rename stack to dominate current DFS scope of def/use by popping until member dfsin/out is contained by dom node.
   push onto rename stack if you have a def
   if stack is empty, and you are staring at a use, it's undef
   any use is defined by top of stack.

(i'm going to blog this algorithm at some point, because i don't see it really described in any paper, but it's O(def+uses) if you have a local numbering around.)

This is if you've inserted phi nodes using some other algorithm, of course.
You can integrate phi node insertion into this algorithm (for starters, IIRC, you only ever have to insert a phi node if stack.size() > 1) by integrating the paper's algorithm into this one.

I use a variant of this algorithm to do value numbering elimination in O(number of things equal to each other) instead of O(I) like the standard algorithms.
Its also used to do renaming for copy insertion here: here:https://github.com/dberlin/llvm-gvn-rewrite/blob/newgvn-predicateinfo/lib/Transforms/Utils/PredicateInfo.cpp
(in that case, you can even play the marker algorithm style game and only insert a copy once you see a use that will use it. This avoids having to compute liveness just to avoid useless insertions)

I think this should all be hidden, and orderedbb or something used to locate the relative position of the use to the defs.

Yep, some notion of ordering is necessary to get multiple defs/uses in a block "right", finding a way to hide it would be cool.

I think we can do it.

I read the paper and reviewed the marker algorithm. Some comments inline.
Side note, for didactical purposes: I wasn't familiar with this algorithm before, but I think it's very nice, probably easier to understand than Morgan's book chapter/algs on SSA construction.

lib/Transforms/Utils/SSAUpdaterNew.cpp
40–44

I'll be happy if we can add stats about the number of PHIs trivialized, inserted etc.. (so that we can also compare the "basic" marker algorithm and the SCC approach)

78–85

You can point out here (or elsewhere) that the updater can be somehow trivially extended to handle incomplete CFGs.

93–96

As you're describing the single predecessor case above, I would add a comment here saying something like: Multiple predecessors case, collect all the definitions and construct a phi, if necessary

99

dot at the end of comment, here and everywhere else.

130

auto *, here and everywhere else you use dyn_cast<>, dyn_cast_or_null<>, cast<> etc..

148–149

Just fold OperRange inside tryRemoveTrivialPhi, maybe.

174–176

It would be good if we can add a test showing that we trivialize self-referencing phi, if possible.

175–176

Also, I understand that we don't need to call replaceAllUsesWith as there are no non-self uses, but shouldn't we call eraseFromParent here?

188–192

This is not the first time that I find SCCIterator being not-very-general.

bryant added a subscriber: bryant.Jan 24 2017, 1:29 PM
davide added inline comments.Jan 24 2017, 3:51 PM
include/llvm/Transforms/Utils/SSAUpdaterNew.h
1–2

Unformatted, I think.

37

The whole comment needs formatting. The explanation is very good, IMHO.

37–38

extra ///

42–43

s/doesn/does/

88–92

Comments on these member variables, maybe? (as we do in NewGVN).

lib/Transforms/Utils/SSAUpdaterNew.cpp
1–2

Unformatted.

48–56

These functions appear to be trivial, move to the header, maybe?

davide added inline comments.Jan 24 2017, 4:13 PM
lib/Transforms/Utils/SSAUpdaterNew.cpp
197

This seems to be more PhiToDFSNum than Root, unless I'm missing something.

205–207

Also, if we special case DFSNum == 0, probably adding a comment would help the reader (or using const int or a macro).

I've never really heard people complain about our current SSAUpdater (about it being too slow or difficult to use) -- well except in this review :). Is there a specific problem you're trying to address with this?

One data point : in https://llvm.org/bugs/show_bug.cgi?id=16756 we have a test case where JumpThreading is taking 30% of the total time of a clang -O3 compilation.
One large issue was that we were spending a lot of time in SSAUpdater (Duncan mentions in this PR that the bottleneck is SSAUpdater).

When I looked at it, it seemed to me that we were constantly rebuilding the DominatorTree inside the SSAUpdater over and over for each and every replacement, while it seemed possible to build one up front and reusing it.

anemet added a subscriber: anemet.Jan 25 2017, 9:48 AM

I've never really heard people complain about our current SSAUpdater (about it being too slow or difficult to use) -- well except in this review :). Is there a specific problem you're trying to address with this?

One data point : in https://llvm.org/bugs/show_bug.cgi?id=16756 we have a test case where JumpThreading is taking 30% of the total time of a clang -O3 compilation.
One large issue was that we were spending a lot of time in SSAUpdater (Duncan mentions in this PR that the bottleneck is SSAUpdater).

When I looked at it, it seemed to me that we were constantly rebuilding the DominatorTree inside the SSAUpdater over and over for each and every replacement, while it seemed possible to build one up front and reusing it.

It also rebuilds the dominance frontiers. The problem with this is that the answer changes depending on the definition blocks you choose. We have a minimal linear time calculator (not used in the current updater, but used by PromoteMemToReg now) , but it's still linear-time per variable worst case. It does do tricks to minimize the work it does.
Before i started down the path of the current algorithm, i also read all the latest multi-variable phi placement algorithms.

In practice, this seemed the easiest starting point, as the "share work between frontier calculation" algorithms get fairly complicated.

It's great to see some new work in this area. It has been years since I rewrote the SSAUpdater and I've always been worried about the performance. It seems fast enough for the common cases but the worst-case behavior is terrible. Anyway, it has been a long time since I looked at this, but I remember concluding that it was hard to really fix it right without changing the API. The basic part of the Braun et al. approach looks very similar to what we had prior to my rewrite (r100047 through r101612). Unfortunately it has been a long time and I'm having trouble remembering the details of the problem we ran into. The gist of it was that SSA construction is different than SSA updating because the latter needs to determine if the necessary phis are already present. We ran into cases where the SSA updater was unable to determine that a loop already had a suitable phi, so it would add a new one, creating phi cycles that grew by one every time the SSA updater was invoked on the code (which can be a large number of times). Perhaps this only came up with irreducible control flow or something, but basically you've got to get minimal SSA all the time to avoid this situation.

At least right now, it should avoid extra phis in all but the irreducible case (unlike the previous attempt, which, again, quite admirable :P)
There, it will not create them more than once (or shouldn't).

If those turn out to matter, we can use the SCC algorithm to minimize them (and i can build it on the fly, it's just uglier to do scc finding as we go instead of after).

lib/Transforms/Utils/SSAUpdaterNew.cpp
175–176

Yes, if phi is not null, we do need to erase it in that case.
Whoops.

Just to note: I'll get back to this at some point.

joerg just gave me a testcase where, out of the 3minutes it takes, 2+ minutes are spent forming and verifying LCSSA, all due to the updater speed :(

hiraditya added inline comments.
unittests/Transforms/Utils/SSAUpdaterNew.cpp
2

File name...

@dberlin I just came back to this while I was looking at NewGVN PRE.
I decided to give it a try to see how the API looked like and I personally didn't find any problems with it (I might just used it wrong, FWIW).

https://reviews.llvm.org/P7997

In particular, the old API getValueInTheMiddleOfTheBlock has been source of confusion (at least to me).

With your patch and my patch on top of it, I saw still one test failing:

--- /home/davide/old.ll 2017-05-30 20:56:21.058869000 -0700
+++ /home/davide/new.ll 2017-05-30 20:56:30.450514797 -0700
@@ -316,7 +316,7 @@
   br label %block4

 block4:                                           ; preds = %block3, %block2
-  %D = phi i32 [ 87, %block2 ], [ 97, %block3 ]
+  %D = phi i32 [ 97, %block3 ], [ 87, %block2 ]
   %A = phi i32 [ -1, %block2 ], [ 42, %block3 ]
   br i1 %cmpxy, label %block5, label %exit

@@ -350,7 +350,7 @@
   br label %loop

 loop:                                             ; preds = %loop, %entry
-  %Y2 = phi i8 [ %Y, %entry ], [ 0, %loop ]
+  %Y2 = phi i8 [ 0, %loop ], [ %Y, %entry ]
   %i = phi i32 [ 4, %entry ], [ 192, %loop ]
   %X2 = getelementptr i8, i8* %p, i32 %i
   %cond = call i1 @cond2()
@@ -372,7 +372,7 @@
   br label %loop

 loop:                                             ; preds = %cont, %entry
-  %Y2 = phi i8 [ %Y, %entry ], [ %Y2.pre, %cont ]
+  %Y2 = phi i8 [ %Y2.pre, %cont ], [ %Y, %entry ]
   %i = phi i32 [ 4, %entry ], [ 3, %cont ]
   %X2 = getelementptr i8, i8* %p, i32 %i
   %cond = call i1 @cond2()

I guess this is because we don't have a consistent ordering of PHI args. If that's the only problem, I might consider looking at it next (and then keep converting the remaining users).

@dberlin I just came back to this while I was looking at NewGVN PRE.
I decided to give it a try to see how the API looked like and I personally didn't find any problems with it (I might just used it wrong, FWIW).

https://reviews.llvm.org/P7997

In particular, the old API getValueInTheMiddleOfTheBlock has been source of confusion (at least to me).

With your patch and my patch on top of it, I saw still one test failing:

--- /home/davide/old.ll 2017-05-30 20:56:21.058869000 -0700
+++ /home/davide/new.ll 2017-05-30 20:56:30.450514797 -0700
@@ -316,7 +316,7 @@
   br label %block4

 block4:                                           ; preds = %block3, %block2
-  %D = phi i32 [ 87, %block2 ], [ 97, %block3 ]
+  %D = phi i32 [ 97, %block3 ], [ 87, %block2 ]
   %A = phi i32 [ -1, %block2 ], [ 42, %block3 ]
   br i1 %cmpxy, label %block5, label %exit

@@ -350,7 +350,7 @@
   br label %loop

 loop:                                             ; preds = %loop, %entry
-  %Y2 = phi i8 [ %Y, %entry ], [ 0, %loop ]
+  %Y2 = phi i8 [ 0, %loop ], [ %Y, %entry ]
   %i = phi i32 [ 4, %entry ], [ 192, %loop ]
   %X2 = getelementptr i8, i8* %p, i32 %i
   %cond = call i1 @cond2()
@@ -372,7 +372,7 @@
   br label %loop

 loop:                                             ; preds = %cont, %entry
-  %Y2 = phi i8 [ %Y, %entry ], [ %Y2.pre, %cont ]
+  %Y2 = phi i8 [ %Y2.pre, %cont ], [ %Y, %entry ]
   %i = phi i32 [ 4, %entry ], [ 3, %cont ]
   %X2 = getelementptr i8, i8* %p, i32 %i
   %cond = call i1 @cond2()

I guess this is because we don't have a consistent ordering of PHI args. If that's the only problem, I might consider looking at it next (and then keep converting the remaining users).

ping on my last comment, in case it got lost in the churn :)

@dberlin I just came back to this while I was looking at NewGVN PRE.
I decided to give it a try to see how the API looked like and I personally didn't find any problems with it (I might just used it wrong, FWIW).

https://reviews.llvm.org/P7997

In particular, the old API getValueInTheMiddleOfTheBlock has been source of confusion (at least to me).

With your patch and my patch on top of it, I saw still one test failing:

--- /home/davide/old.ll 2017-05-30 20:56:21.058869000 -0700
+++ /home/davide/new.ll 2017-05-30 20:56:30.450514797 -0700
@@ -316,7 +316,7 @@
   br label %block4

 block4:                                           ; preds = %block3, %block2
-  %D = phi i32 [ 87, %block2 ], [ 97, %block3 ]
+  %D = phi i32 [ 97, %block3 ], [ 87, %block2 ]
   %A = phi i32 [ -1, %block2 ], [ 42, %block3 ]
   br i1 %cmpxy, label %block5, label %exit

@@ -350,7 +350,7 @@
   br label %loop

 loop:                                             ; preds = %loop, %entry
-  %Y2 = phi i8 [ %Y, %entry ], [ 0, %loop ]
+  %Y2 = phi i8 [ 0, %loop ], [ %Y, %entry ]
   %i = phi i32 [ 4, %entry ], [ 192, %loop ]
   %X2 = getelementptr i8, i8* %p, i32 %i
   %cond = call i1 @cond2()
@@ -372,7 +372,7 @@
   br label %loop

 loop:                                             ; preds = %cont, %entry
-  %Y2 = phi i8 [ %Y, %entry ], [ %Y2.pre, %cont ]
+  %Y2 = phi i8 [ %Y2.pre, %cont ], [ %Y, %entry ]
   %i = phi i32 [ 4, %entry ], [ 3, %cont ]
   %X2 = getelementptr i8, i8* %p, i32 %i
   %cond = call i1 @cond2()

I guess this is because we don't have a consistent ordering of PHI args. If that's the only problem, I might consider looking at it next (and then keep converting the remaining users).

Yes, it should hit the phi arguments in pred order, roughly, but it won't be perfect.
We should just sort them.

Note that we also will not go looking for existing phis to reuse.
IMHO, doing so (as current SSAUpdater does), is not a good plan.
Either let CSE/instsimplify/any of the million things that will fix this, clean it up
or we need to extend the lifetime of ssa updater objects and hash the arguments of phi nodes in blocks we see.
Right now, what SSAUpdater does is badly N^2.

Anyway, if you want to work on this patch, feel free to submit diffs and takeover this revision. I probably won't get back to it for a few months at least.

Thanks for your comment, I'll see what I can do given this shows up in some of our internal workload.
One thing, I agree we should change the API at some point (and maybe use a O(def + uses) renamer as we do in PredInfo) but, for starters, we should be able to have a drop-in replacement for RewriteUse, i.e. the new algorithm will always pick the correct value?

void SSAUpdater::RewriteUse(Use &U) {
  Instruction *User = cast<Instruction>(U.getUser());

  Value *V;
  if (PHINode *UserPN = dyn_cast<PHINode>(User))
    V = GetValueAtEndOfBlock(UserPN->getIncomingBlock(U));
  else
    V = GetValueInMiddleOfBlock(User->getParent());
[...]
dberlin added a comment.EditedJun 26 2017, 6:19 PM

Thanks for your comment, I'll see what I can do given this shows up in some of our internal workload.
One thing, I agree we should change the API at some point (and maybe use a O(def + uses) renamer as we do in PredInfo) but, for starters, we should be able to have a drop-in replacement for RewriteUse, i.e. the new algorithm will always pick the correct value?

void SSAUpdater::RewriteUse(Use &U) {
  Instruction *User = cast<Instruction>(U.getUser());

  Value *V;
  if (PHINode *UserPN = dyn_cast<PHINode>(User))
    V = GetValueAtEndOfBlock(UserPN->getIncomingBlock(U));
  else
    V = GetValueInMiddleOfBlock(User->getParent());
[...]

I believe this should be:

void SSAUpdater::RewriteUse(unsigned Var, Use &U) {
  Instruction *User = cast<Instruction>(U.getUser()); 
  BasicBlock *VarBB = User->getParent();
  if (auto *PN = dyn_cast<PHINode>(User))
    VarBB = PN->getIncomingBlock(U);

  return readVariableAfterDef(Var, VarBB);
}

(Note all these API's, both old and new, assume you are rewriting in program order. It's possible to make it not require that, by storing all the defs in a given block and using ordered instructions to see where the use occurs relative to them)