This is an archive of the discontinued LLVM Phabricator instance.

[GVNSink] Initial GVNSink prototype
ClosedPublic

Authored by jmolloy on Sep 21 2016, 9:02 AM.

Details

Summary

This patch provides an initial prototype for a pass that sinks instructions based on GVN information, similar to GVNHoist. It is not yet ready for commiting but I've uploaded it to gather some initial thoughts.

This pass attempts to sink instructions into successors, reducing static instruction count and enabling if-conversion. We use a variant of global value numbering to decide what can be sunk. Consider:

[ %a1 = add i32 %b, 1  ]   [ %c1 = add i32 %d, 1  ]
[ %a2 = xor i32 %a1, 1 ]   [ %c2 = xor i32 %c1, 1 ]
                 \           /
           [ %e = phi i32 %a2, %c2 ]
           [ add i32 %e, 4         ]

GVN would number %a1 and %c1 differently because they compute different results - the VN of an instruction is a function of its opcode and the transitive closure of its operands. This is the key property for hoisting and CSE.

What we want when sinking however is for a numbering that is a function of the *uses* of an instruction, which allows us to answer the question "if I replace %a1 with %c1, will it contribute in an equivalent way to all successive instructions?". The (new) PostValueTable class in GVN provides this mapping.

There is a lot of bikeshedding to be done here. Other notable things to do:

  • Throw *much* more testing at it
  • Remove virtual function calls in ValueTable, using CRTP
  • Bikeshed PostValueTable
  • Tweak the sinking heuristic
  • Work out when it needs to run in the pass pipeline
  • Properly update MemorySSA instead of fully invalidating it
  • Run away from Danny when he sees what havoc I've wrought with GVN and MemorySSA

This pass has some shown really impressive improvements especially for codesize already on internal benchmarks, so I have high hopes it can replace all the sinking logic in SimplifyCFG.

Diff Detail

Repository
rL LLVM

Event Timeline

jmolloy updated this revision to Diff 72065.Sep 21 2016, 9:02 AM
jmolloy retitled this revision from to [GVNSink] Initial GVNSink prototype.
jmolloy updated this object.
jmolloy added reviewers: dberlin, sebpop, spop.
jmolloy set the repository for this revision to rL LLVM.
jmolloy added a subscriber: llvm-commits.
jmolloy updated this revision to Diff 72068.Sep 21 2016, 9:10 AM

Address some mangled formatting resulting from an over-eager comment reflow.

sebpop edited edge metadata.Sep 21 2016, 10:05 AM

Very nice!
I will try your patch and send more comments once I gave it a run through gdb.

lib/Transforms/Scalar/GVNSink.cpp
353

I think we do not need to iterate for sinking: the only reason we iterate in hoisting is because the data dependences are changing above the expressions, and so we need to get new VN for all the expressions in which an operand has changed.

If we do sink in the right order (from innermost to outer BB not contained in branches, and from beginning of the BB to the last instruction) we should be able to sink all expressions without iterating.

361

I think GVN is not invalidated by sinking. We only have this problem with hoisting.

665

I'll have a look.

dberlin added inline comments.Sep 21 2016, 10:06 AM
lib/Transforms/Utils/MemorySSA.cpp
1235 ↗(On Diff #72068)

You don't need this at all, AFAIK.

If you just want something that works, while it's being reviewed, we discuss the update semantics, etc, do what sebpop did in early gvnhoist:

Instead of using memoryssa as a pass, use it as a utility.

ie something like:

MemorySSA *MSSA = new MemorySSA M(F, DT, AA);
for each sink candidate {
  try to sink a thing
  if we sunk something and we can't update properly {
    delete MSSA
    MSSA = new MemorySSA M(F, DT, AA);
  }
}

or whatever the scope it works safely on right now really is. This will work fine.

Then we can gradually push it to be rebuilt less and less.
Eventually, it's out of the loop and at the top level anyway, and we can just go back to the pass version.

gberry added a subscriber: gberry.Sep 21 2016, 12:06 PM
hiraditya added inline comments.Sep 21 2016, 1:25 PM
lib/Transforms/Scalar/GVNSink.cpp
611

This could be checked before sort and reverse.

hiraditya added inline comments.Sep 21 2016, 1:27 PM
include/llvm/Transforms/Scalar.h
359

?

include/llvm/Transforms/Scalar/GVN.h
241

?

Hi James,

after I discussed with @hiraditya, we realized that the example you give
is not supposed to be sinked by a GVN-sink pass:

[ %a1 = add i32 %b, 1  ]   [ %c1 = add i32 %d, 1  ]
[ %a2 = xor i32 %a1, 1 ]   [ %c2 = xor i32 %c1, 1 ]
               \           /
         [ %e = phi i32 %a2, %c2 ]
         [ add i32 %e, 4         ]

Sinking this example would need the bisimulation algorithm implemented in simplifyCFG:
that part is the one using the LockstepReverseIterator. I think there is value to separate
that algorithm from simplifyCFG to be able to run it separately from the CFG transforms.
Let's submit a patch that outlines all the functions dealing with sinking in simplifyCFG
in a separate code sinking pass. That pass should be able to catch the example you gave.

A GVN-sink pass will not catch the above example and it will sink other
cases that the bisimulation algorithm will not be able to catch: those are all the
expressions for which GVN will return the same number. For example:

[ store 1, %gep1 ]    [ store 1, %gep2 ]
           \             /
          [     ...        ]

if GVN decides that %gep1 and %gep2 are actually the computation of
the same memory address, and there are no side effects on the path
from the stores to the next BB, then we should be able to sink the two stores.

I think the two algorithms are complementing each other: I don't expect
GVN-sink to handle all the cases and I think we should keep both transforms
in two separate passes.

MatzeB added a subscriber: MatzeB.EditedSep 21 2016, 1:43 PM

Without looking at anything in the patch. For the example given, wouldn't it be enough to create some DAGCombine style rules:

x = PHI(OP(y, c), OP(z, c))
=>
x = OP(PHI(y, z), c)

?

hiraditya added inline comments.Sep 21 2016, 2:13 PM
lib/Transforms/Scalar/GVN.cpp
348 ↗(On Diff #72068)

For an instruction with no-uses e.g. a store, this function will return e with only type and opcode initialized. How will that instruction get sunk?

Hi Matthias,

Without looking at anything in the patch. For the example given, wouldn't it be enough to create some DAGCombine style rules:

x = PHI(OP(y, c), OP(z, c))
=>
x = OP(PHI(y, z), c)

?

Adding to what Danny said, the primary issue with taking a greedy approach to sinking is the cost model. We obviously don't want to produce hundreds of PHIs, so we need to only sink instructions when we think the operands are equivalent. Consider this:

bb1:
  %a = load i32 %b
  %c = add i32 %a, 4
  store %c, i32* %p

bb2:
   %a2 = load i32 %b
   %c2 = add i32 %a2, 4
   store %c2, i32* %q

If we attacked this in a greedy fashion, we'd compare the two stores and notice that both operands differ. We'd therefore require two extra PHIs to sink that store:

bb3:
  %p1 = phi %c, %c2
  %p2 = phi %p, %q
  store %p1, i32* %p2

But if we can look more holistically at the entire sequence, we can see that in the three instructions only one operand differs, %p and %q. So we could sink everything with just one additional PHI:

bb3:
  %p1 = phi %p, %q
  %a = load i32* %b
  %c = add %a, 4
  store %c, i32* %p1

This problem of determining equivalence is why the cost model and analysis is so complex, which makes up around 80% of the pass. We can also do things like inserting new blocks. This explanation is one I created earlier and is lifted out of SimplifyCFG.cpp - it applies here and GVNSInk should do an even better job without the restrictions on one conditional edge:

// We support two situations:
//   (1) all incoming arcs are unconditional
//   (2) one incoming arc is conditional
//
// (2) is very common in switch defaults and
// else-if patterns;
//
//   if (a) f(1);
//   else if (b) f(2);
//
// produces:
//
//       [if]
//      /    \
//    [f(1)] [if]
//      |     | \
//      |     |  \
//      |  [f(2)]|
//       \    | /
//        [ end ]
//
// [end] has two unconditional predecessor arcs and one conditional. The
// conditional refers to the implicit empty 'else' arc. This conditional
// arc can also be caused by an empty default block in a switch.
//
// In this case, we attempt to sink code from all *unconditional* arcs.
// If we can sink instructions from these arcs (determined during the scan
// phase below) we insert a common successor for all unconditional arcs and
// connect that to [end], to enable sinking:
//
//       [if]
//      /    \
//    [x(1)] [if]
//      |     | \
//      |     |  \
//      |  [x(2)] |
//       \   /    |
//   [sink.split] |
//         \     /
//         [ end ]
//

The ability to sink from a subset of the predecessors is what allows this optimization to remove large amounts of code. We canonicalize to a single %cleanup block at the end of a function, so there is frequently one block with *many* predecessors from different parts of the function - some of these predecessors will have common code to sink and others won't. It's important to be able to determine when there is enough code sinkable from a subset of predecessors to make it worthwhile to create a new block.

James

Hi,

Thanks all for the comments. I'll be acting on these today.

Danny (in particular), do you have any thoughts about the naming of PostValueTable and the virtual function calls? I'm convinced this approach must have appeared in literature and thus been given a name, but I can't find anything. Do you know of where it might have been written about before, or is this actually novel?

Cheers,

James

lib/Transforms/Scalar/GVN.cpp
348 ↗(On Diff #72068)

Indeed. For a store, createExpr() will return an expression that is identical for all stores.

In lookupOrAddStore() we have two operating modes. If we can't ignore memory dependencies (the default, conservative mode), every store will be given a different value number (createExpr() isn't even called in this case).

If we can ignore memory dependencies, we will be aggressive and every store will be given the same value number. This can obviously lead to incorrect code in general.

In GVNSink.cpp we have a class GVNSink::ValueTable that subclasses PostValueTable. It puts PostValueTable in aggressive mode and imposes its *own* memory ordering, which is domain specific (we can make assumptions in GVNSink about how we compare and move memory operations that makes a simple memory ordering feasible).

Basically memory ordering is hard, so we don't do it by default. The aggressive mode allows us to punt that decision onto the user. An alternative approach I thought of (and could go back to) is to have a specific memory order resolver callback that a user of PostValueTable could optionally supply. Perhaps this might be a cleaner approach?

lib/Transforms/Scalar/GVNSink.cpp
353

I agree this is way too aggressive in iteration, however we do need to iterate in some cases.

If we split an incoming edge and end up adding a new block (a valid place to sink to), we'll invalidate the value numbering, MemorySSA, and postdominatortree. So we need to iterate again in this case.

611

The empty() check could be done before sort and reverse, but the Cost check couldn't - that relies upon candidate list being in sorted order.

lib/Transforms/Utils/MemorySSA.cpp
1235 ↗(On Diff #72068)

Thanks - I'll do that. I'm keen to get the update code right but couldn't quite work out how to clear out MemorySSA.

jmolloy updated this revision to Diff 72149.Sep 22 2016, 4:58 AM
jmolloy edited edge metadata.
jmolloy removed rL LLVM as the repository for this revision.

Respond to current review feedback, add more comments and block comment for PostValueTable.

lib/Transforms/Utils/MemorySSA.cpp
1235 ↗(On Diff #72149)

This is now no longer used and will be destroyed.

Just some coding style nitpicking.

  • Still a bunch of FIXME comments around. I often feel that many of them could just as well have been removed if they didn't turn out out to be important enough to be addressed for the commit.
lib/Transforms/Scalar/GVNSink.cpp
11–33

This could be

/// \file
/// This pass attempts...

to make the comment available to doxygen.

69–80

/// for doxygen, do not repeat the type name in the comment.

94

A matter of taste, but I think BasicBlock *BB instead of auto *BB is only slightly longer and has a higher documentation value. Similar in many following for loops.

114

I'd move that into the for loop to make the cope obvious.

130

Using a reference seems unnecessary (I don't see any modification).

172

could be const

188

could be const PHINode &

199

doxygen ///. More cases of this follow.

251

Is there anything missing in llvms DenseSet or any other reason for std::unordered_set?

524

Indentation

611

Why not SmallPtrSet?

625–626

A custom reversed comparison operator should save the reverse() operation.

grandinj added inline comments.
include/llvm/Transforms/Scalar.h
359

"inverted" might be a better description

include/llvm/Transforms/Scalar/GVN.h
105

corollary

jmolloy updated this revision to Diff 73900.Oct 7 2016, 2:54 AM
jmolloy set the repository for this revision to rL LLVM.
jmolloy marked 12 inline comments as done.

Thanks all for the review comments. I've update the patch following all your comments and have thrown all the testing I have at it and fixed all the bugs seen.

Cheers,

James

jmolloy updated this revision to Diff 99439.May 18 2017, 8:22 AM
  • Patch updated
  • Rebased onto NewGVN
  • No known defects - LNT, spec2k, spec2k6, usual suspects run, no problems.
  • All of the defects raised in SimplifyCFG's implementation have been added to the regression test suite for GVNSink.

The patch is now ready for final review! :)

jmolloy added inline comments.
lib/Transforms/Scalar/GVNSink.cpp
172

It's not a member function so can't be const.

251

Nope, I hadn't even considered DenseSet. I've swapped over.

Looks all good from my side. But I am not the best person to review IR transformations, so would be good to wait for someone else.

lib/Transforms/Scalar/GVNSink.cpp
172

I meant const SinkingInstructionCandidate &C (for operator<< which is a few lines below now)

efriedma edited reviewers, added: efriedma; removed: eli.friedman.May 18 2017, 11:22 AM
efriedma added a subscriber: efriedma.
efriedma added inline comments.
lib/Transforms/Scalar/GVNSink.cpp
509

Maybe move some of the implementations out of the class definition to reduce indentation?

578

This function should probably be somewhere outside GVNHoist, so other passes can use it.

684

Indentation.

872

Could you use post_order(F) instead of PDT here? If not, could you add a comment explaining what properties of the post-dominator tree you're depending on?

efriedma added inline comments.May 18 2017, 2:09 PM
lib/Transforms/Scalar/GVNSink.cpp
872

Err, post_order() isn't right; I guess you want a reverse post-order traversal on the inverted graph, or something like that?

dberlin added inline comments.May 18 2017, 2:13 PM
lib/Transforms/Scalar/GVNSink.cpp
870

This is just flat out broken :)

872

Yes, it's RPO on the inverse graph.

jmolloy updated this revision to Diff 99533.May 19 2017, 2:36 AM
jmolloy marked 3 inline comments as done.

Hi,

Thanks for the review comments:

  • Added const to operator<< as suggested by Matthias.
  • Moved some implementations out of the class as suggested by Eli.
  • Used post_order. I looked into using inverse RPO, but that is complicated by the lack of a single entry node in the inverted graph. Given that I'm iterating to a fixpoint anyway, I think post order traversal gives me the only guarantee I need (look at successors before predecessors).
  • Moved canReplaceOperandWithVariable into Utils/Local.h. Removed the SimplifyCFG version of it (good catch, I'd forgotten that existed!).
  • Added one forgotten test, and sorted out errant tab.

Cheers,

James

dberlin edited edge metadata.May 19 2017, 7:12 AM
  • Used post_order. I looked into using inverse RPO, but that is complicated by the lack of a single entry node in the inverted graph. Given that I'm iterating to a fixpoint anyway, I think post order traversal gives me the only guarantee I need (look at successors before predecessors).

No it won't, in the same way pre-order is not equivalent to predecessors before successors :)

What you want is either PDT order (and we should just fix the post-dom tree and move on. I have patches out to fix it) or inverse traversal RPO (and you can crib how to figure out which nodes should be exit nodes from that patch if you want)

PDT order should equivalent to inverse traversal RPO if you sort the siblings.

post_order is going to cause this to iterate *way* too much on certain types of test-cases.

My only other comment is that it's not always clear what the tests are testing (IE what you expect to happen and why).
The comments that are there sometimes leave the reader confused.

IE i read "; We can't common an intrinsic!" and my first reaction was "sure we can".

lib/Transforms/Scalar/GVNSink.cpp
351

We probably should move this magic into the constructor or something by letting it take an opcode and predicate.

484

Yeah, you really want inverse MemorySSA here.
We could build it without too much trouble if you determine it matters enough, performance wise.

No it won't, in the same way pre-order is not equivalent to predecessors before successors :)

I certainly don't really want to keep PDT and DT around and updated just to get a good iteration order. I'd much prefer a IRPO iterator. I've looked and found your thread starting "[llvm] r296535 - Fix PR 24415 (at least), by making our post-dominator tree behavior sane." - is that the one you were referring to?

I'll take a read and try to pick out how you decide what the entry nodes to the inverse graph should be.

Alternatively, and at least in the meantime, I could restrict GVNSink to only run on canonical CFGs with a single return and no unreachables (or just treat all blocks containing a return as entry blocks for the inverse rpo walk; this doesn't have to cover all blocks for correctness reasons after all).

lib/Transforms/Scalar/GVNSink.cpp
351

Sure.

484

That would be really nice. I'd have to have a think about how many cases that would actually allow us to analyze, and how many still need a good solution to correlated but reordered stores (we previously talked about solving this one with global code motion).

sanjoy added a subscriber: sanjoy.May 21 2017, 10:21 PM
jmolloy updated this revision to Diff 99753.May 22 2017, 6:41 AM

Hi Danny,

Thanks for making me think more about the iteration order. Regardless of the difficulties of forming inverse RPO, I now think I was completely wrong about the preferred iteration order.

For a sinking pass, surely a *forward* walk (preds before succs) is more optimal than a backward walk. Pushing instructions down the graph exposes more opportunities for sinking later on. Also, a forward walk wouldn't have to worry about recalculating value numbers when changing the CFG as we're only changing parts of the CFG that we'll never inspect again.

So I think a single pass, RPO walk of the CFG is sufficient (and indeed passes all my testcases and all the internal testing I can throw at it).

Cheers,

James

Hi Danny,

Thanks for making me think more about the iteration order. Regardless of the difficulties of forming inverse RPO, I now think I was completely wrong about the preferred iteration order.

I don't believe you are.

Most sinking passes operate in IRPO or PDT order.

For a sinking pass, surely a *forward* walk (preds before succs) is more optimal than a backward walk.

Nope.

For most sinking passes, the blocker is the thing below the current instructions

Pushing instructions down the graph exposes more opportunities for sinking later on.

Yes, which is why things operate in reverse-local, succs before preds order.

given

A:
store 1
B:
store 2
C
store 3

They want to do "sink store 3, sink store 2, sink store 1".

This is IRPO order, walking each basic block backwards.

If you do sink store 1, sink store 2, sink store 3, you may only be able to sink store 1 to store 2's location, and store 2 to store 3's location.
Then once you move store 3, you must repeat.

The above may take 3 iterations to be optimal in a forward RPO pass
It takes 1 in a IRPO pass.

So I think a single pass, RPO walk of the CFG is sufficient (and indeed passes all my testcases and all the internal testing I can throw at it).

This is pretty surprising.

Given two must-aliased stores, for example, i can't see how this would sink them as far as possible in one iteration.

dberlin accepted this revision.May 23 2017, 8:45 PM

I'm fine with this code from an algorithm perspective. It may not be perfect, but it's a good start. I haven't gone through it with as strong of a style eye as others may.
It may be worth waiting for someone to do that, but i'm also fine if you want to commit it and do that in post-commit-review, since i have no trouble trusting you'll do that.

lib/Transforms/Scalar/GVNSink.cpp
75

I'm not sure how easy/possible it is to do, but isn't this just a zip_iterator of a bunch of reverse iterators?

198

I'm curious why you think you need stable sort here as opposed to regular

This revision is now accepted and ready to land.May 23 2017, 8:45 PM
Closed by commit rL303850: [GVNSink] GVNSink pass (authored by jamesm). · Explain WhyMay 25 2017, 5:51 AM
This revision was automatically updated to reflect the committed changes.
jmolloy marked an inline comment as done.May 25 2017, 6:20 AM

Thanks Danny,

I've committed the patch after changing from std::stable_sort to std::sort (I didn't need it, I was being paranoid debugging a heisenbug). This was good timing as today is my last day at ARM and in a couple of hours I will lose access to my development machine! :)

Thanks for your last reply; I now, finally, have got it! I spent yesterday prototyping and have something that conceptually works. It'll need a lot more work though. Watch this space ;)

Cheers,

James

lib/Transforms/Scalar/GVNSink.cpp
75

Similar; it'd be really cool to have something like zip_iterator in STLExtras. Although this does have some domain-specific utility functions, and doens't have undefined behaviour when the input iterators aren't of equal length :)

198

I don't. Changed.

Thanks Danny,

I've committed the patch after changing from std::stable_sort to std::sort (I didn't need it, I was being paranoid debugging a heisenbug). This was good timing as today is my last day at ARM and in a couple of hours I will lose access to my development machine! :)

Thanks for your last reply; I now, finally, have got it! I spent yesterday prototyping and have something that conceptually works. It'll need a lot more work though. Watch this space ;)

Cheers,

James

Thanks for working on this. I was waiting for the patch so that I can get some 'inspiration' for my current work.
Best wishes for your next job.