This is an archive of the discontinued LLVM Phabricator instance.

New code hoisting pass based on GVN
AbandonedPublic

Authored by sebpop on Apr 5 2016, 10:55 AM.

Details

Summary

This pass hoists common computations across branches sharing common immediate
dominator. Like early-cse, the primary goal of early-gvn is to reduce the size
of functions before inline heuristics to reduce the total cost of function
inlining. In some cases this pass also reduces the critical path by exposing
more ILP.

The pass addresses the comments from Daniel Berlin from the previous review at http://reviews.llvm.org/D18710:
in particular, the complexity of the pass is now O(N) with N the number of instructions in the blocks where code hoisting is applied.
The pass works on the existing GVN from trunk, and can be ported to Danny's NewGVN by updating the calls to the VN look-ups.

Passes llvm regression test and test-suite.

Pass written by:
Sebastian Pop
Aditya Kumar
Xiaoyu Hu
Brian Rzycki

Diff Detail

Event Timeline

sebpop updated this revision to Diff 52713.Apr 5 2016, 10:55 AM
sebpop retitled this revision from to New code hoisting pass based on GVN.
sebpop updated this object.
sebpop added reviewers: dberlin, chandlerc, mcrosier.
sebpop set the repository for this revision to rL LLVM.
chandlerc requested changes to this revision.Apr 7 2016, 1:26 AM
chandlerc edited edge metadata.

Some high level comments:

  • The code formatting seems bad. Please run clang-format at the least on the new code.
  • Thanks for addressing Danny's comments about the algorithmic complexity. Keeping this away from O(n^2) algorithms is really important.
  • You didn't address Danny's comment about providing motivating benchmark data showing the improvements this provides.
  • We also would likely need to see a reasonable analysis of the effect this has on compile time so we understand the tradeoff this is making. My suspicion is that it won't be the correct tradeoff for O2 if it is correct at any level. (See below.)
llvm/lib/Transforms/IPO/PassManagerBuilder.cpp
212–214

It makes very little sense IMO to do early-cse and then early-gvn... The only point of early-cse was to be a substantially lighter weight CSE than GVN. If you're going to run GVN anyways to do hoisting, just run GVN.

Either way, I strongly suspect this should be conditioned on O3...

This revision now requires changes to proceed.Apr 7 2016, 1:26 AM
dberlin edited edge metadata.Apr 7 2016, 8:07 AM
dberlin added a subscriber: dberlin.

I just want to echo Chandler's comments about complexity:
Thank you for taking the time to do the work to not add another N^2 pass to
the compiler, even if it took you a bit more time/effort :)
(I'll take another gander at the patch for code review, but you should
address chandler's comments)

I just want to echo Chandler's comments about complexity:
Thank you for taking the time to do the work to not add another N^2 pass to
the compiler, even if it took you a bit more time/effort :)
(I'll take another gander at the patch for code review, but you should
address chandler's comments)

Sure, we will get back with useful numbers and fix formatting issues.

sebpop updated this revision to Diff 53095.Apr 8 2016, 3:03 PM
sebpop edited edge metadata.
sebpop removed rL LLVM as the repository for this revision.
  • We also would likely need to see a reasonable analysis of the effect this

has on compile time so we understand the tradeoff this is making. My suspicion
is that it won't be the correct tradeoff for O2 if it is correct at any
level. (See below.) [...] It makes very little sense IMO to do early-cse and
then early-gvn... The only point of early-cse was to be a substantially lighter
weight CSE than GVN. If you're going to run GVN anyways to do hoisting, just
run GVN.

I think there is a misconception of "GVN == PRE": this is because GVN.cpp was
around for a while and the "GVN pass" is a "redundancy elimination" pass.

To clarify matters, GVN is an analysis pass, like SCEV or data dependence are
analysis passes. What GVN does is similar to what md5 does to a text file:
given an instruction, GVN returns a unique integer identifying what the compiler
safely estimates as identical computations.

Our pass uses GVN as an analysis pass to perform code hoisting. CSE removes
redundant computations (same expression computed twice over the same execution
path.) Code hoisting does not remove redundancies: it replaces several identical
expressions executed exactly once on all execution paths with a single
expression placed in a common dominator block. Code hoisting, like CSE, is good
for code size reduction: it improves inlining heuristics by reducing the size of
function bodies.

Over the c/c++ SPEC 2006 benchmarks the GVN-hoisting pass removes 16161
instructions, 4622 loads and 11539 scalar instructions. Overall spec scores
improve with the patch except for bzip2 and gcc:

CINT2006:
400.perlbench 2.4%
401.bzip2 -0.9%
403.gcc -1%
429.mcf 6%
445.gobmk 1%
456.hmmer 0%
458.sjeng 1%
462.libquantum 4.6%
464.h264ref 0.6%
471.omnetpp 1.5%
473.astar 1.9%
483.xalancbmk llvm trunk coredumps

CFP2006:
433.milc 0%
444.namd 0.8%
447.dealII 0%
450.soplex 2.8%
453.povray 1%
470.lbm 2.6%
482.sphinx3 3.9%

Compile time statistics for the c/c++ benchmarks of SPEC2006 show that overall
there are more functions inlined:

  • Without the patch:

Number of call sites deleted, not inlined: 20394
Number of functions deleted because all callers found: 70497
Number of functions inlined: 182119
Number of allocas merged together: 225
Number of caller-callers analyzed: 200361
Number of call sites analyzed: 445806

  • With the patch:

Number of call sites deleted, not inlined: 20419
Number of functions deleted because all callers found: 70502
Number of functions inlined: 182449
Number of allocas merged together: 227
Number of caller-callers analyzed: 200929
Number of call sites analyzed: 446858
Number of hoisted loads: 4622
Number of hoisted scalar instructions: 11539

Compilation time is not impacted compiling nightly test-suite with a clang
configured in release mode:

llvm-trunk $ /usr/bin/time ninja -j1
355.80user 16.30system 6:24.47elapsed 96%CPU (0avgtext+0avgdata 285796maxresident)k
64inputs+291368outputs (7major+10867361minor)pagefaults 0swaps

patch $ /usr/bin/time ninja -j1
355.34user 15.55system 6:24.26elapsed 96%CPU (0avgtext+0avgdata 286008maxresident)k
94568inputs+290792outputs (453major+10921164minor)pagefaults 0swaps

I have renamed the new pass GVN-hoist to avoid further confusion. I think it
would be good in the NewGVN that Danny is pushing to trunk to have a clear
distinction of the GVN analysis with a clear interface that can be used by
transforms other than GVN-PRE. IMO forcing the GVN.cpp file to only contain the
analysis and pushing the other transforms out in different files would achieve
more clarity in the use of GVN analysis.

mcrosier added inline comments.Apr 11 2016, 11:17 AM
llvm/lib/Transforms/Scalar/GVNHoist.cpp
265

Should this be 'return hoistExpressions()'?

dberlin added inline comments.Apr 11 2016, 11:38 AM
llvm/lib/Transforms/Scalar/GVNHoist.cpp
113

You may just want to use df_iterator over the dom tree instead of recursion. It's easier to follow, and won't blow out the stack :)
If you use recursion, you have to or together all these values.

173

This still value numbers bbs unnecessarily, and can try to do so multiple times depending on the branch structure.

The way i suggested only value numbers a given BB once :)

Let me suggest a simpler way than rewriting the algorithm entirely (as i suggested):

value number everything up front, store a sorted set bbtovalues, keeping a list of value numbers that exist in each bb.

Instead of walking instructions in this algorithm, intersect bbtovalues. For each VN in the intersection, see if the expressions for that VN that occur in each BB can be hoisted.

(this will only ever walk the expressions you might hoist, instead of try to value number every expression again)

dberlin added inline comments.Apr 11 2016, 11:40 AM
llvm/lib/Transforms/Scalar/GVNHoist.cpp
173

Note: I realize you have tried to limit the branch structure to cases where you can guarantee you will only value number a given BB once.
I'm saying "if anyone ever, in the future, wanted to extend this at all, they'd have to rewrite your entire algorithm right now" :)

hiraditya added inline comments.Apr 12 2016, 10:23 AM
llvm/lib/Transforms/Scalar/GVNHoist.cpp
173

Thanks for the feedback. Yes, in the current implementation we value number each instruction only once. When an expression is hoisted still the value numbers are not invalidated because one of the expressions in the child branch gets hoisted.

So, an extension of this algorithm could be to start hoisting equivalent (scalar) expressions across non-sibling branches. In that case, I think, we would still value number only once, if we hoist one of the expressions and delete the rest. For loads, or other expressions with side-effects, hoisting would require more complicated analysis across each edge dominated by the common dominator of the equivalent expressions, so maybe we could skip that to save compile time, I'm not very sure about this though. Could you give an example of a case when multiple value numbering of same expression will be required, that will be helpful.

hiraditya added inline comments.Apr 12 2016, 10:30 AM
llvm/lib/Transforms/Scalar/GVNHoist.cpp
265

Yes, thanks for pointing out.

Thanks for the feedback. Yes, in the current implementation we value
number each instruction only once. When an expression is hoisted still the
value numbers are not invalidated because one of the expressions in the
child branch gets hoisted.

Right. This is a very trivial form of hoisting using the dominator tree.

So, an extension of this algorithm could be to start hoisting equivalent
(scalar) expressions across non-sibling branches.

This will not work if the branches are ever back edges or blocks with
multiple predecessors in general unless all predecessors are part of the
same branch of the dominator tree.

In that case, I think, we would still value number only once,

Only if the conditions above are met (that's off the top of my head, i'm
pretty sure i could think of more).

if we hoist one of the expressions and delete the rest. For loads, or
other expressions with side-effects, hoisting would require more
complicated analysis across each edge dominated by the common dominator of
the equivalent expressions, so maybe we could skip that to save compile
time, I'm not very sure about this though.

It's actually a pretty simple analysis, FWIW:

With memory-ssa, all loads have a "def" of the nearest thing above them
that kills them. So if the loads reach the same def, you can hoist them to
that def. If they do not, you cannot.
You can do extra work to see if you can somehow go back past that def if
you like.

Could you give an example of a case when multiple value numbering of same
expression will be required, that will be helpful.

Anything whose value depends on a phi node, directly or indirectly. For
multiple predecessor blocks, if you are trying to determine constant values
of things and not just sameness, it will also take multiple iterations to
get correct answers.

See, in general, the description of the RPO algorithm in
https://www.cs.rice.edu/~keith/Promo/CRPC-TR95636.pdf.gz

Currently

http://reviews.llvm.org/D18798

The optimistic approach Danny asked for is implemented in http://reviews.llvm.org/D19338

llvm/lib/Transforms/Scalar/GVNHoist.cpp
173

"[...] rewrite your entire algorithm right now" :)

Danny knows how to persuade people on doing what he wants ;-)

mcrosier resigned from this revision.Jun 10 2016, 8:40 AM
mcrosier removed a reviewer: mcrosier.

Should this revision be abandoned?

sebpop abandoned this revision.Jun 10 2016, 11:06 AM

Abandoned: the newer version of the patch is in http://reviews.llvm.org/D19338