This is an archive of the discontinued LLVM Phabricator instance.

code hoisting using GVN
AbandonedPublic

Authored by sebpop on Apr 1 2016, 1:28 PM.

Details

Summary

This pass hoists common computations across branches sharing common immediate
dominator. Like early-cse, the primary goal of early-gvn is to reduce the size
of functions before inline heuristics to reduce the total cost of function
inlining. In some cases this pass also reduces the critical path by exposing
more ILP.

Passes llvm regression test and test-suite.

Pass written by:
Sebastian Pop
Aditya Kumar
Xiaoyu Hu
Brian Rzycki

Diff Detail

Event Timeline

sebpop updated this revision to Diff 52415.Apr 1 2016, 1:28 PM
sebpop retitled this revision from to code hoisting using GVN.
sebpop updated this object.
sebpop added reviewers: mcrosier, chandlerc, mehdi_amini.
sebpop set the repository for this revision to rL LLVM.
sebpop added a subscriber: hiraditya.
hxy9243 added a subscriber: hxy9243.Apr 1 2016, 2:19 PM
mcrosier added inline comments.Apr 4 2016, 6:42 AM
llvm/lib/Transforms/Scalar/GVN.cpp
2726

ZeroOrMore isn't generally used with cl::opt.

2730

Same as above.

sebpop added inline comments.
llvm/lib/Transforms/Scalar/GVN.cpp
2726

I copied the stmt from above: GVN.cpp line 76. Do you have an alternative option I should be using?

  1. This needs a ton more correctness tests. It also needs real performance numbers, both on the compile time and execution time side.
  1. For loads, i have trouble seeing how it gets the following case right:
  1
 / \
2   3
|   |
4   5

Two loads, one in 4, one in 5.
All pointer operands are defined in block 1 (so you are all good there).
Both 2 and 3 contain calls that kill memory.
It looks like you will try to hoist to 1 anyway, because nowhere do you use memory dependence or memoryssa to figure out *if the memory state* is the same. For scalars, it doesn't matter.

You do stop trying to hoist when you hit modifying expressions in 4 or 5, but that won't help you here.

  1. This seems like a really expensive way of doing this :)

This is pretty badly N^2.

Can i suggest a different method, closely based on what we do in GCC for speed reasons (You will discover the vast majority of things have only one expression for each VN, so it's pointless to walk and do lookups on them)

Make a multimap from VN to each expression with that VN (VNTable is not currently this) over the entire program.

For each VN in the table:
 if (size (expressions with a given VN) > 1):
   One of:
    A. Calculate VBE over the expressions with that VN.
    B. Do something similar to what you do now (now you have N^2 where N is the number of expressions with a given VN, instead of number of blocks)
    C. Something like this should work for completeness and sparseness:
        For each block in domtree in DFS order:
              For each expression, if DFSin/DFSOut(expression->parent) within range of DFSin/DFSOut(domtree), push(possiblehoistlist (a list), expression (current expression), block (insertion point))
              If you have 2 or more things on possiblehoistlist, calculate availability of operands for each expression in block (you can cache the highest point you can move them to for later to checking again and again. Since you are using dominance, you know they can only be hoisted to blocks dominated by the highest point you can hoist to).
              If 2 or more things are still available, hoist

Note you can also likely skip any domtree block that does not have two or more children, i just haven't proven it to myself yet.

Thanks Danny for your feedback.

  1. This needs a ton more correctness tests.

We will add more testcases.

It also needs real performance numbers, both on the compile time and execution time side.

We ran the llvm nightly test-suite on x86, and we have not seen any meaningful numbers we could report: it is too noisy because of the short execution times. We will run spec 2k and 2k6 on x86 where things should be more stable.

  1. For loads, i have trouble seeing how it gets the following case right:
  1
 / \
2   3
|   |
4   5

Two loads, one in 4, one in 5.
All pointer operands are defined in block 1 (so you are all good there).
Both 2 and 3 contain calls that kill memory.
It looks like you will try to hoist to 1 anyway, because nowhere do you use memory dependence or memoryssa to figure out *if the memory state* is the same. For scalars, it doesn't matter.

You do stop trying to hoist when you hit modifying expressions in 4 or 5, but that won't help you here.

The pattern we are matching does not match this example: please correct me if I'm wrong here.
The only case we are handling is without the intermediate blocks 2 and 3 on your example:

BasicBlock *BB = Dom->getBlock();
// Only handle two branches for now: it is possible to extend the hoisting
// to switch statements.
BranchInst *BI = dyn_cast<BranchInst>(BB->getTerminator());
if (!BI || BI->getNumSuccessors() != 2)
  return false;

BasicBlock *BB1 = BI->getSuccessor(0);
BasicBlock *BB2 = BI->getSuccessor(1);
assert(BB1 != BB2 && "invalid CFG");

if (!DT->properlyDominates(BB, BB1) ||
    !DT->properlyDominates(BB, BB2) ||
    BB1->isEHPad() || BB1->hasAddressTaken() ||
    BB2->isEHPad() || BB2->hasAddressTaken())
  return false;

BB1 and BB2 are the direct successors of BB: there is no other block in between.
We only hoist expressions from BB1 and BB2 into BB.

  1. This seems like a really expensive way of doing this :)

This is pretty badly N^2.

Can i suggest a different method, closely based on what we do in GCC for speed reasons

We will implement your suggestion: Aditya has also remarked that we could be more efficient on compile time by using the internal structures of the VN table, though I agree with you that we will need some more changes.

mcrosier added inline comments.Apr 4 2016, 10:59 AM
llvm/lib/Transforms/Scalar/GVN.cpp
2726

We should be using the default for cl::opt, which is cl::Optional (i.e., the option may be specified zero or once). My suggestion is to remove cl::ZeroOrMore from the command line option as this should only be used with cl::list. Hopefully, that makes sense.

The pattern we are matching does not match this example: please correct me
if I'm wrong here.
The only case we are handling is without the intermediate blocks 2 and 3
on your example:

BasicBlock *BB = Dom->getBlock();
// Only handle two branches for now: it is possible to extend the

hoisting

// to switch statements.
BranchInst *BI = dyn_cast<BranchInst>(BB->getTerminator());
if (!BI || BI->getNumSuccessors() != 2)
  return false;

BB1 and BB2 are the direct successors of BB: there is no other block in

between.

We only hoist expressions from BB1 and BB2 into BB.

Yes, i misread this part.

  1. This seems like a really expensive way of doing this :)
This is pretty badly N^2.
Can i suggest a different method, closely based on what we do in GCC

for speed reasons

We will implement your suggestion: Aditya has also remarked that we could
be more efficient on compile time by using the internal structures of the
VN table, though I agree with you that we will need some more changes.

(FWIW: New GVN already has such a mapping/structure. For each value, it can
tell you all the member expressions of that value)

(FWIW: New GVN already has such a mapping/structure. For each value, it can
tell you all the member expressions of that value)

Thanks for the feedback. You mention that 'new GVN', could you point where new GVN is.
Are you referring to the LeaderTable?

-Aditya

sebpop updated this revision to Diff 52632.Apr 4 2016, 3:45 PM
sebpop removed rL LLVM as the repository for this revision.

Addressed comments from Chad and Danny.
Added more testcases and a flag to limit the O(n^2) behavior.

sebpop marked 2 inline comments as done.Apr 4 2016, 3:46 PM
chandlerc edited edge metadata.Apr 4 2016, 7:57 PM
chandlerc added a subscriber: chandlerc.

Folks, you didn't follow the specific guidance about how to create a patch
with Phabricator. Specifically, when you created the revision, the
'llvm-commits' list was not CC'ed.

As a result, I suspect a large number of folks are getting inline comments
with *zero context* about what this patch even is. That includes myself. It
makes it super hard to even understand what is going on.

Please nuke this entry on Phab, and create a fresh one with the mailing
list attached. That way it will send a nice original email with the actual
patch file attached.

It would also be really nice if the summary were substantially more useful
than "code hoisting using GVN" which sounds like adding a feature to GVN.
The actual description in PHab seems *very* different:
"""
This pass hoists common computations across branches sharing common
immediate dominator. Like early-cse, the primary goal of early-gvn is to
reduce the size of functions before inline heuristics to reduce the total
cost of function inlining.
"""

So this is actually introducing a totally new pass?? Very confused. Waiting
for a fresh code review to make detailed comments.

We don't want to add flags to control time when we know of better
algorithms that can do it.
Please, let's not add more N^2 algorithms to the compiler just because it
was a bit faster to write. Let's just do it well the first time :)

https://github.com/dberlin/llvm-gvn-rewrite/tree/newgvn2

(and in particular,
https://github.com/dberlin/llvm-gvn-rewrite/blob/newgvn2/lib/Transforms/Scalar/NewGVN.cpp
)

I'm rewriting GVN.
MemorySSA, recently committed, was the first step along this route.

It's at the point where it works, it needs to be broken down again and
pieces sent for review.

in any case, you can see struct congruenceclass, and that tracks all the
expressions that belong to a given congruence class (the memberset).

It even tells you thinks that are equivalent to that congruence class in
subparts of the CFG, or can be coerced through type changes to be the same
as things in the class, etc.

Note: It catches 99% of the things GVN does, and plenty it doesn't. I have
no plans on porting every possible optimization to it (and that would
include code hoisting :P)