Page MenuHomePhabricator

[PR16756] JumpThreading: explicitly update SSA rather than use SSAUpdater.
ClosedPublic

Authored by mzolotukhin on Mar 8 2018, 5:16 PM.

Details

Summary

SSAUpdater is often a bottleneck in JumpThreading, and one of the reasons is
that it performs a lot of unnecessary computations (DT/IDF) over and over
again. This patch implements a classic algorithm for PHI-nodes placement and
uses JT-specific properties to optimize it (namely: we only have two blocks with
definitions and these blocks are the same for all instructions we're going to
rewrite).

With this patch the test from PR16756 speeds-up by ~2x, while the time spent in
JumpThreading goes down by ~4x.

Before the patch:

Total Execution Time: 26.6205 seconds (26.6232 wall clock)
 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
13.3016 ( 50.6%)   0.0167 (  4.7%)  13.3183 ( 50.0%)  13.3190 ( 50.0%)  Jump Threading
 5.3226 ( 20.3%)   0.0170 (  4.8%)   5.3397 ( 20.1%)   5.3408 ( 20.1%)  Jump Threading
 1.7753 (  6.8%)   0.0020 (  0.6%)   1.7772 (  6.7%)   1.7772 (  6.7%)  SLP Vectorizer
 1.1617 (  4.4%)   0.1579 ( 44.1%)   1.3195 (  5.0%)   1.3199 (  5.0%)  X86 DAG->DAG Instruction Selection

With the patch:

Total Execution Time: 12.8328 seconds (12.8335 wall clock)
 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
 4.5331 ( 36.3%)   0.0135 (  4.1%)   4.5466 ( 35.4%)   4.5471 ( 35.4%)  Jump Threading
 1.7649 ( 14.1%)   0.0015 (  0.5%)   1.7664 ( 13.8%)   1.7665 ( 13.8%)  SLP Vectorizer
 1.1914 (  9.5%)   0.1594 ( 47.9%)   1.3507 ( 10.5%)   1.3510 ( 10.5%)  X86 DAG->DAG Instruction Selection
 0.8300 (  6.6%)   0.0034 (  1.0%)   0.8333 (  6.5%)   0.8333 (  6.5%)  Jump Threading

Also, some archeology. Some time ago there was a patch for an alternative
SSAUpdater implementation (https://reviews.llvm.org/D28934). It was not
committed back then, but I decided to try it first. Initially, it also showed
huge speed-ups, but then I discovered a couple of bugs, fixes for which ate a
big chunk of the speedups (although the new implementation still was
significantly faster than the existing one). I can upload an updated version of
that patch if there is an interest, but in JumpThreading I decided to pursue
another path: the D28934 implementation, being much faster than the existing
one, still does not scale very well. The algorithm used there is very efficient
for small incremental updates, but doesn't scale well for bulk updates (e.g.
when we have to update big number of instructions), as there is not much to
share across the algorithm invocations. In contrast, the approach I propose here
aims at reusing as much computations as possible.

Compile time impact: no noticable changes on CTMark, a big improvement on the
test from PR16756.

Diff Detail

Repository
rL LLVM

Event Timeline

mzolotukhin created this revision.Mar 8 2018, 5:16 PM
davide added a comment.Mar 8 2018, 5:30 PM

I'm going to review this in detail, but I have a general comment about the design.
I'm a little torn about the approach you're using. I think it's a good experiment because you're leveraging the structure of the problem to find a fundamentally more efficient solution. This is, FWIW, not really surprising.
OTOH, this is bad because you're de-facto rewriting a specialized SSA updater. We might need another one for, e.g. LCSSA, and another one for GVN/PRE (until NewGVN is in or we rewrite PRE to not use the updater).
This results in basically a bunch of code duplicated. History shows (and you know this very well) that it turned out to be a long tail of bugs when we did something similar with the dominator (have specialized updates per-pass). If we really want to go this route, we really need to analyze carefully all the implications.

Yeah, I have such feelings regarding this patch too.

I've been thinking about how to make this more general and reusable: it should be possible to implement something like BulkSSAUpdater (it probably might use a better name), and it will at least won't recompute DT on every iteration. I don't think it would be worth it to expose in its interface a flag showing that all definitions are in the same blocks or something like this, so we'll have to recompute IDF for every instruction we process. It still will be a win compared to the existing implementation, but loss in term of compile time compared to this implementation.

When I submitted this patch, I expected that it would trigger discussions, and that was one of the reasons I wanted to put it out :) I'm eager to suggestions on how to do that better.

Thanks,
Michael

So, FWIW:
It's definitely possible to make the SSAUpdater rewrite faster in bulk insertion. I just never bothered because it wasn't used that way :)
You will never make it as fast as exploiting the problem structure, but it could be made a *lot* faster.

As for this one:
There are approaches to reuse a lot of the phi placement computation in bulk. It's also possible to close IDF recomputation under DT update (IE incrementally recompute IDF).

You can see an example of something like that in the TDMSC algorithms here:https://dl.acm.org/citation.cfm?id=1065890

It only relies on the DJ graph, and it's possible, at maximum, to invalidate only the merge sets as DT changes.
(We'd need to get info from DT about what happened during an update, but that seems easy enough)

All depends how far down this rabbit hole you want to go.

As for the pass-specific vs bulk updating, yeah, this is the eternal debate.
JT itself doesn't make too many types of transforms, and i wonder if you could just verify them all working with the ssa updater with all the types of cfgs or something.
(This would be fragile in other ways, but at least would give you confidence the thing works)

I think it also depensd on how much faster you can make the generic bulk updater. For example, if you get it to 0.9x the speed of the impl you have here, i suspect that's plenty good enough.
But i'm also not the one doing the work :)

I think so far the consensus was that I need to try implementing it in a more general way. I'll do that and come back!

Thanks,
Michael

If you want to update using DJ-Graphs and merge sets, I have implemented them in the global-scheduler which I thought could be reused: https://reviews.llvm.org/D32140
Please let me know if you find this useful.

Thanks! My current plan is to simply move the part that I wrote to a separate class and make it usable from other places. If I manage to keep most of the gains, then it would probably be the easiest way to resolve the issue.

Michael

  • Factor out SSA updating logic to a separate class SSAUpdaterBulk.

Hi,

I created a separate class for bulk SSA updates. With that implementation, we recompute IDF for every individual variable, but usually the subgraph we're working with is smaller. In the previous implementation we computed IDF once for the union of these subgraphs - that was faster, but also we might accidentally insert unneeded phi-nodes, which we later had to clean-up.

I tried to make the interface close to the existing one, and included an example of how it can be used in JumpThreading (if this change is approved, I'll commit these parts separately). With this change, the speedup on the original test is smaller, but still quite good:

===-------------------------------------------------------------------------===
  Total Execution Time: 15.0158 seconds (15.0163 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   4.5962 ( 31.5%)   0.0170 (  3.8%)   4.6132 ( 30.7%)   4.6142 ( 30.7%)  Jump Threading
   3.1789 ( 21.8%)   0.0049 (  1.1%)   3.1838 ( 21.2%)   3.1838 ( 21.2%)  Jump Threading
   1.2104 (  8.3%)   0.1836 ( 41.5%)   1.3940 (  9.3%)   1.3942 (  9.3%)  X86 DAG->DAG Instruction Selection
   1.3558 (  9.3%)   0.0020 (  0.5%)   1.3578 (  9.0%)   1.3577 (  9.0%)  SLP Vectorizer

What do you think?

Thanks,
Michael

PS: Ideas for a better class name are welcome!

I don't mind this approach, but as discussed offline we should consider also moving LCSSA to make sure the API makes sense.
In general, what's your plan for this? You want it to replace the SSA updater for all the instances in llvm? If so, we should carefully plan the transition costs.

@zhendongsu could do a round of testing before we decide whether this can go in to shake bugs.

Also, do you know why the second invocation of Jump threading is taking so long?

llvm/lib/Transforms/Utils/SSAUpdaterBulk.cpp
25–26 ↗(On Diff #138097)

should this be ssaupdaterbulk?

52–54 ↗(On Diff #138097)

ternary

111 ↗(On Diff #138097)

auto

113 ↗(On Diff #138097)

auto

davide added inline comments.Mar 25 2018, 6:45 PM
llvm/lib/Transforms/Scalar/JumpThreading.cpp
2013 ↗(On Diff #138097)

typo: s/ee/we/

2036 ↗(On Diff #138097)

Why do you need to flush the dominator here? please add a comment.

  • Rebase.
  • Address Davide's remarks.

I don't mind this approach, but as discussed offline we should consider also moving LCSSA to make sure the API makes sense.
In general, what's your plan for this? You want it to replace the SSA updater for all the instances in llvm? If so, we should carefully plan the transition costs.

I looked into LCSSA, and indeed it seems that it also can be improved. I tried direct replacement of the old SSAUpdater with the new one, but that didn't give any benefits. However, I think we can simplify the code in LCSSA by passing LoopInfo to SSAUpdaterBulk, which will then use it to insert a phi node whenever it crosses a loop boundary when rewriting a use. I don't know yet how much work it would take, but I don't think it would require much rewritings - it would probably be an addition to what we currently have in this patch.

Also, do you know why the second invocation of Jump threading is taking so long?

I assume by 'taking so long' you mean 'taking so long compared to the first version of this patch', because even this "slow" version is ~50% faster than what we have in trunk. It is most probably caused by the fact that we recompute IDFs for every rewriting while in the first version we computed it once for a united subgraph. Actually, both approaches have "good" and "bad" cases: I can imagine a set of subgraphs for which computing individual IDFs N times would be faster than computing a united IDF, but seemingly in this particular case we have an opposite example.

Thanks,
Michael

I have no other comments on this, which looks fine. I'd add some unittest(s), as we have in the SSAUpdater and I think we should be good to do.
@dberlin what do you think?

I'm fine with this, we can always improve it more later.

davide accepted this revision.Apr 9 2018, 12:35 PM
This revision is now accepted and ready to land.Apr 9 2018, 12:35 PM
This revision was automatically updated to reflect the committed changes.

Thanks! I committed the patch in two parts: r329643 (Add SSAUpdaterBulk) and r329644 (Use SSAUpdaterBulk in JumpThreading).

Michael

I've bisected a recent regression we discovered in Rust to this commit and cc'd some of y'all on the bug there, but if anyone else would like to help take a look at the performance regression there that'd be much appreciated!