This is an archive of the discontinued LLVM Phabricator instance.

[IPSCCP] Add function specialization ability
AbandonedPublic

Authored by mssimpso on Aug 7 2017, 3:11 PM.

Details

Summary

This patch adds function specialization to IPSCCP. After an initial IPSCCP data-flow is solved, we may find that a function argument can take on a particular constant value in some contexts. If this is the case, we can create a new version of the function in which the argument is replaced by the constant value and then recompute the data-flow. This allows constant propagation to cross function boundaries when an argument can take on more than one value.

Specialization is controlled by a goal-oriented heuristic that seeks to predict if replacing an argument with a particular constant value would result in any significant optimization opportunities. Currently, we limit this heuristic to exposing inlining opportunities via indirect call promotion, but the kinds of optimizations we look for can be extended in the future.

Specialization can be disabled by -ipsccp-enable-function-specialization=false.

Function cloning in constant propagation was mentioned on llvm-dev recently, so I thought I would throw this up for review. I've added a simple demonstration test, but we can add more tests to the patch if the approach looks good.

Long ago, there was a separate partial specialization pass. However, this pass was removed (in r123554 and r152759) apparently due to known bugs and lack of maintenance.

Please take a look.

Event Timeline

mssimpso created this revision.Aug 7 2017, 3:11 PM
fhahn added a subscriber: fhahn.Aug 7 2017, 3:46 PM
davide edited edge metadata.Aug 8 2017, 7:57 AM

Very neat work, thanks, this was in my todolist for a while :)
Some meta comments, I'll try to find some time to review this more carefully this week.

I wonder if we're at a point where it makes sense to split the lattice solver from the transforms, as with your partial specialization code SCCP is growing a lot :)
This would have also a bunch of other advantages, e.g. we could more easily try to plug arbitrary lattices (for example for variables/ranges/bits).
IIRC Chris Lattner started a propagation enging a while ago, it should still be in tree, but unused. What are your thoughts?

That said, the approach I had in mind (as mentioned on llvm-dev) was that of:

  1. implementing "real" jump functions for IPSCCP. To the best of my understanding complicated jump functions never got traction (so, e.g. GCC implements only constant and passthrough).
  2. Propagate the information down the DAG of SCC (of the callgraph). From what I can see your approach is instead iterative so everytime you specialize something, you do another round of solving. I'm not sure whether this converges more slowly, or if it matters in practice, probably worth taking some numbers.

I think your heuristic is on the right track (it's the same one I used in an early prototype), but it would be nice to collect numbers to tune it further.

I'm a little bit reluctant about having this enabled by default (in the beginning)a s unfortunately cloning can have catastrophic consequences (I had a chat with the GCC folks where they reported substantial growth in the executable sizes for large programs without real speedups, and I also had an early prototype that confirmed this thesis).

Thanks again for working on this!

efriedma edited edge metadata.Aug 8 2017, 10:52 AM

The way this is written, there isn't much point to sticking the code into IPSCCP; you could just as easily make this a separate pass which runs afterwards. (The only thing you're getting is constant propagation inside the cloned function, which you could just as easily run separately.)

lib/Transforms/Scalar/SCCP.cpp
2279

This loop doesn't work the way you want it to. IPSCCP isn't actually complete until the "while (ResolvedUndefs)" loop finishes, and specializing based on a partially unsolved Solver probably won't give you the results you want.

Very neat work, thanks, this was in my todolist for a while :)
Some meta comments, I'll try to find some time to review this more carefully this week.

Thanks for taking a look!

I wonder if we're at a point where it makes sense to split the lattice solver from the transforms, as with your partial specialization code SCCP is growing a lot :)
This would have also a bunch of other advantages, e.g. we could more easily try to plug arbitrary lattices (for example for variables/ranges/bits).
IIRC Chris Lattner started a propagation enging a while ago, it should still be in tree, but unused. What are your thoughts?

Yes, I think I see what you're talking about. Analysis/SparsePropagation.h? I haven't looked at this closely yet, but I can see some advantages to using it. Any idea why SCCP hasn't already been switched over to use it?

That said, the approach I had in mind (as mentioned on llvm-dev) was that of:

  1. implementing "real" jump functions for IPSCCP. To the best of my understanding complicated jump functions never got traction (so, e.g. GCC implements only constant and passthrough).
  2. Propagate the information down the DAG of SCC (of the callgraph). From what I can see your approach is instead iterative so everytime you specialize something, you do another round of solving. I'm not sure whether this converges more slowly, or if it matters in practice, probably worth taking some numbers.

Right, I've incorporated specialization as an iterative solve. I'm not an expert in this area, but the iterative approach seemed to be the most straightforward extension of our current IPSCCP implementation (which seems to be more-or-less the Wegman, Zadeck approach). It makes us iterate, but we only need to visit the specializations on subsequent iterations.

(In the case of indirect call promotion, we could also revisit the called functions, but the current patch doesn't do this. For example, after call promotion, a function may no longer be address-taken, enabling us to then track it's arguments. We would have to reset their lattice state, though, since they would have already made it to overdefined.)

Am I right in thinking the comment about jump functions is about our IPSCCP infrastructure in general, rather than specialization specifically? If I understand correctly, jump functions (like in Callahan et al.) are a trade-off between the precision of the analysis and the time/space needed to solve it. They summarize functions and then propagate the summaries through call sites. I think our current IPSCCP implementation should (at least in theory) be more precise than this. For example, if we find out via constant propagation that a call is not executed, this could enable more propagation/specialization, etc. But again, this is a trade-off since we may have to revisit a large portion of the module. I would definitely appreciate your thoughts.

I think your heuristic is on the right track (it's the same one I used in an early prototype), but it would be nice to collect numbers to tune it further.

Here are some numbers (AArch64, LTO). It looks like the current tuning is fairly conservative for the LLVM test suite and SPEC. We only hit a handful of the benchmarks, and of the ones we do, spec2017/mcf is the big winner (16% faster). MultiSource/Benchmarks/McCat/12-IOtest is also improved. I would expect us to hit more benchmarks if/when we consider other optimization opportunities beyond indirect call promotion. The table below shows the performance increase, the code size increase, the number of specialized functions, and the number of specializations of those functions. It also shows the number of extra iterations of the IPSCCP solve loop that were performed due to specialization. (I'm in Phabricator, so hopefully this table will look OK.)

BenchmarkPerf % (+ faster)Size % (+ larger)# Added Solver Iters# Specializations# Specialized Funcs
llvm/MultiSource/Applications/SPASS0.300.032186
llvm/MultiSource/Applications/lua0.000.04121
llvm/MultiSource/Applications/siod-0.170.21021
llvm/MultiSource/Benchmarks/McCat/12-IOtest7.04-7.202181
llvm/MultiSource/Benchmarks/Ptrdist/bc0.111.30111
spec2000/perlbmk-0.25-0.00131
spec2006/gcc0.290.321436
spec2006/gobmk0.030.10243
spec2006/hmmer0.000.57151
spec2017/blender0.580.8414144144
spec2017/gcc0.081.54814122
spec2017/imagick-0.44-0.06151
spec2017/mcf16.0119.26121
spec2017/parest0.09-0.02011
spec2017/perlbench-0.440.15252
spec2017/povray-0.282.19151
spec2017/xz0.310.91253

I'm a little bit reluctant about having this enabled by default (in the beginning)a s unfortunately cloning can have catastrophic consequences (I had a chat with the GCC folks where they reported substantial growth in the executable sizes for large programs without real speedups, and I also had an early prototype that confirmed this thesis).

Thanks again for working on this!

Makes sense. Hopefully folks could try out whatever we end up with on some larger programs.

The way this is written, there isn't much point to sticking the code into IPSCCP; you could just as easily make this a separate pass which runs afterwards. (The only thing you're getting is constant propagation inside the cloned function, which you could just as easily run separately.)

Thanks for the comments! The old specialization pass was indeed separate. I think (long term) it makes sense to incorporate this into IPSCCP. It could be the case that specialization enables further propagation, which then enables additional specialization, etc. I don't think we could get that kind of behavior out of a separate pass. But if we only want to care about indirect call promotion, we could probably achieve the same thing with a separate pass that runs before constant propagation.

lib/Transforms/Scalar/SCCP.cpp
2279

I'm not sure this is really a problem. Sure IPSCCP isn't yet complete, but ResolvedUndefsIn only moves values from unknown to constant. If an actual argument does become constant while resolving undefs, we should analyze the corresponding (formal argument, constant) pair in getSpecializationBonus the next time we visit the function when looking for specialization opportunities.

mcrosier added inline comments.Aug 9 2017, 11:35 AM
lib/Transforms/Scalar/SCCP.cpp
249

used -> use

2023

The inline cost model already takes into consideration cold call sites, but I was wondering if it would make sense to use this information directly during functional specialization.

For example, you could filter out cold calls and avoid the overhead of computing the inline cost in cases when you think inlining likely won't happen or where specialization isn't likely to be beneficial.

Of course, the inline cost computation will bail sooner with a lower threshold (due to the call site being cold), so you might not be saving too much..

Just a thought.

It could be the case that specialization enables further propagation, which then enables additional specialization, etc.

This doesn't really happen at the moment. After you specialize a function, you aren't resetting the solver, so everything that's overdefined will stay overdefined.

lib/Transforms/Scalar/SCCP.cpp
2279

Moving a value from undef to constant can move dependent values from constant to overdefined. So you could end up cloning a function, and then fail to actually do any useful constant propagation into the cloned function.

It could be the case that specialization enables further propagation, which then enables additional specialization, etc.

This doesn't really happen at the moment. After you specialize a function, you aren't resetting the solver, so everything that's overdefined will stay overdefined.

That's right. I think I mentioned this in my response to Davide's comments. This is a current limitation, but is something I think we will want to work towards.

lib/Transforms/Scalar/SCCP.cpp
2279

Ah, right. I think I see what you mean now.

dberlin edited edge metadata.Aug 9 2017, 1:06 PM

Very neat work, thanks, this was in my todolist for a while :)
Some meta comments, I'll try to find some time to review this more carefully this week.

Thanks for taking a look!

I wonder if we're at a point where it makes sense to split the lattice solver from the transforms, as with your partial specialization code SCCP is growing a lot :)
This would have also a bunch of other advantages, e.g. we could more easily try to plug arbitrary lattices (for example for variables/ranges/bits).
IIRC Chris Lattner started a propagation enging a while ago, it should still be in tree, but unused. What are your thoughts?

Yes, I think I see what you're talking about. Analysis/SparsePropagation.h? I haven't looked at this closely yet, but I can see some advantages to using it. Any idea why SCCP hasn't already been switched over to use it?

Same as anything else - lack of time for someone ;)

We moved GCC' s SCCP to a generic sparse propagation engine.
As for performance - i'm with Davide.
Cloning as the end transformation is nice, but it also tends to be a harder cost model to get right.

GCC was unable to get this right in practice. Most benefit from from IPSCCP is producing range info for each context.

Am I right in thinking the comment about jump functions is about our IPSCCP infrastructure in general, rather than specialization specifically? If I understand correctly, jump functions (like in Callahan et al.) are a trade-off between the precision of the analysis and the time/space needed to solve it. They summarize functions and then propagate the summaries through call sites. I think our current IPSCCP implementation should (at least in theory) be more precise than this. For example, if we find out via constant propagation that a call is not executed, this could enable more propagation/specialization, etc. But again, this is a trade-off since we may have to revisit a large portion of the module. I would definitely appreciate your thoughts.

Almost right. Our modeling should be more precise than *most* jump functions. But not all, i believe. I think what we are doing is equivalent to passthrough in reality[1] . Essentially, we have extended the constant prop lattice, for call functions, to to do passthrough of arguments and whatever their value is, and not just constants.
But you can imagine a more powerful jump function. For example, the polynomial one. It can be viewed as "extending the constant prop lattice for formal parameters to be polynomial functions", not just constants.

So even if foo(a) and foo(b) is foo(2) and foo(4), maybe in reality it could be expressed as foo(argument * 2) or something. We would not get this. The polynomial jump function would.

So, you are right that as implemented by most(all?) compilers, they use passthrough, and we use passthrough, so we're the same.

[1] I believe what we do can be proven equivalent to passthrough, because we merge the state of incoming arguments from the call sites parameters, without changing the lattice value. If all arguments just pass through, we will pass through the state. IE Without us doing anything else, given a call chain of arguments of foo(a) ->bar(a)->bob(a), a will remain underdefined in our algorithm.

Note: Wegmans SCCP paper covers a variant of IPSCCP.
https://www.cs.utexas.edu/users/lin/cs380c/wegman.pdf
They just link all the SSA procedures together, and run IPSCCP on it. What we do should be identical in practice, i believe.

It should be linear in the total size of the program (or we screwed up :P)
It also talks about integrating it with inlining.

Very neat work, thanks, this was in my todolist for a while :)
Some meta comments, I'll try to find some time to review this more carefully this week.

Thanks for taking a look!

I wonder if we're at a point where it makes sense to split the lattice solver from the transforms, as with your partial specialization code SCCP is growing a lot :)
This would have also a bunch of other advantages, e.g. we could more easily try to plug arbitrary lattices (for example for variables/ranges/bits).
IIRC Chris Lattner started a propagation enging a while ago, it should still be in tree, but unused. What are your thoughts?

Yes, I think I see what you're talking about. Analysis/SparsePropagation.h? I haven't looked at this closely yet, but I can see some advantages to using it. Any idea why SCCP hasn't already been switched over to use it?

Same as anything else - lack of time for someone ;)

We moved GCC' s SCCP to a generic sparse propagation engine.
As for performance - i'm with Davide.
Cloning as the end transformation is nice, but it also tends to be a harder cost model to get right.

A (somehow) related bug came to my mind https://bugs.llvm.org/show_bug.cgi?id=33253

GCC was unable to get this right in practice. Most benefit from from IPSCCP is producing range info for each context.

Am I right in thinking the comment about jump functions is about our IPSCCP infrastructure in general, rather than specialization specifically? If I understand correctly, jump functions (like in Callahan et al.) are a trade-off between the precision of the analysis and the time/space needed to solve it. They summarize functions and then propagate the summaries through call sites. I think our current IPSCCP implementation should (at least in theory) be more precise than this. For example, if we find out via constant propagation that a call is not executed, this could enable more propagation/specialization, etc. But again, this is a trade-off since we may have to revisit a large portion of the module. I would definitely appreciate your thoughts.

Almost right. Our modeling should be more precise than *most* jump functions. But not all, i believe. I think what we are doing is equivalent to passthrough in reality[1] . Essentially, we have extended the constant prop lattice, for call functions, to to do passthrough of arguments and whatever their value is, and not just constants.
But you can imagine a more powerful jump function. For example, the polynomial one. It can be viewed as "extending the constant prop lattice for formal parameters to be polynomial functions", not just constants.

So even if foo(a) and foo(b) is foo(2) and foo(4), maybe in reality it could be expressed as foo(argument * 2) or something. We would not get this. The polynomial jump function would.

So, you are right that as implemented by most(all?) compilers, they use passthrough, and we use passthrough, so we're the same.

Yes. When I read "the" paper about jump functions, https://scholarship.rice.edu/handle/1911/13733 I was under the impression that more sophisticated jump function doesn't actually buy you much (and in a brief chat with David Callahan [who was around at the time] I got a confirmation). That said, this was a long time ago, and many things changed, so maybe things can be re-evaluated at some point.

FWIW, GCC doesn't even implement return jump functions (see e.g. https://godbolt.org/g/eahtHR ) and relies on inlining/cloning to get things right, so in this respect llvm's constant propagation is actually stronger than GCC's [if I remember correctly the reason why they don't implement this is because it's a little complicated to handle the case of mutually recursive functions while walking down the DAG of SCCs, but if we'll ever go for real jump functions we should consider to implement return(s) as well).

[1] I believe what we do can be proven equivalent to passthrough, because we merge the state of incoming arguments from the call sites parameters, without changing the lattice value. If all arguments just pass through, we will pass through the state. IE Without us doing anything else, given a call chain of arguments of foo(a) ->bar(a)->bob(a), a will remain underdefined in our algorithm.

Note: Wegmans SCCP paper covers a variant of IPSCCP.
https://www.cs.utexas.edu/users/lin/cs380c/wegman.pdf
They just link all the SSA procedures together, and run IPSCCP on it. What we do should be identical in practice, i believe.

It should be linear in the total size of the program (or we screwed up :P)
It also talks about integrating it with inlining.

I agree this is a good path forward.

Danny/Davide,

Thanks very much for the feedback. I have some replies to your comments below (abridged, so it's easier to read in Phab.), but I thought I would first summarize things so far. It sounds like your main points are that (1) the cost model for function specialization is difficult to get right in practice, and that (2) our current IPSCCP infrastructure could be improved to do better than pass-through for arguments and returns. I agree with both of these points. So do we think we should iterate on this patch and add function specialization to our current infrastructure?

Almost right. Our modeling should be more precise than *most* jump functions. But not all, i believe. I think what we are doing is equivalent to passthrough in reality[1] . Essentially, we have extended the constant prop lattice, for call functions, to to do passthrough of arguments and whatever their value is, and not just constants.

Ah, that's right, we're doing the same thing as pass-through. We would probably need to significantly rework our analysis to do anything better (for ranges, etc.).

Note: Wegmans SCCP paper covers a variant of IPSCCP.
https://www.cs.utexas.edu/users/lin/cs380c/wegman.pdf
They just link all the SSA procedures together, and run IPSCCP on it. What we do should be identical in practice, i believe.

Yes, this is indeed the way we do it now, if I understand correctly.

A (somehow) related bug came to my mind https://bugs.llvm.org/show_bug.cgi?id=33253

Interesting. Thanks for the pointer!

I agree this is a good path forward.

I'm not sure if you're talking about the current patch or IPSCCP improvements in general :)

Danny/Davide,

Thanks very much for the feedback. I have some replies to your comments below (abridged, so it's easier to read in Phab.), but I thought I would first summarize things so far. It sounds like your main points are that (1) the cost model for function specialization is difficult to get right in practice, and that (2) our current IPSCCP infrastructure could be improved to do better than pass-through for arguments and returns. I agree with both of these points. So do we think we should iterate on this patch and add function specialization to our current infrastructure?

I'm not opposed if we have cases where it matters.
But if we can't find these cases, it's not gonna be turned on by default, and then i think our time would be better spent elsewhere.
IE i don't think it will be horribly useful to have function specialization but not have it good enough to be worth it by default.
At that point, i think'd we better off exploring different kinds of improvements (IE store function argument range info) to IPSCCP or different mechanisms of using the existing info (IE inlining)

Danny/Davide,

Thanks very much for the feedback. I have some replies to your comments below (abridged, so it's easier to read in Phab.), but I thought I would first summarize things so far. It sounds like your main points are that (1) the cost model for function specialization is difficult to get right in practice, and that (2) our current IPSCCP infrastructure could be improved to do better than pass-through for arguments and returns. I agree with both of these points. So do we think we should iterate on this patch and add function specialization to our current infrastructure?

I'm not opposed if we have cases where it matters.
But if we can't find these cases, it's not gonna be turned on by default, and then i think our time would be better spent elsewhere.
IE i don't think it will be horribly useful to have function specialization but not have it good enough to be worth it by default.
At that point, i think'd we better off exploring different kinds of improvements (IE store function argument range info) to IPSCCP or different mechanisms of using the existing info (IE inlining)

I agree. Looking at the biggest performance winners from the current patch (spec2017/mcf, in particular), most of the improvement is coming from inlining (enabled after the indirect call promotion that specialization allows). I wonder if there's a better way to achieve this benefit without specializing. For example, we could propagate constant sets indicating the functions indirect call sites could possibly target. Although we would probably want to limit the size of the sets to something small, the pass could attach the sets via metadata to the calls so that this information could be consumed by later passes. Such metadata could be used for indirect call promotion, intersecting the function attributes of the possible targets (i.e., in CallSite::hasFnAttr), etc. We could perform this here in SCCP or in a separate pass using the generic solver. What do you think?

mssimpso abandoned this revision.Oct 25 2017, 1:21 PM

I'm abandoning this for now to close the review. Davide/Danny, please feel free to revive this patch if you want. To summarize the main points in the review, we need to work on the cost model more to enable something like this by default.

yaozhongxiao removed a subscriber: yaozhongxiao.