Page MenuHomePhabricator

[SCCP] Add Function Specialization pass
ClosedPublic

Authored by SjoerdMeijer on Dec 27 2020, 7:25 AM.

Details

Summary

This patch adds a function specialization pass to LLVM. The constant parameters like function pointers and constant globals are propagated to the callee by specializing the function.

Current limitations:

  • It does not handle specialization of recursive functions,
  • It does not yet handle constants and constant ranges,
  • Only 1 argument per function is specialised,
  • The cost-model could be further looked into,
  • We are not yet caching analysis results.

The pass is based on the Function specialization proposed here.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ChuanqiXu added a comment.EditedMay 17 2021, 3:51 AM

Friendly ping: I am looking for an LGTM for this so the we can continue with the next steps.

This patch looks good to me. But I think I am not the right person to accept it : )

BTW, it looks like we need to format this.

I am wondering if we can remove original IPSCCP once we made this strong enough?

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
433–439

It looks like we left this work to original IPSCCP to handle. Can we make it in this pass?

davide removed a reviewer: davide.May 19 2021, 10:33 AM

For the motivating case from MCF, there is actually code in InlineCost Analysis to handle it. See InlineCostCallAnalyzer::onLoweredCall(..). If it is not already handled there, it is a bug to be fixed there.

For the motivating case from MCF, there is actually code in InlineCost Analysis to handle it. See InlineCostCallAnalyzer::onLoweredCall(..). If it is not already handled there, it is a bug to be fixed there.

It is hard to believe that inline could handle the case in MCF. The case in MCF is there is a function spec_qsort with signature (void*, size_t, size_t, int*(const void*, const void*)). And there is two calls to spec_qsort:

spec_qsort(perm + 1, basket_sizes[thread], sizeof(BASKET*),
            (int (*)(const void *, const void *))cost_compare);
// and
spec_qsort(arcs_pointer_sorted[thread], new_arcs_array[thread], sizeof(arc_p),
                (int (*)(const void *, const void *))arc_compare);

And the benefit comes from function specialization specialize spec_qsort with cost_compare and arc_compare. And finally, cost_compare and arc_compare are inlined in the specialized function. In my minds, without function specialization, compiler couldn't determine the value for the function pointer at compile time. So the inlining couldn't happen.

When compiler analyzes the callsite to spec_qsort, it sees the constant function pointer passed to the callee. During the callee body analysis, when the indirect callsite is seen, it will be simplified/evaluated to be the callsite constant thus the compare call can be resolved. If compiler decides that the compare call can be further inlined, it will give the original callsite additional bonus for inlining. Once spec_qsort is inlined, the compare function call will be inlined further.

I can imagine a fix in the inliner cost analysis to give the call to spec_qsort a bigger bonus when it sees that the compare function is also locally hot (e.g used in a loop). Does it make sense?

When compiler analyzes the callsite to spec_qsort, it sees the constant function pointer passed to the callee. During the callee body analysis, when the indirect callsite is seen, it will be simplified/evaluated to be the callsite constant thus the compare call can be resolved. If compiler decides that the compare call can be further inlined, it will give the original callsite additional bonus for inlining. Once spec_qsort is inlined, the compare function call will be inlined further.

I can imagine a fix in the inliner cost analysis to give the call to spec_qsort a bigger bonus when it sees that the compare function is also locally hot (e.g used in a loop). Does it make sense?

It doesn't make sense to me. There are two reasons:

  • We can't inline every thing. Maybe it is possible to inline spec_qsort by give a bigger bonus. However, we could construct an example easily that a function with a function pointer as parameter too large to infeasible to inline.
  • Consider the following example:
spec_qsort(arc_compare);
spec_qsort(arc_compare);
spec_qsort(arc_compare);
spec_qsort(arc_compare);
spec_qsort(cost_compare);
spec_qsort(cost_compare);
spec_qsort(cost_compare);

Then we could construct two specialization for spec_qsort to solve the problem. And we need to inline seven times for sepc_qsort. When spec_qsort becomes bigger, the difference would be more clear.

If the function pointer points to a function too large to be inlined, I expect the benefit of cloning the caller will also be minimal.

It is possible that the 'spec_sort' like function is very large so it won't be inlined no matter what. In such as case the cloning may be helpful.

What I am saying is that for the motivating example in MCF, it may be possible to fix the inliner.

If the function pointer points to a function too large to be inlined, I expect the benefit of cloning the caller will also be minimal.

It depends on the number of callsites. But for the example in mcf, I agree with you that cloning may be not more beneficial than inlining.

What I am saying is that for the motivating example in MCF, it may be possible to fix the inliner.

I agree with that if is possible to inline spec_qsort by adjusting the bonus. And what I am saying is function specialization is valuable and important.

If the function pointer points to a function too large to be inlined, I expect the benefit of cloning the caller will also be minimal.

It depends on the number of callsites. But for the example in mcf, I agree with you that cloning may be not more beneficial than inlining.

What I am saying is that for the motivating example in MCF, it may be possible to fix the inliner.

I agree with that if is possible to inline spec_qsort by adjusting the bonus. And what I am saying is function specialization is valuable and important.

If it is valuable and important, definitely go for it :)

My question is other than MCF, do we have other real world app that can benefit from this optimization (that can not be done by inliner)? I saw omnetpp also improves, do you know where it helps ? Also with PGO, do we see similar improvement?

Asking these because there are compile time and code size consequences..

What about C-ray? Does function specialization help?

Anyway, it would be good to teach inliner to inline that one important function in c-ray benchmark. gcc has some heuristic that it looks “if we inline this function into loop, does something becomes loop invariant - if yes, inline!”

If the function pointer points to a function too large to be inlined, I expect the benefit of cloning the caller will also be minimal.

It depends on the number of callsites. But for the example in mcf, I agree with you that cloning may be not more beneficial than inlining.

What I am saying is that for the motivating example in MCF, it may be possible to fix the inliner.

I agree with that if is possible to inline spec_qsort by adjusting the bonus. And what I am saying is function specialization is valuable and important.

If it is valuable and important, definitely go for it :)

My question is other than MCF, do we have other real world app that can benefit from this optimization (that can not be done by inliner)? I saw omnetpp also improves, do you know where it helps ? Also with PGO, do we see similar improvement?

Asking these because there are compile time and code size consequences..

I saw omnetpp also improves, do you know where it helps ?

We get omnetpp with 10~20 iterations in the past. And now the default iteration limit is 1. I didn't look into it. I would try if possible.

Also with PGO, do we see similar improvement?

Good question, we need to experiment to answer it.

Asking these because there are compile time and code size consequences

When we limit the iteration to one, as the data provided by @SjoerdMeijer

My question is other than MCF, do we have other real world app that can benefit from this optimization (that can not be done by inliner)?

For this question, my answer is that the current implementation may affect real word applications little from the cost model I had visited. My thought is that it should be a long term process to tune the cost model. It is hard to image that we can make something good enough at one commit.


BTW, from the discussion, we are in consensus that cloning are beneficial to handle the cases that inlining can't. But in the current pass pipeline, function specialization is in the front of inlining pass. I am not sure if it is harmful. But I wonder if it is good to put inlining pass in the front of function specialization pass,. And if function specialization pass made changes, we could run inlined again.

If the function pointer points to a function too large to be inlined, I expect the benefit of cloning the caller will also be minimal.

It depends on the number of callsites. But for the example in mcf, I agree with you that cloning may be not more beneficial than inlining.

What I am saying is that for the motivating example in MCF, it may be possible to fix the inliner.

I agree with that if is possible to inline spec_qsort by adjusting the bonus. And what I am saying is function specialization is valuable and important.

If it is valuable and important, definitely go for it :)

My question is other than MCF, do we have other real world app that can benefit from this optimization (that can not be done by inliner)? I saw omnetpp also improves, do you know where it helps ? Also with PGO, do we see similar improvement?

Asking these because there are compile time and code size consequences..

I saw omnetpp also improves, do you know where it helps ?

We get omnetpp with 10~20 iterations in the past. And now the default iteration limit is 1. I didn't look into it. I would try if possible.

Also with PGO, do we see similar improvement?

Good question, we need to experiment to answer it.

Asking these because there are compile time and code size consequences

When we limit the iteration to one, as the data provided by @SjoerdMeijer

My question is other than MCF, do we have other real world app that can benefit from this optimization (that can not be done by inliner)?

For this question, my answer is that the current implementation may affect real word applications little from the cost model I had visited. My thought is that it should be a long term process to tune the cost model. It is hard to image that we can make something good enough at one commit.


BTW, from the discussion, we are in consensus that cloning are beneficial to handle the cases that inlining can't. But in the current pass pipeline, function specialization is in the front of inlining pass. I am not sure if it is harmful. But I wonder if it is good to put inlining pass in the front of function specialization pass,. And if function specialization pass made changes, we could run inlined again.

In theory, function specialization based on interprocedural const/range/assertion propagation does not depend on inlining, so it should/can be done before the inliner. Similar to inliner, it may be better run after the callsite split pass to expose opportunities:

if (...) {
  a = 0;
} else {
  a = g;
}
foo (a);

->

if (...) {
    foo(0);     // specialization.
 } else {
    foo(g);
}

Thank you for further discussing this.

I have one more data point to add, related to compile-times which is probably the biggest concern for this work:

  • Timed compilation of a non-LTO LLVM release build increases compile time with 0.6% and increases the binary with 0.003%. The code-size increase is not the point here. The point is that function specialisation triggers, with very modest compile-time increase.

Other compile-times discussed in this very lengthy thread show similar trends (for SPEC, MySQL).

Sorry for repeating myself, but I am looking for someone to bless this work, but let me say a few things on this:

  • this will certainly need more tuning and experimentation,
  • the goal is to get this on by default, like GCC, but depends on the previous point,
  • getting this all right and enabled by default with the first commit is probably not achievable, hence why I am looking for a first commit.
  • and given the interest in this work we can start sharing some of this when it is in tree.

@fhahn, as the kind of biggest supporter of "this must be on by default" early in the conversations, are you happy with this?

getting this all right and enabled by default with the first commit is probably not achievable

Maybe yes, well…

If we could pick some the most obvious and profitable cases and enable this pass for them (simple heuristics, otherwise be conservative), plus, if we have PGO data, specialize very hot functions?

Yeah, I see concerns that another off by default pass is not ideal. Just look how many passes are in tree and disabled.

So +1 if we can run this pass even with conservative heuristics.

nikic added a comment.May 20 2021, 1:48 AM

I'm pretty sure @fhahn did not mean to suggest that the pass is actually default enabled when it lands, merely that there is some viable path towards enabling it in the future. Adding a pass and enabling it in the default pipeline are always two separate steps. And it does sound to me like there is a viable path forward here, so this is mainly waiting on someone to perform a technical review of the implementation (that someone would preferably be @fhahn, as the local IPSCCP expert :)

getting this all right and enabled by default with the first commit is probably not achievable

Maybe yes, well…

If we could pick some the most obvious and profitable cases and enable this pass for them (simple heuristics, otherwise be conservative), plus, if we have PGO data, specialize very hot functions?

Yeah, I see concerns that another off by default pass is not ideal. Just look how many passes are in tree and disabled.

So +1 if we can run this pass even with conservative heuristics.

PGO is definitely interesting. Actually quite a few interesting ideas have been brought up during discussions:

  • An idea was expressed that a function specialisation attribute on arguments, we could have a direct way of specialising, avoiding the cost-model,
  • Currently we only support constants, not constant ranges,
  • and I see that PGO could definitely help.

But I think these things can be best added once we have an initial implementation.

Thanks @nikic for your thoughts:

'm pretty sure @fhahn did not mean to suggest that the pass is actually default enabled when it lands, merely that there is some viable path towards enabling it in the future. Adding a pass and enabling it in the default pipeline are always two separate steps. And it does sound to me like there is a viable path forward here, so this is mainly waiting on someone to perform a technical review of the implementation (that someone would preferably be @fhahn, as the local IPSCCP expert :)

I agree with this and I think I have shown that there is a viable path to get this enabled (again, this our goal, because we want to reach parity with GCC here).
We have performed quite a few round of code reviews. Although there's obviously more to do, my impression was that people were reluctant to approve this as a first version because of the "on by default" discussion and not so much because of other things.

My question is other than MCF, do we have other real world app that can benefit from this optimization (that can not be done by inliner)?

An alternative perspective. An inliner does two things. It elides call setup/return code, and it specialises the function on the call site. These can be, and probably should be, separable concerns.

Today we inline very aggressively, which is suboptimal for platforms with code size (or cache) restrictions, but does give the call site specialisation effect. So this patch, today, needs a large enough function to avoid being specialised by the inliner to see a benefit. Examples will be things like qsort or large switch statements on a parameter.

With a specialisation pass in tree we can start backing off the inliner. Calling conventions do have overhead, but I suspect the majority of the performance win of inline is from the specialisation. If that intuition is sound, then this plus a less aggressive inliner will beat the status quo through better icache utilisation. Performance tests of Os may validate that expectation

My question is other than MCF, do we have other real world app that can benefit from this optimization (that can not be done by inliner)?

An alternative perspective. An inliner does two things. It elides call setup/return code, and it specialises the function on the call site. These can be, and probably should be, separable concerns.

Today we inline very aggressively, which is suboptimal for platforms with code size (or cache) restrictions, but does give the call site specialisation effect. So this patch, today, needs a large enough function to avoid being specialised by the inliner to see a benefit. Examples will be things like qsort or large switch statements on a parameter.

The benefit of inlining comes from many different areas:

  1. call overhead reduction (call, pro/epilogue)
  2. inline instance body cleanup with callsite contexts (this is what specialization can get)
  3. cross procedure boundary optimizations -- 3.1) PRE, jump threading etc. between caller body and inline instances 3.2) cross function optimization between sibling calls (sibling inline instances) 3.3) better code layout of inline instance body with enclosing call context ..

This is why with PGO, we have very large inline threshold setup for hot callsites.

For function specialization with PGO, we can use profile data to selectively do function cloning, but then again, it is very likely better to be inlined given its hotness.

I agree function specialization has its place when size is the concern (with -Os), or instruction working set is too large (to hurt performance). We don't have a mechanism to analyze the latter yet.

With a specialisation pass in tree we can start backing off the inliner. Calling conventions do have overhead, but I suspect the majority of the performance win of inline is from the specialisation. If that intuition is sound, then this plus a less aggressive inliner will beat the status quo through better icache utilisation. Performance tests of Os may validate that expectation

fhahn added a comment.May 20 2021, 9:12 AM

I'm pretty sure @fhahn did not mean to suggest that the pass is actually default enabled when it lands, merely that there is some viable path towards enabling it in the future. Adding a pass and enabling it in the default pipeline are always two separate steps.

This was indeed how my earlier comments were intended. Some comments on the code itself.

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
2

this needs updating.

16

this sounds a bit odd. The first sentence says it handles constant parameters. I guess you mean non-constant-int constants?

47

need test for the option?

58

needs test for the option?

66

Those were added to transition existing code in SCCP to ValueLatticeELement. Ideally new code would be explicit about what they expect (constant-range vs constant-int)

131

Instead of patching up the IR, can we just set the lattice values for the cloned function arguments accordingly until we are done with solving?

173

Could you explain why we need to remove ssa_copy in the clone function?

213

nit: no llvm:: should be needed.

638

Why do we need this transformation here? Is this required for specialization or to catch the MCF case?

767

Why do we need to modify the IR after each call to RunSCCPSolver rather than after all solving is done?

778

Are those declarations? Shouldn't we only track functions with definitions?

799

Is it possible to only replace values once we are completely done with solving?

llvm/lib/Transforms/Scalar/SCCP.cpp
30

All those includes are not needed in the file?

161

Why is this needed?

llvm/test/Transforms/FunctionSpecialization/function-specialization4.ll
8

do those tests depend on information from the AArch64 backend? If so, they should only be executed if the AArch64 backend is built.

Ah cheers, many thanks for the comments and review! Will start addressing them now.

ChuanqiXu added inline comments.May 21 2021, 12:08 AM
llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
316–318

Should this be != instead? Or I think we should use TTI.getUserCost for the initial value directly. Now if getUserCost returns non-zero, the value for Cost would become 0! It is too weird.

SjoerdMeijer added inline comments.Mon, May 24, 3:50 AM
llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
131

I might not get your suggestion, but the solving is done much earlier. I.e., RunSCCPSolver is done in runFunctionSpecialization. This here is the final step that performs the actual transformation.

316–318

Yeah, am fixing and rewriting this to use InstructionCost.

638

It's also not yet covered by tests, so I will remove it for now. Even if it is needed for MCF (can't remember), then it looks like a nice candidate as a follow up once we've got something in.

799

Will remove this (see also earlier comment); this is a bit of a different optimisation that we can look at later.

SjoerdMeijer added inline comments.Mon, May 24, 4:00 AM
llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
778

Yeah, these should be functions with definitions. There's a F.isDeclaration() check earlier in this function.

SjoerdMeijer added inline comments.Mon, May 24, 5:30 AM
llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
173

I can't. :-) Not yet at least, so will also remove this code (for now).

This addresses most remarks:

  • transitioned to using InstructionCosts,
  • added tests/testing,
  • removed unnecessary things.

It seems like the patch need to be formatted simply : )

Ran clang-format.

@fhahn: about running the solver first before making IR changes. I think that is happening already. There are some places where the solver is kept up to date after transformations. I think this is a remainder of running function specialisation iteratively that I stripped out for now, but I think it's good to keep these updates to the solver as it's probably good to keep it consistent.

fhahn added a comment.Tue, Jun 1, 2:00 AM

Ran clang-format.

@fhahn: about running the solver first before making IR changes. I think that is happening already. There are some places where the solver is kept up to date after transformations. I think this is a remainder of running function specialisation iteratively that I stripped out for now, but I think it's good to keep these updates to the solver as it's probably good to keep it consistent.

I'm not sure I'm looking at the right thing, but it looks like the specialization is still run iteratively via the code below? RunSCCPSolver seems to modify the IR after solving. Maybe I am looking at the wrong thing?

while (FuncSpecializationMaxIters != I++ &&
       FS.specializeFunctions(FuncDecls, CurrentSpecializations)) {

  // Run solver for the specialized functions only.
  RunSCCPSolver(CurrentSpecializations);

  CurrentSpecializations.clear();
  Changed = true;
}

Do we have test cases where the specialization leads to the function result of the specialized function becoming another constant, enabling further simplification in the caller?

llvm/test/Transforms/FunctionSpecialization/function-specialization-recursive.ll
4

can you elaborate what is going wrong here and what needs fixing?

Ran clang-format.

@fhahn: about running the solver first before making IR changes. I think that is happening already. There are some places where the solver is kept up to date after transformations. I think this is a remainder of running function specialisation iteratively that I stripped out for now, but I think it's good to keep these updates to the solver as it's probably good to keep it consistent.

I'm not sure I'm looking at the right thing, but it looks like the specialization is still run iteratively via the code below? RunSCCPSolver seems to modify the IR after solving. Maybe I am looking at the wrong thing?

while (FuncSpecializationMaxIters != I++ &&
       FS.specializeFunctions(FuncDecls, CurrentSpecializations)) {

  // Run solver for the specialized functions only.
  RunSCCPSolver(CurrentSpecializations);

  CurrentSpecializations.clear();
  Changed = true;
}

Ah okay, you're right, the solver runs but only for the specialised functions like to comments says. The main solving happens earlier in runFunctionSpecialization. Like I also wrote earlier, I have been stripping out a few things from the initial version, like running 10 iterations and doing it recursively, to prepare a basic version for an initial commit and get a baseline for the extra compile time costs. I will get rid of invoking the solver here and add a TODO and will also get rid of the disabled test for now. From memory, I think that needed to run more iterations to kick in, but will look at that later in a follow up.

Do we have test cases where the specialization leads to the function result of the specialized function becoming another constant, enabling further simplification in the caller?

Yeah, I think test test/Transforms/FunctionSpecialization/function-specialization.ll is an example of that: the specialised functions completely disappear because of further simplifications.

Addressed @fhahn 's comments: don't run the solver for specialised functions, removed the recursive specialization test for now.

wenlei added a subscriber: wenlei.Fri, Jun 4, 9:49 AM

We also saw ipa-cp-clone being a very noticeable difference between gcc and llvm when we tried to move workloads from gcc to llvm. Thanks for working on this for llvm optimization parity with gcc.

I think specialization can be beneficial when we can't do very selective inlining - it can realize part of the benefit of inlining with a more moderate size increase. I suspect the benefit of ICP from specialization will be much lower when PGO is used though, because PGO's speculative ICP is quite effective. For our use cases, it's more interesting to see how much benefit we could get from non-function-pointer constants, on top of PGO.

When this is ready, I'd be happy to give it a try on our larger workload to see if there's any interesting hit and how it compares against gcc's ipa-cp-clone for the same workload, all with PGO.

Also with PGO, do we see similar improvement?

Good question, we need to experiment to answer it.

@ChuanqiXu FWIW, PGO always does speculative ICP from spec_qsort to cost_compare and arc_compare in spec2017/mcf.

I agree function specialization has its place when size is the concern (with -Os), or instruction working set is too large (to hurt performance). We don't have a mechanism to analyze the latter yet.

@davidxl you're right that we can't model the latter, but the same working set concern is true for inlining too, and yet we still need to control size bloat from inlining (through inline cost etc) even if we can't model it accurately. I think specialization can be another layer that allows compiler to fine tune that balance when needed. I'm curious have you tried to turn off ipa-cp-clone for gcc for your workloads and whether that leads to noticeable perf difference?

In theory, function specialization based on interprocedural const/range/assertion propagation does not depend on inlining, so it should/can be done before the inliner.

Agreed. Though from compile time perspective, I'm wondering if we have specialization before inlining, would we be paying the cost for specialization for functions that will eventually be inlined in which case the extra cost may yield nothing in perf. But if we do specialization after inlining, we'll be targeting those that deemed not worth inlining only.

llvm/lib/Passes/PassBuilder.cpp
1137

With LTO or ThinLTO, would it make sense to schedule the pass in post-link only?

buildModuleSimplificationPipeline is used by both prelink and postlink for ThinLTO.

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
292

We could replace this AvgLoopIterationCount tuning knob with real frequency when PGO/BFI is available.

In order for specialization to be beneficial on top of PGO, I think leveraging counts in cost/benefit analysis is important. Doing that in later patch is fine.

We also saw ipa-cp-clone being a very noticeable difference between gcc and llvm when we tried to move workloads from gcc to llvm. Thanks for working on this for llvm optimization parity with gcc.

I think specialization can be beneficial when we can't do very selective inlining - it can realize part of the benefit of inlining with a more moderate size increase. I suspect the benefit of ICP from specialization will be much lower when PGO is used though, because PGO's speculative ICP is quite effective. For our use cases, it's more interesting to see how much benefit we could get from non-function-pointer constants, on top of PGO.

Agree.

When this is ready, I'd be happy to give it a try on our larger workload to see if there's any interesting hit and how it compares against gcc's ipa-cp-clone for the same workload, all with PGO.

Also with PGO, do we see similar improvement?

Good question, we need to experiment to answer it.

@ChuanqiXu FWIW, PGO always does speculative ICP from spec_qsort to cost_compare and arc_compare in spec2017/mcf.

I agree function specialization has its place when size is the concern (with -Os), or instruction working set is too large (to hurt performance). We don't have a mechanism to analyze the latter yet.

@davidxl you're right that we can't model the latter, but the same working set concern is true for inlining too, and yet we still need to control size bloat from inlining (through inline cost etc) even if we can't model it accurately. I think specialization can be another layer that allows compiler to fine tune that balance when needed. I'm curious have you tried to turn off ipa-cp-clone for gcc for your workloads and whether that leads to noticeable perf difference?

My recollection is that it did not result in much perf difference with FDO with our internal workload -- otherwise it would have been further tuned.

In theory, function specialization based on interprocedural const/range/assertion propagation does not depend on inlining, so it should/can be done before the inliner.

Agreed. Though from compile time perspective, I'm wondering if we have specialization before inlining, would we be paying the cost for specialization for functions that will eventually be inlined in which case the extra cost may yield nothing in perf. But if we do specialization after inlining, we'll be targeting those that deemed not worth inlining only.

Makes sense -- this is a the reason for iterative optimizations. Partial inlining is in the same situation too.

For our use cases, it's more interesting to see how much benefit we could get from non-function-pointer constants, on top of PGO

Definitely. That's one of the first things we could investigate once we've got a first version in. I have documented this under "current limitations".

Sorry for the early ping @fhahn , but as you were the last one with comments, happy with the latest changes?

ChuanqiXu accepted this revision.Tue, Jun 8, 3:02 AM

Thanks for adding me as a reviewer.
I am new to LLVM community and SCCP module. I am not sure if it is suitable for me to accept it. Just remind me if anyone is not comfortable.
The overall patch looks good to me as the first version. Although there must be some problems remained, I think we can work on top of this patch after merging iteratively. For example, I am playing function specialization for ThinLTO locally. And from the comments @wenlei give, PGO with function specialization is also interesting and charming. So after merging, we could work on specialization together for different aspects, like ThinLTO, PGO, bug fixes and Cost Model refactoring. So I choose to accept it.
Please wait 2~3 days to commit this in case there are more comments.

This revision is now accepted and ready to land.Tue, Jun 8, 3:02 AM
fhahn added a comment.Wed, Jun 9, 2:56 AM

Addressed @fhahn 's comments: don't run the solver for specialised functions removed the recursive specialization test for now.

I'm not sure if removing the recursive specialization test is the best thing to do, without known what it is supposed to test? If it is a legitimate test, I thin it would be good to keep it.

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
118

Are those updates to the solver still needed, after not running the solver after specializeFunctions?

Addressed @fhahn 's comments: don't run the solver for specialised functions removed the recursive specialization test for now.

I'm not sure if removing the recursive specialization test is the best thing to do, without known what it is supposed to test? If it is a legitimate test, I thin it would be good to keep it.

Thanks for the comments. Since they are minor, I will fix it before committing. I.e., will remove that update to solver and add the test back.

fhahn added a comment.Wed, Jun 9, 4:29 AM

Addressed @fhahn 's comments: don't run the solver for specialised functions removed the recursive specialization test for now.

I'm not sure if removing the recursive specialization test is the best thing to do, without known what it is supposed to test? If it is a legitimate test, I thin it would be good to keep it.

Thanks for the comments. Since they are minor, I will fix it before committing. I.e., will remove that update to solver and add the test back.

There are at least a few more places that may modify the solver state after the solver is run. I think it would be good to audit them and update the patch. IIUC in the current version there should be no need to update the solver state after RunSolver. It would also be good to also check if there any places with updates I missed.

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
147

This should also not be needed?

155

also not needed?

234

Is this needed if the solver does not run again?

244

Is this needed if the solver does not run again?

Removed those updates too, added the test case back. Appreciate it if you can have one more quick look.

This revision was landed with ongoing or failed builds.Fri, Jun 11, 1:22 AM
This revision was automatically updated to reflect the committed changes.

Many thanks for all your help and reviews. I have committed this because it was lgtm'ed (officially and also unofficially in comments), no major changes have been made/suggested for a while, and this is a first version that is off by default. And this is also just the beginning of function specialisation. I will continue working on this and will be happy to receive post-commit comments for this version, and of course comments for the work that I am planning and starting to do. I.e., I will now start working on some of the limitations before I start thinking how to eventually get this enabled by default.

fhahn added inline comments.Fri, Jun 11, 1:34 AM
llvm/include/llvm/Transforms/Utils/SCCPSolver.h
146

unused?

148

unused?

149

unused?

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
216

Is this needed in the latest version? If it is not needed, please also remove it from the interface.

509

stray newline?

fhahn added a comment.Fri, Jun 11, 1:40 AM

I'm not quite sure why the implementation is in llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp, rather than llvm/lib/Transforms/IPO/FunctionSpecialization.cpp? It's interprocedural only, right? The implementation in SCCP.cpp works on both a single function and a module, that's probably with its in the Scalar sub-directory. It's also a separate pass from IPSCCP, so I am not sure if the pass definition should be in llvm/lib/Transforms/IPO/SCCP.cpp.

fhahn added inline comments.Fri, Jun 11, 1:49 AM
llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
66

Looks like this was not addressed. If the function is kept, please at least update the comment, as it is mis-leading at the moment (it also returns false if LV is a constant range with more than a single element, see the comment for the same function in SCCP.cpp)

250

NumFuncSpecialized is defined as STATISTIC(NumFuncSpecialized, "Number of Functions Specialized"); How will this work when LLVM is build without statistics? (also redundant brackets)

I'm not quite sure why the implementation is in llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp, rather than llvm/lib/Transforms/IPO/FunctionSpecialization.cpp? It's interprocedural only, right? The implementation in SCCP.cpp works on both a single function and a module, that's probably with its in the Scalar sub-directory. It's also a separate pass from IPSCCP, so I am not sure if the pass definition should be in llvm/lib/Transforms/IPO/SCCP.cpp.

Thanks, I will follow up on this shortly (i.e., will prepare a diff). I can't remember if there was something with this pass definition, but will look.

fhahn added inline comments.Fri, Jun 11, 2:47 AM
llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
462

Looks like this is not covered by a test. Would be good to have one.