This is an archive of the discontinued LLVM Phabricator instance.

[FuncSpec] Don't specialize function which are easy to inline
ClosedPublic

Authored by ChuanqiXu on Aug 11 2021, 5:40 AM.

Details

Summary

It would waste time to specialize a function which would inline finally.
This patch did two things:

  • Don't specialize functions which are always-inline.
  • Don't spescialize functions whose lines of code are less than threshold (100 by default).

For spec2017int, this patch could reduce the number of specialized functions by 33%. Then the compile time didn't increase for every benchmark.

Test Plan: check-llvm, spec2017

Diff Detail

Event Timeline

ChuanqiXu created this revision.Aug 11 2021, 5:40 AM
ChuanqiXu requested review of this revision.Aug 11 2021, 5:40 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 11 2021, 5:40 AM

For spec2017int, this patch could reduce the number of specialized functions by 33%. Then the compile time didn't increase for every benchmark.

Do you mean that with this new default of 100, we see less function specialising for SPEC2017?

llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
70

Nit: don't think "small" adds anything as it it is just the function threshold. Also the naming is inconsistent with other options (there is already some, but no need to make it worse), so can suggest:

-func-specialization-function-size-threshold

but because that's a bit verbose, perhaps shorter and still descriptive is:

-func-specialization-size-threshold
72

Typo: sepcializations

About the description, I am not sure about "redundant". Don't think redundant is the right description, and I would guess it's a profitability / compile-time trade-off, and we do not want to consider too small functions?

444

Typo: sepcialize

llvm/test/Transforms/FunctionSpecialization/function-specialization2.ll
5

Perhaps change one these -func-spec-small-function-threshold=10 additions into -force-function-specialization to check that interaction and behaviour.

I think this patch raises the question whether the pass today is running at the right place in the pipeline with respect to the inliner. To me it seems that either the cost modeling should leverage the InlineCost heuristics (rather than arbitrary constants tuned on SPEC) or specialize post-inlining where such decision-making may no longer be necessary. This is particularly important with the presence of PGO information which significantly affects inlining decisions. It would be appreciated if we could spend some time studying the impact of timing of the pass before committing to this direction.

I fully agree @snehasish, this seems very hackish way - better to rethink it and make it more “general” instead of implementing some arbitrary limits.

For spec2017int, this patch could reduce the number of specialized functions by 33%. Then the compile time didn't increase for every benchmark.

Do you mean that with this new default of 100, we see less function specialising for SPEC2017?

Yes. And the improved benchmarks could still get the improvement.

I think this patch raises the question whether the pass today is running at the right place in the pipeline with respect to the inliner. To me it seems that either the cost modeling should leverage the InlineCost heuristics (rather than arbitrary constants tuned on SPEC) or specialize post-inlining where such decision-making may no longer be necessary. This is particularly important with the presence of PGO information which significantly affects inlining decisions. It would be appreciated if we could spend some time studying the impact of timing of the pass before committing to this direction.

Yes, this is the key question. Let me elaborate it more. The reason why we put this in the front of the inliner is that we could find opportunities to covert indirect call to direct call by function specialization. Then the inliner could make big benefit for that. And if we put function specialization after the inliner, we could only get the profit for constant passing.
BTW it looks like a common opportunity for me to convert indirect call to direct call since there are many functions being passed in actual codes. (Although the implementation for function specialization don't scale to be so powerful, that's our future direction).
So here is the decision point:

  • Put function specialization in the front of the inliner.
    • We have the chance to convert indirect call to direct call.
    • There may be unuseful specialization which would be covered by the inliner. And this is what this patch tries to mitigate.
  • Put function specialization in the back of the inliner.
    • We would miss the opportunity to convert indirect call to direct call.
    • We could avoid unuseful specialization totally.

From my perspective, it looks not bad after we add heuristics (no-matter simple constant or inline cost) to avoid specializing functions which would be inlined finally. Since we get the optimization opportunity and avoid many unused specialization.

For this topic about pass order, I have a more crazy idea:
inliner --> function specializer --> inliner (may be a new inliner pass)
To avoid wasting the time, we could add special attribute in function specializer. And let the second inliner to decide whether or not to inline these functions. The second inliner is also responsible to remove the attributes.

But the question why I didn't implement this is : is it really much better than this simple patch? On the one hand, we need write much more codes and it means more compile time generally. On the other hand, would the generated code be so different?
The reason I prefer this patch is the simplicity and effectiveness in a degree.

To me it seems that either the cost modeling should leverage the InlineCost heuristics (rather than arbitrary constants tuned on SPEC)

The InlineCost heuristics is for callsites instead of functions. If we decide to use InlineCost as a filter, we need run InlineCost for every callsite with constant parameter. It looks expensive to me since the calculation of InlineCost is not simple actually. And the estimation for functions is much cheaper.

So here is the decision point:

  • Put function specialization in the front of the inliner.
    • We have the chance to convert indirect call to direct call.
    • There may be unuseful specialization which would be covered by the inliner. And this is what this patch tries to mitigate.
  • Put function specialization in the back of the inliner.
    • We would miss the opportunity to convert indirect call to direct call.
    • We could avoid unuseful specialization totally.

Thanks for elaborating on the motivation. At a high level it makes sense to me for plain builds which do not use PGO. With PGO, the situation is a little different - there is no longer any need to convert indirect call to direct call for subsequent inlining using this pass since indirect call promotion is very effective with profile information. However, even for plain builds we should not underestimate the negative impact of useless specializations particularly for large binaries. I think we may go towards a different pass timing for plain vs pgo optimization pipelines.

From my perspective, it looks not bad after we add heuristics (no-matter simple constant or inline cost) to avoid specializing functions which would be inlined finally. Since we get the optimization opportunity and avoid many unused specialization.

For this topic about pass order, I have a more crazy idea:
inliner --> function specializer --> inliner (may be a new inliner pass)
To avoid wasting the time, we could add special attribute in function specializer. And let the second inliner to decide whether or not to inline these functions. The second inliner is also responsible to remove the attributes.

This seems related to the approach you proposed in https://reviews.llvm.org/D105524 (incl. generalizing across to ThinLTO)?

But the question why I didn't implement this is : is it really much better than this simple patch? On the one hand, we need write much more codes and it means more compile time generally. On the other hand, would the generated code be so different?
The reason I prefer this patch is the simplicity and effectiveness in a degree.

Can you share what was the compile time degradation without this patch on SPEC?
Also we would be interested in any data you may have on code size increases and performance improvements on SPEC.

To me it seems that either the cost modeling should leverage the InlineCost heuristics (rather than arbitrary constants tuned on SPEC)

The InlineCost heuristics is for callsites instead of functions. If we decide to use InlineCost as a filter, we need run InlineCost for every callsite with constant parameter. It looks expensive to me since the calculation of InlineCost is not simple actually. And the estimation for functions is much cheaper.

Thats true, however again with PGO information not all functions need to be considered. For large binaries with a highly skewed ratio of hot to cold functions, it's possible a callsite oriented approach is less expensive while generalizing better? I am not fundamentally opposed to this patch for now as it benefits plain builds. However, in the longer run with a goal to enable this by default we are interested in how this will generalize to PGO builds as well as leveraging ThinLTO.

So here is the decision point:

  • Put function specialization in the front of the inliner.
    • We have the chance to convert indirect call to direct call.
    • There may be unuseful specialization which would be covered by the inliner. And this is what this patch tries to mitigate.
  • Put function specialization in the back of the inliner.
    • We would miss the opportunity to convert indirect call to direct call.
    • We could avoid unuseful specialization totally.

Thanks for elaborating on the motivation. At a high level it makes sense to me for plain builds which do not use PGO. With PGO, the situation is a little different - there is no longer any need to convert indirect call to direct call for subsequent inlining using this pass since indirect call promotion is very effective with profile information. However, even for plain builds we should not underestimate the negative impact of useless specializations particularly for large binaries. I think we may go towards a different pass timing for plain vs pgo optimization pipelines.

Yes, it's a key and good question: with PGO, how many benefit could the function specialization get? In fact, the main profit of function specialization could be covered by PGO according to the discussion from D93838. Although it is easy to image cases that function specialization could still benefit after applying PGO, it is hard to give realistic example and numbers for me now. But at least I think we could get in consensus in following:

  • Function Specialization could be valuable without PGO.
  • With PGO, the effect of function specialization may be questionable. But we need to experiment it more.

I think the topic of PGO and function specialization is much beyond the scope of this patch. It is a good question and it may be better to discuss it in llvm-dev.

From my perspective, it looks not bad after we add heuristics (no-matter simple constant or inline cost) to avoid specializing functions which would be inlined finally. Since we get the optimization opportunity and avoid many unused specialization.

For this topic about pass order, I have a more crazy idea:
inliner --> function specializer --> inliner (may be a new inliner pass)
To avoid wasting the time, we could add special attribute in function specializer. And let the second inliner to decide whether or not to inline these functions. The second inliner is also responsible to remove the attributes.

This seems related to the approach you proposed in https://reviews.llvm.org/D105524 (incl. generalizing across to ThinLTO)?

Not actually. What I mean is to make function specialization run after the inliner and we could write another inline pass to consider inlining the specialized functions. I think it could solve the useless-specialize problem totally. The reason why I didn't implement it is that I think it may be much complex than this one.

But the question why I didn't implement this is : is it really much better than this simple patch? On the one hand, we need write much more codes and it means more compile time generally. On the other hand, would the generated code be so different?
The reason I prefer this patch is the simplicity and effectiveness in a degree.

Can you share what was the compile time degradation without this patch on SPEC?
Also we would be interested in any data you may have on code size increases and performance improvements on SPEC.

Let me sure if I understand your words correctly. It looks like that you are asking: with function specialization, how the compile-time, code-sizes and performance changes?
There are results in D93838. Simply, the compile-time and code-size change is not aggressive. And the performance change mainly good for 500.mcf_r by 10%.

To me it seems that either the cost modeling should leverage the InlineCost heuristics (rather than arbitrary constants tuned on SPEC)

The InlineCost heuristics is for callsites instead of functions. If we decide to use InlineCost as a filter, we need run InlineCost for every callsite with constant parameter. It looks expensive to me since the calculation of InlineCost is not simple actually. And the estimation for functions is much cheaper.

Thats true, however again with PGO information not all functions need to be considered. For large binaries with a highly skewed ratio of hot to cold functions, it's possible a callsite oriented approach is less expensive while generalizing better? I am not fundamentally opposed to this patch for now as it benefits plain builds. However, in the longer run with a goal to enable this by default we are interested in how this will generalize to PGO builds as well as leveraging ThinLTO.

Yeah, it should use profiling information for the decision of this patch. I think it must be the future direction. For example, we would give up to specialize a function which is too hot.

Interesting discussion, and fair points have been raised. So there's definitely the question of phase ordering, and/or also PGO. But the way I look at this is that function specialisation is hot off the press, and there's a lot of unexplored territory, these 2 topics mentioned earlier included, and then we have also the ThinLTO question.

As explained, there's a reason for the current pipeline organisation, and this minor addition, this one-liner to check AlwaysInline makes sense in this context. This allows us to get a better baseline, i.e. an function specialisation version for which we can measure compile-time increases compared to clang with function specialisation, and GCC which has this enabled by default. This is my short/mid term objective: see if we can get a first version enabled by default to reach parity with GCC.

Like I said, I feel PGO is another topic that is indeed totally unexplored, and I can also easily imagine that for PGO things looks very different and is a different use case. And if I understood this correctly, that the pass ordering may look different:

I think we may go towards a different pass timing for plain vs pgo optimization pipelines.

  • Function Specialization could be valuable without PGO.
  • With PGO, the effect of function specialization may be questionable. But we need to experiment it more.

I think the topic of PGO and function specialization is much beyond the scope of this patch. It is a good question and it may be better to discuss it in llvm-dev.

Agreed, since the focus is currently on plain builds we can continue this approach. In the future, we can share some information on impact with PGO.
@apostolakis @LilyZhong

Let me sure if I understand your words correctly. It looks like that you are asking: with function specialization, how the compile-time, code-sizes and performance changes?
There are results in D93838. Simply, the compile-time and code-size change is not aggressive. And the performance change mainly good for 500.mcf_r by 10%.

I was asking if there is a broader characterization for this work like the 2017 patch contained.

As explained, there's a reason for the current pipeline organisation, and this minor addition, this one-liner to check AlwaysInline makes sense in this context. This allows us to get a better baseline, i.e. an function specialisation version for which we can measure compile-time increases compared to clang with function specialisation, and GCC which has this enabled by default. This is my short/mid term objective: see if we can get a first version enabled by default to reach parity with GCC.

Sounds good to me, we can refine the PGO aspects if we find some opportunities.
Additionally, we would be happy to evaluate the impact (compile time, code size) on some large Google workloads once this pass is more mature.

llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
70

+1 on consistent naming.

71

I think number of insts != lines of code where I assume Metrics.NumInsts enumerates the IR instruction count. Perhaps rephrase to make it more accurate?

llvm/test/Transforms/FunctionSpecialization/function-specialization-always-inline.ll
3–4

Perhaps this can be collapsed into a single CHECK-NOT: foo.{{[0-9]+}}

ChuanqiXu updated this revision to Diff 366588.Aug 16 2021, 4:27 AM

Address the comments.

Let me sure if I understand your words correctly. It looks like that you are asking: with function specialization, how the compile-time, code-sizes and performance changes?
There are results in D93838. Simply, the compile-time and code-size change is not aggressive. And the performance change mainly good for 500.mcf_r by 10%.

I was asking if there is a broader characterization for this work like the 2017 patch contained.

Here is the result:

benchamrkspecialized number before this patchspecialized number after this patch
500.perlbench_r75
502.gcc_r4528
505.mcf_r22
525.x264_r20
557.xz_r44

Thanks! I'm ok with this patch and revisiting the phase ordering and inlining interaction in the future. @xbolva00 what do you think?

As this improves the current state, go for it. In the future this check can be reworked and more sofisticated.

@SjoerdMeijer Do you feel good with this?

This revision is now accepted and ready to land.Aug 23 2021, 1:22 AM
This revision was landed with ongoing or failed builds.Aug 23 2021, 4:22 AM
This revision was automatically updated to reflect the committed changes.