This is an archive of the discontinued LLVM Phabricator instance.

[SCCP] Add Function Specialization pass
ClosedPublic

Authored by SjoerdMeijer on Dec 27 2020, 7:25 AM.

Details

Summary

This patch adds a function specialization pass to LLVM. The constant parameters like function pointers and constant globals are propagated to the callee by specializing the function.

Current limitations:

  • It does not handle specialization of recursive functions,
  • It does not yet handle constants and constant ranges,
  • Only 1 argument per function is specialised,
  • The cost-model could be further looked into,
  • We are not yet caching analysis results.

The pass is based on the Function specialization proposed here.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
  • Rebased
  • This now off by default in the pass manager; renamed/introduced option EnableFunctionSpecialization,
  • Introduced options for some of the threshold values,
  • renamed function-specialize to function-specialization for consistency for the pass option name, test names, etc.
  • restrict the number of iterations the solver is run, controllable with an option.

I think this addresses the (minor) comments so far.

Todo:

  • For the first versions of this I want to solve the case for e.g. foo(1,2,3,4) as remarked in comments, I am looking at this now.
  • And I am adding more test cases.
fhahn added a comment.Apr 15 2021, 4:08 AM

I just run the SPEC2017 intrate fullLTO with this patch and I limits the iterations with 10 times and 20 times.
Here is my result

benchmarkSpeedup with limiting 10 iterationSpeedup with limiting 20 iteration
505.mcf_r8.4%8.4%
520.omnetpp_r0.4%4%

I didn't run 548 since it requires fortran frontend to emit LLVM IR. Other benchmarks in SPEC2017 intrate don't show significant result.

It is interesting that the result of 520.omnetpp_r is different from the result before, which shows a great regression.

It is also interesting that the result diff with different limitations.

Then there are compile-time/code-size changes:

benchmarkcompile-time change with limiting 10 iterationcompile-time change with limiting 20 iteration
500.perlbench_r27%27%
502.gcc_r9%9%
505.mcf_r17%17%
520.omnetpp_r10%14%
523.xalancbmk_r3%5%

Other benchmarks including 525, 531, 541 and 557 don't show significant change.
The time change listed here are the compilation time for the whole compiling process instead of linking time.

Finally, the code size change:

benchmarkcode size change with limiting 10 iterationcode size change with limiting 20 iteration
505.mcf_r14%14%
520.omnetpp_r13%13%

The code sizes for other benchmarks don't show significant change.

Thanks for sharing those sobering numbers. These seem to indicate that just adjusting the threshold may not be enough to control the excessive compile-time.

I lost one of the existing test file, added it back.

Thanks for sharing those sobering numbers. These seem to indicate that just adjusting the threshold may not be enough to control the excessive compile-time.

Yes, that's a bit high, and not inline with additional compile-times we see with GCC. I haven't carefully looked into it yet, but those 10 number of iterations also look a bit arbitrary to me, and I am guessing we could do with less. But agreed that tuning this down may not solve this, so will look into this.

Thanks for sharing those sobering numbers. These seem to indicate that just adjusting the threshold may not be enough to control the excessive compile-time.

Yes, that's a bit high, and not inline with additional compile-times we see with GCC. I haven't carefully looked into it yet, but those 10 number of iterations also look a bit arbitrary to me, and I am guessing we could do with less. But agreed that tuning this down may not solve this, so will look into this.

Noticed that there are many traversals in the code, I wonder if we can mitigate it by caching some results. I am not sure the hot spot in compilation time is in this code or in successive optimization.

Thanks for sharing those sobering numbers. These seem to indicate that just adjusting the threshold may not be enough to control the excessive compile-time.

Since the pass only run in LTO in LegacyPM, I wonder if the excessive compile-time is OK because LTO are time-consuming naturally. Although it looks like to be running as a module pass in NewPM, I don't know it's by design or just a miss.

llvm/lib/Passes/PassBuilder.cpp
1112

The Function Specialization pass would only run in LTO mode in Legacy Pass manager while is seems like it would run as a module pass in NewPM. Is this a intended behavior?

fhahn added a comment.Apr 16 2021, 2:24 AM

! In D93838#2691075, @fhahn wrote:

Thanks for sharing those sobering numbers. These seem to indicate that just adjusting the threshold may not be enough to control the excessive compile-time.

Since the pass only run in LTO in LegacyPM, I wonder if the excessive compile-time is OK because LTO are time-consuming naturally. Although it looks like to be running as a module pass in NewPM, I don't know it's by design or just a miss.

It should run at the same place in both LegacyPM and NewPM. The fact that LTO is slow already is not a strong argument to make it substantially slower again :) In the end, it is always a trade-off between the benefits and the extra compile-time. I think one of the main concerns with the numbers you shared is that there appears to be couple of substantial increases in compile-time without any noticeable benefits. This is only a small sample of course and we probably need more data.

It should run at the same place in both LegacyPM and NewPM. The fact that LTO is slow already is not a strong argument to make it substantially slower again :) In the end, it is always a trade-off between the benefits and the extra compile-time. I think one of the main concerns with the numbers you shared is that there appears to be couple of substantial increases in compile-time without any noticeable benefits. This is only a small sample of course and we probably need more data.

Agreed and I'm now looking into the pass manager business, that needs to be consistent, I can't imagine that the current inconsistency is by design.
About numbers and goals, I shared initial findings in:

https://lists.llvm.org/pipermail/llvm-dev/2021-March/149380.html

My goals is that eventually this reaches parity with GCC: it gets enabled by default, and doesn't exceed compile-times in the ranges of +1.5% as that is what I have seen so far from GCC.

I am now deep diving in the algorithm to see where time could be spent and what and where we can improve things.

SjoerdMeijer added a comment.EditedApr 16 2021, 2:47 AM

I could have elaborated a bit more on my plans in my previous message, so let me add that here.

I saw three parts to this work:

  1. The first one was the SCCP refactoring that already went in,
  2. This patch that adds the function specialisation framework,
  3. The further fine-tuning and cost-modelling to get this enabled by default.

In my opinion, the GCC results would justify getting 2 committed first, then work on 3. A few notes on this:

  • The requirement to get 2 committed I think is that we get compile-times under control,
  • We really would like to get this enabled by default. If we don't reach parity with GCC, this work has little value to us, and there is little point in doing this work.

Like I said, that was my plan, but am interested to hear if you agree or you see things differently @fhahn, @ChuanqiXu. Please let me know.

SjoerdMeijer added inline comments.Apr 16 2021, 5:32 AM
llvm/lib/Transforms/IPO/SCCP.cpp
117

@ChuanqiXu : please note the hard coded true here, which corresponds to IsAggressive boolean. So I think your timings were timing the aggressive mode. But anyway, I will upload one more intermediate diff with all sorts of pass manager stuff fixed, and then will start doing some timing too.

Uploading one more intermediate result that is a better baseline; this has the pass manager things fixed:

  • under the new pass manager, it was always running in aggressive mode, and it wasn't respecting the specialisation level so have added that,
  • have added the pass to the legacy pass manager which indeed was missing.

@ChuanqiXu: I have started looking at increased compile-times, and experimented with the biggest outlier you reported:

500.perlbench_r	27%

In my experiment I think I see less than 1% compile time increase with this patch.
Would you mind double checking this for me with the latest patch? Would be good to get a confirmation to make sure we are on the same page.

Using "sqlite amalgamation" as my favourite compile-time stress test I see that it specialises 4 functions and that:

  • on a older/smaller, noisy system, I see a ~2% compile-time increase,
  • on a newer/bigger system, I see a 1% compile-time increase.

With no optimisation done (caching) to improve compile-times, this isn't a disaster and actually a good place to start from.

ChuanqiXu added a comment.EditedApr 19 2021, 7:19 PM

@ChuanqiXu: I have started looking at increased compile-times, and experimented with the biggest outlier you reported:

500.perlbench_r	27%

In my experiment I think I see less than 1% compile time increase with this patch.
Would you mind double checking this for me with the latest patch? Would be good to get a confirmation to make sure we are on the same page.

Yes, I would double checking on this. The compile-time is a key factor for this patch.

Here is my context information:
Hardware:

  • Ampere
  • Architecture: aarch64
  • CPU(s): 80
  • Thread(s) per core: 1
  • Core(s) per socket: 80
  • Model: 1
  • CPU max MHz: 3000.0000CPU min MHz: 1000.0000
  • L1d cache: 64K
  • L1i cache: 64K
  • L2 cache: 1024K
  • Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

Baseline option:

  • COPTIMIZE: -flto=full
  • CXXOPTIMIZE: -flto=full
  • LDOPTIMIZE: -flto=full -fuse-ld=lld

Function Specialization:

  • COPTIMIZE: -flto=full -mllvm -function-specialize-level=aggressive
  • CXXOPTIMIZE: -flto=full -mllvm -function-specialize-level=aggressive
  • LDOPTIMIZE: -flto=full -fuse-ld=lld -Wl,-mllvm -Wl,-function-specialize-level=aggressive

The compiler is a private version based on 11.

I could have elaborated a bit more on my plans in my previous message, so let me add that here.

I saw three parts to this work:

  1. The first one was the SCCP refactoring that already went in,
  2. This patch that adds the function specialisation framework,
  3. The further fine-tuning and cost-modelling to get this enabled by default.

In my opinion, the GCC results would justify getting 2 committed first, then work on 3. A few notes on this:

  • The requirement to get 2 committed I think is that we get compile-times under control,
  • We really would like to get this enabled by default. If we don't reach parity with GCC, this work has little value to us, and there is little point in doing this work.

Like I said, that was my plan, but am interested to hear if you agree or you see things differently @fhahn, @ChuanqiXu. Please let me know.

Yes, I prefer to turn this on by default if we could control the compile-time. The reason why I am looking into this pass is I saw this from The present and future of Interprocedural Optimization:

It makes sense to me that function specialization is potential. I am also planning to enable function specialization in thinLTO (by adjusting the logic we try to import function) and do function specialization based value range (so we could do interprocedural value range propogation).

llvm/lib/Transforms/IPO/SCCP.cpp
117

Yes, I were timing the aggrresive mode only.

@ChuanqiXu: I have started looking at increased compile-times, and experimented with the biggest outlier you reported:

500.perlbench_r	27%

In my experiment I think I see less than 1% compile time increase with this patch.
Would you mind double checking this for me with the latest patch? Would be good to get a confirmation to make sure we are on the same page.

I just re-build 500.perlbench_r and find 6.5% increase based on the revision 330436. I think the successive revision just didn't contribute to compile-time. I would do other experiments to confirm this.
BTW, the first time I tried to do experiment I made a mistake which didn't pass the argument to the linker actually. I mean:

- LDOPTIMIZE: -flto=full -fuse-ld=lld -mllvm -function-specialize-level=aggressive
+ LDOPTIMIZE: -flto=full -fuse-ld=lld -Wl,-mllvm -Wl,-function-specialize-level=aggressive

@ChuanqiXu: I have started looking at increased compile-times, and experimented with the biggest outlier you reported:

500.perlbench_r	27%

In my experiment I think I see less than 1% compile time increase with this patch.
Would you mind double checking this for me with the latest patch? Would be good to get a confirmation to make sure we are on the same page.

I just re-build 500.perlbench_r and find 6.5% increase based on the revision 330436. I think the successive revision just didn't contribute to compile-time. I would do other experiments to confirm this.
BTW, the first time I tried to do experiment I made a mistake which didn't pass the argument to the linker actually. I mean:

- LDOPTIMIZE: -flto=full -fuse-ld=lld -mllvm -function-specialize-level=aggressive
+ LDOPTIMIZE: -flto=full -fuse-ld=lld -Wl,-mllvm -Wl,-function-specialize-level=aggressive

Many thanks for looking into this again. And 6.5% is already a lot better than 20+%. It is still not great of course and not really acceptable if that was the case, but I am suspecting that there's still quite a bit of noise going on there (I didn't see it triggering on 500.perlbench_r, so 6.5% to walk over a view functions sounds excessive to me, plus you were testing the aggressive mode for what's that worth). What I am going to do is park the compile-time analysis for now, and focus on getting a first version to get the framework in. For that I am going to address a functional issue, your remark how a case like foo(1, 2, 3, 4); is currently treated. After the initial commit of the framework, we can start looking into caching some part of the analysis. This makes sense in my opinion, because the current data so far doesn't show it's a disaster that would be difficult to fix. Quite the opposite is think: the reason I like sqlite amalgamation is because it's a single ginormous file with a lot of functions, function specialisation triggers, is really stable as a compile time test as runs form more than a minute, and the 1 - 2% compile time increase is acceptable.

@ChuanqiXu: I have started looking at increased compile-times, and experimented with the biggest outlier you reported:

500.perlbench_r	27%

In my experiment I think I see less than 1% compile time increase with this patch.
Would you mind double checking this for me with the latest patch? Would be good to get a confirmation to make sure we are on the same page.

I just re-build 500.perlbench_r and find 6.5% increase based on the revision 330436. I think the successive revision just didn't contribute to compile-time. I would do other experiments to confirm this.
BTW, the first time I tried to do experiment I made a mistake which didn't pass the argument to the linker actually. I mean:

- LDOPTIMIZE: -flto=full -fuse-ld=lld -mllvm -function-specialize-level=aggressive
+ LDOPTIMIZE: -flto=full -fuse-ld=lld -Wl,-mllvm -Wl,-function-specialize-level=aggressive

Many thanks for looking into this again. And 6.5% is already a lot better than 20+%. It is still not great of course and not really acceptable if that was the case, but I am suspecting that there's still quite a bit of noise going on there (I didn't see it triggering on 500.perlbench_r, so 6.5% to walk over a view functions sounds excessive to me, plus you were testing the aggressive mode for what's that worth). What I am going to do is park the compile-time analysis for now, and focus on getting a first version to get the framework in. For that I am going to address a functional issue, your remark how a case like foo(1, 2, 3, 4); is currently treated. After the initial commit of the framework, we can start looking into caching some part of the analysis. This makes sense in my opinion, because the current data so far doesn't show it's a disaster that would be difficult to fix. Quite the opposite is think: the reason I like sqlite amalgamation is because it's a single ginormous file with a lot of functions, function specialisation triggers, is really stable as a compile time test as runs form more than a minute, and the 1 - 2% compile time increase is acceptable.

I would still do testing for compile-time to check it. It would be good if another one is interested in this patch and OK to run SPEC2017, which is a little hard to expect.

@ChuanqiXu: I have started looking at increased compile-times, and experimented with the biggest outlier you reported:

500.perlbench_r	27%

In my experiment I think I see less than 1% compile time increase with this patch.
Would you mind double checking this for me with the latest patch? Would be good to get a confirmation to make sure we are on the same page.

I just re-build 500.perlbench_r and find 6.5% increase based on the revision 330436. I think the successive revision just didn't contribute to compile-time. I would do other experiments to confirm this.
BTW, the first time I tried to do experiment I made a mistake which didn't pass the argument to the linker actually. I mean:

- LDOPTIMIZE: -flto=full -fuse-ld=lld -mllvm -function-specialize-level=aggressive
+ LDOPTIMIZE: -flto=full -fuse-ld=lld -Wl,-mllvm -Wl,-function-specialize-level=aggressive

Many thanks for looking into this again. And 6.5% is already a lot better than 20+%. It is still not great of course and not really acceptable if that was the case, but I am suspecting that there's still quite a bit of noise going on there (I didn't see it triggering on 500.perlbench_r, so 6.5% to walk over a view functions sounds excessive to me, plus you were testing the aggressive mode for what's that worth). What I am going to do is park the compile-time analysis for now, and focus on getting a first version to get the framework in. For that I am going to address a functional issue, your remark how a case like foo(1, 2, 3, 4); is currently treated. After the initial commit of the framework, we can start looking into caching some part of the analysis. This makes sense in my opinion, because the current data so far doesn't show it's a disaster that would be difficult to fix. Quite the opposite is think: the reason I like sqlite amalgamation is because it's a single ginormous file with a lot of functions, function specialisation triggers, is really stable as a compile time test as runs form more than a minute, and the 1 - 2% compile time increase is acceptable.

I still get the result 6.5% compile-time increment for 500.perlbench_r in successive repeating. May I ask for your hardware context and compilation options?

I still get the result 6.5% compile-time increment for 500.perlbench_r in successive repeating. May I ask for your hardware context and compilation options?

Okay, thanks for confirming. I was using -Ofast on a A72 based system.
I am addressing some functional issues with this patch at the moment, but once I've done that I will get numbers for SPEC.

SjoerdMeijer retitled this revision from [LLVM] [SCCP] : Add Function Specialization pass to [SCCP] Add Function Specialization pass.
SjoerdMeijer edited the summary of this revision. (Show Details)

Last intermediate update with quite a few things fixed:

  • removed the aggressive mode as we don't need it right now and it didn't seem to entirely work,
  • introduced a force option to overrule the cost-model,
  • added more tests; the force option allowed some simpler tests with constant globals.
  • have added previous review comments as FIXME/TODOs.
  • have added debug messages.
SjoerdMeijer added inline comments.Apr 26 2021, 1:47 AM
llvm/lib/Transforms/Scalar/SCCP.cpp
2392

Currently, it will only specialises this into:

foo_1(2, 3, 4);

because there is a limitation that it specialises 1 argument, see my next comment below....

2442

.... here!

It stops after specialising one argument. This is very arbitrary and probably what we want to fix in the first version.
This requires a bit of a reshuffle, which is what I am addressing now.

ChuanqiXu added inline comments.Apr 26 2021, 2:27 AM
llvm/lib/Transforms/Scalar/SCCP.cpp
96

I would like to rewording it into 'Force function specialization for every call site with constant argument'

98–101

It looks like we need to re-evaluate the performance, compile-time and code-size for different parameters now.

2442

I am OK to fix this in successive patches.

2946–2948

Could we write:

unsigned MaxIters = FuncSpecializationMaxIters; // Since the default value for FuncSpecializationMaxIters is 1

Addressed minor comments.

SjoerdMeijer added inline comments.Apr 26 2021, 5:55 AM
llvm/lib/Transforms/Scalar/SCCP.cpp
98–101

Agreed. This patch might be a candidate for a first version, and with a few things changed, it is time to measure things again to see where are, so am going to do that right now.

2442

That will work for me.

Does this also handle bool propagation case like

myfn(dst, src, flags, isEnabled);

to

if (isEnabled)
myfn_true(dst, src, flags);
else
myfn_false(dst, src, flags);

?

Does this also handle bool propagation case like

myfn(dst, src, flags, isEnabled);

to

if (isEnabled)
myfn_true(dst, src, flags);
else
myfn_false(dst, src, flags);

?

Nope, not yet I am afraid. But agreed that this would be nice!

ChuanqiXu added inline comments.Apr 28 2021, 1:13 AM
llvm/lib/Passes/PassBuilder.cpp
1112

It looks like it would run as a module simplify pass in NewPM?

SjoerdMeijer added inline comments.Apr 28 2021, 1:46 AM
llvm/lib/Passes/PassBuilder.cpp
1112

I will look into this.

I am actually trying to reproduce numbers at the moment, but found that this patch doesn't trigger on SPEC anymore. Curious if I had regressed things with my changes, I took the second revision of this patch for which performance numbers were reported, but I don't think see an uplift with that either, so can't reproduce numbers yet. I am comparing an LTO run without the patch applied as the baseline, to an LTO ran with the patch applied. Now rereading previous reviews comments, it looks like you had the same problem reproducing numbers. But anyway, I will sort that out first.

ChuanqiXu added inline comments.Apr 28 2021, 1:51 AM
llvm/lib/Passes/PassBuilder.cpp
1112

Yeah, previously I find that this patch can't work on SPEC since I made a mistake when I run LTO.
Previously, I set:

LDOPTIMIZE    = -flto=full -fuse-ld=lld -mllvm -function-specialize-level=aggressive

But it wouldn't work. In fact, we need to set:

LDOPTIMIZE    = -flto=full -fuse-ld=lld -Wl,-mllvm -Wl,-function-specialize-level=aggressive

Hope this could help you.

I just runned the newest revision with the SPEC2017 nitrate (without 548.exchange2_r). The number of default iteration limits is one. Below is the results. The overall results look good to me.

Performance:
I observed that 505.mcf_r get 10% increment, which is consistent with previous experiment.
Then I didn't find increment for 520.omnetpp_r nor regression. We need to explore it further.
Then there is no other observable changes for other benchmarks

Compile-time:

benchmarkcompile-time change with limiting 1 iterationNote
500.perlbench_r2%
502.gcc_r6%
505.mcf_r19%The total compile time for 505.mcf_r is relatively fast
520.omnetpp_r3%
523.xalancbmk_r3%

No observable changes for other benchmarks.

Code Sizes:

benchmarkCode Size change with limiting 1 iteration
505.mcf_r14%
523.xalancbmk_r2%

505.mcf_r

Interesting, did you analyze it more? What is the reason of such improvement? Additional vectorization?

I just runned the newest revision with the SPEC2017 nitrate (without 548.exchange2_r). The number of default iteration limits is one. Below is the results. The overall results look good to me.

Performance:
I observed that 505.mcf_r get 10% increment, which is consistent with previous experiment.
Then I didn't find increment for 520.omnetpp_r nor regression. We need to explore it further.
Then there is no other observable changes for other benchmarks

Compile-time:

benchmarkcompile-time change with limiting 1 iterationNote
500.perlbench_r2%
502.gcc_r6%
505.mcf_r19%The total compile time for 505.mcf_r is relatively fast
520.omnetpp_r3%
523.xalancbmk_r3%

No observable changes for other benchmarks.

Code Sizes:

benchmarkCode Size change with limiting 1 iteration
505.mcf_r14%
523.xalancbmk_r2%

Note for others who want to reproduce the results: I tried to run 505.mcf_r in x86-64, then I find no observable improments. The expeirment before was done in AArch64. Is it related to hardware? Or is it related to architecture?

505.mcf_r

Interesting, did you analyze it more? What is the reason of such improvement? Additional vectorization?

I haven't look into the details. I would do that if possible.

SjoerdMeijer added a comment.EditedApr 29 2021, 2:44 AM

I just runned the newest revision with the SPEC2017 nitrate (without 548.exchange2_r). The number of default iteration limits is one. Below is the results. The overall results look good to me.

Performance:
I observed that 505.mcf_r get 10% increment, which is consistent with previous experiment.
Then I didn't find increment for 520.omnetpp_r nor regression. We need to explore it further.
Then there is no other observable changes for other benchmarks

Compile-time:

benchmarkcompile-time change with limiting 1 iterationNote
500.perlbench_r2%
502.gcc_r6%
505.mcf_r19%The total compile time for 505.mcf_r is relatively fast
520.omnetpp_r3%
523.xalancbmk_r3%

No observable changes for other benchmarks.

Code Sizes:

benchmarkCode Size change with limiting 1 iteration
505.mcf_r14%
523.xalancbmk_r2%

Note for others who want to reproduce the results: I tried to run 505.mcf_r in x86-64, then I find no observable improments. The expeirment before was done in AArch64. Is it related to hardware? Or is it related to architecture?

505.mcf_r

Interesting, did you analyze it more? What is the reason of such improvement? Additional vectorization?

I haven't look into the details. I would do that if possible.

Many thanks for the new numbers!
I was struggling with my setup yesterday (not building it, but running it and observe the uplift). I am going to give it another try today.

Interesting, did you analyze it more? What is the reason of such improvement? Additional vectorization?

I haven't look into the details. I would do that if possible.

I would need to double check, but I think the motivating example is included as a regression test (llvm/test/Transforms/FunctionSpecialization/function-specialization.ll). It shows that specialisation eventually results in inlining and a lot of simplifications.

Small fix: the pass was not running in LTO mode because it wasn't added to the LTO pipeline.... so this adds it to the LTO pipeline.

This now triggers on MCF in LTO mode and my first measurements show a ~30% performance gain for an extra 30% compile-time, which is quite a steep increase in compile-time. Now going to look into this more.

@ChuanqiXu : did you measure results with LTO or non-LTO? Because if it was LTO, then it looks like we have only measured noise.

Small fix: the pass was not running in LTO mode because it wasn't added to the LTO pipeline.... so this adds it to the LTO pipeline.

This now triggers on MCF in LTO mode and my first measurements show a ~30% performance gain for an extra 30% compile-time, which is quite a steep increase in compile-time. Now going to look into this more.

@ChuanqiXu : did you measure results with LTO or non-LTO? Because if it was LTO, then it looks like we have only measured noise.

I tested it and the baseline under LTO. Since I run this patch to a downstream based on 11.x. It is OK to run it in LTO mode under the old manager. Your results shows great performance gain and compile-time increase than my results. I wonder if your baseline is measured without LTO?

The benefits of 505.mcf_r comes from the specialization for spec_qsort. Here is the signature of spec_qsort:

void
spec_qsort(void *a, size_t n, size_t es, cmp_t *cmp)

Here the cmp_t* is a function pointer. And there are lots of uses of cpp in spec_qsort. And spec_qsort is called in two places in 505.mcf_r:

spec_qsort(arcs_pointer_sorted[thread], new_arcs_array[thread], sizeof(arc_p),
                (int (*)(const void *, const void *))arc_compare);
spec_qsort(perm + 1, basket_sizes[thread], sizeof(BASKET*),
            (int (*)(const void *, const void *))cost_compare);

Both arc_compare and cost_compare are global functions. So here we can get the reason why function specialization benefits 505.mcf_r. It is converting the indirect call to direct call by function specialization and the direct call would be inlined further.

It looks like this pattern is usual in our daily work codes. However, I wonder what if there is multiple call site for spec_qsort with multiple global functions. It looks like the code now can't handle this situation, which is more usual in projects. I didn't ask for the change of cost model. I think we can made it in the future. This is just a sharing.

The benefits of 505.mcf_r comes from the specialization for spec_qsort. Here is the signature of spec_qsort:

void
spec_qsort(void *a, size_t n, size_t es, cmp_t *cmp)

Here the cmp_t* is a function pointer. And there are lots of uses of cpp in spec_qsort. And spec_qsort is called in two places in 505.mcf_r:

spec_qsort(arcs_pointer_sorted[thread], new_arcs_array[thread], sizeof(arc_p),
                (int (*)(const void *, const void *))arc_compare);
spec_qsort(perm + 1, basket_sizes[thread], sizeof(BASKET*),
            (int (*)(const void *, const void *))cost_compare);

Both arc_compare and cost_compare are global functions. So here we can get the reason why function specialization benefits 505.mcf_r. It is converting the indirect call to direct call by function specialization and the direct call would be inlined further.

It looks like this pattern is usual in our daily work codes. However, I wonder what if there is multiple call site for spec_qsort with multiple global functions. It looks like the code now can't handle this situation, which is more usual in projects. I didn't ask for the change of cost model. I think we can made it in the future. This is just a sharing.

Many thanks for sharing. With my infrastructure/workflow problems (mostly) sorted to run and evaluate things, I have seen exactly the same, so can confirm this.

My baseline is trunk, without this patch applied, in LTO mode. So that is using the new pass manager, and as this new pass wasn't added to its LTO pipeline, I didn't see this triggering. But with that fixed and this patch applied, I noticed to the 30% gain with 30% extra compile time. This was on an older and noisier AArch64 system, but the trend was clear and especially the increased compile-times very obvious and consistent . I will also run this on a newer system, but I am still setting this up.

PS. About LTO, I didn't see this triggering in non-LTO mode on MCF. That's why I am only looking at LTO at the moment.

Now that I have solid LTO numbers and compile-times, I am going to look at compile-times.

ChuanqiXu added a comment.EditedMay 6 2021, 4:16 AM

The benefits of 505.mcf_r comes from the specialization for spec_qsort. Here is the signature of spec_qsort:

void
spec_qsort(void *a, size_t n, size_t es, cmp_t *cmp)

Here the cmp_t* is a function pointer. And there are lots of uses of cpp in spec_qsort. And spec_qsort is called in two places in 505.mcf_r:

spec_qsort(arcs_pointer_sorted[thread], new_arcs_array[thread], sizeof(arc_p),
                (int (*)(const void *, const void *))arc_compare);
spec_qsort(perm + 1, basket_sizes[thread], sizeof(BASKET*),
            (int (*)(const void *, const void *))cost_compare);

Both arc_compare and cost_compare are global functions. So here we can get the reason why function specialization benefits 505.mcf_r. It is converting the indirect call to direct call by function specialization and the direct call would be inlined further.

It looks like this pattern is usual in our daily work codes. However, I wonder what if there is multiple call site for spec_qsort with multiple global functions. It looks like the code now can't handle this situation, which is more usual in projects. I didn't ask for the change of cost model. I think we can made it in the future. This is just a sharing.

Many thanks for sharing. With my infrastructure/workflow problems (mostly) sorted to run and evaluate things, I have seen exactly the same, so can confirm this.

My baseline is trunk, without this patch applied, in LTO mode. So that is using the new pass manager, and as this new pass wasn't added to its LTO pipeline, I didn't see this triggering. But with that fixed and this patch applied, I noticed to the 30% gain with 30% extra compile time. This was on an older and noisier AArch64 system, but the trend was clear and especially the increased compile-times very obvious and consistent . I will also run this on a newer system, but I am still setting this up.

PS. About LTO, I didn't see this triggering in non-LTO mode on MCF. That's why I am only looking at LTO at the moment.

Now that I have solid LTO numbers and compile-times, I am going to look at compile-times.

My result for 505.mcf_r is we can get 10% improvements with 19% extra compile-time. At least our trends are same. Both of us get significant improvements with significant extra compile time.

I would try to experiment in trunk

ChuanqiXu added inline comments.May 6 2021, 8:23 PM
llvm/lib/Transforms/Scalar/SCCP.cpp
2593–2599

To my knowledge, if the Lattice for A is constant, it could be handled by current IPSCCP. But what if the lattice is constant range?

2798–2820

I can't understand why don't we use constant value directly?

The benefits of 505.mcf_r comes from the specialization for spec_qsort. Here is the signature of spec_qsort:

void
spec_qsort(void *a, size_t n, size_t es, cmp_t *cmp)

Here the cmp_t* is a function pointer. And there are lots of uses of cpp in spec_qsort. And spec_qsort is called in two places in 505.mcf_r:

spec_qsort(arcs_pointer_sorted[thread], new_arcs_array[thread], sizeof(arc_p),
                (int (*)(const void *, const void *))arc_compare);
spec_qsort(perm + 1, basket_sizes[thread], sizeof(BASKET*),
            (int (*)(const void *, const void *))cost_compare);

Both arc_compare and cost_compare are global functions. So here we can get the reason why function specialization benefits 505.mcf_r. It is converting the indirect call to direct call by function specialization and the direct call would be inlined further.

It looks like this pattern is usual in our daily work codes. However, I wonder what if there is multiple call site for spec_qsort with multiple global functions. It looks like the code now can't handle this situation, which is more usual in projects. I didn't ask for the change of cost model. I think we can made it in the future. This is just a sharing.

Many thanks for sharing. With my infrastructure/workflow problems (mostly) sorted to run and evaluate things, I have seen exactly the same, so can confirm this.

My baseline is trunk, without this patch applied, in LTO mode. So that is using the new pass manager, and as this new pass wasn't added to its LTO pipeline, I didn't see this triggering. But with that fixed and this patch applied, I noticed to the 30% gain with 30% extra compile time. This was on an older and noisier AArch64 system, but the trend was clear and especially the increased compile-times very obvious and consistent . I will also run this on a newer system, but I am still setting this up.

PS. About LTO, I didn't see this triggering in non-LTO mode on MCF. That's why I am only looking at LTO at the moment.

Now that I have solid LTO numbers and compile-times, I am going to look at compile-times.

After I applied this patch with trunk, I got 10% performance increase with 505, which is consistency with my previous experiments. It looks like the hardware maybe a key factor in this case? I would try to look into this.

SjoerdMeijer added a comment.EditedMay 8 2021, 12:15 AM

After I applied this patch with trunk, I got 10% performance increase with 505, which is consistency with my previous experiments. It looks like the hardware maybe a key factor in this case? I would try to look into this.

Yeah, I am on an old slow system where things behave slightly different, but the trend is the same.

I have seen your previous comments, but just a sign of life that I am investigating compile times and will come back to your previous comments. The way I am investigating compile times is just by looking at he final LTO compile/link step:

clang   -Wl,--sort-section=name -flto=full -fuse-ld=lld   -O3         -flto=full mcf.o mcfutil.o readmin.o implicit.o pstart.o output.o treeup.o pbla.o pflowup.o psimplex.o pbeampp.o spec_qsort/spec_qsort.o             -lm         -o mcf_r -Wl,-mllvm -Wl,-enable-function-specialization=1

For this compile/link step, compile times goes up from 13s to 17s. I found that -ftime-report unfortunately doesn't work in LTO, or I am doing something wrong, but I want to find out where I am losing 4 seconds. By manually adding time instrumentation at the beginning and end of the function specialization pass, it looks like I am only loosing 0.2s in this pass, so now my question is where the other 3.8s are lost. I haven't found the answer to that yet. One suspicion is that more functions need to be compiled and that's where extra compile time is spent. I haven't looked if the 2 specialised functions in MCF are large functions which would explain that. But that's where I am at with this....to be continued next week.

After I applied this patch with trunk, I got 10% performance increase with 505, which is consistency with my previous experiments. It looks like the hardware maybe a key factor in this case? I would try to look into this.

Yeah, I am on an old slow system where things behave slightly different, but the trend is the same.

I have seen your previous comments, but just a sign of life that I am investigating compile times and will come back to your previous comments. The way I am investigating compile times is just by looking at he final LTO compile/link step:

clang   -Wl,--sort-section=name -flto=full -fuse-ld=lld   -O3         -flto=full mcf.o mcfutil.o readmin.o implicit.o pstart.o output.o treeup.o pbla.o pflowup.o psimplex.o pbeampp.o spec_qsort/spec_qsort.o             -lm         -o mcf_r -Wl,-mllvm -Wl,-enable-function-specialization=1

For this compile/link step, compile times goes up from 13s to 17s. I found that -ftime-report unfortunately doesn't work in LTO, or I am doing something wrong, but I want to find out where I am losing 4 seconds. By manually adding time instrumentation at the beginning and end of the function specialization pass, it looks like I am only loosing 0.2s in this pass, so now my question is where the other 3.8s are lost. I haven't found the answer to that yet. One suspicion is that more functions need to be compiled and that's where extra compile time is spent. I haven't looked if the 2 specialised functions in MCF are large functions which would explain that. But that's where I am at with this....to be continued next week.

In my environments, the compiler/link time for 505.mcf_r is less than 1s (The reason maybe the compile concurrency is relatively high) in the original LTO mode. After I apply this patch, the compile/link time is a little bit larger than 1s . So I am OK with the compile time now.

In my environments, the compiler/link time for 505.mcf_r is less than 1s (The reason maybe the compile concurrency is relatively high) in the original LTO mode. After I apply this patch, the compile/link time is a little bit larger than 1s . So I am OK with the compile time now.

I actually found yesterday that I made a mistake: I was testing a Debug build! With a release build, I see the same as you: the LTO compile/link step goes up from 0.85s to 1s. It is the similar trend as in the Debug build, but it's just that the absolute numbers make this more visible. What is funny though, is that function specialisation takes very little time as it takes less than 0.01s. Something after that takes up a bit more compile-time. In MCF, 2 functions get specialised. To tests if the extra compile time is needed for compiling these extra functions, I restricted it to specialise only 1 function, but then compile time remained the same as specialising 2 functions, which is surprising! Perhaps it's some analysis pass after that which hits a corner cases. I wanted to see where compile time is spent with -ftime-report, but found that it doesn't work with LTO. I dropped a little message to the dev list yesterday: https://lists.llvm.org/pipermail/llvm-dev/2021-May/150487.html

I think what I will do is try to get that working with LTO, because it would be really good to know where this extra compile times goes which we don't know at the moment. At the same time, I am more confident that there's not something wrong with compile times in function specialisation and thus I am thinking this could be a candidate for a first commit (with the pass still disabled by default of course). So that we can iterate in-tree which will be more convenient. But again, I will first see if I get anywhere with -ftime-report and then hopefully we know more soon.

In my environments, the compiler/link time for 505.mcf_r is less than 1s (The reason maybe the compile concurrency is relatively high) in the original LTO mode. After I apply this patch, the compile/link time is a little bit larger than 1s . So I am OK with the compile time now.

I actually found yesterday that I made a mistake: I was testing a Debug build! With a release build, I see the same as you: the LTO compile/link step goes up from 0.85s to 1s. It is the similar trend as in the Debug build, but it's just that the absolute numbers make this more visible. What is funny though, is that function specialisation takes very little time as it takes less than 0.01s. Something after that takes up a bit more compile-time. In MCF, 2 functions get specialised. To tests if the extra compile time is needed for compiling these extra functions, I restricted it to specialise only 1 function, but then compile time remained the same as specialising 2 functions, which is surprising! Perhaps it's some analysis pass after that which hits a corner cases. I wanted to see where compile time is spent with -ftime-report, but found that it doesn't work with LTO. I dropped a little message to the dev list yesterday: https://lists.llvm.org/pipermail/llvm-dev/2021-May/150487.html

I think what I will do is try to get that working with LTO, because it would be really good to know where this extra compile times goes which we don't know at the moment. At the same time, I am more confident that there's not something wrong with compile times in function specialisation and thus I am thinking this could be a candidate for a first commit (with the pass still disabled by default of course). So that we can iterate in-tree which will be more convenient. But again, I will first see if I get anywhere with -ftime-report and then hopefully we know more soon.

It is great to hear that we get the consistent results! It looks good to me to make this a candidate for the first commit. But I think you need the approves from other guys : ).

Re: compile-times, I finally managed to get the numbers out.

Timing report (in seconds)LTOLTO + func specDiff
Pass execution 10.50540.7222+43%
Pass execution 20.49920.6274+26%
DWARF Emission0.03990.0304Not interesting, probably noise
Register Allocation0.03810.0433''
Instruction Selection and Scheduling0.12560.0982''

These percentages look bad, but the absolute numbers are small... Zooming in on the Pass Execution 1 regressions, the top 3 compile time consumers before were:

0.1590 ( 22.9%)   0.0000 (  0.1%)   0.1590 ( 22.0%)   0.1590 ( 22.0%)  InstCombinePass
0.0699 ( 10.1%)   0.0006 (  2.1%)   0.0705 (  9.8%)   0.0705 (  9.8%)  GVN
0.0670 (  9.6%)   0.0034 ( 12.2%)   0.0704 (  9.7%)   0.0703 (  9.7%)  IndVarSimplifyPass

And after, with this patch applied:

0.1116 ( 23.0%)   0.0000 (  0.0%)   0.1116 ( 22.1%)   0.1116 ( 22.1%)  InstCombinePass
0.0560 ( 11.5%)   0.0001 (  0.6%)   0.0561 ( 11.1%)   0.0560 ( 11.1%)  IndVarSimplifyPass
0.0380 (  7.8%)   0.0037 ( 19.1%)   0.0417 (  8.3%)   0.0417 (  8.3%)  GVN

Function specialisation on itself is not expensive and takes the 7th place with:

0.0346 (  5.0%)   0.0000 (  0.0%)   0.0346 (  4.8%)   0.0346 (  4.8%)  FunctionSpecializationPass

Looking at Pass Execution 2, the top 3 consumers before were:

0.1270 ( 29.4%)   0.0115 ( 17.3%)   0.1386 ( 27.8%)   0.1386 ( 27.8%)  AArch64 Instruction Selection
0.0241 (  5.6%)   0.0438 ( 65.8%)   0.0679 ( 13.6%)   0.0679 ( 13.6%)  AArch64 Assembly Printer
0.0579 ( 13.4%)   0.0001 (  0.1%)   0.0580 ( 11.6%)   0.0579 ( 11.6%)  Loop Strength Reduction

And after:

0.1432 ( 27.3%)   0.0360 ( 34.9%)   0.1792 ( 28.6%)   0.1792 ( 28.6%)  AArch64 Instruction Selection
0.0336 (  6.4%)   0.0556 ( 53.8%)   0.0891 ( 14.2%)   0.0891 ( 14.2%)  AArch64 Assembly Printer
0.0736 ( 14.0%)   0.0024 (  2.3%)   0.0760 ( 12.1%)   0.0759 ( 12.1%)  Loop Strength Reduction

My conclusion from this is that it's not just one thing: function specialisation on itself is just a minor contributor, but extra compile time is spent in existing passes without anything really standing out.

Thus, how about this as an intial commit @sanwou01, @dmgreen, @fhahn? Iterating on the current limitations with this in tree is more convenient.

SjoerdMeijer edited the summary of this revision. (Show Details)May 13 2021, 3:45 AM

Getting the heuristics right will be difficult here.

There are a bunch of places in the openmp runtime where function pointers are passed to big functions. Manually tagging those arguments with an attribute in exchange for guaranteed specialisation when they're constant would be high value, even without any heuristics to automate the decision.

How do you think about move function specialization pass into standalone files (e.g., FuncitonSpecialization.h and FunctionSpecialization.cpp)? (We may do it in successive patches)

fhahn added a comment.May 13 2021, 3:54 AM

My conclusion from this is that it's not just one thing: function specialisation on itself is just a minor contributor, but extra compile time is spent in existing passes without anything really standing out.

Thanks for sharing the numbers!

I'm not sure of the conclusion though. While function specialization itself is quick, it still is the cause of the regression in the other passes, as it (presumably) adds a lot of extra code, that the other passes now need to churn through. So this is still something that needs to be addressed in the function specialization, right?

re absolute numbers being small: do you have any insight into whether the impact is expected to be less on larger benchmarks/binaries?

Manually tagging those arguments with an attribute in exchange for guaranteed specialisation when they're constant would be high value, even without any heuristics to automate the decision.

Seems like a really good idea. Or maybe attribute hot is enough to ask for guaranteed specialisation?

Anyway in the future, this pass should respect “noclone” attribute (which needs to be added).

absolute numbers being small: do you have any insight into whether the impact is expected to be less on larger benchmarks/binaries?

I am interested how this affects real world programs. Focusing only on small set ot benchmarks is not ideal.

Maybe you should try to compile clang or firefox with and without this feature and get some numbers how compile time and code size numbers were changed?

My conclusion from this is that it's not just one thing: function specialisation on itself is just a minor contributor, but extra compile time is spent in existing passes without anything really standing out.

Thanks for sharing the numbers!

I'm not sure of the conclusion though. While function specialization itself is quick, it still is the cause of the regression in the other passes, as it (presumably) adds a lot of extra code, that the other passes now need to churn through. So this is still something that needs to be addressed in the function specialization, right?

Well, I am not entirely sure yet about that. Yes, the rest of the optimisation pipeline sees more functions/instructions, so is spending more time on that. So, from that point of view, it is what it is.
On the other hand, this could be controlled from function specialisation, by specialising less. For this example, it's really beneficial to specialise two functions. But would we for example only want to specialise 1 for compile-time? Perhaps, then we need a way to control that with a "budget" how much the code is allowed to grow, which will impact/restrict compile-times.

re absolute numbers being small: do you have any insight into whether the impact is expected to be less on larger benchmarks/binaries?

No, not yet. Will see what I can do. It triggers for this case, but will try to find another with a larger compile time. Probably I will return to SQLite for that.

do you have any insight into whether the impact is expected to be less on larger benchmarks/binaries?

I am interested how this affects real world programs. Focusing only on small set ot benchmarks is not ideal.

From the codes I visited, the impact on the real world programs depends on the heuristics set. For the current heuristics, it is hard to image this patch would affect real world programs much since it wouldn't specialization a function with a given argument when the number of possible constants is bigger than 3. I think it would take a long time to tune the heuristics and the cost model may be changing during the evolution process.

xbolva00 added a comment.EditedMay 13 2021, 4:16 AM

I think it would take a long time to tune the heuristics and the cost model may be changing during the evolution process.

Maybe do not reinvent the wheel and take a look which heuristics gcc use?

“budget”

Yeah, definitely. Otherwise everything would be out of control.

One more data point.

MCF is one end of the spectrum with a LTO compile/link step of ~1 second, and SQLite somewhere at the other end with a compile time of 160 seconds (i.e., the amalgamation version: 184K lines of code (one file), over 6.4 megabytes in size). It specialises 4 functions, and thus triggers 4 times. The overhead of function specialisation is just 1.3% - 1.8%, and is mainly just in running the pass itself, not any knock on effect of the rest of the pipeline chewing on more instructions. I think this shows that:

  • the cost of function specialisation is amortised with larger compile times,
  • but of course also the ratio of the number of existing and new functions/instructions is important.

Similarly zooming in on the compile-times for SQLite:

Timing report (in seconds)-O3-O3 + func spec
Pass execution 1159.9010164.3134
Register Allocation2.73572.7376
Instruction Selection and Scheduling9.09109.0746
Pass execution 261.896261.6610
DWARF Emission12.232612.2232

And again the function specialisation pass does not enter the top 5 of the most compile time consuming passes:

2.8416 (  1.8%)   0.3040 (  5.6%)   3.1456 (  1.9%)   3.1457 (  1.9%)  FunctionSpecializationPass

The binary size is 1.575.571 bytes vs. 1.575.299 bytes, so slightly reduced which is interesting.

So, as it stands, and sort of repeating what I said earlier today: this looks like an good candidate for an initial commit. I.e., this looks like a good basis that performs the transformation, but we know this needs further cost-model tuning. For example, the first thing I would like to add is a budget to not let the functions grow out of proportion. What do we think about that?

Moved it to its own file FunctionSpecialization.cpp.

One more data point.

MCF is one end of the spectrum with a LTO compile/link step of ~1 second, and SQLite somewhere at the other end with a compile time of 160 seconds (i.e., the amalgamation version: 184K lines of code (one file), over 6.4 megabytes in size). It specialises 4 functions, and thus triggers 4 times. The overhead of function specialisation is just 1.3% - 1.8%, and is mainly just in running the pass itself, not any knock on effect of the rest of the pipeline chewing on more instructions. I think this shows that:

  • the cost of function specialisation is amortised with larger compile times,
  • but of course also the ratio of the number of existing and new functions/instructions is important.

Similarly zooming in on the compile-times for SQLite:

Timing report (in seconds)-O3-O3 + func spec
Pass execution 1159.9010164.3134
Register Allocation2.73572.7376
Instruction Selection and Scheduling9.09109.0746
Pass execution 261.896261.6610
DWARF Emission12.232612.2232

And again the function specialisation pass does not enter the top 5 of the most compile time consuming passes:

2.8416 (  1.8%)   0.3040 (  5.6%)   3.1456 (  1.9%)   3.1457 (  1.9%)  FunctionSpecializationPass

The binary size is 1.575.571 bytes vs. 1.575.299 bytes, so slightly reduced which is interesting.

So, as it stands, and sort of repeating what I said earlier today: this looks like an good candidate for an initial commit. I.e., this looks like a good basis that performs the transformation, but we know this needs further cost-model tuning. For example, the first thing I would like to add is a budget to not let the functions grow out of proportion. What do we think about that?

From the cost model, I think it is hard to inflate the codes too much. The key component of the cost model is:

if (PossibleConstants.size() > MaxConstantsThreshold) {
      LLVM_DEBUG(dbgs() << "FnSpecialization: number of constants found exceed "
                        << "the maximum number of constants threshold.\n");
      return false;
    }
// ...
/// Compute the cost of specializing function \p F.
  uint64_t getSpecializationCost(Function *F) {
    // Compute the code metrics for the function.
    SmallPtrSet<const Value *, 32> EphValues;
    CodeMetrics::collectEphemeralValues(F, &(GetAC)(*F), EphValues);
    CodeMetrics Metrics;
    for (BasicBlock &BB : *F)
      Metrics.analyzeBasicBlock(&BB, (GetTTI)(*F), EphValues);

    // If the code metrics reveal that we shouldn't duplicate the function, we
    // shouldn't specialize it. Set the specialization cost to the maximum.
    if (Metrics.notDuplicatable)
      return std::numeric_limits<unsigned>::max();

    // Otherwise, set the specialization cost to be the cost of all the
    // instructions in the function and penalty for specializing more functions.
    unsigned Penalty = (NumFuncSpecialized + 1);
    return Metrics.NumInsts * InlineConstants::InstrCost * Penalty;
  }

From the implementation of getSpecializationCost, we could know with number of specialized functions increases, the cost to specialize another function would be get larger. So there shouldn't be too many functions in a large program could be specialized actually.
( So here introduces another problem about the order of functions to specialized. And I think we could handle this in successive patches.)

Add a new budget could help the users to feel more controllable for this pass. I think we could limit the numbers of function specialized.

I think we could limit the numbers of function specialized.

Yes, and we should some score which says how profitable specialization could be - so we can prioritize such specializations.

Let's discuss a way forward with this.

Data and analysis has shown that regarding increased compile-times:

  • there there is not one single pass to blame (if that were the case, there could perhaps have been an easy fix);
  • the obvious way to control this, is to specialise less, based on a cost-model that also takes function size into account, which looks fairly straightforward to plumb in to me as a first step,
  • and compile-times is not a disaster at the moment! The overhead will be bigger perhaps with shorter compile times, and less visible with larger. Also, it currently is in line with GCC, which has this enabled by default.

I would like to propose this as an initial version so that we can get more experience with this, and also very importantly, others can contribute and help with this.

then, as a follow up my plan is now to:

  • look at code-size grow and control it,
  • and study and document GCC's cost-model.

Agreed @fhahn , @xbolva00 , @ChuanqiXu , @sanwou01 ?

I think we could limit the numbers of function specialized.

Yes, and we should some score which says how profitable specialization could be - so we can prioritize such specializations.

Yeah, it should be. It further requires us to split the current pass into an analysis one and a transformed

Let's discuss a way forward with this.

Data and analysis has shown that regarding increased compile-times:

  • there there is not one single pass to blame (if that were the case, there could perhaps have been an easy fix);
  • the obvious way to control this, is to specialise less, based on a cost-model that also takes function size into account, which looks fairly straightforward to plumb in to me as a first step,
  • and compile-times is not a disaster at the moment! The overhead will be bigger perhaps with shorter compile times, and less visible with larger. Also, it currently is in line with GCC, which has this enabled by default.

I would like to propose this as an initial version so that we can get more experience with this, and also very importantly, others can contribute and help with this.

then, as a follow up my plan is now to:

  • look at code-size grow and control it,
  • and study and document GCC's cost-model.

Agreed @fhahn , @xbolva00 , @ChuanqiXu , @sanwou01 ?

  • the obvious way to control this, is to specialise less, based on a cost-model that also takes function size into account, which looks fairly straightforward to plumb in to me as a first step,

I think the current cost model already consider the size of the function.

I would like to propose this as an initial version so that we can get more experience with this, and also very importantly, others can contribute and help with this.

Yeah, the current patch looks good to me with some problems remaining. And after this patch accepted, I think I would try to enable it in ThinLTO like I discussed in (https://groups.google.com/g/llvm-dev/c/rXb0Zl7OKB4). I think function specialization pass is potential.
Of course, as a freshman to this community, I am not familiar with the standard about accepting a new pass. Let's see the opinions from the experienced guys : )

It further requires us to split the current pass into an analysis one and a transformed

+1, agree.

But otherwise, this is a good starting point.

Friendly ping: I am looking for an LGTM for this so the we can continue with the next steps.

ChuanqiXu added a comment.EditedMay 17 2021, 3:51 AM

Friendly ping: I am looking for an LGTM for this so the we can continue with the next steps.

This patch looks good to me. But I think I am not the right person to accept it : )

BTW, it looks like we need to format this.

I am wondering if we can remove original IPSCCP once we made this strong enough?

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
432–438 ↗(On Diff #345249)

It looks like we left this work to original IPSCCP to handle. Can we make it in this pass?

davide removed a reviewer: davide.May 19 2021, 10:33 AM

For the motivating case from MCF, there is actually code in InlineCost Analysis to handle it. See InlineCostCallAnalyzer::onLoweredCall(..). If it is not already handled there, it is a bug to be fixed there.

For the motivating case from MCF, there is actually code in InlineCost Analysis to handle it. See InlineCostCallAnalyzer::onLoweredCall(..). If it is not already handled there, it is a bug to be fixed there.

It is hard to believe that inline could handle the case in MCF. The case in MCF is there is a function spec_qsort with signature (void*, size_t, size_t, int*(const void*, const void*)). And there is two calls to spec_qsort:

spec_qsort(perm + 1, basket_sizes[thread], sizeof(BASKET*),
            (int (*)(const void *, const void *))cost_compare);
// and
spec_qsort(arcs_pointer_sorted[thread], new_arcs_array[thread], sizeof(arc_p),
                (int (*)(const void *, const void *))arc_compare);

And the benefit comes from function specialization specialize spec_qsort with cost_compare and arc_compare. And finally, cost_compare and arc_compare are inlined in the specialized function. In my minds, without function specialization, compiler couldn't determine the value for the function pointer at compile time. So the inlining couldn't happen.

When compiler analyzes the callsite to spec_qsort, it sees the constant function pointer passed to the callee. During the callee body analysis, when the indirect callsite is seen, it will be simplified/evaluated to be the callsite constant thus the compare call can be resolved. If compiler decides that the compare call can be further inlined, it will give the original callsite additional bonus for inlining. Once spec_qsort is inlined, the compare function call will be inlined further.

I can imagine a fix in the inliner cost analysis to give the call to spec_qsort a bigger bonus when it sees that the compare function is also locally hot (e.g used in a loop). Does it make sense?

When compiler analyzes the callsite to spec_qsort, it sees the constant function pointer passed to the callee. During the callee body analysis, when the indirect callsite is seen, it will be simplified/evaluated to be the callsite constant thus the compare call can be resolved. If compiler decides that the compare call can be further inlined, it will give the original callsite additional bonus for inlining. Once spec_qsort is inlined, the compare function call will be inlined further.

I can imagine a fix in the inliner cost analysis to give the call to spec_qsort a bigger bonus when it sees that the compare function is also locally hot (e.g used in a loop). Does it make sense?

It doesn't make sense to me. There are two reasons:

  • We can't inline every thing. Maybe it is possible to inline spec_qsort by give a bigger bonus. However, we could construct an example easily that a function with a function pointer as parameter too large to infeasible to inline.
  • Consider the following example:
spec_qsort(arc_compare);
spec_qsort(arc_compare);
spec_qsort(arc_compare);
spec_qsort(arc_compare);
spec_qsort(cost_compare);
spec_qsort(cost_compare);
spec_qsort(cost_compare);

Then we could construct two specialization for spec_qsort to solve the problem. And we need to inline seven times for sepc_qsort. When spec_qsort becomes bigger, the difference would be more clear.

If the function pointer points to a function too large to be inlined, I expect the benefit of cloning the caller will also be minimal.

It is possible that the 'spec_sort' like function is very large so it won't be inlined no matter what. In such as case the cloning may be helpful.

What I am saying is that for the motivating example in MCF, it may be possible to fix the inliner.

If the function pointer points to a function too large to be inlined, I expect the benefit of cloning the caller will also be minimal.

It depends on the number of callsites. But for the example in mcf, I agree with you that cloning may be not more beneficial than inlining.

What I am saying is that for the motivating example in MCF, it may be possible to fix the inliner.

I agree with that if is possible to inline spec_qsort by adjusting the bonus. And what I am saying is function specialization is valuable and important.

If the function pointer points to a function too large to be inlined, I expect the benefit of cloning the caller will also be minimal.

It depends on the number of callsites. But for the example in mcf, I agree with you that cloning may be not more beneficial than inlining.

What I am saying is that for the motivating example in MCF, it may be possible to fix the inliner.

I agree with that if is possible to inline spec_qsort by adjusting the bonus. And what I am saying is function specialization is valuable and important.

If it is valuable and important, definitely go for it :)

My question is other than MCF, do we have other real world app that can benefit from this optimization (that can not be done by inliner)? I saw omnetpp also improves, do you know where it helps ? Also with PGO, do we see similar improvement?

Asking these because there are compile time and code size consequences..

What about C-ray? Does function specialization help?

Anyway, it would be good to teach inliner to inline that one important function in c-ray benchmark. gcc has some heuristic that it looks “if we inline this function into loop, does something becomes loop invariant - if yes, inline!”

If the function pointer points to a function too large to be inlined, I expect the benefit of cloning the caller will also be minimal.

It depends on the number of callsites. But for the example in mcf, I agree with you that cloning may be not more beneficial than inlining.

What I am saying is that for the motivating example in MCF, it may be possible to fix the inliner.

I agree with that if is possible to inline spec_qsort by adjusting the bonus. And what I am saying is function specialization is valuable and important.

If it is valuable and important, definitely go for it :)

My question is other than MCF, do we have other real world app that can benefit from this optimization (that can not be done by inliner)? I saw omnetpp also improves, do you know where it helps ? Also with PGO, do we see similar improvement?

Asking these because there are compile time and code size consequences..

I saw omnetpp also improves, do you know where it helps ?

We get omnetpp with 10~20 iterations in the past. And now the default iteration limit is 1. I didn't look into it. I would try if possible.

Also with PGO, do we see similar improvement?

Good question, we need to experiment to answer it.

Asking these because there are compile time and code size consequences

When we limit the iteration to one, as the data provided by @SjoerdMeijer

My question is other than MCF, do we have other real world app that can benefit from this optimization (that can not be done by inliner)?

For this question, my answer is that the current implementation may affect real word applications little from the cost model I had visited. My thought is that it should be a long term process to tune the cost model. It is hard to image that we can make something good enough at one commit.


BTW, from the discussion, we are in consensus that cloning are beneficial to handle the cases that inlining can't. But in the current pass pipeline, function specialization is in the front of inlining pass. I am not sure if it is harmful. But I wonder if it is good to put inlining pass in the front of function specialization pass,. And if function specialization pass made changes, we could run inlined again.

If the function pointer points to a function too large to be inlined, I expect the benefit of cloning the caller will also be minimal.

It depends on the number of callsites. But for the example in mcf, I agree with you that cloning may be not more beneficial than inlining.

What I am saying is that for the motivating example in MCF, it may be possible to fix the inliner.

I agree with that if is possible to inline spec_qsort by adjusting the bonus. And what I am saying is function specialization is valuable and important.

If it is valuable and important, definitely go for it :)

My question is other than MCF, do we have other real world app that can benefit from this optimization (that can not be done by inliner)? I saw omnetpp also improves, do you know where it helps ? Also with PGO, do we see similar improvement?

Asking these because there are compile time and code size consequences..

I saw omnetpp also improves, do you know where it helps ?

We get omnetpp with 10~20 iterations in the past. And now the default iteration limit is 1. I didn't look into it. I would try if possible.

Also with PGO, do we see similar improvement?

Good question, we need to experiment to answer it.

Asking these because there are compile time and code size consequences

When we limit the iteration to one, as the data provided by @SjoerdMeijer

My question is other than MCF, do we have other real world app that can benefit from this optimization (that can not be done by inliner)?

For this question, my answer is that the current implementation may affect real word applications little from the cost model I had visited. My thought is that it should be a long term process to tune the cost model. It is hard to image that we can make something good enough at one commit.


BTW, from the discussion, we are in consensus that cloning are beneficial to handle the cases that inlining can't. But in the current pass pipeline, function specialization is in the front of inlining pass. I am not sure if it is harmful. But I wonder if it is good to put inlining pass in the front of function specialization pass,. And if function specialization pass made changes, we could run inlined again.

In theory, function specialization based on interprocedural const/range/assertion propagation does not depend on inlining, so it should/can be done before the inliner. Similar to inliner, it may be better run after the callsite split pass to expose opportunities:

if (...) {
  a = 0;
} else {
  a = g;
}
foo (a);

->

if (...) {
    foo(0);     // specialization.
 } else {
    foo(g);
}

Thank you for further discussing this.

I have one more data point to add, related to compile-times which is probably the biggest concern for this work:

  • Timed compilation of a non-LTO LLVM release build increases compile time with 0.6% and increases the binary with 0.003%. The code-size increase is not the point here. The point is that function specialisation triggers, with very modest compile-time increase.

Other compile-times discussed in this very lengthy thread show similar trends (for SPEC, MySQL).

Sorry for repeating myself, but I am looking for someone to bless this work, but let me say a few things on this:

  • this will certainly need more tuning and experimentation,
  • the goal is to get this on by default, like GCC, but depends on the previous point,
  • getting this all right and enabled by default with the first commit is probably not achievable, hence why I am looking for a first commit.
  • and given the interest in this work we can start sharing some of this when it is in tree.

@fhahn, as the kind of biggest supporter of "this must be on by default" early in the conversations, are you happy with this?

getting this all right and enabled by default with the first commit is probably not achievable

Maybe yes, well…

If we could pick some the most obvious and profitable cases and enable this pass for them (simple heuristics, otherwise be conservative), plus, if we have PGO data, specialize very hot functions?

Yeah, I see concerns that another off by default pass is not ideal. Just look how many passes are in tree and disabled.

So +1 if we can run this pass even with conservative heuristics.

nikic added a comment.May 20 2021, 1:48 AM

I'm pretty sure @fhahn did not mean to suggest that the pass is actually default enabled when it lands, merely that there is some viable path towards enabling it in the future. Adding a pass and enabling it in the default pipeline are always two separate steps. And it does sound to me like there is a viable path forward here, so this is mainly waiting on someone to perform a technical review of the implementation (that someone would preferably be @fhahn, as the local IPSCCP expert :)

getting this all right and enabled by default with the first commit is probably not achievable

Maybe yes, well…

If we could pick some the most obvious and profitable cases and enable this pass for them (simple heuristics, otherwise be conservative), plus, if we have PGO data, specialize very hot functions?

Yeah, I see concerns that another off by default pass is not ideal. Just look how many passes are in tree and disabled.

So +1 if we can run this pass even with conservative heuristics.

PGO is definitely interesting. Actually quite a few interesting ideas have been brought up during discussions:

  • An idea was expressed that a function specialisation attribute on arguments, we could have a direct way of specialising, avoiding the cost-model,
  • Currently we only support constants, not constant ranges,
  • and I see that PGO could definitely help.

But I think these things can be best added once we have an initial implementation.

Thanks @nikic for your thoughts:

'm pretty sure @fhahn did not mean to suggest that the pass is actually default enabled when it lands, merely that there is some viable path towards enabling it in the future. Adding a pass and enabling it in the default pipeline are always two separate steps. And it does sound to me like there is a viable path forward here, so this is mainly waiting on someone to perform a technical review of the implementation (that someone would preferably be @fhahn, as the local IPSCCP expert :)

I agree with this and I think I have shown that there is a viable path to get this enabled (again, this our goal, because we want to reach parity with GCC here).
We have performed quite a few round of code reviews. Although there's obviously more to do, my impression was that people were reluctant to approve this as a first version because of the "on by default" discussion and not so much because of other things.

My question is other than MCF, do we have other real world app that can benefit from this optimization (that can not be done by inliner)?

An alternative perspective. An inliner does two things. It elides call setup/return code, and it specialises the function on the call site. These can be, and probably should be, separable concerns.

Today we inline very aggressively, which is suboptimal for platforms with code size (or cache) restrictions, but does give the call site specialisation effect. So this patch, today, needs a large enough function to avoid being specialised by the inliner to see a benefit. Examples will be things like qsort or large switch statements on a parameter.

With a specialisation pass in tree we can start backing off the inliner. Calling conventions do have overhead, but I suspect the majority of the performance win of inline is from the specialisation. If that intuition is sound, then this plus a less aggressive inliner will beat the status quo through better icache utilisation. Performance tests of Os may validate that expectation

My question is other than MCF, do we have other real world app that can benefit from this optimization (that can not be done by inliner)?

An alternative perspective. An inliner does two things. It elides call setup/return code, and it specialises the function on the call site. These can be, and probably should be, separable concerns.

Today we inline very aggressively, which is suboptimal for platforms with code size (or cache) restrictions, but does give the call site specialisation effect. So this patch, today, needs a large enough function to avoid being specialised by the inliner to see a benefit. Examples will be things like qsort or large switch statements on a parameter.

The benefit of inlining comes from many different areas:

  1. call overhead reduction (call, pro/epilogue)
  2. inline instance body cleanup with callsite contexts (this is what specialization can get)
  3. cross procedure boundary optimizations -- 3.1) PRE, jump threading etc. between caller body and inline instances 3.2) cross function optimization between sibling calls (sibling inline instances) 3.3) better code layout of inline instance body with enclosing call context ..

This is why with PGO, we have very large inline threshold setup for hot callsites.

For function specialization with PGO, we can use profile data to selectively do function cloning, but then again, it is very likely better to be inlined given its hotness.

I agree function specialization has its place when size is the concern (with -Os), or instruction working set is too large (to hurt performance). We don't have a mechanism to analyze the latter yet.

With a specialisation pass in tree we can start backing off the inliner. Calling conventions do have overhead, but I suspect the majority of the performance win of inline is from the specialisation. If that intuition is sound, then this plus a less aggressive inliner will beat the status quo through better icache utilisation. Performance tests of Os may validate that expectation

fhahn added a comment.May 20 2021, 9:12 AM

I'm pretty sure @fhahn did not mean to suggest that the pass is actually default enabled when it lands, merely that there is some viable path towards enabling it in the future. Adding a pass and enabling it in the default pipeline are always two separate steps.

This was indeed how my earlier comments were intended. Some comments on the code itself.

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
1 ↗(On Diff #345249)

this needs updating.

15 ↗(On Diff #345249)

this sounds a bit odd. The first sentence says it handles constant parameters. I guess you mean non-constant-int constants?

46 ↗(On Diff #345249)

need test for the option?

57 ↗(On Diff #345249)

needs test for the option?

65 ↗(On Diff #345249)

Those were added to transition existing code in SCCP to ValueLatticeELement. Ideally new code would be explicit about what they expect (constant-range vs constant-int)

130 ↗(On Diff #345249)

Instead of patching up the IR, can we just set the lattice values for the cloned function arguments accordingly until we are done with solving?

172 ↗(On Diff #345249)

Could you explain why we need to remove ssa_copy in the clone function?

212 ↗(On Diff #345249)

nit: no llvm:: should be needed.

637 ↗(On Diff #345249)

Why do we need this transformation here? Is this required for specialization or to catch the MCF case?

766 ↗(On Diff #345249)

Why do we need to modify the IR after each call to RunSCCPSolver rather than after all solving is done?

777 ↗(On Diff #345249)

Are those declarations? Shouldn't we only track functions with definitions?

798 ↗(On Diff #345249)

Is it possible to only replace values once we are completely done with solving?

llvm/lib/Transforms/Scalar/SCCP.cpp
30

All those includes are not needed in the file?

1730

Why is this needed?

llvm/test/Transforms/FunctionSpecialization/function-specialization4.ll
7 ↗(On Diff #345249)

do those tests depend on information from the AArch64 backend? If so, they should only be executed if the AArch64 backend is built.

Ah cheers, many thanks for the comments and review! Will start addressing them now.

ChuanqiXu added inline comments.May 21 2021, 12:08 AM
llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
315–317 ↗(On Diff #345249)

Should this be != instead? Or I think we should use TTI.getUserCost for the initial value directly. Now if getUserCost returns non-zero, the value for Cost would become 0! It is too weird.

SjoerdMeijer added inline comments.May 24 2021, 3:50 AM
llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
130 ↗(On Diff #345249)

I might not get your suggestion, but the solving is done much earlier. I.e., RunSCCPSolver is done in runFunctionSpecialization. This here is the final step that performs the actual transformation.

315–317 ↗(On Diff #345249)

Yeah, am fixing and rewriting this to use InstructionCost.

637 ↗(On Diff #345249)

It's also not yet covered by tests, so I will remove it for now. Even if it is needed for MCF (can't remember), then it looks like a nice candidate as a follow up once we've got something in.

798 ↗(On Diff #345249)

Will remove this (see also earlier comment); this is a bit of a different optimisation that we can look at later.

SjoerdMeijer added inline comments.May 24 2021, 4:00 AM
llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
777 ↗(On Diff #345249)

Yeah, these should be functions with definitions. There's a F.isDeclaration() check earlier in this function.

SjoerdMeijer added inline comments.May 24 2021, 5:30 AM
llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
172 ↗(On Diff #345249)

I can't. :-) Not yet at least, so will also remove this code (for now).

This addresses most remarks:

  • transitioned to using InstructionCosts,
  • added tests/testing,
  • removed unnecessary things.

It seems like the patch need to be formatted simply : )

Ran clang-format.

@fhahn: about running the solver first before making IR changes. I think that is happening already. There are some places where the solver is kept up to date after transformations. I think this is a remainder of running function specialisation iteratively that I stripped out for now, but I think it's good to keep these updates to the solver as it's probably good to keep it consistent.

fhahn added a comment.Jun 1 2021, 2:00 AM

Ran clang-format.

@fhahn: about running the solver first before making IR changes. I think that is happening already. There are some places where the solver is kept up to date after transformations. I think this is a remainder of running function specialisation iteratively that I stripped out for now, but I think it's good to keep these updates to the solver as it's probably good to keep it consistent.

I'm not sure I'm looking at the right thing, but it looks like the specialization is still run iteratively via the code below? RunSCCPSolver seems to modify the IR after solving. Maybe I am looking at the wrong thing?

while (FuncSpecializationMaxIters != I++ &&
       FS.specializeFunctions(FuncDecls, CurrentSpecializations)) {

  // Run solver for the specialized functions only.
  RunSCCPSolver(CurrentSpecializations);

  CurrentSpecializations.clear();
  Changed = true;
}

Do we have test cases where the specialization leads to the function result of the specialized function becoming another constant, enabling further simplification in the caller?

llvm/test/Transforms/FunctionSpecialization/function-specialization-recursive.ll
3 ↗(On Diff #348239)

can you elaborate what is going wrong here and what needs fixing?

Ran clang-format.

@fhahn: about running the solver first before making IR changes. I think that is happening already. There are some places where the solver is kept up to date after transformations. I think this is a remainder of running function specialisation iteratively that I stripped out for now, but I think it's good to keep these updates to the solver as it's probably good to keep it consistent.

I'm not sure I'm looking at the right thing, but it looks like the specialization is still run iteratively via the code below? RunSCCPSolver seems to modify the IR after solving. Maybe I am looking at the wrong thing?

while (FuncSpecializationMaxIters != I++ &&
       FS.specializeFunctions(FuncDecls, CurrentSpecializations)) {

  // Run solver for the specialized functions only.
  RunSCCPSolver(CurrentSpecializations);

  CurrentSpecializations.clear();
  Changed = true;
}

Ah okay, you're right, the solver runs but only for the specialised functions like to comments says. The main solving happens earlier in runFunctionSpecialization. Like I also wrote earlier, I have been stripping out a few things from the initial version, like running 10 iterations and doing it recursively, to prepare a basic version for an initial commit and get a baseline for the extra compile time costs. I will get rid of invoking the solver here and add a TODO and will also get rid of the disabled test for now. From memory, I think that needed to run more iterations to kick in, but will look at that later in a follow up.

Do we have test cases where the specialization leads to the function result of the specialized function becoming another constant, enabling further simplification in the caller?

Yeah, I think test test/Transforms/FunctionSpecialization/function-specialization.ll is an example of that: the specialised functions completely disappear because of further simplifications.

Addressed @fhahn 's comments: don't run the solver for specialised functions, removed the recursive specialization test for now.

wenlei added a subscriber: wenlei.Jun 4 2021, 9:49 AM

We also saw ipa-cp-clone being a very noticeable difference between gcc and llvm when we tried to move workloads from gcc to llvm. Thanks for working on this for llvm optimization parity with gcc.

I think specialization can be beneficial when we can't do very selective inlining - it can realize part of the benefit of inlining with a more moderate size increase. I suspect the benefit of ICP from specialization will be much lower when PGO is used though, because PGO's speculative ICP is quite effective. For our use cases, it's more interesting to see how much benefit we could get from non-function-pointer constants, on top of PGO.

When this is ready, I'd be happy to give it a try on our larger workload to see if there's any interesting hit and how it compares against gcc's ipa-cp-clone for the same workload, all with PGO.

Also with PGO, do we see similar improvement?

Good question, we need to experiment to answer it.

@ChuanqiXu FWIW, PGO always does speculative ICP from spec_qsort to cost_compare and arc_compare in spec2017/mcf.

I agree function specialization has its place when size is the concern (with -Os), or instruction working set is too large (to hurt performance). We don't have a mechanism to analyze the latter yet.

@davidxl you're right that we can't model the latter, but the same working set concern is true for inlining too, and yet we still need to control size bloat from inlining (through inline cost etc) even if we can't model it accurately. I think specialization can be another layer that allows compiler to fine tune that balance when needed. I'm curious have you tried to turn off ipa-cp-clone for gcc for your workloads and whether that leads to noticeable perf difference?

In theory, function specialization based on interprocedural const/range/assertion propagation does not depend on inlining, so it should/can be done before the inliner.

Agreed. Though from compile time perspective, I'm wondering if we have specialization before inlining, would we be paying the cost for specialization for functions that will eventually be inlined in which case the extra cost may yield nothing in perf. But if we do specialization after inlining, we'll be targeting those that deemed not worth inlining only.

llvm/lib/Passes/PassBuilder.cpp
1112

With LTO or ThinLTO, would it make sense to schedule the pass in post-link only?

buildModuleSimplificationPipeline is used by both prelink and postlink for ThinLTO.

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
291 ↗(On Diff #349228)

We could replace this AvgLoopIterationCount tuning knob with real frequency when PGO/BFI is available.

In order for specialization to be beneficial on top of PGO, I think leveraging counts in cost/benefit analysis is important. Doing that in later patch is fine.

We also saw ipa-cp-clone being a very noticeable difference between gcc and llvm when we tried to move workloads from gcc to llvm. Thanks for working on this for llvm optimization parity with gcc.

I think specialization can be beneficial when we can't do very selective inlining - it can realize part of the benefit of inlining with a more moderate size increase. I suspect the benefit of ICP from specialization will be much lower when PGO is used though, because PGO's speculative ICP is quite effective. For our use cases, it's more interesting to see how much benefit we could get from non-function-pointer constants, on top of PGO.

Agree.

When this is ready, I'd be happy to give it a try on our larger workload to see if there's any interesting hit and how it compares against gcc's ipa-cp-clone for the same workload, all with PGO.

Also with PGO, do we see similar improvement?

Good question, we need to experiment to answer it.

@ChuanqiXu FWIW, PGO always does speculative ICP from spec_qsort to cost_compare and arc_compare in spec2017/mcf.

I agree function specialization has its place when size is the concern (with -Os), or instruction working set is too large (to hurt performance). We don't have a mechanism to analyze the latter yet.

@davidxl you're right that we can't model the latter, but the same working set concern is true for inlining too, and yet we still need to control size bloat from inlining (through inline cost etc) even if we can't model it accurately. I think specialization can be another layer that allows compiler to fine tune that balance when needed. I'm curious have you tried to turn off ipa-cp-clone for gcc for your workloads and whether that leads to noticeable perf difference?

My recollection is that it did not result in much perf difference with FDO with our internal workload -- otherwise it would have been further tuned.

In theory, function specialization based on interprocedural const/range/assertion propagation does not depend on inlining, so it should/can be done before the inliner.

Agreed. Though from compile time perspective, I'm wondering if we have specialization before inlining, would we be paying the cost for specialization for functions that will eventually be inlined in which case the extra cost may yield nothing in perf. But if we do specialization after inlining, we'll be targeting those that deemed not worth inlining only.

Makes sense -- this is a the reason for iterative optimizations. Partial inlining is in the same situation too.

For our use cases, it's more interesting to see how much benefit we could get from non-function-pointer constants, on top of PGO

Definitely. That's one of the first things we could investigate once we've got a first version in. I have documented this under "current limitations".

Sorry for the early ping @fhahn , but as you were the last one with comments, happy with the latest changes?

ChuanqiXu accepted this revision.Jun 8 2021, 3:02 AM

Thanks for adding me as a reviewer.
I am new to LLVM community and SCCP module. I am not sure if it is suitable for me to accept it. Just remind me if anyone is not comfortable.
The overall patch looks good to me as the first version. Although there must be some problems remained, I think we can work on top of this patch after merging iteratively. For example, I am playing function specialization for ThinLTO locally. And from the comments @wenlei give, PGO with function specialization is also interesting and charming. So after merging, we could work on specialization together for different aspects, like ThinLTO, PGO, bug fixes and Cost Model refactoring. So I choose to accept it.
Please wait 2~3 days to commit this in case there are more comments.

This revision is now accepted and ready to land.Jun 8 2021, 3:02 AM
fhahn added a comment.Jun 9 2021, 2:56 AM

Addressed @fhahn 's comments: don't run the solver for specialised functions removed the recursive specialization test for now.

I'm not sure if removing the recursive specialization test is the best thing to do, without known what it is supposed to test? If it is a legitimate test, I thin it would be good to keep it.

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
117 ↗(On Diff #349228)

Are those updates to the solver still needed, after not running the solver after specializeFunctions?

Addressed @fhahn 's comments: don't run the solver for specialised functions removed the recursive specialization test for now.

I'm not sure if removing the recursive specialization test is the best thing to do, without known what it is supposed to test? If it is a legitimate test, I thin it would be good to keep it.

Thanks for the comments. Since they are minor, I will fix it before committing. I.e., will remove that update to solver and add the test back.

fhahn added a comment.Jun 9 2021, 4:29 AM

Addressed @fhahn 's comments: don't run the solver for specialised functions removed the recursive specialization test for now.

I'm not sure if removing the recursive specialization test is the best thing to do, without known what it is supposed to test? If it is a legitimate test, I thin it would be good to keep it.

Thanks for the comments. Since they are minor, I will fix it before committing. I.e., will remove that update to solver and add the test back.

There are at least a few more places that may modify the solver state after the solver is run. I think it would be good to audit them and update the patch. IIUC in the current version there should be no need to update the solver state after RunSolver. It would also be good to also check if there any places with updates I missed.

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
146 ↗(On Diff #349228)

This should also not be needed?

154 ↗(On Diff #349228)

also not needed?

233 ↗(On Diff #349228)

Is this needed if the solver does not run again?

243 ↗(On Diff #349228)

Is this needed if the solver does not run again?

Removed those updates too, added the test case back. Appreciate it if you can have one more quick look.

This revision was landed with ongoing or failed builds.Jun 11 2021, 1:22 AM
This revision was automatically updated to reflect the committed changes.

Many thanks for all your help and reviews. I have committed this because it was lgtm'ed (officially and also unofficially in comments), no major changes have been made/suggested for a while, and this is a first version that is off by default. And this is also just the beginning of function specialisation. I will continue working on this and will be happy to receive post-commit comments for this version, and of course comments for the work that I am planning and starting to do. I.e., I will now start working on some of the limitations before I start thinking how to eventually get this enabled by default.

fhahn added inline comments.Jun 11 2021, 1:34 AM
llvm/include/llvm/Transforms/Utils/SCCPSolver.h
146 ↗(On Diff #351361)

unused?

148 ↗(On Diff #351361)

unused?

149 ↗(On Diff #351361)

unused?

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
216 ↗(On Diff #351361)

Is this needed in the latest version? If it is not needed, please also remove it from the interface.

509 ↗(On Diff #351361)

stray newline?

fhahn added a comment.Jun 11 2021, 1:40 AM

I'm not quite sure why the implementation is in llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp, rather than llvm/lib/Transforms/IPO/FunctionSpecialization.cpp? It's interprocedural only, right? The implementation in SCCP.cpp works on both a single function and a module, that's probably with its in the Scalar sub-directory. It's also a separate pass from IPSCCP, so I am not sure if the pass definition should be in llvm/lib/Transforms/IPO/SCCP.cpp.

fhahn added inline comments.Jun 11 2021, 1:49 AM
llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
250 ↗(On Diff #351361)

NumFuncSpecialized is defined as STATISTIC(NumFuncSpecialized, "Number of Functions Specialized"); How will this work when LLVM is build without statistics? (also redundant brackets)

65 ↗(On Diff #345249)

Looks like this was not addressed. If the function is kept, please at least update the comment, as it is mis-leading at the moment (it also returns false if LV is a constant range with more than a single element, see the comment for the same function in SCCP.cpp)

I'm not quite sure why the implementation is in llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp, rather than llvm/lib/Transforms/IPO/FunctionSpecialization.cpp? It's interprocedural only, right? The implementation in SCCP.cpp works on both a single function and a module, that's probably with its in the Scalar sub-directory. It's also a separate pass from IPSCCP, so I am not sure if the pass definition should be in llvm/lib/Transforms/IPO/SCCP.cpp.

Thanks, I will follow up on this shortly (i.e., will prepare a diff). I can't remember if there was something with this pass definition, but will look.

fhahn added inline comments.Jun 11 2021, 2:47 AM
llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
462 ↗(On Diff #351361)

Looks like this is not covered by a test. Would be good to have one.

Hi,

We stumbled upon a crash in this pass during fuzz testing so I wrote a PR about it:
https://bugs.llvm.org/show_bug.cgi?id=51600

Hi,

We stumbled upon a crash in this pass during fuzz testing so I wrote a PR about it:
https://bugs.llvm.org/show_bug.cgi?id=51600

Thanks for reporting. I will look at this next week.

dewen added a subscriber: dewen.Jun 15 2023, 8:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 15 2023, 8:18 PM
nlopes added inline comments.Jun 16 2023, 12:20 AM
llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
135 ↗(On Diff #351361)

Can't this be poison?
If this is just a placeholder for unreachable code, then poison is sufficient.

We're trying to remove undef, so we appreciate if no more uses of undef get added to the codebase.
Thank you!