This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/
-
llvm/
-
InitializePasses.h
-
LinkAllPasses.h
-
Transforms/
-
IPO.h
-
IPO/
-
SCCP.h
-
Scalar/
-
SCCP.h
-
Utils/
3
SCCPSolver.h
-
lib/
-
Passes/
1/5
PassBuilder.cpp
-
PassRegistry.def
-
Transforms/
-
IPO/
-
IPO.cpp
-
PassManagerBuilder.cpp
1/3
SCCP.cpp
-
Scalar/
-
CMakeLists.txt
3/32
FunctionSpecialization.cpp
5/25
SCCP.cpp
-
Utils/
-
SCCPSolver.cpp
-
test/Transforms/FunctionSpecialization/
-
Transforms/
-
FunctionSpecialization/
1
function-specialization-recursive.ll
-
function-specialization.ll
-
function-specialization2.ll
-
function-specialization3.ll
1
function-specialization4.ll
-
function-specialization5.ll

Differential D93838

[SCCP] Add Function Specialization pass
ClosedPublic

Authored by SjoerdMeijer on Dec 27 2020, 7:25 AM.

Download Raw Diff

Details

Reviewers

dmgreen
efriedma
fhahn
nikic
mssimpso
jdoerfert
mivnay
davidxl
echristo
jaykang10
ChuanqiXu

Commits

rGc4a0969b9c14: Function Specialization Pass

Summary

This patch adds a function specialization pass to LLVM. The constant parameters like function pointers and constant globals are propagated to the callee by specializing the function.

Current limitations:

It does not handle specialization of recursive functions,
It does not yet handle constants and constant ranges,
Only 1 argument per function is specialised,
The cost-model could be further looked into,
We are not yet caching analysis results.

The pass is based on the Function specialization proposed here.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

SjoerdMeijer mentioned this in rGbbab9f986c6d: [SCCP] Create SCCP Solver.Apr 14 2021, 6:59 AM

Rebased
This now off by default in the pass manager; renamed/introduced option EnableFunctionSpecialization,
Introduced options for some of the threshold values,
renamed function-specialize to function-specialization for consistency for the pass option name, test names, etc.
restrict the number of iterations the solver is run, controllable with an option.

I think this addresses the (minor) comments so far.

Todo:

For the first versions of this I want to solve the case for e.g. foo(1,2,3,4) as remarked in comments, I am looking at this now.
And I am adding more test cases.

In D93838#2687738, @ChuanqiXu wrote:

I just run the SPEC2017 intrate fullLTO with this patch and I limits the iterations with 10 times and 20 times.
Here is my result

benchmark Speedup with limiting 10 iteration Speedup with limiting 20 iteration

505.mcf_r 8.4% 8.4%

520.omnetpp_r 0.4% 4%

I didn't run 548 since it requires fortran frontend to emit LLVM IR. Other benchmarks in SPEC2017 intrate don't show significant result.

It is interesting that the result of 520.omnetpp_r is different from the result before, which shows a great regression.

In D93838#2624107, @mivnay wrote:

It is also interesting that the result diff with different limitations.

Then there are compile-time/code-size changes:

benchmark compile-time change with limiting 10 iteration compile-time change with limiting 20 iteration

500.perlbench_r 27% 27%

502.gcc_r 9% 9%

505.mcf_r 17% 17%

520.omnetpp_r 10% 14%

523.xalancbmk_r 3% 5%

Other benchmarks including 525, 531, 541 and 557 don't show significant change.
The time change listed here are the compilation time for the whole compiling process instead of linking time.

Finally, the code size change:

benchmark code size change with limiting 10 iteration code size change with limiting 20 iteration

505.mcf_r 14% 14%

520.omnetpp_r 13% 13%

The code sizes for other benchmarks don't show significant change.

Thanks for sharing those sobering numbers. These seem to indicate that just adjusting the threshold may not be enough to control the excessive compile-time.

I lost one of the existing test file, added it back.

Harbormaster completed remote builds in B98865: Diff 337697.Apr 15 2021, 4:50 AM

Thanks for sharing those sobering numbers. These seem to indicate that just adjusting the threshold may not be enough to control the excessive compile-time.

Yes, that's a bit high, and not inline with additional compile-times we see with GCC. I haven't carefully looked into it yet, but those 10 number of iterations also look a bit arbitrary to me, and I am guessing we could do with less. But agreed that tuning this down may not solve this, so will look into this.

Harbormaster completed remote builds in B98869: Diff 337704.Apr 15 2021, 5:37 AM

In D93838#2691276, @SjoerdMeijer wrote:

Thanks for sharing those sobering numbers. These seem to indicate that just adjusting the threshold may not be enough to control the excessive compile-time.

Yes, that's a bit high, and not inline with additional compile-times we see with GCC. I haven't carefully looked into it yet, but those 10 number of iterations also look a bit arbitrary to me, and I am guessing we could do with less. But agreed that tuning this down may not solve this, so will look into this.

Noticed that there are many traversals in the code, I wonder if we can mitigate it by caching some results. I am not sure the hot spot in compilation time is in this code or in successive optimization.

In D93838#2691075, @fhahn wrote:

Thanks for sharing those sobering numbers. These seem to indicate that just adjusting the threshold may not be enough to control the excessive compile-time.

Since the pass only run in LTO in LegacyPM, I wonder if the excessive compile-time is OK because LTO are time-consuming naturally. Although it looks like to be running as a module pass in NewPM, I don't know it's by design or just a miss.

llvm/lib/Passes/PassBuilder.cpp
1137	The Function Specialization pass would only run in LTO mode in Legacy Pass manager while is seems like it would run as a module pass in NewPM. Is this a intended behavior?

In D93838#2693588, @ChuanqiXu wrote:

In D93838#2691276, @SjoerdMeijer wrote:

! In D93838#2691075, @fhahn wrote:

Thanks for sharing those sobering numbers. These seem to indicate that just adjusting the threshold may not be enough to control the excessive compile-time.

Since the pass only run in LTO in LegacyPM, I wonder if the excessive compile-time is OK because LTO are time-consuming naturally. Although it looks like to be running as a module pass in NewPM, I don't know it's by design or just a miss.

It should run at the same place in both LegacyPM and NewPM. The fact that LTO is slow already is not a strong argument to make it substantially slower again :) In the end, it is always a trade-off between the benefits and the extra compile-time. I think one of the main concerns with the numbers you shared is that there appears to be couple of substantial increases in compile-time without any noticeable benefits. This is only a small sample of course and we probably need more data.

In D93838#2694031, @fhahn wrote:

It should run at the same place in both LegacyPM and NewPM. The fact that LTO is slow already is not a strong argument to make it substantially slower again :) In the end, it is always a trade-off between the benefits and the extra compile-time. I think one of the main concerns with the numbers you shared is that there appears to be couple of substantial increases in compile-time without any noticeable benefits. This is only a small sample of course and we probably need more data.

Agreed and I'm now looking into the pass manager business, that needs to be consistent, I can't imagine that the current inconsistency is by design.
About numbers and goals, I shared initial findings in:

https://lists.llvm.org/pipermail/llvm-dev/2021-March/149380.html

My goals is that eventually this reaches parity with GCC: it gets enabled by default, and doesn't exceed compile-times in the ranges of +1.5% as that is what I have seen so far from GCC.

I am now deep diving in the algorithm to see where time could be spent and what and where we can improve things.

I could have elaborated a bit more on my plans in my previous message, so let me add that here.

I saw three parts to this work:

The first one was the SCCP refactoring that already went in,
This patch that adds the function specialisation framework,
The further fine-tuning and cost-modelling to get this enabled by default.

In my opinion, the GCC results would justify getting 2 committed first, then work on 3. A few notes on this:

The requirement to get 2 committed I think is that we get compile-times under control,
We really would like to get this enabled by default. If we don't reach parity with GCC, this work has little value to us, and there is little point in doing this work.

Like I said, that was my plan, but am interested to hear if you agree or you see things differently @fhahn, @ChuanqiXu. Please let me know.

SjoerdMeijer added inline comments.Apr 16 2021, 5:32 AM

llvm/lib/Transforms/IPO/SCCP.cpp
129	@ChuanqiXu : please note the hard coded `true` here, which corresponds to `IsAggressive` boolean. So I think your timings were timing the aggressive mode. But anyway, I will upload one more intermediate diff with all sorts of pass manager stuff fixed, and then will start doing some timing too.

Uploading one more intermediate result that is a better baseline; this has the pass manager things fixed:

under the new pass manager, it was always running in aggressive mode, and it wasn't respecting the specialisation level so have added that,
have added the pass to the legacy pass manager which indeed was missing.

Harbormaster completed remote builds in B99146: Diff 338072.Apr 16 2021, 6:12 AM

@ChuanqiXu: I have started looking at increased compile-times, and experimented with the biggest outlier you reported:

500.perlbench_r	27%

In my experiment I think I see less than 1% compile time increase with this patch.
Would you mind double checking this for me with the latest patch? Would be good to get a confirmation to make sure we are on the same page.

Using "sqlite amalgamation" as my favourite compile-time stress test I see that it specialises 4 functions and that:

on a older/smaller, noisy system, I see a ~2% compile-time increase,
on a newer/bigger system, I see a 1% compile-time increase.

With no optimisation done (caching) to improve compile-times, this isn't a disaster and actually a good place to start from.

In D93838#2695205, @SjoerdMeijer wrote:
@ChuanqiXu: I have started looking at increased compile-times, and experimented with the biggest outlier you reported:
500.perlbench_r	27%
In my experiment I think I see less than 1% compile time increase with this patch.
Would you mind double checking this for me with the latest patch? Would be good to get a confirmation to make sure we are on the same page.

Yes, I would double checking on this. The compile-time is a key factor for this patch.

Here is my context information:
Hardware:

Ampere
Architecture: aarch64
CPU(s): 80
Thread(s) per core: 1
Core(s) per socket: 80
Model: 1
CPU max MHz: 3000.0000CPU min MHz: 1000.0000
L1d cache: 64K
L1i cache: 64K
L2 cache: 1024K
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

Baseline option:

COPTIMIZE: -flto=full
CXXOPTIMIZE: -flto=full
LDOPTIMIZE: -flto=full -fuse-ld=lld

Function Specialization:

COPTIMIZE: -flto=full -mllvm -function-specialize-level=aggressive
CXXOPTIMIZE: -flto=full -mllvm -function-specialize-level=aggressive
LDOPTIMIZE: -flto=full -fuse-ld=lld -Wl,-mllvm -Wl,-function-specialize-level=aggressive

The compiler is a private version based on 11.

In D93838#2694124, @SjoerdMeijer wrote:

I could have elaborated a bit more on my plans in my previous message, so let me add that here.

I saw three parts to this work:

The first one was the SCCP refactoring that already went in,

This patch that adds the function specialisation framework,

The further fine-tuning and cost-modelling to get this enabled by default.

In my opinion, the GCC results would justify getting 2 committed first, then work on 3. A few notes on this:

The requirement to get 2 committed I think is that we get compile-times under control,

We really would like to get this enabled by default. If we don't reach parity with GCC, this work has little value to us, and there is little point in doing this work.

Like I said, that was my plan, but am interested to hear if you agree or you see things differently @fhahn, @ChuanqiXu. Please let me know.

Yes, I prefer to turn this on by default if we could control the compile-time. The reason why I am looking into this pass is I saw this from The present and future of Interprocedural Optimization:

It makes sense to me that function specialization is potential. I am also planning to enable function specialization in thinLTO (by adjusting the logic we try to import function) and do function specialization based value range (so we could do interprocedural value range propogation).

llvm/lib/Transforms/IPO/SCCP.cpp
129	Yes, I were timing the aggrresive mode only.

In D93838#2695205, @SjoerdMeijer wrote:
@ChuanqiXu: I have started looking at increased compile-times, and experimented with the biggest outlier you reported:
500.perlbench_r	27%
In my experiment I think I see less than 1% compile time increase with this patch.
Would you mind double checking this for me with the latest patch? Would be good to get a confirmation to make sure we are on the same page.

I just re-build 500.perlbench_r and find 6.5% increase based on the revision 330436. I think the successive revision just didn't contribute to compile-time. I would do other experiments to confirm this.
BTW, the first time I tried to do experiment I made a mistake which didn't pass the argument to the linker actually. I mean:

- LDOPTIMIZE: -flto=full -fuse-ld=lld -mllvm -function-specialize-level=aggressive
+ LDOPTIMIZE: -flto=full -fuse-ld=lld -Wl,-mllvm -Wl,-function-specialize-level=aggressive

In D93838#2700463, @ChuanqiXu wrote:
In D93838#2695205, @SjoerdMeijer wrote:
@ChuanqiXu: I have started looking at increased compile-times, and experimented with the biggest outlier you reported:
500.perlbench_r	27%
In my experiment I think I see less than 1% compile time increase with this patch.
Would you mind double checking this for me with the latest patch? Would be good to get a confirmation to make sure we are on the same page.
I just re-build 500.perlbench_r and find 6.5% increase based on the revision 330436. I think the successive revision just didn't contribute to compile-time. I would do other experiments to confirm this.
BTW, the first time I tried to do experiment I made a mistake which didn't pass the argument to the linker actually. I mean:
- LDOPTIMIZE: -flto=full -fuse-ld=lld -mllvm -function-specialize-level=aggressive
+ LDOPTIMIZE: -flto=full -fuse-ld=lld -Wl,-mllvm -Wl,-function-specialize-level=aggressive

Many thanks for looking into this again. And 6.5% is already a lot better than 20+%. It is still not great of course and not really acceptable if that was the case, but I am suspecting that there's still quite a bit of noise going on there (I didn't see it triggering on 500.perlbench_r, so 6.5% to walk over a view functions sounds excessive to me, plus you were testing the aggressive mode for what's that worth). What I am going to do is park the compile-time analysis for now, and focus on getting a first version to get the framework in. For that I am going to address a functional issue, your remark how a case like foo(1, 2, 3, 4); is currently treated. After the initial commit of the framework, we can start looking into caching some part of the analysis. This makes sense in my opinion, because the current data so far doesn't show it's a disaster that would be difficult to fix. Quite the opposite is think: the reason I like sqlite amalgamation is because it's a single ginormous file with a lot of functions, function specialisation triggers, is really stable as a compile time test as runs form more than a minute, and the 1 - 2% compile time increase is acceptable.

In D93838#2701489, @SjoerdMeijer wrote:
In D93838#2700463, @ChuanqiXu wrote:
In D93838#2695205, @SjoerdMeijer wrote:
@ChuanqiXu: I have started looking at increased compile-times, and experimented with the biggest outlier you reported:
500.perlbench_r	27%
In my experiment I think I see less than 1% compile time increase with this patch.
Would you mind double checking this for me with the latest patch? Would be good to get a confirmation to make sure we are on the same page.
I just re-build 500.perlbench_r and find 6.5% increase based on the revision 330436. I think the successive revision just didn't contribute to compile-time. I would do other experiments to confirm this.
BTW, the first time I tried to do experiment I made a mistake which didn't pass the argument to the linker actually. I mean:
- LDOPTIMIZE: -flto=full -fuse-ld=lld -mllvm -function-specialize-level=aggressive
+ LDOPTIMIZE: -flto=full -fuse-ld=lld -Wl,-mllvm -Wl,-function-specialize-level=aggressive
Many thanks for looking into this again. And 6.5% is already a lot better than 20+%. It is still not great of course and not really acceptable if that was the case, but I am suspecting that there's still quite a bit of noise going on there (I didn't see it triggering on 500.perlbench_r, so 6.5% to walk over a view functions sounds excessive to me, plus you were testing the aggressive mode for what's that worth). What I am going to do is park the compile-time analysis for now, and focus on getting a first version to get the framework in. For that I am going to address a functional issue, your remark how a case like foo(1, 2, 3, 4); is currently treated. After the initial commit of the framework, we can start looking into caching some part of the analysis. This makes sense in my opinion, because the current data so far doesn't show it's a disaster that would be difficult to fix. Quite the opposite is think: the reason I like sqlite amalgamation is because it's a single ginormous file with a lot of functions, function specialisation triggers, is really stable as a compile time test as runs form more than a minute, and the 1 - 2% compile time increase is acceptable.

I would still do testing for compile-time to check it. It would be good if another one is interested in this patch and OK to run SPEC2017, which is a little hard to expect.

In D93838#2701489, @SjoerdMeijer wrote:
In D93838#2700463, @ChuanqiXu wrote:
In D93838#2695205, @SjoerdMeijer wrote:
@ChuanqiXu: I have started looking at increased compile-times, and experimented with the biggest outlier you reported:
500.perlbench_r	27%
In my experiment I think I see less than 1% compile time increase with this patch.
Would you mind double checking this for me with the latest patch? Would be good to get a confirmation to make sure we are on the same page.
I just re-build 500.perlbench_r and find 6.5% increase based on the revision 330436. I think the successive revision just didn't contribute to compile-time. I would do other experiments to confirm this.
BTW, the first time I tried to do experiment I made a mistake which didn't pass the argument to the linker actually. I mean:
- LDOPTIMIZE: -flto=full -fuse-ld=lld -mllvm -function-specialize-level=aggressive
+ LDOPTIMIZE: -flto=full -fuse-ld=lld -Wl,-mllvm -Wl,-function-specialize-level=aggressive
Many thanks for looking into this again. And 6.5% is already a lot better than 20+%. It is still not great of course and not really acceptable if that was the case, but I am suspecting that there's still quite a bit of noise going on there (I didn't see it triggering on 500.perlbench_r, so 6.5% to walk over a view functions sounds excessive to me, plus you were testing the aggressive mode for what's that worth). What I am going to do is park the compile-time analysis for now, and focus on getting a first version to get the framework in. For that I am going to address a functional issue, your remark how a case like foo(1, 2, 3, 4); is currently treated. After the initial commit of the framework, we can start looking into caching some part of the analysis. This makes sense in my opinion, because the current data so far doesn't show it's a disaster that would be difficult to fix. Quite the opposite is think: the reason I like sqlite amalgamation is because it's a single ginormous file with a lot of functions, function specialisation triggers, is really stable as a compile time test as runs form more than a minute, and the 1 - 2% compile time increase is acceptable.

I still get the result 6.5% compile-time increment for 500.perlbench_r in successive repeating. May I ask for your hardware context and compilation options?

In D93838#2708144, @ChuanqiXu wrote:

I still get the result 6.5% compile-time increment for 500.perlbench_r in successive repeating. May I ask for your hardware context and compilation options?

Okay, thanks for confirming. I was using -Ofast on a A72 based system.
I am addressing some functional issues with this patch at the moment, but once I've done that I will get numbers for SPEC.

Last intermediate update with quite a few things fixed:

removed the aggressive mode as we don't need it right now and it didn't seem to entirely work,
introduced a force option to overrule the cost-model,
added more tests; the force option allowed some simpler tests with constant globals.
have added previous review comments as FIXME/TODOs.
have added debug messages.

SjoerdMeijer added inline comments.Apr 26 2021, 1:47 AM

llvm/lib/Transforms/Scalar/SCCP.cpp
822	Currently, it will only specialises this into: foo_1(2, 3, 4); because there is a limitation that it specialises 1 argument, see my next comment below....
872	.... here! It stops after specialising one argument. This is very arbitrary and probably what we want to fix in the first version. This requires a bit of a reshuffle, which is what I am addressing now.

Harbormaster completed remote builds in B100874: Diff 340446.Apr 26 2021, 2:22 AM

ChuanqiXu added inline comments.Apr 26 2021, 2:27 AM

llvm/lib/Transforms/Scalar/SCCP.cpp
86	I would like to rewording it into 'Force function specialization for every call site with constant argument'
88–91	It looks like we need to re-evaluate the performance, compile-time and code-size for different parameters now.
872	I am OK to fix this in successive patches.
1376–1378	Could we write: unsigned MaxIters = FuncSpecializationMaxIters; // Since the default value for FuncSpecializationMaxIters is 1

Addressed minor comments.

SjoerdMeijer added inline comments.Apr 26 2021, 5:55 AM

llvm/lib/Transforms/Scalar/SCCP.cpp
88–91	Agreed. This patch might be a candidate for a first version, and with a few things changed, it is time to measure things again to see where are, so am going to do that right now.
872	That will work for me.

Does this also handle bool propagation case like

myfn(dst, src, flags, isEnabled);

if (isEnabled)
myfn_true(dst, src, flags);
else
myfn_false(dst, src, flags);

Harbormaster completed remote builds in B100912: Diff 340496.Apr 26 2021, 6:33 AM

In D93838#2716512, @xbolva00 wrote:

Does this also handle bool propagation case like

myfn(dst, src, flags, isEnabled);

to

if (isEnabled)
myfn_true(dst, src, flags);
else
myfn_false(dst, src, flags);

?

Nope, not yet I am afraid. But agreed that this would be nice!

ChuanqiXu added inline comments.Apr 28 2021, 1:13 AM

llvm/lib/Passes/PassBuilder.cpp
1137	It looks like it would run as a module simplify pass in NewPM?

SjoerdMeijer added inline comments.Apr 28 2021, 1:46 AM

llvm/lib/Passes/PassBuilder.cpp
1137	I will look into this. I am actually trying to reproduce numbers at the moment, but found that this patch doesn't trigger on SPEC anymore. Curious if I had regressed things with my changes, I took the second revision of this patch for which performance numbers were reported, but I don't think see an uplift with that either, so can't reproduce numbers yet. I am comparing an LTO run without the patch applied as the baseline, to an LTO ran with the patch applied. Now rereading previous reviews comments, it looks like you had the same problem reproducing numbers. But anyway, I will sort that out first.

ChuanqiXu added inline comments.Apr 28 2021, 1:51 AM

llvm/lib/Passes/PassBuilder.cpp
1137	Yeah, previously I find that this patch can't work on SPEC since I made a mistake when I run LTO. Previously, I set: LDOPTIMIZE = -flto=full -fuse-ld=lld -mllvm -function-specialize-level=aggressive But it wouldn't work. In fact, we need to set: LDOPTIMIZE = -flto=full -fuse-ld=lld -Wl,-mllvm -Wl,-function-specialize-level=aggressive Hope this could help you.

I just runned the newest revision with the SPEC2017 nitrate (without 548.exchange2_r). The number of default iteration limits is one. Below is the results. The overall results look good to me.

Performance:
I observed that 505.mcf_r get 10% increment, which is consistent with previous experiment.
Then I didn't find increment for 520.omnetpp_r nor regression. We need to explore it further.
Then there is no other observable changes for other benchmarks

Compile-time:

benchmark	compile-time change with limiting 1 iteration	Note
500.perlbench_r	2%
502.gcc_r	6%
505.mcf_r	19%	The total compile time for 505.mcf_r is relatively fast
520.omnetpp_r	3%
523.xalancbmk_r	3%

No observable changes for other benchmarks.

Code Sizes:

benchmark	Code Size change with limiting 1 iteration
505.mcf_r	14%
523.xalancbmk_r	2%

505.mcf_r

Interesting, did you analyze it more? What is the reason of such improvement? Additional vectorization?

In D93838#2724968, @ChuanqiXu wrote:

I just runned the newest revision with the SPEC2017 nitrate (without 548.exchange2_r). The number of default iteration limits is one. Below is the results. The overall results look good to me.

Performance:
I observed that 505.mcf_r get 10% increment, which is consistent with previous experiment.
Then I didn't find increment for 520.omnetpp_r nor regression. We need to explore it further.
Then there is no other observable changes for other benchmarks

Compile-time:

benchmark compile-time change with limiting 1 iteration Note

500.perlbench_r 2%

502.gcc_r 6%

505.mcf_r 19% The total compile time for 505.mcf_r is relatively fast

520.omnetpp_r 3%

523.xalancbmk_r 3%

No observable changes for other benchmarks.

Code Sizes:

benchmark Code Size change with limiting 1 iteration

505.mcf_r 14%

523.xalancbmk_r 2%

Note for others who want to reproduce the results: I tried to run 505.mcf_r in x86-64, then I find no observable improments. The expeirment before was done in AArch64. Is it related to hardware? Or is it related to architecture?

In D93838#2725038, @xbolva00 wrote:

505.mcf_r

Interesting, did you analyze it more? What is the reason of such improvement? Additional vectorization?

I haven't look into the details. I would do that if possible.

In D93838#2725047, @ChuanqiXu wrote:

In D93838#2724968, @ChuanqiXu wrote:

I just runned the newest revision with the SPEC2017 nitrate (without 548.exchange2_r). The number of default iteration limits is one. Below is the results. The overall results look good to me.

Performance:
I observed that 505.mcf_r get 10% increment, which is consistent with previous experiment.
Then I didn't find increment for 520.omnetpp_r nor regression. We need to explore it further.
Then there is no other observable changes for other benchmarks

Compile-time:

benchmark compile-time change with limiting 1 iteration Note

500.perlbench_r 2%

502.gcc_r 6%

505.mcf_r 19% The total compile time for 505.mcf_r is relatively fast

520.omnetpp_r 3%

523.xalancbmk_r 3%

No observable changes for other benchmarks.

Code Sizes:

benchmark Code Size change with limiting 1 iteration

505.mcf_r 14%

523.xalancbmk_r 2%

Note for others who want to reproduce the results: I tried to run 505.mcf_r in x86-64, then I find no observable improments. The expeirment before was done in AArch64. Is it related to hardware? Or is it related to architecture?

In D93838#2725038, @xbolva00 wrote:

505.mcf_r

Interesting, did you analyze it more? What is the reason of such improvement? Additional vectorization?

I haven't look into the details. I would do that if possible.

Many thanks for the new numbers!
I was struggling with my setup yesterday (not building it, but running it and observe the uplift). I am going to give it another try today.

Interesting, did you analyze it more? What is the reason of such improvement? Additional vectorization?

I haven't look into the details. I would do that if possible.

I would need to double check, but I think the motivating example is included as a regression test (llvm/test/Transforms/FunctionSpecialization/function-specialization.ll). It shows that specialisation eventually results in inlining and a lot of simplifications.

Small fix: the pass was not running in LTO mode because it wasn't added to the LTO pipeline.... so this adds it to the LTO pipeline.

This now triggers on MCF in LTO mode and my first measurements show a ~30% performance gain for an extra 30% compile-time, which is quite a steep increase in compile-time. Now going to look into this more.

@ChuanqiXu : did you measure results with LTO or non-LTO? Because if it was LTO, then it looks like we have only measured noise.

Harbormaster completed remote builds in B102544: Diff 342726.May 4 2021, 9:40 AM

In D93838#2736445, @SjoerdMeijer wrote:

Small fix: the pass was not running in LTO mode because it wasn't added to the LTO pipeline.... so this adds it to the LTO pipeline.

This now triggers on MCF in LTO mode and my first measurements show a ~30% performance gain for an extra 30% compile-time, which is quite a steep increase in compile-time. Now going to look into this more.

@ChuanqiXu : did you measure results with LTO or non-LTO? Because if it was LTO, then it looks like we have only measured noise.

I tested it and the baseline under LTO. Since I run this patch to a downstream based on 11.x. It is OK to run it in LTO mode under the old manager. Your results shows great performance gain and compile-time increase than my results. I wonder if your baseline is measured without LTO?

The benefits of 505.mcf_r comes from the specialization for spec_qsort. Here is the signature of spec_qsort:

void
spec_qsort(void *a, size_t n, size_t es, cmp_t *cmp)

Here the cmp_t* is a function pointer. And there are lots of uses of cpp in spec_qsort. And spec_qsort is called in two places in 505.mcf_r:

spec_qsort(arcs_pointer_sorted[thread], new_arcs_array[thread], sizeof(arc_p),
                (int (*)(const void *, const void *))arc_compare);
spec_qsort(perm + 1, basket_sizes[thread], sizeof(BASKET*),
            (int (*)(const void *, const void *))cost_compare);

Both arc_compare and cost_compare are global functions. So here we can get the reason why function specialization benefits 505.mcf_r. It is converting the indirect call to direct call by function specialization and the direct call would be inlined further.

It looks like this pattern is usual in our daily work codes. However, I wonder what if there is multiple call site for spec_qsort with multiple global functions. It looks like the code now can't handle this situation, which is more usual in projects. I didn't ask for the change of cost model. I think we can made it in the future. This is just a sharing.

In D93838#2741479, @ChuanqiXu wrote:
The benefits of 505.mcf_r comes from the specialization for spec_qsort. Here is the signature of spec_qsort:
void
spec_qsort(void *a, size_t n, size_t es, cmp_t *cmp)
Here the cmp_t* is a function pointer. And there are lots of uses of cpp in spec_qsort. And spec_qsort is called in two places in 505.mcf_r:
spec_qsort(arcs_pointer_sorted[thread], new_arcs_array[thread], sizeof(arc_p),
                (int (*)(const void *, const void *))arc_compare);
spec_qsort(perm + 1, basket_sizes[thread], sizeof(BASKET*),
            (int (*)(const void *, const void *))cost_compare);
Both arc_compare and cost_compare are global functions. So here we can get the reason why function specialization benefits 505.mcf_r. It is converting the indirect call to direct call by function specialization and the direct call would be inlined further.

It looks like this pattern is usual in our daily work codes. However, I wonder what if there is multiple call site for spec_qsort with multiple global functions. It looks like the code now can't handle this situation, which is more usual in projects. I didn't ask for the change of cost model. I think we can made it in the future. This is just a sharing.

Many thanks for sharing. With my infrastructure/workflow problems (mostly) sorted to run and evaluate things, I have seen exactly the same, so can confirm this.

My baseline is trunk, without this patch applied, in LTO mode. So that is using the new pass manager, and as this new pass wasn't added to its LTO pipeline, I didn't see this triggering. But with that fixed and this patch applied, I noticed to the 30% gain with 30% extra compile time. This was on an older and noisier AArch64 system, but the trend was clear and especially the increased compile-times very obvious and consistent . I will also run this on a newer system, but I am still setting this up.

PS. About LTO, I didn't see this triggering in non-LTO mode on MCF. That's why I am only looking at LTO at the moment.

Now that I have solid LTO numbers and compile-times, I am going to look at compile-times.

In D93838#2741605, @SjoerdMeijer wrote:
In D93838#2741479, @ChuanqiXu wrote:
The benefits of 505.mcf_r comes from the specialization for spec_qsort. Here is the signature of spec_qsort:
void
spec_qsort(void *a, size_t n, size_t es, cmp_t *cmp)
Here the cmp_t* is a function pointer. And there are lots of uses of cpp in spec_qsort. And spec_qsort is called in two places in 505.mcf_r:
spec_qsort(arcs_pointer_sorted[thread], new_arcs_array[thread], sizeof(arc_p),
                (int (*)(const void *, const void *))arc_compare);
spec_qsort(perm + 1, basket_sizes[thread], sizeof(BASKET*),
            (int (*)(const void *, const void *))cost_compare);
Both arc_compare and cost_compare are global functions. So here we can get the reason why function specialization benefits 505.mcf_r. It is converting the indirect call to direct call by function specialization and the direct call would be inlined further.

It looks like this pattern is usual in our daily work codes. However, I wonder what if there is multiple call site for spec_qsort with multiple global functions. It looks like the code now can't handle this situation, which is more usual in projects. I didn't ask for the change of cost model. I think we can made it in the future. This is just a sharing.
Many thanks for sharing. With my infrastructure/workflow problems (mostly) sorted to run and evaluate things, I have seen exactly the same, so can confirm this.

My baseline is trunk, without this patch applied, in LTO mode. So that is using the new pass manager, and as this new pass wasn't added to its LTO pipeline, I didn't see this triggering. But with that fixed and this patch applied, I noticed to the 30% gain with 30% extra compile time. This was on an older and noisier AArch64 system, but the trend was clear and especially the increased compile-times very obvious and consistent . I will also run this on a newer system, but I am still setting this up.

PS. About LTO, I didn't see this triggering in non-LTO mode on MCF. That's why I am only looking at LTO at the moment.

Now that I have solid LTO numbers and compile-times, I am going to look at compile-times.

My result for 505.mcf_r is we can get 10% improvements with 19% extra compile-time. At least our trends are same. Both of us get significant improvements with significant extra compile time.

I would try to experiment in trunk

ChuanqiXu added inline comments.May 6 2021, 8:23 PM

llvm/lib/Transforms/Scalar/SCCP.cpp
1023–1029	To my knowledge, if the Lattice for A is constant, it could be handled by current IPSCCP. But what if the lattice is constant range?
1228–1250	I can't understand why don't we use constant value directly?

In D93838#2741605, @SjoerdMeijer wrote:
In D93838#2741479, @ChuanqiXu wrote:
The benefits of 505.mcf_r comes from the specialization for spec_qsort. Here is the signature of spec_qsort:
void
spec_qsort(void *a, size_t n, size_t es, cmp_t *cmp)
Here the cmp_t* is a function pointer. And there are lots of uses of cpp in spec_qsort. And spec_qsort is called in two places in 505.mcf_r:
spec_qsort(arcs_pointer_sorted[thread], new_arcs_array[thread], sizeof(arc_p),
                (int (*)(const void *, const void *))arc_compare);
spec_qsort(perm + 1, basket_sizes[thread], sizeof(BASKET*),
            (int (*)(const void *, const void *))cost_compare);
Both arc_compare and cost_compare are global functions. So here we can get the reason why function specialization benefits 505.mcf_r. It is converting the indirect call to direct call by function specialization and the direct call would be inlined further.

It looks like this pattern is usual in our daily work codes. However, I wonder what if there is multiple call site for spec_qsort with multiple global functions. It looks like the code now can't handle this situation, which is more usual in projects. I didn't ask for the change of cost model. I think we can made it in the future. This is just a sharing.
Many thanks for sharing. With my infrastructure/workflow problems (mostly) sorted to run and evaluate things, I have seen exactly the same, so can confirm this.

My baseline is trunk, without this patch applied, in LTO mode. So that is using the new pass manager, and as this new pass wasn't added to its LTO pipeline, I didn't see this triggering. But with that fixed and this patch applied, I noticed to the 30% gain with 30% extra compile time. This was on an older and noisier AArch64 system, but the trend was clear and especially the increased compile-times very obvious and consistent . I will also run this on a newer system, but I am still setting this up.

PS. About LTO, I didn't see this triggering in non-LTO mode on MCF. That's why I am only looking at LTO at the moment.

Now that I have solid LTO numbers and compile-times, I am going to look at compile-times.

After I applied this patch with trunk, I got 10% performance increase with 505, which is consistency with my previous experiments. It looks like the hardware maybe a key factor in this case? I would try to look into this.

After I applied this patch with trunk, I got 10% performance increase with 505, which is consistency with my previous experiments. It looks like the hardware maybe a key factor in this case? I would try to look into this.

Yeah, I am on an old slow system where things behave slightly different, but the trend is the same.

I have seen your previous comments, but just a sign of life that I am investigating compile times and will come back to your previous comments. The way I am investigating compile times is just by looking at he final LTO compile/link step:

clang   -Wl,--sort-section=name -flto=full -fuse-ld=lld   -O3         -flto=full mcf.o mcfutil.o readmin.o implicit.o pstart.o output.o treeup.o pbla.o pflowup.o psimplex.o pbeampp.o spec_qsort/spec_qsort.o             -lm         -o mcf_r -Wl,-mllvm -Wl,-enable-function-specialization=1

For this compile/link step, compile times goes up from 13s to 17s. I found that -ftime-report unfortunately doesn't work in LTO, or I am doing something wrong, but I want to find out where I am losing 4 seconds. By manually adding time instrumentation at the beginning and end of the function specialization pass, it looks like I am only loosing 0.2s in this pass, so now my question is where the other 3.8s are lost. I haven't found the answer to that yet. One suspicion is that more functions need to be compiled and that's where extra compile time is spent. I haven't looked if the 2 specialised functions in MCF are large functions which would explain that. But that's where I am at with this....to be continued next week.

In D93838#2745856, @SjoerdMeijer wrote:
After I applied this patch with trunk, I got 10% performance increase with 505, which is consistency with my previous experiments. It looks like the hardware maybe a key factor in this case? I would try to look into this.

Yeah, I am on an old slow system where things behave slightly different, but the trend is the same.

I have seen your previous comments, but just a sign of life that I am investigating compile times and will come back to your previous comments. The way I am investigating compile times is just by looking at he final LTO compile/link step:
clang   -Wl,--sort-section=name -flto=full -fuse-ld=lld   -O3         -flto=full mcf.o mcfutil.o readmin.o implicit.o pstart.o output.o treeup.o pbla.o pflowup.o psimplex.o pbeampp.o spec_qsort/spec_qsort.o             -lm         -o mcf_r -Wl,-mllvm -Wl,-enable-function-specialization=1
For this compile/link step, compile times goes up from 13s to 17s. I found that -ftime-report unfortunately doesn't work in LTO, or I am doing something wrong, but I want to find out where I am losing 4 seconds. By manually adding time instrumentation at the beginning and end of the function specialization pass, it looks like I am only loosing 0.2s in this pass, so now my question is where the other 3.8s are lost. I haven't found the answer to that yet. One suspicion is that more functions need to be compiled and that's where extra compile time is spent. I haven't looked if the 2 specialised functions in MCF are large functions which would explain that. But that's where I am at with this....to be continued next week.

In my environments, the compiler/link time for 505.mcf_r is less than 1s (The reason maybe the compile concurrency is relatively high) in the original LTO mode. After I apply this patch, the compile/link time is a little bit larger than 1s . So I am OK with the compile time now.

In my environments, the compiler/link time for 505.mcf_r is less than 1s (The reason maybe the compile concurrency is relatively high) in the original LTO mode. After I apply this patch, the compile/link time is a little bit larger than 1s . So I am OK with the compile time now.

I actually found yesterday that I made a mistake: I was testing a Debug build! With a release build, I see the same as you: the LTO compile/link step goes up from 0.85s to 1s. It is the similar trend as in the Debug build, but it's just that the absolute numbers make this more visible. What is funny though, is that function specialisation takes very little time as it takes less than 0.01s. Something after that takes up a bit more compile-time. In MCF, 2 functions get specialised. To tests if the extra compile time is needed for compiling these extra functions, I restricted it to specialise only 1 function, but then compile time remained the same as specialising 2 functions, which is surprising! Perhaps it's some analysis pass after that which hits a corner cases. I wanted to see where compile time is spent with -ftime-report, but found that it doesn't work with LTO. I dropped a little message to the dev list yesterday: https://lists.llvm.org/pipermail/llvm-dev/2021-May/150487.html

I think what I will do is try to get that working with LTO, because it would be really good to know where this extra compile times goes which we don't know at the moment. At the same time, I am more confident that there's not something wrong with compile times in function specialisation and thus I am thinking this could be a candidate for a first commit (with the pass still disabled by default of course). So that we can iterate in-tree which will be more convenient. But again, I will first see if I get anywhere with -ftime-report and then hopefully we know more soon.

In D93838#2750147, @SjoerdMeijer wrote:

In my environments, the compiler/link time for 505.mcf_r is less than 1s (The reason maybe the compile concurrency is relatively high) in the original LTO mode. After I apply this patch, the compile/link time is a little bit larger than 1s . So I am OK with the compile time now.

I actually found yesterday that I made a mistake: I was testing a Debug build! With a release build, I see the same as you: the LTO compile/link step goes up from 0.85s to 1s. It is the similar trend as in the Debug build, but it's just that the absolute numbers make this more visible. What is funny though, is that function specialisation takes very little time as it takes less than 0.01s. Something after that takes up a bit more compile-time. In MCF, 2 functions get specialised. To tests if the extra compile time is needed for compiling these extra functions, I restricted it to specialise only 1 function, but then compile time remained the same as specialising 2 functions, which is surprising! Perhaps it's some analysis pass after that which hits a corner cases. I wanted to see where compile time is spent with -ftime-report, but found that it doesn't work with LTO. I dropped a little message to the dev list yesterday: https://lists.llvm.org/pipermail/llvm-dev/2021-May/150487.html

I think what I will do is try to get that working with LTO, because it would be really good to know where this extra compile times goes which we don't know at the moment. At the same time, I am more confident that there's not something wrong with compile times in function specialisation and thus I am thinking this could be a candidate for a first commit (with the pass still disabled by default of course). So that we can iterate in-tree which will be more convenient. But again, I will first see if I get anywhere with -ftime-report and then hopefully we know more soon.

It is great to hear that we get the consistent results! It looks good to me to make this a candidate for the first commit. But I think you need the approves from other guys : ).

Re: compile-times, I finally managed to get the numbers out.

Timing report (in seconds)	LTO	LTO + func spec	Diff
Pass execution 1	0.5054	0.7222	+43%
Pass execution 2	0.4992	0.6274	+26%
DWARF Emission	0.0399	0.0304	Not interesting, probably noise
Register Allocation	0.0381	0.0433	''
Instruction Selection and Scheduling	0.1256	0.0982	''

These percentages look bad, but the absolute numbers are small... Zooming in on the Pass Execution 1 regressions, the top 3 compile time consumers before were:

0.1590 ( 22.9%)   0.0000 (  0.1%)   0.1590 ( 22.0%)   0.1590 ( 22.0%)  InstCombinePass
0.0699 ( 10.1%)   0.0006 (  2.1%)   0.0705 (  9.8%)   0.0705 (  9.8%)  GVN
0.0670 (  9.6%)   0.0034 ( 12.2%)   0.0704 (  9.7%)   0.0703 (  9.7%)  IndVarSimplifyPass

And after, with this patch applied:

0.1116 ( 23.0%)   0.0000 (  0.0%)   0.1116 ( 22.1%)   0.1116 ( 22.1%)  InstCombinePass
0.0560 ( 11.5%)   0.0001 (  0.6%)   0.0561 ( 11.1%)   0.0560 ( 11.1%)  IndVarSimplifyPass
0.0380 (  7.8%)   0.0037 ( 19.1%)   0.0417 (  8.3%)   0.0417 (  8.3%)  GVN

Function specialisation on itself is not expensive and takes the 7th place with:

0.0346 (  5.0%)   0.0000 (  0.0%)   0.0346 (  4.8%)   0.0346 (  4.8%)  FunctionSpecializationPass

Looking at Pass Execution 2, the top 3 consumers before were:

0.1270 ( 29.4%)   0.0115 ( 17.3%)   0.1386 ( 27.8%)   0.1386 ( 27.8%)  AArch64 Instruction Selection
0.0241 (  5.6%)   0.0438 ( 65.8%)   0.0679 ( 13.6%)   0.0679 ( 13.6%)  AArch64 Assembly Printer
0.0579 ( 13.4%)   0.0001 (  0.1%)   0.0580 ( 11.6%)   0.0579 ( 11.6%)  Loop Strength Reduction

And after:

0.1432 ( 27.3%)   0.0360 ( 34.9%)   0.1792 ( 28.6%)   0.1792 ( 28.6%)  AArch64 Instruction Selection
0.0336 (  6.4%)   0.0556 ( 53.8%)   0.0891 ( 14.2%)   0.0891 ( 14.2%)  AArch64 Assembly Printer
0.0736 ( 14.0%)   0.0024 (  2.3%)   0.0760 ( 12.1%)   0.0759 ( 12.1%)  Loop Strength Reduction

My conclusion from this is that it's not just one thing: function specialisation on itself is just a minor contributor, but extra compile time is spent in existing passes without anything really standing out.

Thus, how about this as an intial commit @sanwou01, @dmgreen, @fhahn? Iterating on the current limitations with this in tree is more convenient.

SjoerdMeijer edited the summary of this revision. (Show Details)May 13 2021, 3:45 AM

Getting the heuristics right will be difficult here.

There are a bunch of places in the openmp runtime where function pointers are passed to big functions. Manually tagging those arguments with an attribute in exchange for guaranteed specialisation when they're constant would be high value, even without any heuristics to automate the decision.

How do you think about move function specialization pass into standalone files (e.g., FuncitonSpecialization.h and FunctionSpecialization.cpp)? (We may do it in successive patches)

In D93838#2756461, @SjoerdMeijer wrote:

My conclusion from this is that it's not just one thing: function specialisation on itself is just a minor contributor, but extra compile time is spent in existing passes without anything really standing out.

Thanks for sharing the numbers!

I'm not sure of the conclusion though. While function specialization itself is quick, it still is the cause of the regression in the other passes, as it (presumably) adds a lot of extra code, that the other passes now need to churn through. So this is still something that needs to be addressed in the function specialization, right?

re absolute numbers being small: do you have any insight into whether the impact is expected to be less on larger benchmarks/binaries?

Manually tagging those arguments with an attribute in exchange for guaranteed specialisation when they're constant would be high value, even without any heuristics to automate the decision.

Seems like a really good idea. Or maybe attribute hot is enough to ask for guaranteed specialisation?

Anyway in the future, this pass should respect “noclone” attribute (which needs to be added).

absolute numbers being small: do you have any insight into whether the impact is expected to be less on larger benchmarks/binaries?

I am interested how this affects real world programs. Focusing only on small set ot benchmarks is not ideal.

Maybe you should try to compile clang or firefox with and without this feature and get some numbers how compile time and code size numbers were changed?

In D93838#2756483, @fhahn wrote:

In D93838#2756461, @SjoerdMeijer wrote:

My conclusion from this is that it's not just one thing: function specialisation on itself is just a minor contributor, but extra compile time is spent in existing passes without anything really standing out.

Thanks for sharing the numbers!

I'm not sure of the conclusion though. While function specialization itself is quick, it still is the cause of the regression in the other passes, as it (presumably) adds a lot of extra code, that the other passes now need to churn through. So this is still something that needs to be addressed in the function specialization, right?

Well, I am not entirely sure yet about that. Yes, the rest of the optimisation pipeline sees more functions/instructions, so is spending more time on that. So, from that point of view, it is what it is.
On the other hand, this could be controlled from function specialisation, by specialising less. For this example, it's really beneficial to specialise two functions. But would we for example only want to specialise 1 for compile-time? Perhaps, then we need a way to control that with a "budget" how much the code is allowed to grow, which will impact/restrict compile-times.

re absolute numbers being small: do you have any insight into whether the impact is expected to be less on larger benchmarks/binaries?

No, not yet. Will see what I can do. It triggers for this case, but will try to find another with a larger compile time. Probably I will return to SQLite for that.

do you have any insight into whether the impact is expected to be less on larger benchmarks/binaries?

I am interested how this affects real world programs. Focusing only on small set ot benchmarks is not ideal.

From the codes I visited, the impact on the real world programs depends on the heuristics set. For the current heuristics, it is hard to image this patch would affect real world programs much since it wouldn't specialization a function with a given argument when the number of possible constants is bigger than 3. I think it would take a long time to tune the heuristics and the cost model may be changing during the evolution process.

I think it would take a long time to tune the heuristics and the cost model may be changing during the evolution process.

Maybe do not reinvent the wheel and take a look which heuristics gcc use?

“budget”

Yeah, definitely. Otherwise everything would be out of control.

One more data point.

MCF is one end of the spectrum with a LTO compile/link step of ~1 second, and SQLite somewhere at the other end with a compile time of 160 seconds (i.e., the amalgamation version: 184K lines of code (one file), over 6.4 megabytes in size). It specialises 4 functions, and thus triggers 4 times. The overhead of function specialisation is just 1.3% - 1.8%, and is mainly just in running the pass itself, not any knock on effect of the rest of the pipeline chewing on more instructions. I think this shows that:

the cost of function specialisation is amortised with larger compile times,
but of course also the ratio of the number of existing and new functions/instructions is important.

Similarly zooming in on the compile-times for SQLite:

Timing report (in seconds)	-O3	-O3 + func spec
Pass execution 1	159.9010	164.3134
Register Allocation	2.7357	2.7376
Instruction Selection and Scheduling	9.0910	9.0746
Pass execution 2	61.8962	61.6610
DWARF Emission	12.2326	12.2232

And again the function specialisation pass does not enter the top 5 of the most compile time consuming passes:

2.8416 (  1.8%)   0.3040 (  5.6%)   3.1456 (  1.9%)   3.1457 (  1.9%)  FunctionSpecializationPass

The binary size is 1.575.571 bytes vs. 1.575.299 bytes, so slightly reduced which is interesting.

So, as it stands, and sort of repeating what I said earlier today: this looks like an good candidate for an initial commit. I.e., this looks like a good basis that performs the transformation, but we know this needs further cost-model tuning. For example, the first thing I would like to add is a budget to not let the functions grow out of proportion. What do we think about that?

Moved it to its own file FunctionSpecialization.cpp.

Harbormaster completed remote builds in B104357: Diff 345249.May 13 2021, 1:50 PM

In D93838#2756933, @SjoerdMeijer wrote:
One more data point.

MCF is one end of the spectrum with a LTO compile/link step of ~1 second, and SQLite somewhere at the other end with a compile time of 160 seconds (i.e., the amalgamation version: 184K lines of code (one file), over 6.4 megabytes in size). It specialises 4 functions, and thus triggers 4 times. The overhead of function specialisation is just 1.3% - 1.8%, and is mainly just in running the pass itself, not any knock on effect of the rest of the pipeline chewing on more instructions. I think this shows that:

the cost of function specialisation is amortised with larger compile times,

but of course also the ratio of the number of existing and new functions/instructions is important.

Similarly zooming in on the compile-times for SQLite:

Timing report (in seconds) -O3 -O3 + func spec

Pass execution 1 159.9010 164.3134

Register Allocation 2.7357 2.7376

Instruction Selection and Scheduling 9.0910 9.0746

Pass execution 2 61.8962 61.6610

DWARF Emission 12.2326 12.2232

And again the function specialisation pass does not enter the top 5 of the most compile time consuming passes:
2.8416 (  1.8%)   0.3040 (  5.6%)   3.1456 (  1.9%)   3.1457 (  1.9%)  FunctionSpecializationPass
The binary size is 1.575.571 bytes vs. 1.575.299 bytes, so slightly reduced which is interesting.

So, as it stands, and sort of repeating what I said earlier today: this looks like an good candidate for an initial commit. I.e., this looks like a good basis that performs the transformation, but we know this needs further cost-model tuning. For example, the first thing I would like to add is a budget to not let the functions grow out of proportion. What do we think about that?

From the cost model, I think it is hard to inflate the codes too much. The key component of the cost model is:

if (PossibleConstants.size() > MaxConstantsThreshold) {
      LLVM_DEBUG(dbgs() << "FnSpecialization: number of constants found exceed "
                        << "the maximum number of constants threshold.\n");
      return false;
    }
// ...
/// Compute the cost of specializing function \p F.
  uint64_t getSpecializationCost(Function *F) {
    // Compute the code metrics for the function.
    SmallPtrSet<const Value *, 32> EphValues;
    CodeMetrics::collectEphemeralValues(F, &(GetAC)(*F), EphValues);
    CodeMetrics Metrics;
    for (BasicBlock &BB : *F)
      Metrics.analyzeBasicBlock(&BB, (GetTTI)(*F), EphValues);

    // If the code metrics reveal that we shouldn't duplicate the function, we
    // shouldn't specialize it. Set the specialization cost to the maximum.
    if (Metrics.notDuplicatable)
      return std::numeric_limits<unsigned>::max();

    // Otherwise, set the specialization cost to be the cost of all the
    // instructions in the function and penalty for specializing more functions.
    unsigned Penalty = (NumFuncSpecialized + 1);
    return Metrics.NumInsts * InlineConstants::InstrCost * Penalty;
  }

From the implementation of getSpecializationCost, we could know with number of specialized functions increases, the cost to specialize another function would be get larger. So there shouldn't be too many functions in a large program could be specialized actually.
( So here introduces another problem about the order of functions to specialized. And I think we could handle this in successive patches.)

Add a new budget could help the users to feel more controllable for this pass. I think we could limit the numbers of function specialized.

I think we could limit the numbers of function specialized.

Yes, and we should some score which says how profitable specialization could be - so we can prioritize such specializations.

Let's discuss a way forward with this.

Data and analysis has shown that regarding increased compile-times:

there there is not one single pass to blame (if that were the case, there could perhaps have been an easy fix);
the obvious way to control this, is to specialise less, based on a cost-model that also takes function size into account, which looks fairly straightforward to plumb in to me as a first step,
and compile-times is not a disaster at the moment! The overhead will be bigger perhaps with shorter compile times, and less visible with larger. Also, it currently is in line with GCC, which has this enabled by default.

I would like to propose this as an initial version so that we can get more experience with this, and also very importantly, others can contribute and help with this.

then, as a follow up my plan is now to:

look at code-size grow and control it,
and study and document GCC's cost-model.

Agreed @fhahn , @xbolva00 , @ChuanqiXu , @sanwou01 ?

In D93838#2758650, @xbolva00 wrote:

I think we could limit the numbers of function specialized.

Yes, and we should some score which says how profitable specialization could be - so we can prioritize such specializations.

Yeah, it should be. It further requires us to split the current pass into an analysis one and a transformed

In D93838#2758878, @SjoerdMeijer wrote:

Let's discuss a way forward with this.

Data and analysis has shown that regarding increased compile-times:

there there is not one single pass to blame (if that were the case, there could perhaps have been an easy fix);

the obvious way to control this, is to specialise less, based on a cost-model that also takes function size into account, which looks fairly straightforward to plumb in to me as a first step,

and compile-times is not a disaster at the moment! The overhead will be bigger perhaps with shorter compile times, and less visible with larger. Also, it currently is in line with GCC, which has this enabled by default.

I would like to propose this as an initial version so that we can get more experience with this, and also very importantly, others can contribute and help with this.

then, as a follow up my plan is now to:

look at code-size grow and control it,

and study and document GCC's cost-model.

Agreed @fhahn , @xbolva00 , @ChuanqiXu , @sanwou01 ?

the obvious way to control this, is to specialise less, based on a cost-model that also takes function size into account, which looks fairly straightforward to plumb in to me as a first step,

I think the current cost model already consider the size of the function.

I would like to propose this as an initial version so that we can get more experience with this, and also very importantly, others can contribute and help with this.

Yeah, the current patch looks good to me with some problems remaining. And after this patch accepted, I think I would try to enable it in ThinLTO like I discussed in (https://groups.google.com/g/llvm-dev/c/rXb0Zl7OKB4). I think function specialization pass is potential.
Of course, as a freshman to this community, I am not familiar with the standard about accepting a new pass. Let's see the opinions from the experienced guys : )

It further requires us to split the current pass into an analysis one and a transformed

+1, agree.

But otherwise, this is a good starting point.

Friendly ping: I am looking for an LGTM for this so the we can continue with the next steps.

In D93838#2762655, @SjoerdMeijer wrote:

Friendly ping: I am looking for an LGTM for this so the we can continue with the next steps.

This patch looks good to me. But I think I am not the right person to accept it : )

BTW, it looks like we need to format this.

I am wondering if we can remove original IPSCCP once we made this strong enough?

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
433–439	It looks like we left this work to original IPSCCP to handle. Can we make it in this pass?

davidxl added a reviewer: davidxl.May 19 2021, 10:21 AM

echristo added a reviewer: echristo.May 19 2021, 10:27 AM

davide removed a reviewer: davide.May 19 2021, 10:33 AM

For the motivating case from MCF, there is actually code in InlineCost Analysis to handle it. See InlineCostCallAnalyzer::onLoweredCall(..). If it is not already handled there, it is a bug to be fixed there.

In D93838#2769900, @davidxl wrote:

For the motivating case from MCF, there is actually code in InlineCost Analysis to handle it. See InlineCostCallAnalyzer::onLoweredCall(..). If it is not already handled there, it is a bug to be fixed there.

It is hard to believe that inline could handle the case in MCF. The case in MCF is there is a function spec_qsort with signature (void*, size_t, size_t, int*(const void*, const void*)). And there is two calls to spec_qsort:

spec_qsort(perm + 1, basket_sizes[thread], sizeof(BASKET*),
            (int (*)(const void *, const void *))cost_compare);
// and
spec_qsort(arcs_pointer_sorted[thread], new_arcs_array[thread], sizeof(arc_p),
                (int (*)(const void *, const void *))arc_compare);

And the benefit comes from function specialization specialize spec_qsort with cost_compare and arc_compare. And finally, cost_compare and arc_compare are inlined in the specialized function. In my minds, without function specialization, compiler couldn't determine the value for the function pointer at compile time. So the inlining couldn't happen.

When compiler analyzes the callsite to spec_qsort, it sees the constant function pointer passed to the callee. During the callee body analysis, when the indirect callsite is seen, it will be simplified/evaluated to be the callsite constant thus the compare call can be resolved. If compiler decides that the compare call can be further inlined, it will give the original callsite additional bonus for inlining. Once spec_qsort is inlined, the compare function call will be inlined further.

I can imagine a fix in the inliner cost analysis to give the call to spec_qsort a bigger bonus when it sees that the compare function is also locally hot (e.g used in a loop). Does it make sense?

In D93838#2770203, @davidxl wrote:

When compiler analyzes the callsite to spec_qsort, it sees the constant function pointer passed to the callee. During the callee body analysis, when the indirect callsite is seen, it will be simplified/evaluated to be the callsite constant thus the compare call can be resolved. If compiler decides that the compare call can be further inlined, it will give the original callsite additional bonus for inlining. Once spec_qsort is inlined, the compare function call will be inlined further.

I can imagine a fix in the inliner cost analysis to give the call to spec_qsort a bigger bonus when it sees that the compare function is also locally hot (e.g used in a loop). Does it make sense?

It doesn't make sense to me. There are two reasons:

We can't inline every thing. Maybe it is possible to inline spec_qsort by give a bigger bonus. However, we could construct an example easily that a function with a function pointer as parameter too large to infeasible to inline.
Consider the following example:

spec_qsort(arc_compare);
spec_qsort(arc_compare);
spec_qsort(arc_compare);
spec_qsort(arc_compare);
spec_qsort(cost_compare);
spec_qsort(cost_compare);
spec_qsort(cost_compare);

Then we could construct two specialization for spec_qsort to solve the problem. And we need to inline seven times for sepc_qsort. When spec_qsort becomes bigger, the difference would be more clear.

If the function pointer points to a function too large to be inlined, I expect the benefit of cloning the caller will also be minimal.

It is possible that the 'spec_sort' like function is very large so it won't be inlined no matter what. In such as case the cloning may be helpful.

What I am saying is that for the motivating example in MCF, it may be possible to fix the inliner.

In D93838#2770225, @davidxl wrote:

If the function pointer points to a function too large to be inlined, I expect the benefit of cloning the caller will also be minimal.

It depends on the number of callsites. But for the example in mcf, I agree with you that cloning may be not more beneficial than inlining.

What I am saying is that for the motivating example in MCF, it may be possible to fix the inliner.

I agree with that if is possible to inline spec_qsort by adjusting the bonus. And what I am saying is function specialization is valuable and important.

In D93838#2770226, @ChuanqiXu wrote:

In D93838#2770225, @davidxl wrote:

If the function pointer points to a function too large to be inlined, I expect the benefit of cloning the caller will also be minimal.

It depends on the number of callsites. But for the example in mcf, I agree with you that cloning may be not more beneficial than inlining.

What I am saying is that for the motivating example in MCF, it may be possible to fix the inliner.

I agree with that if is possible to inline spec_qsort by adjusting the bonus. And what I am saying is function specialization is valuable and important.

If it is valuable and important, definitely go for it :)

My question is other than MCF, do we have other real world app that can benefit from this optimization (that can not be done by inliner)? I saw omnetpp also improves, do you know where it helps ? Also with PGO, do we see similar improvement?

Asking these because there are compile time and code size consequences..

What about C-ray? Does function specialization help?

Anyway, it would be good to teach inliner to inline that one important function in c-ray benchmark. gcc has some heuristic that it looks “if we inline this function into loop, does something becomes loop invariant - if yes, inline!”

In D93838#2770233, @davidxl wrote:

In D93838#2770226, @ChuanqiXu wrote:

In D93838#2770225, @davidxl wrote:

If the function pointer points to a function too large to be inlined, I expect the benefit of cloning the caller will also be minimal.

It depends on the number of callsites. But for the example in mcf, I agree with you that cloning may be not more beneficial than inlining.

What I am saying is that for the motivating example in MCF, it may be possible to fix the inliner.

I agree with that if is possible to inline spec_qsort by adjusting the bonus. And what I am saying is function specialization is valuable and important.

If it is valuable and important, definitely go for it :)

My question is other than MCF, do we have other real world app that can benefit from this optimization (that can not be done by inliner)? I saw omnetpp also improves, do you know where it helps ? Also with PGO, do we see similar improvement?

Asking these because there are compile time and code size consequences..

I saw omnetpp also improves, do you know where it helps ?

We get omnetpp with 10~20 iterations in the past. And now the default iteration limit is 1. I didn't look into it. I would try if possible.

Also with PGO, do we see similar improvement?

Good question, we need to experiment to answer it.

Asking these because there are compile time and code size consequences

When we limit the iteration to one, as the data provided by @SjoerdMeijer

My question is other than MCF, do we have other real world app that can benefit from this optimization (that can not be done by inliner)?

For this question, my answer is that the current implementation may affect real word applications little from the cost model I had visited. My thought is that it should be a long term process to tune the cost model. It is hard to image that we can make something good enough at one commit.

BTW, from the discussion, we are in consensus that cloning are beneficial to handle the cases that inlining can't. But in the current pass pipeline, function specialization is in the front of inlining pass. I am not sure if it is harmful. But I wonder if it is good to put inlining pass in the front of function specialization pass,. And if function specialization pass made changes, we could run inlined again.

In D93838#2770347, @ChuanqiXu wrote:

In D93838#2770233, @davidxl wrote:

In D93838#2770226, @ChuanqiXu wrote:

In D93838#2770225, @davidxl wrote:

If the function pointer points to a function too large to be inlined, I expect the benefit of cloning the caller will also be minimal.

It depends on the number of callsites. But for the example in mcf, I agree with you that cloning may be not more beneficial than inlining.

What I am saying is that for the motivating example in MCF, it may be possible to fix the inliner.

I agree with that if is possible to inline spec_qsort by adjusting the bonus. And what I am saying is function specialization is valuable and important.

If it is valuable and important, definitely go for it :)

My question is other than MCF, do we have other real world app that can benefit from this optimization (that can not be done by inliner)? I saw omnetpp also improves, do you know where it helps ? Also with PGO, do we see similar improvement?

Asking these because there are compile time and code size consequences..

I saw omnetpp also improves, do you know where it helps ?

We get omnetpp with 10~20 iterations in the past. And now the default iteration limit is 1. I didn't look into it. I would try if possible.

Also with PGO, do we see similar improvement?

Good question, we need to experiment to answer it.

Asking these because there are compile time and code size consequences

When we limit the iteration to one, as the data provided by @SjoerdMeijer

My question is other than MCF, do we have other real world app that can benefit from this optimization (that can not be done by inliner)?

For this question, my answer is that the current implementation may affect real word applications little from the cost model I had visited. My thought is that it should be a long term process to tune the cost model. It is hard to image that we can make something good enough at one commit.

BTW, from the discussion, we are in consensus that cloning are beneficial to handle the cases that inlining can't. But in the current pass pipeline, function specialization is in the front of inlining pass. I am not sure if it is harmful. But I wonder if it is good to put inlining pass in the front of function specialization pass,. And if function specialization pass made changes, we could run inlined again.

In theory, function specialization based on interprocedural const/range/assertion propagation does not depend on inlining, so it should/can be done before the inliner. Similar to inliner, it may be better run after the callsite split pass to expose opportunities:

if (...) {
  a = 0;
} else {
  a = g;
}
foo (a);

if (...) {
    foo(0);     // specialization.
 } else {
    foo(g);
}

Thank you for further discussing this.

I have one more data point to add, related to compile-times which is probably the biggest concern for this work:

Timed compilation of a non-LTO LLVM release build increases compile time with 0.6% and increases the binary with 0.003%. The code-size increase is not the point here. The point is that function specialisation triggers, with very modest compile-time increase.

Other compile-times discussed in this very lengthy thread show similar trends (for SPEC, MySQL).

Sorry for repeating myself, but I am looking for someone to bless this work, but let me say a few things on this:

this will certainly need more tuning and experimentation,
the goal is to get this on by default, like GCC, but depends on the previous point,
getting this all right and enabled by default with the first commit is probably not achievable, hence why I am looking for a first commit.
and given the interest in this work we can start sharing some of this when it is in tree.

@fhahn, as the kind of biggest supporter of "this must be on by default" early in the conversations, are you happy with this?

getting this all right and enabled by default with the first commit is probably not achievable

Maybe yes, well…

If we could pick some the most obvious and profitable cases and enable this pass for them (simple heuristics, otherwise be conservative), plus, if we have PGO data, specialize very hot functions?

Yeah, I see concerns that another off by default pass is not ideal. Just look how many passes are in tree and disabled.

So +1 if we can run this pass even with conservative heuristics.

I'm pretty sure @fhahn did not mean to suggest that the pass is actually default enabled when it lands, merely that there is some viable path towards enabling it in the future. Adding a pass and enabling it in the default pipeline are always two separate steps. And it does sound to me like there is a viable path forward here, so this is mainly waiting on someone to perform a technical review of the implementation (that someone would preferably be @fhahn, as the local IPSCCP expert :)

In D93838#2770470, @xbolva00 wrote:

getting this all right and enabled by default with the first commit is probably not achievable

Maybe yes, well…

If we could pick some the most obvious and profitable cases and enable this pass for them (simple heuristics, otherwise be conservative), plus, if we have PGO data, specialize very hot functions?

Yeah, I see concerns that another off by default pass is not ideal. Just look how many passes are in tree and disabled.

So +1 if we can run this pass even with conservative heuristics.

PGO is definitely interesting. Actually quite a few interesting ideas have been brought up during discussions:

An idea was expressed that a function specialisation attribute on arguments, we could have a direct way of specialising, avoiding the cost-model,
Currently we only support constants, not constant ranges,
and I see that PGO could definitely help.

But I think these things can be best added once we have an initial implementation.

Thanks @nikic for your thoughts:

'm pretty sure @fhahn did not mean to suggest that the pass is actually default enabled when it lands, merely that there is some viable path towards enabling it in the future. Adding a pass and enabling it in the default pipeline are always two separate steps. And it does sound to me like there is a viable path forward here, so this is mainly waiting on someone to perform a technical review of the implementation (that someone would preferably be @fhahn, as the local IPSCCP expert :)

I agree with this and I think I have shown that there is a viable path to get this enabled (again, this our goal, because we want to reach parity with GCC here).
We have performed quite a few round of code reviews. Although there's obviously more to do, my impression was that people were reluctant to approve this as a first version because of the "on by default" discussion and not so much because of other things.

In D93838#2770233, @davidxl wrote:

My question is other than MCF, do we have other real world app that can benefit from this optimization (that can not be done by inliner)?

An alternative perspective. An inliner does two things. It elides call setup/return code, and it specialises the function on the call site. These can be, and probably should be, separable concerns.

Today we inline very aggressively, which is suboptimal for platforms with code size (or cache) restrictions, but does give the call site specialisation effect. So this patch, today, needs a large enough function to avoid being specialised by the inliner to see a benefit. Examples will be things like qsort or large switch statements on a parameter.

With a specialisation pass in tree we can start backing off the inliner. Calling conventions do have overhead, but I suspect the majority of the performance win of inline is from the specialisation. If that intuition is sound, then this plus a less aggressive inliner will beat the status quo through better icache utilisation. Performance tests of Os may validate that expectation

In D93838#2770951, @JonChesterfield wrote:

In D93838#2770233, @davidxl wrote:

My question is other than MCF, do we have other real world app that can benefit from this optimization (that can not be done by inliner)?

An alternative perspective. An inliner does two things. It elides call setup/return code, and it specialises the function on the call site. These can be, and probably should be, separable concerns.

Today we inline very aggressively, which is suboptimal for platforms with code size (or cache) restrictions, but does give the call site specialisation effect. So this patch, today, needs a large enough function to avoid being specialised by the inliner to see a benefit. Examples will be things like qsort or large switch statements on a parameter.

The benefit of inlining comes from many different areas:

call overhead reduction (call, pro/epilogue)
inline instance body cleanup with callsite contexts (this is what specialization can get)
cross procedure boundary optimizations -- 3.1) PRE, jump threading etc. between caller body and inline instances 3.2) cross function optimization between sibling calls (sibling inline instances) 3.3) better code layout of inline instance body with enclosing call context ..

This is why with PGO, we have very large inline threshold setup for hot callsites.

For function specialization with PGO, we can use profile data to selectively do function cloning, but then again, it is very likely better to be inlined given its hotness.

I agree function specialization has its place when size is the concern (with -Os), or instruction working set is too large (to hurt performance). We don't have a mechanism to analyze the latter yet.

With a specialisation pass in tree we can start backing off the inliner. Calling conventions do have overhead, but I suspect the majority of the performance win of inline is from the specialisation. If that intuition is sound, then this plus a less aggressive inliner will beat the status quo through better icache utilisation. Performance tests of Os may validate that expectation

In D93838#2770480, @nikic wrote:

I'm pretty sure @fhahn did not mean to suggest that the pass is actually default enabled when it lands, merely that there is some viable path towards enabling it in the future. Adding a pass and enabling it in the default pipeline are always two separate steps.

This was indeed how my earlier comments were intended. Some comments on the code itself.

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
2	this needs updating.
16	this sounds a bit odd. The first sentence says it handles constant parameters. I guess you mean non-constant-int constants?
47	need test for the option?
58	needs test for the option?
66	Those were added to transition existing code in SCCP to `ValueLatticeELement`. Ideally new code would be explicit about what they expect (constant-range vs constant-int)
131	Instead of patching up the IR, can we just set the lattice values for the cloned function arguments accordingly until we are done with solving?
173	Could you explain why we need to remove `ssa_copy` in the clone function?
213	nit: no `llvm::` should be needed.
638	Why do we need this transformation here? Is this required for specialization or to catch the MCF case?
767	Why do we need to modify the IR after each call to `RunSCCPSolver` rather than after all solving is done?
778	Are those declarations? Shouldn't we only track functions with definitions?
799	Is it possible to only replace values once we are completely done with solving?
llvm/lib/Transforms/Scalar/SCCP.cpp
30	All those includes are not needed in the file?
161	Why is this needed?
llvm/test/Transforms/FunctionSpecialization/function-specialization4.ll
8	do those tests depend on information from the AArch64 backend? If so, they should only be executed if the AArch64 backend is built.

Ah cheers, many thanks for the comments and review! Will start addressing them now.

ChuanqiXu added inline comments.May 21 2021, 12:08 AM

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
316–318	Should this be `!=` instead? Or I think we should use `TTI.getUserCost` for the initial value directly. Now if `getUserCost` returns non-zero, the value for Cost would become 0! It is too weird.

SjoerdMeijer added inline comments.May 24 2021, 3:50 AM

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
131	I might not get your suggestion, but the solving is done much earlier. I.e., `RunSCCPSolver` is done in `runFunctionSpecialization`. This here is the final step that performs the actual transformation.
316–318	Yeah, am fixing and rewriting this to use `InstructionCost`.
638	It's also not yet covered by tests, so I will remove it for now. Even if it is needed for MCF (can't remember), then it looks like a nice candidate as a follow up once we've got something in.
799	Will remove this (see also earlier comment); this is a bit of a different optimisation that we can look at later.

SjoerdMeijer added inline comments.May 24 2021, 4:00 AM

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
778	Yeah, these should be functions with definitions. There's a `F.isDeclaration()` check earlier in this function.

SjoerdMeijer added inline comments.May 24 2021, 5:30 AM

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
173	I can't. :-) Not yet at least, so will also remove this code (for now).

This addresses most remarks:

transitioned to using InstructionCosts,
added tests/testing,
removed unnecessary things.

Harbormaster completed remote builds in B105879: Diff 347357.May 24 2021, 6:23 AM

SjoerdMeijer added a reviewer: jaykang10.May 26 2021, 2:43 AM

It seems like the patch need to be formatted simply : )

Ran clang-format.

@fhahn: about running the solver first before making IR changes. I think that is happening already. There are some places where the solver is kept up to date after transformations. I think this is a remainder of running function specialisation iteratively that I stripped out for now, but I think it's good to keep these updates to the solver as it's probably good to keep it consistent.

Harbormaster completed remote builds in B106497: Diff 348239.May 27 2021, 7:31 AM

Friendly ping.

In D93838#2784592, @SjoerdMeijer wrote:

Ran clang-format.

@fhahn: about running the solver first before making IR changes. I think that is happening already. There are some places where the solver is kept up to date after transformations. I think this is a remainder of running function specialisation iteratively that I stripped out for now, but I think it's good to keep these updates to the solver as it's probably good to keep it consistent.

I'm not sure I'm looking at the right thing, but it looks like the specialization is still run iteratively via the code below? RunSCCPSolver seems to modify the IR after solving. Maybe I am looking at the wrong thing?

while (FuncSpecializationMaxIters != I++ &&
       FS.specializeFunctions(FuncDecls, CurrentSpecializations)) {

  // Run solver for the specialized functions only.
  RunSCCPSolver(CurrentSpecializations);

  CurrentSpecializations.clear();
  Changed = true;
}

Do we have test cases where the specialization leads to the function result of the specialized function becoming another constant, enabling further simplification in the caller?

llvm/test/Transforms/FunctionSpecialization/function-specialization-recursive.ll
4	can you elaborate what is going wrong here and what needs fixing?

In D93838#2790499, @fhahn wrote:
In D93838#2784592, @SjoerdMeijer wrote:

Ran clang-format.

@fhahn: about running the solver first before making IR changes. I think that is happening already. There are some places where the solver is kept up to date after transformations. I think this is a remainder of running function specialisation iteratively that I stripped out for now, but I think it's good to keep these updates to the solver as it's probably good to keep it consistent.

I'm not sure I'm looking at the right thing, but it looks like the specialization is still run iteratively via the code below? RunSCCPSolver seems to modify the IR after solving. Maybe I am looking at the wrong thing?
while (FuncSpecializationMaxIters != I++ &&
       FS.specializeFunctions(FuncDecls, CurrentSpecializations)) {

  // Run solver for the specialized functions only.
  RunSCCPSolver(CurrentSpecializations);

  CurrentSpecializations.clear();
  Changed = true;
}

Ah okay, you're right, the solver runs but only for the specialised functions like to comments says. The main solving happens earlier in runFunctionSpecialization. Like I also wrote earlier, I have been stripping out a few things from the initial version, like running 10 iterations and doing it recursively, to prepare a basic version for an initial commit and get a baseline for the extra compile time costs. I will get rid of invoking the solver here and add a TODO and will also get rid of the disabled test for now. From memory, I think that needed to run more iterations to kick in, but will look at that later in a follow up.

Do we have test cases where the specialization leads to the function result of the specialized function becoming another constant, enabling further simplification in the caller?

Yeah, I think test test/Transforms/FunctionSpecialization/function-specialization.ll is an example of that: the specialised functions completely disappear because of further simplifications.

Addressed @fhahn 's comments: don't run the solver for specialised functions, removed the recursive specialization test for now.

Harbormaster completed remote builds in B107215: Diff 349228.Jun 2 2021, 5:04 AM

wenlei added a subscriber: wenlei.Jun 4 2021, 9:49 AM

Herald added a subscriber: ormris. · View Herald TranscriptJun 4 2021, 9:49 AM

We also saw ipa-cp-clone being a very noticeable difference between gcc and llvm when we tried to move workloads from gcc to llvm. Thanks for working on this for llvm optimization parity with gcc.

I think specialization can be beneficial when we can't do very selective inlining - it can realize part of the benefit of inlining with a more moderate size increase. I suspect the benefit of ICP from specialization will be much lower when PGO is used though, because PGO's speculative ICP is quite effective. For our use cases, it's more interesting to see how much benefit we could get from non-function-pointer constants, on top of PGO.

When this is ready, I'd be happy to give it a try on our larger workload to see if there's any interesting hit and how it compares against gcc's ipa-cp-clone for the same workload, all with PGO.

Also with PGO, do we see similar improvement?

Good question, we need to experiment to answer it.

@ChuanqiXu FWIW, PGO always does speculative ICP from spec_qsort to cost_compare and arc_compare in spec2017/mcf.

I agree function specialization has its place when size is the concern (with -Os), or instruction working set is too large (to hurt performance). We don't have a mechanism to analyze the latter yet.

@davidxl you're right that we can't model the latter, but the same working set concern is true for inlining too, and yet we still need to control size bloat from inlining (through inline cost etc) even if we can't model it accurately. I think specialization can be another layer that allows compiler to fine tune that balance when needed. I'm curious have you tried to turn off ipa-cp-clone for gcc for your workloads and whether that leads to noticeable perf difference?

In theory, function specialization based on interprocedural const/range/assertion propagation does not depend on inlining, so it should/can be done before the inliner.

Agreed. Though from compile time perspective, I'm wondering if we have specialization before inlining, would we be paying the cost for specialization for functions that will eventually be inlined in which case the extra cost may yield nothing in perf. But if we do specialization after inlining, we'll be targeting those that deemed not worth inlining only.

llvm/lib/Passes/PassBuilder.cpp
1137	With LTO or ThinLTO, would it make sense to schedule the pass in post-link only? buildModuleSimplificationPipeline is used by both prelink and postlink for ThinLTO.
llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
292	We could replace this `AvgLoopIterationCount` tuning knob with real frequency when PGO/BFI is available. In order for specialization to be beneficial on top of PGO, I think leveraging counts in cost/benefit analysis is important. Doing that in later patch is fine.

In D93838#2801375, @wenlei wrote:

We also saw ipa-cp-clone being a very noticeable difference between gcc and llvm when we tried to move workloads from gcc to llvm. Thanks for working on this for llvm optimization parity with gcc.

I think specialization can be beneficial when we can't do very selective inlining - it can realize part of the benefit of inlining with a more moderate size increase. I suspect the benefit of ICP from specialization will be much lower when PGO is used though, because PGO's speculative ICP is quite effective. For our use cases, it's more interesting to see how much benefit we could get from non-function-pointer constants, on top of PGO.

Agree.

When this is ready, I'd be happy to give it a try on our larger workload to see if there's any interesting hit and how it compares against gcc's ipa-cp-clone for the same workload, all with PGO.

Also with PGO, do we see similar improvement?

Good question, we need to experiment to answer it.

@ChuanqiXu FWIW, PGO always does speculative ICP from spec_qsort to cost_compare and arc_compare in spec2017/mcf.

I agree function specialization has its place when size is the concern (with -Os), or instruction working set is too large (to hurt performance). We don't have a mechanism to analyze the latter yet.

@davidxl you're right that we can't model the latter, but the same working set concern is true for inlining too, and yet we still need to control size bloat from inlining (through inline cost etc) even if we can't model it accurately. I think specialization can be another layer that allows compiler to fine tune that balance when needed. I'm curious have you tried to turn off ipa-cp-clone for gcc for your workloads and whether that leads to noticeable perf difference?

My recollection is that it did not result in much perf difference with FDO with our internal workload -- otherwise it would have been further tuned.

In theory, function specialization based on interprocedural const/range/assertion propagation does not depend on inlining, so it should/can be done before the inliner.

Agreed. Though from compile time perspective, I'm wondering if we have specialization before inlining, would we be paying the cost for specialization for functions that will eventually be inlined in which case the extra cost may yield nothing in perf. But if we do specialization after inlining, we'll be targeting those that deemed not worth inlining only.

Makes sense -- this is a the reason for iterative optimizations. Partial inlining is in the same situation too.

For our use cases, it's more interesting to see how much benefit we could get from non-function-pointer constants, on top of PGO

Definitely. That's one of the first things we could investigate once we've got a first version in. I have documented this under "current limitations".

Sorry for the early ping @fhahn , but as you were the last one with comments, happy with the latest changes?

SjoerdMeijer added a reviewer: ChuanqiXu.Jun 8 2021, 12:54 AM

Thanks for adding me as a reviewer.
I am new to LLVM community and SCCP module. I am not sure if it is suitable for me to accept it. Just remind me if anyone is not comfortable.
The overall patch looks good to me as the first version. Although there must be some problems remained, I think we can work on top of this patch after merging iteratively. For example, I am playing function specialization for ThinLTO locally. And from the comments @wenlei give, PGO with function specialization is also interesting and charming. So after merging, we could work on specialization together for different aspects, like ThinLTO, PGO, bug fixes and Cost Model refactoring. So I choose to accept it.
Please wait 2~3 days to commit this in case there are more comments.

This revision is now accepted and ready to land.Jun 8 2021, 3:02 AM

In D93838#2793238, @SjoerdMeijer wrote:

Addressed @fhahn 's comments: don't run the solver for specialised functions removed the recursive specialization test for now.

I'm not sure if removing the recursive specialization test is the best thing to do, without known what it is supposed to test? If it is a legitimate test, I thin it would be good to keep it.

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
118	Are those updates to the solver still needed, after not running the solver after `specializeFunctions`?

In D93838#2807515, @fhahn wrote:

In D93838#2793238, @SjoerdMeijer wrote:

Addressed @fhahn 's comments: don't run the solver for specialised functions removed the recursive specialization test for now.

I'm not sure if removing the recursive specialization test is the best thing to do, without known what it is supposed to test? If it is a legitimate test, I thin it would be good to keep it.

Thanks for the comments. Since they are minor, I will fix it before committing. I.e., will remove that update to solver and add the test back.

In D93838#2807597, @SjoerdMeijer wrote:

In D93838#2807515, @fhahn wrote:

In D93838#2793238, @SjoerdMeijer wrote:

Addressed @fhahn 's comments: don't run the solver for specialised functions removed the recursive specialization test for now.

I'm not sure if removing the recursive specialization test is the best thing to do, without known what it is supposed to test? If it is a legitimate test, I thin it would be good to keep it.

Thanks for the comments. Since they are minor, I will fix it before committing. I.e., will remove that update to solver and add the test back.

There are at least a few more places that may modify the solver state after the solver is run. I think it would be good to audit them and update the patch. IIUC in the current version there should be no need to update the solver state after RunSolver. It would also be good to also check if there any places with updates I missed.

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
147	This should also not be needed?
155	also not needed?
234	Is this needed if the solver does not run again?
244	Is this needed if the solver does not run again?

Removed those updates too, added the test case back. Appreciate it if you can have one more quick look.

Harbormaster completed remote builds in B108393: Diff 350864.Jun 9 2021, 6:56 AM

This revision was landed with ongoing or failed builds.Jun 11 2021, 1:22 AM

Closed by commit rGc4a0969b9c14: Function Specialization Pass (authored by SjoerdMeijer). · Explain Why

This revision was automatically updated to reflect the committed changes.

SjoerdMeijer added a commit: rGc4a0969b9c14: Function Specialization Pass.

Many thanks for all your help and reviews. I have committed this because it was lgtm'ed (officially and also unofficially in comments), no major changes have been made/suggested for a while, and this is a first version that is off by default. And this is also just the beginning of function specialisation. I will continue working on this and will be happy to receive post-commit comments for this version, and of course comments for the work that I am planning and starting to do. I.e., I will now start working on some of the limitations before I start thinking how to eventually get this enabled by default.

fhahn added inline comments.Jun 11 2021, 1:34 AM

llvm/include/llvm/Transforms/Utils/SCCPSolver.h
146	unused?
148	unused?
149	unused?
llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
216	Is this needed in the latest version? If it is not needed, please also remove it from the interface.
509	stray newline?

I'm not quite sure why the implementation is in llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp, rather than llvm/lib/Transforms/IPO/FunctionSpecialization.cpp? It's interprocedural only, right? The implementation in SCCP.cpp works on both a single function and a module, that's probably with its in the Scalar sub-directory. It's also a separate pass from IPSCCP, so I am not sure if the pass definition should be in llvm/lib/Transforms/IPO/SCCP.cpp.

fhahn added inline comments.Jun 11 2021, 1:49 AM

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
66	Looks like this was not addressed. If the function is kept, please at least update the comment, as it is mis-leading at the moment (it also returns false if `LV` is a constant range with more than a single element, see the comment for the same function in `SCCP.cpp`)
250	`NumFuncSpecialized` is defined as `STATISTIC(NumFuncSpecialized, "Number of Functions Specialized");` How will this work when LLVM is build without statistics? (also redundant brackets)

In D93838#2812661, @fhahn wrote:

I'm not quite sure why the implementation is in llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp, rather than llvm/lib/Transforms/IPO/FunctionSpecialization.cpp? It's interprocedural only, right? The implementation in SCCP.cpp works on both a single function and a module, that's probably with its in the Scalar sub-directory. It's also a separate pass from IPSCCP, so I am not sure if the pass definition should be in llvm/lib/Transforms/IPO/SCCP.cpp.

Thanks, I will follow up on this shortly (i.e., will prepare a diff). I can't remember if there was something with this pass definition, but will look.

fhahn added inline comments.Jun 11 2021, 2:47 AM

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
462	Looks like this is not covered by a test. Would be good to have one.

SjoerdMeijer mentioned this in D104102: [FuncSpec] Statistics.Jun 11 2021, 3:20 AM

ChuanqiXu mentioned this in D107897: [FuncSpec] Don't specialize function which are easy to inline.Aug 12 2021, 11:29 PM

Hi,

We stumbled upon a crash in this pass during fuzz testing so I wrote a PR about it:
https://bugs.llvm.org/show_bug.cgi?id=51600

Herald added a subscriber: snehasish. · View Herald TranscriptAug 24 2021, 5:56 AM

In D93838#2962259, @uabelho wrote:

Hi,

We stumbled upon a crash in this pass during fuzz testing so I wrote a PR about it:
https://bugs.llvm.org/show_bug.cgi?id=51600

Thanks for reporting. I will look at this next week.

• LilyZhong added a subscriber: • LilyZhong.Jan 7 2022, 5:32 PM

labrinea mentioned this in D145394: [FuncSpec] Do not run pre-link when doing LTO..Mar 6 2023, 8:54 AM

dewen added a subscriber: dewen.Jun 15 2023, 8:18 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 15 2023, 8:18 PM

Herald added subscribers: hoy, nlopes, StephenFan. · View Herald Transcript

nlopes added inline comments.Jun 16 2023, 12:20 AM

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp
135	Can't this be poison? If this is just a placeholder for unreachable code, then poison is sufficient. We're trying to remove undef, so we appreciate if no more uses of undef get added to the codebase. Thank you!

Revision Contents

Path

Size

llvm/

include/

llvm/

InitializePasses.h

1 line

LinkAllPasses.h

1 line

Transforms/

IPO.h

5 lines

IPO/

SCCP.h

8 lines

Scalar/

SCCP.h

8 lines

Utils/

SCCPSolver.h

17 lines

lib/

Passes/

PassBuilder.cpp

8 lines

PassRegistry.def

1 line

Transforms/

IPO/

IPO.cpp

1 line

PassManagerBuilder.cpp

14 lines

SCCP.cpp

90 lines

Scalar/

CMakeLists.txt

1 line

FunctionSpecialization.cpp

637 lines

SCCP.cpp

5 lines

Utils/

SCCPSolver.cpp

53 lines

test/

Transforms/

FunctionSpecialization/

function-specialization-recursive.ll

56 lines

function-specialization.ll

50 lines

function-specialization2.ll

87 lines

function-specialization3.ll

58 lines

function-specialization4.ll

60 lines

function-specialization5.ll

40 lines

Diff 351361

llvm/include/llvm/InitializePasses.h

	Show First 20 Lines • Show All 162 Lines • ▼ Show 20 Lines
	void initializeFixIrreduciblePass(PassRegistry &);			void initializeFixIrreduciblePass(PassRegistry &);
	void initializeFixupStatepointCallerSavedPass(PassRegistry&);			void initializeFixupStatepointCallerSavedPass(PassRegistry&);
	void initializeFlattenCFGPassPass(PassRegistry&);			void initializeFlattenCFGPassPass(PassRegistry&);
	void initializeFloat2IntLegacyPassPass(PassRegistry&);			void initializeFloat2IntLegacyPassPass(PassRegistry&);
	void initializeForceFunctionAttrsLegacyPassPass(PassRegistry&);			void initializeForceFunctionAttrsLegacyPassPass(PassRegistry&);
	void initializeForwardControlFlowIntegrityPass(PassRegistry&);			void initializeForwardControlFlowIntegrityPass(PassRegistry&);
	void initializeFuncletLayoutPass(PassRegistry&);			void initializeFuncletLayoutPass(PassRegistry&);
	void initializeFunctionImportLegacyPassPass(PassRegistry&);			void initializeFunctionImportLegacyPassPass(PassRegistry&);
				void initializeFunctionSpecializationLegacyPassPass(PassRegistry &);
	void initializeGCMachineCodeAnalysisPass(PassRegistry&);			void initializeGCMachineCodeAnalysisPass(PassRegistry&);
	void initializeGCModuleInfoPass(PassRegistry&);			void initializeGCModuleInfoPass(PassRegistry&);
	void initializeGCOVProfilerLegacyPassPass(PassRegistry&);			void initializeGCOVProfilerLegacyPassPass(PassRegistry&);
	void initializeGVNHoistLegacyPassPass(PassRegistry&);			void initializeGVNHoistLegacyPassPass(PassRegistry&);
	void initializeGVNLegacyPassPass(PassRegistry&);			void initializeGVNLegacyPassPass(PassRegistry&);
	void initializeGVNSinkLegacyPassPass(PassRegistry&);			void initializeGVNSinkLegacyPassPass(PassRegistry&);
	void initializeGlobalDCELegacyPassPass(PassRegistry&);			void initializeGlobalDCELegacyPassPass(PassRegistry&);
	void initializeGlobalMergePass(PassRegistry&);			void initializeGlobalMergePass(PassRegistry&);
	▲ Show 20 Lines • Show All 280 Lines • Show Last 20 Lines

llvm/include/llvm/LinkAllPasses.h

Show First 20 Lines • Show All 225 Lines • ▼ Show 20 Lines	ForcePassLinking() {
(void) llvm::createFloat2IntPass();		(void) llvm::createFloat2IntPass();
(void) llvm::createEliminateAvailableExternallyPass();		(void) llvm::createEliminateAvailableExternallyPass();
(void)llvm::createScalarizeMaskedMemIntrinLegacyPass();		(void)llvm::createScalarizeMaskedMemIntrinLegacyPass();
(void) llvm::createWarnMissedTransformationsPass();		(void) llvm::createWarnMissedTransformationsPass();
(void) llvm::createHardwareLoopsPass();		(void) llvm::createHardwareLoopsPass();
(void) llvm::createInjectTLIMappingsLegacyPass();		(void) llvm::createInjectTLIMappingsLegacyPass();
(void) llvm::createUnifyLoopExitsPass();		(void) llvm::createUnifyLoopExitsPass();
(void) llvm::createFixIrreduciblePass();		(void) llvm::createFixIrreduciblePass();
		(void)llvm::createFunctionSpecializationPass();

(void)new llvm::IntervalPartition();		(void)new llvm::IntervalPartition();
(void)new llvm::ScalarEvolutionWrapperPass();		(void)new llvm::ScalarEvolutionWrapperPass();
llvm::Function::Create(nullptr, llvm::GlobalValue::ExternalLinkage)->viewCFGOnly();		llvm::Function::Create(nullptr, llvm::GlobalValue::ExternalLinkage)->viewCFGOnly();
llvm::RGPassManager RGM;		llvm::RGPassManager RGM;
llvm::TargetLibraryInfoImpl TLII;		llvm::TargetLibraryInfoImpl TLII;
llvm::TargetLibraryInfo TLI(TLII);		llvm::TargetLibraryInfo TLI(TLII);
llvm::AliasAnalysis AA(TLI);		llvm::AliasAnalysis AA(TLI);
Show All 10 Lines

llvm/include/llvm/Transforms/IPO.h

	Show First 20 Lines • Show All 164 Lines • ▼ Show 20 Lines
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	/// createIPSCCPPass - This pass propagates constants from call sites into the			/// createIPSCCPPass - This pass propagates constants from call sites into the
	/// bodies of functions, and keeps track of whether basic blocks are executable			/// bodies of functions, and keeps track of whether basic blocks are executable
	/// in the process.			/// in the process.
	///			///
	ModulePass *createIPSCCPPass();			ModulePass *createIPSCCPPass();

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
				/// createFunctionSpecializationPass - This pass propagates constants from call
				/// sites to the specialized version of the callee function.
				ModulePass *createFunctionSpecializationPass();

				//===----------------------------------------------------------------------===//
	//			//
	/// createLoopExtractorPass - This pass extracts all natural loops from the			/// createLoopExtractorPass - This pass extracts all natural loops from the
	/// program into a function if it can.			/// program into a function if it can.
	///			///
	Pass *createLoopExtractorPass();			Pass *createLoopExtractorPass();

	/// createSingleLoopExtractorPass - This pass extracts one natural loop from the			/// createSingleLoopExtractorPass - This pass extracts one natural loop from the
	/// program into a function if it can. This is used by bugpoint.			/// program into a function if it can. This is used by bugpoint.
	▲ Show 20 Lines • Show All 113 Lines • Show Last 20 Lines

llvm/include/llvm/Transforms/IPO/SCCP.h

	Show All 26 Lines
	class Module;			class Module;

	/// Pass to perform interprocedural constant propagation.			/// Pass to perform interprocedural constant propagation.
	class IPSCCPPass : public PassInfoMixin<IPSCCPPass> {			class IPSCCPPass : public PassInfoMixin<IPSCCPPass> {
	public:			public:
	PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);			PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
	};			};

				/// Pass to perform interprocedural constant propagation by specializing
				/// functions
				class FunctionSpecializationPass
				: public PassInfoMixin<FunctionSpecializationPass> {
				public:
				PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
				};

	} // end namespace llvm			} // end namespace llvm

	#endif // LLVM_TRANSFORMS_IPO_SCCP_H			#endif // LLVM_TRANSFORMS_IPO_SCCP_H

llvm/include/llvm/Transforms/Scalar/SCCP.h

	Show All 16 Lines
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLVM_TRANSFORMS_SCALAR_SCCP_H			#ifndef LLVM_TRANSFORMS_SCALAR_SCCP_H
	#define LLVM_TRANSFORMS_SCALAR_SCCP_H			#define LLVM_TRANSFORMS_SCALAR_SCCP_H

	#include "llvm/ADT/STLExtras.h"			#include "llvm/ADT/STLExtras.h"
	#include "llvm/Analysis/TargetLibraryInfo.h"			#include "llvm/Analysis/TargetLibraryInfo.h"
				#include "llvm/Analysis/TargetTransformInfo.h"
	#include "llvm/IR/DataLayout.h"			#include "llvm/IR/DataLayout.h"
	#include "llvm/IR/Function.h"			#include "llvm/IR/Function.h"
	#include "llvm/IR/Module.h"			#include "llvm/IR/Module.h"
	#include "llvm/IR/PassManager.h"			#include "llvm/IR/PassManager.h"
	#include "llvm/Transforms/Utils/PredicateInfo.h"			#include "llvm/Transforms/Utils/PredicateInfo.h"
	#include "llvm/Transforms/Utils/SCCPSolver.h"			#include "llvm/Transforms/Utils/SCCPSolver.h"

	namespace llvm {			namespace llvm {

	class PostDominatorTree;			class PostDominatorTree;

	/// This pass performs function-level constant propagation and merging.			/// This pass performs function-level constant propagation and merging.
	class SCCPPass : public PassInfoMixin<SCCPPass> {			class SCCPPass : public PassInfoMixin<SCCPPass> {
	public:			public:
	PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);			PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
	};			};

	bool runIPSCCP(Module &M, const DataLayout &DL,			bool runIPSCCP(Module &M, const DataLayout &DL,
	std::function<const TargetLibraryInfo &(Function &)> GetTLI,			std::function<const TargetLibraryInfo &(Function &)> GetTLI,
	function_ref<AnalysisResultsForFn(Function &)> getAnalysis);			function_ref<AnalysisResultsForFn(Function &)> getAnalysis);

				bool runFunctionSpecialization(
				Module &M, const DataLayout &DL,
				std::function<TargetLibraryInfo &(Function &)> GetTLI,
				std::function<TargetTransformInfo &(Function &)> GetTTI,
				std::function<AssumptionCache &(Function &)> GetAC,
				function_ref<AnalysisResultsForFn(Function &)> GetAnalysis);
	} // end namespace llvm			} // end namespace llvm

	#endif // LLVM_TRANSFORMS_SCALAR_SCCP_H			#endif // LLVM_TRANSFORMS_SCALAR_SCCP_H

llvm/include/llvm/Transforms/Utils/SCCPSolver.h

Show First 20 Lines • Show All 124 Lines • ▼ Show 20 Lines	public:
// isStructLatticeConstant - Return true if all the lattice values		// isStructLatticeConstant - Return true if all the lattice values
// corresponding to elements of the structure are constants,		// corresponding to elements of the structure are constants,
// false otherwise.		// false otherwise.
bool isStructLatticeConstant(Function F, StructType STy);		bool isStructLatticeConstant(Function F, StructType STy);

/// Helper to return a Constant if \p LV is either a constant or a constant		/// Helper to return a Constant if \p LV is either a constant or a constant
/// range with a single element.		/// range with a single element.
Constant *getConstant(const ValueLatticeElement &LV) const;		Constant *getConstant(const ValueLatticeElement &LV) const;

		/// Return a reference to the set of argument tracked functions.
		SmallPtrSetImpl<Function *> &getArgumentTrackedFunctions();

		/// Mark argument \p A constant with value \p C in a new function
		/// specialization. The argument's parent function is a specialization of the
		/// original function \p F. All other arguments of the specialization inherit
		/// the lattice state of their corresponding values in the original function.
		void markArgInFuncSpecialization(Function F, Argument A, Constant *C);

		/// Mark all of the blocks in function \p F non-executable. Clients can used
		/// this method to erase a function from the module (e.g., if it has been
		/// completely specialized and is no longer needed).
		void markFunctionUnreachable(Function *F);
		fhahnUnsubmitted Not Done Reply Inline Actions unused? fhahn: unused?

		void visit(Instruction *I);
		fhahnUnsubmitted Not Done Reply Inline Actions unused? fhahn: unused?
		void visitCall(CallInst &I);
		fhahnUnsubmitted Not Done Reply Inline Actions unused? fhahn: unused?
};		};

} // namespace llvm		} // namespace llvm

#endif // LLVM_TRANSFORMS_UTILS_SCCPSOLVER_H		#endif // LLVM_TRANSFORMS_UTILS_SCCPSOLVER_H

llvm/lib/Passes/PassBuilder.cpp

Show First 20 Lines • Show All 288 Lines • ▼ Show 20 Lines	PipelineTuningOptions::PipelineTuningOptions() {
LicmMssaNoAccForPromotionCap = SetLicmMssaNoAccForPromotionCap;		LicmMssaNoAccForPromotionCap = SetLicmMssaNoAccForPromotionCap;
CallGraphProfile = true;		CallGraphProfile = true;
MergeFunctions = false;		MergeFunctions = false;
}		}

namespace llvm {		namespace llvm {
extern cl::opt<unsigned> MaxDevirtIterations;		extern cl::opt<unsigned> MaxDevirtIterations;
extern cl::opt<bool> EnableConstraintElimination;		extern cl::opt<bool> EnableConstraintElimination;
		extern cl::opt<bool> EnableFunctionSpecialization;
extern cl::opt<bool> EnableGVNHoist;		extern cl::opt<bool> EnableGVNHoist;
extern cl::opt<bool> EnableGVNSink;		extern cl::opt<bool> EnableGVNSink;
extern cl::opt<bool> EnableHotColdSplit;		extern cl::opt<bool> EnableHotColdSplit;
extern cl::opt<bool> EnableIROutliner;		extern cl::opt<bool> EnableIROutliner;
extern cl::opt<bool> EnableOrderFileInstrumentation;		extern cl::opt<bool> EnableOrderFileInstrumentation;
extern cl::opt<bool> EnableCHR;		extern cl::opt<bool> EnableCHR;
extern cl::opt<bool> EnableLoopInterchange;		extern cl::opt<bool> EnableLoopInterchange;
extern cl::opt<bool> EnableUnrollAndJam;		extern cl::opt<bool> EnableUnrollAndJam;
▲ Show 20 Lines • Show All 821 Lines • ▼ Show 20 Lines	PassBuilder::buildModuleSimplificationPipeline(OptimizationLevel Level,
// post link pipeline after ICP. This is to enable usage of the type		// post link pipeline after ICP. This is to enable usage of the type
// tests in ICP sequences.		// tests in ICP sequences.
if (Phase == ThinOrFullLTOPhase::ThinLTOPostLink)		if (Phase == ThinOrFullLTOPhase::ThinLTOPostLink)
MPM.addPass(LowerTypeTestsPass(nullptr, nullptr, true));		MPM.addPass(LowerTypeTestsPass(nullptr, nullptr, true));

for (auto &C : PipelineEarlySimplificationEPCallbacks)		for (auto &C : PipelineEarlySimplificationEPCallbacks)
C(MPM, Level);		C(MPM, Level);

		// Specialize functions with IPSCCP.
		if (EnableFunctionSpecialization)
		MPM.addPass(FunctionSpecializationPass());
		ChuanqiXuUnsubmitted Not Done Reply Inline Actions The Function Specialization pass would only run in LTO mode in Legacy Pass manager while is seems like it would run as a module pass in NewPM. Is this a intended behavior? ChuanqiXu: The Function Specialization pass would only run in LTO mode in Legacy Pass manager while is…
		ChuanqiXuUnsubmitted Not Done Reply Inline Actions It looks like it would run as a module simplify pass in NewPM? ChuanqiXu: It looks like it would run as a module simplify pass in NewPM?
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions I will look into this. I am actually trying to reproduce numbers at the moment, but found that this patch doesn't trigger on SPEC anymore. Curious if I had regressed things with my changes, I took the second revision of this patch for which performance numbers were reported, but I don't think see an uplift with that either, so can't reproduce numbers yet. I am comparing an LTO run without the patch applied as the baseline, to an LTO ran with the patch applied. Now rereading previous reviews comments, it looks like you had the same problem reproducing numbers. But anyway, I will sort that out first. SjoerdMeijer: I will look into this. I am actually trying to reproduce numbers at the moment, but found that…
		ChuanqiXuUnsubmitted Not Done Reply Inline Actions Yeah, previously I find that this patch can't work on SPEC since I made a mistake when I run LTO. Previously, I set: LDOPTIMIZE = -flto=full -fuse-ld=lld -mllvm -function-specialize-level=aggressive But it wouldn't work. In fact, we need to set: LDOPTIMIZE = -flto=full -fuse-ld=lld -Wl,-mllvm -Wl,-function-specialize-level=aggressive Hope this could help you. ChuanqiXu: Yeah, previously I find that this patch can't work on SPEC since I made a mistake when I run…
		wenleiUnsubmitted Not Done Reply Inline Actions With LTO or ThinLTO, would it make sense to schedule the pass in post-link only? buildModuleSimplificationPipeline is used by both prelink and postlink for ThinLTO. wenlei: With LTO or ThinLTO, would it make sense to schedule the pass in post-link only?

// Interprocedural constant propagation now that basic cleanup has occurred		// Interprocedural constant propagation now that basic cleanup has occurred
// and prior to optimizing globals.		// and prior to optimizing globals.
// FIXME: This position in the pipeline hasn't been carefully considered in		// FIXME: This position in the pipeline hasn't been carefully considered in
// years, it should be re-analyzed.		// years, it should be re-analyzed.
MPM.addPass(IPSCCPPass());		MPM.addPass(IPSCCPPass());

// Attach metadata to indirect call sites indicating the set of functions		// Attach metadata to indirect call sites indicating the set of functions
// they may target at run-time. This should follow IPSCCP.		// they may target at run-time. This should follow IPSCCP.
▲ Show 20 Lines • Show All 551 Lines • ▼ Show 20 Lines	if (Level.getSpeedupLevel() > 1) {
MPM.addPass(createModuleToFunctionPassAdaptor(std::move(EarlyFPM)));		MPM.addPass(createModuleToFunctionPassAdaptor(std::move(EarlyFPM)));

// Indirect call promotion. This should promote all the targets that are		// Indirect call promotion. This should promote all the targets that are
// left by the earlier promotion pass that promotes intra-module targets.		// left by the earlier promotion pass that promotes intra-module targets.
// This two-step promotion is to save the compile time. For LTO, it should		// This two-step promotion is to save the compile time. For LTO, it should
// produce the same result as if we only do promotion here.		// produce the same result as if we only do promotion here.
MPM.addPass(PGOIndirectCallPromotion(		MPM.addPass(PGOIndirectCallPromotion(
true /* InLTO */, PGOOpt && PGOOpt->Action == PGOOptions::SampleUse));		true /* InLTO */, PGOOpt && PGOOpt->Action == PGOOptions::SampleUse));

		if (EnableFunctionSpecialization)
		MPM.addPass(FunctionSpecializationPass());
// Propagate constants at call sites into the functions they call. This		// Propagate constants at call sites into the functions they call. This
// opens opportunities for globalopt (and inlining) by substituting function		// opens opportunities for globalopt (and inlining) by substituting function
// pointers passed as arguments to direct uses of functions.		// pointers passed as arguments to direct uses of functions.
MPM.addPass(IPSCCPPass());		MPM.addPass(IPSCCPPass());

// Attach metadata to indirect call sites indicating the set of functions		// Attach metadata to indirect call sites indicating the set of functions
// they may target at run-time. This should follow IPSCCP.		// they may target at run-time. This should follow IPSCCP.
MPM.addPass(CalledValuePropagationPass());		MPM.addPass(CalledValuePropagationPass());
▲ Show 20 Lines • Show All 1,502 Lines • Show Last 20 Lines

llvm/lib/Passes/PassRegistry.def

	Show First 20 Lines • Show All 49 Lines • ▼ Show 20 Lines
	MODULE_PASS("cg-profile", CGProfilePass())			MODULE_PASS("cg-profile", CGProfilePass())
	MODULE_PASS("constmerge", ConstantMergePass())			MODULE_PASS("constmerge", ConstantMergePass())
	MODULE_PASS("cross-dso-cfi", CrossDSOCFIPass())			MODULE_PASS("cross-dso-cfi", CrossDSOCFIPass())
	MODULE_PASS("deadargelim", DeadArgumentEliminationPass())			MODULE_PASS("deadargelim", DeadArgumentEliminationPass())
	MODULE_PASS("elim-avail-extern", EliminateAvailableExternallyPass())			MODULE_PASS("elim-avail-extern", EliminateAvailableExternallyPass())
	MODULE_PASS("extract-blocks", BlockExtractorPass())			MODULE_PASS("extract-blocks", BlockExtractorPass())
	MODULE_PASS("forceattrs", ForceFunctionAttrsPass())			MODULE_PASS("forceattrs", ForceFunctionAttrsPass())
	MODULE_PASS("function-import", FunctionImportPass())			MODULE_PASS("function-import", FunctionImportPass())
				MODULE_PASS("function-specialization", FunctionSpecializationPass())
	MODULE_PASS("globaldce", GlobalDCEPass())			MODULE_PASS("globaldce", GlobalDCEPass())
	MODULE_PASS("globalopt", GlobalOptPass())			MODULE_PASS("globalopt", GlobalOptPass())
	MODULE_PASS("globalsplit", GlobalSplitPass())			MODULE_PASS("globalsplit", GlobalSplitPass())
	MODULE_PASS("hotcoldsplit", HotColdSplittingPass())			MODULE_PASS("hotcoldsplit", HotColdSplittingPass())
	MODULE_PASS("hwasan", HWAddressSanitizerPass(false, false))			MODULE_PASS("hwasan", HWAddressSanitizerPass(false, false))
	MODULE_PASS("khwasan", HWAddressSanitizerPass(true, true))			MODULE_PASS("khwasan", HWAddressSanitizerPass(true, true))
	MODULE_PASS("inferattrs", InferFunctionAttrsPass())			MODULE_PASS("inferattrs", InferFunctionAttrsPass())
	MODULE_PASS("inliner-wrapper", ModuleInlinerWrapperPass())			MODULE_PASS("inliner-wrapper", ModuleInlinerWrapperPass())
	▲ Show 20 Lines • Show All 359 Lines • Show Last 20 Lines

llvm/lib/Transforms/IPO/IPO.cpp

Show All 26 Lines	void llvm::initializeIPO(PassRegistry &Registry) {
initializeArgPromotionPass(Registry);		initializeArgPromotionPass(Registry);
initializeAnnotation2MetadataLegacyPass(Registry);		initializeAnnotation2MetadataLegacyPass(Registry);
initializeCalledValuePropagationLegacyPassPass(Registry);		initializeCalledValuePropagationLegacyPassPass(Registry);
initializeConstantMergeLegacyPassPass(Registry);		initializeConstantMergeLegacyPassPass(Registry);
initializeCrossDSOCFIPass(Registry);		initializeCrossDSOCFIPass(Registry);
initializeDAEPass(Registry);		initializeDAEPass(Registry);
initializeDAHPass(Registry);		initializeDAHPass(Registry);
initializeForceFunctionAttrsLegacyPassPass(Registry);		initializeForceFunctionAttrsLegacyPassPass(Registry);
		initializeFunctionSpecializationLegacyPassPass(Registry);
initializeGlobalDCELegacyPassPass(Registry);		initializeGlobalDCELegacyPassPass(Registry);
initializeGlobalOptLegacyPassPass(Registry);		initializeGlobalOptLegacyPassPass(Registry);
initializeGlobalSplitPass(Registry);		initializeGlobalSplitPass(Registry);
initializeHotColdSplittingLegacyPassPass(Registry);		initializeHotColdSplittingLegacyPassPass(Registry);
initializeIROutlinerLegacyPassPass(Registry);		initializeIROutlinerLegacyPassPass(Registry);
initializeAlwaysInlinerLegacyPassPass(Registry);		initializeAlwaysInlinerLegacyPassPass(Registry);
initializeSimpleInlinerPass(Registry);		initializeSimpleInlinerPass(Registry);
initializeInferFunctionAttrsLegacyPassPass(Registry);		initializeInferFunctionAttrsLegacyPassPass(Registry);
▲ Show 20 Lines • Show All 100 Lines • Show Last 20 Lines

llvm/lib/Transforms/IPO/PassManagerBuilder.cpp

Show All 21 Lines
#include "llvm/Analysis/InlineCost.h"		#include "llvm/Analysis/InlineCost.h"
#include "llvm/Analysis/Passes.h"		#include "llvm/Analysis/Passes.h"
#include "llvm/Analysis/ScopedNoAliasAA.h"		#include "llvm/Analysis/ScopedNoAliasAA.h"
#include "llvm/Analysis/TargetLibraryInfo.h"		#include "llvm/Analysis/TargetLibraryInfo.h"
#include "llvm/Analysis/TypeBasedAliasAnalysis.h"		#include "llvm/Analysis/TypeBasedAliasAnalysis.h"
#include "llvm/IR/DataLayout.h"		#include "llvm/IR/DataLayout.h"
#include "llvm/IR/LegacyPassManager.h"		#include "llvm/IR/LegacyPassManager.h"
#include "llvm/IR/Verifier.h"		#include "llvm/IR/Verifier.h"
		#include "llvm/InitializePasses.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/ManagedStatic.h"		#include "llvm/Support/ManagedStatic.h"
#include "llvm/Target/CGPassBuilderOption.h"		#include "llvm/Target/CGPassBuilderOption.h"
#include "llvm/Transforms/AggressiveInstCombine/AggressiveInstCombine.h"		#include "llvm/Transforms/AggressiveInstCombine/AggressiveInstCombine.h"
#include "llvm/Transforms/IPO.h"		#include "llvm/Transforms/IPO.h"
#include "llvm/Transforms/IPO/Attributor.h"		#include "llvm/Transforms/IPO/Attributor.h"
#include "llvm/Transforms/IPO/ForceFunctionAttrs.h"		#include "llvm/Transforms/IPO/ForceFunctionAttrs.h"
#include "llvm/Transforms/IPO/FunctionAttrs.h"		#include "llvm/Transforms/IPO/FunctionAttrs.h"
#include "llvm/Transforms/IPO/InferFunctionAttrs.h"		#include "llvm/Transforms/IPO/InferFunctionAttrs.h"
#include "llvm/Transforms/InstCombine/InstCombine.h"		#include "llvm/Transforms/InstCombine/InstCombine.h"
#include "llvm/Transforms/Instrumentation.h"		#include "llvm/Transforms/Instrumentation.h"
#include "llvm/Transforms/Scalar.h"		#include "llvm/Transforms/Scalar.h"
#include "llvm/Transforms/Scalar/GVN.h"		#include "llvm/Transforms/Scalar/GVN.h"
#include "llvm/Transforms/Scalar/InstSimplifyPass.h"		#include "llvm/Transforms/Scalar/InstSimplifyPass.h"
#include "llvm/Transforms/Scalar/LICM.h"		#include "llvm/Transforms/Scalar/LICM.h"
#include "llvm/Transforms/Scalar/LoopUnrollPass.h"		#include "llvm/Transforms/Scalar/LoopUnrollPass.h"
		#include "llvm/Transforms/Scalar/SCCP.h"
#include "llvm/Transforms/Scalar/SimpleLoopUnswitch.h"		#include "llvm/Transforms/Scalar/SimpleLoopUnswitch.h"
#include "llvm/Transforms/Utils.h"		#include "llvm/Transforms/Utils.h"
#include "llvm/Transforms/Vectorize.h"		#include "llvm/Transforms/Vectorize.h"
#include "llvm/Transforms/Vectorize/LoopVectorize.h"		#include "llvm/Transforms/Vectorize/LoopVectorize.h"
#include "llvm/Transforms/Vectorize/SLPVectorizer.h"		#include "llvm/Transforms/Vectorize/SLPVectorizer.h"
#include "llvm/Transforms/Vectorize/VectorCombine.h"		#include "llvm/Transforms/Vectorize/VectorCombine.h"

using namespace llvm;		using namespace llvm;
▲ Show 20 Lines • Show All 107 Lines • ▼ Show 20 Lines	cl::opt<bool> EnableMatrix(
"enable-matrix", cl::init(false), cl::Hidden,		"enable-matrix", cl::init(false), cl::Hidden,
cl::desc("Enable lowering of the matrix intrinsics"));		cl::desc("Enable lowering of the matrix intrinsics"));

cl::opt<bool> EnableConstraintElimination(		cl::opt<bool> EnableConstraintElimination(
"enable-constraint-elimination", cl::init(false), cl::Hidden,		"enable-constraint-elimination", cl::init(false), cl::Hidden,
cl::desc(		cl::desc(
"Enable pass to eliminate conditions based on linear constraints."));		"Enable pass to eliminate conditions based on linear constraints."));

		cl::opt<bool> EnableFunctionSpecialization(
		"enable-function-specialization", cl::init(false), cl::Hidden,
		cl::desc("Enable Function Specialization pass"));

cl::opt<AttributorRunOption> AttributorRun(		cl::opt<AttributorRunOption> AttributorRun(
"attributor-enable", cl::Hidden, cl::init(AttributorRunOption::NONE),		"attributor-enable", cl::Hidden, cl::init(AttributorRunOption::NONE),
cl::desc("Enable the attributor inter-procedural deduction pass."),		cl::desc("Enable the attributor inter-procedural deduction pass."),
cl::values(clEnumValN(AttributorRunOption::ALL, "all",		cl::values(clEnumValN(AttributorRunOption::ALL, "all",
"enable all attributor runs"),		"enable all attributor runs"),
clEnumValN(AttributorRunOption::MODULE, "module",		clEnumValN(AttributorRunOption::MODULE, "module",
"enable module-wide attributor runs"),		"enable module-wide attributor runs"),
clEnumValN(AttributorRunOption::CGSCC, "cgscc",		clEnumValN(AttributorRunOption::CGSCC, "cgscc",
▲ Show 20 Lines • Show All 557 Lines • ▼ Show 20 Lines	void PassManagerBuilder::populateModulePassManager(
if (AttributorRun & AttributorRunOption::MODULE)		if (AttributorRun & AttributorRunOption::MODULE)
MPM.add(createAttributorLegacyPass());		MPM.add(createAttributorLegacyPass());

addExtensionsToPM(EP_ModuleOptimizerEarly, MPM);		addExtensionsToPM(EP_ModuleOptimizerEarly, MPM);

if (OptLevel > 2)		if (OptLevel > 2)
MPM.add(createCallSiteSplittingPass());		MPM.add(createCallSiteSplittingPass());

		// Propage constant function arguments by specializing the functions.
		if (OptLevel > 2 && EnableFunctionSpecialization)
		MPM.add(createFunctionSpecializationPass());

MPM.add(createIPSCCPPass()); // IP SCCP		MPM.add(createIPSCCPPass()); // IP SCCP
MPM.add(createCalledValuePropagationPass());		MPM.add(createCalledValuePropagationPass());

MPM.add(createGlobalOptimizerPass()); // Optimize out global vars		MPM.add(createGlobalOptimizerPass()); // Optimize out global vars
// Promote any localized global vars.		// Promote any localized global vars.
MPM.add(createPromoteMemoryToRegisterPass());		MPM.add(createPromoteMemoryToRegisterPass());

MPM.add(createDeadArgEliminationPass()); // Dead argument elimination		MPM.add(createDeadArgEliminationPass()); // Dead argument elimination
▲ Show 20 Lines • Show All 239 Lines • ▼ Show 20 Lines	if (OptLevel > 1) {

// Indirect call promotion. This should promote all the targets that are		// Indirect call promotion. This should promote all the targets that are
// left by the earlier promotion pass that promotes intra-module targets.		// left by the earlier promotion pass that promotes intra-module targets.
// This two-step promotion is to save the compile time. For LTO, it should		// This two-step promotion is to save the compile time. For LTO, it should
// produce the same result as if we only do promotion here.		// produce the same result as if we only do promotion here.
PM.add(		PM.add(
createPGOIndirectCallPromotionLegacyPass(true, !PGOSampleUse.empty()));		createPGOIndirectCallPromotionLegacyPass(true, !PGOSampleUse.empty()));

		// Propage constant function arguments by specializing the functions.
		if (EnableFunctionSpecialization)
		PM.add(createFunctionSpecializationPass());

// Propagate constants at call sites into the functions they call. This		// Propagate constants at call sites into the functions they call. This
// opens opportunities for globalopt (and inlining) by substituting function		// opens opportunities for globalopt (and inlining) by substituting function
// pointers passed as arguments to direct uses of functions.		// pointers passed as arguments to direct uses of functions.
PM.add(createIPSCCPPass());		PM.add(createIPSCCPPass());

// Attach metadata to indirect call sites indicating the set of functions		// Attach metadata to indirect call sites indicating the set of functions
// they may target at run-time. This should follow IPSCCP.		// they may target at run-time. This should follow IPSCCP.
PM.add(createCalledValuePropagationPass());		PM.add(createCalledValuePropagationPass());
▲ Show 20 Lines • Show All 301 Lines • Show Last 20 Lines

llvm/lib/Transforms/IPO/SCCP.cpp

//===-- SCCP.cpp ----------------------------------------------------------===// //===-- SCCP.cpp ----------------------------------------------------------===//

// //

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information. // See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

// //

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

// //

// This file implements Interprocedural Sparse Conditional Constant Propagation. // This file implements Interprocedural Sparse Conditional Constant Propagation.

// //

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

#include "llvm/Transforms/IPO/SCCP.h" #include "llvm/Transforms/IPO/SCCP.h"

#include "llvm/Analysis/AssumptionCache.h" #include "llvm/Analysis/AssumptionCache.h"

#include "llvm/Analysis/PostDominators.h" #include "llvm/Analysis/PostDominators.h"

#include "llvm/Analysis/TargetLibraryInfo.h" #include "llvm/Analysis/TargetLibraryInfo.h"

#include "llvm/Analysis/TargetTransformInfo.h"

#include "llvm/InitializePasses.h" #include "llvm/InitializePasses.h"

#include "llvm/Transforms/IPO.h" #include "llvm/Transforms/IPO.h"

#include "llvm/Transforms/Scalar/SCCP.h" #include "llvm/Transforms/Scalar/SCCP.h"

using namespace llvm; using namespace llvm;

PreservedAnalyses IPSCCPPass::run(Module &M, ModuleAnalysisManager &AM) { PreservedAnalyses IPSCCPPass::run(Module &M, ModuleAnalysisManager &AM) {

const DataLayout &DL = M.getDataLayout(); const DataLayout &DL = M.getDataLayout();

▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines

INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass) INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)

INITIALIZE_PASS_DEPENDENCY(TargetLibraryInfoWrapperPass) INITIALIZE_PASS_DEPENDENCY(TargetLibraryInfoWrapperPass)

INITIALIZE_PASS_END(IPSCCPLegacyPass, "ipsccp", INITIALIZE_PASS_END(IPSCCPLegacyPass, "ipsccp",

"Interprocedural Sparse Conditional Constant Propagation", "Interprocedural Sparse Conditional Constant Propagation",

false, false) false, false)

// createIPSCCPPass - This is the public interface to this file. // createIPSCCPPass - This is the public interface to this file.

ModulePass *llvm::createIPSCCPPass() { return new IPSCCPLegacyPass(); } ModulePass *llvm::createIPSCCPPass() { return new IPSCCPLegacyPass(); }

PreservedAnalyses FunctionSpecializationPass::run(Module &M,

ModuleAnalysisManager &AM) {

const DataLayout &DL = M.getDataLayout();

auto &FAM = AM.getResult<FunctionAnalysisManagerModuleProxy>(M).getManager();

auto GetTLI = [&FAM](Function &F) -> TargetLibraryInfo & {

return FAM.getResult<TargetLibraryAnalysis>(F);

};

auto GetTTI = [&FAM](Function &F) -> TargetTransformInfo & {

return FAM.getResult<TargetIRAnalysis>(F);

};

auto GetAC = [&FAM](Function &F) -> AssumptionCache & {

return FAM.getResult<AssumptionAnalysis>(F);

};

auto GetAnalysis = [&FAM](Function &F) -> AnalysisResultsForFn {

DominatorTree &DT = FAM.getResult<DominatorTreeAnalysis>(F);

return {std::make_unique<PredicateInfo>(

F, DT, FAM.getResult<AssumptionAnalysis>(F)),

&DT, FAM.getCachedResult<PostDominatorTreeAnalysis>(F)};

};

if (!runFunctionSpecialization(M, DL, GetTLI, GetTTI, GetAC, GetAnalysis))

return PreservedAnalyses::all();

SjoerdMeijerAuthorUnsubmitted

Not Done

@ChuanqiXu : please note the hard coded true here, which corresponds to IsAggressive boolean. So I think your timings were timing the aggressive mode. But anyway, I will upload one more intermediate diff with all sorts of pass manager stuff fixed, and then will start doing some timing too.

SjoerdMeijer: @ChuanqiXu : please note the hard coded `true` here, which corresponds to `IsAggressive`…

ChuanqiXuUnsubmitted

Not Done

Yes, I were timing the aggrresive mode only.

ChuanqiXu: Yes, I were timing the aggrresive mode only.

PreservedAnalyses PA;

PA.preserve<DominatorTreeAnalysis>();

PA.preserve<PostDominatorTreeAnalysis>();

PA.preserve<FunctionAnalysisManagerModuleProxy>();

return PA;

}

struct FunctionSpecializationLegacyPass : public ModulePass {

static char ID; // Pass identification, replacement for typeid

FunctionSpecializationLegacyPass() : ModulePass(ID) {}

void getAnalysisUsage(AnalysisUsage &AU) const override {

AU.addRequired<AssumptionCacheTracker>();

AU.addRequired<DominatorTreeWrapperPass>();

AU.addRequired<TargetLibraryInfoWrapperPass>();

AU.addRequired<TargetTransformInfoWrapperPass>();

}

virtual bool runOnModule(Module &M) override {

if (skipModule(M))

return false;

JoeUnsubmitted

Done

AU.addRequired<TargetTransformInfoWrapperPass>();

}

+ StringRef getPassName() const override { return "Function specialization pass"; }

virtual bool runOnModule(Module &M) override {

I believe you're missing getPassName()

Joe: I believe you're missing getPassName()

const DataLayout &DL = M.getDataLayout();

auto GetTLI = [this](Function &F) -> TargetLibraryInfo & {

return this->getAnalysis<TargetLibraryInfoWrapperPass>().getTLI(F);

};

auto GetTTI = [this](Function &F) -> TargetTransformInfo & {

return this->getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);

};

auto GetAC = [this](Function &F) -> AssumptionCache & {

return this->getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);

};

auto GetAnalysis = [this](Function &F) -> AnalysisResultsForFn {

DominatorTree &DT =

this->getAnalysis<DominatorTreeWrapperPass>(F).getDomTree();

return {

std::make_unique<PredicateInfo>(

F, DT,

this->getAnalysis<AssumptionCacheTracker>().getAssumptionCache(

F)),

nullptr, // We cannot preserve the DT or PDT with the legacy pass

nullptr}; // manager, so set them to nullptr.

};

return runFunctionSpecialization(M, DL, GetTLI, GetTTI, GetAC, GetAnalysis);

}

};

char FunctionSpecializationLegacyPass::ID = 0;

INITIALIZE_PASS_BEGIN(

FunctionSpecializationLegacyPass, "function-specialization",

"Propagate constant arguments by specializing the function", false, false)

INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)

INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)

INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)

INITIALIZE_PASS_DEPENDENCY(TargetLibraryInfoWrapperPass)

INITIALIZE_PASS_END(FunctionSpecializationLegacyPass, "function-specialization",

"Propagate constant arguments by specializing the function",

false, false)

ModulePass *llvm::createFunctionSpecializationPass() {

return new FunctionSpecializationLegacyPass();

}

llvm/lib/Transforms/Scalar/CMakeLists.txt

	add_llvm_component_library(LLVMScalarOpts			add_llvm_component_library(LLVMScalarOpts
	ADCE.cpp			ADCE.cpp
	AlignmentFromAssumptions.cpp			AlignmentFromAssumptions.cpp
	AnnotationRemarks.cpp			AnnotationRemarks.cpp
	BDCE.cpp			BDCE.cpp
	CallSiteSplitting.cpp			CallSiteSplitting.cpp
	ConstantHoisting.cpp			ConstantHoisting.cpp
	ConstraintElimination.cpp			ConstraintElimination.cpp
	CorrelatedValuePropagation.cpp			CorrelatedValuePropagation.cpp
	DCE.cpp			DCE.cpp
	DeadStoreElimination.cpp			DeadStoreElimination.cpp
	DivRemPairs.cpp			DivRemPairs.cpp
	EarlyCSE.cpp			EarlyCSE.cpp
	FlattenCFGPass.cpp			FlattenCFGPass.cpp
	Float2Int.cpp			Float2Int.cpp
				FunctionSpecialization.cpp
	GuardWidening.cpp			GuardWidening.cpp
	GVN.cpp			GVN.cpp
	GVNHoist.cpp			GVNHoist.cpp
	GVNSink.cpp			GVNSink.cpp
	IVUsersPrinter.cpp			IVUsersPrinter.cpp
	InductiveRangeCheckElimination.cpp			InductiveRangeCheckElimination.cpp
	IndVarSimplify.cpp			IndVarSimplify.cpp
	InferAddressSpaces.cpp			InferAddressSpaces.cpp
	▲ Show 20 Lines • Show All 76 Lines • Show Last 20 Lines

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp

This file was added.

				//===- FunctionSpecialization.cpp - Function Specialization ---------------===//
				//
				fhahnUnsubmitted Not Done Reply Inline Actions this needs updating. fhahn: this needs updating.
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This specialises functions with constant parameters (e.g. functions,
				// globals). Constant parameters like function pointers and constant globals
				// are propagated to the callee by specializing the function.
				//
				// Current limitations:
				// - It does not handle specialization of recursive functions,
				// - It does not yet handle integer constants, and integer ranges,
				// - Only 1 argument per function is specialised,
				fhahnUnsubmitted Not Done Reply Inline Actions this sounds a bit odd. The first sentence says it handles constant parameters. I guess you mean non-constant-int constants? fhahn: this sounds a bit odd. The first sentence says it handles constant parameters. I guess you mean…
				// - The cost-model could be further looked into,
				// - We are not yet caching analysis results.
				//
				// Ideas:
				// - With a function specialization attribute for arguments, we could have
				// a direct way to steer function specialization, avoiding the cost-model,
				// and thus control compile-times / code-size.
				//
				//===----------------------------------------------------------------------===//

				#include "llvm/ADT/Statistic.h"
				#include "llvm/Analysis/AssumptionCache.h"
				#include "llvm/Analysis/CodeMetrics.h"
				#include "llvm/Analysis/DomTreeUpdater.h"
				#include "llvm/Analysis/InlineCost.h"
				#include "llvm/Analysis/LoopInfo.h"
				#include "llvm/Analysis/TargetLibraryInfo.h"
				#include "llvm/Analysis/TargetTransformInfo.h"
				#include "llvm/Transforms/Scalar/SCCP.h"
				#include "llvm/Transforms/Utils/Cloning.h"
				#include "llvm/Transforms/Utils/SizeOpts.h"

				using namespace llvm;

				#define DEBUG_TYPE "function-specialization"

				STATISTIC(NumFuncSpecialized, "Number of Functions Specialized");

				static cl::opt<bool> ForceFunctionSpecialization(
				"force-function-specialization", cl::init(false), cl::Hidden,
				cl::desc("Force function specialization for every call site with a "
				fhahnUnsubmitted Not Done Reply Inline Actions need test for the option? fhahn: need test for the option?
				"constant argument"));

				static cl::opt<unsigned> FuncSpecializationMaxIters(
				"func-specialization-max-iters", cl::Hidden,
				cl::desc("The maximum number of iterations function specialization is run"),
				cl::init(1));

				static cl::opt<unsigned> MaxConstantsThreshold(
				"func-specialization-max-constants", cl::Hidden,
				cl::desc("The maximum number of clones allowed for a single function "
				"specialization"),
				fhahnUnsubmitted Not Done Reply Inline Actions needs test for the option? fhahn: needs test for the option?
				cl::init(3));

				static cl::opt<unsigned>
				AvgLoopIterationCount("func-specialization-avg-iters-cost", cl::Hidden,
				cl::desc("Average loop iteration count cost"),
				cl::init(10));

				// Helper to check if \p LV is either overdefined or a constant int.
				fhahnUnsubmitted Not Done Reply Inline Actions Those were added to transition existing code in SCCP to `ValueLatticeELement`. Ideally new code would be explicit about what they expect (constant-range vs constant-int) fhahn: Those were added to transition existing code in SCCP to `ValueLatticeELement`. Ideally new code…
				fhahnUnsubmitted Not Done Reply Inline Actions Looks like this was not addressed. If the function is kept, please at least update the comment, as it is mis-leading at the moment (it also returns false if `LV` is a constant range with more than a single element, see the comment for the same function in `SCCP.cpp`) fhahn: Looks like this was not addressed. If the function is kept, please at least update the comment…
				static bool isOverdefined(const ValueLatticeElement &LV) {
				return !LV.isUnknownOrUndef() && !LV.isConstant();
				}

				class FunctionSpecializer {

				/// The IPSCCP Solver.
				SCCPSolver &Solver;

				/// Analyses used to help determine if a function should be specialized.
				std::function<AssumptionCache &(Function &)> GetAC;
				std::function<TargetTransformInfo &(Function &)> GetTTI;
				std::function<TargetLibraryInfo &(Function &)> GetTLI;

				SmallPtrSet<Function *, 2> SpecializedFuncs;

				public:
				FunctionSpecializer(SCCPSolver &Solver,
				std::function<AssumptionCache &(Function &)> GetAC,
				std::function<TargetTransformInfo &(Function &)> GetTTI,
				std::function<TargetLibraryInfo &(Function &)> GetTLI)
				: Solver(Solver), GetAC(GetAC), GetTTI(GetTTI), GetTLI(GetTLI) {}

				/// Attempt to specialize functions in the module to enable constant
				/// propagation across function boundaries.
				///
				/// \returns true if at least one function is specialized.
				bool
				specializeFunctions(SmallVectorImpl<Function *> &FuncDecls,
				SmallVectorImpl<Function *> &CurrentSpecializations) {

				// Attempt to specialize the argument-tracked functions.
				bool Changed = false;
				for (auto *F : FuncDecls) {
				if (specializeFunction(F, CurrentSpecializations)) {
				Changed = true;
				LLVM_DEBUG(dbgs() << "FnSpecialization: Can specialize this func.\n");
				} else {
				LLVM_DEBUG(
				dbgs() << "FnSpecialization: Cannot specialize this func.\n");
				}
				}

				for (auto *SpecializedFunc : CurrentSpecializations) {
				SpecializedFuncs.insert(SpecializedFunc);

				// TODO: If we want to support specializing specialized functions,
				// initialize here the state of the newly created functions, marking
				// them argument-tracked and executable.

				// Replace the function arguments for the specialized functions.
				for (Argument &Arg : SpecializedFunc->args())
				fhahnUnsubmitted Not Done Reply Inline Actions Are those updates to the solver still needed, after not running the solver after `specializeFunctions`? fhahn: Are those updates to the solver still needed, after not running the solver after…
				if (!Arg.use_empty() && tryToReplaceWithConstant(&Arg))
				LLVM_DEBUG(dbgs() << "FnSpecialization: Replaced constant argument: "
				<< Arg.getName() << "\n");
				}
				return Changed;
				}

				bool tryToReplaceWithConstant(Value *V) {
				if (!V->getType()->isSingleValueType() \|\| isa<CallBase>(V) \|\|
				V->user_empty())
				return false;

				const ValueLatticeElement &IV = Solver.getLatticeValueFor(V);
				fhahnUnsubmitted Not Done Reply Inline Actions Instead of patching up the IR, can we just set the lattice values for the cloned function arguments accordingly until we are done with solving? fhahn: Instead of patching up the IR, can we just set the lattice values for the cloned function…
				SjoerdMeijerAuthorUnsubmitted Not Done Reply Inline Actions I might not get your suggestion, but the solving is done much earlier. I.e., `RunSCCPSolver` is done in `runFunctionSpecialization`. This here is the final step that performs the actual transformation. SjoerdMeijer: I might not get your suggestion, but the solving is done much earlier. I.e., `RunSCCPSolver` is…
				if (isOverdefined(IV))
				return false;
				auto *Const = IV.isConstant() ? Solver.getConstant(IV)
				: UndefValue::get(V->getType());
				nlopesUnsubmitted Not Done Reply Inline Actions Can't this be poison? If this is just a placeholder for unreachable code, then poison is sufficient. We're trying to remove undef, so we appreciate if no more uses of undef get added to the codebase. Thank you! nlopes: Can't this be poison? If this is just a placeholder for unreachable code, then poison is…
				V->replaceAllUsesWith(Const);

				// TODO: Update the solver here if we want to specialize specialized
				// functions.
				return true;
				}

				private:
				/// This function decides whether to specialize function \p F based on the
				/// known constant values its arguments can take on. Specialization is
				/// performed on the first interesting argument. Specializations based on
				/// additional arguments will be evaluated on following iterations of the
				fhahnUnsubmitted Not Done Reply Inline Actions This should also not be needed? fhahn: This should also not be needed?
				/// main IPSCCP solve loop. \returns true if the function is specialized and
				/// false otherwise.
				bool specializeFunction(Function *F,
				SmallVectorImpl<Function *> &Specializations) {

				// Do not specialize the cloned function again.
				if (SpecializedFuncs.contains(F)) {
				return false;
				fhahnUnsubmitted Not Done Reply Inline Actions also not needed? fhahn: also not needed?
				}

				// If we're optimizing the function for size, we shouldn't specialize it.
				if (F->hasOptSize() \|\|
				shouldOptimizeForSize(F, nullptr, nullptr, PGSOQueryType::IRPass))
				return false;

				// Exit if the function is not executable. There's no point in specializing
				// a dead function.
				if (!Solver.isBlockExecutable(&F->getEntryBlock()))
				return false;

				LLVM_DEBUG(dbgs() << "FnSpecialization: Try function: " << F->getName()
				<< "\n");
				// Determine if we should specialize the function based on the values the
				// argument can take on. If specialization is not profitable, we continue
				// on to the next argument.
				for (Argument &A : F->args()) {
				fhahnUnsubmitted Not Done Reply Inline Actions Could you explain why we need to remove `ssa_copy` in the clone function? fhahn: Could you explain why we need to remove `ssa_copy` in the clone function?
				SjoerdMeijerAuthorUnsubmitted Not Done Reply Inline Actions I can't. :-) Not yet at least, so will also remove this code (for now). SjoerdMeijer: I can't. :-) Not yet at least, so will also remove this code (for now).
				LLVM_DEBUG(dbgs() << "FnSpecialization: Analysing arg: " << A.getName()
				<< "\n");
				// True if this will be a partial specialization. We will need to keep
				// the original function around in addition to the added specializations.
				bool IsPartial = true;

				// Determine if this argument is interesting. If we know the argument can
				// take on any constant values, they are collected in Constants. If the
				// argument can only ever equal a constant value in Constants, the
				// function will be completely specialized, and the IsPartial flag will
				// be set to false by isArgumentInteresting (that function only adds
				// values to the Constants list that are deemed profitable).
				SmallVector<Constant *, 4> Constants;
				if (!isArgumentInteresting(&A, Constants, IsPartial)) {
				LLVM_DEBUG(dbgs() << "FnSpecialization: Argument is not interesting\n");
				continue;
				}

				assert(!Constants.empty() && "No constants on which to specialize");
				LLVM_DEBUG(dbgs() << "FnSpecialization: Argument is interesting!\n"
				<< "FnSpecialization: Specializing '" << F->getName()
				<< "' on argument: " << A << "\n"
				<< "FnSpecialization: Constants are:\n\n";
				for (unsigned I = 0; I < Constants.size(); ++I) dbgs()
				<< *Constants[I] << "\n";
				dbgs() << "FnSpecialization: End of constants\n\n");

				// Create a version of the function in which the argument is marked
				// constant with the given value.
				for (auto *C : Constants) {
				// Clone the function. We leave the ValueToValueMap empty to allow
				// IPSCCP to propagate the constant arguments.
				ValueToValueMapTy EmptyMap;
				Function *Clone = CloneFunction(F, EmptyMap);
				Argument *ClonedArg = Clone->arg_begin() + A.getArgNo();

				// Rewrite calls to the function so that they call the clone instead.
				rewriteCallSites(F, Clone, *ClonedArg, C);

				// Initialize the lattice state of the arguments of the function clone,
				fhahnUnsubmitted Not Done Reply Inline Actions nit: no `llvm::` should be needed. fhahn: nit: no `llvm::` should be needed.
				// marking the argument on which we specialized the function constant
				// with the given value.
				Solver.markArgInFuncSpecialization(F, ClonedArg, C);
				fhahnUnsubmitted Not Done Reply Inline Actions Is this needed in the latest version? If it is not needed, please also remove it from the interface. fhahn: Is this needed in the latest version? If it is not needed, please also remove it from the…

				// Mark all the specialized functions
				Specializations.push_back(Clone);
				NumFuncSpecialized++;
				}

				// TODO: if we want to support specialize specialized functions, and if
				// the function has been completely specialized, the original function is
				// no longer needed, so we would need to mark it unreachable here.

				// FIXME: Only one argument per function.
				return true;
				}

				return false;
				}

				/// Compute the cost of specializing function \p F.
				fhahnUnsubmitted Not Done Reply Inline Actions Is this needed if the solver does not run again? fhahn: Is this needed if the solver does not run again?
				InstructionCost getSpecializationCost(Function *F) {
				// Compute the code metrics for the function.
				SmallPtrSet<const Value *, 32> EphValues;
				CodeMetrics::collectEphemeralValues(F, &(GetAC)(*F), EphValues);
				CodeMetrics Metrics;
				for (BasicBlock &BB : *F)
				Metrics.analyzeBasicBlock(&BB, (GetTTI)(*F), EphValues);

				// If the code metrics reveal that we shouldn't duplicate the function, we
				// shouldn't specialize it. Set the specialization cost to the maximum.
				fhahnUnsubmitted Not Done Reply Inline Actions Is this needed if the solver does not run again? fhahn: Is this needed if the solver does not run again?
				if (Metrics.notDuplicatable)
				return std::numeric_limits<unsigned>::max();

				// Otherwise, set the specialization cost to be the cost of all the
				// instructions in the function and penalty for specializing more functions.
				unsigned Penalty = (NumFuncSpecialized + 1);
				fhahnUnsubmitted Not Done Reply Inline Actions `NumFuncSpecialized` is defined as `STATISTIC(NumFuncSpecialized, "Number of Functions Specialized");` How will this work when LLVM is build without statistics? (also redundant brackets) fhahn: `NumFuncSpecialized` is defined as `STATISTIC(NumFuncSpecialized, "Number of Functions…
				return Metrics.NumInsts * InlineConstants::InstrCost * Penalty;
				}

				InstructionCost getUserBonus(User *U, llvm::TargetTransformInfo &TTI,
				LoopInfo &LI) {
				auto *I = dyn_cast_or_null<Instruction>(U);
				// If not an instruction we do not know how to evaluate.
				// Keep minimum possible cost for now so that it doesnt affect
				// specialization.
				if (!I)
				return std::numeric_limits<unsigned>::min();

				auto Cost = TTI.getUserCost(U, TargetTransformInfo::TCK_SizeAndLatency);

				// Traverse recursively if there are more uses.
				// TODO: Any other instructions to be added here?
				if (I->mayReadFromMemory() \|\| I->isCast())
				for (auto *User : I->users())
				Cost += getUserBonus(User, TTI, LI);

				// Increase the cost if it is inside the loop.
				auto LoopDepth = LI.getLoopDepth(I->getParent()) + 1;
				Cost *= (AvgLoopIterationCount ^ LoopDepth);
				return Cost;
				}

				/// Compute a bonus for replacing argument \p A with constant \p C.
				InstructionCost getSpecializationBonus(Argument A, Constant C) {
				Function *F = A->getParent();
				DominatorTree DT(*F);
				LoopInfo LI(DT);
				auto &TTI = (GetTTI)(*F);
				LLVM_DEBUG(dbgs() << "FnSpecialization: Analysing bonus for: " << *A
				<< "\n");

				InstructionCost TotalCost = 0;
				for (auto *U : A->users()) {
				TotalCost += getUserBonus(U, TTI, LI);
				LLVM_DEBUG(dbgs() << "FnSpecialization: User cost ";
				TotalCost.print(dbgs()); dbgs() << " for: " << *U << "\n");
				}

				wenleiUnsubmitted Not Done Reply Inline Actions We could replace this `AvgLoopIterationCount` tuning knob with real frequency when PGO/BFI is available. In order for specialization to be beneficial on top of PGO, I think leveraging counts in cost/benefit analysis is important. Doing that in later patch is fine. wenlei: We could replace this `AvgLoopIterationCount` tuning knob with real frequency when PGO/BFI is…
				// The below heuristic is only concerned with exposing inlining
				// opportunities via indirect call promotion. If the argument is not a
				// function pointer, give up.
				if (!isa<PointerType>(A->getType()) \|\|
				!isa<FunctionType>(A->getType()->getPointerElementType()))
				return TotalCost;

				// Since the argument is a function pointer, its incoming constant values
				// should be functions or constant expressions. The code below attempts to
				// look through cast expressions to find the function that will be called.
				Value *CalledValue = C;
				while (isa<ConstantExpr>(CalledValue) &&
				cast<ConstantExpr>(CalledValue)->isCast())
				CalledValue = cast<User>(CalledValue)->getOperand(0);
				Function *CalledFunction = dyn_cast<Function>(CalledValue);
				if (!CalledFunction)
				return TotalCost;

				// Get TTI for the called function (used for the inline cost).
				auto &CalleeTTI = (GetTTI)(*CalledFunction);

				// Look at all the call sites whose called value is the argument.
				// Specializing the function on the argument would allow these indirect
				// calls to be promoted to direct calls. If the indirect call promotion
				// would likely enable the called function to be inlined, specializing is a
				// good idea.
				ChuanqiXuUnsubmitted Not Done Reply Inline Actions Should this be `!=` instead? Or I think we should use `TTI.getUserCost` for the initial value directly. Now if `getUserCost` returns non-zero, the value for Cost would become 0! It is too weird. ChuanqiXu: Should this be `!=` instead? Or I think we should use `TTI.getUserCost` for the initial value…
				SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Yeah, am fixing and rewriting this to use `InstructionCost`. SjoerdMeijer: Yeah, am fixing and rewriting this to use `InstructionCost`.
				int Bonus = 0;
				for (User *U : A->users()) {
				if (!isa<CallInst>(U) && !isa<InvokeInst>(U))
				continue;
				auto *CS = cast<CallBase>(U);
				if (CS->getCalledOperand() != A)
				continue;

				// Get the cost of inlining the called function at this call site. Note
				// that this is only an estimate. The called function may eventually
				// change in a way that leads to it not being inlined here, even though
				// inlining looks profitable now. For example, one of its called
				// functions may be inlined into it, making the called function too large
				// to be inlined into this call site.
				//
				// We apply a boost for performing indirect call promotion by increasing
				// the default threshold by the threshold for indirect calls.
				auto Params = getInlineParams();
				Params.DefaultThreshold += InlineConstants::IndirectCallThreshold;
				InlineCost IC =
				getInlineCost(*CS, CalledFunction, Params, CalleeTTI, GetAC, GetTLI);

				// We clamp the bonus for this call to be between zero and the default
				// threshold.
				if (IC.isAlways())
				Bonus += Params.DefaultThreshold;
				else if (IC.isVariable() && IC.getCostDelta() > 0)
				Bonus += IC.getCostDelta();
				}

				return TotalCost + Bonus;
				}

				/// Determine if we should specialize a function based on the incoming values
				/// of the given argument.
				///
				/// This function implements the goal-directed heuristic. It determines if
				/// specializing the function based on the incoming values of argument \p A
				/// would result in any significant optimization opportunities. If
				/// optimization opportunities exist, the constant values of \p A on which to
				/// specialize the function are collected in \p Constants. If the values in
				/// \p Constants represent the complete set of values that \p A can take on,
				/// the function will be completely specialized, and the \p IsPartial flag is
				/// set to false.
				///
				/// \returns true if the function should be specialized on the given
				/// argument.
				bool isArgumentInteresting(Argument *A,
				SmallVectorImpl<Constant *> &Constants,
				bool &IsPartial) {
				Function *F = A->getParent();

				// For now, don't attempt to specialize functions based on the values of
				// composite types.
				if (!A->getType()->isSingleValueType() \|\| A->user_empty())
				return false;

				// If the argument isn't overdefined, there's nothing to do. It should
				// already be constant.
				if (!Solver.getLatticeValueFor(A).isOverdefined()) {
				LLVM_DEBUG(dbgs() << "FnSpecialization: nothing to do, arg is already "
				<< "constant?\n");
				return false;
				}

				// Collect the constant values that the argument can take on. If the
				// argument can't take on any constant values, we aren't going to
				// specialize the function. While it's possible to specialize the function
				// based on non-constant arguments, there's likely not much benefit to
				// constant propagation in doing so.
				//
				// TODO 1: currently it won't specialize if there are over the threshold of
				// calls using the same argument, e.g foo(a) x 4 and foo(b) x 1, but it
				// might be beneficial to take the occurrences into account in the cost
				// model, so we would need to find the unique constants.
				//
				// TODO 2: this currently does not support constants, i.e. integer ranges.
				//
				SmallVector<Constant *, 4> PossibleConstants;
				bool AllConstant = getPossibleConstants(A, PossibleConstants);
				if (PossibleConstants.empty()) {
				LLVM_DEBUG(dbgs() << "FnSpecialization: no possible constants found\n");
				return false;
				}
				if (PossibleConstants.size() > MaxConstantsThreshold) {
				LLVM_DEBUG(dbgs() << "FnSpecialization: number of constants found exceed "
				<< "the maximum number of constants threshold.\n");
				return false;
				}

				// Determine if it would be profitable to create a specialization of the
				// function where the argument takes on the given constant value. If so,
				// add the constant to Constants.
				auto FnSpecCost = getSpecializationCost(F);
				LLVM_DEBUG(dbgs() << "FnSpecialization: func specialisation cost: ";
				FnSpecCost.print(dbgs()); dbgs() << "\n");

				for (auto *C : PossibleConstants) {
				LLVM_DEBUG(dbgs() << "FnSpecialization: Constant: " << *C << "\n");
				if (ForceFunctionSpecialization) {
				LLVM_DEBUG(dbgs() << "FnSpecialization: Forced!\n");
				Constants.push_back(C);
				continue;
				}
				if (getSpecializationBonus(A, C) > FnSpecCost) {
				LLVM_DEBUG(dbgs() << "FnSpecialization: profitable!\n");
				Constants.push_back(C);
				} else {
				LLVM_DEBUG(dbgs() << "FnSpecialization: not profitable\n");
				}
				}

				// None of the constant values the argument can take on were deemed good
				// candidates on which to specialize the function.
				if (Constants.empty())
				return false;

				// This will be a partial specialization if some of the constants were
				// rejected due to their profitability.
				IsPartial = !AllConstant \|\| PossibleConstants.size() != Constants.size();

				ChuanqiXuUnsubmitted Not Done Reply Inline Actions It looks like we left this work to original IPSCCP to handle. Can we make it in this pass? ChuanqiXu: It looks like we left this work to original IPSCCP to handle. Can we make it in this pass?
				return true;
				}

				/// Collect in \p Constants all the constant values that argument \p A can
				/// take on.
				///
				/// \returns true if all of the values the argument can take on are constant
				/// (e.g., the argument's parent function cannot be called with an
				/// overdefined value).
				bool getPossibleConstants(Argument *A,
				SmallVectorImpl<Constant *> &Constants) {
				Function *F = A->getParent();
				bool AllConstant = true;

				// Iterate over all the call sites of the argument's parent function.
				for (User *U : F->users()) {
				if (!isa<CallInst>(U) && !isa<InvokeInst>(U))
				continue;
				auto &CS = *cast<CallBase>(U);

				// If the parent of the call site will never be executed, we don't need
				// to worry about the passed value.
				if (!Solver.isBlockExecutable(CS.getParent()))
				fhahnUnsubmitted Not Done Reply Inline Actions Looks like this is not covered by a test. Would be good to have one. fhahn: Looks like this is not covered by a test. Would be good to have one.
				continue;

				auto *V = CS.getArgOperand(A->getArgNo());
				// TrackValueOfGlobalVariable only tracks scalar global variables.
				if (auto *GV = dyn_cast<GlobalVariable>(V)) {
				if (!GV->getValueType()->isSingleValueType()) {
				return false;
				}
				}

				// Get the lattice value for the value the call site passes to the
				// argument. If this value is not constant, move on to the next call
				// site. Additionally, set the AllConstant flag to false.
				if (V != A && !Solver.getLatticeValueFor(V).isConstant()) {
				AllConstant = false;
				continue;
				}

				// Add the constant to the set.
				if (auto *C = dyn_cast<Constant>(CS.getArgOperand(A->getArgNo())))
				Constants.push_back(C);
				}

				// If the argument can only take on constant values, AllConstant will be
				// true.
				return AllConstant;
				}

				/// Rewrite calls to function \p F to call function \p Clone instead.
				///
				/// This function modifies calls to function \p F whose argument at index \p
				/// ArgNo is equal to constant \p C. The calls are rewritten to call function
				/// \p Clone instead.
				void rewriteCallSites(Function F, Function Clone, Argument &Arg,
				Constant *C) {
				unsigned ArgNo = Arg.getArgNo();
				SmallVector<CallBase *, 4> CallSitesToRewrite;
				for (auto *U : F->users()) {
				if (!isa<CallInst>(U) && !isa<InvokeInst>(U))
				continue;
				auto &CS = *cast<CallBase>(U);
				if (!CS.getCalledFunction() \|\| CS.getCalledFunction() != F)
				continue;
				CallSitesToRewrite.push_back(&CS);
				}
				for (auto *CS : CallSitesToRewrite) {

				fhahnUnsubmitted Not Done Reply Inline Actions stray newline? fhahn: stray newline?
				if ((CS->getFunction() == Clone && CS->getArgOperand(ArgNo) == &Arg) \|\|
				CS->getArgOperand(ArgNo) == C) {
				CS->setCalledFunction(Clone);
				Solver.markOverdefined(CS);
				}
				}
				}
				};

				/// Function to clean up the left over intrinsics from SCCP util.
				static void cleanup(Module &M) {
				for (Function &F : M) {
				for (BasicBlock &BB : F) {
				for (BasicBlock::iterator BI = BB.begin(), E = BB.end(); BI != E;) {
				Instruction Inst = &BI++;
				if (auto *II = dyn_cast<IntrinsicInst>(Inst)) {
				if (II->getIntrinsicID() == Intrinsic::ssa_copy) {
				Value *Op = II->getOperand(0);
				Inst->replaceAllUsesWith(Op);
				Inst->eraseFromParent();
				}
				}
				}
				}
				}
				}

				bool llvm::runFunctionSpecialization(
				Module &M, const DataLayout &DL,
				std::function<TargetLibraryInfo &(Function &)> GetTLI,
				std::function<TargetTransformInfo &(Function &)> GetTTI,
				std::function<AssumptionCache &(Function &)> GetAC,
				function_ref<AnalysisResultsForFn(Function &)> GetAnalysis) {
				SCCPSolver Solver(DL, GetTLI, M.getContext());
				FunctionSpecializer FS(Solver, GetAC, GetTTI, GetTLI);
				bool Changed = false;

				// Loop over all functions, marking arguments to those with their addresses
				// taken or that are external as overdefined.
				for (Function &F : M) {
				if (F.isDeclaration())
				continue;

				LLVM_DEBUG(dbgs() << "\nFnSpecialization: Analysing decl: " << F.getName()
				<< "\n");
				Solver.addAnalysis(F, GetAnalysis(F));

				// Determine if we can track the function's arguments. If so, add the
				// function to the solver's set of argument-tracked functions.
				if (canTrackArgumentsInterprocedurally(&F)) {
				LLVM_DEBUG(dbgs() << "FnSpecialization: Can track arguments\n");
				Solver.addArgumentTrackedFunction(&F);
				continue;
				} else {
				LLVM_DEBUG(dbgs() << "FnSpecialization: Can't track arguments!\n"
				<< "FnSpecialization: Doesn't have local linkage, or "
				<< "has its address taken\n");
				}

				// Assume the function is called.
				Solver.markBlockExecutable(&F.front());

				// Assume nothing about the incoming arguments.
				for (Argument &AI : F.args())
				Solver.markOverdefined(&AI);
				}

				// Determine if we can track any of the module's global variables. If so, add
				// the global variables we can track to the solver's set of tracked global
				// variables.
				for (GlobalVariable &G : M.globals()) {
				G.removeDeadConstantUsers();
				if (canTrackGlobalVariableInterprocedurally(&G))
				Solver.trackValueOfGlobalVariable(&G);
				}

				// Solve for constants.
				auto RunSCCPSolver = [&](auto &WorkList) {
				bool ResolvedUndefs = true;

				while (ResolvedUndefs) {
				LLVM_DEBUG(dbgs() << "FnSpecialization: Running solver\n");
				Solver.solve();
				LLVM_DEBUG(dbgs() << "FnSpecialization: Resolving undefs\n");
				ResolvedUndefs = false;
				for (Function *F : WorkList)
				if (Solver.resolvedUndefsIn(*F))
				ResolvedUndefs = true;
				}

				for (auto *F : WorkList) {
				for (BasicBlock &BB : *F) {
				if (!Solver.isBlockExecutable(&BB))
				continue;
				for (auto &I : make_early_inc_range(BB))
				FS.tryToReplaceWithConstant(&I);
				}
				}
				};

				auto &TrackedFuncs = Solver.getArgumentTrackedFunctions();
				SmallVector<Function *, 16> FuncDecls(TrackedFuncs.begin(),
				TrackedFuncs.end());
				#ifndef NDEBUG
				LLVM_DEBUG(dbgs() << "FnSpecialization: Worklist fn decls:\n");
				for (auto *F : FuncDecls)
				LLVM_DEBUG(dbgs() << "FnSpecialization: *) " << F->getName() << "\n");
				#endif

				// Initially resolve the constants in all the argument tracked functions.
				RunSCCPSolver(FuncDecls);

				SmallVector<Function *, 2> CurrentSpecializations;
				unsigned I = 0;
				while (FuncSpecializationMaxIters != I++ &&
				FS.specializeFunctions(FuncDecls, CurrentSpecializations)) {
				// TODO: run the solver here for the specialized functions only if we want
				// to specialize recursively.

				CurrentSpecializations.clear();
				Changed = true;
				}

				// Clean up the IR by removing ssa_copy intrinsics.
				cleanup(M);

				return Changed;
				}
				fhahnUnsubmitted Not Done Reply Inline Actions Are those declarations? Shouldn't we only track functions with definitions? fhahn: Are those declarations? Shouldn't we only track functions with definitions?
				SjoerdMeijerAuthorUnsubmitted Not Done Reply Inline Actions Yeah, these should be functions with definitions. There's a `F.isDeclaration()` check earlier in this function. SjoerdMeijer: Yeah, these should be functions with definitions. There's a `F.isDeclaration()` check earlier…
				fhahnUnsubmitted Not Done Reply Inline Actions Why do we need to modify the IR after each call to `RunSCCPSolver` rather than after all solving is done? fhahn: Why do we need to modify the IR after each call to `RunSCCPSolver` rather than after all…
				fhahnUnsubmitted Not Done Reply Inline Actions Why do we need this transformation here? Is this required for specialization or to catch the MCF case? fhahn: Why do we need this transformation here? Is this required for specialization or to catch the…
				SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions It's also not yet covered by tests, so I will remove it for now. Even if it is needed for MCF (can't remember), then it looks like a nice candidate as a follow up once we've got something in. SjoerdMeijer: It's also not yet covered by tests, so I will remove it for now. Even if it is needed for MCF…
				fhahnUnsubmitted Not Done Reply Inline Actions Is it possible to only replace values once we are completely done with solving? fhahn: Is it possible to only replace values once we are completely done with solving?
				SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Will remove this (see also earlier comment); this is a bit of a different optimisation that we can look at later. SjoerdMeijer: Will remove this (see also earlier comment); this is a bit of a different optimisation that we…

llvm/lib/Transforms/Scalar/SCCP.cpp

Show All 21 Lines

#include "llvm/ADT/DenseSet.h" #include "llvm/ADT/DenseSet.h"

#include "llvm/ADT/MapVector.h" #include "llvm/ADT/MapVector.h"

#include "llvm/ADT/PointerIntPair.h" #include "llvm/ADT/PointerIntPair.h"

#include "llvm/ADT/STLExtras.h" #include "llvm/ADT/STLExtras.h"

#include "llvm/ADT/SetVector.h" #include "llvm/ADT/SetVector.h"

#include "llvm/ADT/SmallPtrSet.h" #include "llvm/ADT/SmallPtrSet.h"

#include "llvm/ADT/SmallVector.h" #include "llvm/ADT/SmallVector.h"

#include "llvm/ADT/Statistic.h" #include "llvm/ADT/Statistic.h"

#include "llvm/Analysis/ConstantFolding.h" #include "llvm/Analysis/ConstantFolding.h"

fhahnUnsubmitted

Not Done

All those includes are not needed in the file?

fhahn: All those includes are not needed in the file?

#include "llvm/Analysis/DomTreeUpdater.h" #include "llvm/Analysis/DomTreeUpdater.h"

#include "llvm/Analysis/GlobalsModRef.h" #include "llvm/Analysis/GlobalsModRef.h"

#include "llvm/Analysis/InstructionSimplify.h" #include "llvm/Analysis/InstructionSimplify.h"

#include "llvm/Analysis/TargetLibraryInfo.h" #include "llvm/Analysis/TargetLibraryInfo.h"

#include "llvm/Analysis/ValueLattice.h" #include "llvm/Analysis/ValueLattice.h"

#include "llvm/Analysis/ValueLatticeUtils.h" #include "llvm/Analysis/ValueLatticeUtils.h"

#include "llvm/Analysis/ValueTracking.h" #include "llvm/Analysis/ValueTracking.h"

#include "llvm/IR/BasicBlock.h" #include "llvm/IR/BasicBlock.h"

Show All 18 Lines

#include "llvm/Support/Debug.h" #include "llvm/Support/Debug.h"

#include "llvm/Support/ErrorHandling.h" #include "llvm/Support/ErrorHandling.h"

#include "llvm/Support/raw_ostream.h" #include "llvm/Support/raw_ostream.h"

#include "llvm/Transforms/Scalar.h" #include "llvm/Transforms/Scalar.h"

#include "llvm/Transforms/Utils/Local.h" #include "llvm/Transforms/Utils/Local.h"

#include "llvm/Transforms/Utils/PredicateInfo.h" #include "llvm/Transforms/Utils/PredicateInfo.h"

#include <cassert> #include <cassert>

#include <utility> #include <utility>

#include <vector> #include <vector>

ChuanqiXuUnsubmitted

Not Done

Do we really need this header?

ChuanqiXu: Do we really need this header?

using namespace llvm; using namespace llvm;

#define DEBUG_TYPE "sccp" #define DEBUG_TYPE "sccp"

STATISTIC(NumInstRemoved, "Number of instructions removed"); STATISTIC(NumInstRemoved, "Number of instructions removed");

STATISTIC(NumDeadBlocks , "Number of basic blocks unreachable"); STATISTIC(NumDeadBlocks , "Number of basic blocks unreachable");

STATISTIC(NumInstReplaced, STATISTIC(NumInstReplaced,

"Number of instructions replaced with (simpler) instruction"); "Number of instructions replaced with (simpler) instruction");

STATISTIC(IPNumInstRemoved, "Number of instructions removed by IPSCCP"); STATISTIC(IPNumInstRemoved, "Number of instructions removed by IPSCCP");

STATISTIC(IPNumArgsElimed ,"Number of arguments constant propagated by IPSCCP"); STATISTIC(IPNumArgsElimed ,"Number of arguments constant propagated by IPSCCP");

STATISTIC(IPNumGlobalConst, "Number of globals found to be constant by IPSCCP"); STATISTIC(IPNumGlobalConst, "Number of globals found to be constant by IPSCCP");

STATISTIC( STATISTIC(

IPNumInstReplaced, IPNumInstReplaced,

"Number of instructions replaced with (simpler) instruction by IPSCCP"); "Number of instructions replaced with (simpler) instruction by IPSCCP");

// Helper to check if \p LV is either a constant or a constant // Helper to check if \p LV is either a constant or a constant

// range with a single element. This should cover exactly the same cases as the // range with a single element. This should cover exactly the same cases as the

ChuanqiXuUnsubmitted

Not Done

Could we use cl::opt<unsigned>? It would be easy to tuning.

ChuanqiXu: Could we use cl::opt<unsigned>? It would be easy to tuning.

// old ValueLatticeElement::isConstant() and is intended to be used in the // old ValueLatticeElement::isConstant() and is intended to be used in the

// transition to ValueLatticeElement. // transition to ValueLatticeElement.

ChuanqiXuUnsubmitted

Not Done

I would like to rewording it into 'Force function specialization for every call site with constant argument'

ChuanqiXu: I would like to rewording it into 'Force function specialization for every call site with…

static bool isConstant(const ValueLatticeElement &LV) { static bool isConstant(const ValueLatticeElement &LV) {

ChuanqiXuUnsubmitted

Not Done

Same with above.

ChuanqiXu: Same with above.

return LV.isConstant() || return LV.isConstant() ||

(LV.isConstantRange() && LV.getConstantRange().isSingleElement()); (LV.isConstantRange() && LV.getConstantRange().isSingleElement());

} }

ChuanqiXuUnsubmitted

Not Done

It looks like we need to re-evaluate the performance, compile-time and code-size for different parameters now.

ChuanqiXu: It looks like we need to re-evaluate the performance, compile-time and code-size for different…

SjoerdMeijerAuthorUnsubmitted

Done

Agreed. This patch might be a candidate for a first version, and with a few things changed, it is time to measure things again to see where are, so am going to do that right now.

SjoerdMeijer: Agreed. This patch might be a candidate for a first version, and with a few things changed, it…

// Helper to check if \p LV is either overdefined or a constant range with more // Helper to check if \p LV is either overdefined or a constant range with more

// than a single element. This should cover exactly the same cases as the old // than a single element. This should cover exactly the same cases as the old

// ValueLatticeElement::isOverdefined() and is intended to be used in the // ValueLatticeElement::isOverdefined() and is intended to be used in the

// transition to ValueLatticeElement. // transition to ValueLatticeElement.

static bool isOverdefined(const ValueLatticeElement &LV) { static bool isOverdefined(const ValueLatticeElement &LV) {

return !LV.isUnknownOrUndef() && !isConstant(LV); return !LV.isUnknownOrUndef() && !isConstant(LV);

} }

static bool tryToReplaceWithConstant(SCCPSolver &Solver, Value *V) { static bool tryToReplaceWithConstant(SCCPSolver &Solver, Value *V) {

Constant *Const = nullptr; Constant *Const = nullptr;

if (V->getType()->isStructTy()) { if (V->getType()->isStructTy()) {

std::vector<ValueLatticeElement> IVs = Solver.getStructLatticeValueFor(V); std::vector<ValueLatticeElement> IVs = Solver.getStructLatticeValueFor(V);

if (any_of(IVs, if (any_of(IVs,

[](const ValueLatticeElement &LV) { return isOverdefined(LV); })) [](const ValueLatticeElement &LV) { return isOverdefined(LV); }))

return false; return false;

std::vector<Constant *> ConstVals; std::vector<Constant *> ConstVals;

▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines static bool simplifyInstsInBlock(SCCPSolver &Solver, BasicBlock &BB,

Statistic &InstRemovedStat, Statistic &InstRemovedStat,

Statistic &InstReplacedStat) { Statistic &InstReplacedStat) {

bool MadeChanges = false; bool MadeChanges = false;

for (Instruction &Inst : make_early_inc_range(BB)) { for (Instruction &Inst : make_early_inc_range(BB)) {

if (Inst.getType()->isVoidTy()) if (Inst.getType()->isVoidTy())

continue; continue;

if (tryToReplaceWithConstant(Solver, &Inst)) { if (tryToReplaceWithConstant(Solver, &Inst)) {

if (Inst.isSafeToRemove()) if (Inst.isSafeToRemove())

Inst.eraseFromParent(); Inst.eraseFromParent();

fhahnUnsubmitted

Not Done

Why is this needed?

fhahn: Why is this needed?

// Hey, we just changed something!

MadeChanges = true; MadeChanges = true;

++InstRemovedStat; ++InstRemovedStat;

} else if (isa<SExtInst>(&Inst)) { } else if (isa<SExtInst>(&Inst)) {

Value *ExtOp = Inst.getOperand(0); Value *ExtOp = Inst.getOperand(0);

if (isa<Constant>(ExtOp) || InsertedValues.count(ExtOp)) if (isa<Constant>(ExtOp) || InsertedValues.count(ExtOp))

continue; continue;

const ValueLatticeElement &IV = Solver.getLatticeValueFor(ExtOp); const ValueLatticeElement &IV = Solver.getLatticeValueFor(ExtOp);

if (!IV.isConstantRange(/*UndefAllowed=*/false)) if (!IV.isConstantRange(/*UndefAllowed=*/false))

▲ Show 20 Lines • Show All 488 Lines • ▼ Show 20 Lines while (!GV->use_empty()) {

SI->eraseFromParent(); SI->eraseFromParent();

MadeChanges = true; MadeChanges = true;

} }

M.getGlobalList().erase(GV); M.getGlobalList().erase(GV);

++IPNumGlobalConst; ++IPNumGlobalConst;

} }

return MadeChanges; return MadeChanges;

} }

JoeUnsubmitted

Not Done

Should getPossibleConstants only find unique possible constants? Currently it won't specialize If you have over the threshold of calls using the same argument (e.g foo(a) x 4 and foo(b) x 1)

Joe: Should getPossibleConstants only find //unique// possible constants? Currently it won't…

JoeUnsubmitted

Not Done

// already be constant.

- if (!Solver.getLatticeValueFor(A).isOverdefined())

+ const ValueLatticeElement &IV = Solver.getLatticeValueFor(A);

+ if (!(IV.isOverdefined() || IV.isConstantRange()))

+ return false;

return false;

What if the LatticeValue is a ConstantRange? There could be some great specialization opportunities there. Currently, only checking for overdefined leads to:

// specialized
foo(a)
foo(1)
foo(2)

// not specialized
foo(1)
foo(2)
foo(3)

Joe: What if the LatticeValue is a ConstantRange? There could be some great specialization…

ChuanqiXuUnsubmitted

Not Done

ConstantRange seems to be a much more complex problem. We need analysis for Function to calculate/model the benefit if we can know the range information for some arguments.
Then how do we handle the range is another problem. For example, if there is a range [0, 100] for argument a of function foo, the best specialization solution maybe specialize foo into foo_1 and foo_2. Then all the call site with argument in [0, 50) should call foo_1 and all the call site with argument in [50, 100] should call foo_2. But how can we find the best split point?
My point is, it may not be a good solution to specialize the function if we find constant range. I prefer to handle this in successive patch.

ChuanqiXu: ConstantRange seems to be a much more complex problem. We need analysis for Function to…

JoeUnsubmitted

Done

Absolutely, I agree that this can be handled well in a successive patch. However, seeing as it currently doesn't specialize for more than MaxConstantsThreshold, can't we just have a naive implementation that specializes for all the constant range given it's small enough? It just seems odd to me that it won't specialize foo(1) and foo(2).

Joe: Absolutely, I agree that this can be handled well in a successive patch. However, seeing as it…

ChuanqiXuUnsubmitted

Not Done

IMHO, it doesn't make sense to me to specialize functions only if the constant range is small enough (e.g., [2, 5)). For the foo example, the specialized function foo_range_1_3 may be just the same with the original foo if there is no optimization incurred by the range information. In this case, it seems like we only copy the function and make the code size larger without any benefit. In my mind, it is not easy to implement a naive analysis to give the benefit for the range information.

ChuanqiXu: IMHO, it doesn't make sense to me to specialize functions only if the constant range is small…

JoeUnsubmitted

Not Done

The major problem I see with a simple cutoff threshold like this would be the following:

10x foo(a)
1x foo(b)
1x foo(c)
1x foo(d)

There could be huge benefit for specializing for foo(a) here. A potential change here I see is for PossibleConstants to capture the number of occurrences (possibly taking into account loops), and to specialize the function up to MaxConstantsThreshold in descending count order.

Joe: The major problem I see with a simple cutoff threshold like this would be the following: 10x…

ChuanqiXuUnsubmitted

Not Done

It may be a good idea to consider the occurrences into cost model. Then the example shows potential for using profiling info in function specialization pass.

ChuanqiXu: It may be a good idea to consider the occurrences into cost model. Then the example shows…

JoeUnsubmitted

Not Done

// site. Additionally, set the AllConstant flag to false.

- if (V != A && !Solver.getLatticeValueFor(V).isConstant()) {

- AllConstant = false;

- continue;

- }

+ auto &LV = Solver.getLatticeValueFor(V);

+ if (V != A) {

+ if (LV.isConstantRange()) {

+ auto &CR = LV.getConstantRange();

+ if (!CR.getSingleElement()) {

+ AllConstant = false;

+ continue;

+ }

+ } else if (!LV.isConstant()) {

+ AllConstant = false;

+ continue;

+ }

// Add the constant to the set.

If wanting to support ConstantRanges as per the other comment, this would need changing.

Joe: If wanting to support ConstantRanges as per the other comment, this would need changing.

ChuanqiXuUnsubmitted

Not Done

It looks we should add a iteration counter and an option to limit the times of iterations.

ChuanqiXu: It looks we should add a iteration counter and an option to limit the times of iterations.

ChuanqiXuUnsubmitted

Not Done

If there is a callsite look like this:

foo(1, 2, 3, 4);

Would the compiler transform it into?

foo_1(2, 3, 4);

then

foo_1_2(3, 4)

then

foo_1_2_3(4)

finally

foo_1_2_3_4()

The process looks a little bit weird.

ChuanqiXu: If there is a callsite look like this: ``` foo(1, 2, 3, 4); ``` Would the compiler transform it…

SjoerdMeijerAuthorUnsubmitted

Done

Currently, it will only specialises this into:

foo_1(2, 3, 4);

because there is a limitation that it specialises 1 argument, see my next comment below....

SjoerdMeijer: Currently, it will only specialises this into: foo_1(2, 3, 4); because there is a…

SjoerdMeijerAuthorUnsubmitted

Done

.... here!

It stops after specialising one argument. This is very arbitrary and probably what we want to fix in the first version.
This requires a bit of a reshuffle, which is what I am addressing now.

SjoerdMeijer: .... here! It stops after specialising one argument. This is very arbitrary and probably what…

ChuanqiXuUnsubmitted

Not Done

I am OK to fix this in successive patches.

ChuanqiXu: I am OK to fix this in successive patches.

SjoerdMeijerAuthorUnsubmitted

Done

That will work for me.

SjoerdMeijer: That will work for me.

ChuanqiXuUnsubmitted

Not Done

Could we write:

unsigned MaxIters = FuncSpecializationMaxIters; // Since the default value for FuncSpecializationMaxIters is 1

ChuanqiXu: Could we write: ``` unsigned MaxIters = FuncSpecializationMaxIters; // Since the default value…

ChuanqiXuUnsubmitted

Not Done

To my knowledge, if the Lattice for A is constant, it could be handled by current IPSCCP. But what if the lattice is constant range?

ChuanqiXu: To my knowledge, if the Lattice for A is constant, it could be handled by current IPSCCP. But…

ChuanqiXuUnsubmitted

Not Done

I can't understand why don't we use constant value directly?

ChuanqiXu: I can't understand why don't we use constant value directly?

llvm/lib/Transforms/Utils/SCCPSolver.cpp

Show First 20 Lines • Show All 301 Lines • ▼ Show 20 Lines	private:
}		}

// Instructions that cannot be folded away.		// Instructions that cannot be folded away.

void visitStoreInst(StoreInst &I);		void visitStoreInst(StoreInst &I);
void visitLoadInst(LoadInst &I);		void visitLoadInst(LoadInst &I);
void visitGetElementPtrInst(GetElementPtrInst &I);		void visitGetElementPtrInst(GetElementPtrInst &I);

void visitCallInst(CallInst &I) { visitCallBase(I); }

void visitInvokeInst(InvokeInst &II) {		void visitInvokeInst(InvokeInst &II) {
visitCallBase(II);		visitCallBase(II);
visitTerminator(II);		visitTerminator(II);
}		}

void visitCallBrInst(CallBrInst &CBI) {		void visitCallBrInst(CallBrInst &CBI) {
visitCallBase(CBI);		visitCallBase(CBI);
visitTerminator(CBI);		visitTerminator(CBI);
Show All 9 Lines	private:

void visitInstruction(Instruction &I);		void visitInstruction(Instruction &I);

public:		public:
void addAnalysis(Function &F, AnalysisResultsForFn A) {		void addAnalysis(Function &F, AnalysisResultsForFn A) {
AnalysisResults.insert({&F, std::move(A)});		AnalysisResults.insert({&F, std::move(A)});
}		}

		void visitCallInst(CallInst &I) { visitCallBase(I); }

bool markBlockExecutable(BasicBlock *BB);		bool markBlockExecutable(BasicBlock *BB);

const PredicateBase getPredicateInfoFor(Instruction I) {		const PredicateBase getPredicateInfoFor(Instruction I) {
auto A = AnalysisResults.find(I->getParent()->getParent());		auto A = AnalysisResults.find(I->getParent()->getParent());
if (A == AnalysisResults.end())		if (A == AnalysisResults.end())
return nullptr;		return nullptr;
return A->second.PredInfo->getPredicateInfoFor(I);		return A->second.PredInfo->getPredicateInfoFor(I);
}		}
▲ Show 20 Lines • Show All 97 Lines • ▼ Show 20 Lines	if (auto *STy = dyn_cast<StructType>(V->getType()))
markOverdefined(getStructValueState(V, i), V);		markOverdefined(getStructValueState(V, i), V);
else		else
markOverdefined(ValueState[V], V);		markOverdefined(ValueState[V], V);
}		}

bool isStructLatticeConstant(Function F, StructType STy);		bool isStructLatticeConstant(Function F, StructType STy);

Constant *getConstant(const ValueLatticeElement &LV) const;		Constant *getConstant(const ValueLatticeElement &LV) const;

		SmallPtrSetImpl<Function *> &getArgumentTrackedFunctions() {
		return TrackingIncomingArguments;
		}

		void markArgInFuncSpecialization(Function F, Argument A, Constant *C);

		void markFunctionUnreachable(Function *F) {
		for (auto &BB : *F)
		BBExecutable.erase(&BB);
		}
};		};

} // namespace llvm		} // namespace llvm

bool SCCPInstVisitor::markBlockExecutable(BasicBlock *BB) {		bool SCCPInstVisitor::markBlockExecutable(BasicBlock *BB) {
if (!BBExecutable.insert(BB).second)		if (!BBExecutable.insert(BB).second)
return false;		return false;
LLVM_DEBUG(dbgs() << "Marking Block Executable: " << BB->getName() << '\n');		LLVM_DEBUG(dbgs() << "Marking Block Executable: " << BB->getName() << '\n');
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	Constant *SCCPInstVisitor::getConstant(const ValueLatticeElement &LV) const {
if (LV.isConstantRange()) {		if (LV.isConstantRange()) {
const auto &CR = LV.getConstantRange();		const auto &CR = LV.getConstantRange();
if (CR.getSingleElement())		if (CR.getSingleElement())
return ConstantInt::get(Ctx, *CR.getSingleElement());		return ConstantInt::get(Ctx, *CR.getSingleElement());
}		}
return nullptr;		return nullptr;
}		}

		void SCCPInstVisitor::markArgInFuncSpecialization(Function F, Argument A,
		Constant *C) {
		assert(F->arg_size() == A->getParent()->arg_size() &&
		"Functions should have the same number of arguments");

		// Mark the argument constant in the new function.
		markConstant(A, C);

		// For the remaining arguments in the new function, copy the lattice state
		// over from the old function.
		for (auto I = F->arg_begin(), J = A->getParent()->arg_begin(),
		E = F->arg_end();
		I != E; ++I, ++J)
		if (J != A && ValueState.count(I)) {
		ValueState[J] = ValueState[I];
		pushToWorkList(ValueState[J], J);
		}
		}

void SCCPInstVisitor::visitInstruction(Instruction &I) {		void SCCPInstVisitor::visitInstruction(Instruction &I) {
// All the instructions we don't do any special handling for just		// All the instructions we don't do any special handling for just
// go to overdefined.		// go to overdefined.
LLVM_DEBUG(dbgs() << "SCCP: Don't know how to handle: " << I << '\n');		LLVM_DEBUG(dbgs() << "SCCP: Don't know how to handle: " << I << '\n');
markOverdefined(&I);		markOverdefined(&I);
}		}

bool SCCPInstVisitor::mergeInValue(ValueLatticeElement &IV, Value *V,		bool SCCPInstVisitor::mergeInValue(ValueLatticeElement &IV, Value *V,
▲ Show 20 Lines • Show All 1,043 Lines • ▼ Show 20 Lines
// SCCPSolver implementations		// SCCPSolver implementations
//		//
SCCPSolver::SCCPSolver(		SCCPSolver::SCCPSolver(
const DataLayout &DL,		const DataLayout &DL,
std::function<const TargetLibraryInfo &(Function &)> GetTLI,		std::function<const TargetLibraryInfo &(Function &)> GetTLI,
LLVMContext &Ctx)		LLVMContext &Ctx)
: Visitor(new SCCPInstVisitor(DL, std::move(GetTLI), Ctx)) {}		: Visitor(new SCCPInstVisitor(DL, std::move(GetTLI), Ctx)) {}

SCCPSolver::~SCCPSolver() { }		SCCPSolver::~SCCPSolver() {}

void SCCPSolver::addAnalysis(Function &F, AnalysisResultsForFn A) {		void SCCPSolver::addAnalysis(Function &F, AnalysisResultsForFn A) {
return Visitor->addAnalysis(F, std::move(A));		return Visitor->addAnalysis(F, std::move(A));
}		}

bool SCCPSolver::markBlockExecutable(BasicBlock *BB) {		bool SCCPSolver::markBlockExecutable(BasicBlock *BB) {
return Visitor->markBlockExecutable(BB);		return Visitor->markBlockExecutable(BB);
}		}
▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines

bool SCCPSolver::isStructLatticeConstant(Function F, StructType STy) {		bool SCCPSolver::isStructLatticeConstant(Function F, StructType STy) {
return Visitor->isStructLatticeConstant(F, STy);		return Visitor->isStructLatticeConstant(F, STy);
}		}

Constant *SCCPSolver::getConstant(const ValueLatticeElement &LV) const {		Constant *SCCPSolver::getConstant(const ValueLatticeElement &LV) const {
return Visitor->getConstant(LV);		return Visitor->getConstant(LV);
}		}

		SmallPtrSetImpl<Function *> &SCCPSolver::getArgumentTrackedFunctions() {
		return Visitor->getArgumentTrackedFunctions();
		}

		void SCCPSolver::markArgInFuncSpecialization(Function F, Argument A,
		Constant *C) {
		Visitor->markArgInFuncSpecialization(F, A, C);
		}

		void SCCPSolver::markFunctionUnreachable(Function *F) {
		Visitor->markFunctionUnreachable(F);
		}

		void SCCPSolver::visit(Instruction *I) { Visitor->visit(I); }

		void SCCPSolver::visitCall(CallInst &I) { Visitor->visitCall(I); }

llvm/test/Transforms/FunctionSpecialization/function-specialization-recursive.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -function-specialization -inline -instcombine -S < %s \| FileCheck %s

				; TODO: this is a case that would be interesting to support, but we don't yet
				fhahnUnsubmitted Not Done Reply Inline Actions can you elaborate what is going wrong here and what needs fixing? fhahn: can you elaborate what is going wrong here and what needs fixing?
				; at the moment.

				@Global = internal constant i32 1, align 4

				define internal void @recursiveFunc(i32* nocapture readonly %arg) {
				; CHECK-LABEL: @recursiveFunc(
				; CHECK-NEXT: [[TEMP:%.*]] = alloca i32, align 4
				; CHECK-NEXT: [[ARG_LOAD:%.]] = load i32, i32 [[ARG:%.*]], align 4
				; CHECK-NEXT: [[ARG_CMP:%.*]] = icmp slt i32 [[ARG_LOAD]], 4
				; CHECK-NEXT: br i1 [[ARG_CMP]], label [[BLOCK6:%.]], label [[RET_BLOCK:%.]]
				; CHECK: block6:
				; CHECK-NEXT: call void @print_val(i32 [[ARG_LOAD]])
				; CHECK-NEXT: [[ARG_ADD:%.*]] = add nsw i32 [[ARG_LOAD]], 1
				; CHECK-NEXT: store i32 [[ARG_ADD]], i32* [[TEMP]], align 4
				; CHECK-NEXT: call void @recursiveFunc(i32* nonnull [[TEMP]])
				; CHECK-NEXT: br label [[RET_BLOCK]]
				; CHECK: ret.block:
				; CHECK-NEXT: ret void
				;
				%temp = alloca i32, align 4
				%arg.load = load i32, i32* %arg, align 4
				%arg.cmp = icmp slt i32 %arg.load, 4
				br i1 %arg.cmp, label %block6, label %ret.block

				block6:
				call void @print_val(i32 %arg.load)
				%arg.add = add nsw i32 %arg.load, 1
				store i32 %arg.add, i32* %temp, align 4
				call void @recursiveFunc(i32* nonnull %temp)
				br label %ret.block

				ret.block:
				ret void
				}

				define i32 @main() {
				; CHECK-LABEL: @main(
				; CHECK-NEXT: [[TEMP_I:%.*]] = alloca i32, align 4
				; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TEMP_I]] to i8*
				; CHECK-NEXT: call void @llvm.lifetime.start.p0i8(i64 4, i8* nonnull [[TMP1]])
				; CHECK-NEXT: call void @print_val(i32 1)
				; CHECK-NEXT: store i32 2, i32* [[TEMP_I]], align 4
				; CHECK-NEXT: call void @recursiveFunc(i32* nonnull [[TEMP_I]])
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[TEMP_I]] to i8*
				; CHECK-NEXT: call void @llvm.lifetime.end.p0i8(i64 4, i8* nonnull [[TMP2]])
				; CHECK-NEXT: ret i32 0
				;
				call void @recursiveFunc(i32* nonnull @Global)
				ret i32 0
				}

				declare dso_local void @print_val(i32)

llvm/test/Transforms/FunctionSpecialization/function-specialization.ll

This file was added.

				; RUN: opt -function-specialization -deadargelim -inline -S < %s \| FileCheck %s

				; CHECK-LABEL: @main(i64 %x, i1 %flag) {
				; CHECK: entry:
				; CHECK-NEXT: br i1 %flag, label %plus, label %minus
				; CHECK: plus:
				; CHECK-NEXT: [[TMP0:%.+]] = add i64 %x, 1
				; CHECH-NEXT: br label %merge
				; CHECK: minus:
				; CHECK-NEXT: [[TMP1:%.+]] = sub i64 %x, 1
				; CHECK-NEXT: br label %merge
				; CHECK: merge:
				; CHECK-NEXT: [[TMP2:%.+]] = phi i64 [ [[TMP0]], %plus ], [ [[TMP1]], %minus ]
				; CHECK-NEXT: ret i64 [[TMP2]]
				; CHECK-NEXT: }
				;
				define i64 @main(i64 %x, i1 %flag) {
				entry:
				br i1 %flag, label %plus, label %minus

				plus:
				%tmp0 = call i64 @compute(i64 %x, i64 (i64)* @plus)
				br label %merge

				minus:
				%tmp1 = call i64 @compute(i64 %x, i64 (i64)* @minus)
				br label %merge

				merge:
				%tmp2 = phi i64 [ %tmp0, %plus ], [ %tmp1, %minus]
				ret i64 %tmp2
				}

				define internal i64 @compute(i64 %x, i64 (i64)* %binop) {
				entry:
				%tmp0 = call i64 %binop(i64 %x)
				ret i64 %tmp0
				}

				define internal i64 @plus(i64 %x) {
				entry:
				%tmp0 = add i64 %x, 1
				ret i64 %tmp0
				}

				define internal i64 @minus(i64 %x) {
				entry:
				%tmp0 = sub i64 %x, 1
				ret i64 %tmp0
				}

llvm/test/Transforms/FunctionSpecialization/function-specialization2.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -function-specialization -deadargelim -S < %s \| FileCheck %s
				; RUN: opt -function-specialization -func-specialization-max-iters=1 -deadargelim -S < %s \| FileCheck %s
				; RUN: opt -function-specialization -func-specialization-max-iters=0 -deadargelim -S < %s \| FileCheck %s --check-prefix=DISABLED
				; RUN: opt -function-specialization -func-specialization-avg-iters-cost=1 -deadargelim -S < %s \| FileCheck %s

				; DISABLED-NOT: @func.1(
				; DISABLED-NOT: @func.2(

				define internal i32 @func(i32* %0, i32 %1, void (i32) nocapture %2) {
				%4 = alloca i32, align 4
				store i32 %1, i32* %4, align 4
				%5 = load i32, i32* %4, align 4
				%6 = icmp slt i32 %5, 1
				br i1 %6, label %14, label %7

				7: ; preds = %3
				%8 = load i32, i32* %4, align 4
				%9 = sext i32 %8 to i64
				%10 = getelementptr inbounds i32, i32* %0, i64 %9
				call void %2(i32* %10)
				%11 = load i32, i32* %4, align 4
				%12 = add nsw i32 %11, -1
				%13 = call i32 @func(i32* %0, i32 %12, void (i32) %2)
				br label %14

				14: ; preds = %3, %7
				ret i32 0
				}

				define internal void @increment(i32* nocapture %0) {
				%2 = load i32, i32* %0, align 4
				%3 = add nsw i32 %2, 1
				store i32 %3, i32* %0, align 4
				ret void
				}

				define internal void @decrement(i32* nocapture %0) {
				%2 = load i32, i32* %0, align 4
				%3 = add nsw i32 %2, -1
				store i32 %3, i32* %0, align 4
				ret void
				}

				define i32 @main(i32* %0, i32 %1) {
				; CHECK: [[TMP3:%.]] = call i32 @func.2(i32 [[TMP0:%.]], i32 [[TMP1:%.]])
				%3 = call i32 @func(i32* %0, i32 %1, void (i32) nonnull @increment)
				; CHECK: [[TMP4:%.]] = call i32 @func.1(i32 [[TMP0]], i32 [[TMP3]])
				%4 = call i32 @func(i32* %0, i32 %3, void (i32) nonnull @decrement)
				ret i32 %4
				}

				; CHECK: @func.1(
				; CHECK: [[TMP3:%.*]] = alloca i32, align 4
				; CHECK: store i32 [[TMP1:%.]], i32 [[TMP3]], align 4
				; CHECK: [[TMP4:%.]] = load i32, i32 [[TMP3]], align 4
				; CHECK: [[TMP5:%.*]] = icmp slt i32 [[TMP4]], 1
				; CHECK: br i1 [[TMP5]], label [[TMP13:%.]], label [[TMP6:%.]]
				; CHECK: 6:
				; CHECK: [[TMP7:%.]] = load i32, i32 [[TMP3]], align 4
				; CHECK: [[TMP8:%.*]] = sext i32 [[TMP7]] to i64
				; CHECK: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[TMP0:%.*]], i64 [[TMP8]]
				; CHECK: call void @decrement(i32* [[TMP9]])
				; CHECK: [[TMP10:%.]] = load i32, i32 [[TMP3]], align 4
				; CHECK: [[TMP11:%.*]] = add nsw i32 [[TMP10]], -1
				; CHECK: [[TMP12:%.]] = call i32 @func.1(i32 [[TMP0]], i32 [[TMP11]])
				; CHECK: br label [[TMP13]]
				; CHECK: 13:
				; CHECK: ret i32 0
				;
				;
				; CHECK: @func.2(
				; CHECK: [[TMP3:%.*]] = alloca i32, align 4
				; CHECK: store i32 [[TMP1:%.]], i32 [[TMP3]], align 4
				; CHECK: [[TMP4:%.]] = load i32, i32 [[TMP3]], align 4
				; CHECK: [[TMP5:%.*]] = icmp slt i32 [[TMP4]], 1
				; CHECK: br i1 [[TMP5]], label [[TMP13:%.]], label [[TMP6:%.]]
				; CHECK: 6:
				; CHECK: [[TMP7:%.]] = load i32, i32 [[TMP3]], align 4
				; CHECK: [[TMP8:%.*]] = sext i32 [[TMP7]] to i64
				; CHECK: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[TMP0:%.*]], i64 [[TMP8]]
				; CHECK: call void @increment(i32* [[TMP9]])
				; CHECK: [[TMP10:%.]] = load i32, i32 [[TMP3]], align 4
				; CHECK: [[TMP11:%.*]] = add nsw i32 [[TMP10]], -1
				; CHECK: [[TMP12:%.]] = call i32 @func.2(i32 [[TMP0]], i32 [[TMP11]])
				; CHECK: br label [[TMP13]]
				; CHECK: ret i32 0

llvm/test/Transforms/FunctionSpecialization/function-specialization3.ll

This file was added.

				; RUN: opt -function-specialization -func-specialization-avg-iters-cost=3 -S < %s \| \
				; RUN: FileCheck %s --check-prefixes=COMMON,DISABLED
				; RUN: opt -function-specialization -func-specialization-avg-iters-cost=4 -S < %s \| \
				; RUN: FileCheck %s --check-prefixes=COMMON,FORCE
				; RUN: opt -function-specialization -force-function-specialization -S < %s \| \
				; RUN: FileCheck %s --check-prefixes=COMMON,FORCE
				; RUN: opt -function-specialization -func-specialization-avg-iters-cost=3 -force-function-specialization -S < %s \| \
				; RUN: FileCheck %s --check-prefixes=COMMON,FORCE

				; Test for specializing a constant global.

				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"

				@A = external dso_local constant i32, align 4
				@B = external dso_local constant i32, align 4

				define dso_local i32 @bar(i32 %x, i32 %y) {
				; COMMON-LABEL: @bar
				; FORCE: %call = call i32 @foo.2(i32 %x, i32* @A)
				; FORCE: %call1 = call i32 @foo.1(i32 %y, i32* @B)
				; DISABLED-NOT: %call1 = call i32 @foo.1(
				entry:
				%tobool = icmp ne i32 %x, 0
				br i1 %tobool, label %if.then, label %if.else

				if.then:
				%call = call i32 @foo(i32 %x, i32* @A)
				br label %return

				if.else:
				%call1 = call i32 @foo(i32 %y, i32* @B)
				br label %return

				return:
				%retval.0 = phi i32 [ %call, %if.then ], [ %call1, %if.else ]
				ret i32 %retval.0
				}

				; FORCE: define internal i32 @foo.1(i32 %x, i32* %b) {
				; FORCE-NEXT: entry:
				; FORCE-NEXT: %0 = load i32, i32* @B, align 4
				; FORCE-NEXT: %add = add nsw i32 %x, %0
				; FORCE-NEXT: ret i32 %add
				; FORCE-NEXT: }

				; FORCE: define internal i32 @foo.2(i32 %x, i32* %b) {
				; FORCE-NEXT: entry:
				; FORCE-NEXT: %0 = load i32, i32* @A, align 4
				; FORCE-NEXT: %add = add nsw i32 %x, %0
				; FORCE-NEXT: ret i32 %add
				; FORCE-NEXT: }

				define internal i32 @foo(i32 %x, i32* %b) {
				entry:
				%0 = load i32, i32* %b, align 4
				%add = add nsw i32 %x, %0
				ret i32 %add
				}

llvm/test/Transforms/FunctionSpecialization/function-specialization4.ll

This file was added.

				; RUN: opt -function-specialization -force-function-specialization \
				; RUN: -func-specialization-max-constants=2 -S < %s \| FileCheck %s

				; RUN: opt -function-specialization -force-function-specialization \
				; RUN: -func-specialization-max-constants=1 -S < %s \| FileCheck %s --check-prefix=CONST1

				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"

				fhahnUnsubmitted Not Done Reply Inline Actions do those tests depend on information from the AArch64 backend? If so, they should only be executed if the AArch64 backend is built. fhahn: do those tests depend on information from the AArch64 backend? If so, they should only be…
				@A = external dso_local constant i32, align 4
				@B = external dso_local constant i32, align 4
				@C = external dso_local constant i32, align 4
				@D = external dso_local constant i32, align 4

				define dso_local i32 @bar(i32 %x, i32 %y) {
				entry:
				%tobool = icmp ne i32 %x, 0
				br i1 %tobool, label %if.then, label %if.else

				if.then:
				%call = call i32 @foo(i32 %x, i32* @A, i32* @C)
				br label %return

				if.else:
				%call1 = call i32 @foo(i32 %y, i32* @B, i32* @D)
				br label %return

				return:
				%retval.0 = phi i32 [ %call, %if.then ], [ %call1, %if.else ]
				ret i32 %retval.0
				}

				define internal i32 @foo(i32 %x, i32* %b, i32* %c) {
				entry:
				%0 = load i32, i32* %b, align 4
				%add = add nsw i32 %x, %0
				%1 = load i32, i32* %c, align 4
				%add1 = add nsw i32 %add, %1
				ret i32 %add1
				}

				; CONST1-NOT: define internal i32 @foo.1(i32 %x, i32* %b, i32* %c)
				; CONST1-NOT: define internal i32 @foo.2(i32 %x, i32* %b, i32* %c)

				; CHECK: define internal i32 @foo.1(i32 %x, i32* %b, i32* %c) {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: %0 = load i32, i32* @B, align 4
				; CHECK-NEXT: %add = add nsw i32 %x, %0
				; CHECK-NEXT: %1 = load i32, i32* %c, align 4
				; CHECK-NEXT: %add1 = add nsw i32 %add, %1
				; CHECK-NEXT: ret i32 %add1
				; CHECK-NEXT: }

				; CHECK: define internal i32 @foo.2(i32 %x, i32* %b, i32* %c) {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: %0 = load i32, i32* @A, align 4
				; CHECK-NEXT: %add = add nsw i32 %x, %0
				; CHECK-NEXT: %1 = load i32, i32* %c, align 4
				; CHECK-NEXT: %add1 = add nsw i32 %add, %1
				; CHECK-NEXT: ret i32 %add1
				; CHECK-NEXT: }

llvm/test/Transforms/FunctionSpecialization/function-specialization5.ll

This file was added.

				; RUN: opt -function-specialization -force-function-specialization -S < %s \| FileCheck %s

				; There's nothing to specialize here as both calls are the same, so check that:
				;
				; CHECK-NOT: define internal i32 @foo.1(
				; CHECK-NOT: define internal i32 @foo.2(

				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"

				@A = external dso_local constant i32, align 4
				@B = external dso_local constant i32, align 4
				@C = external dso_local constant i32, align 4
				@D = external dso_local constant i32, align 4

				define dso_local i32 @bar(i32 %x, i32 %y) {
				entry:
				%tobool = icmp ne i32 %x, 0
				br i1 %tobool, label %if.then, label %if.else

				if.then:
				%call = call i32 @foo(i32 %x, i32* @A, i32* @C)
				br label %return

				if.else:
				%call1 = call i32 @foo(i32 %y, i32* @A, i32* @C)
				br label %return

				return:
				%retval.0 = phi i32 [ %call, %if.then ], [ %call1, %if.else ]
				ret i32 %retval.0
				}

				define internal i32 @foo(i32 %x, i32* %b, i32* %c) {
				entry:
				%0 = load i32, i32* %b, align 4
				%add = add nsw i32 %x, %0
				%1 = load i32, i32* %c, align 4
				%add1 = add nsw i32 %add, %1
				ret i32 %add1
				}

This is an archive of the discontinued LLVM Phabricator instance.

[SCCP] Add Function Specialization passClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 351361

llvm/include/llvm/InitializePasses.h

llvm/include/llvm/LinkAllPasses.h

llvm/include/llvm/Transforms/IPO.h

llvm/include/llvm/Transforms/IPO/SCCP.h

llvm/include/llvm/Transforms/Scalar/SCCP.h

llvm/include/llvm/Transforms/Utils/SCCPSolver.h

llvm/lib/Passes/PassBuilder.cpp

llvm/lib/Passes/PassRegistry.def

llvm/lib/Transforms/IPO/IPO.cpp

llvm/lib/Transforms/IPO/PassManagerBuilder.cpp

llvm/lib/Transforms/IPO/SCCP.cpp

llvm/lib/Transforms/Scalar/CMakeLists.txt

llvm/lib/Transforms/Scalar/FunctionSpecialization.cpp

llvm/lib/Transforms/Scalar/SCCP.cpp

llvm/lib/Transforms/Utils/SCCPSolver.cpp

llvm/test/Transforms/FunctionSpecialization/function-specialization-recursive.ll

llvm/test/Transforms/FunctionSpecialization/function-specialization.ll

llvm/test/Transforms/FunctionSpecialization/function-specialization2.ll

llvm/test/Transforms/FunctionSpecialization/function-specialization3.ll

llvm/test/Transforms/FunctionSpecialization/function-specialization4.ll

llvm/test/Transforms/FunctionSpecialization/function-specialization5.ll

[SCCP] Add Function Specialization pass
ClosedPublic