This is an archive of the discontinued LLVM Phabricator instance.

[IPSCCP] Enable specialization of functions.
ClosedPublic

Authored by labrinea on Dec 16 2022, 5:22 AM.

Details

Summary

This patch enables Function Specialization by default.

Compilation Time Overhead:
--------------------------
Measured the Instruction Count increase (Geomean) for CTMark from
the llvm-testsuite as in https://llvm-compile-time-tracker.com.
 * {-O3, Non-LTO}: +0.136% Instruction Count
 * {-O3, LTO}: +0.346% Instruction Count

Performance Uplift:
-------------------
Measured +9.121% score increase for 505.mcf_r from SPEC Int 2017
(Tested on Neoverse N1 with -O3 + LTO)

Correctness Testing:
--------------------
 * Passes bootstrap Clang with ASAN + LTO + FuncSpec aggressive options:
   { MaxClonesThreshold=10,
     SmallFunctionThreshold=10,
     AvgLoopIterationCount=30,
     SpecializeOnAddresses=true,
     EnableSpecializationForLiteralConstant=true,
     FuncSpecializationMaxIters=10 }
 * Builds Chromium and passes its unittests with the above options + ThinLTO.

For more info please refer to
https://discourse.llvm.org/t/rfc-should-we-enable-function-specialization/61518

Diff Detail

Event Timeline

labrinea created this revision.Dec 16 2022, 5:22 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 16 2022, 5:22 AM
labrinea requested review of this revision.Dec 16 2022, 5:22 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 16 2022, 5:22 AM

Great stuff, and impressive numbers, I think they look really good.

CTMark compile time numbers look really good. I was thinking if we could at least add one more data point: timed compilation of LLVM. I would be interested in both the instruction count increase, like you reported for CTMark, but also in absolute timings just to get an idea. And you built Chromium for correctness testing (nice!), did you get any compile time numbers while doing that exercise?

arsenm added a subscriber: arsenm.Dec 16 2022, 5:57 AM
arsenm added inline comments.
llvm/lib/Transforms/IPO/SCCP.cpp
45–46

Should this move to be a pass parameter

nikic added a subscriber: nikic.Dec 16 2022, 7:31 AM

Some more detailed compile-time data: https://llvm-compile-time-tracker.com/compare.php?from=e6676a1382ff4c8f6c520486323430745948481d&to=8d528eb8ebe0261d187fef57829f1db20f42752c&stat=instructions%3Au

I notice that there is a 50% increase in code size on terminator.c from SPASS. Did this large change get analyzed at some point?

Some more detailed compile-time data: https://llvm-compile-time-tracker.com/compare.php?from=e6676a1382ff4c8f6c520486323430745948481d&to=8d528eb8ebe0261d187fef57829f1db20f42752c&stat=instructions%3Au

I notice that there is a 50% increase in code size on terminator.c from SPASS. Did this large change get analyzed at some point?

This particular regression has not been examined, but codesize has been discussed here https://reviews.llvm.org/D139346?vs=on&id=481234#toc
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp:300

TODO: Currently "budget" is derived from max number of specializations per function. A more reasonable metric would be acceptable increase in code size.

Right now we are keeping up to NumCandidates x (MaxClonesThreshold=3) which could be too large for small modules.

fhahn added a comment.Dec 16 2022, 9:44 AM

Some more detailed compile-time data: https://llvm-compile-time-tracker.com/compare.php?from=e6676a1382ff4c8f6c520486323430745948481d&to=8d528eb8ebe0261d187fef57829f1db20f42752c&stat=instructions%3Au

I notice that there is a 50% increase in code size on terminator.c from SPASS. Did this large change get analyzed at some point?

This particular regression has not been examined, but codesize has been discussed here https://reviews.llvm.org/D139346?vs=on&id=481234#toc
llvm/lib/Transforms/IPO/FunctionSpecialization.cpp:300

TODO: Currently "budget" is derived from max number of specializations per function. A more reasonable metric would be acceptable increase in code size.

Right now we are keeping up to NumCandidates x (MaxClonesThreshold=3) which could be too large for small modules.

Should we estimate the size of the module and limit the total size increase?

I don’t know where to look for the 50% code size increase. The increase in percentages sounds bad, but how does it look in absolute numbers? If this is a tiny module, with 2 functions where 1 gets cloned, then that’s okay?

labrinea added a comment.EditedDec 19 2022, 6:49 AM

Great stuff, and impressive numbers, I think they look really good.

CTMark compile time numbers look really good. I was thinking if we could at least add one more data point: timed compilation of LLVM. I would be interested in both the instruction count increase, like you reported for CTMark, but also in absolute timings just to get an idea. And you built Chromium for correctness testing (nice!), did you get any compile time numbers while doing that exercise?

I have measured the time spent on the IPSCCP pass itself (user+system time) during linking using the -time-passes option on lld. On my x86 machine with O3 and Full LTO the Geomean of CTMark goes from ~152ms to ~174ms resulting +14.47% increase (average of five best runs). Regarding the Chromium build, no I haven't measured compilation times.

@nikic can you share the configuration where you see the 50% increase of codesize in terminator.c (which optimization pipeline)? Since it is an intermediate object file I presume the specialization happens during the compilation phase, not when linking. When I build CTMark with Full LTO and debug output I am seeing:

During compilation
------------------
FnSpecialization: Specialized 1 functions in module test-suite/CTMark/SPASS/terminator.c of size 117
FnSpecialization: Specialized 5 functions in module test-suite/CTMark/ClamAV/libclamav_message.c of size 113
FnSpecialization: Specialized 1 functions in module test-suite/CTMark/SPASS/rules-inf.c of size 274
FnSpecialization: Specialized 1 functions in module test-suite/CTMark/sqlite3/sqlite3.c of size 1101

During linking
--------------
FnSpecialization: Specialized 1 functions in module ld-temp.o of size 675 (ClamAV)
FnSpecialization: Specialized 1 functions in module ld-temp.o of size 603 (sqlite3)
FnSpecialization: Specialized 3 functions in module ld-temp.o of size 605 (lencod)

Size here means number of functions in the module.

I looked at the source code of terminator.c. It contains five function definitions and includes a bunch of inlined functions from the header files, that's why llvm::Module::size() reports hundreds. From five to six functions is roughly a ~20ish % codesize increase for the intermediate object file. The final object shouldn't be as much affected, so it doesn't seem worrying to me.

Great stuff, and impressive numbers, I think they look really good.

CTMark compile time numbers look really good. I was thinking if we could at least add one more data point: timed compilation of LLVM. I would be interested in both the instruction count increase, like you reported for CTMark, but also in absolute timings just to get an idea. And you built Chromium for correctness testing (nice!), did you get any compile time numbers while doing that exercise?

I have measured the time spent on the IPSCCP pass itself (user+system time) during linking using the -time-passes option on lld. On my x86 machine with O3 and Full LTO the Geomean of CTMark goes from ~152ms to ~174ms resulting +14.47% increase (average of five best runs). Regarding the Chromium build, no I haven't measured compilation times.

When I looked into this, I found that the increase in compile-times wasn't caused by the function specialization pass itself, the time spent in there was negligible, but the extra went into the the subsequent passes having to work more and harder. So I don't think measuring the time spent in IPSCCP is the right metric we want to be using, although I agree of course it is interesting to see if there's not something very expensive going on. The instruction count that you used for CTMark is a much better metric. In my first comment I was curious how that would look like for one more data point, timed LLVM compilation, and also how that translate to time just out of curiousity.

labrinea added inline comments.Dec 20 2022, 7:23 AM
llvm/lib/Transforms/IPO/SCCP.cpp
45–46

I think it's preferable to control from the command line for unit testing.

So I had a closer look to the code size increase in terminator.c from SPASS. This is the breakdown of LOC for the x86 disassembly:

functionbaselinepatch
red_Terminator104513
red_SearchTerminator792415
red_CreateTerminatorEmptyClause0253
red_TerminatorLitIsBetter2526
red_GetTerminatorPartnerLits0169
total9211376

The code size increase is close to what @nikic had reported (+49,40 %). The patched compiler creates one specialization for red_SearchTerminator which gets inlined inside red_Terminator. The baseline compiler inlines red_CreateTerminatorEmptyClause and red_GetTerminatorPartnerLits inside red_SearchTerminator. I suspect that the patched compiler fails to do so as it reaches the inlining threshold, therefore these functions remain.

Quickly measuring the size of the final image (after LTO) is not worrying me: 608KB vs 616KB (+1.316%).

Great stuff, and impressive numbers, I think they look really good.

CTMark compile time numbers look really good. I was thinking if we could at least add one more data point: timed compilation of LLVM. I would be interested in both the instruction count increase, like you reported for CTMark, but also in absolute timings just to get an idea. And you built Chromium for correctness testing (nice!), did you get any compile time numbers while doing that exercise?

I have measured the time spent on the IPSCCP pass itself (user+system time) during linking using the -time-passes option on lld. On my x86 machine with O3 and Full LTO the Geomean of CTMark goes from ~152ms to ~174ms resulting +14.47% increase (average of five best runs). Regarding the Chromium build, no I haven't measured compilation times.

When I looked into this, I found that the increase in compile-times wasn't caused by the function specialization pass itself, the time spent in there was negligible, but the extra went into the the subsequent passes having to work more and harder. So I don't think measuring the time spent in IPSCCP is the right metric we want to be using, although I agree of course it is interesting to see if there's not something very expensive going on. The instruction count that you used for CTMark is a much better metric. In my first comment I was curious how that would look like for one more data point, timed LLVM compilation, and also how that translate to time just out of curiousity.

I've measured user + system time (average of five runs) and the results are inconclusive: baseline 42663.2s vs patch 42657.1s (delta -0.014%). The experiment was to build LLVM using the patched compiler/linker with LTO as requested. It doesn't seem like a useful metric to me, but now I've checked it off my list :)

I've measured user + system time (average of five runs) and the results are inconclusive: baseline 42663.2s vs patch 42657.1s (delta -0.014%). The experiment was to build LLVM using the patched compiler/linker with LTO as requested. It doesn't seem like a useful metric to me, but now I've checked it off my list :)

I am not sure I understand, how can building llvm (with your patched compiler) and measure compile times (in time and/or in instruction counts) not be a useful exercise in this context?

labrinea added a comment.EditedDec 21 2022, 7:55 AM

I've measured user + system time (average of five runs) and the results are inconclusive: baseline 42663.2s vs patch 42657.1s (delta -0.014%). The experiment was to build LLVM using the patched compiler/linker with LTO as requested. It doesn't seem like a useful metric to me, but now I've checked it off my list :)

I am not sure I understand, how can building llvm (with your patched compiler) and measure compile times (in time and/or in instruction counts) not be a useful exercise in this context?

I meant measuring time is too noisy to draw any conlcusions. On the contrary, instruction count measurements were more consistent. The compile time tracker specifically says:
"Warning: The wall-time metric is very noisy and not meaningful for comparisons between specific revisions."

I tested this against rust: https://perf.rust-lang.org/compare.html?start=413335368bf860e701f6365b0f2f0fa4d7f85b9a&end=70212127e0974013f7e031861d421356e85b0e27&stat=instructions%3Au The results look unconcerning to me, there are no major changes in compile-time, memory usage or binary size. I guess the heuristics are sufficiently conservative to not cause problems. And it didn't crash on any of the benchmarks, so that's reassuring...

So this is fine to go ahead from my side, but I did not review the implementation at all.

SjoerdMeijer accepted this revision.Dec 21 2022, 12:02 PM

Ok, then I will comment on the other side of things.

This transformations is addressing an area of optimisations where we were behind and often comes up as a difference. As mentioned in the description, this will benefit a well known benchmark but it is a very general transformation so has the potential to trigger on a lot of cases (shown when e.g. specialisation is forced). I am not aware of any issues with the implementation, and the testing done looks pretty extensive. So, worth giving this a try I think: LGTM.

Please wait a day in case there are more comments. And optional: enabling this looks worthy a mention in the release notes.

This revision is now accepted and ready to land.Dec 21 2022, 12:02 PM
ChuanqiXu accepted this revision.Dec 21 2022, 6:11 PM

enabling this looks worthy a mention in the release notes.

Yeah, I think we need to mention this.

Many thanks to everyone who contributed in making this happen! I am so stoked! Regarding the release notes, shall I add a section named "Changes to Interprocedural Optimizations" and say something along the lines of "Function Specialization has been integrated in IPSCCP and it is now enabled by default"?

labrinea added a comment.EditedDec 22 2022, 4:04 AM

So I had a last minute thought as I was updating the Release Notes. Currently IPSCCP runs on all optimization levels. Consequently FuncSpec does too. Should we disable the latter for Os, Oz and perhaps O1?

fhahn added a comment.Dec 22 2022, 4:29 AM

Consequently FuncSpec does too. Should we disable the latter for Os, Oz

For -Os and -Oz it would be good to directly disable it (making it a pass parameter will make this possible), as the specializer bails out on functions with optsize and minsize attributes anyways

llvm/lib/Transforms/IPO/SCCP.cpp
45–46

I think @arsenm was referring to pass parameters for NewPM passes, e.g. like https://github.com/llvm/llvm-project/blob/main/llvm/lib/Passes/PassRegistry.def#L423

Those can be supplied at the command line with the pass in -passes. IMO that would be more in-line with new-pm only passes.

labrinea updated this revision to Diff 484917.Dec 22 2022, 11:46 AM
  • Rebased on D140564.
  • Added release notes.
This revision was landed with ongoing or failed builds.Dec 25 2022, 4:30 AM
This revision was automatically updated to reflect the committed changes.
fhahn added a comment.Dec 26 2022, 8:12 AM

It looks like CINT2017rate/502.gcc_r gets mis-compiled with LTO + PGO on
AArch64 with function specialization. I reverted the patch for now.

It looks like CINT2017rate/502.gcc_r gets mis-compiled with LTO + PGO on
AArch64 with function specialization. I reverted the patch for now.

Thanks for looking after this. Can you share more details about your environment, or perhaps a reproducer? I haven't been able to reproduce the mis-compilation of CINT2017rate/502.gcc_r myself on AArch64 with LTO + PGO.

It looks like CINT2017rate/502.gcc_r gets mis-compiled with LTO + PGO on
AArch64 with function specialization. I reverted the patch for now.

Thanks for looking after this. Can you share more details about your environment, or perhaps a reproducer? I haven't been able to reproduce the mis-compilation of CINT2017rate/502.gcc_r myself on AArch64 with LTO + PGO.

Ok I can reproduce the failure on x86 if I build spec as an external benchmark for the llvm test suite.

fhahn added a comment.Jan 5 2023, 5:10 AM

It looks like CINT2017rate/502.gcc_r gets mis-compiled with LTO + PGO on
AArch64 with function specialization. I reverted the patch for now.

Thanks for looking after this. Can you share more details about your environment, or perhaps a reproducer? I haven't been able to reproduce the mis-compilation of CINT2017rate/502.gcc_r myself on AArch64 with LTO + PGO.

Ok I can reproduce the failure on x86 if I build spec as an external benchmark for the llvm test suite.

Great! Here's what I have been using on ARM64 macOS, just in case it helps:

lnt runtest test-suite \
    -C Release --pgo \
    --test-externals /path/to/external/spec/sources/ \
    --cmake-define TEST_SUITE_RUN_TYPE=ref --cmake-define 'TEST_SUITE_SUBDIRS=External/SPEC/CINT2017rate/502.gcc_r' \
     --build-threads=10 --sandbox "$(pwd)/tmp/" --no-timestamp --use-lit=lit \
    --cc=/path/to/bin/clang --cxx=/path/to/bin/clang++ \
    --test-suite=/path/to/llvm-test-suite/ \
    --cmake-define TEST_SUITE_BENCHMARKING_ONLY=On
labrinea added a comment.EditedJan 10 2023, 7:13 AM

After some experimentation I found that the specialization of spec_qsort with case_bit_test_cmp as the compare function causes the problem. Digging a little more in the SPEC documentation I found these portability issues which seem to manifest when FuncSpec is enabled:

spec_qsort and ANSI aliasing: The spec_qsort.c routine does not strictly obey the ANSI aliasing rules. See the detailed discussion in the documentation for 505.mcf_r, which is where the problem was reported.

Some users of GCC 6 (and later) have reported that 505.mcf_r gets wrong answers when compiled with both link-time optimization (LTO) and feedback-directed optimization (FDO), for example at GCC bugzilla 83201.

After I applied the patch (https://gcc.gnu.org/bugzilla/attachment.cgi?id=42919&action=diff) attached on that ticket (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83201) to the SPEC sources the failure went away. I am puzzled because when cloning a function we do preserve the TBAA medatata if any.

Oh wait, -fno-strict-aliasing is not being used in the first place! That explains the failure:

Known Portability Issues

  1. spec_qsort and ANSI aliasing: The spec_qsort.c routine does not strictly obey the ANSI aliasing rules. See the detailed discussion in the documentation for 505.mcf_r, which is where the problem was reported. If you are compiling with the GCC compiler, it is theoretically possible that you might need to add -fno-strict-aliasing to 502.gcc_r as well. The Example GCC config files as updated for SPEC CPU 2017 v1.1.5 demonstrate how to do so.

Good find. I am happy with that explanation, and I think we could try again. Does it mean that this flags needs to be added to the llvm test-suite?
In terms of reenabling this, Dave pointed out to me that the next release is not that far away, so the question is if we want to recommit now or shortly after the release?

Good find. I am happy with that explanation, and I think we could try again. Does it mean that this flags needs to be added to the llvm test-suite?

Yes, I've created D141474

In terms of reenabling this, Dave pointed out to me that the next release is not that far away, so the question is if we want to recommit now or shortly after the release?

I would like to re-enable it before the release. There haven't been other regressions reported so far, plus we've built Clang and Chromium successfully, which is quite reassuring.

Please reland

For people who compile SPEC outside llvm test suite, there should be a note in release notes about this issue..

labrinea added a comment.EditedJan 11 2023, 6:23 AM

For people who compile SPEC outside llvm test suite, there should be a note in release notes about this issue..

Should there be? It's explicitly stated on the SPEC documentation that spec_qsort does not strictly obey the ANSI aliasing rules, and that it should be compiled with -fno-strict-aliasing.

For people who compile SPEC outside llvm test suite, there should be a note in release notes about this issue..

Should there be? It's explicitly stated on the SPEC documentation that spec_qsort does not strictly obey the ANSI aliasing rules, and that it should be compiled with -fno-strict-aliasing.

One may wonder why SPEC with newly released clang/llvm fails. Or maybe a note that this powerful optimization may also expose many existing bugs.

Btw, there is probably no tool to check for violations of ANSI aliasing rules, right? Maybe Type Sanitizer by @fhahn could be able to do that.

labrinea reopened this revision.Jan 12 2023, 2:32 AM
This revision is now accepted and ready to land.Jan 12 2023, 2:32 AM
labrinea updated this revision to Diff 488558.Jan 12 2023, 2:33 AM

I've added a release note about D141474.

xbolva00 accepted this revision.Jan 12 2023, 9:06 AM
This revision was landed with ongoing or failed builds.Jan 13 2023, 6:04 AM
This revision was automatically updated to reflect the committed changes.
jdoerfert reopened this revision.Jan 21 2023, 6:35 PM
jdoerfert added a subscriber: jdoerfert.

In case you missed it, this breaks code: https://github.com/llvm/llvm-project/issues/60191

This revision is now accepted and ready to land.Jan 21 2023, 6:35 PM
labrinea closed this revision.Jan 24 2023, 4:21 PM

The above issue has been fixed.