This is an archive of the discontinued LLVM Phabricator instance.

[InlineCost] Enable the cost benefit analysis on FDO
ClosedPublic

Authored by kazu on Mar 8 2021, 12:16 PM.

Details

Summary

This patch enables the cost-benefit-analysis-based inliner by default
if we have instrumentation profile.

  • SPEC CPU 2017 shows a 0.4% improvement.
  • An internal large benchmark shows a 0.9% reduction in the cycle count along with 14.6% reduction in the number of call instructions executed.

Diff Detail

Event Timeline

kazu created this revision.Mar 8 2021, 12:16 PM
kazu requested review of this revision.Mar 8 2021, 12:17 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 8 2021, 12:17 PM
davidxl added inline comments.Mar 8 2021, 3:12 PM
llvm/lib/Analysis/InlineCost.cpp
683

Why not change the default of this option to true, then there is no need to check explicit occurences?

wenlei added a comment.Mar 8 2021, 3:21 PM

What's the .text size change for your internal benchmark when you turn it on? What is the impact of this switch with sample profile?

kazu added a comment.Mar 11 2021, 11:02 AM

What's the .text size change for your internal benchmark when you turn it on?

For FDO, the size of .text.* increases by 0.03%. .text.hot and .text shrink by 0.67% by 0.24%, respectively. The reduction is mostly offset by a 1.25% increase in .text.split.

For CSFDO, the size of .text.* increases by 0.07%. .text.hot, .text, and .text.split grow by 1.10%, 0.40%, and 2.05%, respectively.

Under the hood, we are inlining very hot functions that were previously too big to be inlined while rejecting marginally hot call sites that were within the size threshold. The former primarily contributes to the execution performance. The latter contributes to size reduction with little impact on the execution performance.

Note that you can set -inline-savings-multiplier to some positive integer smaller than 8, say 4, to limit inlining of the hot call sites and thus reduce the executable size. My experiments with our large internal benchmark show that the performance doesn't change much when I tweak -inline-savings-multiplier. If we assume that the ratio of cycle savings to size cost is a reasonable indicator of inlining desirability, then applying a threshold on the ratio means that we always inline the most desirable call sites, with the threshold controlling the amount of the long tail to reject, which I think is the reason why the parameter doesn't affect the performance very much.

What is the impact of this switch with sample profile?

About neutral without ThinLTO and a 0.27% improvement on our large internal benchmark with ThinLTO. This is why I am turning on the switch by default on FDO profile for now but not on sample profile. The sample profile loader has its own inliner. If we want to benefit from the idea of looking at the ratio of cycle savings to size cost, then I think we really have to modify the inliner in the sample profile loader.

kazu added inline comments.Mar 11 2021, 11:15 AM
llvm/lib/Analysis/InlineCost.cpp
683

I'd like to do two things here:

  • Turn on the cost benefit analysis automatically on FDO.
  • Let the user turn on/off the cost benefit analysis.

The latter allows us to tell what happens to the performance if I disable the cost benefit analysis when we are using FDO. It also allows us to turn on the cost benefit analysis on sample profile.

If I change the default to true, I would have to put a logic to turn it off on sample profile unless the user explicitly requests the cost benefit analysis. In other words, as long as I want to honor an explicit user request, I must check for the occurrence.

davidxl accepted this revision.Mar 11 2021, 2:24 PM

lgtm

This revision is now accepted and ready to land.Mar 11 2021, 2:24 PM

What's the .text size change for your internal benchmark when you turn it on?

For FDO, the size of .text.* increases by 0.03%. .text.hot and .text shrink by 0.67% by 0.24%, respectively. The reduction is mostly offset by a 1.25% increase in .text.split.

For CSFDO, the size of .text.* increases by 0.07%. .text.hot, .text, and .text.split grow by 1.10%, 0.40%, and 2.05%, respectively.

Under the hood, we are inlining very hot functions that were previously too big to be inlined while rejecting marginally hot call sites that were within the size threshold. The former primarily contributes to the execution performance. The latter contributes to size reduction with little impact on the execution performance.

Note that you can set -inline-savings-multiplier to some positive integer smaller than 8, say 4, to limit inlining of the hot call sites and thus reduce the executable size. My experiments with our large internal benchmark show that the performance doesn't change much when I tweak -inline-savings-multiplier. If we assume that the ratio of cycle savings to size cost is a reasonable indicator of inlining desirability, then applying a threshold on the ratio means that we always inline the most desirable call sites, with the threshold controlling the amount of the long tail to reject, which I think is the reason why the parameter doesn't affect the performance very much.

Thanks for the size data, it seems well justified by the perf boost. I think the cycle saving estimation using profile makes a lot of sense and makes LLVM's PGO closer to other compilers.

What is the impact of this switch with sample profile?

About neutral without ThinLTO and a 0.27% improvement on our large internal benchmark with ThinLTO. This is why I am turning on the switch by default on FDO profile for now but not on sample profile. The sample profile loader has its own inliner. If we want to benefit from the idea of looking at the ratio of cycle savings to size cost, then I think we really have to modify the inliner in the sample profile loader.

The much smaller perf boost from sample PGO is interesting. I'm curious how much of that is due to sample loader inliner vs just inferior profile quality. I changed sample loader's inliner to also look at the same InlineCost a while ago, except the threshold part is determined locally inside sample loader. I think we just need to honor the decision when it's from the benefit analysis there. I may give it a try and see what we get on our workloads.

It is probably related to MFS -- machine function splitting is not yet well tuned for AutoFDO, and the cost/benefit analysis needs to sync up with that to that for estimating cold size properlly.

It is probably related to MFS -- machine function splitting is not yet well tuned for AutoFDO, and the cost/benefit analysis needs to sync up with that to that for estimating cold size properlly.

Good point. Now I see that costBenefitAnalysis subtracts cold size when estimating, and the cold size is simply dead block size added together based on BFI, assuming they will all be moved out. Without turning on MFS for sample PGO, we would need to skip the cold size subtraction too so they're coordinated. IIRC, MFS was either small regression or neutral for sample PGO?

Right . We (snehasishk) have plan to tune MFS for AutoFDO. Once that is done, the inline cost analysis for AFDO can be revisited.

Right . We (snehasishk) have plan to tune MFS for AutoFDO. Once that is done, the inline cost analysis for AFDO can be revisited.

Good to know. I'm interested in learning what you find out from the tuning. Btw, the change as is lgtm.

This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.Mar 25 2021, 1:43 PM

Hi, this made clang crash for us in PGO builds. I reverted it in a60ffee3f4ef36f2211a149475cc2cb60164d4a8 for now. A standalone repro is at https://bugs.chromium.org/p/chromium/issues/detail?id=1192783#c6

Looks like div by zero error.

This patch as 9d375a40c3df90dd48edc0e1b1115c702c55d716 causes a segmentation fault with the following commands and this profdata.prof file (zstd compressed).

$ cmake \
-B build/stage1 \
-G Ninja \
-Wno-dev \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=clang \
-DCMAKE_CXX_COMPILER=clang++ \
-DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" \
-DLLVM_TARGETS_TO_BUILD=host \
-DLLVM_USE_LINKER=lld \
llvm &&
ninja -C build/stage1

$ cmake \
-B build/stage2 \
-G Ninja \
-Wno-dev \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \
-DCMAKE_C_COMPILER=$PWD/build/stage1/bin/clang \
-DCMAKE_CXX_COMPILER=$PWD/build/stage1/bin/clang++ \
-DLLVM_ENABLE_LTO=Thin \
-DLLVM_ENABLE_PROJECTS="clang;lld;polly" \
-DLLVM_PROFDATA_FILE=$PWD/build/profdata.prof \
-DLLVM_TARGETS_TO_BUILD="ARM;AArch64" \
-DLLVM_USE_LINKER=lld \
llvm &&
ninja -C build/stage2
...
[3416/3562] Linking CXX executable bin/clang-ast-dump
FAILED: bin/clang-ast-dump
: && /home/nathan/cbl/github/tc-build/llvm-project/build/stage1/bin/clang++ -fPIC -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -fdiagnostics-color -ffunction-sections -fdata-sections -fprofile-instr-use="/home/nathan/cbl/github/tc-build/llvm-project/build/profdata.prof" -flto=thin -fno-common -Woverloaded-virtual -Wno-nested-anon-types -O3 -DNDEBUG -fuse-ld=lld -Wl,--color-diagnostics -fprofile-instr-use="/home/nathan/cbl/github/tc-build/llvm-project/build/profdata.prof" -flto=thin -Wl,--thinlto-cache-dir=/home/nathan/cbl/github/tc-build/llvm-project/build/stage2/lto.cache    -Wl,-O3 -Wl,--gc-sections tools/clang/lib/Tooling/DumpTool/CMakeFiles/clang-ast-dump.dir/ASTSrcLocProcessor.cpp.o tools/clang/lib/Tooling/DumpTool/CMakeFiles/clang-ast-dump.dir/ClangSrcLocDump.cpp.o -o bin/clang-ast-dump  -Wl,-rpath,"\$ORIGIN/../lib"  lib/libLLVMOption.a  lib/libLLVMFrontendOpenMP.a  lib/libLLVMSupport.a  -lpthread  lib/libclangAST.a  lib/libclangASTMatchers.a  lib/libclangBasic.a  lib/libclangDriver.a  lib/libclangFrontend.a  lib/libclangSerialization.a  lib/libclangToolingCore.a  lib/libclangDriver.a  lib/libLLVMOption.a  lib/libclangParse.a  lib/libclangSema.a  lib/libclangEdit.a  lib/libclangAnalysis.a  lib/libclangASTMatchers.a  lib/libclangAST.a  lib/libLLVMFrontendOpenMP.a  lib/libLLVMTransformUtils.a  lib/libLLVMAnalysis.a  lib/libLLVMProfileData.a  lib/libLLVMObject.a  lib/libLLVMMCParser.a  lib/libLLVMTextAPI.a  lib/libLLVMBitReader.a  lib/libclangRewrite.a  lib/libclangLex.a  lib/libclangBasic.a  lib/libLLVMCore.a  lib/libLLVMRemarks.a  lib/libLLVMBitstreamReader.a  lib/libLLVMMC.a  lib/libLLVMBinaryFormat.a  lib/libLLVMDebugInfoCodeView.a  lib/libLLVMSupport.a  -lrt  -ldl  -lpthread  -lm  /usr/lib/x86_64-linux-gnu/libz.so  /usr/lib/x86_64-linux-gnu/libtinfo.so  lib/libLLVMDemangle.a && :
clang-13: error: unable to execute command: Segmentation fault
clang-13: error: linker command failed due to signal (use -v to see invocation)
$ git bisect log
# bad: [ee6a17eb9fc46c417d3d9ec11a218681e03bf18e] [NFC][LoopIdiom] Regenerate left-shift-until-bittest.ll
# good: [8638c897f469dbd1d95b2e46b39ab72fb7b9d336] [WebAssembly] Remove unimplemented-simd target feature
git bisect start 'ee6a17eb9fc46c417d3d9ec11a218681e03bf18e' '8638c897f469dbd1d95b2e46b39ab72fb7b9d336'
# bad: [fd94cfeeb5d29340f46a46862720e53bd9106c1f] [RISCV] Move scheduling resources for B into a separate file (NFC)
git bisect bad fd94cfeeb5d29340f46a46862720e53bd9106c1f
# good: [1bc33eb6a32bdb193a8b838df823b4563450f6b3] [lld-macho][nfc] minor clean up, follow up to D98559
git bisect good 1bc33eb6a32bdb193a8b838df823b4563450f6b3
# good: [c7a39c833af173a7797692757b13b4b8eb801acd] [Coroutine][Clang] Force emit lifetime intrinsics for Coroutines
git bisect good c7a39c833af173a7797692757b13b4b8eb801acd
# bad: [9075864b7375b4eb8f2c0663caa575c0c667de7c] [BasicAA] Refactor linear expression decomposition
git bisect bad 9075864b7375b4eb8f2c0663caa575c0c667de7c
# bad: [203b072dd23b63174c9aa428b6f1c84859f43670] [CMake][gRPC] Fix a typo in protobuf version variable name
git bisect bad 203b072dd23b63174c9aa428b6f1c84859f43670
# good: [5f59f407f59f69c248be2452e5923e6735e7019a] [CSSPGO] Minor tweak for inline candidate priority tie breaker
git bisect good 5f59f407f59f69c248be2452e5923e6735e7019a
# bad: [0b1dc49ca38a8569b0aee40ea8d3054bc960e2ed] [NFC][OCaml] Resolve const and unsigned compilation warnings
git bisect bad 0b1dc49ca38a8569b0aee40ea8d3054bc960e2ed
# bad: [d92b4956d6db8904f6a21c0e037dd5161b843b87] [AMDGPU] Inline FSHRPattern into its only use. NFC.
git bisect bad d92b4956d6db8904f6a21c0e037dd5161b843b87
# bad: [9be8f8b34d9b150cd1811e3556fe9d0cd735ae29] [sanitizer] Simplify GetTls with dl_iterate_phdr
git bisect bad 9be8f8b34d9b150cd1811e3556fe9d0cd735ae29
# good: [3c775d93a1dda3cec3bb6c7617c4b4ab770f4af0] [InlineCost] Reject a zero entry count
git bisect good 3c775d93a1dda3cec3bb6c7617c4b4ab770f4af0
# bad: [9d375a40c3df90dd48edc0e1b1115c702c55d716] Reapply [InlineCost] Enable the cost benefit analysis on FDO
git bisect bad 9d375a40c3df90dd48edc0e1b1115c702c55d716
# first bad commit: [9d375a40c3df90dd48edc0e1b1115c702c55d716] Reapply [InlineCost] Enable the cost benefit analysis on FDO

This patch is unlikely the root cause of the failure (it does not do any
transformation). It just exposes some existing bugs in the compiler.

David

This patch is unlikely the root cause of the failure (it does not do any
transformation). It just exposes some existing bugs in the compiler.

David

Should I file a bug then?

I want to resurrect this change that enabled these cost-benefit heuristics originally developed in https://reviews.llvm.org/D92780

There are a number of issues with the original change that I can see, and this has caused all manner of problems for us internally as we moved to the new pass manager (and therefore started using these heuristics).

The main one being that the heuristics seem to completely override all sensible thresholds for code size. The feature caims to compare the performance gains by measuring the cycles saved from a PGO run, and comparing against some threshold. This doesn't make sense since the length of the runs for PGO are arbitrary, and you shouldn't be using them in absolute terms. The only way it makes sense is if you have a very specific workload and the PGO profile run is exactly the same as the real world runs. The original change gave absolutely no justification, formal or intuitive, for the use of the equation to determine profitability. Giving positive runtime results is no justification for completely overriding any reasonable thresholds and potentially impacting code size & compile time by several orders of magnitude as we've seen internally at Apple.

Even worse: D92780 was accepted into mainline LLVM with absolutely no tests, and then enabled for all users in this change. I didn't even see any discussion about tests, or explanation for why they were impractical. We've disabled these heuristics downstream but I absolutely believe that this must be reverted upstream too until it can be re-worked and re-reviewed in detail.

Herald added a project: Restricted Project. · View Herald TranscriptJan 20 2023, 2:21 PM
Herald added a subscriber: ChuanqiXu. · View Herald Transcript

The feature caims to compare the performance gains by measuring the cycles saved from a PGO run, and comparing against some threshold. This doesn't make sense since the length of the runs for PGO are arbitrary, and you shouldn't be using them in absolute terms.

IIUC, hot threshold is being taken into account in the formula (at the very end of costBenefitAnalysis), so this is affected by relative hotness only. It's just like how pgo counts are used and supposed to be used anywhere else in the pipeline.

I want to resurrect this change that enabled these cost-benefit heuristics originally developed in https://reviews.llvm.org/D92780

There are a number of issues with the original change that I can see, and this has caused all manner of problems for us internally as we moved to the new pass manager (and therefore started using these heuristics).

The main one being that the heuristics seem to completely override all sensible thresholds for code size. The feature caims to compare the performance gains by measuring the cycles saved from a PGO run, and comparing against some threshold. This doesn't make sense since the length of the runs for PGO are arbitrary, and you shouldn't be using them in absolute terms. The only way it makes sense is if you have a very specific workload and the PGO profile run is exactly the same as the real world runs. The original change gave absolutely no justification, formal or intuitive, for the use of the equation to determine profitability. Giving positive runtime results is no justification for completely overriding any reasonable thresholds and potentially impacting code size & compile time by several orders of magnitude as we've seen internally at Apple.

Even worse: D92780 was accepted into mainline LLVM with absolutely no tests, and then enabled for all users in this change. I didn't even see any discussion about tests, or explanation for why they were impractical. We've disabled these heuristics downstream but I absolutely believe that this must be reverted upstream too until it can be re-worked and re-reviewed in detail.

FYI: the cycle savings are normalized by dividing the function entry count, so it is not used in absolute sense.

I think it is worth adding some tests.

For the problems you have (I suppose it is performance related?) More details would be helpful (perhaps using a bug to document it).

kazu added a comment.Jan 20 2023, 2:55 PM

The main one being that the heuristics seem to completely override all sensible thresholds for code size. The feature caims to compare the performance gains by measuring the cycles saved from a PGO run, and comparing against some threshold. This doesn't make sense since the length of the runs for PGO are arbitrary, and you shouldn't be using them in absolute terms. The only way it makes sense is if you have a very specific workload and the PGO profile run is exactly the same as the real world runs.

If you look at llvm/lib/Analysis/InlineCost.cpp:892, you'll see:

//  CycleSavings      PSI->getOrCompHotCountThreshold()                                                                                       
// -------------- >= -----------------------------------                                                                                      
//       Size              InlineSavingsMultiplier

If you rearrange this inequality, we have:

//            CycleSavings                          Size                                                                                      
// ----------------------------------- >= -------------------------                                                                           
//  PSI->getOrCompHotCountThreshold()      InlineSavingsMultiplier

Notice that the LHS divides CycleSavings by PSI->getOrCompHotCountThreshold(), so I am not using cycle savings in absolute terms. If I do a PGO run twice as long, both the numerator and denominator would double, yielding the same ratio.

The original change gave absolutely no justification, formal or intuitive, for the use of the equation to determine profitability. Giving positive runtime results is no justification for completely overriding any reasonable thresholds and potentially impacting code size & compile time by several orders of magnitude as we've seen internally at Apple.

Do you have any test case that you can share, possibly reduced and/or with sensitive code removed? I realize it may be difficult to come up with one from a large piece of software, but I am happy to take a look at a test case that demonstrates at least one problem -- code size or compile time.

I want to resurrect this change that enabled these cost-benefit heuristics originally developed in https://reviews.llvm.org/D92780

There are a number of issues with the original change that I can see, and this has caused all manner of problems for us internally as we moved to the new pass manager (and therefore started using these heuristics).

The main one being that the heuristics seem to completely override all sensible thresholds for code size. The feature caims to compare the performance gains by measuring the cycles saved from a PGO run, and comparing against some threshold. This doesn't make sense since the length of the runs for PGO are arbitrary, and you shouldn't be using them in absolute terms. The only way it makes sense is if you have a very specific workload and the PGO profile run is exactly the same as the real world runs. The original change gave absolutely no justification, formal or intuitive, for the use of the equation to determine profitability. Giving positive runtime results is no justification for completely overriding any reasonable thresholds and potentially impacting code size & compile time by several orders of magnitude as we've seen internally at Apple.

Even worse: D92780 was accepted into mainline LLVM with absolutely no tests, and then enabled for all users in this change. I didn't even see any discussion about tests, or explanation for why they were impractical. We've disabled these heuristics downstream but I absolutely believe that this must be reverted upstream too until it can be re-worked and re-reviewed in detail.

FYI: the cycle savings are normalized by dividing the function entry count, so it is not used in absolute sense.

Right, that's fair enough. See the following comment from the code:

//  CycleSavings      PSI->getOrCompHotCountThreshold()
// -------------- >= -----------------------------------
//       Size              InlineSavingsMultiplier

Cycles are being compared against code size here. How is that metric supposed to be interpreted? What does (cycles/entry)/(bytes) even mean as a unit of measurement? The feature itself seems to be trying to optimize too fine details in machine cycles in a part of the compiler that shouldn't be dealing with such low level specifics.

I think it is worth adding some tests.

For the problems you have (I suppose it is performance related?) More details would be helpful (perhaps using a bug to document it).

It's code size and compile time. These heuristics seem to be returning effectively an always-inline cost, which forces the inliner to inline huge functions. Clearly the heuristics should be much more conservative. Unfortunately I can't share the test case since it's internal code and massive.

The main one being that the heuristics seem to completely override all sensible thresholds for code size. The feature caims to compare the performance gains by measuring the cycles saved from a PGO run, and comparing against some threshold. This doesn't make sense since the length of the runs for PGO are arbitrary, and you shouldn't be using them in absolute terms. The only way it makes sense is if you have a very specific workload and the PGO profile run is exactly the same as the real world runs.

If you look at llvm/lib/Analysis/InlineCost.cpp:892, you'll see:

//  CycleSavings      PSI->getOrCompHotCountThreshold()                                                                                       
// -------------- >= -----------------------------------                                                                                      
//       Size              InlineSavingsMultiplier

If you rearrange this inequality, we have:

//            CycleSavings                          Size                                                                                      
// ----------------------------------- >= -------------------------                                                                           
//  PSI->getOrCompHotCountThreshold()      InlineSavingsMultiplier

Notice that the LHS divides CycleSavings by PSI->getOrCompHotCountThreshold(), so I am not using cycle savings in absolute terms. If I do a PGO run twice as long, both the numerator and denominator would double, yielding the same ratio.

The original change gave absolutely no justification, formal or intuitive, for the use of the equation to determine profitability. Giving positive runtime results is no justification for completely overriding any reasonable thresholds and potentially impacting code size & compile time by several orders of magnitude as we've seen internally at Apple.

Do you have any test case that you can share, possibly reduced and/or with sensitive code removed? I realize it may be difficult to come up with one from a large piece of software, but I am happy to take a look at a test case that demonstrates at least one problem -- code size or compile time.

Sorry I didn't see your comment at the time I was writing. Yes I see that now, it doesn't use absolute cycles. Ultimately the problem is that this heuristic basically overrides too much.

if (auto Result = costBenefitAnalysis()) {
  DecidedByCostBenefit = true;
  if (*Result)
    return InlineResult::success();
  else
    return InlineResult::failure("Cost over threshold.");
}

The InlineResult::success() being returned here effectively is alwaysinline. Even if we were to continue with this notion of cycle savings (which I don't think belongs in the inliner), a particularly bad inlining cost from the inliner can be completely overruled by this heuristic. Maybe we should only use this to inline the marginally bad cases (some max % of cost > threshold).

I want to resurrect this change that enabled these cost-benefit heuristics originally developed in https://reviews.llvm.org/D92780

There are a number of issues with the original change that I can see, and this has caused all manner of problems for us internally as we moved to the new pass manager (and therefore started using these heuristics).

The main one being that the heuristics seem to completely override all sensible thresholds for code size. The feature caims to compare the performance gains by measuring the cycles saved from a PGO run, and comparing against some threshold. This doesn't make sense since the length of the runs for PGO are arbitrary, and you shouldn't be using them in absolute terms. The only way it makes sense is if you have a very specific workload and the PGO profile run is exactly the same as the real world runs. The original change gave absolutely no justification, formal or intuitive, for the use of the equation to determine profitability. Giving positive runtime results is no justification for completely overriding any reasonable thresholds and potentially impacting code size & compile time by several orders of magnitude as we've seen internally at Apple.

Even worse: D92780 was accepted into mainline LLVM with absolutely no tests, and then enabled for all users in this change. I didn't even see any discussion about tests, or explanation for why they were impractical. We've disabled these heuristics downstream but I absolutely believe that this must be reverted upstream too until it can be re-worked and re-reviewed in detail.

FYI: the cycle savings are normalized by dividing the function entry count, so it is not used in absolute sense.

Right, that's fair enough. See the following comment from the code:

//  CycleSavings      PSI->getOrCompHotCountThreshold()
// -------------- >= -----------------------------------
//       Size              InlineSavingsMultiplier

Cycles are being compared against code size here. How is that metric supposed to be interpreted? What does (cycles/entry)/(bytes) even mean as a unit of measurement? The feature itself seems to be trying to optimize too fine details in machine cycles in a part of the compiler that shouldn't be dealing with such low level specifics.

Note the size here is not static code size (it excludes cold code). It is actually a proxy to model the runtime cost due to increased instruction footprint (icache pressure). The multiplier's role is to make the savings and 'size' cost comparable in terms of unit. The cycle name here is also counted at IR level, so not at low level.

Note that the multiplier should indeed be a target platform dependent -- and will be tuned in the future.

I think it is worth adding some tests.

For the problems you have (I suppose it is performance related?) More details would be helpful (perhaps using a bug to document it).

It's code size and compile time. These heuristics seem to be returning effectively an always-inline cost, which forces the inliner to inline huge functions. Clearly the heuristics should be much more conservative. Unfortunately I can't share the test case since it's internal code and massive.

Note that FDO is intended to be improve performance and the compile time budget is larger. This all seems WAI to me. If size is important, use -Os or -Oz, or even better use MLGO.

kazu added a comment.Jan 20 2023, 3:42 PM

Cycles are being compared against code size here. How is that metric supposed to be interpreted? What does (cycles/entry)/(bytes) even mean as a unit of measurement? The feature itself seems to be trying to optimize too fine details in machine cycles in a part of the compiler that shouldn't be dealing with such low level specifics.

If you would like to learn more about the cost-benefit analysis, there are a couple of papers:

The basic idea here is that if we are increasing the executable size by inlining a call site, that has to come with performance justification. The quantity "cycle savings / size" allows us to rank call sites in the order of profitability. We then inline those call sites exceeding a profitability threshold, namely PSI->getOrCompHotCountThreshold() / InlineSavingsMultiplier. The papers above go one step further in that they the most inline profitable call site first by putting the ratio into a priority queue instead of traversing the call graph bottom up (as in LLVM). I believe GCC inlines functions in a similar manner.

Sorry I didn't see your comment at the time I was writing. Yes I see that now, it doesn't use absolute cycles. Ultimately the problem is that this heuristic basically overrides too much.

if (auto Result = costBenefitAnalysis()) {
  DecidedByCostBenefit = true;
  if (*Result)
    return InlineResult::success();
  else
    return InlineResult::failure("Cost over threshold.");
}

The InlineResult::success() being returned here effectively is alwaysinline. Even if we were to continue with this notion of cycle savings (which I don't think belongs in the inliner), a particularly bad inlining cost from the inliner can be completely overruled by this heuristic. Maybe we should only use this to inline the marginally bad cases (some max % of cost > threshold).

The cost-benefit analysis is intended to override the threshold-only heuristic. That said, the traditional threshold in LLVM is compared against the cost, which is basically the callee size, and the cost-benefit analysis still takes size into account.

Note the size here is not static code size (it excludes cold code). It is actually a proxy to model the runtime cost due to increased instruction footprint (icache pressure). The multiplier's role is to make the savings and 'size' cost comparable in terms of unit. The cycle name here is also counted at IR level, so not at low level.

I understand that, but it still doesn't make the comparison of different units any better. Introducing a scaling factor is irrelevant, it's just an arbitrary scalar.

Note that FDO is intended to be improve performance and the compile time budget is larger. This all seems WAI to me. If size is important, use -Os or -Oz, or even better use MLGO.

The cases we've seen involve orders of magnitude increase, e.g. 100x bigger size and compile time. Not working as intended.

// We use 128-bit APInt here to avoid potential overflow.  This variable
// should stay well below 10^^24 (or 2^^80) in practice.  This "worst" case
// assumes that we can avoid or fold a billion instructions, each with a
// profile count of 10^^15 -- roughly the number of cycles for a 24-hour
// period on a 4GHz machine.

If you potentially need a 128 bit integer to store your "cycle savings", you should not be using that value to compare against a size cost. A sufficiently good "saving" will absolutely override a huge size cost.

kazu added a comment.Jan 20 2023, 4:00 PM
// We use 128-bit APInt here to avoid potential overflow.  This variable
// should stay well below 10^^24 (or 2^^80) in practice.  This "worst" case
// assumes that we can avoid or fold a billion instructions, each with a
// profile count of 10^^15 -- roughly the number of cycles for a 24-hour
// period on a 4GHz machine.

If you potentially need a 128 bit integer to store your "cycle savings", you should not be using that value to compare against a size cost. A sufficiently good "saving" will absolutely override a huge size cost.

Keep in mind that the cycle savings is divided by PSI->getOrCompHotCountThreshold() (although the actual implementation avoids divisions by doing "A*D >= B*C" instead of "A/B >= C/D"). We are not directly comparing the cycle savings to the size cost, which would not make sense as you have pointed out.

Note the size here is not static code size (it excludes cold code). It is actually a proxy to model the runtime cost due to increased instruction footprint (icache pressure). The multiplier's role is to make the savings and 'size' cost comparable in terms of unit. The cycle name here is also counted at IR level, so not at low level.

I understand that, but it still doesn't make the comparison of different units any better. Introducing a scaling factor is irrelevant, it's just an arbitrary scalar.

Consider the multiplier to be "The average cycle cost due to frontend stalls per byte instruction", then it will see its meaning and why it is proper here.

Note that FDO is intended to be improve performance and the compile time budget is larger. This all seems WAI to me. If size is important, use -Os or -Oz, or even better use MLGO.

The cases we've seen involve orders of magnitude increase, e.g. 100x bigger size and compile time. Not working as intended.

What is section that is affected the most? .text or .text.hot in your case? FDO may increase .text but .text.unlikely section will be optimized for size and it is not uncommon to see total reduced size with FDO (cost benefit analysis on). Spending some time to reduce the test case will make the discussion more meaningful.

// We use 128-bit APInt here to avoid potential overflow.  This variable
// should stay well below 10^^24 (or 2^^80) in practice.  This "worst" case
// assumes that we can avoid or fold a billion instructions, each with a
// profile count of 10^^15 -- roughly the number of cycles for a 24-hour
// period on a 4GHz machine.

If you potentially need a 128 bit integer to store your "cycle savings", you should not be using that value to compare against a size cost. A sufficiently good "saving" will absolutely override a huge size cost.

I guess in your case, there are functions with huge cold blocks that got inlined into many different callsites resulting in bloat. You may to try partial inlining to see if it helps.