Following the discussion in BOF session of LLVM dev meeting 2014, I did some experiments to enhance LLVM inliner and want to share my result at the moment. My major goal is to improve -O3 performance without profiling support, which should be the simplest scenario of using compiler optimization.
Inlining more code usually could increase performance at the cost of code size bloat, but overly inlining code could increase register pressure and hurt performance, e.g. some more loop invariants can be detected and hoisted out of loop, and finally register pressure increases a lot. In the meantime, inline is expensive because we have to analyze every function in terms of every call site with different arguments to remove dead code as possible as we could. Therefore, the biggest challenge of Inlining problem is how we can make trade-off among performance improvement, code size bloat and compiler slowdown in a smart manner.
- Design
My design to address the issues described above is listed as below.
(1) For performance, the main idea is enlarging inlining threshold heuristically for *hot* spots detected at compile-time. The codes with the following properties are usually *hot*,
(1.a) callee Inside a loop. If callee can be inlined into a loop, we could probably expose more optimization opportunities. E.g. loop invariant hoist. And this solution is particularly useful to small loops, like having less than 2~3 BasicBlocks, because such a simple loop structure would be less possible to trigger register pressure issue.
(1.b) callee with constant argument. For example, if the constant argument is used as a loop boundary, it could trigger completely different loop unrolling behavior, like full unroll or partial unroll.
Solution (1.a) requires loop info. With current pass manager behavior, CallGraphSCCPass doesn’t allow to use getAnalysisUsage to obtain loop info, but we can define a lightweight LoopAnalyzer pass inside module SimpleInliner, and this pass can be implemented simply by calling LoopInfoBase and DominatorTree.
(2) For code size, we have two solutions,
(2.a) It doesn't make sense to inine a lot of *cold* code. Since non-hot code can be treated as cold code, we can reduce the normal threshold. In the patch the default threshold for -O3 is changed from 275 to 240. This way, we could save code size a lot. The performance reduction caused by reducing default threshold could be compensated by increasing threshold for *hot* code inside loops.
(2.b) It would be quite abnormal if a function call the same callee many times, even if they use different arguments, because this kind of code can easily refactored by loop. So we can avoid inlining the same callee many times if we find this case.
(3) For compile time, it’s a big challenge, because loop info calculation is really expensive.
(3.a) Don’t re-compute loop info every time callee is inlined, but only do it once we start to check the new callees introduced by inlining a callee. For example, A->B->C, and A->D->E. When analyzing caller A, if we decide to inline B into A, C will be exposed to A, and at this moment, we don’t re-compute loop info until checking A->D is completed, because the loop info about D won’t be affected after inlining B.
(3.b) Solve A->B->C dilemma differently using early exit. For example, for call graph A->B->C, and A->B->D. When analyzing caller B, if A->B->C pass the ABC checking, i.e. C can be inlined into B, and (B+C) can be inlined into A as well, current algorithm will defer it until analyzing caller A. But if we get D inlined into B before checking caller A, the code size of B could increase, and finally fails to be inlined into A. (Hal has explained this problem previously using vector push_back case). It means A->B->C will be kept as it is eventually, although D is inlined into B. This is *not* a problem, but a heuristic choice, I think. For a lot of cpp program, there are a lot of small functions could trigger this ABC issue. But choosing B->D rather than A->B->C would hurt compile time, because it will check all of callees inside B, although ABC case is already detected. So we can early exit as soon as positive ABC case is detected, and then the new algorithm will inline B into A first, at the moment of analyzing caller A. And then C and D could both be inlined into A eventually.
In order to apply methods (2.b) and (3.a), we have to solve an inline analysis ordering issue. Current inliner analyzes call sites in an unstable order. For example, A->B1->C1, A->B2->C2, and A->B3->C3. The call site analysis order of analyzing caller A was B1, C1, B3, C3, B2, C2. Now I change the order to be B1, B2, B3, C1, C2, C3.
- Benchmark
Chandler previously mentioned SPEC benchmark is not a good candidate for measuring code size impact, so I use llvm bootstrap and chromium as the benchmarks for compile time and code size.
On llvm revision r232011 (March 12), I got the following benchmark data,
- Performance:
SPEC 2000 geomean for AArch64: +1.24%
SPEC 2006 geomean for AArch64: +0.3%
- Code size:
SPEC 2000+2006: +2.68%
clang/llvm: +2.88%
Chromium: +2%~3%
- Compile-time:
llvm bootstrap on x86: +1.8%
SPEC2006 build on x86: +2.7%
Thanks,
-Jiangning