This is an archive of the discontinued LLVM Phabricator instance.

Inliner Enhancement
Needs RevisionPublic

Authored by Jiangning on Mar 17 2015, 10:46 PM.

Details

Summary

Following the discussion in BOF session of LLVM dev meeting 2014, I did some experiments to enhance LLVM inliner and want to share my result at the moment. My major goal is to improve -O3 performance without profiling support, which should be the simplest scenario of using compiler optimization.

Inlining more code usually could increase performance at the cost of code size bloat, but overly inlining code could increase register pressure and hurt performance, e.g. some more loop invariants can be detected and hoisted out of loop, and finally register pressure increases a lot. In the meantime, inline is expensive because we have to analyze every function in terms of every call site with different arguments to remove dead code as possible as we could. Therefore, the biggest challenge of Inlining problem is how we can make trade-off among performance improvement, code size bloat and compiler slowdown in a smart manner.

  1. Design

My design to address the issues described above is listed as below.

(1) For performance, the main idea is enlarging inlining threshold heuristically for *hot* spots detected at compile-time. The codes with the following properties are usually *hot*,
(1.a) callee Inside a loop. If callee can be inlined into a loop, we could probably expose more optimization opportunities. E.g. loop invariant hoist. And this solution is particularly useful to small loops, like having less than 2~3 BasicBlocks, because such a simple loop structure would be less possible to trigger register pressure issue.
(1.b) callee with constant argument. For example, if the constant argument is used as a loop boundary, it could trigger completely different loop unrolling behavior, like full unroll or partial unroll.

Solution (1.a) requires loop info. With current pass manager behavior, CallGraphSCCPass doesn’t allow to use getAnalysisUsage to obtain loop info, but we can define a lightweight LoopAnalyzer pass inside module SimpleInliner, and this pass can be implemented simply by calling LoopInfoBase and DominatorTree.

(2) For code size, we have two solutions,
(2.a) It doesn't make sense to inine a lot of *cold* code. Since non-hot code can be treated as cold code, we can reduce the normal threshold. In the patch the default threshold for -O3 is changed from 275 to 240. This way, we could save code size a lot. The performance reduction caused by reducing default threshold could be compensated by increasing threshold for *hot* code inside loops.
(2.b) It would be quite abnormal if a function call the same callee many times, even if they use different arguments, because this kind of code can easily refactored by loop. So we can avoid inlining the same callee many times if we find this case.

(3) For compile time, it’s a big challenge, because loop info calculation is really expensive.
(3.a) Don’t re-compute loop info every time callee is inlined, but only do it once we start to check the new callees introduced by inlining a callee. For example, A->B->C, and A->D->E. When analyzing caller A, if we decide to inline B into A, C will be exposed to A, and at this moment, we don’t re-compute loop info until checking A->D is completed, because the loop info about D won’t be affected after inlining B.
(3.b) Solve A->B->C dilemma differently using early exit. For example, for call graph A->B->C, and A->B->D. When analyzing caller B, if A->B->C pass the ABC checking, i.e. C can be inlined into B, and (B+C) can be inlined into A as well, current algorithm will defer it until analyzing caller A. But if we get D inlined into B before checking caller A, the code size of B could increase, and finally fails to be inlined into A. (Hal has explained this problem previously using vector push_back case). It means A->B->C will be kept as it is eventually, although D is inlined into B. This is *not* a problem, but a heuristic choice, I think. For a lot of cpp program, there are a lot of small functions could trigger this ABC issue. But choosing B->D rather than A->B->C would hurt compile time, because it will check all of callees inside B, although ABC case is already detected. So we can early exit as soon as positive ABC case is detected, and then the new algorithm will inline B into A first, at the moment of analyzing caller A. And then C and D could both be inlined into A eventually.

In order to apply methods (2.b) and (3.a), we have to solve an inline analysis ordering issue. Current inliner analyzes call sites in an unstable order. For example, A->B1->C1, A->B2->C2, and A->B3->C3. The call site analysis order of analyzing caller A was B1, C1, B3, C3, B2, C2. Now I change the order to be B1, B2, B3, C1, C2, C3.

  1. Benchmark

Chandler previously mentioned SPEC benchmark is not a good candidate for measuring code size impact, so I use llvm bootstrap and chromium as the benchmarks for compile time and code size.

On llvm revision r232011 (March 12), I got the following benchmark data,

  1. Performance:

SPEC 2000 geomean for AArch64: +1.24%
SPEC 2006 geomean for AArch64: +0.3%

  1. Code size:

SPEC 2000+2006: +2.68%
clang/llvm: +2.88%
Chromium: +2%~3%

  1. Compile-time:

llvm bootstrap on x86: +1.8%
SPEC2006 build on x86: +2.7%

Thanks,
-Jiangning

Diff Detail

Repository
rL LLVM

Event Timeline

Jiangning updated this revision to Diff 22156.Mar 17 2015, 10:46 PM
Jiangning retitled this revision from to Inliner Enhancement.
Jiangning updated this object.
Jiangning edited the test plan for this revision. (Show Details)
Jiangning added reviewers: chandlerc, apazos, yinma, hfinkel.
Jiangning set the repository for this revision to rL LLVM.
Jiangning added a subscriber: Unknown Object (MLST).
Jiangning updated this object.Mar 17 2015, 10:47 PM

Hi Jiangning,

AFAICT, your design considerations are sound and relevant.

The "ABC problem" seems nasty. Without much thinking, could you use a lazy evaluation by annotating code with metadata stating the current threshold for that particular BB/Loop? If instead of passing through the code multiple times (for ABC and ABD, etc), you could memoize the info in a graph-like metadata, so that you could re-use partial calculations in future inlinings. It may not be worth it, but if reading from this metadata is quicker than re-calculating, that it'd at least be faster. Also, such metadata could (should?) be quickly discarded anyway, as after inlining, the costs will change.

About your benchmarks, they're very promising numbers. Taking 3% code-size for 1.5% performance is not a bad choice. Have you tested/benchmarked in other arches? Would be good to know at least on x86_64 and ARM how that goes.

I'll see if I can run those same tests on ARM, at least.

cheers,
--renato

Hi Renato,

2015-03-18 19:09 GMT+08:00 Renato Golin <renato.golin@linaro.org>:

Hi Jiangning,

AFAICT, your design considerations are sound and relevant.

The "ABC problem" seems nasty. Without much thinking, could you use a lazy
evaluation by annotating code with metadata stating the current threshold
for that particular BB/Loop? If instead of passing through the code
multiple times (for ABC and ABD, etc), you could memoize the info in a
graph-like metadata, so that you could re-use partial calculations in
future inlinings. It may not be worth it, but if reading from this metadata
is quicker than re-calculating, that it'd at least be faster. Also, such
metadata could (should?) be quickly discarded anyway, as after inlining,
the costs will change.

Yeah, agree with you. This is a piece of logic we can improve for
compile-time. But essentially I also agree with David, using new pass
manager should be able to partially solve the issue.

About your benchmarks, they're very promising numbers. Taking 3% code-size
for 1.5% performance is not a bad choice. Have you tested/benchmarked in
other arches? Would be good to know at least on x86_64 and ARM how that
goes.

I just checked it on my x86 box, the performance impact of SPEC2000 is
+0.85%.

Thanks,
-Jiangning

I'll see if I can run those same tests on ARM, at least.

cheers,
--renato

REPOSITORY

rL LLVM

http://reviews.llvm.org/D8408

EMAIL PREFERENCES

http://reviews.llvm.org/settings/panel/emailpreferences/

Looks good on ARM, too. Not many improvements, but no regressions either.

yinma edited edge metadata.Mar 19 2015, 4:04 PM

Hi Jiangning,

I reviewed your code. I like the loop analyzer. However I have two concerns for your implementation for A->B->Cx problem.

  1. When you visit a SCC, like B if any Cx will be inlined into B, you will not inline Cx into B until visiting A for (unsigned i = 0; i < CallSitesForABC.size(); i++) if (!shouldInline(CallSitesForABC[i], ABC, false)) if (ABC) return false; If so, the logic is not complete, because if B is also an extern function, none of Cx will be inlined into B at the end because each SCC will be only visited once. At the end, only A will be inlined with B and C, B becomes an un-inlined version, which is a big problem. I didn’t see the code or understand where to make sure all C->B are inlined correctly.
  2. For LLVM, when inliner inlines A->B->C, the original order is bottom up, so C->B then CB->A. Actually, this order matters because for every step of inlining, the inliner will call a code simplifier pass to optimize the inlined code. The bottom up order will make C->B code smaller. In some cases, if you delayed C into B for A, the order will changes to B->A then all C->BA. Simplifier changes, you will see performance regression in some cases due to order change. So you have to make sure after you come to A, you still use bottom up inlining order to inline Cx -> B then -> A.

Actually, after we meet in the LLVM dev. I have implemented my own version of deferred inlining for solving A->B->C. I called it as delayed inlining. I will upload it into phabricator for you and everybody to review. I will post the link today or tomorrow.

Yin

yinma added a comment.Mar 19 2015, 4:31 PM

Hi Jiangning,

I have posted my implementation of deferred inlining here.
http://reviews.llvm.org/D8475

Please review and if possible, give a try with your other tuning.

Yin

Hi Yin,

2015-03-20 7:04 GMT+08:00 Yin Ma <yinma@codeaurora.org>:

Hi Jiangning,

I reviewed your code. I like the loop analyzer. However I have two
concerns for your implementation for A->B->Cx problem.

  1. When you visit a SCC, like B if any Cx will be inlined into B, you will

not inline Cx into B until visiting A for (unsigned i = 0; i <
CallSitesForABC.size(); i++) if (!shouldInline(CallSitesForABC[i], ABC,
false)) if (ABC) return false; If so, the logic is not complete, because if
B is also an extern function, none of Cx will be inlined into B at the end
because each SCC will be only visited once. At the end, only A will be
inlined with B and C, B becomes an un-inlined version, which is a big
problem. I didn’t see the code or understand where to make sure all C->B
are inlined correctly.

This is great, and I didn't notice this problem. I will try to combine your
patch with mine to see what would happen.

  1. For LLVM, when inliner inlines A->B->C, the original order is bottom

up, so C->B then CB->A. Actually, this order matters because for every step
of inlining, the inliner will call a code simplifier pass to optimize the
inlined code. The bottom up order will make C->B code smaller. In some
cases, if you delayed C into B for A, the order will changes to B->A then
all C->BA. Simplifier changes, you will see performance regression in some
cases due to order change. So you have to make sure after you come to A,
you still use bottom up inlining order to inline Cx -> B then -> A.

Yeah, you are correct. But the change from bottom-up to top-down would just
stop when coming across a big function, which couldn't be further inlined.
So essentially it is still bottom-up, and for small functions closing to
leaf we change to use top-down. Yes, there will be some regressions, but
there will be some improvements as well, because we will be able to inline
more callees to A, and the are all small functions. My experiment of
applying this change independently shows performance change are all
fluctuations, some are good and some others are bad. Overall, I didn't see
improvement or regression. But using this method independently can reduce
compile-time a lot. You may have a try on llvm bootstrap.

Actually, after we meet in the LLVM dev. I have implemented my own version
of deferred inlining for solving A->B->C. I called it as delayed inlining.
I will upload it into phabricator for you and everybody to review. I will
post the link today or tomorrow.

Yin

REPOSITORY

rL LLVM

http://reviews.llvm.org/D8408

EMAIL PREFERENCES

http://reviews.llvm.org/settings/panel/emailpreferences/

Hi Yin,

I checked your patch,

(1) Compile-time has huge regression. For llvm bootstrap on x86, I can see
+100% compile-time increase, which is unacceptable.
(2) For performance, I can only see +0.05% performance gain for SPEC2000,
so I think it's a kind of noise.

Thanks,
-Jiangning

2015-03-20 7:31 GMT+08:00 Yin Ma <yinma@codeaurora.org>:

Hi Jiangning,

I have posted my implementation of deferred inlining here.
http://reviews.llvm.org/D8475

Please review and if possible, give a try with your other tuning.

Yin

REPOSITORY

rL LLVM

http://reviews.llvm.org/D8408

EMAIL PREFERENCES

http://reviews.llvm.org/settings/panel/emailpreferences/
yinma added a comment.Mar 20 2015, 2:02 PM

Hi Jiangning

We have seen good performance improvement on SPEC from our internal build.

I will take a look if there is any porting issue there.

For the compile time increase, it is expected in some cases because implementing

Deferred inlining in any way will only increase threshold compared with baseline

inliner.

To remedy this issue, the cost used in deferred inlining should be tuned to

Be consistent with inline threshold in my opinion.

Yin

From: Jiangning Liu [mailto:liujiangning1@gmail.com]
Sent: Friday, March 20, 2015 12:29 AM
To: reviews+d8408+public+fc653c157840723f
Cc: Chandler Carruth; Ana Pazos; Hal Finkel; Yin Ma; Renato Golin; Amara Emerson; llvm-commits@cs.uiuc.edu for LLVM
Subject: Re: [PATCH] Inliner Enhancement

Hi Yin,

I checked your patch,

(1) Compile-time has huge regression. For llvm bootstrap on x86, I can see +100% compile-time increase, which is unacceptable.

(2) For performance, I can only see +0.05% performance gain for SPEC2000, so I think it's a kind of noise.

Thanks,

-Jiangning

2015-03-20 7:31 GMT+08:00 Yin Ma <yinma@codeaurora.org <mailto:yinma@codeaurora.org> >:

Hi Jiangning,

I have posted my implementation of deferred inlining here.
http://reviews.llvm.org/D8475

Please review and if possible, give a try with your other tuning.

Yin

REPOSITORY

rL LLVM

http://reviews.llvm.org/D8408

EMAIL PREFERENCES

http://reviews.llvm.org/settings/panel/emailpreferences/
chandlerc requested changes to this revision.Apr 14 2015, 6:30 AM
chandlerc edited edge metadata.

I just want to officially comment here that I think we shouldn't invent mini analyses in the inliner. I think it is better to work on getting the infrastructure fixed so that this works directly.

This revision now requires changes to proceed.Apr 14 2015, 6:30 AM
apazos resigned from this revision.Sep 18 2018, 12:18 PM