This strategy can be enabled with -mllvm -enable-delayed-inline.It is disabled by default.
In order to solve the extern A, B, C problem we implemented a subset of thedeferred inlining concept, named delayed inlining.The algorithm is below.
Given A->B->C as a simple example:
- Staring at processing SCC_B, if C can be inlined into B, B will be markedwith delayed flag only and return without doing actual inlining
- In SCC_A, if delayed B can be inlined into A, B will be inlined into Aand C will be inlined into A like in the normal inlining process.But B will be checked, if B has delayed flag, inling C->B will be pushed intothe queue to make sure C is correctly inlined into B.
- If delayed B cannot be inlined into A, C will be inlined into B,the updated B will be tested again as the final decision.
The general algorithm description:
- If all call sites in a SCC can be inlined, we will check if any functionin SCC will be visited again in the later inlining steps. If yes, all functionsin SCC will be marked delayed and return without doing actual inlining work.
- In later SCC processing, if B can be inlined into A, if B is adelayed function, we recursively make sure all callsites inside B will be pushedinto the queue and correctly inlined into B and B is correctly inlined into A.
- If a delayed B cannot be inlined into A, we recursively inline all callsitesinsided B into B. The updated B will be tested again as the final decision.
- In order to improve speed, we cache the inline cost computed for the bodyof a function F. for a call instruction to F,the cost currently is set to the min( the cached cost, callpenalty(25)).
The current inliner works on callsite level and defers inlining when the callercan be inlined into all caller's callers.This new algorithm works on function level and delays inlining when all calleescan be inlined into callers. This makes sure all delayed funcs can correctly bere-inlined later. Setting flags into function attribute assist the re-inliningprocessing. The new solution also has makeup steps.
This strategy has been only tested with SPEC workload on AArch64 for performance but correctness passed many other benchmarks.