Currently, the only information the inline cost analysis uses from the caller is the call's arguments and the attributes used on those arguments. It does not do any analysis on constants stored by the caller and later loaded by the callee.
This causes a problem in -Oz, in which certain functions are not inlined despite being a no-op at runtime, whereas with a larger inline threshold, the function might be inlined and completely eliminated by further optimization passes.
For context, Chrome compiles the Android binary with -Oz. However, this means that many no-op destructors remain un-inlined. Chief among them are ~unique_ptr and ~scoped_refptr (Chrome's refcounted pointer wrapper). This is even when an rvalue of type unique_ptr<T> is immediately moved and destructed. For example,
global_unique_ptr = std::make_unique<int>();
or
function_that_takes_unique_ptr(std::make_unique<int>());
Emits a destructor call on Android, despite the fact that the move constructor sets the unique_ptr's internal pointer to nullptr, and the destructor performs something similar to:
if (ptr_)
get_deleter()(ptr_);
Here's a relatively simple example: https://godbolt.org/z/YD3U7S
These extra no-op function calls are bad for binary size, and bad for performance. I tried to think of a way to fix this on a higher-level, such as a Clang plugin that fixes the largest offenders (~unique_ptr and ~scoped_refptr), but the AST is too high-level to fix this issue. And increasing the inline-threshold whatsoever is very bad for performance. (not to mention that fiddling with inline-threshold will either cause *all* ~unique_ptr destructors to be inlined or none of them, when it would be better for them to be inlined on a case-by-case basis).
So, add a flag called -inline-perform-caller-analysis. For now, this flag will cause analysis of stores to the caller's stack in the callsite's basic block to run during calculation of the inline cost. Then, any loads in the callee from the caller's stack will be simplified if possible (if the values could not have been clobbered already).
This change causes a 20KB decrease in Android binary size for Chrome. In addition, a planned Chrome change (increasing usage of std::make_unique and base::MakeRefCounted) that would increase Android binary size by ~100KB now causes no change in binary size. Chrome build time with and without this feature does not change (a trial compilation actually ran marginally faster with -inline-perform-caller-analysis).
The comments in InlineCost.cpp on CandidateCall indicate that we should not be performing analysis in the caller, so the results are more easily cacheable. However, inline analysis is already dependent on the call's arguments and parameter attributes, which probably often prevents caching from being useful. I also don't see anywhere in the LLVM codebase where results are being cached. And, in the future if caching will be added, we can return an "uncacheable" cost if this analysis affects the results. Even so, this feature is added behind a flag.
Ah, spotted a bug as soon as I posted this--a store to a stack location that overlaps (but is not equal to) an existing constant won't invalidate the original constant. I'll have to figure out how to fix that.