This is an archive of the discontinued LLVM Phabricator instance.

[InlineCost] Perform caller store analysis
Needs ReviewPublic

Authored by mpdenton on Aug 29 2019, 5:51 PM.

Details

Summary

Currently, the only information the inline cost analysis uses from the caller is the call's arguments and the attributes used on those arguments. It does not do any analysis on constants stored by the caller and later loaded by the callee.

This causes a problem in -Oz, in which certain functions are not inlined despite being a no-op at runtime, whereas with a larger inline threshold, the function might be inlined and completely eliminated by further optimization passes.

For context, Chrome compiles the Android binary with -Oz. However, this means that many no-op destructors remain un-inlined. Chief among them are ~unique_ptr and ~scoped_refptr (Chrome's refcounted pointer wrapper). This is even when an rvalue of type unique_ptr<T> is immediately moved and destructed. For example,
global_unique_ptr = std::make_unique<int>();
or
function_that_takes_unique_ptr(std::make_unique<int>());
Emits a destructor call on Android, despite the fact that the move constructor sets the unique_ptr's internal pointer to nullptr, and the destructor performs something similar to:
if (ptr_)

get_deleter()(ptr_);

Here's a relatively simple example: https://godbolt.org/z/YD3U7S

These extra no-op function calls are bad for binary size, and bad for performance. I tried to think of a way to fix this on a higher-level, such as a Clang plugin that fixes the largest offenders (~unique_ptr and ~scoped_refptr), but the AST is too high-level to fix this issue. And increasing the inline-threshold whatsoever is very bad for performance. (not to mention that fiddling with inline-threshold will either cause *all* ~unique_ptr destructors to be inlined or none of them, when it would be better for them to be inlined on a case-by-case basis).

So, add a flag called -inline-perform-caller-analysis. For now, this flag will cause analysis of stores to the caller's stack in the callsite's basic block to run during calculation of the inline cost. Then, any loads in the callee from the caller's stack will be simplified if possible (if the values could not have been clobbered already).

This change causes a 20KB decrease in Android binary size for Chrome. In addition, a planned Chrome change (increasing usage of std::make_unique and base::MakeRefCounted) that would increase Android binary size by ~100KB now causes no change in binary size. Chrome build time with and without this feature does not change (a trial compilation actually ran marginally faster with -inline-perform-caller-analysis).

The comments in InlineCost.cpp on CandidateCall indicate that we should not be performing analysis in the caller, so the results are more easily cacheable. However, inline analysis is already dependent on the call's arguments and parameter attributes, which probably often prevents caching from being useful. I also don't see anywhere in the LLVM codebase where results are being cached. And, in the future if caching will be added, we can return an "uncacheable" cost if this analysis affects the results. Even so, this feature is added behind a flag.

Diff Detail

Event Timeline

mpdenton created this revision.Aug 29 2019, 5:51 PM
mpdenton marked an inline comment as done.Aug 29 2019, 6:00 PM

I mostly would like feedback on the idea--there are one or two TODOs in the code and I spotted a bug. But, the change has some useful consequences for Chrome nonetheless.

llvm/lib/Analysis/InlineCost.cpp
1268

Ah, spotted a bug as soon as I posted this--a store to a stack location that overlaps (but is not equal to) an existing constant won't invalidate the original constant. I'll have to figure out how to fix that.

This idea isn't fundamentally flawed, its a good idea and something we've discussed many times.

The tricky thing I think will be getting it right.

I think you're going to need a much more powerful form of analysis, specifically something like MemorySSA to do this kind of thing well. Integrating MemorySSA into the inliner and using it to do powerful store->load forwarding would be *awesome*, but it is also going to be a quite a lot of work I fear.

Doing something simpler is likely possible along the lines of what you're starting to do here, but I fear we will have to chase a very long tail of subtle corner cases even with these kinds of simplifications (only look at stores in the callee, only look at constant offsets, etc. etc.).

Not sure how far down you want to go on this rabbit hole? That'll somewhat dictate the direction I suggest...

This idea isn't fundamentally flawed, its a good idea and something we've discussed many times.

The tricky thing I think will be getting it right.

I think you're going to need a much more powerful form of analysis, specifically something like MemorySSA to do this kind of thing well. Integrating MemorySSA into the inliner and using it to do powerful store->load forwarding would be *awesome*, but it is also going to be a quite a lot of work I fear.

Doing something simpler is likely possible along the lines of what you're starting to do here, but I fear we will have to chase a very long tail of subtle corner cases even with these kinds of simplifications (only look at stores in the callee, only look at constant offsets, etc. etc.).

Not sure how far down you want to go on this rabbit hole? That'll somewhat dictate the direction I suggest...

Indeed, I saw the MemorySSA comment in the source. :) I'm interested in the rabbit hole.

As for this simple case, I tried to make the patch heavily integrated with the current inlining infrastructure, so that corner cases are primarily handled by by the existing logic. For example, most of the information used comes from the ConstantOffsetPtrs map and the SROAArgValues map. SimplifiedValues map is used for propagating constants, just like for the function arguments. And, clearing the map of "CallerStackConstants" is handled by the existing disableLoadElimination(), which is called whenever future loads may be clobbered.

So, the only non-trivial things added here are
(1) CallerStackConstants map, which has the corner case of overlapping stores (hasn't been dealt with)
(2) The caller "store search" performed in searchForConstantAtOffset. It's quite simple and doesn't go past many calls, nor past the callsite's basic block. These simplifications mostly eliminate all the corner cases.

So (1) has a corner case that should be solvable, and (2) might have some corner cases--can anything modify memory/clobber stores other than Callbase's and Stores?
There are plenty of corner cases in the inliner altogether, but I think the additions here don't necessarily *add* many extra.

As for efficacy, (2) is the part that is sad to do without MemorySSA. However, it covers some fairly significant cases that come up all the time in C++, as with the no-op destructors.

It seems it would be cool to do something like this first, and then dive down the MemorySSA rabbit hole to see how much it can add.

Thoughts?

mpdenton added a comment.EditedSep 6 2019, 3:31 PM

So, looking at MemorySSA it looks like getClobberingAccess is the API. I assume the first argument would be the MemoryAccess that corresponds to the callsite, and the MemoryLocation is the caller pointer returned from calleePtrToCallerAllocaOffset. Is there a more sophisticated use here, or do I still need all the restrictions from the current inlining code (e.g. constant offsets)?

Would I need MemorySSA from both the callee and the caller? Is MemorySSA intra-procedural or do I still need to manually perform the translation from caller ptr -> callee ptr with calleePtrToCallerAllocaOffset?

Also, getClobberingAccess I believe just skips no-alias Defs but can return may-alias. But don't I want something more definitive? I.e. I either want a definitive "yes this writes exactly the memory location you're querying about" or no result at all. So maybe I'd use the walker and then do a manual check to see whether we've found a definitive overwrite. Which would mean I'd be dealing the same edge cases as my current code (overlapping stores?).

Also, if MemorySSA is not intrapodecural, the "search" for stores of constant wouldn't continue past calls anyway, just like my current code. (calls to readonly functions won't show up as clobbers, but my current code handles that already)

So I'd think MemorySSA would probably buy me "stores from other basic blocks in the function, not just the callsite's basic block", but I'd still probably be dealing with the same edge cases, and MemorySSA brings its own can of worms.

But I'm new to this MemorySSA thing, so I probably missed something above and there might be some additional benefit to using MemorySSA? (There's the "load elimination" thing from the current inliner code, but I'm considering that orthogonal to this change)

I'm not familiar with the inliner, but I'll try to answer some of the MemorySSA questions.

So, looking at MemorySSA it looks like getClobberingAccess is the API. I assume the first argument would be the MemoryAccess that corresponds to the callsite, and the MemoryLocation is the caller pointer returned from calleePtrToCallerAllocaOffset. Is there a more sophisticated use here, or do I still need all the restrictions from the current inlining code (e.g. constant offsets)?

Would I need MemorySSA from both the callee and the caller? Is MemorySSA intra-procedural or do I still need to manually perform the translation from caller ptr -> callee ptr with calleePtrToCallerAllocaOffset?

MemorySSA applies to a single function (it is intra-procedural, but not inter-procedural I believe). I am guessing yes to manually perform the translation.

Also, getClobberingAccess I believe just skips no-alias Defs but can return may-alias. But don't I want something more definitive? I.e. I either want a definitive "yes this writes exactly the memory location you're querying about" or no result at all. So maybe I'd use the walker and then do a manual check to see whether we've found a definitive overwrite. Which would mean I'd be dealing the same edge cases as my current code (overlapping stores?).

Yes. It will also set a "Must" bit for a known must alias. If it set it to May, it's probably not a definitive overwrite.

Also, if MemorySSA is not intrapodecural, the "search" for stores of constant wouldn't continue past calls anyway, just like my current code. (calls to readonly functions won't show up as clobbers, but my current code handles that already)

That's right, it's not inter-procedural, single function only. Again, not familiar with this. I will guess you could skip more than just readonly calls, but it cannot continue past any call.

So I'd think MemorySSA would probably buy me "stores from other basic blocks in the function, not just the callsite's basic block", but I'd still probably be dealing with the same edge cases, and MemorySSA brings its own can of worms.

It will probably buy you some more than that but I will guess there will be remaining edge cases and potentially a big can of worms :).

But I'm new to this MemorySSA thing, so I probably missed something above and there might be some additional benefit to using MemorySSA? (There's the "load elimination" thing from the current inliner code, but I'm considering that orthogonal to this change)

I'll try to look over this in more detail next week to understand the use case. Happy to chat as well.

MemorySSA applies to a single function (it is intra-procedural, but not inter-procedural I believe). I am guessing yes to manually perform the translation.

Ah yes, of course I meant inter-procedural, thanks for the clarification.

Also, getClobberingAccess I believe just skips no-alias Defs but can return may-alias. But don't I want something more definitive? I.e. I either want a definitive "yes this writes exactly the memory location you're querying about" or no result at all. So maybe I'd use the walker and then do a manual check to see whether we've found a definitive overwrite. Which would mean I'd be dealing the same edge cases as my current code (overlapping stores?).

Yes. It will also set a "Must" bit for a known must alias. If it set it to May, it's probably not a definitive overwrite.

Where does it set this "Must" bit? I took a look at the MemorySSA code and from what I can tell it doesn't do anything with the Alias result, at least in the "MemoryLoc" version?

So I'd think MemorySSA would probably buy me "stores from other basic blocks in the function, not just the callsite's basic block", but I'd still probably be dealing with the same edge cases, and MemorySSA brings its own can of worms.

It will probably buy you some more than that but I will guess there will be remaining edge cases and potentially a big can of worms :).

Well, hopefully it can buy us a lot without too much trouble.

I'll try to look over this in more detail next week to understand the use case. Happy to chat as well.

Thanks!