This is an archive of the discontinued LLVM Phabricator instance.

[MemCpyOpt] Add a stack-move optimization to opportunistically merge allocas together.
Needs ReviewPublic

Authored by pcwalton on Dec 15 2022, 2:01 AM.



This patch adds a new feature to the memcpy optimizer known as the stack-move
optimization, intended primarily for the Rust language. It detects the pattern
whereby memory is copied from one stack slot to another stack slot in such a
way that the destination and source are neither captured nor simultaneously
live. In such cases, it optimizes the pattern by merging the two allocas into
one and deleting the memcpy.

Without any changes to the frontend, this optimization eliminates over 17% of
the memcpys between stack slots in the self-hosted Rust compiler. The number
rises to approximately 42% if the Rust frontend is patched to apply nocapture
to all pointer-valued arguments. That patch is of course an incorrect change,
but a Rust compiler compiled in this way works, which is evidence that further
improvements to the Rust frontend to deduce nocapture in more cases can soundly
push the number of memcpys that this optimization eliminates toward the higher
end of that range.

In order to determine that the optimization is safe to apply, this patch
performs a coarse-grained conservative liveness analysis of the two stack slots
on the basic block level. This may appear to be slow, but measurement has
demonstrated the compile-time overhead of this optimization to be approximately
0.07%, which seems well worth it for the large amount of stack traffic that it
eliminates in typical Rust code. Nevertheless, in order to reduce the compile
time impact of this optimization, especially for other languages in which it
isn't expected to help as much, it has a
frontend-configurable cap on the number of basic blocks that it's allowed to

The liveness analysis is built on top of the CaptureTracking framework so that
the pointer use analysis that capture tracking performs can be reused to
calculate live ranges. The actual analysis is the efficient
"variable-at-a-time" version of liveness analysis, which has a time complexity
of O(number of variables * number of instructions) in the worst case. Since we
only have two variables in this case, and we work on the basic block level
instead of on the instruction level (with the exception of the basic block
containing the memcpy itself), this reduces to O(number of basic blocks +
number of instructions in that one block). In practice, this seems to be
efficient enough to avoid a significant compile time regression.

Additionally, this optimization "shrink wraps" any lifetime markers around the
allocas to the nearest common dominator and postdominator blocks. For this, it
needs the postdominator tree. This doesn't seem to cause a noticeable
difference in compile times, because the postdominator tree is needed by dead
store elimination anyway, which typically follows memcpy optimization.
Nonetheless, to be maximally cautious regarding compile time, we only require
the postdominator tree if the optimization is enabled.

Diff Detail

Event Timeline

pcwalton created this revision.Dec 15 2022, 2:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 15 2022, 2:01 AM
pcwalton requested review of this revision.Dec 15 2022, 2:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 15 2022, 2:01 AM

I know this is a big patch. Please feel free to let me know if it's too much to review and I can hand it to someone else. Thanks for your time :)

xbolva00 added inline comments.

Lets avoid yet another off by default oprimization..

nikic added inline comments.Dec 15 2022, 3:40 AM

Setting this to 8, I see no compile-time impact and very little codegen impact on CTMark -- looks like this optimization just doesn't trigger on C/C++ code. So I don't think we need to be concerned about enabling it by default.


stripPointerCasts() is sufficient. You'll never get an alloca looking through aliases.

alex added a subscriber: alex.Dec 15 2022, 5:04 AM
lkail added a subscriber: lkail.Dec 15 2022, 6:47 AM
arsenm added a subscriber: arsenm.Dec 15 2022, 6:49 AM
arsenm added inline comments.

Can reflow to early exit if either one is dynamic


you could adjust the alignments up


Can merge these into one LLVM_DEBUG block


You should query the pointer size for the pointer type


Swap condition? PostDomNode ? getBlock : null


More early returns


Is there any real reason for a strip here with opaque pointers? This only does anything for addrspacecast?


Add a test with a volatile memcpy. Also one that shows metadata that was attached to these

arsenm added inline comments.Dec 15 2022, 7:01 AM

Should also check the address space of the allocas match

FWIW, if this is only about allocas, this is going to be covered by fixing SROA mem transfer intrin handling, and finishing the nocapture/"backing alloca" handling.

pcwalton updated this revision to Diff 484132.Dec 19 2022, 4:31 PM

Updated per comments. The optimization is now enabled by default and tests have been updated accordingly.

pcwalton retitled this revision from [MemCpyOpt] Add a stack-move optimization to opportunistically merge allocas together, disabled by default. to [MemCpyOpt] Add a stack-move optimization to opportunistically merge allocas together..Dec 19 2022, 4:32 PM
pcwalton edited the summary of this revision. (Show Details)
pcwalton marked 11 inline comments as done.

Should be ready for re-review now.


I'm not quite sure what you meant by metadata that was attached, but I added a test remove_alloca_metadata that ensures that we strip metadata properly.

pcwalton edited the summary of this revision. (Show Details)Dec 19 2022, 4:36 PM
nikic added a comment.Jan 6 2023, 8:17 AM

Some mostly stylistic notes, I didn't get to the core logic yet.

High level question: Do we need to do anything about debug info during this transform? I suspect "yes", but I have absolutely no clue about debug info.


This limit is pretty large. Do you have any data that shows that there is any substantial increase in optimization opportunities over a more conservative value, such as 8 or 16?


auto * with dyn_cast


This casts the TypeSize to int, which will assert for scalable vectors. Please add a test with scalable load/store.

(I think if you pass in TypeSize rather than uint64_t and use that, you can probably support scalable vectors without much effort, though just bailing out is also fine.)


The iterator is moved forward before the process functions are called, so it already points to the instruction after the store. I think what you want to do here is just ++BBI;, which will get undone by the step back that returning true from here performs. (Ugh, the iterator handling in memcpyopt is a mess...)


I've added a getAllocationSize() method in to avoid the *8 back and forth here.


For stacksave/stackrestore it would be fine to check just isStaticAlloca() (in fact, you might want to do that anyway for inalloca allocas, though they might get excluded by other checks already). Even if the alloca is not at the start of the entry block, it will won't get affected by stacksave/stackrestore.

For domination, as you skip past GEP instructions (which might be based on the allocas), I'm not sure this would actually avoid all possible domination issues. What I'd do is check isStaticAlloca here, and then in the transform move the retained alloca up if necessary.


nit: Omit braces.


or so


This inspects a potentially large number of instructions.


You want the index size of the alloca address space here, not the pointer size of the default address space.


This looks a bit odd, it might be better to use the instruction-level common dominator instead? I've landed to add a convenience API for this.


nit: auto * for dyn_cast

n-omer added a subscriber: n-omer.Jan 11 2023, 8:03 AM

Drive-by comment after spotting this question:

High level question: Do we need to do anything about debug info during this transform? I suspect "yes", but I have absolutely no clue about debug info.

From a brief glance, I understand this optimisation RAUWs the dest alloca to the source alloca -- if that's all, then there are a couple of ways that variable locations will be presented that have the potential to be misleading. I'm assuming that the source and destination allocas each have their own source variable:

  • Stores to the surviving alloca can be interpreted as assignments to _both_ source variables. That means stores to the alloca before the memcpy would have happened may appear as assignments to the "destination" variable, and stores after the memcpy might appear as assignments to the "source" variable.
  • If no part of either alloca is promoted, both source and destination variables will have the same locations, and so will appear to be identical variables throughout the function.

Whether this is acceptable depends on the user expectations -- if the source and destination variables are never in scope at the same time, then there are no downsides. It might be tolerable for some of these assignments to "leak" between variables. Another alternative would be to try and terminate the "source" variable after the memcpy and prevent the "destination" variable appearing before it. (I know nothing of rust, hence asking about preferences). CC @Orlando as this could interact with assignment-tracking stuff.

I'm not aware of anything odd that could happen with source locations.

khei4 added a subscriber: khei4.Mar 15 2023, 1:19 AM
khei4 added a comment.Jun 6 2023, 8:09 AM

@pcwalton Hi! I'm happy to take ownership of fixing this unless someone else is already working on it. I now plan to start with basic block local optimizations!