This patch adds a new feature to the memcpy optimizer known as the stack-move
optimization, intended primarily for the Rust language. It detects the pattern
whereby memory is copied from one stack slot to another stack slot in such a
way that the destination and source are neither captured nor simultaneously
live. In such cases, it optimizes the pattern by merging the two allocas into
one and deleting the memcpy.
Without any changes to the frontend, this optimization eliminates over 17% of
the memcpys between stack slots in the self-hosted Rust compiler. The number
rises to approximately 42% if the Rust frontend is patched to apply nocapture
to all pointer-valued arguments. That patch is of course an incorrect change,
but a Rust compiler compiled in this way works, which is evidence that further
improvements to the Rust frontend to deduce nocapture in more cases can soundly
push the number of memcpys that this optimization eliminates toward the higher
end of that range.
In order to determine that the optimization is safe to apply, this patch
performs a coarse-grained conservative liveness analysis of the two stack slots
on the basic block level. This may appear to be slow, but measurement has
demonstrated the compile-time overhead of this optimization to be approximately
0.07%, which seems well worth it for the large amount of stack traffic that it
eliminates in typical Rust code. Nevertheless, in order to reduce the compile
time impact of this optimization, especially for other languages in which it
isn't expected to help as much, it has a
frontend-configurable cap on the number of basic blocks that it's allowed to
examine.
The liveness analysis is built on top of the CaptureTracking framework so that
the pointer use analysis that capture tracking performs can be reused to
calculate live ranges. The actual analysis is the efficient
"variable-at-a-time" version of liveness analysis, which has a time complexity
of O(number of variables * number of instructions) in the worst case. Since we
only have two variables in this case, and we work on the basic block level
instead of on the instruction level (with the exception of the basic block
containing the memcpy itself), this reduces to O(number of basic blocks +
number of instructions in that one block). In practice, this seems to be
efficient enough to avoid a significant compile time regression.
Additionally, this optimization "shrink wraps" any lifetime markers around the
allocas to the nearest common dominator and postdominator blocks. For this, it
needs the postdominator tree. This doesn't seem to cause a noticeable
difference in compile times, because the postdominator tree is needed by dead
store elimination anyway, which typically follows memcpy optimization.
Nonetheless, to be maximally cautious regarding compile time, we only require
the postdominator tree if the optimization is enabled.
Lets avoid yet another off by default oprimization..