This is a straightforward port of MemCpyOpt to MemorySSA following the approach of D26739. MemDep queries are replaced with MSSA queries without changing the overall structure of the pass. Some care has to be taken to account for differences between these APIs (MemDep also returns reads, MSSA doesn't).
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp | ||
---|---|---|
358 | What would be a good way to do that? |
Some preliminary compile-time numbers: https://llvm-compile-time-tracker.com/index.php?branch=nikic/perf/memcpy-mssa The last commits are, from the bottom to top: Enabling MSSA DSE, enabling MSSA MemCpyOpt and moving one MemCpyOpt pass next to a DSE pass, to avoid an additional computation of MSSA. The last two commits together actually end up being a mild compile-time improvement. So at least this looks viable.
CMakeFiles/7zip-benchmark.dir/CPP/7zip/Compress/ShrinkDecoder.cpp.o 4KiB 4KiB (+0.44%)
Small codesize regression?
Thank you for working on this, the compile times look very promising and the tests results are great!
Since this is using MSSA walker API with a MemoryLocation (which does not cache), could this become costly in a pathologically constructed example? If so, could we have a (reasonably large) upper bound on the number of getClobbering (Access, Location) calls to avoid such cases?
llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp | ||
---|---|---|
358 | There isn't a way to do that currently. |
I've started implementing this (by returning a DoNothingMemorySSAWalker when a limit is hit), but ended up wondering if it really makes sense: I believe that the number of MSSA walker queries we do in this pass is roughly proportional to the number of memsets/memcpys in the IR. For very big functions we may end up performing many queries, but the overall runtime should still be linear (each individual query is limited by the walker limit). Having a limit would make sense to me if something here had quadratic runtime (like e.g. the AST limit in LICM), but I don't think that's the case here.
Ok, let's leave as is and revisit if we encounter a pathological case where the cost of N memset/memcpy queries is too much and we need to limit that.
What would be a good way to do that?