This is an archive of the discontinued LLVM Phabricator instance.

[MemCpyOpt] Port to MemorySSA
ClosedPublic

Authored by nikic on Oct 11 2020, 8:22 AM.

Details

Summary

This is a straightforward port of MemCpyOpt to MemorySSA following the approach of D26739. MemDep queries are replaced with MSSA queries without changing the overall structure of the pass. Some care has to be taken to account for differences between these APIs (MemDep also returns reads, MSSA doesn't).

Diff Detail

Event Timeline

nikic created this revision.Oct 11 2020, 8:22 AM
nikic requested review of this revision.Oct 11 2020, 8:22 AM
nikic added inline comments.Oct 11 2020, 8:26 AM
llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp
358

What would be a good way to do that?

MSxDOS added a subscriber: MSxDOS.Oct 14 2020, 2:21 AM
nikic added a comment.Oct 14 2020, 1:32 PM

Some preliminary compile-time numbers: https://llvm-compile-time-tracker.com/index.php?branch=nikic/perf/memcpy-mssa The last commits are, from the bottom to top: Enabling MSSA DSE, enabling MSSA MemCpyOpt and moving one MemCpyOpt pass next to a DSE pass, to avoid an additional computation of MSSA. The last two commits together actually end up being a mild compile-time improvement. So at least this looks viable.

CMakeFiles/7zip-benchmark.dir/CPP/7zip/Compress/ShrinkDecoder.cpp.o 4KiB 4KiB (+0.44%)

Small codesize regression?

Thank you for working on this, the compile times look very promising and the tests results are great!

Since this is using MSSA walker API with a MemoryLocation (which does not cache), could this become costly in a pathologically constructed example? If so, could we have a (reasonably large) upper bound on the number of getClobbering (Access, Location) calls to avoid such cases?

llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp
358

There isn't a way to do that currently.
This needs support added in the walker. Draft: D90660; it lacks support in tryOptimizePhi, which means it may make sense to extend the query. This is tricky though because it depends on the path that access Start is on. If the new API is used in a scenario where it doesn't dominate End, the given "StopAt" (Start) access may be bypassed entirely.
So even if it's not the case in MemCpyOptimizer, the MSSAWalker needs to account for such a case.

nikic updated this revision to Diff 306776.Nov 20 2020, 1:49 PM

Rebase.

nikic added a comment.Nov 20 2020, 1:57 PM

Since this is using MSSA walker API with a MemoryLocation (which does not cache), could this become costly in a pathologically constructed example? If so, could we have a (reasonably large) upper bound on the number of getClobbering (Access, Location) calls to avoid such cases?

I've started implementing this (by returning a DoNothingMemorySSAWalker when a limit is hit), but ended up wondering if it really makes sense: I believe that the number of MSSA walker queries we do in this pass is roughly proportional to the number of memsets/memcpys in the IR. For very big functions we may end up performing many queries, but the overall runtime should still be linear (each individual query is limited by the walker limit). Having a limit would make sense to me if something here had quadratic runtime (like e.g. the AST limit in LICM), but I don't think that's the case here.

Since this is using MSSA walker API with a MemoryLocation (which does not cache), could this become costly in a pathologically constructed example? If so, could we have a (reasonably large) upper bound on the number of getClobbering (Access, Location) calls to avoid such cases?

I've started implementing this (by returning a DoNothingMemorySSAWalker when a limit is hit), but ended up wondering if it really makes sense: I believe that the number of MSSA walker queries we do in this pass is roughly proportional to the number of memsets/memcpys in the IR. For very big functions we may end up performing many queries, but the overall runtime should still be linear (each individual query is limited by the walker limit). Having a limit would make sense to me if something here had quadratic runtime (like e.g. the AST limit in LICM), but I don't think that's the case here.

Ok, let's leave as is and revisit if we encounter a pathological case where the cost of N memset/memcpy queries is too much and we need to limit that.

nikic added a comment.Nov 29 2020, 2:43 PM

@asbirlea Is this ready to land in the current form?

This revision is now accepted and ready to land.Nov 30 2020, 6:03 PM
This revision was automatically updated to reflect the committed changes.