This is an archive of the discontinued LLVM Phabricator instance.

[MemorySSA] Support lazy use optimization
ClosedPublic

Authored by nikic on Mar 10 2022, 8:21 AM.

Details

Summary

This changes MemorySSA to be constructed in unoptimized form. MemorySSA::ensureOptimized() can be called to optimize all uses (once). This should be done by passes where having optimized uses is beneficial, either because we're going to query all uses anyway, or because we're doing def-use walks.

This should help reduce the compile-time impact of MemorySSA for some use cases (the reason why I started looking into this is D117926), which can avoid optimizing all uses upfront, and instead only optimize those that are actually queried.

Actually, we have an existing use-case for this, which is EarlyCSE. Disabling eager use optimization there gives a significant compile-time improvement: http://llvm-compile-time-tracker.com/compare.php?from=59191057243e34d85b644716ef2811bfea8efd1e&to=1dd3499045716e71bad61ccd70700e734a74d350&stat=instructions This is because EarlyCSE will generally only query clobbers for a subset of all uses.

To be more conservative I haven't included the EarlyCSE change here, but I can add it.

Diff Detail

Event Timeline

nikic created this revision.Mar 10 2022, 8:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 10 2022, 8:21 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
nikic requested review of this revision.Mar 10 2022, 8:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 10 2022, 8:21 AM

Thank you for sending this out!

Broadly the changes look good. The patch as is can also give some compile-time improvements for cases where MSSA is built and preserved in loop pipelines and need only be optimized before LICM, not before the first loop pass in the pipeline. This also means results may change based on the delayed optimization.
This should hold true for both Legacy and NPM, but I see in practice there's a difference only for LPM (http://llvm-compile-time-tracker.com/compare.php?from=067c035012fc061ad6378458774ac2df117283c6&to=59191057243e34d85b644716ef2811bfea8efd1e&stat=instructions)

Let me test this out as is first. Then separately with the EarlyCSE ensureOptimized() removed to check if the gains in compile-time are not coupled with run time regressions.

nikic added a comment.Mar 10 2022, 1:56 PM

Thank you for sending this out!

Broadly the changes look good. The patch as is can also give some compile-time improvements for cases where MSSA is built and preserved in loop pipelines and need only be optimized before LICM, not before the first loop pass in the pipeline. This also means results may change based on the delayed optimization.
This should hold true for both Legacy and NPM, but I see in practice there's a difference only for LPM (http://llvm-compile-time-tracker.com/compare.php?from=067c035012fc061ad6378458774ac2df117283c6&to=59191057243e34d85b644716ef2811bfea8efd1e&stat=instructions)

I think this is because with the LegacyPM, loop analyses (including MSSA) will get scheduled unconditionally, even if there are no loops, so we sometimes compute MSSA unnecessarily. With the NewPM, analyses are only computed if there are loops, so we generally don't perform unnecessary MSSA constructions there.

Let me test this out as is first. Then separately with the EarlyCSE ensureOptimized() removed to check if the gains in compile-time are not coupled with run time regressions.

I should mention here that this does impact EarlyCSE results due to this fallback: https://github.com/llvm/llvm-project/blob/54d7fde46e8a0e425245e18732c2a78e64fa7b35/llvm/lib/Transforms/Scalar/EarlyCSE.cpp#L1030-L1034 Without eagerly optimizing uses, the getDefiningAccess() fallback will be unoptimized for uses. But I think we can compensate this by increasing the EarlyCSEMssaOptCap if necessary.

Conceptually, I think this makes a lot of sense.

Biggest thing I see in this patch is a strong need to update comments and APIs. This introduces the possibility of clients seeing unoptimized uses which is new, and not currently reflected in the API documentation. Please address before landing.

llvm/include/llvm/Analysis/MemorySSA.h
805

Strong comment: ensureOptimizedLoads.

This does *NOT* optimize stores, and that's important to distinguish in comments and naming.

LGTM for this patch on the impact side, testing looks good.

+1 on improving documentation for future uses, to include that by default uses are no longer optimized. Also rename the new method to ensureOptimizedUses() per @reames's suggestion.
(docs can be landed separately as long as they come shortly after)

On the impact for the EarlyCSE follow-up I'm seeing a couple of small regressions in fdo configuration that I would not block on. I'd go with landing the EarlyCSE change separately from this and evaluate if increasing the cap makes sense or not based on if other folks notice any performance impact from the change.

llvm/include/llvm/Analysis/MemorySSA.h
804

Please expand the doc for this API here.

nikic updated this revision to Diff 415859.Mar 16 2022, 9:10 AM

Rename to ensureOptimizedUses(), some comment updates.

fhahn added a comment.Mar 16 2022, 9:14 AM

Thanks, this looks like a nice improvement in combination with D121740!

asbirlea accepted this revision.Mar 17 2022, 3:02 PM

LGTM with a couple nits.

llvm/include/llvm/Analysis/MemorySSA.h
350

This is not necessarily true due to the cap set by MaxCheckLimit (memssa-check-limit).
Also not true if a MemoryDef was added that invalidated optimizations (OptimizedID will no longer match).

806

Add: By default, during MemorySSA build, uses are *not* optimized

This revision is now accepted and ready to land.Mar 17 2022, 3:02 PM
This revision was landed with ongoing or failed builds.Mar 18 2022, 1:56 AM
This revision was automatically updated to reflect the committed changes.

I noticed by accident that MemCpyOpt does not call ensureOptimizedUses, and thus changes behavior after this change. Not sure if that's good or bad, but it doesn't seem to have been explicitly noted in the review. For call slot optimization, we do perform a lot of clobber walks from loads.

(Separately, we could recognize the code to perform far fewer clobber walks. - e.g. there's no point in searching for the clobber for any non-alloca location.)

I noticed by accident that MemCpyOpt does not call ensureOptimizedUses, and thus changes behavior after this change. Not sure if that's good or bad, but it doesn't seem to have been explicitly noted in the review. For call slot optimization, we do perform a lot of clobber walks from loads.

I'm surprised that this would cause a change in behavior. MemCpyOpt always does clobber walks (without any limit that falls back to getDefiningAccess), so it really shouldn't care whether uses are optimized or not. Possibly our optimization cutoffs work slightly different between the eager use optimization and the clobber walk, and that can cause differences?

I noticed by accident that MemCpyOpt does not call ensureOptimizedUses, and thus changes behavior after this change. Not sure if that's good or bad, but it doesn't seem to have been explicitly noted in the review. For call slot optimization, we do perform a lot of clobber walks from loads.

I'm surprised that this would cause a change in behavior. MemCpyOpt always does clobber walks (without any limit that falls back to getDefiningAccess), so it really shouldn't care whether uses are optimized or not. Possibly our optimization cutoffs work slightly different between the eager use optimization and the clobber walk, and that can cause differences?

Er, sorry, was apparently sloppy in my wording.

I did *not* see a functional difference. I saw a compile time difference. On at least one example, not having the eager use optimization caused a significant shift in where we spend time. I'm not even saying it slowed down (the results were too noisy to tell), just that it was different and that doesn't seem obviously expected from the review.