This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Optionally preserve MemorySSA
ClosedPublic

Authored by reames on Jan 21 2022, 1:29 PM.

Details

Summary

This initial patch adds code to preserve MemorySSA through a run of SLP vectorizer. The eventual plan is to use MemorySSA to accelerate SLP's memory dependence checking, but we're a ways from that.

In particular, this patch is correct, but really slow. I want to land this so that the slightly more delicate compile time optimization patches are individually reviewable.

Edit: Forgot to say, this is my first time using memoryssa for anything, so skeptical review is very warranted.

Note that I'm intentionally not preserving MemorySSA even if available. The current update code is *so slow* that it's faster to simply rebuild after SLP is done.

Suggestions on how to make this reasonable fast are welcomed, but I'd strongly prefer to work incrementally and address compile time concerns in following patches.

Diff Detail

Event Timeline

reames created this revision.Jan 21 2022, 1:29 PM
reames requested review of this revision.Jan 21 2022, 1:29 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 21 2022, 1:29 PM
reames edited the summary of this revision. (Show Details)Jan 21 2022, 1:31 PM
nikic added a comment.Jan 21 2022, 1:37 PM

Can you please explain what the larger context here is? What cases you trying to solve with MemorySSA?

I'm not sure it will be the right tool for the job, so I think we should discuss this before making any changes. We don't have MSSA available at SLP's pipeline position, and computing it just for SLP will make this pass much more expensive.

reames added a comment.EditedJan 21 2022, 1:50 PM

Can you please explain what the larger context here is? What cases you trying to solve with MemorySSA?

Sure, though I'm a bit limited in what I can say. The original example is not public.

Essentially, I have a case where we are spending a large fraction of total O2 time inside SLP - specifically, inside the code which is figuring out which memory dependencies exist while trying to schedule. (To prevent confusion, note that SLP scheduling subsumes several legality tests.)

Specifically, the case which is hurting this example - which is machine generated code - is a very long basic block with a vectorizable pair of loads at the beginning, and a vectorizable pair of stores (consuming the loaded values) at the end. There's multiple pairs, but the core detail is that the required scheduling window is basically the entire size of the huge basic block.

The time is spent figuring out dependencies for *scalar* instructions - not even the ones we're trying to vectorize. Since this is such a huge block, the current mssa-like memory chain ends up being very expensive.

I'd explored options for limiting the scheduling window, but mssa felt like a more general answer, so I started there.

I'm not sure it will be the right tool for the job, so I think we should discuss this before making any changes. We don't have MSSA available at SLP's pipeline position, and computing it just for SLP will make this pass much more expensive.

I'm really surprised to hear you say that. My understanding was that memory ssa was rather cheap to construct if you don't need an optimized form, and that optimization was done lazily.

However, I see my memory of prior discussion on this topic is clearly wrong. The constructor for memoryssa does appear to eagerly optimize.

Despite this, I don't see memoryssaanalysis showing up as expensive in the -time-passes-per-run output even with this change. I see SLP itself slow down a lot, but I had put that down to the generic renaming instead of using specialized knowledge from the callsite.

Edit: I confirmed the pass profiling result by nulling out MSSA immediately after the getResult call. The runtime drops to basically nothing over the non-MSSA version (e.g. measurement noise). So despite the optimization at construction, it really is the updates which are expensive in this case. It's possibly my example is highly unrepresentative, but that seems questionable. Any theories?

Ok, was able to spot the additional construction time. It took about 15 ms.

For context, the original example spends about 3,23 seconds in SLP w/o MSSA, and the (horribly unoptimized) preservation currently takes an additional 5.25 seconds on top of that.

Quite literally, different orders of magnitudes. If we can get SSA preservation down to something reasonable - again, incrementalism please - I'd argue using it here is entirely reasonable.

nikic added a comment.Jan 21 2022, 2:57 PM

Can you please explain what the larger context here is? What cases you trying to solve with MemorySSA?

Sure, though I'm a bit limited in what I can say. The original example is not public.

Essentially, I have a case where we are spending a large fraction of total O2 time inside SLP - specifically, inside the code which is figuring out which memory dependencies exist while trying to schedule. (To prevent confusion, note that SLP scheduling subsumes several legality tests.)

Specifically, the case which is hurting this example - which is machine generated code - is a very long basic block with a vectorizable pair of loads at the beginning, and a vectorizable pair of stores (consuming the loaded values) at the end. There's multiple pairs, but the core detail is that the required scheduling window is basically the entire size of the huge basic block.

Is it possible to construct an artificial test case that can be shared? Just the loads/stores at the beginning and end and dummy instructions in between?

The time is spent figuring out dependencies for *scalar* instructions - not even the ones we're trying to vectorize. Since this is such a huge block, the current mssa-like memory chain ends up being very expensive.

I'd explored options for limiting the scheduling window, but mssa felt like a more general answer, so I started there.

It sounds to me like a cutoff is what this mainly needs. We always run into degenerate cases when there is an unbounded instruction walk. (MSSA itself also limits instruction walks.)

Something you might want to try it use BatchAAResults. Assuming that all the alias checks happen without IR modifications in between, it would be safe to cache them.

I'm not sure it will be the right tool for the job, so I think we should discuss this before making any changes. We don't have MSSA available at SLP's pipeline position, and computing it just for SLP will make this pass much more expensive.

I'm really surprised to hear you say that. My understanding was that memory ssa was rather cheap to construct if you don't need an optimized form, and that optimization was done lazily.

However, I see my memory of prior discussion on this topic is clearly wrong. The constructor for memoryssa does appear to eagerly optimize.

Yes. We discussed adding a mode that does not eagerly optimize in the past, but didn't do so (yet) for lack of a use-case.

Despite this, I don't see memoryssaanalysis showing up as expensive in the -time-passes-per-run output even with this change. I see SLP itself slow down a lot, but I had put that down to the generic renaming instead of using specialized knowledge from the callsite.

Edit: I confirmed the pass profiling result by nulling out MSSA immediately after the getResult call. The runtime drops to basically nothing over the non-MSSA version (e.g. measurement noise). So despite the optimization at construction, it really is the updates which are expensive in this case. It's possibly my example is highly unrepresentative, but that seems questionable. Any theories?

MemorySSA updates can be quite expensive -- some update operations are unexpectedly O(n). I believe the insertUse() and removeMemoryAccess() should be cheap, but insertDef() with RenameUses=true can be expensive, due to renaming. In some cases you can avoid the renaming, see for example D107702.

Ok, was able to spot the additional construction time. It took about 15 ms.

For context, the original example spends about 3,23 seconds in SLP w/o MSSA, and the (horribly unoptimized) preservation currently takes an additional 5.25 seconds on top of that.

Quite literally, different orders of magnitudes. If we can get SSA preservation down to something reasonable - again, incrementalism please - I'd argue using it here is entirely reasonable.

Sure, if you're looking at a degenerate case, MSSA construction will not be a dominating factor. What I have in mind here is the average case, where SLP will be practically free (and usually just not do anything), while MSSA construction still needs to happen.

Any chance you could profile where most time is spent?

As Nikita mentioned, the insertDef and moveTo (which calls insertDef) methods are very expensive if they need to do renaming. Can you check if most time is spent in MSSA->renamePass?
I'm asking this just to get a confirmation on where the issue lies (even the walk from 7714 can be an issue if the block is that huge and there are only MemoryAccesses only at the very beginning and end, though I sincerely doubt it).

There is no good capping for these updates, as renaming is often required to do a correct update, but if all the accesses are in a single block which already contains MemoryDefs, the changes shouldn't be so invasive (no new MemoryPhis should be inserted) so there's a chance to make this manageable.

Also echoing @nikic, an artificial test to replicate the issue would be great.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7741

Without knowing SLPVectorizer details, I think this should be:
s/MemorySSA::InsertionPlace::End/MemorySSA::InsertionPlace::BeforeTerminator.

reames planned changes to this revision.Jan 27 2022, 1:52 PM

I am setting this aside for the moment. The consensus of review seems to be that landing a working but slow preservation and then speeding it up incrementally is not desired. I think that's a poor choice, but am willing to comply.

I will refresh this patch once I have the full complexity of a working and fast MSSA preservation.

This will likely be a while as in the delay caused by the review, I've found a non-mssa based fix for the original compile time issue and will be proceeding with that one instead. I hope to get back to the MSSA update because I think that is generally worthwhile, but it's no longer a priority for me, and I'm currently very short on time.

reames updated this revision to Diff 413553.Mar 7 2022, 10:37 AM

Refresh the patch.

Ok, I'm back to this, and want to strongly argue for this patch being accepted largely as is. I went ahead and filed an issue with a long form writeup explaining why I think this is the right approach (https://github.com/llvm/llvm-project/issues/54256).

I want to request that we defer *efficiency* of update to separate reviews. I have investigated the cause of the slowdowns, and have a local patch which is fast on my motivating example. My local patch is definitely not fully correct as it hacks around several issues, but I'm now reasonable convinced a correct and fast preservation of MSSA is entirely possible. However, the number of interlocking and subtle changes required is well beyond anything reasonable for a single review.

If we can't unblock incremental review on this, I don't think implementing this is practical at all. There's too much undocumented assumptions about memory ssa form, and attempting to post the entire series without being able to do cleanups in between would be simply unmanageable and unreviewable.

Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2022, 10:37 AM

ping

Knowing whether I'm going to be able to land incremental progress here is blocking a lot of work. I'd really appreciate either an LGTM or being told this project is being rejected. I'll accept either answer, but the lack of response is the worst possibility.

Reading the details in your post and looking at the SLPVectorizer code I think there's the potential for the pass to benefit from using MemorySSA. However I do not understand enough of the SLPVectorizer mechanics to guess if this will work out or not.
In this context, I think it makes sense to have incremental changes while keeping building and updating MSSA off by default; as you mentioned in the post, all changes can (and should) be reverted if a later decision is to pursue something else.

@nikic Thoughts?

Regarding this patch, seeing MSSA used in a dependent patch would help.
Is there a benefit from the existance of an insertAccesses() /insertDefs() /insertUses API in MSSA (i.e. bulk add update, perhaps limited to within the same BB)? Does doing lazy updates during the vectorizeTree() call make sense or will MSSA need to be queried?

Reading the details in your post and looking at the SLPVectorizer code I think there's the potential for the pass to benefit from using MemorySSA. However I do not understand enough of the SLPVectorizer mechanics to guess if this will work out or not.
In this context, I think it makes sense to have incremental changes while keeping building and updating MSSA off by default; as you mentioned in the post, all changes can (and should) be reverted if a later decision is to pursue something else.

@nikic Thoughts?

Was this meant to be an LGTM? It sort of sounds like it, but you never say that explicitly.

Regarding this patch, seeing MSSA used in a dependent patch would help.
Is there a benefit from the existance of an insertAccesses() /insertDefs() /insertUses API in MSSA (i.e. bulk add update, perhaps limited to within the same BB)? Does doing lazy updates during the vectorizeTree() call make sense or will MSSA need to be queried?

I explicitly don't want to try to answer this until future patches. That's the whole point of being incremental; we can discuss each one in the appropriate context, and I can avoid repeating myself and speculating about hypothetical situations which don't match the code structure as it evolves.

nikic accepted this revision.Mar 14 2022, 2:05 PM

This looks technically correct to me, so I'll accept this in the interest of exploration.

However, I am still not convinced that MemorySSA is the solution to this problem for reasons already outlined, so my baseline expectation is that we will opt to not enable this code in the end -- let's hope I'm wrong on that ;) The lack of a publishable reproducer is a really big problem here, because this leaves no room to explore alternatives at all.

This revision is now accepted and ready to land.Mar 14 2022, 2:05 PM
This revision was landed with ongoing or failed builds.Mar 15 2022, 4:38 PM
This revision was automatically updated to reflect the committed changes.