Page MenuHomePhabricator

Update MergedLoadStoreMotion to use MemorySSA
AbandonedPublic

Authored by dberlin on Mar 28 2015, 5:43 PM.

Details

Summary

This updates MergedLoadStoreMotion to use and preserve MemorySSA.
It depends on D7864

Prior to this, the algorithm for loads was N^2, and for stores, it was N^2R, where N is the number of instructions in
the block, and R is the number of removed stores
(It restarted the reverse walk every time it removed a store due to iterator invalidation).

It is now O(M) (where M is the number of memory instructions in the two blocks) for loads,
and O(max(M,S^2)) for stores (because we have no downwards clobbering API yet, the hash table does not help
us determine memory dependence for our uses).

I have deliberately not changed behavior in terms of what loads/stores it will remove or the compile time
controls, in order to make the change minimal.

(Hopefully phabricator will not screwup a revision that is a branch of a branch,
arc diff --preview showed the right stuff)

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
dberlin marked an inline comment as done.

Rebase for passmanager conversion

It looks like this change still contains all of the MemorySSA changes that were split out into D21463

Properly rebase on top of other patch

Should be fixed now.

Would it make sense to split out the DFS numbering code motion barrier code into a separate change?

lib/Transforms/Scalar/MergedLoadStoreMotion.cpp
975

Typo: "explosed".
Also, there is no 'End' parameter as referenced in the comment.

Rebase after dependent revision commited

Also run verification pass with new pass manager for tests.

Any other comments?
If not, because it's off by default (and it looked like there are no real objections), my plan was to commit it.

lib/Transforms/Scalar/MergedLoadStoreMotion.cpp
719

Oh, sorry. I copied this code from the non-memoryssa version.
(In truth, the code can stand a lot of cleaning up, improvement, and reducing of restrictions)

I'm still digesting this, but here are a couple more nits I noticed.
I don't want to hold up approval from anyone else though.

lib/Transforms/Scalar/MergedLoadStoreMotion.cpp
754

Isn't MemorySSA only preserved if 'UseMemorySSA' is true?

980

getDFSNumIn/getDFSNumOut are documented as: These are an internal implementation detail, do not call them.
which seems to conflict with this usage.

dberlin added inline comments.Jun 22 2016, 12:54 PM
lib/Transforms/Scalar/MergedLoadStoreMotion.cpp
754

Yes, fixed.

980

I will change the docs for them in a separate patch.

The change to make updateDFSNumbers public, etc, was done specifically to enable this usage.
See the thread titlte "[PATCH] Make updateDFSNumbers API public" from last year, and there was a review where i computed them separately, and it was suggested to just make it public.

(and happy to wait if you want to review it)

Drive-by comments about comments (insert "we need to go deeper" meme here)

lib/Transforms/Scalar/MergedLoadStoreMotion.cpp
789

Potentially unfinished thought?

795

Did you mean "only move simple (non-atomic, non-volatile)"?

  • Update for review comments on comments
dberlin added inline comments.Jun 22 2016, 1:40 PM
lib/Transforms/Scalar/MergedLoadStoreMotion.cpp
980

This is now fixed in the docs for those functions

Gerolf added a subscriber: Gerolf.Jun 22 2016, 7:07 PM
Gerolf added inline comments.
lib/Transforms/Scalar/MergedLoadStoreMotion.cpp
308

Loc0,1 should be moved below the conditional. Then they are definitely needed.

347

This is a MSSA update problem. How about
if (MSSA && MSSA->update(HoistCan, HoistedInst))

MSSA->removeMemoryAccess(ElseInst);
541

You could remove the conditional here and simply return from updateMemorySSA when MSSA is null.

615

How about: Visit blocks in DFS order and number instructions. For each block track first hoist barrier and last sink barrier.

743

I think this is the only place where you need UseMemorySSA. All other instances can be handled by if (MSSA).

1030

Shouldn't updates be owned by MSSA rather than each client?

dberlin marked 10 inline comments as done.Jun 22 2016, 11:54 PM
dberlin added inline comments.
lib/Transforms/Scalar/MergedLoadStoreMotion.cpp
347

As mentioned, if you look early enough in the thread on this, you can see at one point it was, in fact that API, and i was explicitly asked to make simpler apis instead of these kinds of composed ones :)

The way it is now is also consistent with how we do updates for everything else both SSA related (where you do exactly what you see below), and analysis related, in LLVM.
(IE memdep has you call invalidateCachedPointerInfo and such, dominators requires you set and redirect dominators on your own)

In fact, if memoryssa was part of the IR, there would be no difference between the way this code looks and the rest of the function (which is updating the IR) looks . Because it's a side datastructure the naming is a little different and we haven't built all the same utilities yet.

Additonally, the kind of API you suggest above would have wildly varying complexity and is overkill vs building utilities that handle the common cases, which is what we do elsewhere for ssa/dom/etc updates.

I'm more than happy to commit to building those utilities as we come across common cases.

1030

(replying here since phab is not good at processing email like this):

That would be inconsistent with what we do in the rest of LLVM :)
MemorySSA is an SSA form, and like SSA in LLVM, passes are responsible for keeping SSA up to date. This is also true of most analysis preservation.
You can see that splitCriticalEdge takes a memdep argument, for example, and will update memdep instead of memdep updating itself.
Same with dominators. Passes that want to preserve Dom are responsible for redirecting the dominators where they should go. They can screw it up, too!

Doing general updates (IE "I have no idea what happened, you go find out and fix it") is also expensive, and at least for every pass i've converted so far, completely unnecessary.

I also had more functional updaters (IE replaceAccessWithHoistedAccess) in memoryssa at earlier points, but if you look earlier in the review, you'll see others wanted it to be simple functionality instead.
In retrospect, i agree.
We can build a general updater if we need it, and we can build utilities to handle common updates if we need it. So far, it hasn't been necessary.

At some point, the better question is really "why is memoryssa a side data structure instead of just part of the normal IR". But that's a question for another day.

  • Update for review comments
dberlin added inline comments.Jun 23 2016, 10:19 AM
lib/Transforms/Scalar/MergedLoadStoreMotion.cpp
1030

(and, just to also point out, the update algorithm we use is specific to diamonds, and would not work for more general control flow. It also would not work without the set of legality testing, etc, we perform elsewhere in the code )

I'll take a more in depth look into the load and store merge routines also. At a first glance it seems one could just add a few MSSA hooks rather than copy-paste-modify the code base.
I also take a (late :-() look at the core MSSA design. Could you outline the high-level MSSA design in more detail than http://reviews.llvm.org/D7864?Eg. how does it compare/relate to http://www.airs.com/dnovillo/Papers/mem-ssa.pdf? Thanks!

-Gerolf

lib/Transforms/Scalar/MergedLoadStoreMotion.cpp
347

Ok, it would probably not be possible foresee all use cases. I certainly agree to the iterative approach you propose. I'm surprised though that a simple replacement of a load has not come up elsewhere.

1030

What got me here is that I expected the update to be simple. Why is here more involved than replacing two MD's with one, possibly removing a MP? From the clients perspective updates like this look too complicated to get right. Maybe there is verifier in place, but in the code I don't even see a single assertion but could help debugging issues should something go wrong. Perhaps we can come up with some utility routines that make simpler to grasp: when two memory or more memory operation get moved, this is the blueprint for updating MSSA and guarantee it stays consistent.

1097

There could be a lambda or function that takes care of the two loops.

Also, ust to give you some idea of general update complexity: If you don't
bound the problem with some common cases, it's simply not possible to make
it faster than current creation (Ie it would be simpler/easier to not
preserve memoryssa in those cases).

For the common cases, assuming you tell us what variables to remove and
what variables were added in what blocks:

  1. If you only removed and replaced variables (Ie literally no code motion

or insertion), it requires the renaming pass only. It could be further
optimized to not redo use chain optimization for memoryuses if you can
guarantee you only touched memoryuses. For memorydefs, if you can guarantee
(and you probably can't) that none of the aa relationships have changed,
you would not have to redo use chain opt there either. You can't guarantee
this because basicaa, et al have walkers with random hard limits in them,
and so you may get better results simply by reducing the length of a chain
of things it had to walk, even if there was in fact, no aliasing either way.

  1. if the cfg has not changed, and you have added/removed variables: if you

added memoryuses, if we stay with pruned ssa, we have to recompute the IDF.
We could minimize the size we calculate by tracking some things related to
where the last uses in the program are relative to the iterated dominance
frontier. It's complicated however.
After recomputing IDF, we also have to redo renaming. For inserting
memoryuses, if we tracked the first/last version of a block (and updated
it through the creation/removal apis), we only have to rename parts of the
dom tree where the first/last version changes.

(IE if we changed first version in a block due to phi node insertion, but
last version is the same, we do not have to rename children of that block.
if we changed last version but not first version, we only have to rename
children but not the current block)
For inserting stores, it's something close to this but a bit more
complicated.

  1. if the cfg has changed, without tracking a lot more under the covers

related to merge/join sets, it's not possible to make it faster than doing
it from scratch.

The TL;DR of all this is that a general update mechanism is very hard to
make faster than for scratch without giving it all kinds of specific info,
and usually figuring out that info is as hard as figuring out the update
algorithms, in my experience.

gberry added a comment.Jul 5 2016, 8:48 AM

Sorry if this isn't the right place, but I have a general question about what it means to preserve MemorySSA, specifically regarding the defining accesses of MemoryUse nodes. Is the idea here that we make a best effort to keep the MemoryUse defining access links optimized (i.e. never pointing to a no-alias def)? Because of the limits of basic-aa, it isn't possible to guarantee this property after any code transformation, even in the limited case here, since the alias results for completely unrelated load/stores may have been affected, right?

Gerolf added inline comments.Jul 5 2016, 8:42 PM
lib/Transforms/Scalar/MergedLoadStoreMotion.cpp
807

The code from here could be a function. I prefer a function to fit on my screen.

830

Why do you need this? When the first condition is false the while loop does not execute. But then SafeToLoadUnconditionally is not needed anyway

846

Within a block I thought the DFS numbers are
0
1
2
...
So I would expect loop up(load) < BBHoistBarrier. I'm probably wrong about the DFSs.

848

Load1 is loop invariant and the continue condition can be computed in the header.

860

That comment does not look right. Load0 and load1 just got hoisted and LookupIter->first could still be Load0.

Also, not hoisting above loads that had not been hoisted is too restrictive. With better/cheaper DA this optimization can be more aggressive. Perhaps this could be a FIXME.

dberlin added inline comments.Jul 5 2016, 10:45 PM
lib/Transforms/Scalar/MergedLoadStoreMotion.cpp
807

Fixed

830

It's moved here because it's loop-invariant. That's also why the check matches the loop check, to avoid calculating it if the loop won't execute :)
Otherwise, we would calculate it on every loop iteration, when it doesn't change.

846

You are thinking about it backwards i think.
The dfs numbers are indeed as you list them

If the block looks like this:

0
1 - load barrier
2 - load

We can't hoist.

If it's
0
1 - load
2 - load barrier

We can hoist.

So if DFS(load) > DFS(load barrier), that means the load barrier is *above* us in the block (and we come *after* the load barrier), and we can't hoist past it.

848

Fixed

860
  1. Yes, i'll fix.
  2. This code was copied from mergeLoads originally. It is indeed too restrictive (there are a lot of random restrictions we could relax). I'll add a FIXME.

Update for review comments

A few more comments/questions below. I still have a few parts I haven't fully grasped yet, but don't let me hold up approval.

lib/Transforms/Scalar/MergedLoadStoreMotion.cpp
93

vector doesn't appear to be used?

158

Can't you just compute these instruction indices for the diamond then/else blocks lazily? The same goes for LastSinkBarrier, which wouldn't even need a map in that case.

771

Maybe check Load1->isSimple() here? Maybe factor out all these checks into a function and use it for the Load0 hash table insertion as well?

799

Isn't this still O(M^2) if e.g. all of the loads have the same clobbering access and the same types? I guess the limit above controls how big M can get though.

997

Is there a reason you're using a queue for this work-list instead of the more common SmallVector?

dberlin marked an inline comment as done.Jul 6 2016, 5:33 PM

I'll be honest. I think we are going a bit overboard for something that is off by default and is meant to be an identical algorithm to an existing pass that we know can stand improvements.

While I am all for trying to get code to be as good as possible, i don't think we need to make it completely perfect on the first pass when it's easily subject to incremental improvements :)

IE i think there is some point that is "good enough" to start, particularly for things that we can improve before we flip them to be the default.

For all we know, changes we have to make to account for converting the other passes are going to also make us change code here anyway.

lib/Transforms/Scalar/MergedLoadStoreMotion.cpp
93

fixed

158

Yes, you can compute them once on-demand for the blocks, but it takes nearly zero time right now. It did not seem worth it for a first pass to build something to do it :)
Remember this is supposed to be an NFC conversion to start, and building the barrier maps already converts what was N^2 before to N.

771

Let us please not do that in this patch.
Doing that would change this algorithm to catch a different set of loads than the non-memoryssa version.
I would like to keep the algorithms identical in what they catch for the moment, so debugging is easy.

Eventually, when the mssa version is the default and the non-mssa version killed, we can improve it to be whatever we want.

799

If that was the case, all the loads in that block are equal (because they also passed isSameOperation, etc) and GVN should have removed them already :)

Note that neither the mssa version or the non-mssa version will handle the case of two identical loads in one side of the diamond, and one load in the other. It will only hoist/merge one of identical ones.

997

We push back and pull front because that exploration order guarantees we terminate at the earliest possible point.

If we popped off the back, it would be an order that would explore lower things before higher things.

Gerolf added inline comments.Jul 6 2016, 8:11 PM
lib/Transforms/Scalar/MergedLoadStoreMotion.cpp
830

Ok, I assumed the while loop is on the hot path usually. Then the code checks the (part of the) while condition twice.

846

Hm, it looks alright today :-). Thanks.

889

Footer -> Tail in error message

907

There is similar code in mergeLoads. That could be a utility function that returns false when there is no memory access in at least one of the blocks, e.g.. static bool hasMemoryAccess(BasicBlock *B1, BasicBlock *B2).

929

Should be if (++NStores) to be consistent with your code in mergeLoads. Or change the code for NLoads there.

933

Could be a separate function:

950

Please add a comment about what the equal stores are.

954

Since it is sinking stores would it speed up the code reversing the loop traversal (from second to first)?

979

A comment would be nice why the accesses get cleared here. Or would that fit better in sinkStores?

988

Typo: explosed

1003

What is a good invariant for this loop? Like there shouldn't be more checks than memory references in the code that contains Start? A checked invariant would make me more comfortable with this code.

1019

That looks very wrong. At the minimum this is a convoluted equality test.

1052

I continue struggling with this function. For now I just need some help understanding the comment.
When you say "above" S0, S1 you refer to DFS numbering like Sx is above S0 in this example:

0 Sx
1
2 S0

Correct? This would be consistent with your DFS explanation in a previous explanation.

1054

What is the "top of the diamond"? The head?

1057

What is the "sink location block"? The Tail?

1060

What is "the block"? Perhaps "uses in the tail and/or below" describes it better.

1061

Would it make sense to clearly distinguish phi and memory phi (mphi) nodes?

dberlin abandoned this revision.Jul 6 2016, 9:21 PM
dberlin marked an inline comment as done.
gberry added a comment.Jul 8 2016, 2:00 PM

I'm personally fine with committing this as is since it's is disabled, though I don't feel I have the authority to approve it.

I do have a couple more comments/questions below, mainly to help me gain confidence that my understanding of MemorySSA is correct.

lib/Transforms/Scalar/MergedLoadStoreMotion.cpp
799

Is that right (the bit about all the loads being equal)?

I'm thinking of a case like the below where all of the loads pass isSameOperation and all have the call as their clobbering access. In this case you'll end up comparing all pairs of loads (assuming none of the loads must alias).

call foo
br %c, %then, %else

then:
load i32, %p1
load i32, %p2
load i32, %p3
...
br %end

else:
load i32, %p4
load i32, %p5
load i32, %p6
...
br %end

end:

1011

Do you even need the domtree info here to check for barriers in a diamond? There are references to hammocks, but as far as I can tell we never actually attempt to optimize hammocks. If that is the case, couldn't this just check for barriers in the same block as Start?

1052

I think there is a bug in the case 1 scenario if there is a pre-existing phi in the bottom of the diamond that references just one of the sunk stores.
For example:

; 1 = MemDef ...
call foo
br %c %then, %else

then:
; 2 = MemDef(1)
store @A
; 3 = MemDef(2)
store @B
br %end

else:
; 4 = MemDef(1)
store @A
br %end

end:
5 = MemPhi(3, 4)

I believe in this case, after updating the end block will look like:

end:
5 = MemPhi(3, 6)
; 6 = MemDef(1)
store @A

Which seems wrong, since the phi is before it's use, but also it seems like having 6's defining access skip over the phi 5 and def 3 could cause trouble, though I'm less sure about the latter.

Let me clarify my last comment, since I hit send too early. This was a bug
i already fixed. Case 1 should also be checking that there are no stores
below. If there are, it should be doing case 2. In essence, the check for
case two should happen before the check for case 1 (since it's not possible
to have stores below without a phi node existing)

This will cause your phi node to get changed to memphi(3, 1), and do the
right thing.

I have a testcase for this.

davide added a subscriber: davide.Dec 26 2016, 7:02 AM

hrrrm, are there any real showstoppers to get this one in? Maybe Dan has no more time time to work on it but this patch already constitutes a strong basis on which we can iterate and it's also disabled by default.
@Gerolf , any opinions?