This is an archive of the discontinued LLVM Phabricator instance.

[DSE] Implement dead store elimination using MemorySSA (disabled by default).
AbandonedPublic

Authored by mcrosier on Jul 20 2016, 7:00 AM.

Details

Summary

This patch implements dead store elimination using the MemorySSA framework. The implementation is intentionally written to be very similar to the non-MemorySSA algorithm and by no means does it exploit the full capabilities of MemorySSA. My rational is that I wanted to keep it simple for myself and the reviewer, since I'm very new to MemorySSA. I was also hoping this would allow an apples to apples comparison between the two implementations (mostly in terms of compile-time).

My longer-term goal is to implement global DSE using MemorySSA. A version of non-local DSE was attempted in the past (D13363), but was reverted due to compile-time regressions.

Please take a look,
Chad

Diff Detail

Event Timeline

mcrosier updated this revision to Diff 64680.Jul 20 2016, 7:00 AM
mcrosier retitled this revision from to [DSE] Implement dead store elimination using MemorySSA (disabled by default)..
mcrosier updated this object.
mcrosier added reviewers: gberry, dberlin.
mcrosier updated this object.
mcrosier added a subscriber: llvm-commits.
dberlin edited edge metadata.Jul 20 2016, 8:15 AM

Thanks for doing this!

Just as a meta-comment:

One of the advantages of MemorySSA is that it allows sparse analysis. Now, for stores, because it's SSA and not SSI, you do end up having to worklist uses more often (for loads, algorithms are as easy as scalar SSA algorithms).

If you use MemorySSA as a faster memdep, and try to write iterative algorithms, it may or may not be faster, because it's really built to be able to build ssa-like algorithms, and some of the "faster memdep" use cases don't overlap with this well.

That said, I understand you are trying to port this incrementally, so i restricted major algorithmic comments to where there are easy and significantly better ways to do a thing.

lib/Transforms/Scalar/DeadStoreElimination.cpp
1114

Interesting. MemorySSA looks at this the other way around

It guarantees that if you are *loading a pointer you just stored from*, MemoryUse->getDefiningAccess() will be the store you just loaded from..

(That is, it guarantees that load->getDefiningAccess() is the nearest dominating thing that *actually* aliases with the load)

There is one edge case where it would be a MemoryPhi that you could eliminate, but it's a real nonlocal edge case - when the MemoryPhi's operands are all really the same thing

if (a)
  1 = MemoryDef(liveOnEntry)
  store B, 5
else
  2 = MemoryDef(liveOnEntry)
  store B, 5
3 = MemoryPhi(1, 2)
MemoryUse(3)
load B

Not sure this case is worth optimizing. NewGVN acutally handles this already and would just replace the load uses with "5".

1117

Because of the above guarantee, the memory state is guaranteed to be the same as this store if and only if getMemoryAccess(DepLoad)->getDefiningAccess() == getMemoryAccess(SI).

This is a constant time check instead of memoryIsNotModifiedBetween (which also can be done faster with MemorySSA, but ...)

1127

Doesn't GVN do this already?

1134

This whole block could be

// For stores, getClobberingMemoryAccess will guarantee you get the nearest dominating def that actually aliases the store. It is not needed for loads (see the comments at the top of MemorySSA.h for why this is)
MemoryAccess *MA = MSSA->getClobberingMemoryAccess(SI);
if (MemoryDef *MD = dyn_cast<MemoryDef>(MA)) {
   if (isCallocLikeFn(MD->getMemoryInst()) && <pointers are the same>)

....

This is going to be a lot faster than the memoryisnotmodifiedcall.

1156

I'm a bit unclear what this is trying to do, so i can't definitely give advice, only guess.

If you are trying to see if the only use of a store is a call to free, the following applies:

Unlike loads, store uses/defs are not guaranteed to may-alias the store (doing so requires allowing multiple phi nodes in memoryssa).

If they were, this would simply be "Process instructions in reverse order, worklist the uses of the store, if all of them are in calls to free, eliminate the store".

However, because of this, you want to do:

Process instructions in reverse order:

<worklist the initial uses of the MemoryDef for the store>
while(worklist not empty) 
  Pull use off worklist.
  If use is free call, ignore it.
  If use is below free call (IE  MSSA->dominates(memoryaccess for free, memoryaccess for use)), ignore it.
  (there are partially dead cases this will ignore, but let's ignore that for now)
 Otherwise:
   If use is a MemoryUse, you cannot remove the store (we already eliminated all uses after the free call above, and your other so this must be between the store and the free call. Both your original code and this code presume that you have already eliminated loads of just stored pointers, etc, so they don't get in the way ).
   If use is a MemoryDef, and getClobberingMemoryAccess(use) == original store you have store over store to the same data (IE partial or full overwrite). Not sure what you want to do here.
   If use is a MemoryDef, and getClobberingMemoryAccess(use) != original store, put the uses of this MemoryDef on worklist and continue.
  If use is a MemoryPhi, put uses of this MemoryPhi on worklist and continue.

This seems complicated, but is still going to be faster than what you have :)

1242

Errr, If it's a memoryDef, it must write to memory or otherwise affect memory ;)

1257

FWIW: You don't need this loop at all.

Much like in the rewritten merged load store motion, you can figure out what it could possibly be eliminated in favor in by hashing, and then use local ordering to tell which is earlier/later.
This is true even where they are partial overwrites (that just gets accounted for in the hash).

Whether "memory is modified" or (you have a use in the way )is entirely contained in the def/use chains of memoryssa (and MSSA->dominates), you don't need to look at the inst stream.

Even if you don't take the hashing approach, i would approach this sparsely, by starting at the top of the block, looking at each MemoryAccess, and doing something depending on the uses/defs it has. Then move to the next MemoryAccess. You should not have to visit everything more than once, even if you do it this way.

That said, i understand you are trying to incrementally port, so i won't force you to do this.
Just sayin, if you want it to be faster, ...

dberlin added inline comments.Jul 20 2016, 8:40 AM
lib/Transforms/Scalar/DeadStoreElimination.cpp
1156

(Note that the above starts from the store, instead of starting from the free, and that worklist must be a queue to get ordering right)

You can easily extend the above to be global, as well,in the sense that it "does not care where the free call is" - in this block or not.

<worklist the initial uses of the MemoryDef for the store>
while(worklist not empty) 
  Pull use off front of worklist.
  If use is free call, put it on list of free calls.  
  If use is dominated by *any* free call in the list, ignore it (because it means we will hit a free before we hit that use, and thus the use is use-after-free)
  (there are partially dead cases this will ignore, but let's ignore that for now)
  Otherwise:
   If use is a MemoryUse, you cannot remove the store (we already eliminated all uses after the free call above, and your other so this must be between the store and the free call. Both your original code and this code presume that you have already eliminated loads of just stored pointers, etc, so they don't get in the way ).
   If use is a MemoryDef, and getClobberingMemoryAccess(use) == original store you have store over store to the same data (IE partial or full overwrite). Not sure what you want to do here.
   If use is a MemoryDef, and getClobberingMemoryAccess(use) != original store, put the uses of this MemoryDef on worklist and continue.
   If use is a MemoryPhi, put uses of this MemoryPhi on worklist and continue.

The above will handle:

IE

store A

if (B)
  free A
else
  free A

Because it's a queue, and defs dominate uses, you are guaranteed it will process it in the right order. (IE even if you add a use(A) at the end it will get the right answer)

As an optimization, you can drop free calls from the list of things to check in some cases. Probably not worth doing since normal code will likely have 1 or 2 frees to track.

mcrosier abandoned this revision.Oct 26 2016, 5:54 AM

Abandoning this for now.. Regardless, thanks for the feedback, Danny. Maybe I can convince Geoff to revive this someday.

davide added a subscriber: davide.Dec 26 2016, 7:37 AM
jfb added a subscriber: jfb.Jan 30 2019, 1:52 PM