This is an archive of the discontinued LLVM Phabricator instance.

[MemorySSA] Switch to a different walker
ClosedPublic

Authored by george.burgess.iv on Jun 27 2016, 5:24 PM.

Details

Summary

This patch switches MemorySSA to use a new walker. The motivating reasons for this are:

  • Accuracy: the old walker cached things *really* eagerly, so it had to be overly conservative in many cases.
  • Speed of phi optimization: in some cases, the old walker would walk *way* more than it had to in order to determine if a phi was optimizable or not. This walker takes a more incremental approach, so we do as little work as we reasonably can.
  • Flexibility: this walker can be extended without much effort to do things like getting all clobbers that block a phi optimization. This will supposedly be useful in the future. Also, phi optimization is split into its own little world, so we can easily switch it off entirely if we want to (e.g. for -O1).
  • Testability: we can turn the cache on/off, so it's trivial to check whether the cache causes us to give a different answer than manually walking.

The big parts of this patch (in terms of line count) are caching and phi optimization. Phi optimization is more or less an iteratively expanding DFS; given a phi P, we figure out what the nearest legal optimization of P, Q, would be, then we see if there are any clobbers on any path from P to Q. If so, we quit. If not, we walk from Q to its dominating phi, and repeat. Caching is done after we've done all walking. Caching is just difficult in general. :)

This patch also includes a stupidly simple verify-this-optimization-isn't-broken function.

This patch verifies with EXPENSIVE_CHECKS enabled for MSSA.cpp when bootstrapping all of clang/LLVM.

Other smaller (but still noteworthy) things:

  • With this patch, we shouldn't cache MemoryUses anymore. This is desirable, since they always point to their clobber, so having them in the cache was kind of redundant
  • When running this across LLVM/clang, I didn't notice a major speed difference between this and the old walker. A script tells me it's in the noise, but note that the test wasn't quite scientific (read: time ninja -j$((NCORES * 1.1)))
  • The explicit call caching was removed from the upwards query. If it needs to be there, it can be readded without much effort, but the goal was to keep this patch as simple as I could, while not dropping accuracy.
  • The ClobberWalker::reset() function exists because an idea that's been thrown around is a bulk-update MSSA API (or some kind of thing where we'll do N updates back-to-back). If we don't need to drop the walker's cache of BB -> nearest-optimizable-access, then it shouldn't be terrible to keep it

Finally, note that there were some refactors split out from this, so it's easier to review. If you want to run things locally, you'll need to apply this on top of D21776.

Diff Detail

Event Timeline

george.burgess.iv retitled this revision from to [MemorySSA] Switch to a different walker.
george.burgess.iv updated this object.
george.burgess.iv added a subscriber: llvm-commits.
dberlin edited edge metadata.Jun 27 2016, 6:00 PM
  • The explicit call caching was removed from the upwards query. If it

needs to be there, it can be readded without much effort, but the goal was
to keep this patch as simple as I could, while not dropping accuracy.

Sure. I added this because it makes newgvn faster, but i can make a walker
for newgvn that caches more, it doesn't have to be the default walker :)

  • The ClobberWalker::reset() function exists because an idea that's been

thrown around is a bulk-update MSSA API (or some kind of thing where we'll
do N updates back-to-back). If we don

FWIW, as mentioned on a different thread, you can't make bulk update faster
than "destroy everything" without a bunch of work.

gberry edited edge metadata.Jul 18 2016, 3:00 PM

Is anyone planning on looking at this soon? I can probably spend some time this week looking at it but I assumed @dberlin was planning on looking at it?
FWIW, I have some EarlyCSE & LICM w/ MemorySSA performance measurements that I believe are blocked on this change.

(FYI: rebasing this patch wasn't trivial when I tried to do so earlier today; I'll get out an updated version that applies cleanly to trunk in a bit.)

dberlin accepted this revision.Jul 18 2016, 3:41 PM
dberlin edited edge metadata.

Fix up the comments and LGTM

lib/Transforms/Utils/MemorySSA.cpp
198

This seems like a really expensive assert, no?

(maybe it should be in XDEBUG or whatever expensive checking is these days)

206

Please document what the parameters are supposed to be ;)

212

Nit: Move this right before the while loop start.

220

A comment on what, precisely, this is doing, would be helpful.

222

Why? (don't answer, just document :P)

229

Why not |= instead of HadDefinitiveClobber = HadDefinitiveClobber || :)

351

This is probably better named WalkTargetCache :)

357

I assume this is not expensive in practice, because otherwise, you can just track it when we build memoryssa, and always have an answer to "lastnonuse for a block".

(even under updates, it's trivial to update in O(1))

Right now this looks N^2 since the dom tree above us could contain every block, and it could be all loads and then one store at the top.

1626

remove the heh :)

This revision is now accepted and ready to land.Jul 18 2016, 3:41 PM
george.burgess.iv marked 9 inline comments as done.Jul 18 2016, 6:36 PM
george.burgess.iv added inline comments.
lib/Transforms/Utils/MemorySSA.cpp
198

Good point

229

I think |= wouldn't let us short-circuit, and given that instructionClobbersQuery isn't completely trivial, I think short-circuiting may be beneficial here. :)

If I'm wrong, I'm happy to swap to |= in a followup commit.

357

otherwise, you can just track it when we build memoryssa, and always have an answer to "lastnonuse for a block".

Given that I ended up un-killing findDominatingDef, (and that this basically does the same job as findDominatingDef AFAICT), it may be a good idea to swap to that in the future, yeah. Will add a FIXME. :)

Right now this looks N^2 since the dom tree above us could contain every block, and it could be all loads and then one store at the top.

FWIW, Because we have caching (PhiTargetCache), it's linear-time worst-case when we're initially optimizing uses. We kill that cache when updates happen, though, because it relies on the domtree/set of MemoryDefs not changing, so it may be n^2 there.

1626

Oops :)

This revision was automatically updated to reflect the committed changes.