This is an archive of the discontinued LLVM Phabricator instance.

[EarlyCSE] Optionally use MemorySSA. NFC.
ClosedPublic

Authored by gberry on May 2 2016, 12:18 PM.

Details

Summary

Use MemorySSA, if requested, to do less conservative memory dependency checking.

This change doesn't enable the MemorySSA enhanced EarlyCSE in the default pipelines, so should be NFC.

Diff Detail

Event Timeline

gberry updated this revision to Diff 55861.May 2 2016, 12:18 PM
gberry retitled this revision from to [EarlyCSE] Port to use MemorySSA (disabled by default). NFC..
gberry updated this object.
gberry added reviewers: dberlin, sanjoy, reames, majnemer.
gberry added a subscriber: llvm-commits.

A few nits in passing.

lib/Transforms/Scalar/EarlyCSE.cpp
306–307

Can this be committed as a separate change?

563

Please capitalize handle and add a period.

574

Please add a descriptive && "Error message!".

760–762

Separate commit.

776

Separate commit. Maybe typdef?

reames requested changes to this revision.May 5 2016, 6:31 PM
reames edited edge metadata.

At a meta level, I'm not convinced that updating EarlyCSE to work with MemorySSA is the right approach. EarlyCSE is focused on being really really fast at cleaning up stupidly redundant IR so that the rest of the pass pipeline doesn't need to worry about it. MemorySSA is relatively expensive to construct. Given that, I'm really not sure putting it at the very beginning of the pipeline is a good design choice.

Now, having said that, it might make sense to have a dominance order CSE pass based on MemorySSA for use later in the pass pipeline. Currently we use EarlyCSE for two distinct purposes, its possible that it might be time to split them.

Can you justify why this is the right approach?

lib/Transforms/Scalar/EarlyCSE.cpp
504

Huh? This should be handled entirely inside MemorySSA?

555

Having two sets of variables, one integers, one pointers with similar names is *highly* confusing. I'd suggest pulling out a MemorySSA specific impl function and calling it from here to wrap the desired asserts.

563

This comment doesn't make sense where placed?

674

Code like this strongly hints that MemorySSA should be using ValueHandles.

test/Transforms/EarlyCSE/memoryssa.ll
8

If we do go this way, you'll need far far more tests.

This revision now requires changes to proceed.May 5 2016, 6:31 PM
dberlin edited edge metadata.May 5 2016, 7:54 PM
dberlin added a subscriber: dberlin.

So, if it's not actually slower in practice, would that address your
objection

In particular, i'm trying to understand if your concerns are *mostly* "we
want to keep this fast", or broader than that.
If they are broader than that, i'd like to understand the objection.
Because the speed one is simply "either we can make it fast enough or we
can't" (and i agree if we can't we shouldn't do it :P)

gberry updated this revision to Diff 57482.May 17 2016, 8:07 AM
gberry edited edge metadata.

Update based on reames review feedback

@reames I've attempted to resolved most of your individual concerns (or at least made them explicit in the change). The bigger question of whether this is worth the compile time remains to be determined. Do you think more tests need to be added in addition to the already existing EarlyCSE tests? Adding additional run lines to those tests to enable -early-cse-use-memoryssa seems like overkill to me, but I don't feel to strongly about it. Or are you more concerned about adding new tests for cases that are only caught by MemorySSA (both positive and negative)?

@dberlin, @george.burgess.iv There are a couple of FIXME comments in this change that identify cases where MemorySSA is maybe being too conservative (e.g. when dealing with fence release instructions and load atomic instructions). Do you think it is reasonable to refine these cases in MemorySSA or is the conservatism restricted to EarlyCSE's usage, in which case we should deal with it in EarlyCSE? Similarly, what do you think of Phillip's suggestion to look at using ValueHandles in MemorySSA to make removal invalidating more automated?

@reames @dberlin Regarding the compile time impact, do you think it would be worth pursuing a change to make EarlyCSE's use of MemorySSA optional? That way we could avoid using it for early passes EarlyCSE and only use it for later ones, perhaps even influenced by optimization level? A related aspect of the plan for MemorySSA that I'd like to understand is how well we think we'll be able to amortize the cost of building it by preserving/maintaining it across passes. Daniel, can you share your thoughts on that?

So, i would like to see real numbers that say this is going to slow down
anything (or speed it up).
As I said, if the objection is speed, yes, we should look into that, and if
something needs to be done, we should do it.

We can amortize the cost quite well. It should essentially cost nearly
nothing past initial setup cost (it's not harder than the SSA updates we do
today, which are not expensive).

The entire plan is actually to amortize the cost.
Right now, the default pass schedules put things that use memdep mostly in
a row.

At the outset, with a little work, we should have to compute memoryssa
twice (once before MLSM/GVN/MemCpyOpt, once before DSE).
Getting all the way to DSE is harder in the sense that it's a longer way to
go to preserve passes.

But it's also interesting to note that none of these passes preserve memdep
today, and the cost of doing memdep queries on every store (as DSE would)
with no cache, should be more than the cost of memoryssa building + usage.

It definitely can be made to be so.
So that part doesn't worry me.

Shoving this in EarlyCSE, if it's fast enough, seems reasonable at a
glance. In a perfect world, it would be good to preserve it everywhere.
I'm not sure, at the beginning, it makes sense to try to preserve it across
tons and tons of passes that won't ever use it, but do touch memory
heavily. So i would expect EarlyCSE to end up as another computation point
for quite a while.

That needs to be traded off past how much better/easier/etc it makes
EarlyCSE.

gberry updated this revision to Diff 60768.Jun 14 2016, 3:01 PM
gberry edited edge metadata.

Update to use MemorySSA in EarlyCSE if it is available, and make it
available for -O3 or greater in first EarlyCSE pass added by
addFunctionSimplificationPasses().

gberry retitled this revision from [EarlyCSE] Port to use MemorySSA (disabled by default). NFC. to [EarlyCSE] Use MemorySSA if available..Jun 14 2016, 3:03 PM
gberry updated this object.

Compile time test for the llvm test suite on aarch64 (at -O3) were mostly a wash, some faster, some slower, no big outliers. The net change was slightly better compile times.

Notable performance improvements (no significant regressions):
MultiSource/Benchmarks/Trimaran/enc-md5/enc-md5 -6%
MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1 -5%
MultiSource/Benchmarks/sim/sim -3%
SingleSource/Benchmarks/McGill/chomp -4%
MultiSource/Benchmarks/Ptrdist/anagram/anagram -2%
spec2000/bzip2 -10%

@reames I've added some additional lit test coverage, is there more lit test coverage you'd like to see?

bruno added a subscriber: bruno.Jun 15 2016, 10:47 AM

Hi Geoff, what are the numbers of the top slower ones?

I'm also interested.
We already know of some performance issues related to caching and use
optimization with weird testcases (many many nested blocks) that we are
fixing.
If we have significant perf regressions, it would be useful if for no other
reason than to inform the stuff george is taking a look at.

Here are the worst llvm test-suite compile time regressions. I've filtered out the very small test cases. The data shown are the percent diffs of compile times from 5 different runs with and without the above change.

llvm-test-suite/SingleSource/Benchmarks/Misc/ffbench:normal PASS +1.079%, +2.837%, +17.951%, +21.615%, +33.206%
llvm-test-suite/SingleSource/Benchmarks/Shootout-C++/shootout-cxx-moments:normal PASS +1.148%, +9.801%, +14.607%, +15.686%, +19.707%
llvm-test-suite/MultiSource/Applications/spiff/spiff:normal PASS -0.773%, +2.938%, +13.511%, +16.846%, +21.552%
llvm-test-suite/SingleSource/Benchmarks/Shootout-C++/shootout-cxx-nestedloop:normal PASS -1.623%, +1.635%, +12.996%, +13.561%, +14.354%
llvm-test-suite/MultiSource/Benchmarks/Prolangs-C++/garage/garage:normal PASS +1.092%, +8.865%, +12.830%, +14.206%, +15.722%
llvm-test-suite/MultiSource/Benchmarks/TSVC/IndirectAddressing-dbl/IndirectAddressing-dbl:normal PASS +4.676%, +5.196%, +10.725%, +12.248%, +12.967%
llvm-test-suite/MultiSource/Benchmarks/MallocBench/espresso/espresso:normal PASS -2.227%, +4.387%, +9.187%, +11.195%, +13.738%
llvm-test-suite/MultiSource/Benchmarks/BitBench/uuencode/uuencode:normal PASS -0.223%, +3.068%, +7.235%, +9.723%, +18.255%
llvm-test-suite/SingleSource/Benchmarks/Misc-C++/stepanov_container:normal PASS +1.185%, +2.054%, +4.369%, +8.627%, +11.474%
llvm-test-suite/MultiSource/Benchmarks/Prolangs-C++/family/family:normal PASS +1.386%, +1.767%, +4.300%, +10.017%, +19.309%
llvm-test-suite/MultiSource/Benchmarks/mediabench/adpcm/rawcaudio/rawcaudio:normal PASS +1.995%, +2.907%, +3.697%, +12.124%, +12.976%
llvm-test-suite/SingleSource/Benchmarks/Dhrystone/fldry:normal PASS +1.920%, +2.027%, +2.843%, +14.746%, +14.803%
llvm-test-suite/MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4:normal PASS +1.305%, +1.674%, +2.269%, +2.961%, +9.660%

This looks super noisy even at 5 runs :)

@dberlin Yeah, I'm in the process of double checking some of these to make sure my testing methodology was sound.

Updated llvm-test-suite compile time regressions using LNT methodology:
llvm-test-suite/SingleSource/Benchmarks/Misc/flops-6 +4.019%
llvm-test-suite/SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding +4.967%
llvm-test-suite/SingleSource/Benchmarks/Polybench/medley/reg_detect/reg_detect +5.268%
llvm-test-suite/SingleSource/UnitTests/Vectorizer/gcc-loops +5.538%

Looks like the affected benchmarks changed in the new measurements, does that hold for performance improvements as well?

George, can you stare at these quickly and see if any of your caching
changes/etc will help?

(It's fine if not, just trying to avoid duplicating work)

Looks like the affected benchmarks changed in the new measurements, does that hold for performance improvements as well?

I don't have that data yet, I'll update when I do.

George, can you stare at these quickly and see if any of your caching changes/etc will help?

That depends on what exactly is slowing the benchmarks down so much. If our usage pattern is query -> remove -> query -> remove, then our cache may become useless, since (worst case) we drop the entire thing on each removal. If we primarily query defs, then this pattern gives us the same effectively-n^2 behavior of MemDep. One of the big goals of the new cache/walker is to allow us to drop as little as possible.

In terms of pure walker/cache speed, the current walker is happy to do a lot of potentially useless work walking phis we can't optimize; the one I'm working on will do as little work as possible in that case. Also, the current walker potentially does a lot of domtree queries when caching results, whereas the one I'm working on does none (except in asserts). Glancing at some of the benchmarks, I'm not sure if any of that is what's slowing us down here, though.

If you'd like, I'm happy to profile/poke around and give you a more definitive answer.

lib/Transforms/Scalar/EarlyCSE.cpp
500

Nit: Please use ///

571

Nit: return EarlierHeapGen == LaterHeapGen?

Looks like the affected benchmarks changed in the new measurements, does that hold for performance improvements as well?

I don't have that data yet, I'll update when I do.

The performance changes are mostly the same. Updated perf data (this is on an A57-like OoO aarch64 core, time deltas (negative is better)):
llvm-test-suite/SingleSource/Benchmarks/CoyoteBench/fftbench -21.811%, -21.633%, -21.488%, -21.074%, -20.747%, -20.168%, -20.084%, -18.987%, -18.615%, -18.487%
llvm-test-suite/MultiSource/Benchmarks/Trimaran/enc-md5/enc-md5 -6.557%, -6.557%, -6.557%, -6.077%, -6.077%, -5.525%, -5.525%, -5.525%, -5.525%, -5.000%
llvm-test-suite/MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1 -3.306%, -3.306%, -3.306%, -2.479%, -2.479%, -2.479%, -2.479%, -2.479%, -1.681%, -1.667%
llvm-test-suite/MultiSource/Benchmarks/Prolangs-C/gnugo/gnugo -25.000%, -12.500%, -12.500%, -12.500%, -12.500%, -12.500%, +0.000%, +0.000%, +0.000%, +0.000%
llvm-test-suite/MultiSource/Benchmarks/sim/sim -12.815%, -2.667%, -2.667%, -2.400%, -1.872%, -1.114%, -1.111%, -1.070%, -1.067%, -1.067%
llvm-test-suite/MultiSource/Benchmarks/MallocBench/cfrac/cfrac -2.426%, -2.162%, -1.729%, -1.445%, -1.124%, -1.124%, -1.124%, -1.124%, -1.124%, -0.845%

llvm-test-suite/MultiSource/Benchmarks/mafft/pairlocalalign +0.399%, +0.430%, +0.458%, +0.507%, +0.593%, +0.626%, +0.778%, +1.097%, +1.195%, +1.309%

gberry updated this revision to Diff 61579.Jun 22 2016, 11:20 AM
gberry marked 2 inline comments as done.
gberry updated this object.

Update to address George's comments

I re-ran the llvm test-suite compile-time numbers with more samples and found no significant changes (improvements or regressions) in compile time.

I've put this on hold until I can re-run compile-time numbers after George's changes to the MemorySSA caching code go in (http://reviews.llvm.org/D21777)

Sorry for not responding to this for so long.

My objection is primarily from a compile time concern. Right now, EarlyCSE is a *very* cheap pass to run. If you can keep it fast (even when we have to reconstruct MemorySSA) I don't object to having EarlyCSE MemorySSA based. I think that is a very hard bar to pass in practice. In particular, the bar is not total O3 time. It's EarlyCSE time. I fully expect that the more precise analysis may speed up other passes, but we can't assume that happens for all inputs. (As I write this, I'm recognizing that this might be too high a bar to set. If you think I'm being unreasonable, argue why and what a better line should be.)

Given I'm not going to have time to be active involved in this thread, I'm going to defer to other reviewers. If they think this is a good idea, I will not actively block the thread.

p.s. The newer structure of using the original fast path check with memory ssa as a backup which is optional if available is much cleaner than the original code. I'm okay with something like this landing (once other reviewers have signed off) even if the compile time question isn't fully resolved provided that the force memory SSA pass is under an off by default option. As structured, this doesn't complicate the existing code much at all.

gberry updated this revision to Diff 68911.Aug 22 2016, 2:17 PM

New version that adds a pass parameter to control whether MemorySSA is used.

Also changed the memory generation check to do a simpler MemorySSA
dominance check.

gberry retitled this revision from [EarlyCSE] Use MemorySSA if available. to [EarlyCSE] Optionally use MemorySSA. NFC..Aug 22 2016, 2:18 PM
gberry updated this object.
gberry edited edge metadata.

I've collected some compile time stats when enabling MemorySSA EarlyCSE just for the EarlyCSE pass added at the beginning of addFunctionSimplificationPasses at O2 and higher.
There were 8 benchmarks in the llvm test-suite whose compile time increased by more than 1%. The biggest increase was in consumer-typeset. Drilling down a bit, the MemorySSA construction time for compiling the z44.c input to this benchmark is reported as 2% of runtime.

dberlin added inline comments.Aug 22 2016, 2:27 PM
lib/Transforms/Scalar/EarlyCSE.cpp
532
  1. For loads, you don't have to ask for the clobbering access. It's already optimized such that getDefiningAccess == the clobbering access
  1. For stores, not sure if you realize this, but

given
store q (lets's call this a)
x = load p
store q (let's call this b)

if you call getClobberingMemoryAccess on b, it will return a.

gberry added inline comments.Aug 22 2016, 2:57 PM
lib/Transforms/Scalar/EarlyCSE.cpp
532

For 1., I was not clear on whether this holds true after store removal.

For 2., yeah I get this, I'm not sure what you're getting at though. The removal of this second store by EarlyCSE doesn't use MemorySSA to check for intervening loads in this change. It uses the 'LastStore' tracking to know when a store made redundant by a second store can be removed.

dberlin added inline comments.Aug 22 2016, 3:04 PM
lib/Transforms/Scalar/EarlyCSE.cpp
532
  1. Updates have to make it hold after store removal :)

The problem is that if we don't keep this invariant up to date, it means everyone uses getClobberingAccess, which does a bunch of work to discover the load already points to the same thing.

Everyone doing that is much higher than the cost of one person updating the dominating def.

(there is one case were getClobberingAccess will give you a better answer, and that is on cases where we gave up during use optimization. I only have one testcase this occurs on. We only give up on optimizing a load if it's going to be super expensive, and you probably do *not* want to try to get better answers in that case).

As for updating when you remove stores, you should simply be able to replace any loads the store uses with getClobberingAccess(store) using RAUW.

Under the covers, removeMemoryAccess calls RAUW with the DefiningAccess.
We could change it to use getClobberingMemoryAccess for loads, and DefiningAccess for stores.

  1. ah, okay.
gberry added inline comments.Aug 23 2016, 10:17 AM
lib/Transforms/Scalar/EarlyCSE.cpp
532

Okay, I get why just checking the defining access for loads is better (we get to skip the AA check).
For stores, we may be able to do something faster than calling getClobberingAccess(store). We could instead walk up the store defining access chain and stop if we get to a point that dominates the earlier load or a clobbering write. I'll have to time this to see if it makes a difference. I guess it will depend on what percentage of the time the clobber cache has been thrown away.

As for updating when removing stores: it seems like doing RAUW getClobberingAccess(store) is not optimal in some cases. For example:

store @G1 ; 1 = MD(entry)
store @G2 ; 2 = MD(1)
store %p ; 3 = MD(2)
load @G1 ; MU(3)
load @G2  ; MU(3)

If we remove 3 and RUAW getClobberingAccess(3) (=2) we get:

store @G1 ; 1 = MD(entry)
store @G2 ; 2 = MD(1)
load @G1 ; MU(2)
load @G2  ; MU(2)

but the load @G1 would be more precise if it was MU(1) (and the invariant that defining access == clobbering access would be broken). Is this just a compile-time/precision trade-off? Maybe for that reason it makes more sense to let the client decide if they want to do the simple RAUW getClobberingAccess(Store) or optimize each use separately?

gberry accepted this revision.Aug 31 2016, 11:44 AM
gberry added a reviewer: gberry.

Approved by @dberlin over email

This revision was automatically updated to reflect the committed changes.