Page MenuHomePhabricator

MemorySSA backed Dead Store Elimination.
Needs ReviewPublic

Authored by rnk on Nov 27 2017, 2:02 AM.

Details

Reviewers
dmgreen
Summary

This is an upgrade of DSE to use MemorySSA, not MemDeps. Which allows in to work across basic blocks in a sparser manner.

Halfway into making this I found D29624 by bryant, which is an attempt at the same thing, so I stole all their good ideas. There is also D29866, which is a PDSE pass by the same author. As far as I understand, that would be superior but harder to write. Unfortunately both seem to have been abandoned.

I believe this version should handle everything that the old memdeps version does (it passes all the tests). This includes complete overwrite (so long as the later store postdoms), noop stores, partial overwrites, stores before frees/lifetime_ends and PartialEarlierWithFullLater.

The only exception that I know of is for the coroutine tests, which rely on removing stores to soon to be freed data, even across function calls that may throw. See test37 in simple.ll and ex3.ll in coroutine tests. It should be possible to get that working but might involve looking through the llvm.coro.begin.

Putting up for early review, I need to do some extra testing/benchmarking/compile time etc. Added subscribers from anyone who looked interested in D29624. This is a fairly big chunk of code, let me know if I can do anything to make it easier to review.

Event Timeline

dmgreen created this revision.Nov 27 2017, 2:02 AM

Just a suggestion: To ease review, could this be split into smaller patches along the lines you mention - noop stores, partial overwrites, stores before frees/lifetime_ends ?

Yep, that may be an option, although the various things it does can be interrelated. There are probably some semi-sensible ways to split this up though. And having the whole thing up can hopefully make things clearer. It may be easier to do in one go, not spend time on multiple reviews, maybe not.

I will leave that as an option for whoever is willing to take a look at the review. Let me know.

Does this support any transforms which aren't supported by the memdep-based DSE?

lib/Transforms/Scalar/DeadStoreElimination.cpp
1425 ↗(On Diff #124334)

It should be very easy to implement memoryIsNotModifiedBetween using MSSA, I think? But not a big deal to leave it for now.

1469 ↗(On Diff #124334)

This is a confusing name, given what the function checks.

1651 ↗(On Diff #124334)

Is it possible for any DSE-related transform to invalidate this?

I'm not sure I really understand how you're using MayThrowsPostOrders here... more comments would be helpful.

test/Transforms/DeadStoreElimination/simple.ll
555 ↗(On Diff #124334)

Yes, you're right; I guess I missed that case when I fixed the other noalias issues in DSE.

dberlin added inline comments.Nov 29 2017, 1:11 AM
lib/Transforms/Scalar/DeadStoreElimination.cpp
1324 ↗(On Diff #124334)

Thinking about this not too hard, my initial thought is that this should be unnecessary, IMHO. I'm curious what's happening if it's not.

Assuming we did not hit the optimization limits during memory ssa use building, any use points to a store that may-or -must aliases it.
(if we hit the opt limits, you probably want to respect them anyway :P).

So this check should just be flat out unnecessary when current == original access. If that has a memory use it's not dead on the path to the memoryuse (ignoring PDSE , that means it's not dead overall)

When walking from def to def, (IE when current != originalaccess), that leaves two cases:
The store it's a use for may-aliases your original access's siloc - your store is not dead
the store it's a use for must-aliases your original access's siloc - your stores are the same (and so maybe one is dead)
Note that in either case, what the use is for or aliases is irrelevant. The only thing relevant to next access is the store aliasing. The uses force your store alive or not based on what store they are linked to and it's equivalence to your original one.

Put another way: I can't see a reason to check check explicitly whether the use aliases you. You only have to check whether use->getDefiningAccess->memloc mayalias siloc.
MemorySSA has already checked whether use mayalias def for you.

You can cache that, and now your walk is much faster because you only have to know things about the defs and not the uses.

Again, i could be completely wrong, but i'd like to understand why :)

It's true that aliasing is not transitive in theory, but in llvm, the vast majority of cases where it isn't have metadata attached and you could only check here in those cases if you are worried about catching every case.

1689 ↗(On Diff #124334)

FWIW: You can do better than this, but it's complicated.

We could build a form with factored stores and multiple phis/sigmas if it is really worth it.

You also are going to end up rediscovering the same data here again and again (IE for each store, because you are working backwards and they are all linked, you will eventually hit the same use chain you just saw. for a given memloc or anything that is mustalias of that memloc, the answers must be the same) . There are various hashing /pair/etc schemes you can use to cache answers. Not sure it is worth it.

Especially vs building a real form for stores.

Hello all. Thanks for taking a look. Much appreciated. The main thing this does over the old version is work across basic block, which is why we here are interested. In something like this:

for (int i = ..)
    X[i] = 0;
    for (int j = ..)
        X[i] += Y[j];

The inner X[i] will currently be pulled out of the loop by LICM, but the original X[i] = 0 will remain as a dead store.

lib/Transforms/Scalar/DeadStoreElimination.cpp
1324 ↗(On Diff #124334)

This is one of those things where I remember changing the "if" here to an assert, seeing it assert a lot and figuring it was then needed. A lot of those cases will be benign though. I took another look, and after wading through a few csmith cases where extra stores are removed with this check, something like the following is where this comes up:

define void @test(i32* %P, i32* noalias %Q, i32* %R) {
; 1 = MemoryDef(liveOnEntry)
  store i32 1, i32* %Q
; 2 = MemoryDef(1)
  store i32 2, i32* %P
; 3 = MemoryDef(2)
  store i32 3, i32* %Q
; MemoryUse(2)
  %1 = load i32, i32* %R
  ret void
}

store 3 to Q can be removed, but as MemAccess "2" has 2 operands, the second Use would cause us to fail as we walk from 2 to 3.

I'm not sure if this is worth the cost though. I don't have a great grasp of what will end up being expensive. I will try some benchmarks and try to get some compile time numbers to see, and if there are not any useful cases where this comes up, I'll remove it. My first go at getting compile time numbers was too noisy to be useful.

One thing I never looked into very much in creating this was looking into the internals of MemorySSA. MemSSA seemed to just work well enough to not need it. The only thing I remember coming up was cases like this:

define void @test16(i32* noalias %P) {
  %P2 = bitcast i32* %P to i8*
  store i32 1, i32* %P
  br i1 true, label %bb1, label %bb3
bb1:
  store i32 1, i32* %P
  br label %bb3
bb3:
  call void @free(i8* %P2)
  ret void
}

We remove the store 1 in the middle, leaving this:

define void @test16(i32* noalias %P) {
  %P2 = bitcast i32* %P to i8*
; 1 = MemoryDef(liveOnEntry)
  store i32 1, i32* %P
  br i1 true, label %bb1, label %bb3
bb1:                                              ; preds = %0
  br label %bb3
bb3:                                              ; preds = %bb1, %0
; 4 = MemoryPhi({%0,1},{bb1,1})
; 3 = MemoryDef(4)
  call void @free(i8* %P2)
  ret void
}

The memory phi remains with two operands both coming from "1". I presume this is OK, and it's easy enough to work around. It's different to a newly constructed memssa graph though.

1469 ↗(On Diff #124334)

This is very sparse on comments, isn't it. My bad. I'll try and add some more in here. The important part of this function is to check that there are no throwing instructions between the SI and NI, except where SI is an Alloca/AllocLikeFn that do not escape.

1651 ↗(On Diff #124334)

More comments, got it, will do. This is one of the good ideas that came from D29624 (any bugs are still on me, of course :)

Nothing should invalidate the post orders. It's the relative order that we care about and dse only removes instructions, never re-orders them. So if there is a throw between two PO numbers the throw will remain in it's order when we remove one of the stores.

I was just thinking more about this and it may end up giving odd results at times. This linearising of the PO numbers is dependant on the order the blocks are visited. The store 1 here is removed:

define void @test(i32* noalias %P) {
    br i1 true, label %bb1, label %bb2
bb1:
    store i32 1, i32* %P
    br label %bb3
bb2:
    call void @unknown_func()
    br label %bb3
bb3:
    store i32 0, i32* %P
    ret void
}

But changing it to "br i1 true, label %bb2, label %bb1", the store wont be removed. That doesn't sound like something we want. Maybe this should be keeping the maythrown instructions around and checking they are really in the way. I'll look into this, but may need to read a book.

1689 ↗(On Diff #124334)

OK. As I said above I never looked deeply into the internals of MemSSA. Doing the optimisation of stores in the MemorySSA graph is what gcc does, right? So it only needs to look at a stores immediate uses to find dead stores. And look through PHIs.

My understanding is that it would only be DSE that needs this at current. For all the other users of MemSSA (gvn, licm, etc) only need optimised stores. So if we did it that way it should be an optional thing that DSE can then use. Do you think that way would be better? We will still have to walk through PHI's, but that's fairly trivial.

I will try and get some compile time results together and see how things look.

test/Transforms/DeadStoreElimination/simple.ll
555 ↗(On Diff #124334)

Thanks for checking. That pesky handleEnd function I feel was doing more than it should have.

dmgreen updated this revision to Diff 125941.Dec 7 2017, 6:30 AM
dmgreen edited the summary of this revision. (Show Details)

New changes. Including changing some of how maythrows are handled, some extra comments and some optimisations for slow operations.

OK. I have some performance numbers. I'm compiling clang ("ninja clang") and using
-ftime-report/-stat to get info (with some extra precision for decimal places) and
summing the results for all the compiled files. The total runtime is a little noisy on
this machine, but these sub-numbers seem pretty stable between runs.

Firstly the good news. With this version we now remove more dead store.
Old: 41310 New: 51660
With my "MemSSA can enable us to remove more stores" hat on, this is good stuff.

Some more good news is that DSE is now quicker, for the sum of time for each file:
Old: ~26s New: ~19s

The bad news is that we also need to add in the MemorySSA passes. I think we now
calc this twice in the pipeline, not once as before, so times roughly double.
Old: ~35s New: ~69s
I'm hoping that in the long run we can shared the cost of this between other passes.
NewGVN is a couple of hops earlier in the LTO pass pipeline, LICM also quite close
in the normal one. Hopefully this cost can be shared out.

The other bad news is we use a post-dom tree (again, maybe sharable?):
Old: ~15s New: ~27s
But Memdeps is somehow now quicker:
Old: ~13s New: ~8.5s

The total runtime here was on the order of 10000s, so it's hard to pick out the overall
cost exactly. These results suggest that the total is now ~30s more, and excluding
MemSSA we are at roughly the same time.

I'm going to try and take a look at the most costly files and see if we can knock the most
expensive ones down without making the total slower. As Daniel mentioned, there some
good candidates for caching the results here, like those in isOverlap.

Maths isn't on my side for making the whole thing quicker. But it removes more dead stores :)

Ping.

Any futher progress here?

aqjune added a subscriber: aqjune.May 29 2018, 7:35 PM
rnk commandeered this revision.Sep 5 2018, 3:49 PM
rnk added a reviewer: dmgreen.
rnk added a subscriber: rnk.

I started rebasing this, I'll upload it.

rnk updated this revision to Diff 164118.Sep 5 2018, 3:49 PM
rnk retitled this revision from MemorySSA backed Dead Store Elimination. to MemorySSA backed Dead Store Elimination..
  • rebase
jfb added a comment.Sep 5 2018, 3:59 PM

Can you add tests for volatile?

Thanks for taking a look at this, it would be good to make use of it. You may want to check the llvm-commits mails for more of Daniels thoughts, as they unfortunately didn't make it into phabricator.

IIRC my plan was to add some sort of caching to isOverwrite, to reduce the number of required calls to getUnderlyingObject/whatever else in that function can take time. That helped speed things up in some cases, but still left some times that this wasn't as quick as I would like. It may still be quicker than the "old" pass overall.

There is also a patch somewhere to make LICM preserve PDT's which would mean we get it for free. That's not needed before this though, it just cuts down the time again. From the old tests I did it seemed that this version of dse was quicker, but the analysis we need are not free. I was looking for cross-basic-block DSE, so was happy enough paying the price. I believe you are looking for something that has less degenerate compile time regressions?

Can you add tests for volatile?

The idea here would be that that enable-dse-memoryssa would be set to true eventually, so this should pass every existing DSE test. The old parts of this file could then be deleted. The multiblock tests are new as that's not something the old pass could do. I'm not sure if this will still pass all the existing tests, or if the old DSE algorithm has learnt new tricks since then.

jfb added a comment.Sep 6 2018, 9:00 AM

Can you add tests for volatile?

The idea here would be that that enable-dse-memoryssa would be set to true eventually, so this should pass every existing DSE test. The old parts of this file could then be deleted. The multiblock tests are new as that's not something the old pass could do. I'm not sure if this will still pass all the existing tests, or if the old DSE algorithm has learnt new tricks since then.

Old LLVM code is often poorly tests. The DSE directory has two volatile stores, one volatile load, and no volatile atomics. It has a decent amount of atomic load / store, and no atomic RMW or cmpxchg. I agree that you want to pass old tests (potentially with tweaks as some things change), but the current volatile and atomic situation is not solid grounds for an algorithm swap.

I *want* DSE of dead stores around volatile and atomic. I'm not asking for this patch to implement it. I just want basic correctness so that, now and once we start being more aggressive, we know there's no breakage.

fhahn added a subscriber: fhahn.Jan 11 2019, 1:35 PM
fhahn added a comment.Fri, Dec 6, 5:36 AM

Hi!

I would be interested giving dead store elimination using MemorySSA another push. We've hit plenty DSE limitations especially together with -ftrivial-auto-var-init and I think modernizing DSE to use MemorySSA will help us to improve the situation.

@rnk, @dmgreen are you still interested in pushing for DSE + MSSA? Otherwise I would be happy to look into that. I think this patch is a great foundation and I think it be possible to break it up into a few distinct parts, starting with just replacing completely overwritten stores and adding additional cases as follow ups. I also have a few potential simplifications in mind I'd like to try. I rebased the patch and I tried to build SPEC2006/MultiSource tests, but it does not get very far without crashing. Again, my main goal would be to break it up into more manageable pieces :)

Looking a bit further ahead, I'd also like to lift the post-domination requirement for cases where we can prove that an overriding store on all paths to the exit/other reads.

Please let me know what you think!

Herald added a project: Restricted Project. · View Herald TranscriptFri, Dec 6, 5:36 AM
Herald added a subscriber: asbirlea. · View Herald Transcript
rnk added a comment.Fri, Dec 6, 3:04 PM

Yes, please feel free to grab this and take it forward. I got this rebased and realized it was going to be much more work to get it thoroughly tested.

Tyker added a subscriber: Tyker.Sat, Dec 7, 6:22 AM

Sounds great. It would be good to see some improved DSE in llvm.

This will have bitrotted heavily on the last 2 years. I can imagine that most of the parts of "old dse" it was using will have changed, leading to different assumptions.

The algorithm here may not be the greatest thing ever created. I think this is probably better than the current DSE in tree at the moment (which I have a low opinion of, but is battle-tested). It can handle cross-basic block dependencies for example, which is awesome, but tat can come at a price. And there are places in here that are O(n^2), which I did run into on a normal bootstrap from my testing. They weren't super slow, and the old DSE can have the same problems, but it's something to watch out for. Both GCC and Swift have different DSE algorithms. IIRC, Swifts propagates store values backwards and GCC has something closer to PDSE (which is like GVN but in reverse from what I understand).