This is an archive of the discontinued LLVM Phabricator instance.

[DSE] Lift post-dominance restriction.
AbandonedPublic

Authored by fhahn on Jan 30 2020, 6:41 PM.

Details

Summary

To eliminate a store, we must ensure that it is not accessed before the
killing def on any path to a function exit. Currently we use
post-dominance to ensure that the killing def is executed on each path
from the killed def. But this excludes cases where there are no other
reads before another overwriting def on other paths. An example can be
found below. The first store (store i32 1, i32* %P) is dead, but neither
of the killing stores post-dominate it.

This patch adds support for such cases in a relatively straight-forward
way. When checking for read clobbers, we already traverse explore all
uses. If there are any reads before clobbering writes along any path to
an exit, we will check them. We can stop adding uses to the worklist for
MemoryDefs that completely overwrite the original location.

The only remaining problem is accesses after returning from the
function. But I think we can make those reads explicit in the MemorySSA
by introducing an additonal MemoryUse that clobbers all locations
visible to the caller at each exit. With those additional MemoryUses,
the existing algorithms should handle those cases naturally as well.

Currently the patch uses a hack to create those MemoryUses: it just
introduces a readonly function call at the end of each function and
treats it as clobbering all objects may visible to the caller. If the
solution makes sense, I'll find a proper solution for the hack.

define void @test4(i32* noalias %P, i1 %c1) {

store i32 1, i32* %P
br i1 %c1, label %bb1, label %bb2

bb1:

store i32 0, i32* %P
br label %bb5

bb2:

store i32 3, i32* %P
br label %bb5

bb5:

call void @use(i32* %P)
ret void

}

Diff Detail

Event Timeline

fhahn created this revision.Jan 30 2020, 6:41 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 30 2020, 6:41 PM

Unit tests: unknown.

clang-tidy: unknown.

clang-format: unknown.

Build artifacts: diff.json, console-log.txt

Pre-merge checks is in beta. Report issue. Please join beta or enable it for your project.

Currently the patch uses a hack to create those MemoryUses: it just introduces a readonly function call at the end of each function

I'm not deeply familiar with the MemorySSA datastructures, but that makes sense. I'm trying to think if any other passes care, though. I can't come up with anything off the top of my head, but maybe I'm forgetting something.

Currently the patch uses a hack to create those MemoryUses: it just introduces a readonly function call at the end of each function

I'm not deeply familiar with the MemorySSA datastructures, but that makes sense. I'm trying to think if any other passes care, though. I can't come up with anything off the top of my head, but maybe I'm forgetting something.

Currently the pass cleans up the artificial uses it introduces at the end, so other passes should not be impacted, unless I am missing something. The fact that they only alias objects visible to the caller after the function returns is only modelled directly in DSE and leaving them around could potentially slightly pessimize other passes.

fhahn updated this revision to Diff 243261.Feb 7 2020, 11:53 AM

Rebase after latest updates to earlier change.

fhahn updated this revision to Diff 243556.Feb 10 2020, 7:29 AM

Ping. Rebased after D72700 landed.

Tyker added a comment.Feb 29 2020, 5:13 AM

this seems like a great approach to me.

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1615

the comment seems outdated.
i think that with this patch dominating will not be true anymore.
and MemoryPHI seems to not cause a bail out.

Disclaimer: I haven't looked at the approach here.
Have you seen the MustBeExecutedContextExplorer and the path exploration it allows (D65593)?
Sounds like this might actually be applicable here. If the chosen approach is superior or more natural feel free to say so.

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1510

I'm not sure this is allowed in a function pass.

uenoku added a subscriber: uenoku.Feb 29 2020, 5:51 PM
fhahn updated this revision to Diff 247596.Mar 2 2020, 4:13 AM

Rebased, update comment.

Disclaimer: I haven't looked at the approach here.
Have you seen the MustBeExecutedContextExplorer and the path exploration it allows (D65593)?
Sounds like this might actually be applicable here. If the chosen approach is superior or more natural feel free to say so.

We use MemorySSA traversals here, which allows us to only check the relevant linked memory instructions, without needing to explore the CFG. I had a brief look at MustExecute and I think the MemorySSA traversal is better suited to the problem at hand, as MemorySSA directly links the 'interesting' instructions (although I might miss something about MustExecute).

Also, for a killing store, we need to find an overwriting store that must execute when the other store executes. But we also have to check for any aliasing reads that may execute in between.

fhahn marked 2 inline comments as done.Mar 2 2020, 4:21 AM
fhahn added inline comments.
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1510

Do you mean adding a new global or adding a function attribute to the added global?

Ideally function passes would only modify things in the function scope, but I think adding a new global is quite common, as most function passes that add new calls also may need to add a declaration of the called function.

As for adding the attribute, this needs to definitely change before submitting! I think we can either use an intrinsic (with the right attributes) or model it directly in MemorySSA, whatever option is preferred.

1615

Yes, I've rebased the patch and updated the comment as well.

fhahn updated this revision to Diff 248817.Mar 6 2020, 1:45 PM

ping.

updated to use an intrinsic & fix a crash where AA with the new pass manager does return ModRef for our inserted read-only functions. As mentioned earlier, alternatively it could be modeled in MemorySSA directly, if preferered.

jdoerfert added inline comments.Mar 6 2020, 1:59 PM
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1510

I thought there was a new function created, sorry.

I have the feeling this "exit read" should be part of MSSAs functionality (as it seems tied to MSSA and reusable) but I don't have a strong opinion about it.

fhahn updated this revision to Diff 251621.Mar 20 2020, 6:47 AM

Ping. @asbirlea, what do you think about modeling reads at the end of the function? Should it be done directly in MSSA?

Rebased after landing MemoryPhi support, pre-committed new tests.

I don't think MemorySSA should be adding that generic intrinsic. If we need such a hack at least contain it to within the pass, not an analysis used throughout the pass pipeline.
+@george.burgess.iv for his thoughts on this too.

llvm/include/llvm/IR/Intrinsics.td
1160 ↗(On Diff #251621)

I wouldn't use this name. As far as I see, it has nothing to do with MSSA, it's just a generic intrinsic that models a read-only function call.
This looks more akin to int_donothing_mayread.

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1499–1500

Cleanup PDT if it's not used anymore?

1543

If this intrinsic is read only, then assert a MemoryUse was created?

fhahn updated this revision to Diff 252971.Mar 26 2020, 1:47 PM
fhahn marked 6 inline comments as done.

Thanks for taking a look!

Renamed intrinsic, removed PDT, added comment about why calls are inserted at the end of a function.

llvm/include/llvm/IR/Intrinsics.td
1160 ↗(On Diff #251621)

Sounds good to me.

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1499–1500

Yes, it should indeed be removed!

1543

There is some weird behavior with the new pass manager, in which AA fails to infer writenone for the newly added call to the intrinsic. I am not sure exactly what is going on there, I'll take a closer look if the intrinsic approach is the way forward.

Thanks for working on this!

+1 to the general idea of "adding a read-everything intrinsic doesn't smell great," and a further +1 for containing it to only this pass if we do end up needing it.

This is the first time I'm glancing at this pass, so I'm going to err a bit on the side of overcommunication in hopes of making misunderstandings on my side more obvious. :)

I gather that getDomMemoryDef, given two defs A (Current) and B (KillingDef):

  • Determines some Def/Phi that dominates A, C.
  • Walks all of the MemoryAccesses that're transitively users of C, stopping at full overwrites of C + B.
  • If any of those MemoryAccesses is a potential read of A, gives up. Otherwise, returns A

If we remove the postdom restriction, I'm uncertain about how this algo will scale. It used to be that all paths necessarily terminated at B, but now it looks like we'll potentially walk a ton more? e.g.,

bool branch(); // inaccessiblememonly

{
  // 1 = MemoryDef(LOE) -- DomAccess
  *a = 1;
  if (branch()) {
    // 2 = MemoryDef(1) -- The thing we want to replace it with
    *a = 2;
  }
  // 3 = MemoryPhi(2, 1)
  // Every Phi|Def below here is now walked (?)
}

To be clear, I'm not saying this walking isn't necessary; just that it now happens AFAICT.

In any case, the thing I see this new intrinsic standing in for is an oracle that responds to "here's the set of blocks we've observed complete overwrites in. Is there any way for us to exit this function without going through one of those blocks, starting from A->getBlock()?" This is an expensive problem, but I think we can simplify it substantially, *if* my comment about the extra walking above is correct. I'd imagine we could do it essentially by constructing a sort-of def frontier for each ret terminator. In other words, a DenseSet<MemoryAccess *> // (MemoryDef || MemoryPhi) where membership in the set == "there exists a path where this Def/Phi is the final Def/Phi before we hit a ret." Since lifting the postdom restriction makes us explore every Def/Phi reachable from C (stopping at full overwrites, and stopping after N steps), we should hopefully be able to consult that set in the getDomMemoryDef loop to get a similar effect to the fake read-everything calls?

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1615

(did this update get dropped accidentally? This comment still appears to say "Find a MemoryDef, ...")

1668

tangential to this patch: should we be ignoring Us which are MemoryDefs whose only Use of Acc is as their getOptimized() operand? Without that, it's not immediately obvious to me how we aren't overly conservative in situations like:

{
  // 1 = MemoryDef(LOE) -- this is DomAccess
  *a = 1;
  // 2 = MemoryDef(1)
  *b = 2;
  // 3 = MemoryDef(2) -- this is the obvious kill
  *a = 3;
  // 4 = MemoryDef(3) (optimized to 2)
  *c = 4;
  // MemoryUse(4) -- AA says MayAlias for `(d, a)`, so `isReadClobber()` should be `true` if we visit this?
  if (*d) {
    // ...
  }
}
fhahn added a comment.Mar 31 2020, 3:15 PM

Thanks for working on this!

+1 to the general idea of "adding a read-everything intrinsic doesn't smell great," and a further +1 for containing it to only this pass if we do end up needing it.

Regarding the intrinsic, I am not sure if it is really needed, I guess it would be enough to check if we reached the end of our walk for each path. I'll try that tomorrow.

This is the first time I'm glancing at this pass, so I'm going to err a bit on the side of overcommunication in hopes of making misunderstandings on my side more obvious. :)

Thank you very much for taking a look!

I gather that getDomMemoryDef, given two defs A (Current) and B (KillingDef):

  • Determines some Def/Phi that dominates A, C.
  • Walks all of the MemoryAccesses that're transitively users of C, stopping at full overwrites of C + B.
  • If any of those MemoryAccesses is a potential read of A, gives up. Otherwise, returns A

If we remove the postdom restriction, I'm uncertain about how this algo will scale. It used to be that all paths necessarily terminated at B, but now it looks like we'll potentially walk a ton more? e.g.,

bool branch(); // inaccessiblememonly

{
  // 1 = MemoryDef(LOE) -- DomAccess
  *a = 1;
  if (branch()) {
    // 2 = MemoryDef(1) -- The thing we want to replace it with
    *a = 2;
  }
  // 3 = MemoryPhi(2, 1)
  // Every Phi|Def below here is now walked (?)
}

To be clear, I'm not saying this walking isn't necessary; just that it now happens AFAICT.

Yes the summary above is correct. Removing the post-dominance restriction requires us to explore potentially much further unfortunately.

The main property we need to check is the following: given MemoryDefs A and B, where A dominates B and B completely overwrites A. We can eliminate A, if there are no access that may read A before B overwrites it. Only considering defs where B post-dominates A restricts the paths we have to check to the paths 'between' A and B. Without requiring post-dominance we have to check the property along all paths from A to the exit. In particular, the new case this allows to eliminate is when A is not read on any paths that do not go through B.

In any case, the thing I see this new intrinsic standing in for is an oracle that responds to "here's the set of blocks we've observed complete overwrites in. Is there any way for us to exit this function without going through one of those blocks, starting from A->getBlock()?" This is an expensive problem, but I think we can simplify it substantially, *if* my comment about the extra walking above is correct.

I think there are 2 cases to distinguish:

  1. For accesses to non-alloca objects, this matches what the intrinsic achieves I think. We can only eliminate stores to objects that are visible after the function returns, if they are overwritten along all paths to the exit. So if we have determined a set of blocks that overwrite A (and there are no reads in between), we could check if all paths to the exit from A must go through one of the overwriting blocks. I think that matches your suggestion.
  2. For objects that are not visible to the caller (e.g. alloca), the intrinsic achieves something slightly different: Given A and B as above, we can eliminate A, if there are no reads of A between A and any of the exits from A. We can stop walking any paths that contain overwrites of A.

I think 1. could be improved as you suggested. If we want to cover 2. I think we have to check all potentially aliasing access along all paths to the exit.

The approach I tried to follow to bring up MemoySSA-backed DSE is to add support for additional cases in smallish steps to have a baseline (and compile-time is controlled by cut-offs) and subsequently improve the parts that case issues in practice. For example, lift the post-dominance restriction with potentially excessive walking, followed by patches to implement the special handling for the non-alloca case as suggested and future performance improvements (e.g. there is plenty of potential for caching, like in D75025)? That would also allow to quantify the impact of improvements. What do you think?

fhahn updated this revision to Diff 254267.Apr 1 2020, 12:04 PM

Avoid inserting calls by pre-computing set of the last memorydefs or phis before a function exit.

fhahn marked an inline comment as done.Apr 1 2020, 12:07 PM
fhahn added inline comments.
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1668

Oh, the optimized access is also added to the uses of 2 = MemoryDef here? Yes that would be overly conservative. Is there a way to force MemorySSA to be optimized before we run DSE, for testing? I'll put up a separate fix.

I think there are 2 cases to distinguish:

  1. For accesses to non-alloca objects, this matches what the intrinsic achieves I think. We can only eliminate stores to objects that are visible after the function returns, if they are overwritten along all paths to the exit. So if we have determined a set of blocks that overwrite A (and there are no reads in between), we could check if all paths to the exit from A must go through one of the overwriting blocks. I think that matches your suggestion.

Sounds like what I was thinking, yeah.

For objects that are not visible to the caller (e.g. alloca), the intrinsic achieves something slightly different: Given A and B as above, we can eliminate A, if there are no reads of A between A and any of the exits from A. We can stop walking any paths that contain overwrites of A.

I'm confused about why allocas need special treatment. This new set that we're building is essentially a fast way to answer the question "are there potentially unknowable reads after this Def/Phi?" If we know memory (e.g., from an alloca) is unreadable outside of the function, I'd hope we can ignore the set entirely, like this patch is doing now? Baked into my suggestion was the assumption that we were already:

check[ing] all potentially aliasing access along all paths to the exit

...since I think the transitive User walking we're doing inside of getDomMemoryDef should accomplish that?

In any case, if you're confident that this works for #1, I'm content to move discussion about "why don't things work as they are now," to a patch that doesn't block this one.

The approach I tried to follow to bring up MemoySSA-backed DSE is to add support for additional cases in smallish steps to have a baseline (and compile-time is controlled by cut-offs) and subsequently improve the parts that case issues in practice. For example, lift the post-dominance restriction with potentially excessive walking, followed by patches to implement the special handling for the non-alloca case as suggested and future performance improvements (e.g. there is plenty of potential for caching, like in D75025)? That would also allow to quantify the impact of improvements. What do you think?

I like incremental approaches when they're possible, so I'm happy with this. :)

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1492

that is defs or phis without any users that are defs or phis.

This isn't quite complete; consider:

void bar(); // inaccessiblememonly
void foo(int *a) {
  // 1 = MemoryDef(LOE)
  *a = 1;
  if (bar())
    return;
  // 2 = MemoryDef(1)
  *a = 2;
}

Despite having 2 as a User, we'd still want 1 to be in this set, since depending on the way the if goes, it's the last Def/Phi before the function ends

1668

Is there a way to force MemorySSA to be optimized before we run DSE, for testing?

I'm unsure offhand. @asbirlea would likely know if there is.

If there's not and you'd find it useful, a debugging-only pass that boils down to for (MemoryAccess *MA : Fn) { MSSA->getWalker()->getClobberingMemoryAccess(MA, {}); } should force that. We preoptimize Uses, but not Defs.

fhahn marked an inline comment as done.Apr 2 2020, 1:46 PM

I think there are 2 cases to distinguish:

  1. For accesses to non-alloca objects, this matches what the intrinsic achieves I think. We can only eliminate stores to objects that are visible after the function returns, if they are overwritten along all paths to the exit. So if we have determined a set of blocks that overwrite A (and there are no reads in between), we could check if all paths to the exit from A must go through one of the overwriting blocks. I think that matches your suggestion.

Sounds like what I was thinking, yeah.

For objects that are not visible to the caller (e.g. alloca), the intrinsic achieves something slightly different: Given A and B as above, we can eliminate A, if there are no reads of A between A and any of the exits from A. We can stop walking any paths that contain overwrites of A.

I'm confused about why allocas need special treatment. This new set that we're building is essentially a fast way to answer the question "are there potentially unknowable reads after this Def/Phi?" If we know memory (e.g., from an alloca) is unreadable outside of the function, I'd hope we can ignore the set entirely, like this patch is doing now? Baked into my suggestion was the assumption that we were already:

Ah right, I think I was not sure if your suggestion also applied to the alloca case in the previous comment.

IIUC, the additional set would be used for the non-alloca cases, as we have to ensure that there are overwrites along all paths to the exit. For allocas, we have to explore all access along all paths to function exits, as this patch currently does. What I meant to say in my previous comment was that I think we have to stick to the walking as in the patch, which is also what your last comment said, unless I am missing something :)

In any case, if you're confident that this works for #1, I'm content to move discussion about "why don't things work as they are now," to a patch that doesn't block this one.

The discussion so far has been extremely helpful, thanks! I think it would make sense to address the alloca/non-alloca separately. If you are happy with the direction, I would update this patch to only handle the alloca case (basically this patch modulo the intrinsic/LastDefOrPhi) and address the non-alloca in a subsequent patch.

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1492

Of course, I remember why the intrinsic was so convenient to start with again! Anyways, I plan to split up the patch and probably look into using the suggested set straight away instead for the cases that required LastPhiOrDef.

IIUC, the additional set would be used for the non-alloca cases, as we have to ensure that there are overwrites along all paths to the exit.

Yup!

For allocas, we have to explore all access along all paths to function exits, as this patch currently does. What I meant to say in my previous comment was that I think we have to stick to the walking as in the patch, which is also what your last comment said, unless I am missing something :)

Ah, I see. Yeah, agreed :)

If you are happy with the direction, I would update this patch to only handle the alloca case

SGTM

fhahn planned changes to this revision.Apr 8 2020, 8:30 AM

If you are happy with the direction, I would update this patch to only handle the alloca case

SGTM

I put up a patch that only handles the alloca (and alloca like non-escaping) case: D77736

I'll try to get an update on the second case as well soonish.

vsapsai added inline comments.
llvm/test/Transforms/DeadStoreElimination/MSSA/memset-missing-debugloc.ll
18

It would be great to update the comment to match the new IR, i.e., make b a function parameter instead of a local variable.

fhahn abandoned this revision.Apr 27 2020, 7:58 AM
fhahn marked an inline comment as done.

I've put up D78932 which implements eliminating stores to objects that are visible to the caller along the lines of @george.burgess.iv's suggestion

llvm/test/Transforms/DeadStoreElimination/MSSA/memset-missing-debugloc.ll
18

Thanks will do when I submit the test change!