This is an archive of the discontinued LLVM Phabricator instance.

[DSE] Add first version of MemorySSA-backed DSE (Bottom up walk).
ClosedPublic

Authored by fhahn on Jan 14 2020, 6:12 AM.

Details

Summary

This patch adds a first version of a MemorySSA based DSE. It is missing
a lot of features, which will get added as follow-ups, to help to keep
the review manageable.

The patch uses the following general approach: given a MemoryDef, walk
upwards to find clobbering MemoryDefs that may be killed by the
starting def. Then check that there are no uses that may read the
location of the original MemoryDef in between both MemoryDefs. A bit
more concretely:

For all MemoryDefs StartDef:

  1. Get the next dominating clobbering MemoryDef (DomAccess) by walking upwards.
  2. Check that there no reads between DomAccess and the StartDef by checking all uses starting at DomAccess and walking until we see StartDef.
  3. For each found DomDef, check that:
    1. There are no barrier instructions between DomDef and StartDef (like throws or stores with ordering constraints).
    2. StartDef is executed whenever DomDef is executed.
  4. StartDef completely overwrites DomDef.
  5. Erase DomDef from the function and MemorySSA.

The patch uses a very simple approach to guarantee that no throwing
instructions are between 2 stores: We only allow accesses to stack
objects, access that are in the same basic block if the block does not
contain any throwing instructions or accesses in functions that do
not contain any throwing instructions. This will get lifted later.

Besides adding support for the missing cases, there is plenty of additional
potential for improvements as follow-up work, e.g. the way we visit stores
(could be just a traversal of the MemorySSA, rather than collecting them
up-front), using the alias information discovered during walking to optimize
the MemorySSA.

This is loosely based on D40480 by Dave Green.

Diff Detail

Event Timeline

fhahn created this revision.Jan 14 2020, 6:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 14 2020, 6:12 AM

Unit tests: unknown.

clang-tidy: unknown.

clang-format: unknown.

Build artifacts: diff.json, console-log.txt

asbirlea added inline comments.Jan 14 2020, 12:37 PM
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1477

MemorySSA should handle this, if it doesn't we should fix that.

1541

Nit: Use only DomDef, known to be non-null outside the loop.
Restrict scope of DomAccess to inside the loop above.

1544

Walking uses may miss cases due to aliasing not being transitive. This needs to be throughly analyzed. Here's a very rough example.

1 = Def(LoE) ; <----- DomDef  stores [0,3]
2 = Def(1) ; (2, 1)=NoAlias,   stores [4,7]
Use(2)      ; MayAlias 2 *and* 1, the Use points to the *first* Def it may alias, loads [0, 7].
3 = Def(1)  ; <---- Current  (3, 2)=NoAlias, (3,1)=MayAlias,  stores [0,3]

The situation may be simplified due to handling stores, but all Uses may need looking at. Note this will not work to recurse on uses of the uses either.
Rough example why:

1 = Def(LoE)
2 = Def(1)     ; <----- DomDef 
3 = Def(1) ; (3, 2)=NoAlias
Use(3)      ; MayAlias 3 *and* 2, the Use points to the *first* Def it may alias.
4 = Def(2)  ; <---- Current  (4, 3)=NoAlias, (4,2)=MayAlias
1546

You're right this can be partially lifted.
Here's an example:

%a:
  1 = Def(LoE)  ; <----- DomDef 
   br %cond1, %b, %c
%b:
  2 = Def(1)
   br %cond2, %d, %e
%c:
   br %e
%d:
   br %f
%e:
   3 = phi(1,2)
   br %f
%f:
   4 = Phi(2,3)
   5 = Def(4) ; <---- Current
1749

I don't see how this being in a loop will work. Shouldn't this be a "give up"?
Example:

; 1 = MemoryDef (LoE)
store a
; 2 = MemoryDef(1)
call_reading_a_and_overwriting_a
; 3 = MemoryDef(2)
store a

Once the getClobbering found a mayAlias of 3 with 2, even if an overwrite is not proven, def 1 cannot be removed.

I may have missed such a check in getDomMemoryDef.

Tyker added a subscriber: Tyker.Jan 15 2020, 4:23 AM

I did a few experiment with bottom-up algorithm before the patch i showed on phabricator. my implementation of the bottom-up had similar average complie-time to the current pass on the test-suite but it was only barely removing more stores than the current pass.
i gave up on it because i didn't found a good way forward to make it deal with cases like to the following:

   store i32 0, i32* %Ptr.  ; DEAD
   br i1 %cond, label %a, label %b
a:
  store i32 2, i32* %Ptr
   br label %c
b: 
  store i32 2, i32* %Ptr
  br label %c

which is IMO definitely something we want to do. by the way gcc does this optimization. https://godbolt.org/z/VFBf-c
maybe there is a bottom-up way to deal with this that i didn't thought about. any thought ?

would you be interested in a port of my top-down algorithm from D72182 to work with this patch series's framework ?
i have gotten since i wrote it many idea of how to improve it mostly on the compile-time side.

fhahn added a comment.Jan 15 2020, 5:53 AM

I did a few experiment with bottom-up algorithm before the patch i showed on phabricator. my implementation of the bottom-up had similar average complie-time to the current pass on the test-suite but it was only barely removing more stores than the current pass.
i gave up on it because i didn't found a good way forward to make it deal with cases like to the following:

   store i32 0, i32* %Ptr.  ; DEAD
   br i1 %cond, label %a, label %b
a:
  store i32 2, i32* %Ptr
   br label %c
b: 
  store i32 2, i32* %Ptr
  br label %c

which is IMO definitely something we want to do. by the way gcc does this optimization. https://godbolt.org/z/VFBf-c
maybe there is a bottom-up way to deal with this that i didn't thought about. any thought ?

Yes I think we should make sure to cover this case. One approach could be to optimize the defining accesses of the accesses we start our bottom-up walks with. After visiting all MemoryDefs, to detect cases like the one above, we would just have to look for MemoryDefs D with multiple MemoryDefs as users (which we could probably collect along the way). Then we would just have to check that D is not read in between. Does that sound sensible?

would you be interested in a port of my top-down algorithm from D72182 to work with this patch series's framework ?
i have gotten since i wrote it many idea of how to improve it mostly on the compile-time side.

Yes, but I think it would be good to settle the question bottom-up vs top-down.

fhahn marked 2 inline comments as done.Jan 15 2020, 6:44 AM
fhahn added inline comments.
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1544

Thanks for the example. I think I might be missing something for the optimized version though. From the example in MemorySSA.h:

/// Given this form, all the stores that could ever effect the load at %8 can be
/// gotten by using the MemoryUse associated with it, and walking from use to
/// def until you hit the top of the function.

From the comment above, shouldn't we visit both 2 and 3 when walking up from Use(3), as both may change the read location? In the example we would only see 3 and 1.

But even assuming we would visit both 2 and 3, I think I can see how we could end up with scenarios we could miss overlapping reads. I think we would have to take a look at all users of DomDef and we specifically *cannot* skip any non-aliasing MemoryDefs for the read-checks.

That would make things more expensive, but would be something we have to do regardless of going bottom-up/top-down. Does that make sense to you? However I think that would mean that in the worst-case, we would have to do a top-down walk similar to the general top-down approach for the read checks.

1749

I had another look at the getClobberingMemoryAccess, and we indeed need an additional check ensuring that the memorydef does not also read the original location!

Regarding handling the branching example, the solution fhahn proposes sounds reasonable to me.
I was thinking something similar, along the lines of: do the checks bottom-up from both second and third stores during the main algorithm, both find the dead store but do not postdominate it; don't abandon them, but keep this info (include the checks for no intervening reads) for after the main search. The data structure could be a Hashmap<PotentialDeadStore, ListOfDefsOrBlocksWhoFoundThisPotentialDeadStore>. If all paths are covered in the List for a PotentialDeadStore, then it's truly dead.

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1544

I think I know what I missed. MemoryDefs keep two fields, the defining and the optimized access. So 3 = Def(1) actually looks like this: 3 = Def(2) - Optimized 1 , and is a user of both 1 and 2.

Yes, I think you're right that the read checks for all accesses are needed regardless of which approach is taken. And yes, it will be expensive.
I came across something similar in LICM, and I limited or outright avoided analyzing all uses against a store to avoid the cost of analyzing all of them (see ~LICM.cpp:1300)

Regarding handling the branching example, the solution fhahn proposes sounds reasonable to me.
I was thinking something similar, along the lines of: do the checks bottom-up from both second and third stores during the main algorithm, both find the dead store but do not postdominate it; don't abandon them, but keep this info (include the checks for no intervening reads) for after the main search. The data structure could be a Hashmap<PotentialDeadStore, ListOfDefsOrBlocksWhoFoundThisPotentialDeadStore>. If all paths are covered in the List for a PotentialDeadStore, then it's truly dead.

this seems like it should work.

by the way do you have any thought on adding all calls that may throw in the memory ssa graph even if there are specified to not access memory ?
being able to rely on this would simplify greatly and probably speedup any DSE algorithm. and it is a corner case, readnone functions that may throw can't occur in C++ and i expect most other languages to be the same.

fhahn planned changes to this revision.Jan 17 2020, 9:31 AM

Regarding handling the branching example, the solution fhahn proposes sounds reasonable to me.
I was thinking something similar, along the lines of: do the checks bottom-up from both second and third stores during the main algorithm, both find the dead store but do not postdominate it; don't abandon them, but keep this info (include the checks for no intervening reads) for after the main search. The data structure could be a Hashmap<PotentialDeadStore, ListOfDefsOrBlocksWhoFoundThisPotentialDeadStore>. If all paths are covered in the List for a PotentialDeadStore, then it's truly dead.

this seems like it should work.

Sounds good! I'll update this patch in the next few days to improve the read-clobber checks as suggested. I'll update the follow on patches to work with the bottom-up approach as well.

fhahn updated this revision to Diff 241544.Jan 30 2020, 12:27 PM

Address comments, rework to correctly check for reads between Current and DomAccess by visiting all uses (including non-aliasing MemoryDefs) until we reach the original def.

There are some cases where we might visit more uses than necessary (e.g. if Current does not post-dominate DomAccess), but I think it is better to get a correct version and fairly complete version in and then tune for compile-time. I will also post a bunch of follow-up changes that implement various additional cases. I measured compile-time on X86 for the full patch series (covering a lot more cases) and without much tuning the worst compile-time difference is around ~1.5%.

It would be great if you could have another look.

Unit tests: unknown.

clang-tidy: unknown.

clang-format: unknown.

Build artifacts: diff.json, console-log.txt

Pre-merge checks is in beta. Report issue. Please join beta or enable it for your project.

fhahn edited the summary of this revision. (Show Details)
fhahn added a comment.Jan 30 2020, 6:45 PM

Regarding handling the branching example, the solution fhahn proposes sounds reasonable to me.
I was thinking something similar, along the lines of: do the checks bottom-up from both second and third stores during the main algorithm, both find the dead store but do not postdominate it; don't abandon them, but keep this info (include the checks for no intervening reads) for after the main search. The data structure could be a Hashmap<PotentialDeadStore, ListOfDefsOrBlocksWhoFoundThisPotentialDeadStore>. If all paths are covered in the List for a PotentialDeadStore, then it's truly dead.

this seems like it should work.

I've experimented with a different, more lightweight approach to support that case, which seems to fit more natural: if we insert MemoryUses that clobber all locations visible to the caller just before each exit, we should be able to cover this case directly with the read checks: D73763

Some small comments.

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1468

s/modeled/modelled
s/MemroySSA/MemorySSA

1470

Consider adding a cl::opt limit on the things you look to process in F. We've had this issue with generated code where a BB has order of thousands stores.

1592

Reduce scope of DomAccess to inside the do-while loop and use DomDef here, hence no need for the if (isa<MemoryDef>(DomAccess)) check - see previous comment (please mark them as done when updating).

1597

SmallSetVector to avoid processing duplicates?

1648

Can this happen? Wouldn't the getClobbering above have found UseAccess /UseInst instead of DomDef?

1691

s/arn't/aren't

fhahn updated this revision to Diff 242070.Feb 3 2020, 7:29 AM
fhahn marked 16 inline comments as done.
fhahn edited the summary of this revision. (Show Details)

Address comments, thanks!

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1468

I'm not a native speaker, but I think modeled is the US spelling and modelled is the UK spelling. I thought we prefer the US spelling, but I don't mind either way.

1470

I've added a rather generous limit for Defs per BB, but we can always adjust it later. We still have to check the whole basic block for throwing instructions.

1477

it's handled by MemorySSA, I've dropped that.

1544

I've added a test for the scenario: overlapping_read in multiblock-simple.ll

1546

Restrictions related to MemoryPhis are lifted in D72148

1597

Done. I *think* in the initial version the only nodes we might add multiple times are MemoryPhis and we bail out on the first one we see. I've added such a test to llvm/test/Transforms/DeadStoreElimination/MSSA/multiblock-memoryphis.ll

1648

I think it can happen in loops where a store in a the header is killed by a store in the exit, but it is also stored to in the loop body. I've added a few additional test cases to llvm/test/Transforms/DeadStoreElimination/MSSA/multiblock-loops.ll loop_multiple_def* .

I've also moved the post dominance check in into the function and also updated the check here to skip PushMemUses for MemoryDefs that completely overwrite DefLoc. This helps with avoiding unnecessary checks.

Unit tests: unknown.

clang-tidy: unknown.

clang-format: unknown.

Build artifacts: diff.json, console-log.txt

Pre-merge checks is in beta. Report issue. Please join beta or enable it for your project.

asbirlea marked 4 inline comments as done.Feb 3 2020, 11:48 AM
asbirlea added inline comments.
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1468

Yes, you're right, I'm not a native speaker either. Thank you for the correction!

1470

sgtm

1597

I think when you process uses, you may find 2 different operands for a Def, if that Def is optimized. Something like:

1 = Def(LoE)
2 = Def (1)
3 = Def (2) - Optimized field = 1

So if adding uses of 1, one may add (2, 3), then when processing 2, (3) is a use and added again. This may be too convoluted to actually happen, but thought it's worth to have the safety net. Thank you for the update!

1648

I see, yes, I wasn't thinking with the MemoryPhi condition lifted.

1684

s/it's/its

fhahn updated this revision to Diff 242172.Feb 3 2020, 1:33 PM
fhahn marked 3 inline comments as done.

Change it's -> its and regenerate check lines.

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1597

Thanks for the example! That should be handled now :)

1684

Updated, thanks!

Unit tests: unknown.

clang-tidy: unknown.

clang-format: unknown.

Build artifacts: diff.json, console-log.txt

Pre-merge checks is in beta. Report issue. Please join beta or enable it for your project.

Some more minor comments, but I think this is a reasonable first version to check in.

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1383

s/there/there are

1523

Can you add a comment here, along the lines:
UseInst has a MemoryDef associated in MemorySSA. It's possible for a MemoryDef to not write to memory, e.g. a volatile load is modeled as a MemoryDef.

1692

The MemDep variant of DSE also attempts to keep debug info. Does this also make sense here?

// Try to preserve debug information attached to the dead instruction.
salvageDebugInfo(*DeadInst);
1769

Nit: Move declaration of Next inside while condition?

Just curious..

Can you compare this solution vs. GCC's solution vs. PDSE (https://reviews.llvm.org/D29866)?

fhahn updated this revision to Diff 242423.Feb 4 2020, 1:44 PM
fhahn marked 4 inline comments as done.

Addressed latest comments, thanks!

Just curious..

Can you compare this solution vs. GCC's solution vs. PDSE (https://reviews.llvm.org/D29866)?

I unfortunately do not have in-depth knowledge of either, but I think GCC uses a similar approach (I think GCC's virtual operands are similar to MemorySSA).

IIUC PDSE tries to remove partially redundant stores, by inserting/moving stores to split points. I think that is mostly orthogonal to the MemorySSA-backed DSE in this patch, although there might be some overlap in the handled cases.

fhahn added inline comments.Feb 4 2020, 1:46 PM
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1692

Yes, we should definitely do this! I've updated it and there's a debug info test that passes now.

Unit tests: unknown.

clang-tidy: unknown.

clang-format: unknown.

Build artifacts: diff.json, console-log.txt

Pre-merge checks is in beta. Report issue. Please join beta or enable it for your project.

fhahn added a comment.Feb 7 2020, 12:00 PM

Some more minor comments, but I think this is a reasonable first version to check in.

That's great, thanks for all the feedback!

asbirlea accepted this revision.Feb 7 2020, 3:06 PM

I do think there are outstanding issues that need answers, but I believe the way to make progress is to have an initial good version and iterate on it.

  1. The major issue is performance, and to start testing this out we need a working version in tree.

There are many MemorySSA and AA calls in this variant, that we may be able to do better. For example: build a walker to do a single getClobbering with all the preconditions on what instructions are safe to skip, instead of doing this in a loop here.

  1. Add all missing cases from add-on patches and get parity or better stores eliminated than current DSE. Merge tests when this happens to make this clear.
  2. The performance problem's scope goes beyond DSE. The current pass pipeline for both pass managers has a sequence of (gvn, memcpyopt, dse), where all of these use MemDepAnalysis. Switching DSE to MemorySSA may initially get worse compile times, as we need to build both MemDepAnalysis and MemorySSA, but switching all three (use newgvn and port memcpyopt and dse), may be worthwhile. This is the long term goal I have in mind.

Thanks for all the work on this!

This revision is now accepted and ready to land.Feb 7 2020, 3:06 PM
This revision was automatically updated to reflect the committed changes.
fhahn added a comment.Feb 10 2020, 4:45 AM

I do think there are outstanding issues that need answers, but I believe the way to make progress is to have an initial good version and iterate on it.

Great, thanks for all the helpful comments!

  1. The major issue is performance, and to start testing this out we need a working version in tree.

There are many MemorySSA and AA calls in this variant, that we may be able to do better. For example: build a walker to do a single getClobbering with all the preconditions on what instructions are safe to skip, instead of doing this in a loop here.

Agreed! I think it makes sense to look into perf-tuning once we cover most cases legacy DSE handles. Otherwise perf comparisons might be a bit skewed.

  1. Add all missing cases from add-on patches and get parity or better stores eliminated than current DSE. Merge tests when this happens to make this clear.

One missing piece that is not covered yet in the patch series is load->store forwarding. If anybody is interested in pushing this forward, that would be great! Otherwise I'll get to that in a bit.

  1. The performance problem's scope goes beyond DSE. The current pass pipeline for both pass managers has a sequence of (gvn, memcpyopt, dse), where all of these use MemDepAnalysis. Switching DSE to MemorySSA may initially get worse compile times, as we need to build both MemDepAnalysis and MemorySSA, but switching all three (use newgvn and port memcpyopt and dse), may be worthwhile. This is the long term goal I have in mind.

Yes, one follow-up to MSSA backed DSE is MSSA backed MemCopyOpt. My hope is that we can re-use/share some/most of the walking strategy & safety checks between DSE and MemCpyOpt.

Thanks for all the work on this!