This is an archive of the discontinued LLVM Phabricator instance.

[DSE,MSSA] Relax post-dom restriction for objs visible after return.
ClosedPublic

Authored by fhahn on Apr 27 2020, 7:56 AM.

Details

Summary

This patch relaxes the post-dominance requirement for accesses to
objects visible after the function returns.

Instead of requiring the killing def to post-dominate the access to
eliminate, the set of 'killing blocks' (= blocks that completely
overwrite the original access) is collected.

If all paths from the access to eliminate and an exit block go through a
killing block, the access can be removed.

To check this property, we first get the common post-dominator block for
the killing blocks. If this block does not post-dominate the access
block, there may be a path from DomAccess to an exit block not involving
any killing block.

Otherwise we have to check if there is a path from the DomAccess to the
common post-dominator, that does not contain a killing block. If there
is no such path, we can remove DomAccess. For this check, we start at
the common post-dominator and then traverse the CFG backwards. Paths are
terminated when we hit a killing block or a block that is not executed
between DomAccess and a killing block according to the post-order
numbering (if the post order number of a block is greater than the one
of DomAccess, the block cannot be in in a path starting at DomAccess).

This gives the following improvements on the total number of stores
after DSE for MultiSource, SPEC2K, SPEC2006:

Tests: 237
Same hash: 206 (filtered out)
Remaining: 31
Metric: dse.NumRemainingStores

Program base new100 diff
test-suite...CFP2000/188.ammp/188.ammp.test 3624.00 3544.00 -2.2%
test-suite...ch/g721/g721encode/encode.test 128.00 126.00 -1.6%
test-suite.../Benchmarks/Olden/mst/mst.test 73.00 72.00 -1.4%
test-suite...CFP2006/433.milc/433.milc.test 3202.00 3163.00 -1.2%
test-suite...000/186.crafty/186.crafty.test 5062.00 5010.00 -1.0%
test-suite...-typeset/consumer-typeset.test 40460.00 40248.00 -0.5%
test-suite...Source/Benchmarks/sim/sim.test 642.00 639.00 -0.5%
test-suite...nchmarks/McCat/09-vor/vor.test 642.00 644.00 0.3%
test-suite...lications/sqlite3/sqlite3.test 35664.00 35563.00 -0.3%
test-suite...T2000/300.twolf/300.twolf.test 7202.00 7184.00 -0.2%
test-suite...lications/ClamAV/clamscan.test 19475.00 19444.00 -0.2%
test-suite...INT2000/164.gzip/164.gzip.test 2199.00 2196.00 -0.1%
test-suite...peg2/mpeg2dec/mpeg2decode.test 2380.00 2378.00 -0.1%
test-suite.../Benchmarks/Bullet/bullet.test 39335.00 39309.00 -0.1%
test-suite...:: External/Povray/povray.test 36951.00 36927.00 -0.1%
test-suite...marks/7zip/7zip-benchmark.test 67396.00 67356.00 -0.1%
test-suite...6/464.h264ref/464.h264ref.test 31497.00 31481.00 -0.1%
test-suite...006/453.povray/453.povray.test 51441.00 51416.00 -0.0%
test-suite...T2006/401.bzip2/401.bzip2.test 4450.00 4448.00 -0.0%
test-suite...Applications/kimwitu++/kc.test 23481.00 23471.00 -0.0%
test-suite...chmarks/MallocBench/gs/gs.test 6286.00 6284.00 -0.0%
test-suite.../CINT2000/254.gap/254.gap.test 13719.00 13715.00 -0.0%
test-suite.../Applications/SPASS/SPASS.test 30345.00 30338.00 -0.0%
test-suite...006/450.soplex/450.soplex.test 15018.00 15016.00 -0.0%
test-suite...ications/JM/lencod/lencod.test 27780.00 27777.00 -0.0%
test-suite.../CINT2006/403.gcc/403.gcc.test 105285.00 105276.00 -0.0%

There might be potential to pre-compute some of the information of which
blocks are on the path to an exit for each block, but the overall
benefit might be comparatively small.

On the set of benchmarks, 15738 times out of 20322 we reach the
CFG check, the CFG check is successful. The total number of iterations
in the CFG check is 187810, so on average we need less than 10 steps in
the check loop. Bumping the threshold in the loop from 50 to 150 gives a
few small improvements, but I don't think they warrant such a big bump
at the moment. This is all pending further tuning in the future.

Diff Detail

Event Timeline

fhahn created this revision.Apr 27 2020, 7:56 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 27 2020, 7:56 AM

The PostDominatorTree doesn't include implicit unwind edges. I'm not completely sure whether your change handles that correctly... but either way, please make sure we have testcases.

fhahn added a comment.Apr 28 2020, 2:04 PM

The PostDominatorTree doesn't include implicit unwind edges. I'm not completely sure whether your change handles that correctly... but either way, please make sure we have testcases.

To double check, as I am not too familiar with the exception handling stuff, are you referring to an example like the one below, where ehcleanup exits to the caller through unwinding? IIUC the problematic scenario would be when we have a store before an invoke, which gets overwritten on the path to the regular exit, but not on the exception path. Calls to throwing function currently block DSE, and I think that would also prevent eliminating in cases mentioned before (even if it is overwritten on the unwind edges too).

define void @test1(i32* %p, i1 %c) personality i32 (...)* @__CxxFrameHandler3 {
entry:
  store i32 2, i32* %p
  br i1 %c, label %exit, label %invoke.cont

invoke.cont:
  invoke void @throw()
  to label %exit unwind label %catch.dispatch

catch.dispatch:                                   ; preds = %invoke.cont
  %cs = catchswitch within none [label %invoke.cont1] unwind label %ehcleanup

invoke.cont1:                                     ; preds = %catch.dispatch
  %catch = catchpad within %cs [i8* null, i32 64, i8* null]
  invoke void @throw() [ "funclet"(token %catch) ]
  to label %exit unwind label %ehcleanup

ehcleanup:                                        ; preds = %invoke.cont1, %catch.dispatch
  %cleanup = cleanuppad within none []
  call void @release(i64 0) [ "funclet"(token %cleanup) ]
  cleanupret from %cleanup unwind to caller

exit:                                      ; preds = %invoke.cont1, %invoke.cont
  store i32 1, i32* %p
  ret void
}

declare i32 @__CxxFrameHandler3(...)

declare void @throw()

declare void @release(i64)

postdominance should work with invokes. I was thinking more of plain calls that unwind. But now I'm remembering: MemorySSA treats a potentially throwing call as a read of the value, even if the call itself can't actually read the value. But still, the new code isn't consistent with the mayThrowBetween() check. I forget the end result of the discussion of whether we still need that check at all.

postdominance should work with invokes. I was thinking more of plain calls that unwind. But now I'm remembering: MemorySSA treats a potentially throwing call as a read of the value, even if the call itself can't actually read the value. But still, the new code isn't consistent with the mayThrowBetween() check. I forget the end result of the discussion of whether we still need that check at all.

I've pushed additional tests for that scenario (e018b8bbb0ba).

mayThrowInBetween is needed for cases where we have may throw functions marked as read none. Those won't be part of MemorySSA and the extra check is needed. We have to ensure that there are no throws between any of the killing blocks. mayThrowInBetween currently conservatively returns true if there are any throwing blocks, so for now it should be fine. If mayThrowInBetween gets more powerful in the future, we have to make sure all killing blocks are checked. I can add a TODO.

Okay, makes sense.

fhahn added a comment.May 4 2020, 11:17 AM

ping. @george.burgess.iv do you think this patch is along the lines you suggested while reviewing D73763?

sorry for the latency -- a bit busy now, but I hope to get to this by EOD Wednesday :)

Thanks for the patch :)

I haven't reasoned through this fully, but the overall idea of "mark all killing blocks, then walk from the exits and see that all of them hit any of the killing blocks (or never pass through the block with the store)" seems functionally correct to me. I have a first round of comments/questions; happy to dig deeper on whatever direction we decide to head in.

do you think this patch is along the lines you suggested

What I envisioned was more or less keeping a SmallPtrSet<MemoryAccess *, N> populated with potentially-terminating Defs/Phis (e.g., for each Def/Phi in there, there exists a path in the function such that that's the last Def/Phi before the function exits). When walking for memory which is readable after the function exits, if we would call PushMemUses on something in this set, we give up. This can set can be built once lazily, and then queried for the rest of the life of run(). We could build it by {B,D}FS'ing starting at all exit blocks for our current Function, and terminating the search each time we discover any MemoryDefs/MemoryPhis in a block (we have special use-lists for these, so this is a constant-time query per block).

If there aren't holes in it, I think that approach would be generally faster/simpler, but it also introduces the problem of cache invalidation. Cache invalidation makes many things harder, including writing tests + making test-cases. :)

Assuming the development practice we're trying to follow is still "build what's needed, get feedback, iterate," that you mentioned on an earlier review, I'm happy to chalk that up as a future optimization if you'd prefer this approach.

if the post order number of a block is greater than the one
of DomAccess, the block cannot be in in a path starting at DomAccess

I'm not certain I'm interpreting "a path starting at," correctly, but the way I'm reading it, I don't agree:

  A
  |
  B<-----
 / \    |
C   D   |
    |   |
    E --|

(where all edges are downward except E -> B)

C = 1, E = 2, D = 3, B = 4, and A = 5, yet there's a path from E -> B, no? Admittedly, I'm unsure at this point if this is pedantry on my part, or if there are correctness implications on the algorithm as presented.

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1664

Accesses to objects that are

was this accidentally left in?

1745–1756

nit: looks like we're only using KillingBlocks if DefVisibleToCallerAfterRet. Can we save work in the !DefVisibleToCallerAfterRet case by ignoring this bit of the if then?

1753

it's a bit odd to read

if (isCompleteOverwrite(...)) {
  if (isOverwrite(...) == OW_Complete) {
    // ...
  }
}

Is there a subtle difference between the two that I'm missing?

1778

nit: for consistency with elsewhere, can we .count?

1784

tiny nits:

  • one space between // and the comment please
  • also, s/post-dominates/post-dominate/
1799

i'm unsure how we reach this case. AFAICT, PDT.dominates(nullptr, AnythingElse) == false, since the PDT should fail to lookup the node for nullptr, no?

1822

nit: 50 is a bit of a magical number to give up at here. is it worth making a -mllvm option (or perhaps hoisting to a constant with a bit of prose about how it was chosen)?

fhahn updated this revision to Diff 262711.May 7 2020, 11:42 AM
fhahn marked 8 inline comments as done.

Address comments, thanks!

fhahn marked an inline comment as done.May 7 2020, 11:57 AM

Thanks for the patch :)

Thank you very much for taking the time to take a look :)

I haven't reasoned through this fully, but the overall idea of "mark all killing blocks, then walk from the exits and see that all of them hit any of the killing blocks (or never pass through the block with the store)" seems functionally correct to me. I have a first round of comments/questions; happy to dig deeper on whatever direction we decide to head in.

do you think this patch is along the lines you suggested

What I envisioned was more or less keeping a SmallPtrSet<MemoryAccess *, N> populated with potentially-terminating Defs/Phis (e.g., for each Def/Phi in there, there exists a path in the function such that that's the last Def/Phi before the function exits). When walking for memory which is readable after the function exits, if we would call PushMemUses on something in this set, we give up. This can set can be built once lazily, and then queried for the rest of the life of run(). We could build it by {B,D}FS'ing starting at all exit blocks for our current Function, and terminating the search each time we discover any MemoryDefs/MemoryPhis in a block (we have special use-lists for these, so this is a constant-time query per block).

Oh that's convenient, I was not aware of that. That should indeed make it quite straigth-forward to build this set.

If there aren't holes in it, I think that approach would be generally faster/simpler, but it also introduces the problem of cache invalidation. Cache invalidation makes many things harder, including writing tests + making test-cases. :)

Assuming the development practice we're trying to follow is still "build what's needed, get feedback, iterate," that you mentioned on an earlier review, I'm happy to chalk that up as a future optimization if you'd prefer this approach.

IIUC, this approach is mostly the same as in the original patch, just with a different way of detecting 'last' MemoryDefs, right? I think there potentially are cases that would be handled more efficiently with the current approach (e.g. if there are a large number of memoryDefs/uses to traverse). If my understanding is correct, the alternative approach should be relatively straight forward to implement. I would be happy to iterate on that once we have a correct version in tree (that makes balancing the patches and benchmarking a bit easier I think), but I could also try to do it up-front, if that's preferred.

if the post order number of a block is greater than the one
of DomAccess, the block cannot be in in a path starting at DomAccess

I'm not certain I'm interpreting "a path starting at," correctly, but the way I'm reading it, I don't agree:

Sorry, the statement is a bit incomplete I think. It meant to refer to 'all paths to any exit block, starting at DomAccess'. In case you described, there technically is a path through DomAccess, but that should be fine, as we already check all paths from DomAccess to any exit.

  A
  |
  B<-----
 / \    |
C   D   |
    |   |
    E --|

(where all edges are downward except E -> B)

C = 1, E = 2, D = 3, B = 4, and A = 5, yet there's a path from E -> B, no? Admittedly, I'm unsure at this point if this is pedantry on my part, or if there are correctness implications on the algorithm as presented.

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1664

Yes, removed!

1753

Ah right, I've refactored isCompleteOverwrite to just use isOVerwrite().

1799

We can hit this scenario when there's no single exit block that post-dominates all killing blocks, like in the example below. Instead the special 'nullptr'
block is used, but for starting the traversal we need to start with all exit blocks I think.

define void @test12(i32* %P) {
  store i32 0, i32* %P
  br i1 true, label %bb1, label %bb2

bb1:                                              ; preds = %0
  store i32 1, i32* %P
  br label %bb3

bb2:                                              ; preds = %0
  store i32 1, i32* %P
  ret void

bb3:                                              ; preds = %bb1
  ret void
}
1822

I've made it into an option, thanks!

Sorry for the latency :)

I would be happy to iterate on that once we have a correct version in tree (that makes balancing the patches and benchmarking a bit easier I think), but I could also try to do it up-front, if that's preferred.

WFM.

I think there potentially are cases that would be handled more efficiently with the current approach (e.g. if there are a large number of memoryDefs/uses to traverse)

Agreed, though again, the set-based approach only needs to be built once whereas this one needs to happen on every query, so it all depends on how many times we fall back to this codepath, how large the function is, and how sparse the Defs/Phis in the function are.

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1799

Sorry, I should've been a bit more clear. Specifically, this code is guarded by PDT.dominates(CommonPred, DomAccess->getBlock()). DomAccess->getBlock() should never be nullptr and PDT.dominates(nullptr, not_nullptr) should always be false AFAIK, so I think we should never hit this else if !CommonPred? Or maybe we do for DomAccesses which are in unreachable code (but in that case, do we care to do extra work?)

fhahn updated this revision to Diff 265771.May 22 2020, 10:38 AM
fhahn marked an inline comment as done.

Rebased

Sorry for the latency :)

I would be happy to iterate on that once we have a correct version in tree (that makes balancing the patches and benchmarking a bit easier I think), but I could also try to do it up-front, if that's preferred.

WFM.

Great!

I think there potentially are cases that would be handled more efficiently with the current approach (e.g. if there are a large number of memoryDefs/uses to traverse)

Agreed, though again, the set-based approach only needs to be built once whereas this one needs to happen on every query, so it all depends on how many times we fall back to this codepath, how large the function is, and how sparse the Defs/Phis in the function are.

Yep, we definitely should collect data on that. Hopefully Ii'll be able to put the alternative together in the next week or so. As said earlier, it would be easier to get the first approach in as a baseline.

fhahn added inline comments.May 22 2020, 10:40 AM
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1799

PDT.dominates(nullptr, not_nullptr) should always be false

Hm, IIUC nullptr stands for the (virtual) root node in the post dominator tree. Unless I am missing something, I think PDT.dominates(nullptr, not_nullptr) should always be true, i.e. each node in the function is post-dominated by the (virtual) root.

fhahn marked an inline comment as done.May 29 2020, 7:36 AM

ping :)

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1799

@george.burgess.iv does the above make sense?

george.burgess.iv accepted this revision.Jun 9 2020, 8:33 PM

Eesh -- sorry for taking forever. :)

I think my questions are answered, and seeing how this works SGTM. If we need something faster (or wanna experiment with other approaches), I think we have a solid way forward on that.

That said, LGTM. Thanks again!

llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1799

Yup!

This revision is now accepted and ready to land.Jun 9 2020, 8:33 PM
This revision was automatically updated to reflect the committed changes.
fhahn added a comment.Jun 10 2020, 2:54 AM

Eesh -- sorry for taking forever. :)

I think my questions are answered, and seeing how this works SGTM. If we need something faster (or wanna experiment with other approaches), I think we have a solid way forward on that.

That said, LGTM. Thanks again!

Awesome, thanks! I think there are 2 more patches or so to catch most outstanding cases of legacy DSE. After that I'll move to compile-time tuning.