This is an archive of the discontinued LLVM Phabricator instance.

[DeadStoreElimination] Add support for non-local DSE
ClosedPublic

Authored by zzheng on Oct 1 2015, 1:16 PM.

Details

Summary

We extend the search for redundant stores to predecessor blocks that
unconditionally lead to the block BB with the current store instruction.
That also includes single-block loops that unconditionally lead to BB,
and if-then-else blocks where then- and else-blocks unconditionally lead
to BB.

Diff Detail

Repository
rL LLVM

Event Timeline

ivanbaev updated this revision to Diff 36285.Oct 1 2015, 1:16 PM
ivanbaev retitled this revision from to [DeadStoreElimination] Add support for DSE across blocks.
ivanbaev updated this object.
ivanbaev added a subscriber: llvm-commits.

Note that the code for BB DSE stays the same, it is just indented two spaces to the right.
The new code begins at:

// DSE across BB
else if (InstDep.isNonLocal()) { ...

and it includes helper FindUncondPredsIncludingSimpleLoops().

mbodart added a subscriber: mbodart.Oct 1 2015, 3:02 PM
eeckstein edited edge metadata.Oct 6 2015, 1:00 PM

Some comments:

*) Your algorithm has quadratic complexity in worst case. I know this is the same as for the single block algorithm, but extending it to multiple blocks might worsen the situation. Maybe you add some sort of upper bound for the number of dependency checks?

*) Your algorithm only handles certain types of cross-block DSE. As I understood, it cannot go beyond an if-then-else control flow split. I'm wondering if it makes sense to do a simple data flow analysis to collect the eligible predecessor blocks. This would handle all cases and I think it would be not much more implementation effort and doesn't add compile time.

*) I found it confusing that all the code is contained in this large functions with many nested loops and breaks and continues. I think it's better to extract some of the code to separate functions (e.g. to avoid the StopProcessPBAndItsPreds variable).

*) The tests look like they contain much code which is not really necessary to test what you want to test. Maybe you can reduce them. Also in the first test a comment would help, which describes why no DSE can be done.

ivanbaev updated this revision to Diff 36999.Oct 9 2015, 3:52 PM
ivanbaev edited edge metadata.

Addressed Erik's comments:
*) Your algorithm has quadratic complexity in worst case. I know this is the same as for the single block algorithm, but extending it to multiple blocks might worsen the situation. Maybe you add some sort of upper bound for the number of dependency checks?

  • An upper bound (MaxNonLocalAttempts = 100) is added on the number of non-local DSE attempts per function.

*) Your algorithm only handles certain types of cross-block DSE. As I understood, it cannot go beyond an if-then-else control flow split. I'm wondering if it makes sense to do a simple data flow analysis to collect the eligible predecessor blocks. This would handle all cases and I think it would be not much more implementation effort and doesn't add compile time.

  • Analyzing general loops is challenging and compile-time extensive due to underlying MemoryDependenceAnalysis. This patch iteratively goes through predecessors that unconditionally lead to the block with the store and search for redundant stores. Not sure if there are opportunities within a reasonable compile-time budget for a truly global DSE.

*) I found it confusing that all the code is contained in this large functions with many nested loops and breaks and continues. I think it's better to extract some of the code to separate functions (e.g. to avoid the StopProcessPBAndItsPreds variable).

  • The code is refactored with top algorithm extracted in HandleNonLocalDependency(Instruction *Inst).

*) The tests look like they contain much code which is not really necessary to test what you want to test. Maybe you can reduce them. Also in the first test a comment would help, which describes why no DSE can be done.

  • I reduced the cycle test to a much smaller version, and added comments.

In general I think your approach of using a reverse CFG search for this optimization is appropriate. But as Erik points out it doesn't handle if-then-else.
That can be addressed relatively easily:

Let StartBB be the block with the original StoreInst.
Let SafeBlocks be the set of blocks for which we know all paths emanating from them will lead to StoreInst
with no intervening reads of the memory location of interest. Initially, add StartBB to SafeBlocks.

The criteria for adding a new predecessor P to your WorkList are that P is reachable, P is not in SafeBlocks (not visited),
and all of P's successors are P itself (the single-block loop case) or are in SafeBlocks.

Seed your WorkList with the predecessors of StartBB that meet this criteria.
After processing a block B in HandleNonLocalDependency, when DependencyFound is false,
add B to SafeBlocks, and augment the WorkList with B's predecessors that meet the criteria.

The "P is not in SafeBlocks" acts like a "! Visited" check, except that terminal (unsafe) blocks may be
visited multiple times. To avoid that redundancy we'd need another set (UnsafeBlocks, or perhaps just VisitedBlocks).

As to the compile time threshold, I think it needs to be applied to the number of dependences and blocks seen
in HandleNonLocalDependency. That way we'll put a cap on the cost of optimizing an individual store,
rather than limiting the number of store instructions that may get optimized. I don't know what LLVM's
philosophy is on adding new command line options, but perhaps the threshold value should be overridable
that way. And perhaps a cl::opt<bool> EnableGDSE("enable-gdse"...) to control this global dead store elimination.

As for the while loop in runOnBasicBlock, I would prefer that it keep track of whether any reads were seen,
including the isSelfRead case, and then we handle non-local dependences after that loop terminates.
This is the same concept as your DependencySeen tracking in HandleNonLocalDependency, though I might
name it SafeToProcessNonLocalDeps.

lib/Transforms/Scalar/DeadStoreElimination.cpp
857

Instead of InstPt, shouldn't the search start with PB->end()?
It looks like getPointerDependencyFrom does a "--" operation on this iterator
to get the first instruction to start its analysis from.

900

I don't understand why we are stopping the search here.
If the overwrite is partial, there may still be preceding stores
that could be eliminated.

906

Shouldn't this check also include && Dep.isNonLocal()?

ivanbaev updated this revision to Diff 37712.Oct 18 2015, 5:23 PM
ivanbaev updated this object.

The new version now handles a redundant store in the if-block in if-then-else - based on a suggestion by Mitch. We use SafeBlocks set, which is initialized and updated in DSE::HandleNonLocalDependency() and used in helper FindSafePreds(). SafeBlocks keeps track of all predecessor blocks that are safe to search through. A new test - ifthenelse.ll - has been added.

Thanks for suggestions/comments, Mitch. My responses are inlined below.

In general I think your approach of using a reverse CFG search for this optimization is appropriate. But as Erik points out it doesn't handle if-then-else.
That can be addressed relatively easily:

Let StartBB be the block with the original StoreInst.
Let SafeBlocks be the set of blocks for which we know all paths emanating from them will lead to StoreInst
with no intervening reads of the memory location of interest. Initially, add StartBB to SafeBlocks.

The criteria for adding a new predecessor P to your WorkList are that P is reachable, P is not in SafeBlocks (not visited),
and all of P's successors are P itself (the single-block loop case) or are in SafeBlocks.

Seed your WorkList with the predecessors of StartBB that meet this criteria.
After processing a block B in HandleNonLocalDependency, when DependencyFound is false,
add B to SafeBlocks, and augment the WorkList with B's predecessors that meet the criteria.

The "P is not in SafeBlocks" acts like a "! Visited" check, except that terminal (unsafe) blocks may be
visited multiple times. To avoid that redundancy we'd need another set (UnsafeBlocks, or perhaps just VisitedBlocks).

  • I added set SafeBlocks to keep track of all predecessor blocks that are safe to search through.

It is seeded with BB, the block with the original store. If no Def/Clobber dependency is found in predecessor PB, then we add PB to SafeBlocks.
Helper FindSafePreds(Blocks, SafeBlocks, PB, DT) then uses SafeBlocks to add blocks with two successors (e.g. an if-block) to the worklist.

As to the compile time threshold, I think it needs to be applied to the number of dependences and blocks seen
in HandleNonLocalDependency. That way we'll put a cap on the cost of optimizing an individual store,
rather than limiting the number of store instructions that may get optimized. I don't know what LLVM's
philosophy is on adding new command line options, but perhaps the threshold value should be overridable
that way. And perhaps a cl::opt<bool> EnableGDSE("enable-gdse"...) to control this global dead store elimination.

As for the while loop in runOnBasicBlock, I would prefer that it keep track of whether any reads were seen,
including the isSelfRead case, and then we handle non-local dependences after that loop terminates.
This is the same concept as your DependencySeen tracking in HandleNonLocalDependency, though I might
name it SafeToProcessNonLocalDeps.

  • Use of static const unsigned MaxNonLocalAttempts = 100; in DSE is similar to the use of static const unsigned MaxIVUsers = 200; in LoopStrengthReduce.

Note that the large portion of DSE compile-time is due not to proper DSE but to the underlying MemoryDependenceAnalysis,
especially for computing the non-local dependency information. Controlling computation in MemoryDependenceAnalysis is
outside the scope of DSE. Given the iterative way of adding predecessors and searching them locally for non-local DSE
I don't expect a substantial increase in DSE compile-time.

lib/Transforms/Scalar/DeadStoreElimination.cpp, 857
Instead of InstPt, shouldn't the search start with PB->end()?
It looks like getPointerDependencyFrom does a "--" operation on this iterator
to get the first instruction to start its analysis from.

  • Each BB ends with a terminator instruction and these should not have contain dependencies for the store.

See also DSE::HandleFree().

line 900
I don't understand why we are stopping the search here.
If the overwrite is partial, there may still be preceding stores
that could be eliminated.

  • In the local BB case, "if (OR == OverwriteComplete) ..." is followed by

"else if (OR == OverwriteEnd && isShortenable(DepWrite)) ..."
I agree that we can be more aggressive here but suggest this to be added by a separate patch.

line 906
Shouldn't this check also include && Dep.isNonLocal()?

  • We know that the first dependence - which triggers invocation of HandleNonLocalDependency(Inst) - is non-local.

In HandleNonLocalDependency() we use MemoryDependenceAnalysis::getPointerDependencyFrom()
to performs local search: it walks backwards through the basic block, looking for dependencies.

In each iteration of:
while (!Blocks.empty()) {

  BasicBlock *PB = Blocks.pop_back_val();
  ...
  MemDepResult Dep = MD->getPointerDependencyFrom(Loc, false, InstPt, PB, SI);
  ...
}

we are looking for (local) dependencies within PB for the original store instruction. A non-local dependency found
when we search PB should not impact how we process PB.

majnemer added inline comments.
lib/Transforms/Scalar/DeadStoreElimination.cpp
88

Functions should start with a lower case letter.

616

Pointers go on the right side, please use auto * to avoid repetition.

627–630

Pointers lean right.

653–656

Please keep the else if on the same line as the closing }

804

Functions should start with a lowercase letter.

807–808

Can you use a range-base for loop with predecessors ?

812–820

It would be nicer to split this into more if (X) continue; to reduce the heavy nesting.

831–834

Could you use a range based for loop here?

846

Please use auto *.

857

Could this be cached in the constructor?

876

Please use auto *.

ivanbaev updated this revision to Diff 37824.Oct 19 2015, 7:24 PM

changes for style and coding standards

ivanbaev marked 8 inline comments as done.Oct 20 2015, 6:45 PM

ping?

lib/Transforms/Scalar/DeadStoreElimination.cpp
616

this is actually the current code; done

627–630

this is actually the current code; done

bruno added a subscriber: bruno.Oct 21 2015, 2:01 PM

Hi Ivan,

Thanks for working on this. It would be good to see some LNT compile time results of this patch beforehand since DSE triggered compile time bloat in the past.

majnemer added inline comments.Oct 21 2015, 2:40 PM
lib/Transforms/Scalar/DeadStoreElimination.cpp
804–805

This has yet to be addressed.

809–817

The nesting here is still very large. I'm also not entirely convinced that this code is correct. What if PredTI->getNumSuccessors() == 1 is true but PredTI->getSuccessor(0) == Pred is false? In that case, won't you access PredTI->getSuccessor(1) == Pred ? It would be safer and more clear to simply loop over successors.

828–831

This hasn't been addressed.

854

This hasn't been addressed.

ivanbaev marked 2 inline comments as done.Oct 22 2015, 9:58 AM
ivanbaev added inline comments.
lib/Transforms/Scalar/DeadStoreElimination.cpp
804–805

This is similar to the code for iterating over predecessors in existing void FindUnconditionalPreds() function. I suggest to change both in a separate patch.

809–817

if PredTI->getNumSuccessors() == 1 is true, then we exit the if statement checks at this point, so we will not access PredTI->getSuccessor(0), etc.

Do you think that the following is better?
if (PredTI->getNumSuccessors() == 1) {

PredBlocks.push_back(Pred);
continue;

}
if (PredTI->getNumSuccessors() == 2 &&

(PredTI->getSuccessor(0) == Pred || PredTI->getSuccessor(1) == Pred ||
(SafeBlocks.count(PredTI->getSuccessor(0)) &&
    SafeBlocks.count(PredTI->getSuccessor(1)))))
PredBlocks.push_back(Pred);
828–831

There are about 8 places in DeadStoreElimination.cpp where we can use range based for loops. I suggest these changes to be made in a separate patch.

854

That is a good point. There are many places in DSE where DataLayout is used or passed as argument. However, it seems that most transformations in Transform/*, e.g. Vectorize/LoopVectorize.cpp, Scalar/SROA.cpp, Scalar/GVN.cpp, favor accessing DataLayout this way.

mbodart added inline comments.Oct 22 2015, 12:43 PM
lib/Transforms/Scalar/DeadStoreElimination.cpp
572

I didn't see a response to my suggestion to keep track of SafeToProcessNonLocalDeps in this loop. It seems incongruous and unnecessarily limiting to let any isDef or isClobber dependency here inhibit global DSE, while HandleNonLocalDependency tolerates them as long as they don't interfere. That said, it may be better left for a future change set, as it will slightly complicate this loop.

653

For safety I think we also need to check that Inst is not a self read.

809–817

I second the suggestion to use a Successor iterator here.
Additionally, I think the checks for Pred == BB and DT->dominates(BB, Pred) are unnecessary.
Note that BB (and StartBB) should already be in SafeBlocks, so the SafeBlocks.count checks should provide sufficient safety. I would think the routine simplifies to something like this:

for each pred P of BB

if P is reachable and !SafeBlocks.count(P))
    PredOK = true;
    for each succ S of P
        if !SafeBlocks.count(S) && S != P
            PredOK = false;
    if PredOK, add P to PredBlocks
821

I'm still ramping up on LLVM and thus not deeply familiar with all of AA's capabilities.
But it struck me as odd that we would need a custom routine to look for a memory conflict.
Is there no existing AA or MemoryDep interface that meets our need, and if not, should one be added?

854

[Ivan] Each BB ends with a terminator instruction and these should not have contain dependencies for the store.

An Invoke is one instance of a terminator that can have memory dependences.
I don't know if our CFG search could ever get through both edges coming out of an Invoke,
but I think it is best to avoid any such assumptions, and let AA make the dependency determinations.

892

This test is loop invariant and should probably be cached before the loop.
This is especially true after generalizing FindSafePreds to allow an arbitrary number of successors.

911

This is true for the common case of exact overlap.
But it might be worth adding a TBD comment to the effect that we could reasonably continue the search if the dead instruction was only a partial overlap.

I'm not suggesting we do the full "shortening" transformation that local DSE does (though that might be fun). But catching simple scalar stores would just work.

919

[Ivan] We know that the first dependence - which triggers invocation of HandleNonLocalDependency(Inst) - is non-local

Perhaps I don't quite understand the possible return values from MD, but it seems like Dep could be Unknown or NonLocalFunc here, in which case we should terminate the search.

test/Transforms/DeadStoreElimination/cycle.ll
3–4

These tests seem fairly generic.
I would prefer that we remove the explicit datalayout and triple here.

mbodart added inline comments.Oct 22 2015, 12:43 PM
lib/Transforms/Scalar/DeadStoreElimination.cpp
654

I would expect any pathalogical compile time issues to occur in HandleNonLocalDependency, based on how many blocks and dependences it has to search through. So even if MaxNonLocalAttempts is 1, we could theoretically encounter a compile time problem. I'm not averse to keeping MaxNonLocalAttempts here, but would like to see a throttle in HandleNonLocalDependency.

ivanbaev updated this revision to Diff 38338.Oct 24 2015, 5:28 PM
ivanbaev retitled this revision from [DeadStoreElimination] Add support for DSE across blocks to [DeadStoreElimination] Add support for non-local DSE.
ivanbaev marked 15 inline comments as done.Oct 24 2015, 5:45 PM
ivanbaev added inline comments.
lib/Transforms/Scalar/DeadStoreElimination.cpp
572–583

The new patch is more aggressive in searching for redundant stores. Let's defer your suggestion for a subsequent patch.

654

MemoryDependenceAnalysis has BlockScanLimit = 100 and NumResultsLimit = 100. With these and the new MaxNonLocalAttempts = 100 we have enough handles to control compile-time.

821

If there is a client outside DSE we can move this helper into a utility directory

892

The idea is to allow the search continue through simple loops, but not to remove stores within loops.

Hi Ivan,

I added a few more responses.

Please be patient as this is a non-trivial optimization, and for me at least it takes some time to poke around and verify things.
I'm still not confident enough in my understanding of alias analysis to know whether underlyingObjectsDoNotAlias is complete, for example.

Also, please try to address some of the comments from David regarding naming conventions.

regards,

  • Mitch
lib/Transforms/Scalar/DeadStoreElimination.cpp
572–583

That's fine with me.

653

The self read issue has not been addressed.

654

I see. Thanks for pointing that out.
Those seem reasonable as a starting point.
But as Bruno suggested, some actual compile time measurments are probably in order, just to confirm.

808

I still don't understand why we need to check DT->dominates(BB, Pred) here.
I'm guessing you're trying to guard against an infinite loop, which of course
we need to do. But I think that would be automatically handled by instead checking
for !SafeBlocks.count(Pred). In addition to allowing the search to proceed further, this
would eliminate the need for the special check of Pred == BB, since BB must be in SafeBlocks.

814

The Succ == BB check is correct, but redundant as SafeBlocks.count(BB) must be non-zero.
I think the code would read better if this and the next check were combined into one:

if (Succ != Pred && !SafeBlocks.count(Succ))

but would be OK with your current structure.
Either way, however, it would be useful to add a clarifying source comment to the effect:

// Pred is only known to be optimizeable if all of its successors are either safe or are Pred itself.

892

Sorry, I'm not sure which comment you were responding to.
Phabricator for some reason seems to be placing some of my comments
above the lines of interest, and others below.

Regarding my "loop invariant" comment, I'm saying that the test for whether the current block
is a single-block loop (getNumSuccessors() == 2, etc) should probably be done outside of this loop. I'm also not sure if checking explicitly for two successors is sufficient. A program could conceivably have a switch statement with one case having a goto back to a label just prior to the switch. I'm not familiar enough with LLVM optimizations to know whether that can lead to an indirect branch successor which is the block containing the indirect branch itself. But it's theoretically possible. My recommendation here is to use a successor iterator, to match the behavior present in FindSafePreds.

Hi Mitch, thanks for the thoughtful comments; we are making a good progress.
I will make the two changes: using successor iterator and adding the comment, in the next revision.

Regarding Bruno's request for LNT testing, I'm currently running LNT tests with ARM/AArch64 compiler without and with this patch, and will report results when finished. Could someone run this patch with an x86-based compiler if there are concerns about compile-time.

Regarding David's comments, I believe all have been addressed, except caching DataLayout in the constructor.
There are many places in DSE where DataLayout is used or passed as argument. Since most transformations in Transform/*, e.g. Vectorize/LoopVectorize.cpp, Scalar/SROA.cpp, Scalar/GVN.cpp, favor accessing DataLayout this way, I suggest this change to be done in a future patch.

Mitch, Erik, David, Bruno,
Are you planning to attend the LLVM developers meeting on Thu/Fri? That would be a good opportunity to discuss this patch in person if you would like to.

Ivan

lib/Transforms/Scalar/DeadStoreElimination.cpp
653

It looks like that some comments are not aligned to the corresponding lines of code.
The self read issue has been addressed in line 888 by adding there a call to:

isPossibleSelfRead(Inst, Loc, Dependency, *TLI, *AA)

It is similar to the local case.

654

I'm running LNT tests with ARM/AArch64 compiler without and with this patch, and will report results when finished.

808

We need DT->dominates(BB, Pred) check to avoid situations similar to the one in cycle.ll test below. This test is a simplified version of a larger function found in a SPEC benchmark.

814

Agree that Succ == BB check is redundant, but it is a shortcut compared to checking a membership in a set.
That will save some compile time.
Sure, I will add the comment you suggested.

892

I will use a successor iterator here, thanks.

Mitch, Erik, David, Bruno,
Are you planning to attend the LLVM developers meeting on Thu/Fri? That would be a good opportunity to discuss this patch in person if you would like to.

Yes, I'm there on Friday. Looking forward to meet you.

Hi Bruno,

Here are LNT compile-time: ibaev-linuxclang_DEVaarch64 test results
Previous: aarch64 compiler with this patch, non-local DSE
Current: aarch64 compiler without this patch, that is top-of-trunk DSE

Tests Summary
Status Group #
Performance Regressions 141
Performance Improvements 134
Added Tests 1
Existing Failures 489
Unchanged Tests 1191
Total Tests 1956

  • Top regressions with Current (top-of-trunk DSE) ***

Performance Regressions - Compile Time Δ Previous Current σ
SingleSource/UnitTests/2002-08-02-CastTest 400.00% 0.0080 0.0400 -
SingleSource/UnitTests/C++11/stdthreadbug 300.00% 0.0040 0.0160 -
SingleSource/UnitTests/2003-05-31-LongShifts 250.00% 0.0080 0.0280 -
SingleSource/UnitTests/2008-04-20-LoopBug2 200.00% 0.0120 0.0360 -
SingleSource/Regression/C/2004-02-03-AggregateCopy 150.00% 0.0160 0.0400 -
MultiSource/Benchmarks/Prolangs-C++/vcirc/vcirc 100.00% 0.0240 0.0480 -
SingleSource/Benchmarks/Misc/mandel-2 100.00% 0.0360 0.0720 -
SingleSource/UnitTests/2003-04-22-Switch 100.00% 0.0200 0.0400 -
SingleSource/UnitTests/Vector/multiplies 100.00% 0.0200 0.0400 -
SingleSource/UnitTests/block-byref-test 100.00% 0.0120 0.0240 -
....

  • Top improvents with Current (top-of-trunk DSE) ***

Performance Improvements - Compile Time Δ Previous Current σ
SingleSource/UnitTests/2002-10-13-BadLoad -75.00% 0.0160 0.0040 -
SingleSource/UnitTests/2006-12-01-float_varg -71.43% 0.0280 0.0080 -
SingleSource/Regression/C++/2003-06-08-VirtualFunctions -66.67% 0.0240 0.0080 -
SingleSource/Regression/C++/2003-09-29-NonPODsByValue -66.67% 0.0360 0.0120 -
SingleSource/Regression/C++/EH/function_try_block -66.67% 0.0480 0.0160 -
SingleSource/UnitTests/byval-alignment -63.64% 0.0440 0.0160 -
SingleSource/Benchmarks/Misc/pi -58.33% 0.0480 0.0200 -
SingleSource/Regression/C++/short_circuit_dtor -58.33% 0.0480 0.0200 -
SingleSource/UnitTests/2005-05-11-Popcount-ffs-fls -55.56% 0.0720 0.0320 -
SingleSource/UnitTests/Vector/simple -55.56% 0.0360 0.0160 -
SingleSource/Regression/C++/EH/simple_rethrow -54.55% 0.0440 0.0200 -
...

Since this is my first LNT testing, could someone comment on these results. Any outliers?

Thanks,
Ivan

mbodart added inline comments.Oct 30 2015, 3:03 PM
lib/Transforms/Scalar/DeadStoreElimination.cpp
653

It seems useful for compile time to check for a self-read right here, as in that case there is no point in processing non-local dependences. Such a check is only complicated by the fact that isPossibleSelfRead requires a "DepWrite" parameter. We would need to modify that routine to allow a nullptr for DepWrite.

What I'm unsure of, however, is whether this being a non-self read is already implied.
Is it the case that a StoreInst will never be a self read, for example?

808

If nonlocal DSE would otherwise remove the store in the cycle.ll test, it seems to me that's a problem with how we are using alias analysis. The other store in that loop is not guaranteed to overwrite the removed store's location on every loop iteration.

Perhaps it's simply my unfamiliarity with LLVM's alias analysis interfaces, but it seems like there would be some support for checking this definitive overlap, without requiring clients to also factor in dominance information. I'd be interested in hearing what the other reviewers have to say.

ivanbaev updated this revision to Diff 38859.Nov 1 2015, 4:06 PM

addressed two comments by Mitch

ivanbaev marked 12 inline comments as done.Nov 1 2015, 4:31 PM
ivanbaev added inline comments.
lib/Transforms/Scalar/DeadStoreElimination.cpp
653

A StoreInst will never be a self read.

808

By design, DSE - local, non-local, other parts - depends on MemoryDependenceAnalysis, and only occasionally relies on alias analysis.

bruno added a comment.Nov 2 2015, 10:36 AM

Hi Ivan,

Thanks again for working on this!
I'm gonna run your patch in x86 and let you know about the performance results.

ivanbaev marked 2 inline comments as done.Nov 2 2015, 7:19 PM

Thanks, Bruno. If necessary we need to balance non-local DSE for compile-time and run-time performance.

bruno added a comment.Nov 3 2015, 9:27 AM

The patch fails to apply cleanly on ToT, can you update it?

mbodart added inline comments.Nov 3 2015, 11:52 AM
lib/Transforms/Scalar/DeadStoreElimination.cpp
653

If that's the case, then we should be able to remove the isSelfRead check in handleNonLocalDependency.

808

I was using the term alias analysis generically, referring to either AA or MD.
If usage of those interfaces is suspect in the circumstances of the cycle.ll test,
I'd like to understand why exactly.

887

Inst here is the original StoreInst. If StoreInst's can never be a self read, this check will always be false. Do we need an additional check to ensure that Dependency does not read any of Inst's memory, analagous to the check "if (AA->getModRefInfo(DepWrite, Loc) & MRI_Ref)"
in DSE::runOnBasicBlock?

896

This break is just exiting the succ iterator loop. We need to exit the containing while loop.

ivanbaev updated this revision to Diff 39302.Nov 4 2015, 6:23 PM

Rebased.
Address Mitch's comments.

ivanbaev marked 3 inline comments as done.Nov 4 2015, 6:53 PM

Hi Bruno, the patch has been rebased, and should apply cleanly on ToT.

Mitch, thanks for your comments and suggestions.
Non-local DSE extension follows the structure of HandleFree(). The underlying analysis for DSE is MD, and for across-BB both - HandleFree and HandleNonLocalDependency - use MD->getPointerDependencyFrom(). Note that MD is not loop-aware and its clients should handle CFG and alias issues by relying on AA or DT. HandleFree only allows unconditional predecessors - see FundUnconditonalPreds(). Based on reviewers' feedback HandleNonLocalDependency also handles if-then-else predecessors and single-block-loops (actually one of our benchmarks has a redundant store across a single-block-loop).

Regading cycle.ll test, using DT->dominates(BB, Pred) check is sufficient to exclude for.inc BB from the set of safe predecessors of for.body BB. What alternative do you propose for determining that stores in for.body and for.inc may alias?

lib/Transforms/Scalar/DeadStoreElimination.cpp
887

Added AA->getModRefInfo(Dependency, Loc) & MRI_Ref) test.

mbodart added inline comments.Nov 5 2015, 5:30 PM
lib/Transforms/Scalar/DeadStoreElimination.cpp
808

OK, I think I get it now.

Fundamentally the isOverwrite routine assumes that (P1 == P2) means that pointers P1 and P2 always have the same value, and is not considering back edge (phi) issues.

I think your dominance check is sufficient to avoid problems in this area.
But as this is a subtle issue, please add a source comment at the dominance check
explaining the situation.

887

That looks right, thanks!

ivanbaev marked an inline comment as done.Nov 6 2015, 8:55 AM

Hi Bruno,

Did you have a chance to run this patch for x86? I rebased it, so please let me know if it can merge to ToT. Your feedback is very much appreciated.
Thanks!
Ivan

ivanbaev updated this revision to Diff 40106.Nov 12 2015, 8:26 PM

Added option -enable-nonlocal-dse to enable/disable the feature.

All QuIC internal correctness and performance tests are good.
On Arm/AArch64 processor at -O2, the patch improves run-time performance of:
eembc/networking_suite/qos by 8%,
eembc/comsumer_suite/cjpegv2data7 by 16%,
spec2006/sjeng by 1.5%

Please, take another look at the patch. I think all suggestions and comments have been addressed.
I appreciate if you could run the patch on your applications on additional targets and measure the impact.

Best regards,
Ivan

Any objection to commit this patch?

Hi Ivan,

Sorry for the delay. Below the x86 results, of a 5x run in a stable machine.
Highlights include: CINT2006/458_sjeng performing 1.1% slower while CINT2000/175_vpr is 1.5% faster.

Performance Regressions - Execution Time Δ Previous Current σ
MultiSource/Benchmarks/TSVC/IndirectAddressing-flt/IndirectAddressing-flt 3.80% 2.5611 2.6583 0.0179
MultiSource/Benchmarks/sim/sim 2.93% 3.2302 3.3250 0.0019
SingleSource/Benchmarks/Polybench/medley/floyd-warshall/floyd-warshall 2.70% 1.6456 1.6900 0.0172
MultiSource/Benchmarks/NPB-serial/is/is 1.17% 5.6993 5.7660 0.0172
External/SPEC/CINT2006/458_sjeng/458_sjeng 1.14% X Y Z
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trmm/trmm 1.13% 1.4695 1.4861 0.0055
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt 1.13% 2.2458 2.2711 0.0082

Performance Improvements - Compile Time Δ Previous Current σ
MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1 -16.00% 0.0844 0.0709 0.0002

Performance Improvements - Execution Time Δ Previous Current σ
MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1 -12.79% 0.4956 0.4322 0.0000
MultiSource/Benchmarks/mafft/pairlocalalign -5.84% 18.7424 17.6486 0.0154
External/SPEC/CINT2000/175_vpr/175_vpr -1.50% X Y Z
MultiSource/Applications/sqlite3/sqlite3 -1.27% 2.5050 2.4732 0.0107

Hi Bruno,

Thanks for running and sharing x86 results! Deltas look reasonable, in my opinion. Removing dead stores and related instructions across basic blocks should be a good thing by itself. There is a possibility for second order effects - e.g. longer live ranges, increased number of spills.

Results for SPEC2006/sjeng are interesting. On Arm/AArch64/A57 devices at -O2 we have the following run times (in seconds, so lower is better)

  • "-mllvm -enable-nonlocal-dse=false": 1795, 1799, 1801, 1804, 1805, 1806
  • "-mllvm -enable-nonlocal-dse=true": 1773, 1774,1775, 1778, 1778, 1786

At -O3, we do not see any performance delta for sjeng. What optimization level did you use for compiling SPEC on x86?

Using -stats, for both -O2 and -O3. on sjeng we are able to delete 13 non-local stores in total (found in 4 source files). No changes in the number of regalloc spills and other relevant stats other than decreases in machine instructions emitted and printed and isl dag tries (see below).

Glad that the DSE compile-time is under control, and actually improved for one Trimaran benchmark.
On sjeng, there is a decrease in isel - Number of times dag isel has to try another path - from 2380 to 2348 with this patch. How is this related to compile-time?

It would be good to hear what reviewers think about run- and compile-time results on x86 and AArch64 with this patch, and if we can commit it.

Best regards,
Ivan

Could someone with commit access commit this patch, if there is no objection from reviewers?

This is a good feature to have in llvm/DSE; the reviewers' (mbodard, eeckstein, majnemer, bruno) comments and suggestions have been addressed; the run-time and compile-time performance of the patch on x86 (bruno) and arm/hexagon (ivanbaev) is reasonable.

Thanks,
Ivan

Hi Ivan,

I would stll like to see a source comment near the DT->dominates(BB, Pred) check in findSafePreds,
as its necessity is not otherwise obvious. It is needed in order to prevent unsafe use of isOverwrite
and potentially other AA interfaces. For example isOverwrite cannot be trusted if the pointer values
in question could come from different iterations of an enclosing loop.

Otherwise LGTM.

I think you'll need to update to ToT once again, as there was another change in DeadStoreElimination.cpp
which prevented the patch from applying cleanly.

  • mitch
ivanbaev updated this revision to Diff 40722.Nov 19 2015, 5:10 PM

Added a comment to clarify avoidance of blocks (to search for redundant stores) that form a cycle with BB, as requested by Mitch.
Update to ToT is not necessary because the recent changes to DeadStoreElimination.cpp has been reverted.

ivanbaev marked an inline comment as done.Nov 19 2015, 5:13 PM

Mitch, I added a relevant comment above DT->dominates(BB, Pred).

Thanks Ivan!

LGTM

  • mitch
mcrosier commandeered this revision.Dec 3 2015, 1:01 PM
mcrosier edited reviewers, added: ivanbaev; removed: mcrosier.

Ivan has asked that I take over this work.

ivanbaev edited edge metadata.Dec 5 2015, 12:34 PM

Thanks, Chad. Can we commit the patch at this point?
It solves one DSE enhancement known for several years.
I believe the patch has been thoroughly reviewed and tested.

Ivan

zzheng commandeered this revision.Dec 9 2015, 11:59 AM
zzheng updated this revision to Diff 42326.
zzheng added a reviewer: mcrosier.
zzheng edited edge metadata.
zzheng set the repository for this revision to rL LLVM.

Fixed build error due to explicit ilist::iterator.

majnemer added inline comments.Dec 9 2015, 6:39 PM
lib/Transforms/Scalar/DeadStoreElimination.cpp
678

Pointers lean right.

680

Can a range-based for loop be used here? Something like: for (const BasicBlock *Pred : *BB)

712

Likewise, can you use one here?

741

Pointers lean right.

mcrosier accepted this revision.Dec 10 2015, 9:36 AM
mcrosier edited edge metadata.

Committed in r255247. I missed David's comments prior to the commit, but I did clang-format the patch before submission. That did address a few of the issues. I addressed David's suggestion for range-based loops in r255265.

This revision is now accepted and ready to land.Dec 10 2015, 9:36 AM
mcrosier closed this revision.Dec 10 2015, 9:36 AM

I ended up disabling this patch in r255286. I suspect this change is causing lots of bots to timeout as well as doubling the compile-time for Chromium. I'll have someone investigate, but hopefully the compile-time issues have been resolved for the time being.

Hi Chad, thanks for helping with that. Is compile-time increase the remaining problem?
One quick fix could be to change 100 to 10 for the max number on non-local attempts in a function:

static const unsigned MaxNonLocalAttempts = 100;

We've chosen 100 arbitrarily. The newly added DSE tests should pass with 10.

Ivan

I'd prefer if numbers were collected indicating which thresholds worked
best.

bruno added a comment.Dec 10 2015, 6:02 PM

+1
It would be good to have that first!

Agree, but this depends what tests/apps we are concerned about. LNT tests didn't show many outliers.
In addition to Chromium, which applications have been known as outliers in terms of DSE compile-time contribution in the past?

I have a resolution for the issue causing bot timeouts,
as well as one unrelated fix for irreducible control flow.
Fortunately both fixes are trivial. Details below.

Chad or Ivan, how do you want to proceed?
Can you apply these fixes and re-land the change?

I used Rafael's llvm-ar.ll reproducer at
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20151207/318782.html
to chase the bot timeouts.

The issue is in findSafePreds, which was getting tripped up by a switch
statement, and inadvertently adding the same block to the worklist
multiple times. The predecessor iterator needs to avoid processing
a given predecessor more than once. Keeping track with a "PredsVisited"
set resolves that, and there is no longer any noticeable compile time
difference w/wo nonlocal DSE for llvm-ar.ll.

Wrt irreducibility, I was able to construct a test case for which the
CFG search goes into a cycle and revisits the original safe block
that instigated the nonlocal DSE search, SI->getParent() in
DSE::handleNonLocalDependency(). My earlier suggestion for the search
routine had an explicit check for !SafeBlocks.count(P) to avoid this
situation, but that was elided in favor of the dominator check
DT->dominates(BB, Pred). With irreducible control flow we can have
a case where Pred is the original safe block, but BB does not dominate it.

One fix here is to pass in that original block to findSafePreds,
and replace the dominator check with DT->dominates(OrigBB, Pred).
While fairly subtle, this prevents processing any block more than once,
and is still sufficient to avoid problems with the "non loop aware"
dependence analysis.

One other nit: The SafeBlocks set is unordered, and only needed for
membership queries. Should it be declared as a SmallSet instead of
a SmallSetVector?

Finally, I collected some stats when compiling llvm-ar.ll.

There are 23138 calls to handleNonLocalDependency which
reach the worklist processing, with 84 dead instructions found.
94% of the time we process 5 or fewer predecessor blocks.
In the worst case we looked at 266 predecessors.

Blocks Frequency
0 11553 50%
1 5047 22%
2 3858 17%
3 778 3%
4 285
5 221
6 148
7 112
8 193
9 57
10 53
11-50 410
51-100 36
100-266 7

In one instance we found 5 dead instructions by searching 22 blocks.
I didn't dig into whether those could have been found in fewer blocks.
Searching more than 22 blocks was never fruitful for llvm-ar.ll.

Though we would of course want more samples to make any decisions,
limiting the number of worklist blocks processed is another
candidate for a compile time threshold, should one be needed.
OTOH, it may only matter in pathalogical cases.

  • mitch

Hi Mitch,
Since this patch has been closed IMO the path forward would be to upload a new revision for code review. Please include all the current reviewers/subscribers. You should also include a link to this patch for reference. From there we can begin the review process, which hopefully will be fairly short assuming the fixes were mostly trivial.

Chad

Hi Mitch,

Thanks for working on this.

I have a resolution for the issue causing bot timeouts,
as well as one unrelated fix for irreducible control flow.
Fortunately both fixes are trivial. Details below.

Chad or Ivan, how do you want to proceed?
Can you apply these fixes and re-land the change?

Feel free to proceed as Chad suggested.

I used Rafael's llvm-ar.ll reproducer at
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20151207/318782.html
to chase the bot timeouts.

The issue is in findSafePreds, which was getting tripped up by a switch
statement, and inadvertently adding the same block to the worklist
multiple times. The predecessor iterator needs to avoid processing
a given predecessor more than once. Keeping track with a "PredsVisited"
set resolves that, and there is no longer any noticeable compile time
difference w/wo nonlocal DSE for llvm-ar.ll.

Is it something like:

SmallPtrSet<BasicBlock *, 8> Visited;
for (auto I = pred_begin(BB), E = pred_end(BB); I != E; ++I) {

if (!Visited.insert(*I).second)
  continue;

Wrt irreducibility, I was able to construct a test case for which the
CFG search goes into a cycle and revisits the original safe block
that instigated the nonlocal DSE search, SI->getParent() in
DSE::handleNonLocalDependency(). My earlier suggestion for the search
routine had an explicit check for !SafeBlocks.count(P) to avoid this
situation, but that was elided in favor of the dominator check
DT->dominates(BB, Pred). With irreducible control flow we can have
a case where Pred is the original safe block, but BB does not dominate it.

One fix here is to pass in that original block to findSafePreds,
and replace the dominator check with DT->dominates(OrigBB, Pred).
While fairly subtle, this prevents processing any block more than once,
and is still sufficient to avoid problems with the "non loop aware"
dependence analysis.

Do you propose replacing DT->dominates(BB, Pred) in

if (!DT->isReachableFromEntry(Pred) || DT->dominates(BB, Pred))

with DT->dominates(OrigBB, Pred), or adding the latter as a third check?

It would be great to have this irreducible CFG test added to the existing tests for DSE.

One other nit: The SafeBlocks set is unordered, and only needed for
membership queries. Should it be declared as a SmallSet instead of
a SmallSetVector?

That's fine. We actually need a SmallSetVector with reverse insertion order iteration characteristics.

Finally, I collected some stats when compiling llvm-ar.ll.

There are 23138 calls to handleNonLocalDependency which
reach the worklist processing, with 84 dead instructions found.
94% of the time we process 5 or fewer predecessor blocks.
In the worst case we looked at 266 predecessors.

Blocks Frequency
0 11553 50%
1 5047 22%
2 3858 17%
3 778 3%
4 285
5 221
6 148
7 112
8 193
9 57
10 53
11-50 410
51-100 36
100-266 7

In one instance we found 5 dead instructions by searching 22 blocks.
I didn't dig into whether those could have been found in fewer blocks.
Searching more than 22 blocks was never fruitful for llvm-ar.ll.

Though we would of course want more samples to make any decisions,
limiting the number of worklist blocks processed is another
candidate for a compile time threshold, should one be needed.
OTOH, it may only matter in pathalogical cases.

Interesting data; it may vary from a source language to a source language. I agree that we need more samples to justify adding such heuristic. For now, I propose:

static const unsigned MaxNonLocalAttempts = 10;

Ivan

Hi Ivan,

I do plan to follow up with this, but unfortunately am seeing some rather large performance regressions
due to second order effects. Several benchmarks in the EEMBC 1.1 automotive suite drop 10 to 20%.

I'm playing around with some heuristics in CGP's isFormingBranchFromSelectProfitable
to help mitigate these, and would like to have a feasible solution at hand before landing nonlocal DSE.

As to your specific questions, yes to the "Visited" technique, though I did restructure the checks a bit,
and used a SmallSet instead of SmallPtrSet. I do plan to replace the dominator check, not add an additional one.

That's fine. We actually need a SmallSetVector with reverse insertion order iteration characteristics.

For SafeBlocks we don't need a specific ordering, as it's only used for membership queries.
As for the worklist (Blocks) I think the implementation would work fine if this was unordered,
though I left its type unchanged (SmallVector).

Yes, I added a test for the irreducibility issue.

And I didn't intend to add another compile time throttle at this time.
That can be easily added should an issue arise.

regards,

  • mitch

I do plan to follow up with this, but unfortunately am seeing some rather large performance regressions
due to second order effects. Several benchmarks in the EEMBC 1.1 automotive suite drop 10 to 20%.

Any updates, Mitch?

Unfortunately the tuning I had planned in isFormingBranchFromSelectProfitable is no longer applicable.
The relevant heuristic was the "load feeding compare", which I intended to relax a bit.
Per discussions in other code reviews in that area (I don't have the number off hand, but maybe
related to D16836 and D17288), that heurstic has been deleted altogether, and there seemed to be a
general consensus to move this transformation into the target.

I've been looking into that a bit, but it's slow going.

In the meantime, it seemed counterproductive to enable the dead store enhancements.
While the concept sounds good, in practice it seems there are only modest gains,
and some rather large downsides.

jfb added a subscriber: jfb.Jan 30 2019, 1:52 PM