This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/Scalar/
-
Transforms/
-
Scalar/
34/59
DeadStoreElimination.cpp
-
test/Transforms/DeadStoreElimination/
-
Transforms/
-
DeadStoreElimination/
1/1
cycle.ll
-
ifthen.ll
-
ifthenelse.ll
-
ifthenelse2.ll
-
loop.ll

Differential D13363

[DeadStoreElimination] Add support for non-local DSE
ClosedPublic

Authored by zzheng on Oct 1 2015, 1:16 PM.

Download Raw Diff

Details

Reviewers

reames
igor-laevsky
karthikthecool
eeckstein
ivanbaev
hfinkel
mcrosier

Summary

We extend the search for redundant stores to predecessor blocks that
unconditionally lead to the block BB with the current store instruction.
That also includes single-block loops that unconditionally lead to BB,
and if-then-else blocks where then- and else-blocks unconditionally lead
to BB.

Diff Detail

Repository: rL LLVM

Event Timeline

ivanbaev updated this revision to Diff 36285.Oct 1 2015, 1:16 PM

ivanbaev retitled this revision from to [DeadStoreElimination] Add support for DSE across blocks.

ivanbaev updated this object.

ivanbaev added reviewers: eeckstein, igor-laevsky, reames.

ivanbaev added a subscriber: llvm-commits.

Note that the code for BB DSE stays the same, it is just indented two spaces to the right.
The new code begins at:

// DSE across BB
else if (InstDep.isNonLocal()) { ...

and it includes helper FindUncondPredsIncludingSimpleLoops().

mbodart added a subscriber: mbodart.Oct 1 2015, 3:02 PM

ivanbaev added reviewers: karthikthecool, hfinkel.Oct 6 2015, 6:29 AM

Some comments:

*) Your algorithm has quadratic complexity in worst case. I know this is the same as for the single block algorithm, but extending it to multiple blocks might worsen the situation. Maybe you add some sort of upper bound for the number of dependency checks?

*) Your algorithm only handles certain types of cross-block DSE. As I understood, it cannot go beyond an if-then-else control flow split. I'm wondering if it makes sense to do a simple data flow analysis to collect the eligible predecessor blocks. This would handle all cases and I think it would be not much more implementation effort and doesn't add compile time.

*) I found it confusing that all the code is contained in this large functions with many nested loops and breaks and continues. I think it's better to extract some of the code to separate functions (e.g. to avoid the StopProcessPBAndItsPreds variable).

*) The tests look like they contain much code which is not really necessary to test what you want to test. Maybe you can reduce them. Also in the first test a comment would help, which describes why no DSE can be done.

Addressed Erik's comments:
*) Your algorithm has quadratic complexity in worst case. I know this is the same as for the single block algorithm, but extending it to multiple blocks might worsen the situation. Maybe you add some sort of upper bound for the number of dependency checks?

An upper bound (MaxNonLocalAttempts = 100) is added on the number of non-local DSE attempts per function.

Analyzing general loops is challenging and compile-time extensive due to underlying MemoryDependenceAnalysis. This patch iteratively goes through predecessors that unconditionally lead to the block with the store and search for redundant stores. Not sure if there are opportunities within a reasonable compile-time budget for a truly global DSE.

The code is refactored with top algorithm extracted in HandleNonLocalDependency(Instruction *Inst).

I reduced the cycle test to a much smaller version, and added comments.

ping?

In general I think your approach of using a reverse CFG search for this optimization is appropriate. But as Erik points out it doesn't handle if-then-else.
That can be addressed relatively easily:

Let StartBB be the block with the original StoreInst.
Let SafeBlocks be the set of blocks for which we know all paths emanating from them will lead to StoreInst
with no intervening reads of the memory location of interest. Initially, add StartBB to SafeBlocks.

The criteria for adding a new predecessor P to your WorkList are that P is reachable, P is not in SafeBlocks (not visited),
and all of P's successors are P itself (the single-block loop case) or are in SafeBlocks.

Seed your WorkList with the predecessors of StartBB that meet this criteria.
After processing a block B in HandleNonLocalDependency, when DependencyFound is false,
add B to SafeBlocks, and augment the WorkList with B's predecessors that meet the criteria.

The "P is not in SafeBlocks" acts like a "! Visited" check, except that terminal (unsafe) blocks may be
visited multiple times. To avoid that redundancy we'd need another set (UnsafeBlocks, or perhaps just VisitedBlocks).

As to the compile time threshold, I think it needs to be applied to the number of dependences and blocks seen
in HandleNonLocalDependency. That way we'll put a cap on the cost of optimizing an individual store,
rather than limiting the number of store instructions that may get optimized. I don't know what LLVM's
philosophy is on adding new command line options, but perhaps the threshold value should be overridable
that way. And perhaps a cl::opt<bool> EnableGDSE("enable-gdse"...) to control this global dead store elimination.

As for the while loop in runOnBasicBlock, I would prefer that it keep track of whether any reads were seen,
including the isSelfRead case, and then we handle non-local dependences after that loop terminates.
This is the same concept as your DependencySeen tracking in HandleNonLocalDependency, though I might
name it SafeToProcessNonLocalDeps.

lib/Transforms/Scalar/DeadStoreElimination.cpp
857	Instead of InstPt, shouldn't the search start with PB->end()? It looks like getPointerDependencyFrom does a "--" operation on this iterator to get the first instruction to start its analysis from.
900	I don't understand why we are stopping the search here. If the overwrite is partial, there may still be preceding stores that could be eliminated.
906	Shouldn't this check also include && Dep.isNonLocal()?

The new version now handles a redundant store in the if-block in if-then-else - based on a suggestion by Mitch. We use SafeBlocks set, which is initialized and updated in DSE::HandleNonLocalDependency() and used in helper FindSafePreds(). SafeBlocks keeps track of all predecessor blocks that are safe to search through. A new test - ifthenelse.ll - has been added.

Thanks for suggestions/comments, Mitch. My responses are inlined below.

In general I think your approach of using a reverse CFG search for this optimization is appropriate. But as Erik points out it doesn't handle if-then-else.
That can be addressed relatively easily:

I added set SafeBlocks to keep track of all predecessor blocks that are safe to search through.

It is seeded with BB, the block with the original store. If no Def/Clobber dependency is found in predecessor PB, then we add PB to SafeBlocks.
Helper FindSafePreds(Blocks, SafeBlocks, PB, DT) then uses SafeBlocks to add blocks with two successors (e.g. an if-block) to the worklist.

Use of static const unsigned MaxNonLocalAttempts = 100; in DSE is similar to the use of static const unsigned MaxIVUsers = 200; in LoopStrengthReduce.

Note that the large portion of DSE compile-time is due not to proper DSE but to the underlying MemoryDependenceAnalysis,
especially for computing the non-local dependency information. Controlling computation in MemoryDependenceAnalysis is
outside the scope of DSE. Given the iterative way of adding predecessors and searching them locally for non-local DSE
I don't expect a substantial increase in DSE compile-time.

lib/Transforms/Scalar/DeadStoreElimination.cpp, 857
Instead of InstPt, shouldn't the search start with PB->end()?
It looks like getPointerDependencyFrom does a "--" operation on this iterator
to get the first instruction to start its analysis from.

Each BB ends with a terminator instruction and these should not have contain dependencies for the store.

See also DSE::HandleFree().

line 900
I don't understand why we are stopping the search here.
If the overwrite is partial, there may still be preceding stores
that could be eliminated.

In the local BB case, "if (OR == OverwriteComplete) ..." is followed by

"else if (OR == OverwriteEnd && isShortenable(DepWrite)) ..."
I agree that we can be more aggressive here but suggest this to be added by a separate patch.

line 906
Shouldn't this check also include && Dep.isNonLocal()?

We know that the first dependence - which triggers invocation of HandleNonLocalDependency(Inst) - is non-local.

In HandleNonLocalDependency() we use MemoryDependenceAnalysis::getPointerDependencyFrom()
to performs local search: it walks backwards through the basic block, looking for dependencies.

In each iteration of:
while (!Blocks.empty()) {

  BasicBlock *PB = Blocks.pop_back_val();
  ...
  MemDepResult Dep = MD->getPointerDependencyFrom(Loc, false, InstPt, PB, SI);
  ...
}

we are looking for (local) dependencies within PB for the original store instruction. A non-local dependency found
when we search PB should not impact how we process PB.

majnemer added a subscriber: majnemer.Oct 18 2015, 8:03 PM

majnemer added inline comments.

lib/Transforms/Scalar/DeadStoreElimination.cpp
88	Functions should start with a lower case letter.
616	Pointers go on the right side, please use `auto *` to avoid repetition.
627–630	Pointers lean right.
653–656	Please keep the `else if` on the same line as the closing `}`
804	Functions should start with a lowercase letter.
807–808	Can you use a range-base for loop with `predecessors` ?
812–820	It would be nicer to split this into more `if (X) continue;` to reduce the heavy nesting.
831–834	Could you use a range based for loop here?
846	Please use `auto *`.
857	Could this be cached in the constructor?
876	Please use `auto *`.

changes for style and coding standards

ping?

lib/Transforms/Scalar/DeadStoreElimination.cpp
616	this is actually the current code; done
627–630	this is actually the current code; done

Hi Ivan,

Thanks for working on this. It would be good to see some LNT compile time results of this patch beforehand since DSE triggered compile time bloat in the past.

majnemer added inline comments.Oct 21 2015, 2:40 PM

lib/Transforms/Scalar/DeadStoreElimination.cpp
804–805	This has yet to be addressed.
809–817	The nesting here is still very large. I'm also not entirely convinced that this code is correct. What if `PredTI->getNumSuccessors() == 1` is true but `PredTI->getSuccessor(0) == Pred` is false? In that case, won't you access `PredTI->getSuccessor(1) == Pred` ? It would be safer and more clear to simply loop over `successors`.
828–831	This hasn't been addressed.
854	This hasn't been addressed.

ivanbaev marked 2 inline comments as done.Oct 22 2015, 9:58 AM

ivanbaev added inline comments.

lib/Transforms/Scalar/DeadStoreElimination.cpp
804–805	This is similar to the code for iterating over predecessors in existing void FindUnconditionalPreds() function. I suggest to change both in a separate patch.
809–817	if PredTI->getNumSuccessors() == 1 is true, then we exit the if statement checks at this point, so we will not access PredTI->getSuccessor(0), etc. Do you think that the following is better? if (PredTI->getNumSuccessors() == 1) { PredBlocks.push_back(Pred); continue; } if (PredTI->getNumSuccessors() == 2 && (PredTI->getSuccessor(0) == Pred \|\| PredTI->getSuccessor(1) == Pred \|\| (SafeBlocks.count(PredTI->getSuccessor(0)) && SafeBlocks.count(PredTI->getSuccessor(1))))) PredBlocks.push_back(Pred);
828–831	There are about 8 places in DeadStoreElimination.cpp where we can use range based for loops. I suggest these changes to be made in a separate patch.
854	That is a good point. There are many places in DSE where DataLayout is used or passed as argument. However, it seems that most transformations in Transform/*, e.g. Vectorize/LoopVectorize.cpp, Scalar/SROA.cpp, Scalar/GVN.cpp, favor accessing DataLayout this way.

mbodart added inline comments.Oct 22 2015, 12:43 PM

lib/Transforms/Scalar/DeadStoreElimination.cpp
572	I didn't see a response to my suggestion to keep track of SafeToProcessNonLocalDeps in this loop. It seems incongruous and unnecessarily limiting to let any isDef or isClobber dependency here inhibit global DSE, while HandleNonLocalDependency tolerates them as long as they don't interfere. That said, it may be better left for a future change set, as it will slightly complicate this loop.
653	For safety I think we also need to check that Inst is not a self read.
809–817	I second the suggestion to use a Successor iterator here. Additionally, I think the checks for Pred == BB and DT->dominates(BB, Pred) are unnecessary. Note that BB (and StartBB) should already be in SafeBlocks, so the SafeBlocks.count checks should provide sufficient safety. I would think the routine simplifies to something like this: for each pred P of BB if P is reachable and !SafeBlocks.count(P)) PredOK = true; for each succ S of P if !SafeBlocks.count(S) && S != P PredOK = false; if PredOK, add P to PredBlocks
821	I'm still ramping up on LLVM and thus not deeply familiar with all of AA's capabilities. But it struck me as odd that we would need a custom routine to look for a memory conflict. Is there no existing AA or MemoryDep interface that meets our need, and if not, should one be added?
854	[Ivan] Each BB ends with a terminator instruction and these should not have contain dependencies for the store. An Invoke is one instance of a terminator that can have memory dependences. I don't know if our CFG search could ever get through both edges coming out of an Invoke, but I think it is best to avoid any such assumptions, and let AA make the dependency determinations.
892	This test is loop invariant and should probably be cached before the loop. This is especially true after generalizing FindSafePreds to allow an arbitrary number of successors.
911	This is true for the common case of exact overlap. But it might be worth adding a TBD comment to the effect that we could reasonably continue the search if the dead instruction was only a partial overlap. I'm not suggesting we do the full "shortening" transformation that local DSE does (though that might be fun). But catching simple scalar stores would just work.
919	[Ivan] We know that the first dependence - which triggers invocation of HandleNonLocalDependency(Inst) - is non-local Perhaps I don't quite understand the possible return values from MD, but it seems like Dep could be Unknown or NonLocalFunc here, in which case we should terminate the search.
test/Transforms/DeadStoreElimination/cycle.ll
3–4	These tests seem fairly generic. I would prefer that we remove the explicit datalayout and triple here.

mbodart added inline comments.Oct 22 2015, 12:43 PM

lib/Transforms/Scalar/DeadStoreElimination.cpp
654	I would expect any pathalogical compile time issues to occur in HandleNonLocalDependency, based on how many blocks and dependences it has to search through. So even if MaxNonLocalAttempts is 1, we could theoretically encounter a compile time problem. I'm not averse to keeping MaxNonLocalAttempts here, but would like to see a throttle in HandleNonLocalDependency.

ivanbaev updated this revision to Diff 38338.Oct 24 2015, 5:28 PM

ivanbaev retitled this revision from [DeadStoreElimination] Add support for DSE across blocks to [DeadStoreElimination] Add support for non-local DSE.

ivanbaev marked 15 inline comments as done.Oct 24 2015, 5:45 PM

ivanbaev added inline comments.

lib/Transforms/Scalar/DeadStoreElimination.cpp
572–583	The new patch is more aggressive in searching for redundant stores. Let's defer your suggestion for a subsequent patch.
654	MemoryDependenceAnalysis has BlockScanLimit = 100 and NumResultsLimit = 100. With these and the new MaxNonLocalAttempts = 100 we have enough handles to control compile-time.
821	If there is a client outside DSE we can move this helper into a utility directory
892	The idea is to allow the search continue through simple loops, but not to remove stores within loops.

ping?

Hi Ivan,

I added a few more responses.

Please be patient as this is a non-trivial optimization, and for me at least it takes some time to poke around and verify things.
I'm still not confident enough in my understanding of alias analysis to know whether underlyingObjectsDoNotAlias is complete, for example.

Also, please try to address some of the comments from David regarding naming conventions.

regards,

Mitch

lib/Transforms/Scalar/DeadStoreElimination.cpp
572–583	That's fine with me.
653	The self read issue has not been addressed.
654	I see. Thanks for pointing that out. Those seem reasonable as a starting point. But as Bruno suggested, some actual compile time measurments are probably in order, just to confirm.
808	I still don't understand why we need to check DT->dominates(BB, Pred) here. I'm guessing you're trying to guard against an infinite loop, which of course we need to do. But I think that would be automatically handled by instead checking for !SafeBlocks.count(Pred). In addition to allowing the search to proceed further, this would eliminate the need for the special check of Pred == BB, since BB must be in SafeBlocks.
814	The Succ == BB check is correct, but redundant as SafeBlocks.count(BB) must be non-zero. I think the code would read better if this and the next check were combined into one: if (Succ != Pred && !SafeBlocks.count(Succ)) but would be OK with your current structure. Either way, however, it would be useful to add a clarifying source comment to the effect: // Pred is only known to be optimizeable if all of its successors are either safe or are Pred itself.
892	Sorry, I'm not sure which comment you were responding to. Phabricator for some reason seems to be placing some of my comments above the lines of interest, and others below. Regarding my "loop invariant" comment, I'm saying that the test for whether the current block is a single-block loop (getNumSuccessors() == 2, etc) should probably be done outside of this loop. I'm also not sure if checking explicitly for two successors is sufficient. A program could conceivably have a switch statement with one case having a goto back to a label just prior to the switch. I'm not familiar enough with LLVM optimizations to know whether that can lead to an indirect branch successor which is the block containing the indirect branch itself. But it's theoretically possible. My recommendation here is to use a successor iterator, to match the behavior present in FindSafePreds.

Hi Mitch, thanks for the thoughtful comments; we are making a good progress.
I will make the two changes: using successor iterator and adding the comment, in the next revision.

Regarding Bruno's request for LNT testing, I'm currently running LNT tests with ARM/AArch64 compiler without and with this patch, and will report results when finished. Could someone run this patch with an x86-based compiler if there are concerns about compile-time.

Regarding David's comments, I believe all have been addressed, except caching DataLayout in the constructor.
There are many places in DSE where DataLayout is used or passed as argument. Since most transformations in Transform/*, e.g. Vectorize/LoopVectorize.cpp, Scalar/SROA.cpp, Scalar/GVN.cpp, favor accessing DataLayout this way, I suggest this change to be done in a future patch.

Mitch, Erik, David, Bruno,
Are you planning to attend the LLVM developers meeting on Thu/Fri? That would be a good opportunity to discuss this patch in person if you would like to.

Ivan

lib/Transforms/Scalar/DeadStoreElimination.cpp
653	It looks like that some comments are not aligned to the corresponding lines of code. The self read issue has been addressed in line 888 by adding there a call to: isPossibleSelfRead(Inst, Loc, Dependency, TLI, AA) It is similar to the local case.
654	I'm running LNT tests with ARM/AArch64 compiler without and with this patch, and will report results when finished.
808	We need DT->dominates(BB, Pred) check to avoid situations similar to the one in cycle.ll test below. This test is a simplified version of a larger function found in a SPEC benchmark.
814	Agree that Succ == BB check is redundant, but it is a shortcut compared to checking a membership in a set. That will save some compile time. Sure, I will add the comment you suggested.
892	I will use a successor iterator here, thanks.

Mitch, Erik, David, Bruno,
Are you planning to attend the LLVM developers meeting on Thu/Fri? That would be a good opportunity to discuss this patch in person if you would like to.

Yes, I'm there on Friday. Looking forward to meet you.

Hi Bruno,

Here are LNT compile-time: ibaev-linuxclang_DEVaarch64 test results
Previous: aarch64 compiler with this patch, non-local DSE
Current: aarch64 compiler without this patch, that is top-of-trunk DSE

Tests Summary
Status Group #
Performance Regressions 141
Performance Improvements 134
Added Tests 1
Existing Failures 489
Unchanged Tests 1191
Total Tests 1956

Top regressions with Current (top-of-trunk DSE) ***

Performance Regressions - Compile Time Δ Previous Current σ
SingleSource/UnitTests/2002-08-02-CastTest 400.00% 0.0080 0.0400 -
SingleSource/UnitTests/C++11/stdthreadbug 300.00% 0.0040 0.0160 -
SingleSource/UnitTests/2003-05-31-LongShifts 250.00% 0.0080 0.0280 -
SingleSource/UnitTests/2008-04-20-LoopBug2 200.00% 0.0120 0.0360 -
SingleSource/Regression/C/2004-02-03-AggregateCopy 150.00% 0.0160 0.0400 -
MultiSource/Benchmarks/Prolangs-C++/vcirc/vcirc 100.00% 0.0240 0.0480 -
SingleSource/Benchmarks/Misc/mandel-2 100.00% 0.0360 0.0720 -
SingleSource/UnitTests/2003-04-22-Switch 100.00% 0.0200 0.0400 -
SingleSource/UnitTests/Vector/multiplies 100.00% 0.0200 0.0400 -
SingleSource/UnitTests/block-byref-test 100.00% 0.0120 0.0240 -
....

Top improvents with Current (top-of-trunk DSE) ***

Performance Improvements - Compile Time Δ Previous Current σ
SingleSource/UnitTests/2002-10-13-BadLoad -75.00% 0.0160 0.0040 -
SingleSource/UnitTests/2006-12-01-float_varg -71.43% 0.0280 0.0080 -
SingleSource/Regression/C++/2003-06-08-VirtualFunctions -66.67% 0.0240 0.0080 -
SingleSource/Regression/C++/2003-09-29-NonPODsByValue -66.67% 0.0360 0.0120 -
SingleSource/Regression/C++/EH/function_try_block -66.67% 0.0480 0.0160 -
SingleSource/UnitTests/byval-alignment -63.64% 0.0440 0.0160 -
SingleSource/Benchmarks/Misc/pi -58.33% 0.0480 0.0200 -
SingleSource/Regression/C++/short_circuit_dtor -58.33% 0.0480 0.0200 -
SingleSource/UnitTests/2005-05-11-Popcount-ffs-fls -55.56% 0.0720 0.0320 -
SingleSource/UnitTests/Vector/simple -55.56% 0.0360 0.0160 -
SingleSource/Regression/C++/EH/simple_rethrow -54.55% 0.0440 0.0200 -
...

Since this is my first LNT testing, could someone comment on these results. Any outliers?

Thanks,
Ivan

mbodart added inline comments.Oct 30 2015, 3:03 PM

lib/Transforms/Scalar/DeadStoreElimination.cpp
653	It seems useful for compile time to check for a self-read right here, as in that case there is no point in processing non-local dependences. Such a check is only complicated by the fact that isPossibleSelfRead requires a "DepWrite" parameter. We would need to modify that routine to allow a nullptr for DepWrite. What I'm unsure of, however, is whether this being a non-self read is already implied. Is it the case that a StoreInst will never be a self read, for example?
808	If nonlocal DSE would otherwise remove the store in the cycle.ll test, it seems to me that's a problem with how we are using alias analysis. The other store in that loop is not guaranteed to overwrite the removed store's location on every loop iteration. Perhaps it's simply my unfamiliarity with LLVM's alias analysis interfaces, but it seems like there would be some support for checking this definitive overlap, without requiring clients to also factor in dominance information. I'd be interested in hearing what the other reviewers have to say.

addressed two comments by Mitch

ivanbaev marked 12 inline comments as done.Nov 1 2015, 4:31 PM

ivanbaev added inline comments.

lib/Transforms/Scalar/DeadStoreElimination.cpp
653	A StoreInst will never be a self read.
808	By design, DSE - local, non-local, other parts - depends on MemoryDependenceAnalysis, and only occasionally relies on alias analysis.

Hi Ivan,

Thanks again for working on this!
I'm gonna run your patch in x86 and let you know about the performance results.

Thanks, Bruno. If necessary we need to balance non-local DSE for compile-time and run-time performance.

The patch fails to apply cleanly on ToT, can you update it?

mbodart added inline comments.Nov 3 2015, 11:52 AM

lib/Transforms/Scalar/DeadStoreElimination.cpp
653	If that's the case, then we should be able to remove the isSelfRead check in handleNonLocalDependency.
808	I was using the term alias analysis generically, referring to either AA or MD. If usage of those interfaces is suspect in the circumstances of the cycle.ll test, I'd like to understand why exactly.
887	Inst here is the original StoreInst. If StoreInst's can never be a self read, this check will always be false. Do we need an additional check to ensure that Dependency does not read any of Inst's memory, analagous to the check "if (AA->getModRefInfo(DepWrite, Loc) & MRI_Ref)" in DSE::runOnBasicBlock?
896	This break is just exiting the succ iterator loop. We need to exit the containing while loop.

Rebased.
Address Mitch's comments.

Hi Bruno, the patch has been rebased, and should apply cleanly on ToT.

Mitch, thanks for your comments and suggestions.
Non-local DSE extension follows the structure of HandleFree(). The underlying analysis for DSE is MD, and for across-BB both - HandleFree and HandleNonLocalDependency - use MD->getPointerDependencyFrom(). Note that MD is not loop-aware and its clients should handle CFG and alias issues by relying on AA or DT. HandleFree only allows unconditional predecessors - see FundUnconditonalPreds(). Based on reviewers' feedback HandleNonLocalDependency also handles if-then-else predecessors and single-block-loops (actually one of our benchmarks has a redundant store across a single-block-loop).

Regading cycle.ll test, using DT->dominates(BB, Pred) check is sufficient to exclude for.inc BB from the set of safe predecessors of for.body BB. What alternative do you propose for determining that stores in for.body and for.inc may alias?

lib/Transforms/Scalar/DeadStoreElimination.cpp
887	Added AA->getModRefInfo(Dependency, Loc) & MRI_Ref) test.

mbodart added inline comments.Nov 5 2015, 5:30 PM

lib/Transforms/Scalar/DeadStoreElimination.cpp
808	OK, I think I get it now. Fundamentally the isOverwrite routine assumes that (P1 == P2) means that pointers P1 and P2 always have the same value, and is not considering back edge (phi) issues. I think your dominance check is sufficient to avoid problems in this area. But as this is a subtle issue, please add a source comment at the dominance check explaining the situation.
887	That looks right, thanks!

Hi Bruno,

Did you have a chance to run this patch for x86? I rebased it, so please let me know if it can merge to ToT. Your feedback is very much appreciated.
Thanks!
Ivan

Added option -enable-nonlocal-dse to enable/disable the feature.

All QuIC internal correctness and performance tests are good.
On Arm/AArch64 processor at -O2, the patch improves run-time performance of:
eembc/networking_suite/qos by 8%,
eembc/comsumer_suite/cjpegv2data7 by 16%,
spec2006/sjeng by 1.5%

Please, take another look at the patch. I think all suggestions and comments have been addressed.
I appreciate if you could run the patch on your applications on additional targets and measure the impact.

Best regards,
Ivan

Any objection to commit this patch?

Hi Ivan,

Sorry for the delay. Below the x86 results, of a 5x run in a stable machine.
Highlights include: CINT2006/458_sjeng performing 1.1% slower while CINT2000/175_vpr is 1.5% faster.

Performance Regressions - Execution Time Δ Previous Current σ
MultiSource/Benchmarks/TSVC/IndirectAddressing-flt/IndirectAddressing-flt 3.80% 2.5611 2.6583 0.0179
MultiSource/Benchmarks/sim/sim 2.93% 3.2302 3.3250 0.0019
SingleSource/Benchmarks/Polybench/medley/floyd-warshall/floyd-warshall 2.70% 1.6456 1.6900 0.0172
MultiSource/Benchmarks/NPB-serial/is/is 1.17% 5.6993 5.7660 0.0172
External/SPEC/CINT2006/458_sjeng/458_sjeng 1.14% X Y Z
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trmm/trmm 1.13% 1.4695 1.4861 0.0055
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt 1.13% 2.2458 2.2711 0.0082

Performance Improvements - Compile Time Δ Previous Current σ
MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1 -16.00% 0.0844 0.0709 0.0002

Performance Improvements - Execution Time Δ Previous Current σ
MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1 -12.79% 0.4956 0.4322 0.0000
MultiSource/Benchmarks/mafft/pairlocalalign -5.84% 18.7424 17.6486 0.0154
External/SPEC/CINT2000/175_vpr/175_vpr -1.50% X Y Z
MultiSource/Applications/sqlite3/sqlite3 -1.27% 2.5050 2.4732 0.0107

Hi Bruno,

Thanks for running and sharing x86 results! Deltas look reasonable, in my opinion. Removing dead stores and related instructions across basic blocks should be a good thing by itself. There is a possibility for second order effects - e.g. longer live ranges, increased number of spills.

Results for SPEC2006/sjeng are interesting. On Arm/AArch64/A57 devices at -O2 we have the following run times (in seconds, so lower is better)

"-mllvm -enable-nonlocal-dse=false": 1795, 1799, 1801, 1804, 1805, 1806
"-mllvm -enable-nonlocal-dse=true": 1773, 1774,1775, 1778, 1778, 1786

At -O3, we do not see any performance delta for sjeng. What optimization level did you use for compiling SPEC on x86?

Using -stats, for both -O2 and -O3. on sjeng we are able to delete 13 non-local stores in total (found in 4 source files). No changes in the number of regalloc spills and other relevant stats other than decreases in machine instructions emitted and printed and isl dag tries (see below).

Glad that the DSE compile-time is under control, and actually improved for one Trimaran benchmark.
On sjeng, there is a decrease in isel - Number of times dag isel has to try another path - from 2380 to 2348 with this patch. How is this related to compile-time?

It would be good to hear what reviewers think about run- and compile-time results on x86 and AArch64 with this patch, and if we can commit it.

Best regards,
Ivan

Could someone with commit access commit this patch, if there is no objection from reviewers?

This is a good feature to have in llvm/DSE; the reviewers' (mbodard, eeckstein, majnemer, bruno) comments and suggestions have been addressed; the run-time and compile-time performance of the patch on x86 (bruno) and arm/hexagon (ivanbaev) is reasonable.

Thanks,
Ivan

ping?

Hi Ivan,

I would stll like to see a source comment near the DT->dominates(BB, Pred) check in findSafePreds,
as its necessity is not otherwise obvious. It is needed in order to prevent unsafe use of isOverwrite
and potentially other AA interfaces. For example isOverwrite cannot be trusted if the pointer values
in question could come from different iterations of an enclosing loop.

Otherwise LGTM.

I think you'll need to update to ToT once again, as there was another change in DeadStoreElimination.cpp
which prevented the patch from applying cleanly.

mitch

Added a comment to clarify avoidance of blocks (to search for redundant stores) that form a cycle with BB, as requested by Mitch.
Update to ToT is not necessary because the recent changes to DeadStoreElimination.cpp has been reverted.

Mitch, I added a relevant comment above DT->dominates(BB, Pred).

Thanks Ivan!

LGTM

mitch

mcrosier added a reviewer: mcrosier.Nov 23 2015, 1:02 PM

Ivan has asked that I take over this work.

Thanks, Chad. Can we commit the patch at this point?
It solves one DSE enhancement known for several years.
I believe the patch has been thoroughly reviewed and tested.

Ivan

Fixed build error due to explicit ilist::iterator.

LGTM

majnemer added inline comments.Dec 9 2015, 6:39 PM

lib/Transforms/Scalar/DeadStoreElimination.cpp
678	Pointers lean right.
680	Can a range-based for loop be used here? Something like: `for (const BasicBlock Pred : BB)`
712	Likewise, can you use one here?
741	Pointers lean right.

Committed in r255247. I missed David's comments prior to the commit, but I did clang-format the patch before submission. That did address a few of the issues. I addressed David's suggestion for range-based loops in r255265.

This revision is now accepted and ready to land.Dec 10 2015, 9:36 AM

mcrosier closed this revision.Dec 10 2015, 9:36 AM

I ended up disabling this patch in r255286. I suspect this change is causing lots of bots to timeout as well as doubling the compile-time for Chromium. I'll have someone investigate, but hopefully the compile-time issues have been resolved for the time being.

jevinskie added a subscriber: jevinskie.Dec 10 2015, 2:22 PM

Hi Chad, thanks for helping with that. Is compile-time increase the remaining problem?
One quick fix could be to change 100 to 10 for the max number on non-local attempts in a function:

static const unsigned MaxNonLocalAttempts = 100;

We've chosen 100 arbitrarily. The newly added DSE tests should pass with 10.

Ivan

I'd prefer if numbers were collected indicating which thresholds worked
best.

+1
It would be good to have that first!

Agree, but this depends what tests/apps we are concerned about. LNT tests didn't show many outliers.
In addition to Chromium, which applications have been known as outliers in terms of DSE compile-time contribution in the past?

I have a resolution for the issue causing bot timeouts,
as well as one unrelated fix for irreducible control flow.
Fortunately both fixes are trivial. Details below.

Chad or Ivan, how do you want to proceed?
Can you apply these fixes and re-land the change?

I used Rafael's llvm-ar.ll reproducer at
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20151207/318782.html
to chase the bot timeouts.

The issue is in findSafePreds, which was getting tripped up by a switch
statement, and inadvertently adding the same block to the worklist
multiple times. The predecessor iterator needs to avoid processing
a given predecessor more than once. Keeping track with a "PredsVisited"
set resolves that, and there is no longer any noticeable compile time
difference w/wo nonlocal DSE for llvm-ar.ll.

Wrt irreducibility, I was able to construct a test case for which the
CFG search goes into a cycle and revisits the original safe block
that instigated the nonlocal DSE search, SI->getParent() in
DSE::handleNonLocalDependency(). My earlier suggestion for the search
routine had an explicit check for !SafeBlocks.count(P) to avoid this
situation, but that was elided in favor of the dominator check
DT->dominates(BB, Pred). With irreducible control flow we can have
a case where Pred is the original safe block, but BB does not dominate it.

One fix here is to pass in that original block to findSafePreds,
and replace the dominator check with DT->dominates(OrigBB, Pred).
While fairly subtle, this prevents processing any block more than once,
and is still sufficient to avoid problems with the "non loop aware"
dependence analysis.

One other nit: The SafeBlocks set is unordered, and only needed for
membership queries. Should it be declared as a SmallSet instead of
a SmallSetVector?

Finally, I collected some stats when compiling llvm-ar.ll.

There are 23138 calls to handleNonLocalDependency which
reach the worklist processing, with 84 dead instructions found.
94% of the time we process 5 or fewer predecessor blocks.
In the worst case we looked at 266 predecessors.

Blocks Frequency
0 11553 50%
1 5047 22%
2 3858 17%
3 778 3%
4 285
5 221
6 148
7 112
8 193
9 57
10 53
11-50 410
51-100 36
100-266 7

In one instance we found 5 dead instructions by searching 22 blocks.
I didn't dig into whether those could have been found in fewer blocks.
Searching more than 22 blocks was never fruitful for llvm-ar.ll.

Though we would of course want more samples to make any decisions,
limiting the number of worklist blocks processed is another
candidate for a compile time threshold, should one be needed.
OTOH, it may only matter in pathalogical cases.

mitch

Hi Mitch,
Since this patch has been closed IMO the path forward would be to upload a new revision for code review. Please include all the current reviewers/subscribers. You should also include a link to this patch for reference. From there we can begin the review process, which hopefully will be fairly short assuming the fixes were mostly trivial.

Chad

Hi Mitch,

Thanks for working on this.

In D13363#334021, @mbodart wrote:

I have a resolution for the issue causing bot timeouts,
as well as one unrelated fix for irreducible control flow.
Fortunately both fixes are trivial. Details below.

Chad or Ivan, how do you want to proceed?
Can you apply these fixes and re-land the change?

Feel free to proceed as Chad suggested.

I used Rafael's llvm-ar.ll reproducer at
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20151207/318782.html
to chase the bot timeouts.

The issue is in findSafePreds, which was getting tripped up by a switch
statement, and inadvertently adding the same block to the worklist
multiple times. The predecessor iterator needs to avoid processing
a given predecessor more than once. Keeping track with a "PredsVisited"
set resolves that, and there is no longer any noticeable compile time
difference w/wo nonlocal DSE for llvm-ar.ll.

Is it something like:

SmallPtrSet<BasicBlock *, 8> Visited;
for (auto I = pred_begin(BB), E = pred_end(BB); I != E; ++I) {

if (!Visited.insert(*I).second)
  continue;

Wrt irreducibility, I was able to construct a test case for which the
CFG search goes into a cycle and revisits the original safe block
that instigated the nonlocal DSE search, SI->getParent() in
DSE::handleNonLocalDependency(). My earlier suggestion for the search
routine had an explicit check for !SafeBlocks.count(P) to avoid this
situation, but that was elided in favor of the dominator check
DT->dominates(BB, Pred). With irreducible control flow we can have
a case where Pred is the original safe block, but BB does not dominate it.

One fix here is to pass in that original block to findSafePreds,
and replace the dominator check with DT->dominates(OrigBB, Pred).
While fairly subtle, this prevents processing any block more than once,
and is still sufficient to avoid problems with the "non loop aware"
dependence analysis.

Do you propose replacing DT->dominates(BB, Pred) in

if (!DT->isReachableFromEntry(Pred) || DT->dominates(BB, Pred))

with DT->dominates(OrigBB, Pred), or adding the latter as a third check?

It would be great to have this irreducible CFG test added to the existing tests for DSE.

One other nit: The SafeBlocks set is unordered, and only needed for
membership queries. Should it be declared as a SmallSet instead of
a SmallSetVector?

That's fine. We actually need a SmallSetVector with reverse insertion order iteration characteristics.

Finally, I collected some stats when compiling llvm-ar.ll.

There are 23138 calls to handleNonLocalDependency which
reach the worklist processing, with 84 dead instructions found.
94% of the time we process 5 or fewer predecessor blocks.
In the worst case we looked at 266 predecessors.

Blocks Frequency
0 11553 50%
1 5047 22%
2 3858 17%
3 778 3%
4 285
5 221
6 148
7 112
8 193
9 57
10 53
11-50 410
51-100 36
100-266 7

In one instance we found 5 dead instructions by searching 22 blocks.
I didn't dig into whether those could have been found in fewer blocks.
Searching more than 22 blocks was never fruitful for llvm-ar.ll.

Though we would of course want more samples to make any decisions,
limiting the number of worklist blocks processed is another
candidate for a compile time threshold, should one be needed.
OTOH, it may only matter in pathalogical cases.

Interesting data; it may vary from a source language to a source language. I agree that we need more samples to justify adding such heuristic. For now, I propose:

static const unsigned MaxNonLocalAttempts = 10;

Ivan

Hi Ivan,

I do plan to follow up with this, but unfortunately am seeing some rather large performance regressions
due to second order effects. Several benchmarks in the EEMBC 1.1 automotive suite drop 10 to 20%.

I'm playing around with some heuristics in CGP's isFormingBranchFromSelectProfitable
to help mitigate these, and would like to have a feasible solution at hand before landing nonlocal DSE.

As to your specific questions, yes to the "Visited" technique, though I did restructure the checks a bit,
and used a SmallSet instead of SmallPtrSet. I do plan to replace the dominator check, not add an additional one.

That's fine. We actually need a SmallSetVector with reverse insertion order iteration characteristics.

For SafeBlocks we don't need a specific ordering, as it's only used for membership queries.
As for the worklist (Blocks) I think the implementation would work fine if this was unordered,
though I left its type unchanged (SmallVector).

Yes, I added a test for the irreducibility issue.

And I didn't intend to add another compile time throttle at this time.
That can be easily added should an issue arise.

regards,

mitch

In D13363#344496, @mbodart wrote:

I do plan to follow up with this, but unfortunately am seeing some rather large performance regressions
due to second order effects. Several benchmarks in the EEMBC 1.1 automotive suite drop 10 to 20%.

Any updates, Mitch?

Unfortunately the tuning I had planned in isFormingBranchFromSelectProfitable is no longer applicable.
The relevant heuristic was the "load feeding compare", which I intended to relax a bit.
Per discussions in other code reviews in that area (I don't have the number off hand, but maybe
related to D16836 and D17288), that heurstic has been deleted altogether, and there seemed to be a
general consensus to move this transformation into the target.

I've been looking into that a bit, but it's slow going.

In the meantime, it seemed counterproductive to enable the dead store enhancements.
While the concept sounds good, in practice it seems there are only modest gains,
and some rather large downsides.

flyingforyou added a subscriber: flyingforyou.Apr 1 2016, 5:57 PM

mcrosier mentioned this in D22568: [DSE] Implement dead store elimination using MemorySSA (disabled by default)..Jul 20 2016, 7:00 AM

jfb added a subscriber: jfb.Jan 30 2019, 1:52 PM

Revision Contents

Path

Size

lib/

Transforms/

Scalar/

DeadStoreElimination.cpp

333 lines

test/

Transforms/

DeadStoreElimination/

26 lines

22 lines

27 lines

34 lines

42 lines

Diff 42326

lib/Transforms/Scalar/DeadStoreElimination.cpp

Context not available.
	//	//
	//===----------------------------------------------------------------------===//	//===----------------------------------------------------------------------===//
	//	//
	// This file implements a trivial dead store elimination that only considers	// This file implements dead store elimination that considers redundant stores
	// basic-block local redundant stores.	// within a basic-block as well as across basic blocks in a reverse CFG order.
	//
	// FIXME: This should eventually be extended to be a post-dominator tree
	// traversal. Doing so would be pretty trivial.
	//	//
	//===----------------------------------------------------------------------===//	//===----------------------------------------------------------------------===//

Context not available.
	STATISTIC(NumRedundantStores, "Number of redundant stores deleted");	STATISTIC(NumRedundantStores, "Number of redundant stores deleted");
	STATISTIC(NumFastStores, "Number of stores deleted");	STATISTIC(NumFastStores, "Number of stores deleted");
	STATISTIC(NumFastOther , "Number of other instrs removed");	STATISTIC(NumFastOther , "Number of other instrs removed");
		STATISTIC(NumNonLocalStores, "Number of non-local stores deleted");

		static cl::opt<bool> EnableNonLocalDSE("enable-nonlocal-dse", cl::init(true));

		/// MaxNonLocalAttempts is an arbitrary threshold that provides
		/// an early opportunitiy for bail out to control compile time.
		static const unsigned MaxNonLocalAttempts = 100;

	namespace {	namespace {
	struct DSE : public FunctionPass {	struct DSE : public FunctionPass {
Context not available.
	bool runOnBasicBlock(BasicBlock &BB);	bool runOnBasicBlock(BasicBlock &BB);
	bool MemoryIsNotModifiedBetween(Instruction FirstI, Instruction SecondI);	bool MemoryIsNotModifiedBetween(Instruction FirstI, Instruction SecondI);
	bool HandleFree(CallInst *F);	bool HandleFree(CallInst *F);
		bool handleNonLocalDependency(Instruction *Inst);
	bool handleEndBlock(BasicBlock &BB);	bool handleEndBlock(BasicBlock &BB);
		majnemerUnsubmitted Done Reply Inline Actions Functions should start with a lower case letter. majnemer: Functions should start with a lower case letter.
	void RemoveAccessedObjects(const MemoryLocation &LoadedLoc,	void RemoveAccessedObjects(const MemoryLocation &LoadedLoc,
	SmallSetVector<Value *, 16> &DeadStackObjects,	SmallSetVector<Value *, 16> &DeadStackObjects,
Context not available.
	bool DSE::runOnBasicBlock(BasicBlock &BB) {	bool DSE::runOnBasicBlock(BasicBlock &BB) {
	const DataLayout &DL = BB.getModule()->getDataLayout();	const DataLayout &DL = BB.getModule()->getDataLayout();
	bool MadeChange = false;	bool MadeChange = false;
		unsigned NumNonLocalAttempts = 0;

	// Do a top-down walk on the BB.	// Do a top-down walk on the BB.
	for (BasicBlock::iterator BBI = BB.begin(), BBE = BB.end(); BBI != BBE; ) {	for (BasicBlock::iterator BBI = BB.begin(), BBE = BB.end(); BBI != BBE; ) {
Context not available.

	MemDepResult InstDep = MD->getDependency(Inst);	MemDepResult InstDep = MD->getDependency(Inst);

	// Ignore any store where we can't find a local dependence.	if (InstDep.isDef() \|\| InstDep.isClobber()) {
	// FIXME: cross-block DSE would be fun. :)	// Figure out what location is being stored to.
	if (!InstDep.isDef() && !InstDep.isClobber())	MemoryLocation Loc = getLocForWrite(Inst, *AA);
	continue;

	// Figure out what location is being stored to.
	MemoryLocation Loc = getLocForWrite(Inst, *AA);

	// If we didn't get a useful location, fail.
	if (!Loc.Ptr)
	continue;

	while (InstDep.isDef() \|\| InstDep.isClobber()) {	// If we didn't get a useful location, fail.
	// Get the memory clobbered by the instruction we depend on. MemDep will	if (!Loc.Ptr)
	// skip any instructions that 'Loc' clearly doesn't interact with. If we	continue;
	// end up depending on a may- or must-aliased load, then we can't optimize
	// away the store and we bail out. However, if we depend on on something
	// that overwrites the memory location we can potentially optimize it.
	//
	// Find out what memory location the dependent instruction stores.
	Instruction *DepWrite = InstDep.getInst();
	MemoryLocation DepLoc = getLocForWrite(DepWrite, *AA);
	// If we didn't get a useful location, or if it isn't a size, bail out.
	if (!DepLoc.Ptr)
	break;

	// If we find a write that is a) removable (i.e., non-volatile), b) is	while (InstDep.isDef() \|\| InstDep.isClobber()) {
	// completely obliterated by the store to 'Loc', and c) which we know that	// Get the memory clobbered by the instruction we depend on. MemDep will
		mbodartUnsubmitted Not Done Reply Inline Actions I didn't see a response to my suggestion to keep track of SafeToProcessNonLocalDeps in this loop. It seems incongruous and unnecessarily limiting to let any isDef or isClobber dependency here inhibit global DSE, while HandleNonLocalDependency tolerates them as long as they don't interfere. That said, it may be better left for a future change set, as it will slightly complicate this loop. mbodart: I didn't see a response to my suggestion to keep track of SafeToProcessNonLocalDeps in this…
	// 'Inst' doesn't load from, then we can remove it.	// skip any instructions that 'Loc' clearly doesn't interact with. If we
	if (isRemovable(DepWrite) &&	// end up depending on a may- or must-aliased load, then we can't optimize
	!isPossibleSelfRead(Inst, Loc, DepWrite, TLI, AA)) {	// away the store and we bail out. However, if we depend on on something
	int64_t InstWriteOffset, DepWriteOffset;	// that overwrites the memory location we can potentially optimize it.
	OverwriteResult OR =	//
	isOverwrite(Loc, DepLoc, DL, *TLI, DepWriteOffset, InstWriteOffset);	// Find out what memory location the dependent instruction stores.
	if (OR == OverwriteComplete) {	Instruction *DepWrite = InstDep.getInst();
	DEBUG(dbgs() << "DSE: Remove Dead Store:\n DEAD: "	MemoryLocation DepLoc = getLocForWrite(DepWrite, *AA);
	<< DepWrite << "\n KILLER: " << Inst << '\n');	// If we didn't get a useful location, or if it isn't a size, bail out.
		if (!DepLoc.Ptr)
	// Delete the store and now-dead instructions that feed it.	break;
		ivanbaevUnsubmitted Not Done Reply Inline Actions The new patch is more aggressive in searching for redundant stores. Let's defer your suggestion for a subsequent patch. ivanbaev: The new patch is more aggressive in searching for redundant stores. Let's defer your suggestion…
		mbodartUnsubmitted Not Done Reply Inline Actions That's fine with me. mbodart: That's fine with me.
	DeleteDeadInstruction(DepWrite, MD, TLI);
	++NumFastStores;	// If we find a write that is a) removable (i.e., non-volatile), b) is
	MadeChange = true;	// completely obliterated by the store to 'Loc', and c) which we know that
		// 'Inst' doesn't load from, then we can remove it.
	// DeleteDeadInstruction can delete the current instruction in loop	if (isRemovable(DepWrite) &&
	// cases, reset BBI.	!isPossibleSelfRead(Inst, Loc, DepWrite, TLI, AA)) {
	BBI = Inst->getIterator();	int64_t InstWriteOffset, DepWriteOffset;
	if (BBI != BB.begin())	OverwriteResult OR =
	--BBI;	isOverwrite(Loc, DepLoc, DL, *TLI, DepWriteOffset, InstWriteOffset);
	break;	if (OR == OverwriteComplete) {
	} else if (OR == OverwriteEnd && isShortenable(DepWrite)) {	DEBUG(dbgs() << "DSE: Remove Dead Store:\n DEAD: "
	// TODO: base this on the target vector size so that if the earlier	<< DepWrite << "\n KILLER: " << Inst << '\n');
	// store was too small to get vector writes anyway then its likely
	// a good idea to shorten it	// Delete the store and now-dead instructions that feed it.
	// Power of 2 vector writes are probably always a bad idea to optimize	DeleteDeadInstruction(DepWrite, MD, TLI);
	// as any store/memset/memcpy is likely using vector instructions so	++NumFastStores;
	// shortening it to not vector size is likely to be slower
	MemIntrinsic* DepIntrinsic = cast<MemIntrinsic>(DepWrite);
	unsigned DepWriteAlign = DepIntrinsic->getAlignment();
	if (llvm::isPowerOf2_64(InstWriteOffset) \|\|
	((DepWriteAlign != 0) && InstWriteOffset % DepWriteAlign == 0)) {

	DEBUG(dbgs() << "DSE: Remove Dead Store:\n OW END: "
	<< *DepWrite << "\n KILLER (offset "
	<< InstWriteOffset << ", "
	<< DepLoc.Size << ")"
	<< *Inst << '\n');

	Value* DepWriteLength = DepIntrinsic->getLength();
	Value* TrimmedLength = ConstantInt::get(DepWriteLength->getType(),
	InstWriteOffset -
	DepWriteOffset);
	DepIntrinsic->setLength(TrimmedLength);
	MadeChange = true;	MadeChange = true;

		// DeleteDeadInstruction can delete the current instruction in loop
		// cases, reset BBI.
		BBI = Inst->getIterator();
		if (BBI != BB.begin())
		--BBI;
		break;
		} else if (OR == OverwriteEnd && isShortenable(DepWrite)) {
		// TODO: base this on the target vector size so that if the earlier
		// store was too small to get vector writes anyway then its likely
		// a good idea to shorten it
		// Power of 2 vector writes are probably always a bad idea to optimize
		// as any store/memset/memcpy is likely using vector instructions so
		// shortening it to not vector size is likely to be slower
		MemIntrinsic* DepIntrinsic = cast<MemIntrinsic>(DepWrite);
		unsigned DepWriteAlign = DepIntrinsic->getAlignment();
		majnemerUnsubmitted Done Reply Inline Actions Pointers go on the right side, please use `auto ` to avoid repetition. majnemer:* Pointers go on the right side, please use `auto *` to avoid repetition.
		ivanbaevUnsubmitted Not Done Reply Inline Actions this is actually the current code; done ivanbaev: this is actually the current code; done
		if (llvm::isPowerOf2_64(InstWriteOffset) \|\|
		((DepWriteAlign != 0) && InstWriteOffset % DepWriteAlign == 0)) {

		DEBUG(dbgs() << "DSE: Remove Dead Store:\n OW END: "
		<< *DepWrite << "\n KILLER (offset "
		<< InstWriteOffset << ", "
		<< DepLoc.Size << ")"
		<< *Inst << '\n');

		Value* DepWriteLength = DepIntrinsic->getLength();
		Value* TrimmedLength = ConstantInt::get(DepWriteLength->getType(),
		InstWriteOffset -
		DepWriteOffset);
		DepIntrinsic->setLength(TrimmedLength);
		majnemerUnsubmitted Done Reply Inline Actions Pointers lean right. majnemer: Pointers lean right.
		ivanbaevUnsubmitted Not Done Reply Inline Actions this is actually the current code; done ivanbaev: this is actually the current code; done
		MadeChange = true;
		}
	}	}
	}	}
	}

	// If this is a may-aliased store that is clobbering the store value, we	// If this is a may-aliased store that is clobbering the store value, we
	// can keep searching past it for another must-aliased pointer that stores	// can keep searching past it for another must-aliased pointer that stores
	// to the same location. For example, in:	// to the same location. For example, in:
	// store -> P	// store -> P
	// store -> Q	// store -> Q
	// store -> P	// store -> P
	// we can remove the first store to P even though we don't know if P and Q	// we can remove the first store to P even though we don't know if P and Q
	// alias.	// alias.
	if (DepWrite == &BB.front()) break;	if (DepWrite == &BB.front()) break;

	// Can't look past this instruction if it might read 'Loc'.	// Can't look past this instruction if it might read 'Loc'.
	if (AA->getModRefInfo(DepWrite, Loc) & MRI_Ref)	if (AA->getModRefInfo(DepWrite, Loc) & MRI_Ref)
	break;	break;

	InstDep = MD->getPointerDependencyFrom(Loc, false,	InstDep = MD->getPointerDependencyFrom(Loc, false,
	DepWrite->getIterator(), &BB);	DepWrite->getIterator(), &BB);
		}
		} else if (EnableNonLocalDSE && InstDep.isNonLocal()) { // DSE across BB
		mbodartUnsubmitted Done Reply Inline Actions For safety I think we also need to check that Inst is not a self read. mbodart: For safety I think we also need to check that Inst is not a self read.
		mbodartUnsubmitted Done Reply Inline Actions The self read issue has not been addressed. mbodart: The self read issue has not been addressed.
		ivanbaevUnsubmitted Done Reply Inline Actions It looks like that some comments are not aligned to the corresponding lines of code. The self read issue has been addressed in line 888 by adding there a call to: isPossibleSelfRead(Inst, Loc, Dependency, TLI, AA) It is similar to the local case. ivanbaev: It looks like that some comments are not aligned to the corresponding lines of code. The self…
		mbodartUnsubmitted Done Reply Inline Actions It seems useful for compile time to check for a self-read right here, as in that case there is no point in processing non-local dependences. Such a check is only complicated by the fact that isPossibleSelfRead requires a "DepWrite" parameter. We would need to modify that routine to allow a nullptr for DepWrite. What I'm unsure of, however, is whether this being a non-self read is already implied. Is it the case that a StoreInst will never be a self read, for example? mbodart: It seems useful for compile time to check for a self-read right here, as in that case there is…
		ivanbaevUnsubmitted Not Done Reply Inline Actions A StoreInst will never be a self read. ivanbaev: A StoreInst will never be a self read.
		mbodartUnsubmitted Done Reply Inline Actions If that's the case, then we should be able to remove the isSelfRead check in handleNonLocalDependency. mbodart: If that's the case, then we should be able to remove the isSelfRead check in…
		if (++NumNonLocalAttempts < MaxNonLocalAttempts)
		mbodartUnsubmitted Not Done Reply Inline Actions I would expect any pathalogical compile time issues to occur in HandleNonLocalDependency, based on how many blocks and dependences it has to search through. So even if MaxNonLocalAttempts is 1, we could theoretically encounter a compile time problem. I'm not averse to keeping MaxNonLocalAttempts here, but would like to see a throttle in HandleNonLocalDependency. mbodart: I would expect any pathalogical compile time issues to occur in HandleNonLocalDependency, based…
		ivanbaevUnsubmitted Not Done Reply Inline Actions MemoryDependenceAnalysis has BlockScanLimit = 100 and NumResultsLimit = 100. With these and the new MaxNonLocalAttempts = 100 we have enough handles to control compile-time. ivanbaev: MemoryDependenceAnalysis has BlockScanLimit = 100 and NumResultsLimit = 100. With these and the…
		mbodartUnsubmitted Done Reply Inline Actions I see. Thanks for pointing that out. Those seem reasonable as a starting point. But as Bruno suggested, some actual compile time measurments are probably in order, just to confirm. mbodart: I see. Thanks for pointing that out. Those seem reasonable as a starting point. But as Bruno…
		ivanbaevUnsubmitted Done Reply Inline Actions I'm running LNT tests with ARM/AArch64 compiler without and with this patch, and will report results when finished. ivanbaev: I'm running LNT tests with ARM/AArch64 compiler without and with this patch, and will report…
		MadeChange \|= handleNonLocalDependency(Inst);
	}	}
		majnemerUnsubmitted Done Reply Inline Actions Please keep the `else if` on the same line as the closing `}` majnemer: Please keep the `else if` on the same line as the closing `}`
	}	}

Context not available.
	return MadeChange;	return MadeChange;
	}	}

		/// A helper for handleNonLocalDependency() function to find all blocks
		/// that lead to the input block BB and append them to the output PredBlocks.
		/// PredBlocks will include not only predecessors of BB that unconditionally
		/// lead to BB but also:
		/// - single-block loops that lead to BB, and
		/// - if-blocks for which one edge goes to BB and the other edge goes to
		/// a block in the input SafeBlocks.
		/// PredBlocks will not include blocks unreachable from the entry block, nor
		/// blocks that form cycles with BB.
		static void
		findSafePreds(SmallVectorImpl<BasicBlock *> &PredBlocks,
		SmallSetVector<BasicBlock*, 8> &SafeBlocks,
		majnemerUnsubmitted Not Done Reply Inline Actions Pointers lean right. majnemer: Pointers lean right.
		BasicBlock BB, DominatorTree DT) {
		for (auto I = pred_begin(BB), E = pred_end(BB); I != E; ++I) {
		majnemerUnsubmitted Not Done Reply Inline Actions Can a range-based for loop be used here? Something like: `for (const BasicBlock Pred : BB)` majnemer: Can a range-based for loop be used here? Something like: `for (const BasicBlock Pred : BB)`
		BasicBlock Pred = I;
		if (Pred == BB)
		continue;
		// The second check below prevents adding blocks that form a cycle with BB
		// in order to avoid potential problems due to MemoryDependenceAnalysis,
		// isOverwrite, etc. being not loop-aware.
		if (!DT->isReachableFromEntry(Pred) \|\| DT->dominates(BB, Pred))
		continue;

		bool PredIsSafe = true;
		for (auto II = succ_begin(Pred), EE = succ_end(Pred); II != EE; ++II) {
		BasicBlock Succ = II;
		if (Succ == BB \|\| Succ == Pred) // shortcut, BB should be in SafeBlocks
		continue;
		if (!SafeBlocks.count(Succ)) {
		PredIsSafe = false;
		break;
		}
		}
		if (PredIsSafe)
		PredBlocks.push_back(Pred);
		}
		}

		static bool underlyingObjectsDoNotAlias(StoreInst SI, LoadInst LI,
		const DataLayout &DL,
		AliasAnalysis &AA) {
		Value *AObj = GetUnderlyingObject(SI->getPointerOperand(), DL);
		SmallVector<Value *, 4> Pointers;
		GetUnderlyingObjects(LI->getPointerOperand(), Pointers, DL);

		for (auto I = Pointers.begin(), E = Pointers.end(); I != E; ++I) {
		majnemerUnsubmitted Not Done Reply Inline Actions Likewise, can you use one here? majnemer: Likewise, can you use one here?
		Value BObj = I;
		if (!AA.isNoAlias(AObj, DL.getTypeStoreSize(AObj->getType()), BObj,
		DL.getTypeStoreSize(BObj->getType())))
		return false;
		}
		return true;
		}

		/// handleNonLocalDependency - Handle a non-local dependency on
		/// the input instruction Inst located in BB in attempt to remove
		/// redundant stores outside BB.
		bool DSE::handleNonLocalDependency(Instruction *Inst) {
		auto *SI = dyn_cast<StoreInst>(Inst);
		if (!SI)
		return false;
		// Get the location being stored to.
		// If we don't get a useful location, bail out.
		MemoryLocation Loc = getLocForWrite(SI, *AA);
		if (!Loc.Ptr)
		return false;

		bool MadeChange = false;
		BasicBlock *BB = Inst->getParent();
		const DataLayout &DL = BB->getModule()->getDataLayout();

		// Worklist of predecessor blocks of BB
		SmallVector<BasicBlock *, 8> Blocks;
		// Keep track of all predecessor blocks that are safe to search through
		SmallSetVector<BasicBlock*, 8> SafeBlocks;
		majnemerUnsubmitted Not Done Reply Inline Actions Pointers lean right. majnemer: Pointers lean right.
		SafeBlocks.insert(BB);
		findSafePreds(Blocks, SafeBlocks, BB, DT);

		while (!Blocks.empty()) {
		BasicBlock *PB = Blocks.pop_back_val();
		MemDepResult Dep = MD->getPointerDependencyFrom(Loc, false, PB->end(),
		PB, SI);
		while (Dep.isDef() \|\| Dep.isClobber()) {
		Instruction *Dependency = Dep.getInst();

		// Filter out false dependency from a load to SI looking through phis.
		if (auto *LI = dyn_cast<LoadInst>(Dependency)) {
		if (underlyingObjectsDoNotAlias(SI, LI, DL, *AA)) {
		Dep = MD->getPointerDependencyFrom(Loc, false,
		Dependency->getIterator(), PB, SI);
		continue;
		}
		}

		// If we don't get a useful location for the dependent instruction,
		// it doesn't write memory, it is not removable, or it might read Loc,
		// then bail out.
		MemoryLocation DepLoc = getLocForWrite(Dependency, *AA);
		if (!DepLoc.Ptr \|\| !hasMemoryWrite(Dependency, *TLI) \|\|
		!isRemovable(Dependency) \|\|
		(AA->getModRefInfo(Dependency, Loc) & MRI_Ref))
		break;

		// Don't remove a store within single-block loops;
		// we need more analysis: e.g. looking for an interferring load
		// above the store within the loop, etc.
		bool SingleBlockLoop = false;
		for (auto I = succ_begin(PB), E = succ_end(PB); I != E; ++I) {
		BasicBlock Succ = I;
		if (Succ == PB) {
		SingleBlockLoop = true;
		break;
		}
		}
		if (SingleBlockLoop)
		break;

		int64_t InstWriteOffset, DepWriteOffset;
		OverwriteResult OR =
		isOverwrite(Loc, DepLoc, DL, *TLI, DepWriteOffset, InstWriteOffset);
		if (OR == OverwriteComplete) {
		DEBUG(dbgs() << "DSE: Remove Non-Local Dead Store:\n DEAD: "
		<< Dependency << "\n KILLER: " << SI << '\n');

		// Delete redundant store and now-dead instructions that feed it.
		auto Next = std::next(Dependency->getIterator());
		DeleteDeadInstruction(Dependency, MD, TLI);
		++NumNonLocalStores;
		MadeChange = true;
		Dep = MD->getPointerDependencyFrom(Loc, false, Next, PB, SI);
		continue;
		}
		// TODO: attempt shortening of Dependency inst as in the local case
		break;
		}

		if (Dep.isNonLocal()) {
		SafeBlocks.insert(PB);
		majnemerUnsubmitted Done Reply Inline Actions Functions should start with a lowercase letter. majnemer: Functions should start with a lowercase letter.
		findSafePreds(Blocks, SafeBlocks, PB, DT);
		majnemerUnsubmitted Done Reply Inline Actions This has yet to be addressed. majnemer: This has yet to be addressed.
		ivanbaevUnsubmitted Not Done Reply Inline Actions This is similar to the code for iterating over predecessors in existing void FindUnconditionalPreds() function. I suggest to change both in a separate patch. ivanbaev: This is similar to the code for iterating over predecessors in existing void…
		}
		}

		majnemerUnsubmitted Done Reply Inline Actions Can you use a range-base for loop with `predecessors` ? majnemer: Can you use a range-base for loop with `predecessors` ?
		mbodartUnsubmitted Done Reply Inline Actions I still don't understand why we need to check DT->dominates(BB, Pred) here. I'm guessing you're trying to guard against an infinite loop, which of course we need to do. But I think that would be automatically handled by instead checking for !SafeBlocks.count(Pred). In addition to allowing the search to proceed further, this would eliminate the need for the special check of Pred == BB, since BB must be in SafeBlocks. mbodart: I still don't understand why we need to check DT->dominates(BB, Pred) here. I'm guessing you're…
		ivanbaevUnsubmitted Done Reply Inline Actions We need DT->dominates(BB, Pred) check to avoid situations similar to the one in cycle.ll test below. This test is a simplified version of a larger function found in a SPEC benchmark. ivanbaev: We need DT->dominates(BB, Pred) check to avoid situations similar to the one in cycle.ll test…
		mbodartUnsubmitted Done Reply Inline Actions If nonlocal DSE would otherwise remove the store in the cycle.ll test, it seems to me that's a problem with how we are using alias analysis. The other store in that loop is not guaranteed to overwrite the removed store's location on every loop iteration. Perhaps it's simply my unfamiliarity with LLVM's alias analysis interfaces, but it seems like there would be some support for checking this definitive overlap, without requiring clients to also factor in dominance information. I'd be interested in hearing what the other reviewers have to say. mbodart: If nonlocal DSE would otherwise remove the store in the cycle.ll test, it seems to me that's a…
		ivanbaevUnsubmitted Not Done Reply Inline Actions By design, DSE - local, non-local, other parts - depends on MemoryDependenceAnalysis, and only occasionally relies on alias analysis. ivanbaev: By design, DSE - local, non-local, other parts - depends on MemoryDependenceAnalysis, and only…
		mbodartUnsubmitted Not Done Reply Inline Actions I was using the term alias analysis generically, referring to either AA or MD. If usage of those interfaces is suspect in the circumstances of the cycle.ll test, I'd like to understand why exactly. mbodart: I was using the term alias analysis generically, referring to either AA or MD. If usage of…
		mbodartUnsubmitted Done Reply Inline Actions OK, I think I get it now. Fundamentally the isOverwrite routine assumes that (P1 == P2) means that pointers P1 and P2 always have the same value, and is not considering back edge (phi) issues. I think your dominance check is sufficient to avoid problems in this area. But as this is a subtle issue, please add a source comment at the dominance check explaining the situation. mbodart: OK, I think I get it now. Fundamentally the isOverwrite routine assumes that (P1 == P2) means…
		return MadeChange;
		}

	/// Returns true if the memory which is accessed by the second instruction is not	/// Returns true if the memory which is accessed by the second instruction is not
	/// modified between the first and the second instruction.	/// modified between the first and the second instruction.
	/// Precondition: Second instruction must be dominated by the first	/// Precondition: Second instruction must be dominated by the first
Context not available.
		mbodartUnsubmitted Done Reply Inline Actions I don't understand why we are stopping the search here. If the overwrite is partial, there may still be preceding stores that could be eliminated. mbodart: I don't understand why we are stopping the search here. If the overwrite is partial, there may…
		mbodartUnsubmitted Done Reply Inline Actions Shouldn't this check also include && Dep.isNonLocal()? mbodart: Shouldn't this check also include && Dep.isNonLocal()?
		mbodartUnsubmitted Done Reply Inline Actions Instead of InstPt, shouldn't the search start with PB->end()? It looks like getPointerDependencyFrom does a "--" operation on this iterator to get the first instruction to start its analysis from. mbodart: Instead of InstPt, shouldn't the search start with PB->end()? It looks like…
		mbodartUnsubmitted Done Reply Inline Actions [Ivan] Each BB ends with a terminator instruction and these should not have contain dependencies for the store. An Invoke is one instance of a terminator that can have memory dependences. I don't know if our CFG search could ever get through both edges coming out of an Invoke, but I think it is best to avoid any such assumptions, and let AA make the dependency determinations. mbodart: > [Ivan] Each BB ends with a terminator instruction and these should not have contain…
		majnemerUnsubmitted Done Reply Inline Actions Could you use a range based for loop here? majnemer: Could you use a range based for loop here?
		majnemerUnsubmitted Done Reply Inline Actions This hasn't been addressed. majnemer: This hasn't been addressed.
		ivanbaevUnsubmitted Done Reply Inline Actions There are about 8 places in DeadStoreElimination.cpp where we can use range based for loops. I suggest these changes to be made in a separate patch. ivanbaev: There are about 8 places in DeadStoreElimination.cpp where we can use range based for loops. I…
		majnemerUnsubmitted Not Done Reply Inline Actions Could this be cached in the constructor? majnemer: Could this be cached in the constructor?
		majnemerUnsubmitted Not Done Reply Inline Actions This hasn't been addressed. majnemer: This hasn't been addressed.
		ivanbaevUnsubmitted Not Done Reply Inline Actions That is a good point. There are many places in DSE where DataLayout is used or passed as argument. However, it seems that most transformations in Transform/, e.g. Vectorize/LoopVectorize.cpp, Scalar/SROA.cpp, Scalar/GVN.cpp, favor accessing DataLayout this way. ivanbaev:* That is a good point. There are many places in DSE where DataLayout is used or passed as…
		majnemerUnsubmitted Done Reply Inline Actions Please use `auto `. majnemer:* Please use `auto *`.
		majnemerUnsubmitted Done Reply Inline Actions Please use `auto `. majnemer:* Please use `auto *`.
		mbodartUnsubmitted Not Done Reply Inline Actions I'm still ramping up on LLVM and thus not deeply familiar with all of AA's capabilities. But it struck me as odd that we would need a custom routine to look for a memory conflict. Is there no existing AA or MemoryDep interface that meets our need, and if not, should one be added? mbodart: I'm still ramping up on LLVM and thus not deeply familiar with all of AA's capabilities. But it…
		ivanbaevUnsubmitted Not Done Reply Inline Actions If there is a client outside DSE we can move this helper into a utility directory ivanbaev: If there is a client outside DSE we can move this helper into a utility directory
		mbodartUnsubmitted Not Done Reply Inline Actions This test is loop invariant and should probably be cached before the loop. This is especially true after generalizing FindSafePreds to allow an arbitrary number of successors. mbodart: This test is loop invariant and should probably be cached before the loop. This is especially…
		ivanbaevUnsubmitted Not Done Reply Inline Actions The idea is to allow the search continue through simple loops, but not to remove stores within loops. ivanbaev: The idea is to allow the search continue through simple loops, but not to remove stores within…
		mbodartUnsubmitted Done Reply Inline Actions Sorry, I'm not sure which comment you were responding to. Phabricator for some reason seems to be placing some of my comments above the lines of interest, and others below. Regarding my "loop invariant" comment, I'm saying that the test for whether the current block is a single-block loop (getNumSuccessors() == 2, etc) should probably be done outside of this loop. I'm also not sure if checking explicitly for two successors is sufficient. A program could conceivably have a switch statement with one case having a goto back to a label just prior to the switch. I'm not familiar enough with LLVM optimizations to know whether that can lead to an indirect branch successor which is the block containing the indirect branch itself. But it's theoretically possible. My recommendation here is to use a successor iterator, to match the behavior present in FindSafePreds. mbodart: Sorry, I'm not sure which comment you were responding to. Phabricator for some reason seems to…
		ivanbaevUnsubmitted Done Reply Inline Actions I will use a successor iterator here, thanks. ivanbaev: I will use a successor iterator here, thanks.
		mbodartUnsubmitted Done Reply Inline Actions [Ivan] We know that the first dependence - which triggers invocation of HandleNonLocalDependency(Inst) - is non-local Perhaps I don't quite understand the possible return values from MD, but it seems like Dep could be Unknown or NonLocalFunc here, in which case we should terminate the search. mbodart: > [Ivan] We know that the first dependence - which triggers invocation of…
		mbodartUnsubmitted Not Done Reply Inline Actions This is true for the common case of exact overlap. But it might be worth adding a TBD comment to the effect that we could reasonably continue the search if the dead instruction was only a partial overlap. I'm not suggesting we do the full "shortening" transformation that local DSE does (though that might be fun). But catching simple scalar stores would just work. mbodart: This is true for the common case of exact overlap. But it might be worth adding a TBD comment…
		mbodartUnsubmitted Done Reply Inline Actions The Succ == BB check is correct, but redundant as SafeBlocks.count(BB) must be non-zero. I think the code would read better if this and the next check were combined into one: if (Succ != Pred && !SafeBlocks.count(Succ)) but would be OK with your current structure. Either way, however, it would be useful to add a clarifying source comment to the effect: // Pred is only known to be optimizeable if all of its successors are either safe or are Pred itself. mbodart: The Succ == BB check is correct, but redundant as SafeBlocks.count(BB) must be non-zero. I…
		ivanbaevUnsubmitted Done Reply Inline Actions Agree that Succ == BB check is redundant, but it is a shortcut compared to checking a membership in a set. That will save some compile time. Sure, I will add the comment you suggested. ivanbaev: Agree that Succ == BB check is redundant, but it is a shortcut compared to checking a…
		mbodartUnsubmitted Done Reply Inline Actions This break is just exiting the succ iterator loop. We need to exit the containing while loop. mbodart: This break is just exiting the succ iterator loop. We need to exit the containing while loop.
		mbodartUnsubmitted Done Reply Inline Actions Inst here is the original StoreInst. If StoreInst's can never be a self read, this check will always be false. Do we need an additional check to ensure that Dependency does not read any of Inst's memory, analagous to the check "if (AA->getModRefInfo(DepWrite, Loc) & MRI_Ref)" in DSE::runOnBasicBlock? mbodart: Inst here is the original StoreInst. If StoreInst's can never be a self read, this check will…
		ivanbaevUnsubmitted Not Done Reply Inline Actions Added AA->getModRefInfo(Dependency, Loc) & MRI_Ref) test. ivanbaev: Added AA->getModRefInfo(Dependency, Loc) & MRI_Ref) test.
		mbodartUnsubmitted Not Done Reply Inline Actions That looks right, thanks! mbodart: That looks right, thanks!

test/Transforms/DeadStoreElimination/cycle.ll

This file was added.

				; RUN: opt < %s -basicaa -dse -S \| FileCheck %s

				@Table = global [535 x i32] zeroinitializer, align 4

				mbodartUnsubmitted Done Reply Inline Actions These tests seem fairly generic. I would prefer that we remove the explicit datalayout and triple here. mbodart: These tests seem fairly generic. I would prefer that we remove the explicit datalayout and…
				; The store in for.inc block should NOT be removed by non-local DSE.
				; CHECK: store i32 64, i32* %arrayidx
				;
				define void @foo() {
				entry:
				br label %for.body

				for.body:
				%i = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
				%arrayidx = getelementptr inbounds [535 x i32], [535 x i32]* @Table, i32 0, i32 %i
				store i32 %i, i32* %arrayidx, align 4
				%cmp1 = icmp slt i32 %i, 64
				br i1 %cmp1, label %for.inc, label %for.end

				for.inc:
				store i32 64, i32* %arrayidx, align 4
				%inc = add nsw i32 %i, 1
				br label %for.body

				for.end:
				ret void
				}

test/Transforms/DeadStoreElimination/ifthen.ll

This file was added.

				; RUN: opt < %s -basicaa -dse -S \| FileCheck %s

				; The store and add in if.then block should be removed by non-local DSE.
				; CHECK-NOT: %stval = add
				; CHECK-NOT: store i32 %stval
				;
				define void @foo(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32 %c) {
				entry:
				%cmp = icmp sgt i32 %c, 0
				br i1 %cmp, label %if.then, label %if.end

				if.then:
				%0 = load i32, i32* %b, align 4
				%stval = add nsw i32 %0, 1
				store i32 %stval, i32* %a, align 4
				br label %if.end

				if.end:
				%m.0 = phi i32 [ 13, %if.then ], [ 10, %entry ]
				store i32 %m.0, i32* %a, align 4
				ret void
				}

test/Transforms/DeadStoreElimination/ifthenelse.ll

This file was added.

				; RUN: opt < %s -basicaa -dse -S \| FileCheck %s

				; The add and store in entry block should be removed by non-local DSE.
				; CHECK-NOT: %stval = add
				; CHECK-NOT: store i32 %stval
				;
				define void @foo(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32 %c) {
				entry:
				%0 = load i32, i32* %b, align 4
				%stval = add nsw i32 %0, 1
				store i32 %stval, i32* %a, align 4
				%cmp = icmp sgt i32 %c, 0
				br i1 %cmp, label %if.then, label %if.else

				if.then:
				%1 = add nsw i32 %c, 10
				br label %if.end

				if.else:
				%2 = add nsw i32 %c, 13
				br label %if.end

				if.end:
				%3 = phi i32 [ %1, %if.then ], [ %2, %if.else ]
				store i32 %3, i32* %a, align 4
				ret void
				}

test/Transforms/DeadStoreElimination/ifthenelse2.ll

This file was added.

				; RUN: opt < %s -basicaa -dse -S \| FileCheck %s

				; The add and store in entry block should be removed by non-local DSE.
				; CHECK-NOT: %stval = add
				; CHECK-NOT: store i32 %stval
				;
				; The stores in if.then and if.else blocks should be removed by non-local DSE.
				; CHECK-NOT: store i32 %1
				; CHECK-NOT: store i32 %2
				;
				define void @foo(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32 %c) {
				entry:
				%0 = load i32, i32* %b, align 4
				%stval = add nsw i32 %0, 1
				store i32 %stval, i32* %a, align 4
				%cmp = icmp sgt i32 %c, 0
				br i1 %cmp, label %if.then, label %if.else

				if.then:
				%1 = add nsw i32 %c, 10
				store i32 %1, i32* %a, align 4
				br label %if.end

				if.else:
				%2 = add nsw i32 %c, 13
				store i32 %2, i32* %a, align 4
				br label %if.end

				if.end:
				%3 = phi i32 [ %1, %if.then ], [ %2, %if.else ]
				%4 = sub nsw i32 %3, 6
				store i32 %4, i32* %a, align 4
				ret void
				}

test/Transforms/DeadStoreElimination/loop.ll

This file was added.

				; RUN: opt < %s -basicaa -dse -S \| FileCheck %s

				; The store in for.body block should be removed by non-local DSE.
				; CHECK-NOT: store i32 0, i32* %arrayidx
				;
				define void @sum(i32 %N, i32* noalias nocapture %C, i32* noalias nocapture readonly %A, i32* noalias nocapture readonly %B) {
				entry:
				%cmp24 = icmp eq i32 %N, 0
				br i1 %cmp24, label %for.end11, label %for.body

				for.body:
				%i.025 = phi i32 [ %inc10, %for.cond1.for.inc9_crit_edge ], [ 0, %entry ]
				%arrayidx = getelementptr inbounds i32, i32* %C, i32 %i.025
				store i32 0, i32* %arrayidx, align 4
				%mul = mul i32 %i.025, %N
				%arrayidx4.gep = getelementptr i32, i32* %A, i32 %mul
				br label %for.body3

				for.body3:
				%0 = phi i32 [ 0, %for.body ], [ %add8, %for.body3 ]
				%arrayidx4.phi = phi i32* [ %arrayidx4.gep, %for.body ], [ %arrayidx4.inc, %for.body3 ]
				%arrayidx5.phi = phi i32* [ %B, %for.body ], [ %arrayidx5.inc, %for.body3 ]
				%j.023 = phi i32 [ 0, %for.body ], [ %inc, %for.body3 ]
				%1 = load i32, i32* %arrayidx4.phi, align 4
				%2 = load i32, i32* %arrayidx5.phi, align 4
				%add6 = add nsw i32 %2, %1
				%add8 = add nsw i32 %add6, %0
				%inc = add i32 %j.023, 1
				%exitcond = icmp ne i32 %inc, %N
				%arrayidx4.inc = getelementptr i32, i32* %arrayidx4.phi, i32 1
				%arrayidx5.inc = getelementptr i32, i32* %arrayidx5.phi, i32 1
				br i1 %exitcond, label %for.body3, label %for.cond1.for.inc9_crit_edge

				for.cond1.for.inc9_crit_edge:
				store i32 %add8, i32* %arrayidx, align 4
				%inc10 = add i32 %i.025, 1
				%exitcond26 = icmp ne i32 %inc10, %N
				br i1 %exitcond26, label %for.body, label %for.end11

				for.end11:
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[DeadStoreElimination] Add support for non-local DSEClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 42326

lib/Transforms/Scalar/DeadStoreElimination.cpp

test/Transforms/DeadStoreElimination/cycle.ll

test/Transforms/DeadStoreElimination/ifthen.ll

test/Transforms/DeadStoreElimination/ifthenelse.ll

test/Transforms/DeadStoreElimination/ifthenelse2.ll

test/Transforms/DeadStoreElimination/loop.ll

[DeadStoreElimination] Add support for non-local DSE
ClosedPublic