We extend the search for redundant stores to predecessor blocks that
unconditionally lead to the block BB with the current store instruction.
That also includes single-block loops that unconditionally lead to BB,
and if-then-else blocks where then- and else-blocks unconditionally lead
to BB.
Details
Diff Detail
Event Timeline
Note that the code for BB DSE stays the same, it is just indented two spaces to the right.
The new code begins at:
// DSE across BB else if (InstDep.isNonLocal()) { ...
and it includes helper FindUncondPredsIncludingSimpleLoops().
Some comments:
*) Your algorithm has quadratic complexity in worst case. I know this is the same as for the single block algorithm, but extending it to multiple blocks might worsen the situation. Maybe you add some sort of upper bound for the number of dependency checks?
*) Your algorithm only handles certain types of cross-block DSE. As I understood, it cannot go beyond an if-then-else control flow split. I'm wondering if it makes sense to do a simple data flow analysis to collect the eligible predecessor blocks. This would handle all cases and I think it would be not much more implementation effort and doesn't add compile time.
*) I found it confusing that all the code is contained in this large functions with many nested loops and breaks and continues. I think it's better to extract some of the code to separate functions (e.g. to avoid the StopProcessPBAndItsPreds variable).
*) The tests look like they contain much code which is not really necessary to test what you want to test. Maybe you can reduce them. Also in the first test a comment would help, which describes why no DSE can be done.
Addressed Erik's comments:
*) Your algorithm has quadratic complexity in worst case. I know this is the same as for the single block algorithm, but extending it to multiple blocks might worsen the situation. Maybe you add some sort of upper bound for the number of dependency checks?
- An upper bound (MaxNonLocalAttempts = 100) is added on the number of non-local DSE attempts per function.
*) Your algorithm only handles certain types of cross-block DSE. As I understood, it cannot go beyond an if-then-else control flow split. I'm wondering if it makes sense to do a simple data flow analysis to collect the eligible predecessor blocks. This would handle all cases and I think it would be not much more implementation effort and doesn't add compile time.
- Analyzing general loops is challenging and compile-time extensive due to underlying MemoryDependenceAnalysis. This patch iteratively goes through predecessors that unconditionally lead to the block with the store and search for redundant stores. Not sure if there are opportunities within a reasonable compile-time budget for a truly global DSE.
*) I found it confusing that all the code is contained in this large functions with many nested loops and breaks and continues. I think it's better to extract some of the code to separate functions (e.g. to avoid the StopProcessPBAndItsPreds variable).
- The code is refactored with top algorithm extracted in HandleNonLocalDependency(Instruction *Inst).
*) The tests look like they contain much code which is not really necessary to test what you want to test. Maybe you can reduce them. Also in the first test a comment would help, which describes why no DSE can be done.
- I reduced the cycle test to a much smaller version, and added comments.
In general I think your approach of using a reverse CFG search for this optimization is appropriate. But as Erik points out it doesn't handle if-then-else.
That can be addressed relatively easily:
Let StartBB be the block with the original StoreInst.
Let SafeBlocks be the set of blocks for which we know all paths emanating from them will lead to StoreInst
with no intervening reads of the memory location of interest. Initially, add StartBB to SafeBlocks.
The criteria for adding a new predecessor P to your WorkList are that P is reachable, P is not in SafeBlocks (not visited),
and all of P's successors are P itself (the single-block loop case) or are in SafeBlocks.
Seed your WorkList with the predecessors of StartBB that meet this criteria.
After processing a block B in HandleNonLocalDependency, when DependencyFound is false,
add B to SafeBlocks, and augment the WorkList with B's predecessors that meet the criteria.
The "P is not in SafeBlocks" acts like a "! Visited" check, except that terminal (unsafe) blocks may be
visited multiple times. To avoid that redundancy we'd need another set (UnsafeBlocks, or perhaps just VisitedBlocks).
As to the compile time threshold, I think it needs to be applied to the number of dependences and blocks seen
in HandleNonLocalDependency. That way we'll put a cap on the cost of optimizing an individual store,
rather than limiting the number of store instructions that may get optimized. I don't know what LLVM's
philosophy is on adding new command line options, but perhaps the threshold value should be overridable
that way. And perhaps a cl::opt<bool> EnableGDSE("enable-gdse"...) to control this global dead store elimination.
As for the while loop in runOnBasicBlock, I would prefer that it keep track of whether any reads were seen,
including the isSelfRead case, and then we handle non-local dependences after that loop terminates.
This is the same concept as your DependencySeen tracking in HandleNonLocalDependency, though I might
name it SafeToProcessNonLocalDeps.
lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
857 | Instead of InstPt, shouldn't the search start with PB->end()? | |
900 | I don't understand why we are stopping the search here. | |
906 | Shouldn't this check also include && Dep.isNonLocal()? |
The new version now handles a redundant store in the if-block in if-then-else - based on a suggestion by Mitch. We use SafeBlocks set, which is initialized and updated in DSE::HandleNonLocalDependency() and used in helper FindSafePreds(). SafeBlocks keeps track of all predecessor blocks that are safe to search through. A new test - ifthenelse.ll - has been added.
Thanks for suggestions/comments, Mitch. My responses are inlined below.
In general I think your approach of using a reverse CFG search for this optimization is appropriate. But as Erik points out it doesn't handle if-then-else.
That can be addressed relatively easily:
Let StartBB be the block with the original StoreInst.
Let SafeBlocks be the set of blocks for which we know all paths emanating from them will lead to StoreInst
with no intervening reads of the memory location of interest. Initially, add StartBB to SafeBlocks.
The criteria for adding a new predecessor P to your WorkList are that P is reachable, P is not in SafeBlocks (not visited),
and all of P's successors are P itself (the single-block loop case) or are in SafeBlocks.
Seed your WorkList with the predecessors of StartBB that meet this criteria.
After processing a block B in HandleNonLocalDependency, when DependencyFound is false,
add B to SafeBlocks, and augment the WorkList with B's predecessors that meet the criteria.
The "P is not in SafeBlocks" acts like a "! Visited" check, except that terminal (unsafe) blocks may be
visited multiple times. To avoid that redundancy we'd need another set (UnsafeBlocks, or perhaps just VisitedBlocks).
- I added set SafeBlocks to keep track of all predecessor blocks that are safe to search through.
It is seeded with BB, the block with the original store. If no Def/Clobber dependency is found in predecessor PB, then we add PB to SafeBlocks.
Helper FindSafePreds(Blocks, SafeBlocks, PB, DT) then uses SafeBlocks to add blocks with two successors (e.g. an if-block) to the worklist.
As to the compile time threshold, I think it needs to be applied to the number of dependences and blocks seen
in HandleNonLocalDependency. That way we'll put a cap on the cost of optimizing an individual store,
rather than limiting the number of store instructions that may get optimized. I don't know what LLVM's
philosophy is on adding new command line options, but perhaps the threshold value should be overridable
that way. And perhaps a cl::opt<bool> EnableGDSE("enable-gdse"...) to control this global dead store elimination.
As for the while loop in runOnBasicBlock, I would prefer that it keep track of whether any reads were seen,
including the isSelfRead case, and then we handle non-local dependences after that loop terminates.
This is the same concept as your DependencySeen tracking in HandleNonLocalDependency, though I might
name it SafeToProcessNonLocalDeps.
- Use of static const unsigned MaxNonLocalAttempts = 100; in DSE is similar to the use of static const unsigned MaxIVUsers = 200; in LoopStrengthReduce.
Note that the large portion of DSE compile-time is due not to proper DSE but to the underlying MemoryDependenceAnalysis,
especially for computing the non-local dependency information. Controlling computation in MemoryDependenceAnalysis is
outside the scope of DSE. Given the iterative way of adding predecessors and searching them locally for non-local DSE
I don't expect a substantial increase in DSE compile-time.
lib/Transforms/Scalar/DeadStoreElimination.cpp, 857
Instead of InstPt, shouldn't the search start with PB->end()?
It looks like getPointerDependencyFrom does a "--" operation on this iterator
to get the first instruction to start its analysis from.
- Each BB ends with a terminator instruction and these should not have contain dependencies for the store.
See also DSE::HandleFree().
line 900
I don't understand why we are stopping the search here.
If the overwrite is partial, there may still be preceding stores
that could be eliminated.
- In the local BB case, "if (OR == OverwriteComplete) ..." is followed by
"else if (OR == OverwriteEnd && isShortenable(DepWrite)) ..."
I agree that we can be more aggressive here but suggest this to be added by a separate patch.
line 906
Shouldn't this check also include && Dep.isNonLocal()?
- We know that the first dependence - which triggers invocation of HandleNonLocalDependency(Inst) - is non-local.
In HandleNonLocalDependency() we use MemoryDependenceAnalysis::getPointerDependencyFrom()
to performs local search: it walks backwards through the basic block, looking for dependencies.
In each iteration of:
while (!Blocks.empty()) {
BasicBlock *PB = Blocks.pop_back_val(); ... MemDepResult Dep = MD->getPointerDependencyFrom(Loc, false, InstPt, PB, SI); ... }
we are looking for (local) dependencies within PB for the original store instruction. A non-local dependency found
when we search PB should not impact how we process PB.
lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
88 | Functions should start with a lower case letter. | |
616 | Pointers go on the right side, please use auto * to avoid repetition. | |
627–630 | Pointers lean right. | |
653–656 | Please keep the else if on the same line as the closing } | |
804 | Functions should start with a lowercase letter. | |
807–808 | Can you use a range-base for loop with predecessors ? | |
812–820 | It would be nicer to split this into more if (X) continue; to reduce the heavy nesting. | |
831–834 | Could you use a range based for loop here? | |
846 | Please use auto *. | |
857 | Could this be cached in the constructor? | |
876 | Please use auto *. |
Hi Ivan,
Thanks for working on this. It would be good to see some LNT compile time results of this patch beforehand since DSE triggered compile time bloat in the past.
lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
804–805 | This has yet to be addressed. | |
809–817 | The nesting here is still very large. I'm also not entirely convinced that this code is correct. What if PredTI->getNumSuccessors() == 1 is true but PredTI->getSuccessor(0) == Pred is false? In that case, won't you access PredTI->getSuccessor(1) == Pred ? It would be safer and more clear to simply loop over successors. | |
828–831 | This hasn't been addressed. | |
854 | This hasn't been addressed. |
lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
804–805 | This is similar to the code for iterating over predecessors in existing void FindUnconditionalPreds() function. I suggest to change both in a separate patch. | |
809–817 | if PredTI->getNumSuccessors() == 1 is true, then we exit the if statement checks at this point, so we will not access PredTI->getSuccessor(0), etc. Do you think that the following is better? PredBlocks.push_back(Pred); continue; } (PredTI->getSuccessor(0) == Pred || PredTI->getSuccessor(1) == Pred || (SafeBlocks.count(PredTI->getSuccessor(0)) && SafeBlocks.count(PredTI->getSuccessor(1))))) PredBlocks.push_back(Pred); | |
828–831 | There are about 8 places in DeadStoreElimination.cpp where we can use range based for loops. I suggest these changes to be made in a separate patch. | |
854 | That is a good point. There are many places in DSE where DataLayout is used or passed as argument. However, it seems that most transformations in Transform/*, e.g. Vectorize/LoopVectorize.cpp, Scalar/SROA.cpp, Scalar/GVN.cpp, favor accessing DataLayout this way. |
lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
572 | I didn't see a response to my suggestion to keep track of SafeToProcessNonLocalDeps in this loop. It seems incongruous and unnecessarily limiting to let any isDef or isClobber dependency here inhibit global DSE, while HandleNonLocalDependency tolerates them as long as they don't interfere. That said, it may be better left for a future change set, as it will slightly complicate this loop. | |
653 | For safety I think we also need to check that Inst is not a self read. | |
809–817 | I second the suggestion to use a Successor iterator here. for each pred P of BB if P is reachable and !SafeBlocks.count(P)) PredOK = true; for each succ S of P if !SafeBlocks.count(S) && S != P PredOK = false; if PredOK, add P to PredBlocks | |
821 | I'm still ramping up on LLVM and thus not deeply familiar with all of AA's capabilities. | |
854 |
An Invoke is one instance of a terminator that can have memory dependences. | |
892 | This test is loop invariant and should probably be cached before the loop. | |
911 | This is true for the common case of exact overlap. I'm not suggesting we do the full "shortening" transformation that local DSE does (though that might be fun). But catching simple scalar stores would just work. | |
919 |
Perhaps I don't quite understand the possible return values from MD, but it seems like Dep could be Unknown or NonLocalFunc here, in which case we should terminate the search. | |
test/Transforms/DeadStoreElimination/cycle.ll | ||
3–4 | These tests seem fairly generic. |
lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
654 | I would expect any pathalogical compile time issues to occur in HandleNonLocalDependency, based on how many blocks and dependences it has to search through. So even if MaxNonLocalAttempts is 1, we could theoretically encounter a compile time problem. I'm not averse to keeping MaxNonLocalAttempts here, but would like to see a throttle in HandleNonLocalDependency. |
lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
572–583 | The new patch is more aggressive in searching for redundant stores. Let's defer your suggestion for a subsequent patch. | |
654 | MemoryDependenceAnalysis has BlockScanLimit = 100 and NumResultsLimit = 100. With these and the new MaxNonLocalAttempts = 100 we have enough handles to control compile-time. | |
821 | If there is a client outside DSE we can move this helper into a utility directory | |
892 | The idea is to allow the search continue through simple loops, but not to remove stores within loops. |
Hi Ivan,
I added a few more responses.
Please be patient as this is a non-trivial optimization, and for me at least it takes some time to poke around and verify things.
I'm still not confident enough in my understanding of alias analysis to know whether underlyingObjectsDoNotAlias is complete, for example.
Also, please try to address some of the comments from David regarding naming conventions.
regards,
- Mitch
lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
572–583 | That's fine with me. | |
653 | The self read issue has not been addressed. | |
654 | I see. Thanks for pointing that out. | |
808 | I still don't understand why we need to check DT->dominates(BB, Pred) here. | |
814 | The Succ == BB check is correct, but redundant as SafeBlocks.count(BB) must be non-zero. if (Succ != Pred && !SafeBlocks.count(Succ)) but would be OK with your current structure. // Pred is only known to be optimizeable if all of its successors are either safe or are Pred itself. | |
892 | Sorry, I'm not sure which comment you were responding to. Regarding my "loop invariant" comment, I'm saying that the test for whether the current block |
Hi Mitch, thanks for the thoughtful comments; we are making a good progress.
I will make the two changes: using successor iterator and adding the comment, in the next revision.
Regarding Bruno's request for LNT testing, I'm currently running LNT tests with ARM/AArch64 compiler without and with this patch, and will report results when finished. Could someone run this patch with an x86-based compiler if there are concerns about compile-time.
Regarding David's comments, I believe all have been addressed, except caching DataLayout in the constructor.
There are many places in DSE where DataLayout is used or passed as argument. Since most transformations in Transform/*, e.g. Vectorize/LoopVectorize.cpp, Scalar/SROA.cpp, Scalar/GVN.cpp, favor accessing DataLayout this way, I suggest this change to be done in a future patch.
Mitch, Erik, David, Bruno,
Are you planning to attend the LLVM developers meeting on Thu/Fri? That would be a good opportunity to discuss this patch in person if you would like to.
Ivan
lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
653 | It looks like that some comments are not aligned to the corresponding lines of code. isPossibleSelfRead(Inst, Loc, Dependency, *TLI, *AA) It is similar to the local case. | |
654 | I'm running LNT tests with ARM/AArch64 compiler without and with this patch, and will report results when finished. | |
808 | We need DT->dominates(BB, Pred) check to avoid situations similar to the one in cycle.ll test below. This test is a simplified version of a larger function found in a SPEC benchmark. | |
814 | Agree that Succ == BB check is redundant, but it is a shortcut compared to checking a membership in a set. | |
892 | I will use a successor iterator here, thanks. |
Mitch, Erik, David, Bruno,
Are you planning to attend the LLVM developers meeting on Thu/Fri? That would be a good opportunity to discuss this patch in person if you would like to.
Yes, I'm there on Friday. Looking forward to meet you.
Hi Bruno,
Here are LNT compile-time: ibaev-linuxclang_DEVaarch64 test results
Previous: aarch64 compiler with this patch, non-local DSE
Current: aarch64 compiler without this patch, that is top-of-trunk DSE
Tests Summary
Status Group #
Performance Regressions 141
Performance Improvements 134
Added Tests 1
Existing Failures 489
Unchanged Tests 1191
Total Tests 1956
- Top regressions with Current (top-of-trunk DSE) ***
Performance Regressions - Compile Time Δ Previous Current σ
SingleSource/UnitTests/2002-08-02-CastTest 400.00% 0.0080 0.0400 -
SingleSource/UnitTests/C++11/stdthreadbug 300.00% 0.0040 0.0160 -
SingleSource/UnitTests/2003-05-31-LongShifts 250.00% 0.0080 0.0280 -
SingleSource/UnitTests/2008-04-20-LoopBug2 200.00% 0.0120 0.0360 -
SingleSource/Regression/C/2004-02-03-AggregateCopy 150.00% 0.0160 0.0400 -
MultiSource/Benchmarks/Prolangs-C++/vcirc/vcirc 100.00% 0.0240 0.0480 -
SingleSource/Benchmarks/Misc/mandel-2 100.00% 0.0360 0.0720 -
SingleSource/UnitTests/2003-04-22-Switch 100.00% 0.0200 0.0400 -
SingleSource/UnitTests/Vector/multiplies 100.00% 0.0200 0.0400 -
SingleSource/UnitTests/block-byref-test 100.00% 0.0120 0.0240 -
....
- Top improvents with Current (top-of-trunk DSE) ***
Performance Improvements - Compile Time Δ Previous Current σ
SingleSource/UnitTests/2002-10-13-BadLoad -75.00% 0.0160 0.0040 -
SingleSource/UnitTests/2006-12-01-float_varg -71.43% 0.0280 0.0080 -
SingleSource/Regression/C++/2003-06-08-VirtualFunctions -66.67% 0.0240 0.0080 -
SingleSource/Regression/C++/2003-09-29-NonPODsByValue -66.67% 0.0360 0.0120 -
SingleSource/Regression/C++/EH/function_try_block -66.67% 0.0480 0.0160 -
SingleSource/UnitTests/byval-alignment -63.64% 0.0440 0.0160 -
SingleSource/Benchmarks/Misc/pi -58.33% 0.0480 0.0200 -
SingleSource/Regression/C++/short_circuit_dtor -58.33% 0.0480 0.0200 -
SingleSource/UnitTests/2005-05-11-Popcount-ffs-fls -55.56% 0.0720 0.0320 -
SingleSource/UnitTests/Vector/simple -55.56% 0.0360 0.0160 -
SingleSource/Regression/C++/EH/simple_rethrow -54.55% 0.0440 0.0200 -
...
Since this is my first LNT testing, could someone comment on these results. Any outliers?
Thanks,
Ivan
lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
653 | It seems useful for compile time to check for a self-read right here, as in that case there is no point in processing non-local dependences. Such a check is only complicated by the fact that isPossibleSelfRead requires a "DepWrite" parameter. We would need to modify that routine to allow a nullptr for DepWrite. What I'm unsure of, however, is whether this being a non-self read is already implied. | |
808 | If nonlocal DSE would otherwise remove the store in the cycle.ll test, it seems to me that's a problem with how we are using alias analysis. The other store in that loop is not guaranteed to overwrite the removed store's location on every loop iteration. Perhaps it's simply my unfamiliarity with LLVM's alias analysis interfaces, but it seems like there would be some support for checking this definitive overlap, without requiring clients to also factor in dominance information. I'd be interested in hearing what the other reviewers have to say. |
Hi Ivan,
Thanks again for working on this!
I'm gonna run your patch in x86 and let you know about the performance results.
Thanks, Bruno. If necessary we need to balance non-local DSE for compile-time and run-time performance.
lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
653 | If that's the case, then we should be able to remove the isSelfRead check in handleNonLocalDependency. | |
808 | I was using the term alias analysis generically, referring to either AA or MD. | |
887 | Inst here is the original StoreInst. If StoreInst's can never be a self read, this check will always be false. Do we need an additional check to ensure that Dependency does not read any of Inst's memory, analagous to the check "if (AA->getModRefInfo(DepWrite, Loc) & MRI_Ref)" | |
896 | This break is just exiting the succ iterator loop. We need to exit the containing while loop. |
Hi Bruno, the patch has been rebased, and should apply cleanly on ToT.
Mitch, thanks for your comments and suggestions.
Non-local DSE extension follows the structure of HandleFree(). The underlying analysis for DSE is MD, and for across-BB both - HandleFree and HandleNonLocalDependency - use MD->getPointerDependencyFrom(). Note that MD is not loop-aware and its clients should handle CFG and alias issues by relying on AA or DT. HandleFree only allows unconditional predecessors - see FundUnconditonalPreds(). Based on reviewers' feedback HandleNonLocalDependency also handles if-then-else predecessors and single-block-loops (actually one of our benchmarks has a redundant store across a single-block-loop).
Regading cycle.ll test, using DT->dominates(BB, Pred) check is sufficient to exclude for.inc BB from the set of safe predecessors of for.body BB. What alternative do you propose for determining that stores in for.body and for.inc may alias?
lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
889 | Added AA->getModRefInfo(Dependency, Loc) & MRI_Ref) test. |
lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
810 | OK, I think I get it now. Fundamentally the isOverwrite routine assumes that (P1 == P2) means that pointers P1 and P2 always have the same value, and is not considering back edge (phi) issues. I think your dominance check is sufficient to avoid problems in this area. | |
889 | That looks right, thanks! |
Hi Bruno,
Did you have a chance to run this patch for x86? I rebased it, so please let me know if it can merge to ToT. Your feedback is very much appreciated.
Thanks!
Ivan
Added option -enable-nonlocal-dse to enable/disable the feature.
All QuIC internal correctness and performance tests are good.
On Arm/AArch64 processor at -O2, the patch improves run-time performance of:
eembc/networking_suite/qos by 8%,
eembc/comsumer_suite/cjpegv2data7 by 16%,
spec2006/sjeng by 1.5%
Please, take another look at the patch. I think all suggestions and comments have been addressed.
I appreciate if you could run the patch on your applications on additional targets and measure the impact.
Best regards,
Ivan
Hi Ivan,
Sorry for the delay. Below the x86 results, of a 5x run in a stable machine.
Highlights include: CINT2006/458_sjeng performing 1.1% slower while CINT2000/175_vpr is 1.5% faster.
Performance Regressions - Execution Time Δ Previous Current σ
MultiSource/Benchmarks/TSVC/IndirectAddressing-flt/IndirectAddressing-flt 3.80% 2.5611 2.6583 0.0179
MultiSource/Benchmarks/sim/sim 2.93% 3.2302 3.3250 0.0019
SingleSource/Benchmarks/Polybench/medley/floyd-warshall/floyd-warshall 2.70% 1.6456 1.6900 0.0172
MultiSource/Benchmarks/NPB-serial/is/is 1.17% 5.6993 5.7660 0.0172
External/SPEC/CINT2006/458_sjeng/458_sjeng 1.14% X Y Z
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trmm/trmm 1.13% 1.4695 1.4861 0.0055
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt 1.13% 2.2458 2.2711 0.0082
Performance Improvements - Compile Time Δ Previous Current σ
MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1 -16.00% 0.0844 0.0709 0.0002
Performance Improvements - Execution Time Δ Previous Current σ
MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1 -12.79% 0.4956 0.4322 0.0000
MultiSource/Benchmarks/mafft/pairlocalalign -5.84% 18.7424 17.6486 0.0154
External/SPEC/CINT2000/175_vpr/175_vpr -1.50% X Y Z
MultiSource/Applications/sqlite3/sqlite3 -1.27% 2.5050 2.4732 0.0107
Hi Bruno,
Thanks for running and sharing x86 results! Deltas look reasonable, in my opinion. Removing dead stores and related instructions across basic blocks should be a good thing by itself. There is a possibility for second order effects - e.g. longer live ranges, increased number of spills.
Results for SPEC2006/sjeng are interesting. On Arm/AArch64/A57 devices at -O2 we have the following run times (in seconds, so lower is better)
- "-mllvm -enable-nonlocal-dse=false": 1795, 1799, 1801, 1804, 1805, 1806
- "-mllvm -enable-nonlocal-dse=true": 1773, 1774,1775, 1778, 1778, 1786
At -O3, we do not see any performance delta for sjeng. What optimization level did you use for compiling SPEC on x86?
Using -stats, for both -O2 and -O3. on sjeng we are able to delete 13 non-local stores in total (found in 4 source files). No changes in the number of regalloc spills and other relevant stats other than decreases in machine instructions emitted and printed and isl dag tries (see below).
Glad that the DSE compile-time is under control, and actually improved for one Trimaran benchmark.
On sjeng, there is a decrease in isel - Number of times dag isel has to try another path - from 2380 to 2348 with this patch. How is this related to compile-time?
It would be good to hear what reviewers think about run- and compile-time results on x86 and AArch64 with this patch, and if we can commit it.
Best regards,
Ivan
Could someone with commit access commit this patch, if there is no objection from reviewers?
This is a good feature to have in llvm/DSE; the reviewers' (mbodard, eeckstein, majnemer, bruno) comments and suggestions have been addressed; the run-time and compile-time performance of the patch on x86 (bruno) and arm/hexagon (ivanbaev) is reasonable.
Thanks,
Ivan
Hi Ivan,
I would stll like to see a source comment near the DT->dominates(BB, Pred) check in findSafePreds,
as its necessity is not otherwise obvious. It is needed in order to prevent unsafe use of isOverwrite
and potentially other AA interfaces. For example isOverwrite cannot be trusted if the pointer values
in question could come from different iterations of an enclosing loop.
Otherwise LGTM.
I think you'll need to update to ToT once again, as there was another change in DeadStoreElimination.cpp
which prevented the patch from applying cleanly.
- mitch
Added a comment to clarify avoidance of blocks (to search for redundant stores) that form a cycle with BB, as requested by Mitch.
Update to ToT is not necessary because the recent changes to DeadStoreElimination.cpp has been reverted.
Thanks, Chad. Can we commit the patch at this point?
It solves one DSE enhancement known for several years.
I believe the patch has been thoroughly reviewed and tested.
Ivan
Committed in r255247. I missed David's comments prior to the commit, but I did clang-format the patch before submission. That did address a few of the issues. I addressed David's suggestion for range-based loops in r255265.
I ended up disabling this patch in r255286. I suspect this change is causing lots of bots to timeout as well as doubling the compile-time for Chromium. I'll have someone investigate, but hopefully the compile-time issues have been resolved for the time being.
Hi Chad, thanks for helping with that. Is compile-time increase the remaining problem?
One quick fix could be to change 100 to 10 for the max number on non-local attempts in a function:
static const unsigned MaxNonLocalAttempts = 100;
We've chosen 100 arbitrarily. The newly added DSE tests should pass with 10.
Ivan
Agree, but this depends what tests/apps we are concerned about. LNT tests didn't show many outliers.
In addition to Chromium, which applications have been known as outliers in terms of DSE compile-time contribution in the past?
I have a resolution for the issue causing bot timeouts,
as well as one unrelated fix for irreducible control flow.
Fortunately both fixes are trivial. Details below.
Chad or Ivan, how do you want to proceed?
Can you apply these fixes and re-land the change?
I used Rafael's llvm-ar.ll reproducer at
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20151207/318782.html
to chase the bot timeouts.
The issue is in findSafePreds, which was getting tripped up by a switch
statement, and inadvertently adding the same block to the worklist
multiple times. The predecessor iterator needs to avoid processing
a given predecessor more than once. Keeping track with a "PredsVisited"
set resolves that, and there is no longer any noticeable compile time
difference w/wo nonlocal DSE for llvm-ar.ll.
Wrt irreducibility, I was able to construct a test case for which the
CFG search goes into a cycle and revisits the original safe block
that instigated the nonlocal DSE search, SI->getParent() in
DSE::handleNonLocalDependency(). My earlier suggestion for the search
routine had an explicit check for !SafeBlocks.count(P) to avoid this
situation, but that was elided in favor of the dominator check
DT->dominates(BB, Pred). With irreducible control flow we can have
a case where Pred is the original safe block, but BB does not dominate it.
One fix here is to pass in that original block to findSafePreds,
and replace the dominator check with DT->dominates(OrigBB, Pred).
While fairly subtle, this prevents processing any block more than once,
and is still sufficient to avoid problems with the "non loop aware"
dependence analysis.
One other nit: The SafeBlocks set is unordered, and only needed for
membership queries. Should it be declared as a SmallSet instead of
a SmallSetVector?
Finally, I collected some stats when compiling llvm-ar.ll.
There are 23138 calls to handleNonLocalDependency which
reach the worklist processing, with 84 dead instructions found.
94% of the time we process 5 or fewer predecessor blocks.
In the worst case we looked at 266 predecessors.
Blocks Frequency
0 11553 50%
1 5047 22%
2 3858 17%
3 778 3%
4 285
5 221
6 148
7 112
8 193
9 57
10 53
11-50 410
51-100 36
100-266 7
In one instance we found 5 dead instructions by searching 22 blocks.
I didn't dig into whether those could have been found in fewer blocks.
Searching more than 22 blocks was never fruitful for llvm-ar.ll.
Though we would of course want more samples to make any decisions,
limiting the number of worklist blocks processed is another
candidate for a compile time threshold, should one be needed.
OTOH, it may only matter in pathalogical cases.
- mitch
Hi Mitch,
Since this patch has been closed IMO the path forward would be to upload a new revision for code review. Please include all the current reviewers/subscribers. You should also include a link to this patch for reference. From there we can begin the review process, which hopefully will be fairly short assuming the fixes were mostly trivial.
Chad
Hi Mitch,
Thanks for working on this.
Feel free to proceed as Chad suggested.
I used Rafael's llvm-ar.ll reproducer at
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20151207/318782.html
to chase the bot timeouts.The issue is in findSafePreds, which was getting tripped up by a switch
statement, and inadvertently adding the same block to the worklist
multiple times. The predecessor iterator needs to avoid processing
a given predecessor more than once. Keeping track with a "PredsVisited"
set resolves that, and there is no longer any noticeable compile time
difference w/wo nonlocal DSE for llvm-ar.ll.
Is it something like:
SmallPtrSet<BasicBlock *, 8> Visited;
for (auto I = pred_begin(BB), E = pred_end(BB); I != E; ++I) {
if (!Visited.insert(*I).second) continue;
Wrt irreducibility, I was able to construct a test case for which the
CFG search goes into a cycle and revisits the original safe block
that instigated the nonlocal DSE search, SI->getParent() in
DSE::handleNonLocalDependency(). My earlier suggestion for the search
routine had an explicit check for !SafeBlocks.count(P) to avoid this
situation, but that was elided in favor of the dominator check
DT->dominates(BB, Pred). With irreducible control flow we can have
a case where Pred is the original safe block, but BB does not dominate it.One fix here is to pass in that original block to findSafePreds,
and replace the dominator check with DT->dominates(OrigBB, Pred).
While fairly subtle, this prevents processing any block more than once,
and is still sufficient to avoid problems with the "non loop aware"
dependence analysis.
Do you propose replacing DT->dominates(BB, Pred) in
if (!DT->isReachableFromEntry(Pred) || DT->dominates(BB, Pred))
with DT->dominates(OrigBB, Pred), or adding the latter as a third check?
It would be great to have this irreducible CFG test added to the existing tests for DSE.
One other nit: The SafeBlocks set is unordered, and only needed for
membership queries. Should it be declared as a SmallSet instead of
a SmallSetVector?
That's fine. We actually need a SmallSetVector with reverse insertion order iteration characteristics.
Finally, I collected some stats when compiling llvm-ar.ll.
There are 23138 calls to handleNonLocalDependency which
reach the worklist processing, with 84 dead instructions found.
94% of the time we process 5 or fewer predecessor blocks.
In the worst case we looked at 266 predecessors.Blocks Frequency
0 11553 50%
1 5047 22%
2 3858 17%
3 778 3%
4 285
5 221
6 148
7 112
8 193
9 57
10 53
11-50 410
51-100 36
100-266 7In one instance we found 5 dead instructions by searching 22 blocks.
I didn't dig into whether those could have been found in fewer blocks.
Searching more than 22 blocks was never fruitful for llvm-ar.ll.Though we would of course want more samples to make any decisions,
limiting the number of worklist blocks processed is another
candidate for a compile time threshold, should one be needed.
OTOH, it may only matter in pathalogical cases.
Interesting data; it may vary from a source language to a source language. I agree that we need more samples to justify adding such heuristic. For now, I propose:
static const unsigned MaxNonLocalAttempts = 10;
Ivan
Hi Ivan,
I do plan to follow up with this, but unfortunately am seeing some rather large performance regressions
due to second order effects. Several benchmarks in the EEMBC 1.1 automotive suite drop 10 to 20%.
I'm playing around with some heuristics in CGP's isFormingBranchFromSelectProfitable
to help mitigate these, and would like to have a feasible solution at hand before landing nonlocal DSE.
As to your specific questions, yes to the "Visited" technique, though I did restructure the checks a bit,
and used a SmallSet instead of SmallPtrSet. I do plan to replace the dominator check, not add an additional one.
That's fine. We actually need a SmallSetVector with reverse insertion order iteration characteristics.
For SafeBlocks we don't need a specific ordering, as it's only used for membership queries.
As for the worklist (Blocks) I think the implementation would work fine if this was unordered,
though I left its type unchanged (SmallVector).
Yes, I added a test for the irreducibility issue.
And I didn't intend to add another compile time throttle at this time.
That can be easily added should an issue arise.
regards,
- mitch
Unfortunately the tuning I had planned in isFormingBranchFromSelectProfitable is no longer applicable.
The relevant heuristic was the "load feeding compare", which I intended to relax a bit.
Per discussions in other code reviews in that area (I don't have the number off hand, but maybe
related to D16836 and D17288), that heurstic has been deleted altogether, and there seemed to be a
general consensus to move this transformation into the target.
I've been looking into that a bit, but it's slow going.
In the meantime, it seemed counterproductive to enable the dead store enhancements.
While the concept sounds good, in practice it seems there are only modest gains,
and some rather large downsides.
Functions should start with a lower case letter.