We extend the search for redundant stores to predecessor blocks that
unconditionally lead to the block BB with the current store instruction.
That also includes single-block loops that unconditionally lead to BB,
and if-then-else blocks where then- and else-blocks unconditionally lead
to BB.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
654 | I would expect any pathalogical compile time issues to occur in HandleNonLocalDependency, based on how many blocks and dependences it has to search through. So even if MaxNonLocalAttempts is 1, we could theoretically encounter a compile time problem. I'm not averse to keeping MaxNonLocalAttempts here, but would like to see a throttle in HandleNonLocalDependency. |
lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
572–583 | The new patch is more aggressive in searching for redundant stores. Let's defer your suggestion for a subsequent patch. | |
654 | MemoryDependenceAnalysis has BlockScanLimit = 100 and NumResultsLimit = 100. With these and the new MaxNonLocalAttempts = 100 we have enough handles to control compile-time. | |
821 | If there is a client outside DSE we can move this helper into a utility directory | |
892 | The idea is to allow the search continue through simple loops, but not to remove stores within loops. |
Hi Ivan,
I added a few more responses.
Please be patient as this is a non-trivial optimization, and for me at least it takes some time to poke around and verify things.
I'm still not confident enough in my understanding of alias analysis to know whether underlyingObjectsDoNotAlias is complete, for example.
Also, please try to address some of the comments from David regarding naming conventions.
regards,
- Mitch
lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
572–583 | That's fine with me. | |
653 | The self read issue has not been addressed. | |
654 | I see. Thanks for pointing that out. | |
808 | I still don't understand why we need to check DT->dominates(BB, Pred) here. | |
814 | The Succ == BB check is correct, but redundant as SafeBlocks.count(BB) must be non-zero. if (Succ != Pred && !SafeBlocks.count(Succ)) but would be OK with your current structure. // Pred is only known to be optimizeable if all of its successors are either safe or are Pred itself. | |
892 | Sorry, I'm not sure which comment you were responding to. Regarding my "loop invariant" comment, I'm saying that the test for whether the current block |
Hi Mitch, thanks for the thoughtful comments; we are making a good progress.
I will make the two changes: using successor iterator and adding the comment, in the next revision.
Regarding Bruno's request for LNT testing, I'm currently running LNT tests with ARM/AArch64 compiler without and with this patch, and will report results when finished. Could someone run this patch with an x86-based compiler if there are concerns about compile-time.
Regarding David's comments, I believe all have been addressed, except caching DataLayout in the constructor.
There are many places in DSE where DataLayout is used or passed as argument. Since most transformations in Transform/*, e.g. Vectorize/LoopVectorize.cpp, Scalar/SROA.cpp, Scalar/GVN.cpp, favor accessing DataLayout this way, I suggest this change to be done in a future patch.
Mitch, Erik, David, Bruno,
Are you planning to attend the LLVM developers meeting on Thu/Fri? That would be a good opportunity to discuss this patch in person if you would like to.
Ivan
lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
653 | It looks like that some comments are not aligned to the corresponding lines of code. isPossibleSelfRead(Inst, Loc, Dependency, *TLI, *AA) It is similar to the local case. | |
654 | I'm running LNT tests with ARM/AArch64 compiler without and with this patch, and will report results when finished. | |
808 | We need DT->dominates(BB, Pred) check to avoid situations similar to the one in cycle.ll test below. This test is a simplified version of a larger function found in a SPEC benchmark. | |
814 | Agree that Succ == BB check is redundant, but it is a shortcut compared to checking a membership in a set. | |
892 | I will use a successor iterator here, thanks. |
Mitch, Erik, David, Bruno,
Are you planning to attend the LLVM developers meeting on Thu/Fri? That would be a good opportunity to discuss this patch in person if you would like to.
Yes, I'm there on Friday. Looking forward to meet you.
Hi Bruno,
Here are LNT compile-time: ibaev-linuxclang_DEVaarch64 test results
Previous: aarch64 compiler with this patch, non-local DSE
Current: aarch64 compiler without this patch, that is top-of-trunk DSE
Tests Summary
Status Group #
Performance Regressions 141
Performance Improvements 134
Added Tests 1
Existing Failures 489
Unchanged Tests 1191
Total Tests 1956
- Top regressions with Current (top-of-trunk DSE) ***
Performance Regressions - Compile Time Δ Previous Current σ
SingleSource/UnitTests/2002-08-02-CastTest 400.00% 0.0080 0.0400 -
SingleSource/UnitTests/C++11/stdthreadbug 300.00% 0.0040 0.0160 -
SingleSource/UnitTests/2003-05-31-LongShifts 250.00% 0.0080 0.0280 -
SingleSource/UnitTests/2008-04-20-LoopBug2 200.00% 0.0120 0.0360 -
SingleSource/Regression/C/2004-02-03-AggregateCopy 150.00% 0.0160 0.0400 -
MultiSource/Benchmarks/Prolangs-C++/vcirc/vcirc 100.00% 0.0240 0.0480 -
SingleSource/Benchmarks/Misc/mandel-2 100.00% 0.0360 0.0720 -
SingleSource/UnitTests/2003-04-22-Switch 100.00% 0.0200 0.0400 -
SingleSource/UnitTests/Vector/multiplies 100.00% 0.0200 0.0400 -
SingleSource/UnitTests/block-byref-test 100.00% 0.0120 0.0240 -
....
- Top improvents with Current (top-of-trunk DSE) ***
Performance Improvements - Compile Time Δ Previous Current σ
SingleSource/UnitTests/2002-10-13-BadLoad -75.00% 0.0160 0.0040 -
SingleSource/UnitTests/2006-12-01-float_varg -71.43% 0.0280 0.0080 -
SingleSource/Regression/C++/2003-06-08-VirtualFunctions -66.67% 0.0240 0.0080 -
SingleSource/Regression/C++/2003-09-29-NonPODsByValue -66.67% 0.0360 0.0120 -
SingleSource/Regression/C++/EH/function_try_block -66.67% 0.0480 0.0160 -
SingleSource/UnitTests/byval-alignment -63.64% 0.0440 0.0160 -
SingleSource/Benchmarks/Misc/pi -58.33% 0.0480 0.0200 -
SingleSource/Regression/C++/short_circuit_dtor -58.33% 0.0480 0.0200 -
SingleSource/UnitTests/2005-05-11-Popcount-ffs-fls -55.56% 0.0720 0.0320 -
SingleSource/UnitTests/Vector/simple -55.56% 0.0360 0.0160 -
SingleSource/Regression/C++/EH/simple_rethrow -54.55% 0.0440 0.0200 -
...
Since this is my first LNT testing, could someone comment on these results. Any outliers?
Thanks,
Ivan
lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
653 | It seems useful for compile time to check for a self-read right here, as in that case there is no point in processing non-local dependences. Such a check is only complicated by the fact that isPossibleSelfRead requires a "DepWrite" parameter. We would need to modify that routine to allow a nullptr for DepWrite. What I'm unsure of, however, is whether this being a non-self read is already implied. | |
808 | If nonlocal DSE would otherwise remove the store in the cycle.ll test, it seems to me that's a problem with how we are using alias analysis. The other store in that loop is not guaranteed to overwrite the removed store's location on every loop iteration. Perhaps it's simply my unfamiliarity with LLVM's alias analysis interfaces, but it seems like there would be some support for checking this definitive overlap, without requiring clients to also factor in dominance information. I'd be interested in hearing what the other reviewers have to say. |
Hi Ivan,
Thanks again for working on this!
I'm gonna run your patch in x86 and let you know about the performance results.
Thanks, Bruno. If necessary we need to balance non-local DSE for compile-time and run-time performance.
lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
653 | If that's the case, then we should be able to remove the isSelfRead check in handleNonLocalDependency. | |
808 | I was using the term alias analysis generically, referring to either AA or MD. | |
887 | Inst here is the original StoreInst. If StoreInst's can never be a self read, this check will always be false. Do we need an additional check to ensure that Dependency does not read any of Inst's memory, analagous to the check "if (AA->getModRefInfo(DepWrite, Loc) & MRI_Ref)" | |
896 | This break is just exiting the succ iterator loop. We need to exit the containing while loop. |
Hi Bruno, the patch has been rebased, and should apply cleanly on ToT.
Mitch, thanks for your comments and suggestions.
Non-local DSE extension follows the structure of HandleFree(). The underlying analysis for DSE is MD, and for across-BB both - HandleFree and HandleNonLocalDependency - use MD->getPointerDependencyFrom(). Note that MD is not loop-aware and its clients should handle CFG and alias issues by relying on AA or DT. HandleFree only allows unconditional predecessors - see FundUnconditonalPreds(). Based on reviewers' feedback HandleNonLocalDependency also handles if-then-else predecessors and single-block-loops (actually one of our benchmarks has a redundant store across a single-block-loop).
Regading cycle.ll test, using DT->dominates(BB, Pred) check is sufficient to exclude for.inc BB from the set of safe predecessors of for.body BB. What alternative do you propose for determining that stores in for.body and for.inc may alias?
lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
887 | Added AA->getModRefInfo(Dependency, Loc) & MRI_Ref) test. |
lib/Transforms/Scalar/DeadStoreElimination.cpp | ||
---|---|---|
808 | OK, I think I get it now. Fundamentally the isOverwrite routine assumes that (P1 == P2) means that pointers P1 and P2 always have the same value, and is not considering back edge (phi) issues. I think your dominance check is sufficient to avoid problems in this area. | |
887 | That looks right, thanks! |
Hi Bruno,
Did you have a chance to run this patch for x86? I rebased it, so please let me know if it can merge to ToT. Your feedback is very much appreciated.
Thanks!
Ivan
Added option -enable-nonlocal-dse to enable/disable the feature.
All QuIC internal correctness and performance tests are good.
On Arm/AArch64 processor at -O2, the patch improves run-time performance of:
eembc/networking_suite/qos by 8%,
eembc/comsumer_suite/cjpegv2data7 by 16%,
spec2006/sjeng by 1.5%
Please, take another look at the patch. I think all suggestions and comments have been addressed.
I appreciate if you could run the patch on your applications on additional targets and measure the impact.
Best regards,
Ivan
Hi Ivan,
Sorry for the delay. Below the x86 results, of a 5x run in a stable machine.
Highlights include: CINT2006/458_sjeng performing 1.1% slower while CINT2000/175_vpr is 1.5% faster.
Performance Regressions - Execution Time Δ Previous Current σ
MultiSource/Benchmarks/TSVC/IndirectAddressing-flt/IndirectAddressing-flt 3.80% 2.5611 2.6583 0.0179
MultiSource/Benchmarks/sim/sim 2.93% 3.2302 3.3250 0.0019
SingleSource/Benchmarks/Polybench/medley/floyd-warshall/floyd-warshall 2.70% 1.6456 1.6900 0.0172
MultiSource/Benchmarks/NPB-serial/is/is 1.17% 5.6993 5.7660 0.0172
External/SPEC/CINT2006/458_sjeng/458_sjeng 1.14% X Y Z
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trmm/trmm 1.13% 1.4695 1.4861 0.0055
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt 1.13% 2.2458 2.2711 0.0082
Performance Improvements - Compile Time Δ Previous Current σ
MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1 -16.00% 0.0844 0.0709 0.0002
Performance Improvements - Execution Time Δ Previous Current σ
MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1 -12.79% 0.4956 0.4322 0.0000
MultiSource/Benchmarks/mafft/pairlocalalign -5.84% 18.7424 17.6486 0.0154
External/SPEC/CINT2000/175_vpr/175_vpr -1.50% X Y Z
MultiSource/Applications/sqlite3/sqlite3 -1.27% 2.5050 2.4732 0.0107
Hi Bruno,
Thanks for running and sharing x86 results! Deltas look reasonable, in my opinion. Removing dead stores and related instructions across basic blocks should be a good thing by itself. There is a possibility for second order effects - e.g. longer live ranges, increased number of spills.
Results for SPEC2006/sjeng are interesting. On Arm/AArch64/A57 devices at -O2 we have the following run times (in seconds, so lower is better)
- "-mllvm -enable-nonlocal-dse=false": 1795, 1799, 1801, 1804, 1805, 1806
- "-mllvm -enable-nonlocal-dse=true": 1773, 1774,1775, 1778, 1778, 1786
At -O3, we do not see any performance delta for sjeng. What optimization level did you use for compiling SPEC on x86?
Using -stats, for both -O2 and -O3. on sjeng we are able to delete 13 non-local stores in total (found in 4 source files). No changes in the number of regalloc spills and other relevant stats other than decreases in machine instructions emitted and printed and isl dag tries (see below).
Glad that the DSE compile-time is under control, and actually improved for one Trimaran benchmark.
On sjeng, there is a decrease in isel - Number of times dag isel has to try another path - from 2380 to 2348 with this patch. How is this related to compile-time?
It would be good to hear what reviewers think about run- and compile-time results on x86 and AArch64 with this patch, and if we can commit it.
Best regards,
Ivan
Could someone with commit access commit this patch, if there is no objection from reviewers?
This is a good feature to have in llvm/DSE; the reviewers' (mbodard, eeckstein, majnemer, bruno) comments and suggestions have been addressed; the run-time and compile-time performance of the patch on x86 (bruno) and arm/hexagon (ivanbaev) is reasonable.
Thanks,
Ivan
Hi Ivan,
I would stll like to see a source comment near the DT->dominates(BB, Pred) check in findSafePreds,
as its necessity is not otherwise obvious. It is needed in order to prevent unsafe use of isOverwrite
and potentially other AA interfaces. For example isOverwrite cannot be trusted if the pointer values
in question could come from different iterations of an enclosing loop.
Otherwise LGTM.
I think you'll need to update to ToT once again, as there was another change in DeadStoreElimination.cpp
which prevented the patch from applying cleanly.
- mitch
Added a comment to clarify avoidance of blocks (to search for redundant stores) that form a cycle with BB, as requested by Mitch.
Update to ToT is not necessary because the recent changes to DeadStoreElimination.cpp has been reverted.
Thanks, Chad. Can we commit the patch at this point?
It solves one DSE enhancement known for several years.
I believe the patch has been thoroughly reviewed and tested.
Ivan
Committed in r255247. I missed David's comments prior to the commit, but I did clang-format the patch before submission. That did address a few of the issues. I addressed David's suggestion for range-based loops in r255265.
I ended up disabling this patch in r255286. I suspect this change is causing lots of bots to timeout as well as doubling the compile-time for Chromium. I'll have someone investigate, but hopefully the compile-time issues have been resolved for the time being.
Hi Chad, thanks for helping with that. Is compile-time increase the remaining problem?
One quick fix could be to change 100 to 10 for the max number on non-local attempts in a function:
static const unsigned MaxNonLocalAttempts = 100;
We've chosen 100 arbitrarily. The newly added DSE tests should pass with 10.
Ivan
Agree, but this depends what tests/apps we are concerned about. LNT tests didn't show many outliers.
In addition to Chromium, which applications have been known as outliers in terms of DSE compile-time contribution in the past?
I have a resolution for the issue causing bot timeouts,
as well as one unrelated fix for irreducible control flow.
Fortunately both fixes are trivial. Details below.
Chad or Ivan, how do you want to proceed?
Can you apply these fixes and re-land the change?
I used Rafael's llvm-ar.ll reproducer at
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20151207/318782.html
to chase the bot timeouts.
The issue is in findSafePreds, which was getting tripped up by a switch
statement, and inadvertently adding the same block to the worklist
multiple times. The predecessor iterator needs to avoid processing
a given predecessor more than once. Keeping track with a "PredsVisited"
set resolves that, and there is no longer any noticeable compile time
difference w/wo nonlocal DSE for llvm-ar.ll.
Wrt irreducibility, I was able to construct a test case for which the
CFG search goes into a cycle and revisits the original safe block
that instigated the nonlocal DSE search, SI->getParent() in
DSE::handleNonLocalDependency(). My earlier suggestion for the search
routine had an explicit check for !SafeBlocks.count(P) to avoid this
situation, but that was elided in favor of the dominator check
DT->dominates(BB, Pred). With irreducible control flow we can have
a case where Pred is the original safe block, but BB does not dominate it.
One fix here is to pass in that original block to findSafePreds,
and replace the dominator check with DT->dominates(OrigBB, Pred).
While fairly subtle, this prevents processing any block more than once,
and is still sufficient to avoid problems with the "non loop aware"
dependence analysis.
One other nit: The SafeBlocks set is unordered, and only needed for
membership queries. Should it be declared as a SmallSet instead of
a SmallSetVector?
Finally, I collected some stats when compiling llvm-ar.ll.
There are 23138 calls to handleNonLocalDependency which
reach the worklist processing, with 84 dead instructions found.
94% of the time we process 5 or fewer predecessor blocks.
In the worst case we looked at 266 predecessors.
Blocks Frequency
0 11553 50%
1 5047 22%
2 3858 17%
3 778 3%
4 285
5 221
6 148
7 112
8 193
9 57
10 53
11-50 410
51-100 36
100-266 7
In one instance we found 5 dead instructions by searching 22 blocks.
I didn't dig into whether those could have been found in fewer blocks.
Searching more than 22 blocks was never fruitful for llvm-ar.ll.
Though we would of course want more samples to make any decisions,
limiting the number of worklist blocks processed is another
candidate for a compile time threshold, should one be needed.
OTOH, it may only matter in pathalogical cases.
- mitch
Hi Mitch,
Since this patch has been closed IMO the path forward would be to upload a new revision for code review. Please include all the current reviewers/subscribers. You should also include a link to this patch for reference. From there we can begin the review process, which hopefully will be fairly short assuming the fixes were mostly trivial.
Chad
Hi Mitch,
Thanks for working on this.
Feel free to proceed as Chad suggested.
I used Rafael's llvm-ar.ll reproducer at
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20151207/318782.html
to chase the bot timeouts.The issue is in findSafePreds, which was getting tripped up by a switch
statement, and inadvertently adding the same block to the worklist
multiple times. The predecessor iterator needs to avoid processing
a given predecessor more than once. Keeping track with a "PredsVisited"
set resolves that, and there is no longer any noticeable compile time
difference w/wo nonlocal DSE for llvm-ar.ll.
Is it something like:
SmallPtrSet<BasicBlock *, 8> Visited;
for (auto I = pred_begin(BB), E = pred_end(BB); I != E; ++I) {
if (!Visited.insert(*I).second) continue;
Wrt irreducibility, I was able to construct a test case for which the
CFG search goes into a cycle and revisits the original safe block
that instigated the nonlocal DSE search, SI->getParent() in
DSE::handleNonLocalDependency(). My earlier suggestion for the search
routine had an explicit check for !SafeBlocks.count(P) to avoid this
situation, but that was elided in favor of the dominator check
DT->dominates(BB, Pred). With irreducible control flow we can have
a case where Pred is the original safe block, but BB does not dominate it.One fix here is to pass in that original block to findSafePreds,
and replace the dominator check with DT->dominates(OrigBB, Pred).
While fairly subtle, this prevents processing any block more than once,
and is still sufficient to avoid problems with the "non loop aware"
dependence analysis.
Do you propose replacing DT->dominates(BB, Pred) in
if (!DT->isReachableFromEntry(Pred) || DT->dominates(BB, Pred))
with DT->dominates(OrigBB, Pred), or adding the latter as a third check?
It would be great to have this irreducible CFG test added to the existing tests for DSE.
One other nit: The SafeBlocks set is unordered, and only needed for
membership queries. Should it be declared as a SmallSet instead of
a SmallSetVector?
That's fine. We actually need a SmallSetVector with reverse insertion order iteration characteristics.
Finally, I collected some stats when compiling llvm-ar.ll.
There are 23138 calls to handleNonLocalDependency which
reach the worklist processing, with 84 dead instructions found.
94% of the time we process 5 or fewer predecessor blocks.
In the worst case we looked at 266 predecessors.Blocks Frequency
0 11553 50%
1 5047 22%
2 3858 17%
3 778 3%
4 285
5 221
6 148
7 112
8 193
9 57
10 53
11-50 410
51-100 36
100-266 7In one instance we found 5 dead instructions by searching 22 blocks.
I didn't dig into whether those could have been found in fewer blocks.
Searching more than 22 blocks was never fruitful for llvm-ar.ll.Though we would of course want more samples to make any decisions,
limiting the number of worklist blocks processed is another
candidate for a compile time threshold, should one be needed.
OTOH, it may only matter in pathalogical cases.
Interesting data; it may vary from a source language to a source language. I agree that we need more samples to justify adding such heuristic. For now, I propose:
static const unsigned MaxNonLocalAttempts = 10;
Ivan
Hi Ivan,
I do plan to follow up with this, but unfortunately am seeing some rather large performance regressions
due to second order effects. Several benchmarks in the EEMBC 1.1 automotive suite drop 10 to 20%.
I'm playing around with some heuristics in CGP's isFormingBranchFromSelectProfitable
to help mitigate these, and would like to have a feasible solution at hand before landing nonlocal DSE.
As to your specific questions, yes to the "Visited" technique, though I did restructure the checks a bit,
and used a SmallSet instead of SmallPtrSet. I do plan to replace the dominator check, not add an additional one.
That's fine. We actually need a SmallSetVector with reverse insertion order iteration characteristics.
For SafeBlocks we don't need a specific ordering, as it's only used for membership queries.
As for the worklist (Blocks) I think the implementation would work fine if this was unordered,
though I left its type unchanged (SmallVector).
Yes, I added a test for the irreducibility issue.
And I didn't intend to add another compile time throttle at this time.
That can be easily added should an issue arise.
regards,
- mitch
Unfortunately the tuning I had planned in isFormingBranchFromSelectProfitable is no longer applicable.
The relevant heuristic was the "load feeding compare", which I intended to relax a bit.
Per discussions in other code reviews in that area (I don't have the number off hand, but maybe
related to D16836 and D17288), that heurstic has been deleted altogether, and there seemed to be a
general consensus to move this transformation into the target.
I've been looking into that a bit, but it's slow going.
In the meantime, it seemed counterproductive to enable the dead store enhancements.
While the concept sounds good, in practice it seems there are only modest gains,
and some rather large downsides.