Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Avoid migrating existing patches. Phabricator shutdown timeline

[DeadStoreElimination] Add support for non-local DSE
ClosedPublic

Authored by zzheng on Oct 1 2015, 1:16 PM.

Details

Summary

We extend the search for redundant stores to predecessor blocks that
unconditionally lead to the block BB with the current store instruction.
That also includes single-block loops that unconditionally lead to BB,
and if-then-else blocks where then- and else-blocks unconditionally lead
to BB.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
mbodart added inline comments.Oct 22 2015, 12:43 PM
lib/Transforms/Scalar/DeadStoreElimination.cpp
654

I would expect any pathalogical compile time issues to occur in HandleNonLocalDependency, based on how many blocks and dependences it has to search through. So even if MaxNonLocalAttempts is 1, we could theoretically encounter a compile time problem. I'm not averse to keeping MaxNonLocalAttempts here, but would like to see a throttle in HandleNonLocalDependency.

ivanbaev updated this revision to Diff 38338.Oct 24 2015, 5:28 PM
ivanbaev retitled this revision from [DeadStoreElimination] Add support for DSE across blocks to [DeadStoreElimination] Add support for non-local DSE.
ivanbaev marked 15 inline comments as done.Oct 24 2015, 5:45 PM
ivanbaev added inline comments.
lib/Transforms/Scalar/DeadStoreElimination.cpp
572–583

The new patch is more aggressive in searching for redundant stores. Let's defer your suggestion for a subsequent patch.

654

MemoryDependenceAnalysis has BlockScanLimit = 100 and NumResultsLimit = 100. With these and the new MaxNonLocalAttempts = 100 we have enough handles to control compile-time.

821

If there is a client outside DSE we can move this helper into a utility directory

892

The idea is to allow the search continue through simple loops, but not to remove stores within loops.

Hi Ivan,

I added a few more responses.

Please be patient as this is a non-trivial optimization, and for me at least it takes some time to poke around and verify things.
I'm still not confident enough in my understanding of alias analysis to know whether underlyingObjectsDoNotAlias is complete, for example.

Also, please try to address some of the comments from David regarding naming conventions.

regards,

  • Mitch
lib/Transforms/Scalar/DeadStoreElimination.cpp
572–583

That's fine with me.

653

The self read issue has not been addressed.

654

I see. Thanks for pointing that out.
Those seem reasonable as a starting point.
But as Bruno suggested, some actual compile time measurments are probably in order, just to confirm.

808

I still don't understand why we need to check DT->dominates(BB, Pred) here.
I'm guessing you're trying to guard against an infinite loop, which of course
we need to do. But I think that would be automatically handled by instead checking
for !SafeBlocks.count(Pred). In addition to allowing the search to proceed further, this
would eliminate the need for the special check of Pred == BB, since BB must be in SafeBlocks.

814

The Succ == BB check is correct, but redundant as SafeBlocks.count(BB) must be non-zero.
I think the code would read better if this and the next check were combined into one:

if (Succ != Pred && !SafeBlocks.count(Succ))

but would be OK with your current structure.
Either way, however, it would be useful to add a clarifying source comment to the effect:

// Pred is only known to be optimizeable if all of its successors are either safe or are Pred itself.

892

Sorry, I'm not sure which comment you were responding to.
Phabricator for some reason seems to be placing some of my comments
above the lines of interest, and others below.

Regarding my "loop invariant" comment, I'm saying that the test for whether the current block
is a single-block loop (getNumSuccessors() == 2, etc) should probably be done outside of this loop. I'm also not sure if checking explicitly for two successors is sufficient. A program could conceivably have a switch statement with one case having a goto back to a label just prior to the switch. I'm not familiar enough with LLVM optimizations to know whether that can lead to an indirect branch successor which is the block containing the indirect branch itself. But it's theoretically possible. My recommendation here is to use a successor iterator, to match the behavior present in FindSafePreds.

Hi Mitch, thanks for the thoughtful comments; we are making a good progress.
I will make the two changes: using successor iterator and adding the comment, in the next revision.

Regarding Bruno's request for LNT testing, I'm currently running LNT tests with ARM/AArch64 compiler without and with this patch, and will report results when finished. Could someone run this patch with an x86-based compiler if there are concerns about compile-time.

Regarding David's comments, I believe all have been addressed, except caching DataLayout in the constructor.
There are many places in DSE where DataLayout is used or passed as argument. Since most transformations in Transform/*, e.g. Vectorize/LoopVectorize.cpp, Scalar/SROA.cpp, Scalar/GVN.cpp, favor accessing DataLayout this way, I suggest this change to be done in a future patch.

Mitch, Erik, David, Bruno,
Are you planning to attend the LLVM developers meeting on Thu/Fri? That would be a good opportunity to discuss this patch in person if you would like to.

Ivan

lib/Transforms/Scalar/DeadStoreElimination.cpp
653

It looks like that some comments are not aligned to the corresponding lines of code.
The self read issue has been addressed in line 888 by adding there a call to:

isPossibleSelfRead(Inst, Loc, Dependency, *TLI, *AA)

It is similar to the local case.

654

I'm running LNT tests with ARM/AArch64 compiler without and with this patch, and will report results when finished.

808

We need DT->dominates(BB, Pred) check to avoid situations similar to the one in cycle.ll test below. This test is a simplified version of a larger function found in a SPEC benchmark.

814

Agree that Succ == BB check is redundant, but it is a shortcut compared to checking a membership in a set.
That will save some compile time.
Sure, I will add the comment you suggested.

892

I will use a successor iterator here, thanks.

Mitch, Erik, David, Bruno,
Are you planning to attend the LLVM developers meeting on Thu/Fri? That would be a good opportunity to discuss this patch in person if you would like to.

Yes, I'm there on Friday. Looking forward to meet you.

Hi Bruno,

Here are LNT compile-time: ibaev-linuxclang_DEVaarch64 test results
Previous: aarch64 compiler with this patch, non-local DSE
Current: aarch64 compiler without this patch, that is top-of-trunk DSE

Tests Summary
Status Group #
Performance Regressions 141
Performance Improvements 134
Added Tests 1
Existing Failures 489
Unchanged Tests 1191
Total Tests 1956

  • Top regressions with Current (top-of-trunk DSE) ***

Performance Regressions - Compile Time Δ Previous Current σ
SingleSource/UnitTests/2002-08-02-CastTest 400.00% 0.0080 0.0400 -
SingleSource/UnitTests/C++11/stdthreadbug 300.00% 0.0040 0.0160 -
SingleSource/UnitTests/2003-05-31-LongShifts 250.00% 0.0080 0.0280 -
SingleSource/UnitTests/2008-04-20-LoopBug2 200.00% 0.0120 0.0360 -
SingleSource/Regression/C/2004-02-03-AggregateCopy 150.00% 0.0160 0.0400 -
MultiSource/Benchmarks/Prolangs-C++/vcirc/vcirc 100.00% 0.0240 0.0480 -
SingleSource/Benchmarks/Misc/mandel-2 100.00% 0.0360 0.0720 -
SingleSource/UnitTests/2003-04-22-Switch 100.00% 0.0200 0.0400 -
SingleSource/UnitTests/Vector/multiplies 100.00% 0.0200 0.0400 -
SingleSource/UnitTests/block-byref-test 100.00% 0.0120 0.0240 -
....

  • Top improvents with Current (top-of-trunk DSE) ***

Performance Improvements - Compile Time Δ Previous Current σ
SingleSource/UnitTests/2002-10-13-BadLoad -75.00% 0.0160 0.0040 -
SingleSource/UnitTests/2006-12-01-float_varg -71.43% 0.0280 0.0080 -
SingleSource/Regression/C++/2003-06-08-VirtualFunctions -66.67% 0.0240 0.0080 -
SingleSource/Regression/C++/2003-09-29-NonPODsByValue -66.67% 0.0360 0.0120 -
SingleSource/Regression/C++/EH/function_try_block -66.67% 0.0480 0.0160 -
SingleSource/UnitTests/byval-alignment -63.64% 0.0440 0.0160 -
SingleSource/Benchmarks/Misc/pi -58.33% 0.0480 0.0200 -
SingleSource/Regression/C++/short_circuit_dtor -58.33% 0.0480 0.0200 -
SingleSource/UnitTests/2005-05-11-Popcount-ffs-fls -55.56% 0.0720 0.0320 -
SingleSource/UnitTests/Vector/simple -55.56% 0.0360 0.0160 -
SingleSource/Regression/C++/EH/simple_rethrow -54.55% 0.0440 0.0200 -
...

Since this is my first LNT testing, could someone comment on these results. Any outliers?

Thanks,
Ivan

mbodart added inline comments.Oct 30 2015, 3:03 PM
lib/Transforms/Scalar/DeadStoreElimination.cpp
653

It seems useful for compile time to check for a self-read right here, as in that case there is no point in processing non-local dependences. Such a check is only complicated by the fact that isPossibleSelfRead requires a "DepWrite" parameter. We would need to modify that routine to allow a nullptr for DepWrite.

What I'm unsure of, however, is whether this being a non-self read is already implied.
Is it the case that a StoreInst will never be a self read, for example?

808

If nonlocal DSE would otherwise remove the store in the cycle.ll test, it seems to me that's a problem with how we are using alias analysis. The other store in that loop is not guaranteed to overwrite the removed store's location on every loop iteration.

Perhaps it's simply my unfamiliarity with LLVM's alias analysis interfaces, but it seems like there would be some support for checking this definitive overlap, without requiring clients to also factor in dominance information. I'd be interested in hearing what the other reviewers have to say.

ivanbaev updated this revision to Diff 38859.Nov 1 2015, 4:06 PM

addressed two comments by Mitch

ivanbaev marked 12 inline comments as done.Nov 1 2015, 4:31 PM
ivanbaev added inline comments.
lib/Transforms/Scalar/DeadStoreElimination.cpp
653

A StoreInst will never be a self read.

808

By design, DSE - local, non-local, other parts - depends on MemoryDependenceAnalysis, and only occasionally relies on alias analysis.

bruno added a comment.Nov 2 2015, 10:36 AM

Hi Ivan,

Thanks again for working on this!
I'm gonna run your patch in x86 and let you know about the performance results.

ivanbaev marked 2 inline comments as done.Nov 2 2015, 7:19 PM

Thanks, Bruno. If necessary we need to balance non-local DSE for compile-time and run-time performance.

bruno added a comment.Nov 3 2015, 9:27 AM

The patch fails to apply cleanly on ToT, can you update it?

mbodart added inline comments.Nov 3 2015, 11:52 AM
lib/Transforms/Scalar/DeadStoreElimination.cpp
653

If that's the case, then we should be able to remove the isSelfRead check in handleNonLocalDependency.

808

I was using the term alias analysis generically, referring to either AA or MD.
If usage of those interfaces is suspect in the circumstances of the cycle.ll test,
I'd like to understand why exactly.

887

Inst here is the original StoreInst. If StoreInst's can never be a self read, this check will always be false. Do we need an additional check to ensure that Dependency does not read any of Inst's memory, analagous to the check "if (AA->getModRefInfo(DepWrite, Loc) & MRI_Ref)"
in DSE::runOnBasicBlock?

896

This break is just exiting the succ iterator loop. We need to exit the containing while loop.

ivanbaev updated this revision to Diff 39302.Nov 4 2015, 6:23 PM

Rebased.
Address Mitch's comments.

ivanbaev marked 3 inline comments as done.Nov 4 2015, 6:53 PM

Hi Bruno, the patch has been rebased, and should apply cleanly on ToT.

Mitch, thanks for your comments and suggestions.
Non-local DSE extension follows the structure of HandleFree(). The underlying analysis for DSE is MD, and for across-BB both - HandleFree and HandleNonLocalDependency - use MD->getPointerDependencyFrom(). Note that MD is not loop-aware and its clients should handle CFG and alias issues by relying on AA or DT. HandleFree only allows unconditional predecessors - see FundUnconditonalPreds(). Based on reviewers' feedback HandleNonLocalDependency also handles if-then-else predecessors and single-block-loops (actually one of our benchmarks has a redundant store across a single-block-loop).

Regading cycle.ll test, using DT->dominates(BB, Pred) check is sufficient to exclude for.inc BB from the set of safe predecessors of for.body BB. What alternative do you propose for determining that stores in for.body and for.inc may alias?

lib/Transforms/Scalar/DeadStoreElimination.cpp
887

Added AA->getModRefInfo(Dependency, Loc) & MRI_Ref) test.

mbodart added inline comments.Nov 5 2015, 5:30 PM
lib/Transforms/Scalar/DeadStoreElimination.cpp
808

OK, I think I get it now.

Fundamentally the isOverwrite routine assumes that (P1 == P2) means that pointers P1 and P2 always have the same value, and is not considering back edge (phi) issues.

I think your dominance check is sufficient to avoid problems in this area.
But as this is a subtle issue, please add a source comment at the dominance check
explaining the situation.

887

That looks right, thanks!

ivanbaev marked an inline comment as done.Nov 6 2015, 8:55 AM

Hi Bruno,

Did you have a chance to run this patch for x86? I rebased it, so please let me know if it can merge to ToT. Your feedback is very much appreciated.
Thanks!
Ivan

ivanbaev updated this revision to Diff 40106.Nov 12 2015, 8:26 PM

Added option -enable-nonlocal-dse to enable/disable the feature.

All QuIC internal correctness and performance tests are good.
On Arm/AArch64 processor at -O2, the patch improves run-time performance of:
eembc/networking_suite/qos by 8%,
eembc/comsumer_suite/cjpegv2data7 by 16%,
spec2006/sjeng by 1.5%

Please, take another look at the patch. I think all suggestions and comments have been addressed.
I appreciate if you could run the patch on your applications on additional targets and measure the impact.

Best regards,
Ivan

Any objection to commit this patch?

Hi Ivan,

Sorry for the delay. Below the x86 results, of a 5x run in a stable machine.
Highlights include: CINT2006/458_sjeng performing 1.1% slower while CINT2000/175_vpr is 1.5% faster.

Performance Regressions - Execution Time Δ Previous Current σ
MultiSource/Benchmarks/TSVC/IndirectAddressing-flt/IndirectAddressing-flt 3.80% 2.5611 2.6583 0.0179
MultiSource/Benchmarks/sim/sim 2.93% 3.2302 3.3250 0.0019
SingleSource/Benchmarks/Polybench/medley/floyd-warshall/floyd-warshall 2.70% 1.6456 1.6900 0.0172
MultiSource/Benchmarks/NPB-serial/is/is 1.17% 5.6993 5.7660 0.0172
External/SPEC/CINT2006/458_sjeng/458_sjeng 1.14% X Y Z
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trmm/trmm 1.13% 1.4695 1.4861 0.0055
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt 1.13% 2.2458 2.2711 0.0082

Performance Improvements - Compile Time Δ Previous Current σ
MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1 -16.00% 0.0844 0.0709 0.0002

Performance Improvements - Execution Time Δ Previous Current σ
MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1 -12.79% 0.4956 0.4322 0.0000
MultiSource/Benchmarks/mafft/pairlocalalign -5.84% 18.7424 17.6486 0.0154
External/SPEC/CINT2000/175_vpr/175_vpr -1.50% X Y Z
MultiSource/Applications/sqlite3/sqlite3 -1.27% 2.5050 2.4732 0.0107

Hi Bruno,

Thanks for running and sharing x86 results! Deltas look reasonable, in my opinion. Removing dead stores and related instructions across basic blocks should be a good thing by itself. There is a possibility for second order effects - e.g. longer live ranges, increased number of spills.

Results for SPEC2006/sjeng are interesting. On Arm/AArch64/A57 devices at -O2 we have the following run times (in seconds, so lower is better)

  • "-mllvm -enable-nonlocal-dse=false": 1795, 1799, 1801, 1804, 1805, 1806
  • "-mllvm -enable-nonlocal-dse=true": 1773, 1774,1775, 1778, 1778, 1786

At -O3, we do not see any performance delta for sjeng. What optimization level did you use for compiling SPEC on x86?

Using -stats, for both -O2 and -O3. on sjeng we are able to delete 13 non-local stores in total (found in 4 source files). No changes in the number of regalloc spills and other relevant stats other than decreases in machine instructions emitted and printed and isl dag tries (see below).

Glad that the DSE compile-time is under control, and actually improved for one Trimaran benchmark.
On sjeng, there is a decrease in isel - Number of times dag isel has to try another path - from 2380 to 2348 with this patch. How is this related to compile-time?

It would be good to hear what reviewers think about run- and compile-time results on x86 and AArch64 with this patch, and if we can commit it.

Best regards,
Ivan

Could someone with commit access commit this patch, if there is no objection from reviewers?

This is a good feature to have in llvm/DSE; the reviewers' (mbodard, eeckstein, majnemer, bruno) comments and suggestions have been addressed; the run-time and compile-time performance of the patch on x86 (bruno) and arm/hexagon (ivanbaev) is reasonable.

Thanks,
Ivan

Hi Ivan,

I would stll like to see a source comment near the DT->dominates(BB, Pred) check in findSafePreds,
as its necessity is not otherwise obvious. It is needed in order to prevent unsafe use of isOverwrite
and potentially other AA interfaces. For example isOverwrite cannot be trusted if the pointer values
in question could come from different iterations of an enclosing loop.

Otherwise LGTM.

I think you'll need to update to ToT once again, as there was another change in DeadStoreElimination.cpp
which prevented the patch from applying cleanly.

  • mitch
ivanbaev updated this revision to Diff 40722.Nov 19 2015, 5:10 PM

Added a comment to clarify avoidance of blocks (to search for redundant stores) that form a cycle with BB, as requested by Mitch.
Update to ToT is not necessary because the recent changes to DeadStoreElimination.cpp has been reverted.

ivanbaev marked an inline comment as done.Nov 19 2015, 5:13 PM

Mitch, I added a relevant comment above DT->dominates(BB, Pred).

Thanks Ivan!

LGTM

  • mitch
mcrosier commandeered this revision.Dec 3 2015, 1:01 PM
mcrosier edited reviewers, added: ivanbaev; removed: mcrosier.

Ivan has asked that I take over this work.

ivanbaev edited edge metadata.Dec 5 2015, 12:34 PM

Thanks, Chad. Can we commit the patch at this point?
It solves one DSE enhancement known for several years.
I believe the patch has been thoroughly reviewed and tested.

Ivan

zzheng commandeered this revision.Dec 9 2015, 11:59 AM
zzheng updated this revision to Diff 42326.
zzheng added a reviewer: mcrosier.
zzheng edited edge metadata.
zzheng set the repository for this revision to rL LLVM.

Fixed build error due to explicit ilist::iterator.

majnemer added inline comments.Dec 9 2015, 6:39 PM
lib/Transforms/Scalar/DeadStoreElimination.cpp
678

Pointers lean right.

680

Can a range-based for loop be used here? Something like: for (const BasicBlock *Pred : *BB)

712

Likewise, can you use one here?

741

Pointers lean right.

mcrosier accepted this revision.Dec 10 2015, 9:36 AM
mcrosier edited edge metadata.

Committed in r255247. I missed David's comments prior to the commit, but I did clang-format the patch before submission. That did address a few of the issues. I addressed David's suggestion for range-based loops in r255265.

This revision is now accepted and ready to land.Dec 10 2015, 9:36 AM
mcrosier closed this revision.Dec 10 2015, 9:36 AM

I ended up disabling this patch in r255286. I suspect this change is causing lots of bots to timeout as well as doubling the compile-time for Chromium. I'll have someone investigate, but hopefully the compile-time issues have been resolved for the time being.

Hi Chad, thanks for helping with that. Is compile-time increase the remaining problem?
One quick fix could be to change 100 to 10 for the max number on non-local attempts in a function:

static const unsigned MaxNonLocalAttempts = 100;

We've chosen 100 arbitrarily. The newly added DSE tests should pass with 10.

Ivan

I'd prefer if numbers were collected indicating which thresholds worked
best.

bruno added a comment.Dec 10 2015, 6:02 PM

+1
It would be good to have that first!

Agree, but this depends what tests/apps we are concerned about. LNT tests didn't show many outliers.
In addition to Chromium, which applications have been known as outliers in terms of DSE compile-time contribution in the past?

I have a resolution for the issue causing bot timeouts,
as well as one unrelated fix for irreducible control flow.
Fortunately both fixes are trivial. Details below.

Chad or Ivan, how do you want to proceed?
Can you apply these fixes and re-land the change?

I used Rafael's llvm-ar.ll reproducer at
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20151207/318782.html
to chase the bot timeouts.

The issue is in findSafePreds, which was getting tripped up by a switch
statement, and inadvertently adding the same block to the worklist
multiple times. The predecessor iterator needs to avoid processing
a given predecessor more than once. Keeping track with a "PredsVisited"
set resolves that, and there is no longer any noticeable compile time
difference w/wo nonlocal DSE for llvm-ar.ll.

Wrt irreducibility, I was able to construct a test case for which the
CFG search goes into a cycle and revisits the original safe block
that instigated the nonlocal DSE search, SI->getParent() in
DSE::handleNonLocalDependency(). My earlier suggestion for the search
routine had an explicit check for !SafeBlocks.count(P) to avoid this
situation, but that was elided in favor of the dominator check
DT->dominates(BB, Pred). With irreducible control flow we can have
a case where Pred is the original safe block, but BB does not dominate it.

One fix here is to pass in that original block to findSafePreds,
and replace the dominator check with DT->dominates(OrigBB, Pred).
While fairly subtle, this prevents processing any block more than once,
and is still sufficient to avoid problems with the "non loop aware"
dependence analysis.

One other nit: The SafeBlocks set is unordered, and only needed for
membership queries. Should it be declared as a SmallSet instead of
a SmallSetVector?

Finally, I collected some stats when compiling llvm-ar.ll.

There are 23138 calls to handleNonLocalDependency which
reach the worklist processing, with 84 dead instructions found.
94% of the time we process 5 or fewer predecessor blocks.
In the worst case we looked at 266 predecessors.

Blocks Frequency
0 11553 50%
1 5047 22%
2 3858 17%
3 778 3%
4 285
5 221
6 148
7 112
8 193
9 57
10 53
11-50 410
51-100 36
100-266 7

In one instance we found 5 dead instructions by searching 22 blocks.
I didn't dig into whether those could have been found in fewer blocks.
Searching more than 22 blocks was never fruitful for llvm-ar.ll.

Though we would of course want more samples to make any decisions,
limiting the number of worklist blocks processed is another
candidate for a compile time threshold, should one be needed.
OTOH, it may only matter in pathalogical cases.

  • mitch

Hi Mitch,
Since this patch has been closed IMO the path forward would be to upload a new revision for code review. Please include all the current reviewers/subscribers. You should also include a link to this patch for reference. From there we can begin the review process, which hopefully will be fairly short assuming the fixes were mostly trivial.

Chad

Hi Mitch,

Thanks for working on this.

I have a resolution for the issue causing bot timeouts,
as well as one unrelated fix for irreducible control flow.
Fortunately both fixes are trivial. Details below.

Chad or Ivan, how do you want to proceed?
Can you apply these fixes and re-land the change?

Feel free to proceed as Chad suggested.

I used Rafael's llvm-ar.ll reproducer at
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20151207/318782.html
to chase the bot timeouts.

The issue is in findSafePreds, which was getting tripped up by a switch
statement, and inadvertently adding the same block to the worklist
multiple times. The predecessor iterator needs to avoid processing
a given predecessor more than once. Keeping track with a "PredsVisited"
set resolves that, and there is no longer any noticeable compile time
difference w/wo nonlocal DSE for llvm-ar.ll.

Is it something like:

SmallPtrSet<BasicBlock *, 8> Visited;
for (auto I = pred_begin(BB), E = pred_end(BB); I != E; ++I) {

if (!Visited.insert(*I).second)
  continue;

Wrt irreducibility, I was able to construct a test case for which the
CFG search goes into a cycle and revisits the original safe block
that instigated the nonlocal DSE search, SI->getParent() in
DSE::handleNonLocalDependency(). My earlier suggestion for the search
routine had an explicit check for !SafeBlocks.count(P) to avoid this
situation, but that was elided in favor of the dominator check
DT->dominates(BB, Pred). With irreducible control flow we can have
a case where Pred is the original safe block, but BB does not dominate it.

One fix here is to pass in that original block to findSafePreds,
and replace the dominator check with DT->dominates(OrigBB, Pred).
While fairly subtle, this prevents processing any block more than once,
and is still sufficient to avoid problems with the "non loop aware"
dependence analysis.

Do you propose replacing DT->dominates(BB, Pred) in

if (!DT->isReachableFromEntry(Pred) || DT->dominates(BB, Pred))

with DT->dominates(OrigBB, Pred), or adding the latter as a third check?

It would be great to have this irreducible CFG test added to the existing tests for DSE.

One other nit: The SafeBlocks set is unordered, and only needed for
membership queries. Should it be declared as a SmallSet instead of
a SmallSetVector?

That's fine. We actually need a SmallSetVector with reverse insertion order iteration characteristics.

Finally, I collected some stats when compiling llvm-ar.ll.

There are 23138 calls to handleNonLocalDependency which
reach the worklist processing, with 84 dead instructions found.
94% of the time we process 5 or fewer predecessor blocks.
In the worst case we looked at 266 predecessors.

Blocks Frequency
0 11553 50%
1 5047 22%
2 3858 17%
3 778 3%
4 285
5 221
6 148
7 112
8 193
9 57
10 53
11-50 410
51-100 36
100-266 7

In one instance we found 5 dead instructions by searching 22 blocks.
I didn't dig into whether those could have been found in fewer blocks.
Searching more than 22 blocks was never fruitful for llvm-ar.ll.

Though we would of course want more samples to make any decisions,
limiting the number of worklist blocks processed is another
candidate for a compile time threshold, should one be needed.
OTOH, it may only matter in pathalogical cases.

Interesting data; it may vary from a source language to a source language. I agree that we need more samples to justify adding such heuristic. For now, I propose:

static const unsigned MaxNonLocalAttempts = 10;

Ivan

Hi Ivan,

I do plan to follow up with this, but unfortunately am seeing some rather large performance regressions
due to second order effects. Several benchmarks in the EEMBC 1.1 automotive suite drop 10 to 20%.

I'm playing around with some heuristics in CGP's isFormingBranchFromSelectProfitable
to help mitigate these, and would like to have a feasible solution at hand before landing nonlocal DSE.

As to your specific questions, yes to the "Visited" technique, though I did restructure the checks a bit,
and used a SmallSet instead of SmallPtrSet. I do plan to replace the dominator check, not add an additional one.

That's fine. We actually need a SmallSetVector with reverse insertion order iteration characteristics.

For SafeBlocks we don't need a specific ordering, as it's only used for membership queries.
As for the worklist (Blocks) I think the implementation would work fine if this was unordered,
though I left its type unchanged (SmallVector).

Yes, I added a test for the irreducibility issue.

And I didn't intend to add another compile time throttle at this time.
That can be easily added should an issue arise.

regards,

  • mitch

I do plan to follow up with this, but unfortunately am seeing some rather large performance regressions
due to second order effects. Several benchmarks in the EEMBC 1.1 automotive suite drop 10 to 20%.

Any updates, Mitch?

Unfortunately the tuning I had planned in isFormingBranchFromSelectProfitable is no longer applicable.
The relevant heuristic was the "load feeding compare", which I intended to relax a bit.
Per discussions in other code reviews in that area (I don't have the number off hand, but maybe
related to D16836 and D17288), that heurstic has been deleted altogether, and there seemed to be a
general consensus to move this transformation into the target.

I've been looking into that a bit, but it's slow going.

In the meantime, it seemed counterproductive to enable the dead store enhancements.
While the concept sounds good, in practice it seems there are only modest gains,
and some rather large downsides.

jfb added a subscriber: jfb.Jan 30 2019, 1:52 PM