This is an archive of the discontinued LLVM Phabricator instance.

limit the number of instructions per block examined by dead store elimination
ClosedPublic

Authored by inglorion on Dec 15 2015, 11:10 AM.

Details

Summary

Dead store elimination gets very expensive when large numbers of instructions need to be analyzed. This patch limits the number of instructions analyzed per store to the value of the memdep-block-scan-limit parameter (which defaults to 100). This resulted in no observed difference in performance of the generated code, and no change in the statistics for the dead store elimination pass, but improved compilation time on some files by more than an order of magnitude.

Diff Detail

Event Timeline

inglorion updated this revision to Diff 42880.Dec 15 2015, 11:10 AM
inglorion retitled this revision from to limit the number of instructions per block examined by dead store elimination.
inglorion updated this object.
inglorion added a subscriber: llvm-commits.
bruno added a reviewer: bruno.Jan 5 2016, 3:41 PM
mbodart added a subscriber: mbodart.Jan 5 2016, 4:39 PM
dberlin edited edge metadata.Feb 2 2016, 2:39 PM

Can you provide tests for these cases, so we at least have some idea when the limit should hit?

Any update on this patch? I have an internal source file for which DSE is taking an inordinate amount of time:

1427.7928 ( 99.9%)   0.1237 ( 61.0%)  1427.9166 ( 99.9%)  1428.6744 ( 99.9%)  Dead Store Elimination
 ...
1428.8521 (100.0%)   0.2030 (100.0%)  1429.0551 (100.0%)  1429.8275 (100.0%)  Total

Applying this patch reduces this immensely:

17.7091 ( 94.3%)   0.0123 ( 14.7%)  17.7214 ( 93.9%)  17.7251 ( 93.8%)  Dead Store Elimination
 ...
18.7871 (100.0%)   0.0837 (100.0%)  18.8707 (100.0%)  18.8947 (100.0%)  Total
davidxl edited edge metadata.Apr 11 2016, 9:41 AM
davidxl added a subscriber: eraman.
davidxl added a subscriber: davidxl.

Easwaran may also hit some very long compile time problems due to this.

Feel free to invest time switching it to memoryssa :)

yes -- MemSSA will get there. Short term workaround might also be needed :)

I'd be a lot more willing to say yes (though you can get yes from someone
else too!) if someone was committed to doing the work to actually making it
short term.

Because the short term things all over the place rarely end up short term :)
While plan is to convert the existing memdep passes, me doing it alone will
a longer process than folks are likely to want here.

yes -- MemSSA will get there. Short term workaround might also be needed :)

I think we need a stopgap fix for this non-linear compile time issue. I don't think waiting until it is reimplemented to use MemorySSA is a good strategy given the extreme compile time issues, and I can't even find a switch to disable this pass.

BTW, the author had provided a test case on the mailing list (there are two threads for this patch):

http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20160201/330335.html

FWIW, I think it is reasonable to workaround quadratic compile time problems with temporary limits until a well scaling algorithm lands. However, I don't think this patch as-is is the correct approach.

lib/Analysis/MemoryDependenceAnalysis.cpp
398–399

Exposing the search limit of MemDep and allowing the dynamic search depth to be surfaced in the API seems like a really bad API.

Fundamentally, I think we should first bound the DSE scan when it has to call through to MemDep, which this effectively does, but in a very round-about way.

Then, if it necessary to give MemDep increasingly small scan limits, you should just add a vanilla input here, and decrease that limit in the DSE loop below rather than carrying-over the limit from one query to another query. But I'd really like to understand if all you actually need is to limit the DSE scan.

davidxl added inline comments.Apr 12 2016, 9:49 AM
lib/Analysis/MemoryDependenceAnalysis.cpp
398–399

One data point -- DSE is not the only client pass that triggers the problem. We have seen GVN triggered one as well. (I have not examined this patch in detail).

chandlerc added inline comments.Apr 12 2016, 9:57 AM
lib/Analysis/MemoryDependenceAnalysis.cpp
398–399

With out a very clear description of why the *same* limiting technique will work with two different consumers, I think it is a big mistake to put it into the API.

Thank you for all your comments, folks. I will be happy to improve this patch so we can get stop the long compiles while a better solution is being worked on.

@chandlerc, I am not 100% sure I understand your suggestion. If I understand you correctly, you are saying that you think it is bad API design to add the getDefaultBlockScanLimit to MemDep and allow clients of MemDep to pass in a limit to getPointerDependencyFrom. Ok. Then you say "we should first bound the DSE scan when it has to call through to MemDep" and "you should just add a vanilla input here" (in getSimplePointerDependencyFrom). What do you mean? Perhaps you could illustrate the API you're looking for with some pseudocode.

For what it's worth, what I am trying to accomplish with this patch is essentially to better enforce the limit that MemDep already has. Without the patch, we limit the number of instructions we will examine in a single call to getSimplePointerDependencyFrom to (by default) the last 100 instructions before the one where the search starts. In DSE, we then look at the result to determine if it lets us determine whether we can definitely perform or definitely not perform the optimization, but, in some cases, the result actually doesn't tell us one way or another, so we call into MemDep again to get the next possibly relevant instruction. This then searches up to 100 instructions from the location of the previous result. Since this can continue indefinitely, we can end up way more than 100 instructions away from where the search originally started, which is what causes the long compiles. This patch limits the search to 100 instructions (or whatever the parameter is set to) from where the DSE pass started the search, effectively closing the loophole that let us run past the limit the original search had.

Ping ..

Chandler, do you have time to reply to Bob's question? What this patch
does essentially is to enforce a global limit on for MemDep queries instead
of using a per-query local limit as is done today.

Your suggestion of having a fixed limit in DSE loop may not work well in
some cases and lead to more pessimistic result. The main problem with that
approach is that it does not know how expensive (how many instructions are
touched in backward walking) in each individual query. If there is a way to
communicate this cost information back to the caller of MD interface, that
will be fine too -- but that is essentially the same as what this patch
does.

We have been beaten badly by this problem in both sanitizer builds and FDO
builds. With Sanitizer buiild, we can throw in O0 as a walkaround (but
also suffer in runtime), but for FDO this is not an option.

David

Ping. Chandler, It has been a while (3 months) since my last reply (May 6) to the thread.

thanks,

davide added a subscriber: davide.Aug 22 2016, 12:26 PM
reames edited edge metadata.Aug 23 2016, 9:38 AM

I just want to make sure I understand the big picture view of this change. I'm going to try to summarize, please correct me if I'm wrong.

Today, we will walk back through the list of defs/clobbers provided by MDA (within a single block) without limit in DSE. Internally, MDA will only find defs/clobbers which are within a limited distance of each other. As a result, a series of adjacent clobbers will be scanned, but the same series of adjacent clobbers with a single long break of non-memory related instructions will not be. Right?

With the patch, we will walk backwards a fixed distance (in number of instructions) considering any def/clobber we see in that window.

Is that a correct summary?

(p.s. Using the same default value from the original implementation with the new one seems highly suspect since the old implementation would have been much more aggressive in practice..)

reames added inline comments.Aug 23 2016, 12:04 PM
lib/Transforms/Scalar/DeadStoreElimination.cpp
512–513

FYI: this patch clearly needs rebased

512–513

I'm wondering whether we can solve the practical problem with a caching change.

What if we simply chose to cache the intermediate MDA results? (Note that MDA does not do this for the getPointerDependencyFrom interface.)

I'm picturing a simple cache structure of Map<pair<MemoryLoc, Instruction>, MemDepResult>

This wouldn't reduce the worst case complexity (it could be each instruction has a different memory location associated with it which is may alias with all others), but how many unique memory locations do we see in practice? And out of those, how many are mayalias (i.e. not a possible read or def which terminates a chain)?

inglorion updated this revision to Diff 69170.Aug 24 2016, 2:04 PM
inglorion edited edge metadata.

Thanks for rebasing my changes, @davidxl! Since your updated version applies cleanly and matches the changes I wanted to make, I'm updating the code on Phabricator to match yours.

davidxl accepted this revision.Aug 25 2016, 9:38 AM
davidxl added a reviewer: davidxl.

lgtm

Based on discussions, we will take this as solution to tackle quadratic behavior in DSE until memSSA is in place.

This revision is now accepted and ready to land.Aug 25 2016, 9:38 AM
inglorion closed this revision.Aug 26 2016, 9:25 AM