Dead store elimination gets very expensive when large numbers of instructions need to be analyzed. This patch limits the number of instructions analyzed per store to the value of the memdep-block-scan-limit parameter (which defaults to 100). This resulted in no observed difference in performance of the generated code, and no change in the statistics for the dead store elimination pass, but improved compilation time on some files by more than an order of magnitude.
Any update on this patch? I have an internal source file for which DSE is taking an inordinate amount of time:
1427.7928 ( 99.9%) 0.1237 ( 61.0%) 1427.9166 ( 99.9%) 1428.6744 ( 99.9%) Dead Store Elimination ... 1428.8521 (100.0%) 0.2030 (100.0%) 1429.0551 (100.0%) 1429.8275 (100.0%) Total
Applying this patch reduces this immensely:
17.7091 ( 94.3%) 0.0123 ( 14.7%) 17.7214 ( 93.9%) 17.7251 ( 93.8%) Dead Store Elimination ... 18.7871 (100.0%) 0.0837 (100.0%) 18.8707 (100.0%) 18.8947 (100.0%) Total
I'd be a lot more willing to say yes (though you can get yes from someone
else too!) if someone was committed to doing the work to actually making it
Because the short term things all over the place rarely end up short term :)
While plan is to convert the existing memdep passes, me doing it alone will
a longer process than folks are likely to want here.
I think we need a stopgap fix for this non-linear compile time issue. I don't think waiting until it is reimplemented to use MemorySSA is a good strategy given the extreme compile time issues, and I can't even find a switch to disable this pass.
BTW, the author had provided a test case on the mailing list (there are two threads for this patch):
FWIW, I think it is reasonable to workaround quadratic compile time problems with temporary limits until a well scaling algorithm lands. However, I don't think this patch as-is is the correct approach.
Exposing the search limit of MemDep and allowing the dynamic search depth to be surfaced in the API seems like a really bad API.
Fundamentally, I think we should first bound the DSE scan when it has to call through to MemDep, which this effectively does, but in a very round-about way.
Then, if it necessary to give MemDep increasingly small scan limits, you should just add a vanilla input here, and decrease that limit in the DSE loop below rather than carrying-over the limit from one query to another query. But I'd really like to understand if all you actually need is to limit the DSE scan.
One data point -- DSE is not the only client pass that triggers the problem. We have seen GVN triggered one as well. (I have not examined this patch in detail).
With out a very clear description of why the *same* limiting technique will work with two different consumers, I think it is a big mistake to put it into the API.
Thank you for all your comments, folks. I will be happy to improve this patch so we can get stop the long compiles while a better solution is being worked on.
@chandlerc, I am not 100% sure I understand your suggestion. If I understand you correctly, you are saying that you think it is bad API design to add the getDefaultBlockScanLimit to MemDep and allow clients of MemDep to pass in a limit to getPointerDependencyFrom. Ok. Then you say "we should first bound the DSE scan when it has to call through to MemDep" and "you should just add a vanilla input here" (in getSimplePointerDependencyFrom). What do you mean? Perhaps you could illustrate the API you're looking for with some pseudocode.
For what it's worth, what I am trying to accomplish with this patch is essentially to better enforce the limit that MemDep already has. Without the patch, we limit the number of instructions we will examine in a single call to getSimplePointerDependencyFrom to (by default) the last 100 instructions before the one where the search starts. In DSE, we then look at the result to determine if it lets us determine whether we can definitely perform or definitely not perform the optimization, but, in some cases, the result actually doesn't tell us one way or another, so we call into MemDep again to get the next possibly relevant instruction. This then searches up to 100 instructions from the location of the previous result. Since this can continue indefinitely, we can end up way more than 100 instructions away from where the search originally started, which is what causes the long compiles. This patch limits the search to 100 instructions (or whatever the parameter is set to) from where the DSE pass started the search, effectively closing the loophole that let us run past the limit the original search had.
Chandler, do you have time to reply to Bob's question? What this patch
does essentially is to enforce a global limit on for MemDep queries instead
of using a per-query local limit as is done today.
Your suggestion of having a fixed limit in DSE loop may not work well in
some cases and lead to more pessimistic result. The main problem with that
approach is that it does not know how expensive (how many instructions are
touched in backward walking) in each individual query. If there is a way to
communicate this cost information back to the caller of MD interface, that
will be fine too -- but that is essentially the same as what this patch
We have been beaten badly by this problem in both sanitizer builds and FDO
builds. With Sanitizer buiild, we can throw in O0 as a walkaround (but
also suffer in runtime), but for FDO this is not an option.
I just want to make sure I understand the big picture view of this change. I'm going to try to summarize, please correct me if I'm wrong.
Today, we will walk back through the list of defs/clobbers provided by MDA (within a single block) without limit in DSE. Internally, MDA will only find defs/clobbers which are within a limited distance of each other. As a result, a series of adjacent clobbers will be scanned, but the same series of adjacent clobbers with a single long break of non-memory related instructions will not be. Right?
With the patch, we will walk backwards a fixed distance (in number of instructions) considering any def/clobber we see in that window.
Is that a correct summary?
(p.s. Using the same default value from the original implementation with the new one seems highly suspect since the old implementation would have been much more aggressive in practice..)
I'm wondering whether we can solve the practical problem with a caching change.
What if we simply chose to cache the intermediate MDA results? (Note that MDA does not do this for the getPointerDependencyFrom interface.)
I'm picturing a simple cache structure of Map<pair<MemoryLoc, Instruction>, MemDepResult>
This wouldn't reduce the worst case complexity (it could be each instruction has a different memory location associated with it which is may alias with all others), but how many unique memory locations do we see in practice? And out of those, how many are mayalias (i.e. not a possible read or def which terminates a chain)?
FYI: this patch clearly needs rebased