Use MemorySSA, if requested, to do less conservative memory dependency checking.
This change doesn't enable the MemorySSA enhanced EarlyCSE in the default pipelines, so should be NFC.
Differential D19821
[EarlyCSE] Optionally use MemorySSA. NFC. gberry on May 2 2016, 12:18 PM. Authored by
Details Use MemorySSA, if requested, to do less conservative memory dependency checking. This change doesn't enable the MemorySSA enhanced EarlyCSE in the default pipelines, so should be NFC.
Diff Detail Event TimelineComment Actions At a meta level, I'm not convinced that updating EarlyCSE to work with MemorySSA is the right approach. EarlyCSE is focused on being really really fast at cleaning up stupidly redundant IR so that the rest of the pass pipeline doesn't need to worry about it. MemorySSA is relatively expensive to construct. Given that, I'm really not sure putting it at the very beginning of the pipeline is a good design choice. Now, having said that, it might make sense to have a dominance order CSE pass based on MemorySSA for use later in the pass pipeline. Currently we use EarlyCSE for two distinct purposes, its possible that it might be time to split them. Can you justify why this is the right approach?
Comment Actions In particular, i'm trying to understand if your concerns are *mostly* "we Comment Actions @reames I've attempted to resolved most of your individual concerns (or at least made them explicit in the change). The bigger question of whether this is worth the compile time remains to be determined. Do you think more tests need to be added in addition to the already existing EarlyCSE tests? Adding additional run lines to those tests to enable -early-cse-use-memoryssa seems like overkill to me, but I don't feel to strongly about it. Or are you more concerned about adding new tests for cases that are only caught by MemorySSA (both positive and negative)? @dberlin, @george.burgess.iv There are a couple of FIXME comments in this change that identify cases where MemorySSA is maybe being too conservative (e.g. when dealing with fence release instructions and load atomic instructions). Do you think it is reasonable to refine these cases in MemorySSA or is the conservatism restricted to EarlyCSE's usage, in which case we should deal with it in EarlyCSE? Similarly, what do you think of Phillip's suggestion to look at using ValueHandles in MemorySSA to make removal invalidating more automated? Comment Actions @reames @dberlin Regarding the compile time impact, do you think it would be worth pursuing a change to make EarlyCSE's use of MemorySSA optional? That way we could avoid using it for early passes EarlyCSE and only use it for later ones, perhaps even influenced by optimization level? A related aspect of the plan for MemorySSA that I'd like to understand is how well we think we'll be able to amortize the cost of building it by preserving/maintaining it across passes. Daniel, can you share your thoughts on that? Comment Actions So, i would like to see real numbers that say this is going to slow down We can amortize the cost quite well. It should essentially cost nearly The entire plan is actually to amortize the cost. At the outset, with a little work, we should have to compute memoryssa But it's also interesting to note that none of these passes preserve memdep It definitely can be made to be so. Shoving this in EarlyCSE, if it's fast enough, seems reasonable at a That needs to be traded off past how much better/easier/etc it makes Comment Actions Update to use MemorySSA in EarlyCSE if it is available, and make it Comment Actions Compile time test for the llvm test suite on aarch64 (at -O3) were mostly a wash, some faster, some slower, no big outliers. The net change was slightly better compile times. Notable performance improvements (no significant regressions): @reames I've added some additional lit test coverage, is there more lit test coverage you'd like to see? Comment Actions I'm also interested. Comment Actions Here are the worst llvm test-suite compile time regressions. I've filtered out the very small test cases. The data shown are the percent diffs of compile times from 5 different runs with and without the above change. llvm-test-suite/SingleSource/Benchmarks/Misc/ffbench:normal PASS +1.079%, +2.837%, +17.951%, +21.615%, +33.206% Comment Actions @dberlin Yeah, I'm in the process of double checking some of these to make sure my testing methodology was sound. Comment Actions Updated llvm-test-suite compile time regressions using LNT methodology: Comment Actions Looks like the affected benchmarks changed in the new measurements, does that hold for performance improvements as well? Comment Actions George, can you stare at these quickly and see if any of your caching (It's fine if not, just trying to avoid duplicating work) Comment Actions
That depends on what exactly is slowing the benchmarks down so much. If our usage pattern is query -> remove -> query -> remove, then our cache may become useless, since (worst case) we drop the entire thing on each removal. If we primarily query defs, then this pattern gives us the same effectively-n^2 behavior of MemDep. One of the big goals of the new cache/walker is to allow us to drop as little as possible. In terms of pure walker/cache speed, the current walker is happy to do a lot of potentially useless work walking phis we can't optimize; the one I'm working on will do as little work as possible in that case. Also, the current walker potentially does a lot of domtree queries when caching results, whereas the one I'm working on does none (except in asserts). Glancing at some of the benchmarks, I'm not sure if any of that is what's slowing us down here, though. If you'd like, I'm happy to profile/poke around and give you a more definitive answer.
Comment Actions The performance changes are mostly the same. Updated perf data (this is on an A57-like OoO aarch64 core, time deltas (negative is better)): llvm-test-suite/MultiSource/Benchmarks/mafft/pairlocalalign +0.399%, +0.430%, +0.458%, +0.507%, +0.593%, +0.626%, +0.778%, +1.097%, +1.195%, +1.309% Comment Actions I re-ran the llvm test-suite compile-time numbers with more samples and found no significant changes (improvements or regressions) in compile time. Comment Actions I've put this on hold until I can re-run compile-time numbers after George's changes to the MemorySSA caching code go in (http://reviews.llvm.org/D21777) Comment Actions Sorry for not responding to this for so long. My objection is primarily from a compile time concern. Right now, EarlyCSE is a *very* cheap pass to run. If you can keep it fast (even when we have to reconstruct MemorySSA) I don't object to having EarlyCSE MemorySSA based. I think that is a very hard bar to pass in practice. In particular, the bar is not total O3 time. It's EarlyCSE time. I fully expect that the more precise analysis may speed up other passes, but we can't assume that happens for all inputs. (As I write this, I'm recognizing that this might be too high a bar to set. If you think I'm being unreasonable, argue why and what a better line should be.) Given I'm not going to have time to be active involved in this thread, I'm going to defer to other reviewers. If they think this is a good idea, I will not actively block the thread. Comment Actions p.s. The newer structure of using the original fast path check with memory ssa as a backup which is optional if available is much cleaner than the original code. I'm okay with something like this landing (once other reviewers have signed off) even if the compile time question isn't fully resolved provided that the force memory SSA pass is under an off by default option. As structured, this doesn't complicate the existing code much at all. Comment Actions New version that adds a pass parameter to control whether MemorySSA is used. Also changed the memory generation check to do a simpler MemorySSA Comment Actions I've collected some compile time stats when enabling MemorySSA EarlyCSE just for the EarlyCSE pass added at the beginning of addFunctionSimplificationPasses at O2 and higher.
|
Can this be committed as a separate change?