This patch updates IPSCCP to use PredicateInfo to propagate
facts to true branches predicated by EQ and to false branches
predicated by NE.
As a follow up, we should be able to extend it to also propagate additional facts about nonnull.
Differential D45330
[IPSCCP] Use PredicateInfo to propagate facts from cmp instructions. fhahn on Apr 5 2018, 10:57 AM. Authored by
Details
This patch updates IPSCCP to use PredicateInfo to propagate As a follow up, we should be able to extend it to also propagate additional facts about nonnull.
Diff Detail
Event TimelineComment Actions This is interesting. One of the weaknesses of the SCCP algorithm is that it's not path-sensitive; a value is only "constant" if it has the same value on all possible paths. The use of ssa_copy provides a way around this restriction, to some extent: the result of an ssa_copy can be a constant even if the operand is overdefined. That said, this is adding substantial complexity to IPSCCP, and I'm not sure how much benefit you'll get in practice; the example @test2 isn't really compelling. Building the PredicateInfo up-front is probably fine; I would guess it's not that expensive compared to the other work IPSCCP is doing (but you should measure to quantify that).
We need to be very cautious about propagating facts about floats... things which are "obviously" true might not actually be true due to fast-math. Comment Actions Thanks for having a look! The test cases at the moment are just very simple to briefly illustrate what is going on. One case in particular I think this could be helpful with is propagating null/nonnull to callsites. I will make the patch slightly more powerful and try to gather some numbers in the next few days. Comment Actions So, predicateinfo is one of the fastest passes (or was) that we had. I'm not horribly worried about speed (i could make it faster, too). I did have plans to try to turn it back into a virtual form at some point (which would make it faster), i think i found an not horrible way of doing it. The one overall comment i'd have here is that SCCP is fairly limited in how it can use this, because of how limited the lattice is. That said, Comment Actions Yep. One of the slightly longer-term things I want to do is either make IPSCCP use a lattice including (integer) ranges or have a separate pass doing value range propagation. Do you think having it as a separate pass would be beneficial?
Looks like the comment got cut off here? Comment Actions whoops, hit send early. Comment Actions Fix some problems, SingleSource, MultiSource, SPEC2006 now pass with this patch and LTO. Still needs some cleanup and a few more tests. Comment Actions I collected some stats for SPEC2006, MultiSource and SingleSource with -O3 -flto. For sccp.IPNumInstRemoved I get 88490 with this patch instead of 81585 without it (~ +8% eliminated instructions). I think number of instructions eliminated looks quite promising. What do you think? Comment Actions
That's much better than I expected. Although, I'm not sure how much I would trust that number; IPSCCP runs before CorrelatedValuePropagation, so you might be optimizing sequences which would be optimized later anyway. Comment Actions I also aggregated all stats by CorrelatedValuePropagation for the same set of benchmarks: With this patch, CVP triggered in 23754 cases, without this patch it triggered in 24904 cases. I guess this suggests some of the cases eliminated with this patch would also be handled by CVP. But the majority would not be. One additional thing we could do by using PredicateInfo is propagate non-null constraints to call sites, which might help the inliner. Comment Actions Rebased and fixed a few problems. This now passes the test-suite, SPEC2000 & SPEC2006 with LTO. I've now also done a benchmarking run with the suites mentioned earlier on a cortex-a57 with LTO. Overall the geomean runtime improvement is 0.65 percent. The biggest improvements were SingleSource/Benchmarks/BenchmarkGame/recursive -13.16% SingleSource/Benchmarks/Stanford/RealMM -9.89% MultiSource/Benchmarks/McCat/08-main/main -7.90% MultiSource/Benchmarks/MiBench/telecomm-CRC32/telecomm-CRC32 -6.64% MultiSource/Benchmarks/Prolangs-C++/ocean/ocean -6.26% MultiSource/Benchmarks/ASC_Sequoia/IRSmk/IRSmk -6.13% MultiSource/Benchmarks/Olden/mst/mst -6.08% MultiSource/Benchmarks/mediabench/adpcm/rawcaudio/rawcaudio -5.21% SingleSource/Benchmarks/Shootout-C++/Shootout-C++-objinst -5.19% SingleSource/Benchmarks/Polybench/stencils/jacobi-1d-imper/jacobi-1d-imper -4.48% MultiSource/Benchmarks/Prolangs-C++/simul/simul -4.26% SingleSource/Benchmarks/Shootout/Shootout-objinst -3.72% There are a few regression ranging from 0 to up 4% worse execution time The sum of all SCCP stats for all benchmarks is 401732 over 374020 without this patch (+7%). Comment Actions Ping. Do you think that the results for this patch are convincing enough on it's own? Comment Actions I think the numbers are good enough to be worth it perf wise. If they look good, i'd say let's commit it. Comment Actions Thanks for having a look! I did not post them, but for the tests I run (test-suite, spec2000, spec2006) on AArch64, there were no compile time regressions. One change is that we now compute the dominator trees before IPSCCP. I suppose we should try and preserve them in IPSCCP. I'll try to put a patch together before I submit this change. Comment Actions Thanks for having a look. Updated diff to include changes to pipeline tests. I plan to commit this tomorrow, unless there are any more concerns.
Comment Actions I had to revert this patch in rL333323 because it caused a mismatch in the stage3 and stage4 binaries built by the clang-with-thin-lto-ubuntu bot. I will investigate. Comment Actions So it looks like the codegen differences come from PredicateInfo, rather than from the propagation. With only the parts of this patch to build and remove PredicateInfo, I get a small difference in the strtab section between a stage3 and stage4 build of clang with thin LTO, there's no difference in the .text section. I also had a look at the IR after IPSCCP for both stages, the only differences are some variable names/labels. I am not entirely sure yet, but I think that might be caused by small differences in the order the ssa_copy intrinsics for the predicates are emitted. Any ideas or pointers would be greatly appreciated :) Comment Actions My first guess would be some sort of non-determinism. If you're trying to determine whether to IR modules in memory are identical, make sure to include the use-list order in the dump. Comment Actions Thanks for you suggestions. With D48230, there is no difference in the stage3 and 4 clang binaries with thinlto for me locally. Comment Actions This is now committed as rL335206 and the LTO/Thin LTO bots are happy now. I will follow this up with a change to propagate non-null info. Comment Actions Ping, still some issues? Since it was reverted again recently :( I am especially interested in the follow up - nonnull info propagation.
Comment Actions Yep, there was one bot left that was failing because of this. I still have to reproduce the problem. Hope to have some time for that soon. |
Set name?