- User Since
- Jan 12 2017, 6:15 AM (350 w, 1 d)
Jun 22 2021
Sorry for inconvenient, let me add a comment in commit message.
I think I managed to figure out what's going on here in the end, but it would help to explain what the bug is in the commit message, something like:
May 17 2021
@nikic this fixed the regression, thank you!
May 11 2021
Hi, we've got a ~6% regression in SPEC INT 2006 462.libquantum on AArch64 (both -flto and -Ofast) that comes back to this change. See here for a reproducer https://godbolt.org/z/dq98Gqqxn (-fno-vectorize is not strictly necessary, but it does make the difference easier to spot). @dmgreen mentioned to me that we could probably fix this up in the AArch64 backend, but a fix in the mid-end might be more generally useful too.
Apr 22 2021
Thanks for the comments. I'm happy to leave LoopAccessAnalysis alone for now and focus on Loop Distribute's cost model. I was hoping that a cost model tweak or two might enable some more loop distribution in TSVC, but it looks like that isn't the case.
Apr 21 2021
Looking at TSVC a bit, @xbolva00 :
- s221 won't distribute because the second read of a[i] is removed by EarlyCSE, so there is no unique load instruction for a second loop. For this loop I'm not convinced that distribution is likely to help performance; it's a trade-off between (some) vectorization and re-loading both a[i] and d[i].
- s222 also gets mangled by EarlyCSE, but the result would still be distributable if it weren't for the order of the stores to e[i] and a[i]. This runs into a limitation of LoopAccessAnalysis, which can't reorder instructions. Perhaps it could help to do a bit of scheduling on IR?
- s2275 as mentioned above, this runs into another LoopAccessAnalysis limitation: it only handles innermost loops. I'm not sure how easy (if at all possible) it would be to lift that restriction.
Apr 20 2021
Now, there are no differences in distributed loops in the test suite and SPEC, before and after the patch, as intended.
Rebased, and addressed discrepancy in the loop distributed. The difference hinges on loops that contain backward dependences which the loop vectorizer can handle, but which would frustrate loop distribution. In this case, we don't distributing the loop and leave it to the loop vectorizer.
Apr 15 2021
Apr 14 2021
Apr 13 2021
Since this is mostly just moving pre-existing code, I think it's fine to address the style issues in a separate NFC commit. I believe all comments have been addressed, but please wait a day or two before committing in case there is anything else. LGTM.
Apr 12 2021
+1 on eagerly awaiting a fix. a 3% regression on astar (AArch64, LTO) bisects to @lebedev.ri 's revert: https://reviews.llvm.org/rG6270b3a1eafaba4279e021418c5a2c5a35abc002 .
Thanks! This seems to be moving in the right direction. Now that SCCPInstVisitor is separate, its declaration can move into SCCPSolver.cpp entirely.
Apr 7 2021
Good spot! Thanks!
Apr 1 2021
I think the issue here is that quite a number of implementation details are leaking into the header file where they don't belong because they are not part of the public interface. You're right that code using the header file doesn't have access to these by virtue of being private, but this does not make for a very readable header file, IMO.
Mar 30 2021
Mar 8 2021
Mar 5 2021
Mar 1 2021
Add a few more tests and some further test reduction.
Feb 24 2021
Feb 18 2021
Committed as https://reviews.llvm.org/rG93d9a4c95aff . Looks like I forgot to add the "Differential Revision:" line :(
Feb 17 2021
I've adopted your suggestions, thanks!
Feb 15 2021
Feb 8 2021
This looks good to me. @fhahn are you happy with the suggested approach to tighten up the SCCPSolver class in subsequent patches, when function inlining starts using it?
Maybe I'm missing something, but this change doesn't seem to be effective anymore after the new pass manager switcheroo. Did this pass not get ported, in favour of SimpleLoopUnswitch? This now shows up as a regression relative to (what will be) LLVM 12.
Feb 4 2021
This looks like a fine first step to separate SCCPSolver from SCCP. I agree with @fhahn that the SCCPSolver class should be improved by pulling out implementation details into (separate, hidden) classes, but this is probably best reviewed in separate patches. At least it looks like the public interface of SCCPSolver is already restricted to what is needed by SCCP.
Feb 3 2021
Hey, we've run into a case where this patch causes a dead loop to appear which doesn't subsequently get removed. It's not a huge deal (we're seeing some great speed-ups from this patch too!), but a niggle that might be nice to address. Reproducer:
Jan 27 2021
As the approach here is pretty much identical to D36432, do you have any plans on the cost model for this pass? The earlier proposal seems to have been abandoned due to a lack of good heuristics to trade off between the (potentially huge) increase in code size (as well as compile time) and performance improvements (which may or may not arise due to later optimizations). Perhaps it would help to look at what trade-offs GCC makes? Specifically, it would be good to have an idea of how you would implement getSpecializationCost and getSpecializationBonus.
Jan 18 2021
Looks good to me!
Jan 15 2021
Jan 14 2021
So on SPEC 2006 this fixes astar (+9.6%) as well as shakes things up enough to "fix" h264ref (+9.0%, see D93946). Other notable changes are libquantum (+3.2%) and omnetpp (-2.8%). Geomean is +1.5%.
Turns out the failing benchmarks are due to a miscompilation with -g3 (which we add to profiled runs). The patch does seem to make that miscompilation more likely. I'll try to reduce that separately, but at least I'll have some performance numbers shortly.
Jan 12 2021
Hi folks, I'm afraid an 8% regression in h264ref (SPEC INT 2006) bisects to this patch. This in on AArch64 with -flto on a Neoverse-N1 based system, but I wouldn't be surprised if this showed up elsewhere too.
Jan 8 2021
Yep, definitely a problem on our end; I've got similar symptoms in other runs. Sorry about the noise, and I will try to get some perf data once we sort this out.
Our flags are -mcpu=native -O3 -fomit-frame-pointer -flto on a Neoverse-N1 based system, so -mcpu=neoverse-n1 should do the same thing.
Jan 7 2021
@fhahn I'm afraid I'm seeing runtime and verification errors on perlbench, gcc, gobmk, and astar in SPEC INT 2006. I'm guessing something later in the pipe really doesn't like it when certain loops aren't rotated?
Thanks, Florian. I'll give the patch a run through our benchmarking.
@pzheng I don't have perf data for Cortex-A57, but I would certainly expect to see an uplift there too.
Jan 6 2021
Dec 15 2020
Hi, our flags are "-Ofast -flto -mcpu=native -fomit-frame-pointer", where "native" is a Neoverse-N1 system. Let me know if that helps.
Dec 1 2020
Just wanted to add to @fhahn that we're seeing this too, as an even more notable 8% regression (same benchmark, different hardware). Florian's explanation matches what I'm seeing, but I hadn't had a chance to confirm what was happening.
Nov 10 2020
Nov 9 2020
Nov 6 2020
Hi, this is regressing a few internal workloads (physics simulations, AArch64) by a few percent. Did you do any performance measurements for this change?
Nov 3 2020
Reverted due to new test failing on a bunch of buildbots. I'll try again tomorrow, looks like the other pipeline tests manage to work around it.
Oct 21 2020
@SjoerdMeijer yeaahhh these pipeline tests are a bit of a pain, but nothing some big-brain sed scripting can't solve.
Added LTO pipeline test
Sep 29 2020
Thanks @tpopp, that'll unblock all of us.
Sep 28 2020
Committed as d5fd3d9b903e
Sep 22 2020
SPEC 2017 on AArch64 is neutral on the geomean. The only slight worry is omnetpp with a 1% regression, but this is balanced by a .8% improvement on mcf. Other changes are in the noise.
Sep 18 2020
I know this has already been reverted but just FYI that I've bisected a ~2% regression in SPEC2017 x264_r on AArch64 to this commit. Presumably this is due to the extra unrolling / cost modelling issue already mentioned?
Sep 17 2020
Fix for when there is no fp16 faddp + testing
LGTM, thanks for fixing this! Could you wait a day or two before committing to allow others to comment?
Sep 16 2020
Extend to f16, f32, f64 and i64
Rework to match faddp in AArch64 ISel lowering
Thanks for the feedback. I agree that ideally we'd be generating reduction intrinsics in IR and matching that in the backends. I don't think the pairwise add can be represented with the current intrinsics though: we'd need a <2 x float> variant, or a predicated version of the <4 x float> intrinsic to do this for strict FP math, I believe.