This is an archive of the discontinued LLVM Phabricator instance.

AArch64: Disable the latency heuristic
ClosedPublic

Authored by MatzeB on Oct 13 2015, 2:59 PM.

Details

Summary

This patch disable the machine scheduler heuristic that attempts to balance scheduling between multiple long latency chains. In our benchmarks this heuristic tended to increase register pressure and lead to spilling occasionally but didn't appear to have any positive effects on any benchmarks (it seems long latency chains are scarce in practice and out of order cores tend to handle them well).

The main question to review here is if I should guard the changes to AArch64SubTarget.cpp with an "if (isCyclone())" or if the changes are fine on other aarch64 cores as well.

Diff Detail

Repository
rL LLVM

Event Timeline

MatzeB updated this revision to Diff 37289.Oct 13 2015, 2:59 PM
MatzeB retitled this revision from to AArch64: Disable the latency heuristic.
MatzeB updated this object.
MatzeB added reviewers: jmolloy, rengolin, aadg.
MatzeB set the repository for this revision to rL LLVM.
MatzeB added a subscriber: llvm-commits.
jmolloy edited edge metadata.Oct 21 2015, 8:13 AM

Hi Matthias,

Sorry for taking so long to spot this on my backlog.

I've just run a bunch of benchmarking, and I can fairly conclusively say that your patch reduces performance overall on both Cortex-A57 and Cortex-A53 (although with a mutually exclusive set of benchmarks, which is abnormal).

I see a 43% regression in lnt.MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1 on Cortex-A57, and a 15% regression on lnt.SingleSource/Benchmarks/Misc/salsa20 on Cortex-A53 (there are many more regressions and some improvements - these are just the top).

So I think this should indeed be gated on Cyclone.

James

Thanks for benchmarking. TL;DR: I will change the policy only for Cyclone CPUs.

For the record: I re-checked my results of the llvm-testsuite though I tend to see them less critical than the "big benchmarks" spec*, geekbench* for which the change is neutral for all but two benchmarks which improve by ~10%.

Over the whole llvm-testsuite I have a bunch of ups and downs. Most of the ones in my top 20 are just noisy benchmarks, enc-pc1 regressed only 1% for me which relates to me not seeing any important changes in the assembly, salsa20 shows a 7% regression and it appears to be the first genuine testcase I see where the latency heuristic makes sense because there is indeed a very long loop containing just arithmetic instructions where latency hiding has an effect even on out-of-order CPUs (though it's somewhat unfortunate because the sourcecode was already scheduled in a nice way, llvm performs some impressive load/store optimisations on the benchmark but also reorders the instructions somewhere making the scheduling heuristic necessary. On the other hand I see an improvement of 15% in matmul_f64_4x4.

  • Matthias
This revision was automatically updated to reflect the committed changes.