The Cortex-A55 schedule currently attempts to model a lot of the effective latencies by marking most integer instructions as having a latency of 3, and then adding forwarding latencies between classes of instructions. When this works it does OK, but is very easy to either get the effective latencies wrong or be tripped up by instructions like pseudo instructions that knock the latency back to 3 without considering forwarding. That in turn can make the decisions it makes suboptimal. This patch simplifies that by just setting the latencies more directly, lining the latencies up with the values from the Software Optimization Guide. In reality the core is more sophisticated than either scheme.
As expected for the AArch64 default schedule, this alters quite a lot of codegen. Almost all of the tests are the same instructions in a slightly different order. The ones with interesting differences are:
- andorbrcompare.ll - Now choses ccmp vs branch, due to some bad use of latencies in the AArch64ConditionalCompares pass.
- arm64-ldp.ll - Now uses more postinc. Yay.
- arm64-neon-mul-div.ll - Has slightly more spills. Boo.
- Some other changes like this, where there are slightly less or more instructions.
- neon-mla-mls.ll - Chooses mul;sub as opposed to neg;mla. I believe this is generally better, and the differences are coming from better considering of COPYs.
- llvm-mca tests no longer show instructions taking three cycles.
For all the measurements I've been collecting, the performance is on average between flat and a slight performance increase, depending on the core and the benchmarks being run. The knock-on effects from different instruction ordering can make any individual test better or worse, but from a range of benchmarks they tend to roughly average one another out.