This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Update latencies for Cortex-A55 schedule.
AcceptedPublic

Authored by dmgreen on Jul 10 2022, 10:03 AM.

Details

Summary

The Cortex-A55 schedule currently attempts to model a lot of the effective latencies by marking most integer instructions as having a latency of 3, and then adding forwarding latencies between classes of instructions. When this works it does OK, but is very easy to either get the effective latencies wrong or be tripped up by instructions like pseudo instructions that knock the latency back to 3 without considering forwarding. That in turn can make the decisions it makes suboptimal. This patch simplifies that by just setting the latencies more directly, lining the latencies up with the values from the Software Optimization Guide. In reality the core is more sophisticated than either scheme.

As expected for the AArch64 default schedule, this alters quite a lot of codegen. Almost all of the tests are the same instructions in a slightly different order. The ones with interesting differences are:

  • andorbrcompare.ll - Now choses ccmp vs branch, due to some bad use of latencies in the AArch64ConditionalCompares pass.
  • arm64-ldp.ll - Now uses more postinc. Yay.
  • arm64-neon-mul-div.ll - Has slightly more spills. Boo.
  • Some other changes like this, where there are slightly less or more instructions.
  • neon-mla-mls.ll - Chooses mul;sub as opposed to neg;mla. I believe this is generally better, and the differences are coming from better considering of COPYs.
  • llvm-mca tests no longer show instructions taking three cycles.

For all the measurements I've been collecting, the performance is on average between flat and a slight performance increase, depending on the core and the benchmarks being run. The knock-on effects from different instruction ordering can make any individual test better or worse, but from a range of benchmarks they tend to roughly average one another out.

Diff Detail

Event Timeline

dmgreen created this revision.Jul 10 2022, 10:03 AM
dmgreen requested review of this revision.Jul 10 2022, 10:03 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 10 2022, 10:03 AM
SjoerdMeijer accepted this revision.Jul 11 2022, 1:00 AM

I think you just love updating a lot of tests... ;-)

More serious, the change makes sense because like you said latencies line up better with the SWOG.
But since perf is flat or a slight performance increase, I was just curious about your motivation for doing this. I.e., is the slight perf increase worth it, or is this an enabler for other things?

This revision is now accepted and ready to land.Jul 11 2022, 1:00 AM

Thanks. The latest issue we ran into was from the machine combiner costmodel in D125588, but we've been noticing the increased latencies causing issues here and there for a while now. I'm hoping with more normal values it can generally make better decisions.

But I may wait until after the branch point to commit this, as it does change a lot of codegen and we are not very far away now.

Matt added a subscriber: Matt.Jul 13 2022, 4:15 PM

Cheers, sounds good.