This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Correctly determine if {ADD,SUB}{W,X}rs instructions are cheap
ClosedPublic

Authored by chill on Jun 13 2023, 9:23 AM.

Details

Summary

These are marked to be "as cheap as a move".

According to publicly available Software Optimization Guides, they
have one cycle latency and maximum throughput only on some
microarchitectures, only for LSL and only for some shift amounts.

This patch uses the subtarget feature FeatureLSLFast to determine
how cheap the instructions are. As a consequence, each subtarget
with FeatureLSLFast now also has FeatureCustomCheapAsMoveHandling added.

Diff Detail

Event Timeline

chill created this revision.Jun 13 2023, 9:23 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 13 2023, 9:23 AM
chill requested review of this revision.Jun 13 2023, 9:23 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 13 2023, 9:23 AM

Adding handling for ADDWrs with LSLFast sounds good to me. I'm not so sure about this CustomCheapAsMoveHandling though. Ignoring the fact that isCheapAsMove is a bit of a strange concept nowadays, it seems to be missing some of the instructions usually marked as isAsCheapAsAMove (things like COPY and IMPLICIT_DEF, as well as target nodes like FMOV's and MOVI's). I would also like it if either all cpus used hasCustomCheapAsMoveHandling (especially for the testing) or if they were at least closer in terms of functionality.

Do we think that isAsCheapAsAMove should be all instructions that are really as cheap as a mov? (As in, zero latency move instructions, handled in rename). Or should it be all instructions that are single cycle, like ADDs and ORRs, etc?

Do you have any performance results? And can you add a test.

chill added a comment.Jul 6 2023, 5:18 AM

Adding handling for ADDWrs with LSLFast sounds good to me. I'm not so sure about this CustomCheapAsMoveHandling though. Ignoring the fact that isCheapAsMove is a bit of a strange concept nowadays, it seems to be missing some of the instructions usually marked as isAsCheapAsAMove (things like COPY and IMPLICIT_DEF, as well as target nodes like FMOV's and MOVI's). I would also like it if either all cpus used hasCustomCheapAsMoveHandling (especially for the testing) or if they were at least closer in terms of functionality.

TBH, I greatly dislike how AArch64InstrInfo::isAsCheapAsAMove looks right now.

It seems quite more nicer and streamlined if it was organised like:

if (Exynos things) {
  return doExynosThings();
}

switch (Opcode) {
  default: return  MI.isAsCheapAsMove(); // fallback to the default instead of `false`

// Opcode specific processing, look at target features
 case Opc1:
...
  case Opc1:
...
}

It'll handle COPY and IMPLICT_DEF and no need for FeatureCustomCheapAsMoveHandling.

Do we think that isAsCheapAsAMove should be all instructions that are really as cheap as a mov? (As in, zero latency move instructions, handled in rename). Or should it be all instructions that are single cycle, like ADDs and ORRs, etc?

What I can tell from looking at the source, it's used as a heuristic, guiding re-materialisation (?), so it's usually in the context of shortening/not-extending live ranges
and trying to avoid spills, in that sense one-cycle ADD could be more expensive on paper that a zero-latency MOV, but in practice I'm not sure MOV's advantage
materialises that often.

Do you have any performance results? And can you add a test.

TBD

chill updated this revision to Diff 538151.Jul 7 2023, 8:17 AM

Split a part into D154722

Adding handling for ADDWrs with LSLFast sounds good to me. I'm not so sure about this CustomCheapAsMoveHandling though. Ignoring the fact that isCheapAsMove is a bit of a strange concept nowadays, it seems to be missing some of the instructions usually marked as isAsCheapAsAMove (things like COPY and IMPLICIT_DEF, as well as target nodes like FMOV's and MOVI's). I would also like it if either all cpus used hasCustomCheapAsMoveHandling (especially for the testing) or if they were at least closer in terms of functionality.

TBH, I greatly dislike how AArch64InstrInfo::isAsCheapAsAMove looks right now.

It seems quite more nicer and streamlined if it was organised like:

if (Exynos things) {
  return doExynosThings();
}

switch (Opcode) {
  default: return  MI.isAsCheapAsMove(); // fallback to the default instead of `false`

// Opcode specific processing, look at target features
 case Opc1:
...
  case Opc1:
...
}

It'll handle COPY and IMPLICT_DEF and no need for FeatureCustomCheapAsMoveHandling.

Yeah that sounds like a good idea. We can be more precise and have less unnecessary alternative code paths. I'll take a look.

Do we think that isAsCheapAsAMove should be all instructions that are really as cheap as a mov? (As in, zero latency move instructions, handled in rename). Or should it be all instructions that are single cycle, like ADDs and ORRs, etc?

What I can tell from looking at the source, it's used as a heuristic, guiding re-materialisation (?), so it's usually in the context of shortening/not-extending live ranges
and trying to avoid spills, in that sense one-cycle ADD could be more expensive on paper that a zero-latency MOV, but in practice I'm not sure MOV's advantage
materialises that often.

That sounds OK to me. We can treat it as cheap instructions that are roughly cost=1, and hopefully the benchmarks will agree.

BTW, we found recently that LSLFast may want to be split into two different features. One for whether the add with shift will be cheap and another for the addressing modes. That shouldn't alter much here though, it just might be split out in the future.

dmgreen accepted this revision.Jul 27 2023, 12:52 AM

LGTM. Thanks.

This revision is now accepted and ready to land.Jul 27 2023, 12:52 AM
chill updated this revision to Diff 556263.Sep 8 2023, 8:13 AM
This revision was landed with ongoing or failed builds.Sep 21 2023, 10:48 AM
This revision was automatically updated to reflect the committed changes.