This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64InstrInfo.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
addsub-shifted-reg-cheap-as-move.ll

Differential D152827

[AArch64] Correctly determine if {ADD,SUB}{W,X}rs instructions are cheap
ClosedPublic

Authored by chill on Jun 13 2023, 9:23 AM.

Download Raw Diff

Details

Reviewers

efriedma
dmgreen

Commits

rG0eb0a65d0f9c: [AArch64] Correctly determine if {ADD,SUB}{W,X}rs instructions are cheap

Summary

These are marked to be "as cheap as a move".

According to publicly available Software Optimization Guides, they
have one cycle latency and maximum throughput only on some
microarchitectures, only for LSL and only for some shift amounts.

This patch uses the subtarget feature FeatureLSLFast to determine
how cheap the instructions are. As a consequence, each subtarget
with FeatureLSLFast now also has FeatureCustomCheapAsMoveHandling added.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

chill created this revision.Jun 13 2023, 9:23 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 13 2023, 9:23 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

chill requested review of this revision.Jun 13 2023, 9:23 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 13 2023, 9:23 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

chill added a parent revision: D143897: [CodeGenPrepare] Estimate liveness of loop invariants when checking for address folding profitability.Jun 13 2023, 9:23 AM

chill added a child revision: D152828: [MachineSink][AArch64] Sink instruction copies when they can replace copy into hard register or folded into addressing mode .Jun 13 2023, 9:36 AM

chill added reviewers: efriedma, dmgreen.Jun 13 2023, 9:49 AM

Harbormaster completed remote builds in B238516: Diff 530939.Jun 13 2023, 10:56 AM

Adding handling for ADDWrs with LSLFast sounds good to me. I'm not so sure about this CustomCheapAsMoveHandling though. Ignoring the fact that isCheapAsMove is a bit of a strange concept nowadays, it seems to be missing some of the instructions usually marked as isAsCheapAsAMove (things like COPY and IMPLICIT_DEF, as well as target nodes like FMOV's and MOVI's). I would also like it if either all cpus used hasCustomCheapAsMoveHandling (especially for the testing) or if they were at least closer in terms of functionality.

Do we think that isAsCheapAsAMove should be all instructions that are really as cheap as a mov? (As in, zero latency move instructions, handled in rename). Or should it be all instructions that are single cycle, like ADDs and ORRs, etc?

Do you have any performance results? And can you add a test.

In D152827#4434233, @dmgreen wrote:

Adding handling for ADDWrs with LSLFast sounds good to me. I'm not so sure about this CustomCheapAsMoveHandling though. Ignoring the fact that isCheapAsMove is a bit of a strange concept nowadays, it seems to be missing some of the instructions usually marked as isAsCheapAsAMove (things like COPY and IMPLICIT_DEF, as well as target nodes like FMOV's and MOVI's). I would also like it if either all cpus used hasCustomCheapAsMoveHandling (especially for the testing) or if they were at least closer in terms of functionality.

TBH, I greatly dislike how AArch64InstrInfo::isAsCheapAsAMove looks right now.

It seems quite more nicer and streamlined if it was organised like:

if (Exynos things) {
  return doExynosThings();
}

switch (Opcode) {
  default: return  MI.isAsCheapAsMove(); // fallback to the default instead of `false`

// Opcode specific processing, look at target features
 case Opc1:
...
  case Opc1:
...
}

It'll handle COPY and IMPLICT_DEF and no need for FeatureCustomCheapAsMoveHandling.

Do we think that isAsCheapAsAMove should be all instructions that are really as cheap as a mov? (As in, zero latency move instructions, handled in rename). Or should it be all instructions that are single cycle, like ADDs and ORRs, etc?

What I can tell from looking at the source, it's used as a heuristic, guiding re-materialisation (?), so it's usually in the context of shortening/not-extending live ranges
and trying to avoid spills, in that sense one-cycle ADD could be more expensive on paper that a zero-latency MOV, but in practice I'm not sure MOV's advantage
materialises that often.

Do you have any performance results? And can you add a test.

TBD

Split a part into D154722

Harbormaster completed remote builds in B243783: Diff 538151.Jul 7 2023, 8:18 AM

chill edited parent revisions, added: D154722: [AArch64] Refactor AArch64InstrInfo::isAsCheapAsAMove (NFC); removed: D143897: [CodeGenPrepare] Estimate liveness of loop invariants when checking for address folding profitability.Jul 7 2023, 8:19 AM

In D152827#4476856, @chill wrote:
In D152827#4434233, @dmgreen wrote:

Adding handling for ADDWrs with LSLFast sounds good to me. I'm not so sure about this CustomCheapAsMoveHandling though. Ignoring the fact that isCheapAsMove is a bit of a strange concept nowadays, it seems to be missing some of the instructions usually marked as isAsCheapAsAMove (things like COPY and IMPLICIT_DEF, as well as target nodes like FMOV's and MOVI's). I would also like it if either all cpus used hasCustomCheapAsMoveHandling (especially for the testing) or if they were at least closer in terms of functionality.

TBH, I greatly dislike how AArch64InstrInfo::isAsCheapAsAMove looks right now.

It seems quite more nicer and streamlined if it was organised like:
if (Exynos things) {
  return doExynosThings();
}

switch (Opcode) {
  default: return  MI.isAsCheapAsMove(); // fallback to the default instead of `false`

// Opcode specific processing, look at target features
 case Opc1:
...
  case Opc1:
...
}
It'll handle COPY and IMPLICT_DEF and no need for FeatureCustomCheapAsMoveHandling.

Yeah that sounds like a good idea. We can be more precise and have less unnecessary alternative code paths. I'll take a look.

Do we think that isAsCheapAsAMove should be all instructions that are really as cheap as a mov? (As in, zero latency move instructions, handled in rename). Or should it be all instructions that are single cycle, like ADDs and ORRs, etc?

What I can tell from looking at the source, it's used as a heuristic, guiding re-materialisation (?), so it's usually in the context of shortening/not-extending live ranges
and trying to avoid spills, in that sense one-cycle ADD could be more expensive on paper that a zero-latency MOV, but in practice I'm not sure MOV's advantage
materialises that often.

That sounds OK to me. We can treat it as cheap instructions that are roughly cost=1, and hopefully the benchmarks will agree.

BTW, we found recently that LSLFast may want to be split into two different features. One for whether the add with shift will be cheap and another for the addressing modes. That shouldn't alter much here though, it just might be split out in the future.

LGTM. Thanks.

This revision is now accepted and ready to land.Jul 27 2023, 12:52 AM

chill added a child revision: D157116: [AArch64] Pre-commit some tests for D152828 (NFC).Aug 4 2023, 9:34 AM

chill removed a child revision: D152828: [MachineSink][AArch64] Sink instruction copies when they can replace copy into hard register or folded into addressing mode .

chill updated this revision to Diff 556263.Sep 8 2023, 8:13 AM

Harbormaster completed remote builds in B256861: Diff 556263.Sep 8 2023, 10:11 AM

This revision was landed with ongoing or failed builds.Sep 21 2023, 10:48 AM

Closed by commit rG0eb0a65d0f9c: [AArch64] Correctly determine if {ADD,SUB}{W,X}rs instructions are cheap (authored by chill). · Explain Why

This revision was automatically updated to reflect the committed changes.

chill added a commit: rG0eb0a65d0f9c: [AArch64] Correctly determine if {ADD,SUB}{W,X}rs instructions are cheap.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64InstrInfo.cpp

7 lines

test/

CodeGen/

AArch64/

addsub-shifted-reg-cheap-as-move.ll

132 lines

Diff 557189

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 898 Lines • ▼ Show 20 Lines	if (Subtarget.hasExynosCheapAsMoveHandling()) {
if (isExynosCheapAsMove(MI))		if (isExynosCheapAsMove(MI))
return true;		return true;
return MI.isAsCheapAsAMove();		return MI.isAsCheapAsAMove();
}		}

switch (MI.getOpcode()) {		switch (MI.getOpcode()) {
default:		default:
return MI.isAsCheapAsAMove();		return MI.isAsCheapAsAMove();

		case AArch64::ADDWrs:
		case AArch64::ADDXrs:
		case AArch64::SUBWrs:
		case AArch64::SUBXrs:
		return Subtarget.hasALULSLFast() && MI.getOperand(3).getImm() <= 4;

// If MOVi32imm or MOVi64imm can be expanded into ORRWri or		// If MOVi32imm or MOVi64imm can be expanded into ORRWri or
// ORRXri, it is as cheap as MOV.		// ORRXri, it is as cheap as MOV.
// Likewise if it can be expanded to MOVZ/MOVN/MOVK.		// Likewise if it can be expanded to MOVZ/MOVN/MOVK.
case AArch64::MOVi32imm:		case AArch64::MOVi32imm:
return isCheapImmediate(MI, 32);		return isCheapImmediate(MI, 32);
case AArch64::MOVi64imm:		case AArch64::MOVi64imm:
return isCheapImmediate(MI, 64);		return isCheapImmediate(MI, 64);
}		}
▲ Show 20 Lines • Show All 7,634 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/addsub-shifted-reg-cheap-as-move.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2
				; RUN: llc < %s -o - \| FileCheck %s
				; RUN: llc -mattr=+alu-lsl-fast < %s -o - \| FileCheck %s -check-prefix=LSLFAST
				target triple = "aarch64-linux"

				declare void @g(...)

				; Check that ADDWrs/ADDXrs with shift > 4 is considered relatively
				; slow, thus CSE-d.
				define void @f0(i1 %c0, i1 %c1, ptr %a, i64 %i) {
				; CHECK-LABEL: f0:
				; CHECK: // %bb.0: // %E
				; CHECK-NEXT: tbz w0, #0, .LBB0_5
				; CHECK-NEXT: // %bb.1: // %A
				; CHECK-NEXT: str x30, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: .cfi_def_cfa_offset 16
				; CHECK-NEXT: .cfi_offset w30, -16
				; CHECK-NEXT: add x0, x2, x3, lsl #5
				; CHECK-NEXT: tbz w1, #0, .LBB0_3
				; CHECK-NEXT: // %bb.2: // %B
				; CHECK-NEXT: bl g
				; CHECK-NEXT: b .LBB0_4
				; CHECK-NEXT: .LBB0_3: // %C
				; CHECK-NEXT: mov x1, x0
				; CHECK-NEXT: bl g
				; CHECK-NEXT: .LBB0_4:
				; CHECK-NEXT: ldr x30, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: .LBB0_5: // %X
				; CHECK-NEXT: ret
				;
				; LSLFAST-LABEL: f0:
				; LSLFAST: // %bb.0: // %E
				; LSLFAST-NEXT: tbz w0, #0, .LBB0_5
				; LSLFAST-NEXT: // %bb.1: // %A
				; LSLFAST-NEXT: str x30, [sp, #-16]! // 8-byte Folded Spill
				; LSLFAST-NEXT: .cfi_def_cfa_offset 16
				; LSLFAST-NEXT: .cfi_offset w30, -16
				; LSLFAST-NEXT: add x0, x2, x3, lsl #5
				; LSLFAST-NEXT: tbz w1, #0, .LBB0_3
				; LSLFAST-NEXT: // %bb.2: // %B
				; LSLFAST-NEXT: bl g
				; LSLFAST-NEXT: b .LBB0_4
				; LSLFAST-NEXT: .LBB0_3: // %C
				; LSLFAST-NEXT: mov x1, x0
				; LSLFAST-NEXT: bl g
				; LSLFAST-NEXT: .LBB0_4:
				; LSLFAST-NEXT: ldr x30, [sp], #16 // 8-byte Folded Reload
				; LSLFAST-NEXT: .LBB0_5: // %X
				; LSLFAST-NEXT: ret
				E:
				%p0 = getelementptr {i64, i64, i64, i64}, ptr %a, i64 %i
				br i1 %c0, label %A, label %X

				A:
				br i1 %c1, label %B, label %C

				B:
				call void @g(ptr %p0)
				br label %X

				C:
				%p1 = getelementptr {i64, i64, i64, i64}, ptr %a, i64 %i
				call void @g(ptr %p1, ptr %p0)
				br label %X

				X:
				ret void
				}

				; Check that ADDWrs/ADDXrs with shift <= 4 is considered relatively fast on sub-targets
				; with feature +alu-lsl-fast, thus not CSE-d.
				define void @f1(i1 %c0, i1 %c1, ptr %a, i64 %i) {
				; CHECK-LABEL: f1:
				; CHECK: // %bb.0: // %E
				; CHECK-NEXT: tbz w0, #0, .LBB1_5
				; CHECK-NEXT: // %bb.1: // %A
				; CHECK-NEXT: str x30, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: .cfi_def_cfa_offset 16
				; CHECK-NEXT: .cfi_offset w30, -16
				; CHECK-NEXT: add x0, x2, x3, lsl #4
				; CHECK-NEXT: tbz w1, #0, .LBB1_3
				; CHECK-NEXT: // %bb.2: // %B
				; CHECK-NEXT: bl g
				; CHECK-NEXT: b .LBB1_4
				; CHECK-NEXT: .LBB1_3: // %C
				; CHECK-NEXT: mov x1, x0
				; CHECK-NEXT: bl g
				; CHECK-NEXT: .LBB1_4:
				; CHECK-NEXT: ldr x30, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: .LBB1_5: // %X
				; CHECK-NEXT: ret
				;
				; LSLFAST-LABEL: f1:
				; LSLFAST: // %bb.0: // %E
				; LSLFAST-NEXT: tbz w0, #0, .LBB1_5
				; LSLFAST-NEXT: // %bb.1: // %A
				; LSLFAST-NEXT: str x30, [sp, #-16]! // 8-byte Folded Spill
				; LSLFAST-NEXT: .cfi_def_cfa_offset 16
				; LSLFAST-NEXT: .cfi_offset w30, -16
				; LSLFAST-NEXT: add x8, x2, x3, lsl #4
				; LSLFAST-NEXT: tbz w1, #0, .LBB1_3
				; LSLFAST-NEXT: // %bb.2: // %B
				; LSLFAST-NEXT: mov x0, x8
				; LSLFAST-NEXT: bl g
				; LSLFAST-NEXT: b .LBB1_4
				; LSLFAST-NEXT: .LBB1_3: // %C
				; LSLFAST-NEXT: add x0, x2, x3, lsl #4
				; LSLFAST-NEXT: mov x1, x8
				; LSLFAST-NEXT: bl g
				; LSLFAST-NEXT: .LBB1_4:
				; LSLFAST-NEXT: ldr x30, [sp], #16 // 8-byte Folded Reload
				; LSLFAST-NEXT: .LBB1_5: // %X
				; LSLFAST-NEXT: ret
				E:
				%p0 = getelementptr {i64, i64}, ptr %a, i64 %i
				br i1 %c0, label %A, label %X

				A:
				br i1 %c1, label %B, label %C

				B:
				call void @g(ptr %p0)
				br label %X

				C:
				%p1 = getelementptr {i64, i64}, ptr %a, i64 %i
				call void @g(ptr %p1, ptr %p0)
				br label %X

				X:
				ret void
				}