This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/CodeGen/
-
CodeGen/
-
MachineCombiner.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
aarch64-combine-fmul-fsub.mir
1/3
neon-mla-mls.ll

Differential D123512

[MachineCombiner]: Avoid including transient instructions in latency calculation
Needs ReviewPublic

Authored by malharJ on Apr 11 2022, 6:59 AM.

Download Raw Diff

Details

Reviewers

fhahn
georges

Summary

The MachineCombiner pattern matches on machine instruction sequences and generates
a new instruction sequence (that hopefully is more efficient).

Currently, latency calculation (in MachineCombiner) involved when finding the
depth of (root of) the new/transformed instruction sequence includes latency
of transient (ie. machine instructions like COPY, etc. that will be removed
later during register allocation).

This seems incorrect as it results in the depth of the new sequence to be
higher (in some cases, like in the affected test files) than the old sequence,
resulting in a longer critical path and the MachineCombiner ends up rejecting
the transform for efficiency reason.

Also, looking at the logic in MachineTraceMetrics::Ensemble::updateDepth()
(which is used to calculate the latency/depth of the old instruction sequence),
it excludes latency of transient instructions from the calculation.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

malharJ created this revision.Apr 11 2022, 6:59 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2022, 6:59 AM

Herald added a subscriber: hiraditya. · View Herald Transcript

malharJ requested review of this revision.Apr 11 2022, 6:59 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2022, 6:59 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

malharJ added reviewers: fhahn, georges.Apr 11 2022, 7:00 AM

Harbormaster completed remote builds in B159006: Diff 421913.Apr 11 2022, 7:47 AM

fhahn added inline comments.Apr 11 2022, 7:54 AM

llvm/test/CodeGen/AArch64/neon-mla-mls.ll
141	Is this actually profitable compared to the original code? Shouldn't this use `mls`?

georges added inline comments.Apr 11 2022, 9:29 AM

llvm/test/CodeGen/AArch64/neon-mla-mls.ll
141	I don't think you can use `mls` here since the negation is on the input accumulator rather than the multiplicand. Apart from that, I agree with @fhahn that this doesn't look obviously profitable. `fmov`/`mov` are not zero-latency on a lot of micro-architectures. The equivalent case without the `fmov` (e.g. accumulating into `A` rather than `C` so it ends up in `v0` naturally) could be believably faster due to the late accumulator operand forwarding present on many micro-architectures, and at worst shouldn't be worse than the existing code. I'm not sure if that already happened prior to this change though.

malharJ added inline comments.Apr 11 2022, 11:02 AM

llvm/test/CodeGen/AArch64/neon-mla-mls.ll
141	I understand the concern about the additional fmov/mov, but as @georges already pointed out, this is more related to the calling convention of returning result in r0 (and also unrelated to my patch which just fixes the latency/depth calculation) I am happy to interchange the order of %A and %C, as that eliminates the fmov/mov, if that's what is required for this patch. Also, independent of the above, I think the test can be renamed to use "mla" (or maybe "negmla") instead of "mls"

Are you sure this code is doing what it is intended to be doing? If it is adding the latencies (depths) between operands of inserted instructions and their defs, and we say that the latency of transient instructions are ignored, that what adds the latency between the last instruction and whatever will end up using it?

The scheduling info this is using on it's own doesn't suggest this should be changing mul;sub to neg;mla: https://godbolt.org/z/G1WTW5P9x

mgabka added a subscriber: mgabka.Apr 13 2022, 1:40 AM

dmgreen mentioned this in D124564: [MachineCombiner, AArch64] Add a new pattern A-(B+C) => (A-B)-C to reduce latency.May 12 2022, 1:11 AM

Is this by any chance fixed by D129615?

Herald added a subscriber: StephenFan. · View Herald TranscriptJan 9 2023, 9:38 AM

Matt added a subscriber: Matt.Aug 24 2023, 11:42 AM

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

MachineCombiner.cpp

12 lines

test/

CodeGen/

AArch64/

aarch64-combine-fmul-fsub.mir

4 lines

neon-mla-mls.ll

30 lines

Diff 421913

llvm/lib/CodeGen/MachineCombiner.cpp

Show First 20 Lines • Show All 192 Lines • ▼ Show 20 Lines	for (const MachineOperand &MO : InstrPtr->operands()) {
// Operand is new virtual register not in trace		// Operand is new virtual register not in trace
assert(II->second < InstrDepth.size() && "Bad Index");		assert(II->second < InstrDepth.size() && "Bad Index");
MachineInstr *DefInstr = InsInstrs[II->second];		MachineInstr *DefInstr = InsInstrs[II->second];
assert(DefInstr &&		assert(DefInstr &&
"There must be a definition for a new virtual register");		"There must be a definition for a new virtual register");
DepthOp = InstrDepth[II->second];		DepthOp = InstrDepth[II->second];
int DefIdx = DefInstr->findRegisterDefOperandIdx(MO.getReg());		int DefIdx = DefInstr->findRegisterDefOperandIdx(MO.getReg());
int UseIdx = InstrPtr->findRegisterUseOperandIdx(MO.getReg());		int UseIdx = InstrPtr->findRegisterUseOperandIdx(MO.getReg());
		// Add latency if DefInstr is a real instruction. Transients get latency 0.
		if (!DefInstr->isTransient())
LatencyOp = TSchedModel.computeOperandLatency(DefInstr, DefIdx,		LatencyOp = TSchedModel.computeOperandLatency(DefInstr, DefIdx,
InstrPtr, UseIdx);		InstrPtr, UseIdx);
} else {		} else {
MachineInstr *DefInstr = getOperandDef(MO);		MachineInstr *DefInstr = getOperandDef(MO);
if (DefInstr) {		if (DefInstr) {
DepthOp = BlockTrace.getInstrCycles(*DefInstr).Depth;		DepthOp = BlockTrace.getInstrCycles(*DefInstr).Depth;
		// Add latency if DefInstr is a real instruction. Transients get latency 0.
		if (!DefInstr->isTransient())
LatencyOp = TSchedModel.computeOperandLatency(		LatencyOp = TSchedModel.computeOperandLatency(
DefInstr, DefInstr->findRegisterDefOperandIdx(MO.getReg()),		DefInstr, DefInstr->findRegisterDefOperandIdx(MO.getReg()),
InstrPtr, InstrPtr->findRegisterUseOperandIdx(MO.getReg()));		InstrPtr, InstrPtr->findRegisterUseOperandIdx(MO.getReg()));
}		}
}		}
IDepth = std::max(IDepth, DepthOp + LatencyOp);		IDepth = std::max(IDepth, DepthOp + LatencyOp);
}		}
InstrDepth.push_back(IDepth);		InstrDepth.push_back(IDepth);
}		}
unsigned NewRootIdx = InsInstrs.size() - 1;		unsigned NewRootIdx = InsInstrs.size() - 1;
return InstrDepth[NewRootIdx];		return InstrDepth[NewRootIdx];
▲ Show 20 Lines • Show All 521 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/aarch64-combine-fmul-fsub.mir

Show All 16 Lines	bb.0.entry:
%1:fpr64 = COPY $d1		%1:fpr64 = COPY $d1
%0:fpr64 = COPY $d0		%0:fpr64 = COPY $d0
%3:fpr64 = FMULv2f32 %0, %1		%3:fpr64 = FMULv2f32 %0, %1
%4:fpr64 = FSUBv2f32 killed %3, %2		%4:fpr64 = FSUBv2f32 killed %3, %2
$d0 = COPY %4		$d0 = COPY %4
RET_ReallyLR implicit $d0		RET_ReallyLR implicit $d0

...		...
# UNPROFITABLE-LABEL: name: f1_2s
# UNPROFITABLE: %3:fpr64 = FMULv2f32 %0, %1
# UNPROFITABLE-NEXT: FSUBv2f32 killed %3, %2
#
# PROFITABLE-LABEL: name: f1_2s		# PROFITABLE-LABEL: name: f1_2s
# PROFITABLE: [[R1:%[0-9]+]]:fpr64 = FNEGv2f32 %2		# PROFITABLE: [[R1:%[0-9]+]]:fpr64 = FNEGv2f32 %2
# PROFITABLE-NEXT: FMLAv2f32 killed [[R1]], %0, %1		# PROFITABLE-NEXT: FMLAv2f32 killed [[R1]], %0, %1
---		---
name: f1_4s		name: f1_4s
registers:		registers:
- { id: 0, class: fpr128 }		- { id: 0, class: fpr128 }
- { id: 1, class: fpr128 }		- { id: 1, class: fpr128 }
▲ Show 20 Lines • Show All 126 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/neon-mla-mls.ll

Show First 20 Lines • Show All 132 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
%tmp2 = sub <4 x i32> %C, %tmp1;		%tmp2 = sub <4 x i32> %C, %tmp1;
ret <4 x i32> %tmp2		ret <4 x i32> %tmp2
}		}


define <8 x i8> @mls2v8xi8(<8 x i8> %A, <8 x i8> %B, <8 x i8> %C) {		define <8 x i8> @mls2v8xi8(<8 x i8> %A, <8 x i8> %B, <8 x i8> %C) {
; CHECK-LABEL: mls2v8xi8:		; CHECK-LABEL: mls2v8xi8:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: mul v0.8b, v0.8b, v1.8b		; CHECK-NEXT: neg v2.8b, v2.8b
		fhahnUnsubmitted Not Done Reply Inline Actions Is this actually profitable compared to the original code? Shouldn't this use `mls`? fhahn: Is this actually profitable compared to the original code? Shouldn't this use `mls`?
		georgesUnsubmitted Not Done Reply Inline Actions I don't think you can use `mls` here since the negation is on the input accumulator rather than the multiplicand. Apart from that, I agree with @fhahn that this doesn't look obviously profitable. `fmov`/`mov` are not zero-latency on a lot of micro-architectures. The equivalent case without the `fmov` (e.g. accumulating into `A` rather than `C` so it ends up in `v0` naturally) could be believably faster due to the late accumulator operand forwarding present on many micro-architectures, and at worst shouldn't be worse than the existing code. I'm not sure if that already happened prior to this change though. georges: I don't think you can use `mls` here since the negation is on the input accumulator rather than…
		malharJAuthorUnsubmitted Done Reply Inline Actions I understand the concern about the additional fmov/mov, but as @georges already pointed out, this is more related to the calling convention of returning result in r0 (and also unrelated to my patch which just fixes the latency/depth calculation) I am happy to interchange the order of %A and %C, as that eliminates the fmov/mov, if that's what is required for this patch. Also, independent of the above, I think the test can be renamed to use "mla" (or maybe "negmla") instead of "mls" malharJ: I understand the concern about the additional fmov/mov, but as @georges already pointed out…
; CHECK-NEXT: sub v0.8b, v0.8b, v2.8b		; CHECK-NEXT: mla v2.8b, v0.8b, v1.8b
		; CHECK-NEXT: fmov d0, d2
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%tmp1 = mul <8 x i8> %A, %B;		%tmp1 = mul <8 x i8> %A, %B;
%tmp2 = sub <8 x i8> %tmp1, %C;		%tmp2 = sub <8 x i8> %tmp1, %C;
ret <8 x i8> %tmp2		ret <8 x i8> %tmp2
}		}

define <16 x i8> @mls2v16xi8(<16 x i8> %A, <16 x i8> %B, <16 x i8> %C) {		define <16 x i8> @mls2v16xi8(<16 x i8> %A, <16 x i8> %B, <16 x i8> %C) {
; CHECK-LABEL: mls2v16xi8:		; CHECK-LABEL: mls2v16xi8:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: mul v0.16b, v0.16b, v1.16b		; CHECK-NEXT: neg v2.16b, v2.16b
; CHECK-NEXT: sub v0.16b, v0.16b, v2.16b		; CHECK-NEXT: mla v2.16b, v0.16b, v1.16b
		; CHECK-NEXT: mov v0.16b, v2.16b
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%tmp1 = mul <16 x i8> %A, %B;		%tmp1 = mul <16 x i8> %A, %B;
%tmp2 = sub <16 x i8> %tmp1, %C;		%tmp2 = sub <16 x i8> %tmp1, %C;
ret <16 x i8> %tmp2		ret <16 x i8> %tmp2
}		}

define <4 x i16> @mls2v4xi16(<4 x i16> %A, <4 x i16> %B, <4 x i16> %C) {		define <4 x i16> @mls2v4xi16(<4 x i16> %A, <4 x i16> %B, <4 x i16> %C) {
; CHECK-LABEL: mls2v4xi16:		; CHECK-LABEL: mls2v4xi16:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: mul v0.4h, v0.4h, v1.4h		; CHECK-NEXT: neg v2.4h, v2.4h
; CHECK-NEXT: sub v0.4h, v0.4h, v2.4h		; CHECK-NEXT: mla v2.4h, v0.4h, v1.4h
		; CHECK-NEXT: fmov d0, d2
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%tmp1 = mul <4 x i16> %A, %B;		%tmp1 = mul <4 x i16> %A, %B;
%tmp2 = sub <4 x i16> %tmp1, %C;		%tmp2 = sub <4 x i16> %tmp1, %C;
ret <4 x i16> %tmp2		ret <4 x i16> %tmp2
}		}

define <8 x i16> @mls2v8xi16(<8 x i16> %A, <8 x i16> %B, <8 x i16> %C) {		define <8 x i16> @mls2v8xi16(<8 x i16> %A, <8 x i16> %B, <8 x i16> %C) {
; CHECK-LABEL: mls2v8xi16:		; CHECK-LABEL: mls2v8xi16:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: mul v0.8h, v0.8h, v1.8h		; CHECK-NEXT: neg v2.8h, v2.8h
; CHECK-NEXT: sub v0.8h, v0.8h, v2.8h		; CHECK-NEXT: mla v2.8h, v0.8h, v1.8h
		; CHECK-NEXT: mov v0.16b, v2.16b
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%tmp1 = mul <8 x i16> %A, %B;		%tmp1 = mul <8 x i16> %A, %B;
%tmp2 = sub <8 x i16> %tmp1, %C;		%tmp2 = sub <8 x i16> %tmp1, %C;
ret <8 x i16> %tmp2		ret <8 x i16> %tmp2
}		}

define <2 x i32> @mls2v2xi32(<2 x i32> %A, <2 x i32> %B, <2 x i32> %C) {		define <2 x i32> @mls2v2xi32(<2 x i32> %A, <2 x i32> %B, <2 x i32> %C) {
; CHECK-LABEL: mls2v2xi32:		; CHECK-LABEL: mls2v2xi32:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: mul v0.2s, v0.2s, v1.2s		; CHECK-NEXT: neg v2.2s, v2.2s
; CHECK-NEXT: sub v0.2s, v0.2s, v2.2s		; CHECK-NEXT: mla v2.2s, v0.2s, v1.2s
		; CHECK-NEXT: fmov d0, d2
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%tmp1 = mul <2 x i32> %A, %B;		%tmp1 = mul <2 x i32> %A, %B;
%tmp2 = sub <2 x i32> %tmp1, %C;		%tmp2 = sub <2 x i32> %tmp1, %C;
ret <2 x i32> %tmp2		ret <2 x i32> %tmp2
}		}

define <4 x i32> @mls2v4xi32(<4 x i32> %A, <4 x i32> %B, <4 x i32> %C) {		define <4 x i32> @mls2v4xi32(<4 x i32> %A, <4 x i32> %B, <4 x i32> %C) {
; CHECK-LABEL: mls2v4xi32:		; CHECK-LABEL: mls2v4xi32:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: mul v0.4s, v0.4s, v1.4s		; CHECK-NEXT: neg v2.4s, v2.4s
; CHECK-NEXT: sub v0.4s, v0.4s, v2.4s		; CHECK-NEXT: mla v2.4s, v0.4s, v1.4s
		; CHECK-NEXT: mov v0.16b, v2.16b
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%tmp1 = mul <4 x i32> %A, %B;		%tmp1 = mul <4 x i32> %A, %B;
%tmp2 = sub <4 x i32> %tmp1, %C;		%tmp2 = sub <4 x i32> %tmp1, %C;
ret <4 x i32> %tmp2		ret <4 x i32> %tmp2
}		}