Download Raw Diff

Details

Reviewers

spatel
dmgreen
efriedma

Commits

rG74c2d4f6024c: [AArch64][SelectionDAG] Lower multiplication by a constant to shl+add+shl+add
rG4a549be9c367: [AArch64] Lower multiplication by a negative constant to shl+sub+shl

Summary

Change the costmodel to lower a = b * C where C = (1 + 2^m) * (1 + 2^n) to

add   w8, w0, w0, lsl #m
add   w0, w8, w8, lsl #n

Note: The latency can vary depending on the shirt amount

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

Allen created this revision.Oct 7 2022, 4:44 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 7 2022, 4:44 AM

Herald added subscribers: ecnelises, hiraditya, kristof.beyls. · View Herald Transcript

Allen requested review of this revision.Oct 7 2022, 4:44 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 7 2022, 4:44 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Allen added a parent revision: D134934: [AArch64] Lower multiplication by a negative constant to shl+sub+shl.Oct 7 2022, 4:53 AM

Harbormaster completed remote builds in B190926: Diff 466040.Oct 7 2022, 5:53 AM

This is getting into the territory of actually being slower than a "mul", depending on the latency of "mul" and "add-with-shift" on the target CPU... we probably need CPU-specific modeling if you want to go this direction.

In D135441#3843605, @efriedma wrote:

This is getting into the territory of actually being slower than a "mul", depending on the latency of "mul" and "add-with-shift" on the target CPU... we probably need CPU-specific modeling if you want to go this direction.

Thanks. As the Selection DAG doesn't include schedule model, so this should be checked in machine combiner?

MF.getSubtarget().getSchedModel() should work in SelectionDAG.

The tricky things here are:

Actually getting the right variant out of the scheduler is a little tricky. You basically have to construct an MCInst. (Note that the latency can vary depending on the shirt amount.)
We don't have actually have accurate scheduling models for all the chips we care about. It tends to be something that requires a lot of effort for very little effect, so a lot of CPUs just use the A57 model.

Maybe to start, just try to figure out which targets have "free" shifts, and turn on the optimization for the cases that involve those shifts? (Multiple cores have a small shift optimization, where left shifts of 4 or less don't increase the latency of an add.)

Out of interest, what cases do you have where mul is worse than add+shift + add+shift? From the look of the tv100 scheduling model it would seem to be 3/4 cycles for the mul (depending on whether it is i32 or i64) vs 2+2 for the add+shifts. Are small shifts really free, as in FeatureLSLFast?

In D135441#3852104, @dmgreen wrote:

Out of interest, what cases do you have where mul is worse than add+shift + add+shift? From the look of the tv100 scheduling model it would seem to be 3/4 cycles for the mul (depending on whether it is i32 or i64) vs 2+2 for the add+shifts. Are small shifts really free, as in FeatureLSLFast?

I read from the a new spec , which I'm working on, the Latency of add+shifts is 1 when the value of shift small. At the same time, I happen to see a TODO in the upstream code :)
Thanks for your detail suggestion, I'll add the check of FeatureLSLFast or distinguish more shift-related values.

khchen added a subscriber: khchen.Oct 12 2022, 7:39 AM

add condition Subtarget->hasLSLFast() according comment

Harbormaster completed remote builds in B192340: Diff 468008.Oct 15 2022, 4:56 AM

dmgreen added inline comments.Oct 17 2022, 12:19 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15145	-> TODO, shift. The documentation for LSLFast says that Shifts <= 3 places are fast, which is the limit for most address offsets. Modern cores usually have free shifts <= 4 places. (They tend to have cheap multiplies too, if they can perform fast shifts). I was considering putting a LSLFast4 option in when I recently enabled LSLFast for Arm cores, but as far as I understand the LSLFast option current doesn't actually apply to Add instructions like it should at the moment. We can check that ShiftM1 and ShiftN1 are <= 3 here though, and maybe change the subtarget feature for shifts of 4?

Add condition ShiftM1 <= 3 && ShiftM1 <= 3 for LSLFast

Harbormaster completed remote builds in B192502: Diff 468208.Oct 17 2022, 8:08 AM

Allen marked an inline comment as done.Oct 17 2022, 8:08 AM

Allen added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15145	Apply your comment, thanks

Can you add some tests for multiplying by larger values. Maybe 165 and 297.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15145	You can remove the comment now then.
15151	Can this be moved into the `if..`

Add 3 new test cases

Harbormaster completed remote builds in B192750: Diff 468544.Oct 18 2022, 7:59 AM

Allen marked an inline comment as done.Oct 18 2022, 8:00 AM

Allen added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15145	Done
15151	Done, thanks

Retry the commit as a unexpectedpre-merge checks fail

Harbormaster completed remote builds in B193155: Diff 469122.Oct 19 2022, 11:34 PM

Allen removed a parent revision: D134934: [AArch64] Lower multiplication by a negative constant to shl+sub+shl.Oct 20 2022, 12:56 AM

Allen added a commit: rG4a549be9c367: [AArch64] Lower multiplication by a negative constant to shl+sub+shl.Oct 20 2022, 12:58 AM

any new suggestion about the last update?

LGTM, thanks for adding the tests.

This revision is now accepted and ready to land.Oct 20 2022, 9:24 AM

Closed by commit rG74c2d4f6024c: [AArch64][SelectionDAG] Lower multiplication by a constant to shl+add+shl+add (authored by Allen). · Explain WhyOct 20 2022, 9:37 AM

This revision was automatically updated to reflect the committed changes.

Allen added a commit: rG74c2d4f6024c: [AArch64][SelectionDAG] Lower multiplication by a constant to shl+add+shl+add.

Diff 469262

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 15,059 Lines • ▼ Show 20 Lines	static SDValue performMulCombine(SDNode *N, SelectionDAG &DAG,

// Multiplication of a power of two plus/minus one can be done more		// Multiplication of a power of two plus/minus one can be done more
// cheaply as as shift+add/sub. For now, this is true unilaterally. If		// cheaply as as shift+add/sub. For now, this is true unilaterally. If
// future CPUs have a cheaper MADD instruction, this may need to be		// future CPUs have a cheaper MADD instruction, this may need to be
// gated on a subtarget feature. For Cyclone, 32-bit MADD is 4 cycles and		// gated on a subtarget feature. For Cyclone, 32-bit MADD is 4 cycles and
// 64-bit is 5 cycles, so this is always a win.		// 64-bit is 5 cycles, so this is always a win.
// More aggressively, some multiplications N0 * C can be lowered to		// More aggressively, some multiplications N0 * C can be lowered to
// shift+add+shift if the constant C = A * B where A = 2^N + 1 and B = 2^M,		// shift+add+shift if the constant C = A * B where A = 2^N + 1 and B = 2^M,
// e.g. 6=32=(2+1)2.		// e.g. 6=32=(2+1)2, 45=(1+4)*(1+8)
// TODO: lower more cases, e.g. C = 45 which equals to (1+2)*16-(1+2).		// TODO: lower more cases.

// TrailingZeroes is used to test if the mul can be lowered to		// TrailingZeroes is used to test if the mul can be lowered to
// shift+add+shift.		// shift+add+shift.
unsigned TrailingZeroes = ConstValue.countTrailingZeros();		unsigned TrailingZeroes = ConstValue.countTrailingZeros();
if (TrailingZeroes) {		if (TrailingZeroes) {
// Conservatively do not lower to shift+add+shift if the mul might be		// Conservatively do not lower to shift+add+shift if the mul might be
// folded into smul or umul.		// folded into smul or umul.
if (N0->hasOneUse() && (isSignExtended(N0.getNode(), DAG) \|\|		if (N0->hasOneUse() && (isSignExtended(N0.getNode(), DAG) \|\|
Show All 20 Lines	static SDValue performMulCombine(SDNode *N, SelectionDAG &DAG,
auto Sub = [&](SDValue N0, SDValue N1) {		auto Sub = [&](SDValue N0, SDValue N1) {
return DAG.getNode(ISD::SUB, DL, VT, N0, N1);		return DAG.getNode(ISD::SUB, DL, VT, N0, N1);
};		};
auto Negate = [&](SDValue N) {		auto Negate = [&](SDValue N) {
SDValue Zero = DAG.getConstant(0, DL, VT);		SDValue Zero = DAG.getConstant(0, DL, VT);
return DAG.getNode(ISD::SUB, DL, VT, Zero, N);		return DAG.getNode(ISD::SUB, DL, VT, Zero, N);
};		};

		// Can the const C be decomposed into (1+2^M1)*(1+2^N1), eg:
		// C = 45 is equal to (1+4)(1+8), we don't decompose it into (1+2)(16-1) as
		// the (2^N - 1) can't be execused via a single instruction.
		auto isPowPlusPlusConst = [](APInt C, APInt &M, APInt &N) {
		unsigned BitWidth = C.getBitWidth();
		for (unsigned i = 1; i < BitWidth / 2; i++) {
		APInt Rem;
		APInt X(BitWidth, (1 << i) + 1);
		APInt::sdivrem(C, X, N, Rem);
		APInt NVMinus1 = N - 1;
		if (Rem == 0 && NVMinus1.isPowerOf2()) {
		M = X;
		return true;
		}
		}
		return false;
		};

if (ConstValue.isNonNegative()) {		if (ConstValue.isNonNegative()) {
// (mul x, (2^N + 1) * 2^M) => (shl (add (shl x, N), x), M)		// (mul x, (2^N + 1) * 2^M) => (shl (add (shl x, N), x), M)
// (mul x, 2^N - 1) => (sub (shl x, N), x)		// (mul x, 2^N - 1) => (sub (shl x, N), x)
// (mul x, (2^(N-M) - 1) * 2^M) => (sub (shl x, N), (shl x, M))		// (mul x, (2^(N-M) - 1) * 2^M) => (sub (shl x, N), (shl x, M))
		// (mul x, (2^M + 1) * (2^N + 1))
		// => MV = (add (shl x, M), x); (add (shl MV, N), MV)
APInt SCVMinus1 = ShiftedConstValue - 1;		APInt SCVMinus1 = ShiftedConstValue - 1;
APInt SCVPlus1 = ShiftedConstValue + 1;		APInt SCVPlus1 = ShiftedConstValue + 1;
APInt CVPlus1 = ConstValue + 1;		APInt CVPlus1 = ConstValue + 1;
		APInt CVM, CVN;
if (SCVMinus1.isPowerOf2()) {		if (SCVMinus1.isPowerOf2()) {
ShiftAmt = SCVMinus1.logBase2();		ShiftAmt = SCVMinus1.logBase2();
return Shl(Add(Shl(N0, ShiftAmt), N0), TrailingZeroes);		return Shl(Add(Shl(N0, ShiftAmt), N0), TrailingZeroes);
} else if (CVPlus1.isPowerOf2()) {		} else if (CVPlus1.isPowerOf2()) {
ShiftAmt = CVPlus1.logBase2();		ShiftAmt = CVPlus1.logBase2();
return Sub(Shl(N0, ShiftAmt), N0);		return Sub(Shl(N0, ShiftAmt), N0);
} else if (SCVPlus1.isPowerOf2()) {		} else if (SCVPlus1.isPowerOf2()) {
ShiftAmt = SCVPlus1.logBase2() + TrailingZeroes;		ShiftAmt = SCVPlus1.logBase2() + TrailingZeroes;
return Sub(Shl(N0, ShiftAmt), Shl(N0, TrailingZeroes));		return Sub(Shl(N0, ShiftAmt), Shl(N0, TrailingZeroes));
		} else if (Subtarget->hasLSLFast() &&
		isPowPlusPlusConst(ConstValue, CVM, CVN)) {
		APInt CVMMinus1 = CVM - 1;
		dmgreenUnsubmitted Done Reply Inline Actions -> TODO, shift. The documentation for LSLFast says that Shifts <= 3 places are fast, which is the limit for most address offsets. Modern cores usually have free shifts <= 4 places. (They tend to have cheap multiplies too, if they can perform fast shifts). I was considering putting a LSLFast4 option in when I recently enabled LSLFast for Arm cores, but as far as I understand the LSLFast option current doesn't actually apply to Add instructions like it should at the moment. We can check that ShiftM1 and ShiftN1 are <= 3 here though, and maybe change the subtarget feature for shifts of 4? dmgreen: -> TODO, shift. The documentation for LSLFast says that Shifts <= 3 places are fast, which is…
		AllenAuthorUnsubmitted Done Reply Inline Actions Apply your comment, thanks Allen: Apply your comment, thanks
		dmgreenUnsubmitted Done Reply Inline Actions You can remove the comment now then. dmgreen: You can remove the comment now then.
		AllenAuthorUnsubmitted Done Reply Inline Actions Done Allen: Done
		APInt CVNMinus1 = CVN - 1;
		unsigned ShiftM1 = CVMMinus1.logBase2();
		unsigned ShiftN1 = CVNMinus1.logBase2();
		// LSLFast implicate that Shifts <= 3 places are fast
		if (ShiftM1 <= 3 && ShiftN1 <= 3) {
		SDValue MVal = Add(Shl(N0, ShiftM1), N0);
		dmgreenUnsubmitted Done Reply Inline Actions Can this be moved into the `if..` dmgreen: Can this be moved into the `if..`
		AllenAuthorUnsubmitted Done Reply Inline Actions Done, thanks Allen: Done, thanks
		return Add(Shl(MVal, ShiftN1), MVal);
		}
}		}
} else {		} else {
// (mul x, -(2^N - 1)) => (sub x, (shl x, N))		// (mul x, -(2^N - 1)) => (sub x, (shl x, N))
// (mul x, -(2^N + 1)) => - (add (shl x, N), x)		// (mul x, -(2^N + 1)) => - (add (shl x, N), x)
// (mul x, -(2^(N-M) - 1) * 2^M) => (sub (shl x, M), (shl x, N))		// (mul x, -(2^(N-M) - 1) * 2^M) => (sub (shl x, M), (shl x, N))
APInt SCVPlus1 = -ShiftedConstValue + 1;		APInt SCVPlus1 = -ShiftedConstValue + 1;
APInt CVNegPlus1 = -ConstValue + 1;		APInt CVNegPlus1 = -ConstValue + 1;
APInt CVNegMinus1 = -ConstValue - 1;		APInt CVNegMinus1 = -ConstValue - 1;
▲ Show 20 Lines • Show All 7,852 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/mul_pow2.ll

	Show First 20 Lines • Show All 487 Lines • ▼ Show 20 Lines
	; GISEL: // %bb.0:			; GISEL: // %bb.0:
	; GISEL-NEXT: lsl w0, w0, #4			; GISEL-NEXT: lsl w0, w0, #4
	; GISEL-NEXT: ret			; GISEL-NEXT: ret

	%mul = mul nsw i32 %x, 16			%mul = mul nsw i32 %x, 16
	ret i32 %mul			ret i32 %mul
	}			}

				define i32 @test25_fast_shift(i32 %x) "target-features"="+lsl-fast" {
				; CHECK-LABEL: test25_fast_shift:
				; CHECK: // %bb.0:
				; CHECK-NEXT: add w8, w0, w0, lsl #2
				; CHECK-NEXT: add w0, w8, w8, lsl #2
				; CHECK-NEXT: ret
				;
				; GISEL-LABEL: test25_fast_shift:
				; GISEL: // %bb.0:
				; GISEL-NEXT: mov w8, #25
				; GISEL-NEXT: mul w0, w0, w8
				; GISEL-NEXT: ret

				%mul = mul nsw i32 %x, 25 ; 25 = (1+4)*(1+4)
				ret i32 %mul
				}

				define i32 @test45_fast_shift(i32 %x) "target-features"="+lsl-fast" {
				; CHECK-LABEL: test45_fast_shift:
				; CHECK: // %bb.0:
				; CHECK-NEXT: add w8, w0, w0, lsl #2
				; CHECK-NEXT: add w0, w8, w8, lsl #3
				; CHECK-NEXT: ret
				;
				; GISEL-LABEL: test45_fast_shift:
				; GISEL: // %bb.0:
				; GISEL-NEXT: mov w8, #45
				; GISEL-NEXT: mul w0, w0, w8
				; GISEL-NEXT: ret

				%mul = mul nsw i32 %x, 45 ; 45 = (1+4)*(1+8)
				ret i32 %mul
				}

				; Negative test: Keep MUL as don't have the feature LSLFast
				define i32 @test45(i32 %x) {
				; CHECK-LABEL: test45:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #45
				; CHECK-NEXT: mul w0, w0, w8
				; CHECK-NEXT: ret
				;
				; GISEL-LABEL: test45:
				; GISEL: // %bb.0:
				; GISEL-NEXT: mov w8, #45
				; GISEL-NEXT: mul w0, w0, w8
				; GISEL-NEXT: ret

				%mul = mul nsw i32 %x, 45 ; 45 = (1+4)*(1+8)
				ret i32 %mul
				}

				; Negative test: The shift amount 4 larger than 3
				define i32 @test85_fast_shift(i32 %x) "target-features"="+lsl-fast" {
				; CHECK-LABEL: test85_fast_shift:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #85
				; CHECK-NEXT: mul w0, w0, w8
				; CHECK-NEXT: ret
				;
				; GISEL-LABEL: test85_fast_shift:
				; GISEL: // %bb.0:
				; GISEL-NEXT: mov w8, #85
				; GISEL-NEXT: mul w0, w0, w8
				; GISEL-NEXT: ret

				%mul = mul nsw i32 %x, 85 ; 85 = (1+4)*(1+16)
				ret i32 %mul
				}

				; Negative test: The shift amount 5 larger than 3
				define i32 @test297_fast_shift(i32 %x) "target-features"="+lsl-fast" {
				; CHECK-LABEL: test297_fast_shift:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #297
				; CHECK-NEXT: mul w0, w0, w8
				; CHECK-NEXT: ret
				;
				; GISEL-LABEL: test297_fast_shift:
				; GISEL: // %bb.0:
				; GISEL-NEXT: mov w8, #297
				; GISEL-NEXT: mul w0, w0, w8
				; GISEL-NEXT: ret

				%mul = mul nsw i32 %x, 297 ; 297 = (1+8)*(1+32)
				ret i32 %mul
				}

	; Convert mul x, -pow2 to shift.			; Convert mul x, -pow2 to shift.
	; Convert mul x, -(pow2 +/- 1) to shift + add/sub.			; Convert mul x, -(pow2 +/- 1) to shift + add/sub.
	; Lowering other negative constants are not supported yet.			; Lowering other negative constants are not supported yet.

	define i32 @ntest2(i32 %x) {			define i32 @ntest2(i32 %x) {
	; CHECK-LABEL: ntest2:			; CHECK-LABEL: ntest2:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: neg w0, w0, lsl #1			; CHECK-NEXT: neg w0, w0, lsl #1
	▲ Show 20 Lines • Show All 261 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: movi v2.4s, #1, msl #16			; CHECK-NEXT: movi v2.4s, #1, msl #16
	; CHECK-NEXT: shl v0.4s, v0.4s, #6			; CHECK-NEXT: shl v0.4s, v0.4s, #6
	; CHECK-NEXT: sub v0.4s, v1.4s, v0.4s			; CHECK-NEXT: sub v0.4s, v1.4s, v0.4s
	; CHECK-NEXT: and v0.16b, v0.16b, v2.16b			; CHECK-NEXT: and v0.16b, v0.16b, v2.16b
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	;			;
	; GISEL-LABEL: muladd_demand_commute:			; GISEL-LABEL: muladd_demand_commute:
	; GISEL: // %bb.0:			; GISEL: // %bb.0:
	; GISEL-NEXT: adrp x8, .LCPI44_1			; GISEL-NEXT: adrp x8, .LCPI49_1
	; GISEL-NEXT: ldr q2, [x8, :lo12:.LCPI44_1]			; GISEL-NEXT: ldr q2, [x8, :lo12:.LCPI49_1]
	; GISEL-NEXT: adrp x8, .LCPI44_0			; GISEL-NEXT: adrp x8, .LCPI49_0
	; GISEL-NEXT: mla v1.4s, v0.4s, v2.4s			; GISEL-NEXT: mla v1.4s, v0.4s, v2.4s
	; GISEL-NEXT: ldr q0, [x8, :lo12:.LCPI44_0]			; GISEL-NEXT: ldr q0, [x8, :lo12:.LCPI49_0]
	; GISEL-NEXT: and v0.16b, v1.16b, v0.16b			; GISEL-NEXT: and v0.16b, v1.16b, v0.16b
	; GISEL-NEXT: ret			; GISEL-NEXT: ret
	%m = mul <4 x i32> %x, <i32 131008, i32 131008, i32 131008, i32 131008>			%m = mul <4 x i32> %x, <i32 131008, i32 131008, i32 131008, i32 131008>
	%a = add <4 x i32> %m, %y			%a = add <4 x i32> %m, %y
	%r = and <4 x i32> %a, <i32 131071, i32 131071, i32 131071, i32 131071>			%r = and <4 x i32> %a, <i32 131071, i32 131071, i32 131071, i32 131071>
	ret <4 x i32> %r			ret <4 x i32> %r
	}			}

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SelectionDAG] Lower multiplication by a constant to shl+add+shl+add
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 469262

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/mul_pow2.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SelectionDAG] Lower multiplication by a constant to shl+add+shl+addClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 469262

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/mul_pow2.ll

[AArch64][SelectionDAG] Lower multiplication by a constant to shl+add+shl+add
ClosedPublic