This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
10/12
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
11/14
mul_pow2.ll

Differential D132322

[AArch64][SelectionDAG] Optimize multiplication by constant
AbandonedPublic

Authored by Allen on Aug 20 2022, 6:45 PM.

Download Raw Diff

Details

Reviewers

dmgreen
paulwalker-arm
nikic
spatel
craig.topper
hiraditya

Summary

InstCombine will merge the x*5+x into x*6,
then decompose MUL By const into shift/add or shift/sub, eg:
Change the costmodel to lower a = b * C where C = (2^n + 1) * 2^m to

add     w0, w0, w0, lsl n
lsl     w0, w0, m

Fixes AArch64 part of https://github.com/llvm/llvm-project/issues/57255.

Diff Detail

Unit TestsFailed

	Time	Test
	2,360 ms	x64 debian > libFuzzer.libFuzzer::fuzzer-finalstats.test

Event Timeline

Allen created this revision.Aug 20 2022, 6:45 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 20 2022, 6:45 PM

Herald added subscribers: StephenFan, ecnelises, hiraditya, kristof.beyls. · View Herald Transcript

Allen requested review of this revision.Aug 20 2022, 6:45 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 20 2022, 6:45 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Allen edited reviewers, added: spatel, craig.topper; removed: luismarques, MaskRay.Aug 20 2022, 6:49 PM

craig.topper added inline comments.Aug 20 2022, 6:56 PM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13735	Is this copied from RISC-V? That doesn't look like an AArch64 instruction name

update comment

Allen marked an inline comment as done.Aug 20 2022, 7:25 PM

Allen added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13735	yes, I updated comment, thanks @craig.topper

Harbormaster completed remote builds in B182424: Diff 454269.Aug 20 2022, 8:01 PM

Allen mentioned this in D132325: [AArch64][CodeGen] Fold the mov and lsl into ubfiz.Aug 21 2022, 2:52 AM

hiraditya added inline comments.Aug 22 2022, 5:08 PM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13732	What is the rationale behind checking `Imm+1` etc?

Allen marked 2 inline comments as done.Aug 22 2022, 6:04 PM

Allen added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13732	I think it can use shl + sub to replace mul, such as https://gcc.godbolt.org/z/szh7Eb9K8 BTW: some other target, such as RISCVTargetLowering::decomposeMulByConstant also has such logic
llvm/test/CodeGen/AArch64/mul_pow2.ll
217	will be fixed with D132325

Allen added a reviewer: hiraditya.Aug 22 2022, 6:07 PM

hiraditya added inline comments.Aug 22 2022, 6:14 PM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13732	as per godbolt link, it is already happening in some cases?

It's not obvious that the replacement sequences are consistently faster. At least on some cores, "add x8, x8, w0, sxtw #1" and "smull x0, w0, w8" have exactly the same throughput, so transforming from the smull to a two-instruction sequence involving the add isn't really profitable.

On a related note, many cores have optimizations for arithmetic with lsl #n, so we should prefer that over uxtw/sxtw.

llvm/test/CodeGen/AArch64/mul_pow2.ll
179	I think the suggestion misses a zero-extension. Should be able to ubfiz, though.
281	I think you can save an instruction here: instead of "-(x4+x2)", compute x2-x8.
500	This seems to be overriding our existing logic here to produce a worse result.
llvm/test/CodeGen/AArch64/sve-intrinsics-counting-elems-i32.ll
169 ↗	(On Diff #454269)	Probably need some logic to allow folding inch/dech, assuming there isn't some reason to avoid them.

Hi - I had been looking at mul vs add+shift recently, but had not had the time to get very far and had not got to the important part yet - exactly where and when should we be doing the transform, especially when you consider all the various cpus present.

AArch64 usually handles mul vs add/shift here: https://github.com/llvm/llvm-project/blob/6c6c4f6a9b3ef2d7db937cb78784245ea8a61418/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp#L14465
With a cost-model of sorts here: https://github.com/llvm/llvm-project/blob/6c6c4f6a9b3ef2d7db937cb78784245ea8a61418/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp#L14442

My understanding (but some of this may not be correct) is that:

The existing cost model is conservative and could do with adjustment, but is there for a reason. It prevents the transform when there is an add/sub user (to create a madd) or if there is a zext/sext operand (to create a umull).
A good cost model is hard to be precise about but add cost 1, shift costs 1, add+shift usually costs 2 but can be 1 for some cpus and constants. mul costs somewhere between 2 and 5, depending on the cpu, size (i32/i64) and the values in the registers. madd costs the same as mul (so the add is free). Newer cpus have a lower cost for mul, especially i64 mul.
Replacing a mul with 2 instructions is probably good, a mul with 3 instructions is more iffy.
I was using https://godbolt.org/z/cPTEMnP5x with different C to compare when we do the transform compared to gcc.
The cost model is certainly worse where the operands lead into a load/store pointer operand, as the add can be folded into the addressing mode so it won't form a madd. Small shifts can sometimes be free again. I was probably planning to alter the existing profitability checks.

So whilst this may be an improvement on some targets over what we have already, a lot of the changes here either look like obvious regressions (cntd/decd/etc), are non-obvious whether they are regressions or not (madd vs add+lsl+add_lsl), or are not longer testing the point of the tests (machine-combiner-madd.ll).

Thanks for your information, and it seems not a easy work :)

llvm/test/CodeGen/AArch64/mul_pow2.ll
179	Yes, this has been addressed with D132325, thanks.
llvm/test/CodeGen/AArch64/sve-intrinsics-counting-elems-i32.ll
169 ↗	(On Diff #454269)	Thanks, I'll try this first

I agree with @eli.friedman that decomposing multiply into two separate instructions may not be profitable.
Even when latency of multiply is high, having two instructions in the pipeline adds complexity (dependencies, re-ordering, retiring etc).

Another thing to consider would be the number of ALUs and MACs(Multiply Accumulate units).
In case the ALU pressure is high MAC operations may be more efficient as there wont be stalls.

Allen mentioned this in rG7a8178258516: [AArch64][CodeGen]Fold the mov and lsl into ubfiz.Sep 9 2022, 8:50 AM

Add more restrict to disable these have negitate effort

Harbormaster completed remote builds in B186113: Diff 459409.Sep 12 2022, 1:09 AM

rebase to fix new added case Transforms/SeparateConstOffsetFromGEP/AArch64/split-gep.ll

Harbormaster completed remote builds in B186303: Diff 459651.Sep 13 2022, 12:09 AM

dmgreen added inline comments.Sep 13 2022, 6:37 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13726	We can't just expect the first user of a const will be the mul we are interested in. All the other handling of decomposing mul into add+shift is currently done in performMulCombine. Would it be better to just alter the code that is already there? It would make it easier to be more precise with the costmodel.
llvm/test/CodeGen/AArch64/mul_pow2.ll
147–148	I think that (considering the mov as free in terms of latency), in this case the madd would be worth 2-3, the lsl+add_lsl+add would cost 3-4. It would depend heavily on the exact cpu though. For i64 muls the madd would have a higher cost (it was 4 on one cpu I tested, but newer cpus are better).
307	Is this the case you are interested in? Could we change the existing costmodel to be more precise with which sub operand it considers free? // Conservatively do not lower to shift+add+shift if the mul might be // folded into madd or msub. if (N->hasOneUse() && (N->use_begin()->getOpcode() == ISD::ADD \|\| N->use_begin()->getOpcode() == ISD::SUB)) return SDValue();

refactor with costmodel

Allen marked 2 inline comments as done.Sep 14 2022, 8:13 AM

Allen added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13726	Apply your commend, and refactor it with costmodel, thanks
llvm/test/CodeGen/AArch64/mul_pow2.ll
307	Thanks for your idea, I can try this, but why the sub operand can be considered free ?
307	Done, thanks

Harbormaster completed remote builds in B186632: Diff 460100.Sep 14 2022, 9:11 AM

efriedma added inline comments.Sep 14 2022, 2:24 PM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
14562	Do you want to check specifically for a constant that fits into the immediate operand of an add/sub? If it doesn't fit, it doesn't really matter if it's constant.
14634	Move computation of "ShiftedVal1" into the if statment?
llvm/test/CodeGen/AArch64/mul_pow2.ll
306	There is one other possible sequence here: mov w8, #6 mov w9, #-1 madd w0, w0, w8, w9 This is obviously not that exciting in isolation, but it might make sense if we can hoist the "mov" instructions out of a loop.

dmgreen added inline comments.Sep 14 2022, 2:43 PM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
14562	I don't think it matters it the AddSub has one use? (It might for some other optimizations, but that can be a separate patch). It matters that for a sub the `AddSub->operand(1) == N`. If it is `AddSub->operand(0) == N`, then we cannot fold to a msub, so that shouldn't block the transform to add+shifts. The isIntOrFPConstant check can check for isLegalAddImmediate, to be a little more precise where it will generate a mul+addri or subri.

Allen marked an inline comment as done.Sep 15 2022, 11:17 PM

Allen added inline comments.

llvm/test/CodeGen/AArch64/mul_pow2.ll

306

hi @efriedma, I try to generate the new sequence as you showed, but I don't know how to put a const into a register? as the operand of madd should not be const value.

For the const **AddSub->getOperand(1)**, even when I use a ISD::ADD , it still retrun a const value.

 SDValue Const = DAG.getNode(ISD::ADD, DL, VT, AddSub->getOperand(1), DAG.getConstant(0, DL, VT));

(gdb) p Const->dump()
t5: i32 = Constant<-1>
$10 = void
(gdb) p AddSub->getOperand(1)->dump()
 t5: i32 = Constant<-1>

address comment

Harbormaster completed remote builds in B187088: Diff 460676.Sep 16 2022, 3:28 AM

Allen marked 4 inline comments as done.Sep 16 2022, 4:12 AM

Allen added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
14562	Thanks @dmgreen and @efriedma for detail suggestion
14634	Done ,thanks
llvm/test/CodeGen/AArch64/mul_pow2.ll
147–148	done, block the change as we can make use of madd

efriedma added inline comments.Sep 16 2022, 9:45 AM

llvm/test/CodeGen/AArch64/mul_pow2.ll
306	You don't have to do anything special to "put a const in a register". For example, if you write `unsigned a(unsigned x) { return x * 0x33333333 + 0xF0F0F0F0; }`, you get code like I suggested. Getting the same code with smaller immediates is just a matter of making sure we pick the pattern for madd, and not the pattern for sub-with-imm.

Allen marked 4 inline comments as done.Sep 21 2022, 12:40 AM

Allen added inline comments.

llvm/test/CodeGen/AArch64/mul_pow2.ll
306	I try to address the commit with D134336

Allen added a parent revision: D134336: [AArch64] Lower multiplication by a constant int to madd.Sep 21 2022, 5:08 PM

mingmingl added a subscriber: mingmingl.Sep 22 2022, 1:45 PM

Allen mentioned this in D134706: [AArch64] Lower multiplication by a constant int to shl+sub+shl.Sep 27 2022, 12:26 AM

Allen mentioned this in rG62a51c357cf4: [AArch64] Lower multiplication by a constant int to shl+sub+shl.Sep 29 2022, 10:33 AM

abandoned as the accepted in D134934 and D134336

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

47 lines

test/

CodeGen/

AArch64/

mul_pow2.ll

31 lines

Diff 460676

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 13,717 Lines • ▼ Show 20 Lines
}		}

// Return false to prevent folding		// Return false to prevent folding
// (mul (add x, c1), c2) -> (add (mul x, c2), c2*c1) in DAGCombine,		// (mul (add x, c1), c2) -> (add (mul x, c2), c2*c1) in DAGCombine,
// if the folding leads to worse code.		// if the folding leads to worse code.
bool AArch64TargetLowering::isMulAddWithConstProfitable(		bool AArch64TargetLowering::isMulAddWithConstProfitable(
SDValue AddNode, SDValue ConstNode) const {		SDValue AddNode, SDValue ConstNode) const {
// Let the DAGCombiner decide for vector types and large types.		// Let the DAGCombiner decide for vector types and large types.
const EVT VT = AddNode.getValueType();		const EVT VT = AddNode.getValueType();
		dmgreenUnsubmitted Not Done Reply Inline Actions We can't just expect the first user of a const will be the mul we are interested in. All the other handling of decomposing mul into add+shift is currently done in performMulCombine. Would it be better to just alter the code that is already there? It would make it easier to be more precise with the costmodel. dmgreen: We can't just expect the first user of a const will be the mul we are interested in. All the…
		AllenAuthorUnsubmitted Done Reply Inline Actions Apply your commend, and refactor it with costmodel, thanks Allen: Apply your commend, and refactor it with costmodel, thanks
if (VT.isVector() \|\| VT.getScalarSizeInBits() > 64)		if (VT.isVector() \|\| VT.getScalarSizeInBits() > 64)
return true;		return true;

// It is worse if c1 is legal add immediate, while c1*c2 is not		// It is worse if c1 is legal add immediate, while c1*c2 is not
// and has to be composed by at least two instructions.		// and has to be composed by at least two instructions.
const ConstantSDNode *C1Node = cast<ConstantSDNode>(AddNode.getOperand(1));		const ConstantSDNode *C1Node = cast<ConstantSDNode>(AddNode.getOperand(1));
		hiradityaUnsubmitted Done Reply Inline Actions What is the rationale behind checking `Imm+1` etc? hiraditya: What is the rationale behind checking `Imm+1` etc?
		AllenAuthorUnsubmitted Done Reply Inline Actions I think it can use shl + sub to replace mul, such as https://gcc.godbolt.org/z/szh7Eb9K8 BTW: some other target, such as RISCVTargetLowering::decomposeMulByConstant also has such logic Allen: I think it can use shl + sub to replace mul, such as https://gcc.godbolt.
		hiradityaUnsubmitted Not Done Reply Inline Actions as per godbolt link, it is already happening in some cases? hiraditya: as per godbolt link, it is already happening in some cases?
const ConstantSDNode *C2Node = cast<ConstantSDNode>(ConstNode);		const ConstantSDNode *C2Node = cast<ConstantSDNode>(ConstNode);
const int64_t C1 = C1Node->getSExtValue();		const int64_t C1 = C1Node->getSExtValue();
const APInt C1C2 = C1Node->getAPIntValue() * C2Node->getAPIntValue();		const APInt C1C2 = C1Node->getAPIntValue() * C2Node->getAPIntValue();
		craig.topperUnsubmitted Done Reply Inline Actions Is this copied from RISC-V? That doesn't look like an AArch64 instruction name craig.topper: Is this copied from RISC-V? That doesn't look like an AArch64 instruction name
		AllenAuthorUnsubmitted Done Reply Inline Actions yes, I updated comment, thanks @craig.topper Allen: yes, I updated comment, thanks @craig.topper
if (!isLegalAddImmediate(C1) \|\| isLegalAddImmediate(C1C2.getSExtValue()))		if (!isLegalAddImmediate(C1) \|\| isLegalAddImmediate(C1C2.getSExtValue()))
return true;		return true;
SmallVector<AArch64_IMM::ImmInsnModel, 4> Insn;		SmallVector<AArch64_IMM::ImmInsnModel, 4> Insn;
AArch64_IMM::expandMOVImm(C1C2.getZExtValue(), VT.getSizeInBits(), Insn);		AArch64_IMM::expandMOVImm(C1C2.getZExtValue(), VT.getSizeInBits(), Insn);
if (Insn.size() > 1)		if (Insn.size() > 1)
return false;		return false;

// Default to true and let the DAGCombiner decide.		// Default to true and let the DAGCombiner decide.
▲ Show 20 Lines • Show All 807 Lines • ▼ Show 20 Lines	if (TrailingZeroes) {
// Conservatively do not lower to shift+add+shift if the mul might be		// Conservatively do not lower to shift+add+shift if the mul might be
// folded into smul or umul.		// folded into smul or umul.
if (N0->hasOneUse() && (isSignExtended(N0.getNode(), DAG) \|\|		if (N0->hasOneUse() && (isSignExtended(N0.getNode(), DAG) \|\|
isZeroExtended(N0.getNode(), DAG)))		isZeroExtended(N0.getNode(), DAG)))
return SDValue();		return SDValue();
// Conservatively do not lower to shift+add+shift if the mul might be		// Conservatively do not lower to shift+add+shift if the mul might be
// folded into madd or msub.		// folded into madd or msub.
if (N->hasOneUse() && (N->use_begin()->getOpcode() == ISD::ADD \|\|		if (N->hasOneUse() && (N->use_begin()->getOpcode() == ISD::ADD \|\|
N->use_begin()->getOpcode() == ISD::SUB))		N->use_begin()->getOpcode() == ISD::SUB)) {
		SDNode AddSub = N->use_begin();
		// Shouldn't block the transform to add+shifts as the operand of madd/msub
		// should not be const value.
		dmgreenUnsubmitted Done Reply Inline Actions I don't think it matters it the AddSub has one use? (It might for some other optimizations, but that can be a separate patch). It matters that for a sub the `AddSub->operand(1) == N`. If it is `AddSub->operand(0) == N`, then we cannot fold to a msub, so that shouldn't block the transform to add+shifts. The isIntOrFPConstant check can check for isLegalAddImmediate, to be a little more precise where it will generate a mul+addri or subri. dmgreen: I don't think it matters it the AddSub has one use? (It might for some other optimizations, but…
		efriedmaUnsubmitted Done Reply Inline Actions Do you want to check specifically for a constant that fits into the immediate operand of an add/sub? If it doesn't fit, it doesn't really matter if it's constant. efriedma: Do you want to check specifically for a constant that fits into the immediate operand of an…
		AllenAuthorUnsubmitted Done Reply Inline Actions Thanks @dmgreen and @efriedma for detail suggestion Allen: Thanks @dmgreen and @efriedma for detail suggestion
		SDValue BinaryOP;
		if (N == AddSub->getOperand(0).getNode())
		BinaryOP = AddSub->getOperand(1);
		else
		BinaryOP = AddSub->getOperand(0);
		const ConstantSDNode *OpC = dyn_cast<ConstantSDNode>(BinaryOP);
		bool match;
		// For sub, may transform to msub only when OpC is a reg.
		if (AddSub->getOpcode() == ISD::SUB)
		match = !OpC;
		// For add, may transform to madd when OpC is a reg or not match the
		// isLegalAddImmediate.
		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
		if (AddSub->getOpcode() == ISD::ADD)
		match = !OpC \|\|
		!TLI.isLegalAddImmediate(OpC->getAPIntValue().getSExtValue());
		if (match)
return SDValue();		return SDValue();
}		}
		}
// Use ShiftedConstValue instead of ConstValue to support both shift+add/sub		// Use ShiftedConstValue instead of ConstValue to support both shift+add/sub
// and shift+add+shift.		// and shift+add+shift.
APInt ShiftedConstValue = ConstValue.ashr(TrailingZeroes);		APInt ShiftedConstValue = ConstValue.ashr(TrailingZeroes);

unsigned ShiftAmt;		unsigned ShiftAmt;
// Is the shifted value the LHS operand of the add/sub?		// Is the shifted value the LHS operand of the add/sub?
bool ShiftValUseIsN0 = true;		bool ShiftValUseIsN0 = true;
// Do we need to negate the result?		// Do we need to negate the result?
bool NegateResult = false;		bool NegateResult = false;
		// Is the sub has 2 shifted value operands?
		bool Sub2Shift = false;

if (ConstValue.isNonNegative()) {		if (ConstValue.isNonNegative()) {
// (mul x, 2^N + 1) => (add (shl x, N), x)		// (mul x, 2^N + 1) => (add (shl x, N), x)
// (mul x, 2^N - 1) => (sub (shl x, N), x)		// (mul x, 2^N - 1) => (sub (shl x, N), x)
// (mul x, (2^N + 1) * 2^M) => (shl (add (shl x, N), x), M)		// (mul x, (2^N + 1) * 2^M) => (shl (add (shl x, N), x), M)
		// (mul x, (2^(N-M) - 1) * 2^M) => (sub (shl x, N), (shl x, M))
APInt SCVMinus1 = ShiftedConstValue - 1;		APInt SCVMinus1 = ShiftedConstValue - 1;
		APInt SCVPlus1 = ShiftedConstValue + 1;
APInt CVPlus1 = ConstValue + 1;		APInt CVPlus1 = ConstValue + 1;
if (SCVMinus1.isPowerOf2()) {		if (SCVMinus1.isPowerOf2()) {
ShiftAmt = SCVMinus1.logBase2();		ShiftAmt = SCVMinus1.logBase2();
AddSubOpc = ISD::ADD;		AddSubOpc = ISD::ADD;
} else if (CVPlus1.isPowerOf2()) {		} else if (CVPlus1.isPowerOf2()) {
ShiftAmt = CVPlus1.logBase2();		ShiftAmt = CVPlus1.logBase2();
AddSubOpc = ISD::SUB;		AddSubOpc = ISD::SUB;
		} else if (SCVPlus1.isPowerOf2()) {
		ShiftAmt = SCVPlus1.logBase2() + TrailingZeroes;
		AddSubOpc = ISD::SUB;
		Sub2Shift = true;
} else		} else
return SDValue();		return SDValue();
} else {		} else {
// (mul x, -(2^N - 1)) => (sub x, (shl x, N))		// (mul x, -(2^N - 1)) => (sub x, (shl x, N))
// (mul x, -(2^N + 1)) => - (add (shl x, N), x)		// (mul x, -(2^N + 1)) => - (add (shl x, N), x)
APInt CVNegPlus1 = -ConstValue + 1;		APInt CVNegPlus1 = -ConstValue + 1;
APInt CVNegMinus1 = -ConstValue - 1;		APInt CVNegMinus1 = -ConstValue - 1;
if (CVNegPlus1.isPowerOf2()) {		if (CVNegPlus1.isPowerOf2()) {
ShiftAmt = CVNegPlus1.logBase2();		ShiftAmt = CVNegPlus1.logBase2();
AddSubOpc = ISD::SUB;		AddSubOpc = ISD::SUB;
ShiftValUseIsN0 = false;		ShiftValUseIsN0 = false;
} else if (CVNegMinus1.isPowerOf2()) {		} else if (CVNegMinus1.isPowerOf2()) {
ShiftAmt = CVNegMinus1.logBase2();		ShiftAmt = CVNegMinus1.logBase2();
AddSubOpc = ISD::ADD;		AddSubOpc = ISD::ADD;
NegateResult = true;		NegateResult = true;
} else		} else
return SDValue();		return SDValue();
}		}

SDValue ShiftedVal = DAG.getNode(ISD::SHL, DL, VT, N0,		SDValue ShiftedVal0 = DAG.getNode(ISD::SHL, DL, VT, N0,
DAG.getConstant(ShiftAmt, DL, MVT::i64));		DAG.getConstant(ShiftAmt, DL, MVT::i64));
		if (Sub2Shift) {
		efriedmaUnsubmitted Done Reply Inline Actions Move computation of "ShiftedVal1" into the if statment? efriedma: Move computation of "ShiftedVal1" into the if statment?
		AllenAuthorUnsubmitted Done Reply Inline Actions Done ,thanks Allen: Done ,thanks
SDValue AddSubN0 = ShiftValUseIsN0 ? ShiftedVal : N0;		SDValue ShiftedVal1 = DAG.getNode(
SDValue AddSubN1 = ShiftValUseIsN0 ? N0 : ShiftedVal;		ISD::SHL, DL, VT, N0, DAG.getConstant(TrailingZeroes, DL, MVT::i64));
		return DAG.getNode(ISD::SUB, DL, VT, ShiftedVal0, ShiftedVal1);
		}
		SDValue AddSubN0 = ShiftValUseIsN0 ? ShiftedVal0 : N0;
		SDValue AddSubN1 = ShiftValUseIsN0 ? N0 : ShiftedVal0;
SDValue Res = DAG.getNode(AddSubOpc, DL, VT, AddSubN0, AddSubN1);		SDValue Res = DAG.getNode(AddSubOpc, DL, VT, AddSubN0, AddSubN1);
assert(!(NegateResult && TrailingZeroes) &&		assert(!(NegateResult && TrailingZeroes) &&
"NegateResult and TrailingZeroes cannot both be true for now.");		"NegateResult and TrailingZeroes cannot both be true for now.");
// Negate the result.		// Negate the result.
if (NegateResult)		if (NegateResult)
return DAG.getNode(ISD::SUB, DL, VT, DAG.getConstant(0, DL, VT), Res);		return DAG.getNode(ISD::SUB, DL, VT, DAG.getConstant(0, DL, VT), Res);
// Shift the result.		// Shift the result.
if (TrailingZeroes)		if (TrailingZeroes)
▲ Show 20 Lines • Show All 7,717 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/mul_pow2.ll

Show First 20 Lines • Show All 138 Lines • ▼ Show 20 Lines	; GISEL-NEXT: ret
%mul = mul nsw i64 %ext, 6		%mul = mul nsw i64 %ext, 6
ret i64 %mul		ret i64 %mul
}		}

define i32 @test6_madd(i32 %x, i32 %y) {		define i32 @test6_madd(i32 %x, i32 %y) {
; CHECK-LABEL: test6_madd:		; CHECK-LABEL: test6_madd:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: mov w8, #6		; CHECK-NEXT: mov w8, #6
; CHECK-NEXT: madd w0, w0, w8, w1		; CHECK-NEXT: madd w0, w0, w8, w1
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		dmgreenUnsubmitted Done Reply Inline Actions I think that (considering the mov as free in terms of latency), in this case the madd would be worth 2-3, the lsl+add_lsl+add would cost 3-4. It would depend heavily on the exact cpu though. For i64 muls the madd would have a higher cost (it was 4 on one cpu I tested, but newer cpus are better). dmgreen: I think that (considering the mov as free in terms of latency), in this case the madd would be…
		AllenAuthorUnsubmitted Done Reply Inline Actions done, block the change as we can make use of madd Allen: done, block the change as we can make use of madd
;		;
; GISEL-LABEL: test6_madd:		; GISEL-LABEL: test6_madd:
; GISEL: // %bb.0:		; GISEL: // %bb.0:
; GISEL-NEXT: mov w8, #6		; GISEL-NEXT: mov w8, #6
; GISEL-NEXT: madd w0, w0, w8, w1		; GISEL-NEXT: madd w0, w0, w8, w1
; GISEL-NEXT: ret		; GISEL-NEXT: ret

%mul = mul nsw i32 %x, 6		%mul = mul nsw i32 %x, 6
Show All 14 Lines
; GISEL-NEXT: msub w0, w0, w8, w1		; GISEL-NEXT: msub w0, w0, w8, w1
; GISEL-NEXT: ret		; GISEL-NEXT: ret

%mul = mul nsw i32 %x, 6		%mul = mul nsw i32 %x, 6
%sub = sub i32 %y, %mul		%sub = sub i32 %y, %mul
ret i32 %sub		ret i32 %sub
}		}

define i64 @test6_umaddl(i32 %x, i64 %y) {		define i64 @test6_umaddl(i32 %x, i64 %y) {
		efriedmaUnsubmitted Done Reply Inline Actions I think the suggestion misses a zero-extension. Should be able to ubfiz, though. efriedma: I think the suggestion misses a zero-extension. Should be able to ubfiz, though.
		AllenAuthorUnsubmitted Done Reply Inline Actions Yes, this has been addressed with D132325, thanks. Allen: Yes, this has been addressed with D132325, thanks.
; CHECK-LABEL: test6_umaddl:		; CHECK-LABEL: test6_umaddl:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: mov w8, #6		; CHECK-NEXT: mov w8, #6
; CHECK-NEXT: umaddl x0, w0, w8, x1		; CHECK-NEXT: umaddl x0, w0, w8, x1
; CHECK-NEXT: ret		; CHECK-NEXT: ret
;		;
; GISEL-LABEL: test6_umaddl:		; GISEL-LABEL: test6_umaddl:
; GISEL: // %bb.0:		; GISEL: // %bb.0:
Show All 21 Lines
; GISEL-NEXT: ret		; GISEL-NEXT: ret

%ext = sext i32 %x to i64		%ext = sext i32 %x to i64
%mul = mul nsw i64 %ext, 6		%mul = mul nsw i64 %ext, 6
%add = add i64 %mul, %y		%add = add i64 %mul, %y
ret i64 %add		ret i64 %add
}		}

define i64 @test6_umsubl(i32 %x, i64 %y) {		define i64 @test6_umsubl(i32 %x, i64 %y) {
		AllenAuthorUnsubmitted Done Reply Inline Actions will be fixed with D132325 Allen: will be fixed with D132325
; CHECK-LABEL: test6_umsubl:		; CHECK-LABEL: test6_umsubl:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: mov w8, #6		; CHECK-NEXT: mov w8, #6
; CHECK-NEXT: umsubl x0, w0, w8, x1		; CHECK-NEXT: umsubl x0, w0, w8, x1
; CHECK-NEXT: ret		; CHECK-NEXT: ret
;		;
; GISEL-LABEL: test6_umsubl:		; GISEL-LABEL: test6_umsubl:
; GISEL: // %bb.0:		; GISEL: // %bb.0:
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines

define i64 @test6_smnegl(i32 %x) {		define i64 @test6_smnegl(i32 %x) {
; CHECK-LABEL: test6_smnegl:		; CHECK-LABEL: test6_smnegl:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: mov w8, #6		; CHECK-NEXT: mov w8, #6
; CHECK-NEXT: smnegl x0, w0, w8		; CHECK-NEXT: smnegl x0, w0, w8
; CHECK-NEXT: ret		; CHECK-NEXT: ret
;		;
; GISEL-LABEL: test6_smnegl:		; GISEL-LABEL: test6_smnegl:
		efriedmaUnsubmitted Not Done Reply Inline Actions I think you can save an instruction here: instead of "-(x4+x2)", compute x2-x8. efriedma: I think you can save an instruction here: instead of "-(x4+x2)", compute x2-x8.
; GISEL: // %bb.0:		; GISEL: // %bb.0:
; GISEL-NEXT: mov w8, #6		; GISEL-NEXT: mov w8, #6
; GISEL-NEXT: smnegl x0, w0, w8		; GISEL-NEXT: smnegl x0, w0, w8
; GISEL-NEXT: ret		; GISEL-NEXT: ret

%ext = sext i32 %x to i64		%ext = sext i32 %x to i64
%mul = mul nsw i64 %ext, 6		%mul = mul nsw i64 %ext, 6
%sub = sub i64 0, %mul		%sub = sub i64 0, %mul
ret i64 %sub		ret i64 %sub
}		}

		define i32 @mull6_sub(i32 %x) {
		; CHECK-LABEL: mull6_sub:
		; CHECK: // %bb.0:
		; CHECK-NEXT: add w8, w0, w0, lsl #1
		; CHECK-NEXT: lsl w8, w8, #1
		; CHECK-NEXT: sub w0, w8, #1
		; CHECK-NEXT: ret
		;
		; GISEL-LABEL: mull6_sub:
		; GISEL: // %bb.0:
		; GISEL-NEXT: mov w8, #6
		; GISEL-NEXT: mul w8, w0, w8
		; GISEL-NEXT: sub w0, w8, #1
		; GISEL-NEXT: ret
		efriedmaUnsubmitted Not Done Reply Inline Actions There is one other possible sequence here: mov w8, #6 mov w9, #-1 madd w0, w0, w8, w9 This is obviously not that exciting in isolation, but it might make sense if we can hoist the "mov" instructions out of a loop. efriedma: There is one other possible sequence here: ``` mov w8, #6 mov w9, #-1 madd w0, w0, w8, w9 ```…
		AllenAuthorUnsubmitted Done Reply Inline Actions hi @efriedma, I try to generate the new sequence as you showed, but I don't know how to put a const into a register? as the operand of madd should not be const value. For the const AddSub->getOperand(1), even when I use a ISD::ADD , it still retrun a const value. SDValue Const = DAG.getNode(ISD::ADD, DL, VT, AddSub->getOperand(1), DAG.getConstant(0, DL, VT)); (gdb) p Const->dump() t5: i32 = Constant<-1> $10 = void (gdb) p AddSub->getOperand(1)->dump() t5: i32 = Constant<-1> Allen: hi @efriedma, I try to generate the new sequence as you showed, but I don't know how to put a…
		efriedmaUnsubmitted Done Reply Inline Actions You don't have to do anything special to "put a const in a register". For example, if you write `unsigned a(unsigned x) { return x * 0x33333333 + 0xF0F0F0F0; }`, you get code like I suggested. Getting the same code with smaller immediates is just a matter of making sure we pick the pattern for madd, and not the pattern for sub-with-imm. efriedma: You don't have to do anything special to "put a const in a register". For example, if you…
		AllenAuthorUnsubmitted Done Reply Inline Actions I try to address the commit with D134336 Allen: I try to address the commit with D134336
		%mul = mul nsw i32 %x, 6
		dmgreenUnsubmitted Done Reply Inline Actions Is this the case you are interested in? Could we change the existing costmodel to be more precise with which sub operand it considers free? // Conservatively do not lower to shift+add+shift if the mul might be // folded into madd or msub. if (N->hasOneUse() && (N->use_begin()->getOpcode() == ISD::ADD \|\| N->use_begin()->getOpcode() == ISD::SUB)) return SDValue(); dmgreen: Is this the case you are interested in? Could we change the existing costmodel to be more…
		AllenAuthorUnsubmitted Done Reply Inline Actions Thanks for your idea, I can try this, but why the sub operand can be considered free ? Allen: Thanks for your idea, I can try this, but why the sub operand can be considered free ?
		AllenAuthorUnsubmitted Done Reply Inline Actions Done, thanks Allen: Done, thanks
		%sub = add nsw i32 %mul, -1
		ret i32 %sub
		}

define i32 @test7(i32 %x) {		define i32 @test7(i32 %x) {
; CHECK-LABEL: test7:		; CHECK-LABEL: test7:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: lsl w8, w0, #3		; CHECK-NEXT: lsl w8, w0, #3
; CHECK-NEXT: sub w0, w8, w0		; CHECK-NEXT: sub w0, w8, w0
; CHECK-NEXT: ret		; CHECK-NEXT: ret
;		;
; GISEL-LABEL: test7:		; GISEL-LABEL: test7:
▲ Show 20 Lines • Show All 102 Lines • ▼ Show 20 Lines	; GISEL-NEXT: ret

%mul = mul nsw i32 %x, 13		%mul = mul nsw i32 %x, 13
ret i32 %mul		ret i32 %mul
}		}

define i32 @test14(i32 %x) {		define i32 @test14(i32 %x) {
; CHECK-LABEL: test14:		; CHECK-LABEL: test14:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: mov w8, #14		; CHECK-NEXT: lsl w8, w0, #4
; CHECK-NEXT: mul w0, w0, w8		; CHECK-NEXT: sub w0, w8, w0, lsl #1
; CHECK-NEXT: ret		; CHECK-NEXT: ret
;		;
; GISEL-LABEL: test14:		; GISEL-LABEL: test14:
; GISEL: // %bb.0:		; GISEL: // %bb.0:
; GISEL-NEXT: mov w8, #14		; GISEL-NEXT: mov w8, #14
; GISEL-NEXT: mul w0, w0, w8		; GISEL-NEXT: mul w0, w0, w8
; GISEL-NEXT: ret		; GISEL-NEXT: ret

▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	; GISEL-NEXT: ret
%mul = mul nsw i32 %x, -2		%mul = mul nsw i32 %x, -2
ret i32 %mul		ret i32 %mul
}		}

define i32 @ntest3(i32 %x) {		define i32 @ntest3(i32 %x) {
; CHECK-LABEL: ntest3:		; CHECK-LABEL: ntest3:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: sub w0, w0, w0, lsl #2		; CHECK-NEXT: sub w0, w0, w0, lsl #2
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		efriedmaUnsubmitted Not Done Reply Inline Actions This seems to be overriding our existing logic here to produce a worse result. efriedma: This seems to be overriding our existing logic here to produce a worse result.
;		;
; GISEL-LABEL: ntest3:		; GISEL-LABEL: ntest3:
; GISEL: // %bb.0:		; GISEL: // %bb.0:
; GISEL-NEXT: sub w0, w0, w0, lsl #2		; GISEL-NEXT: sub w0, w0, w0, lsl #2
; GISEL-NEXT: ret		; GISEL-NEXT: ret

%mul = mul nsw i32 %x, -3		%mul = mul nsw i32 %x, -3
ret i32 %mul		ret i32 %mul
▲ Show 20 Lines • Show All 236 Lines • ▼ Show 20 Lines
; CHECK-NEXT: movi v2.4s, #1, msl #16		; CHECK-NEXT: movi v2.4s, #1, msl #16
; CHECK-NEXT: shl v0.4s, v0.4s, #6		; CHECK-NEXT: shl v0.4s, v0.4s, #6
; CHECK-NEXT: sub v0.4s, v1.4s, v0.4s		; CHECK-NEXT: sub v0.4s, v1.4s, v0.4s
; CHECK-NEXT: and v0.16b, v0.16b, v2.16b		; CHECK-NEXT: and v0.16b, v0.16b, v2.16b
; CHECK-NEXT: ret		; CHECK-NEXT: ret
;		;
; GISEL-LABEL: muladd_demand_commute:		; GISEL-LABEL: muladd_demand_commute:
; GISEL: // %bb.0:		; GISEL: // %bb.0:
; GISEL-NEXT: adrp x8, .LCPI42_1		; GISEL-NEXT: adrp x8, .LCPI43_1
; GISEL-NEXT: ldr q2, [x8, :lo12:.LCPI42_1]		; GISEL-NEXT: ldr q2, [x8, :lo12:.LCPI43_1]
; GISEL-NEXT: adrp x8, .LCPI42_0		; GISEL-NEXT: adrp x8, .LCPI43_0
; GISEL-NEXT: mla v1.4s, v0.4s, v2.4s		; GISEL-NEXT: mla v1.4s, v0.4s, v2.4s
; GISEL-NEXT: ldr q0, [x8, :lo12:.LCPI42_0]		; GISEL-NEXT: ldr q0, [x8, :lo12:.LCPI43_0]
; GISEL-NEXT: and v0.16b, v1.16b, v0.16b		; GISEL-NEXT: and v0.16b, v1.16b, v0.16b
; GISEL-NEXT: ret		; GISEL-NEXT: ret
%m = mul <4 x i32> %x, <i32 131008, i32 131008, i32 131008, i32 131008>		%m = mul <4 x i32> %x, <i32 131008, i32 131008, i32 131008, i32 131008>
%a = add <4 x i32> %m, %y		%a = add <4 x i32> %m, %y
%r = and <4 x i32> %a, <i32 131071, i32 131071, i32 131071, i32 131071>		%r = and <4 x i32> %a, <i32 131071, i32 131071, i32 131071, i32 131071>
ret <4 x i32> %r		ret <4 x i32> %r
}		}

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SelectionDAG] Optimize multiplication by constantAbandonedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 460676

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/mul_pow2.ll

[AArch64][SelectionDAG] Optimize multiplication by constant
AbandonedPublic