This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SelectionDAG] Optimize multiplication by constant
AbandonedPublic

Authored by Allen on Aug 20 2022, 6:45 PM.

Details

Summary

InstCombine will merge the x*5+x into x*6,
then decompose MUL By const into shift/add or shift/sub, eg:
Change the costmodel to lower a = b * C where C = (2^n + 1) * 2^m to

add     w0, w0, w0, lsl n
lsl     w0, w0, m

Fixes AArch64 part of https://github.com/llvm/llvm-project/issues/57255.

Diff Detail

Event Timeline

Allen created this revision.Aug 20 2022, 6:45 PM
Allen requested review of this revision.Aug 20 2022, 6:45 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 20 2022, 6:45 PM
Allen edited reviewers, added: spatel, craig.topper; removed: luismarques, MaskRay.Aug 20 2022, 6:49 PM
craig.topper added inline comments.Aug 20 2022, 6:56 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13735

Is this copied from RISC-V? That doesn't look like an AArch64 instruction name

Allen updated this revision to Diff 454269.Aug 20 2022, 7:20 PM

update comment

Allen marked an inline comment as done.Aug 20 2022, 7:25 PM
Allen added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13735

yes, I updated comment, thanks @craig.topper

hiraditya added inline comments.Aug 22 2022, 5:08 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13732

What is the rationale behind checking Imm+1 etc?

Allen marked 2 inline comments as done.Aug 22 2022, 6:04 PM
Allen added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13732

I think it can use shl + sub to replace mul, such as
https://gcc.godbolt.org/z/szh7Eb9K8

BTW: some other target, such as RISCVTargetLowering::decomposeMulByConstant also has such logic

llvm/test/CodeGen/AArch64/mul_pow2.ll
217

will be fixed with D132325

hiraditya added inline comments.Aug 22 2022, 6:14 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13732

as per godbolt link, it is already happening in some cases?

It's not obvious that the replacement sequences are consistently faster. At least on some cores, "add x8, x8, w0, sxtw #1" and "smull x0, w0, w8" have exactly the same throughput, so transforming from the smull to a two-instruction sequence involving the add isn't really profitable.

On a related note, many cores have optimizations for arithmetic with lsl #n, so we should prefer that over uxtw/sxtw.

llvm/test/CodeGen/AArch64/mul_pow2.ll
179

I think the suggestion misses a zero-extension. Should be able to ubfiz, though.

281

I think you can save an instruction here: instead of "-(x*4+x*2)", compute x*2-x*8.

500

This seems to be overriding our existing logic here to produce a worse result.

llvm/test/CodeGen/AArch64/sve-intrinsics-counting-elems-i32.ll
169 ↗(On Diff #454269)

Probably need some logic to allow folding inch/dech, assuming there isn't some reason to avoid them.

Hi - I had been looking at mul vs add+shift recently, but had not had the time to get very far and had not got to the important part yet - exactly where and when should we be doing the transform, especially when you consider all the various cpus present.

AArch64 usually handles mul vs add/shift here: https://github.com/llvm/llvm-project/blob/6c6c4f6a9b3ef2d7db937cb78784245ea8a61418/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp#L14465
With a cost-model of sorts here: https://github.com/llvm/llvm-project/blob/6c6c4f6a9b3ef2d7db937cb78784245ea8a61418/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp#L14442

My understanding (but some of this may not be correct) is that:

  • The existing cost model is conservative and could do with adjustment, but is there for a reason. It prevents the transform when there is an add/sub user (to create a madd) or if there is a zext/sext operand (to create a umull).
  • A good cost model is hard to be precise about but add cost 1, shift costs 1, add+shift usually costs 2 but can be 1 for some cpus and constants. mul costs somewhere between 2 and 5, depending on the cpu, size (i32/i64) and the values in the registers. madd costs the same as mul (so the add is free). Newer cpus have a lower cost for mul, especially i64 mul.
  • Replacing a mul with 2 instructions is probably good, a mul with 3 instructions is more iffy.
  • I was using https://godbolt.org/z/cPTEMnP5x with different C to compare when we do the transform compared to gcc.
  • The cost model is certainly worse where the operands lead into a load/store pointer operand, as the add can be folded into the addressing mode so it won't form a madd. Small shifts can sometimes be free again. I was probably planning to alter the existing profitability checks.

So whilst this may be an improvement on some targets over what we have already, a lot of the changes here either look like obvious regressions (cntd/decd/etc), are non-obvious whether they are regressions or not (madd vs add+lsl+add_lsl), or are not longer testing the point of the tests (machine-combiner-madd.ll).

Allen planned changes to this revision.Aug 23 2022, 1:51 AM
Allen marked an inline comment as done.

Thanks for your information, and it seems not a easy work :)

llvm/test/CodeGen/AArch64/mul_pow2.ll
179

Yes, this has been addressed with D132325, thanks.

llvm/test/CodeGen/AArch64/sve-intrinsics-counting-elems-i32.ll
169 ↗(On Diff #454269)

Thanks, I'll try this first

I agree with @eli.friedman that decomposing multiply into two separate instructions may not be profitable.
Even when latency of multiply is high, having two instructions in the pipeline adds complexity (dependencies, re-ordering, retiring etc).

Another thing to consider would be the number of ALUs and MACs(Multiply Accumulate units).
In case the ALU pressure is high MAC operations may be more efficient as there wont be stalls.

Allen updated this revision to Diff 459409.Sep 12 2022, 12:19 AM

Add more restrict to disable these have negitate effort

Allen updated this revision to Diff 459651.Sep 12 2022, 11:22 PM

rebase to fix new added case Transforms/SeparateConstOffsetFromGEP/AArch64/split-gep.ll

dmgreen added inline comments.Sep 13 2022, 6:37 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13726

We can't just expect the first user of a const will be the mul we are interested in.

All the other handling of decomposing mul into add+shift is currently done in performMulCombine. Would it be better to just alter the code that is already there? It would make it easier to be more precise with the costmodel.

llvm/test/CodeGen/AArch64/mul_pow2.ll
147–148

I think that (considering the mov as free in terms of latency), in this case the madd would be worth 2-3, the lsl+add_lsl+add would cost 3-4. It would depend heavily on the exact cpu though.
For i64 muls the madd would have a higher cost (it was 4 on one cpu I tested, but newer cpus are better).

307

Is this the case you are interested in? Could we change the existing costmodel to be more precise with which sub operand it considers free?

// Conservatively do not lower to shift+add+shift if the mul might be
// folded into madd or msub.
if (N->hasOneUse() && (N->use_begin()->getOpcode() == ISD::ADD ||
                       N->use_begin()->getOpcode() == ISD::SUB))
  return SDValue();
Allen updated this revision to Diff 460100.Sep 14 2022, 8:10 AM
Allen edited the summary of this revision. (Show Details)

refactor with costmodel

Allen marked 2 inline comments as done.Sep 14 2022, 8:13 AM
Allen added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13726

Apply your commend, and refactor it with costmodel, thanks

llvm/test/CodeGen/AArch64/mul_pow2.ll
307

Thanks for your idea, I can try this, but why the sub operand can be considered free ?

307

Done, thanks

efriedma added inline comments.Sep 14 2022, 2:24 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
14562

Do you want to check specifically for a constant that fits into the immediate operand of an add/sub? If it doesn't fit, it doesn't really matter if it's constant.

14634

Move computation of "ShiftedVal1" into the if statment?

llvm/test/CodeGen/AArch64/mul_pow2.ll
306

There is one other possible sequence here:

mov w8, #6
mov w9, #-1
madd w0, w0, w8, w9

This is obviously not that exciting in isolation, but it might make sense if we can hoist the "mov" instructions out of a loop.

dmgreen added inline comments.Sep 14 2022, 2:43 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
14562

I don't think it matters it the AddSub has one use? (It might for some other optimizations, but that can be a separate patch).

It matters that for a sub the AddSub->operand(1) == N. If it is AddSub->operand(0) == N, then we cannot fold to a msub, so that shouldn't block the transform to add+shifts.
The isIntOrFPConstant check can check for isLegalAddImmediate, to be a little more precise where it will generate a mul+addri or subri.

Allen marked an inline comment as done.Sep 15 2022, 11:17 PM
Allen added inline comments.
llvm/test/CodeGen/AArch64/mul_pow2.ll
306

hi @efriedma, I try to generate the new sequence as you showed, but I don't know how to put a const into a register? as the operand of madd should not be const value.

For the const **AddSub->getOperand(1)**, even when I use a ISD::ADD , it still retrun a const value.
 SDValue Const = DAG.getNode(ISD::ADD, DL, VT, AddSub->getOperand(1), DAG.getConstant(0, DL, VT));

(gdb) p Const->dump()
t5: i32 = Constant<-1>
$10 = void
(gdb) p AddSub->getOperand(1)->dump()
 t5: i32 = Constant<-1>
Allen updated this revision to Diff 460676.Sep 16 2022, 1:59 AM

address comment

Allen marked 4 inline comments as done.Sep 16 2022, 4:12 AM
Allen added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
14562

Thanks @dmgreen and @efriedma for detail suggestion

14634

Done ,thanks

llvm/test/CodeGen/AArch64/mul_pow2.ll
147–148

done, block the change as we can make use of madd

efriedma added inline comments.Sep 16 2022, 9:45 AM
llvm/test/CodeGen/AArch64/mul_pow2.ll
306

You don't have to do anything special to "put a const in a register". For example, if you write unsigned a(unsigned x) { return x * 0x33333333 + 0xF0F0F0F0; }, you get code like I suggested. Getting the same code with smaller immediates is just a matter of making sure we pick the pattern for madd, and not the pattern for sub-with-imm.

Allen marked 4 inline comments as done.Sep 21 2022, 12:40 AM
Allen added inline comments.
llvm/test/CodeGen/AArch64/mul_pow2.ll
306

I try to address the commit with D134336

Allen abandoned this revision.Oct 7 2022, 4:51 AM
Allen marked an inline comment as done.

abandoned as the accepted in D134934 and D134336