This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Lower multiplication by a constant int to shl+add+shl
ClosedPublic

Authored by haicheng on Oct 25 2016, 2:23 PM.

Details

Reviewers
Gerolf
mcrosier
Summary

GCC can lower a = b * C where C = (2^n + 1) * 2^m to

add     w0, w0, w0, lsl n
lsl     w0, w0, m

also lower C = (2^n - 1) * 2^m to

lsl     w1, w0, n
sub     w0, w1, w0
lsl     w0, w0, m

LLVM cannot do either above transformations and generate code like this

mov     w8, C    
mul     w0, w0, w8

This change considers the first case, since the second case requires an extra instruction. The change is also very conservative to try not to touch the mul that can be folded into s(u)mull, madd(sub), s(u)madd(sub)l since their costs seem unknown during ISelLowering.

I am also open to suggestions about a better place (machine-combiner???) to implement the transformation. If I have the information of the cycles of 32 and 64bit mul, I can consider more constants such as C = (2^m + 1) * (2^n+1), C = (2^m + 1) * 2^n + 1, or C = ((2^m + 1) * 2^n + 1) * 2^p

Diff Detail

Repository
rL LLVM

Event Timeline

haicheng updated this revision to Diff 75789.Oct 25 2016, 2:23 PM
haicheng retitled this revision from to [AArch64] Lower multiplication by a constant int to shl+add+shl.
haicheng updated this object.
haicheng added reviewers: mcrosier, Gerolf.
haicheng set the repository for this revision to rL LLVM.
haicheng added a subscriber: llvm-commits.

Hi Haicheng,

This looks simple enough to be worth, even if the benefit is probably very small. But as it is, the code is complicating one part of two identical twins (positive and negative) and not the other, which complicates the code.

I recommend you to change that part of the code entirely into setting temporary variables, like +1/-1 ISD::ADD/ISD::SUB, based on the result of isNonNegative, and use the same piece of code for both paths.

A more generic approach could be done with some smarter constant-splitting, but this patch is simple as it is and there is already plenty of prior art for that, so let's stick to the pattern.

Also, I was expecting a much larger body of tests, with different constant sizes, multiple edge cases, and those that cannot be done, remaining a mul.

I have some comments inline, but my only additional question is: What is the motivation behind this? Benchmark numbers? Can you share them?

cheers,
--renato

lib/Target/AArch64/AArch64ISelLowering.cpp
7634

why not early return?

why is this not a problem for the previous case as well?

7643

Shift is not a good name, since this implies the "shift amount" not the "shifted value".

test/CodeGen/AArch64/mul_pow2.ll
5

You say "shift+add+shift" but your tests are on the form "add+shift".

Gerolf edited edge metadata.Oct 27 2016, 7:29 PM

Hi Haicheng,

I just have a few observations/food for thought:

  • Nit: In your Summary I think you swapped n and m in your code snippets vs your formulas. Your code is correct though.
  • The 2^N-1 * 2^M reduction increases code size, so it should not fire under Oz. Otherwise similar consideration as to your major case apply
  • The 2^N+1 * 2^M reduction increases schedule height (at least on most processors). It might also increase code when e.g. add+mul could be combined to madd. But when code size is *not* a concern and latency(lsl) + 1 < latency (mul), latency(madd) it should always be a win. But that target dependence is not checked in your code yet.
  • I would look at the machine combiner only for cases that need more global scheduling context to decide

Like Renato I'm also curious about your gains. How big? Which benchmarks?

Cheers
Gerolf

haicheng updated this object.Oct 28 2016, 9:52 AM
haicheng edited edge metadata.
haicheng updated this revision to Diff 76308.Oct 29 2016, 8:49 AM

Rewrite performMulCombine(), make the conversion a little less conservative to improve the performance and reduce the compilation time, add more tests.

Thank you, Renato. I rewrote my change and added more tests, please let me know if I did what you recommended.

I have some comments inline, but my only additional question is: What is the motivation behind this? Benchmark numbers? Can you share them?

The biggest motivation is that GCC can do this, but LLVM cannot. My patch is conservative and it does not make big change to the performance. I have not observed any noticeable regression, but the gain is small. spec2006/h264ref and spec2006/povray have around 1% improvement. One internal benchmark which is integer multiplication centric has much larger improvement.

rengolin added inline comments.Oct 29 2016, 9:02 AM
test/CodeGen/AArch64/mul_pow2.ll
279

Please use {{w[0-9]+}} instead of w8.

The biggest motivation is that GCC can do this, but LLVM cannot. My patch is conservative and it does not make big change to the performance. I have not observed any noticeable regression, but the gain is small. spec2006/h264ref and spec2006/povray have around 1% improvement. One internal benchmark which is integer multiplication centric has much larger improvement.

Any regressions? Not that I'm expecting any, but... :)

I'll come back a bit later once I've done a proper review.

Thanks!

Thank you, Gerolf

Hi Haicheng,

I just have a few observations/food for thought:

  • Nit: In your Summary I think you swapped n and m in your code snippets vs your formulas. Your code is correct though.

Thank you for catching this. I updated the summary.

  • The 2^N-1 * 2^M reduction increases code size, so it should not fire under Oz. Otherwise similar consideration as to your major case apply
  • The 2^N+1 * 2^M reduction increases schedule height (at least on most processors). It might also increase code when e.g. add+mul could be combined to madd. But when code size is *not* a concern and latency(lsl) + 1 < latency (mul), latency(madd) it should always be a win. But that target dependence is not checked in your code yet.
  • I would look at the machine combiner only for cases that need more global scheduling context to decide

I agree everything you said. I tried to be conservative in this patch to not increase code size or impact the generation of madd. If I want to support my cases, I think I need to check the target and compare the cost of different code sequences.

Like Renato I'm also curious about your gains. How big? Which benchmarks?

Please see my response to Renato above.

haicheng updated this revision to Diff 76412.Oct 31 2016, 8:27 AM
haicheng marked an inline comment as done.

Update the tests according to Renato's comments.

Hi,

I have a few variable name proposals, mostly to aid the understanding of the code. I apologise for beating on that key, but the code is getting more dense, less repetitive, so it has to be well understood.

I'd welcome another review from @Gerolf at this point. :)

cheers,
--renato

lib/Target/AArch64/AArch64ISelLowering.cpp
7630

I think one simple formula here would be more than enough:

(+/- 2^N +/- 1) * (+/- 2^M +/- 1)
7630

Feel free to hoist those two flags out of the conditional. This will make it clear that they're invariants here.

7638

N0 refers to x, so maybe calling it Var or something more meaningful would be easier to understand.

7640

Lg2 implies log base 2 of Value, which is not true.

TrailingZeroes is a better name for this.

7657

Better call this SwapValues, as this is the intention of the flag.

7661

VM1 implies V Minus One and VP1 implies V Plus One, but the Vs are different.

In the case where VM1 *is* a power of two, then Lg2 is zero and ShiftedInt == Value, but not always.

I wouldn't mind ShiftedMinus1 and ValuePlus1.

haicheng updated this revision to Diff 76465.Oct 31 2016, 1:30 PM
haicheng marked 8 inline comments as done.

Rename the variable names as suggested by Renato. Thank you.

Gerolf added a comment.Nov 1 2016, 2:26 PM

I thought I understand this until about the middle of the review. Now I could use some help perhaps with variable names and comments that reflect more clearly on the expression(s) you simplify. I think this is what Renato is looking for, too.

Thanks
Gerolf

lib/Target/AArch64/AArch64ISelLowering.cpp
7627โ€“7653

You could tie it more to the code, e.g. some multiplications Var * C can be ...

7630

I liked the spaces in Renato's comment. Or would it be clearer to say " constants C = A*B where A, B are of type +/- (2^N +/- 1)"?

7630

After this point I think you can assert(IntValue == 2^N, some power of 2).

7632

I think Value should be ShiftedMinus1 from here on.

7635

ValueOfC?

7643

no -> not

7648

dito

7654

If you declare e.g C = A * B then ShiftedInt could be ConstantA etc

7657

Operation would be more general than AddOrSub

7659

Please add a comment like what values get swapped.
ExtraNeg -> NegExp? Or NegSubExp?

7661

ShiftedMinus1 could be ConstandAMinus1

7662

Shouldn't ValuePlus1 be ConstantAPlus1? But then it should be ConstantAPlus1 = ShiftedInt (ConstantA) -1. At this point I can't match your specification and your code. However, if I"m right about this I will need to dig deep into your test cases, too ...

7689โ€“7705

I'll take another look at this code after I (think I) understand the code above.

7695

It is not clear to me why TrailingZeros and ExtraNeg are exclusive.

haicheng updated this revision to Diff 76762.Nov 2 2016, 1:07 PM
haicheng marked 11 inline comments as done.

Address Gerolf's comments.

haicheng added inline comments.Nov 9 2016, 8:03 AM
lib/Target/AArch64/AArch64ISelLowering.cpp
7632

Similarly, I do not support
(mul x, -(2^N - 1) * 2^M) => (shl (sub x, (shl x, N)), M)
(mul x, -(2^N + 1) * 2^M) => -(shl (add (shl x, N), x), M)
So I use ValueOfC here.

7662

ValueofC is used here to support (mul x, 2^N - 1) => (sub (shl x, N), x). This patch does not support (mul x, (2^N - 1) * 2^M) => (shl (sub (shl x, N), x), M) yet. If we want to support it in the future, we just need to use ConstantA here as you said.

Hi Gerolf,

Would you please take another look? Does my latest update make the code easier to read?

Thank you,

Haicheng

mcrosier edited edge metadata.Nov 11 2016, 10:02 AM

I've done a bit of refactoring in r286601 and r286606, which I hope makes this a much easier code review. If you rebase the patch, I'd be happy to take a look.

FYI, Chad has made some big refactoring in this area, you will have to re-base:

http://llvm.org/viewvc/llvm-project?rev=286606&view=rev

haicheng updated this revision to Diff 77712.Nov 11 2016, 11:12 PM
haicheng edited edge metadata.

Rebase the code.

With the minor comment, this looks good to me. But I'll let @Gerolf and @mcrosier have a final look and approve.

Thanks!

lib/Target/AArch64/AArch64ISelLowering.cpp
7663

IIGIR, the old case is still covered because if TrailingZeroes == 0, then ConstantA == ConstValue.

Your comment should reflect that. Not here, but above, before ConstantA's instantiation.

Here, you can just add the new case:

// (mul x, (2^N + 1) * 2^M) => (shl (add (shl x, N), x), M)
haicheng updated this revision to Diff 77734.Nov 12 2016, 9:34 PM
haicheng marked an inline comment as done.

Address Renato's comment. Thank you.

mcrosier accepted this revision.Nov 14 2016, 6:52 AM
mcrosier edited edge metadata.

LGTM with a few nits about naming of variables.

lib/Target/AArch64/AArch64ISelLowering.cpp
7633

'Var' isn't descriptive. I'd prefer 'N0' as this is a common idiom in ISel lowering.

7649

I'd prefer 'ShiftedConstValue' over 'ConstantA'.

7663

CAMinus1 -> SCVMinus1

7665

CAMinus1 -> SCVMinus1

7692

Var -> N0

7695

Var -> N0

7696

Var -> N0

7698

Please add an assert showing that TrailingZeroes and NegateResult can't both be true.

Then the last bit of logic here can be written as:

// Negate the result.
if (NegateResult)
  return DAG.getNode(ISD::SUB, DL, VT, DAG.getConstant(0, DL, VT), Res);

// Shift the result.
if (TrailingZeroes)
  return DAG.getNode(ISD::SHL, DL, VT, Res, DAG.getConstant(TrailingZeroes, DL, MVT::i64));

return Res;
This revision is now accepted and ready to land.Nov 14 2016, 6:52 AM

Thanks for following up!
LGTM

lib/Target/AArch64/AArch64ISelLowering.cpp
7650

CAMinus1 is consistent with the comment. Perhaps ConstantAMinus1 would be even more clear.

mcrosier closed this revision.Nov 16 2016, 6:00 AM

This was committed in r287019.