This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Lower multiplication by a constant int to shl+add+shl
ClosedPublic

Authored by haicheng on Oct 25 2016, 2:23 PM.

Details

Reviewers
Gerolf
mcrosier
Summary

GCC can lower a = b * C where C = (2^n + 1) * 2^m to

add     w0, w0, w0, lsl n
lsl     w0, w0, m

also lower C = (2^n - 1) * 2^m to

lsl     w1, w0, n
sub     w0, w1, w0
lsl     w0, w0, m

LLVM cannot do either above transformations and generate code like this

mov     w8, C    
mul     w0, w0, w8

This change considers the first case, since the second case requires an extra instruction. The change is also very conservative to try not to touch the mul that can be folded into s(u)mull, madd(sub), s(u)madd(sub)l since their costs seem unknown during ISelLowering.

I am also open to suggestions about a better place (machine-combiner???) to implement the transformation. If I have the information of the cycles of 32 and 64bit mul, I can consider more constants such as C = (2^m + 1) * (2^n+1), C = (2^m + 1) * 2^n + 1, or C = ((2^m + 1) * 2^n + 1) * 2^p

Diff Detail

Repository
rL LLVM

Event Timeline

haicheng updated this revision to Diff 75789.Oct 25 2016, 2:23 PM
haicheng retitled this revision from to [AArch64] Lower multiplication by a constant int to shl+add+shl.
haicheng updated this object.
haicheng added reviewers: mcrosier, Gerolf.
haicheng set the repository for this revision to rL LLVM.
haicheng added a subscriber: llvm-commits.

Hi Haicheng,

This looks simple enough to be worth, even if the benefit is probably very small. But as it is, the code is complicating one part of two identical twins (positive and negative) and not the other, which complicates the code.

I recommend you to change that part of the code entirely into setting temporary variables, like +1/-1 ISD::ADD/ISD::SUB, based on the result of isNonNegative, and use the same piece of code for both paths.

A more generic approach could be done with some smarter constant-splitting, but this patch is simple as it is and there is already plenty of prior art for that, so let's stick to the pattern.

Also, I was expecting a much larger body of tests, with different constant sizes, multiple edge cases, and those that cannot be done, remaining a mul.

I have some comments inline, but my only additional question is: What is the motivation behind this? Benchmark numbers? Can you share them?

cheers,
--renato

lib/Target/AArch64/AArch64ISelLowering.cpp
7624

why not early return?

why is this not a problem for the previous case as well?

7633

Shift is not a good name, since this implies the "shift amount" not the "shifted value".

test/CodeGen/AArch64/mul_pow2.ll
5

You say "shift+add+shift" but your tests are on the form "add+shift".

Gerolf edited edge metadata.Oct 27 2016, 7:29 PM

Hi Haicheng,

I just have a few observations/food for thought:

  • Nit: In your Summary I think you swapped n and m in your code snippets vs your formulas. Your code is correct though.
  • The 2^N-1 * 2^M reduction increases code size, so it should not fire under Oz. Otherwise similar consideration as to your major case apply
  • The 2^N+1 * 2^M reduction increases schedule height (at least on most processors). It might also increase code when e.g. add+mul could be combined to madd. But when code size is *not* a concern and latency(lsl) + 1 < latency (mul), latency(madd) it should always be a win. But that target dependence is not checked in your code yet.
  • I would look at the machine combiner only for cases that need more global scheduling context to decide

Like Renato I'm also curious about your gains. How big? Which benchmarks?

Cheers
Gerolf

haicheng updated this object.Oct 28 2016, 9:52 AM
haicheng edited edge metadata.
haicheng updated this revision to Diff 76308.Oct 29 2016, 8:49 AM

Rewrite performMulCombine(), make the conversion a little less conservative to improve the performance and reduce the compilation time, add more tests.

Thank you, Renato. I rewrote my change and added more tests, please let me know if I did what you recommended.

I have some comments inline, but my only additional question is: What is the motivation behind this? Benchmark numbers? Can you share them?

The biggest motivation is that GCC can do this, but LLVM cannot. My patch is conservative and it does not make big change to the performance. I have not observed any noticeable regression, but the gain is small. spec2006/h264ref and spec2006/povray have around 1% improvement. One internal benchmark which is integer multiplication centric has much larger improvement.

rengolin added inline comments.Oct 29 2016, 9:02 AM
test/CodeGen/AArch64/mul_pow2.ll
279

Please use {{w[0-9]+}} instead of w8.

The biggest motivation is that GCC can do this, but LLVM cannot. My patch is conservative and it does not make big change to the performance. I have not observed any noticeable regression, but the gain is small. spec2006/h264ref and spec2006/povray have around 1% improvement. One internal benchmark which is integer multiplication centric has much larger improvement.

Any regressions? Not that I'm expecting any, but... :)

I'll come back a bit later once I've done a proper review.

Thanks!

Thank you, Gerolf

Hi Haicheng,

I just have a few observations/food for thought:

  • Nit: In your Summary I think you swapped n and m in your code snippets vs your formulas. Your code is correct though.

Thank you for catching this. I updated the summary.

  • The 2^N-1 * 2^M reduction increases code size, so it should not fire under Oz. Otherwise similar consideration as to your major case apply
  • The 2^N+1 * 2^M reduction increases schedule height (at least on most processors). It might also increase code when e.g. add+mul could be combined to madd. But when code size is *not* a concern and latency(lsl) + 1 < latency (mul), latency(madd) it should always be a win. But that target dependence is not checked in your code yet.
  • I would look at the machine combiner only for cases that need more global scheduling context to decide

I agree everything you said. I tried to be conservative in this patch to not increase code size or impact the generation of madd. If I want to support my cases, I think I need to check the target and compare the cost of different code sequences.

Like Renato I'm also curious about your gains. How big? Which benchmarks?

Please see my response to Renato above.

haicheng updated this revision to Diff 76412.Oct 31 2016, 8:27 AM
haicheng marked an inline comment as done.

Update the tests according to Renato's comments.

Hi,

I have a few variable name proposals, mostly to aid the understanding of the code. I apologise for beating on that key, but the code is getting more dense, less repetitive, so it has to be well understood.

I'd welcome another review from @Gerolf at this point. :)

cheers,
--renato

lib/Target/AArch64/AArch64ISelLowering.cpp
7620

I think one simple formula here would be more than enough:

(+/- 2^N +/- 1) * (+/- 2^M +/- 1)
7628

N0 refers to x, so maybe calling it Var or something more meaningful would be easier to understand.

7630

Lg2 implies log base 2 of Value, which is not true.

TrailingZeroes is a better name for this.

7647

Better call this SwapValues, as this is the intention of the flag.

7651

VM1 implies V Minus One and VP1 implies V Plus One, but the Vs are different.

In the case where VM1 *is* a power of two, then Lg2 is zero and ShiftedInt == Value, but not always.

I wouldn't mind ShiftedMinus1 and ValuePlus1.

7665

Feel free to hoist those two flags out of the conditional. This will make it clear that they're invariants here.

haicheng updated this revision to Diff 76465.Oct 31 2016, 1:30 PM
haicheng marked 8 inline comments as done.

Rename the variable names as suggested by Renato. Thank you.

Gerolf added a comment.Nov 1 2016, 2:26 PM

I thought I understand this until about the middle of the review. Now I could use some help perhaps with variable names and comments that reflect more clearly on the expression(s) you simplify. I think this is what Renato is looking for, too.

Thanks
Gerolf

lib/Target/AArch64/AArch64ISelLowering.cpp
7617โ€“7660

You could tie it more to the code, e.g. some multiplications Var * C can be ...

7620

I liked the spaces in Renato's comment. Or would it be clearer to say " constants C = A*B where A, B are of type +/- (2^N +/- 1)"?

7625

ValueOfC?

7633

no -> not

7638

dito

7644

If you declare e.g C = A * B then ShiftedInt could be ConstantA etc

7647

Operation would be more general than AddOrSub

7649

Please add a comment like what values get swapped.
ExtraNeg -> NegExp? Or NegSubExp?

7651

ShiftedMinus1 could be ConstandAMinus1

7652

Shouldn't ValuePlus1 be ConstantAPlus1? But then it should be ConstantAPlus1 = ShiftedInt (ConstantA) -1. At this point I can't match your specification and your code. However, if I"m right about this I will need to dig deep into your test cases, too ...

7670

After this point I think you can assert(IntValue == 2^N, some power of 2).

7672

I think Value should be ShiftedMinus1 from here on.

7689โ€“7702

I'll take another look at this code after I (think I) understand the code above.

7695

It is not clear to me why TrailingZeros and ExtraNeg are exclusive.

haicheng updated this revision to Diff 76762.Nov 2 2016, 1:07 PM
haicheng marked 11 inline comments as done.

Address Gerolf's comments.

haicheng added inline comments.Nov 9 2016, 8:03 AM
lib/Target/AArch64/AArch64ISelLowering.cpp
7652

ValueofC is used here to support (mul x, 2^N - 1) => (sub (shl x, N), x). This patch does not support (mul x, (2^N - 1) * 2^M) => (shl (sub (shl x, N), x), M) yet. If we want to support it in the future, we just need to use ConstantA here as you said.

7672

Similarly, I do not support
(mul x, -(2^N - 1) * 2^M) => (shl (sub x, (shl x, N)), M)
(mul x, -(2^N + 1) * 2^M) => -(shl (add (shl x, N), x), M)
So I use ValueOfC here.

Hi Gerolf,

Would you please take another look? Does my latest update make the code easier to read?

Thank you,

Haicheng

mcrosier edited edge metadata.Nov 11 2016, 10:02 AM

I've done a bit of refactoring in r286601 and r286606, which I hope makes this a much easier code review. If you rebase the patch, I'd be happy to take a look.

FYI, Chad has made some big refactoring in this area, you will have to re-base:

http://llvm.org/viewvc/llvm-project?rev=286606&view=rev

haicheng updated this revision to Diff 77712.Nov 11 2016, 11:12 PM
haicheng edited edge metadata.

Rebase the code.

With the minor comment, this looks good to me. But I'll let @Gerolf and @mcrosier have a final look and approve.

Thanks!

lib/Target/AArch64/AArch64ISelLowering.cpp
7617

IIGIR, the old case is still covered because if TrailingZeroes == 0, then ConstantA == ConstValue.

Your comment should reflect that. Not here, but above, before ConstantA's instantiation.

Here, you can just add the new case:

// (mul x, (2^N + 1) * 2^M) => (shl (add (shl x, N), x), M)
haicheng updated this revision to Diff 77734.Nov 12 2016, 9:34 PM
haicheng marked an inline comment as done.

Address Renato's comment. Thank you.

mcrosier accepted this revision.Nov 14 2016, 6:52 AM
mcrosier edited edge metadata.

LGTM with a few nits about naming of variables.

lib/Target/AArch64/AArch64ISelLowering.cpp
7616โ€“7617

CAMinus1 -> SCVMinus1

7617

CAMinus1 -> SCVMinus1

7622

'Var' isn't descriptive. I'd prefer 'N0' as this is a common idiom in ISel lowering.

7638

I'd prefer 'ShiftedConstValue' over 'ConstantA'.

7688โ€“7702

Var -> N0

7688โ€“7702

Var -> N0

7688โ€“7702

Please add an assert showing that TrailingZeroes and NegateResult can't both be true.

Then the last bit of logic here can be written as:

// Negate the result.
if (NegateResult)
  return DAG.getNode(ISD::SUB, DL, VT, DAG.getConstant(0, DL, VT), Res);

// Shift the result.
if (TrailingZeroes)
  return DAG.getNode(ISD::SHL, DL, VT, Res, DAG.getConstant(TrailingZeroes, DL, MVT::i64));

return Res;
7689

Var -> N0

This revision is now accepted and ready to land.Nov 14 2016, 6:52 AM

Thanks for following up!
LGTM

lib/Target/AArch64/AArch64ISelLowering.cpp
7639

CAMinus1 is consistent with the comment. Perhaps ConstantAMinus1 would be even more clear.

mcrosier closed this revision.Nov 16 2016, 6:00 AM

This was committed in r287019.