This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Fold series of instructions into mull
ClosedPublic

Authored by Allen on Oct 15 2022, 5:01 AM.

Details

Summary

The following sequence should be folded into in0 * in1

In0Lo = in0 & 0xffffffff; In0Hi = in0 >> 32;
In1Lo = in1 & 0xffffffff; In1Hi = in1 >> 32;
m01 = In1Hi * In0Lo; m10 = In1Lo * In0Hi; m00 = In1Lo * In0Lo;
addc = m01 + m10;
ResLo = m00 + (addc >> 32);

Diff Detail

Event Timeline

Allen created this revision.Oct 15 2022, 5:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 15 2022, 5:01 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
Allen requested review of this revision.Oct 15 2022, 5:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 15 2022, 5:01 AM

What is the motivation of this change? I feel a little strange to do this in instcombine.
And if we really need to do this, we do need more negative tests.

llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
856 ↗(On Diff #468010)

This pattern can work for any types with even bit width I think, not only i64.

864 ↗(On Diff #468010)

Need one-use here for addc.

Allen added a comment.EditedOct 15 2022, 7:16 AM

Thanks for your attention, I do this as there is case https://godbolt.org/z/x5jMhqW8s is our benchmark,
and the source is equel to an mull operater for two 64bits integer vaules, so it should be fold to similar assemble.
This is the 1st step try to generate the mul. so now I only enable it with i64 as the instruction umulh.

mul   x8,x0,x1
umulh x9,x0,x1
str   x8,[x2]
str   x9,[x3]
bcl5980 resigned from this revision.Oct 15 2022, 6:08 PM

Thanks for your attention, I do this as there is case https://godbolt.org/z/x5jMhqW8s is our benchmark,
and the source is equel to an mull operater for two 64bits integer vaules, so it should be fold to similar assemble.
This is the 1st step try to generate the mul. so now I only enable it with i64 as the instruction umulh.

mul   x8,x0,x1
umulh x9,x0,x1
str   x8,[x2]
str   x9,[x3]

Maybe you can do it in AArch64 SDAG if you are only interested in AArch64.
I think the detect pattern is too long in instcombine so I have a little worry about the change.
But I'm not senior enough to review the patch, so I will resign as reviewer.

Allen updated this revision to Diff 468205.Oct 17 2022, 8:00 AM

Add conditon m_OneUse(Addc)

Allen marked an inline comment as done.Oct 17 2022, 8:02 AM
Allen added inline comments.
llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
856 ↗(On Diff #468010)

The source https://godbolt.org/z/x5jMhqW8s is equel to an mull operater for two 64bits integer vaules, so it should be fold to similar assemble.
This is the 1st step try to generate the mul, so now I only enable it with i64 as the instruction umulh

We're not creating a new multiply that is wider than we started with, so I'm assuming codegen can't be worse.
As mentioned earlier, the code should be generalized to handle any even bitwidth; we don't want highly type-specific transforms in IR canonicalization.
https://alive2.llvm.org/ce/z/2BqKLt

The commutative pattern matching doesn't look correct at first glance, so we need tests that exercise all of those possible patterns. The instructions with constants will always have the constant as operand 1, so you don't need to worry about those. But the 3 muls and 2 adds can all be commuted, so that's 16 potential patterns?

Since we are only creating a single new instruction, there's no need to check for m_OneUse on any of the existing values (but we should include at least one test with extra uses to show that works as expected).

Allen updated this revision to Diff 468860.Oct 19 2022, 3:57 AM

Delete condtion m_OneUse and I.getType()->getIntegerBitWidth() == 64, and Add relavant test cases

Allen marked an inline comment as done.Oct 19 2022, 4:51 AM
Allen added inline comments.
llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
856 ↗(On Diff #468010)

delete the checking I.getType()->getIntegerBitWidth() == 64, thanks.

spatel added inline comments.Oct 19 2022, 7:40 AM
llvm/lib/Transforms/InstCombine/InstCombineInternal.h
550 ↗(On Diff #468860)

There's no need to make a class function for this transform. Just create a static function above InstCombinerImpl::visitAdd().

Use the raw BinaryOperator::CreateMul() to return an Instruction, so we don't need to pass the Builder or use replaceInstUsesWith().

llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
856 ↗(On Diff #468860)

The type check is insufficient in at least 2 ways and over-restrictive in other.

So we need at least 3 more tests like this:

define i9 @mul9_low(i9 %in0, i9 %in1) {
  %In0Lo = and i9 %in0, 15
  %In0Hi = lshr i9 %in0, 4
  %In1Lo = and i9 %in1, 15
  %In1Hi = lshr i9 %in1, 4
  %m10 = mul i9 %In1Hi, %In0Lo
  %m01 = mul i9 %In1Lo, %In0Hi
  %m00 = mul i9 %In1Lo, %In0Lo
  %addc = add i9 %m10, %m01
  %shl = shl i9 %addc, 4
  %addc9 = add i9 %shl, %m00
  ret i9 %addc9
}

define <2 x i8> @mul_v2i8_low(<2 x i8> %in0, <2 x i8> %in1) {
  %In0Lo = and <2 x i8> %in0, <i8 15, i8 15>
  %In0Hi = lshr <2 x i8> %in0, <i8 4, i8 4>
  %In1Lo = and <2 x i8> %in1, <i8 15, i8 15>
  %In1Hi = lshr <2 x i8> %in1, <i8 4, i8 4>
  %m10 = mul <2 x i8> %In1Hi, %In0Lo
  %m01 = mul <2 x i8> %In1Lo, %In0Hi
  %m00 = mul <2 x i8> %In1Lo, %In0Lo
  %addc = add <2 x i8> %m10, %m01
  %shl = shl <2 x i8> %addc, <i8 4, i8 4>
  %addc9 = add <2 x i8> %shl, %m00
  ret <2 x i8> %addc9
}

define i128 @mul128_low(i128 %in0, i128 %in1) {
  %In0Lo = and i128 %in0, 18446744073709551615
  %In0Hi = lshr i128 %in0, 64
  %In1Lo = and i128 %in1, 18446744073709551615
  %In1Hi = lshr i128 %in1, 64
  %m10 = mul i128 %In1Hi, %In0Lo
  %m01 = mul i128 %In1Lo, %In0Hi
  %m00 = mul i128 %In1Lo, %In0Lo
  %addc = add i128 %m10, %m01
  %shl = shl i128 %addc, 64
  %addc9 = add i128 %shl, %m00
  ret i128 %addc9
}
866 ↗(On Diff #468860)

The structure of these matches is confusing. I'd prefer to organize it more like this:

  // R = (CrossSum << HalfBits) + (XLo * YLo)
  Value *XLo, *YLo;
  Value *CrossSum;
  if (!match(&I, m_c_Add(m_Shl(m_Value(CrossSum), m_SpecificInt(HalfBits)),
                         m_Mul(m_Value(XLo), m_Value(YLo)))))
    return nullptr;

  // XLo = X & HalfMask
  // YLo = Y & HalfMask
  Value *X, *Y;
  if (!match(XLo, m_And(m_Value(X), m_SpecificInt(HalfMask))) ||
      !match(YLo, m_And(m_Value(Y), m_SpecificInt(HalfMask))))
    return nullptr;

  // CrossSum = (X' * (Y >> Halfbits)) + (Y' * (X >> HalfBits))
...

IIUC, X' can be either X or XLo in the pattern (and the same for Y'). You can probably use m_CombineOr(m_Specific(), m_Specific()) to match that with minimal code.

llvm/test/Transforms/InstCombine/mul.ll
1578 ↗(On Diff #468860)

The tests are incomplete for commutative patterns. As I said earlier, I think we need at least 16 tests to verify that the matching is working as expected.

Once we have the right tests in place, please pre-commit the baseline tests (CHECK lines without the code change), so we will only show diffs in this patch.

Allen updated this revision to Diff 469121.Oct 19 2022, 11:33 PM
Allen marked 2 inline comments as done.

1、 use BinaryOperator::CreateMul() to avoid the use of replaceInstUsesWith()
2、 Add 3 more cases according comment
3、 Use m_CombineOr to match that with minimal code
4、create a static function above InstCombinerImpl::visitAdd()

Allen updated this revision to Diff 469189.Oct 20 2022, 5:34 AM
Allen marked 2 inline comments as done.

update after precommit the testcases

spatel added inline comments.Oct 20 2022, 6:31 AM
llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp
1270

Add a better description for the full transform. Something like:

/// Reduce a sequence of masked half-width multiplies to a single multiply.
/// ((XLow * YHigh) + (YLow * XHigh)) << HalfBits) + (XLow * YLow) --> X * Y
1271

Function names should start with lower-case letter. "Simplify" has a distinct meaning in LLVM combining - it suggests that we are not creating a new instruction. Even though it is misused in other places including in this file, we shouldn't do that again.

I suggest naming this "foldLongMultiply" or "foldBoxMultiply" ( https://www.ixl.com/math/grade-4/box-multiplication ) or something like that, so it's more obvious that we are reducing a sequence of mul and add to something else.

1275

I don't see a reason to exclude vectors from this transform. Just change this line?

unsigned BitWidth = I.getType()->getScalarSizeInBits();
1277

Similarly, why exclude wide widths? We're already using APInt::getMaxValue(), so just use that APInt in the m_SpecificInt() calls?

any chance we could get vector support/tests please?

llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp
1296

What about if the AND has been removed by SimplifyDemandedBits? Maybe also test for KnownBits known leading zeros?

Allen added inline comments.Oct 20 2022, 7:36 AM
llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp
1277

exclude the wide/vectors widths as hey are unusual get the IR from C/C++ code, and can be expand when needed later? or a seperate patch, now we already need too many cases to cover the pattern?

llvm/lib/Transforms/InstCombine/InstCombineInternal.h
550 ↗(On Diff #468860)

Done, thanks for detail suggestions

llvm/lib/Transforms/InstCombine/InstructionCombining.cpp
856 ↗(On Diff #468860)

Thanks for detail examples.

866 ↗(On Diff #468860)

Apply your comment, thanks

llvm/test/Transforms/InstCombine/mul.ll
1578 ↗(On Diff #468860)

Addressed in D136340

Allen updated this revision to Diff 469236.Oct 20 2022, 8:05 AM

a) rename function name to foldBoxMultiply and it's description
b) use APInt in m_SpecificInt directly
c) update getIntegerBitWidth with getScalarSizeInBits

llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp
1296

Thanks for your suggestion, I'll record this issue, and I'll try out your suggestions later with a separate patch ?

Allen marked 2 inline comments as done.Oct 20 2022, 8:06 AM
RKSimon added inline comments.Oct 20 2022, 8:39 AM
llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp
1296

A TODO comment is fine for now - cheers

spatel added inline comments.Oct 20 2022, 10:36 AM
llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp
1277

There's really no difference in the testing - just change one test to i130 or something like that?
And the code difference is just to remove that clause in the if on line 1278 - nothing else changes?
But if you think there's some risk from handling that, then please add a TODO comment, so we can relax the constraint in a follow-up patch.

RKSimon added a subscriber: chfast.Oct 21 2022, 8:17 AM

CC @chfast who was looking at something similar in D56214

In https://reviews.llvm.org/D56214 similar pattern match was applied in AggressiveInstCombine.

Do you want me to submit test cases from there?

Allen updated this revision to Diff 469842.Oct 21 2022, 6:23 PM

rebase as the precommit tests update

Allen marked 5 inline comments as done.Oct 21 2022, 6:26 PM

In https://reviews.llvm.org/D56214 similar pattern match was applied in AggressiveInstCombine.

Do you want me to submit test cases from there?

Yes please @chfast, if you think we can just use this patch then maybe just move them (and tweak for -instcombine).

In https://reviews.llvm.org/D56214 similar pattern match was applied in AggressiveInstCombine.

Do you want me to submit test cases from there?

Yes please @chfast, if you think we can just use this patch then maybe just move them (and tweak for -instcombine).

Added in https://reviews.llvm.org/rG119c34e7f9c66dbdb77f69d67bb50507c91dc2ef.

@Allen please can you rebase?

Allen updated this revision to Diff 470033.Oct 23 2022, 7:14 PM

rebase top as the precommit test

Allen added a comment.Oct 23 2022, 7:16 PM

@Allen please can you rebase?

Done, thanks @RKSimon/@chfast for your precommit tests.

Please rebase again after 41c42f5b1825 / 56c6b612aed1.
If I did that correctly, we won't see any changes for the final value in each test from this revision, but we'll test this patch directly and get a better coverage for commuted patterns.
After that, I think this patch will be complete.

chfast added inline comments.Oct 24 2022, 7:15 AM
llvm/test/Transforms/InstCombine/mul_full_64.ll
452

Interestingly, it hasn't folded this one.

Allen added a comment.Oct 24 2022, 7:19 AM

Please rebase again after 41c42f5b1825 / 56c6b612aed1.
If I did that correctly, we won't see any changes for the final value in each test from this revision, but we'll test this patch directly and get a better coverage for commuted patterns.
After that, I think this patch will be complete.

Done, thanks very much for your changes. And I don't completely understand why need the use at the beginning of a function? eg:

define i8 @mul8_low_A0_B2(i8 %in0, i8 %p) {
  %in1 = call i8 @use8(i8 %p) ; thwart complexity-based canonicalization
  %In0Lo = and i8 %in0, 15
  %In0Hi = lshr i8 %in0, 4
  %In1Lo = and i8 %in1, 15
  %In1Hi = lshr i8 %in1, 4
  %m10 = mul i8 %In1Hi, %in0
  %m01 = mul i8 %in1, %In0Hi
  %m00 = mul i8 %In1Lo, %In0Lo
  %addc = add i8 %m01, %m10
  %shl = shl i8 %addc, 4
  %retLo = add i8 %shl, %m00
  ret i8 %retLo
}
spatel accepted this revision.Oct 24 2022, 9:16 AM

LGTM

Done, thanks very much for your changes. And I don't completely understand why need the use at the beginning of a function? eg:

define i8 @mul8_low_A0_B2(i8 %in0, i8 %p) {
  %in1 = call i8 @use8(i8 %p) ; thwart complexity-based canonicalization

If you remove that line, notice that the values in the later multiply get commuted. That happens before we reach this transform, so the test is trying to ensure that the exact placement of the values at runtime is the same as specified in the test.

llvm/test/Transforms/InstCombine/mul_full_64.ll
452

This patch assumes we are ending with an "add", but this test changes to an "or". We'd need to add another check for hasNoCommonBitsSet() to catch it?

Here's another potential fold:
https://alive2.llvm.org/ce/z/hUm56R
...but it needs to freeze the inputs to be poison-safe because they have multiple uses.

This revision is now accepted and ready to land.Oct 24 2022, 9:16 AM

If you remove that line, notice that the values in the later multiply get commuted. That happens before we reach this transform, so the test is trying to ensure that the exact placement of the values at runtime is the same as specified in the test.

Thanks very much for your guidance.

This revision was automatically updated to reflect the committed changes.
Allen added a subscriber: tgt.Oct 24 2022, 6:56 PM
Allen added inline comments.
llvm/test/Transforms/InstCombine/mul_full_64.ll
452

hi @chfast

I think the case **@mullo** should not be matched? https://alive2.llvm.org/ce/z/jH4kU7

hi, @spatel

 As the case in link https://alive2.llvm.org/ce/z/hUm56R, it's result not equal to **mul i8 %y, %x**, so it need some other logic to match ? maybe defined with a new helper function. see detail https://alive2.llvm.org/ce/z/FEgEU7
```

define i8 @tgt(i8 %x, i8 %y) {

%m = mul i8 %y, %x
ret i8 %m

}

chfast added inline comments.Oct 24 2022, 11:56 PM
llvm/test/Transforms/InstCombine/mul_full_64.ll
452
I think the case **@mullo** should not be matched? https://alive2.llvm.org/ce/z/jH4kU7

There is a typo in the example. You changed or to and but the original pattern starts at add. I.e. all patterns starting at add, or and xor should work, the one starting at and should not. https://alive2.llvm.org/ce/z/y26zaW

I'm not sure it is worth to expand the matching to or and `xor.

Allen added inline comments.Oct 31 2022, 7:32 PM
llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp
1296

hi @RKSimon

As this revision is accept, so it is time to consider your refactor suggestion, do you have some idea about the extra tests ? thanks.
llvm/test/Transforms/InstCombine/mul_full_64.ll
452

Thanks @chfast for your case. I take a look at your case more, except the above add VS or, there is some other diffirence with my initail case. https://alive2.llvm.org/ce/z/ZKmrJB