This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Fold mul nuw+lshr to a single multiplication when the latter is a factor
ClosedPublic

Authored by bcl5980 on Apr 9 2022, 4:40 AM.

Details

Summary

if c is divisible by (1 << ShAmtC), we can fold this pattern:
lshr (mul nuw x, c), ShAmtC -> mul nuw x, (c >> ShAmtC)

https://alive2.llvm.org/ce/z/ox4wAt

Fix https://github.com/llvm/llvm-project/issues/54824

Diff Detail

Event Timeline

bcl5980 created this revision.Apr 9 2022, 4:40 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 9 2022, 4:40 AM
bcl5980 requested review of this revision.Apr 9 2022, 4:40 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 9 2022, 4:40 AM

Comment should be more clear

llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1172

mul nuw, c??

1172

qc is what?

bcl5980 added inline comments.Apr 9 2022, 6:44 AM
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1172

I'm sorry, should be :
lshr exact (mul nuw x, c * (1 << ShAmtC)), ShAmtC -> mul nuw, c
I will fix the comment later

bcl5980 updated this revision to Diff 421727.Apr 9 2022, 6:47 AM
bcl5980 edited the summary of this revision. (Show Details)

fix typo

xbolva00 added inline comments.Apr 9 2022, 7:01 AM
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1172

mul nuw x, c?

bcl5980 added inline comments.Apr 9 2022, 7:16 AM
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1172

yeah

craig.topper added inline comments.Apr 9 2022, 8:36 AM
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1172

c * (1 << ShAmtC) is (c << ShAmtC) isn't it?

craig.topper added inline comments.Apr 9 2022, 8:50 AM
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1165–1166

This comment needs to move inside the first.

Please can you post the link to the general proof?

Please can you post the link to the general proof?

I'm sorry but can you help to explain what the general proof is?

bcl5980 updated this revision to Diff 421744.Apr 9 2022, 10:14 AM
bcl5980 edited the summary of this revision. (Show Details)

Fix comment
Update alive2 link

Please can you post the link to the general proof?

I'm sorry but can you help to explain what the general proof is?

The one you linked has hardcoded constants, yet the constants aren't hardcoded in the transform itself.

xbolva00 added inline comments.Apr 9 2022, 10:19 AM
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1176

Once again:

mul nuw x, c

bcl5980 updated this revision to Diff 421745.Apr 9 2022, 10:37 AM
bcl5980 edited the summary of this revision. (Show Details)

Fix comment
Update general proof link

bcl5980 edited the summary of this revision. (Show Details)Apr 9 2022, 10:38 AM
bcl5980 edited the summary of this revision. (Show Details)

Please can you post the link to the general proof?

I'm sorry but can you help to explain what the general proof is?

The one you linked has hardcoded constants, yet the constants aren't hardcoded in the transform itself.

Thanks for the mentioning. I'll pay attention next time for constants.

bcl5980 marked 6 inline comments as done.Apr 9 2022, 10:47 AM
bcl5980 added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1176

Sorry for that. Miss again

lebedev.ri edited the summary of this revision. (Show Details)Apr 9 2022, 10:49 AM
lebedev.ri edited the summary of this revision. (Show Details)
lebedev.ri added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1177
  1. No one-use check needed, only a single instruction is produced.
  2. I believe, while that lshr is obviously exact, your own proof shows that there is no need to check that it is marked as such?
bcl5980 marked an inline comment as done.Apr 9 2022, 11:08 AM
bcl5980 added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1177

If there is no one-use, we may generate 2 mul instead of mul+shr

define i64 @lshr_mul_negative_oneuse(i64 %0) {
; CHECK-LABEL: @lshr_mul_negative_oneuse(
; CHECK-NEXT:    [[TMP2:%.*]] = mul nuw i64 [[TMP0:%.*]], 52
; CHECK-NEXT:    call void @use(i64 [[TMP2]])
; CHECK-NEXT:    [[TMP3:%.*]] = mul nuw i64 [[TMP0]], 13
; CHECK-NEXT:    ret i64 [[TMP3]]
;
  %2 = mul nuw i64 %0, 52
  call void @use(i64 %2)
  %3 = lshr i64 %2, 2
  ret i64 %3
}

We need a condition to make sure ShAmtC is divisible by NewMulC. The proof use shl that is always sure. But in the code we still need to check eact flag.

bcl5980 added inline comments.Apr 9 2022, 11:12 AM
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1177

We need a condition to make sure ShAmtC is divisible by NewMulC. The proof use shl that is always sure. But in the code we still need to check eact flag.

Should be "make sure MulC is divisible by NewMulC"

lebedev.ri added inline comments.Apr 9 2022, 11:16 AM
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1177

If there is no one-use, we may generate 2 mul instead of mul+shr

We only create only a single mul here, do we not?

We need a condition to make sure ShAmtC is divisible by NewMulC. The proof use shl that is always sure. But in the code we still need to check eact flag.

Can you show the counter-proof that shows that not checking for exact is incorrect?

bcl5980 added inline comments.Apr 9 2022, 11:22 AM
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1177

If there is no one-use, we may generate 2 mul instead of mul+shr

We only create only a single mul here, do we not?

Yeah, we create only a single mul, but most of time mul should be heavier than lshr, am I right?

We need a condition to make sure ShAmtC is divisible by NewMulC. The proof use shl that is always sure. But in the code we still need to check eact flag.

Can you show the counter-proof that shows that not checking for exact is incorrect?

I don't know how to counter-proof with alive2 but this is the negative case on my machine after remove exact:

define i64 @lshr_mul_negative_noexact(i64 %0) {
; CHECK-LABEL: @lshr_mul_negative_noexact(
; CHECK-NEXT:    [[TMP2:%.*]] = mul nuw i64 [[TMP0:%.*]], 13
; CHECK-NEXT:    ret i64 [[TMP2]]
;
  %2 = mul nuw i64 %0, 53
  %3 = lshr i64 %2, 2
  ret i64 %3
}

I see. Then the proof is wrong.

bcl5980 added inline comments.Apr 9 2022, 11:32 AM
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1177

If there is no one-use, we may generate 2 mul instead of mul+shr

We only create only a single mul here, do we not?

declare i64 @use(i64, i64)
define i64 @lshr_mul_negative_oneuse(i64 %0) {
; CHECK-LABEL: @lshr_mul_negative_oneuse(
; CHECK-NEXT:    [[TMP2:%.*]] = mul nuw i64 [[TMP0:%.*]], 52
; CHECK-NEXT:    [[TMP3:%.*]] = mul nuw i64 [[TMP0]], 13
; CHECK-NEXT:    [[TMP4:%.*]] = call i64 @use(i64 [[TMP2]], i64 [[TMP3]])
; CHECK-NEXT:    ret i64 [[TMP4]]
;
  %2 = mul nuw i64 %0, 52
  %3 = lshr i64 %2, 2
  %4 = call i64 @use(i64 %2, i64 %3)
  ret i64 %4
}

This case should more clear why we need one use I think.

bcl5980 updated this revision to Diff 421748.Apr 9 2022, 11:36 AM

Fix test error

I see. Then the proof is wrong.

I'm sorry I really don't know how to proof by alive2. Can you teach me how to proof it?

bcl5980 updated this revision to Diff 421751.Apr 9 2022, 12:11 PM
bcl5980 edited the summary of this revision. (Show Details)

Fix comment

I see. Then the proof is wrong.

I'm sorry I really don't know how to proof by alive2. Can you teach me how to proof it?

You need to either adjust the fold to do what the proof says, or write another proof for the change in question.

Abstracting a bit, is there a generalization of this pattern where the shift amounts don't match?

llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1177

The number of instructions didn't increase, so we're fine.

lebedev.ri added a comment.EditedApr 9 2022, 12:59 PM

For example, what about this:
https://alive2.llvm.org/ce/z/aov-GB
https://alive2.llvm.org/ce/z/ox4wAt (no need for exact, only nuw)

For example, what about this:
https://alive2.llvm.org/ce/z/aov-GB
https://alive2.llvm.org/ce/z/ox4wAt (no need for exact, only nuw)

%t0 = lshr i8 %C1, %C2
%t1 = shl i8 %t0, %C2
%precond = icmp eq i8 %t1, %C1
call void @llvm.assume(i1 %precond)

This is the proof of exact. We can't remove exact check in the code. Or we need to do something like:

if (Op0->hasOneUse()) {
  APInt NewMulC = MulC->lshr(ShAmtC);
  if (MulC->eq(NewMulC.shl(ShAmtC)))
    return BinaryOperator::CreateNUWMul(X, ConstantInt::get(Ty, NewMulC));
}
bcl5980 added inline comments.Apr 9 2022, 8:46 PM
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1177

The number of instructions didn't increase, so we're fine.

I still insist one-use, mul is slower than lshr on most targets.

For example, what about this:
https://alive2.llvm.org/ce/z/aov-GB
https://alive2.llvm.org/ce/z/ox4wAt (no need for exact, only nuw)

%t0 = lshr i8 %C1, %C2
%t1 = shl i8 %t0, %C2
%precond = icmp eq i8 %t1, %C1
call void @llvm.assume(i1 %precond)

This is the proof of exact. We can't remove exact check in the code. Or we need to do something like:

if (Op0->hasOneUse()) {
  APInt NewMulC = MulC->lshr(ShAmtC);
  if (MulC->eq(NewMulC.shl(ShAmtC)))
    return BinaryOperator::CreateNUWMul(X, ConstantInt::get(Ty, NewMulC));
}

Couldn't the exact be set because the LHS of the multiply is known to have trailing zero bits and have nothing to do with the trailing zeros of the constant on the RHS. Would the transform still be valid in that case?

For example, what about this:
https://alive2.llvm.org/ce/z/aov-GB
https://alive2.llvm.org/ce/z/ox4wAt (no need for exact, only nuw)

%t0 = lshr i8 %C1, %C2
%t1 = shl i8 %t0, %C2
%precond = icmp eq i8 %t1, %C1
call void @llvm.assume(i1 %precond)

This is the proof of exact. We can't remove exact check in the code. Or we need to do something like:

if (Op0->hasOneUse()) {
  APInt NewMulC = MulC->lshr(ShAmtC);
  if (MulC->eq(NewMulC.shl(ShAmtC)))
    return BinaryOperator::CreateNUWMul(X, ConstantInt::get(Ty, NewMulC));
}

Couldn't the exact be set because the LHS of the multiply is known to have trailing zero bits and have nothing to do with the trailing zeros of the constant on the RHS. Would the transform still be valid in that case?

Thanks for the finding. That's what I haven't expect this case. So we should use the check if (MulC->eq(NewMulC.shl(ShAmtC))) to avoid mul's LHS involve exact.

bcl5980 updated this revision to Diff 421765.Apr 9 2022, 9:17 PM
bcl5980 edited the summary of this revision. (Show Details)
bcl5980 edited the summary of this revision. (Show Details)
bcl5980 updated this revision to Diff 421766.Apr 9 2022, 9:24 PM
bcl5980 retitled this revision from [InstCombine] Fold mul nuw+lshr exact to a single multiplication when the latter is a factor to [InstCombine] Fold mul nuw+lshr to a single multiplication when the latter is a factor.Apr 10 2022, 9:42 PM
craig.topper added inline comments.Apr 11 2022, 9:58 AM
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1172

Not related to this patch, but this existing transform is incorrect if MulC happens to be 2 and BitWidth is 2 and ShAmtC is 1. MulC-1 would be 1 which is a power 2 but it doesn't have 2 bits set like the transform expects. Hopefully we would always canonicalize the Mul to a shl first so this transform won't fire for that case, but I'm not sure. @spatel @lebedev.ri

spatel added inline comments.Apr 11 2022, 12:56 PM
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1172

Good catch. I'm not sure how to write a test for it either, but I added a "BitWidth > 2" clause:
1206a18d417a

bcl5980 updated this revision to Diff 422123.Apr 11 2022, 10:07 PM

fix build error

@lebedev.ri @craig.topper
What can I do now to continue the patch?

  1. Please pre-commit the tests with baseline CHECKs. Add one more positive test with vector type, so we know that works correctly -- at least for splat (uniform) constants. A negative test with 'nsw' would also be good. What happens if we have both 'nsw' and 'nuw'?
  1. There's an existing fold for: (X * C2) << C1 --> X * (C2 << C1)

...and it does not check for one-use. For consistency, we probably don't want the one-use restriction here either. If there's already a multiply in the pattern before this transform, another one is probably fine? The backend could theoretically decompose it back to shift (but I don't think we have that transform currently).

llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1178–1179

Replace 'c' with 'MulC' in this comment to make it clearer.

bcl5980 added a comment.EditedApr 20 2022, 12:13 AM
void test(unsigned* pa, unsigned* pb, unsigned short i)
{
    unsigned a = i * 100;
    unsigned b = a >> 2;
    *pa = a;
    *pb = b;
}

I write a test to verify backend can optimize the pattern without one use.
This is the AArch64 baseline with one-use:

// %bb.0:                               // %entry
	and	w8, w2, #0xffff
	mov	w9, #100
	mul	w8, w8, w9
	lsr	w9, w8, #2
	str	w8, [x0]
	str	w9, [x1]
	ret
                                        // -- End function

This is the AArch64 result we remove one-use:

"?test@@YAXPEAI0G@Z":                   // @"?test@@YAXPEAI0G@Z"
// %bb.0:                               // %entry
	mov	w8, #100
	and	w9, w2, #0xffff
	mov	w10, #25
	mul	w8, w9, w8
	mul	w9, w9, w10
	str	w8, [x0]
	str	w9, [x1]
	ret
                                        // -- End function

This is the X86 baseline with one-use:

"?test@@YAXPEAI0G@Z":                   # @"?test@@YAXPEAI0G@Z"
# %bb.0:                                # %entry
	movzwl	%r8w, %eax
	imull	$100, %eax, %eax
	movl	%eax, (%rcx)
	shrl	$2, %eax
	movl	%eax, (%rdx)
	retq
                                        # -- End function

This is the X86 result we remove one-use:

"?test@@YAXPEAI0G@Z":                   # @"?test@@YAXPEAI0G@Z"
# %bb.0:                                # %entry
	movzwl	%r8w, %eax
	imull	$100, %eax, %r8d
	leal	(%rax,%rax,4), %eax
	leal	(%rax,%rax,4), %eax
	movl	%r8d, (%rcx)
	movl	%eax, (%rdx)
	retq
                                        # -- End function

Backend is not easy to figure out the case as we need to loop all use of mul's operand 0 to find the candidate.

  1. There's an existing fold for: (X * C2) << C1 --> X * (C2 << C1)

...and it does not check for one-use. For consistency, we probably don't want the one-use restriction here either. If there's already a multiply in the pattern before this transform, another one is probably fine? The backend could theoretically decompose it back to shift (but I don't think we have that transform currently).

Can we add one-use for this pattern also? https://godbolt.org/z/x3bo7q54j

bcl5980 updated this revision to Diff 423835.Apr 20 2022, 1:25 AM

update comments and rebase

Backend is not easy to figure out the case as we need to loop all use of mul's operand 0 to find the candidate.

Ok - I agree that the posted examples don't look like wins. There is no quick fix for the backend, so we can leave the one-use check in for now. Please add a code comment though to explain why have the one-use limitation.

If you want to make the other transform (for "shl") consistent with this one by adding a one-use check to it too, that can be a follow-up patch. That transform has been around for a long time, so it is possible that someone may notice an independent performance difference from that change.

llvm/test/Transforms/InstCombine/shift-logic.ll
256–304

If we do not propagate the 'nsw' on this, we will not be able to recover it in general:
https://alive2.llvm.org/ce/z/tbckkt

bcl5980 updated this revision to Diff 423915.Apr 20 2022, 8:31 AM

keep nsw flag

bcl5980 marked an inline comment as done.Apr 20 2022, 8:32 AM
spatel accepted this revision.Apr 20 2022, 8:46 AM

LGTM - see inline remark for minor edit to a code comment.

llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
1176

I would phrase this differently:

// The one-use check is not strictly necessary, but codegen may not be able 
// to invert the transform and perf may suffer with an extra mul instruction.
This revision is now accepted and ready to land.Apr 20 2022, 8:46 AM
This revision was landed with ongoing or failed builds.Apr 20 2022, 9:15 AM
This revision was automatically updated to reflect the committed changes.