if c is divisible by (1 << ShAmtC), we can fold this pattern:
lshr (mul nuw x, c), ShAmtC -> mul nuw x, (c >> ShAmtC)
Details
Diff Detail
Event Timeline
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp | ||
---|---|---|
1172 | I'm sorry, should be : |
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp | ||
---|---|---|
1172 | mul nuw x, c? |
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp | ||
---|---|---|
1172 | yeah |
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp | ||
---|---|---|
1172 | c * (1 << ShAmtC) is (c << ShAmtC) isn't it? |
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp | ||
---|---|---|
1165–1166 | This comment needs to move inside the first. |
The one you linked has hardcoded constants, yet the constants aren't hardcoded in the transform itself.
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp | ||
---|---|---|
1176 | Once again: mul nuw x, c |
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp | ||
---|---|---|
1176 | Sorry for that. Miss again |
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp | ||
---|---|---|
1177 |
|
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp | ||
---|---|---|
1177 | If there is no one-use, we may generate 2 mul instead of mul+shr define i64 @lshr_mul_negative_oneuse(i64 %0) { ; CHECK-LABEL: @lshr_mul_negative_oneuse( ; CHECK-NEXT: [[TMP2:%.*]] = mul nuw i64 [[TMP0:%.*]], 52 ; CHECK-NEXT: call void @use(i64 [[TMP2]]) ; CHECK-NEXT: [[TMP3:%.*]] = mul nuw i64 [[TMP0]], 13 ; CHECK-NEXT: ret i64 [[TMP3]] ; %2 = mul nuw i64 %0, 52 call void @use(i64 %2) %3 = lshr i64 %2, 2 ret i64 %3 } We need a condition to make sure ShAmtC is divisible by NewMulC. The proof use shl that is always sure. But in the code we still need to check eact flag. |
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp | ||
---|---|---|
1177 |
Should be "make sure MulC is divisible by NewMulC" |
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp | ||
---|---|---|
1177 |
We only create only a single mul here, do we not?
Can you show the counter-proof that shows that not checking for exact is incorrect? |
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp | ||
---|---|---|
1177 |
Yeah, we create only a single mul, but most of time mul should be heavier than lshr, am I right?
I don't know how to counter-proof with alive2 but this is the negative case on my machine after remove exact: define i64 @lshr_mul_negative_noexact(i64 %0) { ; CHECK-LABEL: @lshr_mul_negative_noexact( ; CHECK-NEXT: [[TMP2:%.*]] = mul nuw i64 [[TMP0:%.*]], 13 ; CHECK-NEXT: ret i64 [[TMP2]] ; %2 = mul nuw i64 %0, 53 %3 = lshr i64 %2, 2 ret i64 %3 } |
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp | ||
---|---|---|
1177 |
declare i64 @use(i64, i64) define i64 @lshr_mul_negative_oneuse(i64 %0) { ; CHECK-LABEL: @lshr_mul_negative_oneuse( ; CHECK-NEXT: [[TMP2:%.*]] = mul nuw i64 [[TMP0:%.*]], 52 ; CHECK-NEXT: [[TMP3:%.*]] = mul nuw i64 [[TMP0]], 13 ; CHECK-NEXT: [[TMP4:%.*]] = call i64 @use(i64 [[TMP2]], i64 [[TMP3]]) ; CHECK-NEXT: ret i64 [[TMP4]] ; %2 = mul nuw i64 %0, 52 %3 = lshr i64 %2, 2 %4 = call i64 @use(i64 %2, i64 %3) ret i64 %4 } This case should more clear why we need one use I think. |
I'm sorry I really don't know how to proof by alive2. Can you teach me how to proof it?
You need to either adjust the fold to do what the proof says, or write another proof for the change in question.
Abstracting a bit, is there a generalization of this pattern where the shift amounts don't match?
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp | ||
---|---|---|
1177 | The number of instructions didn't increase, so we're fine. |
For example, what about this:
https://alive2.llvm.org/ce/z/aov-GB
https://alive2.llvm.org/ce/z/ox4wAt (no need for exact, only nuw)
%t0 = lshr i8 %C1, %C2 %t1 = shl i8 %t0, %C2 %precond = icmp eq i8 %t1, %C1 call void @llvm.assume(i1 %precond)
This is the proof of exact. We can't remove exact check in the code. Or we need to do something like:
if (Op0->hasOneUse()) { APInt NewMulC = MulC->lshr(ShAmtC); if (MulC->eq(NewMulC.shl(ShAmtC))) return BinaryOperator::CreateNUWMul(X, ConstantInt::get(Ty, NewMulC)); }
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp | ||
---|---|---|
1177 |
I still insist one-use, mul is slower than lshr on most targets. |
Couldn't the exact be set because the LHS of the multiply is known to have trailing zero bits and have nothing to do with the trailing zeros of the constant on the RHS. Would the transform still be valid in that case?
Thanks for the finding. That's what I haven't expect this case. So we should use the check if (MulC->eq(NewMulC.shl(ShAmtC))) to avoid mul's LHS involve exact.
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp | ||
---|---|---|
1172 | Not related to this patch, but this existing transform is incorrect if MulC happens to be 2 and BitWidth is 2 and ShAmtC is 1. MulC-1 would be 1 which is a power 2 but it doesn't have 2 bits set like the transform expects. Hopefully we would always canonicalize the Mul to a shl first so this transform won't fire for that case, but I'm not sure. @spatel @lebedev.ri |
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp | ||
---|---|---|
1172 | Good catch. I'm not sure how to write a test for it either, but I added a "BitWidth > 2" clause: |
- Please pre-commit the tests with baseline CHECKs. Add one more positive test with vector type, so we know that works correctly -- at least for splat (uniform) constants. A negative test with 'nsw' would also be good. What happens if we have both 'nsw' and 'nuw'?
- There's an existing fold for: (X * C2) << C1 --> X * (C2 << C1)
...and it does not check for one-use. For consistency, we probably don't want the one-use restriction here either. If there's already a multiply in the pattern before this transform, another one is probably fine? The backend could theoretically decompose it back to shift (but I don't think we have that transform currently).
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp | ||
---|---|---|
1178–1179 | Replace 'c' with 'MulC' in this comment to make it clearer. |
void test(unsigned* pa, unsigned* pb, unsigned short i) { unsigned a = i * 100; unsigned b = a >> 2; *pa = a; *pb = b; }
I write a test to verify backend can optimize the pattern without one use.
This is the AArch64 baseline with one-use:
// %bb.0: // %entry and w8, w2, #0xffff mov w9, #100 mul w8, w8, w9 lsr w9, w8, #2 str w8, [x0] str w9, [x1] ret // -- End function
This is the AArch64 result we remove one-use:
"?test@@YAXPEAI0G@Z": // @"?test@@YAXPEAI0G@Z" // %bb.0: // %entry mov w8, #100 and w9, w2, #0xffff mov w10, #25 mul w8, w9, w8 mul w9, w9, w10 str w8, [x0] str w9, [x1] ret // -- End function
This is the X86 baseline with one-use:
"?test@@YAXPEAI0G@Z": # @"?test@@YAXPEAI0G@Z" # %bb.0: # %entry movzwl %r8w, %eax imull $100, %eax, %eax movl %eax, (%rcx) shrl $2, %eax movl %eax, (%rdx) retq # -- End function
This is the X86 result we remove one-use:
"?test@@YAXPEAI0G@Z": # @"?test@@YAXPEAI0G@Z" # %bb.0: # %entry movzwl %r8w, %eax imull $100, %eax, %r8d leal (%rax,%rax,4), %eax leal (%rax,%rax,4), %eax movl %r8d, (%rcx) movl %eax, (%rdx) retq # -- End function
Backend is not easy to figure out the case as we need to loop all use of mul's operand 0 to find the candidate.
Can we add one-use for this pattern also? https://godbolt.org/z/x3bo7q54j
Ok - I agree that the posted examples don't look like wins. There is no quick fix for the backend, so we can leave the one-use check in for now. Please add a code comment though to explain why have the one-use limitation.
If you want to make the other transform (for "shl") consistent with this one by adding a one-use check to it too, that can be a follow-up patch. That transform has been around for a long time, so it is possible that someone may notice an independent performance difference from that change.
llvm/test/Transforms/InstCombine/shift-logic.ll | ||
---|---|---|
256–304 | If we do not propagate the 'nsw' on this, we will not be able to recover it in general: |
LGTM - see inline remark for minor edit to a code comment.
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp | ||
---|---|---|
1176 | I would phrase this differently: // The one-use check is not strictly necessary, but codegen may not be able // to invert the transform and perf may suffer with an extra mul instruction. |
This comment needs to move inside the first.