Teach Trunc AggressiveInstCombine how to include ICmp instructions in chains that can be type reduced.
Diff Detail
Event Timeline
This seems not generic enough to me, can't we not require the icmp operands to be constants/[sz]ext's,
but instead try to see if it can be evaluated in smaller bitwidth (what the rest of the code here does)?
As long as the icmp can be shrunked to at least as small bitwidth as we need there to get rid of cast,
i think we can always pick the actual bitwidth we'll use, which might be wider than minimal?
llvm/lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp | ||
---|---|---|
171 | You want to operate on APInt, not assuming that it fits into 64-bits. |
llvm/lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp | ||
---|---|---|
224 | Hm, can't we already get a constant operand in one of the supported instructions? |
llvm/lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp | ||
---|---|---|
224 | Ah, i see what i'm missing: AIC does not perform DCE, |
After thinking about this, i'd be okay with not doing that straight away.
It should be doable, but would require substantial redesign to nicely model more than one dag and their connection/dependency.
Exactly, there are many improvements that can be added to this pass in many ways.
What I add here is a small and simple improvement that doesn't need serious redesign.
It has it's added value (not a big one though) at a minimal cost.
I think this is missing a test-case where the select could be truncated if you ignored the icmp, but not if you require the icmp to also be truncated. That should show up as a regression in the test diffs.
llvm/lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp | ||
---|---|---|
173 | Why copy? | |
178 | APInt::getActiveBits() | |
180 | Correct me if i'm wrong, but won't this just work? static unsigned getConstMinBitWidth(bool IsSigned, ConstantInt *C) { const APInt& Val = C->getValue(); unsigned NonSignBits = Val.getBitWidth() - Val.getNumSignBits(); return IsSigned ? 1 + NonSignBits : NonSignBits; } |
Not sure I understand what you mean exactly.
Shrinking the cmp is not *required* in order to shrink the select. If there is a cmp instruction that can be shrank along with the select, we shrink it. If not, it does not affect the shrinking of the select.
Is the following test case relevant?
define dso_local i16 @cmp_shrink_select_not_cmp(i8 %a, i8 %b, i32 %c, i32 %d) {
; CHECK-LABEL: @cmp_shrink_select_not_cmp(
; CHECK-NEXT: entry:
; CHECK-NEXT: [[CONV:%.*]] = sext i8 [[A:%.*]] to i16
; CHECK-NEXT: [[CONV2:%.*]] = sext i8 [[B:%.*]] to i16
; CHECK-NEXT: [[CMP:%.*]] = icmp slt i32 [[C:%.*]], [[D:%.*]]
; CHECK-NEXT: [[COND:%.*]] = select i1 [[CMP]], i16 [[CONV2]], i16 [[CONV]]
; CHECK-NEXT: ret i16 [[COND]]
;
entry:
%conv = sext i8 %a to i32 %conv2 = sext i8 %b to i32 %cmp = icmp slt i32 %c, %d %cond = select i1 %cmp, i32 %conv2, i32 %conv %conv4 = trunc i32 %cond to i16 ret i16 %conv4
}
@aymanmus I have something along these lines in mind...
%conv = sext i8 %a to i32 %conv2 = sext i8 %b to i32 %conv3 = sext i8 %c to i32 %cmp = icmp slt i32 %conv3, TOO_BIG_CONST %cond = select i1 %cmp, i32 %conv2, i32 %conv %conv4 = trunc i32 %cond to i16 ret i16 %conv4
That is, where the icmp has a form that passes the isConstOrExt(C->getOperand(0)) && isConstOrExt(C->getOperand(1)) check, but later gets rejected due to constant bitwidth.
Not sure if that example does it, but I think there must be some case like this, unless I'm misunderstanding the patch.
Totally agree.
Fixed the code to exclude the ICmp instruction from the part where we make the decision whether to transform or not.
ICmp instruction can only be replaced after we're already applying the transformation, and only if its operands can fit in the type we are converting to.
llvm/lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp | ||
---|---|---|
180 | IsSigned indicates whether the use of the constant is signed or not. However, the previous comments are 100% legit. |
llvm/lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp | ||
---|---|---|
178 | Please explicitly spell out unsigned here | |
180 | Right. What about this then: static unsigned getConstMinBitWidth(bool IsSigned, ConstantInt *C) { const APInt& Val = C->getValue(); return IsSigned ? Val.getMinSignedBits() : Val.getActiveBits(); } ? |
llvm/lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp | ||
---|---|---|
377 | Doesn't the sext/zext case have to check the IsSigned flag as well? I don't think it's as simple as IsSigned => sext and !IsSigned => zext (e.g. it's fine if you have an equality comparison and both use same extension type), but same correctness checks are needed here. |
llvm/lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp | ||
---|---|---|
369–370 | Wait, why are we giving up on non-scalars here? |
Fix the constant aux functions to correctly handle vector constants.
llvm/lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp | ||
---|---|---|
377 | The IsSigned is not relevant in this part of the code. |
llvm/lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp | ||
---|---|---|
377 | Not sure I get it. Converting (zext i16 %x to i32) ult (sext i16 %y to i32) to %x ult %y is not legal (https://rise4fun.com/Alive/wVZy), and I don't think anything prevents that from happening with your current code. |
llvm/lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp | ||
---|---|---|
377 | No I understand. |
llvm/lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp | ||
---|---|---|
377 | Now I understand*** |
This is a lot more complicated than expected :(
llvm/lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp | ||
---|---|---|
41 | I'd suggest Scl -> Scalar, we're not saving a lot of characters here :) | |
49 | return false doesn't make sense here. I believe that in the case where it is not analyzable, you need to return the bitwidth of the type. It should be possible to test this by making the constant a constant expression. | |
81 | nit: "fits in Ty in case" | |
109 | Is it really sufficient that just one of them is a constant? Say we have (zext i16 %x to i32) ult -1, which is always true, converting it to i16 %x ult -1 would no longer be always true. |
llvm/lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp | ||
---|---|---|
109 | Making sure the const operand is valid for shrinking is done before we reach this point (line 82): if (Ty && Ty->getScalarSizeInBits() < getConstMinBitWidth(IsSigned, C)) return false; If we reach this point, then the constant is valid. |
I have not looked at the code changes, but I noticed the first few tests are all min/max patterns. Can you check your motivating apps/benchmarks to see if rGf2fbdf76d8d0 helps?
Just checked the patch @spatel mentioned.
Doesn't really help with the cases provided here.
Thanks for checking. In that case, it would be helpful to comment on which of the tests are the real motivation (and possibly remove the tests that are not canonical IR).
For example, -instcombine alone is enough to reduce these:
define dso_local i16 @cmp_shrink_select_not_cmp(i8 %a, i8 %b, i32 %c, i32 %d) { %conv = sext i8 %a to i32 %conv2 = sext i8 %b to i32 %cmp = icmp slt i32 %c, %d %cond = select i1 %cmp, i32 %conv2, i32 %conv %conv4 = trunc i32 %cond to i16 ret i16 %conv4 } define i16 @cmp_select_bigConst_cmp(i8 %a, i8 %b, i8 %c) { %conv = sext i8 %a to i32 %conv2 = sext i8 %b to i32 %conv3 = sext i8 %c to i32 %cmp = icmp slt i32 %conv3, 70000 %cond = select i1 %cmp, i32 %conv2, i32 %conv %conv4 = trunc i32 %cond to i16 ret i16 %conv4 }
$ ./opt -instcombine cmpsel.ll -S define dso_local i16 @cmp_shrink_select_not_cmp(i8 %a, i8 %b, i32 %c, i32 %d) { entry: %cmp = icmp slt i32 %c, %d %cond.v = select i1 %cmp, i8 %b, i8 %a %conv4 = sext i8 %cond.v to i16 ret i16 %conv4 } define i16 @cmp_select_bigConst_cmp(i8 %a, i8 %b, i8 %c) { %conv4 = sext i8 %b to i16 ret i16 %conv4 }
@spatel: Basically the test cases with the real motivation are the first 4.
Where most of the other cases are there to check various "edge" cases of the added code behavior.
I think, that even if some of the tests are not in a canonical form (and can be optimized by instcombine), we still have an added value having them here in order to check the behavior of this specific pass with similar cases.
Don't you agree?
Do you mean the first 4 with diffs, or the first 4 that are being added as new tests?
Where most of the other cases are there to check various "edge" cases of the added code behavior.
I think, that even if some of the tests are not in a canonical form (and can be optimized by instcombine), we still have an added value having them here in order to check the behavior of this specific pass with similar cases.
Don't you agree?
Yes, I agree that we want to have tests for edge cases to make sure that the logic is correct here. But we also should have tests that show why this patch is necessary - functions that could not be solved in regular instcombine easily.
The first 4 with diff.
Where most of the other cases are there to check various "edge" cases of the added code behavior.
I think, that even if some of the tests are not in a canonical form (and can be optimized by instcombine), we still have an added value having them here in order to check the behavior of this specific pass with similar cases.
Don't you agree?Yes, I agree that we want to have tests for edge cases to make sure that the logic is correct here. But we also should have tests that show why this patch is necessary - functions that could not be solved in regular instcombine easily.
I agree with you 100%. That's why we have several test cases.
Running the same test with -instcombine instead of -aggressive-instcombine, out of the 23 test cases in the file, the following get optimized:
- cmp_select_zext_i8_noTransformation
- cmp_select_zext_sext_i8_noTransformation
- cmp_select_signed_const_i16Const_noTransformation
- cmp_select_unsigned_const_i16Const
- cmp_select_unsigned_const_i16Const_noTransformation
- cmp_select_unsigned_const_i16_MSB1
- cmp_select_bigConst_cmp
- cmp_zext_and_minus1_noTransformation
Out of these cases, the ones with "_noTransformation" suffix are there to make sure our pass does not apply any transformations, and the others include some special immediate values.
Actually, I feel that I might have misunderstood your comments intention. Can you explain to me what are you suggesting that we do?
Thanks,
Ayman
I'm trying to understand which patterns are not capable of being reduced to the optimal form by instcombine. If there is some artificial limitation within regular instcombine that we can overcome (like unnecessary bypassing of min/max patterns), then we might be able to avoid a specialized solution in this pass. We want to show that there's a good reason to add code here in aggressive-instcombine to handle these optimizations. So if you can add comments in the test file that explain why instcombine can't handle some transform, that will explain to future readers of the code/tests why we have this particular transform here. Also, if instcombine somehow gets smarter in the future, then we would know if this code became redundant.
As per @spatel 's request,
Added a comment explaining the motive behind implementing the transformation in aggressive instcombine instead of normal instcombine.
I still haven't had a chance to actually look at the code here.
@nikic / @lebedev.ri - are your concerns handled?
llvm/test/Transforms/AggressiveInstCombine/trunc_select_cmp.ll | ||
---|---|---|
5–6 | This isn't entirely accurate for the tests as shown. The problem in several of these tests is not that there are extra uses; it's that we don't canonicalize icmp and min/max as well as possible. That was what rGf2fbdf76d8d0 was trying to solve, but I don't think there's a quick fix to restore that. Please add some extra uses in these first 4 tests, so we have a better representation of the motivating tests. One way to do that is to add something like: |
llvm/test/Transforms/AggressiveInstCombine/trunc_select_cmp.ll | ||
---|---|---|
5–6 | I must disagree with you this time. |
Sorry, I did not explain my previous comment as well as possible. Let's take this minimal example:
define i1 @f(i8 %a) { %conv = sext i8 %a to i32 %cmp = icmp slt i32 %conv, 109 ret i1 %cmp }
With instcombine today, we can reduce this as:
define i1 @f(i8 %a) { %cmp = icmp slt i8 %a, 109 ret i1 %cmp }
We can also see that with instcombine today, we can reduce this example even if the cast (sext in this example) has other uses:
declare void @use(i32) define i1 @f(i8 %a) { %conv = sext i8 %a to i32 call void @use(i32 %conv) %cmp = icmp slt i32 %conv, 109 ret i1 %cmp }
Narrow 'cmp' here even though the sext is not eliminated:
define i1 @f(i8 %a) { %conv = sext i8 %a to i32 call void @use(i32 %conv) %cmp = icmp slt i8 %a, 109 ret i1 %cmp }
So that's what I was trying to suggest - the limitation is not entirely about the multiple uses in these small examples. The 1st regression test example would have been altered by rGf2fbdf76d8d0 to become:
define i16 @cmp_select_sext_const(i8 %a) { %c = icmp sgt i8 %a, 109 %narrow = select i1 %c, i8 %a, i8 109 %conv4 = zext i8 %narrow to i16 ret i16 %conv4 }
And even with an extra use of the sext, we would get the narrow icmp/select:
define i16 @cmp_select_sext_const(i8 %a) { %conv = sext i8 %a to i32 call void @use(i32 %conv) %1 = icmp sgt i8 %a, 109 %narrow = select i1 %1, i8 %a, i8 109 %conv4 = zext i8 %narrow to i16 ret i16 %conv4 }
In fact, even when the select also has an extra use (you can reproduce this on current LLVM: https://godbolt.org/z/P6Kws_ ):
define i16 @cmp_select_sext_const(i8 %a) { %conv = sext i8 %a to i32 call void @use(i32 %conv) %cmp = icmp slt i8 %a, 109 %cond = select i1 %cmp, i32 109, i32 %conv call void @use(i32 %cond) %conv4 = trunc i32 %cond to i16 ret i16 %conv4 }
Instcombine will produce this (which seems unintended because we increased the instruction count):
define i16 @cmp_select_sext_const(i8 %a) { %conv = sext i8 %a to i32 call void @use(i32 %conv) %1 = icmp sgt i8 %a, 109 %narrow = select i1 %1, i8 %a, i8 109 %cond = zext i8 %narrow to i32 call void @use(i32 %cond) %conv4 = zext i8 %narrow to i16 ret i16 %conv4 }
So I'm suggesting that we cover that last case in addition to the simpler test, so we will know what happens with multiple extra uses.
Also note that we have floated the idea again of adding min/max intrinsics to IR in D68408. Obviously, we need to post that as a proposal on llvm-dev for wider review...but if we had those intrinsics in IR, it probably affects at least some of the motivation for this patch, so I'd be interested in your thoughts on that.
Sorry, but I think the motivation of the patch is not yet clear.
The whole optimization (we're trying to improve) should be triggered by a truncate instruction that dominates a certain sequence.
This trunc can lead to shrinking a whole sequence of instructions to a new type and by that leave no need for the trunc instruction itself.
The example you put here has no trunc at all so it's not really relevant.
And even with an extra use of the sext, we would get the narrow icmp/select:
define i16 @cmp_select_sext_const(i8 %a) { %conv = sext i8 %a to i32 call void @use(i32 %conv) %1 = icmp sgt i8 %a, 109 %narrow = select i1 %1, i8 %a, i8 109 %conv4 = zext i8 %narrow to i16 ret i16 %conv4 }
Same here.
In fact, even when the select also has an extra use (you can reproduce this on current LLVM: https://godbolt.org/z/P6Kws_ ):
define i16 @cmp_select_sext_const(i8 %a) { %conv = sext i8 %a to i32 call void @use(i32 %conv) %cmp = icmp slt i8 %a, 109 %cond = select i1 %cmp, i32 109, i32 %conv call void @use(i32 %cond) %conv4 = trunc i32 %cond to i16 ret i16 %conv4 }
This example indeed has a truncate instruction, but as we can see, the optimization applied to it did not originate from the truncate.
As a matter of fact, it applied despite of it. We can see that shrinking the select's type actually increased the instruction count.
So this is not a relevant example as well.
What we are trying to solve in this patch, are cases where a full sequence which is terminated with a truncate instruction can be shrunk to a new type, that will subsequently remove the need of a truncate instruction as a terminator.
A perfect example would be:
define dso_local i16 @cmp_select_sext(i8 %a, i8 %b) { entry: %conv = sext i8 %a to i32 %conv2 = sext i8 %b to i32 %cmp = icmp slt i32 %conv, %conv2 %cond = select i1 %cmp, i32 %conv2, i32 %conv %conv4 = trunc i32 %cond to i16 ret i16 %conv4 }
which will not be changed with inst-combine, but with the new optimization, the whole sequence can be shrunk to i16 type (according to the destination type of the truncate instruction), and the truncate can be thrown away.
Also note that we have floated the idea again of adding min/max intrinsics to IR in D68408. Obviously, we need to post that as a proposal on llvm-dev for wider review...but if we had those intrinsics in IR, it probably affects at least some of the motivation for this patch, so I'd be interested in your thoughts on that.
Actually, this might help in some cases where the icmp+select implement a max/min semantics. In that case simply add the new nodes' opcodes (min/max) to the list of opcodes we deal with in the optimization.
But, we still need to handle the general case of icmp+select as well.
llvm/lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp | ||
---|---|---|
59 | No, unfortunately I didn't take that into account. if (auto *VT = dyn_cast<VectorType>(C->getType())) { for (unsigned i = 0; i < VT->getNumElements(); i++) { a correct approach to such case? |
I'd suggest Scl -> Scalar, we're not saving a lot of characters here :)