This patch canonicalizes conditions based on the constant range information of the dominating branch condition. For example: %cmp = icmp slt i64 %a, 0 br i1 %cmp, label %land.lhs.true, label %lor.rhs lor.rhs: %cmp2 = icmp sgt i64 %a, 0 Would now be canonicalized into: %cmp = icmp slt i64 %a, 0 br i1 %cmp, label %land.lhs.true, label %lor.rhs lor.rhs: %cmp2 = icmp ne i64 %a, 0
Details
Diff Detail
Event Timeline
Is this the best pass for this change to be in? ISTM that something like jump threading would be more appropriate.
I see no reason to only solve this one way. This could be handled at instsimplify or early-cse or jumpthreading or instcombine or gvn. I felt instcombine to be better place because we have dominator tree info and there already exists a function isSignBitCheck that checks if the comparison only checks the sign bit.
Fixed a bug. We need to guarantee that the dominating branch condition holds at the compare instruction, so it is not enough to check the immediate dominator. We also need to ensure that the edge dominates the compare instruction.
Addressed few minor nits to reduce the amount of work done and also added another test for an interesting case found in spec2006/povray.
FWIW, this change has 8% speedup in spec2006/soplex on Kryo and some small wins in other CINT2006 benchmarks.
Ping #2?
I plan to improve this patch to also handle PR23333, but wanted to get consensus on the current approach. I believe this can be also solved in JumpThreading too but it will have to depend on LazyValueInfo which is known to have compile-time implications. Please let me know if this approach is reasonable.
Hi,
I think this is an overtly specific solution. Ideally we'd do
something that would generalize to cases like
if (a >= 5) if (a < 6) { FOO; }
to
if (a == 5) { FOO; }
as well.
IMO the right layering for all the predicate logic you have here is in
ConstantRange. Just like we have makeAllowedICmpRegion and
makeSatisfyingICmpRegion, we should have a makeExactICmpRegion
where the RHS is not a ConstantRange but an APInt and have it
return *exactly* the set of values that satisfy the predicate (i.e. a
value satisfies the predicate *iff* it is in that set). Then the
current algorithm could become:
For a dominating icmp instruction "Var Pred RHS", compute R_Dom = makeAllowedICmpRegion(Pred, getRange(RHS)); If the current icmp is of the form "Var Pred ConstRHS" then compute R_Self = makeExactICmpRegion(Pred, ConstRHS) if R_Self.intersect(R_Dom) is empty: replace current icmp with false if R_Self.intersect(R_Dom) has exactly one element E: replace current icmp with "Var == E" if R_Self.intersect(R_Dom).complement() has exact one element E: replace current icmp with "Var != E"
lib/Transforms/InstCombine/InstCombineCompares.cpp | ||
---|---|---|
3263 | If you're going to restrict yourself to this case, I don't think you should depend on DT. Instead you can just check that the only predecessor of Parent is TrueBB or FalseBB (depending on TrueIfSigned) and be done with it, since the only value of IDom you'll succeed with here is Parent->getSinglePredecessor(). |
lib/Transforms/InstCombine/InstCombineCompares.cpp | ||
---|---|---|
3341 | I think you also need to check that TrueBB and FalseBB are not equal. | |
3345 | I think what you want is makeAllowedICmpRegion(getInversePredicate(Pred), CI2). In this case (where Other is a singleton set) what you have is fine, but generally (i.e. if the RHS was not a constant integer, but a range) it is not. Actually, it might be clearer to introduce a new helper: ConstantRange::makeExactICmpRegion that takes an APInt as RHS, and returns a set that is equal to both the "allowed" and "satisfying" icmp range. | |
3347 | This is confusingly named -- by "RHS", I'd normally understand CI or CI2; but what you mean here is the dominated LHS. How about calling it just that: DominatedLHSRange? | |
3352 | Why do you care about isSignBit? | |
test/Transforms/InstCombine/icmp.ll | ||
2015 | Can you add a dominated check with, say, 10 or something just to demonstrate that we're general here? |
Addressed Sanjoy's comments and added more tests.
lib/Transforms/InstCombine/InstCombineCompares.cpp | ||
---|---|---|
3341 | I think we won't match this pattern if TrueBB == FalseBB because instcombine will already have transformed such IR into br i1 undef, label %dest, label %dest. I will add an assert to make sure they aren't equal. | |
3345 | I am not sure if I follow this correctly, could you please give an example?. | |
3347 | How about using CR and DominatingCR corresponding to CI and CI2? | |
3352 | If it is a SignBit, canonicalizing could transform for example an icmp slt 0 into icmp eq/ne 0 which we want to avoid in situations where the icmp is only used as input to a branch because it will pessimize the codegen by generating a compare and branch on zero instruction (cbz) instead of a test and branch(tbz). tbz has better branch displacement than a cbz. If the icmp is an input to anything other than branches such as a select we would be better of canonicalizing it to eq/ne. Added a helper function isZBranchCheck to restrict it further for this case and added testcases. |
lib/Transforms/InstCombine/InstCombineCompares.cpp | ||
---|---|---|
117 | I'd use an architecture neutral name here, if possible. | |
3341 | My bias is that it is better to have these transforms be correct in isolation, and not rely on ordering across a large pass like -instcombine. But if you're sure that the "branch on undef" optimization is guaranteed to kick in before this optimization (i.e. that borders around "obvious"), then feel free to add an assert instead of checking and bailing out. | |
3345 | Yeah, I wasn't as clear as I should have been. The "problem" here is that the logic is structured in a way that will not generalize to RHS being a non-singleton range. For instance, if we have if (lhs u< "[2, 5)") { ... } else { // This is what you're trying to optimize if (lhs u< 3) print("hello"); } then currently you will infer the range of lhs on the else branch to be [0, 4).inverse() == [4, 0), and fold lhs u< 3 to false. But lhs u< 3 can be true also, if the RHS (the value we know is in [2, 5)) was 2 and lhs was 2. In other words, in the above IR you know that in the "then" case lhs cannot be unsigned-greater than 4, but that does not mean that in the else case lhs *has to be* unsigned greater than 4. Now I think this won't be an observable bug if RHS is always a singleton set, but I'd like to have some measures in place that make it clear what's going on. There are two ways we can do that:
makeExactICmpRegion(...) { assert(makeAllowedICmpRegion(...) == makeSatisfyingICmpRegion(...)); return makeAllowedICmpRegion(...); } and to use that here instead of makeAllowedICmpRegion. It would take an APInt as RHS, and if someone comes along and tries to generalize this code to work on non-singleton RHSs, they'll immediately know that it isn't a 100% straightforward.
| |
3347 | SGTM |
Addressed Sanjoy's comments.
lib/Transforms/InstCombine/InstCombineCompares.cpp | ||
---|---|---|
117 | Does 'isBranchOnSignBitCheck' sound good? | |
3341 | Since instcombine visits the instruction in topological order, we will always visit the dominating branch before visting the icmp instruction and so it is guranteed that the branch condition is undef when the condition is irrelevant. | |
3345 |
I think this assumption is not always true, see my previous example.
I see your point here, I have already modified this to 'makeAllowedICmpRegion(getInversePredicate(Pred), CI2)' based on your earlier suggestion.
I have no idea if non-singleton RHS ranges occur enough to be worthwhile to generalize. Do you mind if I defer this suggestion? |
Hi Balaram,
I've replied to your inline comment, but it is possible that we're just talking past each other. I'm usually on #llvm as sanjoyd on PDT daytime, we can chat there in realtime if you think that'll help.
lib/Transforms/InstCombine/InstCombineCompares.cpp | ||
---|---|---|
3342 | Add a comment here on why that is true. | |
3345 |
Are you talking about the "Taking a general example if Pred is ult and Other is i8 [2, 5) makeAllowedICmpRegion would return [0, 4) and make SatisfyingICmpRegion would return [0, 2). Isn't a set that is equal to both same as the "satisfying" icmp range i.e [0, 2) in this example?" bit? In that case isn't the RHS (i.e. Other) [2, 5), specifically not a singleton set? If you're talking about another example, then I must have missed it.
That's not the only place where the current code will not generalize to a non-singleton RHS. For instance, the Intersection.isSingleElement() check won't generalize either -- you cannot transform if (a u< "[2, 5)") { // AllowedRegion = [0, 4) if (a >=u "[3, 5)") { // AllowedRegion = [3, 0) print(); } else abort(); } to if (a u< "[2, 5)") if (a == 3) print(); else abort(); since for "[2, 5)" == 4 and "[3, 5)" == 4 and a == 3, the first program would have called abort() but the new program will call print(). On the other hand, for the Intersection.isEmptySet() you do need to call AllowedICmpRegion on both the conditions. There is a general pattern here -- makeAllowedICmpRegion only lets you "rule out" certain values (e.g. if a >=u "[3, 5)" then a definitely cannot be in {0, 1, 2}, but it can be anything else), but it does not guarantee that the comparison will fold one way or the other. IOW, it is the "only if" part of the constraint (the comparison succeeds only if the LHS in the returned set), but not the "if" part (just because LHS is in the returned set does not mean the comparison will succeed).
Sure, but in that case we need to have the logic phrased in a way that makes it obvious that the current logic is not fully general (i.e. with a makeExactICmpRegion). I don't think non-singletom RHS ranges are too rare: if (a u< 100) if (b u< a) if (b >u 500) In the inner condition b u< a, you know the RHS is in [0, 100), and you can use that information to fold the innermost condition to false. However, it is fine to not handle these cases as a starting point. |
lib/Transforms/InstCombine/InstCombineCompares.cpp | ||
---|---|---|
3341 | This is only true for the very first iteration of InstCombine. InstCombine adds the users of all instructions that it modifies to a worklist, this worklist is not toplogically sorted. |
Thanks Sanjoy, sure I will be available on #llvm as bmakam. I think I have misinterpreted what you mean by non-singleton RHS. However, I think the current logic as is should be able to generalize to what you define as non-singleton RHS ranges.
since for "[2, 5)" == 4 and "[3, 5)" == 4 and a == 3, the first program would have called abort() but the new program will call print().
The condition here is u< [2,5) so the allowed range is [0,4) which means a can never be == 4 for the first/dominating if condition and the program will abort() as expected.
If it helps to clarify, I think of the if conditions allowed ranges as set A for the dominating if condition and set B for the second(Parent) if condition. If dominating if condition is true, then the second if condition is true iff A ∩ B is true. So if {A ∩ B} == {ε} then it is sufficient to canonicalize the second if condition to icmp eq ε. OTOH, the second if condition is false iff A\B (aka relative complement) is true. So if {A\B} == {ε} then it is sufficient to canonicalize the second if condition to icmp ne ε. Similarly, if the dominating if condition is false, then the second if condition is true iff ~A ∩ B is true and the second if condition is false iff ~A\B is true. For non-singleton RHS ranges, the satisfying range is ⊆ allowed range, so it is sufficient to check for allowed range.
The condition here is u< [2,5) so the allowed range is [0,4) which
means a can never be == 4 for the first/dominating if condition and
the program will abort() as expected.
I didn't mean to say a was 4. By "[2, 5)" == 4 I meant the
RHS we know is in "[2, 5)" is 4 (similar for the other time I used
that notation). a itself is 3 in the above example.
[Edit: added this example]IOW, at runtime the ranged values resolve as
if (a u< 4) { // AllowedRegion = [0, 4) if (a >=u 4") { // AllowedRegion = [3, 0) print(); } else abort(); }
If it helps to clarify, I think of the if conditions allowed ranges
as set A for the dominating if condition and set B for the
second(Parent) if condition. If dominating if condition is true,
then the second if condition is true iff A ∩ B is true.
As I said in my previous email, the second condition is true only
if A ∩ B is true, not iff A ∩ B is true (follows from the
fact that the second condition is true only if B is true).
I.e. the second condition implies A ∩ B but not the other way around.
makeAllowedICmpRegion gives you a set with the only if property.
makeSatisfyingICmpRegion gives you a set with the if property.
If they return equal sets, then you have iff on that set (or if
you take an intersection of the sets they return).
makeAllowedICmpRegion gives you a set with the only if property.
makeSatisfyingICmpRegion gives you a set with the if property.
If they return equal sets, then you have iff on that set (or if
you take an intersection of the sets they return).
Add a helper makeExactICmpRegion that will return the region if makeSatisfyingICmpRegion is exactly same as makeAllowedICmpRegion.
lib/Transforms/InstCombine/InstCombineCompares.cpp | ||
---|---|---|
3355 | if Difference is empty then the inner condition is always true. |
lib/Transforms/InstCombine/InstCombineCompares.cpp | ||
---|---|---|
3341 |
Since it isn't very obvious, to be on the safer side I have changed the assert back into a check. |
lib/IR/ConstantRange.cpp | ||
---|---|---|
132 ↗ | (On Diff #55071) | Can you add a comment on why this assert is valid (i.e. that we're relying on knowledge of how the two makeAllowedXXXRegion s are implemented). |
lib/Transforms/InstCombine/InstCombineCompares.cpp | ||
117 | We no longer repeat function names in comments. | |
118 | This isn't an exploded icmp right? | |
3356 | Add a comment on why you need isBranchOnSignBitCheck here. | |
3357 | These are better written as: if (auto *AI = Intersection.getSingleElement()) return new ICmpInst(..., Builder->getInt(*AI)); etc. |
lib/Transforms/InstCombine/InstCombineCompares.cpp | ||
---|---|---|
118 | Right, I have corrected the comment. |
Generally speaking, InstCombine canonicalization helps out down stream analysis by simplifying the IR or exposes further opportunities for canonicalization. i'm not entirely sure that this transform fits that bill.
I wonder if a more appropriate place for this to live is in CodeGenPrepare seeing as how sensitive this transform is to the peculiarities of the target ISA, perhaps with a TargetLowering check which verifies that it would make sense to perform on such a target.
FWIW, the examples listed in the testcases demonstrate that it helps out by simplifying the IR. In my tests on spec2006 benchmarks it has reduced more than 3000 static instructions in xalancbmk and 100s of static instructions in gcc and several other benchmarks. This is similar to the transformations done in FoldAndOfICmps except that here we consider dominating condition as the LHS. I have constrained this only to the first predecessor to find the dominating icmp and could still remove several static instructions which should improve the code quality and compile time.
I wonder if a more appropriate place for this to live is in CodeGenPrepare seeing as how sensitive this transform is to the peculiarities of the target ISA, perhaps with a TargetLowering check which verifies that it would make sense to perform on such a target.
I am not totally against moving this to the backend and infact I started off by implementing it at the very end of the pipeline in D18572, but realized that it would catch more such cases and might help other downstream optimizations when we do this transform in instcombine. What do you think?
Does this result in instructions removed at the LLVM IR or does it result in fewer instructions emitted in the backend?
If it results in fewer IR instructions, can you determine why this is the case? If it results in the same number of IR instructions, then performing this transform might make more sense in CodeGenPrepare.
I wonder if a more appropriate place for this to live is in CodeGenPrepare seeing as how sensitive this transform is to the peculiarities of the target ISA, perhaps with a TargetLowering check which verifies that it would make sense to perform on such a target.
I am not totally against moving this to the backend and infact I started off by implementing it at the very end of the pipeline in D18572, but realized that it would catch more such cases and might help other downstream optimizations when we do this transform in instcombine. What do you think?
I would have done this optimization a little earlier at around the CodeGenPrepare level.
Does this result in instructions removed at the LLVM IR or does it result in fewer instructions emitted in the backend?
If it results in fewer IR instructions, can you determine why this is the case? If it results in the same number of IR instructions, then performing this transform might make more sense in CodeGenPrepare.
I generated the IR after LTO using -Wl,plugin-opt=emit-llvm for xalancbmk and with this patch the number of IR instructions were 1059942. Compared to the baseline without my patch the number of IR instructions were 1067179. So there was a net reduction of 7237 IR instructions which translated into 3531 fewer instructions emitted in the backend. Over 1000 add instructions, and branch instructions were removed. Most of the cases were similar to the tests I have added in this patch.
This looks okay to me (pending the one inline comment), assuming @majnemer is fine with too.
Btw, looks like instcombine already canonicalizes towards equality, for instance:
define i1 @f(i32 %x) { %v = icmp ugt i32 %x, 0 ret i1 %v }
is instcombined to
define i1 @f(i32 %x) { %v = icmp ne i32 %x, 0 ret i1 %v }
and this change looks like it is doing more of the same.
lib/Transforms/InstCombine/InstCombineCompares.cpp | ||
---|---|---|
3336 | I think you can just use dyn_cast here. For a well formed CFG, getTerminator should always return nonnull. |
Thank you very much for the review Sanjoy.
@majnemer, Is it reasonable to commit this patch?
lib/Transforms/InstCombine/InstCombineCompares.cpp | ||
---|---|---|
3355–3366 | The prior two replacements seem like pure goodness: they make conditional branches branch on true or false. These two seem weird to have in InstCombine. Seeing as how you are concerned with pessimization, can we move these to CGP and have them operate on a TargetLowering hook? |
lib/Transforms/InstCombine/InstCombineCompares.cpp | ||
---|---|---|
3355–3366 | On second thought, this is a nice canonicalization regardless because it makes the comparisons easier to analyze. LGTM. |
I'd use an architecture neutral name here, if possible.