This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Canonicalize icmp instructions based on dominating conditions.
ClosedPublic

Authored by bmakam on Apr 6 2016, 12:53 PM.

Details

Summary
This patch canonicalizes conditions based on the constant range information
of the dominating branch condition.
For example:

  %cmp = icmp slt i64 %a, 0
  br i1 %cmp, label %land.lhs.true, label %lor.rhs
  lor.rhs:
    %cmp2 = icmp sgt i64 %a, 0

Would now be canonicalized into:

  %cmp = icmp slt i64 %a, 0
  br i1 %cmp, label %land.lhs.true, label %lor.rhs
  lor.rhs:
    %cmp2 = icmp ne i64 %a, 0

Diff Detail

Repository
rL LLVM

Event Timeline

bmakam updated this revision to Diff 52839.Apr 6 2016, 12:53 PM
bmakam retitled this revision from to [InstCombine] Canonicalize icmp instructions based on dominating conditions..
bmakam updated this object.

Is this the best pass for this change to be in? ISTM that something like jump threading would be more appropriate.

bmakam added a comment.Apr 6 2016, 2:17 PM

Is this the best pass for this change to be in? ISTM that something like jump threading would be more appropriate.

I see no reason to only solve this one way. This could be handled at instsimplify or early-cse or jumpthreading or instcombine or gvn. I felt instcombine to be better place because we have dominator tree info and there already exists a function isSignBitCheck that checks if the comparison only checks the sign bit.

bmakam updated this revision to Diff 52920.Apr 7 2016, 7:45 AM

Fixed a bug. We need to guarantee that the dominating branch condition holds at the compare instruction, so it is not enough to check the immediate dominator. We also need to ensure that the edge dominates the compare instruction.

bmakam updated this revision to Diff 53296.Apr 11 2016, 11:43 AM
bmakam added a subscriber: MatzeB.

Addressed few minor nits to reduce the amount of work done and also added another test for an interesting case found in spec2006/povray.
FWIW, this change has 8% speedup in spec2006/soplex on Kryo and some small wins in other CINT2006 benchmarks.

Ping #2?
I plan to improve this patch to also handle PR23333, but wanted to get consensus on the current approach. I believe this can be also solved in JumpThreading too but it will have to depend on LazyValueInfo which is known to have compile-time implications. Please let me know if this approach is reasonable.

sanjoy edited edge metadata.Apr 17 2016, 11:04 PM

Hi,

I think this is an overtly specific solution. Ideally we'd do
something that would generalize to cases like

if (a >= 5)
  if (a < 6) {
    FOO;
  }

to

if (a == 5) {
  FOO;
}

as well.

IMO the right layering for all the predicate logic you have here is in
ConstantRange. Just like we have makeAllowedICmpRegion and
makeSatisfyingICmpRegion, we should have a makeExactICmpRegion
where the RHS is not a ConstantRange but an APInt and have it
return *exactly* the set of values that satisfy the predicate (i.e. a
value satisfies the predicate *iff* it is in that set). Then the
current algorithm could become:

For a dominating icmp instruction "Var Pred RHS",
    compute R_Dom = makeAllowedICmpRegion(Pred, getRange(RHS));
If the current icmp is of the form "Var Pred ConstRHS" then
    compute R_Self = makeExactICmpRegion(Pred, ConstRHS)
    if R_Self.intersect(R_Dom) is empty:
      replace current icmp with false
    if R_Self.intersect(R_Dom) has exactly one element E:
      replace current icmp with "Var == E"
    if R_Self.intersect(R_Dom).complement() has exact one element E:
      replace current icmp with "Var != E"
lib/Transforms/InstCombine/InstCombineCompares.cpp
3263 ↗(On Diff #53296)

If you're going to restrict yourself to this case, I don't think you should depend on DT. Instead you can just check that the only predecessor of Parent is TrueBB or FalseBB (depending on TrueIfSigned) and be done with it, since the only value of IDom you'll succeed with here is Parent->getSinglePredecessor().

bmakam updated this revision to Diff 54417.Apr 20 2016, 1:54 PM
bmakam updated this object.
bmakam edited edge metadata.
bmakam marked an inline comment as done.

Thanks for the review Sanjoy. Please let me know if this looks reasonable.

sanjoy requested changes to this revision.Apr 21 2016, 11:37 PM
sanjoy edited edge metadata.
sanjoy added inline comments.
lib/Transforms/InstCombine/InstCombineCompares.cpp
3284 ↗(On Diff #54417)

I think you also need to check that TrueBB and FalseBB are not equal.

3288 ↗(On Diff #54417)

I think what you want is makeAllowedICmpRegion(getInversePredicate(Pred), CI2). In this case (where Other is a singleton set) what you have is fine, but generally (i.e. if the RHS was not a constant integer, but a range) it is not.

Actually, it might be clearer to introduce a new helper: ConstantRange::makeExactICmpRegion that takes an APInt as RHS, and returns a set that is equal to both the "allowed" and "satisfying" icmp range.

3290 ↗(On Diff #54417)

This is confusingly named -- by "RHS", I'd normally understand CI or CI2; but what you mean here is the dominated LHS. How about calling it just that: DominatedLHSRange?

3295 ↗(On Diff #54417)

Why do you care about isSignBit?

test/Transforms/InstCombine/icmp.ll
2038 ↗(On Diff #54417)

Can you add a dominated check with, say, 10 or something just to demonstrate that we're general here?

This revision now requires changes to proceed.Apr 21 2016, 11:37 PM
bmakam updated this revision to Diff 54645.Apr 22 2016, 7:22 AM
bmakam edited edge metadata.
bmakam marked 2 inline comments as done.

Addressed Sanjoy's comments and added more tests.

lib/Transforms/InstCombine/InstCombineCompares.cpp
3284 ↗(On Diff #54417)

I think we won't match this pattern if TrueBB == FalseBB because instcombine will already have transformed such IR into br i1 undef, label %dest, label %dest. I will add an assert to make sure they aren't equal.

3288 ↗(On Diff #54417)

I am not sure if I follow this correctly, could you please give an example?.
If the new helper takes an APInt as RHS isn't it a singleton set in this case?
Taking a general example if Pred is ult and Other is i8 [2, 5) makeAllowedICmpRegion would return [0, 4) and make SatisfyingICmpRegion would return [0, 2). Isn't a set that is equal to both same as the "satisfying" icmp range i.e [0, 2) in this example?

3290 ↗(On Diff #54417)

How about using CR and DominatingCR corresponding to CI and CI2?

3295 ↗(On Diff #54417)

If it is a SignBit, canonicalizing could transform for example an icmp slt 0 into icmp eq/ne 0 which we want to avoid in situations where the icmp is only used as input to a branch because it will pessimize the codegen by generating a compare and branch on zero instruction (cbz) instead of a test and branch(tbz). tbz has better branch displacement than a cbz. If the icmp is an input to anything other than branches such as a select we would be better of canonicalizing it to eq/ne. Added a helper function isZBranchCheck to restrict it further for this case and added testcases.

sanjoy requested changes to this revision.Apr 22 2016, 8:13 PM
sanjoy edited edge metadata.
sanjoy added inline comments.
lib/Transforms/InstCombine/InstCombineCompares.cpp
118 ↗(On Diff #54645)

I'd use an architecture neutral name here, if possible.

3293 ↗(On Diff #54645)

My bias is that it is better to have these transforms be correct in isolation, and not rely on ordering across a large pass like -instcombine. But if you're sure that the "branch on undef" optimization is guaranteed to kick in before this optimization (i.e. that borders around "obvious"), then feel free to add an assert instead of checking and bailing out.

3297 ↗(On Diff #54645)

Yeah, I wasn't as clear as I should have been.

The "problem" here is that the logic is structured in a way that will not generalize to RHS being a non-singleton range. For instance, if we have

if (lhs u< "[2, 5)") {
  ...
} else {
  // This is what you're trying to optimize
  if (lhs u< 3)
    print("hello");
}

then currently you will infer the range of lhs on the else branch to be [0, 4).inverse() == [4, 0), and fold lhs u< 3 to false. But lhs u< 3 can be true also, if the RHS (the value we know is in [2, 5)) was 2 and lhs was 2. In other words, in the above IR you know that in the "then" case lhs cannot be unsigned-greater than 4, but that does not mean that in the else case lhs *has to be* unsigned greater than 4.

Now I think this won't be an observable bug if RHS is always a singleton set, but I'd like to have some measures in place that make it clear what's going on. There are two ways we can do that:

  • For a singleton RHS, I think (but please verify) makeAllowedICmpRegion and makeSatisfyingICmpRegion will return the same exact set, which contains _exactly_ the value that the predicate is true for. If this assumption is true, my original suggestion was to add a makeExactICmpRegion method as:
makeExactICmpRegion(...) {
  assert(makeAllowedICmpRegion(...) == makeSatisfyingICmpRegion(...));
  return makeAllowedICmpRegion(...);
}

and to use that here instead of makeAllowedICmpRegion. It would take an APInt as RHS, and if someone comes along and tries to generalize this code to work on non-singleton RHSs, they'll immediately know that it isn't a 100% straightforward.

  • The second possibility is to generalize this logic to work with a non-singleton RHS range. To do that, we can start by allowing the RHS to be a load instruction with !range metadata, and eventually move over to even more sophisticated analyses.
3299 ↗(On Diff #54645)

SGTM

This revision now requires changes to proceed.Apr 22 2016, 8:13 PM
bmakam updated this revision to Diff 54834.Apr 25 2016, 4:52 AM
bmakam edited edge metadata.
bmakam marked an inline comment as done.

Addressed Sanjoy's comments.

lib/Transforms/InstCombine/InstCombineCompares.cpp
118 ↗(On Diff #54645)

Does 'isBranchOnSignBitCheck' sound good?

3293 ↗(On Diff #54645)

Since instcombine visits the instruction in topological order, we will always visit the dominating branch before visting the icmp instruction and so it is guranteed that the branch condition is undef when the condition is irrelevant.

3297 ↗(On Diff #54645)

For a singleton RHS, I think (but please verify) makeAllowedICmpRegion and makeSatisfyingICmpRegion will return the same exact set, which contains _exactly_ the value that the predicate is true for.

I think this assumption is not always true, see my previous example.

then currently you will infer the range of lhs on the else branch to be [0, 4).inverse() == [4, 0)

I see your point here, I have already modified this to 'makeAllowedICmpRegion(getInversePredicate(Pred), CI2)' based on your earlier suggestion.

The second possibility is to generalize this logic to work with a non-singleton RHS range.

I have no idea if non-singleton RHS ranges occur enough to be worthwhile to generalize. Do you mind if I defer this suggestion?

bmakam updated this revision to Diff 54879.Apr 25 2016, 10:57 AM
bmakam edited edge metadata.

correctly update the use of isBranchOnSignBitCheck()

Hi Balaram,

I've replied to your inline comment, but it is possible that we're just talking past each other. I'm usually on #llvm as sanjoyd on PDT daytime, we can chat there in realtime if you think that'll help.

lib/Transforms/InstCombine/InstCombineCompares.cpp
3294 ↗(On Diff #54879)

Add a comment here on why that is true.

3297 ↗(On Diff #54879)

I think this assumption is not always true, see my previous example.

Are you talking about the "Taking a general example if Pred is ult and Other is i8 [2, 5) makeAllowedICmpRegion would return [0, 4) and make SatisfyingICmpRegion would return [0, 2). Isn't a set that is equal to both same as the "satisfying" icmp range i.e [0, 2) in this example?" bit? In that case isn't the RHS (i.e. Other) [2, 5), specifically not a singleton set? If you're talking about another example, then I must have missed it.

I have already modified this to 'makeAllowedICmpRegion(getInversePredicate(Pred), CI2)' based on your earlier suggestion.

That's not the only place where the current code will not generalize to a non-singleton RHS. For instance, the Intersection.isSingleElement() check won't generalize either -- you cannot transform

if (a u< "[2, 5)") {
  // AllowedRegion = [0, 4)
  if (a >=u "[3, 5)") {
    // AllowedRegion = [3, 0)
    print();
  } else
    abort();
}

to

if (a u< "[2, 5)")
  if (a == 3)
    print();
  else
    abort();

since for "[2, 5)" == 4 and "[3, 5)" == 4 and a == 3, the first program would have called abort() but the new program will call print().

On the other hand, for the Intersection.isEmptySet() you do need to call AllowedICmpRegion on both the conditions.

There is a general pattern here -- makeAllowedICmpRegion only lets you "rule out" certain values (e.g. if a >=u "[3, 5)" then a definitely cannot be in {0, 1, 2}, but it can be anything else), but it does not guarantee that the comparison will fold one way or the other. IOW, it is the "only if" part of the constraint (the comparison succeeds only if the LHS in the returned set), but not the "if" part (just because LHS is in the returned set does not mean the comparison will succeed).

I have no idea if non-singleton RHS ranges occur enough to be worthwhile to generalize. Do you mind if I defer this suggestion?

Sure, but in that case we need to have the logic phrased in a way that makes it obvious that the current logic is not fully general (i.e. with a makeExactICmpRegion).

I don't think non-singletom RHS ranges are too rare:

if (a u< 100)
  if (b u< a)
    if (b >u 500)

In the inner condition b u< a, you know the RHS is in [0, 100), and you can use that information to fold the innermost condition to false. However, it is fine to not handle these cases as a starting point.

majnemer added inline comments.Apr 25 2016, 11:38 PM
lib/Transforms/InstCombine/InstCombineCompares.cpp
3293 ↗(On Diff #54879)

This is only true for the very first iteration of InstCombine. InstCombine adds the users of all instructions that it modifies to a worklist, this worklist is not toplogically sorted.

Hi Balaram,

I've replied to your inline comment, but it is possible that we're just talking past each other. I'm usually on #llvm as sanjoyd on PDT daytime, we can chat there in realtime if you think that'll help.

Thanks Sanjoy, sure I will be available on #llvm as bmakam. I think I have misinterpreted what you mean by non-singleton RHS. However, I think the current logic as is should be able to generalize to what you define as non-singleton RHS ranges.

since for "[2, 5)" == 4 and "[3, 5)" == 4 and a == 3, the first program would have called abort() but the new program will call print().

The condition here is u< [2,5) so the allowed range is [0,4) which means a can never be == 4 for the first/dominating if condition and the program will abort() as expected.

If it helps to clarify, I think of the if conditions allowed ranges as set A for the dominating if condition and set B for the second(Parent) if condition. If dominating if condition is true, then the second if condition is true iff A ∩ B is true. So if {A ∩ B} == {ε} then it is sufficient to canonicalize the second if condition to icmp eq ε. OTOH, the second if condition is false iff A\B (aka relative complement) is true. So if {A\B} == {ε} then it is sufficient to canonicalize the second if condition to icmp ne ε. Similarly, if the dominating if condition is false, then the second if condition is true iff ~A ∩ B is true and the second if condition is false iff ~A\B is true. For non-singleton RHS ranges, the satisfying range is ⊆ allowed range, so it is sufficient to check for allowed range.

sanjoy added a comment.EditedApr 26 2016, 9:53 AM

Hi Balaram,

I've replied to your inline comment, but it is possible that we're just talking past each other. I'm usually on #llvm as sanjoyd on PDT daytime, we can chat there in realtime if you think that'll help.

Thanks Sanjoy, sure I will be available on #llvm as bmakam. I think I have misinterpreted what you mean by non-singleton RHS. However, I think the current logic as is should be able to generalize to what you define as non-singleton RHS ranges.

since for "[2, 5)" == 4 and "[3, 5)" == 4 and a == 3, the first program would have called abort() but the new program will call print().

The condition here is u< [2,5) so the allowed range is [0,4) which
means a can never be == 4 for the first/dominating if condition and
the program will abort() as expected.

I didn't mean to say a was 4. By "[2, 5)" == 4 I meant the
RHS we know is in "[2, 5)" is 4 (similar for the other time I used
that notation). a itself is 3 in the above example.

[Edit: added this example]IOW, at runtime the ranged values resolve as

if (a u< 4) {
  // AllowedRegion = [0, 4)
  if (a >=u 4") {
    // AllowedRegion = [3, 0)
    print();
  } else
    abort();
}

If it helps to clarify, I think of the if conditions allowed ranges
as set A for the dominating if condition and set B for the
second(Parent) if condition. If dominating if condition is true,
then the second if condition is true iff A ∩ B is true.

As I said in my previous email, the second condition is true only
if
A ∩ B is true, not iff A ∩ B is true (follows from the
fact that the second condition is true only if B is true).
I.e. the second condition implies A ∩ B but not the other way around.

makeAllowedICmpRegion gives you a set with the only if property.
makeSatisfyingICmpRegion gives you a set with the if property.
If they return equal sets, then you have iff on that set (or if
you take an intersection of the sets they return).

bmakam updated this revision to Diff 55071.EditedApr 26 2016, 12:49 PM

makeAllowedICmpRegion gives you a set with the only if property.
makeSatisfyingICmpRegion gives you a set with the if property.
If they return equal sets, then you have iff on that set (or if
you take an intersection of the sets they return).

Add a helper makeExactICmpRegion that will return the region if makeSatisfyingICmpRegion is exactly same as makeAllowedICmpRegion.

bmakam marked an inline comment as done.Apr 26 2016, 12:51 PM
bmakam added inline comments.
lib/Transforms/InstCombine/InstCombineCompares.cpp
3307 ↗(On Diff #55071)

if Difference is empty then the inner condition is always true.

bmakam added inline comments.Apr 26 2016, 12:56 PM
lib/Transforms/InstCombine/InstCombineCompares.cpp
3293 ↗(On Diff #55071)

This is only true for the very first iteration of InstCombine. InstCombine adds the users of all instructions that it modifies to a worklist, this worklist is not toplogically sorted.

Since it isn't very obvious, to be on the safer side I have changed the assert back into a check.

sanjoy requested changes to this revision.Apr 27 2016, 3:19 PM
sanjoy edited edge metadata.
sanjoy added inline comments.
lib/IR/ConstantRange.cpp
132 ↗(On Diff #55071)

Can you add a comment on why this assert is valid (i.e. that we're relying on knowledge of how the two makeAllowedXXXRegion s are implemented).

lib/Transforms/InstCombine/InstCombineCompares.cpp
118 ↗(On Diff #55071)

We no longer repeat function names in comments.

119 ↗(On Diff #55071)

This isn't an exploded icmp right?

3308 ↗(On Diff #55071)

Add a comment on why you need isBranchOnSignBitCheck here.

3309 ↗(On Diff #55071)

These are better written as:

if (auto *AI = Intersection.getSingleElement())
  return new ICmpInst(..., Builder->getInt(*AI));

etc.

This revision now requires changes to proceed.Apr 27 2016, 3:19 PM
bmakam updated this revision to Diff 55404.Apr 28 2016, 5:58 AM
bmakam edited edge metadata.

Thanks Sanjoy, please let me know if the comments added make sense.

bmakam marked 3 inline comments as done.Apr 28 2016, 6:00 AM
bmakam added inline comments.
lib/Transforms/InstCombine/InstCombineCompares.cpp
118 ↗(On Diff #55404)

Right, I have corrected the comment.

Generally speaking, InstCombine canonicalization helps out down stream analysis by simplifying the IR or exposes further opportunities for canonicalization. i'm not entirely sure that this transform fits that bill.

I wonder if a more appropriate place for this to live is in CodeGenPrepare seeing as how sensitive this transform is to the peculiarities of the target ISA, perhaps with a TargetLowering check which verifies that it would make sense to perform on such a target.

Generally speaking, InstCombine canonicalization helps out down stream analysis by simplifying the IR or exposes further opportunities for canonicalization. i'm not entirely sure that this transform fits that bill.

FWIW, the examples listed in the testcases demonstrate that it helps out by simplifying the IR. In my tests on spec2006 benchmarks it has reduced more than 3000 static instructions in xalancbmk and 100s of static instructions in gcc and several other benchmarks. This is similar to the transformations done in FoldAndOfICmps except that here we consider dominating condition as the LHS. I have constrained this only to the first predecessor to find the dominating icmp and could still remove several static instructions which should improve the code quality and compile time.

I wonder if a more appropriate place for this to live is in CodeGenPrepare seeing as how sensitive this transform is to the peculiarities of the target ISA, perhaps with a TargetLowering check which verifies that it would make sense to perform on such a target.

I am not totally against moving this to the backend and infact I started off by implementing it at the very end of the pipeline in D18572, but realized that it would catch more such cases and might help other downstream optimizations when we do this transform in instcombine. What do you think?

Generally speaking, InstCombine canonicalization helps out down stream analysis by simplifying the IR or exposes further opportunities for canonicalization. i'm not entirely sure that this transform fits that bill.

FWIW, the examples listed in the testcases demonstrate that it helps out by simplifying the IR. In my tests on spec2006 benchmarks it has reduced more than 3000 static instructions in xalancbmk and 100s of static instructions in gcc and several other benchmarks. This is similar to the transformations done in FoldAndOfICmps except that here we consider dominating condition as the LHS. I have constrained this only to the first predecessor to find the dominating icmp and could still remove several static instructions which should improve the code quality and compile time.

Does this result in instructions removed at the LLVM IR or does it result in fewer instructions emitted in the backend?

If it results in fewer IR instructions, can you determine why this is the case? If it results in the same number of IR instructions, then performing this transform might make more sense in CodeGenPrepare.

I wonder if a more appropriate place for this to live is in CodeGenPrepare seeing as how sensitive this transform is to the peculiarities of the target ISA, perhaps with a TargetLowering check which verifies that it would make sense to perform on such a target.

I am not totally against moving this to the backend and infact I started off by implementing it at the very end of the pipeline in D18572, but realized that it would catch more such cases and might help other downstream optimizations when we do this transform in instcombine. What do you think?

I would have done this optimization a little earlier at around the CodeGenPrepare level.

Does this result in instructions removed at the LLVM IR or does it result in fewer instructions emitted in the backend?

If it results in fewer IR instructions, can you determine why this is the case? If it results in the same number of IR instructions, then performing this transform might make more sense in CodeGenPrepare.

I generated the IR after LTO using -Wl,plugin-opt=emit-llvm for xalancbmk and with this patch the number of IR instructions were 1059942. Compared to the baseline without my patch the number of IR instructions were 1067179. So there was a net reduction of 7237 IR instructions which translated into 3531 fewer instructions emitted in the backend. Over 1000 add instructions, and branch instructions were removed. Most of the cases were similar to the tests I have added in this patch.

bmakam added a comment.May 2 2016, 4:27 AM

gentle ping.

sanjoy accepted this revision.May 2 2016, 2:19 PM
sanjoy edited edge metadata.

This looks okay to me (pending the one inline comment), assuming @majnemer is fine with too.

Btw, looks like instcombine already canonicalizes towards equality, for instance:

define i1 @f(i32 %x) {
  %v = icmp ugt i32 %x, 0
  ret i1 %v
}

is instcombined to

define i1 @f(i32 %x) {
  %v = icmp ne i32 %x, 0
  ret i1 %v
}

and this change looks like it is doing more of the same.

lib/Transforms/InstCombine/InstCombineCompares.cpp
3287 ↗(On Diff #55404)

I think you can just use dyn_cast here. For a well formed CFG, getTerminator should always return nonnull.

This revision is now accepted and ready to land.May 2 2016, 2:19 PM
bmakam updated this revision to Diff 55968.May 3 2016, 4:12 AM
bmakam edited edge metadata.

Thank you very much for the review Sanjoy.

@majnemer, Is it reasonable to commit this patch?

bmakam marked an inline comment as done.May 3 2016, 4:13 AM
majnemer added inline comments.May 4 2016, 10:18 AM
lib/Transforms/InstCombine/InstCombineCompares.cpp
3306–3317 ↗(On Diff #55968)

The prior two replacements seem like pure goodness: they make conditional branches branch on true or false. These two seem weird to have in InstCombine. Seeing as how you are concerned with pessimization, can we move these to CGP and have them operate on a TargetLowering hook?

majnemer accepted this revision.May 4 2016, 10:20 AM
majnemer added a reviewer: majnemer.
majnemer added inline comments.
lib/Transforms/InstCombine/InstCombineCompares.cpp
3306–3317 ↗(On Diff #55968)

On second thought, this is a nice canonicalization regardless because it makes the comparisons easier to analyze. LGTM.

Thank you very much David! I will commit this patch shortly.

This revision was automatically updated to reflect the committed changes.