Page MenuHomePhabricator

[InstCombine] Generalize InstCombiner::foldAndOrOfICmpsOfAndWithPow2().
Needs ReviewPublic

Authored by huihuiz on Fri, Jul 5, 8:52 PM.

Details

Summary

In the middle-end, (A & (signbit l>> K)) ==/!= 0 can be folded into
(A l<< K) s>=/s< 0.

When folding 'And' or 'Or' of ICmps, need to extend the iszero() check
of (A & (signbit l>> K)) to its equivalent form of (A l<< K) s>=/s< 0.

Diff Detail

Repository
rL LLVM

Event Timeline

huihuiz created this revision.Fri, Jul 5, 8:52 PM
lebedev.ri requested changes to this revision.Tue, Jul 9, 1:06 AM

Thanks for working on this!
This is for sure missing some one-use checks (i.e. increases instruction count).
Could you please add 5 copies of @foo1_and_signbit_lshr_without_shifting_signbit, each one with different instruction having extra use?
It is also a good idea to add a test where both sides of or/and are in this new form.

llvm/test/Transforms/InstCombine/onehot_merge.ll
163–166

I'm guessing all these instructions must be one-use.

This revision now requires changes to proceed.Tue, Jul 9, 1:06 AM
huihuiz updated this revision to Diff 209121.Wed, Jul 10, 10:02 PM

Addressed review comments

huihuiz added a comment.EditedWed, Jul 10, 10:15 PM

Thanks for working on this!
This is for sure missing some one-use checks (i.e. increases instruction count).
Could you please add 5 copies of @foo1_and_signbit_lshr_without_shifting_signbit, each one with different instruction having extra use?

Thank you for catching this!
Yes indeed, increases instruction count.

For (A & K) ==/!= 0, need to restrict one use for 'and' and 'cmp' , there is no need to check K ('shift') for more than one use.
For equivalent form (A l<< K) s<=/s> 0, need to restrict one use for 'shift' and 'cmp'

It is also a good idea to add a test where both sides of or/and are in this new form.

Test cases added, please expand to see (as there is no diff)
I believe we should not try to decompose (A l<< K) s<=/s> 0 into (A & (signbit l>> K)) ==/!= 0 when both sides are in this new form, as this decomposition introduce additional instruction of signbit shift.
Also, foldLogOpOfMaskedICmps() in InstCombineCompares.cpp will try to fold ((A l<< K1) s<=/> 0) &/| ((A l<< K2) s<=/> 0) into a single icmp.

I think this looks ok, but i'd like to see more test coverage :)

llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
888

could just return match(...);

889

Some test should "regress" (improve due to this), could you please add similar extra-use test coverage for this too?

908–910

To be pointed out, this indeed creates an instruction while we don't know yet whether we will be able to do the fold,
not great, but i think this is the smallest evil, the alternative solution would indeed be uglier..

llvm/test/Transforms/InstCombine/onehot_merge.ll
27–32

Similarly, could you please add 6 tests with different of each of these instructions having an extra use?

lebedev.ri requested changes to this revision.Thu, Jul 11, 12:26 AM

(just marking as reviewed)

This revision now requires changes to proceed.Thu, Jul 11, 12:26 AM
spatel added inline comments.Thu, Jul 11, 10:59 AM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
908–910

Hmm...do we have a test where the temporary instruction is created, but the fold fails?
Last time I made that mistake, we infinite looped: the temporary instruction gets created and deleted triggering another round of combining and repeat forever.

lebedev.ri added inline comments.Thu, Jul 11, 11:09 AM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
908–910

Oh, hmm.

huihuiz marked an inline comment as done.Thu, Jul 11, 11:42 AM
huihuiz added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
908–910

Thank you for point this out! Indeed there are chances of infinite looping

see test

define i1 @foo(i32 %k, i32 %c1, i32 %c2) {
  %t0 = shl i32 3, %c1
  %t1 = and i32 %t0, %k
  %t2 = icmp eq i32 %t1, 0
  %t3 = shl i32 %k, %c2
  %t4 = icmp sgt i32 %t3, -1
  %or = or i1 %t2, %t4
  ret i1 %or
}

I am working on a solution now.

lebedev.ri added inline comments.Thu, Jul 11, 11:49 AM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
908–910

Can we just manually delete the instruction if the fold fails?
I'm sure that can be nicely wrapped into RAII wrapper.

spatel added inline comments.Thu, Jul 11, 1:12 PM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
908–910

I'm skeptical. I could be wrong, but it's not done anywhere else in instcombine?
Could we send 'B2' out as a partial result (for example, if it's not null, we matched), and then have the caller create the shift if needed?

lebedev.ri added inline comments.Thu, Jul 11, 1:16 PM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
908–910

If we don't create this instruction, i guess e.g. isCompareToZeroOrEquivalentForm(), isKnownToBeAPowerOfTwo() will need to be also modified, no?

huihuiz updated this revision to Diff 209341.Thu, Jul 11, 2:37 PM

I think we should only create signbit shift instruction when fold is supposed to happen.

I update the differential, let me know what you guys think?

huihuiz marked 2 inline comments as done.Thu, Jul 11, 2:42 PM

test case foo1_and_signbit_lshr_without_shifting_signbit_not_pwr2() aim to check that we are not creating additional shift instruction when fold fails.

delete the signbit shift instruction might not break the cycle of infinite "create and delete".

Hm, not that ugly; this is still correct?

llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
893–897

Can we end up needing to create this shift on both sides?
If yes, then can you make this a lambda helper?
If not, create it once and std::swap() into proper variable?

huihuiz updated this revision to Diff 209368.Thu, Jul 11, 4:09 PM
huihuiz marked an inline comment as done.
huihuiz added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
893–897

We are not supposed to create for both sides, this will be handled by early exit.

lambda helper probably better.

lebedev.ri added inline comments.Thu, Jul 11, 4:14 PM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
927

This needs a comment as to why we're creating a node (in what pattern are we?)

llvm/test/Transforms/InstCombine/onehot_merge.ll
246

Looks like this one should?

315

Looks like this one should?

338

Could fold, no instruction count increase

huihuiz updated this revision to Diff 209382.Thu, Jul 11, 5:36 PM
huihuiz marked 3 inline comments as done.
huihuiz added inline comments.
llvm/test/Transforms/InstCombine/onehot_merge.ll
246

I am OK with allowing 'and' to have more than one use.

Folding cut instruction count by 2, the worst case is no instruction count increase.

338

We should not fold when 'cmp' has more than one use, this will increase instruction count when one side is using the new form.

Take this as an example, we need to keep the extra 'cmp' that can't be reused, while the benefit of folding is offset by the added signbit shift instruction.

define i1 @foo(i32 %k, i32 %c1, i32 %c2, i1* %p) {
  %t0 = shl i32 1, %c1
  %t1 = and i32 %t0, %k
  %t2 = icmp eq i32 %t1, 0
  store i1 %t2, i1* %p  ; extra use of cmp
  %t3 = shl i32 %k, %c2
  %t4 = icmp sgt i32 %t3, -1
  %or = or i1 %t2, %t4
  ret i1 %or
}
huihuiz updated this revision to Diff 209384.Thu, Jul 11, 5:50 PM

forget to update the test comment, sorry

lebedev.ri marked an inline comment as done.Fri, Jul 12, 3:06 PM
lebedev.ri added inline comments.
llvm/test/Transforms/InstCombine/onehot_merge.ll
246

the worst case is no instruction count increase.

Precisely.

338

I believe i'm missing the point in @foo1_and_extra_use_cmp2.
We were doing this fold previously (in llvm trunk), now we don't because of use-count.
But this restriction does not seem to improve anything?
We had 8 instructions, and we have 8 instructions without the fold?

huihuiz marked an inline comment as done.Fri, Jul 12, 4:10 PM
huihuiz added inline comments.
llvm/test/Transforms/InstCombine/onehot_merge.ll
338

This restriction doesn't improve anything, but make sure we don't increase instruction count for certain cases.

If 'cmp' has extra use, and one side is using the new form, e.g., @foo1_and_signbit_lshr_without_shifting_signbit_extra_use_cmp1/2 (please expand the test file to see)
Without this restriction, instruction count actually increased.

If 'cmp' has extra use, and both sides are using old form, e.g., @foo1_and_extra_use_cmp, we have 8 instructions with and without the fold.

Here is the diff without this restriction:

--- a/llvm/test/Transforms/InstCombine/onehot_merge.ll
+++ b/llvm/test/Transforms/InstCombine/onehot_merge.ll
@@ -274,10 +274,10 @@ define i1 @foo1_and_extra_use_cmp(i32 %k, i32 %c1, i32 %c2, i1* %p) {
 ; CHECK-NEXT:    [[T2:%.*]] = and i32 [[T0]], [[K:%.*]]
 ; CHECK-NEXT:    [[T3:%.*]] = icmp eq i32 [[T2]], 0
 ; CHECK-NEXT:    store i1 [[T3]], i1* [[P:%.*]], align 1
-; CHECK-NEXT:    [[T4:%.*]] = and i32 [[T1]], [[K]]
-; CHECK-NEXT:    [[T5:%.*]] = icmp eq i32 [[T4]], 0
-; CHECK-NEXT:    [[OR:%.*]] = or i1 [[T3]], [[T5]]
-; CHECK-NEXT:    ret i1 [[OR]]
+; CHECK-NEXT:    [[TMP1:%.*]] = or i32 [[T0]], [[T1]]
+; CHECK-NEXT:    [[TMP2:%.*]] = and i32 [[TMP1]], [[K]]
+; CHECK-NEXT:    [[TMP3:%.*]] = icmp ne i32 [[TMP2]], [[TMP1]]
+; CHECK-NEXT:    ret i1 [[TMP3]]
 ;
   %t0 = shl i32 1, %c1
   %t1 = shl i32 1, %c2
@@ -340,13 +340,13 @@ define i1 @foo1_and_extra_use_cmp2(i32 %k, i32 %c1, i32 %c2, i1* %p) {
 ; CHECK-LABEL: @foo1_and_extra_use_cmp2(
 ; CHECK-NEXT:    [[T0:%.*]] = shl i32 1, [[C1:%.*]]
 ; CHECK-NEXT:    [[T1:%.*]] = shl i32 1, [[C2:%.*]]
-; CHECK-NEXT:    [[T2:%.*]] = and i32 [[T0]], [[K:%.*]]
-; CHECK-NEXT:    [[T3:%.*]] = icmp eq i32 [[T2]], 0
-; CHECK-NEXT:    [[T4:%.*]] = and i32 [[T1]], [[K]]
+; CHECK-NEXT:    [[T4:%.*]] = and i32 [[T1]], [[K:%.*]]
 ; CHECK-NEXT:    [[T5:%.*]] = icmp eq i32 [[T4]], 0
 ; CHECK-NEXT:    store i1 [[T5]], i1* [[P:%.*]], align 1
-; CHECK-NEXT:    [[OR:%.*]] = or i1 [[T3]], [[T5]]
-; CHECK-NEXT:    ret i1 [[OR]]
+; CHECK-NEXT:    [[TMP1:%.*]] = or i32 [[T0]], [[T1]]
+; CHECK-NEXT:    [[TMP2:%.*]] = and i32 [[TMP1]], [[K]]
+; CHECK-NEXT:    [[TMP3:%.*]] = icmp ne i32 [[TMP2]], [[TMP1]]
+; CHECK-NEXT:    ret i1 [[TMP3]]
 ;
   %t0 = shl i32 1, %c1
   %t1 = shl i32 1, %c2
@@ -410,10 +410,11 @@ define i1 @foo1_and_signbit_lshr_without_shifting_signbit_extra_use_cmp1(i32 %k,
 ; CHECK-NEXT:    [[T1:%.*]] = and i32 [[T0]], [[K:%.*]]
 ; CHECK-NEXT:    [[T2:%.*]] = icmp eq i32 [[T1]], 0
 ; CHECK-NEXT:    store i1 [[T2]], i1* [[P:%.*]], align 1
-; CHECK-NEXT:    [[T3:%.*]] = shl i32 [[K]], [[C2:%.*]]
-; CHECK-NEXT:    [[T4:%.*]] = icmp sgt i32 [[T3]], -1
-; CHECK-NEXT:    [[OR:%.*]] = or i1 [[T2]], [[T4]]
-; CHECK-NEXT:    ret i1 [[OR]]
+; CHECK-NEXT:    [[TMP1:%.*]] = lshr i32 -2147483648, [[C2:%.*]]
+; CHECK-NEXT:    [[TMP2:%.*]] = or i32 [[T0]], [[TMP1]]
+; CHECK-NEXT:    [[TMP3:%.*]] = and i32 [[TMP2]], [[K]]
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp ne i32 [[TMP3]], [[TMP2]]
+; CHECK-NEXT:    ret i1 [[TMP4]]
 ;
   %t0 = shl i32 1, %c1
   %t1 = and i32 %t0, %k

 define i1 @foo1_and_signbit_lshr_without_shifting_signbit_extra_use_cmp2(i32 %k, i32 %c1, i32 %c2, i1* %p) {
 ; CHECK-LABEL: @foo1_and_signbit_lshr_without_shifting_signbit_extra_use_cmp2(
 ; CHECK-NEXT:    [[T0:%.*]] = shl i32 1, [[C1:%.*]]
-; CHECK-NEXT:    [[T1:%.*]] = and i32 [[T0]], [[K:%.*]]
-; CHECK-NEXT:    [[T2:%.*]] = icmp eq i32 [[T1]], 0
-; CHECK-NEXT:    [[T3:%.*]] = shl i32 [[K]], [[C2:%.*]]
+; CHECK-NEXT:    [[T3:%.*]] = shl i32 [[K:%.*]], [[C2:%.*]]
 ; CHECK-NEXT:    [[T4:%.*]] = icmp sgt i32 [[T3]], -1
 ; CHECK-NEXT:    store i1 [[T4]], i1* [[P:%.*]], align 1
-; CHECK-NEXT:    [[OR:%.*]] = or i1 [[T2]], [[T4]]
-; CHECK-NEXT:    ret i1 [[OR]]
+; CHECK-NEXT:    [[TMP1:%.*]] = lshr i32 -2147483648, [[C2]]
+; CHECK-NEXT:    [[TMP2:%.*]] = or i32 [[T0]], [[TMP1]]
+; CHECK-NEXT:    [[TMP3:%.*]] = and i32 [[TMP2]], [[K]]
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp ne i32 [[TMP3]], [[TMP2]]
+; CHECK-NEXT:    ret i1 [[TMP4]]
 ;
   %t0 = shl i32 1, %c1
   %t1 = and i32 %t0, %k
lebedev.ri added inline comments.Fri, Jul 12, 11:06 PM
llvm/test/Transforms/InstCombine/onehot_merge.ll
338

What i'm saying is, yes, sure, that one-use check is nessesary for some of the patterns, but for this one
it causes regressions for no benefit. Can that check be tightened up to not have *this* false-positive?