Page MenuHomePhabricator

[InstCombine] Introduce fold for icmp pred (and X, (sh signbit, Y)), 0.
Needs ReviewPublic

Authored by huihuiz on Jun 3 2019, 10:23 AM.

Details

Summary

Fold:

(X & (signbit l>> Y)) ==/!= 0 -> (X << Y) s>=/s< 0
(X & (signbit << Y)) ==/!= 0 -> (X l>> Y) s>=/s< 0

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
huihuiz added a comment.EditedJun 3 2019, 10:40 AM

For thumb target, this optimization allow generation of more compacted instructions.
run: clang -mcpu=cortex-m0 -target armv6m-none-eabi icmp-shl-and.ll -O2 -S -o t.s

@ %bb.0:                                @ %entry
        subs    r0, r0, #1
        lsls    r1, r0
        cmp     r1, #0
        blt     .LBB0_2
@ %bb.1:                                @ %entry
        mov     r2, r3
.LBB0_2:                                @ %entry
        mov     r0, r2
        bx      lr

Otherwise will generate more instructions with signmask shifting

@ %bb.0:                                @ %entry
        .save   {r4, lr}
        push    {r4, lr}
        subs    r0, r0, #1
        movs    r4, #1
        lsls    r4, r4, #31
        lsrs    r4, r0
        tst     r4, r1
        beq     .LBB0_2
@ %bb.1:                                @ %entry
        mov     r3, r2
.LBB0_2:                                @ %entry
        mov     r0, r3
        pop     {r4, pc}

ARM and thumb2 target allow flexible second operand, for this case test bit instruction with shift. This optimization does not affect performance of generated instructions.
Run: clang -mcpu=cortex-a53 -target armv8-none-musleabi icmp-shl-and.ll -O2 -S -o t.s

With this optimization

@ %bb.0:                                @ %entry
        sub     r0, r0, #1
        lsl     r0, r1, r0
        cmp     r0, #0
        movge   r2, r3
        mov     r0, r2
        bx      lr

Without this optimization:

@ %bb.0:                                @ %entry
        sub     r12, r0, #1
        mov     r0, #-2147483648
        tst     r1, r0, lsr r12
        moveq   r2, r3
        mov     r0, r2
        bx      lr
lebedev.ri requested changes to this revision.Jun 3 2019, 10:54 AM
lebedev.ri added a subscriber: lebedev.ri.

This looks like a missing backend-level transform, either a generic-one in DAGCombiner, or in ARMISelLowering.cpp.

This fix is not the right thing to do because even if you disable this fold,
you can still receive this 'bad' IR you are trying to avoid here,
and will still end up generating bad ASM.

This revision now requires changes to proceed.Jun 3 2019, 10:54 AM

Though this transform is also bad for X86: https://godbolt.org/z/KFM3gQ
When the C2 << Y isn't being hoisted out of the loop that is of course.

So we're missing an undo fold: https://rise4fun.com/Alive/w25
Not sure if it should be guarded by a TTI hook, i would expect it to be always beneficial.
(that doesn't mind the original fold is always not beneficial though)
I'll try to take a look.

On the instcombine side, one thing worth noting which isn't called out in the commit message is the interaction with other instcombine patterns. In the testcase, note that the final IR actually doesn't contain any mask; instead, it checks icmp slt i32 [[SHL]], 0. Huihui, please update the commit message to make this clear.

It's possible we should also implement the related pattern to transform (x & (signbit >> y)) != 0 to (x << y) < 0, sure.

In terms of whether it's universally profitable, I'm not sure... I guess if somehow "icmp ne X, 0" is free, but "icmp slt X, 0" isn't, it could be an issue, but I don't think that applies to any architecture I can think of.

I'm about to post dagcominer undo-fold, hold on..

On the instcombine side, one thing worth noting which isn't called out in the commit message is the interaction with other instcombine patterns. In the testcase, note that the final IR actually doesn't contain any mask; instead, it checks icmp slt i32 [[SHL]], 0. Huihui, please update the commit message to make this clear.

It's possible we should also implement the related pattern to transform (x & (signbit >> y)) != 0 to (x << y) < 0, sure.

Yes, now that would be a good patch, +see inline comment.

In terms of whether it's universally profitable, I'm not sure... I guess if somehow "icmp ne X, 0" is free, but "icmp slt X, 0" isn't, it could be an issue, but I don't think that applies to any architecture I can think of.

I think there may or may not bea confusion here. We are in a middle-end here. Other than TTI,
we don't really care about what ever backed/target may find troubling/unprofitable.
We only care about producing most simple IR, that is most suited for further transforms.
That new IR may, or may not, be optimal for any particular target.
If IR is not optimal for back-end, then an opposite transform should be present in backend.

lib/Transforms/InstCombine/InstCombineCompares.cpp
1606–1611 ↗(On Diff #202749)

There should also be a sibling fold with swapped shift directions

test/Transforms/InstCombine/icmp-shl-and.ll
12–13 ↗(On Diff #202749)

Hmm, this already should be folding: https://godbolt.org/z/77mvnv
I guess the order of folds is wrong.

And posted: D62871

As for the instcombine side,
i guess i would recommend a new differential,
with actual folds, not this blacklisting.

The other approach could be changing the order of folding. Move foldICmpBinOpEqualityWithConstant to the very beginning of foldICmpInstWithConstant.
foldICmpBinOpEqualityWithConstant has rules to replace (and X, (1 << size(X)-1) != 0) with x s< 0.
Let me know if this approach is more preferable?

The other approach could be changing the order of folding. Move foldICmpBinOpEqualityWithConstant to the very beginning of foldICmpInstWithConstant.
foldICmpBinOpEqualityWithConstant has rules to replace (and X, (1 << size(X)-1) != 0) with x s< 0.
Let me know if this approach is more preferable?

You want (1+1*2)*2 = 6 folds: https://rise4fun.com/Alive/Y8Ct

huihuiz updated this revision to Diff 203024.Jun 4 2019, 2:28 PM
huihuiz retitled this revision from [InstCombine] Allow ((X << Y) & SignMask) != 0 to be optimized as (X << Y) s< 0. to [InstCombine] Change order of ICmp fold..
huihuiz edited the summary of this revision. (Show Details)

Yes , changing the order would allow these folds.

(X & (signbit >> Y)) != 0  ->  (X << Y) s< 0
(X & (signbit >> Y)) == 0  ->  (X << Y) >= 0
((X << Y) & signbit) != 0  ->  (X << Y) s< 0
((X << Y) & signbit) == 0  ->  (X << Y) >= 0
lebedev.ri added inline comments.Jun 4 2019, 3:11 PM
lib/Transforms/InstCombine/InstCombineCompares.cpp
1762–1778 ↗(On Diff #203024)

Eww, this looks too much like backend pattern matching :)
Here you want something more like

// (V0 & (signbit l>> V1)) ==/!= 0 -> (V0 << V1) >=/< 0
// (V0 & (signbit << V1)) ==/!= 0 -> (V0 l>> V1) >=/< 0
Value *V0, *V1, *Shift, *Zero;
ICmpInst::Predicate Pred;
if (match(&Cmp,
          m_ICmp(Pred,
                 m_OneUse(m_c_And(
                     m_CombineAnd(
                         m_CombineAnd(m_Shift(m_SignMask(), m_Value(V1)),
                                      m_Value(Shift)),
                         m_CombineOr(m_Shl(m_Value(), m_Value()),
                                     m_LShr(m_Value(), m_Value()))),
                     m_Value(V0))),
                 m_CombineAnd(m_Zero(), m_Value(Zero)))) &&
    Cmp.isEquality(Pred)) {
  Value *NewShift = cast<Instruction>(Shift)->getOpcode() == Instruction::LShr
                        ? Builder.CreateShl(V0, V1)
                        : Builder.CreateLShr(V0, V1);
  ICmpInst::Predicate NewPred =
      Pred == CmpInst::ICMP_EQ ? CmpInst::ICMP_SGE : CmpInst::ICMP_SLT;
  return new ICmpInst(NewPred, NewShift, Zero);
}
lebedev.ri added inline comments.Jun 4 2019, 3:14 PM
lib/Transforms/InstCombine/InstCombineCompares.cpp
2664–2666 ↗(On Diff #203024)

I'm not looking forward seeing the fallout of this move.
I will be extremely surprised if, while fixing the target problem,
this won't expose numerous other fold order issues.

Can you instead simply follow the TODO, and simply refactor the single interesting fold
out of foldICmpBinOpEqualityWithConstant() into foldICmpAndConstant() i guess?

huihuiz updated this revision to Diff 203284.Jun 5 2019, 10:26 PM
huihuiz retitled this revision from [InstCombine] Change order of ICmp fold. to [InstCombine] Fix fold order for icmp pred (and (sh X, Y), C), 0..
huihuiz edited the summary of this revision. (Show Details)

Test cases in icmp-shift-and-signbit.ll shows the updated fold order can generate better IR.

huihuiz marked 2 inline comments as done.Jun 5 2019, 10:46 PM

Nice, getting closer.
Could you please split this up:

  1. A patch that adds your original motivational testcase that shows that the fold order is wrong.
  2. The move of the // Replace (and X, (1 << size(X)-1) != 0) with x s< 0 fold
  3. A patch with just new test/Transforms/InstCombine/icmp-shift-and-signbit.ll
  4. The fold itself, showing the changes to the check lines
lib/Transforms/InstCombine/InstCombineCompares.cpp
1646 ↗(On Diff #203284)

Here the codegen is irrelevant.
We do this because it results in simpler IR.
Not sure if that new comment adds anything useful

1660 ↗(On Diff #203284)

C2->negate().isPowerOf2()

2791–2792 ↗(On Diff #203284)

Uhm, where did this check that we were comparing with 0?

test/Transforms/InstCombine/icmp-shift-and-signbit.ll
2 ↗(On Diff #203284)

Please

  1. Move this to a new differential
  2. In that same patch, re-add your initial motivational pattern, that shows that fold reordering did something
  3. Use llvm/utils/update_test_checks.py to generate check lines
  4. Rebase *this* diff ontop of that new patch, so this diff shows how the check lines change
13 ↗(On Diff #203284)

select is not relevant for this pattern, drop it

68 ↗(On Diff #203284)

You also want a few extra tests:

  1. A trivial vector test with <i32 -2147483648, i32 -2147483648> and <i32 0, i32 0>
  2. 3 vector tests with undefs:
    • <i32 -2147483648, i32 undef, i32 -2147483648> and <i32 0, i32 0, i32 0>
    • <i32 -2147483648, i32 -2147483648, i32 -2147483648> and <i32 0, i32 undef, i32 0>
    • <i32 -2147483648, i32 undef, i32 -2147483648> and <i32 0, i32 undef, i32 0>
  3. A tests to verify single-use constraints:
    • a test with extra use on %shr (should get folded, but not others)
    • a test with extra use on %and
    • a test with extra use on %shr and %and.

      How to introduce extra uses see e.g. llvm/test/Transforms/InstCombine/unfold-masked-merge-with-const-mask-scalar.ll
test/Transforms/InstCombine/pr17827.ll
66 ↗(On Diff #203284)

These don't look like improvements to me.
Looks like that reordering exposes yet another missing fold.

huihuiz marked 5 inline comments as done.Jun 7 2019, 2:29 PM

Thank you so much for all the review feedback, really appreciate it! :)

lib/Transforms/InstCombine/InstCombineCompares.cpp
1660 ↗(On Diff #203284)

Should not call C2->negate()
If C2 negate is not power of 2, then calling negate() will replace C2 with C2 negate.
C2 should not be modified.

2791–2792 ↗(On Diff #203284)

What happened was, C being 0, signbit, other number

  1. we are ok with 0
  1. if C is signbit, consider test: X & signbit == signbit

fold:
X & -C == -C -> X > u ~C
X & -C != -C -> X <= u ~C
and fold:
For i32: x >u 2147483647 -> x <s 0 -> true if sign bit set

are scheduled before
fold: (and X, (1 << size(X)-1) != 0) with x s< 0

  1. if C is other number, SimplifyICmpInst will do its job
test/Transforms/InstCombine/pr17827.ll
66 ↗(On Diff #203284)

in D63026
I am moving fold ((X & ~7) == 0) --> X < 8 ahead.

If X is (BinOp Y, C3), should allow other rules to fold C3 with C2,
eg (X >> C3) & C2 != C1 -> (X & (C2 << C3)) != (C1 << C3)

huihuiz updated this revision to Diff 203627.Jun 7 2019, 2:40 PM
huihuiz retitled this revision from [InstCombine] Fix fold order for icmp pred (and (sh X, Y), C), 0. to [InstCombine] Introduce fold for icmp pred (and X, (sh signbit, Y)), 0..
huihuiz edited the summary of this revision. (Show Details)

D62818 is now split into D63025 , D63026 , D63028 and D62818

More signum, sgn patterns
https://godbolt.org/z/tE00f4

Hey @xbolva00 , I don't see there is much difference between codegen of x86-clang and x86-gcc.
Let's focus on the missing folds we are trying to resolve here:

(X & (signbit l>> Y)) ==/!= 0 -> (X << Y) >=/< 0
(X & (signbit << Y)) ==/!= 0 -> (X l>> Y) >=/< 0

and fold order issue of

((X << Y) & signbit) ==/!= 0) -> (X << Y) >=/< 0;
(X << Y) & ~C ==/!= 0 -> (X << Y) </>= C+1, C+1 is power of 2;
and
((X << Y) & C) == 0 -> (X & (C >> Y)) == 0.

Oh, i thought i commented on these reviews, apparently not :(
I still see random changes to test coverage (new tests being added) in an non-nfc patches.
Let me rephrase: can you put all the test updates, new tests into *ONE* review, and the rest of the patches should not add new/change existing tests?

huihuiz updated this revision to Diff 204436.Jun 12 2019, 11:52 PM

Original test cases are added in D63025 . Hopefully would be good coverage :)
D63026 fix fold order issue
this differential introduce new fold for icmp pred (and X, (sh signbit, Y)), 0

Is this the only remaining patch?
I don't think i should review my own code, perhaps @spatel can take a look?

lib/Transforms/InstCombine/InstCombineCompares.cpp
1796 ↗(On Diff #204436)

I'm not sure why i have added m_OneUse() here, it should not be here.

spatel added inline comments.Wed, Jun 26, 9:23 AM
lib/Transforms/InstCombine/InstCombineCompares.cpp
1792–1795 ↗(On Diff #204436)

m_LogicalShift() ?

test/Transforms/InstCombine/signbit-shl-and-icmpeq-zero.ll
180 ↗(On Diff #204436)

I didn't step through the transforms, but it seems wrong to call this a 'negative test'. This patch must have fired and allowed further simplification?

huihuiz updated this revision to Diff 206784.Wed, Jun 26, 11:30 PM
huihuiz marked 3 inline comments as done.

I simplify the code for pattern matching, more readable.

huihuiz added inline comments.Wed, Jun 26, 11:33 PM
lib/Transforms/InstCombine/InstCombineCompares.cpp
1796 ↗(On Diff #204436)

I agree that m_OneUse() should not be in the pattern matching. V0 might be constant value, which will have more than one use outside of its current function.

Actually I added, not you, sorry about that.

There is a regression, see test file: test/Transforms/InstCombine/onehot_merge.ll

define i1 @foo1_and(i32 %k, i32 %c1, i32 %c2) {
bb:
  %tmp = shl i32 1, %c1
  %tmp4 = lshr i32 -2147483648, %c2
  %tmp1 = and i32 %tmp, %k
  %tmp2 = icmp eq i32 %tmp1, 0
  %tmp5 = and i32 %tmp4, %k
  %tmp6 = icmp eq i32 %tmp5, 0
  %or = or i1 %tmp2, %tmp6
  ret i1 %or
}

failed to fold
(iszero(A&K1) | iszero(A&K2)) -> (A&(K1|K2)) != (K1|K2) , where K1 and K2 are 'one-hot' (only one bit is on).
Here K1 is one, K2 is signbit.

I am still thinking how to get over this regression.

test/Transforms/InstCombine/signbit-shl-and-icmpeq-zero.ll
180 ↗(On Diff #204436)

Yes, X being constant is positive case. Fold happened, and allowed further simplification.

lebedev.ri added inline comments.Thu, Jun 27, 1:33 AM
llvm/lib/Transforms/InstCombine/InstCombineCompares.cpp
1806–1808

m_Value(V0) will always match, it's best to swap them.

llvm/test/Transforms/InstCombine/onehot_merge.ll
17–18

Can you please regenerate the original test?

17–18

I'm not sure what's on the LHS of the diff, but ignoring the instruction count this looks like improvement to me.

lebedev.ri added inline comments.Thu, Jun 27, 2:49 AM
llvm/lib/Transforms/InstCombine/InstCombineCompares.cpp
1801–1802
// (V0 & (signbit l>> V1)) ==/!= 0 -> (V0 << V1) s>=/s< 0
// (V0 & (signbit << V1)) ==/!= 0 -> (V0 l>> V1) s>=/s< 0
llvm/test/Transforms/InstCombine/onehot_merge.ll
17–18

AH, you also want to str-replace %tmp with %t, it confuses the update script likely.

spatel added inline comments.Thu, Jun 27, 7:25 AM
llvm/test/Transforms/InstCombine/onehot_merge.ll
17–18
huihuiz updated this revision to Diff 206890.Thu, Jun 27, 11:04 AM
huihuiz marked 6 inline comments as done.
huihuiz edited the summary of this revision. (Show Details)

Rebased patch, and addressed review comments.

llvm/test/Transforms/InstCombine/onehot_merge.ll
17–18

Actually we missed the fold for

((k & ( 1 l<< C1 )) == 0) || ((k & ( signbit l>> C2 )) == 0)
-->
((k & (( 1 l<< C1 ) || ( signbit l>> C2 ))) != 0)
lebedev.ri added inline comments.Thu, Jun 27, 11:21 AM
llvm/test/Transforms/InstCombine/onehot_merge.ll
17–18

Thanks for the analysis!
@spatel does this fall into the nowadays reasoning that we shouldn't be doing too much folds into bitmath
here in instcombine? I'm almost tempted to say that this isn't a regression,
but the original fold that now no longer happens should be removed instead.

huihuiz updated this revision to Diff 206958.Thu, Jun 27, 4:18 PM

for onehot_merge.ll
mathematically speaking

(signbit l>> C)

is equivalent to

(one l<< (bitwidth - C - 1))

In D63903, I update the test input, so that we are still checking fold for 'or' of ICmps and 'and' of ICmps.

spatel added inline comments.Fri, Jun 28, 10:58 AM
llvm/test/Transforms/InstCombine/onehot_merge.ll
17–18

If we say that the longer IR sequence is more canonical, then we'd want to add a transform to create that longer sequence starting from the shorter sequence. Are we willing to do that to improve analysis in IR?

As a practical matter, we probably also want to look at asm output for the alternatives on a few targets to see how much backend logic is required to do/undo this.

spatel added inline comments.Fri, Jun 28, 11:05 AM
llvm/test/Transforms/InstCombine/onehot_merge.ll
17–18

Sorry - I haven't followed this patch and its friends closely; scrolling back through the comments, I think the backend questions are covered by D62871.

This isn't specific to sign bit, the more general pattern is https://rise4fun.com/Alive/2zpl
I'm apparently working on it..

lebedev.ri added inline comments.Wed, Jul 3, 1:11 PM
llvm/test/Transforms/InstCombine/onehot_merge.ll
115–122

Looks like to support this pattern, InstCombiner::foldAndOrOfICmpsOfAndWithPow2() will need to be generalized.

huihuiz marked 2 inline comments as done.Wed, Jul 3, 10:26 PM
huihuiz added inline comments.
llvm/test/Transforms/InstCombine/onehot_merge.ll
115–122

I am looking into this, hold on.

huihuiz updated this revision to Diff 208252.Fri, Jul 5, 8:55 PM

Generalize InstCombiner::foldAndOrOfICmpsOfAndWithPow2() in D64275