This is an archive of the discontinued LLVM Phabricator instance.

Add use condition for combine SetCCMOVMSK
ClosedPublic

Authored by xiangzhangllvm on Apr 12 2022, 10:24 PM.

Details

Summary

We encounter a performance drop caused by

6997f4d07fa4b462dd3a02838a2cfed45db9c8a0

We should not try fold allof(cmpeq(x,y)) -> ptest(sub(x,y)) if the cmpeq has other users.

I meet the problem from following case:

// The cmp result has another user at llvm.cttz.i32

  %427 = load <32 x i8>, <32 x i8>* %426, align 1
  %428 = load <32 x i8>, <32 x i8>* %424, align 1
  %**429 **= icmp ne <32 x i8> %427, %428
  %**430 **= bitcast <32 x i1> %**429 **to i32
  %431 = icmp eq i32 %430, 0
  br i1 %431, label %436, label %432

432:                                              ; preds = %421
  %433 = call i32 @llvm.cttz.i32(i32 %**430**, i1 true), !dbg !13967, !range !12902

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptApr 12 2022, 10:24 PM
xiangzhangllvm requested review of this revision.Apr 12 2022, 10:24 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 12 2022, 10:24 PM
wxiao3 added a subscriber: wxiao3.Apr 12 2022, 11:23 PM

Please can you add test coverage - preferably for both cases?

Please can you add test coverage - preferably for both cases?

I want to, if it is easy to reproduced by a small case, I had add the lit test : )
This combine directly works on DAG, hard to control/tune the "input" DAG for this combine.
I spend hours to write a workable small reproducer, but failed.
I tried to fetch the problem code from the perf-drop project :

 1 ; llc -mtriple=x86_64-unknown -mattr=+avx2
 2
 3 define i1 @zx(<16 x i8> %x, <16 x i8>%y) {
 4   %a = icmp ne <16 x i8> %x, %y
 5   %b = bitcast <16 x i1> %a to i16
 6   %c = icmp eq i16 %b, 0
 7   br i1 %c, label %zx1, label %zx2
 8
 9 zx1:
10   ret i1 1
11
12 zx2:
13  ; call void @fun(i16 %b)
14   ret i1 0
15 }
16
17 declare void @fun(i16 %b)

It can reproduce the "combine allof(cmpeq(x,y)) -> ptest(sub(x,y))", out put:

 8 # %bb.0:
 9         vpsubb  %ymm1, %ymm0, %ymm0
10         vptest  %ymm0, %ymm0
11         je      .LBB0_1
12 # %bb.2:                                # %zx2
13         xorl    %eax, %eax
14         vzeroupper
15         retq
16 .LBB0_1:                                # %zx1
17         movb    $1, %al
18         vzeroupper
19         retq

When I uncomment the test's line 13 " ; call void @fun(i16 %b)" to " call void @fun(i16 %b)", let the cmp result has more uses.
The previous optimization will change the DAG
from

SelectionDAG has 14 nodes:
  t0: ch = EntryToken
            t2: v32i8,ch = CopyFromReg t0, Register:v32i8 %0
            t4: v32i8,ch = CopyFromReg t0, Register:v32i8 %1
          t26: v32i8 = X86ISD::PCMPEQ t2, t4
        t31: i32 = X86ISD::MOVMSK t26
      t34: i32,i32 = X86ISD::SUB t31, Constant:i32<-1>
    t36: ch = X86ISD::BRCOND t0, BasicBlock:ch<zx2 0xb5ed448>, TargetConstant:i8<5>, t34:1
  t16: ch = br t36, BasicBlock:ch<zx1 0xb5ed360>

to (change the SUB to XOR )

SelectionDAG has 16 nodes:
  t0: ch = EntryToken
        t2: v32i8,ch = CopyFromReg t0, Register:v32i8 %1
        t4: v32i8,ch = CopyFromReg t0, Register:v32i8 %2
      t28: v32i8 = X86ISD::PCMPEQ t2, t4
    t33: i32 = X86ISD::MOVMSK t28
  t35: i32,i32 = X86ISD::XOR t33, Constant:i32<-1>
      t9: ch = CopyToReg t0, Register:i32 %0, t35
    t37: ch = X86ISD::BRCOND t9, BasicBlock:ch<zx2 0xb5ed288>, TargetConstant:i8<5>, t35:1
  t18: ch = br t37, BasicBlock:ch<zx1 0xb5ed1a0>

So, it fail to meet the logic to do such combine. (fail to reproduce).

But from the DAG level, take combine "MOVMSK(PCMPEQ(X,Y)) == -1 -> PTESTZ(SUB(X,Y),SUB(X,Y))" for example:
This is a obvious bug to combine the "mid-node" without considering if the "mid-node" has other uses.

Maybe use llvm-reduce to reduce the original IR? You can use llvm-extract --bb=FunctionName:DAGCombineBB YourIR.ll to get the bb first.

Maybe use llvm-reduce to reduce the original IR? You can use llvm-extract --bb=FunctionName:DAGCombineBB YourIR.ll to get the bb first.

I have no *.ll file from the perf test, it only can reproduced with -flto.

The small test case is easy to reproduced on llvm.14 PLS refer EY93Kc77E
it continue to do the combine ( gen vpsubb + vptest ) even there is more uses of the cmp result.

but hard to reproduce on llvm trunc. (But the bug is still here)

Try this test case?

define i32 @test_v32i8_muti_uses(<32 x i8> %x, <32 x i8>%y, i32 %z) {
  %a = icmp eq <32 x i8> %x, %y
  %b = bitcast <32 x i1> %a to i32
  %c = icmp eq i32 %b, -1

  %res = select i1 %c, i32 16, i32 %b
  ret i32 %res
}

Try this test case?

That is cool!

Add yuanke's test

please can you rebase?

please can you rebase?

Sure! I am try to first commit the test, when I update the code, I find you have commit it! : ) Thank you very much!

Rebased. Thank you!

LuoYuanke added inline comments.Apr 20 2022, 6:48 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
45012

Move the check to line 45062? I'm not sure if we should check it for all cases. Currently the existing test cases don't cover all scenarios. The same for IsAnyof.

xiangzhangllvm added inline comments.Apr 20 2022, 7:04 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
45012

I think all the combine should stop if the mid-node cmp result has muti-uses. And from current llvm test. We don't find any tests go bad.
@RKSimon How do you think about it. I thinks you are more familiar with here code.

RKSimon added inline comments.Apr 21 2022, 4:03 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
45010

I'm struggling to get a test case for the any-of pattern (the PTEST fold only occurs for all-of) - have you seen anything?

45012

Yes, most of the folds are canonicalizations that shouldn't care much about multi-uses - it might make sense to instead split this off:

bool IsOneUseAllOf = IsAllOf && CmpOp.getNode()->hasOneUse();

And then just use IsOneUseAllOf in the PTEST fold.

I created a test case in test/CodeGen/X86/vector-compare-any_of.ll. Pls reabse.

Thank you for your effort !!
Hi @RKSimon , a small question: do your local lit test can be checked by “make check-all” these 2 days ?
I ask it because I try to change the code and do make check-all to find out which tests will be affect. But I find we (out team) all meet “make check-all” problem these days. (Can not finish in very long time)

LuoYuanke added inline comments.Apr 24 2022, 7:49 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
45012

I think we can use IsOneUseAllOf and IsOneUseAnyOf to fix 2 the cases that we know one use should be checked first. If there are more cases show only one use is optimizable, we can fix them in another patch.

xiangzhangllvm added inline comments.Apr 24 2022, 8:39 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
45012

No problem,
just want to mention 2 point,:
1 this is a optimization issue not correctness issue.
2 here is no muti-uses tested before, it mean the old authors didn't consider muti-uses.

Use IsOneUse, let it more readable.

LuoYuanke accepted this revision.EditedApr 25 2022, 8:35 PM

LGTM, thanks. Pls wait for 1 or 2 days in case Simon has comments on the patch.

This revision is now accepted and ready to land.Apr 25 2022, 8:35 PM
RKSimon accepted this revision.Apr 26 2022, 12:29 AM

LGTM - cheers

Many thanks for your careful reviewing !!

This revision was landed with ongoing or failed builds.Apr 26 2022, 1:44 AM
This revision was automatically updated to reflect the committed changes.