Page MenuHomePhabricator

[X86] Handle inverted inputs when matching VPTERNLOG from 2 binary ops.
ClosedPublic

Authored by craig.topper on Sep 5 2021, 10:12 AM.

Details

Summary

This is a more general version of D109273. Though it doesn't
peek through bitcasts or rearange broadcasts.

Diff Detail

Event Timeline

craig.topper created this revision.Sep 5 2021, 10:12 AM
craig.topper requested review of this revision.Sep 5 2021, 10:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 5 2021, 10:12 AM

LGTM but @pengfei might have seen some other cases that D109273 would address.

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
4048

Missing assert messages

llvm/test/CodeGen/X86/avx512vl-logic.ll
989

Please can you precommit this?

LuoYuanke added inline comments.Sep 6 2021, 5:23 AM
llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
4231

It seems for VPTERNLOG instruction we can accept the 4th operand whose value is allZero or allOne no matter what is logic operation is.

4233

Is the constant operand canonicalized as operand(1)?

4245

C.getOperand(0)?

LuoYuanke added inline comments.Sep 6 2021, 5:25 AM
llvm/test/CodeGen/X86/avx512vl-logic.ll
985

Miss the test case for ~B and ~C?

LGTM but @pengfei might have seen some other cases that D109273 would address.

The general approach looks great. I don't have other cases. Thanks Craig.

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
4231

Do you mean the FALSE and TRUE in table 5-10 and 5-11? I think we don't need a VPTERNLOG to generate allZero and allOne.

4233

We checked it in line 4225.

LuoYuanke added inline comments.Sep 6 2021, 6:03 AM
llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
4231

VPTERNLOG select all the possible result of 3 bits. I mean it can be extent to 4 bit as long as the 4th bit is compile-time fixed 0 or 1. For this case the node is xor (X, -1), the same approach can be applied to xor(X, 0), and(X, -1), andnp(X, 0) and so on.

I found another example:

define dso_local <4 x i64> @foo2(<4 x i64> %0, <4 x i64> %1, <4 x i64> %2) {
  %4 = xor <4 x i64> %2, <i64 -1, i64 -1, i64 -1, i64 -1>
  %5 = or <4 x i64> %4, %1
  %6 = or <4 x i64> %0, %1
  %7 = and <4 x i64> %5, %6
  ret <4 x i64> %7
}

Can we simply it to below in the approach?

vpor    %ymm1, %ymm0, %ymm0
vpternlogq      $208, %ymm2, %ymm1, %ymm0
retq
llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
4231

But other cases can be simplied directly, e.g. xor(X, 0) -> X, and(X, -1) -> X, andnp(X, 0) -> 0 etc.

I found another example:

define dso_local <4 x i64> @foo2(<4 x i64> %0, <4 x i64> %1, <4 x i64> %2) {
  %4 = xor <4 x i64> %2, <i64 -1, i64 -1, i64 -1, i64 -1>
  %5 = or <4 x i64> %4, %1
  %6 = or <4 x i64> %0, %1
  %7 = and <4 x i64> %5, %6
  ret <4 x i64> %7
}

Can we simply it to below in the approach?

vpor    %ymm1, %ymm0, %ymm0
vpternlogq      $208, %ymm2, %ymm1, %ymm0
retq

Seem no with current vpternlog framework. We currently only support A op1 (B op2 C). Not figured out how to extend the framework to accept more operators as long as there is 3 source bit.

I found another example:

define dso_local <4 x i64> @foo2(<4 x i64> %0, <4 x i64> %1, <4 x i64> %2) {
  %4 = xor <4 x i64> %2, <i64 -1, i64 -1, i64 -1, i64 -1>
  %5 = or <4 x i64> %4, %1
  %6 = or <4 x i64> %0, %1
  %7 = and <4 x i64> %5, %6
  ret <4 x i64> %7
}

Can we simply it to below in the approach?

vpor    %ymm1, %ymm0, %ymm0
vpternlogq      $208, %ymm2, %ymm1, %ymm0
retq

How about gcc? Can gcc generate one vpternlogq instruction?

I found another example:

define dso_local <4 x i64> @foo2(<4 x i64> %0, <4 x i64> %1, <4 x i64> %2) {
  %4 = xor <4 x i64> %2, <i64 -1, i64 -1, i64 -1, i64 -1>
  %5 = or <4 x i64> %4, %1
  %6 = or <4 x i64> %0, %1
  %7 = and <4 x i64> %5, %6
  ret <4 x i64> %7
}

Can we simply it to below in the approach?

vpor    %ymm1, %ymm0, %ymm0
vpternlogq      $208, %ymm2, %ymm1, %ymm0
retq

Seem no with current vpternlog framework. We currently only support A op1 (B op2 C). Not figured out how to extend the framework to accept more operators as long as there is 3 source bit.

I meant simplified from current generation:

vpcmpeqd        %ymm3, %ymm3, %ymm3
vpternlogq      $222, %ymm2, %ymm1, %ymm3
vpternlogq      $200, %ymm1, %ymm3, %ymm0
retq

We can save one vpternlogq.

I meant simplified from current generation:

vpcmpeqd        %ymm3, %ymm3, %ymm3
vpternlogq      $222, %ymm2, %ymm1, %ymm3
vpternlogq      $200, %ymm1, %ymm3, %ymm0
retq

We can save one vpternlogq.

I think we may have another algorithm which iterate 8 possible composition of 3 bits and calculate the result with multi-operates and get the immediate operand of VPTERNLOGD.

VPTERNLOGD reg1, reg2, src3
Bit(reg1) Bit(reg2) Bit(src3)
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1

craig.topper added inline comments.Sep 6 2021, 10:14 AM
llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
4231

Right those all should have been simplified by DAGCombine.

4233

DAGCombine should canonicalize XOR with constant to have it on the RHS.

Address review comments. Rebase after pre-committing tests.

Rename IsNot lambda to PeekThroughNot and sync more code into it.

xbolva00 added a subscriber: xbolva00.EditedSep 6 2021, 11:00 AM

Also some more cases for ternlog

__m512i notBorC(__m512i B, __m512i C) {
    return ~(B|C); // 0x11 
}

__m512i notBandC(__m512i B, __m512i C) {
    return ~(B&C); // 0x77
}

__m512i notBxorC(__m512i B, __m512i C) {
    return ~(B^C); // 0x99
}

Also some more cases for ternlog

__m512i notBorC(__m512i B, __m512i C) {
    return ~(B|C); // 0x11 
}

__m512i notBandC(__m512i B, __m512i C) {
    return ~(B&C); // 0x77
}

__m512i notBxorC(__m512i B, __m512i C) {
    return ~(B^C); // 0x99
}

We should be a little careful there. As far as I know, vpternlog doesn't break dependencies on inputs that aren't used by the immediate. So we should try to use one of the other registers twice to prevent false dependencies. If we can fold a load, we need to make sure we don't duplicate that register and prevent the folding.

LuoYuanke accepted this revision.Sep 6 2021, 5:21 PM

LGTM, thanks.

This revision is now accepted and ready to land.Sep 6 2021, 5:21 PM
pengfei accepted this revision.Sep 6 2021, 5:46 PM