This is a more general version of D109273. Though it doesn't

peek through bitcasts or rearange broadcasts.

# Details

# Diff Detail

- Repository
- rG LLVM Github Monorepo

### Event Timeline

llvm/test/CodeGen/X86/avx512vl-logic.ll | ||
---|---|---|

985 | Miss the test case for ~B and ~C? |

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp | ||
---|---|---|

4231 | VPTERNLOG select all the possible result of 3 bits. I mean it can be extent to 4 bit as long as the 4th bit is compile-time fixed 0 or 1. For this case the node is xor (X, -1), the same approach can be applied to xor(X, 0), and(X, -1), andnp(X, 0) and so on. |

I found another example:

define dso_local <4 x i64> @foo2(<4 x i64> %0, <4 x i64> %1, <4 x i64> %2) { %4 = xor <4 x i64> %2, <i64 -1, i64 -1, i64 -1, i64 -1> %5 = or <4 x i64> %4, %1 %6 = or <4 x i64> %0, %1 %7 = and <4 x i64> %5, %6 ret <4 x i64> %7 }

Can we simply it to below in the approach?

vpor %ymm1, %ymm0, %ymm0 vpternlogq $208, %ymm2, %ymm1, %ymm0 retq

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp | ||
---|---|---|

4231 | But other cases can be simplied directly, e.g. xor(X, 0) -> X, and(X, -1) -> X, andnp(X, 0) -> 0 etc. |

Seem no with current vpternlog framework. We currently only support A op1 (B op2 C). Not figured out how to extend the framework to accept more operators as long as there is 3 source bit.

I meant simplified from current generation:

vpcmpeqd %ymm3, %ymm3, %ymm3 vpternlogq $222, %ymm2, %ymm1, %ymm3 vpternlogq $200, %ymm1, %ymm3, %ymm0 retq

We can save one vpternlogq.

I meant simplified from current generation:

vpcmpeqd %ymm3, %ymm3, %ymm3 vpternlogq $222, %ymm2, %ymm1, %ymm3 vpternlogq $200, %ymm1, %ymm3, %ymm0 retqWe can save one vpternlogq.

I think we may have another algorithm which iterate 8 possible composition of 3 bits and calculate the result with multi-operates and get the immediate operand of VPTERNLOGD.

VPTERNLOGD reg1, reg2, src3

Bit(reg1) Bit(reg2) Bit(src3)

0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

Also some more cases for ternlog

__m512i notBorC(__m512i B, __m512i C) { return ~(B|C); // 0x11 } __m512i notBandC(__m512i B, __m512i C) { return ~(B&C); // 0x77 } __m512i notBxorC(__m512i B, __m512i C) { return ~(B^C); // 0x99 }

We should be a little careful there. As far as I know, vpternlog doesn't break dependencies on inputs that aren't used by the immediate. So we should try to use one of the other registers twice to prevent false dependencies. If we can fold a load, we need to make sure we don't duplicate that register and prevent the folding.