This is the patch that lowers x86 intrinsics to native IR

in order to enable optimizations. The patch also includes folding

of previously missing saturation patterns so that IR emits the same

machine instructions as the intrinsics.

Lowering in clang: https://reviews.llvm.org/D44786

# Details

# Diff Detail

- Repository
- rL LLVM
- Build Status
**Buildable 17393**Build 17393: arc lint + arc unit

### Event Timeline

Please can you rebase to latest? We've already introduced SplitOpsAndApply to do what you're doing with LowerVectorOpToLegalSize

I rebased it locally and it seems I'll need to partly rework it - in the meantime somebody added a pattern for matching signed saturation and I'll need to adjust to it.

I rebased to current version of the compiler. Adjusted my patch to existing signed saturation pattern, changed subus pattern to canonical represantation and corrected minor mistakes.

Note: now for specific intrinsics fast-isel emits suboptimal code.

We should probably have a test file for the IR that isn't the fast-isel or schedule test.

We should also test 512 bit vectors with AVX2/AVX512F and 1024-bit vectors with AVX512BW, etc.

lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|

36017 | What ensures we don't create X86ISD::ADDS for with a type smaller than 128 bits? For example if VT was v4i8 | |

36059 | Line this up with LHS on the line above | |

36061 | Line this up | |

36066 | Identation |

It would need to be additional tests. We need the -upgrade.ll tests to test the AutoUpgrade functionality so they need to keep the old calls.

lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|

36017 | You probably still need a power of 2 check. And a minimum elements check. I don't think SplitOpsAndApply can handle say a 384-bit vector on AVX2. It will try to split it in 256-bit pieces. But that's great test case to add. |

Now there are IR tests in regular test files and previous tests in 'upgrade' test files.

I also added some 512 bit AVX2 tests and 1024 bit AVX512BW tests to ensure the vectors get split properly.

isPowerOf2_32 check is restored.

Move the new IR tests to a new test file. vector-arith-sat.ll or something like that. Intrinsic tests should only call intrinsic functions.

lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|

36007 | Correct indentation |

I moved the IR tests to a new file, corrected fast-isel tests and improved pattern recognition in X86ISelLowering to also detect situations where one input vector is composed of constants.

Heads up.

We are seeing some failures in JIT'ed code due to this revision. The

symptom is a jump to zero with a corrupt stack--or at least a really wacky

one. The reproduction steps are sufficiently complicated that I don't have

a good explanation for why at the moment, but wanted to get this out there.

The simplest example I have of what's breaking is the llvmpipe test code; src/gallium/drivers/llvmpipe/lp_test_blend.c from https://cgit.freedesktop.org/mesa/mesa/. Both 17.0.3 and 18.0.0 break.

llvmpipe was explicitly issuing sse2.psubs and sse2.padds in src/gallium/auxiliary/gallivm/lp_bld_arit.c; that produced working code before this patch, but after this patch immediately crashes. By removing the issue of the sse2 padds/psubs our test case works again.

Debug output looks like this:

llc -mattr option(s): +sse2,+cx16,+sahf,-tbm,-avx512ifma,-sha,-gfni,-fma4,-vpclmulqdq,-prfchw,+bmi2,-cldemote,+fsgsbase,-xsavec,+popcnt,+aes,-avx512bitalg,-xsaves,-avx512er,-avx512vnni,-avx512vpopcntdq,-clwb,-avx512f,-clzero,-pku,+mmx,-lwp,-rdpid,-xop,-rdseed,-ibt,-sse4a,-avx512bw,-clflushopt,+xsave,-avx512vbmi2,-avx512vl,-avx512cd,+avx,-vaes,-rtm,+fma,+bmi,+rdrnd,-mwaitx,+sse4.1,+sse4.2,+avx2,-wbnoinvd,+sse,+lzcnt,+pclmul,-prefetchwt1,+f16c,+ssse3,-sgx,-shstk,+cmov,-avx512vbmi,+movbe,+xsaveopt,-avx512dq,-adx,-avx512pf,+sse3

llc -mcpu option: haswell

test:

0: pushq %rbp 1: movq %rsp, %rbp 4: pushq %rbx 5: subq $40, %rsp 9: movq %r8, %rbx 12: vmovdqa (%rdi), %xmm0 16: vmovdqa (%rdx), %xmm1 20: movabsq $140737174016000, %rax 30: vpand (%rax), %xmm0, %xmm2 34: vpsrld $8, %xmm2, %xmm3 39: vpor %xmm2, %xmm3, %xmm2 43: vpcmpeqd %xmm3, %xmm3, %xmm3 47: vpxor %xmm3, %xmm0, %xmm3 51: movabsq $140737174016032, %rax 61: vpbroadcastd (%rax), %xmm4 66: vmovdqa %xmm4, -48(%rbp) 71: vpblendvb %xmm4, (%rsi), %xmm3, %xmm3 77: vpsrld $16, %xmm2, %xmm4 82: vpor %xmm2, %xmm4, %xmm2 86: vpmovzxbw %xmm0, %xmm4 91: vpxor %xmm5, %xmm5, %xmm5 95: vpunpckhbw %xmm5, %xmm0, %xmm0 99: vpmovzxbw %xmm2, %xmm6 104: vpmullw %xmm4, %xmm6, %xmm4 108: vpunpckhbw %xmm5, %xmm2, %xmm2 112: vpmullw %xmm0, %xmm2, %xmm0 116: vpsrlw $8, %xmm4, %xmm2 121: movabsq $140737174016016, %rax 131: vmovdqa (%rax), %xmm6 135: vpaddw %xmm6, %xmm4, %xmm4 139: vpaddw %xmm4, %xmm2, %xmm2 143: vpsrlw $8, %xmm2, %xmm2 148: vpsrlw $8, %xmm0, %xmm4 153: vpaddw %xmm6, %xmm0, %xmm0 157: vpaddw %xmm0, %xmm4, %xmm0 161: vpsrlw $8, %xmm0, %xmm0 166: vpackuswb %xmm0, %xmm2, %xmm0 170: vpmovzxbw %xmm1, %xmm2 175: vpunpckhbw %xmm5, %xmm1, %xmm1 179: vpmovzxbw %xmm3, %xmm4 184: vpmullw %xmm4, %xmm2, %xmm2 188: vpunpckhbw %xmm5, %xmm3, %xmm3 192: vpmullw %xmm3, %xmm1, %xmm1 196: vpsrlw $8, %xmm2, %xmm3 201: vpaddw %xmm6, %xmm2, %xmm2 205: vpaddw %xmm2, %xmm3, %xmm2 209: vpsrlw $8, %xmm2, %xmm2 214: vpsrlw $8, %xmm1, %xmm3 219: vpaddw %xmm6, %xmm1, %xmm1 223: vpaddw %xmm1, %xmm3, %xmm1 227: vpsrlw $8, %xmm1, %xmm1 232: vpackuswb %xmm1, %xmm2, %xmm1 236: vpminub %xmm1, %xmm0, %xmm2 240: vmovdqa %xmm2, -32(%rbp) 245: movabsq $0, %rax 255: callq *%rax 257: vmovdqa -48(%rbp), %xmm1 262: vpblendvb %xmm1, -32(%rbp), %xmm0, %xmm0 269: vmovdqa %xmm0, (%rbx) 273: addq $40, %rsp 277: popq %rbx 278: popq %rbp 279: retq

After we return from the callq, we seem to have a corrupt stack.

Is the address of the call here supposed to be 0?

245: movabsq $0, %rax 255: callq *%rax

~Craig

I've been trying to see if I can spot anything that would expalin the problem here, but I don't see anything so far. I didn't find some other things I missed during the original review.

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|

36006 ↗ | (On Diff #143074) | Is this isSimple check needed? It limits the maximum vector width we can recognize to 2048 bits, but if someone ever changes the largest vXi16/vXi8 type in MachineValueType.h due to some other target in the future this would change the behavior of this code. |

36056 ↗ | (On Diff #143074) | If the middle end can determine the sign bit of the input to the extend to be 0, it might replace the sext with a zext. Should we be using computeNumSignbits/MaskedValueIsZero here instead of checking specific opcodes? |

36065 ↗ | (On Diff #143074) | Don't we need to verify the constant has the right number of sign bits or zero bits? |

RKSimon, is this what you're asking for?

; Function Attrs: nounwind readnone

declare <16 x i8> @llvm.x86.sse2.packuswb.128(<8 x i16>, <8 x i16>) #0

; Function Attrs: nounwind

declare <16 x i8> @llvm.x86.sse2.psubus.b(<16 x i8>, <16 x i8>) #1

define void @test(<16 x i8>*, <16 x i8>*, <16 x i8>*, <16 x i8>*, <16 x i8>*) {

entry:

%src = load <16 x i8>, <16 x i8>* %0 %src1 = load <16 x i8>, <16 x i8>* %1 %dst = load <16 x i8>, <16 x i8>* %2 %const = load <16 x i8>, <16 x i8>* %3 %5 = and <16 x i8> %src, <i8 0, i8 0, i8 0, i8 -1, i8 0, i8 0, i8 0, i8 -1, i8 0, i8 0, i8 0, i8 -1, i8 0, i8 0, i8 0, i8 -1> %6 = bitcast <16 x i8> %5 to <4 x i32> %7 = lshr <4 x i32> %6, <i32 8, i32 8, i32 8, i32 8> %8 = or <4 x i32> %6, %7 %9 = lshr <4 x i32> %8, <i32 16, i32 16, i32 16, i32 16> %10 = or <4 x i32> %8, %9 %11 = bitcast <4 x i32> %10 to <16 x i8> %12 = xor <16 x i8> %src, <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1> %13 = select <16 x i1> <i1 false, i1 false, i1 false, i1 true, i1 false, i1 false, i1 false, i1 true, i1 false, i1 false, i1 false, i1 true, i1 false, i1 false, i1 false, i1 true>, <16 x i8> %12, <16 x i8> %src1 %14 = shufflevector <16 x i8> %src, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23> %15 = shufflevector <16 x i8> %src, <16 x i8> zeroinitializer, <16 x i32> <i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31> %16 = bitcast <16 x i8> %14 to <8 x i16> %17 = bitcast <16 x i8> %15 to <8 x i16> %18 = shufflevector <16 x i8> %11, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23> %19 = shufflevector <16 x i8> %11, <16 x i8> zeroinitializer, <16 x i32> <i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31> %20 = bitcast <16 x i8> %18 to <8 x i16> %21 = bitcast <16 x i8> %19 to <8 x i16> %22 = mul <8 x i16> %16, %20 %23 = lshr <8 x i16> %22, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8> %24 = add <8 x i16> %22, %23 %25 = add <8 x i16> %24, <i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128> %26 = lshr <8 x i16> %25, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8> %27 = mul <8 x i16> %17, %21 %28 = lshr <8 x i16> %27, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8> %29 = add <8 x i16> %27, %28 %30 = add <8 x i16> %29, <i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128> %31 = lshr <8 x i16> %30, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8> %32 = call <16 x i8> @llvm.x86.sse2.packuswb.128(<8 x i16> %26, <8 x i16> %31) %33 = shufflevector <16 x i8> %dst, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23> %34 = shufflevector <16 x i8> %dst, <16 x i8> zeroinitializer, <16 x i32> <i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31> %35 = bitcast <16 x i8> %33 to <8 x i16> %36 = bitcast <16 x i8> %34 to <8 x i16> %37 = shufflevector <16 x i8> %13, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23> %38 = shufflevector <16 x i8> %13, <16 x i8> zeroinitializer, <16 x i32> <i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31> %39 = bitcast <16 x i8> %37 to <8 x i16> %40 = bitcast <16 x i8> %38 to <8 x i16> %41 = mul <8 x i16> %35, %39 %42 = lshr <8 x i16> %41, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8> %43 = add <8 x i16> %41, %42 %44 = add <8 x i16> %43, <i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128> %45 = lshr <8 x i16> %44, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8> %46 = mul <8 x i16> %36, %40 %47 = lshr <8 x i16> %46, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8> %48 = add <8 x i16> %46, %47 %49 = add <8 x i16> %48, <i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128> %50 = lshr <8 x i16> %49, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8> %51 = call <16 x i8> @llvm.x86.sse2.packuswb.128(<8 x i16> %45, <8 x i16> %50) %52 = icmp ult <16 x i8> %32, %51 %53 = sext <16 x i1> %52 to <16 x i8> %54 = trunc <16 x i8> %53 to <16 x i1> %55 = select <16 x i1> %54, <16 x i8> %32, <16 x i8> %51 %56 = shufflevector <16 x i8> %src, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23> %57 = shufflevector <16 x i8> %src, <16 x i8> zeroinitializer, <16 x i32> <i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31> %58 = bitcast <16 x i8> %56 to <8 x i16> %59 = bitcast <16 x i8> %57 to <8 x i16> %60 = shufflevector <16 x i8> %11, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23> %61 = shufflevector <16 x i8> %11, <16 x i8> zeroinitializer, <16 x i32> <i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31> %62 = bitcast <16 x i8> %60 to <8 x i16> %63 = bitcast <16 x i8> %61 to <8 x i16> %64 = mul <8 x i16> %58, %62 %65 = lshr <8 x i16> %64, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8> %66 = add <8 x i16> %64, %65 %67 = add <8 x i16> %66, <i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128> %68 = lshr <8 x i16> %67, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8> %69 = mul <8 x i16> %59, %63 %70 = lshr <8 x i16> %69, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8> %71 = add <8 x i16> %69, %70 %72 = add <8 x i16> %71, <i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128> %73 = lshr <8 x i16> %72, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8> %74 = call <16 x i8> @llvm.x86.sse2.packuswb.128(<8 x i16> %68, <8 x i16> %73) %75 = shufflevector <16 x i8> %dst, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23> %76 = shufflevector <16 x i8> %dst, <16 x i8> zeroinitializer, <16 x i32> <i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31> %77 = bitcast <16 x i8> %75 to <8 x i16> %78 = bitcast <16 x i8> %76 to <8 x i16> %79 = shufflevector <16 x i8> %13, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23> %80 = shufflevector <16 x i8> %13, <16 x i8> zeroinitializer, <16 x i32> <i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31> %81 = bitcast <16 x i8> %79 to <8 x i16> %82 = bitcast <16 x i8> %80 to <8 x i16> %83 = mul <8 x i16> %77, %81 %84 = lshr <8 x i16> %83, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8> %85 = add <8 x i16> %83, %84 %86 = add <8 x i16> %85, <i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128> %87 = lshr <8 x i16> %86, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8> %88 = mul <8 x i16> %78, %82 %89 = lshr <8 x i16> %88, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8> %90 = add <8 x i16> %88, %89 %91 = add <8 x i16> %90, <i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128> %92 = lshr <8 x i16> %91, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8> %93 = call <16 x i8> @llvm.x86.sse2.packuswb.128(<8 x i16> %87, <8 x i16> %92) %94 = call <16 x i8> @llvm.x86.sse2.psubus.b(<16 x i8> %74, <16 x i8> %93) %res = select <16 x i1> <i1 false, i1 false, i1 false, i1 true, i1 false, i1 false, i1 false, i1 true, i1 false, i1 false, i1 false, i1 true, i1 false, i1 false, i1 false, i1 true>, <16 x i8> %94, <16 x i8> %55 store <16 x i8> %res, <16 x i8>* %4 ret void

}

Is the IR being generated on the fly and then fed to the JIT? Which means

it doesn't go through the autoupgrade code since that is only done by the

bitcode reader and the ll parser? If that's the case, you'll need to

generate the replacement sequence directly instead of using the intrinsic.

We still need to fix the issues that I raised in the review though.

~Craig

No luck - I think you're right that its something to do with their pipeline not including the autoupgrade stage.

Reported to llvmpipe at https://bugs.freedesktop.org/show_bug.cgi?id=106231; if I've misunderstood any of your discussion here please add corrections so they can look at how they're using you.

Thomasz, can you open a new review for the changes?

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|

36074 ↗ | (On Diff #143074) | Don't we need to make sure the input type to extends is the same as VT? |

FWIW this is correct, it's using jit so no autoupgrade, so we'll have to fix this in mesa.

However, in the past when intrinsics disappeared, we actually got an error when compiling the IR. See https://bugs.llvm.org/show_bug.cgi?id=28176 for instance: LLVM ERROR: Program used external function '_llvm.x86.sse2.pminu.b' which could not be resolved!

That of course made it a lot more obvious what's going on - no error and just calling a 0 function in the generated code is really not helpful.

Should I upload the whole patch there or just the changes to what is already in trunk?

This is actually crashing normal compiles. Reverting for now. See http://llvm.org/PR37260 for details and test case.