Page MenuHomePhabricator

Lowering x86 adds/addus/subs/subus intrinsics (llvm part)
ClosedPublic

Authored by tkrupa on Mar 22 2018, 9:21 AM.

Details

Summary

This is the patch that lowers x86 intrinsics to native IR
in order to enable optimizations. The patch also includes folding
of previously missing saturation patterns so that IR emits the same
machine instructions as the intrinsics.
Lowering in clang: https://reviews.llvm.org/D44786

Diff Detail

Event Timeline

tkrupa created this revision.Mar 22 2018, 9:21 AM
tkrupa edited the summary of this revision. (Show Details)Mar 22 2018, 9:25 AM

Please can you rebase to latest? We've already introduced SplitOpsAndApply to do what you're doing with LowerVectorOpToLegalSize

Please can you rebase to latest? We've already introduced SplitOpsAndApply to do what you're doing with LowerVectorOpToLegalSize

I rebased it locally and it seems I'll need to partly rework it - in the meantime somebody added a pattern for matching signed saturation and I'll need to adjust to it.

craig.topper added inline comments.Mar 28 2018, 8:33 PM
lib/IR/AutoUpgrade.cpp
87–99

The next release will be 7.0 not 6.0.

lib/Target/X86/X86ISelLowering.cpp
34015

Is this line longer than 80 characters?

tkrupa marked an inline comment as done.Mar 29 2018, 7:13 AM
tkrupa updated this revision to Diff 140232.EditedMar 29 2018, 7:15 AM

I rebased to current version of the compiler. Adjusted my patch to existing signed saturation pattern, changed subus pattern to canonical represantation and corrected minor mistakes.
Note: now for specific intrinsics fast-isel emits suboptimal code.

We should probably have a test file for the IR that isn't the fast-isel or schedule test.

We should also test 512 bit vectors with AVX2/AVX512F and 1024-bit vectors with AVX512BW, etc.

lib/Target/X86/X86ISelLowering.cpp
36017

What ensures we don't create X86ISD::ADDS for with a type smaller than 128 bits? For example if VT was v4i8

36059

Line this up with LHS on the line above

36061

Line this up

36066

Identation

tkrupa updated this revision to Diff 140400.Mar 30 2018, 3:09 AM
tkrupa marked 4 inline comments as done.

We should probably have a test file for the IR that isn't the fast-isel or schedule test.

Should I write additional ones or replace calls with IR in upgrade files?

It would need to be additional tests. We need the -upgrade.ll tests to test the AutoUpgrade functionality so they need to keep the old calls.

lib/Target/X86/X86ISelLowering.cpp
36017

You probably still need a power of 2 check. And a minimum elements check. I don't think SplitOpsAndApply can handle say a 384-bit vector on AVX2. It will try to split it in 256-bit pieces. But that's great test case to add.

Ops, that should have said "SplitOpsAndApply CAN'T handle"

tkrupa updated this revision to Diff 141114.Apr 5 2018, 1:32 AM
tkrupa marked an inline comment as done.

Now there are IR tests in regular test files and previous tests in 'upgrade' test files.
I also added some 512 bit AVX2 tests and 1024 bit AVX512BW tests to ensure the vectors get split properly.
isPowerOf2_32 check is restored.

Move the new IR tests to a new test file. vector-arith-sat.ll or something like that. Intrinsic tests should only call intrinsic functions.

lib/Target/X86/X86ISelLowering.cpp
36007

Correct indentation

What happens when one of the inputs is a constant (e.g. saturated increment:: _mm_adds_epi8(X, _mm_set1_epi8(1)))?

test/CodeGen/X86/avx2-intrinsics-fast-isel.ll
5

As noted here, you should be using the same IR as is generated in the clang builtins tests updated in D44786

tkrupa marked 2 inline comments as done.Apr 11 2018, 6:47 AM
tkrupa updated this revision to Diff 142011.Apr 11 2018, 7:52 AM

I moved the IR tests to a new file, corrected fast-isel tests and improved pattern recognition in X86ISelLowering to also detect situations where one input vector is composed of constants.

Do we have test cases for types less than 128-bits to make sure we don't convert them?

lib/Target/X86/X86ISelLowering.cpp
36054

Use SDValue instead of auto. LLVM tends to be conservative with use of auto.

test/CodeGen/X86/vector-arith-sat.ll
11

Add a blank line here.

tkrupa updated this revision to Diff 142181.Apr 12 2018, 8:01 AM
tkrupa marked 2 inline comments as done.

Added some tests for 64-bit vectors.

This revision is now accepted and ready to land.Apr 12 2018, 9:57 AM
This revision was automatically updated to reflect the committed changes.

Heads up.

We are seeing some failures in JIT'ed code due to this revision. The
symptom is a jump to zero with a corrupt stack--or at least a really wacky
one. The reproduction steps are sufficiently complicated that I don't have
a good explanation for why at the moment, but wanted to get this out there.

How do I reproduce it? Standard LLVM and Clang tests are passing for me.

The simplest example I have of what's breaking is the llvmpipe test code; src/gallium/drivers/llvmpipe/lp_test_blend.c from https://cgit.freedesktop.org/mesa/mesa/. Both 17.0.3 and 18.0.0 break.

llvmpipe was explicitly issuing sse2.psubs and sse2.padds in src/gallium/auxiliary/gallivm/lp_bld_arit.c; that produced working code before this patch, but after this patch immediately crashes. By removing the issue of the sse2 padds/psubs our test case works again.

Debug output looks like this:

llc -mattr option(s): +sse2,+cx16,+sahf,-tbm,-avx512ifma,-sha,-gfni,-fma4,-vpclmulqdq,-prfchw,+bmi2,-cldemote,+fsgsbase,-xsavec,+popcnt,+aes,-avx512bitalg,-xsaves,-avx512er,-avx512vnni,-avx512vpopcntdq,-clwb,-avx512f,-clzero,-pku,+mmx,-lwp,-rdpid,-xop,-rdseed,-ibt,-sse4a,-avx512bw,-clflushopt,+xsave,-avx512vbmi2,-avx512vl,-avx512cd,+avx,-vaes,-rtm,+fma,+bmi,+rdrnd,-mwaitx,+sse4.1,+sse4.2,+avx2,-wbnoinvd,+sse,+lzcnt,+pclmul,-prefetchwt1,+f16c,+ssse3,-sgx,-shstk,+cmov,-avx512vbmi,+movbe,+xsaveopt,-avx512dq,-adx,-avx512pf,+sse3
llc -mcpu option: haswell

test:

  0:         pushq   %rbp
  1:         movq    %rsp, %rbp
  4:         pushq   %rbx
  5:         subq    $40, %rsp
  9:         movq    %r8, %rbx
 12:         vmovdqa (%rdi), %xmm0
 16:         vmovdqa (%rdx), %xmm1
 20:         movabsq $140737174016000, %rax
 30:         vpand   (%rax), %xmm0, %xmm2
 34:         vpsrld  $8, %xmm2, %xmm3
 39:         vpor    %xmm2, %xmm3, %xmm2
 43:         vpcmpeqd        %xmm3, %xmm3, %xmm3
 47:         vpxor   %xmm3, %xmm0, %xmm3
 51:         movabsq $140737174016032, %rax
 61:         vpbroadcastd    (%rax), %xmm4
 66:         vmovdqa %xmm4, -48(%rbp)
 71:         vpblendvb       %xmm4, (%rsi), %xmm3, %xmm3
 77:         vpsrld  $16, %xmm2, %xmm4
 82:         vpor    %xmm2, %xmm4, %xmm2
 86:         vpmovzxbw       %xmm0, %xmm4
 91:         vpxor   %xmm5, %xmm5, %xmm5
 95:         vpunpckhbw      %xmm5, %xmm0, %xmm0
 99:         vpmovzxbw       %xmm2, %xmm6
104:         vpmullw %xmm4, %xmm6, %xmm4
108:         vpunpckhbw      %xmm5, %xmm2, %xmm2
112:         vpmullw %xmm0, %xmm2, %xmm0
116:         vpsrlw  $8, %xmm4, %xmm2
121:         movabsq $140737174016016, %rax
131:         vmovdqa (%rax), %xmm6
135:         vpaddw  %xmm6, %xmm4, %xmm4
139:         vpaddw  %xmm4, %xmm2, %xmm2
143:         vpsrlw  $8, %xmm2, %xmm2
148:         vpsrlw  $8, %xmm0, %xmm4
153:         vpaddw  %xmm6, %xmm0, %xmm0
157:         vpaddw  %xmm0, %xmm4, %xmm0
161:         vpsrlw  $8, %xmm0, %xmm0
166:         vpackuswb       %xmm0, %xmm2, %xmm0
170:         vpmovzxbw       %xmm1, %xmm2
175:         vpunpckhbw      %xmm5, %xmm1, %xmm1
179:         vpmovzxbw       %xmm3, %xmm4
184:         vpmullw %xmm4, %xmm2, %xmm2
188:         vpunpckhbw      %xmm5, %xmm3, %xmm3
192:         vpmullw %xmm3, %xmm1, %xmm1
196:         vpsrlw  $8, %xmm2, %xmm3
201:         vpaddw  %xmm6, %xmm2, %xmm2
205:         vpaddw  %xmm2, %xmm3, %xmm2
209:         vpsrlw  $8, %xmm2, %xmm2
214:         vpsrlw  $8, %xmm1, %xmm3
219:         vpaddw  %xmm6, %xmm1, %xmm1
223:         vpaddw  %xmm1, %xmm3, %xmm1
227:         vpsrlw  $8, %xmm1, %xmm1
232:         vpackuswb       %xmm1, %xmm2, %xmm1
236:         vpminub %xmm1, %xmm0, %xmm2
240:         vmovdqa %xmm2, -32(%rbp)
245:         movabsq $0, %rax
255:         callq   *%rax
257:         vmovdqa -48(%rbp), %xmm1
262:         vpblendvb       %xmm1, -32(%rbp), %xmm0, %xmm0
269:         vmovdqa %xmm0, (%rbx)
273:         addq    $40, %rsp
277:         popq    %rbx
278:         popq    %rbp
279:         retq

After we return from the callq, we seem to have a corrupt stack.

Is the address of the call here supposed to be 0?

245:         movabsq $0, %rax
255:         callq   *%rax

~Craig

Can you attach the .ll file for llc?

~Craig

I've been trying to see if I can spot anything that would expalin the problem here, but I don't see anything so far. I didn't find some other things I missed during the original review.

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
36006 ↗(On Diff #143074)

Is this isSimple check needed? It limits the maximum vector width we can recognize to 2048 bits, but if someone ever changes the largest vXi16/vXi8 type in MachineValueType.h due to some other target in the future this would change the behavior of this code.

36056 ↗(On Diff #143074)

If the middle end can determine the sign bit of the input to the extend to be 0, it might replace the sext with a zext. Should we be using computeNumSignbits/MaskedValueIsZero here instead of checking specific opcodes?

36065 ↗(On Diff #143074)

Don't we need to verify the constant has the right number of sign bits or zero bits?

Does anyone have the IR that llvmpipe is creating?

RKSimon, is this what you're asking for?

; Function Attrs: nounwind readnone
declare <16 x i8> @llvm.x86.sse2.packuswb.128(<8 x i16>, <8 x i16>) #0

; Function Attrs: nounwind
declare <16 x i8> @llvm.x86.sse2.psubus.b(<16 x i8>, <16 x i8>) #1

define void @test(<16 x i8>*, <16 x i8>*, <16 x i8>*, <16 x i8>*, <16 x i8>*) {
entry:

%src = load <16 x i8>, <16 x i8>* %0
%src1 = load <16 x i8>, <16 x i8>* %1
%dst = load <16 x i8>, <16 x i8>* %2
%const = load <16 x i8>, <16 x i8>* %3
%5 = and <16 x i8> %src, <i8 0, i8 0, i8 0, i8 -1, i8 0, i8 0, i8 0, i8 -1, i8 0, i8 0, i8 0, i8 -1, i8 0, i8 0, i8 0, i8 -1>
%6 = bitcast <16 x i8> %5 to <4 x i32>
%7 = lshr <4 x i32> %6, <i32 8, i32 8, i32 8, i32 8>
%8 = or <4 x i32> %6, %7
%9 = lshr <4 x i32> %8, <i32 16, i32 16, i32 16, i32 16>
%10 = or <4 x i32> %8, %9
%11 = bitcast <4 x i32> %10 to <16 x i8>
%12 = xor <16 x i8> %src, <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>
%13 = select <16 x i1> <i1 false, i1 false, i1 false, i1 true, i1 false, i1 false, i1 false, i1 true, i1 false, i1 false, i1 false, i1 true, i1 false, i1 false, i1 false, i1 true>, <16 x i8> %12, <16 x i8> %src1
%14 = shufflevector <16 x i8> %src, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23>
%15 = shufflevector <16 x i8> %src, <16 x i8> zeroinitializer, <16 x i32> <i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31>
%16 = bitcast <16 x i8> %14 to <8 x i16>
%17 = bitcast <16 x i8> %15 to <8 x i16>
%18 = shufflevector <16 x i8> %11, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23>
%19 = shufflevector <16 x i8> %11, <16 x i8> zeroinitializer, <16 x i32> <i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31>
%20 = bitcast <16 x i8> %18 to <8 x i16>
%21 = bitcast <16 x i8> %19 to <8 x i16>
%22 = mul <8 x i16> %16, %20
%23 = lshr <8 x i16> %22, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
%24 = add <8 x i16> %22, %23
%25 = add <8 x i16> %24, <i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128>
%26 = lshr <8 x i16> %25, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
%27 = mul <8 x i16> %17, %21
%28 = lshr <8 x i16> %27, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
%29 = add <8 x i16> %27, %28
%30 = add <8 x i16> %29, <i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128>
%31 = lshr <8 x i16> %30, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
%32 = call <16 x i8> @llvm.x86.sse2.packuswb.128(<8 x i16> %26, <8 x i16> %31)
%33 = shufflevector <16 x i8> %dst, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23>
%34 = shufflevector <16 x i8> %dst, <16 x i8> zeroinitializer, <16 x i32> <i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31>
%35 = bitcast <16 x i8> %33 to <8 x i16>
%36 = bitcast <16 x i8> %34 to <8 x i16>
%37 = shufflevector <16 x i8> %13, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23>
%38 = shufflevector <16 x i8> %13, <16 x i8> zeroinitializer, <16 x i32> <i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31>
%39 = bitcast <16 x i8> %37 to <8 x i16>
%40 = bitcast <16 x i8> %38 to <8 x i16>
%41 = mul <8 x i16> %35, %39
%42 = lshr <8 x i16> %41, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
%43 = add <8 x i16> %41, %42
%44 = add <8 x i16> %43, <i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128>
%45 = lshr <8 x i16> %44, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
%46 = mul <8 x i16> %36, %40
%47 = lshr <8 x i16> %46, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
%48 = add <8 x i16> %46, %47
%49 = add <8 x i16> %48, <i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128>
%50 = lshr <8 x i16> %49, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
%51 = call <16 x i8> @llvm.x86.sse2.packuswb.128(<8 x i16> %45, <8 x i16> %50)
%52 = icmp ult <16 x i8> %32, %51
%53 = sext <16 x i1> %52 to <16 x i8>
%54 = trunc <16 x i8> %53 to <16 x i1>
%55 = select <16 x i1> %54, <16 x i8> %32, <16 x i8> %51
%56 = shufflevector <16 x i8> %src, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23>
%57 = shufflevector <16 x i8> %src, <16 x i8> zeroinitializer, <16 x i32> <i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31>
%58 = bitcast <16 x i8> %56 to <8 x i16>
%59 = bitcast <16 x i8> %57 to <8 x i16>
%60 = shufflevector <16 x i8> %11, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23>
%61 = shufflevector <16 x i8> %11, <16 x i8> zeroinitializer, <16 x i32> <i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31>
%62 = bitcast <16 x i8> %60 to <8 x i16>
%63 = bitcast <16 x i8> %61 to <8 x i16>
%64 = mul <8 x i16> %58, %62
%65 = lshr <8 x i16> %64, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
%66 = add <8 x i16> %64, %65
%67 = add <8 x i16> %66, <i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128>
%68 = lshr <8 x i16> %67, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
%69 = mul <8 x i16> %59, %63
%70 = lshr <8 x i16> %69, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
%71 = add <8 x i16> %69, %70
%72 = add <8 x i16> %71, <i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128>
%73 = lshr <8 x i16> %72, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
%74 = call <16 x i8> @llvm.x86.sse2.packuswb.128(<8 x i16> %68, <8 x i16> %73)
%75 = shufflevector <16 x i8> %dst, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23>
%76 = shufflevector <16 x i8> %dst, <16 x i8> zeroinitializer, <16 x i32> <i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31>
%77 = bitcast <16 x i8> %75 to <8 x i16>
%78 = bitcast <16 x i8> %76 to <8 x i16>
%79 = shufflevector <16 x i8> %13, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23>
%80 = shufflevector <16 x i8> %13, <16 x i8> zeroinitializer, <16 x i32> <i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31>
%81 = bitcast <16 x i8> %79 to <8 x i16>
%82 = bitcast <16 x i8> %80 to <8 x i16>
%83 = mul <8 x i16> %77, %81
%84 = lshr <8 x i16> %83, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
%85 = add <8 x i16> %83, %84
%86 = add <8 x i16> %85, <i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128>
%87 = lshr <8 x i16> %86, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
%88 = mul <8 x i16> %78, %82
%89 = lshr <8 x i16> %88, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
%90 = add <8 x i16> %88, %89
%91 = add <8 x i16> %90, <i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128, i16 128>
%92 = lshr <8 x i16> %91, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
%93 = call <16 x i8> @llvm.x86.sse2.packuswb.128(<8 x i16> %87, <8 x i16> %92)
%94 = call <16 x i8> @llvm.x86.sse2.psubus.b(<16 x i8> %74, <16 x i8> %93)
%res = select <16 x i1> <i1 false, i1 false, i1 false, i1 true, i1 false, i1 false, i1 false, i1 true, i1 false, i1 false, i1 false, i1 true, i1 false, i1 false, i1 false, i1 true>, <16 x i8> %94, <16 x i8> %55
store <16 x i8> %res, <16 x i8>* %4
ret void

}

Simon or Tomasz, have you had any luck reproducing this?

~Craig

Is the IR being generated on the fly and then fed to the JIT? Which means
it doesn't go through the autoupgrade code since that is only done by the
bitcode reader and the ll parser? If that's the case, you'll need to
generate the replacement sequence directly instead of using the intrinsic.

We still need to fix the issues that I raised in the review though.

~Craig

Simon or Tomasz, have you had any luck reproducing this?

No luck - I think you're right that its something to do with their pipeline not including the autoupgrade stage.

tkrupa reopened this revision.Apr 24 2018, 11:59 PM
tkrupa marked an inline comment as done.
This revision is now accepted and ready to land.Apr 24 2018, 11:59 PM
tkrupa updated this revision to Diff 143865.Apr 25 2018, 12:02 AM

No luck either, Craig.
I fixed the issues you pointed out.

tkrupa marked 2 inline comments as done.Apr 25 2018, 12:03 AM

Reported to llvmpipe at https://bugs.freedesktop.org/show_bug.cgi?id=106231; if I've misunderstood any of your discussion here please add corrections so they can look at how they're using you.

Thomasz, can you open a new review for the changes?

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
36074 ↗(On Diff #143074)

Don't we need to make sure the input type to extends is the same as VT?

Is the IR being generated on the fly and then fed to the JIT? Which means
it doesn't go through the autoupgrade code since that is only done by the
bitcode reader and the ll parser? If that's the case, you'll need to
generate the replacement sequence directly instead of using the intrinsic.

FWIW this is correct, it's using jit so no autoupgrade, so we'll have to fix this in mesa.
However, in the past when intrinsics disappeared, we actually got an error when compiling the IR. See https://bugs.llvm.org/show_bug.cgi?id=28176 for instance: LLVM ERROR: Program used external function '_llvm.x86.sse2.pminu.b' which could not be resolved!
That of course made it a lot more obvious what's going on - no error and just calling a 0 function in the generated code is really not helpful.

Thomasz, can you open a new review for the changes?

Should I upload the whole patch there or just the changes to what is already in trunk?

Just the differences with what's already in trunk.

This is actually crashing normal compiles. Reverting for now. See http://llvm.org/PR37260 for details and test case.

I made a new revision D46179. I uploaded the whole thing there because of reversion.

tkrupa closed this revision.Apr 27 2018, 1:47 AM