This is follow up of D107082, which enable vector support according to psABI.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
| llvm/lib/Target/X86/X86InstrSSE.td | ||
|---|---|---|
| 571 | Why did you move the pattern down? | |
| llvm/lib/Target/X86/X86InstrSSE.td | ||
|---|---|---|
| 635 | Should we use UseSSE2 here? | |
| llvm/test/CodeGen/X86/vec_fp_to_int.ll | ||
|---|---|---|
| 2160 | Why are we removing these? | |
| llvm/test/CodeGen/X86/vec_fp_to_int.ll | ||
|---|---|---|
| 2160 | It has been moved to "llvm/test/CodeGen/X86/vector-half-conversions.ll" | |
| llvm/lib/Target/X86/X86ISelLowering.cpp | ||
|---|---|---|
| 1845 | it seems you forgot v32f16's fadd. now you make v32f16 type legal, then v32f16's fadd become legal, but there is no v32f16's fadd's instruction for avx512 without fp16. it will lead to crash. for (unsigned Opc : { ISD::FADD, ISD::FSUB, ISD::FMUL, ISD::FDIV })
  setOperationPromotedToType(Opc, MVT::v32f16, MVT::v32f32); | |
| llvm/lib/Target/X86/X86ISelLowering.cpp | ||
|---|---|---|
| 1063 | besides, since we have addps(xmm) for sse2, so we can lower v8f16 into two v4f32, here setting v8f16's fadd as expand will lead to performance issue. | |
hi, we're seeing flaky crashes after this patch and I'm having trouble figuring out what's going wrong
llc -O3 /tmp/xla.ll
diff before/after this patch
3,4c3,4 < .section .rodata,"a",@progbits < .p2align 1 # -- Begin function main.34 --- > .section .rodata.cst16,"aM",@progbits,16 > .p2align 4 # -- Begin function main.34 6a7,17 > .short 0x6056 # half 555 > .short 0x6056 # half 555 > .short 0x6056 # half 555 > .short 0x6056 # half 555 > .short 0x6056 # half 555 > .short 0x6056 # half 555 > .short 0x6056 # half 555 > .section .rodata,"a",@progbits > .p2align 1 > .LCPI0_1: > .short 0x6056 # half 555 15a27,28 > movdqa .LCPI0_0(%rip), %xmm0 # xmm0 = [5.55E+2,5.55E+2,5.55E+2,5.55E+2,5.55E+2,5.55E+2,5.55E+2,5.55E+2] > movdqa %xmm0, (%rax) 17,18d29 < movq %rsi, (%rax) < movq %rsi, 8(%rax) 21c32 < pinsrw $0, .LCPI0_0(%rip), %xmm0 --- > pinsrw $0, .LCPI0_1(%rip), %xmm0
full assembly
we had theories that there was something to do with alignment with the movdqa?
Hi @aeubanks, I think it should be an inherent problem in the application and just exposed by this patch. The diff in the assembly is as expected. The problem is the align 16 in below IR:
%fusion = load ptr, ptr %buffer_table, align 8, !invariant.load !0, !dereferenceable !2, !align !1 store half 0xH6056, ptr %fusion, align 16, !alias.scope !3, !noalias !6
which makes codegen to select movdqa, while the flaky crashes turn out %fusion is not always aligned to 16.
The pointers in %buffer_table are known to be always 16-byte aligned, so this shouldn't be a problem. If I run this with llc -mcpu=skx I get
movq (%rcx), %rax # load %fusion into %rax ... vmovdqu %ymm0, 44(%rax) vmovdqa 44(%rax), %xmm0 # crash here
I haven't figured out yet why this happens, but adding 44 to a 16-byte aligned pointer will never be 16-byte aligned.
Hi @pengfei, this patch is causing clang to crash with a "fatal error: error in backend" when compiling some code with -mno-avx2. A (mostly automatically) reduced test case (thanks @joanahalili for preparing it) is below. I imagine it may already be a not perfectly valid C++, but the original code was a quite hairy template beast to be a good example. I hope, this helps fixing the issue soon.
$ cat input.cc typedef long __m128i __attribute__((__vector_size__(16))); struct half { short x; }; __m128i pset1(half from) { short __w7, __w6, __w5, __w4, __w3, __w2; return (__attribute__((__vector_size__(8 * sizeof(short)))) short){ from.x, from.x, __w2, __w3, __w4, __w5, __w6, __w7}; } __attribute__((__vector_size__(8 * sizeof(float)))) float pmul___trans_tmp_8; void g(int *); struct S { S() { g(&n); } virtual void c() { pmul___trans_tmp_8 = __builtin_ia32_vcvtph2ps256(pset1(m_value)); } half m_value; int n; }; void f() { S(); } $ ./clang-good -std=gnu++17 -O3 --target=x86_64--linux-gnu -m64 -march=haswell -maes -mprefer-vector-width=128 -mno-avx2 -c input.cc $ ./clang-good --version clang version trunk (e97b2d413849d3dbc8b49740ce5a07ed0382309c) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: ... $ ./clang -std=gnu++17 -O3 --target=x86_64--linux-gnu -m64 -march=haswell -maes -mprefer-vector-width=128 -c input.cc $ ./clang -std=gnu++17 -O3 --target=x86_64--linux-gnu -m64 -march=haswell -maes -mprefer-vector-width=128 -mno-avx2 -c input.cc fatal error: error in backend: Cannot select: 0x55233f9ad270: v8f16,ch = X86ISD::VBROADCAST_LOAD<(dereferenceable load (s16) from %ir.0 + 8, align 8)> 0x55233f922068, 0x55233f9ad4e0 0x55233f9ad4e0: i64 = add 0x55233f9ad750, Constant:i64<8> 0x55233f9ad750: i64,ch = CopyFromReg 0x55233f922068, Register:i64 %0 0x55233f9ad478: i64 = Register %0 0x55233f9ad680: i64 = Constant<8> In function: _ZN1S1cEv clang: error: clang frontend command failed with exit code 70 (use -v to see invocation) clang version trunk (f18794816270244f9942e9217b96e23a94a7f32c) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: ... clang: note: diagnostic msg: ******************** PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT: Preprocessed source(s) and associated run script(s) are located at: clang: note: diagnostic msg: /tmp/input-188997.cpp clang: note: diagnostic msg: /tmp/input-188997.sh clang: note: diagnostic msg: ********************
besides, since we have addps(xmm) for sse2, so we can lower v8f16 into two v4f32, here setting v8f16's fadd as expand will lead to performance issue.