This is an archive of the discontinued LLVM Phabricator instance.

[X86][FP16] Enable vector support for FP16 emulation
ClosedPublic

Authored by pengfei on Jun 16 2022, 9:24 AM.

Details

Diff Detail

Event Timeline

pengfei created this revision.Jun 16 2022, 9:24 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 16 2022, 9:24 AM
pengfei requested review of this revision.Jun 16 2022, 9:24 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 16 2022, 9:24 AM
pengfei updated this revision to Diff 443443.Jul 9 2022, 8:31 AM

Rebase + Fix FP16 <-> INT problem.

skan added inline comments.Jul 13 2022, 12:17 AM
llvm/lib/Target/X86/X86InstrSSE.td
571

Why did you move the pattern down?

skan added inline comments.Jul 13 2022, 12:23 AM
llvm/lib/Target/X86/X86InstrSSE.td
635

Should we use UseSSE2 here?

pengfei updated this revision to Diff 444579.Jul 14 2022, 3:19 AM

Address Shengchen's comments and fix an infinite loop issue.

llvm/lib/Target/X86/X86InstrSSE.td
571

No idea. Should be accident :)

635

I think AVX instruction is always preferred, but yes, UseSSE2 can make sure of that.

RKSimon added inline comments.Jul 14 2022, 3:32 AM
llvm/test/CodeGen/X86/vec_fp_to_int.ll
2160

Why are we removing these?

pengfei added inline comments.Jul 14 2022, 3:59 AM
llvm/test/CodeGen/X86/vec_fp_to_int.ll
2160

It has been moved to "llvm/test/CodeGen/X86/vector-half-conversions.ll"
The reason is it's failed with -mtriple=i686 and -mtriple=i686 -mattr=+sse in this file now. This is expected because the ABI requests targets that at least have SSE2 to support the half type.

skan accepted this revision.Jul 14 2022, 5:04 AM

LGTM

This revision is now accepted and ready to land.Jul 14 2022, 5:04 AM

Thanks Shengchen!

This revision was landed with ongoing or failed builds.Jul 15 2022, 6:55 PM
This revision was automatically updated to reflect the committed changes.
yubing added a subscriber: yubing.Jul 18 2022, 12:37 AM
yubing added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
1858

it seems you forgot v32f16's fadd. now you make v32f16 type legal, then v32f16's fadd become legal, but there is no v32f16's fadd's instruction for avx512 without fp16. it will lead to crash.
i would like to suggest that we can promote v32f16's fadd to v32f32's fadd, which can be split into two v16f32's fadd by legalizeType(after vector legalization). what do you think?

for (unsigned Opc : { ISD::FADD, ISD::FSUB, ISD::FMUL, ISD::FDIV })
  setOperationPromotedToType(Opc, MVT::v32f16, MVT::v32f32);
yubing added inline comments.Jul 18 2022, 2:14 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
1077

besides, since we have addps(xmm) for sse2, so we can lower v8f16 into two v4f32, here setting v8f16's fadd as expand will lead to performance issue.

hi, we're seeing flaky crashes after this patch and I'm having trouble figuring out what's going wrong

llc -O3 /tmp/xla.ll

diff before/after this patch

3,4c3,4
<       .section        .rodata,"a",@progbits
<       .p2align        1                               # -- Begin function main.34
---
>       .section        .rodata.cst16,"aM",@progbits,16
>       .p2align        4                               # -- Begin function main.34
6a7,17
>       .short  0x6056                          # half 555
>       .short  0x6056                          # half 555
>       .short  0x6056                          # half 555
>       .short  0x6056                          # half 555
>       .short  0x6056                          # half 555
>       .short  0x6056                          # half 555
>       .short  0x6056                          # half 555
>       .section        .rodata,"a",@progbits
>       .p2align        1
> .LCPI0_1:
>       .short  0x6056                          # half 555
15a27,28
>       movdqa  .LCPI0_0(%rip), %xmm0           # xmm0 = [5.55E+2,5.55E+2,5.55E+2,5.55E+2,5.55E+2,5.55E+2,5.55E+2,5.55E+2]
>       movdqa  %xmm0, (%rax)
17,18d29
<       movq    %rsi, (%rax)
<       movq    %rsi, 8(%rax)
21c32
<       pinsrw  $0, .LCPI0_0(%rip), %xmm0
---
>       pinsrw  $0, .LCPI0_1(%rip), %xmm0

full assembly


we had theories that there was something to do with alignment with the movdqa?

hi, we're seeing flaky crashes after this patch and I'm having trouble figuring out what's going wrong

llc -O3 /tmp/xla.ll

diff before/after this patch

3,4c3,4
<       .section        .rodata,"a",@progbits
<       .p2align        1                               # -- Begin function main.34
---
>       .section        .rodata.cst16,"aM",@progbits,16
>       .p2align        4                               # -- Begin function main.34
6a7,17
>       .short  0x6056                          # half 555
>       .short  0x6056                          # half 555
>       .short  0x6056                          # half 555
>       .short  0x6056                          # half 555
>       .short  0x6056                          # half 555
>       .short  0x6056                          # half 555
>       .short  0x6056                          # half 555
>       .section        .rodata,"a",@progbits
>       .p2align        1
> .LCPI0_1:
>       .short  0x6056                          # half 555
15a27,28
>       movdqa  .LCPI0_0(%rip), %xmm0           # xmm0 = [5.55E+2,5.55E+2,5.55E+2,5.55E+2,5.55E+2,5.55E+2,5.55E+2,5.55E+2]
>       movdqa  %xmm0, (%rax)
17,18d29
<       movq    %rsi, (%rax)
<       movq    %rsi, 8(%rax)
21c32
<       pinsrw  $0, .LCPI0_0(%rip), %xmm0
---
>       pinsrw  $0, .LCPI0_1(%rip), %xmm0

full assembly


we had theories that there was something to do with alignment with the movdqa?

Hi @aeubanks, I think it should be an inherent problem in the application and just exposed by this patch. The diff in the assembly is as expected. The problem is the align 16 in below IR:

%fusion = load ptr, ptr %buffer_table, align 8, !invariant.load !0, !dereferenceable !2, !align !1
store half 0xH6056, ptr %fusion, align 16, !alias.scope !3, !noalias !6

which makes codegen to select movdqa, while the flaky crashes turn out %fusion is not always aligned to 16.

Hi @aeubanks, I think it should be an inherent problem in the application and just exposed by this patch. The diff in the assembly is as expected. The problem is the align 16 in below IR:

%fusion = load ptr, ptr %buffer_table, align 8, !invariant.load !0, !dereferenceable !2, !align !1
store half 0xH6056, ptr %fusion, align 16, !alias.scope !3, !noalias !6

which makes codegen to select movdqa, while the flaky crashes turn out %fusion is not always aligned to 16.

The pointers in %buffer_table are known to be always 16-byte aligned, so this shouldn't be a problem. If I run this with llc -mcpu=skx I get

movq    (%rcx), %rax  # load %fusion into %rax
...
vmovdqu %ymm0, 44(%rax)
vmovdqa 44(%rax), %xmm0  # crash here

I haven't figured out yet why this happens, but adding 44 to a 16-byte aligned pointer will never be 16-byte aligned.

Hi @pengfei, this patch is causing clang to crash with a "fatal error: error in backend" when compiling some code with -mno-avx2. A (mostly automatically) reduced test case (thanks @joanahalili for preparing it) is below. I imagine it may already be a not perfectly valid C++, but the original code was a quite hairy template beast to be a good example. I hope, this helps fixing the issue soon.

$ cat input.cc
typedef long __m128i __attribute__((__vector_size__(16)));
struct half {
  short x;
};
__m128i pset1(half from) {
  short __w7, __w6, __w5, __w4, __w3, __w2;
  return (__attribute__((__vector_size__(8 * sizeof(short)))) short){
          from.x, from.x, __w2, __w3, __w4, __w5, __w6, __w7};
}
__attribute__((__vector_size__(8 * sizeof(float)))) float pmul___trans_tmp_8;
void g(int *);
struct S {
  S() { g(&n); }
  virtual void c() {
    pmul___trans_tmp_8 = __builtin_ia32_vcvtph2ps256(pset1(m_value));
  }
  half m_value;
  int n;
};
void f() { S(); }
$ ./clang-good -std=gnu++17 -O3 --target=x86_64--linux-gnu  -m64 -march=haswell -maes -mprefer-vector-width=128 -mno-avx2 -c input.cc
$ ./clang-good --version
clang version trunk (e97b2d413849d3dbc8b49740ce5a07ed0382309c)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: ...
$ ./clang -std=gnu++17 -O3 --target=x86_64--linux-gnu  -m64 -march=haswell -maes -mprefer-vector-width=128 -c input.cc
$ ./clang -std=gnu++17 -O3 --target=x86_64--linux-gnu  -m64 -march=haswell -maes -mprefer-vector-width=128 -mno-avx2 -c input.cc
fatal error: error in backend: Cannot select: 0x55233f9ad270: v8f16,ch = X86ISD::VBROADCAST_LOAD<(dereferenceable load (s16) from %ir.0 + 8, align 8)> 0x55233f922068, 0x55233f9ad4e0
  0x55233f9ad4e0: i64 = add 0x55233f9ad750, Constant:i64<8>
    0x55233f9ad750: i64,ch = CopyFromReg 0x55233f922068, Register:i64 %0
      0x55233f9ad478: i64 = Register %0
    0x55233f9ad680: i64 = Constant<8>
In function: _ZN1S1cEv
clang: error: clang frontend command failed with exit code 70 (use -v to see invocation)
clang version trunk (f18794816270244f9942e9217b96e23a94a7f32c)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: ...
clang: note: diagnostic msg:
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
clang: note: diagnostic msg: /tmp/input-188997.cpp
clang: note: diagnostic msg: /tmp/input-188997.sh
clang: note: diagnostic msg:

********************

Hi @pengfei, this patch is causing clang to crash with a "fatal error: error in backend" when compiling some code with -mno-avx2. A (mostly automatically) reduced test case (thanks @joanahalili for preparing it) is below. I imagine it may already be a not perfectly valid C++, but the original code was a quite hairy template beast to be a good example. I hope, this helps fixing the issue soon.

$ cat input.cc
typedef long __m128i __attribute__((__vector_size__(16)));
struct half {
  short x;
};
__m128i pset1(half from) {
  short __w7, __w6, __w5, __w4, __w3, __w2;
  return (__attribute__((__vector_size__(8 * sizeof(short)))) short){
          from.x, from.x, __w2, __w3, __w4, __w5, __w6, __w7};
}
__attribute__((__vector_size__(8 * sizeof(float)))) float pmul___trans_tmp_8;
void g(int *);
struct S {
  S() { g(&n); }
  virtual void c() {
    pmul___trans_tmp_8 = __builtin_ia32_vcvtph2ps256(pset1(m_value));
  }
  half m_value;
  int n;
};
void f() { S(); }
$ ./clang-good -std=gnu++17 -O3 --target=x86_64--linux-gnu  -m64 -march=haswell -maes -mprefer-vector-width=128 -mno-avx2 -c input.cc
$ ./clang-good --version
clang version trunk (e97b2d413849d3dbc8b49740ce5a07ed0382309c)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: ...
$ ./clang -std=gnu++17 -O3 --target=x86_64--linux-gnu  -m64 -march=haswell -maes -mprefer-vector-width=128 -c input.cc
$ ./clang -std=gnu++17 -O3 --target=x86_64--linux-gnu  -m64 -march=haswell -maes -mprefer-vector-width=128 -mno-avx2 -c input.cc
fatal error: error in backend: Cannot select: 0x55233f9ad270: v8f16,ch = X86ISD::VBROADCAST_LOAD<(dereferenceable load (s16) from %ir.0 + 8, align 8)> 0x55233f922068, 0x55233f9ad4e0
  0x55233f9ad4e0: i64 = add 0x55233f9ad750, Constant:i64<8>
    0x55233f9ad750: i64,ch = CopyFromReg 0x55233f922068, Register:i64 %0
      0x55233f9ad478: i64 = Register %0
    0x55233f9ad680: i64 = Constant<8>
In function: _ZN1S1cEv
clang: error: clang frontend command failed with exit code 70 (use -v to see invocation)
clang version trunk (f18794816270244f9942e9217b96e23a94a7f32c)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: ...
clang: note: diagnostic msg:
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
clang: note: diagnostic msg: /tmp/input-188997.cpp
clang: note: diagnostic msg: /tmp/input-188997.sh
clang: note: diagnostic msg:

********************

Thanks @alexfh ! The test case is good enough. I'll investigate it. Thanks again!