Page MenuHomePhabricator

[X86][RFC] Enable `_Float16` type support on X86 following the psABI
ClosedPublic

Authored by pengfei on Jul 29 2021, 8:18 AM.

Details

Summary

GCC and Clang/LLVM will support _Float16 on X86 in C/C++, following
the latest X86 psABI. (https://gitlab.com/x86-psABIs)

_Float16 arithmetic will be performed using native half-precision. If
native arithmetic instructions are not available, it will be performed
at a higher precision (currently always float) and then truncated down
to _Float16 immediately after each single arithmetic operation.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
LuoYuanke added inline comments.Jun 8 2022, 2:00 AM
llvm/lib/Target/X86/X86InstrAVX512.td
4107

Previous prd only apply to "def rr"? Is it a bug for previous code?

4352

Why previous code don't have predicates?

11657

Why set AddedComplexity to -10? There no such addtional complexity in previous code. Add comments for it?

llvm/lib/Target/X86/X86InstrSSE.td
3970

Why AddedComplexity = -10? Add comments for it?

3978

Miss pattern for store?

5215

Why no AddedComplexity for it?

llvm/lib/Target/X86/X86RegisterInfo.td
540

When there is avx512fp16 feature, is the spill size still 32?

pengfei updated this revision to Diff 435151.Jun 8 2022, 7:15 AM
pengfei marked 3 inline comments as done.

Address Yuanke's comments. Thanks!

llvm/lib/Target/X86/X86ISelLowering.cpp
616

Yeah, we can use LibCall instead.

763

It's in the scope of if (UseX87). And we need to lower fpext half %0 to x86_fp80.

22474

Return SDValue() will extent later. This can save the code.

22548

Yes.

22790

No. We use MVT::v8i16 when we enabled F16C instructions.

22800

I don't think we have pattern to extract f16 from v8i16. Besides, I think keeping the bitcast makes the flow clear.

llvm/lib/Target/X86/X86InstrAVX512.td
1476

Good catch. Added in X86InstrSSE.td

4107

No. previous code works well because no mask variants before AVX512 and no f16 before FP16. The latter is not true now.

4352

Because no legal f16 previously.

11657

We used it before, but very little. We need to make sure select FP16 instructions first if available.

llvm/lib/Target/X86/X86InstrSSE.td
3970

This is to avoid FP16 instructions been overridden.

3978

It's in line 5214.

5215

We don't need it if no BWI.

llvm/lib/Target/X86/X86RegisterInfo.td
540

Yes, it's more efficient to use movss that insert/extrct. And we also use FR32X for AVX512 targets without FP16.

LuoYuanke added inline comments.Jun 8 2022, 7:33 AM
llvm/test/Analysis/CostModel/X86/fptoi_sat.ll
852

It seems the cost is reduced in general. Is it because we pass/return f16 by xmm register?

llvm/test/CodeGen/MIR/X86/inline-asm-registers.mir
31–32

Why f16 patch affect this test case? There is no fp instruction in this test case.

llvm/test/CodeGen/X86/atomic-non-integer.ll
253

I notice X86-SSE1 return by GPR. Should we also return by GPR for X64-SSE?

llvm/test/CodeGen/X86/avx512-insert-extract.ll
2315

Is code less efficient than previous code? Why previous code still works without convert half to float?

llvm/test/CodeGen/X86/avx512-masked_memop-16-8.ll
156

It seems parameter %val is useless.

llvm/test/CodeGen/X86/callbr-asm-bb-exports.ll
20

Why this test is affacted? Is it caused by calling convention change?

llvm/test/CodeGen/X86/fmf-flags.ll
115

Does __gnu_h2f_ieee retrun from xmm?

LuoYuanke added inline comments.Jun 8 2022, 11:51 PM
llvm/test/CodeGen/X86/fpclamptosat.ll
569–570

I'm curious why there is 1 more compare in this patch.

776

Ditto.

llvm/test/CodeGen/X86/fpclamptosat_vec.ll
605

Is the vector <4 x half> split to 4 scalar and pass by xmm? What's the ABI for vector half? Is there any case that test the scenario that run out of register and pass parameter through stack?

llvm/test/CodeGen/X86/fptosi-sat-scalar.ll
2139

It seems less efficient than previous code on NAN, zero handling, but we can improve later.

llvm/test/CodeGen/X86/half.ll
934

Why the x87 instruction is generated?

pengfei updated this revision to Diff 435583.Jun 9 2022, 9:03 AM
pengfei marked an inline comment as done.

Address Yuanke's comments.

llvm/test/Analysis/CostModel/X86/fptoi_sat.ll
852

No. It's because we don't have cost model for f16. I added some in D127386 to address this.

llvm/test/CodeGen/MIR/X86/inline-asm-registers.mir
31–32

I this it's newly added FR16 that affects all number the other number register class. We met the problem when enabling FP16 too.

llvm/test/CodeGen/X86/atomic-non-integer.ll
253

No. The result in X86-SSE in UB. We support the emulation on SSE2 and later.

llvm/test/CodeGen/X86/avx512-insert-extract.ll
2315

Yes. The previous code using i16 for FP16. Improved, thanks!

llvm/test/CodeGen/X86/callbr-asm-bb-exports.ll
20

No. It's caused by newly added FR16 register class.

llvm/test/CodeGen/X86/fmf-flags.ll
115

There does not exist a __gnu_h2f_ieee on X86 before. It's ARM/AArch64 specific.

llvm/test/CodeGen/X86/fpclamptosat.ll
569–570

It's an optimization implemented by D111976. We don't meet the requirment that isOperationLegalOrCustom. It's not easy to solve because we need to check the promoted type instead. I'll leave it as is.

776

The same as above.

llvm/test/CodeGen/X86/fpclamptosat_vec.ll
605

Good question! Previously, I discussed with GCC folks we won't support vector in emulation. I expected the FE with pass whole vector through stack. So a vector in IR is illegal to ABI and can be splited.
But seems GCC passes it by vector register. https://godbolt.org/z/a67rMhTW6
I'll double confirm with GCC folks.

llvm/test/CodeGen/X86/fptosi-sat-scalar.ll
2139

Yes. Added FIXMEs.

llvm/test/CodeGen/X86/half.ll
934

On 32 bit, float and double are passed by x87 register.

pengfei added inline comments.Jun 10 2022, 9:34 AM
llvm/test/CodeGen/X86/fpclamptosat_vec.ll
605

Discussed with GCC folks today. We should support the vector ABI. But we have to adding more patterns to support load/store etc. operations for vector type. I'd like to address this as a follow up.

LuoYuanke accepted this revision.Jun 10 2022, 10:17 PM

LGTM, thanks.

This revision is now accepted and ready to land.Jun 10 2022, 10:17 PM
This revision was landed with ongoing or failed builds.Jun 11 2022, 8:40 PM
This revision was automatically updated to reflect the committed changes.
mehdi_amini added a subscriber: mehdi_amini.EditedJun 12 2022, 8:20 AM

This broke the bot here: https://lab.llvm.org/buildbot/#/builders/61/builds/27616

The cmake invocation includes some GPU specific options that you can omit (-DMLIR_ENABLE_CUDA_RUNNER=1 , -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc, -DMLIR_ENABLE_VULKAN_RUNNER=1, -DMLIR_RUN_CUDA_TENSOR_CORE_TESTS=ON), which should leave out:

cmake ../llvm.src/llvm -DLLVM_BUILD_EXAMPLES=ON '-DLLVM_TARGETS_TO_BUILD=host;NVPTX' -DLLVM_ENABLE_PROJECTS=mlir  -DMLIR_INCLUDE_INTEGRATION_TESTS=ON  -DBUILD_SHARED_LIBS=ON -DLLVM_CCACHE_BUILD=ON -DMLIR_ENABLE_BINDINGS_PYTHON=ON  -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON '-DLLVM_LIT_ARGS=-v -vv' -GNinja

You can probably leave out other options too:

cmake ../llvm.src/llvm '-DLLVM_TARGETS_TO_BUILD=host' -DLLVM_ENABLE_PROJECTS=mlir  -DMLIR_INCLUDE_INTEGRATION_TESTS=ON -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON '-DLLVM_LIT_ARGS=-v -vv' -GNinja

This broke the bot here: https://lab.llvm.org/buildbot/#/builders/61/builds/27616

The cmake invocation includes some GPU specific options that you can omit (-DMLIR_ENABLE_CUDA_RUNNER=1 , -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc, -DMLIR_ENABLE_VULKAN_RUNNER=1, -DMLIR_RUN_CUDA_TENSOR_CORE_TESTS=ON), which should leave out:

cmake ../llvm.src/llvm -DLLVM_BUILD_EXAMPLES=ON '-DLLVM_TARGETS_TO_BUILD=host;NVPTX' -DLLVM_ENABLE_PROJECTS=mlir  -DMLIR_INCLUDE_INTEGRATION_TESTS=ON  -DBUILD_SHARED_LIBS=ON -DLLVM_CCACHE_BUILD=ON -DMLIR_ENABLE_BINDINGS_PYTHON=ON  -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON '-DLLVM_LIT_ARGS=-v -vv' -GNinja

You can probably leave out other options too:

cmake ../llvm.src/llvm '-DLLVM_TARGETS_TO_BUILD=host' -DLLVM_ENABLE_PROJECTS=mlir  -DMLIR_INCLUDE_INTEGRATION_TESTS=ON -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON '-DLLVM_LIT_ARGS=-v -vv' -GNinja

@mehdi_amini Thanks for the commands, I can reproduce it on my local now. Will look into it.

pengfei added inline comments.Jun 16 2022, 9:25 AM
llvm/test/CodeGen/X86/fpclamptosat_vec.ll
605

Addressed by D127982.

MaskRay added a subscriber: MaskRay.EditedJun 24 2022, 4:38 PM

Please include Differential Revision: line for reland commits as well so that people know that this patch has a reland.
Please include the full description, not just what has changed from the previous commit.

https://github.com/llvm/llvm-project/issues/56204 is related to 655ba9c8a1d22075443711cc749f0b032e07adee

In addition, don't use Reland "Reland "Reland "Reland ... One Reland is sufficient.

I'll take care next time. Thanks @MaskRay !

Same as in https://reviews.llvm.org/D114099
It breaks the build on ubuntu bionic, Hirsute, etc on amd64:

"/build/llvm-toolchain-snapshot-15~++20220702091600+23ee84f43201/build-llvm/./bin/clang" --target=x86_64-pc-linux-gnu -DVISIBILITY_HIDDEN  -fstack-protector-strong -Wformat -Werror=format-security -Wno-unused-command-line-argument -Wdate-time -D_FORTIFY_SOURCE=2 -O3 -DNDEBUG -m32 -DCOMPILER_RT_HAS_FLOAT16 -std=c11 -fPIC -fno-builtin -fvisibility=hidden -fomit-frame-pointer -MD -MT CMakeFiles/clang_rt.builtins-i386.dir/extendhfsf2.c.o -MF CMakeFiles/clang_rt.builtins-i386.dir/extendhfsf2.c.o.d -o CMakeFiles/clang_rt.builtins-i386.dir/extendhfsf2.c.o -c '/build/llvm-toolchain-snapshot-15~++20220702091600+23ee84f43201/compiler-rt/lib/builtins/extendhfsf2.c'
In file included from /build/llvm-toolchain-snapshot-15~++20220702091600+23ee84f43201/compiler-rt/lib/builtins/extendhfsf2.c:11:
In file included from /build/llvm-toolchain-snapshot-15~++20220702091600+23ee84f43201/compiler-rt/lib/builtins/fp_extend_impl.inc:38:
/build/llvm-toolchain-snapshot-15~++20220702091600+23ee84f43201/compiler-rt/lib/builtins/fp_extend.h:44:9: error: _Float16 is not supported on this target
typedef _Float16 src_t;
        ^
1 error generated.

Same as in https://reviews.llvm.org/D114099
It breaks the build on ubuntu bionic, Hirsute, etc on amd64:

"/build/llvm-toolchain-snapshot-15~++20220702091600+23ee84f43201/build-llvm/./bin/clang" --target=x86_64-pc-linux-gnu -DVISIBILITY_HIDDEN  -fstack-protector-strong -Wformat -Werror=format-security -Wno-unused-command-line-argument -Wdate-time -D_FORTIFY_SOURCE=2 -O3 -DNDEBUG -m32 -DCOMPILER_RT_HAS_FLOAT16 -std=c11 -fPIC -fno-builtin -fvisibility=hidden -fomit-frame-pointer -MD -MT CMakeFiles/clang_rt.builtins-i386.dir/extendhfsf2.c.o -MF CMakeFiles/clang_rt.builtins-i386.dir/extendhfsf2.c.o.d -o CMakeFiles/clang_rt.builtins-i386.dir/extendhfsf2.c.o -c '/build/llvm-toolchain-snapshot-15~++20220702091600+23ee84f43201/compiler-rt/lib/builtins/extendhfsf2.c'
In file included from /build/llvm-toolchain-snapshot-15~++20220702091600+23ee84f43201/compiler-rt/lib/builtins/extendhfsf2.c:11:
In file included from /build/llvm-toolchain-snapshot-15~++20220702091600+23ee84f43201/compiler-rt/lib/builtins/fp_extend_impl.inc:38:
/build/llvm-toolchain-snapshot-15~++20220702091600+23ee84f43201/compiler-rt/lib/builtins/fp_extend.h:44:9: error: _Float16 is not supported on this target
typedef _Float16 src_t;
        ^
1 error generated.

Hi @sylvestre.ledru , thanks for reporting this issue.

It looks to me a configuration (or option mismatch) problem in compiler-rt. We support the _Float16 type on targets that have SSE2 and/or up features. A 32-bit target doesn't enable SSE2 feature by default. This should be fine because the cmake of compiler-rt will detect the buildable of _Float16 first and set COMPILER_RT_HAS_FLOAT16 accordingly. So this issue looks to me it passed the detection of _Float16 with a SSE2 enabled option but built the compiler-rt with a different option (SSE2 disabled).

I'd suggest to add an extra -msse2 when build it if possible. Otherwise, don't let -DCOMPILER_RT_HAS_FLOAT16 been passed here.

@pengfei I am not convinced it is an issue on my side. I don't have anything particular in this area and using a stage2 build system.

Anyway, this patch fixes the issue on my side:
https://salsa.debian.org/pkg-llvm-team/llvm-toolchain/-/blob/snapshot/debian/patches/force-sse2-compiler-rt.diff

@pengfei I am not convinced it is an issue on my side. I don't have anything particular in this area and using a stage2 build system.

Anyway, this patch fixes the issue on my side:
https://salsa.debian.org/pkg-llvm-team/llvm-toolchain/-/blob/snapshot/debian/patches/force-sse2-compiler-rt.diff

I don't have much experience in compiler-rt and multi stage build. So I may be wrong. It looks to me like an existing problem just exposed by this patch. The diff is another proof.
The build command tells us it's a 32-bit build. But the change for x86_64 solves it, which confirms my previous guess: You are using one configure for CMake (probobally 64 bit) but build for 32 bit target.
Although the diff works, it doesn't look a clean solution to me. But I don't have better suggestion either.

Hi @pengfei, I am working on flang, and after this patch, we started to see some bugs in Fortran programs using REAL(2) (which is fp16 in flang). I am not an expert in LLVM codegen and the builtins, but I am wondering if there is not issue with how llvm codegen thinks __truncsfhf2 returns its value and how the runtime actually does return it.

Here is an llvm IR reproducer for a bug we saw:

define void @bug(ptr %addr, i32 %i) {
  %1 = sitofp i32 %i to half
  store half %1, ptr %addr, align 2
  ret void
}

After this patch the generated assembly on X86 is:

bug:                                    # @bug
        push    rbx
        mov     rbx, rdi
        cvtsi2ss        xmm0, esi
        call    __truncsfhf2@PLT
        pextrw  eax, xmm0, 0
        mov     word ptr [rbx], ax
        pop     rbx
        ret

When running this from a C program to test integers are casted to floats, I am only seeing the bytes of the passed address being set to zero (regardless of the input). It seems to me that there is an issue around the __truncsfhf2 interface. The pextrw eax, xmm0, 0 after the call seems to suggest LLVM codegen is looking for the result in xmm0 register, but it seems that __truncsfhf2 is only returning it in eax.

Do you have any idea what could be the issue ?

Hi @jeanPerier , yes, you are right. This patch changes the calling conversion of fp16 from GPRs to XMMs. So you need to update the runtime. If you are using compiler-rt, you could simply re-build it with trunk code, or at least after rGabeeae57. If you are using your own runtime, you can solve the problem through the way in https://github.com/llvm/llvm-project/issues/56156

Hi @jeanPerier , yes, you are right. This patch changes the calling conversion of fp16 from GPRs to XMMs. So you need to update the runtime. If you are using compiler-rt, you could simply re-build it with trunk code, or at least after rGabeeae57. If you are using your own runtime, you can solve the problem through the way in https://github.com/llvm/llvm-project/issues/56156

Thanks for the quick reply. I was using a compiler-rt from the trunk source but not building it with a clang compiler compiled from the trunk. I did not know the version of clang used to compiled compiler-rt mattered that much. Using clang from the trunk (or at least after the commit you mentionnned) solved my problem. Thanks !

Thanks for confirming it! I don't have much experience in compiler-rt. But I think the version of clang matters much to compiler-rt particular in ABI changing cases like this :)

@pengfei We are also hitting the following assertion with this patch. Do you have any idea why?

/llvm-project/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp:4333: void {anonymous}::SelectionDAGLegalize::ConvertNodeToLibcall(llvm::SDNode*): Assertion `cast<ConstantSDNode>(Node->getOperand(IsStrict ? 2 : 1))->isZero() && "Unable to expand as libcall if it is not normal rounding"' failed.

LLVM IR triggering the assertion.

; ModuleID = 'FIRModule'
source_filename = "FIRModule"
target triple = "x86_64-unknown-linux-gnu"

@_QMhp237Ea11 = global half 0xH3D00
@_QMhp237Eb1a = global half 0xH5640
@_QMf90_kindECascii = external constant i32
@_QMf90_kindECbyte = external constant i32
@_QMf90_kindECdouble = external constant i32
@_QMiso_fortran_envECint16 = external constant i32
@_QMiso_fortran_envECint32 = external constant i32
@_QMiso_fortran_envECint64 = external constant i32
@_QMiso_fortran_envECint8 = external constant i32
@_QMf90_kindECjis = external constant i32
@_QMiso_fortran_envEClogical16 = external constant i32
@_QMiso_fortran_envEClogical32 = external constant i32
@_QMiso_fortran_envEClogical64 = external constant i32
@_QMiso_fortran_envEClogical8 = external constant i32
@_QMf90_kindECnot_available = external constant i32
@_QMf90_kindECquad = external constant i32
@_QMiso_fortran_envECreal128 = external constant i32
@_QMf90_kindECreal16 = external constant i32
@_QMiso_fortran_envECreal32 = external constant i32
@_QMiso_fortran_envECreal64 = external constant i32
@_QMf90_kindECreal64x2 = external constant i32
@_QMf90_kindECsingle = external constant i32
@_QMf90_kindECtwobyte = external constant i32
@_QMf90_kindECucs2 = external constant i32
@_QMf90_kindECucs4 = external constant i32
@_QMf90_kindECword = external constant i32
@_QQcl.2E2F627567312E66393000 = linkonce constant [11 x i8] c"./bug1.f90\00"
@_QQcl.2831362C313629 = linkonce constant [7 x i8] c"(16,16)"
@_QQcl.28346631302E3329 = linkonce constant [8 x i8] c"(4f10.3)"

declare ptr @malloc(i64)

declare void @free(ptr)

define void @_QQmain() !dbg !3 {
  %1 = alloca { ptr, i64, i32, i8, i8, i8, i8 }, align 8, !dbg !7
  %2 = alloca half, i64 1, align 2, !dbg !9
  %3 = call ptr @_FortranAioBeginExternalListOutput(i32 -1, ptr @_QQcl.2E2F627567312E66393000, i32 9), !dbg !10
  %4 = call i1 @_FortranAioOutputAscii(ptr %3, ptr @_QQcl.2831362C313629, i64 7), !dbg !11
  %5 = call i32 @_FortranAioEndIoStatement(ptr %3), !dbg !12
  %6 = call ptr @_FortranAioBeginExternalFormattedOutput(ptr @_QQcl.28346631302E3329, i64 8, i32 -1, ptr @_QQcl.2E2F627567312E66393000, i32 10), !dbg !13
  %7 = load half, ptr @_QMhp237Ea11, align 2, !dbg !14
  %8 = load half, ptr @_QMhp237Eb1a, align 2, !dbg !15
  %9 = fpext half %7 to float, !dbg !16
  %10 = fpext half %8 to float, !dbg !17
  %11 = call float @llvm.copysign.f32(float %9, float %10), !dbg !18
  %12 = fptrunc float %11 to half, !dbg !19
  store half %12, ptr %2, align 2, !dbg !20
  %13 = insertvalue { ptr, i64, i32, i8, i8, i8, i8 } { ptr undef, i64 2, i32 20180515, i8 0, i8 25, i8 0, i8 0 }, ptr %2, 0, !dbg !7
  store { ptr, i64, i32, i8, i8, i8, i8 } %13, ptr %1, align 8, !dbg !7
  %14 = call i1 @_FortranAioOutputDescriptor(ptr %6, ptr %1), !dbg !21
  %15 = call i32 @_FortranAioEndIoStatement(ptr %6), !dbg !22
  ret void, !dbg !23
}

declare ptr @_FortranAioBeginExternalListOutput(i32, ptr, i32)

declare i1 @_FortranAioOutputAscii(ptr, ptr, i64)

declare i32 @_FortranAioEndIoStatement(ptr)

declare ptr @_FortranAioBeginExternalFormattedOutput(ptr, i64, i32, ptr, i32)

declare i1 @_FortranAioOutputDescriptor(ptr, ptr)

; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn
declare float @llvm.copysign.f32(float, float) #0

attributes #0 = { nocallback nofree nosync nounwind readnone speculatable willreturn }

!llvm.dbg.cu = !{!0}
!llvm.module.flags = !{!2}

!0 = distinct !DICompileUnit(language: DW_LANG_C, file: !1, producer: "mlir", isOptimized: true, runtimeVersion: 0, emissionKind: FullDebug)
!1 = !DIFile(filename: "FIRModule", directory: "/")
!2 = !{i32 2, !"Debug Info Version", i32 3}
!3 = distinct !DISubprogram(name: "_QQmain", linkageName: "_QQmain", scope: null, file: !4, line: 9, type: !5, scopeLine: 9, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0, retainedNodes: !6)
!4 = !DIFile(filename: "<stdin>", directory: "/local/home/vclement/llvm-project/build")
!5 = !DISubroutineType(types: !6)
!6 = !{}
!7 = !DILocation(line: 39, column: 9, scope: !8)
!8 = !DILexicalBlockFile(scope: !3, file: !4, discriminator: 0)
!9 = !DILocation(line: 10, column: 8, scope: !8)
!10 = !DILocation(line: 17, column: 8, scope: !8)
!11 = !DILocation(line: 22, column: 8, scope: !8)
!12 = !DILocation(line: 23, column: 9, scope: !8)
!13 = !DILocation(line: 31, column: 9, scope: !8)
!14 = !DILocation(line: 32, column: 9, scope: !8)
!15 = !DILocation(line: 33, column: 9, scope: !8)
!16 = !DILocation(line: 34, column: 9, scope: !8)
!17 = !DILocation(line: 35, column: 9, scope: !8)
!18 = !DILocation(line: 36, column: 9, scope: !8)
!19 = !DILocation(line: 37, column: 9, scope: !8)
!20 = !DILocation(line: 38, column: 3, scope: !8)
!21 = !DILocation(line: 41, column: 9, scope: !8)
!22 = !DILocation(line: 42, column: 9, scope: !8)
!23 = !DILocation(line: 43, column: 3, scope: !8)

Thanks @clementval for reporting it and the reproducer. Put a patch D129294 to address it.