This is an archive of the discontinued LLVM Phabricator instance.

[X86][RFC] Enable `_Float16` type support on X86 following the psABI
ClosedPublic

Authored by pengfei on Jul 29 2021, 8:18 AM.

Details

Summary

GCC and Clang/LLVM will support _Float16 on X86 in C/C++, following
the latest X86 psABI. (https://gitlab.com/x86-psABIs)

_Float16 arithmetic will be performed using native half-precision. If
native arithmetic instructions are not available, it will be performed
at a higher precision (currently always float) and then truncated down
to _Float16 immediately after each single arithmetic operation.

Diff Detail

Event Timeline

pengfei created this revision.Jul 29 2021, 8:18 AM
pengfei requested review of this revision.Jul 29 2021, 8:18 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJul 29 2021, 8:18 AM

I sent out this patch mainly for PoC of the ABI changes, I'll fix the performance regressions in next phase.
LLVM was using a different calling conversion on x86 when passing and returning half type. It conflicts with current X86 psABI.
I have evaluated the risk internally and think the change of ABI has low risk due to Clang doesn't use such calling conversion. But I may not be thoughtful enough. Questions and comments are appreciated.

pengfei added inline comments.Jul 29 2021, 8:38 AM
llvm/include/llvm/IR/RuntimeLibcalls.def
293–294 ↗(On Diff #362792)

GCC12 will provide functions __extendhfsf2 and __truncsfhf2. I wonder if I can change it directly here or do extra customization for ARM/AArch64? Other targets?

craig.topper added a comment.EditedJul 29 2021, 8:52 AM

I haven't had a chance to look at this patch in detail, but I wanted to ask if you considered doing what ARM and RISCV do for this. They pass the f16 in the lower bits on an f32 by only changing the ABI handling code in the backend. The type legalizer takes care of the rest. That seems simpler than this patch. See for example https://reviews.llvm.org/D98670

I haven't had a chance to look at this patch in detail, but I wanted to ask if you considered doing what ARM and RISCV do for this. They pass the f16 in the lower bits on an f32 by only changing the ABI handling code in the backend. The type legalizer takes care of the rest. That seems simpler than this patch. See for example https://reviews.llvm.org/D98670

Thanks Craig for the information. I referenced implementation in AArch64. I think we have to add a legal f16 type in this way because:

  1. We will support _Float16 type in Clang on SSE2 and above to keep the same behavior with GCC. So a legal type is a must.
  2. Using lower 16bits of f32 may not satisfice the requirment from calling conversion of aggregation type and complex type defined by psABI.
  3. We have some optimizations to leverage F16C or AVX512 ps2ph/ph2ps instructions. A legal type is easy to customize.

Besides, we have full arithmatic f16 support in AVX512FP16. Most of the code here are shared and served for both scenarios. We just need to promote for most FP operations and expand or customize FP_ROUND and FP_EXTEND here.

pengfei updated this revision to Diff 362958.Jul 29 2021, 7:37 PM

Remove unused vector f16 definitions.

pengfei updated this revision to Diff 362974.Jul 29 2021, 11:43 PM

Add more conversion tests.

pengfei updated this revision to Diff 363461.Aug 2 2021, 6:02 AM
  1. Reverted several unrelated changes.
  2. Improved conversions to/from f64/f80 etc under f16c.
  3. Added combine to reduce intermediate move instructions.
  4. Refactor for several trivial problems.

After the last refactor, I think this patch is mostly ready.
This patch strips most of the ABI and _Float16 type related code from D105263, which can be leaving with only AVX512-FP16 ISA enabling code.
I think it should be more friendly for review. The defect is we make all FP16 enabling patches depend on and been blocked by this one. So I hope we could have a quick review and land it earlier.

Herald added a project: Restricted Project. · View Herald TranscriptMay 11 2022, 12:52 AM
Herald added a subscriber: StephenFan. · View Herald Transcript
pengfei updated this revision to Diff 430019.May 17 2022, 6:21 AM

Rebased on the avx512fp16 implementation. Still WIP for optimizations and a fast RA issue.

pengfei updated this revision to Diff 430314.May 18 2022, 3:47 AM

Use 32-bit spill slot for half type. Others still on going.

pengfei updated this revision to Diff 430349.May 18 2022, 5:38 AM

Replace gnu_f2h_ieee/gnu_h2f_ieee with truncsfhf2/extendhfsf2 to match with GCC.

pengfei planned changes to this revision.May 18 2022, 5:49 AM
pengfei retitled this revision from [X86][RFC] Enable `_Float16` type support on X86 following the psABI to [X86][RFC][WIP] Enable `_Float16` type support on X86 following the psABI.
pengfei updated this revision to Diff 430897.May 20 2022, 12:39 AM

Adjust libcall lowering according to GCC code generation.

pengfei retitled this revision from [X86][RFC][WIP] Enable `_Float16` type support on X86 following the psABI to [X86][RFC] Enable `_Float16` type support on X86 following the psABI.May 20 2022, 8:31 AM
pengfei updated this revision to Diff 430985.May 20 2022, 8:31 AM
pengfei retitled this revision from [X86][RFC] Enable `_Float16` type support on X86 following the psABI to [X86][RFC][WIP] Enable `_Float16` type support on X86 following the psABI.

Fix a few minor issues. I think it's mature for review now.

skan added inline comments.May 24 2022, 3:47 AM
llvm/docs/ReleaseNotes.rst
136

Just for curiosity, why is SSE2?

llvm/lib/Target/X86/X86ISelLowering.cpp
593

Promote to which type?

5714

Add comments for hasBWI?

5726

Why is this diffferent from isScalarFPTypeInSSEReg in X86FastISel.cpp?

bool isScalarFPTypeInSSEReg(EVT VT) const {
    return ((VT == MVT::f16 || VT == MVT::f64) && Subtarget->hasSSE2()) ||
           (VT == MVT::f32 && Subtarget->hasSSE1());
  }
20809

Need comments

21277

Need comments

llvm/lib/Target/X86/X86InstrAVX512.td
4101

Why do we need compare the prd w/ HasFP16 here?
Couldn't we just use [prd, OptForSize]?

llvm/lib/Target/X86/X86RegisterInfo.td
540

The alignment is not same as the size?

pengfei updated this revision to Diff 432298.May 26 2022, 8:41 AM
pengfei marked 2 inline comments as done.

Address Shengchen's review comments.

llvm/docs/ReleaseNotes.rst
136

We are following to GCC. The more background about why chosing SSE2 can be found here

llvm/lib/Target/X86/X86ISelLowering.cpp
593

If we didn't set the promote type, LLVM will automaticly find the next available type in the same type class as the order defined in MachineValueType.h

5714

Good question! Taking a look at the caller, this is used to combine integer logic to FP logic instructions. Since we don't have FP logic instructions on f16 type, we don't need set true for it.

5726

Good catch! Unlike f32 and f64, we only use SSE register for f16. So no need to check the condition. I'll update the FastISel part, thanks!

llvm/lib/Target/X86/X86InstrAVX512.td
4101

No, we can't. The predicate list AND all its predicates, which means we don't have a pattern for non OptForSize case.

llvm/lib/Target/X86/X86RegisterInfo.td
540

No. This is spill size instead of alignment.

pengfei retitled this revision from [X86][RFC][WIP] Enable `_Float16` type support on X86 following the psABI to [X86][RFC] Enable `_Float16` type support on X86 following the psABI.May 31 2022, 6:04 PM
LuoYuanke added inline comments.Jun 8 2022, 2:00 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
616

Just confused how to expand it. Will the expand fail and finally turns to libcall?

763

Why f16 emulation affect f80 type? Are we checking isTypeLegal(MVT::f80)?

22099

Not sure if it is better to wrapper it into a readable function (e.g., isSoftF16).

22447

Why we don't extent to f32 here?

22521

Why we don't extent to f32 here? Will it be promoted finally?

22764

Should MVT::v8i16 be MVT::v8f16?

22765

Is it rounding control? Can we use a macro or add comments for what is the rounding control?

22774

MVT::f16 and delete the bitcast?

44207

Not sure if it is better to wrapper it into a readable function (e.g., isSoftF16).

llvm/lib/Target/X86/X86InstrAVX512.td
1476

If target don't have avx512bw feature. There is some other pattern to lower the node or fp16 broadcast node is invalid?

4107

Previous prd only apply to "def rr"? Is it a bug for previous code?

4352

Why previous code don't have predicates?

LuoYuanke added inline comments.Jun 8 2022, 2:00 AM
llvm/lib/Target/X86/X86InstrAVX512.td
11657

Why set AddedComplexity to -10? There no such addtional complexity in previous code. Add comments for it?

llvm/lib/Target/X86/X86InstrSSE.td
3970

Why AddedComplexity = -10? Add comments for it?

3978

Miss pattern for store?

5214

Why no AddedComplexity for it?

llvm/lib/Target/X86/X86RegisterInfo.td
540

When there is avx512fp16 feature, is the spill size still 32?

pengfei updated this revision to Diff 435151.Jun 8 2022, 7:15 AM
pengfei marked 3 inline comments as done.

Address Yuanke's comments. Thanks!

llvm/lib/Target/X86/X86ISelLowering.cpp
616

Yeah, we can use LibCall instead.

763

It's in the scope of if (UseX87). And we need to lower fpext half %0 to x86_fp80.

22447

Return SDValue() will extent later. This can save the code.

22521

Yes.

22764

No. We use MVT::v8i16 when we enabled F16C instructions.

22774

I don't think we have pattern to extract f16 from v8i16. Besides, I think keeping the bitcast makes the flow clear.

llvm/lib/Target/X86/X86InstrAVX512.td
1476

Good catch. Added in X86InstrSSE.td

4107

No. previous code works well because no mask variants before AVX512 and no f16 before FP16. The latter is not true now.

4352

Because no legal f16 previously.

11657

We used it before, but very little. We need to make sure select FP16 instructions first if available.

llvm/lib/Target/X86/X86InstrSSE.td
3970

This is to avoid FP16 instructions been overridden.

3978

It's in line 5214.

5214

We don't need it if no BWI.

llvm/lib/Target/X86/X86RegisterInfo.td
540

Yes, it's more efficient to use movss that insert/extrct. And we also use FR32X for AVX512 targets without FP16.

LuoYuanke added inline comments.Jun 8 2022, 7:33 AM
llvm/test/Analysis/CostModel/X86/fptoi_sat.ll
852

It seems the cost is reduced in general. Is it because we pass/return f16 by xmm register?

llvm/test/CodeGen/MIR/X86/inline-asm-registers.mir
31–32

Why f16 patch affect this test case? There is no fp instruction in this test case.

llvm/test/CodeGen/X86/atomic-non-integer.ll
253

I notice X86-SSE1 return by GPR. Should we also return by GPR for X64-SSE?

llvm/test/CodeGen/X86/avx512-insert-extract.ll
2307

Is code less efficient than previous code? Why previous code still works without convert half to float?

llvm/test/CodeGen/X86/avx512-masked_memop-16-8.ll
156

It seems parameter %val is useless.

llvm/test/CodeGen/X86/callbr-asm-bb-exports.ll
20

Why this test is affacted? Is it caused by calling convention change?

llvm/test/CodeGen/X86/fmf-flags.ll
115

Does __gnu_h2f_ieee retrun from xmm?

LuoYuanke added inline comments.Jun 8 2022, 11:51 PM
llvm/test/CodeGen/X86/fpclamptosat.ll
569–570

I'm curious why there is 1 more compare in this patch.

776

Ditto.

llvm/test/CodeGen/X86/fpclamptosat_vec.ll
605

Is the vector <4 x half> split to 4 scalar and pass by xmm? What's the ABI for vector half? Is there any case that test the scenario that run out of register and pass parameter through stack?

llvm/test/CodeGen/X86/fptosi-sat-scalar.ll
2138

It seems less efficient than previous code on NAN, zero handling, but we can improve later.

llvm/test/CodeGen/X86/half.ll
946

Why the x87 instruction is generated?

pengfei updated this revision to Diff 435583.Jun 9 2022, 9:03 AM
pengfei marked an inline comment as done.

Address Yuanke's comments.

llvm/test/Analysis/CostModel/X86/fptoi_sat.ll
852

No. It's because we don't have cost model for f16. I added some in D127386 to address this.

llvm/test/CodeGen/MIR/X86/inline-asm-registers.mir
31–32

I this it's newly added FR16 that affects all number the other number register class. We met the problem when enabling FP16 too.

llvm/test/CodeGen/X86/atomic-non-integer.ll
253

No. The result in X86-SSE in UB. We support the emulation on SSE2 and later.

llvm/test/CodeGen/X86/avx512-insert-extract.ll
2307

Yes. The previous code using i16 for FP16. Improved, thanks!

llvm/test/CodeGen/X86/callbr-asm-bb-exports.ll
20

No. It's caused by newly added FR16 register class.

llvm/test/CodeGen/X86/fmf-flags.ll
115

There does not exist a __gnu_h2f_ieee on X86 before. It's ARM/AArch64 specific.

llvm/test/CodeGen/X86/fpclamptosat.ll
569–570

It's an optimization implemented by D111976. We don't meet the requirment that isOperationLegalOrCustom. It's not easy to solve because we need to check the promoted type instead. I'll leave it as is.

776

The same as above.

llvm/test/CodeGen/X86/fpclamptosat_vec.ll
605

Good question! Previously, I discussed with GCC folks we won't support vector in emulation. I expected the FE with pass whole vector through stack. So a vector in IR is illegal to ABI and can be splited.
But seems GCC passes it by vector register. https://godbolt.org/z/a67rMhTW6
I'll double confirm with GCC folks.

llvm/test/CodeGen/X86/fptosi-sat-scalar.ll
2138

Yes. Added FIXMEs.

llvm/test/CodeGen/X86/half.ll
946

On 32 bit, float and double are passed by x87 register.

pengfei added inline comments.Jun 10 2022, 9:34 AM
llvm/test/CodeGen/X86/fpclamptosat_vec.ll
605

Discussed with GCC folks today. We should support the vector ABI. But we have to adding more patterns to support load/store etc. operations for vector type. I'd like to address this as a follow up.

LuoYuanke accepted this revision.Jun 10 2022, 10:17 PM

LGTM, thanks.

This revision is now accepted and ready to land.Jun 10 2022, 10:17 PM
This revision was landed with ongoing or failed builds.Jun 11 2022, 8:40 PM
This revision was automatically updated to reflect the committed changes.
mehdi_amini added a subscriber: mehdi_amini.EditedJun 12 2022, 8:20 AM

This broke the bot here: https://lab.llvm.org/buildbot/#/builders/61/builds/27616

The cmake invocation includes some GPU specific options that you can omit (-DMLIR_ENABLE_CUDA_RUNNER=1 , -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc, -DMLIR_ENABLE_VULKAN_RUNNER=1, -DMLIR_RUN_CUDA_TENSOR_CORE_TESTS=ON), which should leave out:

cmake ../llvm.src/llvm -DLLVM_BUILD_EXAMPLES=ON '-DLLVM_TARGETS_TO_BUILD=host;NVPTX' -DLLVM_ENABLE_PROJECTS=mlir  -DMLIR_INCLUDE_INTEGRATION_TESTS=ON  -DBUILD_SHARED_LIBS=ON -DLLVM_CCACHE_BUILD=ON -DMLIR_ENABLE_BINDINGS_PYTHON=ON  -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON '-DLLVM_LIT_ARGS=-v -vv' -GNinja

You can probably leave out other options too:

cmake ../llvm.src/llvm '-DLLVM_TARGETS_TO_BUILD=host' -DLLVM_ENABLE_PROJECTS=mlir  -DMLIR_INCLUDE_INTEGRATION_TESTS=ON -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON '-DLLVM_LIT_ARGS=-v -vv' -GNinja

This broke the bot here: https://lab.llvm.org/buildbot/#/builders/61/builds/27616

The cmake invocation includes some GPU specific options that you can omit (-DMLIR_ENABLE_CUDA_RUNNER=1 , -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc, -DMLIR_ENABLE_VULKAN_RUNNER=1, -DMLIR_RUN_CUDA_TENSOR_CORE_TESTS=ON), which should leave out:

cmake ../llvm.src/llvm -DLLVM_BUILD_EXAMPLES=ON '-DLLVM_TARGETS_TO_BUILD=host;NVPTX' -DLLVM_ENABLE_PROJECTS=mlir  -DMLIR_INCLUDE_INTEGRATION_TESTS=ON  -DBUILD_SHARED_LIBS=ON -DLLVM_CCACHE_BUILD=ON -DMLIR_ENABLE_BINDINGS_PYTHON=ON  -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON '-DLLVM_LIT_ARGS=-v -vv' -GNinja

You can probably leave out other options too:

cmake ../llvm.src/llvm '-DLLVM_TARGETS_TO_BUILD=host' -DLLVM_ENABLE_PROJECTS=mlir  -DMLIR_INCLUDE_INTEGRATION_TESTS=ON -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON '-DLLVM_LIT_ARGS=-v -vv' -GNinja

@mehdi_amini Thanks for the commands, I can reproduce it on my local now. Will look into it.

pengfei added inline comments.Jun 16 2022, 9:25 AM
llvm/test/CodeGen/X86/fpclamptosat_vec.ll
605

Addressed by D127982.

MaskRay added a subscriber: MaskRay.EditedJun 24 2022, 4:38 PM

Please include Differential Revision: line for reland commits as well so that people know that this patch has a reland.
Please include the full description, not just what has changed from the previous commit.

https://github.com/llvm/llvm-project/issues/56204 is related to 655ba9c8a1d22075443711cc749f0b032e07adee

In addition, don't use Reland "Reland "Reland "Reland ... One Reland is sufficient.

I'll take care next time. Thanks @MaskRay !

Same as in https://reviews.llvm.org/D114099
It breaks the build on ubuntu bionic, Hirsute, etc on amd64:

"/build/llvm-toolchain-snapshot-15~++20220702091600+23ee84f43201/build-llvm/./bin/clang" --target=x86_64-pc-linux-gnu -DVISIBILITY_HIDDEN  -fstack-protector-strong -Wformat -Werror=format-security -Wno-unused-command-line-argument -Wdate-time -D_FORTIFY_SOURCE=2 -O3 -DNDEBUG -m32 -DCOMPILER_RT_HAS_FLOAT16 -std=c11 -fPIC -fno-builtin -fvisibility=hidden -fomit-frame-pointer -MD -MT CMakeFiles/clang_rt.builtins-i386.dir/extendhfsf2.c.o -MF CMakeFiles/clang_rt.builtins-i386.dir/extendhfsf2.c.o.d -o CMakeFiles/clang_rt.builtins-i386.dir/extendhfsf2.c.o -c '/build/llvm-toolchain-snapshot-15~++20220702091600+23ee84f43201/compiler-rt/lib/builtins/extendhfsf2.c'
In file included from /build/llvm-toolchain-snapshot-15~++20220702091600+23ee84f43201/compiler-rt/lib/builtins/extendhfsf2.c:11:
In file included from /build/llvm-toolchain-snapshot-15~++20220702091600+23ee84f43201/compiler-rt/lib/builtins/fp_extend_impl.inc:38:
/build/llvm-toolchain-snapshot-15~++20220702091600+23ee84f43201/compiler-rt/lib/builtins/fp_extend.h:44:9: error: _Float16 is not supported on this target
typedef _Float16 src_t;
        ^
1 error generated.

Same as in https://reviews.llvm.org/D114099
It breaks the build on ubuntu bionic, Hirsute, etc on amd64:

"/build/llvm-toolchain-snapshot-15~++20220702091600+23ee84f43201/build-llvm/./bin/clang" --target=x86_64-pc-linux-gnu -DVISIBILITY_HIDDEN  -fstack-protector-strong -Wformat -Werror=format-security -Wno-unused-command-line-argument -Wdate-time -D_FORTIFY_SOURCE=2 -O3 -DNDEBUG -m32 -DCOMPILER_RT_HAS_FLOAT16 -std=c11 -fPIC -fno-builtin -fvisibility=hidden -fomit-frame-pointer -MD -MT CMakeFiles/clang_rt.builtins-i386.dir/extendhfsf2.c.o -MF CMakeFiles/clang_rt.builtins-i386.dir/extendhfsf2.c.o.d -o CMakeFiles/clang_rt.builtins-i386.dir/extendhfsf2.c.o -c '/build/llvm-toolchain-snapshot-15~++20220702091600+23ee84f43201/compiler-rt/lib/builtins/extendhfsf2.c'
In file included from /build/llvm-toolchain-snapshot-15~++20220702091600+23ee84f43201/compiler-rt/lib/builtins/extendhfsf2.c:11:
In file included from /build/llvm-toolchain-snapshot-15~++20220702091600+23ee84f43201/compiler-rt/lib/builtins/fp_extend_impl.inc:38:
/build/llvm-toolchain-snapshot-15~++20220702091600+23ee84f43201/compiler-rt/lib/builtins/fp_extend.h:44:9: error: _Float16 is not supported on this target
typedef _Float16 src_t;
        ^
1 error generated.

Hi @sylvestre.ledru , thanks for reporting this issue.

It looks to me a configuration (or option mismatch) problem in compiler-rt. We support the _Float16 type on targets that have SSE2 and/or up features. A 32-bit target doesn't enable SSE2 feature by default. This should be fine because the cmake of compiler-rt will detect the buildable of _Float16 first and set COMPILER_RT_HAS_FLOAT16 accordingly. So this issue looks to me it passed the detection of _Float16 with a SSE2 enabled option but built the compiler-rt with a different option (SSE2 disabled).

I'd suggest to add an extra -msse2 when build it if possible. Otherwise, don't let -DCOMPILER_RT_HAS_FLOAT16 been passed here.

@pengfei I am not convinced it is an issue on my side. I don't have anything particular in this area and using a stage2 build system.

Anyway, this patch fixes the issue on my side:
https://salsa.debian.org/pkg-llvm-team/llvm-toolchain/-/blob/snapshot/debian/patches/force-sse2-compiler-rt.diff

@pengfei I am not convinced it is an issue on my side. I don't have anything particular in this area and using a stage2 build system.

Anyway, this patch fixes the issue on my side:
https://salsa.debian.org/pkg-llvm-team/llvm-toolchain/-/blob/snapshot/debian/patches/force-sse2-compiler-rt.diff

I don't have much experience in compiler-rt and multi stage build. So I may be wrong. It looks to me like an existing problem just exposed by this patch. The diff is another proof.
The build command tells us it's a 32-bit build. But the change for x86_64 solves it, which confirms my previous guess: You are using one configure for CMake (probobally 64 bit) but build for 32 bit target.
Although the diff works, it doesn't look a clean solution to me. But I don't have better suggestion either.

Hi @pengfei, I am working on flang, and after this patch, we started to see some bugs in Fortran programs using REAL(2) (which is fp16 in flang). I am not an expert in LLVM codegen and the builtins, but I am wondering if there is not issue with how llvm codegen thinks __truncsfhf2 returns its value and how the runtime actually does return it.

Here is an llvm IR reproducer for a bug we saw:

define void @bug(ptr %addr, i32 %i) {
  %1 = sitofp i32 %i to half
  store half %1, ptr %addr, align 2
  ret void
}

After this patch the generated assembly on X86 is:

bug:                                    # @bug
        push    rbx
        mov     rbx, rdi
        cvtsi2ss        xmm0, esi
        call    __truncsfhf2@PLT
        pextrw  eax, xmm0, 0
        mov     word ptr [rbx], ax
        pop     rbx
        ret

When running this from a C program to test integers are casted to floats, I am only seeing the bytes of the passed address being set to zero (regardless of the input). It seems to me that there is an issue around the __truncsfhf2 interface. The pextrw eax, xmm0, 0 after the call seems to suggest LLVM codegen is looking for the result in xmm0 register, but it seems that __truncsfhf2 is only returning it in eax.

Do you have any idea what could be the issue ?

Hi @jeanPerier , yes, you are right. This patch changes the calling conversion of fp16 from GPRs to XMMs. So you need to update the runtime. If you are using compiler-rt, you could simply re-build it with trunk code, or at least after rGabeeae57. If you are using your own runtime, you can solve the problem through the way in https://github.com/llvm/llvm-project/issues/56156

Hi @jeanPerier , yes, you are right. This patch changes the calling conversion of fp16 from GPRs to XMMs. So you need to update the runtime. If you are using compiler-rt, you could simply re-build it with trunk code, or at least after rGabeeae57. If you are using your own runtime, you can solve the problem through the way in https://github.com/llvm/llvm-project/issues/56156

Thanks for the quick reply. I was using a compiler-rt from the trunk source but not building it with a clang compiler compiled from the trunk. I did not know the version of clang used to compiled compiler-rt mattered that much. Using clang from the trunk (or at least after the commit you mentionnned) solved my problem. Thanks !

Thanks for confirming it! I don't have much experience in compiler-rt. But I think the version of clang matters much to compiler-rt particular in ABI changing cases like this :)

@pengfei We are also hitting the following assertion with this patch. Do you have any idea why?

/llvm-project/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp:4333: void {anonymous}::SelectionDAGLegalize::ConvertNodeToLibcall(llvm::SDNode*): Assertion `cast<ConstantSDNode>(Node->getOperand(IsStrict ? 2 : 1))->isZero() && "Unable to expand as libcall if it is not normal rounding"' failed.

LLVM IR triggering the assertion.

; ModuleID = 'FIRModule'
source_filename = "FIRModule"
target triple = "x86_64-unknown-linux-gnu"

@_QMhp237Ea11 = global half 0xH3D00
@_QMhp237Eb1a = global half 0xH5640
@_QMf90_kindECascii = external constant i32
@_QMf90_kindECbyte = external constant i32
@_QMf90_kindECdouble = external constant i32
@_QMiso_fortran_envECint16 = external constant i32
@_QMiso_fortran_envECint32 = external constant i32
@_QMiso_fortran_envECint64 = external constant i32
@_QMiso_fortran_envECint8 = external constant i32
@_QMf90_kindECjis = external constant i32
@_QMiso_fortran_envEClogical16 = external constant i32
@_QMiso_fortran_envEClogical32 = external constant i32
@_QMiso_fortran_envEClogical64 = external constant i32
@_QMiso_fortran_envEClogical8 = external constant i32
@_QMf90_kindECnot_available = external constant i32
@_QMf90_kindECquad = external constant i32
@_QMiso_fortran_envECreal128 = external constant i32
@_QMf90_kindECreal16 = external constant i32
@_QMiso_fortran_envECreal32 = external constant i32
@_QMiso_fortran_envECreal64 = external constant i32
@_QMf90_kindECreal64x2 = external constant i32
@_QMf90_kindECsingle = external constant i32
@_QMf90_kindECtwobyte = external constant i32
@_QMf90_kindECucs2 = external constant i32
@_QMf90_kindECucs4 = external constant i32
@_QMf90_kindECword = external constant i32
@_QQcl.2E2F627567312E66393000 = linkonce constant [11 x i8] c"./bug1.f90\00"
@_QQcl.2831362C313629 = linkonce constant [7 x i8] c"(16,16)"
@_QQcl.28346631302E3329 = linkonce constant [8 x i8] c"(4f10.3)"

declare ptr @malloc(i64)

declare void @free(ptr)

define void @_QQmain() !dbg !3 {
  %1 = alloca { ptr, i64, i32, i8, i8, i8, i8 }, align 8, !dbg !7
  %2 = alloca half, i64 1, align 2, !dbg !9
  %3 = call ptr @_FortranAioBeginExternalListOutput(i32 -1, ptr @_QQcl.2E2F627567312E66393000, i32 9), !dbg !10
  %4 = call i1 @_FortranAioOutputAscii(ptr %3, ptr @_QQcl.2831362C313629, i64 7), !dbg !11
  %5 = call i32 @_FortranAioEndIoStatement(ptr %3), !dbg !12
  %6 = call ptr @_FortranAioBeginExternalFormattedOutput(ptr @_QQcl.28346631302E3329, i64 8, i32 -1, ptr @_QQcl.2E2F627567312E66393000, i32 10), !dbg !13
  %7 = load half, ptr @_QMhp237Ea11, align 2, !dbg !14
  %8 = load half, ptr @_QMhp237Eb1a, align 2, !dbg !15
  %9 = fpext half %7 to float, !dbg !16
  %10 = fpext half %8 to float, !dbg !17
  %11 = call float @llvm.copysign.f32(float %9, float %10), !dbg !18
  %12 = fptrunc float %11 to half, !dbg !19
  store half %12, ptr %2, align 2, !dbg !20
  %13 = insertvalue { ptr, i64, i32, i8, i8, i8, i8 } { ptr undef, i64 2, i32 20180515, i8 0, i8 25, i8 0, i8 0 }, ptr %2, 0, !dbg !7
  store { ptr, i64, i32, i8, i8, i8, i8 } %13, ptr %1, align 8, !dbg !7
  %14 = call i1 @_FortranAioOutputDescriptor(ptr %6, ptr %1), !dbg !21
  %15 = call i32 @_FortranAioEndIoStatement(ptr %6), !dbg !22
  ret void, !dbg !23
}

declare ptr @_FortranAioBeginExternalListOutput(i32, ptr, i32)

declare i1 @_FortranAioOutputAscii(ptr, ptr, i64)

declare i32 @_FortranAioEndIoStatement(ptr)

declare ptr @_FortranAioBeginExternalFormattedOutput(ptr, i64, i32, ptr, i32)

declare i1 @_FortranAioOutputDescriptor(ptr, ptr)

; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn
declare float @llvm.copysign.f32(float, float) #0

attributes #0 = { nocallback nofree nosync nounwind readnone speculatable willreturn }

!llvm.dbg.cu = !{!0}
!llvm.module.flags = !{!2}

!0 = distinct !DICompileUnit(language: DW_LANG_C, file: !1, producer: "mlir", isOptimized: true, runtimeVersion: 0, emissionKind: FullDebug)
!1 = !DIFile(filename: "FIRModule", directory: "/")
!2 = !{i32 2, !"Debug Info Version", i32 3}
!3 = distinct !DISubprogram(name: "_QQmain", linkageName: "_QQmain", scope: null, file: !4, line: 9, type: !5, scopeLine: 9, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0, retainedNodes: !6)
!4 = !DIFile(filename: "<stdin>", directory: "/local/home/vclement/llvm-project/build")
!5 = !DISubroutineType(types: !6)
!6 = !{}
!7 = !DILocation(line: 39, column: 9, scope: !8)
!8 = !DILexicalBlockFile(scope: !3, file: !4, discriminator: 0)
!9 = !DILocation(line: 10, column: 8, scope: !8)
!10 = !DILocation(line: 17, column: 8, scope: !8)
!11 = !DILocation(line: 22, column: 8, scope: !8)
!12 = !DILocation(line: 23, column: 9, scope: !8)
!13 = !DILocation(line: 31, column: 9, scope: !8)
!14 = !DILocation(line: 32, column: 9, scope: !8)
!15 = !DILocation(line: 33, column: 9, scope: !8)
!16 = !DILocation(line: 34, column: 9, scope: !8)
!17 = !DILocation(line: 35, column: 9, scope: !8)
!18 = !DILocation(line: 36, column: 9, scope: !8)
!19 = !DILocation(line: 37, column: 9, scope: !8)
!20 = !DILocation(line: 38, column: 3, scope: !8)
!21 = !DILocation(line: 41, column: 9, scope: !8)
!22 = !DILocation(line: 42, column: 9, scope: !8)
!23 = !DILocation(line: 43, column: 3, scope: !8)

Thanks @clementval for reporting it and the reproducer. Put a patch D129294 to address it.