Page MenuHomePhabricator

[X86][MS] Add 80bit long double support for Windows
ClosedPublic

Authored by pengfei on Dec 9 2021, 5:08 AM.

Details

Summary

MSVC currently doesn't support 80 bits long double. But ICC does support
it on Windows. Besides, there're also some users asked for this feature.
We can find the discussions from stackoverflow, msdn etc.

Given Clang has already support -mlong-double-80, extending it to
support for Windows seems worthwhile.

Diff Detail

Event Timeline

pengfei created this revision.Dec 9 2021, 5:08 AM
pengfei requested review of this revision.Dec 9 2021, 5:08 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptDec 9 2021, 5:08 AM
erichkeane accepted this revision.Dec 9 2021, 5:58 AM

This looks good to me, and mirrors something I implemented in our downstream a few years ago. Please don't submit for another day or two to give others a chance to review.

This revision is now accepted and ready to land.Dec 9 2021, 5:58 AM

Doesn’t icc also emit code into main to change the FPCW precision control field? Is making long double 80 bits useful if you don’t increase the precision in hardware?

Doesn’t icc also emit code into main to change the FPCW precision control field? Is making long double 80 bits useful if you don’t increase the precision in hardware?

Yes, but it is controlled by an independent option /Qpc80. We can do a follow-up if it is required. Currently, I think it's still useful for library build and functional compatibility with ICC. Even considering the missing /Qpc80, we can still get the less precise result rather than the wrong one when linking each other. Besides, users are always able to set it manually.

rnk added a subscriber: mstorsjo.Dec 10 2021, 7:57 PM

I seem to recall assuming that Windows long double was 64-bits in many, many places. Unfortunately, I have no idea where that could've happened. Something that comes to mind immediately is the MSVC name mangler. I don't think that's a blocking issue, but it's something you should be aware of if you want to promote this flag's usage.

@mstorsjo, can you advise what GCC does here? I've forgotten how this is supposed to work.

clang/lib/Basic/Targets/X86.h
537 ↗(On Diff #393120)

If GCC aligns f80 to 16 bytes, we might as well make the change here and share it with the mingw target.

I seem to recall assuming that Windows long double was 64-bits in many, many places. Unfortunately, I have no idea where that could've happened.

Is it because MSDN explicitly declares it? https://docs.microsoft.com/en-us/previous-versions/9cx8xs15(v=vs.140)?redirectedfrom=MSDN

Something that comes to mind immediately is the MSVC name mangler. I don't think that's a blocking issue, but it's something you should be aware of if you want to promote this flag's usage.

Not sure if I got your point. I checked both Clang and MSVC can mangle/demangle it, e.g., "long double foo(long double a, long double b)" <==> "?foo@@YAOOO@Z". So this is not a problem?

>undname "?foo@@YAOOO@Z"
Microsoft (R) C++ Name Undecorator
Copyright (C) Microsoft Corporation. All rights reserved.

Undecoration of :- "?foo@@YAOOO@Z"
is :- "long double __cdecl foo(long double,long double)"

I seem to recall assuming that Windows long double was 64-bits in many, many places. Unfortunately, I have no idea where that could've happened.

Nothing comes to mind for me about that - in _most_ cases, Windows is kinda oblivious to long double, as nothing in Windows public API uses that type.

However outside of the core OS, any function in the CRT, that uses long doubles, is going to be wrong; in the C99 runtime, there's plenty of long double functions - a separate -l suffixed version of most math functions, but also more important conversion functions like strtold.

Something that comes to mind immediately is the MSVC name mangler. I don't think that's a blocking issue, but it's something you should be aware of if you want to promote this flag's usage.

Oh, right, I have no familiarity with those aspects and what might break there.

@mstorsjo, can you advise what GCC does here? I've forgotten how this is supposed to work.

In GCC on Windows (and clang in mingw mode), long double is always 80 bit on x86. (On i386, sizeof(long double) == 12, while on x86_64 it's 16.)

Regarding the initial FPU state, I think the statically linked CRT startup bits differ from MSVC in this aspect, so the x87 state is initialized in 80 bit mode.

Then for runtime functions, mingw handles this by providing their own (statically linked) reimplementation of essentially all functions that touch long doubles. For math and similar, it's pretty straightforward, but for printf, it's a bit of a mess since we'd otherwise want to use UCRT's (otherwise standards compliant) printfs, but whenever long doubles are involved (very rarely in practice, but libc++'s testsuite do exercise them) the mingw provided version has to be used.

For mingw on arm (32 and 64) I haven't wanted to introduce any further deviance from MSVC, so there it's all identical to plain double.

Thanks @mstorsjo for the inputs.

However outside of the core OS, any function in the CRT, that uses long doubles, is going to be wrong

Good point! I didn't think much on the CRT library. But I think this is not a blocking issue, given

  1. The option is off by default. So it's not destructive for the code that using default CRT.
  2. For users who use this option, they should have knowledge of the difference on long double type. There're 2 use scenarios I can think out:
    • Users who have their own CRT libraries. This is the case for our downstream compiler.
    • Users who are using freestanding environment or using CRT with their own implementation of long double functions.

In GCC on Windows (and clang in mingw mode), long double is always 80 bit on x86. (On i386, sizeof(long double) == 12, while on x86_64 it's 16.)

How about the alignment? I can see on the i386 Linux case, the alignment is 4, I assume it is also 4 for GCC on Windows, right?

In GCC on Windows (and clang in mingw mode), long double is always 80 bit on x86. (On i386, sizeof(long double) == 12, while on x86_64 it's 16.)

How about the alignment? I can see on the i386 Linux case, the alignment is 4, I assume it is also 4 for GCC on Windows, right?

Yes, it's 4 for i386 in GCC on Windows too (and Clang in mingw mode). For x86_64, both sizeof and alignof are 16.

rnk added a comment.Dec 13 2021, 10:01 AM

In GCC on Windows (and clang in mingw mode), long double is always 80 bit on x86. (On i386, sizeof(long double) == 12, while on x86_64 it's 16.)

How about the alignment? I can see on the i386 Linux case, the alignment is 4, I assume it is also 4 for GCC on Windows, right?

Yes, it's 4 for i386 in GCC on Windows too (and Clang in mingw mode). For x86_64, both sizeof and alignof are 16.

Yeah, the alignment is the key thing which is generating a lot of the MSVC-specific complexity.

I have a thought. Why do you need to change the LLVM data layout at all? Clang's record layout is distinct from LLVM's data layout. This is similar to how -malign-double works, which does not affect LLVM's data layout, it is entirely a frontend change.

Does any of this impact the -f128 support? We use f128 x-float on OpenVMS. We've historically only aligned on 8-byte boundaries for legacy reasons (I'm not opposed to having my own mods to control the record layout and/or data layout)

I have a thought. Why do you need to change the LLVM data layout at all? Clang's record layout is distinct from LLVM's data layout. This is similar to how -malign-double works, which does not affect LLVM's data layout, it is entirely a frontend change.

We have to change LLVM data layout because it's required by the calling conversion.

  1. We specified the alignment of f80 on 32 bits to 0: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86CallingConv.td#L869
  2. Which means the alignment is actually determined by the data layout: https://github.com/llvm/llvm-project/blob/main/llvm/utils/TableGen/CallingConvEmitter.cpp#L200

As far as I understand it, -malign-double only affects the alignment of local and global variables. It has nothing to do with calling conversion.

Does any of this impact the -f128 support? We use f128 x-float on OpenVMS. We've historically only aligned on 8-byte boundaries for legacy reasons (I'm not opposed to having my own mods to control the record layout and/or data layout)

I don't think so. f128 has its distinct layout -f128:xxx, though I found only X86-32 MCU defines it in trunk code: https://github.com/llvm/llvm-project/blob/main/clang/lib/Basic/Targets/X86.h#L628

rnk added a comment.Dec 14 2021, 1:00 PM

We have to change LLVM data layout because it's required by the calling conversion.

Is that necessary? It would be simpler to leave the fp80 value 4 byte aligned, which I believe is consistent with the way doubles are passed unaligned. GCC doesn't align fp80 long double to 16 bytes on i686, so I see no reason for LLVM to do it. Is there some other compiler that you need ABI compatibility with?

Also consider that in LLVM, the alignment of arguments passed in memory is not observable (unless byval or inalloca is used). If the user takes the address of an argument, they actually take the address of a local alloca, which is a copy of the argument. The frontend (clang) decides the alignment of the alloca.

GCC doesn't align fp80 long double to 16 bytes on i686, so I see no reason for LLVM to do it. Is there some other compiler that you need ABI compatibility with?

Yes. ICC aligns long double to 16 bytes on 32bit Windows. (I mentioned it in the summary :). In contrast with GCC, ICC is more compatible with MSVC. So I think it's reasonable to align with ICC ranther than GCC.

Also consider that in LLVM, the alignment of arguments passed in memory is not observable (unless byval or inalloca is used). If the user takes the address of an argument, they actually take the address of a local alloca, which is a copy of the argument. The frontend (clang) decides the alignment of the alloca.

We cannot force user to always pass arguments by address. Once they are passed by value, (actually it's common in the code, we usually write like foo(double a) rather than foo(double *a) ), it turns to the scope of calling conversion. It's true all basic types except f80 is aligned to 4 when passed by value on 32 bits. But Windows 32 bits is not alone. Darwin 32 bits uses the same calling conversion for f80 too.

rnk added a comment.Dec 15 2021, 10:04 AM

Let me know if it would be more helpful to set up a call, that might help us reach agreement sooner. I've used discord for this previously if that works for you, my username there is the same (@rnk#8591).

GCC doesn't align fp80 long double to 16 bytes on i686, so I see no reason for LLVM to do it. Is there some other compiler that you need ABI compatibility with?

Yes. ICC aligns long double to 16 bytes on 32bit Windows. (I mentioned it in the summary :). In contrast with GCC, ICC is more compatible with MSVC. So I think it's reasonable to align with ICC ranther than GCC.

My comment here refers to the alignment of argument values, not user-declared variables. The frontend controls the alignment of user-declared variables by setting the alloca alignment. GCC and ICC appear to align long double arguments to 4 bytes: https://gcc.godbolt.org/z/PbobWdrPf

Also consider that in LLVM, the alignment of arguments passed in memory is not observable (unless byval or inalloca is used). If the user takes the address of an argument, they actually take the address of a local alloca, which is a copy of the argument. The frontend (clang) decides the alignment of the alloca.

We cannot force user to always pass arguments by address. Once they are passed by value, (actually it's common in the code, we usually write like foo(double a) rather than foo(double *a) ), it turns to the scope of calling conversion. It's true all basic types except f80 is aligned to 4 when passed by value on 32 bits. But Windows 32 bits is not alone. Darwin 32 bits uses the same calling conversion for f80 too.

I don't intend to ask users to pass arguments by address. What I mean is, LLVM will copy an under-aligned long double argument passed in memory to properly aligned memory. Consider this example using double:
https://gcc.godbolt.org/z/v6oTfrTr5

void escape(void*);
void foo(int x, double d, int y) {
  escape(&x);
  escape(&d);
  escape(&y);
}

-->

"?foo@@YAXHNH@Z": # @"?foo@@YAXHNH@Z"
  push ebp
  mov ebp, esp
  # Align stack to 8 bytes
  and esp, -8
  sub esp, 8
  # Copy bytes of d to aligned memory
  movsd xmm0, qword ptr [ebp + 12] # xmm0 = mem[0],zero
  movsd qword ptr [esp], xmm0
  lea eax, [ebp + 8]
...
  # Take address of aligned alloca at ESP+0
  mov eax, esp
  push eax
  call "?escape@@YAXPAX@Z"

So, I'm still not convinced we have to change the LLVM data layout. Is there some other aspect of calling convention lowering that needs to know the alignment of long double?

Hi @rnk , mine is Phoebe#3036. I haven't really used it before. No sure if I invited you correctly, so I try to explain here.

My comment here refers to the alignment of argument values, not user-declared variables. The frontend controls the alignment of user-declared variables by setting the alloca alignment.

Sure. We have the same goal :)

GCC and ICC appear to align long double arguments to 4 bytes: https://gcc.godbolt.org/z/PbobWdrPf

Unfortunately, compiler explorer doesn't provide the Windows version ICC. And the alignment on Windows is different from Linux for ICC. Here are my local result:

> cat ex5.cpp
void escape(void*);
void foo(int x, long double d, int y) {
  escape(&x);
  escape(&d);
  escape(&y);
}

> icl -c ex5.cpp /Qlong-double /Qpc80
> dumpbin.exe /disasm ex5.obj
Microsoft (R) COFF/PE Dumper Version 14.29.30133.0
Copyright (C) Microsoft Corporation.  All rights reserved.


Dump of file ex5.obj

File Type: COFF OBJECT

?foo@@YAXH_TH@Z (void __cdecl foo(int,decltype(auto),int)):
  00000000: 8D 44 24 04        lea         eax,[esp+4]
  00000004: 50                 push        eax
  00000005: E8 00 00 00 00     call        ?escape@@YAXPAX@Z
  0000000A: 8D 44 24 18        lea         eax,[esp+18h]
  0000000E: 50                 push        eax
  0000000F: E8 00 00 00 00     call        ?escape@@YAXPAX@Z
  00000014: 8D 44 24 2C        lea         eax,[esp+2Ch]
  00000018: 50                 push        eax
  00000019: E8 00 00 00 00     call        ?escape@@YAXPAX@Z
  0000001E: 83 C4 0C           add         esp,0Ch
  00000021: C3                 ret
  00000022: 0F 1F 80 00 00 00  nop         dword ptr [eax]
            00
  00000029: 0F 1F 80 00 00 00  nop         dword ptr [eax]
            00

  Summary

          B7 .drectve
          30 .text

As you can see, the x and d has 16 bytes distance.
I realized I should have put the result earlier to avoid the ambiguity. Sorry for the inconvenience!

rnk added a comment.Dec 16 2021, 9:29 AM

I see, thanks for the info. Can you please add a targeted LLVM test for long double arguments? From what I can tell, the auto-generated update_llc_test_checks.py style tests are not a good fit for testing parameter passing because they pattern-match away the stack offsets which are relevant to the test.

I think it's also worth breaking this into LLVM and clang-side patches:

  1. make clang emit x86_fp80 with -mlong-double-80
  2. update LLVM datalayout (must affect clang via clang/lib/Basic/Targets/X86.h) to align fp80 to 16 bytes in the MSVC environment
pengfei updated this revision to Diff 395128.Dec 17 2021, 7:30 AM
This comment was removed by pengfei.
pengfei updated this revision to Diff 395129.Dec 17 2021, 7:32 AM

Split the LLVM datalayout to a different patch.

rnk accepted this revision.Jan 6 2022, 2:32 PM

lgtm

I believe you can go ahead and land this, it doesn't depend on the data layout changes.

This revision was landed with ongoing or failed builds.Feb 13 2022, 9:51 PM
This revision was automatically updated to reflect the committed changes.