This is an archive of the discontinued LLVM Phabricator instance.

Fix i386 ABI "__m64" type bug
Needs RevisionPublic

Authored by wxiao3 on Mar 23 2019, 9:08 PM.

Details

Summary

According to System V i386 ABI: the __m64 type paramater and return value are
passed by MMX registers. But current implementation treats __m64 as i64
which results in parameter passing by stack and returning by EDX and EAX.

This patch fixes the bug (https://bugs.llvm.org/show_bug.cgi?id=41029) for Linux
and NetBSD.

Diff Detail

Event Timeline

wxiao3 created this revision.Mar 23 2019, 9:08 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 23 2019, 9:08 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
wxiao3 edited the summary of this revision. (Show Details)Mar 24 2019, 1:10 AM
wxiao3 added a comment.Apr 3 2019, 7:37 PM

Dear reviewers, any comments?

rnk added inline comments.Apr 4 2019, 2:49 PM
lib/CodeGen/TargetInfo.cpp
9478–9479

The Sys V rules apply to every non-Windows OS, not just Linux. I think you should add the parameter regardless of the OS

RKSimon added inline comments.Apr 8 2019, 2:12 AM
test/CodeGen/x86_32-mmx-linux.c
2 ↗(On Diff #192021)

Test on more triples and add the test file to trunk with current codegen so this patch shows the delta

wxiao3 updated this revision to Diff 195096.Apr 14 2019, 10:58 PM
wxiao3 marked 2 inline comments as done.
RKSimon added inline comments.Apr 15 2019, 10:19 AM
test/CodeGen/x86_32-m64-darwin.c
1 ↗(On Diff #195096)

You should be able to merge all of these triples into the same test file, each with their own RUN: line, you will need to add a FileCheck prefix, something like:

| FileCheck %s --check-prefixes=CHECK,DARWIN
| FileCheck %s --check-prefixes=CHECK,IAMCU
| FileCheck %s --check-prefixes=CHECK,LINUX
| FileCheck %s --check-prefixes=CHECK,WIN32
wxiao3 updated this revision to Diff 195291.Apr 15 2019, 7:24 PM
wxiao3 marked an inline comment as done.

One last style comment from me but we need somebody better with the different ABIs to finally approve this.

lib/CodeGen/TargetInfo.cpp
1421

superfluous braces?

wxiao3 updated this revision to Diff 195659.Apr 17 2019, 6:25 PM
wxiao3 updated this revision to Diff 195660.Apr 17 2019, 6:33 PM
wxiao3 marked an inline comment as done.
rnk added inline comments.
lib/CodeGen/TargetInfo.cpp
919

I think looking at the LLVM type to decide how something should be passed is a bad pattern to follow. We should look at the clang AST to decide how things will be passed, not LLVM types. Would that be complicated? Are there aggregate types that end up getting passed directly in MMX registers?

9480

I think this needs to preserve existing behavior for Darwin and PS4 based on comments from @rjmccall and @dexonsmith in D60748.

wxiao3 marked 2 inline comments as done.Apr 27 2019, 8:08 AM
wxiao3 added inline comments.
lib/CodeGen/TargetInfo.cpp
919

For x86 32 bit target, no aggregate types end up getting passed in MMX register.
The only type passed by MMX is

__m64

which is defined in header file (mmintrin.h):

typedef long long __m64 __attribute__((__vector_size__(8), __aligned__(8)));

Yes, it would be good if we define _m64 as a builtin type and handle it in AST level. But I'm afraid that it won't be a trivial work. Since GCC also handles __m64 in the same way as Clang currently does, can we just keep current implementation as it is?

9480

ok, I will follow it.

rnk added inline comments.May 1 2019, 1:09 PM
lib/CodeGen/TargetInfo.cpp
919

That's not quite what I'm suggesting. I'm saying that IsX86_MMXType should take a QualType parameter, and it should check if that qualtype looks like the __m64 vector type, instead of converting the QualType to llvm::Type and then checking if the llvm::Type is a 64-bit vector. Does that seem reasonable? See the code near the call site conditionalized on IsDarwinVectorABI which already has similar logic.

RKSimon resigned from this revision.May 2 2019, 3:15 AM
wxiao3 updated this revision to Diff 199233.May 13 2019, 3:37 AM
wxiao3 marked an inline comment as done.May 13 2019, 3:46 AM
wxiao3 added inline comments.
lib/CodeGen/TargetInfo.cpp
919

Yes, it's unnecessary to convert QualType to llvm::Type just for the _m64 vector type checking.
Since It's very simple to check _m64 type based on QualType with pre-conditioned type assertion

if (const VectorType *VT = RetTy->getAs<VectorType>())

I just remove the utility function: IsX86_MMXType.

wxiao3 added a reviewer: krytarowski.
wxiao3 updated this revision to Diff 202579.Jun 1 2019, 7:53 PM
wxiao3 edited the summary of this revision. (Show Details)
wxiao3 added a comment.Jun 1 2019, 8:03 PM

Hi all,

With the latest version, I have made below changes according to all your comments:

  1. Only apply the fix to Linux where many libraries are built by GCC.
  2. Avoid converting the QualType to llvm::Type and then checking if the llvm::Type is a 64-bit vector, which is unnecessary and inefficient. Furthermore, I remove the utility function: IsX86_MMXType since It's very simple to check _m64 type based on QualType.

Ok for merge now?

krytarowski added a comment.EditedJun 2 2019, 1:50 AM

sysv abi is not only for Linux but most non-Windows ones (BSDs, HAIKU, ...).

wxiao3 added a comment.Jun 3 2019, 7:13 AM

Consider other Systems (e.g Darwin, PS4 and FreeBSD) don't want to spend any effort dealing with the ramifications of ABI breaks (as discussed in https://reviews.llvm.org/D60748) at present, I only fix the bug for Linux. If other system wants the fix, the only thing needed is to add a flag (like "IsLinuxABI" ) to enable it.

Consider other Systems (e.g Darwin, PS4 and FreeBSD) don't want to spend any effort dealing with the ramifications of ABI breaks (as discussed in https://reviews.llvm.org/D60748) at present, I only fix the bug for Linux. If other system wants the fix, the only thing needed is to add a flag (like "IsLinuxABI" ) to enable it.

CC @mgorny and @joerg - do we want this for NetBSD?

mgorny added a comment.Jun 4 2019, 7:10 AM

Consider other Systems (e.g Darwin, PS4 and FreeBSD) don't want to spend any effort dealing with the ramifications of ABI breaks (as discussed in https://reviews.llvm.org/D60748) at present, I only fix the bug for Linux. If other system wants the fix, the only thing needed is to add a flag (like "IsLinuxABI" ) to enable it.

CC @mgorny and @joerg - do we want this for NetBSD?

Probably yes. FWICS, gcc uses %mm0 and %mm1 on NetBSD while clang doesn't.

Consider other Systems (e.g Darwin, PS4 and FreeBSD) don't want to spend any effort dealing with the ramifications of ABI breaks (as discussed in https://reviews.llvm.org/D60748) at present, I only fix the bug for Linux. If other system wants the fix, the only thing needed is to add a flag (like "IsLinuxABI" ) to enable it.

CC @mgorny and @joerg - do we want this for NetBSD?

Probably yes. FWICS, gcc uses %mm0 and %mm1 on NetBSD while clang doesn't.

Unless Joerg will protest, @wxiao3 please enable it on NetBSD as well.. but personally I would enable it unconditionally for all sysv ABIs.

joerg added a comment.Jun 5 2019, 1:22 PM

I think MMX code is obscure enough at this point that it doesn't matter much either way. Even less across DSO boundaries. That's why I don't really care either way.

mgorny added inline comments.Jun 6 2019, 6:33 AM
lib/CodeGen/TargetInfo.cpp
1005

Maybe replace the two booleans with something alike IsPassInMMXRegABI? And while at it, include NetBSD there, please.

wxiao3 updated this revision to Diff 203492.Jun 6 2019, 9:36 PM
wxiao3 edited the summary of this revision. (Show Details)
wxiao3 added a reviewer: mgorny.
wxiao3 added a reviewer: joerg.
wxiao3 added a comment.Jun 6 2019, 9:39 PM

Thanks for the suggestions!
I have updated the patch accordingly.

Ok for merge now?

rjmccall added inline comments.Jun 10 2019, 4:15 PM
lib/CodeGen/TargetInfo.cpp
1005

CGT is a member variable, so you can just query the target fresh in your isPassInMMXRegABI method. The check upfront for a 64-bit vector type should keep this well out of the fast path.

wxiao3 updated this revision to Diff 204057.Jun 11 2019, 7:06 AM

Thanks for the suggestions!
I have updated it.

Ok now?

rjmccall accepted this revision.Jun 11 2019, 9:56 AM

Minor comments, then LGTM.

lib/CodeGen/TargetInfo.cpp
1098

"The System V i386 psABI requires __m64 to be passed in MMX registers.
Clang historically had a bug where it failed to apply this rule, and some platforms
(e.g. Darwin, PS4, and FreeBSD) have opted to maintain compatibility with the old
Clang behavior, so we only apply it on platforms that have specifically requested it
(currently just Linux and NetBSD)."

1420

Indentation on the continuation line.

This revision is now accepted and ready to land.Jun 11 2019, 9:56 AM
wxiao3 updated this revision to Diff 204197.Jun 11 2019, 6:08 PM

Thanks for the comments!
Updated for landing.

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptJun 11 2019, 6:50 PM
hans added a subscriber: hans.Jun 17 2019, 6:51 AM

I tried replying on the cfe-commits email, but both Pengfei and Wei's email addresses seem to bounce, so replying here instead:

This broke Chromium for 32-bit Linux:
https://bugs.chromium.org/p/chromium/issues/detail?id=974542#c5

It's not clear what's happening yet, bet just to give a heads up.

Since this changes ABI behaviour, it would probably we worth
mentioning in docs/ReleaseNotes.rst too.

@hans

Please make sure all Chromium for 32-bit Linux libraries are following System V ABI (i.e., m64 is passed on mmx register). I suspect that there are some hand written assembly code in your libraries which is not following the ABI.

thakis added a subscriber: thakis.Jun 17 2019, 6:24 PM

@hans

Please make sure all Chromium for 32-bit Linux libraries are following System V ABI (i.e., m64 is passed on mmx register). I suspect that there are some hand written assembly code in your libraries which is not following the ABI.

That's likely true, but also not very helpful since the ABI implications here are pretty big (see comments on the chromium bug). It's also currently impossible to write an assembly function that works with both trunk clang and clang 8.0.0, which makes it difficult to update compilers independent of changing the code. clang generally tries to be abi-compatible with itself. This should probably support the existing fclang-abi-compat= flag at least (and have a release notes entry); possibly there should be a dedicated flag for this.

I have created a patch for you: https://reviews.llvm.org/D63473
Is it ok?

hans added a comment.Jun 18 2019, 1:49 AM

@hans

Please make sure all Chromium for 32-bit Linux libraries are following System V ABI (i.e., m64 is passed on mmx register). I suspect that there are some hand written assembly code in your libraries which is not following the ABI.

We still don't have the root cause, but the library in question (Skia) doesn't have much assembly code. After your patch, %st0 (which aliases with %mm0) gets clobbered if a function returns a 4 x u16 vector. Skia tries to work around this by force-inlining such functions, but we're still seeing functions where %mm0 gets used. We believe this is the cause, but I'm still trying to figure out where the remaining %mm0 uses come from.

Hey folks, I'm the Skia point of contact on this, and "luckily" the person who wrote all the code that got us into this mess. Let me cross post a couple questions I've had from the Chromium bug over here where folks might know the answer...

Now that Clang's decided to match GCC's behavior of using mm0 to pass around 8-byte vectors on x86-32, is there any way to use 8-byte vector types safely any more? I don't really have the full context of this Clang change, but is it maybe a good idea applied to too many types? I notice the change mentions m64, but here I'm using uint16_t ext_vector_type(4) exclusively, never m64 or even an 8x8 vector... can we just squint and say u16x4 and m64 aren't the same, passing m64 according to the ABI but vector extensions however we were doing it before? Or can we work out some sort of ABI that preserves st0/mm0? I think we're finding that even with forced-inlining, at -O0 we still end up getting u16x4 values stored in mm0 briefly (kind of roundabout through xmm registers and the stack once or twice too).

In short, should working with 4x u16 be safe on x86-32 and there's a bug / undefined behavior in my code leading to this mm0/st0 clobber, or is this just actually not really spec'd to work?

Hey folks, I'm the Skia point of contact on this, and "luckily" the person who wrote all the code that got us into this mess. Let me cross post a couple questions I've had from the Chromium bug over here where folks might know the answer...

Now that Clang's decided to match GCC's behavior of using mm0 to pass around 8-byte vectors on x86-32, is there any way to use 8-byte vector types safely any more? I don't really have the full context of this Clang change, but is it maybe a good idea applied to too many types? I notice the change mentions m64, but here I'm using uint16_t ext_vector_type(4) exclusively, never m64 or even an 8x8 vector... can we just squint and say u16x4 and m64 aren't the same, passing m64 according to the ABI but vector extensions however we were doing it before?

__m64 is of course defined using the compiler's vector extensions. More importantly, GCC also has those vector extensions (or at least some of them), and my understanding is that GCC is interpreting the ABI's __m64 to mean "all 8-byte vectors" (which seems quite reasonable to me), and that's what Clang needs to stay compatible with on systems where GCC is the system compiler.

Now, we could theoretically use a different ABI rule for vectors defined with Clang-specific extensions, but that seems like it would cause quite a few problems of its own.

In short, should working with 4x u16 be safe on x86-32 and there's a bug / undefined behavior in my code leading to this mm0/st0 clobber, or is this just actually not really spec'd to work?

It's always possible that there's a bug in the compiler, but the most likely thing is that you have assembly code that's not obeying the ABI in some way.

Now, we could theoretically use a different ABI rule for vectors defined with Clang-specific extensions, but that seems like it would cause quite a few problems of its own.

I think we can't reasonably impose this ABI rule on vectors defined with ext_vector_type: that makes it impossible to build portable OpenCL code for 32-bit x86, given the side-effects of introducing any use of the x86_mmx type. So that leaves us with two options: make vector_size and ext_vector_type incompatible, or revert this patch and intentionally remain ABI-incompatible with gcc. (I guess we could theoretically try to separate out a special case for OpenCL instead, but that seems even more fragile.)

Being ABI-incompatible is obviously inconvenient if you're writing code using MMX types/intrinsics, but using MMX intrinsics is sort of "at your own risk" anyway, given neither LLVM nor gcc properly manages the state of the MMX/x87 register file.

Now, we could theoretically use a different ABI rule for vectors defined with Clang-specific extensions, but that seems like it would cause quite a few problems of its own.

I think we can't reasonably impose this ABI rule on vectors defined with ext_vector_type: that makes it impossible to build portable OpenCL code for 32-bit x86, given the side-effects of introducing any use of the x86_mmx type.

Sorry, I've remained somewhat intentionally ignorant of the issues here. Are you saying that using MMX in LLVM requires source-level workarounds in some way, and so we can't lower portable code to use MMX because that code will (reasonably) lack those workarounds? If that's true, then fixing that seems like a blocker to landing this patch; it is better to be ABI-non-compliant than to produce broken code.

Are you saying that using MMX in LLVM requires source-level workarounds in some way, and so we can't lower portable code to use MMX because that code will (reasonably) lack those workarounds?

Yes.

The x86 architecture requires that a program executes an "emms" instruction between any MMX instructions, and any x87 instructions. Otherwise, the x87 instructions will produce nonsense results. LLVM, and other compilers, never insert emms automatically; this is partially historical, but also because emms can be expensive on Intel chips. Instead, the user is expected to call _mm_empty() in appropriate places.

To allow users to generate arbitrary vector IR without tripping over this, LLVM does not lower vector IR to MMX instructions; instead, it only generates MMX instructions for operations using the special type x86_mmx. If any instruction or argument has a result or operand of type x86_mmx in LLVM IR, the user must explicitly execute emms (@llvm.x86.mmx.emms() in IR, _mm_empty() in C) between that instruction, and any code that might use x87 registers. "Between" isn't really sound because emms intrinsic doesn't reliably prevent code motion of floating-point operations, but it works well enough in practice. (See also https://bugs.llvm.org/show_bug.cgi?id=35982 .)

On the clang side, without this patch, we only generate code using the x86_mmx type in a couple places: _mm_* calls, and inline asm with an MMX operand. If the user does not use either of those, there will never be any values of type x86_mmx, so there will never be any MMX instructions, and we avoid the whole mess. 64-bit vector operations get lowered to SSE2 instructions instead (or scalarized).

This patch introduces a new place where clang will generate the type x86_mmx: for call arguments and return values. This means more places where the user is required to write _mm_empty() to get correct behavior.

Ah, thank you for that explanation. That's got to be exactly what we're tripping over in Chromium / Skia.

Thank you. So it sounds like this patch needs to be reverted, and the correct version of it will have to insert these intrinsic calls in four places:

  • before translating vector arguments to MMX type before calls that pass __m64 arguments,
  • after translating MMX parameters to vector type in functions that receive __m64 parameters,
  • before translating vector results to MMX type in functions that return __m64, and
  • after translating MMX results to vector type after calls that return __m64.

Will that be sufficient to satisfy LLVM?

If we're going to insert emms instructions automatically, it doesn't really make sense to do it in the frontend; the backend could figure out the most efficient placement itself. (See lib/Target/X86/X86VZeroUpper.cpp, which implements similar logic for AVX.) The part I'd be worried about is the potential performance hit from calling emms in places where other compilers wouldn't, for code using MMX intrinsics.

If we're going to insert emms instructions automatically, it doesn't really make sense to do it in the frontend; the backend could figure out the most efficient placement itself. (See lib/Target/X86/X86VZeroUpper.cpp, which implements similar logic for AVX.) The part I'd be worried about is the potential performance hit from calling emms in places where other compilers wouldn't, for code using MMX intrinsics.

It would certainly be simpler for the frontend if the backend did this — in fact, even if the "frontend" was going to do it, I would have suggested doing it as a pass over the emitted IR rather than a special case in IRGen. Anyway, I'm open to any reasonable option; at this point, I'm just laying out the basic requirements for getting this patch back in, because the current patch is invalid given LLVM's current requirements.

I'm just laying out the basic requirements for getting this patch back in, because the current patch is invalid given LLVM's current requirements.

Yes, I'm on the same page.

Can anyone provide me some small reproducers code for the issue tripped over by Chromium / Skia?

hans added a comment.Jun 19 2019, 1:05 AM

Can anyone provide me some small reproducers code for the issue tripped over by Chromium / Skia?

Sorry, I don't have a small repro yet. I'm still working on finding out exactly what's happening in Chromium, but it's a large test. It's easy to find where the x87 state gets clobbered after your change, but I haven't found what code was depending on that state yet.

I've raised https://bugs.llvm.org/show_bug.cgi?id=42319 which suggests the creation of a EMMS insertion pass.

hans added a comment.Jun 19 2019, 4:16 AM

Can anyone provide me some small reproducers code for the issue tripped over by Chromium / Skia?

Sorry, I don't have a small repro yet. I'm still working on finding out exactly what's happening in Chromium, but it's a large test. It's easy to find where the x87 state gets clobbered after your change, but I haven't found what code was depending on that state yet.

Oh, I thought the problem was just that the registers alias, not that the whole x87 state gets messed up by mmx instructions. Here's a simple repro:

$ cat /tmp/a.c
#include <stdint.h>
#include <stdio.h>

#ifdef __clang__
typedef uint16_t __attribute__((ext_vector_type(4))) V;
#else
typedef uint16_t V __attribute__ ((vector_size (4*sizeof(uint16_t))));
#endif

V f() {
  V v = { 1,2,3,4 };
  return v;
}

double d() { return 3.14; }

int main() {
  f();
  printf("%lf\n", d());
  return 0;
}

$ bin/clang -m32 -O0 /tmp/a.c && ./a.out
-nan

Before your change, it prints 3.140000.

Chromium was previously working around this problem in gcc by force-inlining f() into main(). That doesn't work with Clang because it touches %mm0 even after inlining.

hans added a comment.Jun 19 2019, 4:31 AM

I've reverted in r363790 until a solution can be found.

RKSimon reopened this revision.Jun 19 2019, 4:39 AM
This revision is now accepted and ready to land.Jun 19 2019, 4:39 AM
RKSimon requested changes to this revision.Jun 19 2019, 4:39 AM
This revision now requires changes to proceed.Jun 19 2019, 4:39 AM
hans added a comment.Jun 19 2019, 7:49 AM

$ bin/clang -m32 -O0 /tmp/a.c && ./a.out
-nan

Before your change, it prints 3.140000.

I looked through the Intel manual to understand what's happening in detail:

When we return from f() with the new ABI, we write to the %mm0 register, and as a side effect:

(9.5.1) After each MMX instruction, the entire x87 FPU tag word is set to valid (00B).

What does that mean?

(8.1.7) "The x87 FPU uses the tag values to detect stack overflow and underflow conditions (see Section 8.5.1.1)"

(8.5.1.1) "Stack overflow — An instruction attempts to load a non-empty x87 FPU register from memory. A non-empty
register is defined as a register containing a zero (tag value of 01), a valid value (tag value of 00), or a special
value (tag value of 10).

When the x87 FPU detects stack overflow or underflow, it sets the IE flag (bit 0) and the SF flag (bit 6) in the x87
FPU status word to 1. It then sets condition-code flag C1 (bit 9) in the x87 FPU status word to 1 if stack overflow
occurred or to 0 if stack underflow occurred.
If the invalid-operation exception is masked, the x87 FPU returns the floating point, integer, or packed decimal
integer indefinite value to the destination operand, depending on the instruction being executed. This value over-
writes the destination register or memory location specified by the instruction."

Okay, so essentially any MMX instruction marks the x87 register stack as full, and when we try to store into it in d() with "fldl" we get a stack overflow, and because the exception is masked, it stores "the floating point indefinite value" into the register, which is what we end up printing.

At least I finally think I understand what's going on :-)

-O0 always inline isn't working because the frontend is emitting a store of vector type to memory then a load of x86_mmx to do the type coercion. The caller does the opposite to coerce back from mmx. This -O0 pipeline isn't capable of getting rid of these redundant store/load pairs. We might have a better chance if we just emitted bitcasts.