This is an archive of the discontinued LLVM Phabricator instance.

[llvm][ADT] Fix Any linker error with multiple compilers
ClosedPublic

Authored by sebastian-ne on Apr 21 2023, 12:46 PM.

Details

Summary

Citing the comment in the source:

Define the type id and initialize with a non-zero value.
Initializing with a zero value means the variab can end up in either the
.data or the .bss section. This can lead to multiple definition linker
errors when some object files are compiled with a compiler that puts the
variable into .data but they are linked to object files from a different
compiler that put the variable into .bss. To prevent this issue from
happening, initialize the variable with a non-zero value, which forces
it to land in .data (because .bss is zero-initialized).

Fixes https://github.com/llvm/llvm-project/issues/62270
A regression of D139974.

Diff Detail

Event Timeline

sebastian-ne created this revision.Apr 21 2023, 12:46 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2023, 12:46 PM
sebastian-ne requested review of this revision.Apr 21 2023, 12:46 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2023, 12:46 PM
jloser accepted this revision.Apr 21 2023, 12:54 PM

Thanks for the quick fix from the discussion on the issue today.

I'll be glad once we're off llvm::Any once and for all. :)

llvm/include/llvm/ADT/Any.h
128

Quibble: s/variab/variable?

This revision is now accepted and ready to land.Apr 21 2023, 12:54 PM

errrr:

This can lead to multiple definition linker errors when some object files are compiled with a compiler that puts the variable into .data but they are linked to object files from a different compiler that put the variable into .bss.

When/where/how does this happen? Would this not be an ABI bug in those compilers that needs to be fixed there?

filnet added a subscriber: filnet.EditedApr 22 2023, 4:54 AM

errrr:

This can lead to multiple definition linker errors when some object files are compiled with a compiler that puts the variable into .data but they are linked to object files from a different compiler that put the variable into .bss.

When/where/how does this happen? Would this not be an ABI bug in those compilers that needs to be fixed there?

Seems to be a grey area according to wikipedia:

In C, statically allocated objects without an explicit initializer are initialized to zero (for arithmetic types) or a null pointer (for pointer types). Implementations of C typically represent zero values and null pointer values using a bit pattern consisting solely of zero-valued bits (despite filling bss with zero is not required by the C standard, all variables in .bss are required to be individually initialized to some sort of zeroes according to Section 6.7.8 of C ISO Standard 9899:1999 or section 6.7.9 for newer standards). Hence, the BSS segment typically includes all uninitialized objects (both variables and constants) declared at file scope (i.e., outside any function) as well as uninitialized static local variables (local variables declared with the static keyword); static local constants must be initialized at declaration, however, as they do not have a separate declaration, and thus are typically not in the BSS section, though they may be implicitly or explicitly initialized to zero. An implementation may also assign statically-allocated variables and constants initialized with a value consisting solely of zero-valued bits to the BSS section.

EDIT: We ran into the issue when compiling Rust with gcc and using a LLVM compiled with clang.

sebastian-ne marked an inline comment as done.Apr 24 2023, 1:40 AM

Thanks for the quote from wikipedia!

Would this not be an ABI bug in those compilers that needs to be fixed there?

One could stand on the point that if objects from different compilers are linked, the user needs to supply -f[no-]zero-initialized-in-bss to make sure they behave the same.
As this instance of such behavior is easy enough to fix in code, I think we can do that in LLVM.

Not sure if it is justified to backport this patch in LLVM or in Rust’s LLVM fork.

Thanks for the quote from wikipedia!

Would this not be an ABI bug in those compilers that needs to be fixed there?

One could stand on the point that if objects from different compilers are linked, the user needs to supply -f[no-]zero-initialized-in-bss to make sure they behave the same.
As this instance of such behavior is easy enough to fix in code, I think we can do that in LLVM.

I don't think we should be working around ABI mismatches in compilers - the longer-term solution should be that we only use ABI compatible compilers, or use compilers in ABI-compatible ways (with the right flags, if that's what's required).

What are the specifically incompatible compilers/versions/flags?

For future reference, this patch was ported to mingw here: https://github.com/msys2/MINGW-packages/pull/16855

Thanks for the quote from wikipedia!

Would this not be an ABI bug in those compilers that needs to be fixed there?

One could stand on the point that if objects from different compilers are linked, the user needs to supply -f[no-]zero-initialized-in-bss to make sure they behave the same.
As this instance of such behavior is easy enough to fix in code, I think we can do that in LLVM.

I don't think we should be working around ABI mismatches in compilers - the longer-term solution should be that we only use ABI compatible compilers, or use compilers in ABI-compatible ways (with the right flags, if that's what's required).

What are the specifically incompatible compilers/versions/flags?

From the ticket, it seems to be the mingw clang version and mingw gcc where the problem occurs.
At the time of writing this comment, mingw-w64-gcc is at version 12.2.0 and mingw-w64-clang is at 16.0.1.

I don't think we should be working around ABI mismatches in compilers

Sure, I agree. The reason I still made this patch, is because it is only a one character change from a zero to a one :)

For future reference, this patch was ported to mingw here: https://github.com/msys2/MINGW-packages/pull/16855

Thanks for the quote from wikipedia!

Would this not be an ABI bug in those compilers that needs to be fixed there?

One could stand on the point that if objects from different compilers are linked, the user needs to supply -f[no-]zero-initialized-in-bss to make sure they behave the same.
As this instance of such behavior is easy enough to fix in code, I think we can do that in LLVM.

I don't think we should be working around ABI mismatches in compilers - the longer-term solution should be that we only use ABI compatible compilers, or use compilers in ABI-compatible ways (with the right flags, if that's what's required).

What are the specifically incompatible compilers/versions/flags?

From the ticket, it seems to be the mingw clang version and mingw gcc where the problem occurs.
At the time of writing this comment, mingw-w64-gcc is at version 12.2.0 and mingw-w64-clang is at 16.0.1.

Is there/could you file bugs on one or both of these compilers to track/address this issue (& if we're going to keep this patch - link to the bug(s) from the code)

Oh, that made me look - this commit includes a link to the bug it's fixing in the source. We don't usually do that. The bug should be linked from the commit message (which it is, thanks!) which can be found via version control, but generally not linked in the source itself (too noisy if we put bug links in the source all the time - conversely, if the code is there to /workaround/ a bug, linking that bug makes sense as we'll want to check on it repeatedly to see if the bug is addressed yet so we can remove the workaround).

I don't think we should be working around ABI mismatches in compilers

Sure, I agree. The reason I still made this patch, is because it is only a one character change from a zero to a one :)

Sure, but the explanation/ongoing cost of people trying to understand it/read the comment/etc is more significant than the one character change...

This is what happened.

Prior to 16..0.0 that one line was like follows (note the const qualifier) and everybody was happy:

template <typename T> const char Any::TypeId<T>::Id = 0;

In 1.6.0 the const qualifier was removed as follows to work around a MSVC bug:

template <typename T> char Any::TypeId<T>::Id = 0;

This caused the definition to be moved to a different segment which broke the MSYS2 build of Rust.

This fix changed that line again to:

template <typename T> char Any::TypeId<T>::Id = 1;

Which apparently moved the definition back to where it was, looping the loop, and making everybody happy again.

My take on all this is that there is no compiler bug. Where these kind of constructs end up is a bit of a grey area.
Compilers do it differently but there are options to force a behavior or another.
I did not do a deep dive in this area but I am keeping it in mind should similar issues appear in the future.

And I believe there is an ongoing effort to replace llvm::Any with std::Any.
Once that is done then this particular piece of code will become someone else's problem.

> My take on all this is that there is no compiler bug. Where these kind of constructs end up is a bit of a grey area.

Compilers do it differently but there are options to force a behavior or another.

That two compilers that are trying to be ABI compatible (if they weren't trying, even basic things wouldn't work) aren't... is a pretty good sign this is a bug. Yeah, there are usually flags you can pass that might break ABI compatibility - but the default in this instance seems to be intending to be ABI compatible and is not.