Page MenuHomePhabricator

[LLD] Search archives for non-tentative defintions.
Needs ReviewPublic

Authored by sfertile on Aug 18 2020, 7:21 AM.

Details

Summary

Work in progress patch to spur the discussion. Motivated by bfds common symbol handling in archives which is needed by FORTRAN libraries which use CommonBlock/BlockData .

When an object symbol is a tentative definition, we search archives for 'real' definitions to override the common symbol.

Diff Detail

Unit TestsFailed

TimeTest
25,960 mslinux > LLVM.Bindings/Go::go.test
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/llvm-project/build/bin/llvm-go go=/usr/bin/go test llvm.org/llvm/bindings/go/llvm
250 mslinux > lld.ELF::common-archive-lookup.s
Script: -- : 'RUN: at line 3'; rm -rf /mnt/disks/ssd0/agent/llvm-project/build/tools/lld/test/ELF/Output/common-archive-lookup.s.tmp.dir

Event Timeline

sfertile created this revision.Aug 18 2020, 7:21 AM
Herald added a project: Restricted Project. · View Herald Transcript
sfertile requested review of this revision.Aug 18 2020, 7:21 AM
NeHuang added inline comments.Aug 18 2020, 9:48 AM
lld/ELF/InputFiles.cpp
1206

nit: remove the empty line.

lld/test/ELF/common-archive-lookup.s
2

The test cases are failing on linux and windows in phabricator unit test, are we missing -triple=powerpc64le or -triple=powerpc64 when running llvm-mc?

29

nit: definitions

29

nit: missing period.

81

nit: referenced

104

Can we remove the comments # in block definitions?

159

nit: initialized

161

nit: TEST1-LABEL:

167

nit: definition from the object file, and the definitions from the archive.

170

nit: TEST2-LABEL:

The general idea makes sense to me.
I have a set of mostly minor comments here and there.

lld/ELF/InputFiles.cpp
1239

You can probably make this static since you only call it from one place that I can see.
Same with the function below.

1245

Question:
Why do you use Expected<> here?
Won't getName() always return a valid StringRef?

Since you have this here I assume you have come across a situation where you need it.

1247

I think that by definition we cannot have both sym.isDefined() and sym.isCommon() to be both true at the same time.

bool isCommon() const { return symbolKind == CommonKind; }
bool isDefined() const { return symbolKind == DefinedKind; }

You might be able to just use

return sym.isDefined();
lld/ELF/Symbols.cpp
712

You probably don't need to save and restore the type here.
You know based on the guard that the symbol is isObject() so you can just use:

replace(other);
type = llvm::ELF::STT_OBJECT;
lld/test/ELF/common-archive-lookup.s
32

nit:
Do you need the ABI version?

MaskRay added a comment.EditedAug 18 2020, 10:04 PM

Alan Modra kindly shared with me the history
https://sourceware.org/pipermail/binutils/1999-December/002952.html See
elf_link_is_defined_archive_symbol in bfd/elflink.c, which checks whether
a common symbol is a regular definition in an archive member.
LLD's current semantics are compatible with gold, and GNU ld 20 years ago
(before https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=a3a8c91d411abe91720a2ac92b8140e1bdb41282)
(I have to blame Sun/HP ld at that time... Ian Lance Taylor said
"It's hard to know what to do with common symbols, because they do not
have clearly defined semantics. Different people have different
expectations of them." It was so unfortunate they picked such strange expectations)

Before we jump directly into a fix, can you create a FORTRAN example where this
special common symbol handling is reasonably used? (I use uppercase FORTRAN
because I think this is FORTRAN 77 legacy...) I know that it matches GNU ld, but
this is rather bizarre. I hope the additional (very high; likely 100+ lines)
complexity in ArchiveFile and LazyObjFile's symbol resolution can be justified.

Note that regular overriding common has an analog: global overriding weak.
Suppose both libweak.a and libglobal.a define a symbol which is referenced by undef.o.
Let's consider two link orders:

  • undef.o -lweak -lglobal will pick libweak.a and ignore libglobal.a if libglobal.a does not need fetching.
  • -lglobal undef.o -lweak does not fetch libglobal.a at all

So to provide a weak definition while allowing an optional strong definition, the strong definition needs to be surrounded in --whole-archive. Can we ask the user to
do so the similar thing for the regular overriding common case? :)

There are 3 directions:

  • We don't need to address it at all. However, it seems that you may raise a reasonable FORTRAN use case. Still, hope you can consider my previous global vs weak example.
  • Add a Mach-O style -c option to llvm-ar (ranlib): don't add common symbols to the archive index. See below. (LazyObjFile will be inconsistent but your user does not likely use it I guess)
  • We patch LLD.

If we end up patching LLD, we should avoid making Symbol::resolveLazy complex.
We need to think hard doing symbol resolution in batch, in
ArchiveFile/LazyObjFile::parse. For the tests, all the code is insignificant.
My example at https://sourceware.org/pipermail/binutils/2020-August/112878.html
is probably sufficient. As you can see, bitcode LazyObjFile requires special handling.
I need to think about the interaction and I may be able to send a patch.

In ELF the last paragraph of the section on STT_COMMON only mentions that a dynamic linker may change its resolution rules to prefer non-common over a common definition. I'm guessing extending this to static linker is something that some other linkers have done and there are some programs that currently depend on that behaviour.

Symbols with type STT_COMMON label uninitialized common blocks. In relocatable objects, these symbols are not allocated and must have the special section index SHN_COMMON (see below). In shared objects and executables these symbols must be allocated to some section in the defining object.

In relocatable objects, symbols with type STT_COMMON are treated just as other symbols with index SHN_COMMON. If the link-editor allocates space for the SHN_COMMON symbol in an output section of the object it is producing, it must preserve the type of the output symbol as STT_COMMON.

When the dynamic linker encounters a reference to a symbol that resolves to a definition of type STT_COMMON, it may (but is not required to) change its symbol resolution rules as follows: instead of binding the reference to the first symbol found with the given name, the dynamic linker searches for the first symbol with that name with type other than STT_COMMON. If no such symbol is found, it looks for the STT_COMMON definition of that name that has the largest size.

Until recently -fcommon was the default for C programs in Clang and GCC as well, although not for C++. If we do change then this could have an affect wider than Fortran. I can think of a couple of things:

  • Some programs will see different library members being selected, potentially changing behaviour.
  • For every common definition in the program we are doing library searches and potentially expensive fetches of objects from libraries.

Given that BFD supports this behaviour but Gold and LLD do not I suspect that the risk of breaking existing programs is small, especially as LLD tends to be used with more recent programs where -fno-common is the default. However I think we should be cautious and perhaps enable under a command line option, especially as it looks like this is a convention rather than part of a specification. If anyone has better references, then I'm happy to retract.

I've spent some time reading through the linked thread. Thanks. There is a lot to reply to so I hope I don't miss anything.

Before we jump directly into a fix, can you create a FORTRAN example where this special common symbol handling is reasonably used?

The Engineering and Scientific Subroutine Library (ESSL) uses common block and block data. One case that shows this problem is when using DRCFTD function, which calls an internal routine that uses a table of trig values. If we link the library into a shared object we get the expected behavior: the file with the block data overrides all the common blocks and the global symbol is properly initialized in the shared object. If we package the library into an archive and link against that then we end up missing the strong definition since an earlier object provides the tentative definition and the global symbol is zero initialized, producing broken runtime behavior.

Note that regular overriding common has an analog: global overriding weak.

I agree that the special handling for commons is ugly, and I really don't like the asymmetry it produces in relation to weak symbols any more then you do. If we approached the problem strictly from an ELF semantics perspective, then the consistency of golds behavior is arguably desirable. From a usability perspective though, our archive member handling breaks a common idiom in an old but still used dialect of a rather important language. The interaction of common symbols and archive member inclusion unfortunately breaks the FORTRAN language semantics. My understanding is there are lots of modern(ish) programs (not necessarily written in fortran) that link against old FORTRAN libraries and so IMO the trade off in consistency is worth it for the usability.

I don't think using --whole-archives to pull in the entire archive is a reasonable fix. With the libraries I was using, linking in whole archives for all the FORTRAN libraries failed due to multiply defined symbols. In the cases where a library is structured in a way that we can use the whole archive option it will lead to binary bloat in any program that needs to link against an old Fortran library.

Given that BFD supports this behaviour but Gold and LLD do not I suspect that the risk of breaking existing programs is small, especially as LLD tends to be used with more recent programs where -fno-common is the default. However I think we should be cautious and perhaps enable under a command line option, especially as it looks like this is a convention rather than part of a specification. If anyone has better references, then I'm happy to retract.

I agree the new behavior needs an option to enable it and disable it. Initially I considered having the Fortran driver pass the option to LLD, but after discussing it offline with @rzurob I think it might be best to enable the behavior by default. The consumer of these libraries aren't necessarily Fortran programs, and as you mentioned its likely to be lower risk because we are adopting a behavior already implemented in ld.bfd.

If we end up patching LLD, we should avoid making Symbol::resolveLazy complex.
We need to think hard doing symbol resolution in batch, in
ArchiveFile/LazyObjFile::parse. For the tests, all the code is insignificant.
My example at https://sourceware.org/pipermail/binutils/2020-August/112878.html
is probably sufficient. As you can see, bitcode LazyObjFile requires special handling.
I need to think about the interaction and I may be able to send a patch.

That is why I posted a WIP patch early. So we could figure out what the best direction is. I'm open to any suggestions.

MaskRay added a comment.EditedAug 24 2020, 4:49 PM

Given that BFD supports this behaviour but Gold and LLD do not I suspect that the risk of breaking existing programs is small, especially as LLD tends to be used with more recent programs where -fno-common is the default. However I think we should be cautious and perhaps enable under a command line option, especially as it looks like this is a convention rather than part of a specification. If anyone has better references, then I'm happy to retract.

I agree the new behavior needs an option to enable it and disable it. Initially I considered having the Fortran driver pass the option to LLD, but after discussing it offline with @rzurob I think it might be best to enable the behavior by default. The consumer of these libraries aren't necessarily Fortran programs, and as you mentioned its likely to be lower risk because we are adopting a behavior already implemented in ld.bfd.

There are 4 combinations: (ArchiveFile,LazyObjFile) x (ELF, BitcodeFile). LazyObjFile is for --start-lib/--end-lib. Do you think whether it is worth making LazyObjFile or BitcodeFile consistent with ArchiveFile x ELF behavior? If we add an option (say, --fortran-linking, the Dec 1999 binutils thread suggested but did not adopt), we will have better justification that we don't make LazyObjFile or BitcodeFile consistent.

I do worry a bit about the time complexity of the regular-overriding-common behavior, which may be a problem for my internal users (Bazel: --start-lib; we use very few ArchiveFiles) and external users (mostly ArchiveFiles).

If we use an option --fortran-linking, we can mostly get away with the time complexity problem, because most users will not pass the option and will not observe any potential performance problems. The time complexity problem can be solved by looking up symbols in batch but the overall complexity may be larger.

If we end up patching LLD, we should avoid making Symbol::resolveLazy complex.
We need to think hard doing symbol resolution in batch, in
ArchiveFile/LazyObjFile::parse. For the tests, all the code is insignificant.
My example at https://sourceware.org/pipermail/binutils/2020-August/112878.html
is probably sufficient. As you can see, bitcode LazyObjFile requires special handling.
I need to think about the interaction and I may be able to send a patch.

That is why I posted a WIP patch early. So we could figure out what the best direction is. I'm open to any suggestions.

Thanks!

Given that BFD supports this behaviour but Gold and LLD do not I suspect that the risk of breaking existing programs is small, especially as LLD tends to be used with more recent programs where -fno-common is the default. However I think we should be cautious and perhaps enable under a command line option, especially as it looks like this is a convention rather than part of a specification. If anyone has better references, then I'm happy to retract.

I agree the new behavior needs an option to enable it and disable it. Initially I considered having the Fortran driver pass the option to LLD, but after discussing it offline with @rzurob I think it might be best to enable the behavior by default. The consumer of these libraries aren't necessarily Fortran programs, and as you mentioned its likely to be lower risk because we are adopting a behavior already implemented in ld.bfd.

There are 4 combinations: (ArchiveFile,LazyObjFile) x (ELF, BitcodeFile). LazyObjFile is for --start-lib/--end-lib. Do you think whether it is worth making LazyObjFile or BitcodeFile consistent with ArchiveFile x ELF behavior? If we add an option (say, --fortran-linking, the Dec 1999 binutils thread suggested but did not adopt), we will have better justification that we don't make LazyObjFile or BitcodeFile consistent.

IMO, it is important to make it consistent.

I do worry a bit about the time complexity of the regular-overriding-common behavior, which may be a problem for my internal users (Bazel: --start-lib; we use very few ArchiveFiles) and external users (mostly ArchiveFiles).

I think there should not be a problem with time complexity, because there is only extra work to do in uncommon situations:

  1. Only if you see a common symbol, do you need to try to find additional lazy definitions of the same symbol name. This will be very rare for C code now, due to -fno-common being the default.
  2. Only if the common symbol's name is present in another archive symbol-table do you need to load the archive member to check whether that second definition is also common or not. This will be even more rare, since it will only trigger, in C code, when you have an actually-incorrect use (e.g. "int x" in a header instead of "extern int x").

If we use an option --fortran-linking, we can mostly get away with the time complexity problem, because most users will not pass the option and will not observe any potential performance problems. The time complexity problem can be solved by looking up symbols in batch but the overall complexity may be larger.

I think it's unnecessary to have an option, due to the above. C users should not generally observe performance problems with this feature, because it won't trigger for them.

sfertile updated this revision to Diff 291913.Sep 15 2020, 7:31 AM

Extended the patch so that lazy object files are handled the same as archives, and bitcode files are handled same as native objects, both in archives and as lazy object files.

MaskRay added inline comments.Sep 20 2020, 11:00 AM
lld/ELF/InputFiles.cpp
1241

range-based for

1245

if is misaligned. equals -> ==

1253

static

1260

range based for

lld/ELF/Symbols.cpp
703

Delete isObject()

A common symbol is a subset of an object symbol. (If it doesn't, it is an error)

lld/test/ELF/Inputs/blockdata.ll
1 ↗(On Diff #291913)

Use RUN: split-file for short files.

Replace double array with a simple i32.

lld/test/ELF/Inputs/commonblock.ll
6 ↗(On Diff #291913)

Keeping one function is sufficient to demonstrate the effects.

lld/test/ELF/common-archive-lookup.s
67

The test is much more complex than it needs to be. You just need one reference. All the setup (TOC, addi, .localentry) and teardown (blr) instructions should be deleted not to harm readability.

sfertile updated this revision to Diff 293830.Wed, Sep 23, 12:02 PM

Address review comments, switched test to use split-file instead of other inputs and macros, and simplified the test as much as possible.

sfertile marked 16 inline comments as done.Wed, Sep 23, 12:05 PM

We usually use x86 for generic feature tests (sorry). I am fine with ppc but the test is probably worth a comment that some FORTRAN libraries on ppc require this behavior (i.e. if you no longer require this behavior and if we need some symbol table refactoring where this behavior gets in the way, we should have the right to revisit the decision. SHN_COMMON will get more and more obsoleted and hope we don't need this particular behavior in the future ;-) )

Gold behaves like LLD and internally we have much Fortran code (x86/ppc) which does not tickle this property:)

https://sourceware.org/pipermail/binutils/2020-August/112926.html

COMMON is also one of those dark corners with multiple implementations, so I'm not at all surprised to learn that some linkers handle it differently. Most code doesn't tickle those cases, so these differences generally go unnoticed.

In any case I have to do a large run internally because we still use -fcommon :)

lld/ELF/InputFiles.cpp
1243

sym.isGlobal() should be checked along with the name, because a local symbol should be ignored.

1257

Nit: parens around == are unneeded

1263

static

lld/ELF/Symbols.cpp
693

static

lld/test/ELF/common-archive-lookup.s
2

We usually use x86 for generic features (sorry). I am fine with ppc but this is probably worth a comment that some FORTRAN libraries on ppc require this behavior (i.e. if you no longer require this behavior and if we need some symbol table refactoring where this behavior gets in the way, we should have the right to revisit the decision. SHN_COMMON will get more and more obsoleted and hope we don't need this particular behavior in the future ;-) )

Gold behaves like LLD and internally we have much Fortran code (x86/ppc) which does not tickle this property:)

https://sourceware.org/pipermail/binutils/2020-August/112926.html

COMMON is also one of those dark corners with multiple implementations, so I'm not at all surprised to learn that some linkers handle it differently. Most code doesn't tickle those cases, so these differences generally go unnoticed.

In any case I have to do a large run internally because we still use -fcommon :)

61

For newer tests we use ## for non-RUN-non-CHECK comments.

95

.comm changes st_type, so .type is redundant.

135

Don't mix different indentations. 2 may be sufficient.

sfertile updated this revision to Diff 294399.Fri, Sep 25, 12:24 PM
  • Add static on helper functions.
  • Use ## for comments in lit test.
  • Remove redundant parens
  • Ignore locals when looking for overriding symbol def.
sfertile added inline comments.Fri, Sep 25, 12:40 PM
lld/test/ELF/common-archive-lookup.s
2

In any case I have to do a large run internally because we still use -fcommon :)

Sorry for the churn. On the bright side if this change impacts any internal projects you can use it to help convince them to stop using -fcommon :).

this is probably worth a comment that some FORTRAN libraries on ppc require this behavior.

The motivating example is ESSL on PPC, but this isn't a PPC distinct problem. From what our fortran compiler devs tell me, its not unusual for heavy fortran users to have one or two ancient libraries they depend on that were written in pre-F90, and no one touches them to port to newer language dialects.

In any case I have to do a large run internally because we still use -fcommon :)

Sorry for the churn. On the bright side if this change impacts any internal projects you can use it to help convince them to stop using -fcommon :).

Indeed I have noticed some dependent behaviors. Only two problems (potential ODR violation in an openssl; a qemu not using -fno-common) with a little noise. It is likely we can fix the issues in one week.

MaskRay added inline comments.Mon, Sep 28, 2:44 PM
lld/ELF/InputFiles.h
318

name

lld/ELF/Symbols.cpp
713

dyn_cast -> cast ?