Page MenuHomePhabricator

[lld] Add archive file support to Mach-O backend
ClosedPublic

Authored by Ktwu on Apr 16 2020, 7:30 PM.

Details

Summary

With this change, basic archive files can be linked together. Input section discovery has been refactored into a function since archive files lazily resolve their symbols / the object files containing those symbols.

Depends on D77893.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
MaskRay added inline comments.May 3 2020, 5:06 PM
lld/MachO/SymbolTable.cpp
79

How does an archive symbol interact with a DylibSymbol?

lld/test/MachO/bad-archive.s
1 ↗(On Diff #261491)

You could also create invalid/ and place such tests there.

Ktwu marked 4 inline comments as done.Mon, May 4, 10:24 AM
Ktwu added inline comments.
lld/MachO/Arch/X86_64.cpp
34–37 ↗(On Diff #258224)

Sure, I'll rebase this on top of D79311 (thanks!)

lld/MachO/SymbolTable.cpp
79

By "interact" do you mean if there's a conflict -- a dylib and an archive both expose a symbol with the same name? I assume there shouldn't be issues if symbol names are unique...

@int3 you know more about dylibs; do you see potential issues?

lld/test/MachO/bad-archive.s
1 ↗(On Diff #261491)

A place to stick tests for known failures? I like that a lot!

I see now that ELF has the same (although the tests I found didn't live within it...)

int3 added inline comments.Tue, May 5, 4:51 PM
lld/MachO/SymbolTable.cpp
79

We definitely need to account for symbol name collisions. I just commented on D79114: [lld-macho] Dylib symbols should always replace undefined symbols, but basically I've just been feeding various inputs to ld64 and emulating its output, though we should probably figure out the details of its symbol table implementation sometime...

smeenai added a subscriber: pcc.Wed, May 6, 12:09 AM

It's worth noting in the commit message that we're restoring some of this functionality from @pcc and @ruiu's initial prototype.

lld/MachO/InputFiles.cpp
311–318

The ELF equivalent is InputFiles::getSections, and it returns an ArrayRef.

Returning by value wouldn't be an issue here – we're returning a local object, so return value optimization (NRVO) kicks in. It's not ideal for ObjFile::getInputSections though, as I commented above.

lld/MachO/InputFiles.h
59

Nit: add override

I'm *really* wary of this returning by value ... that seem like a bunch of unnecessary copying. The ELF equivalent to this function is InputFile::getSections (which isn't overridden); it returns an ArrayRef, and it asserts that it's only called on object or binary files (and we don't have to worry about binary files). ELF makes that work by having a global objectFiles vector (instead of just an inputFiles vector), so when it's doing the loop over all input sections, it only loops over object files.

Whether we wanna have a specific objectFiles global vector as well or just stick with inputFiles is a broader design question, but we should at least figure out a good way to make this not return by value.

73

sections should always be empty for a DylibFile, so this is confusing. Perhaps just construct and return an empty vector here? (Although that wouldn't work with the ArrayRef comment above, but in that case a comment would be good.)

lld/MachO/SymbolTable.cpp
61

We need to add a check here where if the existing symbol isa<LazySymbol>, we fetch that archive member. A consequence is that we need to store a pointer to the ArchiveFile in the LazySymbol.

We should add a test for this case, where the archive file appears on the link line before the object file, and its members should still be fetched.

79

ld64 uses the same archive file semantics as LLD ELF and COFF, where an archive member can be used to resolve undefined symbol references regardless of its position on the command line.

A defined symbol in an object file always wins out over a dylib symbol. An archive member that got included in the link is treated as a regular object file for this purpose. If the dylib symbol is seen first, an archive member defining the same symbol won't be pulled out (unless some other symbol from that member is referenced).

I believe that's the behavior we're implementing here. A LazySymbol won't replace a DylibSymbol, but if the member gets fetched, all its symbols will be added as Defined, which will override DylibSymbols.

Here's an example to clarify what I mean:

$ cat f1.c
int f() { return 1; }

$ cat f2.c
int f() { return 2; }

$ cat g.c
int f(void);
int g() { return f(); }

$ cat f2g.c
int f() { return 2; }
int g() { return f(); }

$ cat main.c
int g(void);
int main() { return g(); }

$ clang -c f1.c f2.c g.c f2g.c main.c
$ ld -dylib -o libf1.dylib f1.o
$ ar crs libf2_g.a f2.o g.o
$ ar crs libf2g.a f2g.o

$ ld libf1.dylib libf2_g.a main.o -lSystem
$ ./a.out; echo $?
1 # because libf1.dylib was first in the link line

$ ld libf2_g.a libf1.dylib main.o -lSystem
$ ./a.out; echo $?
2 # because libf2_g.a was first on the link line

$ ld libf1.dylib libf2g.a main.o -lSystem
$ ./a.out; eho $?
2 # libf2g.a(f2g.o) was included in the link cos of the reference to g, so the f from there was used
lld/MachO/SymbolTable.h
34

Nit: explicitly include Archive.h

lld/test/MachO/Inputs/archive4.s
1 ↗(On Diff #261491)

Directive doesn't match symbol name below

lld/test/MachO/archive-no-index.s
2 ↗(On Diff #261491)

Super nit: use %t.o, etc. for object file names, to make it clear they're object files. Extension-less names are usually executables, so it's easier to understand tests if we have the explicit extension.

lld/test/MachO/archive.s
16

We'll wanna add a CHECK-NOT for _lonely/_alone (depending on what you decide to call that symbol), to make sure that archive member wasn't pulled in.

Unfortunately, that depends on the order symbols are printed out by nm, so it's not completely reliable. As in, if we did have _alone in the symbol table (because we incorrectly pulled in that archive member), and nm printed it out before _bar, having a CHECK-NOT: T _alone after this line wouldn't do us any good. Having a second CHECK-NOT before the CHECK for _bar should be enough to make this work, I think.

int3 added inline comments.Wed, May 6, 2:11 AM
lld/MachO/InputFiles.cpp
311–318

Ah, sorry, I was looking at getInputSections() in OutputSections.cpp, but I see it's doing something different. And good point about the NRVO, forgot about that.

Maybe we could make this a forEachSection(callback) instead, though I do like the objectFiles idea too.

Ktwu marked 5 inline comments as done.Thu, May 7, 11:59 AM
Ktwu added inline comments.
lld/test/MachO/Inputs/archive4.s
1 ↗(On Diff #261491)

This was on purpose; I thought it'd be useful to have both a declaration and a definition that don't link up to anything.

If that's not actually useful (or if I should be explicit), I'll change it.

lld/test/MachO/archive.s
16

I can also rerun llvm-nm %t.out explicitly to make sure that the symbols don't appear anywhere, right? That way I wouldn't have to worry about ordering.

smeenai added inline comments.Thu, May 7, 3:09 PM
lld/test/MachO/Inputs/archive4.s
1 ↗(On Diff #261491)

Got it. I didn't realize that .global by itself would add a reference to that symbol in the symbol table. This looks good!

To elaborate further on what I meant, I thought that a .global would just be ignored if it named a symbol you didn't define in the file. Instead, it looks like it adds an undefined symbol with that name to the symbol table, so essentially serving as a declaration (as you put it). Today I learned :)

lld/test/MachO/archive.s
16

Yeah, that's better!

smeenai added inline comments.Thu, May 7, 3:12 PM
lld/MachO/SymbolTable.cpp
61

LLD ELF moved its symbol resolution logic from SymbolTable to Symbol last year, so it's hard to find the direct analogue for this, but https://github.com/llvm/llvm-project/blob/edf0391491e3a2076779c5da2c0dfc1829585d64/lld/ELF/Symbols.cpp#L253 is the equivalent logic there. https://github.com/llvm/llvm-project/blob/edf0391491e3a2076779c5da2c0dfc1829585d64/lld/COFF/SymbolTable.cpp#L485 is the equivalent logic in the SymbolTable in COFF.

Ktwu marked 5 inline comments as done.Fri, May 8, 4:28 PM
Ktwu added inline comments.
lld/MachO/InputFiles.cpp
311–318

What about keeping the original input sections vector within the Archive file? When loading an object file, that file's sections are cached away.

lld/MachO/InputFiles.h
59

I realize now that I can continue to keep a global sections member; when fetching an ArchiveFile, however, I can immediately fetch each child object's sections and add them to the Archive. No need to worry about a new function in that case.

73

No longer relevant since I'm just keeping the old private member.

lld/MachO/SymbolTable.cpp
79

Neat; I'll turn this research into a test.

Ktwu updated this revision to Diff 262983.Fri, May 8, 6:14 PM
  • Add a test for symbol order resolution based on @smeenai's wonderful work
  • Get rid of the input section function in favor of immediately caching Archive inputs during fetch
  • Test format cleanup: use *.o naming
  • Tried to use llvm-nm -p, test started failing, punting for now
  • Explicitly include Archive.h
int3 added inline comments.Tue, May 12, 7:45 PM
lld/MachO/Driver.cpp
114

s/mbref.getBufferIdentifier()/file->getName()/

nit: drop if curly braces

lld/MachO/InputFiles.h
93

would be nice to have a comment here about what the uint64_t represents

lld/test/MachO/symbol-order.s
3–6

nit: the Inputs folder is shared by all tests, so it'll be cleaner if we namespace them. I.e. symbol-order-g.s instead of g.s

also personally I like to run tests with multiple outputs in their own directory (i.e. mkdir -p %t would be the first run command and all files would be of the form %t/foo.o) but either way is fine

17–18

lld's tests don't typically involve executing the linked binaries (plus it wouldn't work anyway on non-OSX CI machines). -lSystem isn't portable either.

Instead of using constant literals to distinguish where a symbol came from, I'd suggest using custom section names instead. That is, you can have each object file in the archive have the same symbol name under a different section, then see which section it appears under in the final output. llvm-objdump --syms --macho --lazy-bind should cover that, plus show if a symbol is coming from the dylib instead.

ruiu added a comment.Tue, May 12, 10:04 PM

Nice. This looks like a straightforward implementation of an archive file support.

lld/MachO/InputFiles.cpp
322–324

Yes, if an input archive file is corrupted, getName() may return an error.

Looks good with all the comments addressed! Making the test not run a binary is the only real blocker at this point, I think.

lld/MachO/InputFiles.cpp
316

This is much nicer!

There's still some copying going on here, but given that it's just a bunch of pointers, the copying should be pretty fast. I think this is good enough for now, and we can think about how to make this better in a follow-up. (My cause for concern is that with subsections via symbols implemented, some object files could end up with a sizeable number of sections, so even copying all the pointers over could have a non-trivial cost.)

lld/MachO/Symbols.cpp
18

Not sure which iteration the comment was from, but it's used by SymbolTable::addUndefined above. The return value is unused though ... perhaps we could omit that if it's not gonna be useful in the future?

I'd also prefer a name like fetchMember or fetchArchiveMember, since I believe "member" is the standard terminology for a file in an archive, but that's not a super strong preference.

lld/test/MachO/Inputs/archive4.s
1 ↗(On Diff #261491)

Sorry, might not have been clear enough in my above comment. I meant that your original (where the directive and the label had different names) was fine as-is. This works too, of course.

lld/test/MachO/archive.s
4

I don't have a super strong preference about separate test files vs. inlining, and there's definitely a point at which inlining becomes unreadable, but for tests where we just need to either define or reference a symbol, the inlining is nice in that you don't have to open up another file to understand what the test is trying to do.

Note that for the input files used by this test, the only thing that should matter is defining a symbol; the contents of the symbol shouldn't matter.

10

Can you also add a test for when the archive appears on the linker command line before the object file? The results should be identical.

19

_alone doesn't exist in the current incarnation of archive4.s.

lld/test/MachO/symbol-order.s
17–18

Yeah, running a binary in a test is a no-go ... we just have to examine the output statically. I like the suggestion for having different section names.

smeenai requested changes to this revision.Wed, May 13, 2:25 PM

Requesting changes to remove from my review queue.

This revision now requires changes to proceed.Wed, May 13, 2:25 PM
Ktwu marked 7 inline comments as done.Wed, May 13, 3:49 PM
Ktwu added inline comments.
lld/MachO/Driver.cpp
114

I copied this from the wasm driver :o

If we want to follow the ELF driver, I should use path instead....

lld/MachO/InputFiles.h
93

ELF's ArchiveFile doesn't list a comment either, but I'll add a comment; these are address offsets into the archive, which are used to map whether we've loaded a child of the archive yet.

lld/test/MachO/symbol-order.s
3–6

Both ideas make sense; I didn't like the weird concatenation here at all...

int3 added inline comments.Wed, May 13, 3:49 PM
lld/test/MachO/archive.s
4

Note that for the input files used by this test, the only thing that should matter is defining a symbol; the contents of the symbol shouldn't matter.

+1. If we're going to use section names to test in symbol-order.s then it's just 3 statements that should fit onto one line too

int3 added inline comments.Wed, May 13, 3:51 PM
lld/MachO/Driver.cpp
114

file->getName() is just a wrapper around mbref.getBufferIdentifier(). Just suggesting it since it saves a few characters :p but not a big deal either way

int3 added inline comments.Wed, May 13, 3:54 PM
lld/test/MachO/archive.s
4

(and if it's a matter of making sure the resulting test executable is runnable, then all we need is a ret, the mov can be omitted)

Ktwu marked 4 inline comments as done.Wed, May 13, 3:59 PM
Ktwu marked 5 inline comments as done.Wed, May 13, 5:11 PM
Ktwu updated this revision to Diff 263893.EditedWed, May 13, 5:13 PM
  • Inlined the test files for symbol-order
  • Changed symbol-order to check sections instead of running executables
  • A couple other lil changes
  • rebased
int3 accepted this revision.EditedWed, May 13, 5:41 PM

Couple of nits but this mostly looks good. If you can make sure it's ASAN/UBSAN clean I can land it

lld/test/MachO/Inputs/archive2.s
3 ↗(On Diff #263893)

can we drop the mov commands in the archive input files even if we're not inlining them?

lld/test/MachO/archive.s
19

You could reuse the same CHECK commands instead of having a different prefix with the same checks -- that's what I did in my order file diff. But this is fine too

lld/test/MachO/symbol-order.s
7

.global takes multiple variable names, so you can have .globl f, g. (You can also omit the 'a' in .global if you'd like)

Also, I think it's more conventional (and maybe more correct? dunno) to write it as .globl f, g; .section __TEXT,test_fg; f: g: -- i.e. .globl f is a declaration and the definition determines which section it actually lands in. At least, it seems like the ELF tests don't omit the definition when defining these inputs

10

intentional comment?

22

CHECK lines are substring matches, so this could just be __TEXT,test_g g

Ktwu updated this revision to Diff 264026.Thu, May 14, 10:25 AM
Ktwu marked 8 inline comments as done.
  • Inlined the archive*.s file inputs
  • Removed unused return value
  • Retweaked the inputs for symbol-order to reflect the files that inspired them
  • Re-ran check-lld once more
lld/test/MachO/Inputs/archive2.s
3 ↗(On Diff #263893)

I'll inline and drop the mov; these files are basically identical to the symbol-order tests at this point so they might as well be inlined...

lld/test/MachO/symbol-order.s
10

Not intentionally at all; I added this when running the retweaked test against the system linker...

smeenai accepted this revision.Thu, May 14, 12:09 PM

Looks great with the leak fixed!

lld/MachO/Driver.cpp
116

So we're just leaking the memory for the Archive? That's not ideal.

LLD ELF has the unique_ptr<object::Archive> be a member of ArchiveFile and moves the pointer here into the constructor. Can we do the same?

lld/test/MachO/archive.s
19

Yeah, what I was thinking was that you create two separate output files, one for the archive second case and one for the archive first case, and then run the same FileCheck commands for both (as in with the same check prefix (or lack thereof)). That way it's immediately obvious that both of them have the same output, and the VISIBLE-NOT tests cover both cases as well.

This revision is now accepted and ready to land.Thu, May 14, 12:09 PM
Ktwu marked an inline comment as done.Thu, May 14, 12:17 PM
Ktwu added inline comments.
lld/MachO/Driver.cpp
116

I originally had it so that ArchiveFile owned the input archive, but @int3 mentioned passing around unique_ptr is weird, so I made this change and figured leaking was acceptable given the globals in the linker :(

Ktwu updated this revision to Diff 264059.Thu, May 14, 12:34 PM
Ktwu marked an inline comment as done.
  • Make archive input files own their archives again
int3 added inline comments.Thu, May 14, 12:36 PM
lld/MachO/Driver.cpp
116

I said passing around unique_ptr is weird if the callee isn't changing ownership... which it wasn't :) (I think the first iteration of the diff was actually allowing the input archive to be freed prematurely... didn't notice that the first time around)

But yeah, we should follow what LLD-ELF is doing here -- take the Archive by rvalue reference to unique_ptr, and store it in a unique_ptr member in ArchiveFile.

This revision was automatically updated to reflect the committed changes.
Ktwu marked 2 inline comments as done.Thu, May 14, 1:44 PM
Ktwu added inline comments.
lld/MachO/Driver.cpp
116

Ahhhh, you're right @int3 (I am sorry for the misattribution, my memory hasn't been keeping up-to-date with this diff's history very well).

lld/test/MachO/archive.s
19

This isn't a blocker, is it? What I've got here is more clear to me than making two output files :/

smeenai added inline comments.Thu, May 14, 2:11 PM
lld/test/MachO/archive.s
19

Not a blocker ... it's fine as-is. Would be nice to also test the VISIBLE-NOT case for both the archive first and the archive second case, but that's easy enough to do in a follow-up.

int3 added inline comments.Thu, May 14, 4:11 PM
lld/MachO/InputFiles.cpp
316

With the order file diff, we'll need to copy over all symbols as well, so I think I'll switch this to use a global objectFiles vector

int3 added inline comments.Thu, May 14, 4:36 PM
lld/MachO/InputFiles.cpp
316

Changed my mind, it's more refactoring / rebasing than I want to do right now...