This is an archive of the discontinued LLVM Phabricator instance.

[lld] Add archive file support to Mach-O backend
ClosedPublic

Authored by Ktwu on Apr 16 2020, 7:30 PM.

Details

Summary

With this change, basic archive files can be linked together. Input section discovery has been refactored into a function since archive files lazily resolve their symbols / the object files containing those symbols.

Depends on D77893.

Diff Detail

Event Timeline

Ktwu created this revision.Apr 16 2020, 7:30 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 16 2020, 7:30 PM
Ktwu edited the summary of this revision. (Show Details)Apr 16 2020, 7:31 PM
int3 added inline comments.Apr 21 2020, 2:51 AM
lld/MachO/Arch/X86_64.cpp
34–37

are these two relocations used in this diff's tests?

52–54

could just move this case to line 50 above and remove the comment

lld/MachO/Driver.cpp
107–109

We should add a test for an erroneous archive.

Would also be nice to explicitly #include Archive.h here.

Also, some notes in case there are other reviewers like me who aren't familiar with lld's error handling code:

  1. CHECK fatals lld if the given value is an error, so no need to worry about returning early
  2. The ErrorHandler.h comments say "You should never call fatal() except for reporting a corrupted input file." So this seems like an appropriate use of CHECK/fatal
lld/MachO/InputFiles.cpp
248–255

Kind of unfortunate that we have to return by value here, but I see that lld-ELF's getInputSections does something similar.

Also, did you mean to type this method as std::vector<InputSection *> ArchiveFile::getInputSections() const {? I *think* the current placement of the const is saying that the return value is const instead of saying the method is const...

259–261

When does this happen? Does it indicate input corruption?

lld/MachO/InputFiles.h
82

Not sure if LLVM convention says otherwise, but typically I don't pass unique_ptrs as arguments unless the callee is changing object ownership. Otherwise pass by reference or raw pointer

lld/MachO/SymbolTable.h
31

nit: add newline above, or remove lines 27 and 29.

Not really sure what the vertical spacing convention is in the rest of the codebase, idk if @ruiu has an opinion here

32

take by const ref

lld/MachO/Symbols.cpp
18

this method seems currently unused

lld/test/MachO/archive.s
4

Nits: I don't think having a separate Inputs/archive.s is necessary; could just include its contents below and reference it via %s.

Also, I was looking at some of the lld-ELF tests (e.g. archive-fetch.s), and it seems that we can create object files / archives with symbols but no corresponding code. So we could define those files inline too via echo '.globl _boo | llvm-mc ...

Ktwu marked 11 inline comments as done.Apr 29 2020, 11:01 PM
Ktwu added inline comments.
lld/MachO/Arch/X86_64.cpp
34–37

Yup.

lld/MachO/InputFiles.cpp
248–255

re: const, yup, thanks!

259–261

This is code copied from the initial implementation of MachO lld we inherited. However, looking at the implementation of getMember(), it looks like it represents a malformed archive (and hence a parse error).

lld/MachO/InputFiles.h
82

This is part of that lld corpus we inherited, so I have no opinion on this. I assume it's just because archive creation returns a unique pointer.

Your convention seems reasonable to me so I'm happy to adapt that here.

lld/test/MachO/archive.s
4

I'll definitely get rid of archive.s (this was back when I didn't know how Filecheck worked).

I like having explicit test files (although knowing about passing stuff straight to llvm-mc is a neat trick), so I'd prefer to keep the archive#.s if that's OK.

Ktwu updated this revision to Diff 261133.Apr 29 2020, 11:16 PM
Ktwu marked an inline comment as done.

Rebase, go through most comments (still need to add an erroneous archive test)

Ktwu updated this revision to Diff 261491.May 1 2020, 10:23 AM

Added two tests inspired by the ELF implementation -- a bad archive, and an archive with no symbol table.

MaskRay added inline comments.May 3 2020, 5:00 PM
lld/test/MachO/archive.s
4

For a definition, just write:

echo '.globl _bar; _bar:' | llvm-mc -filetype=obj -triple=x86_64-apple-darwin - -o %t2.o

For a reference:

echo '.globl _bar' | llvm-mc -filetype=obj -triple=x86_64-apple-darwin - -o %t3.o

Use applicable file extensions.

12

llvm-nm -p may be better than plain llvm-nm. We can additionally check the symbol order.

MaskRay added inline comments.May 3 2020, 5:06 PM
lld/MachO/Arch/X86_64.cpp
34–37

Don't change X86_64_RELOC_SIGNED_1 which is unrelated to the main purpose of the patch.

Use other existing relocation types.

MaskRay added inline comments.May 3 2020, 5:06 PM
lld/MachO/SymbolTable.cpp
77

How does an archive symbol interact with a DylibSymbol?

lld/test/MachO/bad-archive.s
1 ↗(On Diff #261491)

You could also create invalid/ and place such tests there.

Ktwu marked 4 inline comments as done.May 4 2020, 10:24 AM
Ktwu added inline comments.
lld/MachO/Arch/X86_64.cpp
34–37

Sure, I'll rebase this on top of D79311 (thanks!)

lld/MachO/SymbolTable.cpp
77

By "interact" do you mean if there's a conflict -- a dylib and an archive both expose a symbol with the same name? I assume there shouldn't be issues if symbol names are unique...

@int3 you know more about dylibs; do you see potential issues?

lld/test/MachO/bad-archive.s
1 ↗(On Diff #261491)

A place to stick tests for known failures? I like that a lot!

I see now that ELF has the same (although the tests I found didn't live within it...)

int3 added inline comments.May 5 2020, 4:51 PM
lld/MachO/SymbolTable.cpp
77

We definitely need to account for symbol name collisions. I just commented on D79114: [lld-macho] Dylib symbols should always replace undefined symbols, but basically I've just been feeding various inputs to ld64 and emulating its output, though we should probably figure out the details of its symbol table implementation sometime...

smeenai added a subscriber: pcc.May 6 2020, 12:09 AM

It's worth noting in the commit message that we're restoring some of this functionality from @pcc and @ruiu's initial prototype.

lld/MachO/InputFiles.cpp
248–255

The ELF equivalent is InputFiles::getSections, and it returns an ArrayRef.

Returning by value wouldn't be an issue here – we're returning a local object, so return value optimization (NRVO) kicks in. It's not ideal for ObjFile::getInputSections though, as I commented above.

lld/MachO/InputFiles.h
59

Nit: add override

I'm *really* wary of this returning by value ... that seem like a bunch of unnecessary copying. The ELF equivalent to this function is InputFile::getSections (which isn't overridden); it returns an ArrayRef, and it asserts that it's only called on object or binary files (and we don't have to worry about binary files). ELF makes that work by having a global objectFiles vector (instead of just an inputFiles vector), so when it's doing the loop over all input sections, it only loops over object files.

Whether we wanna have a specific objectFiles global vector as well or just stick with inputFiles is a broader design question, but we should at least figure out a good way to make this not return by value.

70

sections should always be empty for a DylibFile, so this is confusing. Perhaps just construct and return an empty vector here? (Although that wouldn't work with the ArrayRef comment above, but in that case a comment would be good.)

lld/MachO/SymbolTable.cpp
59

We need to add a check here where if the existing symbol isa<LazySymbol>, we fetch that archive member. A consequence is that we need to store a pointer to the ArchiveFile in the LazySymbol.

We should add a test for this case, where the archive file appears on the link line before the object file, and its members should still be fetched.

77

ld64 uses the same archive file semantics as LLD ELF and COFF, where an archive member can be used to resolve undefined symbol references regardless of its position on the command line.

A defined symbol in an object file always wins out over a dylib symbol. An archive member that got included in the link is treated as a regular object file for this purpose. If the dylib symbol is seen first, an archive member defining the same symbol won't be pulled out (unless some other symbol from that member is referenced).

I believe that's the behavior we're implementing here. A LazySymbol won't replace a DylibSymbol, but if the member gets fetched, all its symbols will be added as Defined, which will override DylibSymbols.

Here's an example to clarify what I mean:

$ cat f1.c
int f() { return 1; }

$ cat f2.c
int f() { return 2; }

$ cat g.c
int f(void);
int g() { return f(); }

$ cat f2g.c
int f() { return 2; }
int g() { return f(); }

$ cat main.c
int g(void);
int main() { return g(); }

$ clang -c f1.c f2.c g.c f2g.c main.c
$ ld -dylib -o libf1.dylib f1.o
$ ar crs libf2_g.a f2.o g.o
$ ar crs libf2g.a f2g.o

$ ld libf1.dylib libf2_g.a main.o -lSystem
$ ./a.out; echo $?
1 # because libf1.dylib was first in the link line

$ ld libf2_g.a libf1.dylib main.o -lSystem
$ ./a.out; echo $?
2 # because libf2_g.a was first on the link line

$ ld libf1.dylib libf2g.a main.o -lSystem
$ ./a.out; eho $?
2 # libf2g.a(f2g.o) was included in the link cos of the reference to g, so the f from there was used
lld/MachO/SymbolTable.h
35

Nit: explicitly include Archive.h

lld/test/MachO/Inputs/archive4.s
2

Directive doesn't match symbol name below

lld/test/MachO/archive-no-index.s
2 ↗(On Diff #261491)

Super nit: use %t.o, etc. for object file names, to make it clear they're object files. Extension-less names are usually executables, so it's easier to understand tests if we have the explicit extension.

lld/test/MachO/archive.s
16

We'll wanna add a CHECK-NOT for _lonely/_alone (depending on what you decide to call that symbol), to make sure that archive member wasn't pulled in.

Unfortunately, that depends on the order symbols are printed out by nm, so it's not completely reliable. As in, if we did have _alone in the symbol table (because we incorrectly pulled in that archive member), and nm printed it out before _bar, having a CHECK-NOT: T _alone after this line wouldn't do us any good. Having a second CHECK-NOT before the CHECK for _bar should be enough to make this work, I think.

int3 added inline comments.May 6 2020, 2:11 AM
lld/MachO/InputFiles.cpp
248–255

Ah, sorry, I was looking at getInputSections() in OutputSections.cpp, but I see it's doing something different. And good point about the NRVO, forgot about that.

Maybe we could make this a forEachSection(callback) instead, though I do like the objectFiles idea too.

Ktwu marked 5 inline comments as done.May 7 2020, 11:59 AM
Ktwu added inline comments.
lld/test/MachO/Inputs/archive4.s
2

This was on purpose; I thought it'd be useful to have both a declaration and a definition that don't link up to anything.

If that's not actually useful (or if I should be explicit), I'll change it.

lld/test/MachO/archive.s
16

I can also rerun llvm-nm %t.out explicitly to make sure that the symbols don't appear anywhere, right? That way I wouldn't have to worry about ordering.

smeenai added inline comments.May 7 2020, 3:09 PM
lld/test/MachO/Inputs/archive4.s
2

Got it. I didn't realize that .global by itself would add a reference to that symbol in the symbol table. This looks good!

To elaborate further on what I meant, I thought that a .global would just be ignored if it named a symbol you didn't define in the file. Instead, it looks like it adds an undefined symbol with that name to the symbol table, so essentially serving as a declaration (as you put it). Today I learned :)

lld/test/MachO/archive.s
16

Yeah, that's better!

smeenai added inline comments.May 7 2020, 3:12 PM
lld/MachO/SymbolTable.cpp
59

LLD ELF moved its symbol resolution logic from SymbolTable to Symbol last year, so it's hard to find the direct analogue for this, but https://github.com/llvm/llvm-project/blob/edf0391491e3a2076779c5da2c0dfc1829585d64/lld/ELF/Symbols.cpp#L253 is the equivalent logic there. https://github.com/llvm/llvm-project/blob/edf0391491e3a2076779c5da2c0dfc1829585d64/lld/COFF/SymbolTable.cpp#L485 is the equivalent logic in the SymbolTable in COFF.

Ktwu marked 5 inline comments as done.May 8 2020, 4:28 PM
Ktwu added inline comments.
lld/MachO/InputFiles.cpp
248–255

What about keeping the original input sections vector within the Archive file? When loading an object file, that file's sections are cached away.

lld/MachO/InputFiles.h
59

I realize now that I can continue to keep a global sections member; when fetching an ArchiveFile, however, I can immediately fetch each child object's sections and add them to the Archive. No need to worry about a new function in that case.

70

No longer relevant since I'm just keeping the old private member.

lld/MachO/SymbolTable.cpp
77

Neat; I'll turn this research into a test.

Ktwu updated this revision to Diff 262983.May 8 2020, 6:14 PM
  • Add a test for symbol order resolution based on @smeenai's wonderful work
  • Get rid of the input section function in favor of immediately caching Archive inputs during fetch
  • Test format cleanup: use *.o naming
  • Tried to use llvm-nm -p, test started failing, punting for now
  • Explicitly include Archive.h
int3 added inline comments.May 12 2020, 7:45 PM
lld/MachO/Driver.cpp
111

s/mbref.getBufferIdentifier()/file->getName()/

nit: drop if curly braces

lld/MachO/InputFiles.h
100

would be nice to have a comment here about what the uint64_t represents

lld/test/MachO/symbol-order.s
2–5 ↗(On Diff #262983)

nit: the Inputs folder is shared by all tests, so it'll be cleaner if we namespace them. I.e. symbol-order-g.s instead of g.s

also personally I like to run tests with multiple outputs in their own directory (i.e. mkdir -p %t would be the first run command and all files would be of the form %t/foo.o) but either way is fine

16–17 ↗(On Diff #262983)

lld's tests don't typically involve executing the linked binaries (plus it wouldn't work anyway on non-OSX CI machines). -lSystem isn't portable either.

Instead of using constant literals to distinguish where a symbol came from, I'd suggest using custom section names instead. That is, you can have each object file in the archive have the same symbol name under a different section, then see which section it appears under in the final output. llvm-objdump --syms --macho --lazy-bind should cover that, plus show if a symbol is coming from the dylib instead.

ruiu added a comment.May 12 2020, 10:04 PM

Nice. This looks like a straightforward implementation of an archive file support.

lld/MachO/InputFiles.cpp
259–261

Yes, if an input archive file is corrupted, getName() may return an error.

Looks good with all the comments addressed! Making the test not run a binary is the only real blocker at this point, I think.

lld/MachO/InputFiles.cpp
263

This is much nicer!

There's still some copying going on here, but given that it's just a bunch of pointers, the copying should be pretty fast. I think this is good enough for now, and we can think about how to make this better in a follow-up. (My cause for concern is that with subsections via symbols implemented, some object files could end up with a sizeable number of sections, so even copying all the pointers over could have a non-trivial cost.)

lld/MachO/Symbols.cpp
18

Not sure which iteration the comment was from, but it's used by SymbolTable::addUndefined above. The return value is unused though ... perhaps we could omit that if it's not gonna be useful in the future?

I'd also prefer a name like fetchMember or fetchArchiveMember, since I believe "member" is the standard terminology for a file in an archive, but that's not a super strong preference.

lld/test/MachO/Inputs/archive4.s
2

Sorry, might not have been clear enough in my above comment. I meant that your original (where the directive and the label had different names) was fine as-is. This works too, of course.

lld/test/MachO/archive.s
4

I don't have a super strong preference about separate test files vs. inlining, and there's definitely a point at which inlining becomes unreadable, but for tests where we just need to either define or reference a symbol, the inlining is nice in that you don't have to open up another file to understand what the test is trying to do.

Note that for the input files used by this test, the only thing that should matter is defining a symbol; the contents of the symbol shouldn't matter.

10

Can you also add a test for when the archive appears on the linker command line before the object file? The results should be identical.

19

_alone doesn't exist in the current incarnation of archive4.s.

lld/test/MachO/symbol-order.s
16–17 ↗(On Diff #262983)

Yeah, running a binary in a test is a no-go ... we just have to examine the output statically. I like the suggestion for having different section names.

smeenai requested changes to this revision.May 13 2020, 2:25 PM

Requesting changes to remove from my review queue.

This revision now requires changes to proceed.May 13 2020, 2:25 PM
Ktwu marked 7 inline comments as done.May 13 2020, 3:49 PM
Ktwu added inline comments.
lld/MachO/Driver.cpp
111

I copied this from the wasm driver :o

If we want to follow the ELF driver, I should use path instead....

lld/MachO/InputFiles.h
100

ELF's ArchiveFile doesn't list a comment either, but I'll add a comment; these are address offsets into the archive, which are used to map whether we've loaded a child of the archive yet.

lld/test/MachO/symbol-order.s
2–5 ↗(On Diff #262983)

Both ideas make sense; I didn't like the weird concatenation here at all...

int3 added inline comments.May 13 2020, 3:49 PM
lld/test/MachO/archive.s
4

Note that for the input files used by this test, the only thing that should matter is defining a symbol; the contents of the symbol shouldn't matter.

+1. If we're going to use section names to test in symbol-order.s then it's just 3 statements that should fit onto one line too

int3 added inline comments.May 13 2020, 3:51 PM
lld/MachO/Driver.cpp
111

file->getName() is just a wrapper around mbref.getBufferIdentifier(). Just suggesting it since it saves a few characters :p but not a big deal either way

int3 added inline comments.May 13 2020, 3:54 PM
lld/test/MachO/archive.s
4

(and if it's a matter of making sure the resulting test executable is runnable, then all we need is a ret, the mov can be omitted)

Ktwu marked 4 inline comments as done.May 13 2020, 3:59 PM
Ktwu marked 5 inline comments as done.May 13 2020, 5:11 PM
Ktwu updated this revision to Diff 263893.EditedMay 13 2020, 5:13 PM
  • Inlined the test files for symbol-order
  • Changed symbol-order to check sections instead of running executables
  • A couple other lil changes
  • rebased
int3 accepted this revision.EditedMay 13 2020, 5:41 PM

Couple of nits but this mostly looks good. If you can make sure it's ASAN/UBSAN clean I can land it

lld/test/MachO/Inputs/archive2.s
4

can we drop the mov commands in the archive input files even if we're not inlining them?

lld/test/MachO/archive.s
19

You could reuse the same CHECK commands instead of having a different prefix with the same checks -- that's what I did in my order file diff. But this is fine too

lld/test/MachO/symbol-order.s
6 ↗(On Diff #263893)

.global takes multiple variable names, so you can have .globl f, g. (You can also omit the 'a' in .global if you'd like)

Also, I think it's more conventional (and maybe more correct? dunno) to write it as .globl f, g; .section __TEXT,test_fg; f: g: -- i.e. .globl f is a declaration and the definition determines which section it actually lands in. At least, it seems like the ELF tests don't omit the definition when defining these inputs

9 ↗(On Diff #263893)

intentional comment?

21 ↗(On Diff #263893)

CHECK lines are substring matches, so this could just be __TEXT,test_g g

Ktwu updated this revision to Diff 264026.May 14 2020, 10:25 AM
Ktwu marked 8 inline comments as done.
  • Inlined the archive*.s file inputs
  • Removed unused return value
  • Retweaked the inputs for symbol-order to reflect the files that inspired them
  • Re-ran check-lld once more
lld/test/MachO/Inputs/archive2.s
4

I'll inline and drop the mov; these files are basically identical to the symbol-order tests at this point so they might as well be inlined...

lld/test/MachO/symbol-order.s
9 ↗(On Diff #263893)

Not intentionally at all; I added this when running the retweaked test against the system linker...

smeenai accepted this revision.May 14 2020, 12:09 PM

Looks great with the leak fixed!

lld/MachO/Driver.cpp
113

So we're just leaking the memory for the Archive? That's not ideal.

LLD ELF has the unique_ptr<object::Archive> be a member of ArchiveFile and moves the pointer here into the constructor. Can we do the same?

lld/test/MachO/archive.s
19

Yeah, what I was thinking was that you create two separate output files, one for the archive second case and one for the archive first case, and then run the same FileCheck commands for both (as in with the same check prefix (or lack thereof)). That way it's immediately obvious that both of them have the same output, and the VISIBLE-NOT tests cover both cases as well.

This revision is now accepted and ready to land.May 14 2020, 12:09 PM
Ktwu marked an inline comment as done.May 14 2020, 12:17 PM
Ktwu added inline comments.
lld/MachO/Driver.cpp
113

I originally had it so that ArchiveFile owned the input archive, but @int3 mentioned passing around unique_ptr is weird, so I made this change and figured leaking was acceptable given the globals in the linker :(

Ktwu updated this revision to Diff 264059.May 14 2020, 12:34 PM
Ktwu marked an inline comment as done.
  • Make archive input files own their archives again
int3 added inline comments.May 14 2020, 12:36 PM
lld/MachO/Driver.cpp
113

I said passing around unique_ptr is weird if the callee isn't changing ownership... which it wasn't :) (I think the first iteration of the diff was actually allowing the input archive to be freed prematurely... didn't notice that the first time around)

But yeah, we should follow what LLD-ELF is doing here -- take the Archive by rvalue reference to unique_ptr, and store it in a unique_ptr member in ArchiveFile.

This revision was automatically updated to reflect the committed changes.
Ktwu marked 2 inline comments as done.May 14 2020, 1:44 PM
Ktwu added inline comments.
lld/MachO/Driver.cpp
113

Ahhhh, you're right @int3 (I am sorry for the misattribution, my memory hasn't been keeping up-to-date with this diff's history very well).

lld/test/MachO/archive.s
19

This isn't a blocker, is it? What I've got here is more clear to me than making two output files :/

smeenai added inline comments.May 14 2020, 2:11 PM
lld/test/MachO/archive.s
19

Not a blocker ... it's fine as-is. Would be nice to also test the VISIBLE-NOT case for both the archive first and the archive second case, but that's easy enough to do in a follow-up.

int3 added inline comments.May 14 2020, 4:11 PM
lld/MachO/InputFiles.cpp
263

With the order file diff, we'll need to copy over all symbols as well, so I think I'll switch this to use a global objectFiles vector

int3 added inline comments.May 14 2020, 4:36 PM
lld/MachO/InputFiles.cpp
263

Changed my mind, it's more refactoring / rebasing than I want to do right now...