This is an archive of the discontinued LLVM Phabricator instance.

[lld-macho] Add basic support for linking against dylibs
ClosedPublic

Authored by int3 on Mar 16 2020, 2:07 PM.

Details

Summary

This diff implements:

  • dylib loading (much of which is being restored from @pcc and @ruiu's original work)
  • The GOT_LOAD relocation, which allows us to load non-lazy dylib symbols
  • Basic bind opcode emission, which tells dyld how to populate the GOT

Depends on D75382.

Diff Detail

Event Timeline

int3 created this revision.Mar 16 2020, 2:07 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 16 2020, 2:07 PM
int3 retitled this revision from [lld] Add basic support for linking against dylibs to [lld-macho] Add basic support for linking against dylibs.Mar 16 2020, 2:15 PM
int3 updated this revision to Diff 250629.Mar 16 2020, 2:17 PM
int3 marked an inline comment as done.

delete extra line

smeenai added a subscriber: Ktwu.Mar 16 2020, 3:43 PM

Initial set of comments. I'm still going through the rest.

lld/MachO/Arch/X86_64.cpp
47

Nit: change the two spaces between "are" and "only" to one

lld/MachO/InputFiles.cpp
78

Universal binary support could technically be a diff by itself. I'd prefer to do so, although it's small enough that I'm not super opposed to folding it into this diff. We definitely need to add tests for it either way. llvm-lipo is complete enough to be used for the tests.

I'd also note in the commit message that some of this functionality is being restored from @pcc and @ruiu's initial commit, for completeness.

89

We should add some sort of checking here to prevent going out of bounds of the file (if you have a ridiculously large nfat_arch in a malformed file, for example). We should also add tests for the error conditions here if possible (assuming yaml2obj lets us construct malformed universal binaries that would trigger the conditions).

99

I believe ld64 warns if you give it a universal binary input file and it can't find any slices for the architecture being linked. We should do the same.

246

Is it valid to not have an LC_ID_DYLIB load command? If it is, should we figure out a fallback dylibName? If not, we should error. (Also, we should add tests for this either way if possible.)

251

Is it valid to not have a symtab? If not, we should error.

266

Unnecessary comment change? We aren't adding support for archive files yet. (CC @Ktwu, this'll need to be adjusted when archive file support is added.)

lld/test/MachO/Inputs/goodbye-dylib.yaml
2

Description for hello and goodbye seems swapped.

MaskRay added inline comments.Mar 16 2020, 9:48 PM
lld/MachO/Arch/X86_64.cpp
38

This message enhancement should be moved to the initial patch.

lld/test/MachO/Inputs/goodbye-dylib.yaml
2

Use ## for comments and # for RUN and CHECK lines.

109

Are all these fields important? I think we probably should think what fields are optional and improve yaml2obj. Otherwise the verbosity can make tests a lot more complex.

lld/test/MachO/dylink.s
4

> -> -o

8

Should obj2yaml and llvm-objdump -d tests be separated?

int3 edited the summary of this revision. (Show Details)Mar 16 2020, 10:17 PM
ruiu added inline comments.Mar 16 2020, 10:20 PM
lld/MachO/Arch/X86_64.cpp
38

Yeah, but I'm not worried about the exact order and the contents of initial patch series, as this is going to be submitted as soon as the first patch is submitted, and no one will be using the feature only with the first patch, so I think I'm fine with this.

lld/MachO/Target.h
18

Are you going to make it work only for 64-bit platforms? IIRC, both macOS and iOS are dropping 32-bit app support, so it is probably a good choice to not support 32-bit, but just to confirm.

lld/MachO/Writer.cpp
316

I wonder if you directly dyn_cast to DylibSymbol * as r.target.dyn_cast<DylibSymbol *>().

334

Can you add a comment as to what dyld info contains?

int3 updated this revision to Diff 250692.Mar 16 2020, 10:24 PM
int3 marked 9 inline comments as done.

address comments

lld/MachO/InputFiles.cpp
78

Fair enough, will split into another diff

251

Not sure why, but unlike the LC_ID_DYLIB case, trying to create a dylib without LC_SYMTAB makes yaml2obj crash. Probably some offsets / indices are off.

Looking at ld64's source, it seems that not having an LC_SYMTAB is only an error if LC_DYLD_INFO(_ONLY) is also missing. The dylib I looked at had LC_DYLD_INFO_ONLY, so I'm not sure if such a case arises in practice...

Anyway, proper handling of a missing LC_SYMTAB should probably cover the ObjectFile case as well, so I'm inclined to punt on it for now.

lld/test/MachO/Inputs/goodbye-dylib.yaml
109

@smeenai earlier asked about deleting LC_SYMTAB, and I ran into some issues with yaml2obj while trying it (see my other comment). Given that using yaml2obj is mostly a stop-gap measure until lld itself can emit dylibs, I'd prefer not to spend too much time making the test case minimal.

int3 edited the summary of this revision. (Show Details)Mar 16 2020, 10:48 PM
int3 updated this revision to Diff 250696.Mar 16 2020, 10:49 PM
int3 marked 5 inline comments as done.

add comment about LC_DYLD_INFO

int3 marked 2 inline comments as done.Mar 16 2020, 10:50 PM
int3 added inline comments.
lld/MachO/Arch/X86_64.cpp
38

no worries, wasn't hard to split out

lld/MachO/Target.h
18

I'm not sure we're 100% not going to support it, but we're definitely prioritizing 64-bit for now

lld/MachO/Writer.cpp
316

just tried, doesn't work sadly

MaskRay added inline comments.Mar 16 2020, 10:50 PM
lld/MachO/Arch/X86_64.cpp
38

I would argue that when sending patches as a series, we should make extra efforts keeping each commit clean and avoiding unneeded code move if possible...

I think the this patch is good in its current status but will hope Jez can make some final cleanups to avoid subsequent formatting/cleanup changes.

int3 updated this revision to Diff 250697.Mar 16 2020, 10:54 PM
int3 marked 4 inline comments as done.

address another comment

lld/test/MachO/dylink.s
8

not possible with the current setup; I'm using obj2yaml to get the address of the GOT, and to check that the order of the symbols references in it match the addresses printed by objdump

Harbormaster completed remote builds in B49397: Diff 250696.
ruiu added a comment.Mar 17 2020, 12:32 AM

As to the tests, we generally don't use yaml2obj to write tests (except for COFF which is historically using yaml2obj). If you are using yaml2obj just until lld itself is able to produce files used by tests, I don't think you have to spend too much time writing tests in yaml2obj. Instead, you can submit binary files for now and then replace them later.

lld/MachO/Target.h
18

Then please leave a comment here saying that we are currently supporting only 64-bit targets as macOS and iOS are deprecating 32-bit apps.

int3 added a comment.Mar 17 2020, 10:25 AM

The no-id-dylink.s test needs yaml2obj to generate an invalid dylib, but yeah I can replace the other two with binary blobs

The no-id-dylink.s test needs yaml2obj to generate an invalid dylib, but yeah I can replace the other two with binary blobs

We are trying to avoid binary blob in LLVM binary utilities + lld. @grimar has done lots of work in this area. Can you figure out how much effort it takes to implement relevant features in yaml2obj Mach-O? I hope we can avoid binary blobs early.

int3 added a comment.Mar 17 2020, 12:04 PM

I am indifferent as to whether we use binary blobs or yaml2obj for now, but I expect I'll have a diff up to make lld produce dylibs within the next week.

ruiu added a comment.Mar 17 2020, 6:48 PM

The no-id-dylink.s test needs yaml2obj to generate an invalid dylib, but yeah I can replace the other two with binary blobs

We are trying to avoid binary blob in LLVM binary utilities + lld. @grimar has done lots of work in this area. Can you figure out how much effort it takes to implement relevant features in yaml2obj Mach-O? I hope we can avoid binary blobs early.

Yeah but my point is that we are not going to keep using yaml2obj for most test cases. Once lld is able to produce binary files used in the tests, we can replace the binary blobs with an invocation of lld. So checking in binary files as a temporary measure is pretty much okay to me, and it actually look like a good engineering decision to me. At this early stage of development, I think things don't always have to be done in the "right" way, as it usually has lots of scaffolding which will be replaced or removed later.

As to the tests, we generally don't use yaml2obj to write tests (except for COFF which is historically using yaml2obj). If you are using yaml2obj just until lld itself is able to produce files used by tests, I don't think you have to spend too much time writing tests in yaml2obj. Instead, you can submit binary files for now and then replace them later.

Hi @ruiu,

I'm just doing a quick drive by comment here, since I'm not really doing much LLD work at the moment myself. Why are you opposed to yaml2obj being used to generate test inputs instead of llvm-mc? Surely yaml2obj gives more direct control over what is actually in the input? In llvm-mc you only have relatively limited control over some things like relocations.

It seems to me like yaml2obj is a better instrument to be using for the vast majority of test cases, with only a limited set using llvm-mc as a catch-all test to make sure that we don't miss some odd interaction. Of course, this is all making the assumption that mach-o yaml2obj is straightforward to use (I have a feeling it might not be).

ruiu added a comment.Mar 18 2020, 2:24 AM

As to the tests, we generally don't use yaml2obj to write tests (except for COFF which is historically using yaml2obj). If you are using yaml2obj just until lld itself is able to produce files used by tests, I don't think you have to spend too much time writing tests in yaml2obj. Instead, you can submit binary files for now and then replace them later.

Hi @ruiu,

I'm just doing a quick drive by comment here, since I'm not really doing much LLD work at the moment myself. Why are you opposed to yaml2obj being used to generate test inputs instead of llvm-mc? Surely yaml2obj gives more direct control over what is actually in the input? In llvm-mc you only have relatively limited control over some things like relocations.

It seems to me like yaml2obj is a better instrument to be using for the vast majority of test cases, with only a limited set using llvm-mc as a catch-all test to make sure that we don't miss some odd interaction. Of course, this is all making the assumption that mach-o yaml2obj is straightforward to use (I have a feeling it might not be).

You can write tests with yaml2obj too, but if you take a look at ELF, don't you think that we can write fine-grained test cases with llvm-mc too? I want tests to be realistic input files, and in that respect, test cases written in llvm-mc are better than ones in obj2yaml.

I also found that tests written in llvm-mc are easier to maintain. For example, imagine that you already have a test file containing tests for relocations and you want to add a new type of relocation. Then, you'll have to not only add a new relocation to a yaml file but also add a few extra bytes with some addend at a right bit-wise position to the .text section, which isn't easy to do by hand. So, I guess you'll end up writing a test in assembly, compile it and then convert it to yaml -- but if that's the case, we should write tests in assembly in the first place. On the other hand, I didn't find a test case that is easy to write in yaml but not in assembly. There might be some, but that shouldn't be too many. So it looks to me that llvm-mc is a better choice to write test files.

grimar added a comment.EditedMar 18 2020, 2:54 AM

As to the tests, we generally don't use yaml2obj to write tests (except for COFF which is historically using yaml2obj). If you are using yaml2obj just until lld itself is able to produce files used by tests, I don't think you have to spend too much time writing tests in yaml2obj. Instead, you can submit binary files for now and then replace them later.

Hi @ruiu,

I'm just doing a quick drive by comment here, since I'm not really doing much LLD work at the moment myself. Why are you opposed to yaml2obj being used to generate test inputs instead of llvm-mc? Surely yaml2obj gives more direct control over what is actually in the input? In llvm-mc you only have relatively limited control over some things like relocations.

It seems to me like yaml2obj is a better instrument to be using for the vast majority of test cases, with only a limited set using llvm-mc as a catch-all test to make sure that we don't miss some odd interaction. Of course, this is all making the assumption that mach-o yaml2obj is straightforward to use (I have a feeling it might not be).

You can write tests with yaml2obj too, but if you take a look at ELF, don't you think that we can write fine-grained test cases with llvm-mc too? I want tests to be realistic input files, and in that respect, test cases written in llvm-mc are better than ones in obj2yaml.

I also found that tests written in llvm-mc are easier to maintain. For example, imagine that you already have a test file containing tests for relocations and you want to add a new type of relocation. Then, you'll have to not only add a new relocation to a yaml file but also add a few extra bytes with some addend at a right bit-wise position to the .text section, which isn't easy to do by hand. So, I guess you'll end up writing a test in assembly, compile it and then convert it to yaml -- but if that's the case, we should write tests in assembly in the first place. On the other hand, I didn't find a test case that is easy to write in yaml but not in assembly. There might be some, but that shouldn't be too many. So it looks to me that llvm-mc is a better choice to write test files.

I think there is no universal approach and I agree with both of you. For example, for testing relocation relaxations I'd definetely prefer using YAML because it is important to get the particular op-codes and relocations in the inputs.
It is probaly easier to use YAMLs rather than using llvm-mc and verifying inputs with objdump or alike. For many other things, using llvm-mc is probably a bit more natural.

What I do not agree with is that commiting binaries is OK. From my one last experience after working on
yaml2obj and converting LLVM binaries, I found them are source of all kind of problems. If we can use yaml2obj for that out of box, I think we should.
If we can't then probably it is OK to use binaries temporary, but probably a bug should be reported about lacking features of yaml2obj and the
test should have a FIXME comment with a link.

ruiu added a comment.Mar 18 2020, 3:21 AM

As to the tests, we generally don't use yaml2obj to write tests (except for COFF which is historically using yaml2obj). If you are using yaml2obj just until lld itself is able to produce files used by tests, I don't think you have to spend too much time writing tests in yaml2obj. Instead, you can submit binary files for now and then replace them later.

Hi @ruiu,

I'm just doing a quick drive by comment here, since I'm not really doing much LLD work at the moment myself. Why are you opposed to yaml2obj being used to generate test inputs instead of llvm-mc? Surely yaml2obj gives more direct control over what is actually in the input? In llvm-mc you only have relatively limited control over some things like relocations.

It seems to me like yaml2obj is a better instrument to be using for the vast majority of test cases, with only a limited set using llvm-mc as a catch-all test to make sure that we don't miss some odd interaction. Of course, this is all making the assumption that mach-o yaml2obj is straightforward to use (I have a feeling it might not be).

You can write tests with yaml2obj too, but if you take a look at ELF, don't you think that we can write fine-grained test cases with llvm-mc too? I want tests to be realistic input files, and in that respect, test cases written in llvm-mc are better than ones in obj2yaml.

I also found that tests written in llvm-mc are easier to maintain. For example, imagine that you already have a test file containing tests for relocations and you want to add a new type of relocation. Then, you'll have to not only add a new relocation to a yaml file but also add a few extra bytes with some addend at a right bit-wise position to the .text section, which isn't easy to do by hand. So, I guess you'll end up writing a test in assembly, compile it and then convert it to yaml -- but if that's the case, we should write tests in assembly in the first place. On the other hand, I didn't find a test case that is easy to write in yaml but not in assembly. There might be some, but that shouldn't be too many. So it looks to me that llvm-mc is a better choice to write test files.

I think there is no universal approach and I agree with both of you. For example, for testing relocation relaxations I'd definetely prefer using YAML because it is important to get the particular op-codes and relocations in the inputs.
It is probaly easier to use YAMLs rather than using llvm-mc and verifying inputs with objdump or alike. For many other things, using llvm-mc is probably a bit more natural.

What I do not agree with is that commiting binaries is OK. From my one last experience after working on
yaml2obj and converting LLVM binaries, I found them are source of all kind of problems. If we can use yaml2obj for that out of box, I think we should.
If we can't then probably it is OK to use binaries temporary, but probably a bug should be reported about lacking features of yaml2obj and the
test should have a FIXME comment with a link.

I think that we are on the same page. I'm not recommending submitting binary files as a permanent measure. What I'm saying is that if a test file can eventually be produced by mach-o lld itself, then we should allow binary files as a temporary measure. For example, it is okay to submit a shared object file as a binary file as a temporary measure until lld is able to produce a shared library file.

I agree with the previous discussions about llvm-mc vs yaml2obj. Most of time an assembly test is easier to read. In some cases yaml2obj can be more convenient. One notable inability of llvm-mc is that it cannot create invalid tests.

As to binary blobs: I am another disliker. I hope we can avoid them if possible. This may require some contribution to yaml2obj.

  1. contribute yaml2obj 2) improve lld Mach-O add YAML tests

is better than

  1. improve lld Mach-O and add binary tests 2) contribute to yaml2obj 3) convert binary tests to YAML tests

Note that the latter can increase the repository size indefinitely with these binary blobs. I am concerned about some practice in some areas of llvm-project, in particular, XCOFF in Object, and llvm-objdump Mach-O.

int3 added a comment.Mar 18 2020, 8:01 PM

Just to be clear, none of the tests right now would benefit from additional work on yaml2obj. We just need lld-macho to produce dylibs, which I expect I'll have have a diff for by EOW. In the meantime I'll leave the tests as-is in YAML, and I'm happy to wait for the dylib-production diff to be approved before landing everything as a stack, so we don't have to worry about sullying trunk with undesired test formats :)

int3 updated this revision to Diff 251544.Mar 19 2020, 8:18 PM

alphabetize

int3 updated this revision to Diff 252809.Mar 26 2020, 5:24 AM

clang-tidy

int3 updated this revision to Diff 253051.Mar 27 2020, 1:00 AM

use objdump --bind instead of obj2yaml to get symbol addresses

int3 updated this revision to Diff 253272.Mar 27 2020, 6:24 PM

fix test

int3 added a reviewer: Ktwu.Mar 28 2020, 1:37 AM
int3 added a reviewer: gkm.
ruiu added inline comments.Mar 30 2020, 11:30 PM
lld/MachO/InputFiles.cpp
215

Can you add a brief comment here that we initialize dylibName.

223

Ditto -- we are initializing symbols.

lld/MachO/Symbols.h
79

Isn't uint32_t enough? It's not a big deal, but I'd use uint32_t for an index of GOT.

lld/MachO/SyntheticSections.h
34

So GOT has no data but dynamic relocations? If so, I'd leave a brief comment here.

36

nit: add a blank line before a label.

lld/MachO/Writer.cpp
300

Replace auto with a concrete type.

357–360

nit: we usually do

os << foo << bar
   << baz << fizz;

instead of

os << foo << bar;
os << baz << fizz;
int3 marked 8 inline comments as done.Mar 31 2020, 11:39 AM
int3 added inline comments.
lld/MachO/Writer.cpp
357–360

sgtm. Any thoughts on whether we should make a wrapper class like the ByteBuffer in ReaderWriter/MachO? I didn't see a whole lot of usage of raw_ostreams in the other lld implementations but I didn't find any abstraction layer over them either

int3 updated this revision to Diff 253940.Mar 31 2020, 11:40 AM
int3 marked an inline comment as done.

address comments

int3 updated this revision to Diff 253943.Mar 31 2020, 11:51 AM

static_cast

Harbormaster completed remote builds in B51160: Diff 253940.
int3 updated this revision to Diff 253976.Mar 31 2020, 12:56 PM
int3 edited the summary of this revision. (Show Details)
int3 removed reviewers: Ktwu, gkm.
int3 removed a subscriber: grimar.

clang-format

int3 added reviewers: Ktwu, gkm.Mar 31 2020, 1:01 PM
int3 added a subscriber: grimar.
int3 marked an inline comment as done.Mar 31 2020, 4:03 PM
int3 added inline comments.
lld/MachO/SyntheticSections.h
34

Actually, I'm not sure it's right to say that it has "no data" -- it does actually occupy space in the binary (unlike the __PAGEZERO segment, a section can't have zero filesize while having a non-zero-length address range), so I think saying it contains all zeroes is more accurate.

int3 updated this revision to Diff 254040.Mar 31 2020, 4:26 PM

rephrase comments

int3 updated this revision to Diff 254050.Mar 31 2020, 5:14 PM
int3 marked 2 inline comments as done.

forgot to address one more

ruiu added inline comments.Mar 31 2020, 9:52 PM
lld/MachO/Writer.cpp
357–360

We don't usually use raw_ostream because we usually write to the output file in two passes: in the first pass, we compute an offset for each output element, and then in the second pass, we let each output element to copy itself to the output file.

However, that two-pass technique cannot be used to create a ULEB-encoded stuff. because we don't know the exact size of each output element until we fix the file contents. Constructing an ULEB-encoded section contents are naturally sequential. So the usage of raw_ostream in for this section is fine. I don't see any problem with that.

This LGTM with the comments. @ruiu, @MaskRay, any further comments? I think it's fine leaving the tests as YAML for now, since subsequent diffs in the stack will enabling LLD to produce dylibs itself and we'll replace the tests at that point.

lld/MachO/InputFiles.cpp
225

Is it correct to just consider LC_SYMTAB for this, or should we also be consulting LC_DYSYMTAB and/or the export trie? If so, we don't have to address that in this diff, but we should add a TODO.

We should also add a TODO about handling a missing LC_SYMTAB.

lld/MachO/InputFiles.h
37

Where is this used?

lld/MachO/SyntheticSections.h
24

It's kinda interesting to me that GotSection would inherit from InputSection, since it's not an input section, of course.

LLD ELF explains this design decision like so:

// Synthetic sections are designed as input sections as opposed to
// output sections because we want to allow them to be manipulated
// using linker scripts just like other input sections from regular
// files.

For Mach-O, we don't have linker scripts, so perhaps that reasoning doesn't hold.

COFF uses Chunks instead of InputSections, since there's more synthesized data (https://lld.llvm.org/NewLLD.html#important-data-structures). I think Mach-O has a good amount of linker-synthesized data and input section processing as well, so a COFF-like design might make more sense. It's hard to say until we have more of the implementation though.

For now, I think leaving this inheritance as-is is fine, but we should add a TODO to reconsider it once we've implemented more of the linker. @ruiu, @MaskRay, what do you think?

int3 marked 2 inline comments as done.Apr 9 2020, 5:11 PM
int3 added inline comments.
lld/MachO/InputFiles.cpp
225

I *think* the export trie is only consumed by dyld. Let me see what the deal is with LC_DYSYMTAB...

lld/MachO/InputFiles.h
37

error("dylib " + path + " missing LC_ID_DYLIB load command"); in InputFiles.cpp

int3 updated this revision to Diff 256500.Apr 9 2020, 8:56 PM
int3 marked an inline comment as done.

add TODO; use reinterpret_cast

int3 added inline comments.Apr 9 2020, 8:57 PM
lld/MachO/InputFiles.cpp
225

Okay, I looked through loader.h; didn't entirely understand it, but I think LC_DYSYMTAB just contains additional info about / indirect references to symbols in LC_SYMTAB.

I'll give other reviewers a few more days to take a look.

lld/MachO/InputFiles.cpp
235

We shouldn't be doing this for symbols that are undefined in the dylib's symbol table, right? (And if so, we should add a test for that, though I don't think that'll be possible till after your follow-ups.)

Looks like LLD ELF adds both defined and undefined symbols from dylibs into the symbol table, presumably to be able to check if all undefined symbols for dylibs are satisfied when linking an executable. Whether or not we do that depends on ld64's behavior there.

MaskRay added inline comments.Apr 15 2020, 10:31 AM
lld/MachO/InputFiles.cpp
235

Looks like LLD ELF adds both defined and undefined symbols from dylibs into the symbol table, presumably to be able to check if all undefined symbols for dylibs are satisfied when linking an executable.

Yes. See D57385 and D57569.

MaskRay added inline comments.Apr 15 2020, 10:32 AM
lld/MachO/InputFiles.cpp
235

It is also used to set the exportDynamic property of a symbol. When linking an executable, the symbols used by DSOs will be added to .dynsym

int3 marked 2 inline comments as done.Apr 15 2020, 3:24 PM
int3 added inline comments.
lld/MachO/InputFiles.cpp
235

I think re-exported symbols also appear as undefined in the symbol table... I need to figure out how to distinguish them. I'll add a TODO.

lld/MachO/InputFiles.h
37

Actually, I see that we already have getName(). It just doesn't always work because we neglected to copy the strings from our input arguments before discarding them. I'll fix it and use that

int3 updated this revision to Diff 257875.Apr 15 2020, 3:39 PM

use getName(); add TODO for undefined symbols

smeenai accepted this revision.Apr 20 2020, 4:20 PM

LGTM with the comments addressed. I can commit this for you once they are.

lld/MachO/Driver.cpp
92–93

I believe these should actually be placed on the search path *after* any user-specified -L directories. At least that's what ld64 appears to do, based on the output from its -v option.

lld/MachO/InputSection.cpp
34

Nit: when one branch of an if-else has braces, all the others should too.

36

(same here)

This revision is now accepted and ready to land.Apr 20 2020, 4:20 PM
int3 marked an inline comment as done.Apr 20 2020, 10:37 PM
int3 added inline comments.
lld/MachO/Driver.cpp
92–93

Oh good catch. I'll make our -v option emit that too so we can test for it

This revision was automatically updated to reflect the committed changes.