This is an archive of the discontinued LLVM Phabricator instance.

[lld-macho] Add --start-lib --end-lib
ClosedPublic

Authored by MaskRay on Jan 9 2022, 8:32 PM.

Details

Reviewers
int3
oontvoo
Group Reviewers
Restricted Project
Commits
rG0aae2bf37318: [lld-macho] Add --start-lib --end-lib
Summary

In ld.lld, when an ObjFile/BitcodeFile is read in --start-lib state, the file is
given archive semantics. --end-lib closes the previous --start-lib. A build
system can use this feature as an alternative to archives. This patch ports
the feature to lld-macho.

--start-lib and --end-lib are positional, unlike usual ld64 options.
I think the slight drawback does not matter as (a) reusing option names
make build systems convenient (b) --start-lib a.o b.o --end-lib conveys more
information than an alternative design: -objlib a.o -objlib b.o because
--start-lib makes it clear which objects are in the same conceptual archive.
This provides flexibility (c) -objlib/-filelist interaction may be weird.

Close https://github.com/llvm/llvm-project/issues/52931

Diff Detail

Event Timeline

MaskRay created this revision.Jan 9 2022, 8:32 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 9 2022, 8:32 PM
Herald added a subscriber: dang. · View Herald Transcript
MaskRay requested review of this revision.Jan 9 2022, 8:32 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 9 2022, 8:32 PM
MaskRay edited the summary of this revision. (Show Details)Jan 9 2022, 8:34 PM
MaskRay updated this revision to Diff 399143.Jan 11 2022, 5:35 PM

rebase. improve tests

MaskRay added inline comments.
lld/MachO/UnwindInfoSection.cpp
229

@oontvoo This change does not make the new test in lld/test/MachO/objc-uses-custom-personality.s work. Do you know why?

oontvoo added inline comments.Jan 12 2022, 4:25 PM
lld/MachO/Driver.cpp
250

can this be a parameter to the addFile() function instead?

Having it as a global makes it harder to introduce any kind parallelization in processing the input (which i'm working on a POC for, since lld-macho is currently proven to be too slow to link Youtube)

Also I've said this on the bug but I'm kind concerned about introducing stateful flags like this given there's no precedence for it in LD64.

lld/MachO/UnwindInfoSection.cpp
229

I think this is because this new lazyobject, as currently implemented, is slightly different from an archive in that objc symbols in archives are pulled in "eagerly" if the -ObjC flag is present. (This is a necessary condition for this test because it needs _OBJC_CLASS_$_MyTest, which would then pull in the personality symbol).

There's no golden standard to look to here given LD64 doesn't have this, but I guess we should try to make it behave similarly to archives when it comes to objc handling.

Also I've said this on the bug but I'm kind concerned about introducing stateful flags like this given there's no precedence for it in LD64.

I'd also be more supportive of doing something like -objlib instead of --start-lib. We have an analog for this already, where ELF linkers use --whole-archive --no-whole-archive to enclose archives to be linked whole, whereas ld64 has -force_load to whole-link a particular archive. I don't feel super strongly about that though.

MaskRay updated this revision to Diff 400326.Jan 15 2022, 3:09 PM
  • Support -ObjC
  • remove inLib

I keep my viewpoint that --start-lib is more suitable because of (b) (as stated in the description).
The library boundary is useful and -objlib would lose the information.

--whole-archive (-force_load), while similar, keeps the library boundary information,
so it's not the best analogy.

MaskRay marked an inline comment as done.Jan 15 2022, 3:10 PM
MaskRay added inline comments.
lld/MachO/Driver.cpp
250

Removed. The state would not make parallelization more difficult. We do not parallelize the addFile function calls but the underlying initialization.

MaskRay updated this revision to Diff 400327.Jan 15 2022, 3:13 PM
MaskRay marked an inline comment as done.

Add some /*isLazy=*/ comments

int3 added a subscriber: int3.Jan 17 2022, 10:47 PM

So I realize that the choice of class hierarchy here was chosen to follow the LLD-ELF backend's design... but I'm not the biggest fan of it for two reasons:

  1. ObjFiles and BitcodeFiles with lazy = true aren't proper instances of their classes; e.g. calling ->getDataInCode() on a lazy ObjFile isn't a valid operation. IMO it would make more sense to have a separate LazyFile instance that gets converted to ObjFile / BitcodeFile as necessary. Kind of like an ArchiveFile but with only one member.
  2. I have the opposite objection to LazyObject: it doesn't really merit the creation of a new class. Implementation-wise, it's indeed different from LazyArchive, but interface-wise, they both expose a single common operation: the fetching of their associated file. IMO it would be cleaner to have a single LazySymbol class with a fetchFile() method. It can use a PointerUnion to store a pointer to the llvm::object::Archive::Symbol or InputFile instance as necessary.

All that said, I do see the sense in not deviating from LLD-ELF's architecture. If you prefer the hierarchy as-is, I'm good with that too.

lld/MachO/Driver.cpp
350

nit: how about isa<ObjFile> and isa<BitcodeFile> since the line above is also using isa<>?

350–359

IIUC, we are evaluating lazy object files for ObjC members after doing regular symbol resolution. Archive loading (intentionally) does this in the opposite order: D108781: [lld-macho] Have -ObjC load archive members before symbol resolution. Would probably be good to follow suit

lld/MachO/InputFiles.cpp
929

nit: no need for llvm::

1619

no need for llvm::

lld/MachO/InputFiles.h
242

make this private?

  • Support -ObjC
  • remove inLib

I keep my viewpoint that --start-lib is more suitable because of (b) (as stated in the description).
The library boundary is useful and -objlib would lose the information.

--whole-archive (-force_load), while similar, keeps the library boundary information,
so it's not the best analogy.

Ok, I think i'm convinced --start-lib/--end-lib is better in this case :)

lld/MachO/Driver.cpp
350

is anything wrong with switching the previous predicate to checking the magic? (checking for the underlying type always seemed kind of iffy to me ... )

lld/MachO/InputFiles.cpp
1619

Sorry, I'm missing something here but why not simply for (const lto::InputFile::Symbol &objSym : obj->symbols()) ?
IOWs, why is llvm::enumerate needed?

lld/test/MachO/objc.s
34

OBJC-DAG instead?

MaskRay marked 5 inline comments as done.Jan 18 2022, 7:29 PM
MaskRay added inline comments.
lld/MachO/Driver.cpp
350–359

Thanks for the pointer. Unfortunately it is difficult to follow D108781 because without parse and parseLazy, we don't have symbols (to check ObjC symbols/sections).... For now I just add a test with TODO.

I think some heavy symbol table refactoring is needed to make symbols available before symbol resolution. https://github.com/llvm/llvm-project/issues/53104

lld/MachO/InputFiles.cpp
1619

llvm::foo for STL-like helpers (lower_bound, find_if, etc) has an advantage: prevent the confusing argument-dependent lookup. In other components of llvm-project llvm::foo for STL-like helpers is generally recommended.

So I am going to defend for the usage a bit...

1619

symbols[it.index()] is needed by -ObjC for ObjC symbol checking. It has another advantage: when a symbol table refactoring is done in the future, we can transfer the symbol table to the non-lazy state. See D116390 for ELF.

MaskRay updated this revision to Diff 401077.Jan 18 2022, 7:29 PM

address comments

So I realize that the choice of class hierarchy here was chosen to follow the LLD-ELF backend's design... but I'm not the biggest fan of it for two reasons:

  1. ObjFiles and BitcodeFiles with lazy = true aren't proper instances of their classes; e.g. calling ->getDataInCode() on a lazy ObjFile isn't a valid operation. IMO it would make more sense to have a separate LazyFile instance that gets converted to ObjFile / BitcodeFile as necessary. Kind of like an ArchiveFile but with only one member.
  2. I have the opposite objection to LazyObject: it doesn't really merit the creation of a new class. Implementation-wise, it's indeed different from LazyArchive, but interface-wise, they both expose a single common operation: the fetching of their associated file. IMO it would be cleaner to have a single LazySymbol class with a fetchFile() method. It can use a PointerUnion to store a pointer to the llvm::object::Archive::Symbol or InputFile instance as necessary.

All that said, I do see the sense in not deviating from LLD-ELF's architecture. If you prefer the hierarchy as-is, I'm good with that too.

A separate LazyFile class makes sharing information (symbols/sections) between the lazy and non-lazy states difficult.
(https://maskray.me/blog/2021-12-19-why-isnt-ld.lld-faster "initialization of sections" and "initialization of non-local symbols")
I think we are currently a bit far away from the goal, but I am carefully writing code not to make it harder.

int3 accepted this revision.Jan 18 2022, 8:25 PM

lgtm, thanks!

lld/MachO/InputFiles.cpp
1619

mm yeah, I have also been including llvm:: in front of LLVM functions which have STL counterparts of the same name. My reasoning was that since std::enumerate doesn't exist, llvm:: wasn't needed here... but I can also see the argument for using llvm:: for all STL-like things.

This revision is now accepted and ready to land.Jan 18 2022, 8:25 PM
oontvoo accepted this revision.Jan 19 2022, 6:41 AM

LG thanks!

This revision was automatically updated to reflect the committed changes.

Sorry I didn't see to comment on this before, but I'm also skeptical about introducing a new concept of stateful options to this command-line syntax.

(b) --start-lib a.o b.o --end-lib conveys more
information than an alternative design: -objlib a.o -objlib b.o because
--start-lib makes it clear which objects are in the same conceptual archive.

I don't think this is relevant to this linker -- what flexibility are you imagining it provides? "Conceptual archive" doesn't mean anything to a Mach-O linker, since we don't have the legacy ELF symbol lookup behavior where you need to loop around and search an archive repeatedly to find the proper files to pull into the link.

(c) -objlib/-filelist interaction may be weird.

Filelist can _only_ contain files, no options, so I don't see how this is an advantage of --start-lib? Also, the use of -filelist should be considered deprecated -- passing the args directly to Clang, via a clang @ response-file if needed, is preferable in every way.

MaskRay added a comment.EditedJan 19 2022, 11:59 AM

Sorry I didn't see to comment on this before, but I'm also skeptical about introducing a new concept of stateful options to this command-line syntax.

(b) --start-lib a.o b.o --end-lib conveys more
information than an alternative design: -objlib a.o -objlib b.o because
--start-lib makes it clear which objects are in the same conceptual archive.

I don't think this is relevant to this linker -- what flexibility are you imagining it provides? "Conceptual archive" doesn't mean anything to a Mach-O linker, since we don't have the legacy ELF symbol lookup behavior where you need to loop around and search an archive repeatedly to find the proper files to pull into the link.

If we later discover that linking x.a(a.o b.o) y.a(c.o) has different behavior than x.a(a.o) y.a(b.o c.o), with --start-lib, since we have the library boundary information, we can fix it.
With -objlib a.o -objlib b.o -objlib c.o, we can't.
I wouldn't put my bet on that Mach-O archives will never introduce such a rule (or has already such a rule but we just don't know yet), given weird options like -ObjC.

If an archive is augmented with special __.* members doing weird stuff, with --start-lib we can port it.
With -objlib, we can't.

(c) -objlib/-filelist interaction may be weird.

Filelist can _only_ contain files, no options, so I don't see how this is an advantage of --start-lib? Also, the use of -filelist should be considered deprecated -- passing the args directly to Clang, via a clang @ response-file if needed, is preferable in every way.

I used this as a very weak argument. I think it is fine to argue this in both ways. --start-lib makes it easy to use --start-lib @response.rsp --end-lib.