This is an archive of the discontinued LLVM Phabricator instance.

[lld-macho] Infer machine type from input files
AbandonedPublic

Authored by int3 on Feb 24 2021, 4:51 PM.

Details

Reviewers
None
Group Reviewers
Restricted Project
Summary

I initially tried to do this by mirroring LLD-ELF's architecture of having
InputFile construction separated from parsing of the file contents. However, I
think that model is a bit more of an awkward fit for Mach-O with its .tbd files --
since one TBD file can generate multiple corresponding InputFiles, there
isn't a 1:1 mapping between unparsed buffers and InputFile objects. Moreover, I
found having files in an unparsed state & having to remember to parse them
before use was confusing in general.

As such, I've decided to have us infer the machine type by scanning the input
object files without constructing them. While this does mean that we read in one
extra page per file in the worst case, I think there should be almost no
overhead in practice, since 1. we'll usually succeed in inferring the type from
the first file we examine, 2. we're only reading in one page, and 3. this entire
logic can be bypassed by specifying -arch on the command line.

Note that we only look at object files (and not bitcode) when doing this
inference. This matches ld64's behavior, though I suppose it would be nice to
support inference from bitcode in the future.

Diff Detail

Event Timeline

int3 created this revision.Feb 24 2021, 4:51 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 24 2021, 4:51 PM
int3 requested review of this revision.Feb 24 2021, 4:51 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 24 2021, 4:51 PM
int3 edited the summary of this revision. (Show Details)Feb 24 2021, 4:51 PM
int3 edited the summary of this revision. (Show Details)
int3 added inline comments.Feb 24 2021, 4:52 PM
lld/test/MachO/reproduce-thin-archives.s
9

since we no longer default to x86_64, tests which do not pass in an object file will have to specify the arch manually

thakis added a subscriber: thakis.Feb 25 2021, 7:31 AM

I think there should be almost no
overhead in practice, since 1. we'll usually succeed in inferring the type from
the first file we examine, 2. we're only reading in one page, and 3. this entire
logic can be bypassed by specifying -arch on the command line.

Hm, 1 isn't true if most files are fat .o files. 2. isn't really true since it's kind of bad for the disk cache to open the first page of every obj and then walk all of them again (especially on spinning disks), and if we believe 3 is the way to go, maybe we should just always require -arch? Usually the linker is called through clang anyways, which passes it in.

int3 added a subscriber: smeenai.Feb 25 2021, 11:56 AM

Hmm good point about #1, I wasn't sure how common fat binary builds were but @smeenai pointed out that with the introduction of M1 they're going to be more likely. Re #2, I think the same issue plagues LLD-ELF, at least if my mental model of the disk cache is right: even though it doesn't mmap the input files twice, it checks the machine type by reading the first page, then goes back later and reads in the rest of the file, which results in similar disk cache effects. But you do make a good point about this not being a super useful feature given that LLD is not typically invoked directly... this just saves us some boilerplate in tests for questionable code complexity.

Sounds like we should require -arch, at least for now :)

(https://bugs.llvm.org/show_bug.cgi?id=38965 has a years-long monologue of me looking at a tangentially-related problem every now and then.)

int3 planned changes to this revision.Feb 26 2021, 9:09 AM
int3 added a comment.Mar 1 2021, 12:41 PM

I realize we should actually validate against the platform as well. ld64 likewise can infer the platform from its inputs. We could make lld require the platform to be explicitly specified, however I think sprinkling -platform_version all over our tests is going to be pretty ugly. How do you feel about aliases like %lld-x86_64 and %lld-arm64 that expand into include both an arch and a platform flag? I expect that most of our tests will be fine with using e.g. just the macOS platform.

I realize we should actually validate against the platform as well. ld64 likewise can infer the platform from its inputs. We could make lld require the platform to be explicitly specified, however I think sprinkling -platform_version all over our tests is going to be pretty ugly. How do you feel about aliases like %lld-x86_64 and %lld-arm64 that expand into include both an arch and a platform flag? I expect that most of our tests will be fine with using e.g. just the macOS platform.

Sounds fine. Maybe just %lld can expand to ... -arch x86_64 -platform_version macos ... and if a test wants something else they can add another flag, which should win over the earlier one?

int3 added a comment.Mar 2 2021, 10:48 AM

Yup, sounds good :)

oontvoo added inline comments.
lld/MachO/Driver.cpp
148

But isn't this still technically UB? Would the UBSAN builds complain?

int3 added inline comments.Mar 2 2021, 11:03 AM
lld/MachO/Driver.cpp
148

hm, why would it be UB?

In any case, per the other comments, I'm going to kill all this inferMachineType logic and just have things specified explicitly.

int3 abandoned this revision.Mar 2 2021, 1:10 PM

Replaced by D97799