This is an archive of the discontinued LLVM Phabricator instance.

[lld-macho] Extend SyntheticSections to cover all segment load commands
ClosedPublic

Authored by int3 on Mar 26 2020, 5:24 AM.

Details

Summary

Previously, the special segments __PAGEZERO and __LINKEDIT were implemented as special LoadCommands. This diff implements them using special sections instead which have an isHidden() attribute. We do not emit section headers for hidden sections, but we use their addresses and file offsets to determine that of their containing segments. In addition to allowing us to share more segment-related code, this refactor is also important for the next step of emitting dylibs:

  1. dylibs don't have segments like __PAGEZERO, so we need an easy way of omitting them w/o messing up segment indices
  2. Unlike the kernel, which is happy to run an executable with out-of-order segments, dyld requires dylibs to have their segment load commands arranged in increasing address order. The refactor makes it easier to implement sorting of sections and segments.

Depends on D76252: [lld-macho] Add basic support for linking against dylibs.

Diff Detail

Event Timeline

int3 created this revision.Mar 26 2020, 5:24 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 26 2020, 5:24 AM
int3 edited the summary of this revision. (Show Details)Mar 26 2020, 5:37 AM
int3 edited the summary of this revision. (Show Details)Mar 26 2020, 5:47 AM
int3 updated this revision to Diff 252815.Mar 26 2020, 5:53 AM
int3 marked an inline comment as done.

update

lld/MachO/SyntheticSections.h
23–24

The names of hidden sections don't make it to the final executable; they're just for distinguishing the InputSections within lld code and so can be any unique string. But I've picked the same names as the corresponding Atoms in ld64 for easy reference.

Harbormaster completed remote builds in B50526: Diff 252813.
int3 marked 3 inline comments as done.Mar 26 2020, 11:38 AM
int3 added inline comments.
lld/MachO/OutputSegment.cpp
20–36

this is more or less what ld64 does

lld/MachO/Writer.h
17

Had to move this to the header so that MachHeaderSection's implementation in SyntheticSections.cpp could access it. That said, I'm a bit undecided on whether it really makes sense for things like MachHeaderSection to be there -- while it technically is a synthetic section, if it only gets used by one component of the linker (i.e. in Writer.cpp), maybe it should be defined there?

int3 edited the summary of this revision. (Show Details)Mar 26 2020, 12:23 PM
int3 edited the summary of this revision. (Show Details)Mar 26 2020, 12:26 PM
int3 updated this revision to Diff 253040.Mar 26 2020, 9:17 PM

add InputSection::getFileOffset()

Harbormaster completed remote builds in B50656: Diff 253041.
int3 marked an inline comment as done.Mar 27 2020, 12:47 AM
int3 added inline comments.
lld/MachO/Writer.cpp
225–234

this was copypasted from the ELF implementation; we may want to factor it out into Common/ later

int3 added a reviewer: Ktwu.Mar 28 2020, 1:38 AM
int3 edited the summary of this revision. (Show Details)Mar 30 2020, 6:03 PM
ruiu added a comment.Mar 31 2020, 12:58 AM

Overall I'd like to see more comments. All functions and classes that are not trivial should have a comment explaining what they represent or what they are supposed to do.

lld/MachO/InputSection.cpp
24

It is safer to use uint64_t to represent file offsets. At least on Linux, we actually create binaries that are larger than 4 GiB.

lld/MachO/OutputSegment.cpp
23

nit: we generally omit else after return, so

if (...)
  return ...;
if (...)
  return ...;
return ...;
lld/MachO/OutputSegment.h
38

nit: add a blank line after this line

lld/MachO/SyntheticSections.h
23–24

Actually this kind of one-shot temporary can be inlined, and in particular for these names, we don't have to have then as global constants because we are not really referring them.

32

Please add a class comment to explain what this class represents.

74

Can you expand the comment a bit? I'd explain what the bind opcode in more details.

int3 updated this revision to Diff 254045.Mar 31 2020, 5:07 PM
int3 marked 7 inline comments as done.

address comments

int3 updated this revision to Diff 254053.Mar 31 2020, 5:38 PM
int3 marked an inline comment as not done.

update

int3 updated this revision to Diff 254057.Mar 31 2020, 5:41 PM

update

lld/MachO/InputSection.cpp
24

Sections are limited to 32-bit file offsets even in 64-bit MachO binaries. Even hidden sections addressed by e.g. LC_DYLD_LOAD_INFO must be 32-bit. Segments can have 64-bit file offsets though... so I've changed fileOff in OutputSegment but kept this 32-bit.

lld/MachO/SyntheticSections.h
23–24

they're referenced in sectionOrder(), so they're not exactly one-shot

74

hm, I think the explanation of bind opcodes would make more sense in encode()... I'll elaborate there

ruiu added inline comments.Mar 31 2020, 9:42 PM
lld/MachO/InputSection.cpp
24

For PE/COFF, we are using uint64 as file offsets at least in some part of the code instead of uint32 even though PE/COFF file is limited to <4GiB, because it makes it easy to find errors when we are creating a file that is too large. If we are using 32-bit integers, we need to have an error check for integer wraparound at many places, but if we are using 64-bit integers, we can just proceed until we create an actual file and then we can do an error check at that point.

I don't know if that technique is useful to this particular function, but that's something you want to keep in mind.

int3 updated this revision to Diff 254103.Apr 1 2020, 1:05 AM
int3 marked 2 inline comments as done.

64 bit

lld/MachO/InputSection.cpp
24

Good point :)

ruiu added inline comments.Apr 1 2020, 2:42 AM
lld/MachO/OutputSegment.cpp
18

I believe the LLVM coding style recommends making functions private using static instead of surrounding a function definition with an anonymous namespace.

38

Since you have using namespace lld::macho, I guess you can omit macho::?

41

!empty()

61

Looks like you can directly assign to segRef.

lld/MachO/OutputSegment.h
35

nit: !sections.empty() is more conventional.

lld/MachO/SyntheticSections.cpp
157–159

This function makes a copy of a string which is then copied to the output buffer, which isn't very efficient. You can avoid the extra string copy by eliminating the output stream as well as SmallVector, and instead storing strings as std::vector<StringRef>.

lld/MachO/SyntheticSections.h
46–47

Thank you for the comment, that's very helpful.

64

!empty()

lld/MachO/Writer.cpp
113

!empty()

ruiu added inline comments.Apr 1 2020, 2:42 AM
lld/MachO/Writer.cpp
121

Use a concrete type instead of auto.

123–128

We usually use ArrayRef<T> instead of const std::vector<T>.

124

Expand auto.

225

Can you add a function comment?

226

You are always passing sectionOrder function as the second argument, so it looks like you can eliminate the second parameter and always using sectionOrder.

245

nit: omit else after return.

248–250

So it looks like the function returns only 4 possible values -- 0, 1, linkEditOffset, or linkEditOffset+1. Can't you use 0, 1, 2, 3 instead?

int3 marked 18 inline comments as done.Apr 1 2020, 4:06 PM
int3 added inline comments.
lld/MachO/Writer.cpp
121

The types appear in the next two lines... is it really necessary to repeat them here?

I did a quick grep, it seems that lld/ELF tends to use explicit types for pairs, but lld/COFF seems to favor auto.

226

as mentioned in the comment below, this is likely sharable with a similar function in lld/ELF... I can factor it out now to make the 2nd parameter actually useful

248–250

The idea is to make it easily extensible -- there may be other special sections that we want to order before the __LINKEDIT sections.

int3 updated this revision to Diff 254337.Apr 1 2020, 4:24 PM

address comments

int3 planned changes to this revision.Apr 1 2020, 7:48 PM
int3 marked an inline comment as done.
int3 updated this revision to Diff 254403.Apr 1 2020, 8:11 PM

update

int3 marked an inline comment as done.Apr 1 2020, 8:12 PM
int3 added inline comments.
lld/MachO/Writer.cpp
226

On second thought, I don't think that really needs to be a standalone function at all. Simplified things...

ruiu added inline comments.Apr 1 2020, 8:12 PM
lld/MachO/Writer.cpp
248–250

Got it, but I don't think you have to prepare this tiny function for future extension. You can just make it return 0,1,2,3 and then in a later change when you actually need to use linkEditOffset, you can start returning linkEditOffset and linkEditOffset+1.

Nice, I *really* like this design of representing all these special segments as synthetic sections.

@ruiu, how does this look to you now?

lld/MachO/InputSection.h
38

isHidden implies you *don't* emit the header, right? I'd clarify that in the comment.

40

Same here, since you *wouldn't* omit it if isNeeded.

lld/MachO/OutputSegment.cpp
20

Same comment here about logic based on section names, though I'm not sure if doing it differently would end up cleaner.

38

Hmm, what's up with the vec.empty() condition? Doesn't that mean numNonHiddenSections will only ever be 0 or 1, depending on if the first section addSection was called on for a particular segment was hidden or not?

lld/MachO/OutputSegment.h
35

Looks like this one still needs to be addressed?

35

ld64 commonly checks segment or section names and takes particular actions based on that name. I'm not a fan of that design, personally, since it ties a lot of logic to those names. Having named constants for the names is a big improvement already, but I'm wondering if there's any way to design this that would sidestep the issue altogether.

For example, we could have a property of an OutputSegment like alwaysNeeded, which would be set for the linkEdit segment at its point of creation. Alternatively, we could have a LinkEditSegment override of OutputSegment which overrides isNeeded. The point would be to try to represent properties of the segment as attributes or subclasses, instead of computing them based on the name.

I'm not sure how that'd end up looking in practice though, or if it's worth it in all cases. @int3, @ruiu, what do you think?

lld/MachO/SyntheticSections.cpp
157

index might be clearer than strx.

lld/MachO/SyntheticSections.h
33

Shouldn't all of these be inheriting from SyntheticSection?

108

How come you're calling this the StringPoolSection instead of the StringTableSection? I thought string table was the standard terminology.

lld/MachO/Writer.cpp
126

Nit: drop the braces.

248–250

Looks like this comment still needs a response.

lld/MachO/Writer.h
17

I think either design is justifiable ... we can just leave them where they are for now and see how things shake out when we have more of the linker implemented.

int3 updated this revision to Diff 256432.Apr 9 2020, 3:42 PM
int3 marked 9 inline comments as done.

address comments

int3 added a comment.Apr 9 2020, 3:42 PM

Sorry about the delay, got sidetracked for a bit...

lld/MachO/OutputSegment.cpp
38

This is to handle merged sections (well, aside from the fact that the rest of the code doesn't really do that yet). vec.empty() checks to see if we're creating a new output section or adding to an existing one.

lld/MachO/OutputSegment.h
35

Personally I try to avoid abstractions until they're definitively needed. At this point I'm not aware of any other segment that will benefit from alwaysNeeded

lld/MachO/SyntheticSections.cpp
157

it's named this way because its value ends up populating n_list::n_strx

lld/MachO/SyntheticSections.h
33

I haven't defined such a class yet, mostly because I'm not yet sure what it should contain that InputSection doesn't. I'm thinking we can sort out the class hierarchy after we have a proper OutputSection class set up.

108

ld64 calls it the string pool ¯\_(ツ)_/¯ I'm fine with either though

lld/MachO/Writer.cpp
248–250

You can just make it return 0,1,2,3 and then in a later change when you actually need to use linkEditOffset, you can start returning linkEditOffset and linkEditOffset+1.

I'd actually forgotten to add the symbol table / string pool sections to the ordering after rebasing, oops. So we do actually have a bunch of __LINKEDIT sections to order in this diff already.

In any case, I've refactored things so that the ordering can be specified a bit more declaratively, though it does add a bit more code... let me know what you think, I can revert to the if-else chain if you'd prefer.

int3 added a comment.Apr 9 2020, 3:45 PM
This comment was removed by int3.
int3 updated this revision to Diff 256480.Apr 9 2020, 7:11 PM
int3 marked an inline comment as done.

rebase

int3 added inline comments.Apr 9 2020, 8:58 PM
lld/MachO/Writer.cpp
248–250

I'd actually forgotten to add the symbol table / string pool sections to the ordering after rebasing

Nevermind, I've instead swapped the stack position of this diff with that of D76742, so those sections are added later...

int3 edited the summary of this revision. (Show Details)Apr 10 2020, 12:21 AM
int3 updated this revision to Diff 257897.Apr 15 2020, 4:52 PM

order section name declarations by output order

int3 marked an inline comment as done.Apr 15 2020, 9:42 PM
int3 added inline comments.
lld/MachO/Writer.cpp
248–250

I was debugging an issue locally that boiled down to sections not actually being arranged contiguously by the sort. There's currently nothing enforcing that property, aside from relying on us to correctly update the ordering list each time a new section is created. As such, I'm changing the sorting comparator to operate on segment names first followed by section names.

int3 updated this revision to Diff 257951.Apr 15 2020, 9:42 PM

update sorter

LGTM with the comments addressed. @ruiu, what do you think?

lld/MachO/OutputSegment.cpp
38

Got it. Flagging this for @Ktwu to adjust as part of the section merging, if she isn't already.

lld/MachO/SyntheticSections.cpp
85

Any reason to have this one-liner not be part of the header?

lld/MachO/SyntheticSections.h
91

I'd prefer StringTable over StringPool, since string table is the standard terminology used in the Mach-O reference.

lld/MachO/Writer.cpp
291

Hmm, I'm not sure if I found this or the if-ladder clearer. On the one hand, this is more declarative, but on the other, you have to parse the code below to understand what this is doing. It should be easy enough to change in a follow-up if we want though, so I don't care too much right now.

int3 marked 4 inline comments as done.Apr 20 2020, 10:37 PM
int3 added inline comments.
lld/MachO/SyntheticSections.cpp
85

The declaration of InStruct appears at the end of the header file, so in isn't visible at isNeeded()'s declaration.

lld/MachO/SyntheticSections.h
91

D'oh, this entire section should be added in the symtab diff. Rebasing headaches 😅

Will move it over and rename

int3 updated this revision to Diff 258903.Apr 20 2020, 10:38 PM
int3 marked 2 inline comments as done.

address comments

smeenai added inline comments.Apr 21 2020, 12:02 AM
lld/MachO/SyntheticSections.cpp
85

Sure, but we could move that order around :) (You'd just need a struct GotSection; forward declaration.)

I don't think it really matters though ... I doubt the cost of not inlining this in the writer would be significant (and LTO would be able to inline it regardless, if you really cared about perf).

This revision is now accepted and ready to land.Apr 27 2020, 12:44 PM
This revision was automatically updated to reflect the committed changes.
vitalybuka reopened this revision.Apr 28 2020, 3:21 AM
vitalybuka added a subscriber: vitalybuka.

http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-fast/builds/40817/steps/check-lld%20ubsan/logs/stdio

ommand Output (stderr):
--
/b/sanitizer-x86_64-linux-fast/build/llvm-project/lld/MachO/InputSection.cpp:29:15: runtime error: null pointer passed as argument 2, which is declared to never be null
/usr/include/string.h:43:28: note: nonnull attribute specified here
    #0 0xc7f0ba in lld::macho::InputSection::writeTo(unsigned char*) /b/sanitizer-x86_64-linux-fast/build/llvm-project/lld/MachO/InputSection.cpp:29:3
    #1 0xc85687 in writeSections /b/sanitizer-x86_64-linux-fast/build/llvm-project/lld/MachO/Writer.cpp:449:15
    #2 0xc85687 in (anonymous namespace)::Writer::run() /b/sanitizer-x86_64-linux-fast/build/llvm-project/lld/MachO/Writer.cpp:490:3
    #3 0xc8418c in lld::macho::writeResult() /b/sanitizer-x86_64-linux-fast/build/llvm-project/lld/MachO/Writer.cpp:496:38
    #4 0xc79269 in lld::macho::link(llvm::ArrayRef<char const*>, bool, llvm::raw_ostream&, llvm::raw_ostream&) /b/sanitizer-x86_64-linux-fast/build/llvm-project/lld/MachO/Driver.cpp:170:3
    #5 0x835016 in main /b/sanitizer-x86_64-linux-fast/build/llvm-project/lld/tools/lld/lld.cpp:163:13
    #6 0x7f441c63809a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409a)
    #7 0x81e579 in _start (/b/sanitizer-x86_64-linux-fast/build/llvm_build_ubsan/bin/lld+0x81e579)

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /b/sanitizer-x86_64-linux-fast/build/llvm-project/lld/MachO/InputSection.cpp:29:15 in
This revision is now accepted and ready to land.Apr 28 2020, 3:21 AM

Thanks for the report, taking a look.

(I'll revert if we can't figure this out soon, but there's a bunch of dependents so I wanna try fixing forward first.)

I have a repro, and we think we know the issue. Testing that out.

Committed a fix for one issue in rGaf40bff32db7840cfbe07278ff0c498604acc5f0, and reverted the other two commits in rGb52bc2653bbc8da852a804afbb34836bb7f1e58a and rGfbae153ca583588de73d8fae84ec262c24d09025. We'll recommit those with a fix for the UB and keep an eye on the bot. Sorry for the breakage.

lld/test/MachO/segments.s