This is an archive of the discontinued LLVM Phabricator instance.

[lld-macho] Add basic symbol table output
AbandonedPublic

Authored by int3 on Mar 24 2020, 5:14 PM.

Details

Summary

This diff implements basic support for writing a symbol table.

  • Attributes are loosely supported for extern symbols and not at all for other types

Immediate future work will involve implementing section merging.

Depends on D76839: [lld-macho] Extend SyntheticSections to cover all segment load commands.

Initial version by @Ktwu, commandeered by @int3.

Diff Detail

Event Timeline

Ktwu created this revision.Mar 24 2020, 5:14 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 24 2020, 5:14 PM
Ktwu edited the summary of this revision. (Show Details)Mar 24 2020, 5:16 PM
int3 added a subscriber: int3.Mar 24 2020, 8:14 PM
int3 added inline comments.
lld/MachO/Writer.cpp
53

this could be constructed locally in the using method I think

lld/test/MachO/symtab_basic.s
40 ↗(On Diff #252459)

maybe put one of these in a different section like .data?

smeenai added inline comments.
lld/MachO/InputSection.h
46 ↗(On Diff #252459)

Nit: we normally use lowerCamelCase instead of snake_case.

lld/MachO/Writer.cpp
352

Hmm. This is gonna be incorrect for multiple sections of the same name, which should get merged together. Section merging is currently broken, as you pointed out.

I think we need an OutputSection class, similar to OutputSection in LLD ELF, to handle section merging. We have OutputSegments, but the sections in a segment themselves need merging too. That's also a much better place for the addr (which is the address in the *output*, not the address from the input) and n_sect fields to go.

smeenai added inline comments.Mar 25 2020, 2:28 PM
lld/MachO/Writer.cpp
390

It's n_desc, and we need a TODO here for filling this out.

397

Nit: variable naming

401

Do we need a TODO here for handling more types?

lld/test/MachO/symtab_basic.s
15 ↗(On Diff #252459)

I think we should check the values, to make sure they're written correctly. They should be consistent (they'll depend on the base address, but that should be fixed).

gkm added a subscriber: gkm.Mar 25 2020, 10:24 PM
Ktwu marked 7 inline comments as done.Mar 26 2020, 9:42 AM
Ktwu added inline comments.
lld/MachO/Writer.cpp
352

That makes sense, although I'm afraid any logic I write up for merging probably won't be correct. I'll try referencing what ELF's stuff does.

lld/test/MachO/symtab_basic.s
15 ↗(On Diff #252459)

I didn't put down a value because I didn't know, at this stage, how to tell if a value is correct or not. Is there another command I can invoke to see if these values are sane (like if readobj uses the value and hopefully doesn't crash, for example)?

40 ↗(On Diff #252459)

Yup, will try that!

int3 added inline comments.Mar 26 2020, 10:08 PM
lld/test/MachO/symtab_basic.s
15 ↗(On Diff #252459)

you could use FileCheck's numeric substitution blocks to write these values as offsets from the start of the __text section. I'm doing something like that in dylink.s in D76252.

int3 added inline comments.Mar 26 2020, 10:32 PM
lld/MachO/Writer.cpp
397

I wonder if we should just drop the n_ prefix for internal variable names, i.e. just strx instead of nStrx. Not really sure why n_ was added in the first place, but it seems like some kind of ad-hoc namespace convention? (Not that it really makes sense for struct members...) In any case it doesn't seem to add much to code readability

Ktwu updated this revision to Diff 253722.Mar 30 2020, 3:32 PM

Addressing first round of comments (minus the value checks in the test)

Ktwu updated this revision to Diff 253738.Mar 30 2020, 4:05 PM

Verify that address offsets for the text segment are reasonable

Ktwu edited the summary of this revision. (Show Details)Mar 30 2020, 4:08 PM
int3 added inline comments.Mar 30 2020, 4:35 PM
lld/MachO/Writer.cpp
395

s/contensOs/contentsOs/

int3 added inline comments.Mar 30 2020, 5:59 PM
lld/MachO/InputSection.h
49 ↗(On Diff #253738)

I think we should name this sectionIndex in accordance with lld-ELF's OutputSection::sectionIndex. (I've made that change in D76839)

lld/test/MachO/symtab_basic.s
1 ↗(On Diff #253738)

naming nits:

  1. Use hyphens instead of underscores for test names (consistent with other lld tests)
  2. The "_basic" suffix seems a bit unnecessary since this will (hopefully quickly) become a non-basic test
ruiu added inline comments.Mar 30 2020, 11:54 PM
lld/MachO/Driver.cpp
168–169 ↗(On Diff #253738)

This can now fit to a single line?

lld/MachO/Writer.cpp
381

Can you add a function comment about the structure fo the symbol table, so that readers can get a sense what this function is supposed to do without reading the code?

395

but maybe just os is fine.

Ktwu marked 6 inline comments as done.Apr 1 2020, 6:17 PM

Sorry for the delay in response, and thanks for the review so far!

lld/test/MachO/symtab_basic.s
1 ↗(On Diff #253738)

Sure; I hoped that we could have a series of tests with one sanity test, but I'll happily rename this.

Ktwu updated this revision to Diff 254391.Apr 1 2020, 6:46 PM

Address comments, rebase on D76252

ruiu added a comment.Apr 1 2020, 7:57 PM

We usually construct a section in three phases as follows:

  1. In the first phase, we add pieces of information to each section object (e.g. adding symbols to the symbol table section object), but we don't construct the actual contents at this stage. We only compute the size of each section.
  2. We fix the layout of the output file, which can be done once we know the size of each section.
  3. Finally, we let each section to write their contents directly to the output buffer.

This design has several advantages over constructing section contents early:

  • This last phase can be parallelized for each section, which significantly improves performance of the writer.
  • As we directly construct section contents to the output buffer, we can avoid an extra copy.

My understanding of this patch is that this creates section contents early. But can you avoid doing that?

Ktwu marked an inline comment as done.Apr 2 2020, 9:37 AM

My understanding of this patch is that this creates section contents early. But can you avoid doing that?

Of course!

lld/MachO/Writer.cpp
397

I think it helps map the value to the intended struct name; if I see a variable with this nPattern naming, it's easier to understand that the value is ultimately intended for use within the symbol table given the documented struct variable names.

Ktwu added a comment.Apr 6 2020, 1:34 PM

@ruiu

Since D76839 is based on this diff, it'd be easier to land this diff as-is so that D76839 doesn't have to rebase. Any qualms about landing as-is and then changing the write order in another diff?

MaskRay added a comment.EditedApr 6 2020, 2:22 PM

@ruiu

Since D76839 is based on this diff, it'd be easier to land this diff as-is so that D76839 doesn't have to rebase. Any qualms about landing as-is and then changing the write order in another diff?

For a patch series, amending an earlier patch may require the rebase of subsequent patches. I think that is just how things work. Email based reviews (e.g. [PATCH 0/5]) work this way, too. I don't think a question can be dismissed just because "landing a patch as-is can avoid rebasing future patches".

Ktwu updated this revision to Diff 255553.Apr 6 2020, 5:57 PM

Addressing request to split generating content and assigning addresses. This doesn't functionally
do anything different than before since nothing in the writer is parallelized, but in principle
this could be adapted to be so.

@ruiu, does this look good to you now?

lld/MachO/Writer.cpp
390

Idk if this is a useful reference to put in a comment, since it's a random person's GitHub copy of a document which may or may not stay alive. There's lots more official documentation of the Mach-O structures; in fact, LLVM has its own ones in http://llvm.org/doxygen/BinaryFormat_2MachO_8h_source.html#l00992. I'd say it's safe to omit this.

392

This is called after assignAddresses, so we should know our exact string table size at this point, right? We could call reserve on it with that exact size.

407

Change to nDesc

409

Nit: you don't need the parentheses.

lld/test/MachO/symtab.s
26

Nice!

32

How come you're not checking the section index for this one?

ruiu added inline comments.Apr 8 2020, 11:04 PM
lld/MachO/Writer.cpp
372

This function should be called finalize and should belong to LCSymtab class instead of Writer. nSyms and strSize member variables as well as how to compute the size of symtab are internal details of the symtab section, and Write shouldn't take care of them.

In ELF, we call function in this order:

  1. For each output section, call finalize()
  2. Call Writer::assignAddresses()
  3. For each output section, call write()

(1) finalizes the contents of the section. Each section knows how to compute its size, and the size should be computed by finalize and shouldn't change afterwards.

391

This should move to LCSymtab as LCSymtab::write(uint8_t *buf). That function should directly write symbol table section contents to a given buffer (specifically without using SmallVector or raw_svector_ostream). I think you should take a look at ELF's writer.cpp as an example how we writer a writer.

Ktwu marked 4 inline comments as done.Apr 9 2020, 11:59 AM
Ktwu added inline comments.
lld/MachO/Writer.cpp
372

Gotcha; I'll move the functionality to LCSymtab, change the function names, and make sure I don't use local buffers when writing out the contents since that's unnecessary.

390

Sure!

392

Good point.

lld/test/MachO/symtab.s
32

It differs from what the system ld emits for the section index -- this diff emits 0x2, ld emits 0x3

For lack of understanding for why that it at the moment I figured neglecting the value to check is better than checking against an anomaly.

int3 commandeered this revision.Apr 9 2020, 4:16 PM
int3 edited reviewers, added: Ktwu; removed: int3.

For a patch series, amending an earlier patch may require the rebase of subsequent patches.

I'd agree if it were a series put up by one person, but with multiple people this creates a lot of friction. @Ktwu and I chatted and we agreed that I'd commandeer this, but I hope we can be more flexible for future work.

int3 planned changes to this revision.Apr 9 2020, 4:16 PM
int3 updated this revision to Diff 256481.Apr 9 2020, 7:11 PM
int3 marked 2 inline comments as done.

address comments / rebase on top of D76839

int3 retitled this revision from [lld] Add basic symbol table output to [lld-macho] Add basic symbol table output.Apr 9 2020, 8:12 PM
int3 edited the summary of this revision. (Show Details)
int3 updated this revision to Diff 256498.Apr 9 2020, 8:23 PM

format

int3 edited the summary of this revision. (Show Details)Apr 10 2020, 12:19 AM
ruiu accepted this revision.Apr 12 2020, 8:29 PM

Much better! Thank you for making the changes, this patch looks pretty good now. LGTM

This revision is now accepted and ready to land.Apr 12 2020, 8:29 PM
int3 updated this revision to Diff 257900.Apr 15 2020, 4:59 PM

order section name declarations by output order

int3 updated this revision to Diff 257952.Apr 15 2020, 9:42 PM

only emit defined symbols for now

smeenai added inline comments.Apr 21 2020, 1:41 PM
lld/MachO/SyntheticSections.h
97–141

Should this be part of this diff instead of the previous one?

Also, the same comment about renaming this to string table everywhere applies.

This revision was automatically updated to reflect the committed changes.
smeenai reopened this revision.Apr 28 2020, 11:43 AM

Reverted in rGfbae153ca583588de73d8fae84ec262c24d09025 because of a UBSan failure.

This revision is now accepted and ready to land.Apr 28 2020, 11:43 AM