This is an archive of the discontinued LLVM Phabricator instance.

Augment lldb's symbol table with external symbols in Mach-O's dyld trie
ClosedPublic

Authored by jasonmolenda on Mar 24 2020, 11:46 PM.

Details

Summary

In addition to the traditional nlist records in Mach-O binaries, the dynamic linker dyld has a trie structure that it uses to record symbols with external linkage. It uses this structure when resolving symbols at runtime; the nlist records can be completely stripped from a binary without detriment. lldb has read the special class of reexport symbols out of the dyld trie until now.

This patch has four parts -

  1. Remove checks for an empty string table / nlist table as meaning there is no symbol table.
  1. Changes ParseTrieEntries to recognize the externally-visible symbols and add them to a second array of TrieEntries.
  1. After populating the symbol table from the nlist records, looks for any matching symbol addresses that we read from the trie, marks them as already seen so we don't add duplicated symbols in the table.
  1. Adds the trie entries that hadn't already been seen, and marks any function starts with those addresses as already-added.

There is a test case that has a variety of text and data symbols with different linkage visibility and a test case that checks that we don't have duplicate symbol table entries, and that we can still find the externally visible symbols after stripping the binary.

The patch at this point is pretty straightforward. It's easy to make mistakes in ObjectFileMachO::ParseSymtab, and in the process of writing this I think I made all of them. But I'm open to any feedback about how things might be done more clearly.

Adrian, I wasn't sure how well I'm conforming to best practices on the testsuite Makefile, where I'm compiling my main.cpp has a dylib, then making a stripped copy. This works, but if you have a chance to look at it and provide feedback, I would appreciate it.

Diff Detail

Event Timeline

jasonmolenda created this revision.Mar 24 2020, 11:46 PM

I only have a bunch of superficial comments but this seems reasonable as far as I can tell.

lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
1988

Flags(e.entry.flags).Test(EXPORT_SYMBOL_FLAGS_REEXPORT) ?

1990

Super nit-picky nit pick:

// Add symbols that are reexport symbols with a valid import name.

2012

Flags(e.entry.flags).Test(EXPORT_SYMBOL_FLAGS_REEXPORT) ?

4496

Ah I see. the normal operator< sorts by offset.

4499

Do we need llvm::stable_sort here to have reproducible behavior or does it not matter?

4500

Does for (auto &current_sym : sym) work here?

4512

GetFlags().test(MACHO_NLIST_ARM_SYMBOL_IS_THUMB) ?

Up to taste; it's more verbose but it's kind of confusing to combine && and & in the same expression.

4514

This block looks like it could be a lambda because it's the same code as above?

4546

Shows that I don't know the operater precedence rules of C very well, but without extra parenthesis this makes me uneasy :-)

4554

Wouldn't this automatically fail if function_starts_count == 0?

4556

if (!symbol_is_thumb ?

4566

High-level comment what this block is doing?

lldb/test/API/macosx/dyld-trie-symbols/Makefile
10

What does $(CFLAGS_NO_DEBUG) achieve in the *linker* invocation?

Could you get rid of this entire rule and just specify LD_EXTRAS = -dynamiclib?

17

Same question here

jasonmolenda marked 2 inline comments as done.Mar 25 2020, 11:37 PM

Thanks Adrian, great comments, incorporated most of them. The use of the block to search the trie entries when eliminating duplication between nlist/tries was a good idea, much cleaner now.

lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
4500

This idiom is used everywhere else in ParseSymtab, just for consistency I want to stick with the older style.

4566

Added. It was largely the same thing as the function starts symbol table entries - we compute which section the address is in, to set the type of the symbol in the symtab.

Update patch with Adrian's suggestions.

jasonmolenda marked 11 inline comments as done.Mar 25 2020, 11:44 PM
jasonmolenda added inline comments.
lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
4499

The dyld trie should not have multiple exported symbols with the same address, so sorting the vector by address should be fine. If that were to happen, we'd probably only clear the first one from the trie and we would eventually add the second one when we iterate over the table. I don't that's something I'll check in the code tho, it doesn't make much sense.

aprantl accepted this revision.Mar 26 2020, 9:44 AM
This revision is now accepted and ready to land.Mar 26 2020, 9:44 AM
JDevlieghere added inline comments.Mar 26 2020, 9:46 AM
lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
1926

return flags == LLDB_INVALID_ADDRESS

1990

Mega nit-picky nit pick: comment should be sentences and start with a capital.

4508

You don't need the else after a return.

4569

I guess this could be an else if

4598

This could make a great ternary operator ;-)

jingham added inline comments.
lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
1926

This sort of change has the downside that you can't break when flags == LLDB_INVALID_ADDRESS without adding a condition. That seems in this case like a condition you are likely to want to break on, and given this looks like a function that gets called a lot, it's probably good to pay the cost of a condition.

I'm not sure I'm in favor of this sort of rewrite. It just saves one line, and isn't particularly easier to read. Does it have some benefit I'm missing?

So I know the mach-o symbol table parsing code is a mess already, but it seems like this patch can be simpler if we make a std::set<lldb:addr_t> at the top of ObjectFileMachO::ParseSymtab() and every time we add a symbol that has a valid address value, add the file addr to this set. This will avoid the need to do any sorting. This std::set can be used to not add LC_FUNCTION_START entries that already have a symbol at the address, and it can be used to only add TrieEntry values that have symbols. Thoughts?

lldb/source/Plugins/ObjectFile/Mach-O/ObjectFileMachO.cpp
1924

The only bits that are used in this field are:

enum {
  EXPORT_SYMBOL_FLAGS_KIND_MASK = 0x03u,
  EXPORT_SYMBOL_FLAGS_WEAK_DEFINITION = 0x04u,
  EXPORT_SYMBOL_FLAGS_REEXPORT = 0x08u,
  EXPORT_SYMBOL_FLAGS_STUB_AND_RESOLVER = 0x10u
};

enum ExportSymbolKind {
  EXPORT_SYMBOL_FLAGS_KIND_REGULAR = 0x00u,
  EXPORT_SYMBOL_FLAGS_KIND_THREAD_LOCAL = 0x01u,
  EXPORT_SYMBOL_FLAGS_KIND_ABSOLUTE = 0x02u
};

So why not just set the highest bit and avoid clobbering all of the other flags?:

#define EXPORT_SYMBOL_FLAGS_LLDB_ALREADY_PRESENT 1ull << 63
void SymbolAlreadyPresent() { flags |= EXPORT_SYMBOL_FLAGS_LLDB_ALREADY_PRESENT; }
1926

So is "flags" initially used just after parsing, but before we mark TrieEntry values as already present? These flags mean something when we first parse them. I would rather add an extra bool entry to this structure, since they don't stay around for long.

1926
return (flags & EXPORT_SYMBOL_FLAGS_LLDB_ALREADY_PRESENT) != 0;
1965

s/resoler/resolver/

2503–2516

Maybe we should run this before ParseTrieEntries and pass text_segment_file_addr in as an argument and add this to the offset as we parse the TrieEntry objects?

4490

Might be easier to create a std::set<lldb::addr_t> at the top of the ObjectFileMachO::ParseSymtab() function. Anyone who adds a symbol, would add the file address to that set. Then we wouldn't have to sort these entries, we would just iterator through them. We must do something like this for LC_FUNCTION_STARTS already right?

So I know the mach-o symbol table parsing code is a mess already, but it seems like this patch can be simpler if we make a std::set<lldb:addr_t> at the top of ObjectFileMachO::ParseSymtab() and every time we add a symbol that has a valid address value, add the file addr to this set. This will avoid the need to do any sorting. This std::set can be used to not add LC_FUNCTION_START entries that already have a symbol at the address, and it can be used to only add TrieEntry values that have symbols. Thoughts?

That's a solid idea, let me try rewriting for that. LC_FUNCTION_STARTS already has a 'done' bool with each one to indicate whether it should be added to the symtab at the end. & yeah, I should have just added a new flag to indicate that the thing had already been seen -- for the exported symbols from the trie I wasn't use flags at all so I just clobbered it, but that's not a very nice way to do it. Let me try doing the std::set thing I think that'll remove a bunch of code across the function.

Updated the patch to incorporate Jonas and Greg's suggestions.

Most importantly, Greg's suggestion to have a set of addresses added to the symbol table; now that we've added a third symbol source, this was a much cleaner way to track which symbols should be added - process them in most-information-ful to least-information-ful order using this set to track if it has already been added.

This opened up the FunctionStarts data bool which was used to track if the func_start should be added, and I've used this (and a flag in the TrieEntry) to track if a function is thumb codegen; this further simplified a lot of code that was tracking this bit of metadata in the 0th bit of the addresses, complicating address comparisons etc - only the actual addresses are stored in these structures now.

This revision was automatically updated to reflect the committed changes.

Much cleaner!