This is an archive of the discontinued LLVM Phabricator instance.

[lldb] Introduce new SymbolFileJSON and ObjectFileJSON
ClosedPublic

Authored by JDevlieghere on Mar 2 2023, 12:57 PM.

Details

Summary

Introduce a new object and symbol file format with the goal of mapping addresses to symbol names. I'd like to think of is as an extremely simple textual syntab. The new file format is extremely simple, it contains a triple, a UUID and a list of address to symbol name mapping. JSON is used for the encoding, but that's mostly an implementation detail: any other encoding could achieve the same thing. However I did purposely pick a human readable format.

The new format is motivated by two use cases:

  1. Stripped binaries: when a binary is stripped, you lose the ability to do thing like setting symbolic breakpoints. You can keep the unstripped binary around, but if all you need is the stripped symbols then that's a lot of overhead. Instead, we could save the stripped symbols to a file and load them in the debugger when needed. I want to extend llvm-strip to have a mode where it emits this new file format.
  2. Interactive crashlogs: with interactive crashlogs, if we don't have the binary or the dSYM for a particular module, we currently show an unnamed symbol for those frames. This is a regression compared to the textual format, that has these frames pre-symbolicated. Given that this information is available in the JSON crashlog, we need a way to tell LLDB about it. With the new symbol file format, we can easily synthesize a symbol file for each of those modules and load them to symbolicate those frames.

Here's an example of the file format:

{
    "triple": "arm64-apple-macosx13.0.0",
    "uuid": "36D0CCE7-8ED2-3CA3-96B0-48C1764DA908",
    "symbols": [
        {
            "name": "main",
            "addr": 4294983568
        },
        {
            "name": "foo",
            "addr": 4294983560
        }
    ]
}

I've added a test case that illustrates the stripped binary workflow. For the interactive crashlogs, we'll need to extend the crashlog script.

Diff Detail

Event Timeline

JDevlieghere created this revision.Mar 2 2023, 12:57 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 2 2023, 12:57 PM
JDevlieghere requested review of this revision.Mar 2 2023, 12:57 PM
Herald added a project: Restricted Project. · View Herald Transcript
Herald added a subscriber: sstefan1. · View Herald Transcript
mib added a comment.Mar 2 2023, 1:22 PM

Left some comments but overall looks good to me :) Thanks for taking this !

lldb/source/Plugins/ObjectFile/JSON/ObjectFileJSON.cpp
38
48–53

Why do we do this twice ?

93–112

I don't see the point of re-parsing the file here, we could just save the parsed object as a class member.

146

Cool 😁

lldb/source/Plugins/SymbolFile/JSON/SymbolFileJSON.cpp
113

Should we increment the symID here ?

Given that json numbers are read into double, should the addresses be stored as strings to avoid any issues with address integer values that can't be represented as double (>=2**53)?

JDevlieghere marked 5 inline comments as done.Mar 2 2023, 2:26 PM
JDevlieghere added inline comments.
lldb/source/Plugins/ObjectFile/JSON/ObjectFileJSON.cpp
38

See my comment bellow why this is split up.

48–53

When we create an object file instance, we usually read the first 512 bytes to read the magic. If we haven't read anything at all, the code above reads in the file. If we've only read the first 512 bytes, we read the rest of the file here.

93–112

That doesn't work because GetModuleSpecifications is a static method.

lldb/source/Plugins/SymbolFile/JSON/SymbolFileJSON.cpp
113

The IDs are arbitrary, and if we start at zero, we'll have conflicts with the ones already in the symbol table (e.g. the lldb_unnamed_symbols for stripped binaries). We could check the size of the symtab and continue counting from there. Or we can use 0 like we did here to indicate that these values are "special". I went with the latter approach mostly because that's what SymbolFileBreakpad does too.

I like this idea of this, but I would like to see this be a bit more complete. One idea is to remove the ObjectFileJSON::Symbol definition and just use lldb_private::Symbol objects and allow these objects to construct encode and decode from JSON. Then we have the ability to re-create any symbols we need in full fidelity. But that not be the goal in your case, but I don't think it would hurt to use the lldb_private::Symbol as the object we serialize and deserialize from JSON as ObjectFileJSON::Symbol just can't reproduce the full depth of the symbols we want.

From reading this it looks like your main use case is to supply additional symbols to an existing stripped binary. Is that correct? Are you aware that each dSYM file has a full copy of the unstripped symbol table already? It removes the debug map entries, but each dSYM copies all non debug map symbols inside of it already. So the same thing from this patch can already be accomplished by stripping a dSYM file of all debug info sections and leaving the symbol table alone, and then using this this minimal dSYM file just to get the symbols.

Any idea on where the JSON file will live if it is just a companion file for another mach-o or ELF executable? Will it always be next to the mach-o executable? Will we enable a Spotlight importer to find it like we do for dSYM files?

lldb/source/Plugins/ObjectFile/JSON/ObjectFileJSON.cpp
134

Don't we want to create a Symtab from the ObjectFileJSON::Symbol vector here? Symbols are not considered debug info. Debug info for functions is supposed to be more rich than just address and name.

lldb/source/Plugins/ObjectFile/JSON/ObjectFileJSON.h
93–96

It would be great to be able to also define sections somewhere. I can think of this being really handy for core file serialization where we generate a core file and then store extra metadata that describes the object file in this JSON format. It would be great to have sections for this, and we really need sections to make ObjectFileJSON be able to represent something that actually looks like a real object file. See Section.h for what we expect section to have, but we can just store uint64_t values instead of lldb_private::Address values for the start address of the range for the section.

98–101

A few things to think about:

  • some symbols have a byte size (like in ELF or mach-o symbols in the debug map that have pairs)
  • some symbol have absolute values that are not actually addresses, there is no way to represent this currently here
  • some symbols have symbol types. We could allow symbols to specify a lldb::SymbolType by name?
  • If we have symbol types other than just symbols with addresses, some might refer to a sibling symbol by ID or index. It might be good to allow symbols to have integer ids

One suggestion would be to define Symbol as:

struct Symbol {
   uint64_t value;
   std::optional<uint64_t> size;
   std::optional<uint64_t> id; 
   std::optional<SymbolType> type; 
   std::string name;
   bool value_is_address; // Set to true if "value" is an address, false if "value" is just an integer with a different meaning
 };
lldb/source/Plugins/SymbolFile/JSON/SymbolFileJSON.cpp
88

If we have no debug info i SymbolFileJSON, then there is really no need to add the symbols here, we can just avoid this whole SymbolFileJSON class and have ObjectFileJSON parse the JSON symbols and add them to the Symtab in:

void ObjectFileJSON::ParseSymtab(Symtab &symtab);

So if we are not going to add JSON debug info to the ObjectFileJSON schema, then I would vote to remove this class and just do everything from ObjectFileJSON.

113

regardless of where the code goes that converts ObjectFileJSON::Symbol to lldb_private::Symbol, I would vote that we include enough information in ObjectFileJSON::Symbol so that we don't have to make up symbols IDs, or set all symbol IDs to zero. LLDB relies on having unique symbol IDs in each lldb_private::Symbol instance.

JDevlieghere marked 4 inline comments as done.Mar 3 2023, 11:01 AM

Thanks for the feedback Greg, they're all great suggestions.

From reading this it looks like your main use case is to supply additional symbols to an existing stripped binary. Is that correct?

That's one use case, the other one being the interactive crashlogs. I went into a bit a bit more detail in the summary.

Are you aware that each dSYM file has a full copy of the unstripped symbol table already? It removes the debug map entries, but each dSYM copies all non debug map symbols inside of it already. So the same thing from this patch can already be accomplished by stripping a dSYM file of all debug info sections and leaving the symbol table alone, and then using this this minimal dSYM file just to get the symbols.

Yup and for the strip scenario I described above, we wouldn't even have to go through a dSYM, we could just have llvm-strip emit a Mach-O with only the unstripped symbol table and that should work out of the box in LLDB (similar to how you can add the unstripped binary with target symbols add). But for the crashlog use case where we only have an address and a symbol, it would be really tedious to have to build the whole symbol table. I really like the idea of a textual format for this and it's easy to read and modify. The barrier is super low and even if you had nothing but the textual output of nm you could create one of these JSON files and symbolicate your binary in LLDB.

Any idea on where the JSON file will live if it is just a companion file for another mach-o or ELF executable? Will it always be next to the mach-o executable? Will we enable a Spotlight importer to find it like we do for dSYM files?

For now I have no plans to have LLDB pick these files up automatically, but that's definitely something we could explore in the future.

JDevlieghere marked 3 inline comments as done.

Address Greg's feedback

If these files can be used as the only source of information (without a stripped executable), we really should include a serialized SectionList in the JSON that can be loaded into ObjectFileJSON. This would be very useful for easily creating unit tests.

lldb/include/lldb/Symbol/Symbol.h
23

Do we something that says "value is an address"? Or are we inferring that from the lldb::SymbolType?

lldb/test/API/macosx/symbols/TestSymbolFileJSON.py
40

we probably should test that the "id" and "section_id" fields work correctly. We should also test that we are able to make symbols with only an address and name, then add tests for symbols that each add a new optional value that can be specified to ensure we can correctly make symbols.

clayborg added inline comments.Mar 6 2023, 2:41 PM
lldb/include/lldb/Symbol/Symbol.h
352–357

Do we want to stick with JSONSymbol or just teach lldb_private::Symbol to serialize and deserialize from JSON? If we so the latter, we can create any type of symbol we need and it would be useful for easily making unit tests that could use ObjectFileJSON as the basis.

JDevlieghere marked 3 inline comments as done.Mar 6 2023, 2:47 PM
JDevlieghere added inline comments.
lldb/include/lldb/Symbol/Symbol.h
23

In the Symbolc constructor that takes a JSONSymbol, I assume the value is an address if there's no section_id and an offset otherwise. We can definitely tweak that or add a sanity check based on the symbol type.

352–357

I think we really want the JSONSymbol class. That also matches what we do for the tracing objects as well, where there's a JSON object and a "real" object. For the crashlog use case where I don't have sections in the JSON, I want to be able to reuse the existing section list from the module so I need a way to pass that in. To (de)serialize the Symbol object directly, we'd need a way to pass that in, which I don't think exists today.

lldb/test/API/macosx/symbols/TestSymbolFileJSON.py
40

Yep, I saw that working manually but I can extend the test case to include that.

clayborg added inline comments.Mar 6 2023, 3:06 PM
lldb/include/lldb/Symbol/Symbol.h
23

Remember there are some symbols that are not addresses. Think of trying to create a N_OSO symbol where the value isn't an address, but it is a time stamp integer.

Symbols in the symbol table for mach-o files can have a section ID that is valid and the value is not an offset, it is still an address. Not sure if that matters.

One way to make this clear is if we have a "section_id" then "value" is an address, and if not, then "value" is just an integer? It does require that we always specify section IDs though. Also, section IDs are not human readable, we could change "std::optional<uint64_t> section_id;" to be "std::optional<std::string> section;" and thid becomes the fully qualified section name like "TEXT.text" or "PT_LOAD[0]..text". Much more readable. And easy to dig up the section from the Module's section list.

23

Another way to do this would be to have JSONSymbol know if it has an address or just an integer value by defining this as:

std::optional<uint64_t> address;
std::optional<uint64_t> value;

The deserializer would verify that only one of these is valid and fail if both or neither were specified

352–357

Any ObjectFile can always get the section list from the Module, so if there is no serialized section list, ObjectFileJSON doesn't need to provide any sections, and it can just call:

SectionList *sections = GetModule()->GetSectionList();

Any object files within a module are asked to add their own sections if they have any extra sections. We do this with dSYM files where we add the DWARF segment to the Module's main section list. So the executable adds "PAGEZERO, __TEXT, DATA_, etc", then the dSYM object file just adds any extra sections it need with a call to:

void ObjectFile::CreateSections(SectionList &unified_section_list);

So if we have a section list already in ObjectFileJSON, we need to verify that they all match with any pre-existing sections that are already in the SectionList, and can add any extra sections if needed. If the ObjectFileJSON was the only object file, it could fully populate a Module's section list on its own.

JDevlieghere marked 3 inline comments as done.Mar 6 2023, 3:13 PM
JDevlieghere added inline comments.
lldb/include/lldb/Symbol/Symbol.h
23

Sure, that sounds reasonable.

352–357

I'm aware the data is present but I don't see how to pass that into the JSON deserialization function if we don't go through an intermediary object. As far as I can tell, there's no way to pass this additional data into fromJSON. That's why I have a ctor (Symbol(const JSONSymbol &symbol, SectionList *section_list);) that takes both the deserialized symbol and the section list. If there's a way to do that I'm happy to scrap the JSONSymbol.

clayborg added inline comments.Mar 6 2023, 3:35 PM
lldb/include/lldb/Symbol/Symbol.h
23

And we can add tests that verify we get an error back if neither or both are specified.

Implement Greg's features. I'll add a unit test for the deserialization later today.

clayborg added inline comments.Mar 7 2023, 2:45 PM
lldb/include/lldb/Symbol/Symbol.h
27

Come to think of it, we might not need the section as a name as it adds no real value unless we want to add a "std::optional<uint64_t> sect_offset;" field that could be specified. We could switch to saying "if we have a valid address member, we must be able to look it up in the section lists by address".

lldb/source/Symbol/Symbol.cpp
44–49

We probably should make this function into a static factory function that returns a llvm::Expected<Symbol> so we can return an error if the symbol has not value or address, and if we can't resolve the address in a section:

llvm::Expected<Symbol> Symbol::CreateSymbol(const JSONSymbol &symbol, SectionList *section_list);

Then we can return errors for the above cases and also for below.

55–74

This code should probably check if we have a "symbol.address" first, then always fill out the address range correctly and expect a value section_sp like the code above is doing. If we have.a symbol.value, then we don't expect a symbol to have an address, and to do this, we will in the AddressRange with no section and the "symbol.value" is the offset.

I am not sure if we need a "JSONSymbol::section" member anymore since we know if this is an address now, we will expect the section lookup to succeed and can return an error if this fails, so we can probably remove "JSONSymbol::section". WE could leave this in if we want to specify an offset directly.

lldb/test/API/macosx/symbols/TestSymbolFileJSON.py
119–120

We should never have "section" and "value" as a combination. See above discussion about possibly removing the "section" field as we might not need it if we can assume:

  • if "address" is valid, we must be able to look it up by address in the section list
  • if "value" is valid, it just gets encoded as the suggested code edit
JDevlieghere marked an inline comment as done.Mar 7 2023, 2:51 PM
JDevlieghere added inline comments.
lldb/source/Symbol/Symbol.cpp
44–49

Do you prefer to handle the error here or during deserialization? Currently these things are enforced there. I don't think it makes sense to check the same invariant twice. I'm happy to move it here though if you think that's better.

JDevlieghere marked an inline comment as done.Mar 7 2023, 3:04 PM
JDevlieghere added inline comments.
lldb/source/Symbol/Symbol.cpp
44–49

Oh I missed the case where the address doesn't belong to a section. Yeah we definitely need to diagnose that here.

  • Remove section field
  • Support raw value symbols
  • Update test

Looks good. Only questions is if we can add a C++ unit test for this file and test the new Symbol::FromJSON() and test all error conditions.

lldb/source/Symbol/Symbol.cpp
99–100

Can we unittest this function with a ObjectFileJSONTest.cpp? It would be nice to check for the errors.

  • Add unit tests for JSON deserialization
  • Add unit tests for converting JSONSymbol to Symbol
clayborg added inline comments.Mar 8 2023, 3:06 PM
lldb/include/lldb/Core/Debugger.h
594 ↗(On Diff #503432)

I am guessing these changes are in Debugger.h and Debugger.cpp are not related to this diff?

lldb/source/Core/Debugger.cpp
830–845 ↗(On Diff #503432)

I am guessing these changes are in Debugger.h and Debugger.cpp are not related to this diff?

lldb/source/Symbol/Symbol.cpp
780–781

Should this return an llvm::Expected<lldb_private::JSONSymbol> instead of a bool? Or is this fromJSON pattern used everywhere? Then we wouldn't need to fill in "path" and could return an error?

804–805

Should this return an llvm::Expected<SymbolType> instead of a bool? Or is this fromJSON pattern used everywhere? Then we wouldn't need to fill in "path" and could return an error?

lldb/unittests/Symbol/JSONSymbolTest.cpp
35

Change over to using EXPECT_THAT_EXPECTED. Repeat for all cases below.

57

Use EXPECT_THAT_EXPECTED for clarity

82
144–145

Use EXPECT_THAT_EXPECTED for all "Expected<Symbol>" error cases.

156–157

Use EXPECT_THAT_EXPECTED for all "Expected<Symbol>" error cases.

166–167

Use EXPECT_THAT_EXPECTED for all "Expected<Symbol>" error cases.

190–192

Use EXPECT_THAT_EXPECTED for all "Expected<Symbol>" error cases

JDevlieghere marked 21 inline comments as done.
lldb/include/lldb/Core/Debugger.h
594 ↗(On Diff #503432)

Yup. This got unintentionally mixed with D135631.

lldb/source/Symbol/Symbol.cpp
780–781

No, these are specializations/overloads for the JSON library in LLVM. Same below.

clayborg accepted this revision.Mar 8 2023, 5:28 PM

Thanks for the changes! LGTM. Just one missed EXPECT_THAT_EXPECTED, but accepted.

lldb/unittests/Symbol/JSONSymbolTest.cpp
145–146

Missed one EXPECT_THAT_EXPECTED. Feel free to fix and submit without approval.

This revision is now accepted and ready to land.Mar 8 2023, 5:28 PM
labath added inline comments.Mar 14 2023, 9:53 AM
lldb/source/Plugins/SymbolFile/JSON/SymbolFileJSON.cpp
113

Speaking of breakpad, have you considered using SymbolFileBreakpad directly? I think it serves pretty much the same purpose, and it also supports other, more advanced functionality (e.g. line tables and unwind info). It sounds like you don't need that now, but if you ever did, you wouldn't need to reinvent that logic...

JDevlieghere marked an inline comment as done.Mar 17 2023, 10:31 AM
JDevlieghere added inline comments.
lldb/source/Plugins/SymbolFile/JSON/SymbolFileJSON.cpp
113

I feel pretty dumb now: looking at breakpad in more detail, the PUBLIC records pretty much have everything I was looking for. For some reason I was under the impression that it was a binary format and I really wanted something human readable. I think I got it confused with minidumps.

I wasn't planning on adding anything fancy to the JSON format so I agree that if that need arises for something more advanced we should use breakpad.