Page MenuHomePhabricator

GSYM symbolication format
Needs ReviewPublic

Authored by clayborg on Oct 17 2018, 9:00 AM.

Details

Summary

I got approval to open source the GSYM symbolication format. This patch still needs to get testing added and switched over to use the AsmPrinter to create the GSYM files, but I wanted to post this patch in progress for the LLVM conference to allow folks to see what it is and try it out. Full details on the file format below:

GSYM Introduction

GSYM is a symbolication file format is designed to be the best format to use for symbolicating addresses into function name + source file + line information. It is a binary file format designed to be mapped into one or more processes. GSYM information can be created by converting DWARF debug information, or Breakpad files. GSYM information can exist as a stand alone file, or be contained in ELF or mach-o files in a section. When embedded into ELF or mach-o files, GSYM sections can share a string tables that already exists within a file.

Why use GSYM?
GSYM files are up to 7x smaller than DWARF files and up to 3x smaller than Breakpad files. The file format is designed to touch as few pages of the file as possible while doing address lookups. GSYM files can be mmap'ed into a process as shared memory allowing multiple processes on a symbolication server to share loaded GSYM pages. The file format includes inline call stack information and can help turn a single address lookup into multiple stack frames that walk the inlined call stack back to the concrete function that invoked these functions.

Converting DWARF Files to GSYM
llvm-dsymutil is available in the llvm/tools/gsym directory and has options to convert DWARF into GSYM files. llvm-dsymutil has a -dwarf option that specifies a DWARF file to convert into a GSYM file. The output file can be specified with the -out-file option.

$ llvm-dsymutil -dwarf /tmp/a.out -out-file /tmp/a.out.gsym

This command will convert a DWARF file into the GSYM file format. This allows clients that are currently symbolicating with DWARF to switch to using the GSYM file format. This tool could be used in a symbolication workflow where symbolication servers convert DWARF to GSYM and cached the results on the fly, or could be used at build time to always produce a GSYM file at build time. DWARF debug information is rich enough to support encoding the inline call stack information for richer and more useful symbolication backtraces.

Converting Breakpad Files to GSYM

llvm-dsymutil has a -breakpad option that specifies a Breakpad file to convert into a GSYM file. The output file can be specified with the -out-file option.

$ llvm-dsymutil -breakpad /tmp/foo.sym -out-file /tmp/foo.gsym

This allows clients currently using breakpad to switch over to use GSYM files. This tool could be used in a symbolication workflow where symbolication servers convert breakpad to GSYM format on the fly only when needed. Breakpad files do not contain inline call stack information, so it is advisable to use llvm-dsymutil -dwarf when possible to avoid losing this vital information.

File Format Overview
The GSYM file consists of a header, address table, address info offset table and address info data for each address.

The GSYM file format when in a stand alone file is ordered as shown:

  • Header
  • Address Table
  • Address Data Offsets Table
  • File Table
  • String Table
  • Address Data

Header

#define GSYM_MAGIC 0x4753594d
#define GSYM_VERSION 1
struct Header {
  uint32_t magic;
  uint16_t version;
  uint8_t  addr_off_size;
  uint8_t  uuid_size;
  uint64_t base_address;
  uint32_t num_addrs;
  uint32_t strtab_offset;
  uint32_t strtab_size;
  uint8_t  uuid[20];
};

The magic value is set to GSYM_MAGIC and allows quick and easy detection of this file format when it is loaded. Addresses in the address table are stored as offsets from a 64 bit address found in Header.base_address. This allows the address table to contain 32, 16 or 8 bit offsets, instead of a table of full sized addresses. The file size is smaller and causes fewer pages to be touched during address lookups when the address table is smaller. The size of the address offsets in the address table is specified in the header in Header.addr_off_size. The header contains a UUID to ensure the GSYM file can be properly matched to the object ELf or mach-o file that created the stack trace. The header specifies the location of the string table for all strings contained in the GSYM file, or can point to an existing string table within a ELF or mach-o file.

Address Table
The address table immediately follows the header in the file and consists of Header.num_addrs address offsets. These offsets are sorted and can be binary searched for efficient lookups. Address offsets are encoded as offsets that are Header.addr_off_size bytes in size. During address lookup, the index of the matching address offset will be the index into the address data offsets table.

Address Data Offsets Table
The address data offsets table immediately follows the address table and consists of Header.num_addrs 32 bit file offsets: one for each address in the address table. The offsets in this table are the absolute file offset to the address data for each address in the address table. Keeping this data separate from the address table helps to reduce the number of pages that are touched when address lookups occur on a GSYM file.

File Table
The file table immediately follows the address data offsets table. The format of the FileTable is:

struct FileTable {
  uint32_t count;
  FileInfo files[];
};

The file table starts with a 32 bit count of the number of files that are used in all of the address data, followed by that number of FileInfo structures.

Each file in the file table is represented with a FileInfo structure:

struct FileInfo {
  uint32_t directory;
  uint32_t filename;
};

The FileInfo structure has the file path split into a string for the directory and a string for the filename. The directory and filename are specified as offsets into the string table. Splitting paths into directory and file base name allows GSYM to use the same string table entry for common directories.

String Table
The string table follows the file table in stand alone GSYM files and contains all strings for everything contained in the GSYM file. Any string data should be added to the string table and any references to strings inside GSYM information must be stored as 32 bit string table offsets into this string table.

Address Data
The address data is the payload that contains information about the address that is being looked up. The structure that represents this data is:

struct AddressInfo {
    uint32_t size;
    uint32_t name;
    AddressData data[];
};

It starts with a 32 bit size for the address range of the functiopn and is followed by the 32 bit string table offset for the name of the function. The size of the address range is important to encode as it stops address lookups from matching if the address is between two functions in some padding. This is followed by an array of address data information:

struct AddressData {
    uint32_t type;
    uint32_t length;
    uint8_t data[length];
};

The address data starts with a 32 bit type, followed by a 32 bit length, followed by an array of bytes that encode each specify kind of data.
The AddressData.type is an enumeration value:

enum class InfoType {
   EndOfList = 0u,
   LineTableInfo = 1u,
   InlineInfo = 2u
};

The AddressInfo.data[] is encoded as a vector of AddressData structs that is terminated by a AddressData struct whose type is set to InfoType.EndOfList. This allows the GSYM file format the contain arbitrary data for any address range and allows us to expand the GSYM capabilities as we find more uses for it.

InfoType::EndOfList is always the last AddressData in the AddressInfo.

InfoType::LineTableInfo is a modified version of the DWARF line tables that efficiently stores line table information for each function. DWARF stores line table information for an entire source file and includes all functions. Having each function's line table encoded separately allows fewer pages to be touched when looking up the line entry for a specific address. The information is optional and can be omitted fo address data that is from a symbol or label where no line table information is available.

InfoType::InlineInfo is a format that encodes inline call stacks. This information is optional and doesn't need to be included for each address. If the function has no inlined functions this data should not be included.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
JDevlieghere added inline comments.Nov 5 2018, 10:03 AM
include/llvm/DebugInfo/GSYM/FileTableCreator.h
27

Commented out code.

include/llvm/DebugInfo/GSYM/GsymCreator.h
48

s/guard/Guard/

include/llvm/DebugInfo/GSYM/GsymReader.h
57

llvm::Error?

lib/DebugInfo/GSYM/Breakpad.cpp
50

lowerCamelCase for methods.

lib/DebugInfo/GSYM/DwarfTransformer.cpp
162

No braces. Same for the single-line if-blocks below.

217

You could make this an early return and save some indentation.

247

No braces, same for the rest of the file.

263

Make this an early return.

495

Can you run clang-format over you patch (again)? It should've removed these double newlines.

lib/DebugInfo/GSYM/GsymReader.cpp
213

Since there's no dependency for the fall-through case I'd turn this into an early return as well, and duplicate this line.

216

StringRef

228

Remove commented-out code.

407

Looks like this code is always the same except the cast, I'd factor this out.

lib/DebugInfo/GSYM/LineTable.cpp
25

UperCamelCase.

34

MinLineDelta, etc

55

s/offset/Offset, same for the other variable names in this file.

JDevlieghere requested changes to this revision.Nov 5 2018, 10:03 AM
This revision now requires changes to proceed.Nov 5 2018, 10:03 AM
lemo added a subscriber: zturner.Nov 5 2018, 11:56 AM

Greg, what do you have in mind as the next step? LLDB integration?

Greg, what do you have in mind as the next step? LLDB integration?

Next step, I would like to bump to version 2 to address some of Mark Mentovai's comments:

  • bump string table offset and size to 64 bit in case GSYM is a section in a large file where we share string table
  • allow UUID to be any size and only emit what we need instead of padding out to 20 bytes
  • few other things
  • possibly move number of files into header

Follow up from the community (including from me):

  • agree upon an unwinding format and add support for what we agree upon
    • create a .debug_frame to above format converter
  • any other improvements or AddressInfo types
  • LLDB integration with GSYM format
lemo added a comment.Nov 5 2018, 1:49 PM

Cool. Looking forward to the LLDB integration!

clayborg updated this revision to Diff 172650.Nov 5 2018, 1:56 PM

Fixes:

  • Take care of issues found by Jonas
  • Rename variables to be camel case
  • Remove commented out code

Thanks Greg! I came across a few more places that don't need braces but otherwise this looks good style-wise.

include/llvm/DebugInfo/GSYM/DwarfTransformer.h
38

No braces

46

No braces

include/llvm/DebugInfo/GSYM/InlineInfo.h
38

/// for Doxygen comments?

include/llvm/DebugInfo/GSYM/LineTable.h
28

You should considering using ///< so it becomes a Doxygen comment.

lib/DebugInfo/GSYM/DwarfTransformer.cpp
74

No braces

82

You could make this an early break by inverting the condition. Same for the condition below. I think that would improve the readability of this code.

110

No braces

115

No braces

128

No braces

153

No braces

168

No braces

175

No braces

391

No braces

407

Simplify this with early break.

422

No braces

444

No braces

449

No braces

483

No braces

497

No braces

lib/DebugInfo/GSYM/FileTableCreator.cpp
35

No braces

lib/DebugInfo/GSYM/FunctionInfo.cpp
26

No braces

lib/DebugInfo/GSYM/GsymReader.cpp
61

I strongly prefer wrapping this in an llvm::StringError. It's a little more verbose but keeps things consistent.

256

No braces

lib/DebugInfo/GSYM/InlineInfo.cpp
97

No braces

clayborg updated this revision to Diff 172784.Nov 6 2018, 10:04 AM

Fix all of Jonas' issues.

Friendly ping

One more nit: can you turn your comments into full sentences? Most of them are already like that but some are missing capitalization and others are lacking a full stop. I didn't do another pass but other than that it looked alright style-wise.

Unfortunately there's nothing else I can do on this since I don't feel enough ownership over the top level lib/DebugInfo folder to approve a new sub-folder. If you still can't get any traction from someone who does have that kind of ownership, one idea might be to make a post on llvm-dev@ pointing them to this review.

Unfortunately there's nothing else I can do on this since I don't feel enough ownership over the top level lib/DebugInfo folder to approve a new sub-folder. If you still can't get any traction from someone who does have that kind of ownership, one idea might be to make a post on llvm-dev@ pointing them to this review.

I have this ownership and am on the review list. This is a large patch and a lot of work to review and I'll get to it as soon as I can.

2 month ping

Hi Greg,

I've had a lot of time to review this (thanks for that) and do apologize for taking so long. I have a couple of concerns about this so bear with me and let's see where we can get:

a) This looks like it's a standalone tool that's being added to llvm, but that really doesn't involve anything coming out of llvm right now?
b) This seems to be largely a binary encoding of a breakpad file and not a new debug format?
c) What are the future plans for this code?

Mostly I'm trying to figure out what we want to do with this in llvm. It seems like something that would be good for the breakpad project mostly?

Thanks!

-eric

Hi Greg,

I've had a lot of time to review this (thanks for that) and do apologize for taking so long. I have a couple of concerns about this so bear with me and let's see where we can get:

a) This looks like it's a standalone tool that's being added to llvm, but that really doesn't involve anything coming out of llvm right now?

It has a few tools:
dwarf2gsym
bpad2gsym

It also has all of the details for parsing and creating the format.

The dwarf2gsym uses the LLVM DWARF parser to parse and convert DWARF to GSYM format so that is a huge part of LLVM that is being used.

bpad2gsym converts the textual breakpad format to GSYM. There are many servers out there that are using very large breakpad, a google project, text files for symbolication which wastes a ton of CPU time as the file format is a big blob of text. So seeing as breakpad and crashpad want to adopt this format, it seemed like LLVM was a good place to put it so that it can get adopted by these Google teams. They already have a DWARF to breakpad conversion tool out there somewhere. That tool might have its own DWARF parser, which seems like a waste to not share the very nice LLVM DWARF parser because it keeps up with the standard more. We had breakpad users at Facebook having to fix DWARF parsing bugs as DWARF moved to DWARF4 recently and I was surprised to find a tool that had its own DWARF parse. So sharing of LLVM technologies seemed to make more sense seeing as Google folks own breakpad and crashpad.

b) This seems to be largely a binary encoding of a breakpad file and not a new debug format?

It is a more efficiently encoded symbolication format for address to information (source file and line, and inline call stack). It isn't a replacement for DWARF, but it can be a complete replacement for any users of -gline-tables-only. It is designed to allow crash reporting tools and servers that are parsing millions of symbolication requests to symbolicate many orders of magnitude faster than using DWARF. It is designed to be mmap'ed shared by one or more processes and used as is (no setup, or sorting the DWARF "accelerator" tables (which are random indexes)). Unlike DWARF, we can mmap this in and use it much like the apple accelerator tables. All information for each function is in a single blob of data, where in DWARF it is scattered across .debug_info, .debug_line, .debug_abbrev, .debug_ranges, and more sections making symbolication very expensive (file cache and performance) when using DWARF. With DWARF, you must check .debug_aranges for the address after parsing all .debug_aranges and sorting the random address list, or linearly search all CUs for their DW_AT_ranges, then if you find a CU, parse ALL DWARF for that CU till you find the function info that is correct, then go parse the line table for the entire CU, and pull out just the bits you cared about for the function.

c) What are the future plans for this code?

A few things I can think of:

  • have compiler add the GSYM data in as a section when compiling and linking. The GSYM data can share the string table from the .debug_str or the symbol table, so this information can be added in along with DWARF to get much better symbolication performance alongside other DWARF and debug info data.
  • replace -gline-tables-only with this for better performance or symbolication
  • add unwind information to the address info to allow symbolication tools that might be doing stack backtraces in process, or external tools to backtrace correctly when given async unwind info for a function. WE can also specify if the information is asynchronous so we can trust it to unwind first frames, or if it isn't only unwind non first frames or non sigtramp following frames
  • Add DWARF DIE offset info to the address info for each address to allow this to be used as a better address accelerator table. Right now DWARF .debug_aranges are just random addresses to CU offset (not DIE offset).
  • use this format more in profiling tools that might need to backtrace or gather data. We saved thousands of machines by switching to GSYM here at Facebook for symbolication and for real time CPU profiling data
  • possibly get this accepted into DWARF format as a replacement for .debug_aranges?

Mostly I'm trying to figure out what we want to do with this in llvm. It seems like something that would be good for the breakpad project mostly?

From the above stuff I hope you might be able to see where we go with this. But this format applies to anyone wanting to do very quick address to data lookups. mmap in, use the tables, better line table encoding than DWARF (we have a single file table where DWARF has one per source file), get inline call stack unwinding in cases where you want to symbolicate.

My idea behind putting it into LLVM allows any compiler that uses LLVM to add this accelerator table as a section in their .o files, their linked executables, or make stand alone GSYM files for server symbolication. The dwarf2gsym conversion tool leverages the LLVM DWARF parser to convert DWARF to this format for people that aren't able to build it into their .o files or binaries at compile/link time, but I would love to see this format be able to be added to .o files and symbol files during build time.

Let me know what you think. I would be happy to meet at Google to discuss further for lunch, or have any folks come up to Facebook. Let me know what you think.

mgrang added inline comments.Feb 26 2019, 9:50 AM
include/llvm/DebugInfo/GSYM/GsymCreator.h
75

Please use range based llvm::sort instead of std::sort.

llvm::sort(Funcs);

See https://llvm.org/docs/CodingStandards.html#beware-of-non-deterministic-sorting-order-of-equal-elements for more details.

phosek added a subscriber: phosek.Apr 18 2019, 5:09 PM

We have a use case for this other than Breakpad.

We have a use case for this other than Breakpad.

Awesome, can you talk about it a bit more at the moment? Also perhaps start a review :)

-eric

This is just where I got tired today but I think I can recommend how to split this up so I could move faster and provide more useful high level review. Prior to splitting I'll keep chugging away for at least a bit each day.

  1. Only add functionality for creating GSym in memory and associated unit tests (no reading from a file).
  2. Add functionality for reading from a file. If easy enough to do we should ignore inline info in this patch to make it smaller. Add gsymutil in this change and add llvm-lit tests for gsymutil.
  3. Add functionality for writing to a file.
  4. Add functionality for reading from breakpad.
  5. Add functionality for writing to breakpad.
  6. Add inline info if it wasn't added in 2
  7. Add functionality for integration into MC

Also it isn't clear to me (again I haven't really gone though anything but the headers) what clear methods or many of the operator overloads are for. Removing unused versions of them would be helpful.

include/llvm/DebugInfo/GSYM/Breakpad.h
21

Its more typical to use Error instead of std::error_code. Error has some sharp edges to it because it forces you to check the error.

include/llvm/DebugInfo/GSYM/DwarfTransformer.h
30

I think I'd prefer this return errors via Error and generally use its interface to allow users to report their own errors. What's the motivation behind passing a log like this?

35

I think it would be more llvm-ish to use Error here but you should be able to swap between the two.

Also I think having methods that finish constructing isn't ideal. Ideally this would happen in the constructor but since we have to handle errors perhaps having a private "incomplete" constructor like this one, make these methods private, and then provide a static function that returns an Expected<DwarfTransformer> by first calling the incomplete constructor and then completing the object using these methods.

43

Perhaps this is part of the reason for using these methods but it seems really inefficient to reload the file when loadDwarf needs to be called anyway. Is there a use case for that?

include/llvm/DebugInfo/GSYM/FileEntry.h
24

How do we want to handle endianness? Are we just assuming that we'll only be working on the target system? Is it always little endian?

The standard thing to do would be to use https://llvm.org/doxygen/Endian_8h_source.html and use packed_endian_specific_integral. This has been extremely successful in llvm.

47

nit: You could use llvm::hash_combine

include/llvm/DebugInfo/GSYM/FileWriter.h
23

Weather raw_ostream or something like FileOutputBuffer is used varies across LLVM. I generally prefer using FileOutputBuffer for binary output. FileOutputBuffer has the unfortunate problem that it doesn't have an abstraction that lets you choose between using MemoryBuffer or other such things that do have abstractions otherwise this would be a fairly simple choice. Also if you want to stream output for memory reasons you might prefer to use a raw_ostream.

The general pattern I've observed for overcoming the abstraction short comings of FileOutputBuffer is to make 'write' methods accept a uint8_t* instead. The consequence is that you often need a reinterpret_cast.

Is there a reason an ostream was used here instead of a raw_ostream?

include/llvm/DebugInfo/GSYM/GsymCreator.h
49

Can you comment on the expected paralell use here? I saw above that you expect to use multiple threads to create the gsym data but it isn't clear to me that this sort of thing makes sense if the threads are going to be under constant contention. Will many threads by calling these methods rapidly or will each thread do a bunch of work between calls? I'd expect the former which makes me skeptical. Have you benchmarked this?

include/llvm/DebugInfo/GSYM/GsymReader.h
65

Some comment about the 'loadFoo' things above. I'd expect there to be a create function that returns an Expected<GsymReader>

69

comment what Verbose does, it isn't clear to me.

include/llvm/DebugInfo/GSYM/Range.h
31–35

nit: These sorts of methods seem superfluous to me but don't worry about removing them.

include/llvm/DebugInfo/GSYM/StringTableCreator.h
21

I almost commented on this issue during the meeting. The standard way these are built is using StringTableBuilder in MC. It uses a finalization technique that makes things a bit more difficult because you have to request the indexes *after* finalization. This enables the standard string table compression technique for shared suffixes.

Switching to that might however cause your interface problems. It would be nice to switch but this might require a large architectural switch. It seems a shame that a chosen interface would overrule a filesize optimization however.

clayborg marked 12 inline comments as done.Thu, Jun 6, 8:32 AM

This is just where I got tired today but I think I can recommend how to split this up so I could move faster and provide more useful high level review. Prior to splitting I'll keep chugging away for at least a bit each day.

  1. Only add functionality for creating GSym in memory and associated unit tests (no reading from a file).
  2. Add functionality for reading from a file. If easy enough to do we should ignore inline info in this patch to make it smaller. Add gsymutil in this change and add llvm-lit tests for gsymutil.
  3. Add functionality for writing to a file.
  4. Add functionality for reading from breakpad.
  5. Add functionality for writing to breakpad.
  6. Add inline info if it wasn't added in 2
  7. Add functionality for integration into MC

I will work on splitting this patch as requested

Also it isn't clear to me (again I haven't really gone though anything but the headers) what clear methods or many of the operator overloads are for. Removing unused versions of them would be helpful.

We can go through each thing in each individual patch as I submit them.

include/llvm/DebugInfo/GSYM/Breakpad.h
21

sounds good, will switch this over to llvm::Error

include/llvm/DebugInfo/GSYM/DwarfTransformer.h
30

It was just how we did things for the Facebook code which was outside of llvm since it was a command line only tool. We can have the DwarfTransformer contain a list of errors and warnings and report those after the fact. Would that be better? So maybe have DwarfTransformer have a std::vector<llvm::Error> as a member variable and possibly std::vector<std::string> for warnings?:

std::mutex ErrorsWarningsMutex; // Allow mutli-threaded access to Errors and Warnings
std::vector<llvm::Error> Errors;
std::vector<std::string> Warnings;
35

llvm::Error is fine. Also fine to move things around as needed and uses static creation methods.

43

Some of this complexity I believe came from the different Facebook internal sources that had sharding built in. I didn't want to add sharding (break up one GSYM file into multiple parts) in the first check-in so some of this is left over from that. Also some of the work was done by a non llvm person after I checked in my sources. As you said before, many of these functions should be private and or could be merged into a single function.

include/llvm/DebugInfo/GSYM/FileEntry.h
24

If we are going to encode into object files, we need to have the magic value in the header tell us the byte order IMHO. We really want it to be the same as the system that will use since we want to mmap the file into memory and use it as efficiently as possible. Not sure how well things would perform if we forced an byte order on the file using things like llvm::support::ulittle32_t.

include/llvm/DebugInfo/GSYM/FileWriter.h
23

No reason. Happy to switch. Just trying to avoid the ASMWriter as it requires so much of llvm (targets and more) to be loaded and made making a 32 bit or 64 bit file with a byte order much harder as you had to match it up with a target just to get those values to match.

include/llvm/DebugInfo/GSYM/GsymCreator.h
49

Doing each compile unit in DWARF separately does speed things up when we tested really large binaries here at Facebook. Happy to try it both ways to make sure. But parsing DWARF is generally grabbing a DIE, getting its address ranges, if it has valid ranges, then parse the line table entries that are only the line entries for the function itself, create the inline info, go onto next stage. The string population happens after we get the function ranges if they are valid, and during the line table parsing. We can easily cache a DWARF file index to GSYM file index if that isn't already being done since the first time we add a file from DWARF to GSYM we will need to unique the directory and basename strings, but we should only need to do that once per file index for compile unit line table file. So I believe this will work out as there is plenty of work to do between

75

will do!

include/llvm/DebugInfo/GSYM/GsymReader.h
65

will do

69

will do

include/llvm/DebugInfo/GSYM/Range.h
35

I like them in case I switch the contents to be "uint64_t Start; uint64_t Size;". Just allows less code changes if we switch this around. I know it is already a struct, so if I do this, this should be a class where "Start" and "End" are private.

include/llvm/DebugInfo/GSYM/StringTableCreator.h
21

Yeah, I believe I tried to avoid the MC layer as is required llvm targets to be available and you had to pick an architecture to match the address byte size and byte ordering. Happy to use other things from LLVM where possible and not to large of a pain. But not getting an offsets right away seems like a shame and would change a lot of things to require fixups before they were emitted (all function names in FunctionInfo, directories and basenames in FileEntry object, inline function infos) etc. And we might need a special class still if we ever emit into an exising object file because we would want to be able to reuse and strings in .debug_str or other ELF string tables.

clayborg updated this revision to Diff 203385.Thu, Jun 6, 9:19 AM

Rebase to current llvm sources.