This is an archive of the discontinued LLVM Phabricator instance.

GSYM symbolication format
AbandonedPublic

Authored by clayborg on Oct 17 2018, 9:00 AM.

Details

Summary

I got approval to open source the GSYM symbolication format. This patch still needs to get testing added and switched over to use the AsmPrinter to create the GSYM files, but I wanted to post this patch in progress for the LLVM conference to allow folks to see what it is and try it out. Full details on the file format below:

GSYM Introduction

GSYM is a symbolication file format is designed to be the best format to use for symbolicating addresses into function name + source file + line information. It is a binary file format designed to be mapped into one or more processes. GSYM information can be created by converting DWARF debug information, or Breakpad files. GSYM information can exist as a stand alone file, or be contained in ELF or mach-o files in a section. When embedded into ELF or mach-o files, GSYM sections can share a string tables that already exists within a file.

Why use GSYM?
GSYM files are up to 7x smaller than DWARF files and up to 3x smaller than Breakpad files. The file format is designed to touch as few pages of the file as possible while doing address lookups. GSYM files can be mmap'ed into a process as shared memory allowing multiple processes on a symbolication server to share loaded GSYM pages. The file format includes inline call stack information and can help turn a single address lookup into multiple stack frames that walk the inlined call stack back to the concrete function that invoked these functions.

Converting DWARF Files to GSYM
llvm-dsymutil is available in the llvm/tools/gsym directory and has options to convert DWARF into GSYM files. llvm-dsymutil has a -dwarf option that specifies a DWARF file to convert into a GSYM file. The output file can be specified with the -out-file option.

$ llvm-dsymutil -dwarf /tmp/a.out -out-file /tmp/a.out.gsym

This command will convert a DWARF file into the GSYM file format. This allows clients that are currently symbolicating with DWARF to switch to using the GSYM file format. This tool could be used in a symbolication workflow where symbolication servers convert DWARF to GSYM and cached the results on the fly, or could be used at build time to always produce a GSYM file at build time. DWARF debug information is rich enough to support encoding the inline call stack information for richer and more useful symbolication backtraces.

Converting Breakpad Files to GSYM

llvm-dsymutil has a -breakpad option that specifies a Breakpad file to convert into a GSYM file. The output file can be specified with the -out-file option.

$ llvm-dsymutil -breakpad /tmp/foo.sym -out-file /tmp/foo.gsym

This allows clients currently using breakpad to switch over to use GSYM files. This tool could be used in a symbolication workflow where symbolication servers convert breakpad to GSYM format on the fly only when needed. Breakpad files do not contain inline call stack information, so it is advisable to use llvm-dsymutil -dwarf when possible to avoid losing this vital information.

File Format Overview
The GSYM file consists of a header, address table, address info offset table and address info data for each address.

The GSYM file format when in a stand alone file is ordered as shown:

  • Header
  • Address Table
  • Address Data Offsets Table
  • File Table
  • String Table
  • Address Data

Header

#define GSYM_MAGIC 0x4753594d
#define GSYM_VERSION 1
struct Header {
  uint32_t magic;
  uint16_t version;
  uint8_t  addr_off_size;
  uint8_t  uuid_size;
  uint64_t base_address;
  uint32_t num_addrs;
  uint32_t strtab_offset;
  uint32_t strtab_size;
  uint8_t  uuid[20];
};

The magic value is set to GSYM_MAGIC and allows quick and easy detection of this file format when it is loaded. Addresses in the address table are stored as offsets from a 64 bit address found in Header.base_address. This allows the address table to contain 32, 16 or 8 bit offsets, instead of a table of full sized addresses. The file size is smaller and causes fewer pages to be touched during address lookups when the address table is smaller. The size of the address offsets in the address table is specified in the header in Header.addr_off_size. The header contains a UUID to ensure the GSYM file can be properly matched to the object ELf or mach-o file that created the stack trace. The header specifies the location of the string table for all strings contained in the GSYM file, or can point to an existing string table within a ELF or mach-o file.

Address Table
The address table immediately follows the header in the file and consists of Header.num_addrs address offsets. These offsets are sorted and can be binary searched for efficient lookups. Address offsets are encoded as offsets that are Header.addr_off_size bytes in size. During address lookup, the index of the matching address offset will be the index into the address data offsets table.

Address Data Offsets Table
The address data offsets table immediately follows the address table and consists of Header.num_addrs 32 bit file offsets: one for each address in the address table. The offsets in this table are the absolute file offset to the address data for each address in the address table. Keeping this data separate from the address table helps to reduce the number of pages that are touched when address lookups occur on a GSYM file.

File Table
The file table immediately follows the address data offsets table. The format of the FileTable is:

struct FileTable {
  uint32_t count;
  FileInfo files[];
};

The file table starts with a 32 bit count of the number of files that are used in all of the address data, followed by that number of FileInfo structures.

Each file in the file table is represented with a FileInfo structure:

struct FileInfo {
  uint32_t directory;
  uint32_t filename;
};

The FileInfo structure has the file path split into a string for the directory and a string for the filename. The directory and filename are specified as offsets into the string table. Splitting paths into directory and file base name allows GSYM to use the same string table entry for common directories.

String Table
The string table follows the file table in stand alone GSYM files and contains all strings for everything contained in the GSYM file. Any string data should be added to the string table and any references to strings inside GSYM information must be stored as 32 bit string table offsets into this string table.

Address Data
The address data is the payload that contains information about the address that is being looked up. The structure that represents this data is:

struct AddressInfo {
    uint32_t size;
    uint32_t name;
    AddressData data[];
};

It starts with a 32 bit size for the address range of the functiopn and is followed by the 32 bit string table offset for the name of the function. The size of the address range is important to encode as it stops address lookups from matching if the address is between two functions in some padding. This is followed by an array of address data information:

struct AddressData {
    uint32_t type;
    uint32_t length;
    uint8_t data[length];
};

The address data starts with a 32 bit type, followed by a 32 bit length, followed by an array of bytes that encode each specify kind of data.
The AddressData.type is an enumeration value:

enum class InfoType {
   EndOfList = 0u,
   LineTableInfo = 1u,
   InlineInfo = 2u
};

The AddressInfo.data[] is encoded as a vector of AddressData structs that is terminated by a AddressData struct whose type is set to InfoType.EndOfList. This allows the GSYM file format the contain arbitrary data for any address range and allows us to expand the GSYM capabilities as we find more uses for it.

InfoType::EndOfList is always the last AddressData in the AddressInfo.

InfoType::LineTableInfo is a modified version of the DWARF line tables that efficiently stores line table information for each function. DWARF stores line table information for an entire source file and includes all functions. Having each function's line table encoded separately allows fewer pages to be touched when looking up the line entry for a specific address. The information is optional and can be omitted fo address data that is from a symbol or label where no line table information is available.

InfoType::InlineInfo is a format that encodes inline call stacks. This information is optional and doesn't need to be included for each address. If the function has no inlined functions this data should not be included.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
zturner added inline comments.Oct 23 2018, 4:35 PM
lib/DebugInfo/GSYM/Breakpad.cpp
33–35

This should be StringRef, then you can use the starts_with member function.

37

Can you use an enum class here? Also, since this is a type and it's local to this file, it should be in an anonymous namespace.

47–51

We shouldn't have global constructors. Instead, can you make this a constexpr StringRef?

53–61

I believe we have something like this in StringExtras.h it's called llvm::hexDigitValue

64–65

You could store this as a StringRef instead of a start and end pointer. It will actually be more space efficient (pointer + length is potentially 4 bytes smaller than pointer + pointer)

71

Early exit.

102

Can you use llvm::isHexDigit here instead?

109

This should return a StringRef.

113–125

If you're storing a StringRef, then this function becomes:

// Remove leading spaces
m_pos = m_pos.ltrim();
// get all characters until the next space
StringRef word = m_pos.take_until([](char c) {return !isspace(c); });
// remove those from the stored string, and remove whitespace
m_pos = m_pos.drop_front(word.size()).ltrim();
// return the word
return word;
127–131

This just becomes return m_pos;

Also, this function should be const

132–156

All of these functions should be const. That said, I think they're all unnecessary. llvm::StringRef has a method called getAsInteger. I think we can just use that, so all of these various specialized conversion functions seem unnecessary. The user can just call GetString() which returns the rest of the line as a StringRef, then they can call getAsInteger() with that. And all of these functions can just be deleted.

159

Can we use llvm::Error here?

160–161

StringRef

lib/DebugInfo/GSYM/DwarfTransformer.cpp
168

Can you assign this to a StringRef right away so that we don't have to use strcmp related functions?

194

General comment, I don't want to point out every single instance, but there's a *lot* of auto in this code. It makes it hard to read. I think we prefer not to use so much auto in LLVM, so I'd suggest removing some of it and use explicit types. Only when the type is specified in the following code is it ok, like auto foo = llvm::make_unique<TheType>, because then you know it's a unique_ptr<TheType>, or when it's really obnoxiously long to spell out.

595–597

Can you remove all the commented out logging code?

clayborg added inline comments.Oct 26 2018, 9:58 AM
lib/DebugInfo/GSYM/README.md
7

We need to add unwind info at some point for sure. It is one of the payloads in the address info we can work on once this is in.

60

We need to be able to binary search this table for your address. If we use relative offsets, then we can't do that. The idea is to mmap this file and use the data as is with minimal setup.

84

I haven't really done much optimization on paths other that split them into directory and filename so file entries can share the strings. One thing we could do is allow strings to be specified in the string table with a length for the file table. That way we could have a long path: /a/b/c/d and refer to "/a", "/a/b", "/a/b/c" and "/a/b/c/d" using the same string. I am open to ideas here. I kept it simple to start with.

104

I think having the length defined is essential to the format. It allows you to skip any data you don't care about with knowing what it contains.

107

This is designed to be extensible and I hope to see many more types of info added in the future.

lemo added inline comments.Oct 26 2018, 11:14 AM
lib/DebugInfo/GSYM/README.md
60

Sorry if I wasn't clear, I was talking about the absolute file offsets:

The offsets in this table are the absolute file offset to the address data for each address in the address table.

These are just file pointers to the address data, so relative vs. absolute has nothing to do with the binary search in the address table, right? Also, absolute 32bit file offsets could be limiting simply because they can only address 4Gb (as section-relative offsets this is less of a concern)

84

What I had in mind is to simply provision for prefix compression - for example every string having an optional link to its prefix (which can also be prefix compressed). What do you think?

85

Can you please document the exact format used to store strings? (even if it's just to note that it uses the same format as .debug_str for example)

104

Then at least encode the len as LEB128?

clayborg added inline comments.Oct 26 2018, 2:28 PM
lib/DebugInfo/GSYM/README.md
60

Actually these are section relative offsets already. They are absolute file offsets if the GSYM is stand alone, but if the GSYM is in a section, it will be section relative. Sorry for the confusion. I will update the documentation to say this more clearly.

84

I am open to ideas on this. For paths I thought about this idea of chaining but directory names are often short and using 4 bytes for the chain might not make for big wins in gain and adds complexity. I am happy to discuss any string table optimizations that you have in mind though.

85

Will do

clayborg updated this revision to Diff 171351.Oct 26 2018, 2:34 PM

Most of Zachary's fixes are in this path. Few major differences:

  • Updated the README.md
    • be more specific about the string table encoding
    • specify address info offsets are offsets relative to the start of the GSYM header
  • Update GsymReader
    • Switched over to using BinaryStreamReader and removed DataRef class
    • Use llvm::ArrayRef instead of pointers for AddrOffsets, AddrInfoOffsets, and Files (file table)
  • Switch to DenseMap and StringMap where applicable
  • Removed use of std::shared_ptr for string tables and file tables
  • Made GsymCreator thread safe and cleaned up multi-threaded code
  • Removed segmenting code until we get all classes in the right shape. Will commit this back after first checkin once format is locked down
  • Added unit tests

+Eric Christopher for a DWARF expert's perspective

Hi Greg, this is great stuff. I’m going to take a closer look at the implementation, in the meantime here are a few high level comments:

Important requirements

These observations describe areas which would prevent or significantly limit the use of GSYM format for crash processing. I’m not suggesting that these are blockers for checking in the initial implementation although some, if we all agree to incorporate, might be harder to retrofit later on.

  • Full debug info fidelity: this is one of our key requirements for the Breakpad processor replacement. For example our developers have been asking for things like the ability to extract the arguments/locals values. It would be fine if we can use GSYM as a first level index which would enable use to pick a subset of the full debug information (having to fetch the full ELF/DWARF/PDB files would defeat the value of using GSYM)
    • CFI (must have, w/o this we won’t be able to do even basic stack walks, right?)
    • “Accelerator” indexes pointing to subsets of the real debug information
    • Encoding the full debug info (does this even make sense since we’d end up with yet another debug info format)

DWARF is good for full debug info so I see no need to reinvent here. We can put the GSYM into a section and also have any DWARF we need for full debug info. We could also embed DWARF into the address info data if needed. Many options to achieve the debug info requirements.

I see CFI as the next thing to get added to GSYM. Lets work on the format we want. for me I want the following from CFI:

  • Async unwind info so we can unwind at any PC
  • If we can get the above info, we at least know if the unwind info is only valid at call sites
  • No need to re-invent here. We can just point to existing unwind info in the file (like say in EH frame), or we can inline existing formats (standard EH frame, .debug_frame, compact unwind from ARM, etc)

For accelerator tables, this file format is an address accelerator table. For other info, we can add new InfoType enumeration values that can point into existing DWARF (.debug_info, .debug_frame, and more)

  • First class support for CodeView/PDBs: as far as I can tell, the proposed GSYM format doesn’t prevent an PDB importer (and I think it’s worth mentioning this in the description). But if we’re talking about adding support for full debug information this might need more consideration.

Yes, we should add a pdb2Dwarf converter like we do with DWARF. Very easy to add for sure.

  • Built-in versioning: I think it’s critical to allow the format to evolve in a backward compatible way.

The header has a version. We can include a version in each InfoType as well

Miscellaneous Observations, Ideas and Questions

  • First class support for sharding: for horizontal scalability reasons we’d want to pull in only the information required to process a particular minidump

I pulled the sharding (segmenting) that was in the initial checkin until we lock down all of the changes. I will re-add this as soon as the first check-in happens.

  • What do you think of using a hierarchical address data structure? (ex. B-Trees. This goes hand in had with sharding)

Anything that makes the lookups faster is fine with me. It would be easy to embed a reference to another GSYM file within a GSYM file as a new InfoType!

  • It would be nice to mention that COFF input is also an important use case.

Yeah, easy to use COFF with any of this.

  • Why not use ELF format as the top container/file format?

We can, but not required. GSYM can be a stand alone file, or a section within an ELF file. Much quicker to map in a dedicated GSYM file and then just start parsing IMHO. But we have the option for both. Really works well if we can point the GSYM string table at the .debug_str so we can share strings. The string table offset is an absolute file offset so that it can shared string tables with other sections. It can even share multiple string tables by pointing to the lower file offset for the first line table and giving a large size and spans multiple areas of the file. We might need to dump the string table offset to 64 bit.

clayborg added inline comments.Oct 26 2018, 3:10 PM
include/llvm/DebugInfo/GSYM/GsymReader.h
47

With the switch to using BinaryStreamReader this might just work now if I understand the class directly. I switched over to decoding ArrayRefs for AddrOffsets, AddrInfoOffsets and FileEntry in the GsymReader. So this format can be any endian if needed. BinaryStreamReader would make a copy and swap them right?

zturner added inline comments.Oct 26 2018, 3:26 PM
include/llvm/DebugInfo/GSYM/GsymReader.h
47

I'm not sure I understand the question about making a copy and swapping. BinaryStreamReader never makes a copy of anything (that's one of the nice things about it actually). So you'd say something like:

const Header *Header = nullptr;
BinaryStreamReader Reader(Stream);
Reader.readObject(Header);

Note that header was never declared as an object to begin with, just a pointer. So no copy ever happened. All that happened was it changed the value of your pointer to point into the buffer.

Does that answer your question?

clayborg added inline comments.Oct 26 2018, 4:29 PM
include/llvm/DebugInfo/GSYM/GsymReader.h
47

I was commenting on the ArrayRef stuff I used when decoding an array of integers for the AddrOffsets, AddrInfoOffsets and FileEntry in GsymReader::init(...). I know pointers won't ever have their contents swapped. It would be easy to unswap the header if needed. I mostly worry about the array of address offsets and that is the only performance specific bottleneck if things are byte reversed.

zturner added inline comments.Oct 26 2018, 8:24 PM
include/llvm/DebugInfo/GSYM/GsymReader.h
47

That said, I think worrying about performance in byte-reversed scenarios is kind of a YAGNI concern. Unless you're specifically planning to support big endian, we can address it if and when it comes.

That said though, reading the ArrayRef doesn't make a copy either. It's just like all the other methods in that all it's going to do is reinterpret_cast<> the bytes. So if you ask for an ArrayRef<uint32_t> it's just going to cast the bytes to a uint32_t* and put those into an ArrayRef<>. But it's not going to copy anything.

If the reason you're asking is because BinaryStreamReader takes an endianness value to its constructor, that value is only used when you call the readInteger methods. If you never call readInteger(), the endianness value you pass to the constructor is never used for anything.

120

Why does this need to be mutable?

clayborg added inline comments.Oct 27 2018, 6:59 AM
include/llvm/DebugInfo/GSYM/GsymReader.h
120

because BinaryStreamReader must use BinaryStreamReader::setOffset() before reading data because it contains the file position. Many functions should be const, but they can't be if they access any data in GSYMData.

Oh I usually would store a BinaryStreamRef in the class and just create the
BinaryStreamReader on the stack whenever you need one

That actually makes it more thread safe too since you wouldn’t have
multiple threads sharing the same offset

That actually makes it more thread safe too since you wouldn’t have multiple threads sharing the same offset

Sounds good, I will fix this.

clayborg updated this revision to Diff 171400.Oct 27 2018, 8:58 AM
  • Fixed new StringRef decoding of Breakpad files
  • Added a breakpad to GSYM unit test
  • Removed the mutable BinaryStreamReader ivar from GsymReader and create a BinaryStreamReader instance on the fly when we need to decode information
clayborg updated this revision to Diff 171402.Oct 27 2018, 9:47 AM
  • Added lookup tests to unit tests
    • test function info lookup with valid and invalid addresses
    • test that we can infer the size for zero sized symbols by taking delta between current entry and next entry
clayborg updated this revision to Diff 171431.Oct 28 2018, 9:21 AM
  • Added full InlineInfo tests
  • Cleaned up the API for parsing InlineInfo. Prior to this an offset reference was needed to be supplied when parsing, but the user need not worry about that so I cleaned up the API and make private decoding calls and made the API correct
clayborg updated this revision to Diff 171441.Oct 28 2018, 3:34 PM

Fixed an issue with removing auto that was causing a crasher when parsing DWARF

I’m glad you’re able to work on this publicly again! I’ve made a pass through just the README.md only at first, just to help nail down the format.

lib/DebugInfo/GSYM/README.md
10

llvm-gsymutil

throughout this file

43

Expectations around this field?

That is: should a v1-compliant reader reject v2 input (thereby making this field essentially just another magic number), or should v2 input be considered a strict superset of the v1 format, such that a v1 reader could process it correctly while ignoring whatever extensions v2 defines (probably via additional fields in this struct)?

For some purposes, a “header size” field may be better than a version field, because it indicates not just the version of the header structure (assuming you only grow), but also tells you where the data that follows (in this case, the address table) actually begins.

46

Define what the base_address is relative to. The load address of a particular segment or section? The lowest possible load address? The ELF or Mach-O header?

The symbol table analogue is that each address in the table is relative to a particular section (except for special cases, such as those interpreted as “absolute”), in nlist::n_sect or ElfN_Sym::st_shndx. This should be nailed down for GSYM a bit better too. This has implications on the common base_address approach.

50

The 20-byte UUID matches Mach-O practice, but ELF IDs are just a bunch of bytes. They may be 20-byte UUIDs or UUID-like, but they may also be longer or shorter.

Do you want to provide guidance for what to do in case the corresponding binary image doesn’t have a build ID at all?

54

ELF, not ELf.

(Would be easier to review if this were hard-wrapped.)

54

So valid values for addr_off_size are 8, 16, and 32, it sounds. Not 64?

Not representing as n and interpreting as 2^(n+3) or 2^(n+4)? That might give you a better layout for future placement of extension bits in this field.

76

Rather than putting this in the file table, could you move it into the header? That’d make it easier to locate data that follows (like the string table) without having to pull another page in from disk if the file table isn’t otherwise needed.

It also seems weird that the file table is the only structure that’s not sized in the header. You’ve got num_addrs and strtab_size there.

94

As much as I’d love for the world to grow up and use UTF-8 everywhere, specifying it here is maybe a little stronger than can reliably be guaranteed.

97

What’s the empty string for?

99

If your base_offset is 64 bits wide in contemplation of large files, why is strtab_offset only 32 bits?

100

“the file” = relative to what? The image header? The first section? The section loaded at the lowest address? A special magic section? The file on disk that hosts the image (which may be fat?)

113

functiopn

113

Is the name of the function encoded in decorated form?

117

If you store a number-of-address-data-elements field in AddressInfo, you can save four bytes per AddressInfo by not needing to carry the unnecessary “length” field for EndOfList AddressData entries.

121

specify → specific

130

But this also means that you may wind up expressing things redundantly. Address ranges that apply to a single conceptual LineTableInfo and a single conceptual InlineInfo may not coincide. The definition here means that you’re going to have to split everything up at each boundary. You wind up, for example, having to express the same InlineInfo twice, just because there was a break in LineTableInfo.

This becomes more important as you add other InfoTypes, such as an InfoType for CFI, as Leonard pointed out would be necessary. Same goes for an InfoType that would allow recovery of function parameters and local variables.

134

fo → for

134

Can you describe the format, or at least the modifications to DWARF?

136

Same: can you describe the format?

clayborg added inline comments.Oct 29 2018, 1:53 PM
lib/DebugInfo/GSYM/README.md
43

I was thinking that the magic and version are the only things that are fixed within this format. Anything that follows is specific to s specific version. I think it is nice to be able to change anything you want. Would be very east to later have HeaderV1, HeaderV2, etc. The idea is that the version number gets bumped and anything in the file format can change. I am open to other ideas on this, as this is just how it is currently coded.

46

Currently "base_address" is actually the min address for all addresses in the "Address Offsets Table". We can choose to define this differently, but the smaller the address offsets are in the "Address Offsets Table" the better for the size of this file. Address offsets are used for both symbols and function info with debug info (line table + inline info).

50

We can define this to just be "uint8_t uuid[];" and allow the header size to vary depending on the Header.uuid_size value if needed. Currently in LLDB, we handle any size UUID. I am find with modifying this to not have a size. This will allow headers to be shorter if needed. What do you think?

54

Address offsets can be 64 bit. I will fix the docs. We could decide to do 24 bit, 40 bit, 48 bit offsets etc if that helps with file size at some point. Not keeping a power of two gives us that option?

76

Yep, easy to move the file count into the header.

97

It makes for an easy invalid string table offset value to be zero. Also used in many DWARF string tables. If the empty string is always zero, it makes it easy to look at a string table index in your structure and know it is invalid instead of having to know that "0x12340400" is the empty string (and would differ for each string table. One byte extra only, so thought it was worth doing this like other string tables.

99

I started with a GSYM stand alone format and added the ability to be in a section later. Easy to bump the string table offset to 64 bit with a 64 bit size. By "base_offset" I believe you mean "base_address"?

100

Since this data is in a stand alone file or a section in an object file. it should be relative to the start of the file for a stand alone GSYM file, or the Image header if in ELF (Elf Header) or the Mach-o (Mach-o header in a universal file or single arch file).

113

It is currently encoded in mangled form if that was available from the debug info, or in demangled fully qualified form (calculated in DWARF when to DW_AT_linkage_name is available) or just a base name when the parent decl context is the file. Could be defined as "shortest form that is available for a fully qualified name to the function"?

117

I like the idea of having the number-of-address-data-elements in the AddressInfo and get rid o the EndOfList.

I would like to keep the length field if at all possible because you might want to skip data you don't care about. For example, if we have a AddressData for line table, inline info and for unwind info, you might want to skip to the unwind info. You wouldn't want to have to run the streaming parser for the line table and for the inline info if you just want the unwind info. So I like the length field here as it makes it easy to skip unwanted info.

130

Some addresses may be redundant, but the line table can be more efficient when it doesn't have many extra opcodes like the actual DWARF line table (stmt_begin, stmt_end, prologue_begin, prologue_end). If we merge all of the info together it encodes less efficiently. Also, you might want to "thin" out a GSYM file to contain only unwind info. If they are all separate chunks, very easy to go. All addresses are stored as offsets from the start address of the AddressInfo, so they tend to be encoded as ULEB numbers. Any lexical scopes are relative to their parent scope (the function or a another scopes). So addresses tend to be encoded very efficiently.

134

I will add that complete description. Very pared down to just file + line + address (no stmt_begin, stmt_end, prologue_end, epilogue_begin, many other things have been removed and given back to the opcode that advances line and address in 1 byte.

136

Will do.

clayborg updated this revision to Diff 171833.Oct 30 2018, 4:28 PM

Updated README.md to contain full details on how line tables and inline info are encoded.

So I believe this is in a good shape now. It matches the current GSYM format we have here at Facebook which is "version 1". I would love to get this in without any changes to the format so that it matches the current format we have been using for a year now. Then we can iterate and move to version 2 and make changes. Thoughts?

@echristo probably needs to have a say, i know he's been busy but i'll try to ping him about this.

README.md has complete file details now for anyone wanting to understand the entire file format.

README.md has complete file details now for anyone wanting to understand the entire file format.

I'll get to this in the next day or so. Coming back from vacation.

A great way to understand and see the file format is to use to convert some DWARF:

$ llvm-gsymutil -dwarf /tmp/a.out

Then dump it:

$ llvm-gsymutil -verbose /tmp/a.out.gsym

The verbose flag will dump all GSYM sections (header, address table, address offsets table, file table, string table, and all address infos. This allows you to take known input from DWARF and generate GSYM output files.

Hi Greg, I did a high level pass over your patch. Personally I'm okay with keeping the format in sync with what you have at Facebook as long as the style issues are addressed before we land this.

include/llvm/DebugInfo/GSYM/DwarfTransformer.h
35

Have you considered using llvm::Error? You can use it as a wrapper for error codes and has an assert that the error is dealt with. At some point you'll probably have to convert it to an error anyway so might as well do it at the source (imho).

37

No braces around single statement block according to the LLVM style guide.

41

Spurious ;

include/llvm/DebugInfo/GSYM/FileTableCreator.h
27

Commented out code.

include/llvm/DebugInfo/GSYM/GsymCreator.h
48

s/guard/Guard/

include/llvm/DebugInfo/GSYM/GsymReader.h
57

llvm::Error?

lib/DebugInfo/GSYM/Breakpad.cpp
50

lowerCamelCase for methods.

lib/DebugInfo/GSYM/DwarfTransformer.cpp
217

You could make this an early return and save some indentation.

JDevlieghere added inline comments.Nov 5 2018, 10:03 AM
lib/DebugInfo/GSYM/DwarfTransformer.cpp
162

No braces. Same for the single-line if-blocks below.

247

No braces, same for the rest of the file.

263

Make this an early return.

495

Can you run clang-format over you patch (again)? It should've removed these double newlines.

lib/DebugInfo/GSYM/GsymReader.cpp
213

Since there's no dependency for the fall-through case I'd turn this into an early return as well, and duplicate this line.

216

StringRef

228

Remove commented-out code.

407

Looks like this code is always the same except the cast, I'd factor this out.

lib/DebugInfo/GSYM/LineTable.cpp
25

UperCamelCase.

34

MinLineDelta, etc

55

s/offset/Offset, same for the other variable names in this file.

JDevlieghere requested changes to this revision.Nov 5 2018, 10:03 AM
This revision now requires changes to proceed.Nov 5 2018, 10:03 AM
lemo added a subscriber: zturner.Nov 5 2018, 11:56 AM

Greg, what do you have in mind as the next step? LLDB integration?

Greg, what do you have in mind as the next step? LLDB integration?

Next step, I would like to bump to version 2 to address some of Mark Mentovai's comments:

  • bump string table offset and size to 64 bit in case GSYM is a section in a large file where we share string table
  • allow UUID to be any size and only emit what we need instead of padding out to 20 bytes
  • few other things
  • possibly move number of files into header

Follow up from the community (including from me):

  • agree upon an unwinding format and add support for what we agree upon
    • create a .debug_frame to above format converter
  • any other improvements or AddressInfo types
  • LLDB integration with GSYM format
lemo added a comment.Nov 5 2018, 1:49 PM

Cool. Looking forward to the LLDB integration!

clayborg updated this revision to Diff 172650.Nov 5 2018, 1:56 PM

Fixes:

  • Take care of issues found by Jonas
  • Rename variables to be camel case
  • Remove commented out code

Thanks Greg! I came across a few more places that don't need braces but otherwise this looks good style-wise.

include/llvm/DebugInfo/GSYM/DwarfTransformer.h
38

No braces

46

No braces

include/llvm/DebugInfo/GSYM/InlineInfo.h
38

/// for Doxygen comments?

include/llvm/DebugInfo/GSYM/LineTable.h
28

You should considering using ///< so it becomes a Doxygen comment.

lib/DebugInfo/GSYM/DwarfTransformer.cpp
74

No braces

82

You could make this an early break by inverting the condition. Same for the condition below. I think that would improve the readability of this code.

110

No braces

115

No braces

128

No braces

153

No braces

168

No braces

175

No braces

391

No braces

407

Simplify this with early break.

422

No braces

444

No braces

449

No braces

483

No braces

497

No braces

lib/DebugInfo/GSYM/FileTableCreator.cpp
35

No braces

lib/DebugInfo/GSYM/FunctionInfo.cpp
26

No braces

lib/DebugInfo/GSYM/GsymReader.cpp
61

I strongly prefer wrapping this in an llvm::StringError. It's a little more verbose but keeps things consistent.

256

No braces

lib/DebugInfo/GSYM/InlineInfo.cpp
97

No braces

clayborg updated this revision to Diff 172784.Nov 6 2018, 10:04 AM

Fix all of Jonas' issues.

Friendly ping

One more nit: can you turn your comments into full sentences? Most of them are already like that but some are missing capitalization and others are lacking a full stop. I didn't do another pass but other than that it looked alright style-wise.

Unfortunately there's nothing else I can do on this since I don't feel enough ownership over the top level lib/DebugInfo folder to approve a new sub-folder. If you still can't get any traction from someone who does have that kind of ownership, one idea might be to make a post on llvm-dev@ pointing them to this review.

Unfortunately there's nothing else I can do on this since I don't feel enough ownership over the top level lib/DebugInfo folder to approve a new sub-folder. If you still can't get any traction from someone who does have that kind of ownership, one idea might be to make a post on llvm-dev@ pointing them to this review.

I have this ownership and am on the review list. This is a large patch and a lot of work to review and I'll get to it as soon as I can.

2 month ping

Hi Greg,

I've had a lot of time to review this (thanks for that) and do apologize for taking so long. I have a couple of concerns about this so bear with me and let's see where we can get:

a) This looks like it's a standalone tool that's being added to llvm, but that really doesn't involve anything coming out of llvm right now?
b) This seems to be largely a binary encoding of a breakpad file and not a new debug format?
c) What are the future plans for this code?

Mostly I'm trying to figure out what we want to do with this in llvm. It seems like something that would be good for the breakpad project mostly?

Thanks!

-eric

Hi Greg,

I've had a lot of time to review this (thanks for that) and do apologize for taking so long. I have a couple of concerns about this so bear with me and let's see where we can get:

a) This looks like it's a standalone tool that's being added to llvm, but that really doesn't involve anything coming out of llvm right now?

It has a few tools:
dwarf2gsym
bpad2gsym

It also has all of the details for parsing and creating the format.

The dwarf2gsym uses the LLVM DWARF parser to parse and convert DWARF to GSYM format so that is a huge part of LLVM that is being used.

bpad2gsym converts the textual breakpad format to GSYM. There are many servers out there that are using very large breakpad, a google project, text files for symbolication which wastes a ton of CPU time as the file format is a big blob of text. So seeing as breakpad and crashpad want to adopt this format, it seemed like LLVM was a good place to put it so that it can get adopted by these Google teams. They already have a DWARF to breakpad conversion tool out there somewhere. That tool might have its own DWARF parser, which seems like a waste to not share the very nice LLVM DWARF parser because it keeps up with the standard more. We had breakpad users at Facebook having to fix DWARF parsing bugs as DWARF moved to DWARF4 recently and I was surprised to find a tool that had its own DWARF parse. So sharing of LLVM technologies seemed to make more sense seeing as Google folks own breakpad and crashpad.

b) This seems to be largely a binary encoding of a breakpad file and not a new debug format?

It is a more efficiently encoded symbolication format for address to information (source file and line, and inline call stack). It isn't a replacement for DWARF, but it can be a complete replacement for any users of -gline-tables-only. It is designed to allow crash reporting tools and servers that are parsing millions of symbolication requests to symbolicate many orders of magnitude faster than using DWARF. It is designed to be mmap'ed shared by one or more processes and used as is (no setup, or sorting the DWARF "accelerator" tables (which are random indexes)). Unlike DWARF, we can mmap this in and use it much like the apple accelerator tables. All information for each function is in a single blob of data, where in DWARF it is scattered across .debug_info, .debug_line, .debug_abbrev, .debug_ranges, and more sections making symbolication very expensive (file cache and performance) when using DWARF. With DWARF, you must check .debug_aranges for the address after parsing all .debug_aranges and sorting the random address list, or linearly search all CUs for their DW_AT_ranges, then if you find a CU, parse ALL DWARF for that CU till you find the function info that is correct, then go parse the line table for the entire CU, and pull out just the bits you cared about for the function.

c) What are the future plans for this code?

A few things I can think of:

  • have compiler add the GSYM data in as a section when compiling and linking. The GSYM data can share the string table from the .debug_str or the symbol table, so this information can be added in along with DWARF to get much better symbolication performance alongside other DWARF and debug info data.
  • replace -gline-tables-only with this for better performance or symbolication
  • add unwind information to the address info to allow symbolication tools that might be doing stack backtraces in process, or external tools to backtrace correctly when given async unwind info for a function. WE can also specify if the information is asynchronous so we can trust it to unwind first frames, or if it isn't only unwind non first frames or non sigtramp following frames
  • Add DWARF DIE offset info to the address info for each address to allow this to be used as a better address accelerator table. Right now DWARF .debug_aranges are just random addresses to CU offset (not DIE offset).
  • use this format more in profiling tools that might need to backtrace or gather data. We saved thousands of machines by switching to GSYM here at Facebook for symbolication and for real time CPU profiling data
  • possibly get this accepted into DWARF format as a replacement for .debug_aranges?

Mostly I'm trying to figure out what we want to do with this in llvm. It seems like something that would be good for the breakpad project mostly?

From the above stuff I hope you might be able to see where we go with this. But this format applies to anyone wanting to do very quick address to data lookups. mmap in, use the tables, better line table encoding than DWARF (we have a single file table where DWARF has one per source file), get inline call stack unwinding in cases where you want to symbolicate.

My idea behind putting it into LLVM allows any compiler that uses LLVM to add this accelerator table as a section in their .o files, their linked executables, or make stand alone GSYM files for server symbolication. The dwarf2gsym conversion tool leverages the LLVM DWARF parser to convert DWARF to this format for people that aren't able to build it into their .o files or binaries at compile/link time, but I would love to see this format be able to be added to .o files and symbol files during build time.

Let me know what you think. I would be happy to meet at Google to discuss further for lunch, or have any folks come up to Facebook. Let me know what you think.

mgrang added inline comments.Feb 26 2019, 9:50 AM
include/llvm/DebugInfo/GSYM/GsymCreator.h
74

Please use range based llvm::sort instead of std::sort.

llvm::sort(Funcs);

See https://llvm.org/docs/CodingStandards.html#beware-of-non-deterministic-sorting-order-of-equal-elements for more details.

phosek added a subscriber: phosek.Apr 18 2019, 5:09 PM

We have a use case for this other than Breakpad.

We have a use case for this other than Breakpad.

Awesome, can you talk about it a bit more at the moment? Also perhaps start a review :)

-eric

This is just where I got tired today but I think I can recommend how to split this up so I could move faster and provide more useful high level review. Prior to splitting I'll keep chugging away for at least a bit each day.

  1. Only add functionality for creating GSym in memory and associated unit tests (no reading from a file).
  2. Add functionality for reading from a file. If easy enough to do we should ignore inline info in this patch to make it smaller. Add gsymutil in this change and add llvm-lit tests for gsymutil.
  3. Add functionality for writing to a file.
  4. Add functionality for reading from breakpad.
  5. Add functionality for writing to breakpad.
  6. Add inline info if it wasn't added in 2
  7. Add functionality for integration into MC

Also it isn't clear to me (again I haven't really gone though anything but the headers) what clear methods or many of the operator overloads are for. Removing unused versions of them would be helpful.

include/llvm/DebugInfo/GSYM/Breakpad.h
20

Its more typical to use Error instead of std::error_code. Error has some sharp edges to it because it forces you to check the error.

include/llvm/DebugInfo/GSYM/DwarfTransformer.h
29

I think I'd prefer this return errors via Error and generally use its interface to allow users to report their own errors. What's the motivation behind passing a log like this?

34

I think it would be more llvm-ish to use Error here but you should be able to swap between the two.

Also I think having methods that finish constructing isn't ideal. Ideally this would happen in the constructor but since we have to handle errors perhaps having a private "incomplete" constructor like this one, make these methods private, and then provide a static function that returns an Expected<DwarfTransformer> by first calling the incomplete constructor and then completing the object using these methods.

42

Perhaps this is part of the reason for using these methods but it seems really inefficient to reload the file when loadDwarf needs to be called anyway. Is there a use case for that?

include/llvm/DebugInfo/GSYM/FileEntry.h
23

How do we want to handle endianness? Are we just assuming that we'll only be working on the target system? Is it always little endian?

The standard thing to do would be to use https://llvm.org/doxygen/Endian_8h_source.html and use packed_endian_specific_integral. This has been extremely successful in llvm.

46

nit: You could use llvm::hash_combine

include/llvm/DebugInfo/GSYM/FileWriter.h
22

Weather raw_ostream or something like FileOutputBuffer is used varies across LLVM. I generally prefer using FileOutputBuffer for binary output. FileOutputBuffer has the unfortunate problem that it doesn't have an abstraction that lets you choose between using MemoryBuffer or other such things that do have abstractions otherwise this would be a fairly simple choice. Also if you want to stream output for memory reasons you might prefer to use a raw_ostream.

The general pattern I've observed for overcoming the abstraction short comings of FileOutputBuffer is to make 'write' methods accept a uint8_t* instead. The consequence is that you often need a reinterpret_cast.

Is there a reason an ostream was used here instead of a raw_ostream?

include/llvm/DebugInfo/GSYM/GsymCreator.h
48

Can you comment on the expected paralell use here? I saw above that you expect to use multiple threads to create the gsym data but it isn't clear to me that this sort of thing makes sense if the threads are going to be under constant contention. Will many threads by calling these methods rapidly or will each thread do a bunch of work between calls? I'd expect the former which makes me skeptical. Have you benchmarked this?

include/llvm/DebugInfo/GSYM/GsymReader.h
64

Some comment about the 'loadFoo' things above. I'd expect there to be a create function that returns an Expected<GsymReader>

68

comment what Verbose does, it isn't clear to me.

include/llvm/DebugInfo/GSYM/Range.h
30–34

nit: These sorts of methods seem superfluous to me but don't worry about removing them.

include/llvm/DebugInfo/GSYM/StringTableCreator.h
20

I almost commented on this issue during the meeting. The standard way these are built is using StringTableBuilder in MC. It uses a finalization technique that makes things a bit more difficult because you have to request the indexes *after* finalization. This enables the standard string table compression technique for shared suffixes.

Switching to that might however cause your interface problems. It would be nice to switch but this might require a large architectural switch. It seems a shame that a chosen interface would overrule a filesize optimization however.

clayborg marked 12 inline comments as done.Jun 6 2019, 8:32 AM

This is just where I got tired today but I think I can recommend how to split this up so I could move faster and provide more useful high level review. Prior to splitting I'll keep chugging away for at least a bit each day.

  1. Only add functionality for creating GSym in memory and associated unit tests (no reading from a file).
  2. Add functionality for reading from a file. If easy enough to do we should ignore inline info in this patch to make it smaller. Add gsymutil in this change and add llvm-lit tests for gsymutil.
  3. Add functionality for writing to a file.
  4. Add functionality for reading from breakpad.
  5. Add functionality for writing to breakpad.
  6. Add inline info if it wasn't added in 2
  7. Add functionality for integration into MC

I will work on splitting this patch as requested

Also it isn't clear to me (again I haven't really gone though anything but the headers) what clear methods or many of the operator overloads are for. Removing unused versions of them would be helpful.

We can go through each thing in each individual patch as I submit them.

include/llvm/DebugInfo/GSYM/Breakpad.h
20

sounds good, will switch this over to llvm::Error

include/llvm/DebugInfo/GSYM/DwarfTransformer.h
29

It was just how we did things for the Facebook code which was outside of llvm since it was a command line only tool. We can have the DwarfTransformer contain a list of errors and warnings and report those after the fact. Would that be better? So maybe have DwarfTransformer have a std::vector<llvm::Error> as a member variable and possibly std::vector<std::string> for warnings?:

std::mutex ErrorsWarningsMutex; // Allow mutli-threaded access to Errors and Warnings
std::vector<llvm::Error> Errors;
std::vector<std::string> Warnings;
34

llvm::Error is fine. Also fine to move things around as needed and uses static creation methods.

42

Some of this complexity I believe came from the different Facebook internal sources that had sharding built in. I didn't want to add sharding (break up one GSYM file into multiple parts) in the first check-in so some of this is left over from that. Also some of the work was done by a non llvm person after I checked in my sources. As you said before, many of these functions should be private and or could be merged into a single function.

include/llvm/DebugInfo/GSYM/FileEntry.h
23

If we are going to encode into object files, we need to have the magic value in the header tell us the byte order IMHO. We really want it to be the same as the system that will use since we want to mmap the file into memory and use it as efficiently as possible. Not sure how well things would perform if we forced an byte order on the file using things like llvm::support::ulittle32_t.

include/llvm/DebugInfo/GSYM/FileWriter.h
22

No reason. Happy to switch. Just trying to avoid the ASMWriter as it requires so much of llvm (targets and more) to be loaded and made making a 32 bit or 64 bit file with a byte order much harder as you had to match it up with a target just to get those values to match.

include/llvm/DebugInfo/GSYM/GsymCreator.h
48

Doing each compile unit in DWARF separately does speed things up when we tested really large binaries here at Facebook. Happy to try it both ways to make sure. But parsing DWARF is generally grabbing a DIE, getting its address ranges, if it has valid ranges, then parse the line table entries that are only the line entries for the function itself, create the inline info, go onto next stage. The string population happens after we get the function ranges if they are valid, and during the line table parsing. We can easily cache a DWARF file index to GSYM file index if that isn't already being done since the first time we add a file from DWARF to GSYM we will need to unique the directory and basename strings, but we should only need to do that once per file index for compile unit line table file. So I believe this will work out as there is plenty of work to do between

74

will do!

include/llvm/DebugInfo/GSYM/GsymReader.h
64

will do

68

will do

include/llvm/DebugInfo/GSYM/Range.h
34

I like them in case I switch the contents to be "uint64_t Start; uint64_t Size;". Just allows less code changes if we switch this around. I know it is already a struct, so if I do this, this should be a class where "Start" and "End" are private.

include/llvm/DebugInfo/GSYM/StringTableCreator.h
20

Yeah, I believe I tried to avoid the MC layer as is required llvm targets to be available and you had to pick an architecture to match the address byte size and byte ordering. Happy to use other things from LLVM where possible and not to large of a pain. But not getting an offsets right away seems like a shame and would change a lot of things to require fixups before they were emitted (all function names in FunctionInfo, directories and basenames in FileEntry object, inline function infos) etc. And we might need a special class still if we ever emit into an exising object file because we would want to be able to reuse and strings in .debug_str or other ELF string tables.

clayborg updated this revision to Diff 203385.Jun 6 2019, 9:19 AM

Rebase to current llvm sources.

@clayborg Do you mind rebasing this revision since you've committed some parts? arc patch D53379 currently has some conflicts. Thanks!

clayborg abandoned this revision.Oct 14 2022, 12:54 PM

This has already been broken up into several patches and submitted already using different diffs.

Herald added a project: Restricted Project. · View Herald TranscriptOct 14 2022, 12:54 PM