I got approval to open source the GSYM symbolication format. This patch still needs to get testing added and switched over to use the AsmPrinter to create the GSYM files, but I wanted to post this patch in progress for the LLVM conference to allow folks to see what it is and try it out. Full details on the file format below:
GSYM Introduction
GSYM is a symbolication file format is designed to be the best format to use for symbolicating addresses into function name + source file + line information. It is a binary file format designed to be mapped into one or more processes. GSYM information can be created by converting DWARF debug information, or Breakpad files. GSYM information can exist as a stand alone file, or be contained in ELF or mach-o files in a section. When embedded into ELF or mach-o files, GSYM sections can share a string tables that already exists within a file.
Why use GSYM?
GSYM files are up to 7x smaller than DWARF files and up to 3x smaller than Breakpad files. The file format is designed to touch as few pages of the file as possible while doing address lookups. GSYM files can be mmap'ed into a process as shared memory allowing multiple processes on a symbolication server to share loaded GSYM pages. The file format includes inline call stack information and can help turn a single address lookup into multiple stack frames that walk the inlined call stack back to the concrete function that invoked these functions.
Converting DWARF Files to GSYM 
llvm-dsymutil is available in the llvm/tools/gsym directory and has options to convert DWARF into GSYM files. llvm-dsymutil has a -dwarf option that specifies a DWARF file to convert into a GSYM file. The output file can be specified with the -out-file option.
$ llvm-dsymutil -dwarf /tmp/a.out -out-file /tmp/a.out.gsymThis command will convert a DWARF file into the GSYM file format. This allows clients that are currently symbolicating with DWARF to switch to using the GSYM file format. This tool could be used in a symbolication workflow where symbolication servers convert DWARF to GSYM and cached the results on the fly, or could be used at build time to always produce a GSYM file at build time. DWARF debug information is rich enough to support encoding the inline call stack information for richer and more useful symbolication backtraces.
Converting Breakpad Files to GSYM
llvm-dsymutil has a -breakpad option that specifies a Breakpad file to convert into a GSYM file. The output file can be specified with the -out-file option.
$ llvm-dsymutil -breakpad /tmp/foo.sym -out-file /tmp/foo.gsymThis allows clients currently using breakpad to switch over to use GSYM files. This tool could be used in a symbolication workflow where symbolication servers convert breakpad to GSYM format on the fly only when needed. Breakpad files do not contain inline call stack information, so it is advisable to use llvm-dsymutil -dwarf when possible to avoid losing this vital information.
File Format Overview
The GSYM file consists of a header, address table, address info offset table and address info data for each address.
The GSYM file format when in a stand alone file is ordered as shown:
- Header
- Address Table
- Address Data Offsets Table
- File Table
- String Table
- Address Data
Header
#define GSYM_MAGIC 0x4753594d
#define GSYM_VERSION 1
struct Header {
  uint32_t magic;
  uint16_t version;
  uint8_t  addr_off_size;
  uint8_t  uuid_size;
  uint64_t base_address;
  uint32_t num_addrs;
  uint32_t strtab_offset;
  uint32_t strtab_size;
  uint8_t  uuid[20];
};The magic value is set to GSYM_MAGIC and allows quick and easy detection of this file format when it is loaded. Addresses in the address table are stored as offsets from a 64 bit address found in Header.base_address. This allows the address table to contain 32, 16 or 8 bit offsets, instead of a table of full sized addresses. The file size is smaller and causes fewer pages to be touched during address lookups when the address table is smaller. The size of the address offsets in the address table is specified in the header in Header.addr_off_size. The header contains a UUID to ensure the GSYM file can be properly matched to the object ELf or mach-o file that created the stack trace. The header specifies the location of the string table for all strings contained in the GSYM file, or can point to an existing string table within a ELF or mach-o file.
Address Table
The address table immediately follows the header in the file and consists of Header.num_addrs address offsets. These offsets are sorted and can be binary searched for efficient lookups. Address offsets are encoded as offsets that are Header.addr_off_size bytes in size. During address lookup, the index of the matching address offset will be the index into the address data offsets table.
Address Data Offsets Table
The address data offsets table immediately follows the address table and consists of Header.num_addrs 32 bit file offsets: one for each address in the address table. The offsets in this table are the absolute file offset to the address data for each address in the address table. Keeping this data separate from the address table helps to reduce the number of pages that are touched when address lookups occur on a GSYM file.
File Table
The file table immediately follows the address data offsets table. The format of the FileTable is:
struct FileTable {
  uint32_t count;
  FileInfo files[];
};The file table starts with a 32 bit count of the number of files that are used in all of the address data, followed by that number of FileInfo structures.
Each file in the file table is represented with a FileInfo structure:
struct FileInfo {
  uint32_t directory;
  uint32_t filename;
};The FileInfo structure has the file path split into a string for the directory and a string for the filename. The directory and filename are specified as offsets into the string table. Splitting paths into directory and file base name allows GSYM to use the same string table entry for common directories.
String Table
The string table follows the file table in stand alone GSYM files and contains all strings for everything contained in the GSYM file. Any string data should be added to the string table and any references to strings inside GSYM information must be stored as 32 bit string table offsets into this string table.
Address Data
The address data is the payload that contains information about the address that is being looked up. The structure that represents this data is:
struct AddressInfo {
    uint32_t size;
    uint32_t name;
    AddressData data[];
};It starts with a 32 bit size for the address range of the functiopn and is followed by the 32 bit string table offset for the name of the function. The size of the address range is important to encode as it stops address lookups from matching if the address is between two functions in some padding. This is followed by an array of address data information:
struct AddressData {
    uint32_t type;
    uint32_t length;
    uint8_t data[length];
};The address data starts with a 32 bit type, followed by a 32 bit length, followed by an array of bytes that encode each specify kind of data.
The AddressData.type is an enumeration value:
enum class InfoType {
   EndOfList = 0u,
   LineTableInfo = 1u,
   InlineInfo = 2u
};The AddressInfo.data[] is encoded as a vector of AddressData structs that is terminated by a AddressData struct whose type is set to InfoType.EndOfList. This allows the GSYM file format the contain arbitrary data for any address range and allows us to expand the GSYM capabilities as we find more uses for it.
InfoType::EndOfList is always the last AddressData in the AddressInfo.
InfoType::LineTableInfo is a modified version of the DWARF line tables that efficiently stores line table information for each function. DWARF stores line table information for an entire source file and includes all functions. Having each function's line table encoded separately allows fewer pages to be touched when looking up the line entry for a specific address. The information is optional and can be omitted fo address data that is from a symbol or label where no line table information is available.
InfoType::InlineInfo is a format that encodes inline call stacks. This information is optional and doesn't need to be included for each address. If the function has no inlined functions this data should not be included.
Its more typical to use Error instead of std::error_code. Error has some sharp edges to it because it forces you to check the error.