Index: docs/PDB/DbiStream.rst =================================================================== --- /dev/null +++ docs/PDB/DbiStream.rst @@ -0,0 +1,3 @@ +===================================== +The PDB DBI (Debug Info) Stream +===================================== Index: docs/PDB/GlobalStream.rst =================================================================== --- /dev/null +++ docs/PDB/GlobalStream.rst @@ -0,0 +1,3 @@ +===================================== +The PDB Global Symbol Stream +===================================== Index: docs/PDB/HashStream.rst =================================================================== --- /dev/null +++ docs/PDB/HashStream.rst @@ -0,0 +1,3 @@ +===================================== +The TPI & IPI Hash Streams +===================================== Index: docs/PDB/ModiStream.rst =================================================================== --- /dev/null +++ docs/PDB/ModiStream.rst @@ -0,0 +1,3 @@ +===================================== +The Module Information Stream +===================================== Index: docs/PDB/MsfFile.rst =================================================================== --- /dev/null +++ docs/PDB/MsfFile.rst @@ -0,0 +1,117 @@ +===================================== +The MSF File Format +===================================== + +.. contents:: + :local: + +.. _msf_superblock: + +The Superblock +============== +At file offset 0 in an MSF file is the MSF *SuperBlock*, which is laid out as +follows: + +.. code-block:: c++ + + struct SuperBlock { + char FileMagic[sizeof(Magic)]; + ulittle32_t BlockSize; + ulittle32_t FreeBlockMapBlock; + ulittle32_t NumBlocks; + ulittle32_t NumDirectoryBytes; + ulittle32_t Unknown; + ulittle32_t BlockMapAddr; + }; + +- **FileMagic** - Must be equal to ``"Microsoft C / C++ MSF 7.00\\r\\n"`` + followed by the bytes ``1A 44 53 00 00 00``. +- **BlockSize** - The block size of the internal file system. Valid values are + 512, 1024, 2048, and 4096 bytes. Certain aspects of the MSF file layout vary + depending on the block sizes. For the purposes of LLVM, we handle only block + sizes of 4KiB, and all further discussion assumes a block size of 4KiB. +- **FreeBlockMapBlock** - The index of a block within the file, at which begins + a bitfield representing the set of all blocks within the file which are "free" + (i.e. the data within that block is not used). This bitfield is spread across + the MSF file at ``BlockSize`` intervals. + **Important**: ``FreeBlockMapBlock`` can only be ``1`` or ``2``! This field + is designed to support incremental and atomic updates of the underlying MSF + file. While writing to an MSF file, if the value of this field is `1`, you + can write your new modified bitfield to page 2, and vice versa. Only when + you commit the file to disk do you need to swap the value in the SuperBlock + to point to the new ``FreeBlockMapBlock``. +- **NumBlocks** - The total number of blocks in the file. ``NumBlocks * BlockSize`` + should equal the size of the file on disk. +- **NumDirectoryBytes** - The size of the stream directory, in bytes. The stream + directory contains information about each stream's size and the set of blocks + that it occupies. It will be described in more detail later. +- **BlockMapAddr** - The index of a block within the MSF file. At this block is + an array of ``ulittle32_t``'s describing the layout of the Stream Directory. + For large MSF files, the stream directory (which describes the block layout of + each stream) may not fit entirely on a single block. As a result, this extra + layer of indirection is introduced, whereby this block contains the list of + blocks that the stream directory occupies, and the stream directory itself can + be stitched together accordingly. The number of ``ulittle32_t``'s in this array + is given by ``⌈NumDirectoryBytes / BlockSize⌉``. + +The Stream Directory +==================== +The Stream Directory is the root of all access to the other streams in an MSF +file. Beginning at byte 0 of the stream directory is the following structure: + +.. code-block:: c++ + + struct StreamDirectory { + ulittle32_t NumStreams; + ulittle32_t StreamSizes[NumStreams]; + ulittle32_t StreamBlocks[NumStreams][]; + }; + +And this structure occupies exactly ``SuperBlock->NumDirectoryBytes`` bytes. +Note that each of the last two arrays is of variable length, and in particular +that the second array is jagged. + +**Example:** Suppose a hypothetical PDB file with a 4KiB block size, and 4 +streams of lengths {1000 bytes, 8000 bytes, 16000 bytes, 9000 bytes}. +Stream 0: ⌈1000 / 4096⌉ = 1 block +Stream 1: ⌈8000 / 4096⌉ = 2 block +Stream 2: ⌈16000 / 4096⌉ = 4 blocks +Stream 3: ⌈9000 / 4096⌉ = 3 blocks + +In total, 10 blocks are used. Let's see what the stream directory might look +like: + +.. code-block:: c++ + + struct StreamDirectory { + ulittle32_t NumStreams = 4; + ulittle32_t StreamSizes[] = {1000, 8000, 16000, 9000}; + ulittle32_t StreamBlocks[][] = { + {4}, + {5, 6}, + {11, 9, 7, 8}, + {10, 15, 12} + }; + }; + +In total, this occupies ``15 * 4 = 60`` bytes, so ``SuperBlock->NumDirectoryBytes`` +would equal ``60``, and ``SuperBlock->BlockMapAddr`` would be an array of one +``ulittle32_t``, since ``60 <= SuperBlock->BlockSize``. + +Note also that the streams are discontiguous, and that part of stream 3 is in the +middle of part of stream 2. You cannot assume anything about the layout of the +blocks! + +Alignment and Block Boundaries +============================== +As may be clear by now, it is possible for a single field (whether it be a high +level record, a long string field, or even a single ``uint16``) to begin and +end in separate blocks. For example, if the block size is 4096 bytes, and a +``uint16`` field begins at the last byte of the current block, then it would +need to end on the first byte of the next block. Since blocks are not +necessarily contiguously laid out in the file, this means that both the consumer +and the producer of an MSF file must be prepared to split data apart +accordingly. In the aforementioned example, the high byte of the ``uint16`` +would be written to the last byte of block N, and the low byte would be written +to the first byte of block N+1, which could be tens of thousands of bytes later +(or even earlier!) in the file, depending on what the stream directory says. \ No newline at end of file Index: docs/PDB/PdbStream.rst =================================================================== --- /dev/null +++ docs/PDB/PdbStream.rst @@ -0,0 +1,3 @@ +======================================== +The PDB Info Stream (aka the PDB Stream) +======================================== Index: docs/PDB/PublicStream.rst =================================================================== --- /dev/null +++ docs/PDB/PublicStream.rst @@ -0,0 +1,3 @@ +===================================== +The PDB Public Symbol Stream +===================================== Index: docs/PDB/TpiIpiStream.rst =================================================================== --- /dev/null +++ docs/PDB/TpiIpiStream.rst @@ -0,0 +1,3 @@ +===================================== +The PDB TPI Stream +===================================== Index: docs/PDB/index.rst =================================================================== --- /dev/null +++ docs/PDB/index.rst @@ -0,0 +1,157 @@ +===================================== +The PDB File Format +===================================== + +.. contents:: + :local: + +.. _pdb_intro: + +Introduction +============ + +PDB (Program Database) is a file format invented by Microsoft and which contains +debug information that can be consumed by debuggers and other tools. Since +officially supported APIs exist on Windows for querying debug information from +PDBs even without the user understanding the internals of the file format, a +large ecosystem of tools have been built for Windows to consume this format. In +order for Clang to be able to generate programs that can interoperate with these +tools, it is necessary for us to generate PDB files ourselves. + +At the same time, LLVM has a long history of being able to cross-compile from +any platform to any platform, and we wish for the same to be true here. So it +is necessary for us to understand the PDB file format at the byte-level so that +we can generate PDB files with entirely on our own. + +This manual describes what we know about the PDB file format today. The layout +of the file, the various streams contained within, the format of individual +records within, and more. + +We would like to extend our heartfelt gratitude to Microsoft, without whom we +would not be where we are today. Much of the knowledge contained within this +manual was learned through reading code published by Microsoft on their `GitHub +repo `__. + +.. _pdb_layout: + +File Layout +=========== + +.. toctree:: + :hidden: + + MsfFile + PdbStream + TpiIpiStream + DbiStream + ModiStream + PublicStream + GlobalStream + HashStream + +.. _msf: + +The MSF Container +----------------- +A PDB file is really just a special case of an MSF (Multi-Stream Format) file. +An MSF file is actually a miniature "file system within a file". It contains +multiple streams (aka files) which can represent arbitrary data, and these +streams are divided into blocks which may not necessarily be contiguously +laid out within the file (aka fragmented). Additionally, the MSF contains a +stream directory (aka MFT) which describes how the streams (files) are laid +out within the MSF. + +For more information about the MSF container format, stream directory, and +block layout, see :doc:`MsfFile`. + +.. _streams: + +Streams +------- +The PDB format contains a number of streams which describe various information +such as the types, symbols, source files, and compilands (e.g. object files) +of a program, as well as some additional streams containing hash tables that are +used by debuggers and other tools to provide fast lookup of records and types +by name, and various other information about how the program was compiled such +as the specific toolchain used, and more. A summary of streams contained in a +PDB file is as follows: + ++--------------------+------------------------------+-------------------------------------------+ +| Name | Stream Index | Contents | ++====================+==============================+===========================================+ +| Old Directory | - Fixed Stream Index 0 | - Previous MSF Stream Directory | ++--------------------+------------------------------+-------------------------------------------+ +| PDB Stream | - Fixed Stream Index 1 | - Basic File Information | +| | | - Fields to match EXE to this PDB | +| | | - Named Stream References | ++--------------------+------------------------------+-------------------------------------------+ +| TPI Stream | - Fixed Stream Index 2 | - CodeView Type Records | +| | | - Reference to IPI Hash Stream | ++--------------------+------------------------------+-------------------------------------------+ +| DBI Stream | - Fixed Stream Index 3 | - Module/Compiland Information | +| | | - References to individual module streams | +| | | - References to public / global streams | +| | | - Section Contribution Information | +| | | - Source File Information | +| | | - FPO / PGO Data | ++--------------------+------------------------------+-------------------------------------------+ +| IPI Stream | - Fixed Stream Index 4 | - CodeView Type Records | +| | | - Reference to IPI Hash Stream | ++--------------------+------------------------------+-------------------------------------------+ +| /LinkInfo | - Referenced from PDB Stream | - Unknown | ++--------------------+------------------------------+-------------------------------------------+ +| /src/headerblock | - Referenced from PDB Stream | - Unknown | ++--------------------+------------------------------+-------------------------------------------+ +| /names | - Referenced from PDB Stream | - Unknown | ++--------------------+------------------------------+-------------------------------------------+ +| Module Info Stream | - Referenced from DBI Stream | - CodeView Symbol Records for this module | +| | - One for each compiland | - Line Number Information | ++--------------------+------------------------------+-------------------------------------------+ +| Public Stream | - Referenced from DBI Stream | - Public (Exported) Symbol Records | +| | | - Reference to Public Hash Stream | ++--------------------+------------------------------+-------------------------------------------+ +| Global Stream | - Referenced from DBI Stream | - Global Symbol Records | +| | | - Reference to Global Hash Stream | ++--------------------+------------------------------+-------------------------------------------+ +| TPI Hash Stream | - Referenced from TPI Stream | - Hash table for looking up TPI records | +| | | by name | ++--------------------+------------------------------+-------------------------------------------+ +| IPI Hash Stream | - Referenced from IPI Stream | - Hash table for looking up IPI records | +| | | by name | ++--------------------+------------------------------+-------------------------------------------+ + +More information about the structure of each of these can be found on the +following pages: + +:doc:`PdbStream` + Information about the PDB Info Stream and how it is used to match PDBs to EXEs. + +:doc:`TpiIpiStream` + Information about the TPI stream and the CodeView records contained within. + +:doc:`DbiStream` + Information about the DBI strea, and relevant substreams including the Module Substreams, + source file information, and CodeView symbol records contained within. + +:doc:`ModiStream` + Information about the Module Information Stream, of which there is one for each compilation + unit and the format of symbols contained within. + +:doc:`PublicStream` + Information about the Public Symbol Stream. + +:doc:`GlobalStream` + Information about the Global Symbol Stream. + +:doc:`HashStream` + Information about the Hash Table stream, and how it can be used to quickly look up records + by name. + +CodeView +-------- +CodeView is another format which comes into the picture. While MSF defines +the structure of the overall file, and PDB defines the set of streams that +appear within the MSF file and the format of those streams, CodeView defines +the format of **symbol and type records** that appear within specific streams. +Refer to the pages on `CodeView Symbol Records` and `CodeView Type Records` for +more information about the CodeView format. \ No newline at end of file Index: docs/index.rst =================================================================== --- docs/index.rst +++ docs/index.rst @@ -272,6 +272,7 @@ FaultMaps MIRLangRef Coroutines + PDB/index :doc:`WritingAnLLVMPass` Information on how to write LLVM transformations and analyses. @@ -390,6 +391,9 @@ :doc:`Coroutines` LLVM support for coroutines. +:doc:`The Microsoft PDB File Format ` + A detailed description of the Microsoft PDB (Program Database) file format. + Development Process Documentation =================================