This is an archive of the discontinued LLVM Phabricator instance.

docs/PDB/MsfFile.rst
25 ↗	(On Diff #128944)	Shouldn't this be BlockSize - 3? The pattern is Block 0, Block 1, Block 2, <BlockSize - 3 more blocks>
35 ↗	(On Diff #128944)	Can we call it FPM instead of FBM? We use the terms Page and Block sometimes interchangeably, but we never refer to anything other than FPM in the code.
102–105 ↗	(On Diff #128944)	I think this sentence is a bit wrong. The first two bytes of each FPM are not necessarily 1. But the first three bits of each FPM are. And not necessarily each FPM either, only the active one. I would just say "There is one bit in the FPM for every block in the file, including the super block." that should be clear enough that there are no exceptions.

colden added inline comments.Jan 8 2018, 10:26 AM

docs/PDB/MsfFile.rst
25 ↗	(On Diff #128944)	I don't think so. I may be wrong, but here was my thinking. If the second FPM0 is at index 4097, that means that there are 4097 blocks before it. BUT, if the interval size is 4096, that means that there's 1 extra block in the first interval, which I think would have to be the super block. I can probably word this better, so I'll work on it, but I think the pattern is Super Block, followed by 1+ 4096 block-intervals repeating
35 ↗	(On Diff #128944)	Yeah for sure. I'm glad you brought this up though, because the the code use is actually super confusing. Whenever we use the whole words, it's always FreeBlockMap (see SuperBlock struct), but whenever its an acronym, it's FPM. Should we just pick one and rename everything to that? I'd be happy to do that as a separate change, if we can agree on which one to go with. I personally prefer FBM, because everything else refers to them as "Blocks," not "Pages." Maybe I'm just bitter because it took me about an hour to figure out what FPM stood for :)
102–105 ↗	(On Diff #128944)	So it seems like there's a couple issues here: You are correct, I should have said "bits" You are correct, this is only guaranteed of the active FPM. 2 vs 3, This one I'm unsure about. The MS code does indeed mark the first 3 bits as "in-use,"[0] but thinking about this in conjunction with your first question, that doesn't really make sense. Because the first interval contains 4097 blocks, there must be one that isn't referenced by the FPM. My assumption would be it's the SuperBlock (how could it possibly be "free?"). Am I missing something obvious here? The more I think about this the more confusing it gets. [0] The comment where you described the process of it marking 3 pages as allocated: https://reviews.llvm.org/D41734#967624

I think it's the block at the beginning that is throwing you off. *every* interval consists of 4096 blocks. And *every* interval looks like this:

+-------------+-------+------------------+------------------+------+------+------+
| Block Index | 0     | 1                | 2                | 3    | ...  | 4095 |
+=============+=======+==================+==================+======+======+======+
| Meaning     | Data  | Free Block Map 1 | Free Block Map 2 | Data | Data | Data |
+-------------+-------+------------------+------------------+------+------+------+

So each interval has exactly 4096 blocks. Block 0 is only special in the first interval. It's still there in every other interval, it just contains regular data.

In D41825#970033, @zturner wrote:

So each interval has exactly 4096 blocks. Block 0 is only special in the first interval. It's still there in every other interval, it just contains regular data.

But according to the code, that can't be true. Specifically, I'm looking at llvm::msf::getFpmStreamLayout. If you look at the loop assuming that FpmBlock is 1, then it pushes index 1 as the first FPM block, adds the result of msf::getFpmIntervalLength (which always returns 4096), and pushes that. This means that the second FPM in the sequence is at index 4097, which would mean that there are 4097 blocks in front of it. Unless block at index 4096 is unused, the first interval is in fact 1 block longer.

In D41825#970040, @colden wrote:

In D41825#970033, @zturner wrote:

So each interval has exactly 4096 blocks. Block 0 is only special in the first interval. It's still there in every other interval, it just contains regular data.

But according to the code, that can't be true. Specifically, I'm looking at llvm::msf::getFpmStreamLayout. If you look at the loop assuming that FpmBlock is 1, then it pushes index 1 as the first FPM block, adds the result of msf::getFpmIntervalLength (which always returns 4096), and pushes that. This means that the second FPM in the sequence is at index 4097, which would mean that there are 4097 blocks in front of it. Unless block at index 4096 is unused, the first interval is in fact 1 block longer.

Doesn't all of that match up exactly with the layout I posted? Paste a copy of that layout diagram in my previous comment onto the end of itself. 4096 becomes "Data", 4097 becomes "FPM1", 4098 becomes "FPM2". "Interval" is the distance between two consecutive blocks of the same FPM. It's always 4096. So FPM1 is on blocks 1, 4096 + 1, 4096*2 + 1, 4096*3 + 1, etc. The blocks immediately before that (0, 4096, 4096*2, 4096*3, etc) are just data, with the only exception being block 0, which is the super block.

In D41825#970042, @zturner wrote:

In D41825#970040, @colden wrote:

In D41825#970033, @zturner wrote:

So each interval has exactly 4096 blocks. Block 0 is only special in the first interval. It's still there in every other interval, it just contains regular data.

But according to the code, that can't be true. Specifically, I'm looking at llvm::msf::getFpmStreamLayout. If you look at the loop assuming that FpmBlock is 1, then it pushes index 1 as the first FPM block, adds the result of msf::getFpmIntervalLength (which always returns 4096), and pushes that. This means that the second FPM in the sequence is at index 4097, which would mean that there are 4097 blocks in front of it. Unless block at index 4096 is unused, the first interval is in fact 1 block longer.

Doesn't all of that match up exactly with the layout I posted? Paste a copy of that layout diagram in my previous comment onto the end of itself. 4096 becomes "Data", 4097 becomes "FPM1", 4098 becomes "FPM2". "Interval" is the distance between two consecutive blocks of the same FPM. It's always 4096. So FPM1 is on blocks 1, 4096 + 1, 4096*2 + 1, 4096*3 + 1, etc. The blocks immediately before that (0, 4096, 4096*2, 4096*3, etc) are just data, with the only exception being block 0, which is the super block.

OOOOOH it just clicked. I thought the FPMs were always the first blocks in the interval, but having a block of data first makes so much sense.

Sorry for being so thick, and thanks for breaking it down for me. I'll post an update here in a minute.

Now that I actually understand how the intervals work, I reworked a lot of what I had written. I think it's much clearer now, but I'm sure there's more nits to be had :)

Updated with my newer, better understanding of the file layout

zturner added inline comments.Jan 9 2018, 10:13 PM

llvm/docs/PDB/MsfFile.rst
97 ↗	(On Diff #128962)	This is backwards (yes it's confusing). Since it's the free block map, a value of 1 indicates that the block is free, and a value of 0 indicates that the value is not free.
107–110 ↗	(On Diff #128962)	This is also not quite true. The `k`'th bit in the `j`'th FPM block refers to the `4096j + k`'th block in the file. So, for example, consider blocks 1 and 4097. These are the first and second blocks of FPM1, respectively. To find out the allocation status of block 4097, you look at bit 1 of byte 512 (5128 + 1 = 4097). There's actually a comment in the MSF reference implementation that indicates this was probably a design oversight, because this results in far more FPM blocks being reserved in a file than are actually necessary (by a factor of 8). But since it's already that way and everything is relying on it, and files are out there in the wild using it, and it results in neglible file bloat, it just stays that way. This is actually the whole point of the `IncludeUnusedFpmData` flag from my previous patch. If you specify `true` for it, it will give you a stream consisting of every block that could be an FPM block (i.e. the index is of the form 4096*k + 1), but if you pass `false` it truncates the extraneous ones at the end that would otherwise refer to blocks that are not even in the file.

colden added inline comments.Jan 10 2018, 8:46 AM

llvm/docs/PDB/MsfFile.rst
97 ↗	(On Diff #128962)	Aaah so close.
107–110 ↗	(On Diff #128962)	Oh fascinating. I was wondering about that `/ 8`. I'll see if I can figure out a good way to phrase this.

nth time's the charm!

looks good, thanks!

This revision is now accepted and ready to land.Jan 10 2018, 9:51 AM

Yay, thanks!

If you wouldn't mind committing this for me, I'd really appreciate it.

Ping, could someone please submit this for me?

Closed by commit rL322404: Update MSF File Documentation. (authored by zturner). · Explain WhyJan 12 2018, 1:43 PM

This revision was automatically updated to reflect the committed changes.

Thank you Zach!

Revision Contents

Path

Size

llvm/

trunk/

docs/

PDB/

MsfFile.rst

82 lines

Diff 129701

llvm/trunk/docs/PDB/MsfFile.rst

=====================================		=====================================
The MSF File Format		The MSF File Format
=====================================		=====================================

.. contents::		.. contents::
:local:		:local:

		.. _msf_layout:

		File Layout
		===========

		The MSF file format consists of the following components:

		1. :ref:`msf_superblock`
		2. :ref:`msf_freeblockmap` (also know as Free Page Map, or FPM)
		3. Data

		Each component is stored as an indexed block, the length of which is specified
		in ``SuperBlock::BlockSize``. The file consists of 1 or more iterations of the
		following pattern (sometimes referred to as an "interval"):

		1. 1 block of data
		2. Free Block Map 1 (corresponds to ``SuperBlock::FreeBlockMapBlock`` 1)
		3. Free Block Map 2 (corresponds to ``SuperBlock::FreeBlockMapBlock`` 2)
		4. ``SuperBlock::BlockSize - 3`` blocks of data

		In the first interval, the first data block is used to store
		:ref:`msf_superblock`.

		The following diagram demonstrates the general layout of the file (\\| denotes
		the end of an interval, and is for visualization purposes only):

		+-------------+-----------------------+------------------+------------------+----------+----+------+------+------+-------------+----+-----+
		\| Block Index \| 0 \| 1 \| 2 \| 3 - 4095 \| \\| \| 4096 \| 4097 \| 4098 \| 4099 - 8191 \| \\| \| ... \|
		+=============+=======================+==================+==================+==========+====+======+======+======+=============+====+=====+
		\| Meaning \| :ref:`msf_superblock` \| Free Block Map 1 \| Free Block Map 2 \| Data \| \\| \| Data \| FPM1 \| FPM2 \| Data \| \\| \| ... \|
		+-------------+-----------------------+------------------+------------------+----------+----+------+------+------+-------------+----+-----+

		The file may end after any block, including immediately after a FPM1.

		.. note::
		LLVM only supports 4096 byte blocks (sometimes referred to as the "BigMsf"
		variant), so the rest of this document will assume a block size of 4096.

.. _msf_superblock:		.. _msf_superblock:

The Superblock		The Superblock
==============		==============
At file offset 0 in an MSF file is the MSF SuperBlock, which is laid out as		At file offset 0 in an MSF file is the MSF SuperBlock, which is laid out as
follows:		follows:

.. code-block:: c++		.. code-block:: c++
Show All 11 Lines
- FileMagic - Must be equal to ``"Microsoft C / C++ MSF 7.00\\r\\n"``		- FileMagic - Must be equal to ``"Microsoft C / C++ MSF 7.00\\r\\n"``
followed by the bytes ``1A 44 53 00 00 00``.		followed by the bytes ``1A 44 53 00 00 00``.
- BlockSize - The block size of the internal file system. Valid values are		- BlockSize - The block size of the internal file system. Valid values are
512, 1024, 2048, and 4096 bytes. Certain aspects of the MSF file layout vary		512, 1024, 2048, and 4096 bytes. Certain aspects of the MSF file layout vary
depending on the block sizes. For the purposes of LLVM, we handle only block		depending on the block sizes. For the purposes of LLVM, we handle only block
sizes of 4KiB, and all further discussion assumes a block size of 4KiB.		sizes of 4KiB, and all further discussion assumes a block size of 4KiB.
- FreeBlockMapBlock - The index of a block within the file, at which begins		- FreeBlockMapBlock - The index of a block within the file, at which begins
a bitfield representing the set of all blocks within the file which are "free"		a bitfield representing the set of all blocks within the file which are "free"
(i.e. the data within that block is not used). This bitfield is spread across		(i.e. the data within that block is not used). See :ref:`msf_freeblockmap` for
the MSF file at ``BlockSize`` intervals.		more information.
Important: ``FreeBlockMapBlock`` can only be ``1`` or ``2``! This field		Important: ``FreeBlockMapBlock`` can only be ``1`` or ``2``!
is designed to support incremental and atomic updates of the underlying MSF
file. While writing to an MSF file, if the value of this field is `1`, you
can write your new modified bitfield to page 2, and vice versa. Only when
you commit the file to disk do you need to swap the value in the SuperBlock
to point to the new ``FreeBlockMapBlock``.
- NumBlocks - The total number of blocks in the file. ``NumBlocks * BlockSize``		- NumBlocks - The total number of blocks in the file. ``NumBlocks * BlockSize``
should equal the size of the file on disk.		should equal the size of the file on disk.
- NumDirectoryBytes - The size of the stream directory, in bytes. The stream		- NumDirectoryBytes - The size of the stream directory, in bytes. The stream
directory contains information about each stream's size and the set of blocks		directory contains information about each stream's size and the set of blocks
that it occupies. It will be described in more detail later.		that it occupies. It will be described in more detail later.
- BlockMapAddr - The index of a block within the MSF file. At this block is		- BlockMapAddr - The index of a block within the MSF file. At this block is
an array of ``ulittle32_t``'s listing the blocks that the stream directory		an array of ``ulittle32_t``'s listing the blocks that the stream directory
resides on. For large MSF files, the stream directory (which describes the		resides on. For large MSF files, the stream directory (which describes the
block layout of each stream) may not fit entirely on a single block. As a		block layout of each stream) may not fit entirely on a single block. As a
result, this extra layer of indirection is introduced, whereby this block		result, this extra layer of indirection is introduced, whereby this block
contains the list of blocks that the stream directory occupies, and the stream		contains the list of blocks that the stream directory occupies, and the stream
directory itself can be stitched together accordingly. The number of		directory itself can be stitched together accordingly. The number of
``ulittle32_t``'s in this array is given by ``ceil(NumDirectoryBytes / BlockSize)``.		``ulittle32_t``'s in this array is given by ``ceil(NumDirectoryBytes / BlockSize)``.

		.. _msf_freeblockmap:

		The Free Block Map
		==================

		The Free Block Map (sometimes referred to as the Free Page Map, or FPM) is a
		series of blocks which contains a bit flag for every block in the file. The
		flag will be set to 0 if the block is in use, and 1 if the block is unused.

		Each file contains two FPMs, one of which is active at any given time. This
		feature is designed to support incremental and atomic updates of the underlying
		MSF file. While writing to an MSF file, if the active FPM is FPM1, you can
		write your new modified bitfield to FPM2, and vice versa. Only when you commit
		the file to disk do you need to swap the value in the SuperBlock to point to
		the new ``FreeBlockMapBlock``.

		The Free Block Maps are stored as a series of single blocks thoughout the file
		at intervals of BlockSize. Because each FPM block is of size ``BlockSize``
		bytes, it contains 8 times as many bits as an interval has blocks. This means
		that the first block of each FPM refers to the first 8 intervals of the file
		(the first 32768 blocks), the second block of each FPM refers to the next 8
		blocks, and so on. This results in far more FPM blocks being present than are
		required, but in order to maintain backwards compatibility the format must stay
		this way.

The Stream Directory		The Stream Directory
====================		====================
The Stream Directory is the root of all access to the other streams in an MSF		The Stream Directory is the root of all access to the other streams in an MSF
file. Beginning at byte 0 of the stream directory is the following structure:		file. Beginning at byte 0 of the stream directory is the following structure:

.. code-block:: c++		.. code-block:: c++

struct StreamDirectory {		struct StreamDirectory {
ulittle32_t NumStreams;		ulittle32_t NumStreams;
ulittle32_t StreamSizes[NumStreams];		ulittle32_t StreamSizes[NumStreams];
ulittle32_t StreamBlocks[NumStreams][];		ulittle32_t StreamBlocks[NumStreams][];
};		};

And this structure occupies exactly ``SuperBlock->NumDirectoryBytes`` bytes.		And this structure occupies exactly ``SuperBlock->NumDirectoryBytes`` bytes.
Note that each of the last two arrays is of variable length, and in particular		Note that each of the last two arrays is of variable length, and in particular
that the second array is jagged.		that the second array is jagged.

Example: Suppose a hypothetical PDB file with a 4KiB block size, and 4		Example: Suppose a hypothetical PDB file with a 4KiB block size, and 4
streams of lengths {1000 bytes, 8000 bytes, 16000 bytes, 9000 bytes}.		streams of lengths {1000 bytes, 8000 bytes, 16000 bytes, 9000 bytes}.

Stream 0: ceil(1000 / 4096) = 1 block		Stream 0: ceil(1000 / 4096) = 1 block

Stream 1: ceil(8000 / 4096) = 2 blocks		Stream 1: ceil(8000 / 4096) = 2 blocks

Show All 11 Lines	struct StreamDirectory {
ulittle32_t StreamSizes[] = {1000, 8000, 16000, 9000};		ulittle32_t StreamSizes[] = {1000, 8000, 16000, 9000};
ulittle32_t StreamBlocks[][] = {		ulittle32_t StreamBlocks[][] = {
{4},		{4},
{5, 6},		{5, 6},
{11, 9, 7, 8},		{11, 9, 7, 8},
{10, 15, 12}		{10, 15, 12}
};		};
};		};

In total, this occupies ``15 * 4 = 60`` bytes, so ``SuperBlock->NumDirectoryBytes``		In total, this occupies ``15 * 4 = 60`` bytes, so ``SuperBlock->NumDirectoryBytes``
would equal ``60``, and ``SuperBlock->BlockMapAddr`` would be an array of one		would equal ``60``, and ``SuperBlock->BlockMapAddr`` would be an array of one
``ulittle32_t``, since ``60 <= SuperBlock->BlockSize``.		``ulittle32_t``, since ``60 <= SuperBlock->BlockSize``.

Note also that the streams are discontiguous, and that part of stream 3 is in the		Note also that the streams are discontiguous, and that part of stream 3 is in the
middle of part of stream 2. You cannot assume anything about the layout of the		middle of part of stream 2. You cannot assume anything about the layout of the
blocks!		blocks!

Show All 13 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[Docs][PDB] Update MSF File documentationClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 129701

llvm/trunk/docs/PDB/MsfFile.rst

[Docs][PDB] Update MSF File documentation
ClosedPublic