[Docs][PDB] Update MSF File documentation
ClosedPublic

Authored by colden on Jan 8 2018, 8:50 AM.

Details

Summary

As long as we have all this super useful knowledge, may as well share it.

Diff Detail

Repository
rL LLVM
colden created this revision.Jan 8 2018, 8:50 AM

Few minor nits, but thanks for doing this!

docs/PDB/MsfFile.rst
25 ↗(On Diff #128944)

Shouldn't this be BlockSize - 3? The pattern is Block 0, Block 1, Block 2, <BlockSize - 3 more blocks>

35 ↗(On Diff #128944)

Can we call it FPM instead of FBM? We use the terms Page and Block sometimes interchangeably, but we never refer to anything other than FPM in the code.

102–105 ↗(On Diff #128944)

I think this sentence is a bit wrong. The first two bytes of each FPM are not necessarily 1. But the first *three bits* of each FPM are. And not necessarily each FPM either, only the active one. I would just say "There is one bit in the FPM for every block in the file, including the super block." that should be clear enough that there are no exceptions.

colden added inline comments.Jan 8 2018, 10:26 AM
docs/PDB/MsfFile.rst
25 ↗(On Diff #128944)

I don't think so. I may be wrong, but here was my thinking.

If the second FPM0 is at index 4097, that means that there are 4097 blocks before it. BUT, if the interval size is 4096, that means that there's 1 extra block in the first interval, which *I think* would have to be the super block.

I can probably word this better, so I'll work on it, but I think the pattern is Super Block, followed by 1+ 4096 block-intervals repeating

35 ↗(On Diff #128944)

Yeah for sure.

I'm glad you brought this up though, because the the code use is actually super confusing. Whenever we use the whole words, it's always FreeBlockMap (see SuperBlock struct), but whenever its an acronym, it's FPM. Should we just pick one and rename everything to that? I'd be happy to do that as a separate change, if we can agree on which one to go with. I personally prefer FBM, because everything else refers to them as "Blocks," not "Pages."

Maybe I'm just bitter because it took me about an hour to figure out what FPM stood for :)

102–105 ↗(On Diff #128944)

So it seems like there's a couple issues here:

  1. You are correct, I should have said "bits"
  2. You are correct, this is only guaranteed of the active FPM.
  3. 2 vs 3, This one I'm unsure about. The MS code does indeed mark the first 3 bits as "in-use,"[0] but thinking about this in conjunction with your first question, that doesn't really make sense. Because the first interval contains 4097 blocks, there *must* be one that isn't referenced by the FPM. My assumption would be it's the SuperBlock (how could it possibly be "free?").

Am I missing something obvious here? The more I think about this the more confusing it gets.

[0] The comment where you described the process of it marking 3 pages as allocated: https://reviews.llvm.org/D41734#967624

I think it's the block at the beginning that is throwing you off. *every* interval consists of 4096 blocks. And *every* interval looks like this:

+-------------+-------+------------------+------------------+------+------+------+
| Block Index | 0     | 1                | 2                | 3    | ...  | 4095 |
+=============+=======+==================+==================+======+======+======+
| Meaning     | Data  | Free Block Map 1 | Free Block Map 2 | Data | Data | Data |
+-------------+-------+------------------+------------------+------+------+------+

So each interval has exactly 4096 blocks. Block 0 is only special in the first interval. It's still there in every other interval, it just contains regular data.

So each interval has exactly 4096 blocks. Block 0 is only special in the first interval. It's still there in every other interval, it just contains regular data.

But according to the code, that can't be true. Specifically, I'm looking at llvm::msf::getFpmStreamLayout. If you look at the loop assuming that FpmBlock is 1, then it pushes index 1 as the first FPM block, adds the result of msf::getFpmIntervalLength (which always returns 4096), and pushes that. This means that the second FPM in the sequence is at index 4097, which would mean that there are 4097 blocks in front of it. Unless block at index 4096 is unused, the first interval is in fact 1 block longer.

So each interval has exactly 4096 blocks. Block 0 is only special in the first interval. It's still there in every other interval, it just contains regular data.

But according to the code, that can't be true. Specifically, I'm looking at llvm::msf::getFpmStreamLayout. If you look at the loop assuming that FpmBlock is 1, then it pushes index 1 as the first FPM block, adds the result of msf::getFpmIntervalLength (which always returns 4096), and pushes that. This means that the second FPM in the sequence is at index 4097, which would mean that there are 4097 blocks in front of it. Unless block at index 4096 is unused, the first interval is in fact 1 block longer.

Doesn't all of that match up exactly with the layout I posted? Paste a copy of that layout diagram in my previous comment onto the end of itself. 4096 becomes "Data", 4097 becomes "FPM1", 4098 becomes "FPM2". "Interval" is the distance between two consecutive blocks of the same FPM. It's always 4096. So FPM1 is on blocks 1, 4096 + 1, 4096*2 + 1, 4096*3 + 1, etc. The blocks immediately before that (0, 4096, 4096*2, 4096*3, etc) are just data, with the only exception being block 0, which is the super block.

So each interval has exactly 4096 blocks. Block 0 is only special in the first interval. It's still there in every other interval, it just contains regular data.

But according to the code, that can't be true. Specifically, I'm looking at llvm::msf::getFpmStreamLayout. If you look at the loop assuming that FpmBlock is 1, then it pushes index 1 as the first FPM block, adds the result of msf::getFpmIntervalLength (which always returns 4096), and pushes that. This means that the second FPM in the sequence is at index 4097, which would mean that there are 4097 blocks in front of it. Unless block at index 4096 is unused, the first interval is in fact 1 block longer.

Doesn't all of that match up exactly with the layout I posted? Paste a copy of that layout diagram in my previous comment onto the end of itself. 4096 becomes "Data", 4097 becomes "FPM1", 4098 becomes "FPM2". "Interval" is the distance between two consecutive blocks of the same FPM. It's always 4096. So FPM1 is on blocks 1, 4096 + 1, 4096*2 + 1, 4096*3 + 1, etc. The blocks immediately before that (0, 4096, 4096*2, 4096*3, etc) are just data, with the only exception being block 0, which is the super block.

OOOOOH it just clicked. I thought the FPMs were always the first blocks in the interval, but having a block of data first makes so much sense.

Sorry for being so thick, and thanks for breaking it down for me. I'll post an update here in a minute.

colden updated this revision to Diff 128962.Jan 8 2018, 11:45 AM

Now that I actually understand how the intervals work, I reworked a lot of what I had written. I think it's much clearer now, but I'm sure there's more nits to be had :)

colden added a comment.Jan 9 2018, 6:09 PM

Updated with my newer, better understanding of the file layout

zturner added inline comments.Jan 9 2018, 10:13 PM
llvm/docs/PDB/MsfFile.rst
97 ↗(On Diff #128962)

This is backwards (yes it's confusing). Since it's the free block map, a value of 1 indicates that the block is free, and a value of 0 indicates that the value is not free.

107–110 ↗(On Diff #128962)

This is also not quite true. The k'th bit in the j'th FPM block refers to the 4096*j + k'th block in the file. So, for example, consider blocks 1 and 4097. These are the first and second blocks of FPM1, respectively. To find out the allocation status of block 4097, you look at bit 1 of byte 512 (512*8 + 1 = 4097).

There's actually a comment in the MSF reference implementation that indicates this was probably a design oversight, because this results in far more FPM blocks being reserved in a file than are actually necessary (by a factor of 8). But since it's already that way and everything is relying on it, and files are out there in the wild using it, and it results in neglible file bloat, it just stays that way.

This is actually the whole point of the IncludeUnusedFpmData flag from my previous patch. If you specify true for it, it will give you a stream consisting of every block that *could* be an FPM block (i.e. the index is of the form 4096*k + 1), but if you pass false it truncates the extraneous ones at the end that would otherwise refer to blocks that are not even in the file.

colden added inline comments.Jan 10 2018, 8:46 AM
llvm/docs/PDB/MsfFile.rst
97 ↗(On Diff #128962)

Aaah so close.

107–110 ↗(On Diff #128962)

Oh fascinating. I was wondering about that / 8. I'll see if I can figure out a good way to phrase this.

colden updated this revision to Diff 129283.Jan 10 2018, 9:04 AM

nth time's the charm!

zturner accepted this revision.Jan 10 2018, 9:51 AM

looks good, thanks!

This revision is now accepted and ready to land.Jan 10 2018, 9:51 AM

Yay, thanks!

If you wouldn't mind committing this for me, I'd really appreciate it.

Ping, could someone please submit this for me?

Closed by commit rL322404: Update MSF File Documentation. (authored by zturner, committed by ). · Explain WhyJan 12 2018, 1:43 PM
This revision was automatically updated to reflect the committed changes.

Thank you Zach!