As long as we have all this super useful knowledge, may as well share it.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Few minor nits, but thanks for doing this!
docs/PDB/MsfFile.rst | ||
---|---|---|
25 | Shouldn't this be BlockSize - 3? The pattern is Block 0, Block 1, Block 2, <BlockSize - 3 more blocks> | |
35 | Can we call it FPM instead of FBM? We use the terms Page and Block sometimes interchangeably, but we never refer to anything other than FPM in the code. | |
102–105 | I think this sentence is a bit wrong. The first two bytes of each FPM are not necessarily 1. But the first *three bits* of each FPM are. And not necessarily each FPM either, only the active one. I would just say "There is one bit in the FPM for every block in the file, including the super block." that should be clear enough that there are no exceptions. |
docs/PDB/MsfFile.rst | ||
---|---|---|
25 | I don't think so. I may be wrong, but here was my thinking. If the second FPM0 is at index 4097, that means that there are 4097 blocks before it. BUT, if the interval size is 4096, that means that there's 1 extra block in the first interval, which *I think* would have to be the super block. I can probably word this better, so I'll work on it, but I think the pattern is Super Block, followed by 1+ 4096 block-intervals repeating | |
35 | Yeah for sure. I'm glad you brought this up though, because the the code use is actually super confusing. Whenever we use the whole words, it's always FreeBlockMap (see SuperBlock struct), but whenever its an acronym, it's FPM. Should we just pick one and rename everything to that? I'd be happy to do that as a separate change, if we can agree on which one to go with. I personally prefer FBM, because everything else refers to them as "Blocks," not "Pages." Maybe I'm just bitter because it took me about an hour to figure out what FPM stood for :) | |
102–105 | So it seems like there's a couple issues here:
Am I missing something obvious here? The more I think about this the more confusing it gets. [0] The comment where you described the process of it marking 3 pages as allocated: https://reviews.llvm.org/D41734#967624 |
I think it's the block at the beginning that is throwing you off. *every* interval consists of 4096 blocks. And *every* interval looks like this:
+-------------+-------+------------------+------------------+------+------+------+ | Block Index | 0 | 1 | 2 | 3 | ... | 4095 | +=============+=======+==================+==================+======+======+======+ | Meaning | Data | Free Block Map 1 | Free Block Map 2 | Data | Data | Data | +-------------+-------+------------------+------------------+------+------+------+
So each interval has exactly 4096 blocks. Block 0 is only special in the first interval. It's still there in every other interval, it just contains regular data.
But according to the code, that can't be true. Specifically, I'm looking at llvm::msf::getFpmStreamLayout. If you look at the loop assuming that FpmBlock is 1, then it pushes index 1 as the first FPM block, adds the result of msf::getFpmIntervalLength (which always returns 4096), and pushes that. This means that the second FPM in the sequence is at index 4097, which would mean that there are 4097 blocks in front of it. Unless block at index 4096 is unused, the first interval is in fact 1 block longer.
Doesn't all of that match up exactly with the layout I posted? Paste a copy of that layout diagram in my previous comment onto the end of itself. 4096 becomes "Data", 4097 becomes "FPM1", 4098 becomes "FPM2". "Interval" is the distance between two consecutive blocks of the same FPM. It's always 4096. So FPM1 is on blocks 1, 4096 + 1, 4096*2 + 1, 4096*3 + 1, etc. The blocks immediately before that (0, 4096, 4096*2, 4096*3, etc) are just data, with the only exception being block 0, which is the super block.
OOOOOH it just clicked. I thought the FPMs were always the first blocks in the interval, but having a block of data first makes so much sense.
Sorry for being so thick, and thanks for breaking it down for me. I'll post an update here in a minute.
Now that I actually understand how the intervals work, I reworked a lot of what I had written. I think it's much clearer now, but I'm sure there's more nits to be had :)
llvm/docs/PDB/MsfFile.rst | ||
---|---|---|
97 ↗ | (On Diff #128962) | This is backwards (yes it's confusing). Since it's the free block map, a value of 1 indicates that the block is free, and a value of 0 indicates that the value is not free. |
107–110 ↗ | (On Diff #128962) | This is also not quite true. The k'th bit in the j'th FPM block refers to the 4096*j + k'th block in the file. So, for example, consider blocks 1 and 4097. These are the first and second blocks of FPM1, respectively. To find out the allocation status of block 4097, you look at bit 1 of byte 512 (512*8 + 1 = 4097). There's actually a comment in the MSF reference implementation that indicates this was probably a design oversight, because this results in far more FPM blocks being reserved in a file than are actually necessary (by a factor of 8). But since it's already that way and everything is relying on it, and files are out there in the wild using it, and it results in neglible file bloat, it just stays that way. This is actually the whole point of the IncludeUnusedFpmData flag from my previous patch. If you specify true for it, it will give you a stream consisting of every block that *could* be an FPM block (i.e. the index is of the form 4096*k + 1), but if you pass false it truncates the extraneous ones at the end that would otherwise refer to blocks that are not even in the file. |
Shouldn't this be BlockSize - 3? The pattern is Block 0, Block 1, Block 2, <BlockSize - 3 more blocks>