This is an archive of the discontinued LLVM Phabricator instance.

[pdb] Introduce an MsfBuilder class for laying out PDB files
ClosedPublic

Authored by zturner on Jul 13 2016, 11:24 AM.

Details

Summary

As we get further with writing PDB files, we run into the issue of needing to lay out the PDB file, generate a valid super block, and generate the directory and stream sizes array. Previously this was easy because we we were just reading in a PDB, and writing exactly the same directory and super block out. But now that we're trying to write only some of a particular stream, it means that the size of the stream we output is different from the size of the original stream. This in turn leads to different number of pages being used, which means a different stream directory, and possibly even the stream directory itself using a different number of blocks as before.

So we introduce an MsfBuilder class here. You create one, add a bunch of streams to it of the size you want, optionally giving it hints about what blocks you want a particular stream to occupy, and then it handles layout at the end. A bunch of unit tests are added to verify functionality.

I separated this out from a larger patch which actually updates the rest of the codebase to use this, and with my other patch I have confirmed that this class does produce a valid Msf layout.

Diff Detail

Repository
rL LLVM

Event Timeline

zturner updated this revision to Diff 63830.Jul 13 2016, 11:24 AM
zturner retitled this revision from to [pdb] Introduce an MsfBuilder class for laying out PDB files.
zturner updated this object.
zturner added reviewers: ruiu, rnk, majnemer.
zturner added a subscriber: llvm-commits.
zturner updated this revision to Diff 63834.Jul 13 2016, 11:32 AM

Update some comments.

Looks pretty good overall. I have a few suggestions/questions.

include/llvm/DebugInfo/PDB/Raw/MsfBuilder.h
17 ↗(On Diff #63834)

The inconsistent capitalization of initialisms seems unfortunate. PDB but Msf.

38 ↗(On Diff #63834)

Some comments describing these members would be nice. I'd expect something like addStream to return something like an index or handle to refer to that new stream. What's the difference between the two versions (with and without Blocks)?

lib/DebugInfo/PDB/Raw/MsfBuilder.cpp
30 ↗(On Diff #63834)

I think you meant "addresses" (plural).

108 ↗(On Diff #63834)

Are pages and blocks the same thing? Maybe blocks are ranges of pages. It seems Idx is used for both stream indexes and page/block indexes.

143 ↗(On Diff #63834)

I think this comment would be better in the header by the declaration of the method, as it explains the difference between the two addStreams, which users of the API care about.

The funny line break also makes me wonder if you left out some words.

221 ↗(On Diff #63834)

"should was" ?

223 ↗(On Diff #63834)

Is there a reason why you're specifying (global) ::memcpy rather than std::memcpy?

lib/DebugInfo/PDB/Raw/MsfCommon.cpp
18 ↗(On Diff #63834)

::memcmp for consistency with ::memcpys you put elsewhere?

zturner added inline comments.Jul 13 2016, 3:01 PM
include/llvm/DebugInfo/PDB/Raw/MsfBuilder.h
38 ↗(On Diff #63834)

There's two use cases I had in mind. One is you're building a PDB from scratch. You don't care what blocks a stream lies on, you just want the builder to figure it out. That's where you use the overload with no block list.

The other use case is you're editing a PDB. You load up one from disk, make some edits to it, and write it back out. This second use case is how I'm doing the Pdb <-> Yaml round-tripping, but it will be useful down the line when we want to integrate this into the linker. In this use case, you want to change the layout of the PDB as little as possible. If a stream used to take up blocks 10, 11, and 12, it should still *try* to take up blocks 10, 11, and 12. As long as you don't make any edits to the stream, that should still be possible. But if you remove a record or add a record, the stream extents could change.

When you call this overload, the builder does all of this. If it can use exactly the same set of blocks, it will. If you add data it will keep the blocks you specified but add some new ones. If you remove data it will keep the first N blocks you specified, but free up the other ones.

lib/DebugInfo/PDB/Raw/MsfBuilder.cpp
30 ↗(On Diff #63834)

I actually meant address singular. The "block map address" is one 32-bit integer.

108 ↗(On Diff #63834)

Yes

143 ↗(On Diff #63834)

The funny line breaks happen when you clang-format a multi-line comment where one line went past 80 characters. I always miss a few when going over a CL after clang-formatting.

amccarth added inline comments.Jul 13 2016, 3:06 PM
include/llvm/DebugInfo/PDB/Raw/MsfBuilder.h
38 ↗(On Diff #63834)

OK, great answer. A short version of that in the comments would probably help someone trying to figure out how to use this interface.

lib/DebugInfo/PDB/Raw/MsfBuilder.cpp
30 ↗(On Diff #63834)

Ah, I read it as "super block address" and "block map address."

ruiu added inline comments.Jul 13 2016, 3:17 PM
include/llvm/DebugInfo/PDB/Raw/MsfBuilder.h
33 ↗(On Diff #63834)

What's the use case of reset()? You can create as many instances of this class as you want, so wouldn't it be easier to make it a non-reusable class?

67 ↗(On Diff #63834)

Which member contains streams' actual data?

Ahh, so this class is to constructs only the MSF header, and stream data is managed separately outside of this class. Is this correct? So it seems that you'd use this class to create a MSF directory, obtain a Layout object, and write each stream as the returned layout. Not sure how you would handle actual file writing, but this is probably just out of scope of this patch.

lib/DebugInfo/PDB/Raw/MsfCommon.cpp
22–37 ↗(On Diff #63834)

These two tests seem to do basically the same thing. You can remove the former check.

unittests/DebugInfo/PDB/MsfBuilderTest.cpp
149 ↗(On Diff #63834)

of of

zturner added inline comments.Jul 13 2016, 3:24 PM
include/llvm/DebugInfo/PDB/Raw/MsfBuilder.h
33 ↗(On Diff #63834)

build() calls reset() after it's done to invalidate the entire structure so you have to start over. It would be nice if you could call build(), make one or two simple changes, then call build() again, but it's a little bit tricky for technical reasons, so for now I'm just having build() invalidate it.

67 ↗(On Diff #63834)

Correct, this is only to generate in-memory objects that describe the layout of an MSF. I have some file writing in other patches but I didn't want to submit it all at once. But the short version is you create a PDBFileBuilder which will contain an MsfBuilder, then add streams to the MsfBuilder, then make some edits using the PDBFileBuilder (like setting fields from the individual streams to a different value).

Since my use case right now is editing a PDB (i.e. reading one in, making some changes, writing it back out), there's no code yet to generate one from scratch. But all the writing to disk and setting stream fields and serializing is working, so it should be easy to extend to generate a PDB from scratch.

ruiu added inline comments.Jul 13 2016, 3:29 PM
include/llvm/DebugInfo/PDB/Raw/MsfBuilder.h
33 ↗(On Diff #63834)

I think what I was trying to say is to move the code from reset to the ctor if needed and remove the reset member function entirely. If a user wants to create two or more MSF headers, they can create multiple instances of MsfBuilder, so reset does not seem necessary.

lib/DebugInfo/PDB/Raw/MsfBuilder.cpp
26–27 ↗(On Diff #63834)

I think this is a bit error-prone. Why don't you merge this with the ctor?

The reason I still kept reset is because without it, subsequent calls to
build will generate invalid layouts, which seemed unintuitive and error
prone. I can fix build() to support this, but it's kind of tricky. Even if
someone can create another builder they shouldn't be able to generate a bad
layout.

About initialize(), In earlier revisions this function could fail, so it
needed to return an Error. But I removed the ability for it to fail.
Although now that I think about it, I should probably validate the block
size.

ruiu edited edge metadata.Jul 13 2016, 3:51 PM

You can return an Error on second and subsequent calls of build(), no? I don't see a reason to allow users to call build() multiple times. If they want to create multiple MSF headers, they can simply create as many MsfBuilder instances they want, and call build() only once for each instance.

Yes I can have it return an error if it's already been called.

Calling it twice is useful for writing unit tests when you want to test
that 2 slightly different configurations differ in the expected way. But it
already doesn't work since it was difficult to do so in my current unit
tests i use 2 builders which are configured the same way except for one
thing.

In any case, having it return an Error on subsequent calls to build is
fine, I can do that

ruiu added a comment.Jul 13 2016, 4:32 PM

By the way, did you figure out the format of the bitmap after the super block?

No, and this builder actually overwrites block 1 with the directory block
map by default. At some point we'll have to go through the ms code to try
to figure out block 1

zturner updated this revision to Diff 64160.Jul 15 2016, 10:32 AM
zturner edited edge metadata.

Fix suggestions

majnemer edited edge metadata.Jul 15 2016, 10:53 AM

No, and this builder actually overwrites block 1 with the directory block
map by default. At some point we'll have to go through the ms code to try
to figure out block 1

It's the free page map: https://github.com/Microsoft/microsoft-pdb/blob/master/PDB/msf/msf.cpp#L38

msf.cpp seems to contain much of the FPM manipulation code.

include/llvm/DebugInfo/PDB/Raw/MsfBuilder.h
31–32 ↗(On Diff #64160)

What do CanGrow and MinBlockCount mean?

lib/DebugInfo/PDB/Raw/MsfBuilder.cpp
26–27 ↗(On Diff #64160)

Ditto.

zturner added inline comments.Jul 15 2016, 11:17 AM
include/llvm/DebugInfo/PDB/Raw/MsfBuilder.h
31–32 ↗(On Diff #64160)

MinBlockCount means "assume I will have at least this many blocks in the PDB file". The reason this is useful is because when you want to modify an existing PDB, you want to keep the layout as stable as possible. So we will be calling addStream many times in succession and using the overload that takes a block list. But if you do this without specifying a MinBlockCount, then it will think you are trying to reference blocks that are outside the bounds of the entire file. For example, if you were to write this code:

msf.initialize(4096);
msf.addStream(29083, {7, 8, 9, 12, 14, 15, 16, 17});

addStream is going to complain that the file does not have space for that many blocks. It could try to help you out and grow the file automatically, and just say "well you asked for block 17, i'll just make sure there's at least enough blocks to be able to fit 17", but this seems awfully inefficient primarily because in the use case where you'll be doing this (i.e. you are reading in an existing PDB file), the super block already tells you exactly how many blocks the file has.

Even if we don't automatically grow the file here (and the current implementation doesn't), growing the file is still useful. If you want to create a new PDB from scratch, then you will use the overload of addStream() that does not take the block list, and in this case every single call to addStream() necessitates a grow.

Basically, CanGrow dictates what happens when the FreeBlocks bit vector has no entries. If it is true, we will resize the bit vector and make new blocks. If it is false, we return an error. We could actually get by without it and just always use behave as if CanGrow = true, but it seems useful as a way to prevent programming errors in cases where you know you are not adding any records or doing anything which should cause the PDB to get larger. I'm using false here for some unit tests, for example, because the test is intended to verify the internal block allocation behavior, and having it return an error if blocks are unexpectedly allocated is a useful diagnostic.

majnemer added inline comments.Jul 15 2016, 11:27 AM
include/llvm/DebugInfo/PDB/Raw/MsfBuilder.h
31–32 ↗(On Diff #64160)

Can some of this get added to the header?

zturner added inline comments.Jul 15 2016, 11:28 AM
include/llvm/DebugInfo/PDB/Raw/MsfBuilder.h
31–32 ↗(On Diff #64160)

Sure

zturner updated this revision to Diff 64176.Jul 15 2016, 11:42 AM
zturner edited edge metadata.
ruiu added inline comments.Jul 15 2016, 12:54 PM
lib/DebugInfo/PDB/Raw/MsfBuilder.cpp
27–28 ↗(On Diff #64176)

So you don't like it, Zach?

zturner added inline comments.Jul 15 2016, 1:12 PM
lib/DebugInfo/PDB/Raw/MsfBuilder.cpp
27–28 ↗(On Diff #64176)

You mean merging it with the constructor? The problem is that it's possible to pass an invalid block size, so I didn't want to make it possible to construct an instance of an MsfBuilder with an incorrect block size. If you use the constructor, there is no way to pass the error back to the user. I could make an enum for the block size, but I wasn't convinced that was better.

ruiu added inline comments.Jul 15 2016, 1:14 PM
lib/DebugInfo/PDB/Raw/MsfBuilder.cpp
27–28 ↗(On Diff #64176)

You could make the ctor private and instead provide a static member function (say, create) which verifies all parameters and then call the constructor. In this way you can return an error or a new instance.

zturner added inline comments.Jul 15 2016, 1:16 PM
lib/DebugInfo/PDB/Raw/MsfBuilder.cpp
27–28 ↗(On Diff #64176)

Ahh, yea actually that's not a bad idea. Thanks for the suggestion!

zturner updated this revision to Diff 64189.Jul 15 2016, 1:34 PM

Make the constructor private and change Error initialize() to static Error create()

ruiu accepted this revision.Jul 15 2016, 1:37 PM
ruiu edited edge metadata.

LGTM

This revision is now accepted and ready to land.Jul 15 2016, 1:37 PM
This revision was automatically updated to reflect the committed changes.