This is an archive of the discontinued LLVM Phabricator instance.

[pdb] Improve StreamInterface to support writing
ClosedPublic

Authored by zturner on Jun 8 2016, 4:41 PM.

Details

Summary

This adds support for writing to a StreamInterface, including a MappedBlockStream which might be discontiguous.

No code actually uses this yet, but I've added a bunch of unit tests to illustrate the idea and demonstrate that it works.

There were two unexpected complexities that arose with this patch.

First, due to the nature of how we need to allocate from a pool when a read breaks a stream boundary, things get complicated when you go and start writing. If you just write to the underlying stream (which could be discontiguous at the point you write), a person holding onto a pool pointer will now have a stale view of the contents.

Second, we previously assumed that any two reads from the same stream offset would have the same size (due to the nature of how streams are laid out as records that don't often change). This is a very inconvenient assumption when trying to write unit tests, where you may want to write something, read it back out, then write over it, then read it back out again. So I removed this assumption.

The result is somewhat convoluted, but the idea is this:

read:
   1. try to read contiguously.  If so, return it.
   2. Otherwise check if there is a cached pool allocation
      that is the same size or bigger than the request.  If so, return it.
   3. Otherwise make a new pool allocation for the given offset and
      size, copy the data into it, and add it to the list of allocations for
      this offset.

write:
   1. Write all the data across the discontiguous sequence of blocks.
   2. Iterate every cached allocation, looking for any allocation which
      overlaps this request.
   3. For those which overlap, memcpy the overlapped portion of the
      cached allocation with the corresponding data from the new write

While this is O(n) in number of cached allocations, this is again most
common in tests, and will not be a frequent occurrence in real world
situations, so I think it's ok.

Diff Detail

Repository
rL LLVM

Event Timeline

zturner updated this revision to Diff 60115.Jun 8 2016, 4:41 PM
zturner retitled this revision from to [pdb] Improve StreamInterface to support writing.
zturner updated this object.
zturner added reviewers: ruiu, rnk, majnemer.
zturner added a subscriber: llvm-commits.
ruiu edited edge metadata.Jun 8 2016, 5:07 PM

IIUC, this is a patch to allow users to read and write arbitrary streams at arbitrary position just like files on a file system (I haven't read through yet, so correct me if I'm wrong.) I wonder if it is actually going to be needed. In most cases, we are going to create a new stream in PDB (by copying a stream and appending some data at end of it or in the middle of it) and then atomically update the MSF so that the old stream number now refers to the new stream, no? If that's the case, we open most streams as read-only and write to streams without seeking back.

In D21157#452974, @ruiu wrote:

IIUC, this is a patch to allow users to read and write arbitrary streams at arbitrary position just like files on a file system (I haven't read through yet, so correct me if I'm wrong.) I wonder if it is actually going to be needed. In most cases, we are going to create a new stream in PDB (by copying a stream and appending some data at end of it or in the middle of it) and then atomically update the MSF so that the old stream number now refers to the new stream, no? If that's the case, we open most streams as read-only and write to streams without seeking back.

While it might not be an immediate priority for LLD, I think it will be needed in the future for LLD, and I also think it will be very helpful for some of the improvements tools I have in mind. For example, we might want to find out how our tools would handle a specific type of CodeView record, or how Microsoft's tool would handle a specific type of CodeView record. But it's hard to figure out how to get Microsoft to generate that record type, or get it with just the right flag that we want to test. With the YAML stuff, we will be able to write YAML however we want, and generate a PDB that contains that YAML. We may even want to have the option to take a small fragment of YAML that describes 1 single record, and inject it into an existing PDB file.

For those kind of operations, we will need the ability to "append" to a PDB. This means handling complications like what happens if you want to write to a PDB, but there aren't enough blocks. You have to allocate a new block at the end of the file, then go update the directory, and then write to the PDB while making use of the newly allocated block.

Finally, I think it will be very helpful for testing. Because when we write a test to make sure we can write PDBs, the easiest way to start doing that is to output some YAML, generate a PDB from it, then dump it, and make sure the dump of the new PDB and the old PDB looks the same. In order to do that we will need to be able to maintain the structure (stream order, etc).

While it does allow writing into the middle of a stream at an arbitrary position, that wasn't the primary motivating factor. The primary motivating factor is just to be able to treat discontiguous sequences of blocks as contiguous when writing, so that the person doing the writing doesn't have to worry about seeking around to different blocks. As long as you give it the stream map, it just works.

ruiu added a comment.Jun 8 2016, 5:46 PM

If you like the readable and writable streams, I don't have a strong opinion to oppose it, since I don't really understand the need here.

But it is a bit concerning that the use case you wrote are currently all hypothetical, and we are writing relatively complex code particularly for cache consistency. Do we need to update the cached entries? It might be a good property to guarantee, but we may be able to just say that "when you write something to a stream, all data objects you read before are invalid now -- so don't use it or read it again." It's hard for me to imagine a use case in which you read, write and read some data again and again from/to the same stream. Even if you need it in future, you can add it when it is needed, no?

include/llvm/DebugInfo/CodeView/StreamWriter.h
32–33 ↗(On Diff #60115)

I'd like to name them writeUint{16,32} instead of defining as a overloaded function. Sometimes argument type is not obvious in a local context (uses of auto makes it much harder).

In D21157#453015, @ruiu wrote:

If you like the readable and writable streams, I don't have a strong opinion to oppose it, since I don't really understand the need here.

So imagine we want to test that our PDB writing code works. How would we do it? Because we aren't generating any PDBs yet. Even if we imagine that 6 months down the line LLD has full support for generating PDBs, how complete will it be? What if there's one record type that we don't generate correctly, but we don't know because none of our tests compile the right code to exercise it?

The need is basically to be able to enable better testing. Our support for dumping PDBs is becoming more and more complete, and only a few things remain. It does this by reading the PDB, building a bunch of internal data structures, and printing them. When it's time for LLD to generate PDBs, it will also probably use some internal data structures to represent the various streams and records. Since the ones we're using for reading are already well tested and work pretty well, it would be nice to find a way to reuse as much as possible for writing, because they already work. Furthermore, by using the same code, we can start testing arbitrarily complex PDBs right now, and then when it comes time for LLD to start writing PDBs, all of the code will already be tested. The YAML stuff, for example, is just a description of the PDB and the records. So we read the YAML, use the fields to fill out the structures that describe the PDB, and then use the StreamWriter to write that into an actual PDB. And then LLD will do almost the same thing, except instead of coming from YAML, the "description" will come from IR metadata. But everything else will be the same. Fill out the fields, write the file.

But it is a bit concerning that the use case you wrote are currently all hypothetical, and we are writing relatively complex code particularly for cache consistency. Do we need to update the cached entries? It might be a good property to guarantee, but we may be able to just say that "when you write something to a stream, all data objects you read before are invalid now -- so don't use it or read it again." It's hard for me to imagine a use case in which you read, write and read some data again and again from/to the same stream. Even if you need it in future, you can add it when it is needed, no?

I thought about doing it that way, but it seemed overly harsh to say if you even write one byte to a stream, any outstanding references pointing into the stream are invalidated. Mostly I did it because it made the tests much cleaner and shorter since I didn't have to construct brand new objects every time just to write one integer.

zturner added inline comments.Jun 8 2016, 10:38 PM
include/llvm/DebugInfo/CodeView/StreamWriter.h
32–33 ↗(On Diff #60115)

Hmm, it turns out this makes writeEnum difficult to implement, because it relies on the overload. It can still be done with something like this:

template<typename T, typename U = std::underlying_type<T>::type> Error writeEnum(T Num);

template<typename T> Error writeEnum<T, uint32_t> Error writeEnum(T Num) {
  return writeUInt32(static_cast<uint32_t>(Num));
}

template<typename T> Error writeEnum<T, uint16_t> Error writeEnum(T Num) {
  return writeUInt16(static_cast<uint16_t>(Num));
}

but I start to wonder if it's an improvement.

majnemer added inline comments.Jun 9 2016, 10:13 AM
lib/DebugInfo/PDB/Raw/MappedBlockStream.cpp
47–48 ↗(On Diff #60115)

Doesn't this mean that larger queries that happened earlier are not reused?

Let's say we start with an empty cache.
Then a read request for 4096 bytes from offset 0 happens.
Then a read request for 1024 bytes at offset 1024 happens.
Won't the second request not share with the first?

zturner added inline comments.Jun 9 2016, 10:22 AM
lib/DebugInfo/PDB/Raw/MappedBlockStream.cpp
47–48 ↗(On Diff #60115)

You are correct. And I thought of that, but it makes the data structure more complicated, and I didn't think it was worth it for a code path that would almost never get exercised anyway. This is mostly here so you don't shoot yourself in the foot, but as Rui mentioned earlier, read/write on the same pdb is not going to be a common thing at all, so I made it just smart enough that if you use it your program will still work, but still stupid enough that I didn't have to devise a new complicated data structure.

Ob, one more thing. Larger queries will still be reused IF offset of the
previous request is the same as offset of the current request.

What doesn't work is if they overlap but start at different locations. Eg

Read 1: offset 0, length 1024
Read 2: offset 512, length 256

Even in that case though, if the 256 byte read didn't cross the
discontinuity, it would be satisfied without the cache. So probability is
very low of running into this

On second thought i think i can actually make that work. I was thinking
you'd want a full blown interval tree, but it could be done using the same
walk I'm doing for writing. Since this is still a rare case, I'm fine
adding this, or not adding it, depending on what people prefer

ruiu added a comment.Jun 9 2016, 10:36 AM

Although I agree with you that this is good for testing and completeness is generally good, but how about runtime performance? Updating read buffers incurs runtime cost and hence slows down PDB file generation. The most important thing for the linker is performance, and the current good number comes from the fact that we just do less instead of doing everything efficiently. If we add too much to the PDB reader/writer library beforehand, we risk the usability of the library for the linker. Aiming completeness is not always a win.

Being said that, the most difficult point of PDB reading/writing is to understand the format, so once we get a complete picture, it should be much easier to speed it up (by eliminating features or whatever) or write another one. So I think it's your call. If this helps creating PDB and help us understand the PDB as a file format (I guess it is), it's probably a good change.

In D21157#453549, @ruiu wrote:

Although I agree with you that this is good for testing and completeness is generally good, but how about runtime performance? Updating read buffers incurs runtime cost and hence slows down PDB file generation. The most important thing for the linker is performance, and the current good number comes from the fact that we just do less instead of doing everything efficiently. If we add too much to the PDB reader/writer library beforehand, we risk the usability of the library for the linker. Aiming completeness is not always a win.

Being said that, the most difficult point of PDB reading/writing is to understand the format, so once we get a complete picture, it should be much easier to speed it up (by eliminating features or whatever) or write another one. So I think it's your call. If this helps creating PDB and help us understand the PDB as a file format (I guess it is), it's probably a good change.

Yea, I agree we will want to be careful of performance. I think if it's a problem then we can rename MappedBlockStream to ReadWritePdbStream and then make a WriteOnlyPdbStream that doesn't do any of this and has none of the caching logic at all. I think we have enough extensibility points to get the performance we need, but I guess we don't know until we try. Worst case scenario I guess we have a second codepath for writing, but at least with all the testing we will understand the format. I'm hoping we can reuse though.

ruiu accepted this revision.Jun 9 2016, 11:01 AM
ruiu edited edge metadata.

LGTM

include/llvm/DebugInfo/PDB/Raw/MappedBlockStream.h
55–58 ↗(On Diff #60115)

Is this different from MutableArrayRef?

This revision is now accepted and ready to land.Jun 9 2016, 11:01 AM
zturner added inline comments.Jun 9 2016, 11:03 AM
include/llvm/DebugInfo/PDB/Raw/MappedBlockStream.h
55–58 ↗(On Diff #60115)

That's a good point, I should fix that. Thanks

This revision was automatically updated to reflect the committed changes.
chapuni added inline comments.
llvm/trunk/unittests/DebugInfo/PDB/MappedBlockStreamTest.cpp
317

I am afraid if it generated temporary objects. Fixed in r272457.