This is an archive of the discontinued LLVM Phabricator instance.

More strongly separate the PDB reading interfaces and PDB writing interfaces
ClosedPublic

Authored by zturner on Jul 22 2016, 11:51 AM.

Details

Summary

This is a huge patch, so instead of asking you guys to look at it in detail, I'll summarize the highlights (I also have 7 other smaller patches leading up to this which are not in yet. Point being that I tried as hard as I could to break it down into something smaller, but this is the best I could do). Mostly looking for high level feedback. Even this is long, but it's better than reading a 5,000 line patch :)

Problem: A separation started to occur between reading interfaces and writing interfaces. E.g. PDBFile, DbiStream, InfoStream, etc were the reading interfaces, and PDBFileBuilder, MsfBuilder, InfoStreamBuilder, etc were the writing interfaces. But due to the coupling of the reading interfaces and the underlying file buffer, the only way to write changes back out to disk was to do so through the reading interface. So we had functions like PDBFile::commit(), and DbiStream::commit(), even though these were supposed to be the reading interfaces. The way you would use the api is to create a PdbFileBuilder, fill out the fields, call PdbFileBuilder::build() on it and that would return you a PDBFile. Then you would call commit() on it.

So the problem was to get these commit() methods over to the builders, so that the reading interfaces would be truly read only.

Solution: Until this patch, there was no clear distinction between interfaces which could read and interfaces which could write. Everything could read and everything could write. This causes a problem though, because your backing data is not always writable to begin with. For example, when reading a PDB from disk and dumping it, we read it into a MemoryBuffer, which is immutable. So we would have to use hacks like returning an llvm::Error if the backing store is not writable. It's sort of analagous to if everything was a MutableArrayRef<>, but methods returned an error if the underlying memory was const. Ideally you want the types to dictate what operations are possible.

With that in mind, I split all the stream based types into readable streams, writable streams, and read/write streams. This is supported by the following fundamental types:

/// An abstract base class which defines an interface for reading data from an arbitrary backing store which may or may not be contiguous.
class ReadableStream

/// An abstract base class which defines an interface for writing data to an arbitrary backing store which may or may not be contiguous.
class WritableStream

/// A stream which is backed by a source which is both readable and writable.
/// This is mostly equivalent to what was formerly known as `StreamInterface`.
class ReadWriteStream : public ReadableStream, public WritableStream

These are the basis for everything. From here there are concrete implementations that are backed by contiguous in-memory byte sequences (ByteStream and MutableByteStream) which are useful when you are building your own streams, or when you have a single buffer which contains code view type data, etc.

There are concrete implementations that are backed by discontiguous block streams in an Msf file (MappedBlockStream and WritableMappedBlockStream).

There are concrete implementations that are backed by disk files (FileBufferByteStream) that are useful when you want to commit to a disk.

But it's not always useful to just have an entire stream. Sometimes you just want part of a stream. For example, maybe starting at offset 364 in a big stream is a "substream". We want some way to get a smaller view over a larger stream. For this we have ReadableStreamRef, WritableStreamRef, and ReadWriteStreamRef. StreamRefs are to Streams as ArrayRefs are to arrays. They can be trivially copied and passed by value, and provide convenience methods for dropping items, slicing, etc. But they are backed by ReadableStream, WritableStream, and ReadWriteStream, which again means they can refer to any kind of stream, whether it be a ByteStream, a MappedBlockStream, or any other kind of stream.

Finally, we have StreamReader and StreamWriter. While the underlying stream only knows how to read and write byte sequences, StreamReader and StreamWriter know how to read *types*. integers, objects, strings, etc. They wrap ReadableStreamRef and WritableStreamRef respectively, and maintain an offset into the stream where the next read or write will occur at.

these abstractions largely mirror the abstractions of before this patch, with the exception that reading and writing responsibilities are enforced by the type system.

There is one breaking change to be aware of between this patch and the previous patch. Previously, MappedBlockStream (which supported both reading and writing at the time) handled the case where you read across a block boundary by allocating from an internal pool. And if you ever requested that offset again, it would return the value allocated from the pool. Since it also supported writing, it had to handle the case where you overwrote some of the values that were previously cached from the pool. Because otherwise, if someone were to try to read that value again, they would see it was cached and return the stale value from the cache. So MappedBlockStream was smart enough to deal with this.

But now, MappedBlockStream is read-only and WritableMappedBlockStream is write-only, and just like if you make 2 unique_ptrs to the same data, bad things will happen. The only use case I had found for this advanced cache coherency was in simplicity of writing tests, so it is now up to the user to make sure he/she coordinates reading and writing appropriately and explicitly clears the read cache after any write is performed on the same backing stream.

Diff Detail

Event Timeline

zturner updated this revision to Diff 65108.Jul 22 2016, 11:51 AM
zturner retitled this revision from to More strongly separate the PDB reading interfaces and PDB writing interfaces.
zturner updated this object.
zturner added reviewers: rnk, ruiu, amccarth.
zturner added a subscriber: llvm-commits.
zturner updated this revision to Diff 65144.Jul 22 2016, 1:59 PM

The breaking change I mentioned from before is no longer an issue. In addition, this patch of makes things a little bit more sane. Previously it was possible to have a write-only interface. Now all writable interfaces are also readable. This mimics the way MutableArrayRef / ArrayRef and other similar classes in LLVM work. Because of this, WritableMappedBlockStream can both read and write, and it now has to deal with the same issue that MappedBlockStream used to have to deal with. While it adds some that complexity back, it has been working and well tested, and removes the possibility of having strange errors pop up that are difficult to figure out. (Plus it makes tests cleaner)

amccarth edited edge metadata.Jul 25 2016, 12:01 PM

At a high level, this seems reasonable to me. At an API level, I find it surprising that the write interfaces are const.

include/llvm/DebugInfo/Msf/StreamInterface.h
45

It's surprising that the writeBytes method is const. I understand why this works, but as a user of the API, it's surprising.

lib/DebugInfo/Msf/MappedBlockStream.cpp
191

Is this correct? If Size is two blocks and BytesFromFirstBlock is less than BlockSize, then we'll get two additional blocks instead of one.

majnemer added inline comments.
lib/DebugInfo/Msf/MappedBlockStream.cpp
191

Perhaps:

uint32_t NumAdditionalBlocks = (llvm::alignTo(Size, BlockSize) / BlockSize) - 1;

?

zturner added inline comments.Jul 25 2016, 12:51 PM
lib/DebugInfo/Msf/MappedBlockStream.cpp
191

Are you sure? Here's a concrete example:

// Starting from the beginning of block 2, read 2 full blocks of data.
BlockSize = 4096
Offset = 8192
Size = 8192

  // BlockNum = 8192 / 4096 = 2;
  uint32_t BlockNum = Offset / BlockSize;

  // OffsetInBlock = 8192 % 4096 = 0;
  uint32_t OffsetInBlock = Offset % BlockSize;

  // BytesFromFirstBlock = min(8192, 4096 - 0) = 4096
  uint32_t BytesFromFirstBlock = std::min(Size, BlockSize - OffsetInBlock);

  // NumAdditionalBlocks = alignTo(8192 - 4096, 4096) / 4096 = 4096 / 4096 = 1
  uint32_t NumAdditionalBlocks =
      llvm::alignTo(Size - BytesFromFirstBlock, BlockSize) / BlockSize;
amccarth added inline comments.Jul 25 2016, 2:07 PM
lib/DebugInfo/Msf/MappedBlockStream.cpp
191

I didn't understand what BytesFromFirstBlock was measuring. Now that I do, I think the original code is correct.

Just to clarify, here's a diagram:

      BytesFromFirstBlock
--------=============----------------------------------------------------------------------------------------------------------
|       x            |                    |                    |                    |         y          |                    |
-------------------------------------------------------------------------------------------------------------------------------
                     |                     NumAdditionalBlocks                                           |

BytesFromFirstBlock is therefore equal to BlockSize - OffsetInBlock, and `NumAdditionalBlocks is however many bytes are left over, rounded up to the block size, divided by the block size.

ruiu edited edge metadata.Jul 25 2016, 3:05 PM

At a high level, it looks good to me, but as I probably mentioned earlier, I have a little concern that we are having an API for PDB that is too complete at this moment. The COFF linker is 7000 lines long in total, and if we keep this pace, we'll probably end up having a PDB library which is larger than the linker. Even if we'd eventually need it, I'm more comfortable if we do as-needed basis.

This revision was automatically updated to reflect the committed changes.
lib/DebugInfo/CodeView/ModuleSubstreamVisitor.cpp