This is an archive of the discontinued LLVM Phabricator instance.

BitcodeReader: Require clients to read the block info block at most once.
ClosedPublic

Authored by pcc on Oct 26 2016, 4:24 PM.

Details

Summary

This change makes it the client's responsibility to call ReadBlockInfoBlock()
at most once. This is in preparation for a future change that will allow
there to be multiple block info blocks.

See also: http://lists.llvm.org/pipermail/llvm-dev/2016-October/106512.html

Diff Detail

Repository
rL LLVM

Event Timeline

pcc updated this revision to Diff 75962.Oct 26 2016, 4:24 PM
pcc retitled this revision from to BitcodeReader: Require clients to read the block info block at most once..
pcc updated this object.
pcc added a reviewer: mehdi_amini.
pcc added subscribers: llvm-commits, jordan_rose.

This seems a little unfortunate, because it means that you can't just serialize two full blobs of data one after the other and reuse the same cursor on both of them.

To elaborate a bit, the bitstream format is used to hold things other than LLVM bitcode (we currently use it for Swift "module" interface files—thanks for CCing me), and changing the requirements of that format seems unfortunate. (I do understand that it was never promised to be a stable container format.)

mehdi_amini edited edge metadata.Oct 27 2016, 1:07 PM

@jordan_rose: wouldn't there be a bug right now if you reuse the cursor and your second "blob" has a block info that will be skipped because the first blob was already processed?

I meant homogeneous data. With heterogeneous data you just couldn't reuse the cursor either way.

ReadBlockInfoBlock is only called if we encounter a second "block info". There is a single one per LLVM module. If you don't emit an LLVM module, why would this be an issue?

Here is the description from the header:

/// BLOCKINFO_BLOCK is used to define metadata about blocks, for example,
/// standard abbrevs that should be available to all blocks of a specified
/// ID.
BLOCKINFO_BLOCK_ID = 0,

BLOCKINFO is part of the container format, not part of an LLVM module. I'm objecting to the idea that it's better to write one BLOCKINFO for several "data blobs" (LLVM module, Swift module, whatever) than to just stick the BLOCKINFO in each data blob and be done with it. Right now we have a completely concatenative format, as long as you separate out the fixed-size magic number at the start; this would break that.

I don't feel too strongly about this, but I do think it's an important point to understand.

BLOCKINFO is part of the container format, not part of an LLVM module.

OK, I was looking at how we use it today in LLVM.

I'm objecting to the idea that it's better to write one BLOCKINFO for several "data blobs" (LLVM module, Swift module, whatever) than to just stick the BLOCKINFO in each data blob and be done with it. Right now we have a completely concatenative format, as long as you separate out the fixed-size magic number at the start; this would break that.

Can you clarify how this concatenation works? I'm puzzled because my impression is that today if you try to stick a blockinfo in each blob, it will *ignore* the second one. This how I read the code modified in this patch:

// If this is the second stream to get to the block info block, skip it.
// We expect the client to read the block info block at most once.
assert(!getBitStreamReader()->hasBlockInfoRecords());

calling:

/// Return true if we've already read and processed the block info block for
/// this Bitstream. We only process it for the first cursor that walks over
/// it.
bool hasBlockInfoRecords() const { return !BlockInfoRecords.empty(); }

Maybe you are you saying that we should support this use case? (I may have mis-read your objection from the beginning as something that this patch would break).

pcc added a comment.Oct 27 2016, 1:37 PM

Right now we have a completely concatenative format

At least in LLVM modules this isn't the case because VSTOFFSET is relative to the start of the file.

Anyway, I expect that there is more value in allowing bitcode clients to use multiple block info blocks like this than in allowing external programs to concatenate. The latter would seem to complicate the implementation of reader clients because you would now potentially have multiple conflicting interpretations of abbreviations.

That's how I read it too, but that's reasonable behavior if you have several data blobs of the same kind. Your file can be { MAGIC_NUMBER, { BLOCKINFO, DATA, DATA }, { BLOCKINFO, DATA, DATA }, …} instead of { MAGIC_NUMBER, BLOCKINFO, { DATA, DATA }, { DATA, DATA}, …}. Why is the former better for concatenation than the latter? Because you don't need to know the size of the BLOCKINFO.

It's not a huge advantage, but someone could certainly be making use of it, and this will break them. (I don't think we're currently using it but I could check.)

Again, if the two BLOCKINFO structures are different then you can't reuse the cursor today, so this is effectively no change there. I guess it helps catch reuse mistakes.

That's how I read it too, but that's reasonable behavior if you have several data blobs of the same kind. Your file can be { MAGIC_NUMBER, { BLOCKINFO, DATA, DATA }, { BLOCKINFO, DATA, DATA }, …} instead of { MAGIC_NUMBER, BLOCKINFO, { DATA, DATA }, { DATA, DATA}, …}. Why is the former better for concatenation than the latter? Because you don't need to know the size of the BLOCKINFO.

It's not a huge advantage, but someone could certainly be making use of it, and this will break them. (I don't think we're currently using it but I could check.)

If someone would rely on it, I'd rather have a mechanism to validate that the second BLOCKINFO is identical. This seems too much error prone to just ignore it.

Again, if the two BLOCKINFO structures are different then you can't reuse the cursor today, so this is effectively no change there. I guess it helps catch reuse mistakes.

Right!

llvm/lib/Bitcode/Reader/BitstreamReader.cpp
323 ↗(On Diff #75962)

Can we have an error instead of an assert?

pcc updated this revision to Diff 76103.Oct 27 2016, 2:41 PM
pcc marked an inline comment as done.
pcc edited edge metadata.

Error out instead of asserting

mehdi_amini accepted this revision.Oct 27 2016, 2:47 PM
mehdi_amini edited edge metadata.

LGTM, thanks.

This revision is now accepted and ready to land.Oct 27 2016, 2:47 PM
This revision was automatically updated to reflect the committed changes.