This is an archive of the discontinued LLVM Phabricator instance.

BitcodeReader: Require clients to read the block info block at most once.
ClosedPublic

Authored by pcc on Oct 26 2016, 4:24 PM.

Download Raw Diff

Details

Reviewers

Commits

rGfc0a99bfda0b: BitcodeReader: Require clients to read the block info block at most once.
rL285350: BitcodeReader: Require clients to read the block info block at most once.

Summary

This change makes it the client's responsibility to call ReadBlockInfoBlock()
at most once. This is in preparation for a future change that will allow
there to be multiple block info blocks.

Diff Detail

Build Status

Buildable 809
Build 809: arc lint + arc unit

Event Timeline

pcc updated this revision to Diff 75962.Oct 26 2016, 4:24 PM

pcc retitled this revision from to BitcodeReader: Require clients to read the block info block at most once..

pcc updated this object.

pcc added a reviewer: mehdi_amini.

pcc added subscribers: llvm-commits, jordan_rose.

This seems a little unfortunate, because it means that you can't just serialize two full blobs of data one after the other and reuse the same cursor on both of them.

To elaborate a bit, the bitstream format is used to hold things other than LLVM bitcode (we currently use it for Swift "module" interface files—thanks for CCing me), and changing the requirements of that format seems unfortunate. (I do understand that it was never promised to be a stable container format.)

@jordan_rose: wouldn't there be a bug right now if you reuse the cursor and your second "blob" has a block info that will be skipped because the first blob was already processed?

I meant homogeneous data. With heterogeneous data you just couldn't reuse the cursor either way.

ReadBlockInfoBlock is only called if we encounter a second "block info". There is a single one per LLVM module. If you don't emit an LLVM module, why would this be an issue?

Here is the description from the header:

/// BLOCKINFO_BLOCK is used to define metadata about blocks, for example,
/// standard abbrevs that should be available to all blocks of a specified
/// ID.
BLOCKINFO_BLOCK_ID = 0,

BLOCKINFO is part of the container format, not part of an LLVM module. I'm objecting to the idea that it's better to write one BLOCKINFO for several "data blobs" (LLVM module, Swift module, whatever) than to just stick the BLOCKINFO in each data blob and be done with it. Right now we have a completely concatenative format, as long as you separate out the fixed-size magic number at the start; this would break that.

I don't feel too strongly about this, but I do think it's an important point to understand.

In D26016#581319, @jordan_rose wrote:

BLOCKINFO is part of the container format, not part of an LLVM module.

OK, I was looking at how we use it today in LLVM.

I'm objecting to the idea that it's better to write one BLOCKINFO for several "data blobs" (LLVM module, Swift module, whatever) than to just stick the BLOCKINFO in each data blob and be done with it. Right now we have a completely concatenative format, as long as you separate out the fixed-size magic number at the start; this would break that.

Can you clarify how this concatenation works? I'm puzzled because my impression is that today if you try to stick a blockinfo in each blob, it will *ignore* the second one. This how I read the code modified in this patch:

// If this is the second stream to get to the block info block, skip it.
// We expect the client to read the block info block at most once.
assert(!getBitStreamReader()->hasBlockInfoRecords());

calling:

/// Return true if we've already read and processed the block info block for
/// this Bitstream. We only process it for the first cursor that walks over
/// it.
bool hasBlockInfoRecords() const { return !BlockInfoRecords.empty(); }

Maybe you are you saying that we should support this use case? (I may have mis-read your objection from the beginning as something that this patch would break).

Right now we have a completely concatenative format

At least in LLVM modules this isn't the case because VSTOFFSET is relative to the start of the file.

Anyway, I expect that there is more value in allowing bitcode clients to use multiple block info blocks like this than in allowing external programs to concatenate. The latter would seem to complicate the implementation of reader clients because you would now potentially have multiple conflicting interpretations of abbreviations.

That's how I read it too, but that's reasonable behavior if you have several data blobs of the same kind. Your file can be { MAGIC_NUMBER, { BLOCKINFO, DATA, DATA }, { BLOCKINFO, DATA, DATA }, …} instead of { MAGIC_NUMBER, BLOCKINFO, { DATA, DATA }, { DATA, DATA}, …}. Why is the former better for concatenation than the latter? Because you don't need to know the size of the BLOCKINFO.

It's not a huge advantage, but someone could certainly be making use of it, and this will break them. (I don't think we're currently using it but I could check.)

Again, if the two BLOCKINFO structures are different then you can't reuse the cursor today, so this is effectively no change there. I guess it helps catch reuse mistakes.

In D26016#581335, @jordan_rose wrote:

That's how I read it too, but that's reasonable behavior if you have several data blobs of the same kind. Your file can be { MAGIC_NUMBER, { BLOCKINFO, DATA, DATA }, { BLOCKINFO, DATA, DATA }, …} instead of { MAGIC_NUMBER, BLOCKINFO, { DATA, DATA }, { DATA, DATA}, …}. Why is the former better for concatenation than the latter? Because you don't need to know the size of the BLOCKINFO.

It's not a huge advantage, but someone could certainly be making use of it, and this will break them. (I don't think we're currently using it but I could check.)

If someone would rely on it, I'd rather have a mechanism to validate that the second BLOCKINFO is identical. This seems too much error prone to just ignore it.

Again, if the two BLOCKINFO structures are different then you can't reuse the cursor today, so this is effectively no change there. I guess it helps catch reuse mistakes.

Right!

llvm/lib/Bitcode/Reader/BitstreamReader.cpp
323	Can we have an error instead of an assert?

Error out instead of asserting

LGTM, thanks.

This revision is now accepted and ready to land.Oct 27 2016, 2:47 PM

Closed by commit rL285350: BitcodeReader: Require clients to read the block info block at most once. (authored by pcc). · Explain WhyOct 27 2016, 2:48 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Bitcode/

Reader/

BitstreamReader.cpp

5 lines

Diff 75962

llvm/lib/Bitcode/Reader/BitstreamReader.cpp

Show First 20 Lines • Show All 313 Lines • ▼ Show 20 Lines	void BitstreamCursor::ReadAbbrevRecord() {
}		}

if (Abbv->getNumOperandInfos() == 0)		if (Abbv->getNumOperandInfos() == 0)
report_fatal_error("Abbrev record with no operands");		report_fatal_error("Abbrev record with no operands");
CurAbbrevs.push_back(Abbv);		CurAbbrevs.push_back(Abbv);
}		}

bool BitstreamCursor::ReadBlockInfoBlock() {		bool BitstreamCursor::ReadBlockInfoBlock() {
// If this is the second stream to get to the block info block, skip it.		// We expect the client to read the block info block at most once.
if (getBitStreamReader()->hasBlockInfoRecords())		assert(!getBitStreamReader()->hasBlockInfoRecords());
		mehdi_aminiUnsubmitted Done Reply Inline Actions Can we have an error instead of an assert? mehdi_amini: Can we have an error instead of an assert?
return SkipBlock();

if (EnterSubBlock(bitc::BLOCKINFO_BLOCK_ID)) return true;		if (EnterSubBlock(bitc::BLOCKINFO_BLOCK_ID)) return true;

SmallVector<uint64_t, 64> Record;		SmallVector<uint64_t, 64> Record;
BitstreamReader::BlockInfo *CurBlockInfo = nullptr;		BitstreamReader::BlockInfo *CurBlockInfo = nullptr;

// Read all the records for this module.		// Read all the records for this module.
while (true) {		while (true) {
▲ Show 20 Lines • Show All 58 Lines • Show Last 20 Lines