This is an archive of the discontinued LLVM Phabricator instance.

Refactor bitcode reader to simplify control.
AbandonedPublic

Authored by kschimpf on Apr 1 2015, 2:32 PM.

Details

Summary

Modifies the bitcode reader such that the same logic is used for
both memory buffers and data streams. The incremental parsing
was factored into startParse, continueParse, and finishParse.
All parses (incremental or non-incremental) begin with startParse.
Then zero (or more) calls to continueParse incrementally read more
input, picking up from where the last call left off. finishParse
materializes any additional parts, based on the flags passed to startParse.

Diff Detail

Event Timeline

kschimpf updated this revision to Diff 23088.Apr 1 2015, 2:32 PM
kschimpf retitled this revision from to Refactor bitcode reader to simplify control..
kschimpf updated this object.
kschimpf edited the test plan for this revision. (Show Details)
kschimpf added reviewers: dschuff, jvoung, rafael.
kschimpf added a subscriber: Unknown Object (MLST).
jvoung edited edge metadata.Apr 1 2015, 4:49 PM

A couple nits for now, but still trying to read through and understand what's happening in the streaming and lazy cases...

lib/Bitcode/Reader/BitcodeReader.cpp
285–286
285–286

\returns

293

lowercase first letter of function name -- should probably do it for these new functions since you're touching it (but leave existing functions alone?)

404

lowercase first letter for new function (and I guess update the commit message if you do)

408

same

414

extra space in between updateParseState and (

416

In the review, I've tried to look for where ParseState gets set, and there are various ways to grep for that...

one is ParseState = X... another is updateParseState(Y, ...), or updateParseState(Z);

Is there any way you can make the number of variations smaller?

431

This is a bit weird to me... you have a bunch of function called "updateParseState" but some variants modify ParseState and some variants don't.

2795

NextUnreadBit is no longer set -- wanted to check if that is okay now (and why)?

2863

NOte -> Note

3297

I don't quite understand how "ShouldMaterializeAll = false" is supposed to work for the streaming case, if this isn't checked until after:

while (ParseState < NoMoreInput) {

  if (std::error_code EC = ContinueParse()) {
    return EC;
  }
}

How do you delay reading until materialize(GV) for streaming?

3301

This used to be early, in "getLazyBitcodeModuleImpl". This looks like it is now happening late in FinishParse after the loop until NoMoreInput.

Why is this okay now? Was the early call to "materializeForwardReferencedFunctions" actually extraneous because of the call in materialize(GV), or what?

Make sure to check the lazy case with blockaddresses for computed gotos, if there isn't already a unittest for that.

4636

no need for extra space

jvoung added inline comments.Apr 1 2015, 4:49 PM
lib/Bitcode/Reader/BitcodeReader.cpp
251

be consistent about capitalization in comments

one line starts with "parsed input" and another line starts with "Parsed input" =)

267–268

Don't need explicit anymore, though that transition was a while back so not really related to this CL.

294

Variable name is different from comment "MaterializeAll" vs "ShouldMaterializeAll" -- make them the same? I see that in the actual definition you're trying to avoid conflicting with the field name...

kschimpf updated this object.Apr 6 2015, 10:04 AM
kschimpf edited edge metadata.
kschimpf updated this revision to Diff 23431.Apr 8 2015, 11:28 AM

DId all changes except adding test that forward block address references get resolved on lazy loads.

lib/Bitcode/Reader/BitcodeReader.cpp
251

Done.

267–268

Done.

285–286

Sorry, my fault. I followed the syntax of ConstantPlaceHolder below. Changing to follow coding standards.

285–286

Done.

293

Good point. Fixing.

294

Fixing the names to be consistent. Fixing name conflict by prefixing assignments of field names with "this->".

404

Done.

408

Done.

416

I guess the first part of the problem is that there are 2 notions of parse state:

  1. The field ParseState that names the state of the parser, and
  2. NextUnreadBit which defines where to continue the parse on return (in case a function body gets parsed between calls).

I also overloaded the return value with this update. Refactoring to do less and be more clear.

431

The refactoring is a bit better now. Hopefully good enough.

2863

Done.

3297

After talking to Derek, I realized what was the issue I was missing. I'l summarize what I understand:

  1. When streaming, we want to "return" as soon as possible, without having to force all bitcode to be scanned. This reduces the cost of (potential) blocking calls to the data streamer.
  1. Control can return to the caller without having completed the parse. However, the parsed portions must be consistent (i.e. forward block address references have been resolved).

Based on this, I've modified the code to lift the materializeForwardReferencedFunctions into startParse.

3301

I agree that there should be some type of forward reference unit test to verify we can lazy evaluate these forward-referenced block addresses.

I also agree that for the use by llvm-dis, my original code worked because it eventually calls materializeAllPermanently. I think I may have been confused about the full expectation of "streamed" (or lazy) was because of this.

4636

Done.

rafael added inline comments.Apr 8 2015, 5:27 PM
include/llvm/Support/StreamingMemoryObject.h
75 ↗(On Diff #23431)

Why do you need this? The streamer will return how many bytes were read and can handle a larger request.

Also, why does it need to be part of this patch? It looks like this patch has many independent changes in it.

88 ↗(On Diff #23431)

Why the extra logic? If objectsize is known it is the same as BytesRead, no?

kschimpf added inline comments.Apr 9 2015, 9:41 AM
include/llvm/Support/StreamingMemoryObject.h
75 ↗(On Diff #23431)

This was added to handle the case of when one is parsing a wrapped bitcode file. In such cases, you do not need to do another read (which may block until it succeeds). That was the intent of this change. However, a simpler approach would be to allow the extra read, and then not set ObjectSize (below) if already set.

I will remove this change, and add the conditional assignment to ObjectSize below.

I changed it in this CL because it didn't cause a problem until I fixed that materializing a module (when streaming) didn't actually read all of the bitcode file. When that change was added, tests failed and this issue was exposed.

I will remove the changes StreamingMemoryObject.{h,cpp} and put in a separate CL.

88 ↗(On Diff #23431)

No, they aren't necessarily the same.

The problem happens when you have a wrapped bitcode file, and was not exposed until I fixed the case that we weren't reading the entire bitcode when materializing lazily. Then, a bunch of test cases failed.

When I looked into it, this is what I discovered:

  1. The wrapped bitcode was smaller than kChunkSize. Hence, the initial read set BytesRead to the size of the wrapped file on first read.
  1. The wrapper was then read, and set ObjectSize, which corresponded to 4 bytes smaller than BytesRead.

This is the reason I changed this file as I did.

jvoung added inline comments.Apr 9 2015, 2:33 PM
lib/Bitcode/Reader/BitcodeReader.cpp
259

nit: This is usually 0 or 1, but it seems unexpected for this field to be named "NumModulesParsed", and yet have the type be "bool". Rename or change type?

431

Thanks -- this is better.

For a while I was also wondering how many places need to be aware of setting the state to ParseError, but I think it's just continueParse() because most/all searching for bit position, etc. goes through that.

3178

Does this need to be cleanupOnError(EC) also?

3216

This covers two states? InsideModule and AtTopLevel? It might be more clear if you list them out so it's clear what the "break" corresponds to (AtTopLevel).

Previously, the Stream.JumpToBit(NextUnreadBit); was only needed when InsideModule... is it now needed for AtTopLevel too?

3312

"NoMorInput" -> "NoMoreInput"

It could also be that more states >= NoMoreInput are added as code evolves, but not handled here. Can the compiler accept/handle a "static_assert(ParseState < NoMoreInput, "...") to catch what happens if more states are added after NoMoreInput but not handled by this switch?

4684

Is this necessary at this point? Should that already be covered by the " // Iterate over the module, deserializing any functions that are still on disk" loop?

4684–4685

The "promise" comment from "above" is removed now, so you could update this comment.

kschimpf updated this revision to Diff 23535.Apr 9 2015, 3:35 PM

Fix issues raised by jvoung, and remove changes now in D8907.

kschimpf added inline comments.
lib/Bitcode/Reader/BitcodeReader.cpp
259

Good catch. I did meant to use size_t. Fixing.

431

That is correct and was the intent. State updates (and bit positioning) is intentionally now localized to continueParse.

The only exception is in ParseModule, which updates the state to state whether it returned without completing.

3178

Yes. Good catch. Fixing.

3216

The jumpToBit is needed because various "materialize" methods may be called between calls to continueParse. By forcing a jumpToBit to happen at all calls to continueParse, we no longer need to know where the materialize methods leave the bitcursor.

While I did not see an example of an error caused by interleaved calls to materialize, I was very suspicious that they could occur, and wanted to make sure that this would not happen. Hence, I made sure that continueParse always resets the position to where it left off.

I will fix to not use default, so that a corresponding warning will be generated if a new value is added to the enumeration.

3312

Fixed string. Also removed "default" case and made all states explicit. This will force a warning if a new state is added.

4671

Removing the comment about being after a function body. This is no longer true. A call to materializeMetadata would put us some place else in the bitcode file.

4684

In correct bitcode files, you are right. However, if the function doesn't define any function blocks, but (incorrectly) references function block addresses, this code will cause the error to be generated.

However, looking at the following instruction, this is checked anyway. Removing.

4684–4685

Done.

kschimpf updated this revision to Diff 23684.Apr 13 2015, 10:35 AM

Fix issues in diff 23431.

kschimpf updated this revision to Diff 26641.May 27 2015, 3:29 PM
  • Merge branch 'master' into readfac1
  • Fix issues raised by merge.
  • Merge branch 'master' of http://llvm.org/git/llvm into readfac1
  • Fix tests to use old-style parser.
kschimpf updated this revision to Diff 26643.May 27 2015, 3:42 PM
  • Fix nits.

Now that the issues with the streaming memory object has been fixed, I have updated this CL for review.

Note that I added a CL flag "-old-lazy-bitcode-parser". This was done to deal with a bug fixed by this CL. That is, in the old code, when you materialized a module, it didn't check if there was any additional data in the bitcode file. The new code fixes this by calling "finishParse". However, there are a couple of (bitcode binary) tests that were generated with this violation. Hence, the flag was added to fix this problem.

I'm willing to remove this flag in either (1) a later review, or (2) in a later revision. However, for this review I made the issues explicit so that the problem can be seen.

filcab added a subscriber: filcab.May 27 2015, 8:53 PM

However, there are a couple of (bitcode binary) tests that were generated with this violation. Hence, the flag was added to fix this problem.

I'm willing to remove this flag in either (1) a later review, or (2) in a later revision. However, for this review I made the issues explicit so that the problem can be seen.

Let me know which ones have bugs. They might be easy-ish to reconstruct (especially since I have some additional practice in fiddling with bc files, now (Can't promise to deal with them very quickly, though).

lib/Bitcode/Reader/BitcodeReader.cpp
289

Why the empty line?

301

http://llvm.org/docs/CodingStandards.html#doxygen-use-in-documentation-comments
^^ This changed recently.

Omit \brief if the brief description is just a sentence. (You might want to add the '.' at the end, though)

789

Omit \brief.

803

Omit \brief.

3254

We're already at top level, no? (line 3234)
I might be missing something, but it looks like we're at top level, and saw an EndBlock. Shouldn't this be an error?

kschimpf updated this revision to Diff 26745.May 28 2015, 3:06 PM

Fixes based on feedback by Filipe.

lib/Bitcode/Reader/BitcodeReader.cpp
289

Removed.

789

Done.

803

Done.

3254

I agree that we should be at top level.

I also agree that it appears weird that we allow extra (unmathced) EndBlocks. This has been allowed by the bitcode reader/writer for years. I just wasn't willing to make the leap that I should remove this.

However, I tried removing it (and making it an error), and no tests failed. Hence, Converting this to an error.

4676

Moved the iterating of functions to before the call to finishParse. This deals with the problems I was having with tests in test/Bitcode/invalid.test (i.e. Inputs/invalid-fwdref-type-mismatch-2.bc and Inputs/invalid-load-ptr-type.bc). These two files had multiple errors (the one they intended which was inside a function body, and the one probably not intended - extraneous stuff at the end of the bitcode file).

This removes the need for the command line flag UseOldLazyBitcodeParse, and I have deleted it.

filcab edited edge metadata.EditedMay 28 2015, 3:58 PM

Hi Karl,

The files came straight from the fuzzer, so it is likely have more than one error. If, in order to support them (where support is: keep the test working and diagnosing what we want), you have to change the code in a convoluted way, I would prefer to change the test.

If the change is minimal and not a problem (doesn't impact legibility or architecture), then keeping the tests as they are is not a problem either. I just want to avoid having worse code just so we don't have to re-do some tests.

Of course, if we started crashing on the tests, that's a problem :-)

Thanks,

Filipe

(Phab butchered the comment in my email. Editing it to get a complete history in Phabricator)

kschimpf updated this revision to Diff 26923.Jun 1 2015, 1:39 PM
kschimpf edited edge metadata.
  • Fix invalid bitcode tests with more than one error.
  • Merge branch 'master' into readfac1

In addition to moving the code back in materializeModule, I fixed three tests. Two of them were fuzz tests where I "truncated" the file to the end of the module block. The other test did not have anything after a bad abbreviation definition, and the code file was incomplete. So I generated a replacement test that was well structured otherwise (i.e. only had that one error in it).

include/llvm/Bitcode/BitstreamReader.h
328 ↗(On Diff #26923)

This code fixes the state of the bit streamer when no more input is found. As a result, method AtEndOfStream now works correctly.

lib/Bitcode/Reader/BitcodeReader.cpp
3253–3255

Discovered that the Bitstream::EndBLock was "hiding" a bug int the bitstream reader when processing a data stream. That is, when using a data stream, the size is not set until after the eof is reached. Hence, when Stream.AtEndOfStream() was called above, it would return false even when at the eof. The actual problem was in FillCurWord, which did not set the bit position correctly when there was no more input.

The old code worked because the read (at eof) would return zero, and is understood as an end block. By returning success for this value, it would hide this problem.

I also improved the error message so that once can see where the reader thought the eof should be, if there is miscellaneous stuff at the end of the bit code file. This makes it easier to know where to cut a test file in such cases.

4677

Now that the eof checking is fixed, I moved this back where it was in an earlier version of this CL.

Hi Karl,

Really sorry for the delay.
LGTM on my part, as long as you add the test for the error message and do the fix.

Thank you,

Filipe

lib/Bitcode/Reader/BitcodeReader.cpp
790

Nit: Put more words on the first line.

794

Nit: If it's for docs, it's probably best to start with an uppercase letter.

3258

Thank you!

3262

errs()? Or StrBuf?
Please also add a test for this error message.

kschimpf updated this revision to Diff 27681.Jun 15 2015, 9:15 AM

Fixed issues raised by Filipe.

lib/Bitcode/Reader/BitcodeReader.cpp
790

Done.

794

Done.

3262

Good catch. I meant StrBuf, so that we can use the same API for all errors.

rafael edited edge metadata.Jun 15 2015, 12:55 PM

Applying the patch locally to take a better look.

lib/Bitcode/Reader/BitcodeReader.cpp
286–288

Please commit the pure cleanup bits first:

using \ instead of @
starting functions with lowercase names.

3263

This looks a bit much to be honest. Corrupted files are not that common and it is trivial to set a breakpoint to find the state.

3265

This is always 0 or 1. Use a boolean instead.

I got the following test failures locally:

LLVM :: Bitcode/invalid.test
LLVM :: tools/gold/invalid.ll

This CL has gotten a bit long, and hard to read (to many versions). Moved to a new CL http://reviews.llvm.org/D10518

lib/Bitcode/Reader/BitcodeReader.cpp
286–288

I assume this has already been done. The new CL doesn't have suc cases anymore.

3263

Simplified in new CL to do same as before.

3265

This was already fixed in master.

kschimpf abandoned this revision.Jun 17 2015, 1:59 PM

This CL has gotten a bit long, and hard to read (to many versions). Moved to a new CL http://reviews.llvm.org/D10518