This is an archive of the discontinued LLVM Phabricator instance.

Refactor bitcode reader to simplify control.
AbandonedPublic

Authored by kschimpf on Apr 1 2015, 2:32 PM.

Details

Summary

Modifies the bitcode reader such that the same logic is used for
both memory buffers and data streams. The incremental parsing
was factored into startParse, continueParse, and finishParse.
All parses (incremental or non-incremental) begin with startParse.
Then zero (or more) calls to continueParse incrementally read more
input, picking up from where the last call left off. finishParse
materializes any additional parts, based on the flags passed to startParse.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
jvoung added inline comments.Apr 1 2015, 4:49 PM
lib/Bitcode/Reader/BitcodeReader.cpp
234

be consistent about capitalization in comments

one line starts with "parsed input" and another line starts with "Parsed input" =)

250–251

Don't need explicit anymore, though that transition was a while back so not really related to this CL.

277

Variable name is different from comment "MaterializeAll" vs "ShouldMaterializeAll" -- make them the same? I see that in the actual definition you're trying to avoid conflicting with the field name...

kschimpf updated this object.Apr 6 2015, 10:04 AM
kschimpf edited edge metadata.
kschimpf updated this revision to Diff 23431.Apr 8 2015, 11:28 AM

DId all changes except adding test that forward block address references get resolved on lazy loads.

lib/Bitcode/Reader/BitcodeReader.cpp
234

Done.

250–251

Done.

268–269

Sorry, my fault. I followed the syntax of ConstantPlaceHolder below. Changing to follow coding standards.

268–269

Done.

276

Good point. Fixing.

277

Fixing the names to be consistent. Fixing name conflict by prefixing assignments of field names with "this->".

386

Done.

390

Done.

398

I guess the first part of the problem is that there are 2 notions of parse state:

  1. The field ParseState that names the state of the parser, and
  2. NextUnreadBit which defines where to continue the parse on return (in case a function body gets parsed between calls).

I also overloaded the return value with this update. Refactoring to do less and be more clear.

413

The refactoring is a bit better now. Hopefully good enough.

2845

Done.

3277

After talking to Derek, I realized what was the issue I was missing. I'l summarize what I understand:

  1. When streaming, we want to "return" as soon as possible, without having to force all bitcode to be scanned. This reduces the cost of (potential) blocking calls to the data streamer.
  1. Control can return to the caller without having completed the parse. However, the parsed portions must be consistent (i.e. forward block address references have been resolved).

Based on this, I've modified the code to lift the materializeForwardReferencedFunctions into startParse.

3281

I agree that there should be some type of forward reference unit test to verify we can lazy evaluate these forward-referenced block addresses.

I also agree that for the use by llvm-dis, my original code worked because it eventually calls materializeAllPermanently. I think I may have been confused about the full expectation of "streamed" (or lazy) was because of this.

4616

Done.

rafael added inline comments.Apr 8 2015, 5:27 PM
include/llvm/Support/StreamingMemoryObject.h
75 ↗(On Diff #23431)

Why do you need this? The streamer will return how many bytes were read and can handle a larger request.

Also, why does it need to be part of this patch? It looks like this patch has many independent changes in it.

88 ↗(On Diff #23431)

Why the extra logic? If objectsize is known it is the same as BytesRead, no?

kschimpf added inline comments.Apr 9 2015, 9:41 AM
include/llvm/Support/StreamingMemoryObject.h
75 ↗(On Diff #23431)

This was added to handle the case of when one is parsing a wrapped bitcode file. In such cases, you do not need to do another read (which may block until it succeeds). That was the intent of this change. However, a simpler approach would be to allow the extra read, and then not set ObjectSize (below) if already set.

I will remove this change, and add the conditional assignment to ObjectSize below.

I changed it in this CL because it didn't cause a problem until I fixed that materializing a module (when streaming) didn't actually read all of the bitcode file. When that change was added, tests failed and this issue was exposed.

I will remove the changes StreamingMemoryObject.{h,cpp} and put in a separate CL.

88 ↗(On Diff #23431)

No, they aren't necessarily the same.

The problem happens when you have a wrapped bitcode file, and was not exposed until I fixed the case that we weren't reading the entire bitcode when materializing lazily. Then, a bunch of test cases failed.

When I looked into it, this is what I discovered:

  1. The wrapped bitcode was smaller than kChunkSize. Hence, the initial read set BytesRead to the size of the wrapped file on first read.
  1. The wrapper was then read, and set ObjectSize, which corresponded to 4 bytes smaller than BytesRead.

This is the reason I changed this file as I did.

jvoung added inline comments.Apr 9 2015, 2:33 PM
lib/Bitcode/Reader/BitcodeReader.cpp
242

nit: This is usually 0 or 1, but it seems unexpected for this field to be named "NumModulesParsed", and yet have the type be "bool". Rename or change type?

413

Thanks -- this is better.

For a while I was also wondering how many places need to be aware of setting the state to ParseError, but I think it's just continueParse() because most/all searching for bit position, etc. goes through that.

3159

Does this need to be cleanupOnError(EC) also?

3197

This covers two states? InsideModule and AtTopLevel? It might be more clear if you list them out so it's clear what the "break" corresponds to (AtTopLevel).

Previously, the Stream.JumpToBit(NextUnreadBit); was only needed when InsideModule... is it now needed for AtTopLevel too?

3292

"NoMorInput" -> "NoMoreInput"

It could also be that more states >= NoMoreInput are added as code evolves, but not handled here. Can the compiler accept/handle a "static_assert(ParseState < NoMoreInput, "...") to catch what happens if more states are added after NoMoreInput but not handled by this switch?

4664

Is this necessary at this point? Should that already be covered by the " // Iterate over the module, deserializing any functions that are still on disk" loop?

4664–4665

The "promise" comment from "above" is removed now, so you could update this comment.

kschimpf updated this revision to Diff 23535.Apr 9 2015, 3:35 PM

Fix issues raised by jvoung, and remove changes now in D8907.

kschimpf added inline comments.
lib/Bitcode/Reader/BitcodeReader.cpp
242

Good catch. I did meant to use size_t. Fixing.

413

That is correct and was the intent. State updates (and bit positioning) is intentionally now localized to continueParse.

The only exception is in ParseModule, which updates the state to state whether it returned without completing.

3159

Yes. Good catch. Fixing.

3197

The jumpToBit is needed because various "materialize" methods may be called between calls to continueParse. By forcing a jumpToBit to happen at all calls to continueParse, we no longer need to know where the materialize methods leave the bitcursor.

While I did not see an example of an error caused by interleaved calls to materialize, I was very suspicious that they could occur, and wanted to make sure that this would not happen. Hence, I made sure that continueParse always resets the position to where it left off.

I will fix to not use default, so that a corresponding warning will be generated if a new value is added to the enumeration.

3292

Fixed string. Also removed "default" case and made all states explicit. This will force a warning if a new state is added.

4651

Removing the comment about being after a function body. This is no longer true. A call to materializeMetadata would put us some place else in the bitcode file.

4664

In correct bitcode files, you are right. However, if the function doesn't define any function blocks, but (incorrectly) references function block addresses, this code will cause the error to be generated.

However, looking at the following instruction, this is checked anyway. Removing.

4664–4665

Done.

kschimpf updated this revision to Diff 23684.Apr 13 2015, 10:35 AM

Fix issues in diff 23431.

kschimpf updated this revision to Diff 26641.May 27 2015, 3:29 PM
  • Merge branch 'master' into readfac1
  • Fix issues raised by merge.
  • Merge branch 'master' of http://llvm.org/git/llvm into readfac1
  • Fix tests to use old-style parser.
kschimpf updated this revision to Diff 26643.May 27 2015, 3:42 PM
  • Fix nits.

Now that the issues with the streaming memory object has been fixed, I have updated this CL for review.

Note that I added a CL flag "-old-lazy-bitcode-parser". This was done to deal with a bug fixed by this CL. That is, in the old code, when you materialized a module, it didn't check if there was any additional data in the bitcode file. The new code fixes this by calling "finishParse". However, there are a couple of (bitcode binary) tests that were generated with this violation. Hence, the flag was added to fix this problem.

I'm willing to remove this flag in either (1) a later review, or (2) in a later revision. However, for this review I made the issues explicit so that the problem can be seen.

filcab added a subscriber: filcab.May 27 2015, 8:53 PM

However, there are a couple of (bitcode binary) tests that were generated with this violation. Hence, the flag was added to fix this problem.

I'm willing to remove this flag in either (1) a later review, or (2) in a later revision. However, for this review I made the issues explicit so that the problem can be seen.

Let me know which ones have bugs. They might be easy-ish to reconstruct (especially since I have some additional practice in fiddling with bc files, now (Can't promise to deal with them very quickly, though).

lib/Bitcode/Reader/BitcodeReader.cpp
282–283

Why the empty line?

294

http://llvm.org/docs/CodingStandards.html#doxygen-use-in-documentation-comments
^^ This changed recently.

Omit \brief if the brief description is just a sentence. (You might want to add the '.' at the end, though)

770–771

Omit \brief.

785

Omit \brief.

3227

We're already at top level, no? (line 3234)
I might be missing something, but it looks like we're at top level, and saw an EndBlock. Shouldn't this be an error?

kschimpf updated this revision to Diff 26745.May 28 2015, 3:06 PM

Fixes based on feedback by Filipe.

lib/Bitcode/Reader/BitcodeReader.cpp
282–283

Removed.

770–771

Done.

785

Done.

3227

I agree that we should be at top level.

I also agree that it appears weird that we allow extra (unmathced) EndBlocks. This has been allowed by the bitcode reader/writer for years. I just wasn't willing to make the leap that I should remove this.

However, I tried removing it (and making it an error), and no tests failed. Hence, Converting this to an error.

4656

Moved the iterating of functions to before the call to finishParse. This deals with the problems I was having with tests in test/Bitcode/invalid.test (i.e. Inputs/invalid-fwdref-type-mismatch-2.bc and Inputs/invalid-load-ptr-type.bc). These two files had multiple errors (the one they intended which was inside a function body, and the one probably not intended - extraneous stuff at the end of the bitcode file).

This removes the need for the command line flag UseOldLazyBitcodeParse, and I have deleted it.

filcab edited edge metadata.EditedMay 28 2015, 3:58 PM

Hi Karl,

The files came straight from the fuzzer, so it is likely have more than one error. If, in order to support them (where support is: keep the test working and diagnosing what we want), you have to change the code in a convoluted way, I would prefer to change the test.

If the change is minimal and not a problem (doesn't impact legibility or architecture), then keeping the tests as they are is not a problem either. I just want to avoid having worse code just so we don't have to re-do some tests.

Of course, if we started crashing on the tests, that's a problem :-)

Thanks,

Filipe

(Phab butchered the comment in my email. Editing it to get a complete history in Phabricator)

kschimpf updated this revision to Diff 26923.Jun 1 2015, 1:39 PM
kschimpf edited edge metadata.
  • Fix invalid bitcode tests with more than one error.
  • Merge branch 'master' into readfac1

In addition to moving the code back in materializeModule, I fixed three tests. Two of them were fuzz tests where I "truncated" the file to the end of the module block. The other test did not have anything after a bad abbreviation definition, and the code file was incomplete. So I generated a replacement test that was well structured otherwise (i.e. only had that one error in it).

include/llvm/Bitcode/BitstreamReader.h
328–330

This code fixes the state of the bit streamer when no more input is found. As a result, method AtEndOfStream now works correctly.

lib/Bitcode/Reader/BitcodeReader.cpp
3226–3227

Discovered that the Bitstream::EndBLock was "hiding" a bug int the bitstream reader when processing a data stream. That is, when using a data stream, the size is not set until after the eof is reached. Hence, when Stream.AtEndOfStream() was called above, it would return false even when at the eof. The actual problem was in FillCurWord, which did not set the bit position correctly when there was no more input.

The old code worked because the read (at eof) would return zero, and is understood as an end block. By returning success for this value, it would hide this problem.

I also improved the error message so that once can see where the reader thought the eof should be, if there is miscellaneous stuff at the end of the bit code file. This makes it easier to know where to cut a test file in such cases.

4657

Now that the eof checking is fixed, I moved this back where it was in an earlier version of this CL.

Hi Karl,

Really sorry for the delay.
LGTM on my part, as long as you add the test for the error message and do the fix.

Thank you,

Filipe

lib/Bitcode/Reader/BitcodeReader.cpp
771–772

Nit: Put more words on the first line.

776

Nit: If it's for docs, it's probably best to start with an uppercase letter.

3229

Thank you!

3233

errs()? Or StrBuf?
Please also add a test for this error message.

kschimpf updated this revision to Diff 27681.Jun 15 2015, 9:15 AM

Fixed issues raised by Filipe.

lib/Bitcode/Reader/BitcodeReader.cpp
771–772

Done.

776

Done.

3233

Good catch. I meant StrBuf, so that we can use the same API for all errors.

rafael edited edge metadata.Jun 15 2015, 12:55 PM

Applying the patch locally to take a better look.

lib/Bitcode/Reader/BitcodeReader.cpp
269

Please commit the pure cleanup bits first:

using \ instead of @
starting functions with lowercase names.

3234

This looks a bit much to be honest. Corrupted files are not that common and it is trivial to set a breakpoint to find the state.

3245

This is always 0 or 1. Use a boolean instead.

I got the following test failures locally:

LLVM :: Bitcode/invalid.test
LLVM :: tools/gold/invalid.ll

This CL has gotten a bit long, and hard to read (to many versions). Moved to a new CL http://reviews.llvm.org/D10518

lib/Bitcode/Reader/BitcodeReader.cpp
269

I assume this has already been done. The new CL doesn't have suc cases anymore.

3234

Simplified in new CL to do same as before.

3245

This was already fixed in master.

kschimpf abandoned this revision.Jun 17 2015, 1:59 PM

This CL has gotten a bit long, and hard to read (to many versions). Moved to a new CL http://reviews.llvm.org/D10518