This is an archive of the discontinued LLVM Phabricator instance.

Implement tool to convert bitcode to text.
AbandonedPublic

Authored by kschimpf on Jul 9 2015, 11:06 AM.

Details

Summary

Provide a new tool for LLVM bitcode files that makes it easier to test
and fuzz. That is, a simple textual form that captures the contents of
binary bitcode files.

The textual form removes all references to abbreviations in the binary
bitcode file. This is done to simplify its form.

A textual bitcode record is a sequence of (unsigned) integers,
separated by commas, and terminated with a semicolin. Each bitcode
record appears on a separate line.

The tool llvm-bcanalyzer also provides a textual version of bitcode
files. However, it doesn't provide anyway to convert the output back
to a binary bitcode file. It is also much more complex.

Besides providing a command line tool, it set up for use with lit
(LLVM Integrated Tester). Invalid bitcode tests, that don't
corresponding to abbreviations and bitstream issues, can be written as
a text file. This text file can run tests and check the results using
FileCheck.

The use of textual bitcode files (unlike the current invalid bitcode
tests) can use a single file to describe the invalid test. It also has
the advantage of being easier to port if the format of bitcode records
change. In such cases, you only need to edit the problematic textual
bitcode records that have changed since the test was added. Current
tests are in binary form, and very difficult to figure out how to
change when the bitcode format changes.

For fuzzing, the text form is easy to model using tokens (the digits,
separator ',', and terminator ';\n'). This allows (local) mutations to
easily be formed by existing fuzzers (such as afl-fuzz) without
modification. The textual-to-binary conversion methods can be used as
a post-processor of the fuzzer, or as a preprocessor to the llvm tool
being tested.

The motivation for this tool has been the success PNaCl
(www.gonacl.com) has had in testing its tools, as well as the ease of
generating invalid test cases.

Diff Detail

Event Timeline

kschimpf updated this revision to Diff 29360.Jul 9 2015, 11:06 AM
kschimpf retitled this revision from to Implement tool to convert bitcode to text..
kschimpf updated this object.
kschimpf added reviewers: jvoung, dschuff.
kschimpf added a subscriber: llvm-commits.
jvoung edited edge metadata.Jul 9 2015, 1:48 PM

Made a quick scan over (not a full review).

Would be nice if folks who are more involved with fuzzing could chime in about whether or not they agree if this tool will be useful to them =)

include/llvm/Bitcode/BitcodeConvert.h
2

Short description next to the file name?

2

BitcodeConvert might be a bit too generic of a name. What does it convert to-from?

I'm sure more experienced reviewers would have a better suggestion. TextualBitcodeConvert ?

14

semicolin -> semicolon (here and below)

17

For "easy to fuzz", I don't know if you mentioned the issue with binary form. The binary form isn't byte aligned, etc...

59

fo -> for

60

Do you have any experience for supporting ErrorRecovery mode?

lib/Bitcode/Reader/BitcodeRecordReader.cpp
1 ↗(On Diff #29360)

Wonder if these various fils should be TextualBitcode...

18 ↗(On Diff #29360)

semicolin -> semicolon in this file too

62 ↗(On Diff #29360)

Don't need to reference "pnacl" here...

133 ↗(On Diff #29360)

Clarify that if not found, it will not advance the cursor.

142 ↗(On Diff #29360)

How about skipWhitespaceLines() as an alternate name? This doesn't skip whitespace in the middle of a line.

262 ↗(On Diff #29360)

read -> Read

271 ↗(On Diff #29360)

read -> Read

288 ↗(On Diff #29360)

Could this be shared with the "other" bitcode reader? (binary to IR reader, which has two instances of this already?).

The refactoring can be a separate patch too...

369 ↗(On Diff #29360)

Doesn't this already check isBitcodeWrapper ?

lib/Bitcode/Writer/BitcodeRecordWriter.cpp
1 ↗(On Diff #29360)

BitcodeRecordWriter.cpp (capital W)

Wonder if this file should also be prefixed with Textual...

12 ↗(On Diff #29360)

Capital T for Textual seems a bit inconsistent (binary doesn't get a capital B) in the docs. So, perhaps just lower case it.

243 ↗(On Diff #29360)

If you are refactoring the reader to share code, in a separate patch, you could share this writeHeader also.

tools/Makefile
31–32

let's keep the P in PARALLEL

tools/llvm-bcconv/Makefile
12

BitReader BitWriter

tools/llvm-bcconv/llvm-bcconv.cpp
2

Update the header (it's not llvm-bcanalyzer)

24

semicolon

31

Is the motivation for making it optional, so that a fuzzer doesn't go off and start filling up the file with comments which don't affect anything?

I guess I would have hoped that a fuzzer would be smart enough to detect that such changes aren't useful (no increase in coverage), but maybe it's still helpful to make it optional...

kcc added a subscriber: kcc.Jul 9 2015, 2:19 PM

Will this work with libFuzzer out of the box?
Do you need something form libFuzzer to make it work (better|at all)?

I believe so, but haven't tried it yet. I will be on vacation next week,
but will look into this further when I get back on the 20th.

Karl

Karl Schimpf

kcc added a comment.Jul 9 2015, 2:25 PM

I believe so, but haven't tried it yet. I will be on vacation next week,
but will look into this further when I get back on the 20th.

I don't want to block you on this, but my first question is whether we need this change to fuzz using libFuzzer at all.
If I were doing a mutator for llvm bit code, I'd probably do something different.
We even have an interface for custom mutators:
https://github.com/llvm-mirror/llvm/blob/master/lib/Fuzzer/FuzzerInterface.h#L51

Another story is if we also have to support afl-fuzz, w/o adding features to afl-fuzz.

I guess I'm confused. I thought you were talking about a custom mutator.
One of the main goals is to allow it to use this API. The fuzzer would fuzz
the textual version. The custom mutator would then modify the fuzzed
bitcode into bitcode records using function readTextualBitcode, and then
reformat it to the binary form by calling writeBinaryBitcode.

This would allow the LLVM bitcode reader to work out of the box, without
having to change the code.

The tool llvm-bcconv just makes it convenient to build unit test cases for
LLVM, and to convert (assembled) llvm fuzzing test cases into the textual
form.

Karl Schimpf

If we are going to have this (not convinced we should), it should be
another library and not modify lib/Bitcode.

kschimpf updated this revision to Diff 30956.Jul 29 2015, 3:08 PM
kschimpf marked 14 inline comments as done.
kschimpf edited edge metadata.

Apply fixes from review, and add "simplified" form of bitcode.

I have added the new form "simplified" bitcode to this, based on conversations with Kostya and libFuzzer. I also moved the code into a separate library as suggested by Rafael.

include/llvm/Bitcode/BitcodeConvert.h
2

I don't like TextualBitcodeConvert because it converts back and forth between the text, binary, and (new) simplified forms.

The simplified form is based on discussions with Kostya and the way that libFuzzer works. It is intended to improve the ability of that fuzzer to mutate simplified forms.

17

Changed the discussion to try and make this more clear.

60

The main purpose for supporting error recovery is to be able to convert fuzzed (i.e. mutated) bitcode files into something that is (bitstream) readable. It is not clear if they are needed for the other forms (i.e. textual and the new variant "simplified"). However, since all three forms use the same driver, it is easier to assume that we can recover when writing all forms.

lib/Bitcode/Reader/BitcodeRecordReader.cpp
1 ↗(On Diff #29360)

Decided to separate out each of the reader/writers into separate files, and moved then to a separate subdirectory (and library).

288 ↗(On Diff #29360)

I agree that this should be factored out, however, I will do it in a separate patch.

369 ↗(On Diff #29360)

Yes. Forgot to remove TODO after fix.

lib/Bitcode/Writer/BitcodeRecordWriter.cpp
1 ↗(On Diff #29360)

Refactored the code so that each bitcode writer is in a separate source file.

243 ↗(On Diff #29360)

Agreed. Will create separate patch for this.

tools/llvm-bcconv/llvm-bcconv.cpp
31

Yes. I'm not sure that it will guarantee that the fuzzer will notice this, so I was being safe.

Please move the library outside of Bitcode. This is just another
textual representation. This patch should not change anything in
lib/Bitcode.

I'm a bit confused on naming conventions. lib/Bitcode represents multiple libraries:

lib/Bitcode/Reader
lib/Bitcode/Writer

I assume that they are contained in lib/Bitcode because they represent Bitcode, bitcode records, and the corresponding bitstreams.

Similarly, the bitcode converter reads/writes bitcode records without converting to IR, or starting with IR. Rather, it only uses the (same) bitcode records that the bitcode reader and writer use. For this reason, I created a separate library lib/Bitcode/Convert.

Are you suggesting that I should (instead) create a separate directory lib/BitcodeConvert, even though lib/Bitcode defines bitcode records as well?

Rafael wrote:

I am simply not convinced of the need for this and want it to be as
out of the way as possible.

Given that this is just used by tools/llvm-bcconv, probably the best

i> s to just move *all* the code to tools/llvm-bcconv and not have a

library at all.

I did think of putting it all in tools/llvm-bcconv. However, I would rather have a library. Part of this work is to build fuzzers for LLVM bitcode files. I already have code that fuzzes LLVM bitcode files using lib/Bitcode/Convert and lib/Fuzzer. That is my motivation for having a library.

Rafael wrote:

Why can't the fuzzer write a file?

lib/Fuzzer runs the fuzzing in the same process as the test. It doesn't create an intermediate file. Even if it did, the conversion must still take place.

The reason for this is that the what the LLVM bitstream is modeled, it is not conducive to fuzzing. That is, values are variable-rate bit encoded, based on the value. This implies that mutating a couple of bits will (almost always) make the rest of the input bitstream unreadable. The point of the fuzzed mutations is to change small portions of the input, and leave the rest alone. This is not possible with the binary form of bitcode.

Hence (as mentioned in the comments for include/llvm/Bitcode/BitcodeConvert.h), this CL defines two alternative forms: simplified and textual. Both are essentially the same, except that the textual form is human readable, by using textual digits rather than binary bytes to define numbers. These alternative forms are specifically designed to all small mutations to only effect the bitcode record (and possibly the immediately surrounding records), when the contents of a bitcode record is mutated. This makes the mutations of bitcode tractable.

If it doesn't create files, why do you need a text representation? If
you don't need the text representation, than it is llvm-bcconv that
should be deleted and this library merged into a tools/llvm-bc-fuzzer.

The summary is that this is a lot of code with unproven value. So far
filcab has found more bugs with just afl.

kcc added a comment.Aug 5 2015, 11:54 AM

afl is great, but in my experience more fuzzers lead to more bugs.
Besides, libFuzzer being in-process makes it much faster.
(I am biased, of course)
Maybe if Karl can add the actual libFuzzer glue to this patch and demonstrate a couple of bugs found this way
it will be easier to make it through.

kschimpf updated this revision to Diff 31399.Aug 5 2015, 2:38 PM

Files in llvm/test/Convert show how this code could be used.

Files in llvm/test/Convert are as follows:

File MyMain.cpp that shows how the bitcode coversion library of
http://reviews.llvm.org/D11072 can be used. The single input fed to
the generated fuzzer is a trivial program in foo.ll.

Within less than a minute I was able to generate the crashes in this
CL (files with prefix "crash"). Files with extension '.bc' are LLVM
bitcode files. Files with extension '.sbc' are the corresponding
bitcode files in simplified (binary) form. Files with extension '.err'
are the corresponding crashes caused by the '.bc' file.

Note: The code in test/Convert will be removed before committing. It was included to show how the bitcode conversion library/tool can be used.

I am all for fuzzing stuff, but I don't think I will have the time to really review this much code right now.

Kostya is the fuzzer expert and Duncan the bitcode one. If you guys are OK with this, go for it.

I just have two passing requests:

  • Please git-clang-format the patch.
  • Please don't add unnecessary features. No additional text format if it is not need. No additional library if only one program will use it. Etc.
kcc edited edge metadata.Aug 7 2015, 10:12 AM
  • Please don't add unnecessary features. No additional text format if it is not need. No additional library if only one program will use it. Etc.

Karl,
Within the Rafael's constraints, to which I tend to agree, you may want to back up to my initial proposal:
Implement a fuzzer with a custom mutator:
https://github.com/llvm-mirror/llvm/blob/master/lib/Fuzzer/FuzzerInterface.h
https://github.com/llvm-mirror/llvm/blob/master/lib/Fuzzer/test/UserSuppliedFuzzerTest.cpp

It can all be done simple and in a single file:

  1. Read and parse the bit code
  2. translate it into some simple binary form in-memory. The simpler the format is the better.
  3. let the fuzzer mutate it
  4. translate the mutated binary form back to bitcode. If the translation was successful -- run optiimzations. This step is the most important: how many of the mutated units will lead to a valid bitcode?
  5. profit

Kostya,

I agree that we should keep it simple. The text format isn't necessary, but
it does make writing unit tests much easier to maintain. Since the records
are simple lists of (textual) numbers, it is easy to update a record in
this form, should the format change.

When the code is in binary form, this is much harder to do.

However, for a first step, I agree that we should not worry about the
textual form. That can be left to be introduced later, when more concrete
examples of its utility can be shown.

I will also look at your suggested list of flow for mutators, to see how it
effects things.

I also agree that we want to keep things as simple as possible. We might be
able to put it all in a custom mutator library. I will look into that.

Finally, if possible, I will split this patch into smaller patches. I
initially presented as a single patch mainly to make it clear how the
pieces fit together.

Karl

Karl Schimpf

kcc added a comment.Aug 7 2015, 11:08 AM

When the code is in binary form, this is much harder to do.

Just to clarify:
My suggestion is that files on disk should be in one of the two standard formats: textual bit code format or binary bitcode format.
The "fuzzer-friendly" format should exist only in memory and never be written on disk.

kschimpf edited edge metadata.Aug 7 2015, 2:46 PM
kschimpf added a subscriber: chandlerc.

Met with Chandler Carruth, JF Bastien , Kostya Serebryany, and myself (Karl Schimpf) to discuss how to proceed with helping the LLVM community using fuzzing. Briefly discussed patch D11072. Then discussed the overall goals of the LLVM community with respect to fuzzing.

We discussed that there really should be 3 levels of fuzzing:

  1. Parsing .ll files to generate IR.
  2. Optimization phases of the compiler
  3. Using bitcode to represent other structured objects.

It was also noted that once these levels are more stable, one could move into more structural fuzzing (details omitted here).

As such, patch D11072 doesn’t really fit this strategy. Hence, this patch will be abandoned.

However, I will continue to work towards fuzzing LLVM to improve the code quality. My initial goals will be to apply afl-fuzz to a corpus of .ll files to find errors in parsing .ll files (i.e. run llvm-as).

I will also work with Kostya on building a fuzzer using libFuzzer to also test parsing .ll files to generate IR. To improve code coverage, the goal is to call functions that remove the bitcode writing step of llvm-as.

Once llvm-as is relatively stable, focus will then move forward to the optimization phases of the compiler.

Any help or comments from the general LLVM community to help with fuzzing is greatly appreciated.

kschimpf added a subscriber: filcab.Aug 7 2015, 2:47 PM
kschimpf abandoned this revision.Aug 28 2015, 10:49 AM