This is an archive of the discontinued LLVM Phabricator instance.

[libFuzzer] Add custom output function
Needs RevisionPublic

Authored by Manishearth on Nov 26 2019, 11:52 AM.

Details

Summary

Sometimes the input goes through a bunch of processing before it is
passed to the actual code being fuzzed. For example, you may have a
process that converts the input into a pair of UTF-8 strings. When this
is the case, it's trickier to understand what the failing input was
if libFuzzer is only outputting the byte stream.

This commit adds LLVMFuzzerCustomOutput, which can be used to print
the failing input string with custom formatting.

Event Timeline

Manishearth created this revision.Nov 26 2019, 11:52 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptNov 26 2019, 11:52 AM
Herald added subscribers: llvm-commits, Restricted Project, delcypher. · View Herald Transcript

We're hoping to use this in cargo-fuzz: https://github.com/rust-fuzz/libfuzzer-sys/pull/48

cargo-fuzz supports smoothly using non-bytestrings as inputs to fuzz targets, so you can ask for a String and the underlying library will construct one from the bytestring for you. However, this does mean that the actual underlying byte string may be harder to read, which can be annoying when debugging failures. We'd like to be able to hook into this and automatically format the output.

kcc added a comment.Nov 26 2019, 5:16 PM

I'd prefer to not overload the public interface and this use-case sounds a bit too corner-case-ish.
Am I wrong?

Also, this seems to be only important for interactive use of libFuzzer.
Can it be made outside of libFuzzer somehow?

I don't see how this is a corner case: anything fuzzing with structured data by parsing the bytestring into something more structured will benefit from this. cargo-fuzz is used by a lot of Rust programs and some of them use the automatic structured data stuff, so they'd all benefit. I wish C++ codebases had something like this so that it would be easier to write good fuzz testcases, but as of now they don't and you have to convert bytestrings into input by hand. Still, such tests could still make use of a custom output formatter.

I don't see how this can be done outside the program, there's no hook that can be run when fuzzing completes.

(I'm not sure what you mean by interactive use of libFuzzer? This is for when you link libfuzzer to a program defining LLVMTestOneInput and then run the resulting binary)

Hi, I'd just like to voice my support for this functionality as another user of cargo-fuzz.

I'm often interpreting the input bytes into a some kind of structured input in memory, and when libfuzzer finds an input that triggers a failure in my fuzz target, I'd like to be shown the structured input rather than the byte string that it was interpreted from.

Generally, if one was implementing custom mutations for a structured input, wouldn't one also want to have a custom display for that structured input? They seem to go hand in hand.

[sorry for delay, I was OOO]
So, this patch will cause LLVMFuzzerCustomOutput to be called on the reproducer input
which in turn will cause an arbitrarily large input to be printed to stderr (stdin)?
Or in fact, it will cause an arbitrary action to be performed with {Data,Size}

I can totally see how this is helpful in some cases when running libFuzzer manually, but it can also be very annoying when the reproducer is large.
In any kind of automated scenario, it should be easy to add a separate binary that prints the inputs in human readable form.

I am still not convinced that this functionality deserves an extension to public API.
Not because I don't understand the need, but because the public API is expensive to maintain when it gets large.

But if I were convinced, I would ask to change it to

size_t LLVMFuzzerHumanReadableDump(const uint8_t *Data, size_t Size, uint8_t *Output, size_t MaxOutputSize)

(with a better name and with a description of what happens if the output is > MaxOutputSize )

And then there will need to be a run-time flag (generalized from kMaxUnitSizeToPrint)
to set the maximal size of printed output.

When fuzzing with text protos we don't need this because text protos are human readable.
But we also sometimes fuzz with binary protos.
Vitaly, WDYT, would the binary proto fuzzing users want to have textual dumps of the reproducers?

I can totally see how this is helpful in some cases when running libFuzzer manually, but it can also be very annoying when the reproducer is large.
In any kind of automated scenario, it should be easy to add a separate binary that prints the inputs in human readable form.

This requires parsing the human-readable libfuzzer output though, which could change, and is also brittle

I think this is still useful for automated runs.

I am still not convinced that this functionality deserves an extension to public API.

As fitzgen said this is pretty much something that would go hand in hand with a custom mutation function, which is functionality that you already have. Furthermore, it's useful even when you don't need custom mutators (which is true for most cargo-fuzz use cases).

But if I were convinced, I would ask to change it to

size_t LLVMFuzzerHumanReadableDump(const uint8_t *Data, size_t Size, uint8_t *Output, size_t MaxOutputSize)

(with a better name and with a description of what happens if the output is > MaxOutputSize )

And then there will need to be a run-time flag (generalized from kMaxUnitSizeToPrint)
to set the maximal size of printed output.

This could work. I was trying to avoid having to do some kind of allocation dance, but this makes sense.

When fuzzing with text protos we don't need this because text protos are human readable.
But we also sometimes fuzz with binary protos.

We're fuzzing with arbitrary structured data, not just protobufs. The arbitrary data is derived from the bits in a quickcheck-esque way.

I don't think there's an equivalent structured fuzzing library for C++, but it could be written with some work, and it would find this useful too. The goal here is to make the job of fuzzing very easy: if the fuzzer can produce structured data, then you don't need to do the work of constructing that structured data from the binary yourself.

Looking more closely at the libfuzzer+protobuf stuff, what we're doing is quite similar to the protobuf thing, except we're not using protobufs since Rust lets you do custom derives which allow us to directly create a bits-to-structured data function.

But yeah, this is similar to using binary protos. The original input isn't really readable.

kcc added a comment.Dec 6 2019, 10:47 AM

I can totally see how this is helpful in some cases when running libFuzzer manually, but it can also be very annoying when the reproducer is large.
In any kind of automated scenario, it should be easy to add a separate binary that prints the inputs in human readable form.

This requires parsing the human-readable libfuzzer output though, which could change, and is also brittle

No. libFuzzer writes the input file to a predictable path which is reported in a line like

artifact_prefix='./'; Test unit written to ./crash-a40d96b9ef9c17f9012c0159faa4efb01b6daf1b

The path is controlled by two flags, exact_artifact_path and artifact_prefix so that you can write the file to a precisely specified path instead.

If you need to automate human-readable dumping of artifacts after the fact
I would use artifact_prefix=./ARTIFACT_DIR_NAME and then dump all files that appear in ARTIFACT_DIR_NAME.

You can consider just to link needed code into small binary which will printout serialized inputs from the drive. And call that e.g. in the following way:

my_fuzzer  |& my_printer

Then my_printer will match "Test unit written" and printout content of corresponding file.

This option existed before, but I don't see that it used a lot.
So having that workaround is quite simple I'd prefer to not extend API.

Thanks for the in-depth replies, @kcc and @vitalybuka.

will cause an arbitrarily large input to be printed to stderr

If someone sets this hook, then they *want* those logs, so I see this as working as expected, and not as something to be concerned about.

I suppose that if you are for some reason running someone else's fuzzer, perhaps you wouldn't want those logs? Could this (seemingly less common) use case be supported with a -disable-custom-output flag or something of the sort?

I'd prefer to not overload the public interface

I hope that this concern can be alleviated by

  1. the tiny size of this patch, which shows that it is not an invasive change and should not be a burden to maintain going forward, and
  1. the conceptual congruence it has with the existing custom mutation hook support.

Furthermore, it seems like having everyone with this use case write their own wrapper script sums to more of a total maintenance burden than accepting this patch and solving the issue for everyone once and for all.

But yes, there is no getting around the fact that this patch adds to the public interface, and if keeping that small is the absolute highest priority, then we can go back to the drawing board.

You can consider just to link needed code into small binary which will printout serialized inputs from the drive. And call that e.g. in the following way:

At least in the case of cargo-fuzz this means compiling the fuzz target under two modes, because the way libfuzzer works it has to take over as main. This is really hard to make work under cargo, which is driving all the stuff here. It's fine when you need to compile one binary per fuzz target, but compiling two from the same code requires a lot of invasive changes that are not possible for a cargo wrapper.

This option existed before, but I don't see that it used a lot.
So having that workaround is quite simple I'd prefer to not extend API.

We, at least, can't use this workaround.

kcc added a comment.Dec 6 2019, 1:31 PM

Understood.
Before agreeing with the approach I'd like to hear from more users who need this and some details about their use cases. I've pinged some users.

kcc added a comment.Dec 6 2019, 1:42 PM

BTW, may I ask you to provide some details of your Rust fuzz target examples?
(like the code of the fuzz target and the output with your patch)

Dor1s added a subscriber: Dor1s.Dec 6 2019, 1:49 PM

In Chromium we recommend fuzz target authors do something like this (https://cs.chromium.org/chromium/src/testing/libfuzzer/fuzzers/javascript_parser_proto_fuzzer.cc?l=63&rcl=f88381726e8673b289d79dde6af1b6b7f1ab063a):

if (getenv("LPM_DUMP_NATIVE_INPUT")) {
  std::cout << source_string << std::endl;
  std::cout << "module: " << source_protobuf.is_module() << std::endl;
}

It may look less clean that a separate function, but OTOH this gives more flexibility as you can inject code like this in various places or even have different env vars for different behaviors.

getenv doesn't invoke any syscalls and works fast, so I don't think this approach puts any considerable performance penalty.

As for this patch, from a user perspective I see the convenience of having a specific interface function, but as Kostya said maintaining those is pricey. We had issues in Chrome with LLVMFuzzerInitialize, for instance, observing different behaviors of the compiler on different platforms, and those issues were:

  • hard to debug;
  • made us spend quite some time on discussions whether we fix should things in the compiler or in the libFuzzer or in the Chromium code;
  • ended up with a bunch of manual code changes and updates to the user documentation (which did not make it easier to follow -- sigh).

TL;DR as a person who has to ensure that thousands of fuzz targets are building and running well across a variety of projects and platforms, I would avoid extending the interface as long as there are reasonable alternatives (see the beginning of my comment).

In D70738#1773461, @kcc wrote:

BTW, may I ask you to provide some details of your Rust fuzz target examples?
(like the code of the fuzz target and the output with your patch)

There's a short example here, but a real example where this would matter a LOT is something like

fuzz_target!(|(s, b): (String, bool) | {
    if s == "hello" && b {
        panic!("success!");
    }
});

We're actually still working on improving the behavior around this, right now it's a bit slow for silly reasons, but we'd like the above to work smoothly.

A running example I have locally is

fuzz_target!(|data: (u8, u8)| {
    if 0 == data.0 && 70 == data.1 {
        panic!("no")
    }
});

which produces the output

SUMMARY: libFuzzer: deadly signal
MS: 1 InsertByte-; base unit: 5ba93c9db0cff93f52b521d7420e43f6eda2784f
0x0,0x46,
\x00F
artifact_prefix='/home/manishearth/mozilla/sand/fuzz/fuzz-test/fuzz/artifacts/fuzz_target_1/'; Test unit written to /home/manishearth/mozilla/sand/fuzz/fuzz-test/fuzz/artifacts/fuzz_target_1/crash-beb59e97091dd2f1ada25fea1abf7a70ac3ada78
Base64: AEY=
Formatted: (0, 70)

If you want to look at our output function, it's here, the println!("Formatted: {:?}", data) is doing the actual formatting.

The problem with approaches like this is that you still need to identify the last failing use case and run it a second time. With this patch the user experience is very smooth: the failing string is formatted and printed immediately.

It feels to me that we're using libFuzzer a bit differently from y'all, with different goals on usability. The workarounds mentioned so far are very suboptimal :/

We're hoping to make fuzzing as easy as possible for random maintainers to quickly set up, as opposed to us maintaining a lot of fuzz targets written in a uniform way.

Also, to be clear, since we build libFuzzer ourselves, this patch doesn't _have_ to be upstreamed for us to be able to use it, but we'd prefer to not have to fork libFuzzer, and I strongly feel that there will be other people for whom this patch will be useful.

Dor1s added a comment.Dec 6 2019, 2:36 PM

The problem with approaches like this is that you still need to identify the last failing use case and run it a second time. With this patch the user experience is very smooth: the failing string is formatted and printed immediately.

Oh, why do you need to print it every time fuzz target hits a crash? It should be an option controlled during runtime, so that users or an automated infrastructure can easily turn it on when reproducing an individual crash.

Oh, why do you need to print it every time fuzz target hits a crash? It should be an option controlled during runtime, so that users or an automated infrastructure can easily turn it on when reproducing an individual crash.

I don't understand why you _wouldn't_ want to: why do we print three different representations of the binary fuzz data each time the target hits the crash? This would be for the same reason.

The vast majority of workflows for Rust fuzzing is to run cargo fuzz name_of_target and look at the output of libFuzzer.

Dor1s added a comment.Dec 6 2019, 3:27 PM

If you're running fuzz targets manually, then it makes sense -- you need to type one command less to print the crash input. Although it seems like cargo fuzz can do that for you as well, so it will be still a single command.

Once you start automating things and run fuzz targets at scale, you won't need to look into the input until the following is done:

  1. verify reproducibility
  2. perform deduplication
  3. minimize the input

Also note that the same crashes will be hit many times if you run fuzz targets in parallel. That's why it's better not to print the input every time a fuzz target crashes.

If you're running fuzz targets manually, then it makes sense -- you need to type one command less to print the crash input. Although it seems like cargo fuzz can do that for you as well, so it will be still a single command.

Kind of, cargo fuzz doesn't know how to format your code, only the fuzz target (written by the user) does, since cargo-fuzz is unaware of the types being used. Due to the way libFuzzer works -- it replaces your main function -- we can't add another code path that just loads the string from a specified file and formats it because no matter what libFuzzer will still be run.

Dor1s added a comment.Dec 9 2019, 7:49 AM

If you're running fuzz targets manually, then it makes sense -- you need to type one command less to print the crash input. Although it seems like cargo fuzz can do that for you as well, so it will be still a single command.

Kind of, cargo fuzz doesn't know how to format your code, only the fuzz target (written by the user) does, since cargo-fuzz is unaware of the types being used. Due to the way libFuzzer works -- it replaces your main function -- we can't add another code path that just loads the string from a specified file and formats it because no matter what libFuzzer will still be run.

Sorry, I wasn't clear. I mean if you write fuzz targets with the getenv() based solution I proposed, cargo fuzz can be extended so that it will execute the crashing input with the necessary env variable set. That way we achieve the following:

  1. a single cargo fuzz command for your users
  2. no changes to libFuzzer's public interface
  3. instead of defining a new function, users can inject the printing code anywhere under if getenv() condition

now I look at this and realize that 3) is even better than having a separate function, e.g. when you initialize a complex object or a protocol packet, you could print it right away from the LLVMFuzzerTestOneInput, rather than duplicating / refactoring out data processing code in order to support an additional printing routine.

Sorry, I wasn't clear. I mean if you write fuzz targets with the getenv() based solution I proposed, cargo fuzz can be extended so that it will execute the crashing input with the necessary env variable set. That way we achieve the following:

No, it cannot, libFuzzer sets a main function at link time, this function cannot be overridden at runtime through an environment variable. So no matter what you do, if you execute this binary, it will pick some random input and start fuzzing. There doesn't seem to be a way for the fuzz target to prevent that, the only thing you can do is compile two separate binaries, which is trickier to do in the setup we have.

  1. instead of defining a new function, users can inject the printing code anywhere under if getenv() condition

We don't want users to have to deal with this, it's supposed to be a smooth, invisible API

Sorry, I wasn't clear. I mean if you write fuzz targets with the getenv() based solution I proposed, cargo fuzz can be extended so that it will execute the crashing input with the necessary env variable set. That way we achieve the following:

It's going to get printed on each fuzz run, though, which may not be desirable.

Dor1s added a comment.Dec 9 2019, 8:23 AM

Here's a concrete example:

  1. compile https://cs.chromium.org/chromium/src/testing/libfuzzer/fuzzers/javascript_parser_proto_fuzzer.cc
  2. fuzz with the binary produced
  3. if a crash happens, execute LPM_DUMP_NATIVE_INPUT=1 ./binary <crash_input> to get the input fancy printed
  1. and 3. can be both implemented inside cargo fuzz, i.e. the user will compile only one binary and execute only one cargo fuzz command

We don't want users to have to deal with this, it's supposed to be a smooth, invisible API

Writing if getenv() statement is not harder then writing a new function. Maintaining that solution costs 0 for libFuzzer upstream and any other libFuzzer users.

Dor1s added a comment.Dec 9 2019, 1:18 PM

Another (real) example, imagine a fuzz target like this: https://cs.chromium.org/chromium/src/net/spdy/fuzzing/http2_frame_decoder_fuzzer.cc?rcl=0be62a8d95f7fa1455fce1a76f0fa5b8484d0c8c&l=34

The data is being split into multiple pieces (e.g. chunk size and chunk buffer) in a loop (fuzzing a function that decodes frames coming from the network).

To write LLVMFuzzerCustomOutput function, you would need to either duplicate that data break up logic, or introduce another function for splitting the data and passing the objects around.

With the proposed solution (which we use in Chrome / at Google), you can print pretty much anything from anywhere. The getenv result might be put in a static variable and you'd print every chunk before actually decoding it. Flexible and easy to write / read / maintain.

Note that this is a real but still simple example. We have more sophisticated fuzz targets where implementing LLVMFuzzerCustomOutput would be a considerable waste of time.

  1. if a crash happens, execute LPM_DUMP_NATIVE_INPUT=1 ./binary <crash_input> to get the input fancy printed

Ah, I didn't know the binary could be used this way. Still, it's a bit annoying to have to parse the crash lines and hope that format never changes.

Writing if getenv() statement is not harder then writing a new function. Maintaining that solution costs 0 for libFuzzer upstream and any other libFuzzer users.

I think I'd misunderstood your suggestion. It seems we'd be able to stick the getenv call at the top of the macro, I was saying that we don't want the _user_ to ever have to write this.

To write LLVMFuzzerCustomOutput function, you would need to either duplicate that data break up logic, or introduce another function for splitting the data and passing the objects around.

We're using an autogenerated trait for this anyway, so we get this for free.

kcc added a comment.Dec 10 2019, 4:42 PM

We're using an autogenerated trait for this anyway, so we get this for free.

So, your LLVMFuzzerCustomOutput is generated.
Can't you also generate LLVMFuzzerTestOneInput so that it calls LLVMFuzzerCustomOutput when e.g.

LPM_DUMP_NATIVE_INPUT=1 ./binary <crash_input>

is called?

Still, it's a bit annoying to have to parse the crash lines and hope that format never changes.

That part I am happy to resolve separately if you think it's a problem (but see my response above about the two flags, exact_artifact_path and artifact_prefix).
E.g. we have -print_stats that prints machine-readable output, we can do something like that and guarantee it doesn't change.
And, "hope that format never changes" can be solved by adding a test.

Just to reiterate: I'd love to solve the problem, I just don't think that introducing new public API is the right way.

Can't you also generate LLVMFuzzerTestOneInput so that it calls LLVMFuzzerCustomOutput when e.g.

LPM_DUMP_NATIVE_INPUT=1 ./binary <crash_input>

is called?

Yeah, I hadn't realized this was doable until earlier.

That part I am happy to resolve separately if you think it's a problem (but see my response above about the two flags, exact_artifact_path and artifact_prefix).

These could work, I was hoping to closely match libFuzzer behavior on the autogeneration here. But we could pass some flag, I guess.

E.g. we have -print_stats that prints machine-readable output, we can do something like that and guarantee it doesn't change.

We want the user to also have user-readable output, though.

Just to reiterate: I'd love to solve the problem, I just don't think that introducing new public API is the right way.

I get this, I personally think this API is small, optional and should not be an issue. It produces the output line among all the other formatted lines (base64, hex, etc), which is good UX. In general having a post-fuzz hook seems like a useful concept to me.

kcc added a comment.Dec 11 2019, 6:19 PM

E.g. we have -print_stats that prints machine-readable output, we can do something like that and guarantee it doesn't change.

We want the user to also have user-readable output, though.

-print_stats's output is both human- and machine- readable.
It's just an example. We can similarly print the artifact name in a separate line and set the output format in stone (or declare the current output to be set in stone, make sure there is a test for it, and comment the test to ensure we don't break it)

vitalybuka requested changes to this revision.Apr 8 2020, 6:05 PM
This revision now requires changes to proceed.Apr 8 2020, 6:05 PM