This is an archive of the discontinued LLVM Phabricator instance.

[compiler-rt] Move FDP to include/fuzzer/FuzzedDataProvider.h for easier use.
ClosedPublic

Authored by Dor1s on Aug 2 2019, 8:12 AM.

Details

Summary

FuzzedDataProvider is a helper class for writing fuzz targets that fuzz
multple inputs simultaneously. The header is supposed to be used for fuzzing
engine agnostic fuzz targets (i.e. the same target can be used with libFuzzer,
AFL, honggfuzz, and other engines). The common thing though is that fuzz targets
are typically compiled with clang, as it provides all sanitizers as well as
different coverage instrumentation modes. Therefore, making this FDP class a
part of the compiler-rt installation package would make it easier to develop
and distribute fuzz targets across different projects, build systems, etc.
Some context also available in https://github.com/google/oss-fuzz/pull/2547.

This CL does not delete the header from lib/fuzzer/utils directory in order to
provide the downstream users some time for a smooth migration to the new
header location.

Event Timeline

Dor1s created this revision.Aug 2 2019, 8:12 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptAug 2 2019, 8:12 AM
Herald added subscribers: Restricted Project, delcypher, mgorny and 2 others. · View Herald Transcript

I'm not opposing, but i have a question - this is not fuzzer specific at all, right?
This is just Span on steroids - knows it's size and byte position within the buffer,
and has methods to change the position by consuming bytes; nothing more?

Dor1s added a comment.Aug 2 2019, 8:41 AM

I'm not opposing, but i have a question - this is not fuzzer specific at all, right?

Yes, see the summary above.

This is just Span on steroids - knows it's size and byte position within the buffer,
and has methods to change the position by consuming bytes; nothing more?

No, span is harmful for fuzzing, as its boundaries are not instrumented (i.e. we can miss some buffer under-/overflows). The FDP takes care of that by allocating dedicated buffers for separate inputs. Plus, it provides various other helpers like ConsumeBool or PickValueInArray to save people from writing custom tricks like data++[0] % something again and again.

It has evolved from a similar classes invented in Chrome and some other Google projects, and it did prove to be useful.

I should probably add some documentation in LLVM. As of now there is a short documentation for FDP in google/fuzzing repo: https://github.com/google/fuzzing/blob/master/docs/split-inputs.md#fuzzed-data-provider

I'm not opposing, but i have a question - this is not fuzzer specific at all, right?

Yes, see the summary above.

This is just Span on steroids - knows it's size and byte position within the buffer,
and has methods to change the position by consuming bytes; nothing more?

No, span is harmful for fuzzing, as its boundaries are not instrumented (i.e. we can miss some buffer under-/overflows). The FDP takes care of that by allocating dedicated buffers for separate inputs.

The word that is throwing me off here is "inputs".
If fuzzer gave us a buffer of 8 bytes, and we consume it as 2 consecutive 4-byte integers,
the terminology here means those are 2 separate inputs, right?
And while span would act like a light-weight view with no extra allocations,
this would allocate a *separate* buffer for each of these "inputs"?
That's it? I think this should be documented better, if it's not already.

Plus, it provides various other helpers like ConsumeBool or PickValueInArray to save people from writing custom tricks like data++[0] % something again and again.

It has evolved from a similar classes invented in Chrome and some other Google projects, and it did prove to be useful.

I should probably add some documentation in LLVM. As of now there is a short documentation for FDP in google/fuzzing repo: https://github.com/google/fuzzing/blob/master/docs/split-inputs.md#fuzzed-data-provider

Dor1s added a comment.Aug 2 2019, 9:04 AM

the terminology here means those are 2 separate inputs, right?

Yes! Can you think of a better name for that? I thought "argument" or "parameter" could work, but what if those values are used somehow else (e.g. a number of iterations) and not passed to any function explicitly, then they are not technically arguments or parameters :/ "input" sounds pretty abstract to me, but any better suggestions are totally welcome.

And while span would act like a light-weight view with no extra allocations,
this would allocate a *separate* buffer for each of these "inputs"?

I don't see a sane way to use span in your example. Most likely it'll end up with two integer variable on stack, which is what FDP would do as well.

But if you interpret the buffer like int32_t*[2] and pass the address to the target API, you may miss some bugs because of the way how instrumentation works. We had cases like this, when e.g. off-by-one bugs were not detected. Educating every fuzzer developer on that matter doesn't scale. Providing an API that solves the problem does :)

the terminology here means those are 2 separate inputs, right?

Yes! Can you think of a better name for that? I thought "argument" or "parameter" could work, but what if those values are used somehow else (e.g. a number of iterations) and not passed to any function explicitly, then they are not technically arguments or parameters :/ "input" sounds pretty abstract to me, but any better suggestions are totally welcome.

Aha, now it makes sense. Sub-buffers?

And while span would act like a light-weight view with no extra allocations,
this would allocate a *separate* buffer for each of these "inputs"?

I don't see a sane way to use span in your example. Most likely it'll end up with two integer variable on stack, which is what FDP would do as well.

But if you interpret the buffer like int32_t*[2]

That's what i meant, yes.

and pass the address to the target API, you may miss some bugs because of the way how instrumentation works. We had cases like this, when e.g. off-by-one bugs were not detected.

Yeah, that i understand first-hand (:

Educating every fuzzer developer on that matter doesn't scale. Providing an API that solves the problem does :)

Though this wrapper won't solve all the issues, this would need to be more or less consistently
used throught all the code being fuzzed, not just on the surface in just the fuzz target.
Which means it would need to be *always* available, not just when building with clang+fuzzer,
which means it would need to be located in every project's sources.
In other words, i guess not sure how putting it in clang helps, other than creating compiler lock-in.

Dor1s added a comment.Aug 2 2019, 10:04 AM

If this change gets accepted, any clang user should be able to use it via #include <fuzzer/FuzzedDataProvider.h>, rather than have it somewhere in their repo. It is not tied to -fsanitizer=fuzzer or any other compiler flags.

If this change gets accepted, any clang user should be able to use it via #include <fuzzer/FuzzedDataProvider.h>, rather than have it somewhere in their repo. It is not tied to -fsanitizer=fuzzer or any other compiler flags.

I think that misses the point, it's still a compiler lock-in - does that work with gcc?

Dor1s added a comment.Aug 2 2019, 10:10 AM

If this change gets accepted, any clang user should be able to use it via #include <fuzzer/FuzzedDataProvider.h>, rather than have it somewhere in their repo. It is not tied to -fsanitizer=fuzzer or any other compiler flags.

I think that misses the point, it's still a compiler lock-in - does that work with gcc?

I don't think gcc is a good choice for fuzzing :) That is also covered in the CL summary.

Dor1s added a comment.EditedAug 2 2019, 10:13 AM

But seriously, GCC users can just grab this header and use it -- that's the current state of the world, so this CL doesn't make life for GCC users harder.

If this change gets accepted, any clang user should be able to use it via #include <fuzzer/FuzzedDataProvider.h>, rather than have it somewhere in their repo. It is not tied to -fsanitizer=fuzzer or any other compiler flags.

I think that misses the point, it's still a compiler lock-in - does that work with gcc?

I don't think gcc is a good choice for fuzzing :)

...
Yep, i failed to convey the problem.
Any code that uses that header will get locked into compiler that provides said header,
unless they resort to bundling said header in the first place, which will be frowned upon.
Any code - at least the fuzzers, which somewhat contradicts the recommendations of oss-fuzz
for fuzz targets to be tested during *normal* builds. And really, this structure would need to
be consistently used through the code for best results,
in which case the entire codebase will get compiler-locked.

But i guess these concerns are completely insane/alien to everyone in nowadays monoculture world.

That is also covered in the CL summary.

Dor1s updated this revision to Diff 213172.Aug 2 2019, 8:36 PM

Keep the header at the old location as well for smooth migration.

Dor1s edited the summary of this revision. (Show Details)Aug 2 2019, 8:41 PM
morehouse accepted this revision.Aug 5 2019, 11:29 AM

Yep, i failed to convey the problem.
Any code that uses that header will get locked into compiler that provides said header,
unless they resort to bundling said header in the first place, which will be frowned upon.

Without this patch, any project that wants to use this header needs to add a vendored copy of FDP to their source repository. With this patch, non-clang will still have to do this, though projects that build with clang will not. To me, this seems very beneficial for every project on OSS-Fuzz (and any other project that uses libFuzzer), since they support building with clang anyway.

Any code - at least the fuzzers, which somewhat contradicts the recommendations of oss-fuzz
for fuzz targets to be tested during *normal* builds. And really, this structure would need to
be consistently used through the code for best results,
in which case the entire codebase will get compiler-locked.

Yes, the alternative is to use a vendored FDP. I agree it's ugly, but no uglier than is currently required.

I think this patch has more upside than downside. We strictly improve the case for projects using clang already, while leaving things the same for non-clang.

This revision is now accepted and ready to land.Aug 5 2019, 11:29 AM
Dor1s updated this revision to Diff 213442.Aug 5 2019, 12:53 PM

Rebase + re-run the tests locally

This revision was automatically updated to reflect the committed changes.

Without this patch, any project that wants to use this header needs to add a vendored copy of FDP to their source repository. With this patch, non-clang will still have to do this, though projects that build with clang will not.

Wasn't @lebedev.ri's point that projects might use multiple compilers? Projects that build with Clang might also build with other compilers. So what you call "non-clang" should probably be "not-exclusively-clang", and the only projects benefiting are those that build only with Clang.

It seems to me as if this header is compiler-independent, so why is it in Clang's builtin include directory (on Linux that's <prefix>/lib/clang/<version>/include) as opposed to the system include directory (on Linux, <prefix>/include)? The builtin include directory is supposed to contain headers that are "private" to Clang, because they contain non-standard C/C++ that is only guaranteed to be supported by Clang. (Basically stuff that starts with two underscores like builtins.)