This is an archive of the discontinued LLVM Phabricator instance.

[libFuzzer] Make -merge=1 to reuse coverage information from the control file.
ClosedPublic

Authored by Dor1s on Aug 12 2019, 1:06 PM.

Details

Summary

This change allows to perform corpus merging in two steps. This is useful when
the user wants to address the following two points simultaneously:

  1. Get trustworthy incremental stats for the coverage and corpus size changes when adding new corpus units.
  2. Make sure the shorter units will be preferred when two or more units give the same unique signal (equivalent to the REDUCE logic).

This solution was brainstormed together with @kcc, hopefully it looks good to
the other people too. The proposed use case scenario:

  1. We have a fuzz_target binary and existing_corpus directory.
  2. We do fuzzing and write new units into the new_corpus directory.
  3. We want to merge the new corpus into the existing corpus and satisfy the points mentioned above.
  4. We create an empty directory merged_corpus and run the first merge step:

    ./fuzz_target -merge=1 -merge_control_file=MCF ./merged_corpus ./existing_corpus

    this provides the initial stats for existing_corpus, e.g. from the output:

    MERGE-OUTER: 3 new files with 11 new features added; 11 new coverage edges
  1. We recreate merged_corpus directory and run the second merge step:

    ./fuzz_target -merge=1 -merge_control_file=MCF ./merged_corpus ./existing_corpus ./new_corpus

    this provides the final stats for the merged corpus, e.g. from the output:

    MERGE-OUTER: 6 new files with 14 new features added; 14 new coverage edges

Alternative solutions to this approach are:

A) Store precise coverage information for every unit (not only unique signal).
B) Execute the same two steps without reusing the control file.

Either of these would be suboptimal as it would impose an extra disk or CPU load
respectively, which is bad given the quadratic complexity in the worst case.

Tested on Linux, Mac, Windows.

Diff Detail

Event Timeline

Dor1s created this revision.Aug 12 2019, 1:06 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptAug 12 2019, 1:06 PM
Herald added subscribers: Restricted Project, mgrang, delcypher. · View Herald Transcript
Dor1s updated this revision to Diff 214697.Aug 12 2019, 1:07 PM

fix a typo

Dor1s added a comment.Aug 12 2019, 1:10 PM

Hi everyone,

This CL is a proof-of-concept to start the discussion. If you approve the approach and/or suggest any improvements, I'll go ahead and polish it + update the tests. Please take a look :)

Context: https://github.com/google/clusterfuzz/pull/815#issuecomment-520087538

Harbormaster completed remote builds in B36617: Diff 214697.
Dor1s updated this revision to Diff 214705.Aug 12 2019, 2:23 PM

Actually updated the test to prove everyone (including myself) that this works.

Dor1s edited the summary of this revision. (Show Details)Aug 12 2019, 2:31 PM
Dor1s edited the summary of this revision. (Show Details)Aug 12 2019, 2:50 PM
Dor1s edited the summary of this revision. (Show Details)
Dor1s updated this revision to Diff 214714.Aug 12 2019, 2:52 PM

Update the test a bit more

Dor1s added a comment.Aug 13 2019, 3:47 PM

Friendly ping :) Feedback on the description would be the most important at this point, as I feel like I can improve the code a bit more. But if you could check out the code, that would be also great. Note there are at least two TODOs that I'll address before merging. It's still a draft, even though it works.

A few high-level thoughts/questions:

  • What do you plan to use the improved merge stats for?
  • With this change, -merge seems to no longer mean we're "merging" the input directories into the output directory, since existing files in the output can be deleted.
  • The change seems a bit complex. I wonder if there's a way to use the existing "greedy selection" approach to make things simpler. Maybe if we combine the output corpus SizedFile vector into the input corpus SizedFile vector before sorting and writing the initial control file.

Thanks for looking into making this change. It should be very useful for CF.
I'll try to take a look again tomorrow morning with fresh eyes.

lib/fuzzer/FuzzerDefs.h
23 ↗(On Diff #214714)

Nit: before <vector>.

lib/fuzzer/FuzzerDriver.cpp
499 ↗(On Diff #214714)

I'm a bit confused why WriteToOutputCorpus isn't going to truncate tescases (which i'm pretty sure we don't want).
I know your CL didn't introduce this, but do you know if this is the case?

lib/fuzzer/FuzzerFork.cpp
224 ↗(On Diff #214714)

How about doing the FilesToReplace loop inside of CrashResistantMerge instead so it doesn't need to be repeated and so that CrashResistantMerge doesn't need to be passed more arguments.

lib/fuzzer/FuzzerMerge.cpp
194 ↗(On Diff #214714)

I'm not sure this approach with signatures works the same way reduction works within a fuzzing run (ie: the way I think we probably want). I'll have to think about this more by tomorrow, since I haven't 100% thought this through and the code dealing with this during fuzzing is tricky.

As I understand it, this implementation it only replaces initial corpus elements with new ones that are smaller and are exactly the same in coverage.
However, during fuzzing corpus elements are replaced if a new corpus element is found that is smaller and covers the features for which the initial element is the smallest testcase, i.e. testcases can be replaced by smaller testcases that aren't exactly equivalent.

Example:
Unit A is the smallest element/unit covering feature X. A only covers feature X and no other feature.
Then we find Unit B which is smaller than Unit A. B covers features X and is the only (or smallest) unit that covers Y.
During fuzzing we will delete A and add B since B is now the smallest unit covering X.
Here, if A were an initial testcase (primary/output corpus) and B a new one (secondary/input corpus), I think we don't delete A.

Is my example correct?

If so then there may be a problem if there is an initial Unit C that is already covering Y, I don't think B will get added to the merged corpus, since it isn't a pure reduction of any testcase.

Dor1s marked 3 inline comments as done.Aug 14 2019, 7:31 AM

Guys, thanks a lot for the feedback! Some answers below, I'll get back to the code soon.

A few high-level thoughts/questions:

  • What do you plan to use the improved merge stats for?

In various ClusterFuzz instances we're tracking the stats for every fuzzer run. It's important as we jungle a bunch of different so-called strategies, and now we have an automated logic that auto-assigns probabilities to the strategies based on the stats. Basically, we must have a trustworthy stats w.r.t. new coverage / corpus changes, and relying on libFuzzer's merge is our best call. Otherwise, we do a lot of hacky parsing and still have problems with stats in certain cases :(

Outside of CF, I think it's a reasonable feature when a user merges two or more corpuses, they would see whether that gave them any new coverage, etc.

  • With this change, -merge seems to no longer mean we're "merging" the input directories into the output directory, since existing files in the output can be deleted.

That's true... However, the -merge is also supposed to be used for corpus minimization, and in this case it'll be doing even a better job.

  • The change seems a bit complex. I wonder if there's a way to use the existing "greedy selection" approach to make things simpler. Maybe if we combine the output corpus SizedFile vector into the input corpus SizedFile vector before sorting and writing the initial control file.

I didn't think it was possible as my perception of the REDUCE logic seemed to be wrong. Now, with your and Jonathan's comments, it feels like this should be possible to achieve through the existing greedy selection. Even if I have to loop through some files twice (not sure if that'll be needed, just speculating), that should be feasible, as it all happens in memory and doesn't invoke the target function.

lib/fuzzer/FuzzerDriver.cpp
499 ↗(On Diff #214714)

Not sure I understood what exactly you meant:

  • If you're asking about the contents of the unit (in memory), then I think it's not getting truncated here as the MaxLen has been applied earlier as well (e.g. when reading the inputs)
  • If you're asking about truncating the actual corpus units on disk, I think it doesn't happen because this function calculates the filename based on the contents (through hash), i.e. a truncated file contents would be given a different filename
lib/fuzzer/FuzzerFork.cpp
224 ↗(On Diff #214714)

Hm, sounds like an option. Thanks!

lib/fuzzer/FuzzerMerge.cpp
194 ↗(On Diff #214714)

Wow, that's a great point! I was thinking that the units should give exactly the same coverage. If that's not the case, I should be able to achieve what we need without introducing the signature stuff, and it'll likely look better :) Thanks for pointing this out!

kcc added a comment.Aug 16 2019, 12:48 PM

I would prefer to not introduce this complexity.
For periodic pruning we can use an empty dir, like you describe.
For stats, we can use the overal corpus size (in bytes and in files)

Dor1s added a comment.Aug 16 2019, 1:29 PM
In D66107#1633546, @kcc wrote:

I would prefer to not introduce this complexity.
For periodic pruning we can use an empty dir, like you describe.
For stats, we can use the overal corpus size (in bytes and in files)

Sorry, i didn't get a chance to re-write this in a better way yet.

The problem with an empty dir is that we don't have stats for the existing corpus. In order to get those, we'd need to do an extra ./fuzzer -runs=0 ... execution for the current working corpus. And of course parse the logs yet again, calculate the difference, etc.

It is not necessary in some cases, but whenever we use corpus subset strategy or an arbitrary -max_len value, we do not get the correct information about the current coverage. Value profiling strategy is another trouble maker if we continue to calculate coverage on the user side.

Dor1s updated this revision to Diff 218523.Sep 3 2019, 12:54 PM

Implement another solution brainstromed with kcc@

Dor1s retitled this revision from [libFuzzer] Improve -merge= process to account for REDUCED corpus units. to [libFuzzer] Make -merge=1 to reuse coverage information from the control file..Sep 3 2019, 1:58 PM
Dor1s edited the summary of this revision. (Show Details)
Dor1s edited the summary of this revision. (Show Details)Sep 3 2019, 1:58 PM
Dor1s added a comment.Sep 3 2019, 2:00 PM

Hey @morehouse and @metzman,

This is a new approach that @kcc and I agreed on (Kostya hasn't seen the implementation yet though).

The main idea is that we run the merge twice (without and with the new corpus). To avoid wasting cycles, coverage information is re-used when the control file is provided by the user (and it's possible to reuse it).

Please take a look.

We recreate merged_corpus directory and run the second merge step:

./fuzz_target -merge=1 -merge_control_file=MCF ./new_corpus ./existing_corpus ./new_corpus

Should this be ./fuzz_target -merge=1 -merge_control_file=MCF ./merged_corpus ./existing_corpus ./new_corpus?

Dor1s added a comment.Sep 6 2019, 6:46 AM

We recreate merged_corpus directory and run the second merge step:

./fuzz_target -merge=1 -merge_control_file=MCF ./new_corpus ./existing_corpus ./new_corpus

Should this be ./fuzz_target -merge=1 -merge_control_file=MCF ./merged_corpus ./existing_corpus ./new_corpus?

Oh, right :( my bad, fixing the description now!

Dor1s edited the summary of this revision. (Show Details)Sep 6 2019, 6:46 AM
kcc added a comment.Sep 10 2019, 12:39 PM

LGTM with one nit, also asking Matt for a second review.

lib/fuzzer/FuzzerDefs.h
196 ↗(On Diff #218523)

any reason to do this?
we had the special types for Vector/Set due to the problem which we don't have anymore
(we solved it by using a privatized STL)

Dor1s marked an inline comment as done.Sep 10 2019, 1:03 PM
Dor1s added inline comments.
lib/fuzzer/FuzzerDefs.h
196 ↗(On Diff #218523)

Not really, I just thought I would follow that pattern of pre-defining STL container types using fuzzer_allocator. I'm fine to get rid of it, if you prefer.

Dor1s marked an inline comment as done.Sep 10 2019, 3:04 PM
Dor1s added inline comments.
lib/fuzzer/FuzzerDefs.h
196 ↗(On Diff #218523)

Discussed offline. I'll remove this change after Matt's review.

morehouse accepted this revision.Sep 10 2019, 3:14 PM
morehouse added inline comments.
lib/fuzzer/FuzzerDefs.h
196 ↗(On Diff #218523)

Please remove it.

lib/fuzzer/FuzzerMerge.cpp
332 ↗(On Diff #218523)

Nit: let's omit the else for less nesting since the if short-circuits

This revision is now accepted and ready to land.Sep 10 2019, 3:14 PM
Dor1s updated this revision to Diff 219707.Sep 11 2019, 7:08 AM
Dor1s marked 5 inline comments as done.

Address review comments

Dor1s added inline comments.Sep 11 2019, 7:09 AM
lib/fuzzer/FuzzerDefs.h
196 ↗(On Diff #218523)

Done

lib/fuzzer/FuzzerMerge.cpp
332 ↗(On Diff #218523)

Done

This revision was automatically updated to reflect the committed changes.

The new test is failing for me because CHECK1 is not satisfied. Instead the line says MERGE-OUTER: 3 new files with 12 new features added; 11 new coverage edges (instead of 11 new features). I'm currently investigating what's wrong here, let me know if you have an idea.

The new test is failing for me because CHECK1 is not satisfied. Instead the line says MERGE-OUTER: 3 new files with 12 new features added; 11 new coverage edges (instead of 11 new features). I'm currently investigating what's wrong here, let me know if you have an idea.

Thanks for letting me know. I'm trying on the latest ToT now. Maybe some instrumentation bits changed while I was working on this.

Hm, doesn't fail for me, but I guess the feature detection might be platform-dependent to some extent, so I'm fine with replacing the number of the features with a regex. Do you want to upload a change, or should I?

Hm, doesn't fail for me, but I guess the feature detection might be platform-dependent to some extent, so I'm fine with replacing the number of the features with a regex. Do you want to upload a change, or should I?

I think the bots are also green, so it might be just related to how I build Clang (with libc++, for example). I'm half way through building ToT with GCC, that should give insight whether it's related to my system or my configuration.

Just curious: So the number of new features and coverage edges does not need to be the same? I also have "15 new freatures" for CHECK2.

Hm, doesn't fail for me, but I guess the feature detection might be platform-dependent to some extent, so I'm fine with replacing the number of the features with a regex. Do you want to upload a change, or should I?

I think the bots are also green, so it might be just related to how I build Clang (with libc++, for example). I'm half way through building ToT with GCC, that should give insight whether it's related to my system or my configuration.

Actually, some ARM bots got broken, so I've uploaded a fix: https://reviews.llvm.org/D67458

Just curious: So the number of new features and coverage edges does not need to be the same? I also have "15 new freatures" for CHECK2.

Yes, features include coverage edges + additional signal (e.g. value profiling), that's why the number is different and why number of features can differ between platforms.

I think the bots are also green, so it might be just related to how I build Clang (with libc++, for example). I'm half way through building ToT with GCC, that should give insight whether it's related to my system or my configuration.

Actually, some ARM bots got broken, so I've uploaded a fix: https://reviews.llvm.org/D67458

Just curious: So the number of new features and coverage edges does not need to be the same? I also have "15 new freatures" for CHECK2.

Yes, features include coverage edges + additional signal (e.g. value profiling), that's why the number is different and why number of features can differ between platforms.

Thanks for the explanation and the fix :)