This is an archive of the discontinued LLVM Phabricator instance.

Greedy set cover implementation of `Merger::Merge`
ClosedPublic

Authored by arisKoutsou on Jul 1 2021, 7:33 AM.

Details

Summary

Extend the existing single-pass algorithm for Merger::Merge with an algorithm that gives better results. This new implementation can be used with a new set_cover_merge=1 flag.

This greedy set cover implementation gives a substantially smaller final corpus (40%-80% less testcases) while preserving the same features/coverage. At the same time, the execution time penalty is not that significant (+50% for ~1M corpus files and far less for smaller corpora). These results were obtained by comparing several targets with varying size corpora.

Change Merger::CrashResistantMergeInternalStep to collect all features from each file and not just unique ones. This is needed for the set cover algorithm to work correctly. The implementation of the algorithm in Merger::SetCoverMerge uses a bitvector to store features that are covered by a file while performing the pass. Collisions while indexing the bitvector are ignored similarly to the fuzzer.

Diff Detail

Event Timeline

arisKoutsou created this revision.Jul 1 2021, 7:33 AM
arisKoutsou requested review of this revision.Jul 1 2021, 7:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 1 2021, 7:33 AM
Herald added a subscriber: Restricted Project. · View Herald Transcript
kcc added a comment.Jul 12 2021, 12:43 PM

Thanks for the change!
Indeed, the current single-pass merge is far from perfect, and it's nice to see your numbers.

Some high-level notes:

  • not sure about the flag syntax (-merge=2), I'd prefer something more descriptive. Maybe add an extra flag -set_cover_merge=1?
  • The current single-pass algorithm is also greedy (just less so :)), so maybe use a different name to distinguish the new algorithm
  • please try to follow the coding style e.g. with respect to {} around single-statement blocks.
  • please add a unit test and a .lit test

I'll let Matt make another round of review.

I understand that this merge algorithm decreases the number of inputs by taking the most feature-rich inputs first. Does this lead to larger average input sizes in the merged corpus?

And what's the effect on total corpus bytes (du -hs)?

compiler-rt/lib/fuzzer/FuzzerMerge.cpp
198

Please document the merge algorithm in a function comment here.

233

Given the multiple passes over Remaining, is this sort useful anymore?

317

Would a set be a better data structure for Remaining? Then we wouldn't need to do a linear lookup on every erase.

arisKoutsou edited the summary of this revision. (Show Details)Aug 23 2021, 7:24 AM

Changes:

  • Renamed all occurences of MergeGreedy with SetCoverMerge.
  • Added a new flag called set_cover_merge. Defaults to 0, when 1 the new set_cover_merge implementation is used to merge corpora instead of the standard merge.
  • Some code-style changes.
  • Added a unit test, based off the Merger::Merge unit test.
  • Added a lit test based on the merge.test. Results of merge and set_cover_merge are different in some cases, as expected.
  • Changed the Remaining variable in SetCoverMerge() from std::vector to a std::set.
  • Addressed some inaccuracies in the algorithm, mostly regarding the first corpus (consider features in the first corpus as already covered, just like merge=1 works).
  • Removed the initial sorting of files on size/features. Instead, for files with the same number of features, break ties by choosing the smaller one in size.
arisKoutsou updated this revision to Diff 368107.EditedAug 23 2021, 7:47 AM
arisKoutsou edited the summary of this revision. (Show Details)

Does this lead to larger average input sizes in the merged corpus?

@morehouse It will lead to a larger average size, since we are preferring files with many features (which probably are larger in many cases). On the other hand, since the final corpus size (in files) is smaller, the sum of the file sizes in bytes may end up less than with the existing merge algorithm.

For example, consider minimizing a 70K testcase corpus for a harness of the jq target. The following workflow is performed in a container (image: https://hub.docker.com/r/ariskoutsou/jq-libfuzzer) containing the updated -set_cover_merge=1 implementation along with the target harness:

root@dd166043164c:/src/harness# mkdir 0
root@dd166043164c:/src/harness# ./fuzzer_jq -detect_leaks=0 -close_fd_mask=3 -set_cover_merge=1 0 corpus
...
MERGE-OUTER: the control file has 48616105 bytes
MERGE-OUTER: consumed 34Mb (185Mb rss) to parse the control file
MERGE-OUTER: 83 new files with 3174 new features added; 1116 new coverage edges
...
root@dd166043164c:/src/harness# find 0 -ls | awk '{sum += $7; n++;} END {print sum/n;}'
1068.81
root@dd166043164c:/src/harness# du -sh 0
356K    0
root@dd166043164c:/src/harness# rm 0/*
root@dd166043164c:/src/harness# ./fuzzer_jq -detect_leaks=0 -close_fd_mask=3 -merge=1 0 corpus
...
MERGE-OUTER: the control file has 4184437 bytes
MERGE-OUTER: consumed 5Mb (94Mb rss) to parse the control file
MERGE-OUTER: 161 new files with 3174 new features added; 1293 new coverage edges
...
root@dd166043164c:/src/harness# find 0 -ls | awk '{sum += $7; n++;} END {p
rint sum/n;}'
675.296
root@dd166043164c:/src/harness# du -sh 0/
676K    0/

Here, even though the average file size is quite bigger ~1068 bytes over ~675 bytes, the total corpus size is smaller. With the set_cover_merge flag the minimized corpus size is 356K, while with merge it is 676K.

arisKoutsou updated this revision to Diff 368133.EditedAug 23 2021, 9:51 AM
  • Change .lit test input to better illustrate the scenario where 2 files cover all features.
morehouse added inline comments.Aug 25 2021, 12:45 PM
compiler-rt/lib/fuzzer/FuzzerFork.cpp
324
compiler-rt/lib/fuzzer/FuzzerMerge.cpp
358

Could this continue cause an infinite loop? i.e. when Remaining is empty

compiler-rt/lib/fuzzer/tests/FuzzerUnittest.cpp
953

IIUC, this test doesn't quite do what's intended.

We choose input C because it has the most unique features, but after that A only has {0} as a unique feature while B has {4, 5}. So we do in fact choose B, but not because it is smaller.

compiler-rt/test/fuzzer/set_cover_merge.test
49

I think we would get the same results with -merge. Perhaps we should make some feature-poor inputs smaller so that -merge would pick those first, while -set_cover_merge picks the feature-rich ones.

  • Add comments for argument names when passing argument values.
  • Change AllFeatures. Calculate all unique features by considering Feature % kFeatureSetSize as the feature value.
  • Change continue to break statement in main loop. Add assertions to highlight the condition that exits the loop.
  • Remove checking for feature-less files since we are not removing features from any MergeFileInfo objects.
  • Update tests, add testcase for feature collision on the bitvector.
compiler-rt/lib/fuzzer/FuzzerMerge.cpp
358

It wouldn't cause an infinite loop when Remaining is empty because the while condition (CoveredSize != AllFeatures.size()) would be false. On the other hand, I noticed that there would be a problem if a feature had a value greater than 1 << 21, which is the size of the bitvector. In that case, there could be an infinite loop because the while condition would never become false. I am addressing this problem in the latest patch.

compiler-rt/lib/fuzzer/tests/FuzzerUnittest.cpp
953

Correct, I will adjust the features correctly so that we can test breaking ties between files with equal number of features.

compiler-rt/test/fuzzer/set_cover_merge.test
49

Should I also include a test of -merge=1 in this source file to highlight the difference in behavior?

morehouse added inline comments.Sep 3 2021, 1:58 PM
compiler-rt/lib/fuzzer/tests/FuzzerUnittest.cpp
953

The comment above is confusing me now. I *think* what's happening is that B is picked first (over A since B is smaller), leaving A with {0, 3} unique features and C with {1, 2, 4} unique features. Then C gets picked since it has more features.

We still end up with C and B in the set, but the comment's explanation of how this happens is wrong.

compiler-rt/test/fuzzer/set_cover_merge.test
49

Yes, please do.

  • Add -merge=1 test in set_cover_merge.test for comparison with -set_cover_merge=1.
compiler-rt/lib/fuzzer/tests/FuzzerUnittest.cpp
953

As far as I can see here , the format of the control file is:

FT <current_file_index> <feature_1> <feature_2> ... <feature_n>

So, in our case, the feature sets are:

A = {3, 5, 6}
B = {4, 5, 6}
C = {1, 2, 3, 4}
D = {1}

Since the set cover algorithm chooses the set that covers the maximum number of previously uncovered features, the file that is chosen in the first iteration is C. This is because Covered is empty and C has 4 features while all the other files have less features. In the next iteration of the algorithm A's uncovered features are {5, 6} and B's uncovered features are {5, 6} so we break the tie by selecting the smallest which is B. Finally, all features are covered so we can exit.

morehouse accepted this revision.Sep 7 2021, 8:47 AM

LGTM

compiler-rt/lib/fuzzer/tests/FuzzerUnittest.cpp
953

Ah, yep you're right. Thanks for explaining.

This revision is now accepted and ready to land.Sep 7 2021, 8:47 AM
  • Fix set_cover_merge.test to not produce flaky results based on file listing order.
This revision was landed with ongoing or failed builds.Sep 7 2021, 9:43 AM
This revision was automatically updated to reflect the committed changes.