Page MenuHomePhabricator

Entropic: Boosting LibFuzzer Performance
ClosedPublic

Authored by marcel on Jan 31 2020, 4:48 AM.

Details

Summary

This is collaboration between Marcel Boehme @ Monash, Australia and Valentin Manès plus Sang Kil Cha @ KAIST, South Korea.

We have made a few modifications to boost LibFuzzer performance by changing how weights are assigned to the seeds in the corpus. Essentially, seeds that reveal more "information" about globally rare features are assigned a higher weight. Our results on the Fuzzer Test Suite seem quite promising. In terms of bug finding, our Entropic patch usually finds the same errors much faster and in more runs. In terms of coverage, our version Entropic achieves the same coverage in less than half the time for the majority of subjects. For the lack of space, we shared more detailed performance results directly with @kcc. We'll publish the preprint with all the technical details as soon as it is accepted. Happy to share if you drop us an email.

There should be plenty of opportunities to optimise further. For instance, while Entropic achieves the same coverage in less than half the time, Entropic has a much lower #execs per second. We ran the perf-tool and found a few performance bottlenecks.

Thanks for open-sourcing LibFuzzer (and the entire LLVM Compiler Infrastructure)! This has been such a tremendous help to my research.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
marcel marked 3 inline comments as done.Feb 25 2020, 4:09 PM

Changes since last commit:

  • Better memory footprint
    • Global feature frequencies is a fixed size array with uint16_t-width elements: uint16_t GlobalFeatureFreqs[kFeatureSetSize] (instead of size_t-width)
    • Local feature frequencies is a sorted vector of pairs: Vector<std::pair<uint32_t,uint16_t>> FeatureFreqs (instead of unordered unordered_map<size_t,size_t>)
  • Added options to enable/disable entropic and pulled some constants out:
    • Option entropic (default: 0): Enables entropic power schedule.
    • Option considered_rare (default: 0xFF): If entropic is enabled, all features which are observed less often than the specified value are considered as rare.
    • Option topX_rarest_features (default: 100): If entropic is enabled, we keep track of the frequencies only for the Top-X least abundant features (union features that are considered as rare).
    • Option sparse_energy_updates (default: 10000): If entropic is enabled, the inverse value specifies the probability to update the corpus distribution even though no features or seeds were added or deleted. A larger value means less updates.
    • Constant kProbAgressiveSchedule (default: 80): Determines the probability to choose a more aggressive schedule (assigning zero weight to below average seeds).
  • Turned most occurrences of long double to double.
  • Better variable names and comments.
  • Removed some redundant code.
  • Fuzzer unit tests (Thanks @Valentin!)

The patch applies to LLVM-trunk. We tested these changes by repeating all our FTS experiments (20 runs).
The performance results look pretty much the same as with the other patch, which is good.

vitalybuka added inline comments.Feb 27 2020, 2:04 PM
compiler-rt/lib/fuzzer/FuzzerCorpus.h
42

it would be nice to manipulate this fields only with reasonable named methods

44

could you please initialize them?

445

It would be nice to rename variables in a such way that reader without background can understand what is going on.

453

please don't reuse variables like Y here
just declare as close as possible to first use, or even better with assignment

561

uint16_t GlobalFeatureFreqs[kFeatureSetSize] = {}; instead of memsets
it would be nice do to the same for other arrays here, but in a separate patch

compiler-rt/lib/fuzzer/FuzzerFlags.def
156

entropic -> focus_rare_features

Not sure how, it would be nice to rename sparse_energy_updates as something meaningful to libfuzzer user, to make it explain behavior change, not implementation details like now.

dgg5503 added a subscriber: dgg5503.Mar 9 2020, 1:56 PM
marcel updated this revision to Diff 250155.Mar 13 2020, 2:43 AM
marcel marked an inline comment as not done.

In this *third revision*, we accomodate Vitaly's comments:

  • Better naming (e.g., s/Sum_Y/SumIncidence/g)
  • Initialized variables on struct InputInfo
  • Locally, don't reuse variables but declare them anew.
  • Instead of memset'ing or global array, we declare uint16_t GlobalFeatureFreqs[kFeatureSetSize] = {};
  • We keep the entropic option, though. Hope this is okay.

In addition, we removed option "sparse_energy_updates" and made it a constant kSparseEnergyUpdates = 10000.

Also, good news: Entropic is coming in second @ Fuzzbench: https://www.fuzzbench.com/reports/2020-03-04/index.html

marcel marked 15 inline comments as done.Mar 18 2020, 5:45 PM

Ping

Dor1s added inline comments.Mar 18 2020, 9:59 PM
compiler-rt/lib/fuzzer/FuzzerCorpus.h
145

since this is a vector, I don't think we need to manually clear it in the destructor

196

nit: seems like log should be sufficient, as Energy is double, not long double

345

nit: I'd rather use auto or decltype(RareFeatures)::value_type to avoid type mismatch if we ever change RareFeatures definition and forget to change the type here

354

assuming this code gets executed quite often, and the order inside RareFeatures isn't important, we can avoid erase-remove and do something like:

RareFeatures[index_from_the_loop] = RareFeatures.back();
RareFeatures.resize(RareFeatures.size() - 1);

but the loop on line 269 would have to use index in the vector (from 1 to < RareFeatures.size()) instead of the iterator

feel free to ignore though, it's just a suggestion which may or may not be a good one :)

368

since we always do this, resize() in my suggestion above might not be even needed, but that's a minor thing

375

can this and the loop on the line 294 be combined?

376

please consider adding a comment why we skip inputs with zero energy

416

invert the condition to reduce indentation:

if (!II)
  return;
418

can do return after this line and avoid else with extra indentation below

418

what is 1 here? would it make sense to have a static const variable with a descriptive name, e.g. kDefaultSomething?

423

same point, what is 0? a constant with a descriptive name or (less preferred) comment would be really helpful for others who come to read the code in future

441

From the code below it seems like Energy represents entropy and the max value is 0, which we reduce depending on the actual feature frequencies. Is that correct understanding?

453

it seems like the Rand(kSparseEnergyUpdates) clause is applicable to the Entropic case only, is that correct? Do we really need it in the vanilla case?

481

could you please rewrite this in a more readable if-else form?

483

why 20? a constant with a descriptive name or a comment would be appreciated

495

so if there is at least one input that touches focus function, we will be always wasting time in this loop (starting on the line 472) and then falling back to the VanillaSchedule case? In such case I think we should just check Options.FocusFunction and use vanilla schedule if it's set, just because almost always there will be input(s) touching the focus function

compiler-rt/lib/fuzzer/FuzzerLoop.cpp
718

does it always need update, even when new coverage wasn't observed?

marcel marked 19 inline comments as done.Mar 19 2020, 5:08 AM

Preparing the next patch.

compiler-rt/lib/fuzzer/FuzzerCorpus.h
354

With the subsequent push_back (Line 292), do you mean a swap and pop_back here?

418

We are setting the frequency of Idx32 to 1. Adding a comment.

423

Zero (0) is the default value for lower_bound as binary search over a vector of pairs. In this case, any value would work.

441

Yes, we estimate the entropy over the probabilities of the features in the neighborhood of the seed. Entropy is positive. The maximum entropy is logl(GlobalNumberOfFeatures).

453

Yes. kSparseEnergyUpdates should apply only for Entropic.

compiler-rt/lib/fuzzer/FuzzerLoop.cpp
718

For II, the local feature frequencies have changed. So we schedule an update. However, it will only be updated when the distribution needs an update, and we do not set DistributionNeedsUpdate here.

marcel updated this revision to Diff 251365.Mar 19 2020, 6:38 AM
marcel marked 4 inline comments as done.

Changes in this fourth revision.

  • Error message + exit if parameters '--focus_function` and --entropic are used together.
  • Refactored a constant (kMaxMutationFactor)
  • Better code formatting
    • early return for better indentation,
    • more comments,
    • less ternaries,
vitalybuka added inline comments.Mar 20 2020, 12:43 AM
compiler-rt/lib/fuzzer/FuzzerCorpus.h
133

please make field private and add access method if it's needed outside

318

DeleteFeatureFreq -> InputInfo::DeleteFeatureFreq

340

uint32_t MostAbundantRareFeatureIdx[2] = {}
or
just
MostAbundantRareFeatureIdx1
MostAbundantRareFeatureIdx2

397

InputInfo::UpdateFeatureFrequency

448

InputInfo::UpdateEnergy

449

"long double" is still there?

compiler-rt/lib/fuzzer/FuzzerFlags.def
156

many of comments are marked as "Done" but I see no changes.

Dor1s added inline comments.Mar 23 2020, 11:29 PM
compiler-rt/lib/fuzzer/FuzzerCorpus.h
354

yes, swap and pop_back would have the same effect

441

sorry, I don't understand. Below are the code lines changing Energy value:

  II->Energy = 0.0;
  II->SumIncidence = 0;

  // Apply add-one smoothing to locally discovered features.
  for (auto F : II->FeatureFreqs) {
    size_t LocalIncidence = F.second + 1;
    Energy -= LocalIncidence * logl(LocalIncidence);
    SumIncidence += LocalIncidence;
  }

  <...>

  // Add a single locally abundant feature apply add-one smoothing.
  size_t AbdIncidence = II->NumExecutedMutations + 1;
  Energy -= AbdIncidence * logl(AbdIncidence);
  <...>

  // Normalize.
  if (SumIncidence != 0)
    Energy = (Energy / SumIncidence) + logl(SumIncidence);

  II->Energy = (double)Energy;
<...>
}

as I read this, I see that Energy should be negative in many cases?

marcel marked 11 inline comments as done.Apr 14 2020, 6:57 AM

Just completed a few tests with the revised patch in FuzzBench. Going to upload the revision soon.

compiler-rt/lib/fuzzer/FuzzerCorpus.h
318

Moved directly into the InputInfo struct.

354

Implemented your swap and pop_back idea. Cheers!

441

Sorry for the brevity. This is why II->Energy is positive. Entropy is computed as $-\sum_{i=1}^S p_i \log(p_i)$ where $p_i$ is the probability that fuzzing II generates an input that exercises feature $i$ and $S$ is the total number of rare features. We could estimate the probability $p_i$ as the proportion of generated inputs that exercise $i$, i.e., $\hat p_i = LocalIncidence_i / SumIncidence$. If you plug this proportion into the formula for entropy, you can compute entropy as $[-\sum_{i=1}^S LocalIncidence_i \log(LocalIncidence_i)] / SumIncidence + log(SumIncidence)$. While Energy is certainly negative before // Normalize., it is positive after.

Just drop me a DM. I'll send you the write up.

449

Yes, keeping maximum precision during the processing to minimize the cumulative FP arithmetic error, and downcast to double once the processing is done.

compiler-rt/lib/fuzzer/FuzzerFlags.def
156

Tried to address all comments either inline or in the summary. In this case, I wrote

  • We keep the entropic option, though. Hope this is okay.
marcel updated this revision to Diff 257332.Apr 14 2020, 7:52 AM
marcel marked 2 inline comments as done.

This is revision number 5. Thanks for all the great feedback!

  • Moved InputInfo functions (UpdateEnergy, UpdateFeatureFrequency, DeleteFeatureFreq) into InputInfo struct.
  • Made NumExecutedMutations private. Added public FuzzerCorpus::IncrementNumExecutedMutations()
  • Set kSparseEnergyUpdates to 100 (instead of 10000). Seems to give better results on some FuzzBench subjects.
  • Removed kProbAgressiveSchedule. Seems to give better results on some FuzzBench subjects.
  • Some cleanup in FuzzerCorpus::AddRareFeature(). Use swap and pop_back. Use arrary to maintain most and second-most abundant rare feature.
  • Been playing with LF's power schedule, and in preliminary experiments it seems that prioritizing faster seeds brings quite some performance gains.
  • Submitted PR to FuzzBench for evaluation: https://github.com/google/fuzzbench/pull/226

FuzzBench results for Entropic: https://www.fuzzbench.com/reports/2020-04-14/index.html

kcc added a comment.Apr 22 2020, 1:03 PM

Commenting on just to issues, not the hole patch.

compiler-rt/lib/fuzzer/FuzzerCorpus.h
36

this is new in the patch, is it?
While I completely understand why we'd want to use execution time as a signal for weights,
it makes fuzzing process non-reproducible with a given seed, which I consider pretty bad.
If we used 32- or 64- bit edge counters we could have substituted them for time, but alas, we use 8-bit ones.

70

I'm still worried about long double due to portability.
Do you actually "know" that it's critical to use long double here?

compiler-rt/lib/fuzzer/FuzzerLoop.cpp
683

for consistency, please use the C++ interface for getting current time (as elsewhere in the code).
But see above about my comment on time in gneral.

marcel marked 2 inline comments as done.Apr 22 2020, 4:40 PM
marcel added inline comments.
compiler-rt/lib/fuzzer/FuzzerCorpus.h
36

this is new in the patch, is it?

Yes. Been playing with a few smaller tweaks to boost LF performance.

While I completely understand why we'd want to use execution time as a signal for weights,
it makes fuzzing process non-reproducible with a given seed, which I consider pretty bad.

Do you mean LibFuzzer should be fully deterministic when you start it with the same seed corpus (e.g., by fixing *the* random seed)? Currently, even without this patch I've been observing quite some variance in the coverage achieved over time. Happy to take it out, though, if this messes with the LF design principles.

70

You are right. After fixing frequencies to uint16_t, this can definitely be a double.

kcc added a comment.Apr 22 2020, 5:14 PM

Please take out the time-related changes for now. If anything, extra changes make the code review process quadratic.

Yes, I expect that given the same seeds corpus and a fixed random seed (-seed=<N>) libFuzzer will produce the same mutations.
There is probably at least one case where this is not actually true today: ASLR (we use PCs to index into arrays, etc),
but other than that I'd expect LF to be deterministic.

Please also replace long double with double.
I'll try to make another review PASS ASAP.

marcel updated this revision to Diff 259448.Apr 22 2020, 7:03 PM

Revision 6: Removed time-based performance boost (would render LibFuzzer non-deterministic even if the random seed is fixed with -seed=<n>). Also, sed "s/long double/double/g".

kcc added a comment.May 1 2020, 4:58 PM

Sorry for the delay. Mostly naming/style nits left.

compiler-rt/lib/fuzzer/FuzzerCorpus.h
43

NeedsEnergyUpdate?

57

Lower
(use upper case)

107

Lower

130

I'd like to put these into a

struct EntropicOptions {
   bool Enabled;
   size_t EntropicFeatureFrequencyThreshold;
  ...
}

and pass such a struct (by value) to InputCorpus CTOR.
Of course, please match the names with those in Options

compiler-rt/lib/fuzzer/FuzzerOptions.h
47

I'd prefer if these names were more descriptive (ok if longer) and have "entropic" in it.
E.g.

Entropic
EntropicFeatureFrequencyThreshold
EntropicNumberOfRarestFeatures

of course, make the command line parameters match

marcel marked 5 inline comments as done.May 15 2020, 6:47 AM
marcel updated this revision to Diff 264229.May 15 2020, 6:59 AM

Thanks again everyone for your time to review. Learned a lot!

Revision 7. Created struct EntropicOptions which lives in the FuzzerCorpus.h header. The EntropicOptions are passed by value from FuzzerDriver (which handles FuzzerOptions) when constructing the InputCorpus. Renamed option parameters as per KCC's suggestions. As always, formatted with clang-format.

Let me know if there is anything else, I can do.

kcc added inline comments.May 15 2020, 2:14 PM
compiler-rt/lib/fuzzer/FuzzerCorpus.h
129

here and below: remove 'struct'.

compiler-rt/lib/fuzzer/tests/FuzzerUnittest.cpp
595

When running 'ninja check-fuzzer' this test fails for me:

That's weird: why does the functionality change with Entropic off?

[5/9] Running Fuzzer unit tests
FAIL: LLVMFuzzer-Unittest :: ./Fuzzer-x86_64-Test/Corpus.Distribution (48 of 55)

  • TEST 'LLVMFuzzer-Unittest :: ./Fuzzer-x86_64-Test/Corpus.Distribution' FAILED ****

Note: Google Test filter = Corpus.Distribution
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Corpus
[ RUN ] Corpus.Distribution
/usr/local/google/home/kcc/llvm-project/compiler-rt/lib/fuzzer/tests/FuzzerUnittest.cpp:609: Failure
Expected: (Hist[i]) > (TriesPerUnit / N / 3), actual: 0 vs 2184
/usr/local/google/home/kcc/llvm-project/compiler-rt/lib/fuzzer/tests/FuzzerUnittest.cpp:609: Failure
Expected: (Hist[i]) > (TriesPerUnit / N / 3), actual: 0 vs 2184
/usr/local/google/home/kcc/llvm-project/compiler-rt/lib/fuzzer/tests/FuzzerUnittest.cpp:609: Failure

1099

Will a simpler syntax work, e.g.:

{1,3},
{2,3}
marcel marked 4 inline comments as done.May 15 2020, 9:38 PM
marcel added inline comments.
compiler-rt/lib/fuzzer/tests/FuzzerUnittest.cpp
595

Instead of updating the corpus distribution every time it changes (e.g., FuzzerCorpus.h#L114 and FuzzerCorpus.h#L165), entropic schedules that update by setting a flag. For efficiency, only when (and just before) a new input is chosen, the corpus distribution is actually updated. I had this update in ChooseUnitToMutate which calls ChooseUnitIdxToMutate. Now moved the call to UpdateCorpusDistribution to ChooseUnitIdxToMutate (which is used by the test case).

All 40 fuzzer unit tests pass.

marcel updated this revision to Diff 264415.May 15 2020, 9:43 PM
marcel marked an inline comment as done.

Revision 8. Moved the call to UpdateCorpusDistribution from ChooseUnitToMutate to ChooseUnitIdxToMutate and removed some nits.

make -j 32 FuzzerUnitTests && \
projects/compiler-rt/lib/fuzzer/tests/Fuzzer-x86_64-Test

returns with

[==========] 40 tests from 9 test cases ran. (408 ms total)
[  PASSED  ] 40 tests.
marcel updated this revision to Diff 264416.May 15 2020, 9:49 PM

Minor. Removed one line I used for debugging..

kcc accepted this revision.May 18 2020, 11:47 AM

Thanks for this work, and the effort to make the code better!

This revision is now accepted and ready to land.May 18 2020, 11:47 AM
kcc added a comment.May 18 2020, 12:08 PM

(let me land it)

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptMay 19 2020, 10:56 AM
Herald added a subscriber: Restricted Project. · View Herald Transcript