This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
compiler-rt/lib/fuzzer/
-
lib/
-
fuzzer/
1
FuzzerDriver.cpp
-
FuzzerMutate.h
-
FuzzerMutate.cpp
-
tests/
-
FuzzerUnittest.cpp

Differential D93879

Add LLVMFuzzerAddToDictionary
Needs ReviewPublic

Authored by IanPudney on Dec 28 2020, 5:22 PM.

Download Raw Diff

Details

Reviewers

morehouse
Dor1s
kcc

Summary

Add support for LLVMFuzzerAddToDictionary, which if called from a fuzzer
injects words into the user dictionary at runtime. This can be used by
fuzzers if they have fuzzer-specific knowledge of words that might help;
in my use-case, this is used for adding words that will satisfy regular
expressions encountered during fuzzing.

To accomplish this, MutationDispatcher no longer owns the
ManualDictionary, it takes a reference to it. Outside of tests,
this reference will point to a global Dictionary instance referred to
from LLVMFuzzerAddToDictionary. The logic for adding to the
ManualDictionary is pulled out of MutationDispatcher.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

IanPudney requested review of this revision.Dec 28 2020, 5:22 PM

IanPudney created this revision.

Herald added a project: Restricted Project. · View Herald TranscriptDec 28 2020, 5:22 PM

Herald added a subscriber: Restricted Project. · View Herald Transcript

alex added a subscriber: alex.Dec 28 2020, 5:28 PM

Harbormaster completed remote builds in B83617: Diff 313915.Dec 28 2020, 5:55 PM

If this change moves forward, it will also need a test using LLVMFuzzerAddToDictionary. However, I'm personally not convinced in the use case for this.

When there are interesting sequences generated during fuzzing, libFuzzer already leverages them by at least two following mechanics:

inputs triggering new coverage signals are added to the corpora, i.e. such byte sequences are present in the inputs used for future mutations
recommended dictionary is auto-generated and is accumulating sequences that appear to be good dictionary entries

If you run your fuzz targets on ClusterFuzz, it automatically takes care of parsing the recommended dictionary, re-testing its entries (using libFuzzer's analyze_dict feature) and preserving really useful elements for future use.

With this in mind, the CL looks to me like an attempt to re-implement some of the logic that is already being handled.

compiler-rt/lib/fuzzer/FuzzerDriver.cpp
927	Is this logic really needed? Looks like a shotgun to me in a case people misuse this API.

I'll tell you more about my use-case with Atheris. In Python, it's very common for control flow to be decided by regular expressions. However, regular expression matching is implemented inside of CPython, in the _sre.c module. This means that unless CPython itself is compiled with coverage, re.match() appears as an atomic operation that libFuzzer has no insight into.

Compiling CPython is actually pretty complex. Furthermore, because it introduces so many coverage symbols, performance ends up limited to 2000 execs/sec (on my very powerful machine) if CPython is compiled with coverage. This means the user has to jump through even more hoops to compile just the _sre module with coverage, if they want regular expression coverage support.

This change helps me solve that problem. Whenever Atheris encounters a regular expression in Python, it can insert into the dictionary a string that matches that expression. This immediately makes libFuzzer able to progress past the regular expression. It's very effective.

For more background, see https://github.com/google/atheris/issues/5.

For what it's worth, it's not clear to me that even if you _did_ compile _sre.c with fuzzer-no-link that you'd get good results. The regexp engine is effectively an interpreter, which is probably the worst case for coverage guided fuzzing -- essentially the program counter and branches have a low correspondence with semantics. For example, trying to match the regexp ab, you'd have two MATCH_CHAR opcodes, but it'd be backed by a single C function, so you wouldn't get different coverage for one matching versus the other.

This is basically the same reason why you can't fuzz Python code by instrumenting CPython -- you get full coverage of the interpreter eval loop, which tells you nothing about the coverage of Python code itself.

This is all by way of saying, that the current coverage instrumentation alone will probably never be sufficient for this, for code that uses regular expressions in control flow, you'll always need special handling.

Thanks for the context. If I understand correctly, the actual underlying goal is to pass an additional coverage signal to the fuzzing engine. If there is a way to achieve that without extending libFuzzer's API, would that suffice?

It believe that would be a better option, as that would expand the corpora, which is accumulated over time and can be re-used across different runs, in contrary to the in-memory dictionary we're expanding here.

Dor1s, what do you suggest? I haven't been able to find a good way to pass this information to libFuzzer without extending the API. The best we came up with was to simulate a memcmp(), but it didn't seem to work very well.

I am reluctant to extend the public interface in ways that
a) are likely to be useful for only few cases
b) are likely to remain libFuzzer-specific
c) already have an existing functionality that can be used instead). I mean the existing -dict flag (it's not exactly what you describe though)

The public interface should remain maximally engine-agnostic.

Maybe you can find a solution for your specific case using an existing mechanism?
Did you try using the extra counters somehow?
https://github.com/llvm/llvm-project/blob/main/compiler-rt/test/fuzzer/TableLookupTest.cpp
We basically need to detect a situation where the behavior is interesting and let LF know via __libfuzzer_extra_counters

In D93879#2490294, @IanPudney wrote:

Dor1s, what do you suggest? I haven't been able to find a good way to pass this information to libFuzzer without extending the API. The best we came up with was to simulate a memcmp(), but it didn't seem to work very well.

I was thinking about __libfuzzer_extra_counters, which Kostya has just mentioned.

Revision Contents

Path

Size

compiler-rt/

lib/

fuzzer/

FuzzerDriver.cpp

30 lines

FuzzerMutate.h

12 lines

FuzzerMutate.cpp

13 lines

tests/

FuzzerUnittest.cpp

10 lines

Diff 313915

compiler-rt/lib/fuzzer/FuzzerDriver.cpp

Show First 20 Lines • Show All 82 Lines • ▼ Show 20 Lines
#undef FUZZER_FLAG_STRING		#undef FUZZER_FLAG_STRING
};		};

static const size_t kNumFlags =		static const size_t kNumFlags =
sizeof(FlagDescriptions) / sizeof(FlagDescriptions[0]);		sizeof(FlagDescriptions) / sizeof(FlagDescriptions[0]);

static Vector<std::string> *Inputs;		static Vector<std::string> *Inputs;
static std::string *ProgName;		static std::string *ProgName;
		static Dictionary ManualDictionary;

static void PrintHelp() {		static void PrintHelp() {
Printf("Usage:\n");		Printf("Usage:\n");
auto Prog = ProgName->c_str();		auto Prog = ProgName->c_str();
Printf("\nTo run fuzzing pass 0 or more directories.\n");		Printf("\nTo run fuzzing pass 0 or more directories.\n");
Printf("%s [-flag1=val1 [-flag2=val2 ...] ] [dir1 [dir2 ...] ]\n", Prog);		Printf("%s [-flag1=val1 [-flag2=val2 ...] ] [dir1 [dir2 ...] ]\n", Prog);

Printf("\nTo run individual tests without fuzzing pass 1 or more files:\n");		Printf("\nTo run individual tests without fuzzing pass 1 or more files:\n");
▲ Show 20 Lines • Show All 700 Lines • ▼ Show 20 Lines	if (RunIndividualFiles)
return CollectDataFlow(Flags.collect_data_flow, Flags.data_flow_trace,		return CollectDataFlow(Flags.collect_data_flow, Flags.data_flow_trace,
ReadCorpora({}, *Inputs));		ReadCorpora({}, *Inputs));
else		else
return CollectDataFlow(Flags.collect_data_flow, Flags.data_flow_trace,		return CollectDataFlow(Flags.collect_data_flow, Flags.data_flow_trace,
ReadCorpora(*Inputs, {}));		ReadCorpora(*Inputs, {}));
}		}

Random Rand(Seed);		Random Rand(Seed);
auto *MD = new MutationDispatcher(Rand, Options);		auto *MD = new MutationDispatcher(Rand, Options, ManualDictionary);
auto *Corpus = new InputCorpus(Options.OutputCorpus, Entropic);		auto *Corpus = new InputCorpus(Options.OutputCorpus, Entropic);
auto F = new Fuzzer(Callback, Corpus, *MD, Options);		auto F = new Fuzzer(Callback, Corpus, *MD, Options);

for (auto &U: Dictionary)		for (auto &U: Dictionary)
if (U.size() <= Word::GetMaxSize())		if (U.size() <= Word::GetMaxSize()) {
MD->AddWordToManualDictionary(Word(U.data(), U.size()));		Word UWord(U.data(), U.size());
		ManualDictionary.push_back({UWord, std::numeric_limits<size_t>::max()});
		}

// Threads are only supported by Chrome. Don't use them with emscripten		// Threads are only supported by Chrome. Don't use them with emscripten
// for now.		// for now.
#if !LIBFUZZER_EMSCRIPTEN		#if !LIBFUZZER_EMSCRIPTEN
StartRssThread(F, Flags.rss_limit_mb);		StartRssThread(F, Flags.rss_limit_mb);
#endif // LIBFUZZER_EMSCRIPTEN		#endif // LIBFUZZER_EMSCRIPTEN

Options.HandleAbrt = Flags.handle_abrt;		Options.HandleAbrt = Flags.handle_abrt;
▲ Show 20 Lines • Show All 92 Lines • ▼ Show 20 Lines
}		}

extern "C" ATTRIBUTE_INTERFACE int		extern "C" ATTRIBUTE_INTERFACE int
LLVMFuzzerRunDriver(int argc, char **argv,		LLVMFuzzerRunDriver(int argc, char **argv,
int (UserCb)(const uint8_t Data, size_t Size)) {		int (UserCb)(const uint8_t Data, size_t Size)) {
return FuzzerDriver(argc, argv, UserCb);		return FuzzerDriver(argc, argv, UserCb);
}		}

		extern "C" ATTRIBUTE_INTERFACE void
		LLVMFuzzerAddToDictionary(const uint8_t *Data, size_t Size) {
		while (Size >= Word::kMaxSize) {
		Dor1sUnsubmitted Not Done Reply Inline Actions Is this logic really needed? Looks like a shotgun to me in a case people misuse this API. Dor1s: Is this logic really needed? Looks like a shotgun to me in a case people misuse this API.
		Word DataWord(Data, Word::kMaxSize);
		if (!ManualDictionary.ContainsWord(DataWord)) {
		ManualDictionary.push_back(
		{DataWord, std::numeric_limits<size_t>::max()});
		}
		Size -= Word::kMaxSize;
		Data += Word::kMaxSize;
		}

		if (Size) {
		Word DataWord(Data, Size);
		if (!ManualDictionary.ContainsWord(DataWord)) {
		ManualDictionary.push_back(
		{DataWord, std::numeric_limits<size_t>::max()});
		}
		}
		}

// Storage for global ExternalFunctions object.		// Storage for global ExternalFunctions object.
ExternalFunctions *EF = nullptr;		ExternalFunctions *EF = nullptr;

} // namespace fuzzer		} // namespace fuzzer

compiler-rt/lib/fuzzer/FuzzerMutate.h

Show All 12 Lines

#include "FuzzerDefs.h"		#include "FuzzerDefs.h"
#include "FuzzerDictionary.h"		#include "FuzzerDictionary.h"
#include "FuzzerOptions.h"		#include "FuzzerOptions.h"
#include "FuzzerRandom.h"		#include "FuzzerRandom.h"

namespace fuzzer {		namespace fuzzer {

		extern Dictionary DefaultEmptyDictionary;

class MutationDispatcher {		class MutationDispatcher {
public:		public:
MutationDispatcher(Random &Rand, const FuzzingOptions &Options);		MutationDispatcher(Random &Rand, const FuzzingOptions &Options,
		Dictionary &ManualDictionary = DefaultEmptyDictionary);
~MutationDispatcher() {}		~MutationDispatcher() {}
/// Indicate that we are about to start a new sequence of mutations.		/// Indicate that we are about to start a new sequence of mutations.
void StartMutationSequence();		void StartMutationSequence();
/// Print the current sequence of mutations. Only prints the full sequence		/// Print the current sequence of mutations. Only prints the full sequence
/// when Verbose is true.		/// when Verbose is true.
void PrintMutationSequence(bool Verbose = true);		void PrintMutationSequence(bool Verbose = true);
/// Return the current sequence of mutations.		/// Return the current sequence of mutations.
std::string MutationSequence();		std::string MutationSequence();
▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	public:
/// Applies one of the default mutations. Provided as a service		/// Applies one of the default mutations. Provided as a service
/// to mutation authors.		/// to mutation authors.
size_t DefaultMutate(uint8_t *Data, size_t Size, size_t MaxSize);		size_t DefaultMutate(uint8_t *Data, size_t Size, size_t MaxSize);

/// Creates a cross-over of two pieces of Data, returns its size.		/// Creates a cross-over of two pieces of Data, returns its size.
size_t CrossOver(const uint8_t Data1, size_t Size1, const uint8_t Data2,		size_t CrossOver(const uint8_t Data1, size_t Size1, const uint8_t Data2,
size_t Size2, uint8_t *Out, size_t MaxOutSize);		size_t Size2, uint8_t *Out, size_t MaxOutSize);

void AddWordToManualDictionary(const Word &W);

void PrintRecommendedDictionary();		void PrintRecommendedDictionary();

void SetCrossOverWith(const Unit *U) { CrossOverWith = U; }		void SetCrossOverWith(const Unit *U) { CrossOverWith = U; }

Random &GetRand() { return Rand; }		Random &GetRand() { return Rand; }

private:		private:
struct Mutator {		struct Mutator {
Show All 22 Lines	DictionaryEntry MakeDictionaryEntryFromCMP(const void Arg1, const void Arg2,
const void *Arg1Mutation,		const void *Arg1Mutation,
const void *Arg2Mutation,		const void *Arg2Mutation,
size_t ArgSize,		size_t ArgSize,
const uint8_t *Data, size_t Size);		const uint8_t *Data, size_t Size);

Random &Rand;		Random &Rand;
const FuzzingOptions Options;		const FuzzingOptions Options;

// Dictionary provided by the user via -dict=DICT_FILE.		// Dictionary provided by the user via -dict=DICT_FILE or
Dictionary ManualDictionary;		// LLVMFuzzerAddToDictionary.
		Dictionary &ManualDictionary;
// Persistent dictionary modified by the fuzzer, consists of		// Persistent dictionary modified by the fuzzer, consists of
// entries that led to successful discoveries in the past mutations.		// entries that led to successful discoveries in the past mutations.
Dictionary PersistentAutoDictionary;		Dictionary PersistentAutoDictionary;

Vector<DictionaryEntry *> CurrentDictionaryEntrySequence;		Vector<DictionaryEntry *> CurrentDictionaryEntrySequence;

static const size_t kCmpDictionaryEntriesDequeSize = 16;		static const size_t kCmpDictionaryEntriesDequeSize = 16;
DictionaryEntry CmpDictionaryEntriesDeque[kCmpDictionaryEntriesDequeSize];		DictionaryEntry CmpDictionaryEntriesDeque[kCmpDictionaryEntriesDequeSize];
Show All 17 Lines

compiler-rt/lib/fuzzer/FuzzerMutate.cpp

Show All 12 Lines
#include "FuzzerIO.h"		#include "FuzzerIO.h"
#include "FuzzerMutate.h"		#include "FuzzerMutate.h"
#include "FuzzerOptions.h"		#include "FuzzerOptions.h"
#include "FuzzerTracePC.h"		#include "FuzzerTracePC.h"

namespace fuzzer {		namespace fuzzer {

const size_t Dictionary::kMaxDictSize;		const size_t Dictionary::kMaxDictSize;

		Dictionary DefaultEmptyDictionary;

static const size_t kMaxMutationsToPrint = 10;		static const size_t kMaxMutationsToPrint = 10;

static void PrintASCII(const Word &W, const char *PrintAfter) {		static void PrintASCII(const Word &W, const char *PrintAfter) {
PrintASCII(W.data(), W.size(), PrintAfter);		PrintASCII(W.data(), W.size(), PrintAfter);
}		}

MutationDispatcher::MutationDispatcher(Random &Rand,		MutationDispatcher::MutationDispatcher(Random &Rand,
const FuzzingOptions &Options)		const FuzzingOptions &Options,
: Rand(Rand), Options(Options) {		Dictionary &ManualDictionary)
		: Rand(Rand), Options(Options), ManualDictionary(ManualDictionary) {
DefaultMutators.insert(		DefaultMutators.insert(
DefaultMutators.begin(),		DefaultMutators.begin(),
{		{
{&MutationDispatcher::Mutate_EraseBytes, "EraseBytes"},		{&MutationDispatcher::Mutate_EraseBytes, "EraseBytes"},
{&MutationDispatcher::Mutate_InsertByte, "InsertByte"},		{&MutationDispatcher::Mutate_InsertByte, "InsertByte"},
{&MutationDispatcher::Mutate_InsertRepeatedBytes,		{&MutationDispatcher::Mutate_InsertRepeatedBytes,
"InsertRepeatedBytes"},		"InsertRepeatedBytes"},
{&MutationDispatcher::Mutate_ChangeByte, "ChangeByte"},		{&MutationDispatcher::Mutate_ChangeByte, "ChangeByte"},
▲ Show 20 Lines • Show All 527 Lines • ▼ Show 20 Lines	size_t MutationDispatcher::MutateWithMask(uint8_t *Data, size_t Size,
(void)NewSize;		(void)NewSize;
// Even if NewSize < OneBits we still use all OneBits bytes.		// Even if NewSize < OneBits we still use all OneBits bytes.
for (size_t I = 0, J = 0; I < MaskedSize; I++)		for (size_t I = 0, J = 0; I < MaskedSize; I++)
if (Mask[I])		if (Mask[I])
Data[I] = T[J++];		Data[I] = T[J++];
return Size;		return Size;
}		}

void MutationDispatcher::AddWordToManualDictionary(const Word &W) {
ManualDictionary.push_back(
{W, std::numeric_limits<size_t>::max()});
}

} // namespace fuzzer		} // namespace fuzzer

compiler-rt/lib/fuzzer/tests/FuzzerUnittest.cpp

Show All 10 Lines

#include "FuzzerCorpus.h"		#include "FuzzerCorpus.h"
#include "FuzzerDictionary.h"		#include "FuzzerDictionary.h"
#include "FuzzerInternal.h"		#include "FuzzerInternal.h"
#include "FuzzerMerge.h"		#include "FuzzerMerge.h"
#include "FuzzerMutate.h"		#include "FuzzerMutate.h"
#include "FuzzerRandom.h"		#include "FuzzerRandom.h"
#include "FuzzerTracePC.h"		#include "FuzzerTracePC.h"
#include "gtest/gtest.h"		#include "gtest/gtest.h"
		Lint: Pre-merge checks Inline Actions clang-tidy: error: 'gtest/gtest.h' file not found [clang-diagnostic-error] not useful Lint: Pre-merge checks: clang-tidy: error: 'gtest/gtest.h' file not found [clang-diagnostic-error] [[https://github.
#include <memory>		#include <memory>
#include <set>		#include <set>
#include <sstream>		#include <sstream>

using namespace fuzzer;		using namespace fuzzer;

// For now, have LLVMFuzzerTestOneInput just to make it link.		// For now, have LLVMFuzzerTestOneInput just to make it link.
// Later we may want to make unittests that actually call LLVMFuzzerTestOneInput.		// Later we may want to make unittests that actually call LLVMFuzzerTestOneInput.
▲ Show 20 Lines • Show All 384 Lines • ▼ Show 20 Lines	for (int count = 0; count < (1 << 18); ++count) {
ASSERT_EQ(NewSize, MaxSize);		ASSERT_EQ(NewSize, MaxSize);
}		}
}		}

void TestAddWordFromDictionary(Mutator M, int NumIter) {		void TestAddWordFromDictionary(Mutator M, int NumIter) {
std::unique_ptr<ExternalFunctions> t(new ExternalFunctions());		std::unique_ptr<ExternalFunctions> t(new ExternalFunctions());
fuzzer::EF = t.get();		fuzzer::EF = t.get();
Random Rand(0);		Random Rand(0);
std::unique_ptr<MutationDispatcher> MD(new MutationDispatcher(Rand, {}));		Dictionary ManualDictionary;
		std::unique_ptr<MutationDispatcher> MD(
		new MutationDispatcher(Rand, {}, ManualDictionary));
uint8_t Word1[4] = {0xAA, 0xBB, 0xCC, 0xDD};		uint8_t Word1[4] = {0xAA, 0xBB, 0xCC, 0xDD};
uint8_t Word2[3] = {0xFF, 0xEE, 0xEF};		uint8_t Word2[3] = {0xFF, 0xEE, 0xEF};
MD->AddWordToManualDictionary(Word(Word1, sizeof(Word1)));		ManualDictionary.push_back(
MD->AddWordToManualDictionary(Word(Word2, sizeof(Word2)));		{Word(Word1, 4), std::numeric_limits<size_t>::max()});
		ManualDictionary.push_back(
		{Word(Word2, 3), std::numeric_limits<size_t>::max()});
int FoundMask = 0;		int FoundMask = 0;
uint8_t CH0[7] = {0x00, 0x11, 0x22, 0xAA, 0xBB, 0xCC, 0xDD};		uint8_t CH0[7] = {0x00, 0x11, 0x22, 0xAA, 0xBB, 0xCC, 0xDD};
uint8_t CH1[7] = {0x00, 0x11, 0xAA, 0xBB, 0xCC, 0xDD, 0x22};		uint8_t CH1[7] = {0x00, 0x11, 0xAA, 0xBB, 0xCC, 0xDD, 0x22};
uint8_t CH2[7] = {0x00, 0xAA, 0xBB, 0xCC, 0xDD, 0x11, 0x22};		uint8_t CH2[7] = {0x00, 0xAA, 0xBB, 0xCC, 0xDD, 0x11, 0x22};
uint8_t CH3[7] = {0xAA, 0xBB, 0xCC, 0xDD, 0x00, 0x11, 0x22};		uint8_t CH3[7] = {0xAA, 0xBB, 0xCC, 0xDD, 0x00, 0x11, 0x22};
uint8_t CH4[6] = {0x00, 0x11, 0x22, 0xFF, 0xEE, 0xEF};		uint8_t CH4[6] = {0x00, 0x11, 0x22, 0xFF, 0xEE, 0xEF};
uint8_t CH5[6] = {0x00, 0x11, 0xFF, 0xEE, 0xEF, 0x22};		uint8_t CH5[6] = {0x00, 0x11, 0xFF, 0xEE, 0xEF, 0x22};
uint8_t CH6[6] = {0x00, 0xFF, 0xEE, 0xEF, 0x11, 0x22};		uint8_t CH6[6] = {0x00, 0xFF, 0xEE, 0xEF, 0x11, 0x22};
▲ Show 20 Lines • Show All 691 Lines • Show Last 20 Lines