Add support for LLVMFuzzerAddToDictionary, which if called from a fuzzer injects words into the user dictionary at runtime. This can be used by fuzzers if they have fuzzer-specific knowledge of words that might help; in my use-case, this is used for adding words that will satisfy regular expressions encountered during fuzzing. To accomplish this, MutationDispatcher no longer owns the ManualDictionary, it takes a reference to it. Outside of tests, this reference will point to a global Dictionary instance referred to from LLVMFuzzerAddToDictionary. The logic for adding to the ManualDictionary is pulled out of MutationDispatcher.
If this change moves forward, it will also need a test using LLVMFuzzerAddToDictionary. However, I'm personally not convinced in the use case for this.
When there are interesting sequences generated during fuzzing, libFuzzer already leverages them by at least two following mechanics:
- inputs triggering new coverage signals are added to the corpora, i.e. such byte sequences are present in the inputs used for future mutations
- recommended dictionary is auto-generated and is accumulating sequences that appear to be good dictionary entries
If you run your fuzz targets on ClusterFuzz, it automatically takes care of parsing the recommended dictionary, re-testing its entries (using libFuzzer's analyze_dict feature) and preserving really useful elements for future use.
With this in mind, the CL looks to me like an attempt to re-implement some of the logic that is already being handled.
Is this logic really needed? Looks like a shotgun to me in a case people misuse this API.
I'll tell you more about my use-case with Atheris. In Python, it's very common for control flow to be decided by regular expressions. However, regular expression matching is implemented inside of CPython, in the _sre.c module. This means that unless CPython itself is compiled with coverage, re.match() appears as an atomic operation that libFuzzer has no insight into.
Compiling CPython is actually pretty complex. Furthermore, because it introduces so many coverage symbols, performance ends up limited to 2000 execs/sec (on my very powerful machine) if CPython is compiled with coverage. This means the user has to jump through even more hoops to compile just the _sre module with coverage, if they want regular expression coverage support.
This change helps me solve that problem. Whenever Atheris encounters a regular expression in Python, it can insert into the dictionary a string that matches that expression. This immediately makes libFuzzer able to progress past the regular expression. It's very effective.
For more background, see https://github.com/google/atheris/issues/5.
For what it's worth, it's not clear to me that even if you _did_ compile _sre.c with fuzzer-no-link that you'd get good results. The regexp engine is effectively an interpreter, which is probably the worst case for coverage guided fuzzing -- essentially the program counter and branches have a low correspondence with semantics. For example, trying to match the regexp ab, you'd have two MATCH_CHAR opcodes, but it'd be backed by a single C function, so you wouldn't get different coverage for one matching versus the other.
This is basically the same reason why you can't fuzz Python code by instrumenting CPython -- you get full coverage of the interpreter eval loop, which tells you nothing about the coverage of Python code itself.
This is all by way of saying, that the current coverage instrumentation alone will probably never be sufficient for this, for code that uses regular expressions in control flow, you'll always need special handling.
Thanks for the context. If I understand correctly, the actual underlying goal is to pass an additional coverage signal to the fuzzing engine. If there is a way to achieve that without extending libFuzzer's API, would that suffice?
It believe that would be a better option, as that would expand the corpora, which is accumulated over time and can be re-used across different runs, in contrary to the in-memory dictionary we're expanding here.
Dor1s, what do you suggest? I haven't been able to find a good way to pass this information to libFuzzer without extending the API. The best we came up with was to simulate a memcmp(), but it didn't seem to work very well.
I am reluctant to extend the public interface in ways that
a) are likely to be useful for only few cases
b) are likely to remain libFuzzer-specific
c) already have an existing functionality that can be used instead). I mean the existing -dict flag (it's not exactly what you describe though)
The public interface should remain maximally engine-agnostic.
Maybe you can find a solution for your specific case using an existing mechanism?
Did you try using the extra counters somehow?
We basically need to detect a situation where the behavior is interesting and let LF know via __libfuzzer_extra_counters