This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/tools/clang-format/
-
tools/
-
clang-format/
-
clang-format.py
-
compiler-rt/lib/fuzzer/
-
lib/
-
fuzzer/
-
FuzzerDriver.cpp
-
FuzzerFlags.def
-
FuzzerFork.h
2
FuzzerFork.cpp

Differential D100161

Redistribute energy for Corpus
Needs ReviewPublic

Authored by gtt1995 on Apr 8 2021, 7:53 PM.

Download Raw Diff

Details

Reviewers

kcc
morehouse
charco

Summary

Divide the corpus into n parts according to size. Each job executes each corpus in turn, Job one executes the corpus with the smallest size, Job two executes the relatively larger corpus,...Job N executes the seed of the largest corpus, in turn,. i.e. each job choose some seeds from corpus 1, corpus2 ,..., corpus N, corpus1,corpus2...corpus N .....
that is, allocate more energy to the small seeds, trigger the common path in advance, and prefer to keep the small seeds.
In my experiment , It is found that the bugs rate is greatly accelerated, the cov is greatly increased (equal to the effect of entropic improvement), and the size of the newly generated interesting seeds is very small.

Diff Detail

Event Timeline

gtt1995 requested review of this revision.Apr 8 2021, 7:53 PM

gtt1995 created this revision.

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptApr 8 2021, 7:53 PM

Herald added subscribers: Restricted Project, cfe-commits. · View Herald Transcript

gtt1995 edited reviewers, added: kcc; removed: 0b01.Apr 8 2021, 8:08 PM

gtt1995 removed 1 blocking reviewer(s): kcc.

gtt1995 edited reviewers, added: morehouse; removed: 01alchemist.Apr 8 2021, 8:10 PM

gtt1995 added a reviewer: charco.

gtt1995 edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B97869: Diff 336298.Apr 8 2021, 8:27 PM

Thanks for the patch! Would you mind sharing the experimental data/results you obtained for this patch?

Additionally, could you submit this patch to FuzzBench for an independent evaluation?

Thanks,
Matt

In D100161#2679922, @morehouse wrote:

Thanks for the patch! Would you mind sharing the experimental data/results you obtained for this patch?

Additionally, could you submit this patch to FuzzBench for an independent evaluation?

Thanks,
Matt

Hello,How can i share the experiment data?
It seems that FuzzBench does not accept this parallel mode evaluation.

experiment.tar8 MBDownload

This is part of raw data , the object from oss-fuzz project.

In D100161#2681045, @gtt1995 wrote:

experiment.tar8 MBDownload

This is part of raw data , the object from oss-fuzz project.

The patched version data in average dir, data of libfuzzer dir is from original libfuzzer.
Thanks.

Thanks for sharing your data. Took a quick look and seems promising.

I would like to try this on FuzzBench before accepting the patch though. FuzzBench has a very nice experimental framework for evaluating changes like this.

It seems that FuzzBench does not accept this parallel mode evaluation.

I talked to @metzman who manages FuzzBench. Sounds like you're correct, FuzzBench uses only one worker process in fork mode. @metzman said we could probably run a special experiment with more workers to evaluate this patch.

Another approach that might be worth doing, is to make the patch effective even for a single worker. For example, maybe we randomly pick from a subset of the corpus for that single worker.

Also, I'm curious how the number of fork-mode workers affects efficacy. I can imagine with lots of workers that this patch could perform much worse. Specifically if we have a small number of corpus elements per worker, the crossover mutation becomes quite limited.

compiler-rt/lib/fuzzer/FuzzerFork.cpp
143	Does keeping this skew within each subset of the corpus help? Would a uniform-random approach change efficacy?
150	So we would still use the current strategy on the rare occasion when we get an `m` index past the end of `Files`. I imagine there's always some cases where the original strategy will be beneficial at least for the crossover mutation. Maybe we could make the probability of using the current strategy more explicit. Something like if (RandInt(100) < 90) // Use new strategy else // Use old strategy

Also, the descriptions states:

Divide the corpus into n parts according to size.

Is it really according to size? IIUC when there are multiple worker processes, any new coverage they have simply gets appended to Files. So Files is not necessarily sorted by size.

In D100161#2683632, @morehouse wrote:

Also, the descriptions states:

Divide the corpus into n parts according to size.

Is it really according to size? IIUC when there are multiple worker processes, any new coverage they have simply gets appended to Files. So Files is not necessarily sorted by size.

Yes, Grouping the corpus has two advantages, one is that it can reduce the repetitive work of each process, and the other is that small seeds trigger the same path in advance. And triggering new paths in advance is the key reason to improve efficiency. so 'Files' should sorted by size.

In D100161#2683622, @morehouse wrote:

Thanks for sharing your data. Took a quick look and seems promising.

I would like to try this on FuzzBench before accepting the patch though. FuzzBench has a very nice experimental framework for evaluating changes like this.

It seems that FuzzBench does not accept this parallel mode evaluation.

I talked to @metzman who manages FuzzBench. Sounds like you're correct, FuzzBench uses only one worker process in fork mode. @metzman said we could probably run a special experiment with more workers to evaluate this patch.

Another approach that might be worth doing, is to make the patch effective even for a single worker. For example, maybe we randomly pick from a subset of the corpus for that single worker.

Also, I'm curious how the number of fork-mode workers affects efficacy. I can imagine with lots of workers that this patch could perform much worse. Specifically if we have a small number of corpus elements per wOorker, the crossover mutation becomes quite limited.

OK, Thanks for your work.
There are some thoughtful tips to tell you :

The effect of Grouping corpus energy may be similar to the effect of entropic (distribute energy for single seeds)on some goals, but there are also differences. So you should enable -entropic=0 when you evalutate them . Of course , At the same time, Enable -entropic and -NumCorpus will also have a certain effect .If you are interested, you can test four groups of subjects

(1).-entropic=0,NumCorpuses=1;
(2),-entropic=1,NumCorpuses=1
(3),-entropic=0, NumCorpuses=N (i set 30, others are also possible, I think this changes with the total number of seeds, it should change dynamically,)
(4),-entropic=1,NumCorpuses=30

I set -fork=30,-NumCorpuses=30,-entropic=0 in my evaluation. But the -fork value can not be equal to the -NumCorpuses value, because each job will execute each corpus in turn from small to large.
According to 2, it should be worked well in the single core mode, Single process executes each corpus in turn from small to large. for in-process libfuzzer, frequent interaction with fs brings additional overhead. Therefore, it is still suitable for energy scheduling in parallel fuzzing when each child process maintains the same coverage bitmap in time .
The degree of parallelism depends on how long you want to get results .I think more grouping and -fork equal to -NumCorpuses will be much better. They can be regarded as: traversing all corpora in one loop.If this is not the case, the corpues that are biased to the back will not be fully tested, because the merged results of the previous jobs will be written back to fs (if there is a large seed generated by small jobs ) and will be taken out again, which will have some negative effects.
I am sorry I haven't tried it on -workers .
Can you share the official results to me？
Thanks for your work once again！

In D100161#2683632, @morehouse wrote:

Also, the descriptions states:

Divide the corpus into n parts according to size.

Is it really according to size? IIUC when there are multiple worker processes, any new coverage they have simply gets appended to Files. So Files is not necessarily sorted by size.

Sorry, I want to know 'Files' is sorted by size necessarily in the trunk version ,is it different from the https://github.com/Dor1s/libfuzzer-workshop? The shared data is collected on the https://github.com/Dor1s/libfuzzer-workshop , and the new evaluation data is still tested in the trunk version ,the data seems promising.
I have a question. When they created the task for the first time, the ’FILES‘ were sorted strictly according to the size. The newly covered seeds were added after the Files were appended, which means that the new seeds in the future may not necessarily be sorted according to the size. , Then my theory may not hold true, but from the experimental data, at the same time, the average seed size of the main corpus is very small.The cov is more, this method may give little seeds more energy from a certain angle, but I don’t seem to be able to explain this phenomenon. Can you help me?
Wouldn't it be better if FILEs were always sorted strictly according to size? Because the use of corpus grouping is equivalent to N more locations for extracting seeds in the entire corpus, so Rand->SkewTowardsLast(Files.size()) may not be appropriate, so it tends to extract newly added seeds .
How to implement File is strictly sorted according to seeds. I only came into contact with libfuzzer code last two weeks and I am not very familiar with it .
I think my method of grouping can prevent your original method from causing local optimality. Therefore, if you sort strictly according to the size, this will ensure that interesting seeds that are based on a certain sub-corpus can fall into the original sub-corpus with a high probability . Then i can design a filtering algorithm Thompson sampling or Multi-armed Bandits to find the global optimal solution .

If not sorted by size , Just a simple grouping of corpus, the effect is similar to entropic.

Maybe uniform-random approach change efficacy!

If the effect is similar to entropic, why do we need this patch as well?

In D100161#2686130, @morehouse wrote:

If the effect is similar to entropic, why do we need this patch as well?

They just have some similarities, they will be better after patching.

Hello.
Due to the time zone difference, I think our communication is a bit inefficient. Can we arrange a convenient time for you to focus on the discussion?
We use CST.
Thanks.

At this point I am not convinced this patch will provide benefit for the default use case when -entropic=1. I am hesitant to add complexity to the code for unsure benefit.

If you request a FuzzBench experiment to get some data on this, and the results look good, then I'll be willing to invest more time into reviewing this patch.

Please CC me on the FuzzBench pull request, so I can make sure we are evaluating this properly.

In D100161#2689156, @morehouse wrote:

At this point I am not convinced this patch will provide benefit for the default use case when -entropic=1. I am hesitant to add complexity to the code for unsure benefit.

If you request a FuzzBench experiment to get some data on this, and the results look good, then I'll be willing to invest more time into reviewing this patch.

Please CC me on the FuzzBench pull request, so I can make sure we are evaluating this properly.

Hello,
Fuzzbench don't accept the parallel mode testing .
I will share my complete experiment data in the future.

I redesigned the algorithm and did a complete long-term evaluation by myself, and got very good results. Whether it is -entropic=0 or 1, it performs very well, and -fork mode is now better than paralllel fuzzing mode Better performance, 
please move to D105084. There are detailed data. Thanks  a lot.

Revision Contents

Path

Size

clang/

tools/

clang-format/

clang-format.py

compiler-rt/

lib/

fuzzer/

2 lines

1 line

2 lines

27 lines

Diff 336298

clang/tools/clang-format/clang-format.py

Property	Old Value	New Value
File Mode	100644	100755

The contents of this file were not changed.

compiler-rt/lib/fuzzer/FuzzerDriver.cpp

Context not available.
	}	}

	if (Flags.fork)	if (Flags.fork)
	FuzzWithFork(F->GetMD().GetRand(), Options, Args, *Inputs, Flags.fork);	FuzzWithFork(F->GetMD().GetRand(), Options, Args, *Inputs, Flags.fork, Flags.NumCorpuses);
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - FuzzWithFork(F->GetMD().GetRand(), Options, Args, Inputs, Flags.fork, Flags.NumCorpuses); + FuzzWithFork(F->GetMD().GetRand(), Options, Args, Inputs, Flags.fork, + Flags.NumCorpuses); Lint: Pre-merge checks: clang-format: please reformat the code ``` - FuzzWithFork(F->GetMD().GetRand(), Options…

	if (Flags.merge)	if (Flags.merge)
	Merge(F, Options, Args, *Inputs, Flags.merge_control_file);	Merge(F, Options, Args, *Inputs, Flags.merge_control_file);
Context not available.

compiler-rt/lib/fuzzer/FuzzerFlags.def

Context not available.
	FUZZER_FLAG_INT(max_total_time, 0, "If positive, indicates the maximal total "	FUZZER_FLAG_INT(max_total_time, 0, "If positive, indicates the maximal total "
	"time in seconds to run the fuzzer.")	"time in seconds to run the fuzzer.")
	FUZZER_FLAG_INT(help, 0, "Print help.")	FUZZER_FLAG_INT(help, 0, "Print help.")
		FUZZER_FLAG_INT(NumCorpuses, 1, "Divide the corpus into N parts according to size.")
	FUZZER_FLAG_INT(fork, 0, "Experimental mode where fuzzing happens "	FUZZER_FLAG_INT(fork, 0, "Experimental mode where fuzzing happens "
	"in a subprocess")	"in a subprocess")
	FUZZER_FLAG_INT(ignore_timeouts, 1, "Ignore timeouts in fork mode")	FUZZER_FLAG_INT(ignore_timeouts, 1, "Ignore timeouts in fork mode")
Context not available.

compiler-rt/lib/fuzzer/FuzzerFork.h

Context not available.
	namespace fuzzer {	namespace fuzzer {
	void FuzzWithFork(Random &Rand, const FuzzingOptions &Options,	void FuzzWithFork(Random &Rand, const FuzzingOptions &Options,
	const Vector<std::string> &Args,	const Vector<std::string> &Args,
	const Vector<std::string> &CorpusDirs, int NumJobs);	const Vector<std::string> &CorpusDirs, int NumJobs, int NumCorpuses);
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - const Vector<std::string> &CorpusDirs, int NumJobs, int NumCorpuses); + const Vector<std::string> &CorpusDirs, int NumJobs, + int NumCorpuses); Lint: Pre-merge checks: clang-format: please reformat the code ``` - const Vector<std::string>…
	} // namespace fuzzer	} // namespace fuzzer

	#endif // LLVM_FUZZER_FORK_H	#endif // LLVM_FUZZER_FORK_H
Context not available.

compiler-rt/lib/fuzzer/FuzzerFork.cpp

Context not available.
	.count();	.count();
	}	}

	FuzzJob *CreateNewJob(size_t JobId) {	FuzzJob *CreateNewJob(size_t JobId, int NumCorpuses) {
	Command Cmd(Args);	Command Cmd(Args);
	Cmd.removeFlag("fork");	Cmd.removeFlag("fork");
	Cmd.removeFlag("runs");	Cmd.removeFlag("runs");
Context not available.
	std::string Seeds;	std::string Seeds;
	if (size_t CorpusSubsetSize =	if (size_t CorpusSubsetSize =
	std::min(Files.size(), (size_t)sqrt(Files.size() + 2))) {	std::min(Files.size(), (size_t)sqrt(Files.size() + 2))) {
		size_t AverageSize = Files.size()/NumCorpuses +1;
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - size_t AverageSize = Files.size()/NumCorpuses +1; + size_t AverageSize = Files.size() / NumCorpuses + 1; Lint: Pre-merge checks: clang-format: please reformat the code ``` - size_t AverageSize = Files.size()/NumCorpuses…
	auto Time1 = std::chrono::system_clock::now();	auto Time1 = std::chrono::system_clock::now();
		size_t StartIndex = ((JobId-1)%NumCorpuses) * AverageSize;
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - size_t StartIndex = ((JobId-1)%NumCorpuses) * AverageSize; - printf("\n Job %d Choose Corpus %d ",JobId,(JobId)%NumCorpuses); + size_t StartIndex = ((JobId - 1) % NumCorpuses) * AverageSize; + printf("\n Job %d Choose Corpus %d ", JobId, (JobId) % NumCorpuses); Lint: Pre-merge checks: clang-format: please reformat the code ``` - size_t StartIndex = ((JobId-1)%NumCorpuses) *…
		printf("\n Job %d Choose Corpus %d ",JobId,(JobId)%NumCorpuses);
	for (size_t i = 0; i < CorpusSubsetSize; i++) {	for (size_t i = 0; i < CorpusSubsetSize; i++) {
	auto &SF = Files[Rand->SkewTowardsLast(Files.size())];	size_t j = Rand->SkewTowardsLast(AverageSize);
		morehouseUnsubmitted Not Done Reply Inline Actions Does keeping this skew within each subset of the corpus help? Would a uniform-random approach change efficacy? morehouse: Does keeping this skew within each subset of the corpus help? Would a uniform-random approach…
	Seeds += (Seeds.empty() ? "" : ",") + SF;	size_t m = j + StartIndex;
	CollectDFT(SF);	if (m < Files.size()) {
		auto &SF = Files[m];
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - auto &SF = Files[m]; - Seeds += (Seeds.empty() ? "" : ",") + SF; - CollectDFT(SF); - } - else { - auto &SF = Files[Rand->SkewTowardsLast(Files.size())]; - Seeds += (Seeds.empty() ? "" : ",") + SF; - CollectDFT(SF); + auto &SF = Files[m]; + Seeds += (Seeds.empty() ? "" : ",") + SF; 5 diff lines are omitted. See full path. Lint: Pre-merge checks: clang-format: please reformat the code ``` - auto &SF = Files[m]…
		Seeds += (Seeds.empty() ? "" : ",") + SF;
		CollectDFT(SF);
		}
		else {
		morehouseUnsubmitted Not Done Reply Inline Actions So we would still use the current strategy on the rare occasion when we get an `m` index past the end of `Files`. I imagine there's always some cases where the original strategy will be beneficial at least for the crossover mutation. Maybe we could make the probability of using the current strategy more explicit. Something like if (RandInt(100) < 90) // Use new strategy else // Use old strategy morehouse: So we would still use the current strategy on the rare occasion when we get an `m` index past…
		auto &SF = Files[Rand->SkewTowardsLast(Files.size())];
		Seeds += (Seeds.empty() ? "" : ",") + SF;
		CollectDFT(SF);
		}
	}	}
	auto Time2 = std::chrono::system_clock::now();	auto Time2 = std::chrono::system_clock::now();
	auto DftTimeInSeconds = duration_cast<seconds>(Time2 - Time1).count();	auto DftTimeInSeconds = duration_cast<seconds>(Time2 - Time1).count();
Context not available.
	// This is just a skeleton of an experimental -fork=1 feature.	// This is just a skeleton of an experimental -fork=1 feature.
	void FuzzWithFork(Random &Rand, const FuzzingOptions &Options,	void FuzzWithFork(Random &Rand, const FuzzingOptions &Options,
	const Vector<std::string> &Args,	const Vector<std::string> &Args,
	const Vector<std::string> &CorpusDirs, int NumJobs) {	const Vector<std::string> &CorpusDirs, int NumJobs, int NumCorpuses) {
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - const Vector<std::string> &CorpusDirs, int NumJobs, int NumCorpuses) { + const Vector<std::string> &CorpusDirs, int NumJobs, + int NumCorpuses) { Lint: Pre-merge checks: clang-format: please reformat the code ``` - const Vector<std::string>…
	Printf("INFO: -fork=%d: fuzzing in separate process(s)\n", NumJobs);	Printf("INFO: -fork=%d: fuzzing in separate process(s)\n", NumJobs);

	GlobalEnv Env;	GlobalEnv Env;
Context not available.
	Vector<std::thread> Threads;	Vector<std::thread> Threads;
	for (int t = 0; t < NumJobs; t++) {	for (int t = 0; t < NumJobs; t++) {
	Threads.push_back(std::thread(WorkerThread, &FuzzQ, &MergeQ));	Threads.push_back(std::thread(WorkerThread, &FuzzQ, &MergeQ));
	FuzzQ.Push(Env.CreateNewJob(JobId++));	FuzzQ.Push(Env.CreateNewJob(JobId++, NumCorpuses));
	}	}
		//printf("\n 创建%d个jobs\n",NumJobs);
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - //printf("\n 创建%d个jobs\n",NumJobs); + // printf("\n 创建%d个jobs\n",NumJobs); Lint: Pre-merge checks: clang-format: please reformat the code ``` - //printf("\n 创建%d个jobs\n",NumJobs); + // printf…

	while (true) {	while (true) {
	std::unique_ptr<FuzzJob> Job(MergeQ.Pop());	std::unique_ptr<FuzzJob> Job(MergeQ.Pop());
Context not available.
	break;	break;
	}	}

	FuzzQ.Push(Env.CreateNewJob(JobId++));	FuzzQ.Push(Env.CreateNewJob(JobId++, NumCorpuses));
	}	}

	for (auto &T : Threads)	for (auto &T : Threads)
Context not available.