Changeset View
Standalone View
compiler-rt/lib/fuzzer/FuzzerCorpus.h
Show All 27 Lines | struct InputInfo { | ||||
Unit U; // The actual input data. | Unit U; // The actual input data. | ||||
uint8_t Sha1[kSHA1NumBytes]; // Checksum. | uint8_t Sha1[kSHA1NumBytes]; // Checksum. | ||||
// Number of features that this input has and no smaller input has. | // Number of features that this input has and no smaller input has. | ||||
size_t NumFeatures = 0; | size_t NumFeatures = 0; | ||||
size_t Tmp = 0; // Used by ValidateFeatureSet. | size_t Tmp = 0; // Used by ValidateFeatureSet. | ||||
// Stats. | // Stats. | ||||
size_t NumExecutedMutations = 0; | size_t NumExecutedMutations = 0; | ||||
size_t NumSuccessfullMutations = 0; | size_t NumSuccessfullMutations = 0; | ||||
bool MayDeleteFile = false; | bool MayDeleteFile = false; | ||||
kcc: this is new in the patch, is it?
While I completely understand why we'd want to use execution… | |||||
Yes. Been playing with a few smaller tweaks to boost LF performance.
Do you mean LibFuzzer should be fully deterministic when you start it with the same seed corpus (e.g., by fixing *the* random seed)? Currently, even without this patch I've been observing quite some variance in the coverage achieved over time. Happy to take it out, though, if this messes with the LF design principles. marcel: > this is new in the patch, is it?
Yes. Been playing with a few smaller tweaks to boost LF… | |||||
bool Reduced = false; | bool Reduced = false; | ||||
bool HasFocusFunction = false; | bool HasFocusFunction = false; | ||||
Vector<uint32_t> UniqFeatureSet; | Vector<uint32_t> UniqFeatureSet; | ||||
Vector<uint8_t> DataFlowTraceForFocusFunction; | Vector<uint8_t> DataFlowTraceForFocusFunction; | ||||
// Power schedule. | |||||
it would be nice to manipulate this fields only with reasonable named methods vitalybuka: it would be nice to manipulate this fields only with reasonable named methods | |||||
bool NeedsUpdate = false; | |||||
NeedsEnergyUpdate? kcc: NeedsEnergyUpdate? | |||||
double Energy = 0.0; | |||||
could you please initialize them? vitalybuka: could you please initialize them? | |||||
size_t SumIncidence = 0; | |||||
why not just 'double'? kcc: why not just 'double'? | |||||
Vector<std::pair<uint32_t, uint16_t>> FeatureFreqs; | |||||
Any chance to have a planer data structure here? (sorted array or some such) kcc: Any chance to have a planer data structure here? (sorted array or some such)
Also, why size_t? | |||||
Whether or not unorder_map, see top-level comment. If unordered_map, then uint32_t for keys and values should work. marcel: Whether or not unorder_map, see top-level comment. If unordered_map, then uint32_t for keys and… | |||||
}; | }; | ||||
class InputCorpus { | class InputCorpus { | ||||
static const size_t kFeatureSetSize = 1 << 21; | static const uint32_t kFeatureSetSize = 1 << 21; | ||||
static const uint8_t kMaxMutationFactor = 20; | |||||
why is this public? kcc: why is this public?
Also, why the g_ prefix? | |||||
g_NumExecutedMutations contains the total number of inputs generated. It is updated in FuzzerLoop.c in MutateAndTestOne. I can drop the g_ prefix. marcel: `g_NumExecutedMutations` contains the total number of inputs generated. It is updated in… | |||||
yea, please drop g_. kcc: yea, please drop g_.
| |||||
static const uint8_t kProbAgressiveSchedule = 80; | |||||
static const size_t kSparseEnergyUpdates = 10000; | |||||
bool Entropic; | |||||
size_t ConsideredRare; | |||||
here and below: remove 'struct'. kcc: here and below: remove 'struct'. | |||||
size_t TopXRarestFeatures; | |||||
Lower kcc: Lower
(use upper case) | |||||
I'd like to put these into a struct EntropicOptions { bool Enabled; size_t EntropicFeatureFrequencyThreshold; ... } and pass such a struct (by value) to InputCorpus CTOR. kcc: I'd like to put these into a
struct EntropicOptions {
bool Enabled;
size_t… | |||||
public: | public: | ||||
InputCorpus(const std::string &OutputCorpus) : OutputCorpus(OutputCorpus) { | size_t NumExecutedMutations = 0; | ||||
vitalybukaUnsubmitted please make field private and add access method if it's needed outside vitalybuka: please make field private and add access method if it's needed outside | |||||
InputCorpus(const std::string &OutputCorpus, bool Entropic, | |||||
size_t ConsideredRare, size_t TopXRarestFeatures) | |||||
: Entropic(Entropic), ConsideredRare(ConsideredRare), | |||||
TopXRarestFeatures(TopXRarestFeatures), OutputCorpus(OutputCorpus) { | |||||
memset(InputSizesPerFeature, 0, sizeof(InputSizesPerFeature)); | memset(InputSizesPerFeature, 0, sizeof(InputSizesPerFeature)); | ||||
memset(SmallestElementPerFeature, 0, sizeof(SmallestElementPerFeature)); | memset(SmallestElementPerFeature, 0, sizeof(SmallestElementPerFeature)); | ||||
} | } | ||||
~InputCorpus() { | ~InputCorpus() { | ||||
for (auto II : Inputs) | for (auto II : Inputs) | ||||
delete II; | delete II; | ||||
do you need this? kcc: do you need this?
the following line should take care of it | |||||
Not Done ReplyInline ActionsI'm still worried about long double due to portability. kcc: I'm still worried about long double due to portability.
Do you actually "know" that it's… | |||||
You are right. After fixing frequencies to uint16_t, this can definitely be a double. marcel: You are right. After fixing frequencies to `uint16_t`, this can definitely be a `double`. | |||||
} | } | ||||
since this is a vector, I don't think we need to manually clear it in the destructor Dor1s: since this is a vector, I don't think we need to manually clear it in the destructor | |||||
size_t size() const { return Inputs.size(); } | size_t size() const { return Inputs.size(); } | ||||
size_t SizeInBytes() const { | size_t SizeInBytes() const { | ||||
size_t Res = 0; | size_t Res = 0; | ||||
for (auto II : Inputs) | for (auto II : Inputs) | ||||
Res += II->U.size(); | Res += II->U.size(); | ||||
return Res; | return Res; | ||||
} | } | ||||
size_t NumActiveUnits() const { | size_t NumActiveUnits() const { | ||||
size_t Res = 0; | size_t Res = 0; | ||||
for (auto II : Inputs) | for (auto II : Inputs) | ||||
Res += !II->U.empty(); | Res += !II->U.empty(); | ||||
return Res; | return Res; | ||||
} | } | ||||
size_t MaxInputSize() const { | size_t MaxInputSize() const { | ||||
size_t Res = 0; | size_t Res = 0; | ||||
for (auto II : Inputs) | for (auto II : Inputs) | ||||
Res = std::max(Res, II->U.size()); | Res = std::max(Res, II->U.size()); | ||||
return Res; | return Res; | ||||
} | } | ||||
size_t NumInputsThatTouchFocusFunction() { | size_t NumInputsThatTouchFocusFunction() { | ||||
Please try to follow the coding style, i.e. capitalize the names: Singletons Also, the term Singleton may be a bit too overloaded. NumFeaturesCoveredByASingleInput or some such? kcc: Please try to follow the coding style, i.e. capitalize the names:
Singletons
Also, the term… | |||||
return std::count_if(Inputs.begin(), Inputs.end(), [](const InputInfo *II) { | return std::count_if(Inputs.begin(), Inputs.end(), [](const InputInfo *II) { | ||||
return II->HasFocusFunction; | return II->HasFocusFunction; | ||||
}); | }); | ||||
} | } | ||||
size_t NumInputsWithDataFlowTrace() { | size_t NumInputsWithDataFlowTrace() { | ||||
return std::count_if(Inputs.begin(), Inputs.end(), [](const InputInfo *II) { | return std::count_if(Inputs.begin(), Inputs.end(), [](const InputInfo *II) { | ||||
return !II->DataFlowTraceForFocusFunction.empty(); | return !II->DataFlowTraceForFocusFunction.empty(); | ||||
}); | }); | ||||
} | } | ||||
bool empty() const { return Inputs.empty(); } | bool empty() const { return Inputs.empty(); } | ||||
const Unit &operator[] (size_t Idx) const { return Inputs[Idx]->U; } | const Unit &operator[] (size_t Idx) const { return Inputs[Idx]->U; } | ||||
InputInfo *AddToCorpus(const Unit &U, size_t NumFeatures, bool MayDeleteFile, | InputInfo *AddToCorpus(const Unit &U, size_t NumFeatures, bool MayDeleteFile, | ||||
bool HasFocusFunction, | bool HasFocusFunction, | ||||
Lower kcc: Lower | |||||
const Vector<uint32_t> &FeatureSet, | const Vector<uint32_t> &FeatureSet, | ||||
const DataFlowTrace &DFT, const InputInfo *BaseII) { | const DataFlowTrace &DFT, const InputInfo *BaseII) { | ||||
assert(!U.empty()); | assert(!U.empty()); | ||||
if (FeatureDebug) | if (FeatureDebug) | ||||
Printf("ADD_TO_CORPUS %zd NF %zd\n", Inputs.size(), NumFeatures); | Printf("ADD_TO_CORPUS %zd NF %zd\n", Inputs.size(), NumFeatures); | ||||
Inputs.push_back(new InputInfo()); | Inputs.push_back(new InputInfo()); | ||||
InputInfo &II = *Inputs.back(); | InputInfo &II = *Inputs.back(); | ||||
II.U = U; | II.U = U; | ||||
II.NumFeatures = NumFeatures; | II.NumFeatures = NumFeatures; | ||||
II.MayDeleteFile = MayDeleteFile; | II.MayDeleteFile = MayDeleteFile; | ||||
II.UniqFeatureSet = FeatureSet; | II.UniqFeatureSet = FeatureSet; | ||||
II.HasFocusFunction = HasFocusFunction; | II.HasFocusFunction = HasFocusFunction; | ||||
// Assign maximal energy to the new seed. | |||||
II.Energy = RareFeatures.empty() ? 1.0 : log(RareFeatures.size()); | |||||
nit: seems like log should be sufficient, as Energy is double, not long double Dor1s: nit: seems like `log` should be sufficient, as Energy is `double`, not `long double` | |||||
II.SumIncidence = RareFeatures.size(); | |||||
II.NeedsUpdate = false; | |||||
std::sort(II.UniqFeatureSet.begin(), II.UniqFeatureSet.end()); | std::sort(II.UniqFeatureSet.begin(), II.UniqFeatureSet.end()); | ||||
ComputeSHA1(U.data(), U.size(), II.Sha1); | ComputeSHA1(U.data(), U.size(), II.Sha1); | ||||
auto Sha1Str = Sha1ToString(II.Sha1); | auto Sha1Str = Sha1ToString(II.Sha1); | ||||
Hashes.insert(Sha1Str); | Hashes.insert(Sha1Str); | ||||
if (HasFocusFunction) | if (HasFocusFunction) | ||||
if (auto V = DFT.Get(Sha1Str)) | if (auto V = DFT.Get(Sha1Str)) | ||||
II.DataFlowTraceForFocusFunction = *V; | II.DataFlowTraceForFocusFunction = *V; | ||||
// This is a gross heuristic. | // This is a gross heuristic. | ||||
// Ideally, when we add an element to a corpus we need to know its DFT. | // Ideally, when we add an element to a corpus we need to know its DFT. | ||||
// But if we don't, we'll use the DFT of its base input. | // But if we don't, we'll use the DFT of its base input. | ||||
if (II.DataFlowTraceForFocusFunction.empty() && BaseII) | if (II.DataFlowTraceForFocusFunction.empty() && BaseII) | ||||
II.DataFlowTraceForFocusFunction = BaseII->DataFlowTraceForFocusFunction; | II.DataFlowTraceForFocusFunction = BaseII->DataFlowTraceForFocusFunction; | ||||
UpdateCorpusDistribution(); | DistributionNeedsUpdate = true; | ||||
PrintCorpus(); | PrintCorpus(); | ||||
// ValidateFeatureSet(); | // ValidateFeatureSet(); | ||||
return &II; | return &II; | ||||
} | } | ||||
// Debug-only | // Debug-only | ||||
void PrintUnit(const Unit &U) { | void PrintUnit(const Unit &U) { | ||||
if (!FeatureDebug) return; | if (!FeatureDebug) return; | ||||
Show All 34 Lines | public: | ||||
void Replace(InputInfo *II, const Unit &U) { | void Replace(InputInfo *II, const Unit &U) { | ||||
assert(II->U.size() > U.size()); | assert(II->U.size() > U.size()); | ||||
Hashes.erase(Sha1ToString(II->Sha1)); | Hashes.erase(Sha1ToString(II->Sha1)); | ||||
DeleteFile(*II); | DeleteFile(*II); | ||||
ComputeSHA1(U.data(), U.size(), II->Sha1); | ComputeSHA1(U.data(), U.size(), II->Sha1); | ||||
Hashes.insert(Sha1ToString(II->Sha1)); | Hashes.insert(Sha1ToString(II->Sha1)); | ||||
II->U = U; | II->U = U; | ||||
II->Reduced = true; | II->Reduced = true; | ||||
UpdateCorpusDistribution(); | DistributionNeedsUpdate = true; | ||||
} | } | ||||
bool HasUnit(const Unit &U) { return Hashes.count(Hash(U)); } | bool HasUnit(const Unit &U) { return Hashes.count(Hash(U)); } | ||||
bool HasUnit(const std::string &H) { return Hashes.count(H); } | bool HasUnit(const std::string &H) { return Hashes.count(H); } | ||||
InputInfo &ChooseUnitToMutate(Random &Rand) { | InputInfo &ChooseUnitToMutate(Random &Rand) { | ||||
UpdateCorpusDistribution(Rand); | |||||
InputInfo &II = *Inputs[ChooseUnitIdxToMutate(Rand)]; | InputInfo &II = *Inputs[ChooseUnitIdxToMutate(Rand)]; | ||||
assert(!II.U.empty()); | assert(!II.U.empty()); | ||||
return II; | return II; | ||||
} | } | ||||
// Returns an index of random unit from the corpus to mutate. | // Returns an index of random unit from the corpus to mutate. | ||||
size_t ChooseUnitIdxToMutate(Random &Rand) { | size_t ChooseUnitIdxToMutate(Random &Rand) { | ||||
size_t Idx = static_cast<size_t>(CorpusDistribution(Rand)); | size_t Idx = static_cast<size_t>(CorpusDistribution(Rand)); | ||||
Show All 26 Lines | void DeleteFile(const InputInfo &II) { | ||||
if (!OutputCorpus.empty() && II.MayDeleteFile) | if (!OutputCorpus.empty() && II.MayDeleteFile) | ||||
RemoveFile(DirPlusFile(OutputCorpus, Sha1ToString(II.Sha1))); | RemoveFile(DirPlusFile(OutputCorpus, Sha1ToString(II.Sha1))); | ||||
} | } | ||||
void DeleteInput(size_t Idx) { | void DeleteInput(size_t Idx) { | ||||
InputInfo &II = *Inputs[Idx]; | InputInfo &II = *Inputs[Idx]; | ||||
DeleteFile(II); | DeleteFile(II); | ||||
Unit().swap(II.U); | Unit().swap(II.U); | ||||
II.Energy = 0.0; | |||||
II.NeedsUpdate = false; | |||||
DistributionNeedsUpdate = true; | |||||
if (FeatureDebug) | if (FeatureDebug) | ||||
Printf("EVICTED %zd\n", Idx); | Printf("EVICTED %zd\n", Idx); | ||||
} | } | ||||
bool DeleteFeatureFreq(InputInfo *II, uint32_t Idx) { | |||||
vitalybukaUnsubmitted DeleteFeatureFreq -> InputInfo::DeleteFeatureFreq vitalybuka: DeleteFeatureFreq -> InputInfo::DeleteFeatureFreq | |||||
marcelAuthorUnsubmitted Moved directly into the InputInfo struct. marcel: Moved directly into the InputInfo struct. | |||||
if (II->FeatureFreqs.empty()) | |||||
return false; | |||||
// Binary search over local feature frequencies sorted by index. | |||||
auto lower = | |||||
std::lower_bound(II->FeatureFreqs.begin(), II->FeatureFreqs.end(), | |||||
try not to use constants like this. const size_t kSomething = 100; .. if (xxx.size() > kSomething) if there is any reason to play with different value, you may want to use a flag instead Similar for 0xFF. Probably, the best here is to replace these two constants with function parameters so that this function becomes unit-testable. kcc: try not to use constants like this.
At the very least, use
const size_t kSomething = 100;
. | |||||
We will pull these constants out as command line options. marcel: We will pull these constants out as command line options. | |||||
std::pair<uint32_t, uint16_t>(Idx, 0)); | |||||
if (lower != II->FeatureFreqs.end() && lower->first == Idx) { | |||||
II->FeatureFreqs.erase(lower); | |||||
return true; | |||||
} | |||||
return false; | |||||
} | |||||
void AddRareFeature(uint32_t Idx) { | |||||
// Maintain *at least* TopXRarestFeatures many rare features | |||||
// and all features with a frequency below ConsideredRare. | |||||
// Remove all other features. | |||||
while (RareFeatures.size() > TopXRarestFeatures && | |||||
FreqOfMostAbundantRareFeature > ConsideredRare) { | |||||
uint32_t ST_mostAbundantRareFeatureIdx = | |||||
vitalybukaUnsubmitted uint32_t MostAbundantRareFeatureIdx[2] = {} vitalybuka: uint32_t MostAbundantRareFeatureIdx[2] = {}
or
just
MostAbundantRareFeatureIdx1… | |||||
RareFeatures[0]; // 1st most abd feature index | |||||
uint32_t ND_mostAbundantRareFeatureIdx = | |||||
RareFeatures[0]; // 2nd most abd feature index | |||||
// Find most and second most abbundant feature | |||||
Not Done ReplyInline Actionsnit: I'd rather use auto or decltype(RareFeatures)::value_type to avoid type mismatch if we ever change RareFeatures definition and forget to change the type here Dor1s: nit: I'd rather use `auto` or `decltype(RareFeatures)::value_type` to avoid type mismatch if we… | |||||
for (uint32_t Idx2 : RareFeatures) { | |||||
if (GlobalFeatureFreqs[Idx2] >= | |||||
GlobalFeatureFreqs[ST_mostAbundantRareFeatureIdx]) { | |||||
ND_mostAbundantRareFeatureIdx = ST_mostAbundantRareFeatureIdx; | |||||
ST_mostAbundantRareFeatureIdx = Idx2; | |||||
} | |||||
} | |||||
// Remove most abundant rare feature. | |||||
Not Done ReplyInline Actionsassuming this code gets executed quite often, and the order inside RareFeatures isn't important, we can avoid erase-remove and do something like: RareFeatures[index_from_the_loop] = RareFeatures.back(); RareFeatures.resize(RareFeatures.size() - 1); but the loop on line 269 would have to use index in the vector (from 1 to < RareFeatures.size()) instead of the iterator feel free to ignore though, it's just a suggestion which may or may not be a good one :) Dor1s: assuming this code gets executed quite often, and the order inside `RareFeatures` isn't… | |||||
With the subsequent push_back (Line 292), do you mean a swap and pop_back here? marcel: With the subsequent push_back (Line 292), do you mean a swap and pop_back here? | |||||
yes, swap and pop_back would have the same effect Dor1s: yes, `swap` and `pop_back` would have the same effect | |||||
Implemented your swap and pop_back idea. Cheers! marcel: Implemented your swap and pop_back idea. Cheers! | |||||
RareFeatures.erase(remove(RareFeatures.begin(), RareFeatures.end(), | |||||
ST_mostAbundantRareFeatureIdx), | |||||
RareFeatures.end()); | |||||
for (auto II : Inputs) { | |||||
if (DeleteFeatureFreq(II, ST_mostAbundantRareFeatureIdx)) | |||||
II->NeedsUpdate = true; | |||||
} | |||||
// Set 2nd most abundant as the new most abundant feature count. | |||||
FreqOfMostAbundantRareFeature = | |||||
GlobalFeatureFreqs[ND_mostAbundantRareFeatureIdx]; | |||||
} | |||||
// Add rare feature, handle collisions, and update energy. | |||||
since we always do this, resize() in my suggestion above might not be even needed, but that's a minor thing Dor1s: since we always do this, `resize()` in my suggestion above might not be even needed, but that's… | |||||
RareFeatures.push_back(Idx); | |||||
GlobalFeatureFreqs[Idx] = 0; | |||||
for (auto II : Inputs) { | |||||
DeleteFeatureFreq(II, Idx); | |||||
// Apply add-one smoothing to this locally undiscovered feature. | |||||
// Zero energy seeds will never be fuzzed and remain zero energy. | |||||
can this and the loop on the line 294 be combined? Dor1s: can this and the loop on the line 294 be combined? | |||||
if (II->Energy > 0.0) { | |||||
please consider adding a comment why we skip inputs with zero energy Dor1s: please consider adding a comment why we skip inputs with zero energy | |||||
II->SumIncidence += 1; | |||||
II->Energy += logl(II->SumIncidence) / II->SumIncidence; | |||||
} | |||||
} | |||||
DistributionNeedsUpdate = true; | |||||
} | |||||
bool AddFeature(size_t Idx, uint32_t NewSize, bool Shrink) { | bool AddFeature(size_t Idx, uint32_t NewSize, bool Shrink) { | ||||
assert(NewSize); | assert(NewSize); | ||||
Idx = Idx % kFeatureSetSize; | Idx = Idx % kFeatureSetSize; | ||||
uint32_t OldSize = GetFeature(Idx); | uint32_t OldSize = GetFeature(Idx); | ||||
if (OldSize == 0 || (Shrink && OldSize > NewSize)) { | if (OldSize == 0 || (Shrink && OldSize > NewSize)) { | ||||
if (OldSize > 0) { | if (OldSize > 0) { | ||||
size_t OldIdx = SmallestElementPerFeature[Idx]; | size_t OldIdx = SmallestElementPerFeature[Idx]; | ||||
InputInfo &II = *Inputs[OldIdx]; | InputInfo &II = *Inputs[OldIdx]; | ||||
assert(II.NumFeatures > 0); | assert(II.NumFeatures > 0); | ||||
II.NumFeatures--; | II.NumFeatures--; | ||||
if (II.NumFeatures == 0) | if (II.NumFeatures == 0) | ||||
DeleteInput(OldIdx); | DeleteInput(OldIdx); | ||||
} else { | } else { | ||||
NumAddedFeatures++; | NumAddedFeatures++; | ||||
if (Entropic) | |||||
AddRareFeature((uint32_t)Idx); | |||||
} | } | ||||
NumUpdatedFeatures++; | NumUpdatedFeatures++; | ||||
if (FeatureDebug) | if (FeatureDebug) | ||||
Printf("ADD FEATURE %zd sz %d\n", Idx, NewSize); | Printf("ADD FEATURE %zd sz %d\n", Idx, NewSize); | ||||
SmallestElementPerFeature[Idx] = Inputs.size(); | SmallestElementPerFeature[Idx] = Inputs.size(); | ||||
InputSizesPerFeature[Idx] = NewSize; | InputSizesPerFeature[Idx] = NewSize; | ||||
return true; | return true; | ||||
} | } | ||||
return false; | return false; | ||||
} | } | ||||
void UpdateFeatureFrequency(InputInfo *II, size_t Idx) { | |||||
This is the hotspot, right? kcc: This is the hotspot, right?
My guess is that using a hash map here causes most of the slowdown… | |||||
Yes. This is the hot spot. However, most executions should exit in Line 337. marcel: Yes. This is the hot spot. However, most executions should exit in Line 337.
FeaturesFreqMap is… | |||||
vitalybukaUnsubmitted InputInfo::UpdateFeatureFrequency vitalybuka: InputInfo::UpdateFeatureFrequency | |||||
uint32_t Idx32 = Idx % kFeatureSetSize; | |||||
// Saturated increment. | |||||
if (GlobalFeatureFreqs[Idx32] == 0xFFFF) | |||||
return; | |||||
uint16_t Freq = GlobalFeatureFreqs[Idx32]++; | |||||
// Skip if abundant. | |||||
if (Freq > FreqOfMostAbundantRareFeature || | |||||
std::find(RareFeatures.begin(), RareFeatures.end(), Idx32) == | |||||
RareFeatures.end()) | |||||
return; | |||||
// Update global frequencies. | |||||
if (Freq == FreqOfMostAbundantRareFeature) | |||||
FreqOfMostAbundantRareFeature++; | |||||
// Update local frequencies. | |||||
if (!II) | |||||
Not Done ReplyInline Actionsinvert the condition to reduce indentation: if (!II) return; Dor1s: invert the condition to reduce indentation:
```
if (!II)
return;
``` | |||||
return; | |||||
can do return after this line and avoid else with extra indentation below Dor1s: can do return after this line and avoid `else` with extra indentation below | |||||
what is 1 here? would it make sense to have a static const variable with a descriptive name, e.g. kDefaultSomething? Dor1s: what is `1` here? would it make sense to have a static const variable with a descriptive name… | |||||
We are setting the frequency of Idx32 to 1. Adding a comment. marcel: We are setting the frequency of Idx32 to 1. Adding a comment. | |||||
// The local feature frequencies is an ordered vector of pairs. | |||||
// If there are no local feature frequencies, push_back preserves order. | |||||
// Set the feature frequency for feature Idx32 to 1. | |||||
if (II->FeatureFreqs.empty()) { | |||||
II->FeatureFreqs.push_back(std::pair<uint32_t, uint16_t>(Idx32, 1)); | |||||
Not Done ReplyInline Actionssame point, what is 0? a constant with a descriptive name or (less preferred) comment would be really helpful for others who come to read the code in future Dor1s: same point, what is `0`? a constant with a descriptive name or (less preferred) comment would… | |||||
Zero (0) is the default value for lower_bound as binary search over a vector of pairs. In this case, any value would work. marcel: Zero (0) is the default value for lower_bound as binary search over a vector of pairs. In this… | |||||
II->NeedsUpdate = true; | |||||
return; | |||||
} | |||||
// Binary search over local feature frequencies sorted by index. | |||||
auto lower = | |||||
std::lower_bound(II->FeatureFreqs.begin(), II->FeatureFreqs.end(), | |||||
std::pair<uint32_t, uint16_t>(Idx32, 0)); | |||||
// If feature Idx32 already exists, increment its frequency. | |||||
// Otherwise, insert a new pair right after the next lower index. | |||||
if (lower != II->FeatureFreqs.end() && lower->first == Idx32) { | |||||
lower->second++; | |||||
} else { | |||||
II->FeatureFreqs.insert(lower, std::pair<uint32_t, uint16_t>(Idx32, 1)); | |||||
} | |||||
II->NeedsUpdate = true; | |||||
} | |||||
From the code below it seems like Energy represents entropy and the max value is 0, which we reduce depending on the actual feature frequencies. Is that correct understanding? Dor1s: From the code below it seems like `Energy` represents entropy and the max value is 0, which we… | |||||
Yes, we estimate the entropy over the probabilities of the features in the neighborhood of the seed. Entropy is positive. The maximum entropy is logl(GlobalNumberOfFeatures). marcel: Yes, we estimate the entropy over the probabilities of the features in the neighborhood of the… | |||||
Not Done ReplyInline Actionssorry, I don't understand. Below are the code lines changing Energy value: II->Energy = 0.0; II->SumIncidence = 0; // Apply add-one smoothing to locally discovered features. for (auto F : II->FeatureFreqs) { size_t LocalIncidence = F.second + 1; Energy -= LocalIncidence * logl(LocalIncidence); SumIncidence += LocalIncidence; } <...> // Add a single locally abundant feature apply add-one smoothing. size_t AbdIncidence = II->NumExecutedMutations + 1; Energy -= AbdIncidence * logl(AbdIncidence); <...> // Normalize. if (SumIncidence != 0) Energy = (Energy / SumIncidence) + logl(SumIncidence); II->Energy = (double)Energy; <...> } as I read this, I see that Energy should be negative in many cases? Dor1s: sorry, I don't understand. Below are the code lines changing `Energy` value:
```
II… | |||||
Sorry for the brevity. This is why II->Energy is positive. Entropy is computed as $-\sum_{i=1}^S p_i \log(p_i)$ where $p_i$ is the probability that fuzzing II generates an input that exercises feature $i$ and $S$ is the total number of rare features. We could estimate the probability $p_i$ as the proportion of generated inputs that exercise $i$, i.e., $\hat p_i = LocalIncidence_i / SumIncidence$. If you plug this proportion into the formula for entropy, you can compute entropy as $[-\sum_{i=1}^S LocalIncidence_i \log(LocalIncidence_i)] / SumIncidence + log(SumIncidence)$. While Energy is certainly negative before // Normalize., it is positive after. Just drop me a DM. I'll send you the write up. marcel: Sorry for the brevity. This is why `II->Energy` is positive. Entropy is computed as $… | |||||
// Assign more energy to a high-entropy seed, i.e., that reveals more | |||||
// information about the globally rare features in the neighborhood | |||||
// of the seed. Since we do not know the entropy of a seed that has | |||||
It would be nice to rename variables in a such way that reader without background can understand what is going on. vitalybuka: It would be nice to rename variables in a such way that reader without background can… | |||||
// never been executed we assign fresh seeds maximum entropy and | |||||
// let II->Energy approach the true entropy from above. | |||||
void UpdateEnergy(InputInfo *II, size_t GlobalNumberOfFeatures) { | |||||
vitalybukaUnsubmitted InputInfo::UpdateEnergy vitalybuka: InputInfo::UpdateEnergy | |||||
long double Energy = 0.0L; | |||||
vitalybukaUnsubmitted Not Done ReplyInline Actions"long double" is still there? vitalybuka: "long double" is still there?
| |||||
marcelAuthorUnsubmitted Yes, keeping maximum precision during the processing to minimize the cumulative FP arithmetic error, and downcast to double once the processing is done. marcel: Yes, keeping maximum precision during the processing to minimize the cumulative FP arithmetic… | |||||
size_t SumIncidence = 0; | |||||
II->Energy = 0.0; | |||||
II->SumIncidence = 0; | |||||
please don't reuse variables like Y here vitalybuka: please don't reuse variables like Y here
just declare as close as possible to first use, or… | |||||
// Apply add-one smoothing to locally discovered features. | |||||
for (auto F : II->FeatureFreqs) { | |||||
size_t LocalIncidence = F.second + 1; | |||||
Energy -= LocalIncidence * logl(LocalIncidence); | |||||
SumIncidence += LocalIncidence; | |||||
} | |||||
// Apply add-one smoothing to locally undiscovered features. | |||||
// Energy -= 0; // since logl(1.0) == 0) | |||||
SumIncidence += (GlobalNumberOfFeatures - II->FeatureFreqs.size()); | |||||
// Add a single locally abundant feature apply add-one smoothing. | |||||
size_t AbdIncidence = II->NumExecutedMutations + 1; | |||||
Energy -= AbdIncidence * logl(AbdIncidence); | |||||
SumIncidence += AbdIncidence; | |||||
// Normalize. | |||||
if (SumIncidence != 0) | |||||
Energy = (Energy / SumIncidence) + logl(SumIncidence); | |||||
II->Energy = (double)Energy; | |||||
II->SumIncidence = SumIncidence; | |||||
} | |||||
size_t NumFeatures() const { return NumAddedFeatures; } | size_t NumFeatures() const { return NumAddedFeatures; } | ||||
size_t NumFeatureUpdates() const { return NumUpdatedFeatures; } | size_t NumFeatureUpdates() const { return NumUpdatedFeatures; } | ||||
private: | private: | ||||
static const bool FeatureDebug = false; | static const bool FeatureDebug = false; | ||||
size_t GetFeature(size_t Idx) const { return InputSizesPerFeature[Idx]; } | size_t GetFeature(size_t Idx) const { return InputSizesPerFeature[Idx]; } | ||||
void ValidateFeatureSet() { | void ValidateFeatureSet() { | ||||
if (FeatureDebug) | if (FeatureDebug) | ||||
PrintFeatureSet(); | PrintFeatureSet(); | ||||
for (size_t Idx = 0; Idx < kFeatureSetSize; Idx++) | for (size_t Idx = 0; Idx < kFeatureSetSize; Idx++) | ||||
if (GetFeature(Idx)) | if (GetFeature(Idx)) | ||||
Inputs[SmallestElementPerFeature[Idx]]->Tmp++; | Inputs[SmallestElementPerFeature[Idx]]->Tmp++; | ||||
for (auto II: Inputs) { | for (auto II: Inputs) { | ||||
if (II->Tmp != II->NumFeatures) | if (II->Tmp != II->NumFeatures) | ||||
Printf("ZZZ %zd %zd\n", II->Tmp, II->NumFeatures); | Printf("ZZZ %zd %zd\n", II->Tmp, II->NumFeatures); | ||||
assert(II->Tmp == II->NumFeatures); | assert(II->Tmp == II->NumFeatures); | ||||
II->Tmp = 0; | II->Tmp = 0; | ||||
} | } | ||||
} | } | ||||
// Updates the probability distribution for the units in the corpus. | // Updates the probability distribution for the units in the corpus. | ||||
a top-level comment explaining the computations would be nice kcc: a top-level comment explaining the computations would be nice | |||||
// Must be called whenever the corpus or unit weights are changed. | // Must be called whenever the corpus or unit weights are changed. | ||||
// | // | ||||
// Hypothesis: units added to the corpus last are more interesting. | // Hypothesis: inputs that maximize information about globally rare features | ||||
// | // are interesting. | ||||
// Hypothesis: inputs with infrequent features are more interesting. | void UpdateCorpusDistribution(Random &Rand) { | ||||
void UpdateCorpusDistribution() { | // Skip update if no seeds or rare features were added/deleted. | ||||
// Sparse updates for local change of feature frequencies, | |||||
use () just in case, replace 10000 with kSomething kcc: use () just in case, replace 10000 with kSomething
don't use random(), pass (Random &Rand)… | |||||
// i.e., randomly do not skip. | |||||
if (!DistributionNeedsUpdate && (!Entropic || Rand(kSparseEnergyUpdates))) | |||||
it seems like the Rand(kSparseEnergyUpdates) clause is applicable to the Entropic case only, is that correct? Do we really need it in the vanilla case? Dor1s: it seems like the `Rand(kSparseEnergyUpdates)` clause is applicable to the `Entropic` case only… | |||||
Yes. kSparseEnergyUpdates should apply only for Entropic. marcel: Yes. `kSparseEnergyUpdates` should apply only for Entropic. | |||||
return; | |||||
DistributionNeedsUpdate = false; | |||||
why not for (auto It : I->RareFeaturesFreqMap) ? kcc: why not
for (auto It : I->RareFeaturesFreqMap)
? | |||||
size_t N = Inputs.size(); | size_t N = Inputs.size(); | ||||
assert(N); | assert(N); | ||||
Intervals.resize(N + 1); | Intervals.resize(N + 1); | ||||
Weights.resize(N); | Weights.resize(N); | ||||
std::iota(Intervals.begin(), Intervals.end(), 0); | std::iota(Intervals.begin(), Intervals.end(), 0); | ||||
bool VanillaSchedule = true; | |||||
if (Entropic) { | |||||
// If aggressive, assign zero energy to seeds with below average entropy | |||||
don't use libc's rand(), pass (Random &Rand) instead kcc: don't use libc's rand(), pass (Random &Rand) instead | |||||
// and normalize. | |||||
bool AggressiveSchedule = Rand(100) < kProbAgressiveSchedule; | |||||
double AvgEnergy = 0; | |||||
for (auto II : Inputs) { | |||||
if (II->NeedsUpdate && II->Energy != 0.0) { | |||||
UpdateEnergy(II, RareFeatures.size()); | |||||
II->NeedsUpdate = false; | |||||
} | |||||
AvgEnergy += II->Energy * (long double)II->NumExecutedMutations / | |||||
NumExecutedMutations; | |||||
} | |||||
for (size_t i = 0; i < N; i++) { | |||||
could you please rewrite this in a more readable if-else form? Dor1s: could you please rewrite this in a more readable if-else form? | |||||
if (Inputs[i]->NumFeatures == 0) { | |||||
// If the seed doesn't represent any features, assign zero energy. | |||||
why 20? a constant with a descriptive name or a comment would be appreciated Dor1s: why `20`? a constant with a descriptive name or a comment would be appreciated | |||||
Weights[i] = 0.; | |||||
} else if (Inputs[i]->NumExecutedMutations / kMaxMutationFactor > | |||||
NumExecutedMutations / Inputs.size()) { | |||||
// If the seed was fuzzed a lot more than average, assign zero energy. | |||||
Weights[i] = 0.; | |||||
} else if (AggressiveSchedule) { | |||||
if (Inputs[i]->Energy < AvgEnergy) { | |||||
// If the seed has below avarage entropy, assign zero energy. | |||||
Weights[i] = 0.; | |||||
} else { | |||||
// Otherwise subtract the average entropy to normalize. | |||||
Weights[i] = Inputs[i]->Energy - AvgEnergy; | |||||
so if there is at least one input that touches focus function, we will be always wasting time in this loop (starting on the line 472) and then falling back to the VanillaSchedule case? In such case I think we should just check Options.FocusFunction and use vanilla schedule if it's set, just because almost always there will be input(s) touching the focus function Dor1s: so if there is at least one input that touches focus function, we will be always wasting time… | |||||
} | |||||
} else { | |||||
// If not aggressive, simply assign the computed energy. | |||||
Weights[i] = Inputs[i]->Energy; | |||||
} | |||||
// If energy for all seeds is zero, fall back to vanilla schedule. | |||||
if (Weights[i] > 0.0) | |||||
VanillaSchedule = false; | |||||
} | |||||
} | |||||
if (VanillaSchedule) { | |||||
for (size_t i = 0; i < N; i++) | for (size_t i = 0; i < N; i++) | ||||
Weights[i] = Inputs[i]->NumFeatures | Weights[i] = Inputs[i]->NumFeatures | ||||
? (i + 1) * (Inputs[i]->HasFocusFunction ? 1000 : 1) | ? (i + 1) * (Inputs[i]->HasFocusFunction ? 1000 : 1) | ||||
: 0.; | : 0.; | ||||
} | |||||
if (FeatureDebug) { | if (FeatureDebug) { | ||||
for (size_t i = 0; i < N; i++) | for (size_t i = 0; i < N; i++) | ||||
Printf("%zd ", Inputs[i]->NumFeatures); | Printf("%zd ", Inputs[i]->NumFeatures); | ||||
Printf("SCORE\n"); | Printf("SCORE\n"); | ||||
for (size_t i = 0; i < N; i++) | for (size_t i = 0; i < N; i++) | ||||
Printf("%f ", Weights[i]); | Printf("%f ", Weights[i]); | ||||
Printf("Weights\n"); | Printf("Weights\n"); | ||||
} | } | ||||
CorpusDistribution = std::piecewise_constant_distribution<double>( | CorpusDistribution = std::piecewise_constant_distribution<double>( | ||||
Intervals.begin(), Intervals.end(), Weights.begin()); | Intervals.begin(), Intervals.end(), Weights.begin()); | ||||
} | } | ||||
std::piecewise_constant_distribution<double> CorpusDistribution; | std::piecewise_constant_distribution<double> CorpusDistribution; | ||||
Vector<double> Intervals; | Vector<double> Intervals; | ||||
Vector<double> Weights; | Vector<double> Weights; | ||||
std::unordered_set<std::string> Hashes; | std::unordered_set<std::string> Hashes; | ||||
Vector<InputInfo*> Inputs; | Vector<InputInfo*> Inputs; | ||||
size_t NumAddedFeatures = 0; | size_t NumAddedFeatures = 0; | ||||
size_t NumUpdatedFeatures = 0; | size_t NumUpdatedFeatures = 0; | ||||
uint32_t InputSizesPerFeature[kFeatureSetSize]; | uint32_t InputSizesPerFeature[kFeatureSetSize]; | ||||
uint32_t SmallestElementPerFeature[kFeatureSetSize]; | uint32_t SmallestElementPerFeature[kFeatureSetSize]; | ||||
bool DistributionNeedsUpdate = true; | |||||
uint16_t FreqOfMostAbundantRareFeature = 0; | |||||
uint16_t GlobalFeatureFreqs[kFeatureSetSize] = {}; | |||||
does this have to be 8-byte per feature? kcc: does this have to be 8-byte per feature? | |||||
Each entry is upper-bounded by the total number of generated inputs g_NumExecutedMutations which likely won't fit into uint32_t. That's why I chose size_t. However, we really only need abundance information for features with an abundance below MostAbundant_RareFeature. If memory footprint is a concern, we can go down to uint16_t at the cost of an overflow check in the hot code in UpdateFeatureFrequency. See top-level comment for more details. marcel: Each entry is upper-bounded by the total number of generated inputs `g_NumExecutedMutations`… | |||||
Yea, I'd prefer uint16_t and a saturated add. Just to save some RAM kcc: Yea, I'd prefer uint16_t and a saturated add. Just to save some RAM | |||||
Not Done ReplyInline Actionsuint16_t GlobalFeatureFreqs[kFeatureSetSize] = {}; instead of memsets vitalybuka: uint16_t GlobalFeatureFreqs[kFeatureSetSize] = {}; instead of memsets
it would be nice do to… | |||||
Vector<uint32_t> RareFeatures; | |||||
std::string OutputCorpus; | std::string OutputCorpus; | ||||
}; | }; | ||||
} // namespace fuzzer | } // namespace fuzzer | ||||
#endif // LLVM_FUZZER_CORPUS | #endif // LLVM_FUZZER_CORPUS |
this is new in the patch, is it?
While I completely understand why we'd want to use execution time as a signal for weights,
it makes fuzzing process non-reproducible with a given seed, which I consider pretty bad.
If we used 32- or 64- bit edge counters we could have substituted them for time, but alas, we use 8-bit ones.