After testing D147740 with multiple industrial projects with ~10 million FunctionSamples, no MD5 collision has been found. 
In perfect hashing, the probability of collision for N symbols over K possible hash value is 1 - K!/((K-N)! * K^N). When N is 1 million and K is 2^64, the probability is 3*10^-8, when N is 10 million the probability is 3*10^-6, so we are probably not going to find an actual case in real world application. (However if K is 2^32, the probability of collision is almost 1, this is indeed a problem, if anyone still use a large profile on 32-bit machine, as hash_code is tied to size_t). 
Furthermore, when a collision happens we can't do anything to recover it, unless using a multi-map, but that is significantly slower, which contradicts the purpose of optimizing the profile reader. 
One more thing, since we have been using profiles with MD5 names, and they have to be coming from non-MD5 sources, so if hash collision is to happen, it already happened when we convert a non-MD5 profile to a MD5 one, so there's no point to check for that in the reader, and this feature can be removed.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
| Time | Test | |
|---|---|---|
| 50 ms | x64 debian > Flang.Driver::pic-flags.f90 | |
| 60,030 ms | x64 debian > MLIR.Examples/standalone::test.toy | 
Event Timeline
In SampleProf.h line 1423 1428 1448, added fully qualified class name to the template template argument. Not sure if this is supposed to be the case, or is it MSVC's bug?
Copying comments from the original patch for continuation..
In D147740#4443233, @huangjd wrote:
Actually do we really care about MD5 collision? ExtBinary format already ignored MD5 collision for regular string names (and therefore regular function profiles), as only one of two functions with colliding MD5 get written to the name table (and the other is therefore lost). If we are using CS profiles, since different CS profiles have different serializations, their > hashes are distributed as expected. The most important thing is that, even if we detect a hash collision, we can't do anything about it except logging it (using a multi-map makes the reader much slower), so I think the MD5 collision check should be marked as LLVM_DEBUG. This does reduce 0.5 second out of ~30 seconds (1.67%) over the 1 GB profile read .
I don't think we care. Is the new type HashKeyMap and SampleProfileMap all for detecting and reporting collision? I'd avoid all that complexity and prefer a simple DenseMap + a SampleContext->hash_code converter and not even bother with debug prints for collision...
I will add a separate patch to remove the hash collision check. It would be noteworthy if users reports collision in actual use cases, and if that doesn't happen in a while the hash collision check can be removed. Even the probability of collision is very small for one single program but given that LLVM is used everywhere so we can't be so sure. 
The new wrapper type also serves another purpose that existing code can use emplace, find, erase, etc without changing in most cases, and I am planning to have CallTargetMap and FunctionSamplesMap using that as well.
I don't think there's much we can do still, even if there's collision report. Maybe we can revert back to use full name instead of MD5, if that turns out to be a problem. But as you mentioned we already support MD5 in profile generation.
The new wrapper type also serves another purpose that existing code can use emplace, find, erase, etc without changing in most cases,
Can you elaborate? Is that critical or just for convenience?
and I am planning to have CallTargetMap and FunctionSamplesMap using that as well.
FWIW, for CallTargetMap, we experimented with changing from StringMap (which is bad since it owns copies of strings) to DenseMap<StringRef, uint64_t>, and it regressed memory usage noticeably. The problem is DenseMap grows to 64 entries on first insertion, but CallTargetMap is usually very sparse with zero or a few entries. FunctionSamplesMap is likely sparse too.
Sorry for being late on patch review. I should have given feedback on the original patch that it's preferable to keep things simple. Unless I'm missing something, the custom HashKeyMap/SampleProfileMap feels unjustified for added complexity.
SampleProfileMap is a simple wrapper that can probably introduce interfaces to ensure the right insertion policy, the HashKeyMap does look like a little heavy weight and may deserve some simplification (i.e. with assumption that collision is a non-issue).
I don't think there's much we can do still, even if there's collision report. Maybe we can revert back to use full name instead of MD5, if that turns out to be a problem. But as you mentioned we already support MD5 in profile generation.
If a collision is rare while there is a strong demand for correctness by users, we can add some ad-hoc logic to deal with that, which is still faster than using a multimap, and much faster than using the full function name. If collision happens frequently, then we have no choice but to revert back to use full name, but this is extremely unlikely. I couldn't find any research on partial MD5 (we only use 64-bit of it) collision using ASCII strings, so that's why I am not ruling out this out of caution.
Can you elaborate? Is that critical or just for convenience?
It's to enforce OOP practice. The optimization passes see SampleProfileMap, which is supposed to map Functions to SampleProfiles. It should not care how keys are actually represented, and it's much more cleaner to write profiles.find(FuncName) than profiles.find(MD5Hash(FuncName)). The latter has another potential risk of misuse, because there already exist multiple ways to get a string's hash value: MD5Hash, llvm::hash_value, std::hash, and they are all different, we have to make sure the correct hash function is used at all times.
FWIW, for CallTargetMap, we experimented with changing from StringMap (which is bad since it owns copies of strings) to DenseMap<StringRef, uint64_t>, and it regressed memory usage noticeably. The problem is DenseMap grows to 64 entries on first insertion, but CallTargetMap is usually very sparse with zero or a few entries. FunctionSamplesMap is likely sparse too.
Yes, I am aware of that. I am planning to change it to HashKeyMap<std::map, StringRef, uint64_t> as well (or unordered_map, depends on which one is more beneficial. After that I also have a plan to use specialized data structure for CallTargetMap and FunctionSamplesMap because it's true that they have zero or one entries for almost all the cases.
I'm still not convinced that we need to detect or report collision, It's not a correctness issue when collision happens - one of the function will just lose its profile. It's going to be very rare for function name to hit collision (we're not talking about hashing the entire function which is mapping a much bigger universe into an int), and it's going to be even less likely for a collision to cause noticeable perf loss. That said, if there's concern around collision, maybe we should verify this assumption by building large code base and see whether/how often it happens as a one-off testing for this change, instead of building collision detection into the final implementation (I doubt we would actually get any report even if it happens, if the report is debug only).
Can you elaborate? Is that critical or just for convenience?
It's to enforce OOP practice. The optimization passes see SampleProfileMap, which is supposed to map Functions to SampleProfiles. It should not care how keys are actually represented, and it's much more cleaner to write profiles.find(FuncName) than profiles.find(MD5Hash(FuncName)). The latter has another potential risk of misuse, because there already exist multiple ways to get a string's hash value: MD5Hash, llvm::hash_value, std::hash, and they are all different, we have to make sure the correct hash function is used at all times.
Ok, I think this is fair - hiding hashing details behind the scene is reasonable. But if that's the intention, SampleProfileMap can be a very thin wrapper on top of existing container.
FWIW, for CallTargetMap, we experimented with changing from StringMap (which is bad since it owns copies of strings) to DenseMap<StringRef, uint64_t>, and it regressed memory usage noticeably. The problem is DenseMap grows to 64 entries on first insertion, but CallTargetMap is usually very sparse with zero or a few entries. FunctionSamplesMap is likely sparse too.
Yes, I am aware of that. I am planning to change it to HashKeyMap<std::map, StringRef, uint64_t> as well (or unordered_map, depends on which one is more beneficial. After that I also have a plan to use specialized data structure for CallTargetMap and FunctionSamplesMap because it's true that they have zero or one entries for almost all the cases.
Do you plan to change CallTargetMap to use MD5 as key as well? This will be different from SampleProfileMap in the sense that SampleProfileMap has function name as part of its value (function samples), but for CallTargetMap, its value is just a count, so if we change its key to MD5, we won't be able to recover "original" key from its value for debugging etc.
| llvm/include/llvm/ProfileData/SampleProf.h | ||
|---|---|---|
| 1308 | fixed mistake in comment | |
fixed mistake in comment