std::hash returns a 64bit hash code while previously we were using only lower 32 bits which caused hash collision for large workloads.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
llvm/lib/Transforms/IPO/SampleContextTracker.cpp | ||
---|---|---|
183–185 | This nodeHash is the blend function similar to DJB and it's assuming 32-bit length, see the shift 16. If we are fully switching to 64 bit hash, we should change the blend function too. However, does the collision come from the blend function (nodeHash), or lower part of std::hash? I guess it's the later, then I'm curious what strings actually lead to collision on the lower 32 bit? |
llvm/lib/Transforms/IPO/SampleContextTracker.cpp | ||
---|---|---|
183–185 | Good point! Is (Callsite.LineOffset << 32) | Callsite.Discriminator; a good form for 64-bit? Yes, the collision came from a colliding lower-32-bit of std::hash Same lower 32-bit: 2086825821 Name1: _ZNSt8__detail15_Hash_code_baseINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS6_PmENS_10_Select1stESt4hashIS6_ENS_18_Mod_range_hashingENS_20_Default_ranged_hashELb1EE10_M_extractEv _ Name2: ZSt12__niter_baseIPN4onnx3UseESt6vectorIS1_SaIS1_EEET_N9__gnu_cxx17__normal_iteratorIS6_T0_EE |
llvm/lib/Transforms/IPO/SampleContextTracker.cpp | ||
---|---|---|
183–185 |
I think that should be good. Though I'd be surprised if we really need 64-bit hash to avoid collision. We just want to avoid collision among all callees of a function, which is a small space. Not sure what's the hash function used by std::hash, but perhaps that is not a hash with great distribution when truncated. Can you try djbHash from llvm-project/llvm/include/llvm/Support/DJB.h to see if we can make it work without increasing to 64-bit? |
llvm/lib/Transforms/IPO/SampleContextTracker.cpp | ||
---|---|---|
183–185 | Looks like 32-bit hash isn't working well: uint64_t NameHash1 = djbHash("_ZN8facebook18storage_foundation15detail_gen_" "avx226t10x1_95cc5b80e2a6436ca08eEPKPKhPKPhm"); uint64_t NameHash2 = djbHash( "_ZSt4moveIRPN5caffe5LayerIdEEEONSt16remove_referenceIT_E4typeEOS6_"); outs() << NameHash1 << " " << NameHash2 << "\n"; output: 309151195 309151195 |
nit: the convention is to use auto when full type name is too long. uint64_t is an unambiguous case for explicit type. same for other places.