[pdb] Raise out the hash table used in NameMap to its own class. This data structure is used in other places in the PDB code, one of which we have immediate need for in the TPI stream. The data structure is the exact same as that used by the NameMap. It is not really so much of a hash table as it is a linear dictionary where items can be deleted and then later re-used. In raising this out, I found some bugs in the original implementation, particularly regarding the deleted and present bit vectors whose values were improperly being ignored upon serialization. In the Microsoft implementation, this is a template class where key and value can be arbitrary types. However, in all uses which seem important to serialize, the key and value are 32-bit integers, so to simplify things I've called this class LinearIntegerMap. We can re-generalize later if we find it necessary to do so.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
llvm/include/llvm/DebugInfo/PDB/Raw/LinearIntegerMap.h | ||
---|---|---|
8 ↗ | (On Diff #84409) | It doesn't have to be detailed, but this needs a file comment to describe what this is. |
69–70 ↗ | (On Diff #84409) | You can initialize them by = {0};, no? |
llvm/lib/DebugInfo/PDB/Raw/LinearIntegerMap.cpp | ||
22 ↗ | (On Diff #84409) | What does HAH stand for? |
32–33 ↗ | (On Diff #84409) | You are extending these vectors unconditionally. What was the point of resize(1) in the ctor? |
67–70 ↗ | (On Diff #84409) | Does this actually find largest numbers? |
153–154 ↗ | (On Diff #84409) | Remove this. |
188 ↗ | (On Diff #84409) | This needs a comment. |
llvm/lib/DebugInfo/PDB/Raw/LinearIntegerMap.cpp | ||
---|---|---|
22 ↗ | (On Diff #84409) | Left over from old code where it was more obvious. I can fix this. |
32–33 ↗ | (On Diff #84409) | You can either create one of these and manipulate it in memory then serialize it, or you can load one from a serialized format. If you create one from scratch, the capacity needs to be at least 1 so that you can write an item into it. |
67–70 ↗ | (On Diff #84409) | Yes since it's a bit vector, the values you get from iterating over it are the indices that contain set bits. It iterates over the indices which contain 1 bits, in ascending order. So the last value should be the index of the largest set bit. |
188 ↗ | (On Diff #84409) | Good point. FWIW this comes from the Microsoft code, where they want to maintain < 2/3 load factor (size / capacity ratio) |
llvm/include/llvm/DebugInfo/PDB/Raw/LinearIntegerMap.h | ||
---|---|---|
59 ↗ | (On Diff #84409) | Capacity ? |
79 ↗ | (On Diff #84409) | = default; ? |
llvm/lib/DebugInfo/PDB/Raw/LinearIntegerMap.cpp | ||
187–188 ↗ | (On Diff #84409) | Capacity ? Also, I'd add parens. |
194 ↗ | (On Diff #84409) | capacity() * 3 can overflow, should there be a std::max with capacity()? Maybe std::max(capacity() * 3 / 2 + 1, capacity() + 1) ? |
There are some intricacies of the data structure that I didn't properly handle with regards to growing and some interactions between deleted and empty entries.
AFAICT this is basically an open address hash table with lazy deletion where the hash function is the identity. I need to re-work this to properly handle rehashing and lazy deletion.
This patch is entirely re-worked. Support for Present and Deleted fields is correctly implemented, linear probing is correctly implemented, growing is correctly implemented, and some unit tests are added to verify all of this.
llvm/lib/DebugInfo/PDB/Raw/HashTable.cpp | ||
---|---|---|
189 ↗ | (On Diff #84745) | Should we assert that NewCapacity is bigger than capacity()? |
llvm/lib/DebugInfo/PDB/Raw/HashTable.cpp | ||
---|---|---|
189 ↗ | (On Diff #84745) | Forgot about this, I'm using your suggestion of std::max from the previous comment, will upload a fix and some other minor cleanups shortly. |
Fix an unchecked overflow in HashTable::grow() and fix some small typos and cleanups.