This is an archive of the discontinued LLVM Phabricator instance.

[PDB] Hash types up front when merging types instead of using StringMap
ClosedPublic

Authored by rnk on May 22 2017, 6:33 PM.

Details

Summary

First, StringMap uses llvm::HashString, which is only good for short
identifiers and really bad for large blobs of binary data like type
records. Moving to DenseMap<StringRef, TypeIndex> with some tricks for
memory allocation fixes that.

Unfortunately, that didn't buy very much performance. Profiling showed
that we spend a long time during DenseMap growth rehashing existing
entries. Also, in general, DenseMap is faster when the keys are small.
This change takes that to the logical conclusion by introducing a small
wrapper value type around a pointer to key data. The key data contains a
precomputed hash, the original record data (pointer and size), and the
type index, which is the "value" of our original map.

This reduces the time to produce llvm-as.exe and llvm-as.pdb from ~15s
on my machine to 3.5s, which is about a 4x improvement.

Diff Detail

Repository
rL LLVM

Event Timeline

rnk created this revision.May 22 2017, 6:33 PM
zturner added inline comments.May 22 2017, 8:04 PM
llvm/lib/DebugInfo/CodeView/TypeSerializer.cpp
22 ↗(On Diff #99840)

hash_code?

23–24 ↗(On Diff #99840)

ArrayRef<uint8_t> or StringRef?

25 ↗(On Diff #99840)

TypeIndex?

53–54 ↗(On Diff #99840)

I haven't looked at the implementation of DenseMap, but is this check necessary? I would imagine you could assert that the hashes are different, otherwise why would DenseMap be calling this function?

59–60 ↗(On Diff #99840)

Is this an important consideration? Seems like it sacrifices readability

98–99 ↗(On Diff #99840)

Why not xxHash64? hash_value appears to operate on one char at a time.

rnk added a subscriber: chandlerc.May 23 2017, 11:19 AM
rnk added inline comments.
llvm/lib/DebugInfo/CodeView/TypeSerializer.cpp
23–24 ↗(On Diff #99840)

This saves 4 bytes. We know type records are always shorter than 0xFF00 bytes, so it's safe to go all the way to uint16_t as the comment suggests.

25 ↗(On Diff #99840)

Yeah, we can do that. I kind of like unsigned or uint32_t because it doesn't depend on llvm::support::ulittle32_t, which uses all this memcpy craziness.

53–54 ↗(On Diff #99840)

DenseMap takes the low bits of the hash and uses them to index into its table, so there can be hash collisions. So, if the table has 256 entries and two records hash to 0x100 and 0x200, this comparison will save us from doing a full memcmp.

59–60 ↗(On Diff #99840)

I really wanted HashedType and HashedTypePtr to be in anonymous namespaces, and this is what I had to do to accomplish that. They really aren't general purpose types that should float around in public headers.

98–99 ↗(On Diff #99840)

@chandlerc said we'll probably have to delete xxHash64, so I didn't want to use it. hash_value actually hashes 64 bytes at a time if it can. I'm pretty confident we're hitting that overload, but I could do more to check. It also worried me.

zturner accepted this revision.May 23 2017, 11:22 AM
This revision is now accepted and ready to land.May 23 2017, 11:22 AM
This revision was automatically updated to reflect the committed changes.

Drive by comment...

llvm/trunk/lib/DebugInfo/CodeView/TypeSerializer.cpp
40

This isn't suitably aligned. You should use the DenseMapInfo for some pointer type instead.