This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Use xxhash instead of SHA1 for background index file digests.
ClosedPublic

Authored by sammccall on Jul 7 2019, 8:48 PM.

Details

Summary

Currently SHA1 is about 10% of our CPU, this patch reduces it to ~1%.

xxhash is a well-defined (stable) non-cryptographic hash optimized for
fast checksums (like crc32).
Collisions shouldn't be a problem, despite the reduced length:

  • for actual file content (used to invalidate bg index shards), there are only two versions that can collide (new shard and old shard).
  • for file paths in bg index shard filenames, we would need 2^32 files with the same filename to expect a collision. Imperfect hashing may reduce this a bit but it's well beyond what's plausible.

This will invalidate shards on disk (as usual; I bumped the version),
but this time the filenames are changing so the old files will stick
around :-( So this is more expensive than the usual bump, but would be
good to land before the v9 branch when everyone will start using bg index.

Diff Detail

Repository
rL LLVM

Event Timeline

sammccall created this revision.Jul 7 2019, 8:48 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 7 2019, 8:48 PM
kadircet accepted this revision.Jul 8 2019, 1:41 AM

LGTM.

Out of curiosity what was the use case yielded the profile? I hope it is a use case with a lot of loading from disk storage, and only a few indexing operations?
Because in that case we're mostly IO bound, but if it is the other way around and this pops up as a 10% percent cpu usage, while we're performing (CPU bound) indexing part, it definitely sounds scary and something else might be wrong.

This revision is now accepted and ready to land.Jul 8 2019, 1:41 AM

LGTM.

Out of curiosity what was the use case yielded the profile? I hope it is a use case with a lot of loading from disk storage, and only a few indexing operations?
Because in that case we're mostly IO bound, but if it is the other way around and this pops up as a 10% percent cpu usage, while we're performing (CPU bound) indexing part, it definitely sounds scary and something else might be wrong.

(As discussed offline, nothing new here)
Load is starting clangd on an index of the LLVM project itself. This is 100M of shards on disk, around 7 sec of CPU loading shards, and 8 sec of CPU building Dex structures. (On my machine, without any of my perf patches, measured using gperftools cpu profiler). That CPU usage is single-threaded. I don't think we're actually IO-bound.

This revision was automatically updated to reflect the committed changes.