Page MenuHomePhabricator

[clangd] Define a compact binary serialization fomat for symbol slab/index.
ClosedPublic

Authored by sammccall on Sep 3 2018, 4:10 AM.

Details

Summary

This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:

llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M

It's also simpler/faster to read and write. Reading llvmidx.riff is 6x faster on my machine than llvmidx.yaml (1.1 vs 6.5s).

The format is a RIFF container (chunks of (type, size, data)) with:

  • a compressed string table
  • simple binary encoding of symbols (with varints for compactness)

It can be extended to include occurrences, Dex posting lists, etc.

There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.

Alternatives considered:

  • compressed YAML or JSON: bulky and slow to load
  • llvm bitstream: confusing model and libraries are hard to use. My attempt produced slightly larger files, and the code was longer and slower.
  • protobuf or similar: would be really nice (esp for back-compat) but the dependency is a big hassle
  • ad-hoc binary format without a container: it seems clear we're going to add posting lists and occurrences here, and that they will benefit from sharing a string table. The container makes it easy to debug these pieces in isolation, and make them optional.

YAML tests are moved from symbolcollectortests to serialization tests - they're a closer fit, and we can reuse the data for testing both formats.

Diff Detail

Repository
rL LLVM

Event Timeline

sammccall created this revision.Sep 3 2018, 4:10 AM
sammccall updated this revision to Diff 163688.Sep 3 2018, 4:14 AM

Revert unused changes.

sammccall edited the summary of this revision. (Show Details)Sep 3 2018, 4:15 AM

This still needs to be synced to head to account for Symbol changes (multi-includes) but those changes are pretty obvious/mechanical.

sammccall updated this revision to Diff 163722.Sep 3 2018, 8:07 AM

Rebase with occurrences changes (but don't serialize occurrences yet).
Also incorporate multiple-include-header change, which is tricky as it makes
Symbol variable-length. Current hack is to preserve only the most-popular 50
includes, open to other ideas.

sammccall updated this revision to Diff 163743.Sep 3 2018, 11:56 AM

Fix subtle macro expansion bug!

sammccall updated this revision to Diff 163744.Sep 3 2018, 12:18 PM

Fix comment

sammccall edited the summary of this revision. (Show Details)Sep 3 2018, 12:31 PM
ioeric added a comment.Sep 4 2018, 8:24 AM

Super cool! Just a few nits.

clangd/RIFF.cpp
58 ↗(On Diff #163722)

The error message seems wrong?

clangd/index/Serialization.cpp
50 ↗(On Diff #163744)

This function could use a comment. What's the difference between this and write32?

96 ↗(On Diff #163744)

Any reason to use std::pair<const char *, size_t> instead of StringRef?

335 ↗(On Diff #163744)

assert(Data.Symbols)?

sammccall marked 2 inline comments as done.Sep 4 2018, 8:39 AM
sammccall added inline comments.
clangd/index/Serialization.cpp
50 ↗(On Diff #163744)

Added comment describing varint encoding.

96 ↗(On Diff #163744)

It's a performance hack: DenseMap<StringRef, T> does a lookup by content which requires hashing the string. We intern the strings as we gather them so there's no need to hash the string twice. Added a comment.

sammccall updated this revision to Diff 163836.Sep 4 2018, 8:39 AM
sammccall marked an inline comment as done.

address review comments

ioeric accepted this revision.Sep 4 2018, 8:45 AM

lgtm!

This revision is now accepted and ready to land.Sep 4 2018, 8:45 AM
This revision was automatically updated to reflect the committed changes.