This is an archive of the discontinued LLVM Phabricator instance.

[clangd][SymbolCollector] Introduce a cache for SymbolID generation and some cleanups
ClosedPublic

Authored by kadircet on Apr 7 2022, 1:42 AM.

Details

Reviewers
sammccall
Summary
  • Inline SymbolID hashing to header
  • Don't collect references for symbols without a SymbolID
  • Store referenced symbols, rather than separately storing decls and macros.
  • Don't defer ref collection to end of translation unit
  • Perform const_cast when updating reference counts (~0.5% saving)
  • Introduce caching for getSymbolID in SymbolCollector. (~30% saving)
  • Don't modify symbolslab if there's no definition location
  • Don't lex the whole file to deduce spelled tokens, just lex the relevant piece (~8%)

    Overall this achieves ~38% reduction in time spent inside SymbolCollector compared to baseline (on my machine :)).

    I'd expect the last optimization to affect dynamic index a lot more, I was testing with clangd-indexer on clangd subfolder of LLVM. As clangd-indexer runs indexing of whole TU at once, we indeed see almost every token from every source included in the TU (hence lexing full files vs just lexing referenced tokens are almost the same), whereas during dynamic indexing we mostly index main file symbols, but we would touch the files defining/declaring those symbols, and lex complete files for nothing, rather than just the token location.

    The last optimization is also a functional change (added test), previously we used raw tokens from syntax::tokenize, which didn't canonicalize trigraphs/newlines in identifiers, wheres Lexer::getSpelling canonicalizes them.

Diff Detail

Event Timeline

kadircet created this revision.Apr 7 2022, 1:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 7 2022, 1:42 AM
kadircet requested review of this revision.Apr 7 2022, 1:42 AM
sammccall accepted this revision.Apr 7 2022, 10:54 AM

Neat!

I think you've broken the FilesToTokenCache by moving it into the loop, so I'd expect further wins from fixing that.
(If you don't see any, something seems wrong: syntax::tokenize should be called a fair bit and should be pretty expensive!)

clang-tools-extra/clangd/index/SymbolCollector.cpp
608

there's a big block of code here that's checking if the reference was spelled or not, pull out a function?

610

Um, is this cache meant to be a member? It's pretty well guaranteed to be empty on the next line :-)

714–718

What does a non-spelled macro reference look like?

815

no need for this to be a lambda anymore, a plain loop seems fine

816–817

if you have timing set up, try making this const_cast<Symbol*>(S)->References++.

Reinserting into a symbol slab is pretty expensive I think, and this is just an awkward gap in the API.
We could fix the API or live with const_cast if it matters.

972

nit: just try_emplace(D). The rest of the arguments get forwarded to the V constructor, so you're calling an unneccesary move constructor here.

(Shouldn't matter here because SymbolID is trivial, but it can)

clang-tools-extra/clangd/index/SymbolCollector.h
180

hash_code(SymbolID) is not defined in the header, but the hash function is trivial and could be inlined everywhere. Might be worth exposing.

(not really related to this patch, but you have some setup for benchmarking at the moment...)

This revision is now accepted and ready to land.Apr 7 2022, 10:54 AM
kadircet updated this revision to Diff 421847.Apr 11 2022, 1:40 AM
  • Address comments and more cleanups
kadircet edited the summary of this revision. (Show Details)Apr 11 2022, 1:41 AM
kadircet updated this revision to Diff 421855.Apr 11 2022, 2:31 AM
kadircet marked 6 inline comments as done.
kadircet edited the summary of this revision. (Show Details)
  • Get rid of leftovers and update comments
clang-tools-extra/clangd/index/SymbolCollector.cpp
714–718

I don't fully understand it either (hence decided to keep it as-is), but my initial guess is nested macro expansions, e.g:

#define FOO(X) X
#define BAR(X) FOO(X)

BAR(int x);

I suppose we assume there's a reference to FOO at expansion of BAR here today. But I am not sure if libIndex will actually emit a macro reference for FOO here.

sammccall accepted this revision.Apr 11 2022, 2:33 AM
sammccall added inline comments.
clang-tools-extra/clangd/index/SymbolCollector.cpp
189

You've changed this from tokenizing the file with a cache.
If I'm reading your benchmark spreadsheet right, this is ~3% speedup.

I'm not sure this is significant (I imagine not), but you don't actually have to run the lexer here, since you already know what the string is going to be, it's enough to grab the buffer pointer, check that it starts with the text, check that the next character is not an identifier-continuer.

714–718

oops nevermind, you're only moving this comment from elsewhere

kadircet added inline comments.Apr 11 2022, 2:47 AM
clang-tools-extra/clangd/index/SymbolCollector.cpp
189

right, I actually did that at first, but it implies keeping the behaviour around unclean tokens broken, and there wasn't much of a win (delta was in the noise). I think because the expensive part is actually figuring out the fileid, and the lexing call can do it cheaply right now as it benefits from the single element cache.

(This is ready to land, right?)