- Inline SymbolID hashing to header
- Don't collect references for symbols without a SymbolID
- Store referenced symbols, rather than separately storing decls and macros.
- Don't defer ref collection to end of translation unit
- Perform const_cast when updating reference counts (~0.5% saving)
- Introduce caching for getSymbolID in SymbolCollector. (~30% saving)
- Don't modify symbolslab if there's no definition location
- Don't lex the whole file to deduce spelled tokens, just lex the relevant piece (~8%)
Overall this achieves ~38% reduction in time spent inside SymbolCollector compared to baseline (on my machine :)).
I'd expect the last optimization to affect dynamic index a lot more, I was testing with clangd-indexer on clangd subfolder of LLVM. As clangd-indexer runs indexing of whole TU at once, we indeed see almost every token from every source included in the TU (hence lexing full files vs just lexing referenced tokens are almost the same), whereas during dynamic indexing we mostly index main file symbols, but we would touch the files defining/declaring those symbols, and lex complete files for nothing, rather than just the token location.
The last optimization is also a functional change (added test), previously we used raw tokens from syntax::tokenize, which didn't canonicalize trigraphs/newlines in identifiers, wheres Lexer::getSpelling canonicalizes them.
Details
- Reviewers
sammccall
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Neat!
I think you've broken the FilesToTokenCache by moving it into the loop, so I'd expect further wins from fixing that.
(If you don't see any, something seems wrong: syntax::tokenize should be called a fair bit and should be pretty expensive!)
clang-tools-extra/clangd/index/SymbolCollector.cpp | ||
---|---|---|
608 | there's a big block of code here that's checking if the reference was spelled or not, pull out a function? | |
610 | Um, is this cache meant to be a member? It's pretty well guaranteed to be empty on the next line :-) | |
714–718 | What does a non-spelled macro reference look like? | |
815 | no need for this to be a lambda anymore, a plain loop seems fine | |
816–817 | if you have timing set up, try making this const_cast<Symbol*>(S)->References++. Reinserting into a symbol slab is pretty expensive I think, and this is just an awkward gap in the API. | |
972 | nit: just try_emplace(D). The rest of the arguments get forwarded to the V constructor, so you're calling an unneccesary move constructor here. (Shouldn't matter here because SymbolID is trivial, but it can) | |
clang-tools-extra/clangd/index/SymbolCollector.h | ||
180 | hash_code(SymbolID) is not defined in the header, but the hash function is trivial and could be inlined everywhere. Might be worth exposing. (not really related to this patch, but you have some setup for benchmarking at the moment...) |
- Get rid of leftovers and update comments
clang-tools-extra/clangd/index/SymbolCollector.cpp | ||
---|---|---|
714–718 | I don't fully understand it either (hence decided to keep it as-is), but my initial guess is nested macro expansions, e.g: #define FOO(X) X #define BAR(X) FOO(X) BAR(int x); I suppose we assume there's a reference to FOO at expansion of BAR here today. But I am not sure if libIndex will actually emit a macro reference for FOO here. |
clang-tools-extra/clangd/index/SymbolCollector.cpp | ||
---|---|---|
189 | You've changed this from tokenizing the file with a cache. I'm not sure this is significant (I imagine not), but you don't actually have to run the lexer here, since you already know what the string is going to be, it's enough to grab the buffer pointer, check that it starts with the text, check that the next character is not an identifier-continuer. | |
714–718 | oops nevermind, you're only moving this comment from elsewhere |
clang-tools-extra/clangd/index/SymbolCollector.cpp | ||
---|---|---|
189 | right, I actually did that at first, but it implies keeping the behaviour around unclean tokens broken, and there wasn't much of a win (delta was in the noise). I think because the expensive part is actually figuring out the fileid, and the lexing call can do it cheaply right now as it benefits from the single element cache. |
hash_code(SymbolID) is not defined in the header, but the hash function is trivial and could be inlined everywhere. Might be worth exposing.
(not really related to this patch, but you have some setup for benchmarking at the moment...)