Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
This is a first attempt at a patch for this.
There are some cases this doesn't handle, which I was thinking of handling in follow-up commits (but we could do them here if desired):
- Entities in anonymous namespaces are still not indexed, There is a comment in SymbolCollector::shouldCollectSymbol() which says "Also we skip the symbols in anonymous namespace as the qualifier names of these symbols are like foo::<anonymous>::bar, which need a special handling." It's not clear to me how that should be handled, suggestions are welcome.
- Macros in main-files are still not indexed. I found that when I tried to index them, the index would start to contain all sorts of built-in macros like __INT_FAST16_MAX__, __DBL_MIN__, etc. I assume we'll want to exclude those, but it wasn't immediately clear to me how.
I ran clangd-indexer on the clangd codebase to see what this patch does to the size of the generated index: the increase in on-disk size was only 0.45%. Presumably that will increase once we start storing entities in anonymous namespaces. (I haven't run it on all of LLVM yet as it takes a very long time to run.)
clang-tools-extra/unittests/clangd/FindSymbolsTests.cpp | ||
---|---|---|
67 ↗ | (On Diff #176309) | I found I had to do this to get many WorkspaceSymbolsTest test cases to pass. In the absence of this flag, the header files in affected test cases are indexed as C files, and later (when included via the .cpp file) as C++ files. USRs for functions are different in C mode than in C++ mode (in C++ mode the USR includes the function's signature), leading to different SymbolIDs and thus two copies of the function in the index. Having the test case parse the header file in C++ mode seems reasonable, as it matches what happens in production (when you open a .h file in a C++ project in an editor, clangd does use a compilation command that includes -xc++, or rather -xc++-header (thanks to InterpolatingCompilationDatabase?)). Not sure if there's a better way to do this - should MockCompilationDatabase be doing this automatically somehow? |
clang-tools-extra/unittests/clangd/SymbolCollectorTests.cpp | ||
533 ↗ | (On Diff #176309) | I had to simplify the test case a bit because otherwise there were 11 symbols and UnorderedElementsAre() only supports 10 :) Not sure if there's a better thing to do here. |
Thanks for taking a look at this.
Symbols from the main file might require special handling as there are various percularities with them.
E.g. it would be weird to see just a single "main" function being returned in the workspace symbols, despite having multiple of those in the codebase. There's a whole can of worms when the names of the symbols are the same, but they live inside different C++ files in the anonymous namespace.
However, it should be ok if we collect them only to expose in the workspace symbols.
I think we should take a look at collecting the symbols from the anonymous namespace. Those should be important, since this is where almost all cpp-file local symbols should live (in a well-behaved C++ codebase).
It's not clear to me how that should be handled, suggestions are welcome.
Let's just try collecting them. This would work fine with symbols that are only inside a single file, but will result in symbols with the same name from multiple files being merged together, which is bad.
I found that when I tried to index them, the index would start to contain all sorts of built-in macros like INT_FAST16_MAX, DBL_MIN, etc. I assume we'll want to exclude those, but it wasn't immediately clear to me how
Those macros should come from a buffer with the name "<built-in>". there should be a way to detect those and get rid of them.
clang-tools-extra/clangd/index/SymbolCollector.cpp | ||
---|---|---|
312 ↗ | (On Diff #176309) | Why out parameter here? Just compute this in the calling function instead. |
365 ↗ | (On Diff #176309) | Just set CollectRef to false on symbols from the main files? |
550 ↗ | (On Diff #176309) | Please add a comment that we only collect main file symbols for workspace symbols. |
For code completion, we don't want them in the index, these kind of symbols can be offered from clang's AST (as anonymous namespace symbols typically are in main file). For workspace symbols, we want them in the index, they will be returned when users search foo::bar instead of foo::<anonymous>::bar I suppose.
So I think we can treat them as normal static symbols.
- Macros in main-files are still not indexed. I found that when I tried to index them, the index would start to contain all sorts of built-in macros like __INT_FAST16_MAX__, __DBL_MIN__, etc. I assume we'll want to exclude those, but it wasn't immediately clear to me how.
I think we should exclude builtin macros, MacroInfo provides a isBuiltinMacro function that could help you.
I ran clangd-indexer on the clangd codebase to see what this patch does to the size of the generated index: the increase in on-disk size was only 0.45%. Presumably that will increase once we start storing entities in anonymous namespaces. (I haven't run it on all of LLVM yet as it takes a very long time to run.)
AFAIK, the symbol (section) is not the largest part of index (we have 30K symbols), would be interesting to know the (index, memory) size (including symbols in anonymous namespace).
clang-tools-extra/clangd/index/SymbolCollector.cpp | ||
---|---|---|
311 ↗ | (On Diff #176309) | The comment is stale. |
550 ↗ | (On Diff #176309) | This introduces a new dependency IsMainFileSymbol, we need to check IsMainFileSymbol to make sure we set IndexedForCodeCompletion correctly. And we need to update all callsides of isIndexedForCodeCompletion. How about
This would get rid of passing IsMainFileSymbol argument in shouldCollectSymbol, addDeclaration. The tradeoff is that we pay an extra cost of IsDeclaredInMainFile, but I think it is fine. |
clang-tools-extra/unittests/clangd/SymbolCollectorTests.cpp | ||
145 ↗ | (On Diff #176309) | I think we need add a test for static symbols. |
Thanks for the reviews!
Thanks for clarifying. I thought the comment was saying the name would require special handling.
clang-tools-extra/clangd/index/SymbolCollector.cpp | ||
---|---|---|
550 ↗ | (On Diff #176309) | This is the only place where we call isIndexedForCodeCompletion. Given that, should we still make these changes? |
clang-tools-extra/unittests/clangd/SymbolCollectorTests.cpp | ||
533 ↗ | (On Diff #176309) | I found the solution: UnorderedElementsAreArray(). |
Do you have a suggestion for how to measure memory usage? Should I just look at the memory usage of the clangd process in top?
You can use dexp(clangd/index/dex/dexp/) to load the index data, and it will print the memory usage of the index.
clang-tools-extra/clangd/index/SymbolCollector.cpp | ||
---|---|---|
396 ↗ | (On Diff #177819) | hmm, the isBuiltinMacro only detects the builtin macros which is not sufficient for our cases. The current way seems a bit tricky. Could you dump the DefLoc to see the whole location information of __DBL_MIN_? I think we can do this in a separate patch, just focus on the declarations in this patch, what do you think? |
544 ↗ | (On Diff #177819) | I think we can rephrase the comment more collectly -- we collect the main-file symbols, but we don't use them for code completion. |
Thanks!
So, on the clangd codebase, I got the following measurements for indexing with this patch compared to indexing without this patch:
- Size of index on disk increased by 3.2%
- Memory usage as reported by dexp increased by 3.6%
clang-tools-extra/clangd/index/SymbolCollector.cpp | ||
---|---|---|
396 ↗ | (On Diff #177819) | I found SourceManager::isWrittenInBuiltinFile() which seems to serve this purpose well. |
! In D55185#1330328, @nridge wrote:
Thanks!
So, on the clangd codebase, I got the following measurements for indexing with this patch compared to indexing without this patch:
- Size of index on disk increased by 3.2% - Memory usage as reported by `dexp` increased by 3.6%
Thanks for the investigation. Could you please rebase your patch? I'd like to try it on the whole llvm project (since it is too slow for you to run it on whole LLVM).
clang-tools-extra/clangd/index/SymbolCollector.cpp | ||
---|---|---|
396 ↗ | (On Diff #177819) | Can we postpone the macro case to another patch? I'm a bit nervous about macros... |
clang-tools-extra/unittests/clangd/SymbolCollectorTests.cpp | ||
345 ↗ | (On Diff #178126) | Why ForCodeCompletion is true for ff? ff is a main file symbol. |
644 ↗ | (On Diff #178126) | Since we change the purpose of the test, I think the Header can be empty. |
651 ↗ | (On Diff #178126) | Maybe add a test case where anonymous namespace is inside a named namespace, like namespace foo { namespace { class Foo {}; } } |
Rebased, thanks!
clang-tools-extra/unittests/clangd/SymbolCollectorTests.cpp | ||
---|---|---|
345 ↗ | (On Diff #178126) | This code is interpreted as a header, so none of the symbols are main-file symbols. |
clang-tools-extra/clangd/index/SymbolCollector.cpp | ||
---|---|---|
396 ↗ | (On Diff #177819) | Moved the macro part here: https://reviews.llvm.org/D55739 |
Thanks for the update.
Got some data on your patch, the increased size is larger than we expected, it seems that we have bugs in the patch, IIUC the references number should be unchanged.
Before (76M)
./bin/dexp symbols.dex Loaded Dex from symbols.dex with estimated memory usage 277976448 bytes - number of symbols: 308624 - number of refs: 5561489
After (89M)
./bin/dexp symbols-main-symbols.dex Loaded Dex from symbols-main-symbols.dex with estimated memory usage 345706301 bytes - number of symbols: 440012 - number of refs: 6022675
I am not able to reproduce this, at least not on the clangd codebase (I didn't try all of llvm).
Just to check: when you did this test, were you indexing the exact same code both times? I ask because, when I was looking into this, I initially made the mistake of indexing the same copy of the clangd source from which I built the indexer. In that case, the number of refs was slightly different, but that's expected because the source code being indexed is different. When I tried it again indexing the same source code both times, the number of refs was unchanged.
I did double check today, I can confirmed that both of them ran at the same code base. Didn't dig into it, my suspicion is that we collect header symbols in anonymous namespace with this patch, this maybe the reason that increases references.
Thanks for checking.
Didn't dig into it, my suspicion is that we collect header symbols in anonymous namespace with this patch, this maybe the reason that increases references.
This is the only explanation that I can think of, too.
I modified the patch so we don't collect header symbols in anonymous namespaces. Could you try again with this patch, and see if it resolves the discrepancy?
I modified the patch so we don't collect header symbols in anonymous namespaces. Could you try again with this patch, and see if it resolves the discrepancy?
With the new patch, the result looks reasonable now (number of symbols: 405 K vs 309 K, dex usage: 320 MB vs 278 MB, index size: 82 MB vs 76 MB), the memory usage has 15% increase, which is a bit unfortunate, but we can optimize it.
clang-tools-extra/clangd/index/SymbolCollector.cpp | ||
---|---|---|
327 ↗ | (On Diff #180180) | I think we can optimize this, calling isInMainFile for each reference seems wasteful. How about if (IsOnlyRef && !CollectRef) return true; bool IsMainFileSymbol = SM.isInMainFile(SM.getExpansionLoc(ND->getBeginLoc())); if (!shouldCollectSymbol(*ND, *ASTCtx, Opts, IsMainFileSymbol)) return true; if (CollectRef && !isMainFileSymbol && ...) |
clang-tools-extra/clangd/index/SymbolCollector.h | ||
27 ↗ | (On Diff #180180) | The documentation needs update. |
clang-tools-extra/clangd/index/SymbolCollector.cpp | ||
---|---|---|
327 ↗ | (On Diff #180180) | Heh, this is actually what I did in the first version of the patch, and then Ilya suggested I modify CollectRef instead. Anyways, I changed it back now. |
Thanks! The patch looks mostly good.
I think we'd need a new SymbolFlag for main-file symbols (allowing us to filter them out internally) as we are not quite interested in this kind of symbols internally, and the increase of the symbols may blow up our internal indexer, but this can be done in another patch.
clang-tools-extra/unittests/clangd/SymbolCollectorTests.cpp | ||
---|---|---|
119 ↗ | (On Diff #180387) | Passing true to the function doesn't seem right to me. I think we might need to expose the code SM.isInMainFile(SM.getExpansionLoc(ND->getBeginLoc())); to a utility function, maybe a static method SymbolCollector::isDeclaredInMainFile. |
Sorry to jump in late here.
The piece that seems to be missing is recording that a symbol is main-file only, which seems important for ranking - public symbols should be ranked above private ones.
(In Quality.h this is the distinction between AccessibleScope::GlobalScope and AccessibleScope::FileScope).
I think we want to add a new Symbol::SymbolFlag for this, I'd suggest FileLocal or so.
clang-tools-extra/clangd/index/SymbolCollector.cpp | ||
---|---|---|
243 ↗ | (On Diff #180753) | isWrittenInMainFile (the names of these functions are terrible - for our purpose, isWrittenInMainFile is just faster) |
clang-tools-extra/clangd/index/SymbolCollector.h | ||
46 ↗ | (On Diff #180753) | Can I request making main-file symbol collection optional here? (on by default) The cost/benefit tradeoff is different for a static index of a huge codebase, we may want to keep this off for our internal code index. |
83 ↗ | (On Diff #180753) | This doesn't make sense as part of the public SymbolCollector API. |
Ranking for what? If you mean code completion, we are already excluding main-file symbols from code completion by not setting the IndexedForCodeCompletion flag.
Addressed Sam's comments, except for the SymbolFlag one as I do not understand it yet
Thanks for the changes. LGTM, though I think we should follow up with the ranking stuff.
SymbolIndex::fuzzyFind provides ranking for non-completion searches.
This is used for workspaceSymbol at the moment and maybe other queries (the subtypes search we discussed on the mailing list?) in future.
Currently the scope isn't taken into account for non-completion queries, because historically only top-level global symbols were in the index. With members and now main-file symbols in the index, we should fix that.
I don't want to block this patch, so feel free to commit as is. I think we should follow up with:
- add the flag to Symbol to distinguish file-local symbols (please do this, but a separate patch is fine)
- set Scope accordingly in Quality.cpp, and adjust relevance scores for non-code-complete queries based on scope, in the opposide direction from CC queries (I'm happy to do this if you don't want to)
Thanks, makes sense now.
I don't want to block this patch, so feel free to commit as is. I think we should follow up with:
- add the flag to Symbol to distinguish file-local symbols (please do this, but a separate patch is fine)
I've done this in the latest update to this patch.
- set Scope accordingly in Quality.cpp, and adjust relevance scores for non-code-complete queries based on scope, in the opposide direction from CC queries (I'm happy to do this if you don't want to)
I'm happy to give this a try. I think this one makes more sense in a separate patch.
As I'm not a committer, could you commit this patch for me? Thanks!
Thanks for adding the flag!
Merge.cpp also needs to handle what happens when different copies of the symbol have different value for the flag (--> it's not file local).
I'll address that and another last comment and land the patch.
clang-tools-extra/clangd/index/SymbolCollector.cpp | ||
---|---|---|
333 ↗ | (On Diff #181462) | just noticed this is a bit subtle: we want to know if this is *only* a main-file symbol, not just if it's ever declared in the main file. But the check works because D is the canonical decl (per the clang/Index infrastructure), which is the lexically first, and we see the headers before the main file. I'll add a comment. |
add comment about subtleties around canonical declarations and main-file check
invert sense of Symbol flag (FileLocal -> VisibleOutsideFile) so Merge does the right thing
test: add case for forward declarations (I think we do the wrong thing, fix later)
test: replace Matcher(true)/Matcher(false) with Matcher() and Not(Matcher())
The forward-declaration-in-main-file case:
In principle, if the definition is missing, I think we should treat these differently (not index them, or treat them as visible-outside file) because they're almost certainly declared "properly" elsewhere.
@ilya-biryukov convinced me offline that this isn't a big deal in practice as forward declarations in main-files (and not subsequently defining the symbol) aren't that useful.
So going to avoid patching that, as there might be unintended consequences.