This is an archive of the discontinued LLVM Phabricator instance.

[Clangd] Index main-file symbols (bug 39761)
ClosedPublic

Authored by nridge on Dec 2 2018, 5:47 PM.

Event Timeline

nridge created this revision.Dec 2 2018, 5:47 PM
nridge marked 2 inline comments as done.Dec 2 2018, 6:06 PM

This is a first attempt at a patch for this.

There are some cases this doesn't handle, which I was thinking of handling in follow-up commits (but we could do them here if desired):

  • Entities in anonymous namespaces are still not indexed, There is a comment in SymbolCollector::shouldCollectSymbol() which says "Also we skip the symbols in anonymous namespace as the qualifier names of these symbols are like foo::<anonymous>::bar, which need a special handling." It's not clear to me how that should be handled, suggestions are welcome.
  • Macros in main-files are still not indexed. I found that when I tried to index them, the index would start to contain all sorts of built-in macros like __INT_FAST16_MAX__, __DBL_MIN__, etc. I assume we'll want to exclude those, but it wasn't immediately clear to me how.

I ran clangd-indexer on the clangd codebase to see what this patch does to the size of the generated index: the increase in on-disk size was only 0.45%. Presumably that will increase once we start storing entities in anonymous namespaces. (I haven't run it on all of LLVM yet as it takes a very long time to run.)

nridge added inline comments.Dec 2 2018, 6:06 PM
clang-tools-extra/unittests/clangd/FindSymbolsTests.cpp
67

I found I had to do this to get many WorkspaceSymbolsTest test cases to pass.

In the absence of this flag, the header files in affected test cases are indexed as C files, and later (when included via the .cpp file) as C++ files. USRs for functions are different in C mode than in C++ mode (in C++ mode the USR includes the function's signature), leading to different SymbolIDs and thus two copies of the function in the index.

Having the test case parse the header file in C++ mode seems reasonable, as it matches what happens in production (when you open a .h file in a C++ project in an editor, clangd does use a compilation command that includes -xc++, or rather -xc++-header (thanks to InterpolatingCompilationDatabase?)).

Not sure if there's a better way to do this - should MockCompilationDatabase be doing this automatically somehow?

clang-tools-extra/unittests/clangd/SymbolCollectorTests.cpp
535

I had to simplify the test case a bit because otherwise there were 11 symbols and UnorderedElementsAre() only supports 10 :) Not sure if there's a better thing to do here.

Thanks for taking a look at this.

Symbols from the main file might require special handling as there are various percularities with them.
E.g. it would be weird to see just a single "main" function being returned in the workspace symbols, despite having multiple of those in the codebase. There's a whole can of worms when the names of the symbols are the same, but they live inside different C++ files in the anonymous namespace.
However, it should be ok if we collect them only to expose in the workspace symbols.
I think we should take a look at collecting the symbols from the anonymous namespace. Those should be important, since this is where almost all cpp-file local symbols should live (in a well-behaved C++ codebase).

It's not clear to me how that should be handled, suggestions are welcome.

Let's just try collecting them. This would work fine with symbols that are only inside a single file, but will result in symbols with the same name from multiple files being merged together, which is bad.

I found that when I tried to index them, the index would start to contain all sorts of built-in macros like INT_FAST16_MAX, DBL_MIN, etc. I assume we'll want to exclude those, but it wasn't immediately clear to me how

Those macros should come from a buffer with the name "<built-in>". there should be a way to detect those and get rid of them.

clang-tools-extra/clangd/index/SymbolCollector.cpp
297–298

Why out parameter here? Just compute this in the calling function instead.

352

Just set CollectRef to false on symbols from the main files?

544–545

Please add a comment that we only collect main file symbols for workspace symbols.

hokein added a subscriber: hokein.Dec 3 2018, 5:06 AM

This is a first attempt at a patch for this.

There are some cases this doesn't handle, which I was thinking of handling in follow-up commits (but we could do them here if desired):

  • Entities in anonymous namespaces are still not indexed, There is a comment in SymbolCollector::shouldCollectSymbol() which says "Also we skip the symbols in anonymous namespace as the qualifier names of these symbols are like foo::<anonymous>::bar, which need a special handling." It's not clear to me how that should be handled, suggestions are welcome.

For code completion, we don't want them in the index, these kind of symbols can be offered from clang's AST (as anonymous namespace symbols typically are in main file). For workspace symbols, we want them in the index, they will be returned when users search foo::bar instead of foo::<anonymous>::bar I suppose.

So I think we can treat them as normal static symbols.

  • Macros in main-files are still not indexed. I found that when I tried to index them, the index would start to contain all sorts of built-in macros like __INT_FAST16_MAX__, __DBL_MIN__, etc. I assume we'll want to exclude those, but it wasn't immediately clear to me how.

I think we should exclude builtin macros, MacroInfo provides a isBuiltinMacro function that could help you.

I ran clangd-indexer on the clangd codebase to see what this patch does to the size of the generated index: the increase in on-disk size was only 0.45%. Presumably that will increase once we start storing entities in anonymous namespaces. (I haven't run it on all of LLVM yet as it takes a very long time to run.)

AFAIK, the symbol (section) is not the largest part of index (we have 30K symbols), would be interesting to know the (index, memory) size (including symbols in anonymous namespace).

clang-tools-extra/clangd/index/SymbolCollector.cpp
297–298

The comment is stale.

544–545

This introduces a new dependency IsMainFileSymbol, we need to check IsMainFileSymbol to make sure we set IndexedForCodeCompletion correctly. And we need to update all callsides of isIndexedForCodeCompletion.

How about

  • expose a static method isDeclaredInMainFile(const NamedDecl, SourceManager SM) in SymbolCollector
  • use this method in shouldCollectSymbol, and isIndexedForCodeCompletion

This would get rid of passing IsMainFileSymbol argument in shouldCollectSymbol, addDeclaration. The tradeoff is that we pay an extra cost of IsDeclaredInMainFile, but I think it is fine.

clang-tools-extra/unittests/clangd/SymbolCollectorTests.cpp
146

I think we need add a test for static symbols.

simark added a subscriber: simark.Dec 3 2018, 11:54 AM
nridge marked 7 inline comments as done and an inline comment as not done.Dec 11 2018, 9:04 PM

Thanks for the reviews!

It's not clear to me how that should be handled, suggestions are welcome.

Let's just try collecting them. This would work fine with symbols that are only inside a single file, but will result in symbols with the same name from multiple files being merged together, which is bad.

Thanks for clarifying. I thought the comment was saying the name would require special handling.

clang-tools-extra/clangd/index/SymbolCollector.cpp
544–545

This is the only place where we call isIndexedForCodeCompletion.

Given that, should we still make these changes?

clang-tools-extra/unittests/clangd/SymbolCollectorTests.cpp
535

I found the solution: UnorderedElementsAreArray().

nridge updated this revision to Diff 177819.Dec 11 2018, 9:05 PM

Addressed review comments

AFAIK, the symbol (section) is not the largest part of index (we have 30K symbols), would be interesting to know the (index, memory) size (including symbols in anonymous namespace).

Do you have a suggestion for how to measure memory usage? Should I just look at the memory usage of the clangd process in top?

AFAIK, the symbol (section) is not the largest part of index (we have 30K symbols), would be interesting to know the (index, memory) size (including symbols in anonymous namespace).

Do you have a suggestion for how to measure memory usage? Should I just look at the memory usage of the clangd process in top?

You can use dexp(clangd/index/dex/dexp/) to load the index data, and it will print the memory usage of the index.

clang-tools-extra/clangd/index/SymbolCollector.cpp
396

hmm, the isBuiltinMacro only detects the builtin macros which is not sufficient for our cases.

The current way seems a bit tricky. Could you dump the DefLoc to see the whole location information of __DBL_MIN_?
We don't record Predefined information in MacroInfo. I think we can do it by checking whether the FileID is a predefined file ID (using SM.getFileID(DefLoc) == PP->getPredefinesFileID()).

I think we can do this in a separate patch, just focus on the declarations in this patch, what do you think?

544

I think we can rephrase the comment more collectly -- we collect the main-file symbols, but we don't use them for code completion.

nridge updated this revision to Diff 178126.Dec 13 2018, 1:24 PM

Address second round of review comments

nridge marked 3 inline comments as done.Dec 13 2018, 1:27 PM

AFAIK, the symbol (section) is not the largest part of index (we have 30K symbols), would be interesting to know the (index, memory) size (including symbols in anonymous namespace).

Do you have a suggestion for how to measure memory usage? Should I just look at the memory usage of the clangd process in top?

You can use dexp(clangd/index/dex/dexp/) to load the index data, and it will print the memory usage of the index.

Thanks!

So, on the clangd codebase, I got the following measurements for indexing with this patch compared to indexing without this patch:

  • Size of index on disk increased by 3.2%
  • Memory usage as reported by dexp increased by 3.6%
clang-tools-extra/clangd/index/SymbolCollector.cpp
396

I found SourceManager::isWrittenInBuiltinFile() which seems to serve this purpose well.

! In D55185#1330328, @nridge wrote:

Thanks!

So, on the clangd codebase, I got the following measurements for indexing with this patch compared to indexing without this patch:

 
- Size of index on disk increased by 3.2%
- Memory usage as reported by `dexp` increased by 3.6%

Thanks for the investigation. Could you please rebase your patch? I'd like to try it on the whole llvm project (since it is too slow for you to run it on whole LLVM).

clang-tools-extra/clangd/index/SymbolCollector.cpp
396

Can we postpone the macro case to another patch? I'm a bit nervous about macros...

clang-tools-extra/unittests/clangd/SymbolCollectorTests.cpp
345

Why ForCodeCompletion is true for ff? ff is a main file symbol.

644

Since we change the purpose of the test, I think the Header can be empty.

651

Maybe add a test case where anonymous namespace is inside a named namespace, like

namespace foo {
namespace {
class Foo {};
}
}
nridge updated this revision to Diff 178378.Dec 15 2018, 2:44 PM
nridge marked 4 inline comments as done.

Address third round of review comments

Thanks for the investigation. Could you please rebase your patch? I'd like to try it on the whole llvm project (since it is too slow for you to run it on whole LLVM).

Rebased, thanks!

clang-tools-extra/unittests/clangd/SymbolCollectorTests.cpp
345

This code is interpreted as a header, so none of the symbols are main-file symbols.

nridge marked 2 inline comments as done.Dec 15 2018, 2:55 PM
nridge added inline comments.
clang-tools-extra/clangd/index/SymbolCollector.cpp
396

Moved the macro part here: https://reviews.llvm.org/D55739

Thanks for the update.

Got some data on your patch, the increased size is larger than we expected, it seems that we have bugs in the patch, IIUC the references number should be unchanged.

Before (76M)

 ./bin/dexp symbols.dex
Loaded Dex from symbols.dex with estimated memory usage 277976448 bytes
  - number of symbols: 308624
  - number of refs: 5561489

After (89M)

 ./bin/dexp symbols-main-symbols.dex
Loaded Dex from symbols-main-symbols.dex with estimated memory usage 345706301 bytes
  - number of symbols: 440012
  - number of refs: 6022675
nridge marked an inline comment as done.Dec 21 2018, 9:17 AM

it seems that we have bugs in the patch, IIUC the references number should be unchanged.

Before (76M)

 ./bin/dexp symbols.dex
Loaded Dex from symbols.dex with estimated memory usage 277976448 bytes
  - number of symbols: 308624
  - number of refs: 5561489

After (89M)

 ./bin/dexp symbols-main-symbols.dex
Loaded Dex from symbols-main-symbols.dex with estimated memory usage 345706301 bytes
  - number of symbols: 440012
  - number of refs: 6022675

I am not able to reproduce this, at least not on the clangd codebase (I didn't try all of llvm).

Just to check: when you did this test, were you indexing the exact same code both times? I ask because, when I was looking into this, I initially made the mistake of indexing the same copy of the clangd source from which I built the indexer. In that case, the number of refs was slightly different, but that's expected because the source code being indexed is different. When I tried it again indexing the same source code both times, the number of refs was unchanged.

hokein added a comment.Jan 3 2019, 7:26 AM

it seems that we have bugs in the patch, IIUC the references number should be unchanged.

Before (76M)

 ./bin/dexp symbols.dex
Loaded Dex from symbols.dex with estimated memory usage 277976448 bytes
  - number of symbols: 308624
  - number of refs: 5561489

After (89M)

 ./bin/dexp symbols-main-symbols.dex
Loaded Dex from symbols-main-symbols.dex with estimated memory usage 345706301 bytes
  - number of symbols: 440012
  - number of refs: 6022675

I am not able to reproduce this, at least not on the clangd codebase (I didn't try all of llvm).

Just to check: when you did this test, were you indexing the exact same code both times? I ask because, when I was looking into this, I initially made the mistake of indexing the same copy of the clangd source from which I built the indexer. In that case, the number of refs was slightly different, but that's expected because the source code being indexed is different. When I tried it again indexing the same source code both times, the number of refs was unchanged.

I did double check today, I can confirmed that both of them ran at the same code base. Didn't dig into it, my suspicion is that we collect header symbols in anonymous namespace with this patch, this maybe the reason that increases references.

nridge updated this revision to Diff 180180.Jan 3 2019, 5:01 PM

Do not index symbols inside anonymous namespaces in header files.

nridge added a comment.Jan 3 2019, 5:03 PM

I did double check today, I can confirmed that both of them ran at the same code base.

Thanks for checking.

Didn't dig into it, my suspicion is that we collect header symbols in anonymous namespace with this patch, this maybe the reason that increases references.

This is the only explanation that I can think of, too.

I modified the patch so we don't collect header symbols in anonymous namespaces. Could you try again with this patch, and see if it resolves the discrepancy?

hokein added a comment.EditedJan 4 2019, 4:30 AM

I modified the patch so we don't collect header symbols in anonymous namespaces. Could you try again with this patch, and see if it resolves the discrepancy?

With the new patch, the result looks reasonable now (number of symbols: 405 K vs 309 K, dex usage: 320 MB vs 278 MB, index size: 82 MB vs 76 MB), the memory usage has 15% increase, which is a bit unfortunate, but we can optimize it.

clang-tools-extra/clangd/index/SymbolCollector.cpp
342

I think we can optimize this, calling isInMainFile for each reference seems wasteful.

How about

if (IsOnlyRef && !CollectRef)
   return true;
bool IsMainFileSymbol = SM.isInMainFile(SM.getExpansionLoc(ND->getBeginLoc()));
if (!shouldCollectSymbol(*ND, *ASTCtx, Opts, IsMainFileSymbol))
   return true;
if (CollectRef && !isMainFileSymbol && ...)
clang-tools-extra/clangd/index/SymbolCollector.h
27

The documentation needs update.

nridge marked 3 inline comments as done.Jan 5 2019, 5:51 PM
nridge added inline comments.
clang-tools-extra/clangd/index/SymbolCollector.cpp
342

Heh, this is actually what I did in the first version of the patch, and then Ilya suggested I modify CollectRef instead.

Anyways, I changed it back now.

nridge updated this revision to Diff 180387.Jan 5 2019, 5:51 PM
nridge marked an inline comment as done.

Addressed latest round of review comments

hokein added a comment.Jan 7 2019, 7:19 AM

Thanks! The patch looks mostly good.

I think we'd need a new SymbolFlag for main-file symbols (allowing us to filter them out internally) as we are not quite interested in this kind of symbols internally, and the increase of the symbols may blow up our internal indexer, but this can be done in another patch.

clang-tools-extra/unittests/clangd/SymbolCollectorTests.cpp
119

Passing true to the function doesn't seem right to me. I think we might need to expose the code SM.isInMainFile(SM.getExpansionLoc(ND->getBeginLoc())); to a utility function, maybe a static method SymbolCollector::isDeclaredInMainFile.

nridge updated this revision to Diff 180753.Jan 8 2019, 3:22 PM

Address latest review comment

nridge marked an inline comment as done.Jan 8 2019, 3:22 PM
hokein accepted this revision.Jan 9 2019, 2:54 AM

looks good. Do you have commit access?

This revision is now accepted and ready to land.Jan 9 2019, 2:54 AM
nridge added a comment.EditedJan 9 2019, 7:30 AM

Thanks very much for the reviews!

looks good. Do you have commit access?

I do not. Could you commit it for me?

Sorry to jump in late here.

The piece that seems to be missing is recording that a symbol is main-file only, which seems important for ranking - public symbols should be ranked above private ones.
(In Quality.h this is the distinction between AccessibleScope::GlobalScope and AccessibleScope::FileScope).

I think we want to add a new Symbol::SymbolFlag for this, I'd suggest FileLocal or so.

clang-tools-extra/clangd/index/SymbolCollector.cpp
264

isWrittenInMainFile

(the names of these functions are terrible - for our purpose, isWrittenInMainFile is just faster)

clang-tools-extra/clangd/index/SymbolCollector.h
45

Can I request making main-file symbol collection optional here? (on by default)

The cost/benefit tradeoff is different for a static index of a huge codebase, we may want to keep this off for our internal code index.

82

This doesn't make sense as part of the public SymbolCollector API.
I'd suggest just inlining it at its callsite.
(The test shouldn't really be calling it at all - I think the test is confused about its scope. But that's not new in this patch, so just inlining it there too seems ok for now)

The piece that seems to be missing is recording that a symbol is main-file only, which seems important for ranking - public symbols should be ranked above private ones.
(In Quality.h this is the distinction between AccessibleScope::GlobalScope and AccessibleScope::FileScope).

Ranking for what? If you mean code completion, we are already excluding main-file symbols from code completion by not setting the IndexedForCodeCompletion flag.

nridge updated this revision to Diff 181131.Jan 10 2019, 12:32 PM

Addressed Sam's comments, except for the SymbolFlag one as I do not understand it yet

sammccall accepted this revision.Jan 11 2019, 3:36 AM

Thanks for the changes. LGTM, though I think we should follow up with the ranking stuff.

The piece that seems to be missing is recording that a symbol is main-file only, which seems important for ranking - public symbols should be ranked above private ones.
(In Quality.h this is the distinction between AccessibleScope::GlobalScope and AccessibleScope::FileScope).

Ranking for what? If you mean code completion, we are already excluding main-file symbols from code completion by not setting the IndexedForCodeCompletion flag.

SymbolIndex::fuzzyFind provides ranking for non-completion searches.
This is used for workspaceSymbol at the moment and maybe other queries (the subtypes search we discussed on the mailing list?) in future.

Currently the scope isn't taken into account for non-completion queries, because historically only top-level global symbols were in the index. With members and now main-file symbols in the index, we should fix that.
I don't want to block this patch, so feel free to commit as is. I think we should follow up with:

  • add the flag to Symbol to distinguish file-local symbols (please do this, but a separate patch is fine)
  • set Scope accordingly in Quality.cpp, and adjust relevance scores for non-code-complete queries based on scope, in the opposide direction from CC queries (I'm happy to do this if you don't want to)
nridge updated this revision to Diff 181462.Jan 12 2019, 8:17 PM

Add a new symbol flag to mark main-file symbols

SymbolIndex::fuzzyFind provides ranking for non-completion searches.
This is used for workspaceSymbol at the moment and maybe other queries (the subtypes search we discussed on the mailing list?) in future.

Currently the scope isn't taken into account for non-completion queries, because historically only top-level global symbols were in the index. With members and now main-file symbols in the index, we should fix that.

Thanks, makes sense now.

I don't want to block this patch, so feel free to commit as is. I think we should follow up with:

  • add the flag to Symbol to distinguish file-local symbols (please do this, but a separate patch is fine)

I've done this in the latest update to this patch.

  • set Scope accordingly in Quality.cpp, and adjust relevance scores for non-code-complete queries based on scope, in the opposide direction from CC queries (I'm happy to do this if you don't want to)

I'm happy to give this a try. I think this one makes more sense in a separate patch.

As I'm not a committer, could you commit this patch for me? Thanks!

  • set Scope accordingly in Quality.cpp, and adjust relevance scores for non-code-complete queries based on scope, in the opposide direction from CC queries (I'm happy to do this if you don't want to)

I'm happy to give this a try. I think this one makes more sense in a separate patch.

I did this in this patch.

Thanks for adding the flag!

Merge.cpp also needs to handle what happens when different copies of the symbol have different value for the flag (--> it's not file local).

I'll address that and another last comment and land the patch.

clang-tools-extra/clangd/index/SymbolCollector.cpp
351

just noticed this is a bit subtle: we want to know if this is *only* a main-file symbol, not just if it's ever declared in the main file. But the check works because D is the canonical decl (per the clang/Index infrastructure), which is the lexically first, and we see the headers before the main file. I'll add a comment.

sammccall updated this revision to Diff 181510.Jan 14 2019, 2:04 AM

add comment about subtleties around canonical declarations and main-file check
invert sense of Symbol flag (FileLocal -> VisibleOutsideFile) so Merge does the right thing
test: add case for forward declarations (I think we do the wrong thing, fix later)
test: replace Matcher(true)/Matcher(false) with Matcher() and Not(Matcher())

This revision was automatically updated to reflect the committed changes.

The forward-declaration-in-main-file case:
In principle, if the definition is missing, I think we should treat these differently (not index them, or treat them as visible-outside file) because they're almost certainly declared "properly" elsewhere.
@ilya-biryukov convinced me offline that this isn't a big deal in practice as forward declarations in main-files (and not subsequently defining the symbol) aren't that useful.
So going to avoid patching that, as there might be unintended consequences.