This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Shard preamble symbols in dynamic index
ClosedPublic

Authored by kadircet on Apr 8 2020, 7:23 AM.

Details

Summary

This reduces memory usage by dynamic index from more than 400MB to 32MB
when all files in clang-tools-extra/clangd/*.cpp are active in clangd.

Diff Detail

Event Timeline

kadircet created this revision.Apr 8 2020, 7:23 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 8 2020, 7:23 AM

Can't believe we didn't do this before. Nice catch.

clang-tools-extra/clangd/index/FileIndex.cpp
103

This screams out to be shared with BackgroundIndex::update(). Is it very hard to do?

Things that are different:

  • no duplication of symbol into defining file (why?)
  • refs not gathered here (but I guess empty input -> empty output?)
  • include graph not gathered here (same)
  • this holds 3 copies of all the data at peak, other impl holds 2

I imagine factoring out a function that returns a StringMap<File> like in BackgroundIndex::Update, but File has a build() function that returns an IndexFileIn or so.
(Ooh, is depending on IndexFileIn a layering violation? Maybe we should move that into another header. Duplicating the struct is probably OK for now)

118

um... I guess that works, because Expected is unchecked in NDEBUG?
I think cantFail is clearer though.

422

nit: main-file

clang-tools-extra/clangd/index/FileIndex.h
129–130

This seems like a caveat rather than a FIXME - we're not really proposing a fix are we?

kadircet marked an inline comment as done.Apr 12 2020, 3:28 AM
kadircet added inline comments.
clang-tools-extra/clangd/index/FileIndex.cpp
103

In backgroundindex we "prefilter" symbols and only duplicate the ones coming from modified files, whereas in this one we duplicate every symbol no matter what. I didn't want to change backgroundindex into post filtering mode (I suppose that's what you mean by the last item).

as for the first bullet point, duplicating in case of preamble symbols wouldn't matter since we don't merge(rather pick one) while building the index for preamblesymbols. I suppose storing them twice shouldn't be a huge issue.

Depending to IndexFileIn in "Index.h" would be a violation, but depending on it in "FileIndex.h" is fine.

I am happy to refactor a StringMap<File> shardIndexToFiles(IndexFileIn, HintPath) into "FileIndex.h" that can be used by both backgroundindex and this, if you are OK with above mentioned regressions.

kadircet updated this revision to Diff 257003.Apr 13 2020, 9:23 AM
kadircet marked 4 inline comments as done.
  • Unify sharding logic in BackgroundIndex and FileIndex.
  • Make sure relations have valid subjects.
kadircet added inline comments.Apr 13 2020, 9:47 AM
clang-tools-extra/clangd/index/FileIndex.cpp
103

implemented a version of this that only copies data when requested. Hence the pre/post filtering argument is gone.

kadircet marked an inline comment as done.Apr 13 2020, 9:49 AM
kadircet added inline comments.
clang-tools-extra/clangd/index/SymbolCollector.cpp
286

put this before processRelations to ensure Subject of a relation is a symbol we are going to collect.

sammccall accepted this revision.Apr 13 2020, 2:42 PM

LG, API quibbles.

It would be nice to test this but the good things about it (memory usage) aren't easy to test, and the bad things (sensitivity to header parsing context) seem odd to be the primary test.
Actually I guess one nice thing we get from this is erasure of stale symbols: if we index the preamble {foo.cpp=#include "x.h", x.h=int a;}, and then {bar.cpp=#include "x.h", x.h=int b;}, then I think before this patch we see both a and b in the index, and after we see only b.
That would be a reasonable test.

Or I guess now that the sharding function is public we can test it directly.

clang-tools-extra/clangd/index/FileIndex.h
129–130

nit: header

157

nit: space before parenthetical

159

Hmm, this is a good name for a function, but we want a noun for a class. FileShardedIndex?

165

I do find it a little weird not to expose the map-structure of the vast majority of the data here.

What steers you away from just making this a function
StringMap<X> shardIndexToFile(IndexFileIn, PathRef)
where X could be File with a method to obtain the data, or the more anonymous unique_function<IndexFileIn()>?

172

Can we bundle these parallel functions into one that returns IndexFileIn?
I don't think there's much efficiency in building only a subset, is there? (If there's a lot of data in e.g. refs we're going to ignore, we can strip them from the input)

That would also mean no need to special-case mainfilecmd in the API - the only caller of that function does it in the file loop, so it would be provided for the main-file shard only.

clang-tools-extra/clangd/index/SymbolCollector.cpp
286

The code around this seems unchanged - how does this relate to the rest of the patch?
(Other than I guess you've been inspecting the data and finding bogus shards :-)

This is bit subtle, and hard to reason about since we only have one example. Indeed dangling relations are pointless. However:

  • the direction of the relation is fairly arbitrary. Should we be checking shouldCollectSymbol on the object of the relation too?
  • shouldCollectSymbols depends somewhat on the file context, and also the indexer options. In principle the subject could be indexable elsewhere but not here. Or this index could see the relation and another could see the symbol itself. Can't think of compelling examples though, just hypotheticals.

I think probably we should give this a comment (about not wanting relations or refs either), add filtering on the relation objects, and add a test (I think that should be pretty easy with inheriting from a main-file-only class?).
Maybe it's a separate patch?

This revision is now accepted and ready to land.Apr 13 2020, 2:42 PM
kadircet updated this revision to Diff 257242.Apr 14 2020, 2:35 AM
kadircet marked 7 inline comments as done.
  • Address comments
  • Merge slab generations into a single member that returns an IndexFileIn.
kadircet added inline comments.Apr 14 2020, 2:46 AM
clang-tools-extra/clangd/index/FileIndex.h
165

because the map structure is just pointers into the IndexFileIn, I wanted to make sure those pointers do not outlive the IndexFileIn.

clang-tools-extra/clangd/index/SymbolCollector.cpp
286

The code around this seems unchanged - how does this relate to the rest of the patch?
(Other than I guess you've been inspecting the data and finding bogus shards :-)

Selection of file declaring the Subject seemed arbitrary, and turned out to be broken. We were not hitting this case before because background index was performing pre filtering and skipping any relation that mapped to a "non interesting" file, hence we didn't notice it was broken. Now it become obvious as we assert on the file containing the subject symbol.

the direction of the relation is fairly arbitrary. Should we be checking shouldCollectSymbol on the object of the relation too?
shouldCollectSymbols depends somewhat on the file context, and also the indexer options. In principle the subject could be indexable elsewhere but not here. Or this index could see the relation and another could see the symbol itself. Can't think of compelling examples though, just hypotheticals.
I think probably we should give this a comment (about not wanting relations or refs either), add filtering on the relation objects, and add a test (I think that should be pretty easy with inheriting from a main-file-only class?).
Maybe it's a separate patch?

I also wanted to do this but it got hairy and as you also have suggested I decided to have a separate patch for this. Because there were other problematic stuff, for example IndexerOpts::CollectMainFileSymbols is always true.

sammccall accepted this revision.Apr 14 2020, 6:38 AM

Still LG

clang-tools-extra/clangd/index/FileIndex.h
165

I don't think it's unreasonable to have a non-owning data structure with a comment, but up to you.

kadircet updated this revision to Diff 257607.Apr 14 2020, 11:55 PM
  • Add tests for sharding logic and preamble overwrite
This revision was automatically updated to reflect the committed changes.