This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Ignore the static index refs from the dynamic index files.
ClosedPublic

Authored by ArcsinX on Dec 16 2020, 5:49 AM.

Details

Summary

This patch fixes the following problem:

  • open a file with references to the symbol Foo
  • remove all references to Foo (from the dynamic index).
  • MergedIndex::refs() result will contain positions of removed references (from the static index).

The idea of this patch is to keep a set of files which were used during index build inside the index.
Thus at processing the static index references we can check if the file of processing reference is a part of the dynamic index or not.

Diff Detail

Event Timeline

ArcsinX created this revision.Dec 16 2020, 5:49 AM
ArcsinX requested review of this revision.Dec 16 2020, 5:49 AM

TL;DR: Cool! I think we should complicate the signature of hasFile() a bit to avoid calling SwapIndex::snapshot() for each file. There's also some quadratic behavior due to calling index ops from MergeIndex when children are MergeIndexes, but N is small and I think we can solve it later.


I like the simplicity/elegance of this solution, and the fact that it avoids the scaling pitfall of trying to return a list of all the indexed files.

I am concerned about performance though. At a high level, we try to make index operations fairly coarse/batched and only do a handful of them. This was originally to allow these operations to be RPCs, but of course this assumption gets baked in in other ways too. Introducing a fine-grained operation like this causes some of the assumptions to be violated.

Concretely, SwapIndex (and ProjectAwareIndex) take a lock to obtain the index snapshot to use, and so here we do that for each call to hasFile(). Most indexes are wrapped in SwapIndex for either lazy-loading or dynamic updating.

And we hammer it reasonably hard here, MergeIndex::refs() will call it in a loop for ~100 times, and because there are multiple MergeIndexes in the index stack, we end up with duplication: the first index never gets hasFile(), the second gets it 100 times, the third 200 times, the fourth 300 times. (Currently we have at least 4 indexes in the stack).

Taking 600 locks to call refs() seems... less than ideal. For now they'll probably be uncontended as we only call refs() for interactive actions (find-references and rename), but fundamentally this applies to all requests, not just refs, so I'm not sure this will be true in the future (a fast pseudoparser index will want to consult the previous index to resolve symbols).

This all seems related to the idea that hasFile() isn't implemented for RemoteIndex - it's a pragmatic shortcut, but in principle it'd be nice to be able to implement it, and there's some minimum per-RPC overhead.


So, can we do anything about all this?

Locking once for each ref: fundamentally we want to lock once to get the snapshot, then query the snapshot for each file. This changing hasFile to return an oracle that can be cheaply queried, e.g. a signature like unique_function<bool(StringRef)> SymbolIndex::indexedFiles().

The quadratic nature of calling any index ops in MergedIndex: I guess probably not without flattening our binary-tree/linked-list of MergedIndex into an N-way merge. Back in the day, we basically had one static index which was huge, and one dynamic index which was tiny, and the no-copy aspect of MergedIndex was appealing (and it was easy to implement!). But we're probably not coming out ahead anymore vs simply committing to copy everything once. The N-way merge would have other advantages too: we could query large/slow indexes in parallel to improve latency.

Implementing a pseudo-batch form of hasFiles as an RPC for remote-index: we'd need to give up on getting answers for files one-by-one. That would be OK for an N-way merge (since we'd fetch all results before merging, we could build a list). But servers typically index *everything* under a certain directory, so fetching the list of directories and evaluating per-file queries client-side is a cheap and reasonable alternative (that's more accurate than return false!)

ArcsinX updated this revision to Diff 312440.Dec 17 2020, 3:54 AM

bool SymbolIndex::hasFile(llvm::StringRef) => llvm::unique_function<bool(llvm::StringRef)> SymbolIndex::indexedFiles()

ArcsinX added a comment.EditedDec 17 2020, 4:06 AM

Thanks for your reply. I have updated the patch (changed signature of hasFile), but remote-index scenario is not clear for me yet.

Implementing a pseudo-batch form of hasFiles as an RPC for remote-index: we'd need to give up on getting answers for files one-by-one. That would be OK for an N-way merge (since we'd fetch all results before merging, we could build a list). But servers typically index *everything* under a certain directory, so fetching the list of directories and evaluating per-file queries client-side is a cheap and reasonable alternative (that's more accurate than return false!)

Seems I need some clarifications here.
Is it should be like this?:

  • [client]->[server] Give me all directories and subdirectories
  • [client]<-[server] All directories are .....

(we need this only once)

IndexClient::indexedFiles() {
<returns function which checks if the file is inside a directory from the list (i.e. full filename starts with a directory from the list)>
}

What structure should be used for directories list? I think we can use Trie

Am I right about your suggestion?

Thanks, this should work with some slight tweaks.
Biggest question is if we really need/want to allow this function to be null.

clang-tools-extra/clangd/index/Index.h
127

Maybe to be really explicit, add the lifetime constraint: "The function must only be called while the index is alive."

127

it looks below like you allow this to return nullptr. TL;DR: I think we probably shouldn't, and just return false with a fixme in RemoteIndex, and false without a fixme in ProjectAwareIndex.

If we do this, we should:

  • explicitly call out that possibility, because it's error-prone
  • carefully define the semantics

In the implementations nullptr seems to be used in two ways:

  • to mean "always no", as in ProjectAwareIndex with no project.
  • to mean "always don't-know", as in RemoteIndex. The problem is

I don't see any reason to use nullptr for "no" instead of returning [](StringRef){return false;}.
The "don't-know" is more appealing, because the indexes don't have to lie and instead the caller handles the ambiguity. The problem is you can't be consistent about it. Given a MergedIndex where Static->indexedFiles() is non-null and Dynamic->indexFiles() is null, then the answer can be "yes" or "don't know" depending on the file, and we can't express that, so we have to lie anyway.

We could add a tristate return value for the function, but given that it's plausible we could actually implement indexedFiles() for remote index later, I think we should just return false in the "don't know" case for now, with a FIXME there.

clang-tools-extra/clangd/index/MemIndex.cpp
116

need to consumeError(Path.takeError()) here, or we'll crash in debug mode on error

clang-tools-extra/clangd/index/Merge.cpp
127

this is calling indexedFiles() once for each file we're checking, rather than once overall.

You want to move these variables to the capture list instead. (There should be no need to capture this in the end)

clang-tools-extra/clangd/index/ProjectAware.cpp
122

(this is definitely-no-files, so as mentioned above [](StringRef){return false;} seems more appropriate)

clang-tools-extra/clangd/index/dex/Dex.cpp
320

and need to consume error here

Thanks for your reply. I have updated the patch (changed signature of hasFile), but remote-index scenario is not clear for me yet.

Sorry, I didn't see this before comments.

To be clear, I'm *not* asking you to add remote support in this patch, I'm just thinking aloud at what it might be like.
The "return false" behavior is fine for now. The remote index is basically at the bottom of the stack, so it's mostly moot. And "return false" yields the old behavior anyway.

I just want to think aloud about how we can add it on :-)

Implementing a pseudo-batch form of hasFiles as an RPC for remote-index: we'd need to give up on getting answers for files one-by-one. That would be OK for an N-way merge (since we'd fetch all results before merging, we could build a list). But servers typically index *everything* under a certain directory, so fetching the list of directories and evaluating per-file queries client-side is a cheap and reasonable alternative (that's more accurate than return false!)

Seems I need some clarifications here.
Is it should be like this?:

  • [client]->[server] Give me all directories and subdirectories
  • [client]<-[server] All directories are .....

(we need this only once)

Right - we could fetch it when the connection goes up, or on every request, or even configure it entirely client-side (the client knows about the "mount point" of the index and maybe we could assume it's fully-populated). I'm a little hazy on this, @kbobyrev @kadircet would know what's best here once back from vacation.

IndexClient::indexedFiles() {
<returns function which checks if the file is inside a directory from the list (i.e. full filename starts with a directory from the list)>
}

What structure should be used for directories list? I think we can use Trie

Probably just a string or flat list - I think this is exactly one directory today, maybe a short list in future.
(We have the assumption of a single root for path-translation purposes, not sure if we've seen any need for that root to be "partly-populated" yet. If so we may just end up creating one Index instance per directory and share stubs between them.)

But again, this isn't something you need to build now, unless you're really keen :-)

ArcsinX updated this revision to Diff 312497.Dec 17 2020, 7:34 AM

Make indexedFiles() to return [](StringRef){return false;} instead of nullptr.
Add comments.
Do not call indexedFiles() at every file check for MergeIndex.
Consume error if URI::resolve() fails.

ArcsinX marked 6 inline comments as done.Dec 17 2020, 7:35 AM
sammccall accepted this revision.Dec 17 2020, 8:13 AM

Fantastic, thanks so much!

clang-tools-extra/clangd/index/Merge.cpp
129

Currently the signature says that the returned function is not const/threadsafe.
If you prefer, you can have it be unique_function<bool(StringRef) const>, then the contract is that it's const/threadsafe and you don't need mutable here.

Up to you though, I can't think of a case where we *need* it to be threadsafe.

This revision is now accepted and ready to land.Dec 17 2020, 8:13 AM
ArcsinX updated this revision to Diff 312508.Dec 17 2020, 8:14 AM

Fix comment inside IndexClient::indexedFiles()

ArcsinX updated this revision to Diff 312530.Dec 17 2020, 9:11 AM

unique_function<bool(StringRef)> indexedFiles() const => unique_function<bool(StringRef) const> indexedFiles() const
StringRef => llvm::StringRef for consistency

ArcsinX marked an inline comment as done.Dec 17 2020, 9:13 AM

Thank you for review!

ArcsinX updated this revision to Diff 312538.Dec 17 2020, 9:37 AM

StringRef => llvm::StringRef in one more place.

ArcsinX updated this revision to Diff 312742.Dec 18 2020, 3:19 AM

Fix format
Rebase

ArcsinX edited the summary of this revision. (Show Details)Dec 18 2020, 3:20 AM
This revision was landed with ongoing or failed builds.Dec 18 2020, 4:37 AM
This revision was automatically updated to reflect the committed changes.