[clangd] DexIndex implementation prototype
Needs ReviewPublic

Authored by kbobyrev on Mon, Aug 6, 8:16 AM.

Details

Summary

This patch is a proof-of-concept Dex index implementation. It has several flaws, which don't allow replacing static MemIndex yet, such as:

  • Not being able to handle queries of small size (less than 3 symbols); a way to solve this is generating trigrams of smaller size and having such incomplete trigrams in the index structure.
  • Speed measurements: while manually editing files in Vim and requesting autocompletion gives an impression that the performance is at least comparable with the current static index, having actual numbers is important because we don't want to hurt the users and roll out slow code. Eric (@ioeric) suggested that we should only replace MemIndex as soon as we have the evidence that this is not a regression in terms of performance. An approach which is likely to be successful here is to wait until we have benchmark library in the LLVM core repository, which is something I have suggested in the LLVM mailing lists, received positive feedback on and started working on. I will add a dependency as soon as the suggested patch is out for a review (currently there's at least one complication which is being addressed by https://github.com/google/benchmark/pull/649). Key performance improvements for iterators are sorting by cost and the limit iterator.
  • Quality measurements: currently, boosting iterator and two-phase lookup stage are not implemented, without these the quality is likely to be worse than the current implementation can yield. Measuring quality is tricky, but another suggestion in the offline discussion was that the drop-in replacement should only happen after Boosting iterators implementation (and subsequent query enhancement).

The proposed changes do not affect Clangd functionality or performance, DexIndex is only used in unit tests and not in production code.

Diff Detail

kbobyrev created this revision.Mon, Aug 6, 8:16 AM
kbobyrev planned changes to this revision.Mon, Aug 6, 8:19 AM

The patch is currently in preview-mode; I have to make few changes:

  • Improve testing infrastructure; one possible way would be to use exactly the same code MemIndex currently does as it is meant to be a drop-in replacement. An existing obstacle would be not handling <3 long queries, but it's not hard to fix.
  • Documenting DexIndex implementation and thinking about how to abstract out very similar code pieces shared with MemIndex. The proposed implementation is rather straightforward, but few pieces are identical to MemIndex which causes some code duplication.
kbobyrev updated this revision to Diff 159463.Tue, Aug 7, 1:26 AM

Don't resize retrieved symbols vector, simply let callback process at most MaxCandidateCount items.

kbobyrev planned changes to this revision.Tue, Aug 7, 1:26 AM

As discussed offline, incomplete trigrams (unigrams and bigrams generation) should be a blocker for this patch, because otherwise it isn't functional. Once incomplete trigrams are in, MemIndex tests can be reused for DexIndex to ensure stability.

kbobyrev updated this revision to Diff 159515.Tue, Aug 7, 8:23 AM

Continue implementing Proof of Concept Dex-based static index replacement.

This diff adds short query processing, the current solution does not utilize iterators framework (unlike the general queries) yet and is a subject to change. As discussed offline, this implementation should lean towards simplicity and usability rather then premature optimization.

The patch is still not ready for a comprehensive review yet, these are few points which should be addressed before the review is live:

  • Code duplication should be reduced as much as possible. DexIndex is likely to become way more sophisticated than MemIndex in the future and hence it does not simply inherit or reuse MemIndex, this is also a reason why (as discussed offline) code duplication in unit tests is not that bad keeping in mind that the functionality and implementation of both types of index will diverge in the future. However, it's better to abstract out as much as possible if the implementation does not become less flexible and cross-dependencies are not introduced in the process.
  • Slightly cleaning up unit tests (IndexHelpers.(h|cpp) is not a very good name for the new file used by both MemIndex and DexIndex testing framework, code duplication is also a slight concern)
kbobyrev planned changes to this revision.Tue, Aug 7, 8:23 AM
kbobyrev updated this revision to Diff 159658.Wed, Aug 8, 2:27 AM

Minor code cleanup. This is now a fully functional symbol index.

I have reflected my concerns and uncertainties in FIXMEs, please indicate if you think there's something to improve in this patch. In general, I believe it is ready for a review.

ioeric added inline comments.Wed, Aug 8, 7:01 AM
clang-tools-extra/clangd/index/MemIndex.h
45 ↗(On Diff #159658)

I think this FIXME still applies here. This can probably go away when we completely get rid of MemIndex.

clang-tools-extra/clangd/index/dex/DexIndex.cpp
32

Couldn we build the inverted index also outside of the critical section? As this blocks ongoing index requests, we should do as little work as possible in the CS.

41

nit: const auto * Sym

58

nit: Initialize to false

68

I'm not quite sure what this FIXME means. What code do you want to share between them?

But we do want to refactor the code a bit to make it easier to follow.

93

Can we avoid creating scope iterators in the first place if they are not going to be used?

99

It seems that the lack of limiting iterators can make the query very inefficient. Maybe we should implement the limiting iterators before getting the index in, in case people start using it before it's ready?

103

Could you document what the approach you are taking to handle short queries? It seems that this can be very expensive, for example, if all matching symbols are at the end of the DenseMap. Short queries happen very often, so we need to make sure index handles them efficiently.

103

Did you mean to iterate on Symbols which is sorted by quality?

clang-tools-extra/clangd/index/dex/DexIndex.h
9

I think this file could use some high-level documentation.

28

There is some assumption about Symbols (like`MemIndex::build`). Please add documentation.

32

Do we need this for this patch? If not, I'd suggest leaving it out and revisit when we actually need it (e.g. when replacing MemIndex); otherwise, I think what we want is basically a function that takes a SymbolSlab and returns std::shared_ptr<std::vector<const Symbol *>>. This can probably live in Index.h as you suggested.

48

Unlike MemIndex where this is used as an actual index, here it's simply a lookup table, IIUC? Maybe just SymbolsByID or lookupTable?

50

This can use some comments. Could be useful for people who are not familiar with inverted index.

clang-tools-extra/clangd/index/dex/Token.h
84

Please document this function.

kbobyrev updated this revision to Diff 159908.Thu, Aug 9, 6:14 AM
kbobyrev marked 15 inline comments as done.

Address a round of comments. Also put FIXMEs where appropriate for the future changes.

ioeric added inline comments.Fri, Aug 10, 3:02 AM
clang-tools-extra/clangd/index/dex/DexIndex.cpp
30

Calculating quality on each comparison can also get expensive. I think we could store the quality.

38

nit: use try_emplace to save one lookup?

63

I think we should let the helpers grab the lock. Some preparatory work doesn't require lock.

FWIW, I think the separation of short and long code paths is heading in a wrong direction. And it's also pretty hard to find a very clean abstraction here. For example, there is some overlaps in both functions, and useCallback seems a bit awkward.

As all this would probably go away after D50517 anyway, I think we could try to get that patch landed and incorporate it into this patch? If you prefer to get this patch in first, please add FIXME somewhere to make it clear that the divergence is temporary.

83

nit: avoid autohere as the type of Score is not obvious here.

85

Put this FIXME on the for-loop level as iterating all symbols is the problem.

And I think the FIXME could simply be

FIXME(...): This can be very expensive. We should first filter symbols by scopes (with scope iterators).

We can leave out the details/options as they are not interesting for most readers (as they are mostly concerns about scope filtering). In general, we should try to keep comments in documentation brief to keep the code shorter and easier to read.

95

nit: if (Score)

96

The code here is trivial, so the comment seems redundant.

103

For clarity, - (*Score) * quality.

Please also comment on why we negate the number here?

176

Shouldn't Scores already be sorted for both short query and long query?

clang-tools-extra/clangd/index/dex/DexIndex.h
13

nit: we don't really need to mention MemIndex here as it's likely to be replaced soon, and the comment will outdated then.

It's okay to mention why Dex is cool though :)

61

nit: The comment about implementation details should go with the implementation. Same below.

66

Req.Query.size() < 3?

clang-tools-extra/clangd/index/dex/Token.h
84

nit: I'd try to avoid tying documentation to the current state as it can easily get outdated. If you want to include future changes, consider making it more explicit. For example:

Returns the tokens which are given symbol's characteristics. For example, ... trigrams and scopes.
FIXME: support more tokens types:
- path proxmity
- ...
88

Dependency on Index.h form Token.h seems strange. I think this function should probably belong in DexIndex.cpp? Do we expect this to be used outside of DexIndex?

kbobyrev updated this revision to Diff 160104.Fri, Aug 10, 7:24 AM
kbobyrev marked 12 inline comments as done.

Address most comments.

kbobyrev updated this revision to Diff 160146.Fri, Aug 10, 10:51 AM
kbobyrev marked 2 inline comments as done.

Store symbol qualities (so that it's not computed each time when requested which might be expensive). Use operator[] to construct the value for inverted index when key is not inserted yet.

kbobyrev planned changes to this revision.Mon, Aug 13, 11:16 AM

As discussed offline, I should update the patch to reflect changes accepted in https://reviews.llvm.org/D50517.

kbobyrev updated this revision to Diff 160576.Tue, Aug 14, 7:10 AM

Don't separate the logic for "long" and "short" queries: D50517 (rCTE339548) introduced incomplete trigrams which can be used on for "short" queries, too.