This is an archive of the discontinued LLVM Phabricator instance.

[clangd] DexIndex implementation prototype
ClosedPublic

Authored by kbobyrev on Aug 6 2018, 8:16 AM.

Details

Summary

This patch is a proof-of-concept Dex index implementation. It has several flaws, which don't allow replacing static MemIndex yet, such as:

  • Not being able to handle queries of small size (less than 3 symbols); a way to solve this is generating trigrams of smaller size and having such incomplete trigrams in the index structure.
  • Speed measurements: while manually editing files in Vim and requesting autocompletion gives an impression that the performance is at least comparable with the current static index, having actual numbers is important because we don't want to hurt the users and roll out slow code. Eric (@ioeric) suggested that we should only replace MemIndex as soon as we have the evidence that this is not a regression in terms of performance. An approach which is likely to be successful here is to wait until we have benchmark library in the LLVM core repository, which is something I have suggested in the LLVM mailing lists, received positive feedback on and started working on. I will add a dependency as soon as the suggested patch is out for a review (currently there's at least one complication which is being addressed by https://github.com/google/benchmark/pull/649). Key performance improvements for iterators are sorting by cost and the limit iterator.
  • Quality measurements: currently, boosting iterator and two-phase lookup stage are not implemented, without these the quality is likely to be worse than the current implementation can yield. Measuring quality is tricky, but another suggestion in the offline discussion was that the drop-in replacement should only happen after Boosting iterators implementation (and subsequent query enhancement).

The proposed changes do not affect Clangd functionality or performance, DexIndex is only used in unit tests and not in production code.

Diff Detail

Repository
rL LLVM

Event Timeline

kbobyrev created this revision.Aug 6 2018, 8:16 AM
kbobyrev planned changes to this revision.Aug 6 2018, 8:19 AM

The patch is currently in preview-mode; I have to make few changes:

  • Improve testing infrastructure; one possible way would be to use exactly the same code MemIndex currently does as it is meant to be a drop-in replacement. An existing obstacle would be not handling <3 long queries, but it's not hard to fix.
  • Documenting DexIndex implementation and thinking about how to abstract out very similar code pieces shared with MemIndex. The proposed implementation is rather straightforward, but few pieces are identical to MemIndex which causes some code duplication.
kbobyrev updated this revision to Diff 159463.Aug 7 2018, 1:26 AM

Don't resize retrieved symbols vector, simply let callback process at most MaxCandidateCount items.

kbobyrev planned changes to this revision.Aug 7 2018, 1:26 AM

As discussed offline, incomplete trigrams (unigrams and bigrams generation) should be a blocker for this patch, because otherwise it isn't functional. Once incomplete trigrams are in, MemIndex tests can be reused for DexIndex to ensure stability.

kbobyrev updated this revision to Diff 159515.Aug 7 2018, 8:23 AM

Continue implementing Proof of Concept Dex-based static index replacement.

This diff adds short query processing, the current solution does not utilize iterators framework (unlike the general queries) yet and is a subject to change. As discussed offline, this implementation should lean towards simplicity and usability rather then premature optimization.

The patch is still not ready for a comprehensive review yet, these are few points which should be addressed before the review is live:

  • Code duplication should be reduced as much as possible. DexIndex is likely to become way more sophisticated than MemIndex in the future and hence it does not simply inherit or reuse MemIndex, this is also a reason why (as discussed offline) code duplication in unit tests is not that bad keeping in mind that the functionality and implementation of both types of index will diverge in the future. However, it's better to abstract out as much as possible if the implementation does not become less flexible and cross-dependencies are not introduced in the process.
  • Slightly cleaning up unit tests (IndexHelpers.(h|cpp) is not a very good name for the new file used by both MemIndex and DexIndex testing framework, code duplication is also a slight concern)
kbobyrev planned changes to this revision.Aug 7 2018, 8:23 AM
kbobyrev updated this revision to Diff 159658.Aug 8 2018, 2:27 AM

Minor code cleanup. This is now a fully functional symbol index.

I have reflected my concerns and uncertainties in FIXMEs, please indicate if you think there's something to improve in this patch. In general, I believe it is ready for a review.

ioeric added inline comments.Aug 8 2018, 7:01 AM
clang-tools-extra/clangd/index/MemIndex.h
45 ↗(On Diff #159658)

I think this FIXME still applies here. This can probably go away when we completely get rid of MemIndex.

clang-tools-extra/clangd/index/dex/DexIndex.cpp
31 ↗(On Diff #159658)

Couldn we build the inverted index also outside of the critical section? As this blocks ongoing index requests, we should do as little work as possible in the CS.

40 ↗(On Diff #159658)

nit: const auto * Sym

57 ↗(On Diff #159658)

nit: Initialize to false

102 ↗(On Diff #159658)

Could you document what the approach you are taking to handle short queries? It seems that this can be very expensive, for example, if all matching symbols are at the end of the DenseMap. Short queries happen very often, so we need to make sure index handles them efficiently.

102 ↗(On Diff #159658)

Did you mean to iterate on Symbols which is sorted by quality?

clang-tools-extra/clangd/index/dex/DexIndex.h
8 ↗(On Diff #159658)

I think this file could use some high-level documentation.

27 ↗(On Diff #159658)

There is some assumption about Symbols (like`MemIndex::build`). Please add documentation.

31 ↗(On Diff #159658)

Do we need this for this patch? If not, I'd suggest leaving it out and revisit when we actually need it (e.g. when replacing MemIndex); otherwise, I think what we want is basically a function that takes a SymbolSlab and returns std::shared_ptr<std::vector<const Symbol *>>. This can probably live in Index.h as you suggested.

47 ↗(On Diff #159658)

Unlike MemIndex where this is used as an actual index, here it's simply a lookup table, IIUC? Maybe just SymbolsByID or lookupTable?

49 ↗(On Diff #159658)

This can use some comments. Could be useful for people who are not familiar with inverted index.

clang-tools-extra/clangd/index/dex/Token.h
84 ↗(On Diff #159658)

Please document this function.

ioeric added inline comments.Aug 8 2018, 7:01 AM
clang-tools-extra/clangd/index/dex/DexIndex.cpp
67 ↗(On Diff #159658)

I'm not quite sure what this FIXME means. What code do you want to share between them?

But we do want to refactor the code a bit to make it easier to follow.

92 ↗(On Diff #159658)

Can we avoid creating scope iterators in the first place if they are not going to be used?

98 ↗(On Diff #159658)

It seems that the lack of limiting iterators can make the query very inefficient. Maybe we should implement the limiting iterators before getting the index in, in case people start using it before it's ready?

kbobyrev updated this revision to Diff 159908.Aug 9 2018, 6:14 AM
kbobyrev marked 15 inline comments as done.

Address a round of comments. Also put FIXMEs where appropriate for the future changes.

ioeric added inline comments.Aug 10 2018, 3:02 AM
clang-tools-extra/clangd/index/dex/DexIndex.cpp
29 ↗(On Diff #159908)

Calculating quality on each comparison can also get expensive. I think we could store the quality.

37 ↗(On Diff #159908)

nit: use try_emplace to save one lookup?

62 ↗(On Diff #159908)

I think we should let the helpers grab the lock. Some preparatory work doesn't require lock.

FWIW, I think the separation of short and long code paths is heading in a wrong direction. And it's also pretty hard to find a very clean abstraction here. For example, there is some overlaps in both functions, and useCallback seems a bit awkward.

As all this would probably go away after D50517 anyway, I think we could try to get that patch landed and incorporate it into this patch? If you prefer to get this patch in first, please add FIXME somewhere to make it clear that the divergence is temporary.

82 ↗(On Diff #159908)

nit: avoid autohere as the type of Score is not obvious here.

84 ↗(On Diff #159908)

Put this FIXME on the for-loop level as iterating all symbols is the problem.

And I think the FIXME could simply be

FIXME(...): This can be very expensive. We should first filter symbols by scopes (with scope iterators).

We can leave out the details/options as they are not interesting for most readers (as they are mostly concerns about scope filtering). In general, we should try to keep comments in documentation brief to keep the code shorter and easier to read.

94 ↗(On Diff #159908)

nit: if (Score)

95 ↗(On Diff #159908)

The code here is trivial, so the comment seems redundant.

102 ↗(On Diff #159908)

For clarity, - (*Score) * quality.

Please also comment on why we negate the number here?

175 ↗(On Diff #159908)

Shouldn't Scores already be sorted for both short query and long query?

clang-tools-extra/clangd/index/dex/DexIndex.h
12 ↗(On Diff #159908)

nit: we don't really need to mention MemIndex here as it's likely to be replaced soon, and the comment will outdated then.

It's okay to mention why Dex is cool though :)

60 ↗(On Diff #159908)

nit: The comment about implementation details should go with the implementation. Same below.

65 ↗(On Diff #159908)

Req.Query.size() < 3?

clang-tools-extra/clangd/index/dex/Token.h
84 ↗(On Diff #159658)

nit: I'd try to avoid tying documentation to the current state as it can easily get outdated. If you want to include future changes, consider making it more explicit. For example:

Returns the tokens which are given symbol's characteristics. For example, ... trigrams and scopes.
FIXME: support more tokens types:
- path proxmity
- ...
88 ↗(On Diff #159908)

Dependency on Index.h form Token.h seems strange. I think this function should probably belong in DexIndex.cpp? Do we expect this to be used outside of DexIndex?

kbobyrev updated this revision to Diff 160104.Aug 10 2018, 7:24 AM
kbobyrev marked 12 inline comments as done.

Address most comments.

kbobyrev updated this revision to Diff 160146.Aug 10 2018, 10:51 AM
kbobyrev marked 2 inline comments as done.

Store symbol qualities (so that it's not computed each time when requested which might be expensive). Use operator[] to construct the value for inverted index when key is not inserted yet.

kbobyrev planned changes to this revision.Aug 13 2018, 11:16 AM

As discussed offline, I should update the patch to reflect changes accepted in https://reviews.llvm.org/D50517.

kbobyrev updated this revision to Diff 160576.Aug 14 2018, 7:10 AM

Don't separate the logic for "long" and "short" queries: D50517 (rCTE339548) introduced incomplete trigrams which can be used on for "short" queries, too.

Don't separate the logic for "long" and "short" queries: D50517 (rCTE339548) introduced incomplete trigrams which can be used on for "short" queries, too.

Have you forgotten to upload the new revision? :)

kbobyrev updated this revision to Diff 161017.Aug 16 2018, 6:28 AM

Sorry, the last diff was the old one. Should be correct now.

ioeric added inline comments.Aug 16 2018, 7:54 AM
clang-tools-extra/clangd/index/dex/DexIndex.cpp
25 ↗(On Diff #161017)

Why is this enforced? fuzzyFind doesn't say anything about callback order.

Also useCallback seems to be the right abstraction to me; different requests can have different callback behaviors. I think we could simply inline the code here.

99 ↗(On Diff #161017)

Any reason we are not doing the two-stage scoring now? Retrieving while scoring with more expensive scoring seems to be diverging from the expected design. I think we can retrieve a relatively large number of symbols (e.g. a fixed large number or 100*MaxCandidateCount?) before re-scoring them with more expensive scorers (e.g. fuzzy match), as consuming only Req.MaxCandidateCount symbols from the iterator tree can easily leave out good candidates (e.g. those that would've gotten good fuzzy match scores).

110 ↗(On Diff #161017)

It seems that the trigram generation could be done outside of the critical section?

111 ↗(On Diff #161017)

I think we could pull some helper functions here to make the code a bit easier to follow e.g. createTrigramIterators(TrigramTokens), createScopeIterators(scopes)...

clang-tools-extra/clangd/index/dex/DexIndex.h
50 ↗(On Diff #161017)

Why virtual?

clang-tools-extra/unittests/clangd/TestIndexOperations.h
1 ↗(On Diff #161017)

As this file contains helper functions for testing indexes, I'd suggest calling this TestIndex.h like TestTU.h.

kbobyrev updated this revision to Diff 161202.Aug 17 2018, 3:51 AM
kbobyrev marked 6 inline comments as done.

Address a round of comments.

ioeric added inline comments.Aug 17 2018, 6:27 AM
clang-tools-extra/clangd/index/dex/DexIndex.cpp
86 ↗(On Diff #161202)

createTrigramIterators and createScopeIterators use InvertedIndex, so they should be called in the critical section.

Maybe

/// ....
/// Requires: Called from a critical section of `InvertedIndex`.
std::vector<std::unique_ptr<Iterator>> createTrigramIterators(
    const llvm::DenseMap<Token, PostingList> &InvertedIndex, TrigramTokens);
102 ↗(On Diff #161202)

Maybe add a FIXME about finding the best pre-scoring retrieval threshold. I'm not sure if 100*MaxCandidateCount would work well in practice. For example, if the MaxCandidateCount is small e.g. 5, then the retrieval threshold would only be 50, which still seems to be too small.

110 ↗(On Diff #161202)

I don't think this assertion is well justified. I think we should just skip if the fuzzy match fails.

118 ↗(On Diff #161202)

I think we should use a priority queue so that we don't need to store/sort all retrieved symbols.

clang-tools-extra/unittests/clangd/DexIndexTests.cpp
367 ↗(On Diff #161202)

This seems to be testing mergeIndex(...) which is not relevant in this patch?

388 ↗(On Diff #161202)

Now that we are handling both short and long queries. I think we can address this FIXME in this patch?

540 ↗(On Diff #161202)

Again, mergeIndex(...) is not interesting here.

kbobyrev updated this revision to Diff 161235.Aug 17 2018, 6:56 AM
kbobyrev marked 7 inline comments as done.

Address another round of comments.

Almost LG! Just a few more nits.

clang-tools-extra/clangd/index/dex/DexIndex.cpp
87 ↗(On Diff #161235)

nit: move SymbolDocIDs and Top closer to where they're used.

97 ↗(On Diff #161235)

I think we should let createScopeIterator handle empty scope list case; it can return an empty list anyway.

109 ↗(On Diff #161235)

This is not a proper place to set More. It's already handled below.

112 ↗(On Diff #161235)

nit: use range-based for loop?

113 ↗(On Diff #161235)

nit: Maybe take const auto &Sym?

119 ↗(On Diff #161235)

nit: - (*Score) * ... for readability.

127 ↗(On Diff #161235)

Can we simply iterate without pop()?

150 ↗(On Diff #161235)

This assumes that createTrigramIterator and createScopeIterator are already guarded by the mutex, which is implicit. I think we can make it clearer by making these local helpers that take in InvertedIndex` with the requirement that local has been acquired.

clang-tools-extra/unittests/clangd/DexIndexTests.cpp
366 ↗(On Diff #161235)

What tests do we want? If it's related to the changes in this patch, we should add it now. Tests shouldn't be FIXME :)

kbobyrev updated this revision to Diff 161252.Aug 17 2018, 7:57 AM
kbobyrev marked 9 inline comments as done.

Address another round of comments.

ioeric added inline comments.Aug 17 2018, 8:17 AM
clang-tools-extra/clangd/index/dex/DexIndex.cpp
97 ↗(On Diff #161252)

As discussed offline, triggering assertion seems to be a pretty bad behavior. Although the trigram generation, as you suggested, always more than one token, we should try to get rid of this FIXME by introducing the true iterator as proposed here.

121 ↗(On Diff #161252)

Still, we shouldn't set More here.

clang-tools-extra/clangd/index/dex/DexIndex.h
58 ↗(On Diff #161252)

nit: add /*GUARDED_BY(Mutex)*/

clang-tools-extra/unittests/clangd/DexIndexTests.cpp
366 ↗(On Diff #161252)

Please add tests with empty Query.

kbobyrev updated this revision to Diff 161273.Aug 17 2018, 8:55 AM
kbobyrev marked 3 inline comments as done.

Address all the comment, except the one about True iterators.

kbobyrev planned changes to this revision.Aug 17 2018, 8:55 AM

I should create another patch with True iterator to address the last comment.

clang-tools-extra/clangd/index/dex/DexIndex.cpp
97 ↗(On Diff #161235)

Yes, but it returns an iterator now and OrIterator (just like any other iterator) has to have non-empty list of children.

kbobyrev updated this revision to Diff 161432.Aug 20 2018, 1:30 AM
kbobyrev marked 2 inline comments as done.

Use TRUE iterator to ensure validity of the query processing.

ioeric accepted this revision.Aug 20 2018, 5:25 AM

LG. Last few nits and then good to go.

clang-tools-extra/clangd/index/dex/DexIndex.cpp
97 ↗(On Diff #161432)

Check !TrigramIterators.empty()?

128 ↗(On Diff #161432)

nit: if (!Score)

149 ↗(On Diff #161432)

This also needs lock.

clang-tools-extra/unittests/clangd/TestIndex.h
23 ↗(On Diff #161432)

Add comment about what SymbolID is?

25 ↗(On Diff #161432)

Please add documentation.

46 ↗(On Diff #161432)

Please add documentation.

This revision is now accepted and ready to land.Aug 20 2018, 5:25 AM
kbobyrev updated this revision to Diff 161480.Aug 20 2018, 7:39 AM
kbobyrev marked 6 inline comments as done.

Address post-LGTM comments.

This revision was automatically updated to reflect the committed changes.