[clangd] DexIndex implementation prototype

Authored by kbobyrev on Aug 6 2018, 8:16 AM.



This patch is a proof-of-concept Dex index implementation. It has several flaws, which don't allow replacing static MemIndex yet, such as:

  • Not being able to handle queries of small size (less than 3 symbols); a way to solve this is generating trigrams of smaller size and having such incomplete trigrams in the index structure.
  • Speed measurements: while manually editing files in Vim and requesting autocompletion gives an impression that the performance is at least comparable with the current static index, having actual numbers is important because we don't want to hurt the users and roll out slow code. Eric (@ioeric) suggested that we should only replace MemIndex as soon as we have the evidence that this is not a regression in terms of performance. An approach which is likely to be successful here is to wait until we have benchmark library in the LLVM core repository, which is something I have suggested in the LLVM mailing lists, received positive feedback on and started working on. I will add a dependency as soon as the suggested patch is out for a review (currently there's at least one complication which is being addressed by https://github.com/google/benchmark/pull/649). Key performance improvements for iterators are sorting by cost and the limit iterator.
  • Quality measurements: currently, boosting iterator and two-phase lookup stage are not implemented, without these the quality is likely to be worse than the current implementation can yield. Measuring quality is tricky, but another suggestion in the offline discussion was that the drop-in replacement should only happen after Boosting iterators implementation (and subsequent query enhancement).

The proposed changes do not affect Clangd functionality or performance, DexIndex is only used in unit tests and not in production code.

Diff Detail

There are a very large number of changes, so older changes are hidden. Show Older Changes
ioeric added inline comments.Aug 8 2018, 7:01 AM
67 ↗(On Diff #159658)

I'm not quite sure what this FIXME means. What code do you want to share between them?

But we do want to refactor the code a bit to make it easier to follow.

92 ↗(On Diff #159658)

Can we avoid creating scope iterators in the first place if they are not going to be used?

98 ↗(On Diff #159658)

It seems that the lack of limiting iterators can make the query very inefficient. Maybe we should implement the limiting iterators before getting the index in, in case people start using it before it's ready?

kbobyrev updated this revision to Diff 159908.Aug 9 2018, 6:14 AM
kbobyrev marked 15 inline comments as done.

Address a round of comments. Also put FIXMEs where appropriate for the future changes.

ioeric added inline comments.Aug 10 2018, 3:02 AM
29 ↗(On Diff #159908)

Calculating quality on each comparison can also get expensive. I think we could store the quality.

37 ↗(On Diff #159908)

nit: use try_emplace to save one lookup?

62 ↗(On Diff #159908)

I think we should let the helpers grab the lock. Some preparatory work doesn't require lock.

FWIW, I think the separation of short and long code paths is heading in a wrong direction. And it's also pretty hard to find a very clean abstraction here. For example, there is some overlaps in both functions, and useCallback seems a bit awkward.

As all this would probably go away after D50517 anyway, I think we could try to get that patch landed and incorporate it into this patch? If you prefer to get this patch in first, please add FIXME somewhere to make it clear that the divergence is temporary.

82 ↗(On Diff #159908)

nit: avoid autohere as the type of Score is not obvious here.

84 ↗(On Diff #159908)

Put this FIXME on the for-loop level as iterating all symbols is the problem.

And I think the FIXME could simply be

FIXME(...): This can be very expensive. We should first filter symbols by scopes (with scope iterators).

We can leave out the details/options as they are not interesting for most readers (as they are mostly concerns about scope filtering). In general, we should try to keep comments in documentation brief to keep the code shorter and easier to read.

94 ↗(On Diff #159908)

nit: if (Score)

95 ↗(On Diff #159908)

The code here is trivial, so the comment seems redundant.

102 ↗(On Diff #159908)

For clarity, - (*Score) * quality.

Please also comment on why we negate the number here?

175 ↗(On Diff #159908)

Shouldn't Scores already be sorted for both short query and long query?

12 ↗(On Diff #159908)

nit: we don't really need to mention MemIndex here as it's likely to be replaced soon, and the comment will outdated then.

It's okay to mention why Dex is cool though :)

60 ↗(On Diff #159908)

nit: The comment about implementation details should go with the implementation. Same below.

65 ↗(On Diff #159908)

Req.Query.size() < 3?

84 ↗(On Diff #159658)

nit: I'd try to avoid tying documentation to the current state as it can easily get outdated. If you want to include future changes, consider making it more explicit. For example:

Returns the tokens which are given symbol's characteristics. For example, ... trigrams and scopes.
FIXME: support more tokens types:
- path proxmity
- ...
88 ↗(On Diff #159908)

Dependency on Index.h form Token.h seems strange. I think this function should probably belong in DexIndex.cpp? Do we expect this to be used outside of DexIndex?

kbobyrev updated this revision to Diff 160104.Aug 10 2018, 7:24 AM
kbobyrev marked 12 inline comments as done.

Address most comments.

kbobyrev updated this revision to Diff 160146.Aug 10 2018, 10:51 AM
kbobyrev marked 2 inline comments as done.

Store symbol qualities (so that it's not computed each time when requested which might be expensive). Use operator[] to construct the value for inverted index when key is not inserted yet.

kbobyrev planned changes to this revision.Aug 13 2018, 11:16 AM

As discussed offline, I should update the patch to reflect changes accepted in https://reviews.llvm.org/D50517.

kbobyrev updated this revision to Diff 160576.Aug 14 2018, 7:10 AM

Don't separate the logic for "long" and "short" queries: D50517 (rCTE339548) introduced incomplete trigrams which can be used on for "short" queries, too.

Don't separate the logic for "long" and "short" queries: D50517 (rCTE339548) introduced incomplete trigrams which can be used on for "short" queries, too.

Have you forgotten to upload the new revision? :)

kbobyrev updated this revision to Diff 161017.Aug 16 2018, 6:28 AM

Sorry, the last diff was the old one. Should be correct now.

ioeric added inline comments.Aug 16 2018, 7:54 AM
25 ↗(On Diff #161017)

Why is this enforced? fuzzyFind doesn't say anything about callback order.

Also useCallback seems to be the right abstraction to me; different requests can have different callback behaviors. I think we could simply inline the code here.

99 ↗(On Diff #161017)

Any reason we are not doing the two-stage scoring now? Retrieving while scoring with more expensive scoring seems to be diverging from the expected design. I think we can retrieve a relatively large number of symbols (e.g. a fixed large number or 100*MaxCandidateCount?) before re-scoring them with more expensive scorers (e.g. fuzzy match), as consuming only Req.MaxCandidateCount symbols from the iterator tree can easily leave out good candidates (e.g. those that would've gotten good fuzzy match scores).

110 ↗(On Diff #161017)

It seems that the trigram generation could be done outside of the critical section?

111 ↗(On Diff #161017)

I think we could pull some helper functions here to make the code a bit easier to follow e.g. createTrigramIterators(TrigramTokens), createScopeIterators(scopes)...

50 ↗(On Diff #161017)

Why virtual?

1 ↗(On Diff #161017)

As this file contains helper functions for testing indexes, I'd suggest calling this TestIndex.h like TestTU.h.

kbobyrev updated this revision to Diff 161202.Aug 17 2018, 3:51 AM
kbobyrev marked 6 inline comments as done.

Address a round of comments.

ioeric added inline comments.Aug 17 2018, 6:27 AM
86 ↗(On Diff #161202)

createTrigramIterators and createScopeIterators use InvertedIndex, so they should be called in the critical section.


/// ....
/// Requires: Called from a critical section of `InvertedIndex`.
std::vector<std::unique_ptr<Iterator>> createTrigramIterators(
    const llvm::DenseMap<Token, PostingList> &InvertedIndex, TrigramTokens);
102 ↗(On Diff #161202)

Maybe add a FIXME about finding the best pre-scoring retrieval threshold. I'm not sure if 100*MaxCandidateCount would work well in practice. For example, if the MaxCandidateCount is small e.g. 5, then the retrieval threshold would only be 50, which still seems to be too small.

110 ↗(On Diff #161202)

I don't think this assertion is well justified. I think we should just skip if the fuzzy match fails.

118 ↗(On Diff #161202)

I think we should use a priority queue so that we don't need to store/sort all retrieved symbols.

367 ↗(On Diff #161202)

This seems to be testing mergeIndex(...) which is not relevant in this patch?

388 ↗(On Diff #161202)

Now that we are handling both short and long queries. I think we can address this FIXME in this patch?

540 ↗(On Diff #161202)

Again, mergeIndex(...) is not interesting here.

kbobyrev updated this revision to Diff 161235.Aug 17 2018, 6:56 AM
kbobyrev marked 7 inline comments as done.

Address another round of comments.

Almost LG! Just a few more nits.

87 ↗(On Diff #161235)

nit: move SymbolDocIDs and Top closer to where they're used.

97 ↗(On Diff #161235)

I think we should let createScopeIterator handle empty scope list case; it can return an empty list anyway.

109 ↗(On Diff #161235)

This is not a proper place to set More. It's already handled below.

112 ↗(On Diff #161235)

nit: use range-based for loop?

113 ↗(On Diff #161235)

nit: Maybe take const auto &Sym?

119 ↗(On Diff #161235)

nit: - (*Score) * ... for readability.

127 ↗(On Diff #161235)

Can we simply iterate without pop()?

150 ↗(On Diff #161235)

This assumes that createTrigramIterator and createScopeIterator are already guarded by the mutex, which is implicit. I think we can make it clearer by making these local helpers that take in InvertedIndex` with the requirement that local has been acquired.

366 ↗(On Diff #161235)

What tests do we want? If it's related to the changes in this patch, we should add it now. Tests shouldn't be FIXME :)

kbobyrev updated this revision to Diff 161252.Aug 17 2018, 7:57 AM
kbobyrev marked 9 inline comments as done.

Address another round of comments.

ioeric added inline comments.Aug 17 2018, 8:17 AM
97 ↗(On Diff #161252)

As discussed offline, triggering assertion seems to be a pretty bad behavior. Although the trigram generation, as you suggested, always more than one token, we should try to get rid of this FIXME by introducing the true iterator as proposed here.

121 ↗(On Diff #161252)

Still, we shouldn't set More here.

58 ↗(On Diff #161252)

nit: add /*GUARDED_BY(Mutex)*/

366 ↗(On Diff #161252)

Please add tests with empty Query.

kbobyrev updated this revision to Diff 161273.Aug 17 2018, 8:55 AM
kbobyrev marked 3 inline comments as done.

Address all the comment, except the one about True iterators.

kbobyrev planned changes to this revision.Aug 17 2018, 8:55 AM

I should create another patch with True iterator to address the last comment.

97 ↗(On Diff #161235)

Yes, but it returns an iterator now and OrIterator (just like any other iterator) has to have non-empty list of children.

kbobyrev updated this revision to Diff 161432.Aug 20 2018, 1:30 AM
kbobyrev marked 2 inline comments as done.

Use TRUE iterator to ensure validity of the query processing.

ioeric accepted this revision.Aug 20 2018, 5:25 AM

LG. Last few nits and then good to go.

97 ↗(On Diff #161432)

Check !TrigramIterators.empty()?

128 ↗(On Diff #161432)

nit: if (!Score)

149 ↗(On Diff #161432)

This also needs lock.

23 ↗(On Diff #161432)

Add comment about what SymbolID is?

25 ↗(On Diff #161432)

Please add documentation.

46 ↗(On Diff #161432)

Please add documentation.

This revision is now accepted and ready to land.Aug 20 2018, 5:25 AM
kbobyrev updated this revision to Diff 161480.Aug 20 2018, 7:39 AM
kbobyrev marked 6 inline comments as done.

Address post-LGTM comments.

This revision was automatically updated to reflect the committed changes.