[clangd] Generate incomplete trigrams for the Dex index
ClosedPublic

Authored by kbobyrev on Aug 9 2018, 9:14 AM.

Details

Summary

This patch handles trigram generation "short" identifiers and queries. Trigram generator produces incomplete trigrams for short names so that the same query iterator API can be used to match symbols which don't have enough symbols to form a trigram and correctly handle queries which also are not sufficient for generating a full trigram.

Diff Detail

Repository
rL LLVM
kbobyrev created this revision.Aug 9 2018, 9:14 AM
kbobyrev planned changes to this revision.Aug 9 2018, 9:44 AM

This patch is in preview mode and can be useful for the discussion. It's not functional yet, but this will be changed in the future.

The upcoming changes would allow handling short queries introduced in https://reviews.llvm.org/D50337 in a more efficient manner.

@ioeric proposed to generate unigrams for the first letter of the identifier so that the index would only perform prefix match for one-letter completion requests, which I think would be a great performance improvement.

kbobyrev updated this revision to Diff 160071.Aug 10 2018, 1:48 AM

Complete the tests, finish the implementation.

One thought about prefix match suggestion: we should either make it more explicit for the index (e.g. introduce prefixMatch and dispatch fuzzyMatch to prefix matching in case query only contains one "true" symbol) or document this properly. While, as I wrote earlier, I totally support the idea of prefix matching queries of length 1 it might not align with some user expectations and it's also very implicit if we just generate tokens this way and don't mention it anywhere in the DexIndex implementation.

@ioeric, @ilya-biryukov any thoughts?

kbobyrev planned changes to this revision.EditedAug 10 2018, 2:24 AM

As discussed offline with @ilya-biryukov, the better approach would be to prefix match first symbols of each distinct identifier piece instead of prefix matching (just looking at the first letter of the identifier) the whole identifier.

Example:

  • Query: "u"
  • Symbols: "unique_ptr", "user", "super_user"

Current implementation would match "unique_ptr" and "user" only.
Proposed implementation would match all three symbols, because the second piece of "super_user" starts with u.

This might be useful for codebases where e.g. each identifier starts with some project prefix (ProjectInstruction, ProjectGraph, etc). For C++, it's better to use namespaces instead of this naming which is not really great, but I am aware of the C++ projects which actually opt for such naming convention. However, in pure C this relatively common practice, e.g. a typical piece of code for GNOME might be

struct _GtkWrapBoxPrivate
{
	GtkOrientation        orientation;
	GtkWrapAllocationMode mode;

	GtkWrapBoxSpreading   horizontal_spreading;
	GtkWrapBoxSpreading   vertical_spreading;

	guint16               vertical_spacing;
	guint16               horizontal_spacing;

	guint16               minimum_line_children;
	guint16               natural_line_children;

	GList                *children;
};

Also, this is better for macros, which can not be put into namespaces anyway and there's BENCHMARK_UNREACHABLE and so on.

I'll update the patch with the proposed solution.

Thanks for the patch!

Complete the tests, finish the implementation.

One thought about prefix match suggestion: we should either make it more explicit for the index (e.g. introduce prefixMatch and dispatch fuzzyMatch to prefix matching in case query only contains one "true" symbol) or document this properly. While, as I wrote earlier, I totally support the idea of prefix matching queries of length 1 it might not align with some user expectations and it's also very implicit if we just generate tokens this way and don't mention it anywhere in the DexIndex implementation.

@ioeric, @ilya-biryukov any thoughts?

(copied my inline comment :)
We should definitely add documentation about it. It should be pretty simple IMO. As the behavior should be easy to infer from samples, and it shouldn't be too surprising for users, I think it would be OK to consider it as implementation detail (like details in how exactly trigrams are generated) without exposing new interfaces for them.

clang-tools-extra/clangd/index/dex/Trigram.cpp
26 ↗(On Diff #160071)

It's a nice to optimization have when we run into oversized posting lists, but this is not necessarily restricted to unigram posting lists. I think the FIXME should live near the general posting list code. I think it's probably also ok to leave it out; it's hard to forget if we do run into problem in the future ;)

74 ↗(On Diff #160071)

Could this be pulled out of the loop? I think what we want is just LowercaseIdentifier[0] right?

I'd probably also pulled that into a function, as the function body is getting larger.

87 ↗(On Diff #160071)

I think we could be more restrictive on bigram generation. I think a bigram prefix of identifier and a bigram prefix of the HEAD substring should work pretty well in practice. For example, for StringStartsWith, you would have st$ and ss$ (prefix of "SSW").

WDYT?

115 ↗(On Diff #160071)

It seems to me that what we need for short queries is simply:

if (Query.empty()) {
   // return empty token
}
if (Query.size() == 1) return {Query + "$$"};
if (Query.size() == 2) return {Query + "$"};

// Longer queries...

?

clang-tools-extra/clangd/index/dex/Trigram.h
39 ↗(On Diff #160071)

Any reason why this should be exposed?

62 ↗(On Diff #160071)

The behavior should be easy to infer from samples. As long as it's not totally expected, I think it would be OK to consider treat as implementation detail (like details in how trigrams are generated).

74 ↗(On Diff #160071)

I'm not quite sure what this means. Could you elaborate?

kbobyrev updated this revision to Diff 160074.Aug 10 2018, 2:32 AM

@ilya-biryukov I have changed the approach to the one we discussed before.

As discussed offline with @ilya-biryukov, the better approach would be to prefix match first symbols of each distinct identifier piece instead of prefix matching (just looking at the first letter of the identifier) the whole identifier.

Example:

  • Query: "u"
  • Symbols: "unique_ptr", "user", "super_user"

    Current implementation would match "unique_ptr" and "user" only. Proposed implementation would match all three symbols, because the second piece of "super_user" starts with u.

I agree that this can be useful sometime, but I suspect it's relatively rare and might actually compromise ranking quality for the more common use case e.g. the first character users type is the first character of the expected identifier.

As discussed offline with @ilya-biryukov, the better approach would be to prefix match first symbols of each distinct identifier piece instead of prefix matching (just looking at the first letter of the identifier) the whole identifier.

Example:

  • Query: "u"
  • Symbols: "unique_ptr", "user", "super_user"

    Current implementation would match "unique_ptr" and "user" only. Proposed implementation would match all three symbols, because the second piece of "super_user" starts with u.

And in the case where users want to match super_user, I think it's reasonable to have users type two more characters and match it with use.

kbobyrev updated this revision to Diff 160081.Aug 10 2018, 3:12 AM
kbobyrev marked 5 inline comments as done.

Address a round of comments.

I have added few comments to get additional feedback before further changes are made.

clang-tools-extra/clangd/index/dex/Trigram.cpp
74 ↗(On Diff #160071)

Same as elsewhere, if we have __builtin_whatever the it's not actually the first symbol of the lowercase identifier.

87 ↗(On Diff #160071)

Good idea!

115 ↗(On Diff #160071)

That would mean that we expect the query to be "valid", i.e. only consist of letters and digits. My concern is about what happens if we have "u_" or something similar ("_u", "_u_", "$u$", etc) - in that case we would actually still have to identify the first valid symbol for the trigram, process the string (trim it, etc) which looks very similar to what FuzzyMatching calculateRoles does.

The current approach is rather straightforward and generic, but I can try to change it if you want. My biggest concern is fighting some corner cases and ensuring that the query is "right" on the user (index) side, which might turn out to be more code and ensuring that the "state" is valid throughout the pipeline.

clang-tools-extra/clangd/index/dex/Trigram.h
74 ↗(On Diff #160071)

Added an example and reflected in the other comment.

As discussed offline with @ilya-biryukov, the better approach would be to prefix match first symbols of each distinct identifier piece instead of prefix matching (just looking at the first letter of the identifier) the whole identifier.

Example:

  • Query: "u"
  • Symbols: "unique_ptr", "user", "super_user"

    Current implementation would match "unique_ptr" and "user" only. Proposed implementation would match all three symbols, because the second piece of "super_user" starts with u.

And in the case where users want to match super_user, I think it's reasonable to have users type two more characters and match it with use.

That would probably yield lower code completion quality for identifiers like GtkWhatever which might be very common in pure C projects and elsewhere. Also, Ilya mentioned that fuzzy matching filter would significantly increase the score of symbols which can be prefix matched and hence they would end up at the top if the quality is actually good. Another thing we can do is to boost prefix matched symbols if your concern is about them being removed after the initial filtering.

I'm personally leaning towards having unigrams for all segment starting symbols, but if you believe that it's certainly bad I can change that and in the future it will be rather trivial to switch if we decide to go backwards. What do you think?

ioeric added inline comments.Aug 10 2018, 3:53 AM
clang-tools-extra/clangd/index/dex/Trigram.cpp
74 ↗(On Diff #160071)

I would argue that I would start by typing "_" if I actually want __builtin_whatever. I'm also not sure if this is the case we should optimize for as well; __builtin symbols are already penalized in code completion ranking.

115 ↗(On Diff #160071)

It's not clear what we would want to match with "*_", except for u_ in unique_ptr (maybe).

Generally, as short queries tend to match many more symbols, I think we should try to make them more restrictive and optimize for the most common use case.

kbobyrev updated this revision to Diff 160093.Aug 10 2018, 5:31 AM
kbobyrev marked 8 inline comments as done.

Address issues we discussed with Eric.

ioeric added inline comments.Aug 10 2018, 7:26 AM
clang-tools-extra/clangd/index/dex/Trigram.cpp
33 ↗(On Diff #160093)

This is probably neater as a lambda in generateIdentifierTrigrams , e.g.

auto add = [&](std::string Chars) {
  trigrams.insert(Token(Token::Kind::Trigram, Chars));
}
...
add("abc");
95 ↗(On Diff #160093)

Do you mean *bigrams* here?

96 ↗(On Diff #160093)

Should we also check Roles[J] == Head ?

As bigram posting lists would be significantly larger than those of trigrams, I would suggest being even more restrictive. For example, for "AppleBananaCat", the most common short queries would be "ap" and "ab" (for AB).

119 ↗(On Diff #160093)

Couldn't we generate a bigram "u_$" in this case? I think we can assume prefix matching in this case, if we generate bigram "u_" for identiifers like "u_*".

In this case, we would like to match "u" only.

Why? If user types "_", I would expect it to be a meaning filter.

141 ↗(On Diff #160093)

For queries like __ or _x, I think we can generate tokens "__$" or _x$.

clang-tools-extra/clangd/index/dex/Trigram.h
49 ↗(On Diff #160093)

nit: the term "symbol" here is confusing. Do you mean "character"?

59 ↗(On Diff #160093)

I think the comment can be simplified a bit:

This also generates incomplete trigrams for short query scenarios:
  * Empty trigram: "$$$"
  * Unigram: the first character of the identifier.
  * Bigrams: a 2-char prefix of the identifier and a bigram of the first two HEAD characters (if it exists).
kbobyrev updated this revision to Diff 160133.Aug 10 2018, 10:11 AM
kbobyrev marked 7 inline comments as done.

Address issues we have discussed with Eric.

ioeric added inline comments.Aug 10 2018, 10:28 AM
clang-tools-extra/clangd/index/dex/Trigram.cpp
31 ↗(On Diff #160133)

nit: s/auto/char/

Maybe just use static instead of an anonymous namespace just for this.

83 ↗(On Diff #160133)

It's probably unclear what FoundFirstHead and FoundSecondHead are for to readers without context (i.e. we are looking first two HEADs). I think it's would be cleaner if we handle this out side of the look e.g. record the first head in the Next initialization loop above.

138 ↗(On Diff #160133)

nit: Chars is only used in the else branch?

143 ↗(On Diff #160133)

I think this can be simplified as:

std::string Chars = LowercaseQuery.substr(std::min(3, LowercaseQuery.size()));
Chars.append(END_MARKER, 3-Chars.size());
UniqueTrigrams.insert(Token(Token::Kind::Trigram, Chars));

also nit: avoid using the term "symbol" here.

167 ↗(On Diff #160133)

When would this happen? And why are we reversing the trigram and throwing it away?

clang-tools-extra/clangd/index/dex/Trigram.h
59 ↗(On Diff #160093)

nit: and the previous paragraph Special kind of trigrams ... is redundant now IMO.

ioeric added inline comments.Aug 10 2018, 10:29 AM
clang-tools-extra/clangd/index/dex/Trigram.cpp
83 ↗(On Diff #160133)

Sorry, full of typos here:

I think it would be cleaner if we handle this outside of the loop e.g. record the first head in the Next initialization loop above.

kbobyrev updated this revision to Diff 160154.Aug 10 2018, 11:16 AM
kbobyrev marked 7 inline comments as done.

Address a round of comments.

ioeric accepted this revision.Aug 10 2018, 12:30 PM

lg! Thanks for the changes!

clang-tools-extra/clangd/index/dex/Trigram.cpp
147 ↗(On Diff #160154)

nit: inline this variable? You don't need to count below as insert duplicates for you already.

This revision is now accepted and ready to land.Aug 10 2018, 12:30 PM
kbobyrev updated this revision to Diff 160302.Aug 13 2018, 1:20 AM
kbobyrev marked an inline comment as done.
kbobyrev edited the summary of this revision. (Show Details)

Address the post-LGTM comment.

This revision was automatically updated to reflect the committed changes.