This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang-tools-extra/
-
clangd/index/dex/
-
index/
-
dex/
7/7
Trigram.h
21/21
Trigram.cpp
-
unittests/clangd/
-
clangd/
-
DexIndexTests.cpp

Differential D50517

[clangd] Generate incomplete trigrams for the Dex index
ClosedPublic

Authored by kbobyrev on Aug 9 2018, 9:14 AM.

Download Raw Diff

Details

Reviewers

ioeric
ilya-biryukov

Commits

rGff2dd9095fa6: [clangd] Generate incomplete trigrams for the Dex index
rCTE339548: [clangd] Generate incomplete trigrams for the Dex index
rL339548: [clangd] Generate incomplete trigrams for the Dex index

Summary

This patch handles trigram generation "short" identifiers and queries. Trigram generator produces incomplete trigrams for short names so that the same query iterator API can be used to match symbols which don't have enough symbols to form a trigram and correctly handle queries which also are not sufficient for generating a full trigram.

Diff Detail

Event Timeline

kbobyrev created this revision.Aug 9 2018, 9:14 AM

Herald added subscribers: arphaman, jkorous, MaskRay. · View Herald TranscriptAug 9 2018, 9:14 AM

This patch is in preview mode and can be useful for the discussion. It's not functional yet, but this will be changed in the future.

The upcoming changes would allow handling short queries introduced in https://reviews.llvm.org/D50337 in a more efficient manner.

@ioeric proposed to generate unigrams for the first letter of the identifier so that the index would only perform prefix match for one-letter completion requests, which I think would be a great performance improvement.

Complete the tests, finish the implementation.

One thought about prefix match suggestion: we should either make it more explicit for the index (e.g. introduce prefixMatch and dispatch fuzzyMatch to prefix matching in case query only contains one "true" symbol) or document this properly. While, as I wrote earlier, I totally support the idea of prefix matching queries of length 1 it might not align with some user expectations and it's also very implicit if we just generate tokens this way and don't mention it anywhere in the DexIndex implementation.

@ioeric, @ilya-biryukov any thoughts?

As discussed offline with @ilya-biryukov, the better approach would be to prefix match first symbols of each distinct identifier piece instead of prefix matching (just looking at the first letter of the identifier) the whole identifier.

Example:

Query: "u"
Symbols: "unique_ptr", "user", "super_user"

Current implementation would match "unique_ptr" and "user" only.
Proposed implementation would match all three symbols, because the second piece of "super_user" starts with u.

This might be useful for codebases where e.g. each identifier starts with some project prefix (ProjectInstruction, ProjectGraph, etc). For C++, it's better to use namespaces instead of this naming which is not really great, but I am aware of the C++ projects which actually opt for such naming convention. However, in pure C this relatively common practice, e.g. a typical piece of code for GNOME might be

struct _GtkWrapBoxPrivate
{
	GtkOrientation        orientation;
	GtkWrapAllocationMode mode;

	GtkWrapBoxSpreading   horizontal_spreading;
	GtkWrapBoxSpreading   vertical_spreading;

	guint16               vertical_spacing;
	guint16               horizontal_spacing;

	guint16               minimum_line_children;
	guint16               natural_line_children;

	GList                *children;
};

Also, this is better for macros, which can not be put into namespaces anyway and there's BENCHMARK_UNREACHABLE and so on.

I'll update the patch with the proposed solution.

Thanks for the patch!

In D50517#1194955, @kbobyrev wrote:

Complete the tests, finish the implementation.

One thought about prefix match suggestion: we should either make it more explicit for the index (e.g. introduce prefixMatch and dispatch fuzzyMatch to prefix matching in case query only contains one "true" symbol) or document this properly. While, as I wrote earlier, I totally support the idea of prefix matching queries of length 1 it might not align with some user expectations and it's also very implicit if we just generate tokens this way and don't mention it anywhere in the DexIndex implementation.

@ioeric, @ilya-biryukov any thoughts?

(copied my inline comment :)
We should definitely add documentation about it. It should be pretty simple IMO. As the behavior should be easy to infer from samples, and it shouldn't be too surprising for users, I think it would be OK to consider it as implementation detail (like details in how exactly trigrams are generated) without exposing new interfaces for them.

clang-tools-extra/clangd/index/dex/Trigram.cpp
31	It's a nice to optimization have when we run into oversized posting lists, but this is not necessarily restricted to unigram posting lists. I think the FIXME should live near the general posting list code. I think it's probably also ok to leave it out; it's hard to forget if we do run into problem in the future ;)
71	Could this be pulled out of the loop? I think what we want is just `LowercaseIdentifier[0]` right? I'd probably also pulled that into a function, as the function body is getting larger.
79	I think we could be more restrictive on bigram generation. I think a bigram prefix of identifier and a bigram prefix of the HEAD substring should work pretty well in practice. For example, for `StringStartsWith`, you would have `st$` and `ss$` (prefix of "SSW"). WDYT?
108	It seems to me that what we need for short queries is simply: if (Query.empty()) { // return empty token } if (Query.size() == 1) return {Query + "$$"}; if (Query.size() == 2) return {Query + "$"}; // Longer queries... ?
clang-tools-extra/clangd/index/dex/Trigram.h
39	Any reason why this should be exposed?
54	The behavior should be easy to infer from samples. As long as it's not totally expected, I think it would be OK to consider treat as implementation detail (like details in how trigrams are generated).
63	I'm not quite sure what this means. Could you elaborate?

@ilya-biryukov I have changed the approach to the one we discussed before.

In D50517#1194976, @kbobyrev wrote:

As discussed offline with @ilya-biryukov, the better approach would be to prefix match first symbols of each distinct identifier piece instead of prefix matching (just looking at the first letter of the identifier) the whole identifier.

Example:

Query: "u"

Symbols: "unique_ptr", "user", "super_user"

Current implementation would match "unique_ptr" and "user" only.
Proposed implementation would match all three symbols, because the second piece of "super_user" starts with u.

I agree that this can be useful sometime, but I suspect it's relatively rare and might actually compromise ranking quality for the more common use case e.g. the first character users type is the first character of the expected identifier.

In D50517#1194976, @kbobyrev wrote:

As discussed offline with @ilya-biryukov, the better approach would be to prefix match first symbols of each distinct identifier piece instead of prefix matching (just looking at the first letter of the identifier) the whole identifier.

Example:

Query: "u"

Symbols: "unique_ptr", "user", "super_user"

Current implementation would match "unique_ptr" and "user" only.
Proposed implementation would match all three symbols, because the second piece of "super_user" starts with u.

And in the case where users want to match super_user, I think it's reasonable to have users type two more characters and match it with use.

ioeric mentioned this in D50337: [clangd] DexIndex implementation prototype.Aug 10 2018, 3:02 AM

Address a round of comments.

I have added few comments to get additional feedback before further changes are made.

clang-tools-extra/clangd/index/dex/Trigram.cpp
71	Same as elsewhere, if we have `__builtin_whatever` the it's not actually the first symbol of the lowercase identifier.
79	Good idea!
108	That would mean that we expect the query to be "valid", i.e. only consist of letters and digits. My concern is about what happens if we have `"u_"` or something similar (`"_u", "_u_", "$u$"`, etc) - in that case we would actually still have to identify the first valid symbol for the trigram, process the string (trim it, etc) which looks very similar to what FuzzyMatching `calculateRoles` does. The current approach is rather straightforward and generic, but I can try to change it if you want. My biggest concern is fighting some corner cases and ensuring that the query is "right" on the user (index) side, which might turn out to be more code and ensuring that the "state" is valid throughout the pipeline.
clang-tools-extra/clangd/index/dex/Trigram.h
63	Added an example and reflected in the other comment.

In D50517#1194990, @ioeric wrote:

In D50517#1194976, @kbobyrev wrote:

As discussed offline with @ilya-biryukov, the better approach would be to prefix match first symbols of each distinct identifier piece instead of prefix matching (just looking at the first letter of the identifier) the whole identifier.

Example:

Query: "u"

Symbols: "unique_ptr", "user", "super_user"

Current implementation would match "unique_ptr" and "user" only.
Proposed implementation would match all three symbols, because the second piece of "super_user" starts with u.

And in the case where users want to match super_user, I think it's reasonable to have users type two more characters and match it with use.

That would probably yield lower code completion quality for identifiers like GtkWhatever which might be very common in pure C projects and elsewhere. Also, Ilya mentioned that fuzzy matching filter would significantly increase the score of symbols which can be prefix matched and hence they would end up at the top if the quality is actually good. Another thing we can do is to boost prefix matched symbols if your concern is about them being removed after the initial filtering.

I'm personally leaning towards having unigrams for all segment starting symbols, but if you believe that it's certainly bad I can change that and in the future it will be rather trivial to switch if we decide to go backwards. What do you think?

ioeric added inline comments.Aug 10 2018, 3:53 AM

clang-tools-extra/clangd/index/dex/Trigram.cpp
71	I would argue that I would start by typing "_" if I actually want `__builtin_whatever`. I'm also not sure if this is the case we should optimize for as well; __builtin symbols are already penalized in code completion ranking.
108	It's not clear what we would want to match with "*_", except for `u_` in `unique_ptr` (maybe). Generally, as short queries tend to match many more symbols, I think we should try to make them more restrictive and optimize for the most common use case.

Address issues we discussed with Eric.

ioeric added inline comments.Aug 10 2018, 7:26 AM

clang-tools-extra/clangd/index/dex/Trigram.cpp
38	This is probably neater as a lambda in `generateIdentifierTrigrams` , e.g. auto add = [&](std::string Chars) { trigrams.insert(Token(Token::Kind::Trigram, Chars)); } ... add("abc");
79	Do you mean bigrams here?
80	Should we also check `Roles[J] == Head` ? As bigram posting lists would be significantly larger than those of trigrams, I would suggest being even more restrictive. For example, for "AppleBananaCat", the most common short queries would be "ap" and "ab" (for `AB`).
107	Couldn't we generate a bigram "u_$" in this case? I think we can assume prefix matching in this case, if we generate bigram "u_" for identiifers like "u_*". In this case, we would like to match "u" only. Why? If user types "_", I would expect it to be a meaning filter.
123	For queries like `__` or `_x`, I think we can generate tokens "__$" or `_x$`.
clang-tools-extra/clangd/index/dex/Trigram.h
54	nit: the term "symbol" here is confusing. Do you mean "character"?
64	I think the comment can be simplified a bit: This also generates incomplete trigrams for short query scenarios: * Empty trigram: "$$$" * Unigram: the first character of the identifier. * Bigrams: a 2-char prefix of the identifier and a bigram of the first two HEAD characters (if it exists).

Address issues we have discussed with Eric.

ioeric added inline comments.Aug 10 2018, 10:28 AM

clang-tools-extra/clangd/index/dex/Trigram.cpp
36	nit: s/auto/char/ Maybe just use `static` instead of an anonymous namespace just for this.
82	It's probably unclear what `FoundFirstHead` and `FoundSecondHead` are for to readers without context (i.e. we are looking first two HEADs). I think it's would be cleaner if we handle this out side of the look e.g. record the first head in the `Next` initialization loop above.
116	nit: `Chars` is only used in the else branch?
123	When would this happen? And why are we reversing the trigram and throwing it away?
124	I think this can be simplified as: std::string Chars = LowercaseQuery.substr(std::min(3, LowercaseQuery.size())); Chars.append(END_MARKER, 3-Chars.size()); UniqueTrigrams.insert(Token(Token::Kind::Trigram, Chars)); also nit: avoid using the term "symbol" here.
clang-tools-extra/clangd/index/dex/Trigram.h
64	nit: and the previous paragraph `Special kind of trigrams ...` is redundant now IMO.

ioeric added inline comments.Aug 10 2018, 10:29 AM

clang-tools-extra/clangd/index/dex/Trigram.cpp
82	Sorry, full of typos here: I think it would be cleaner if we handle this outside of the loop e.g. record the first head in the `Next` initialization loop above.

Address a round of comments.

lg! Thanks for the changes!

clang-tools-extra/clangd/index/dex/Trigram.cpp
130	nit: inline this variable? You don't need to `count` below as `insert` duplicates for you already.

This revision is now accepted and ready to land.Aug 10 2018, 12:30 PM

Address the post-LGTM comment.

Closed by commit rL339548: [clangd] Generate incomplete trigrams for the Dex index (authored by omtcyfz). · Explain WhyAug 13 2018, 1:57 AM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: llvm-commits. · View Herald TranscriptAug 13 2018, 1:58 AM

Revision Contents

Path

Size

clang-tools-extra/

clangd/

index/

dex/

Trigram.h

5 lines

Trigram.cpp

38 lines

unittests/

clangd/

DexIndexTests.cpp

44 lines

Diff 159937

clang-tools-extra/clangd/index/dex/Trigram.h

	Show All 25 Lines
	#include "Token.h"			#include "Token.h"

	#include <string>			#include <string>

	namespace clang {			namespace clang {
	namespace clangd {			namespace clangd {
	namespace dex {			namespace dex {

				/// This is used to mark unigrams and bigrams and distinct them from complete
				/// trigrams. Since '$' is not present in valid identifier names, it is safe to
				/// use it as the special symbol.
				const auto END_SYMBOL = '$';

	/// Returns list of unique fuzzy-search trigrams from unqualified symbol.			/// Returns list of unique fuzzy-search trigrams from unqualified symbol.
				ioericUnsubmitted Done Reply Inline Actions Any reason why this should be exposed? ioeric: Any reason why this should be exposed?
	///			///
	/// First, given Identifier (unqualified symbol name) is segmented using			/// First, given Identifier (unqualified symbol name) is segmented using
	/// FuzzyMatch API and lowercased. After segmentation, the following technique			/// FuzzyMatch API and lowercased. After segmentation, the following technique
	/// is applied for generating trigrams: for each letter or digit in the input			/// is applied for generating trigrams: for each letter or digit in the input
	/// string the algorithms looks for the possible next and skip-1-next symbols			/// string the algorithms looks for the possible next and skip-1-next symbols
	/// which can be jumped to during fuzzy matching. Each combination of such three			/// which can be jumped to during fuzzy matching. Each combination of such three
	/// symbols is inserted into the result.			/// symbols is inserted into the result.
	///			///
	/// Trigrams can start at any character in the input. Then we can choose to move			/// Trigrams can start at any character in the input. Then we can choose to move
	/// to the next character, move to the start of the next segment, or skip over a			/// to the next character, move to the start of the next segment, or skip over a
	/// segment.			/// segment.
	///			///
	/// Note: the returned list of trigrams does not have duplicates, if any trigram			/// Note: the returned list of trigrams does not have duplicates, if any trigram
	/// belongs to more than one class it is only inserted once.			/// belongs to more than one class it is only inserted once.
	std::vector<Token> generateIdentifierTrigrams(llvm::StringRef Identifier);			std::vector<Token> generateIdentifierTrigrams(llvm::StringRef Identifier);
				ioericUnsubmitted Done Reply Inline Actions The behavior should be easy to infer from samples. As long as it's not totally expected, I think it would be OK to consider treat as implementation detail (like details in how trigrams are generated). ioeric: The behavior should be easy to infer from samples. As long as it's not totally expected, I…
				ioericUnsubmitted Done Reply Inline Actions nit: the term "symbol" here is confusing. Do you mean "character"? ioeric: nit: the term "symbol" here is confusing. Do you mean "character"?

	/// Returns list of unique fuzzy-search trigrams given a query.			/// Returns list of unique fuzzy-search trigrams given a query.
	///			///
	/// Query is segmented using FuzzyMatch API and downcasted to lowercase. Then,			/// Query is segmented using FuzzyMatch API and downcasted to lowercase. Then,
	/// the simplest trigrams - sequences of three consecutive letters and digits			/// the simplest trigrams - sequences of three consecutive letters and digits
	/// are extracted and returned after deduplication.			/// are extracted and returned after deduplication.
	std::vector<Token> generateQueryTrigrams(llvm::StringRef Query);			std::vector<Token> generateQueryTrigrams(llvm::StringRef Query);

	} // namespace dex			} // namespace dex
				ioericUnsubmitted Done Reply Inline Actions I'm not quite sure what this means. Could you elaborate? ioeric: I'm not quite sure what this means. Could you elaborate?
				kbobyrevAuthorUnsubmitted Done Reply Inline Actions Added an example and reflected in the other comment. kbobyrev: Added an example and reflected in the other comment.
	} // namespace clangd			} // namespace clangd
				ioericUnsubmitted Done Reply Inline Actions I think the comment can be simplified a bit: This also generates incomplete trigrams for short query scenarios: * Empty trigram: "$$$" * Unigram: the first character of the identifier. * Bigrams: a 2-char prefix of the identifier and a bigram of the first two HEAD characters (if it exists). ioeric: I think the comment can be simplified a bit: ``` This also generates incomplete trigrams for…
				ioericUnsubmitted Done Reply Inline Actions nit: and the previous paragraph `Special kind of trigrams ...` is redundant now IMO. ioeric: nit: and the previous paragraph `Special kind of trigrams ...` is redundant now IMO.
	} // namespace clang			} // namespace clang

	#endif			#endif

clang-tools-extra/clangd/index/dex/Trigram.cpp

	//===--- Trigram.cpp - Trigram generation for Fuzzy Matching ----- C++ --===//			//===--- Trigram.cpp - Trigram generation for Fuzzy Matching ----- C++ --===//
	//			//
	// The LLVM Compiler Infrastructure			// The LLVM Compiler Infrastructure
	//			//
	// This file is distributed under the University of Illinois Open Source			// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.			// License. See LICENSE.TXT for details.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "Trigram.h"			#include "Trigram.h"
	#include "../../FuzzyMatch.h"			#include "../../FuzzyMatch.h"
	#include "Token.h"			#include "Token.h"

	#include "llvm/ADT/ArrayRef.h"			#include "llvm/ADT/ArrayRef.h"
	#include "llvm/ADT/DenseSet.h"			#include "llvm/ADT/DenseSet.h"
	#include "llvm/ADT/StringExtras.h"			#include "llvm/ADT/StringExtras.h"

	#include <cctype>			#include <cctype>
	#include <queue>			#include <queue>
	#include <string>			#include <string>

	using namespace llvm;			using namespace llvm;

	namespace clang {			namespace clang {
	namespace clangd {			namespace clangd {
	namespace dex {			namespace dex {

	// FIXME(kbobyrev): Deal with short symbol symbol names. A viable approach would			// FIXME(kbobyrev): Deal with short symbol symbol names. A viable approach would
	// be generating unigrams and bigrams here, too. This would prevent symbol index			// be generating unigrams and bigrams here, too. This would prevent symbol index
	// from applying fuzzy matching on a tremendous number of symbols and allow			// from applying fuzzy matching on a tremendous number of symbols and allow
	// supplementary retrieval for short queries.			// supplementary retrieval for short queries.
	//			//
	// Short names (total segment length <3 characters) are currently ignored.			// Short names (total segment length <3 characters) are currently ignored.
				ioericUnsubmitted Done Reply Inline Actions It's a nice to optimization have when we run into oversized posting lists, but this is not necessarily restricted to unigram posting lists. I think the FIXME should live near the general posting list code. I think it's probably also ok to leave it out; it's hard to forget if we do run into problem in the future ;) ioeric: It's a nice to optimization have when we run into oversized posting lists, but this is not…
	std::vector<Token> generateIdentifierTrigrams(llvm::StringRef Identifier) {			std::vector<Token> generateIdentifierTrigrams(llvm::StringRef Identifier) {
	// Apply fuzzy matching text segmentation.			// Apply fuzzy matching text segmentation.
	std::vector<CharRole> Roles(Identifier.size());			std::vector<CharRole> Roles(Identifier.size());
	calculateRoles(Identifier,			calculateRoles(Identifier,
	llvm::makeMutableArrayRef(Roles.data(), Identifier.size()));			llvm::makeMutableArrayRef(Roles.data(), Identifier.size()));
				ioericUnsubmitted Done Reply Inline Actions nit: s/auto/char/ Maybe just use `static` instead of an anonymous namespace just for this. ioeric: nit: s/auto/char/ Maybe just use `static` instead of an anonymous namespace just for this.

	std::string LowercaseIdentifier = Identifier.lower();			std::string LowercaseIdentifier = Identifier.lower();
				ioericUnsubmitted Done Reply Inline Actions This is probably neater as a lambda in `generateIdentifierTrigrams` , e.g. auto add = [&](std::string Chars) { trigrams.insert(Token(Token::Kind::Trigram, Chars)); } ... add("abc"); ioeric: This is probably neater as a lambda in `generateIdentifierTrigrams `, e.g. ``` auto add = [&]…

	// For each character, store indices of the characters to which fuzzy matching			// For each character, store indices of the characters to which fuzzy matching
	// algorithm can jump. There are 3 possible variants:			// algorithm can jump. There are 3 possible variants:
	//			//
	// * Next Tail - next character from the same segment			// * Next Tail - next character from the same segment
	// * Next Head - front character of the next segment			// * Next Head - front character of the next segment
	// * Skip-1-Next Head - front character of the skip-1-next segment			// * Skip-1-Next Head - front character of the skip-1-next segment
	//			//
	// Next stores tuples of three indices in the presented order, if a variant is			// Next stores tuples of three indices in the presented order, if a variant is
	// not available then 0 is stored.			// not available then 0 is stored.
	std::vector<std::array<unsigned, 3>> Next(LowercaseIdentifier.size());			std::vector<std::array<unsigned, 3>> Next(LowercaseIdentifier.size());
	unsigned NextTail = 0, NextHead = 0, NextNextHead = 0;			unsigned NextTail = 0, NextHead = 0, NextNextHead = 0;
	for (int I = LowercaseIdentifier.size() - 1; I >= 0; --I) {			for (int I = LowercaseIdentifier.size() - 1; I >= 0; --I) {
	Next[I] = {{NextTail, NextHead, NextNextHead}};			Next[I] = {{NextTail, NextHead, NextNextHead}};
	NextTail = Roles[I] == Tail ? I : 0;			NextTail = Roles[I] == Tail ? I : 0;
	if (Roles[I] == Head) {			if (Roles[I] == Head) {
	NextNextHead = NextHead;			NextNextHead = NextHead;
	NextHead = I;			NextHead = I;
	}			}
	}			}

				// Iterate through valid seqneces of three characters Fuzzy Matcher can
				// process.
	DenseSet<Token> UniqueTrigrams;			DenseSet<Token> UniqueTrigrams;
	std::array<char, 4> Chars;			std::array<char, 4> Chars;
	for (size_t I = 0; I < LowercaseIdentifier.size(); ++I) {			for (size_t I = 0; I < LowercaseIdentifier.size(); ++I) {
	// Skip delimiters.			// Skip delimiters.
	if (Roles[I] != Head && Roles[I] != Tail)			if (Roles[I] != Head && Roles[I] != Tail)
	continue;			continue;

				Chars = {{LowercaseIdentifier[I], END_SYMBOL, END_SYMBOL, 0}};
				const auto Unigram = Token(Token::Kind::Trigram, Chars.data());
				if (!UniqueTrigrams.count(Unigram)) {
				ioericUnsubmitted Done Reply Inline Actions Could this be pulled out of the loop? I think what we want is just `LowercaseIdentifier[0]` right? I'd probably also pulled that into a function, as the function body is getting larger. ioeric: Could this be pulled out of the loop? I think what we want is just `LowercaseIdentifier[0]`…
				kbobyrevAuthorUnsubmitted Done Reply Inline Actions Same as elsewhere, if we have `__builtin_whatever` the it's not actually the first symbol of the lowercase identifier. kbobyrev: Same as elsewhere, if we have `__builtin_whatever` the it's not actually the first symbol of…
				ioericUnsubmitted Done Reply Inline Actions I would argue that I would start by typing "_" if I actually want `__builtin_whatever`. I'm also not sure if this is the case we should optimize for as well; __builtin symbols are already penalized in code completion ranking. ioeric: I would argue that I would start by typing "_" if I actually want `__builtin_whatever`. I'm…
				UniqueTrigrams.insert(Unigram);
				}

	for (const unsigned J : Next[I]) {			for (const unsigned J : Next[I]) {
	if (!J)			if (!J)
	continue;			continue;

				Chars = {{LowercaseIdentifier[I], LowercaseIdentifier[J], END_SYMBOL, 0}};
				ioericUnsubmitted Done Reply Inline Actions I think we could be more restrictive on bigram generation. I think a bigram prefix of identifier and a bigram prefix of the HEAD substring should work pretty well in practice. For example, for `StringStartsWith`, you would have `st$` and `ss$` (prefix of "SSW"). WDYT? ioeric: I think we could be more restrictive on bigram generation. I think a bigram prefix of…
				kbobyrevAuthorUnsubmitted Done Reply Inline Actions Good idea! kbobyrev: Good idea!
				ioericUnsubmitted Done Reply Inline Actions Do you mean bigrams here? ioeric: Do you mean bigrams here?
				const auto Bigram = Token(Token::Kind::Trigram, Chars.data());
				ioericUnsubmitted Done Reply Inline Actions Should we also check `Roles[J] == Head` ? As bigram posting lists would be significantly larger than those of trigrams, I would suggest being even more restrictive. For example, for "AppleBananaCat", the most common short queries would be "ap" and "ab" (for `AB`). ioeric: Should we also check `Roles[J] == Head `? As bigram posting lists would be significantly…
				if (!UniqueTrigrams.count(Bigram)) {
				UniqueTrigrams.insert(Bigram);
				ioericUnsubmitted Done Reply Inline Actions It's probably unclear what `FoundFirstHead` and `FoundSecondHead` are for to readers without context (i.e. we are looking first two HEADs). I think it's would be cleaner if we handle this out side of the look e.g. record the first head in the `Next` initialization loop above. ioeric: It's probably unclear what `FoundFirstHead` and `FoundSecondHead` are for to readers without…
				ioericUnsubmitted Done Reply Inline Actions Sorry, full of typos here: I think it would be cleaner if we handle this outside of the loop e.g. record the first head in the `Next` initialization loop above. ioeric: Sorry, full of typos here: I think it would be cleaner if we handle this outside of the loop…
				}

	for (const unsigned K : Next[J]) {			for (const unsigned K : Next[J]) {
	if (!K)			if (!K)
	continue;			continue;
	Chars = {{LowercaseIdentifier[I], LowercaseIdentifier[J],			Chars = {{LowercaseIdentifier[I], LowercaseIdentifier[J],
	LowercaseIdentifier[K], 0}};			LowercaseIdentifier[K], 0}};
	auto Trigram = Token(Token::Kind::Trigram, Chars.data());			auto Trigram = Token(Token::Kind::Trigram, Chars.data());
	// Push unique trigrams to the result.			// Push unique trigrams to the result.
	if (!UniqueTrigrams.count(Trigram)) {			if (!UniqueTrigrams.count(Trigram)) {
	UniqueTrigrams.insert(Trigram);			UniqueTrigrams.insert(Trigram);
	}			}
	}			}
	}			}
	}			}

	std::vector<Token> Result;			std::vector<Token> Result;
	for (const auto &Trigram : UniqueTrigrams)			for (const auto &Trigram : UniqueTrigrams)
	Result.push_back(Trigram);			Result.push_back(Trigram);

	return Result;			return Result;
	}			}

	// FIXME(kbobyrev): Similarly, to generateIdentifierTrigrams, this ignores short			// FIXME(kbobyrev): Similarly, to generateIdentifierTrigrams, this ignores short
	// inputs (total segment length <3 characters).			// inputs (total segment length <3 characters).
				ioericUnsubmitted Done Reply Inline Actions Couldn't we generate a bigram "u_$" in this case? I think we can assume prefix matching in this case, if we generate bigram "u_" for identiifers like "u_". In this case, we would like to match "u" only. Why? If user types "_", I would expect it to be a meaning filter. ioeric:* Couldn't we generate a bigram "u_$" in this case? I think we can assume prefix matching in this…
	std::vector<Token> generateQueryTrigrams(llvm::StringRef Query) {			std::vector<Token> generateQueryTrigrams(llvm::StringRef Query) {
				ioericUnsubmitted Done Reply Inline Actions It seems to me that what we need for short queries is simply: if (Query.empty()) { // return empty token } if (Query.size() == 1) return {Query + "$$"}; if (Query.size() == 2) return {Query + "$"}; // Longer queries... ? ioeric: It seems to me that what we need for short queries is simply: ``` if (Query.empty()) { //…
				kbobyrevAuthorUnsubmitted Done Reply Inline Actions That would mean that we expect the query to be "valid", i.e. only consist of letters and digits. My concern is about what happens if we have `"u_"` or something similar (`"_u", "_u_", "$u$"`, etc) - in that case we would actually still have to identify the first valid symbol for the trigram, process the string (trim it, etc) which looks very similar to what FuzzyMatching `calculateRoles` does. The current approach is rather straightforward and generic, but I can try to change it if you want. My biggest concern is fighting some corner cases and ensuring that the query is "right" on the user (index) side, which might turn out to be more code and ensuring that the "state" is valid throughout the pipeline. kbobyrev: That would mean that we expect the query to be "valid", i.e. only consist of letters and digits.
				ioericUnsubmitted Done Reply Inline Actions It's not clear what we would want to match with "_", except for `u_` in `unique_ptr` (maybe). Generally, as short queries tend to match many more symbols, I think we should try to make them more restrictive and optimize for the most common use case. ioeric:* It's not clear what we would want to match with "*_", except for `u_` in `unique_ptr` (maybe).
	// Apply fuzzy matching text segmentation.			// Apply fuzzy matching text segmentation.
	std::vector<CharRole> Roles(Query.size());			std::vector<CharRole> Roles(Query.size());
	calculateRoles(Query, llvm::makeMutableArrayRef(Roles.data(), Query.size()));			calculateRoles(Query, llvm::makeMutableArrayRef(Roles.data(), Query.size()));

	std::string LowercaseQuery = Query.lower();			std::string LowercaseQuery = Query.lower();

	DenseSet<Token> UniqueTrigrams;			DenseSet<Token> UniqueTrigrams;
	std::deque<char> Chars;			std::deque<char> Chars = {END_SYMBOL, END_SYMBOL, END_SYMBOL};
				ioericUnsubmitted Done Reply Inline Actions nit: `Chars` is only used in the else branch? ioeric: nit: `Chars` is only used in the else branch?

	for (size_t I = 0; I < LowercaseQuery.size(); ++I) {			for (size_t I = 0; I < LowercaseQuery.size(); ++I) {
	// If current symbol is delimiter, just skip it.			// If current symbol is delimiter, just skip it.
	if (Roles[I] != Head && Roles[I] != Tail)			if (Roles[I] != Head && Roles[I] != Tail)
	continue;			continue;

	Chars.push_back(LowercaseQuery[I]);			Chars.push_back(LowercaseQuery[I]);
				ioericUnsubmitted Done Reply Inline Actions For queries like `__` or `_x`, I think we can generate tokens "__$" or `_x$`. ioeric: For queries like `__` or `_x`, I think we can generate tokens "__$" or `_x$`.
				ioericUnsubmitted Done Reply Inline Actions When would this happen? And why are we reversing the trigram and throwing it away? ioeric: When would this happen? And why are we reversing the trigram and throwing it away?

				ioericUnsubmitted Done Reply Inline Actions I think this can be simplified as: std::string Chars = LowercaseQuery.substr(std::min(3, LowercaseQuery.size())); Chars.append(END_MARKER, 3-Chars.size()); UniqueTrigrams.insert(Token(Token::Kind::Trigram, Chars)); also nit: avoid using the term "symbol" here. ioeric: I think this can be simplified as: ``` std::string Chars = LowercaseQuery.substr(std::min(3…
	if (Chars.size() > 3)			if (Chars.size() > 3)
	Chars.pop_front();			Chars.pop_front();
	if (Chars.size() == 3) {
	auto Trigram =			auto Trigram =
	Token(Token::Kind::Trigram, std::string(begin(Chars), end(Chars)));			Token(Token::Kind::Trigram, std::string(begin(Chars), end(Chars)));

				ioericUnsubmitted Done Reply Inline Actions nit: inline this variable? You don't need to `count` below as `insert` duplicates for you already. ioeric: nit: inline this variable? You don't need to `count` below as `insert` duplicates for you…
				if (Chars.front() == END_SYMBOL)
				Trigram = Token(Token::Kind::Trigram,
				std::string(Chars.rbegin(), Chars.rend()));

	// Push unique trigrams to the result.			// Push unique trigrams to the result.
	if (!UniqueTrigrams.count(Trigram)) {			if (!UniqueTrigrams.count(Trigram)) {
	UniqueTrigrams.insert(Trigram);			UniqueTrigrams.insert(Trigram);
	}			}
	}			}
	}

	std::vector<Token> Result;			std::vector<Token> Result;
	for (const auto &Trigram : UniqueTrigrams)			for (const auto &Trigram : UniqueTrigrams)
	Result.push_back(Trigram);			Result.push_back(Trigram);

	return Result;			return Result;
	}			}

	} // namespace dex			} // namespace dex
	} // namespace clangd			} // namespace clangd
	} // namespace clang			} // namespace clang

clang-tools-extra/unittests/clangd/DexIndexTests.cpp

Show First 20 Lines • Show All 244 Lines • ▼ Show 20 Lines	trigramsAre(std::initializer_list<std::string> Trigrams) {
std::vector<Token> Tokens;		std::vector<Token> Tokens;
for (const auto &Symbols : Trigrams) {		for (const auto &Symbols : Trigrams) {
Tokens.push_back(Token(Token::Kind::Trigram, Symbols));		Tokens.push_back(Token(Token::Kind::Trigram, Symbols));
}		}
return testing::UnorderedElementsAreArray(Tokens);		return testing::UnorderedElementsAreArray(Tokens);
}		}

TEST(DexIndexTrigrams, IdentifierTrigrams) {		TEST(DexIndexTrigrams, IdentifierTrigrams) {
EXPECT_THAT(generateIdentifierTrigrams("X86"), trigramsAre({"x86"}));		EXPECT_THAT(generateIdentifierTrigrams("X86"),
		trigramsAre({"x86", "x$$", "8$$", "6$$", "x8$", "86$"}));

EXPECT_THAT(generateIdentifierTrigrams("nl"), trigramsAre({}));		EXPECT_THAT(generateIdentifierTrigrams("nl"),
		trigramsAre({"nl$", "n$$", "l$$"}));

EXPECT_THAT(generateIdentifierTrigrams("clangd"),		EXPECT_THAT(generateIdentifierTrigrams("n"), trigramsAre({"n$$"}));
trigramsAre({"cla", "lan", "ang", "ngd"}));

EXPECT_THAT(generateIdentifierTrigrams("abc_def"),
trigramsAre({"abc", "abd", "ade", "bcd", "bde", "cde", "def"}));

EXPECT_THAT(		EXPECT_THAT(
generateIdentifierTrigrams("a_b_c_d_e_"),		generateIdentifierTrigrams("clangd"),
trigramsAre({"abc", "abd", "acd", "ace", "bcd", "bce", "bde", "cde"}));		trigramsAre({"cla", "lan", "ang", "ngd", "an$", "n$$", "g$$", "cl$",
		"ng$", "d$$", "l$$", "a$$", "c$$", "gd$", "la$"}));

EXPECT_THAT(		EXPECT_THAT(generateIdentifierTrigrams("abc_def"),
generateIdentifierTrigrams("unique_ptr"),		trigramsAre({"abc", "abd", "ade", "bcd", "bde", "cde", "def",
trigramsAre({"uni", "unp", "upt", "niq", "nip", "npt", "iqu", "iqp",		"a$$", "b$$", "c$$", "d$$", "e$$", "f$$", "ab$",
"ipt", "que", "qup", "qpt", "uep", "ept", "ptr"}));		"ad$", "bc$", "bd$", "cd$", "de$", "ef$"}));

		EXPECT_THAT(generateIdentifierTrigrams("a_b_c_d_e_"),
		trigramsAre({"abc", "abd", "acd", "ace", "bcd", "bce", "bde",
		"cde", "a$$", "ab$", "ac$", "b$$", "bc$", "bd$",
		"c$$", "cd$", "ce$", "d$$", "de$", "e$$"}));

		EXPECT_THAT(generateIdentifierTrigrams("unique_ptr"),
		trigramsAre({"uni", "unp", "upt", "niq", "nip", "npt", "iqu",
		"iqp", "ipt", "que", "qup", "qpt", "uep", "ept",
		"ptr", "u$$", "un$", "up$", "n$$", "ni$", "np$",
		"i$$", "iq$", "ip$", "q$$", "qu$", "qp$", "ue$",
		"e$$", "ep$", "p$$", "pt$", "t$$", "tr$", "r$$"}));

EXPECT_THAT(generateIdentifierTrigrams("TUDecl"),		EXPECT_THAT(generateIdentifierTrigrams("TUDecl"),
trigramsAre({"tud", "tde", "ude", "dec", "ecl"}));		trigramsAre({"tud", "tde", "ude", "dec", "ecl", "t$$", "tu$",
		"td$", "u$$", "ud$", "d$$", "de$", "e$$", "ec$",
		"c$$", "cl$", "l$$"}));

EXPECT_THAT(generateIdentifierTrigrams("IsOK"),		EXPECT_THAT(generateIdentifierTrigrams("IsOK"),
trigramsAre({"iso", "iok", "sok"}));		trigramsAre({"iso", "iok", "sok", "i$$", "is$", "io$", "s$$",
		"so$", "o$$", "ok$", "k$$"}));

EXPECT_THAT(generateIdentifierTrigrams("abc_defGhij__klm"),		EXPECT_THAT(generateIdentifierTrigrams("abc_defGhij__klm"),
trigramsAre({		trigramsAre({
"abc", "abd", "abg", "ade", "adg", "adk", "agh", "agk", "bcd",		"abc", "abd", "abg", "ade", "adg", "adk", "agh", "agk", "bcd",
"bcg", "bde", "bdg", "bdk", "bgh", "bgk", "cde", "cdg", "cdk",		"bcg", "bde", "bdg", "bdk", "bgh", "bgk", "cde", "cdg", "cdk",
"cgh", "cgk", "def", "deg", "dek", "dgh", "dgk", "dkl", "efg",		"cgh", "cgk", "def", "deg", "dek", "dgh", "dgk", "dkl", "efg",
"efk", "egh", "egk", "ekl", "fgh", "fgk", "fkl", "ghi", "ghk",		"efk", "egh", "egk", "ekl", "fgh", "fgk", "fkl", "ghi", "ghk",
"gkl", "hij", "hik", "hkl", "ijk", "ikl", "jkl", "klm",		"gkl", "hij", "hik", "hkl", "ijk", "ikl", "jkl", "klm",
Show All 33 Lines