This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang-tools-extra/trunk/
-
trunk/
-
clangd/index/dex/
-
index/
-
dex/
-
Iterator.h
-
Trigram.h
-
Trigram.cpp
-
unittests/clangd/
-
clangd/
-
DexIndexTests.cpp

Differential D50517

[clangd] Generate incomplete trigrams for the Dex index
ClosedPublic

Authored by kbobyrev on Aug 9 2018, 9:14 AM.

Download Raw Diff

Details

Reviewers

ioeric
ilya-biryukov

Commits

rGff2dd9095fa6: [clangd] Generate incomplete trigrams for the Dex index
rCTE339548: [clangd] Generate incomplete trigrams for the Dex index
rL339548: [clangd] Generate incomplete trigrams for the Dex index

Summary

This patch handles trigram generation "short" identifiers and queries. Trigram generator produces incomplete trigrams for short names so that the same query iterator API can be used to match symbols which don't have enough symbols to form a trigram and correctly handle queries which also are not sufficient for generating a full trigram.

Diff Detail

Repository: rL LLVM

Event Timeline

kbobyrev created this revision.Aug 9 2018, 9:14 AM

Herald added subscribers: arphaman, jkorous, MaskRay. · View Herald TranscriptAug 9 2018, 9:14 AM

This patch is in preview mode and can be useful for the discussion. It's not functional yet, but this will be changed in the future.

The upcoming changes would allow handling short queries introduced in https://reviews.llvm.org/D50337 in a more efficient manner.

@ioeric proposed to generate unigrams for the first letter of the identifier so that the index would only perform prefix match for one-letter completion requests, which I think would be a great performance improvement.

Complete the tests, finish the implementation.

One thought about prefix match suggestion: we should either make it more explicit for the index (e.g. introduce prefixMatch and dispatch fuzzyMatch to prefix matching in case query only contains one "true" symbol) or document this properly. While, as I wrote earlier, I totally support the idea of prefix matching queries of length 1 it might not align with some user expectations and it's also very implicit if we just generate tokens this way and don't mention it anywhere in the DexIndex implementation.

@ioeric, @ilya-biryukov any thoughts?

As discussed offline with @ilya-biryukov, the better approach would be to prefix match first symbols of each distinct identifier piece instead of prefix matching (just looking at the first letter of the identifier) the whole identifier.

Example:

Query: "u"
Symbols: "unique_ptr", "user", "super_user"

Current implementation would match "unique_ptr" and "user" only.
Proposed implementation would match all three symbols, because the second piece of "super_user" starts with u.

This might be useful for codebases where e.g. each identifier starts with some project prefix (ProjectInstruction, ProjectGraph, etc). For C++, it's better to use namespaces instead of this naming which is not really great, but I am aware of the C++ projects which actually opt for such naming convention. However, in pure C this relatively common practice, e.g. a typical piece of code for GNOME might be

struct _GtkWrapBoxPrivate
{
	GtkOrientation        orientation;
	GtkWrapAllocationMode mode;

	GtkWrapBoxSpreading   horizontal_spreading;
	GtkWrapBoxSpreading   vertical_spreading;

	guint16               vertical_spacing;
	guint16               horizontal_spacing;

	guint16               minimum_line_children;
	guint16               natural_line_children;

	GList                *children;
};

Also, this is better for macros, which can not be put into namespaces anyway and there's BENCHMARK_UNREACHABLE and so on.

I'll update the patch with the proposed solution.

Thanks for the patch!

In D50517#1194955, @kbobyrev wrote:

Complete the tests, finish the implementation.

One thought about prefix match suggestion: we should either make it more explicit for the index (e.g. introduce prefixMatch and dispatch fuzzyMatch to prefix matching in case query only contains one "true" symbol) or document this properly. While, as I wrote earlier, I totally support the idea of prefix matching queries of length 1 it might not align with some user expectations and it's also very implicit if we just generate tokens this way and don't mention it anywhere in the DexIndex implementation.

@ioeric, @ilya-biryukov any thoughts?

(copied my inline comment :)
We should definitely add documentation about it. It should be pretty simple IMO. As the behavior should be easy to infer from samples, and it shouldn't be too surprising for users, I think it would be OK to consider it as implementation detail (like details in how exactly trigrams are generated) without exposing new interfaces for them.

clang-tools-extra/clangd/index/dex/Trigram.cpp
26 ↗	(On Diff #160071)	It's a nice to optimization have when we run into oversized posting lists, but this is not necessarily restricted to unigram posting lists. I think the FIXME should live near the general posting list code. I think it's probably also ok to leave it out; it's hard to forget if we do run into problem in the future ;)
74 ↗	(On Diff #160071)	Could this be pulled out of the loop? I think what we want is just `LowercaseIdentifier[0]` right? I'd probably also pulled that into a function, as the function body is getting larger.
87 ↗	(On Diff #160071)	I think we could be more restrictive on bigram generation. I think a bigram prefix of identifier and a bigram prefix of the HEAD substring should work pretty well in practice. For example, for `StringStartsWith`, you would have `st$` and `ss$` (prefix of "SSW"). WDYT?
115 ↗	(On Diff #160071)	It seems to me that what we need for short queries is simply: if (Query.empty()) { // return empty token } if (Query.size() == 1) return {Query + "$$"}; if (Query.size() == 2) return {Query + "$"}; // Longer queries... ?
clang-tools-extra/clangd/index/dex/Trigram.h
39 ↗	(On Diff #160071)	Any reason why this should be exposed?
62 ↗	(On Diff #160071)	The behavior should be easy to infer from samples. As long as it's not totally expected, I think it would be OK to consider treat as implementation detail (like details in how trigrams are generated).
74 ↗	(On Diff #160071)	I'm not quite sure what this means. Could you elaborate?

@ilya-biryukov I have changed the approach to the one we discussed before.

In D50517#1194976, @kbobyrev wrote:

As discussed offline with @ilya-biryukov, the better approach would be to prefix match first symbols of each distinct identifier piece instead of prefix matching (just looking at the first letter of the identifier) the whole identifier.

Example:

Query: "u"

Symbols: "unique_ptr", "user", "super_user"

Current implementation would match "unique_ptr" and "user" only.
Proposed implementation would match all three symbols, because the second piece of "super_user" starts with u.

I agree that this can be useful sometime, but I suspect it's relatively rare and might actually compromise ranking quality for the more common use case e.g. the first character users type is the first character of the expected identifier.

In D50517#1194976, @kbobyrev wrote:

As discussed offline with @ilya-biryukov, the better approach would be to prefix match first symbols of each distinct identifier piece instead of prefix matching (just looking at the first letter of the identifier) the whole identifier.

Example:

Query: "u"

Symbols: "unique_ptr", "user", "super_user"

Current implementation would match "unique_ptr" and "user" only.
Proposed implementation would match all three symbols, because the second piece of "super_user" starts with u.

And in the case where users want to match super_user, I think it's reasonable to have users type two more characters and match it with use.

ioeric mentioned this in D50337: [clangd] DexIndex implementation prototype.Aug 10 2018, 3:02 AM

Address a round of comments.

I have added few comments to get additional feedback before further changes are made.

clang-tools-extra/clangd/index/dex/Trigram.cpp
74 ↗	(On Diff #160071)	Same as elsewhere, if we have `__builtin_whatever` the it's not actually the first symbol of the lowercase identifier.
87 ↗	(On Diff #160071)	Good idea!
115 ↗	(On Diff #160071)	That would mean that we expect the query to be "valid", i.e. only consist of letters and digits. My concern is about what happens if we have `"u_"` or something similar (`"_u", "_u_", "$u$"`, etc) - in that case we would actually still have to identify the first valid symbol for the trigram, process the string (trim it, etc) which looks very similar to what FuzzyMatching `calculateRoles` does. The current approach is rather straightforward and generic, but I can try to change it if you want. My biggest concern is fighting some corner cases and ensuring that the query is "right" on the user (index) side, which might turn out to be more code and ensuring that the "state" is valid throughout the pipeline.
clang-tools-extra/clangd/index/dex/Trigram.h
74 ↗	(On Diff #160071)	Added an example and reflected in the other comment.

In D50517#1194990, @ioeric wrote:

In D50517#1194976, @kbobyrev wrote:

As discussed offline with @ilya-biryukov, the better approach would be to prefix match first symbols of each distinct identifier piece instead of prefix matching (just looking at the first letter of the identifier) the whole identifier.

Example:

Query: "u"

Symbols: "unique_ptr", "user", "super_user"

Current implementation would match "unique_ptr" and "user" only.
Proposed implementation would match all three symbols, because the second piece of "super_user" starts with u.

And in the case where users want to match super_user, I think it's reasonable to have users type two more characters and match it with use.

That would probably yield lower code completion quality for identifiers like GtkWhatever which might be very common in pure C projects and elsewhere. Also, Ilya mentioned that fuzzy matching filter would significantly increase the score of symbols which can be prefix matched and hence they would end up at the top if the quality is actually good. Another thing we can do is to boost prefix matched symbols if your concern is about them being removed after the initial filtering.

I'm personally leaning towards having unigrams for all segment starting symbols, but if you believe that it's certainly bad I can change that and in the future it will be rather trivial to switch if we decide to go backwards. What do you think?

ioeric added inline comments.Aug 10 2018, 3:53 AM

clang-tools-extra/clangd/index/dex/Trigram.cpp
74 ↗	(On Diff #160071)	I would argue that I would start by typing "_" if I actually want `__builtin_whatever`. I'm also not sure if this is the case we should optimize for as well; __builtin symbols are already penalized in code completion ranking.
115 ↗	(On Diff #160071)	It's not clear what we would want to match with "*_", except for `u_` in `unique_ptr` (maybe). Generally, as short queries tend to match many more symbols, I think we should try to make them more restrictive and optimize for the most common use case.

Address issues we discussed with Eric.

ioeric added inline comments.Aug 10 2018, 7:26 AM

clang-tools-extra/clangd/index/dex/Trigram.cpp
33 ↗	(On Diff #160093)	This is probably neater as a lambda in `generateIdentifierTrigrams` , e.g. auto add = [&](std::string Chars) { trigrams.insert(Token(Token::Kind::Trigram, Chars)); } ... add("abc");
95 ↗	(On Diff #160093)	Do you mean bigrams here?
96 ↗	(On Diff #160093)	Should we also check `Roles[J] == Head` ? As bigram posting lists would be significantly larger than those of trigrams, I would suggest being even more restrictive. For example, for "AppleBananaCat", the most common short queries would be "ap" and "ab" (for `AB`).
119 ↗	(On Diff #160093)	Couldn't we generate a bigram "u_$" in this case? I think we can assume prefix matching in this case, if we generate bigram "u_" for identiifers like "u_*". In this case, we would like to match "u" only. Why? If user types "_", I would expect it to be a meaning filter.
141 ↗	(On Diff #160093)	For queries like `__` or `_x`, I think we can generate tokens "__$" or `_x$`.
clang-tools-extra/clangd/index/dex/Trigram.h
49 ↗	(On Diff #160093)	nit: the term "symbol" here is confusing. Do you mean "character"?
59 ↗	(On Diff #160093)	I think the comment can be simplified a bit: This also generates incomplete trigrams for short query scenarios: * Empty trigram: "$$$" * Unigram: the first character of the identifier. * Bigrams: a 2-char prefix of the identifier and a bigram of the first two HEAD characters (if it exists).

Address issues we have discussed with Eric.

ioeric added inline comments.Aug 10 2018, 10:28 AM

clang-tools-extra/clangd/index/dex/Trigram.cpp
31 ↗	(On Diff #160133)	nit: s/auto/char/ Maybe just use `static` instead of an anonymous namespace just for this.
83 ↗	(On Diff #160133)	It's probably unclear what `FoundFirstHead` and `FoundSecondHead` are for to readers without context (i.e. we are looking first two HEADs). I think it's would be cleaner if we handle this out side of the look e.g. record the first head in the `Next` initialization loop above.
138 ↗	(On Diff #160133)	nit: `Chars` is only used in the else branch?
143 ↗	(On Diff #160133)	I think this can be simplified as: std::string Chars = LowercaseQuery.substr(std::min(3, LowercaseQuery.size())); Chars.append(END_MARKER, 3-Chars.size()); UniqueTrigrams.insert(Token(Token::Kind::Trigram, Chars)); also nit: avoid using the term "symbol" here.
167 ↗	(On Diff #160133)	When would this happen? And why are we reversing the trigram and throwing it away?
clang-tools-extra/clangd/index/dex/Trigram.h
59 ↗	(On Diff #160093)	nit: and the previous paragraph `Special kind of trigrams ...` is redundant now IMO.

ioeric added inline comments.Aug 10 2018, 10:29 AM

clang-tools-extra/clangd/index/dex/Trigram.cpp
83 ↗	(On Diff #160133)	Sorry, full of typos here: I think it would be cleaner if we handle this outside of the loop e.g. record the first head in the `Next` initialization loop above.

Address a round of comments.

lg! Thanks for the changes!

clang-tools-extra/clangd/index/dex/Trigram.cpp
147 ↗	(On Diff #160154)	nit: inline this variable? You don't need to `count` below as `insert` duplicates for you already.

This revision is now accepted and ready to land.Aug 10 2018, 12:30 PM

Address the post-LGTM comment.

Closed by commit rL339548: [clangd] Generate incomplete trigrams for the Dex index (authored by omtcyfz). · Explain WhyAug 13 2018, 1:57 AM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: llvm-commits. · View Herald TranscriptAug 13 2018, 1:58 AM

Revision Contents

Path

Size

clang-tools-extra/

trunk/

clangd/

index/

dex/

Iterator.h

8 lines

Trigram.h

14 lines

Trigram.cpp

94 lines

unittests/

clangd/

DexIndexTests.cpp

60 lines

Diff 160310

clang-tools-extra/trunk/clangd/index/dex/Iterator.h

	Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines
	namespace dex {			namespace dex {

	/// Symbol position in the list of all index symbols sorted by a pre-computed			/// Symbol position in the list of all index symbols sorted by a pre-computed
	/// symbol quality.			/// symbol quality.
	using DocID = uint32_t;			using DocID = uint32_t;
	/// Contains sorted sequence of DocIDs all of which belong to symbols matching			/// Contains sorted sequence of DocIDs all of which belong to symbols matching
	/// certain criteria, i.e. containing a Search Token. PostingLists are values			/// certain criteria, i.e. containing a Search Token. PostingLists are values
	/// for the inverted index.			/// for the inverted index.
				// FIXME(kbobyrev): Posting lists for incomplete trigrams (one/two symbols) are
				// likely to be very dense and hence require special attention so that the index
				// doesn't use too much memory. Possible solution would be to construct
				// compressed posting lists which consist of ranges of DocIDs instead of
				// distinct DocIDs. A special case would be the empty query: for that case
				// TrueIterator should be implemented - an iterator which doesn't actually store
				// any PostingList within itself, but "contains" all DocIDs in range
				// [0, IndexSize).
	using PostingList = std::vector<DocID>;			using PostingList = std::vector<DocID>;
	/// Immutable reference to PostingList object.			/// Immutable reference to PostingList object.
	using PostingListRef = llvm::ArrayRef<DocID>;			using PostingListRef = llvm::ArrayRef<DocID>;

	/// Iterator is the interface for Query Tree node. The simplest type of Iterator			/// Iterator is the interface for Query Tree node. The simplest type of Iterator
	/// is DocumentIterator which is simply a wrapper around PostingList iterator			/// is DocumentIterator which is simply a wrapper around PostingList iterator
	/// and serves as the Query Tree leaf. More sophisticated examples of iterators			/// and serves as the Query Tree leaf. More sophisticated examples of iterators
	/// can manage intersection, union of the elements produced by other iterators			/// can manage intersection, union of the elements produced by other iterators
	▲ Show 20 Lines • Show All 96 Lines • Show Last 20 Lines

clang-tools-extra/trunk/clangd/index/dex/Trigram.h

	Show All 30 Lines
	namespace clangd {			namespace clangd {
	namespace dex {			namespace dex {

	/// Returns list of unique fuzzy-search trigrams from unqualified symbol.			/// Returns list of unique fuzzy-search trigrams from unqualified symbol.
	///			///
	/// First, given Identifier (unqualified symbol name) is segmented using			/// First, given Identifier (unqualified symbol name) is segmented using
	/// FuzzyMatch API and lowercased. After segmentation, the following technique			/// FuzzyMatch API and lowercased. After segmentation, the following technique
	/// is applied for generating trigrams: for each letter or digit in the input			/// is applied for generating trigrams: for each letter or digit in the input
	/// string the algorithms looks for the possible next and skip-1-next symbols			/// string the algorithms looks for the possible next and skip-1-next characters
	/// which can be jumped to during fuzzy matching. Each combination of such three			/// which can be jumped to during fuzzy matching. Each combination of such three
	/// symbols is inserted into the result.			/// characters is inserted into the result.
	///			///
	/// Trigrams can start at any character in the input. Then we can choose to move			/// Trigrams can start at any character in the input. Then we can choose to move
	/// to the next character, move to the start of the next segment, or skip over a			/// to the next character, move to the start of the next segment, or skip over a
	/// segment.			/// segment.
	///			///
				/// This also generates incomplete trigrams for short query scenarios:
				/// * Empty trigram: "$$$".
				/// * Unigram: the first character of the identifier.
				/// * Bigrams: a 2-char prefix of the identifier and a bigram of the first two
				/// HEAD characters (if they exist).
				//
	/// Note: the returned list of trigrams does not have duplicates, if any trigram			/// Note: the returned list of trigrams does not have duplicates, if any trigram
	/// belongs to more than one class it is only inserted once.			/// belongs to more than one class it is only inserted once.
	std::vector<Token> generateIdentifierTrigrams(llvm::StringRef Identifier);			std::vector<Token> generateIdentifierTrigrams(llvm::StringRef Identifier);

	/// Returns list of unique fuzzy-search trigrams given a query.			/// Returns list of unique fuzzy-search trigrams given a query.
	///			///
	/// Query is segmented using FuzzyMatch API and downcasted to lowercase. Then,			/// Query is segmented using FuzzyMatch API and downcasted to lowercase. Then,
	/// the simplest trigrams - sequences of three consecutive letters and digits			/// the simplest trigrams - sequences of three consecutive letters and digits
	/// are extracted and returned after deduplication.			/// are extracted and returned after deduplication.
				///
				/// For short queries (less than 3 characters with Head or Tail roles in Fuzzy
				/// Matching segmentation) this returns a single trigram with the first
				/// characters (up to 3) to perfrom prefix match.
	std::vector<Token> generateQueryTrigrams(llvm::StringRef Query);			std::vector<Token> generateQueryTrigrams(llvm::StringRef Query);

	} // namespace dex			} // namespace dex
	} // namespace clangd			} // namespace clangd
	} // namespace clang			} // namespace clang

	#endif			#endif

clang-tools-extra/trunk/clangd/index/dex/Trigram.cpp

	//===--- Trigram.cpp - Trigram generation for Fuzzy Matching ----- C++ --===//			//===--- Trigram.cpp - Trigram generation for Fuzzy Matching ----- C++ --===//
	//			//
	// The LLVM Compiler Infrastructure			// The LLVM Compiler Infrastructure
	//			//
	// This file is distributed under the University of Illinois Open Source			// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.			// License. See LICENSE.TXT for details.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "Trigram.h"			#include "Trigram.h"
	#include "../../FuzzyMatch.h"			#include "../../FuzzyMatch.h"
	#include "Token.h"			#include "Token.h"

	#include "llvm/ADT/ArrayRef.h"			#include "llvm/ADT/ArrayRef.h"
	#include "llvm/ADT/DenseSet.h"			#include "llvm/ADT/DenseSet.h"
	#include "llvm/ADT/StringExtras.h"			#include "llvm/ADT/StringExtras.h"

	#include <cctype>			#include <cctype>
	#include <queue>			#include <queue>
	#include <string>			#include <string>

	using namespace llvm;			using namespace llvm;

	namespace clang {			namespace clang {
	namespace clangd {			namespace clangd {
	namespace dex {			namespace dex {

	// FIXME(kbobyrev): Deal with short symbol symbol names. A viable approach would			/// This is used to mark unigrams and bigrams and distinct them from complete
	// be generating unigrams and bigrams here, too. This would prevent symbol index			/// trigrams. Since '$' is not present in valid identifier names, it is safe to
	// from applying fuzzy matching on a tremendous number of symbols and allow			/// use it as the special symbol.
	// supplementary retrieval for short queries.			static const char END_MARKER = '$';
	//
	// Short names (total segment length <3 characters) are currently ignored.
	std::vector<Token> generateIdentifierTrigrams(llvm::StringRef Identifier) {			std::vector<Token> generateIdentifierTrigrams(llvm::StringRef Identifier) {
	// Apply fuzzy matching text segmentation.			// Apply fuzzy matching text segmentation.
	std::vector<CharRole> Roles(Identifier.size());			std::vector<CharRole> Roles(Identifier.size());
	calculateRoles(Identifier,			calculateRoles(Identifier,
	llvm::makeMutableArrayRef(Roles.data(), Identifier.size()));			llvm::makeMutableArrayRef(Roles.data(), Identifier.size()));

	std::string LowercaseIdentifier = Identifier.lower();			std::string LowercaseIdentifier = Identifier.lower();

	// For each character, store indices of the characters to which fuzzy matching			// For each character, store indices of the characters to which fuzzy matching
	// algorithm can jump. There are 3 possible variants:			// algorithm can jump. There are 3 possible variants:
	//			//
	// * Next Tail - next character from the same segment			// * Next Tail - next character from the same segment
	// * Next Head - front character of the next segment			// * Next Head - front character of the next segment
	// * Skip-1-Next Head - front character of the skip-1-next segment			// * Skip-1-Next Head - front character of the skip-1-next segment
	//			//
	// Next stores tuples of three indices in the presented order, if a variant is			// Next stores tuples of three indices in the presented order, if a variant is
	// not available then 0 is stored.			// not available then 0 is stored.
	std::vector<std::array<unsigned, 3>> Next(LowercaseIdentifier.size());			std::vector<std::array<unsigned, 3>> Next(LowercaseIdentifier.size());
	unsigned NextTail = 0, NextHead = 0, NextNextHead = 0;			unsigned NextTail = 0, NextHead = 0, NextNextHead = 0;
				// Store two first HEAD characters in the identifier (if present).
				std::deque<char> TwoHeads;
	for (int I = LowercaseIdentifier.size() - 1; I >= 0; --I) {			for (int I = LowercaseIdentifier.size() - 1; I >= 0; --I) {
	Next[I] = {{NextTail, NextHead, NextNextHead}};			Next[I] = {{NextTail, NextHead, NextNextHead}};
	NextTail = Roles[I] == Tail ? I : 0;			NextTail = Roles[I] == Tail ? I : 0;
	if (Roles[I] == Head) {			if (Roles[I] == Head) {
	NextNextHead = NextHead;			NextNextHead = NextHead;
	NextHead = I;			NextHead = I;
				TwoHeads.push_front(LowercaseIdentifier[I]);
				if (TwoHeads.size() > 2)
				TwoHeads.pop_back();
	}			}
	}			}

	DenseSet<Token> UniqueTrigrams;			DenseSet<Token> UniqueTrigrams;
	std::array<char, 4> Chars;
				auto add = [&](std::string Chars) {
				UniqueTrigrams.insert(Token(Token::Kind::Trigram, Chars));
				};

				// FIXME(kbobyrev): Instead of producing empty trigram for each identifier,
				// just use True Iterator on the query side when the query string is empty.
				add({{END_MARKER, END_MARKER, END_MARKER}});

				if (TwoHeads.size() == 2)
				add({{TwoHeads.front(), TwoHeads.back(), END_MARKER}});

				if (!LowercaseIdentifier.empty())
				add({{LowercaseIdentifier.front(), END_MARKER, END_MARKER}});

				if (LowercaseIdentifier.size() >= 2)
				add({{LowercaseIdentifier[0], LowercaseIdentifier[1], END_MARKER}});

				if (LowercaseIdentifier.size() >= 3)
				add({{LowercaseIdentifier[0], LowercaseIdentifier[1],
				LowercaseIdentifier[2]}});

				// Iterate through valid seqneces of three characters Fuzzy Matcher can
				// process.
	for (size_t I = 0; I < LowercaseIdentifier.size(); ++I) {			for (size_t I = 0; I < LowercaseIdentifier.size(); ++I) {
	// Skip delimiters.			// Skip delimiters.
	if (Roles[I] != Head && Roles[I] != Tail)			if (Roles[I] != Head && Roles[I] != Tail)
	continue;			continue;
	for (const unsigned J : Next[I]) {			for (const unsigned J : Next[I]) {
	if (!J)			if (!J)
	continue;			continue;
	for (const unsigned K : Next[J]) {			for (const unsigned K : Next[J]) {
	if (!K)			if (!K)
	continue;			continue;
	Chars = {{LowercaseIdentifier[I], LowercaseIdentifier[J],			add({{LowercaseIdentifier[I], LowercaseIdentifier[J],
	LowercaseIdentifier[K], 0}};			LowercaseIdentifier[K]}});
	auto Trigram = Token(Token::Kind::Trigram, Chars.data());
	// Push unique trigrams to the result.
	if (!UniqueTrigrams.count(Trigram)) {
	UniqueTrigrams.insert(Trigram);
	}
	}			}
	}			}
	}			}

	std::vector<Token> Result;			std::vector<Token> Result;
	for (const auto &Trigram : UniqueTrigrams)			for (const auto &Trigram : UniqueTrigrams)
	Result.push_back(Trigram);			Result.push_back(Trigram);

	return Result;			return Result;
	}			}

	// FIXME(kbobyrev): Similarly, to generateIdentifierTrigrams, this ignores short
	// inputs (total segment length <3 characters).
	std::vector<Token> generateQueryTrigrams(llvm::StringRef Query) {			std::vector<Token> generateQueryTrigrams(llvm::StringRef Query) {
	// Apply fuzzy matching text segmentation.			// Apply fuzzy matching text segmentation.
	std::vector<CharRole> Roles(Query.size());			std::vector<CharRole> Roles(Query.size());
	calculateRoles(Query, llvm::makeMutableArrayRef(Roles.data(), Query.size()));			calculateRoles(Query, llvm::makeMutableArrayRef(Roles.data(), Query.size()));

				// Additional pass is necessary to count valid identifier characters.
				// Depending on that, this function might return incomplete trigram.
				unsigned ValidSymbolsCount = 0;
				for (size_t I = 0; I < Roles.size(); ++I)
				if (Roles[I] == Head \|\| Roles[I] == Tail)
				++ValidSymbolsCount;

	std::string LowercaseQuery = Query.lower();			std::string LowercaseQuery = Query.lower();

	DenseSet<Token> UniqueTrigrams;			DenseSet<Token> UniqueTrigrams;
	std::deque<char> Chars;

				// If the number of symbols which can form fuzzy matching trigram is not
				// sufficient, generate a single incomplete trigram for query.
				if (ValidSymbolsCount < 3) {
				std::string Chars = LowercaseQuery.substr(0, std::min(3UL, Query.size()));
				Chars.append(3 - Chars.size(), END_MARKER);
				UniqueTrigrams.insert(Token(Token::Kind::Trigram, Chars));
				} else {
				std::deque<char> Chars;
	for (size_t I = 0; I < LowercaseQuery.size(); ++I) {			for (size_t I = 0; I < LowercaseQuery.size(); ++I) {
	// If current symbol is delimiter, just skip it.			// If current symbol is delimiter, just skip it.
	if (Roles[I] != Head && Roles[I] != Tail)			if (Roles[I] != Head && Roles[I] != Tail)
	continue;			continue;

	Chars.push_back(LowercaseQuery[I]);			Chars.push_back(LowercaseQuery[I]);

	if (Chars.size() > 3)			if (Chars.size() > 3)
	Chars.pop_front();			Chars.pop_front();

	if (Chars.size() == 3) {			if (Chars.size() == 3) {
	auto Trigram =			UniqueTrigrams.insert(
	Token(Token::Kind::Trigram, std::string(begin(Chars), end(Chars)));			Token(Token::Kind::Trigram, std::string(begin(Chars), end(Chars))));
	// Push unique trigrams to the result.
	if (!UniqueTrigrams.count(Trigram)) {
	UniqueTrigrams.insert(Trigram);
	}			}
	}			}
	}			}

	std::vector<Token> Result;			std::vector<Token> Result;
	for (const auto &Trigram : UniqueTrigrams)			for (const auto &Trigram : UniqueTrigrams)
	Result.push_back(Trigram);			Result.push_back(Trigram);

	return Result;			return Result;
	}			}

	} // namespace dex			} // namespace dex
	} // namespace clangd			} // namespace clangd
	} // namespace clang			} // namespace clang

clang-tools-extra/trunk/unittests/clangd/DexIndexTests.cpp

Show First 20 Lines • Show All 265 Lines • ▼ Show 20 Lines	trigramsAre(std::initializer_list<std::string> Trigrams) {
std::vector<Token> Tokens;		std::vector<Token> Tokens;
for (const auto &Symbols : Trigrams) {		for (const auto &Symbols : Trigrams) {
Tokens.push_back(Token(Token::Kind::Trigram, Symbols));		Tokens.push_back(Token(Token::Kind::Trigram, Symbols));
}		}
return testing::UnorderedElementsAreArray(Tokens);		return testing::UnorderedElementsAreArray(Tokens);
}		}

TEST(DexIndexTrigrams, IdentifierTrigrams) {		TEST(DexIndexTrigrams, IdentifierTrigrams) {
EXPECT_THAT(generateIdentifierTrigrams("X86"), trigramsAre({"x86"}));		EXPECT_THAT(generateIdentifierTrigrams("X86"),
		trigramsAre({"x86", "x$$", "x8$", "$$$"}));

EXPECT_THAT(generateIdentifierTrigrams("nl"), trigramsAre({}));		EXPECT_THAT(generateIdentifierTrigrams("nl"),
		trigramsAre({"nl$", "n$$", "$$$"}));

		EXPECT_THAT(generateIdentifierTrigrams("n"), trigramsAre({"n$$", "$$$"}));

EXPECT_THAT(generateIdentifierTrigrams("clangd"),		EXPECT_THAT(generateIdentifierTrigrams("clangd"),
trigramsAre({"cla", "lan", "ang", "ngd"}));		trigramsAre({"c$$", "cl$", "cla", "lan", "ang", "ngd", "$$$"}));

EXPECT_THAT(generateIdentifierTrigrams("abc_def"),		EXPECT_THAT(generateIdentifierTrigrams("abc_def"),
trigramsAre({"abc", "abd", "ade", "bcd", "bde", "cde", "def"}));		trigramsAre({"a$$", "abc", "abd", "ade", "bcd", "bde", "cde",
		"def", "ab$", "ad$", "$$$"}));
EXPECT_THAT(
generateIdentifierTrigrams("a_b_c_d_e_"),
trigramsAre({"abc", "abd", "acd", "ace", "bcd", "bce", "bde", "cde"}));

EXPECT_THAT(		EXPECT_THAT(generateIdentifierTrigrams("a_b_c_d_e_"),
generateIdentifierTrigrams("unique_ptr"),		trigramsAre({"a$$", "a_$", "a_b", "abc", "abd", "acd", "ace",
trigramsAre({"uni", "unp", "upt", "niq", "nip", "npt", "iqu", "iqp",		"bcd", "bce", "bde", "cde", "ab$", "$$$"}));
"ipt", "que", "qup", "qpt", "uep", "ept", "ptr"}));
		EXPECT_THAT(generateIdentifierTrigrams("unique_ptr"),
		trigramsAre({"u$$", "uni", "unp", "upt", "niq", "nip", "npt",
		"iqu", "iqp", "ipt", "que", "qup", "qpt", "uep",
		"ept", "ptr", "un$", "up$", "$$$"}));

EXPECT_THAT(generateIdentifierTrigrams("TUDecl"),		EXPECT_THAT(generateIdentifierTrigrams("TUDecl"),
trigramsAre({"tud", "tde", "ude", "dec", "ecl"}));		trigramsAre({"t$$", "tud", "tde", "ude", "dec", "ecl", "tu$",
		"td$", "$$$"}));

EXPECT_THAT(generateIdentifierTrigrams("IsOK"),		EXPECT_THAT(generateIdentifierTrigrams("IsOK"),
trigramsAre({"iso", "iok", "sok"}));		trigramsAre({"i$$", "iso", "iok", "sok", "is$", "io$", "$$$"}));

EXPECT_THAT(generateIdentifierTrigrams("abc_defGhij__klm"),		EXPECT_THAT(
trigramsAre({		generateIdentifierTrigrams("abc_defGhij__klm"),
"abc", "abd", "abg", "ade", "adg", "adk", "agh", "agk", "bcd",		trigramsAre({"a$$", "abc", "abd", "abg", "ade", "adg", "adk", "agh",
"bcg", "bde", "bdg", "bdk", "bgh", "bgk", "cde", "cdg", "cdk",		"agk", "bcd", "bcg", "bde", "bdg", "bdk", "bgh", "bgk",
"cgh", "cgk", "def", "deg", "dek", "dgh", "dgk", "dkl", "efg",		"cde", "cdg", "cdk", "cgh", "cgk", "def", "deg", "dek",
"efk", "egh", "egk", "ekl", "fgh", "fgk", "fkl", "ghi", "ghk",		"dgh", "dgk", "dkl", "efg", "efk", "egh", "egk", "ekl",
"gkl", "hij", "hik", "hkl", "ijk", "ikl", "jkl", "klm",		"fgh", "fgk", "fkl", "ghi", "ghk", "gkl", "hij", "hik",
}));		"hkl", "ijk", "ikl", "jkl", "klm", "ab$", "ad$", "$$$"}));
}		}

TEST(DexIndexTrigrams, QueryTrigrams) {		TEST(DexIndexTrigrams, QueryTrigrams) {
EXPECT_THAT(generateQueryTrigrams("X86"), trigramsAre({"x86"}));		EXPECT_THAT(generateQueryTrigrams("c"), trigramsAre({"c$$"}));
		EXPECT_THAT(generateQueryTrigrams("cl"), trigramsAre({"cl$"}));
		EXPECT_THAT(generateQueryTrigrams("cla"), trigramsAre({"cla"}));

		EXPECT_THAT(generateQueryTrigrams("_"), trigramsAre({"_$$"}));
		EXPECT_THAT(generateQueryTrigrams("__"), trigramsAre({"__$"}));
		EXPECT_THAT(generateQueryTrigrams("___"), trigramsAre({"___"}));

EXPECT_THAT(generateQueryTrigrams("nl"), trigramsAre({}));		EXPECT_THAT(generateQueryTrigrams("X86"), trigramsAre({"x86"}));

EXPECT_THAT(generateQueryTrigrams("clangd"),		EXPECT_THAT(generateQueryTrigrams("clangd"),
trigramsAre({"cla", "lan", "ang", "ngd"}));		trigramsAre({"cla", "lan", "ang", "ngd"}));

EXPECT_THAT(generateQueryTrigrams("abc_def"),		EXPECT_THAT(generateQueryTrigrams("abc_def"),
trigramsAre({"abc", "bcd", "cde", "def"}));		trigramsAre({"abc", "bcd", "cde", "def"}));

EXPECT_THAT(generateQueryTrigrams("a_b_c_d_e_"),		EXPECT_THAT(generateQueryTrigrams("a_b_c_d_e_"),
Show All 18 Lines