This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang-tools-extra/
-
clangd/index/dex/
-
index/
-
dex/
-
Trigram.h
-
Trigram.cpp
-
unittests/clangd/
-
clangd/
4
DexIndexTests.cpp

Differential D50700

[clangd] Generate better incomplete bigrams for the Dex index
AbandonedPublic

Authored by kbobyrev on Aug 14 2018, 4:50 AM.

Download Raw Diff

Details

Reviewers: None

Summary

Currently, the query trigram generator would simply yield u_p trigram for the u_p query. This is not optimal, since the user is likely to try matching two heads with this query and this patch addresses the issue.

Diff Detail

Event Timeline

kbobyrev created this revision.Aug 14 2018, 4:50 AM

Herald added subscribers: arphaman, jkorous, MaskRay. · View Herald TranscriptAug 14 2018, 4:50 AM

Treat leading underscores as additional signals and don't extract two heads in that case.

ioeric added inline comments.Aug 16 2018, 6:26 AM

clang-tools-extra/unittests/clangd/DexIndexTests.cpp
324	I'm not sure if this is correct. If users have explicitly typed `_`, they are likely to want a `_` there. You mentioned in the patch summary that users might want to match two heads with this. Could you provide an example?

kbobyrev added inline comments.Aug 16 2018, 6:33 AM

clang-tools-extra/unittests/clangd/DexIndexTests.cpp
324	The particular example I had on my mind was `unique_ptr`. Effectively, if the query is `SC` then `StrCat` would be matched (because the incomplete trigram would be `sc$` and two heads from the symbol identifier would also yield `sc$`), but for `u_p`, `unique_ptr` is currently not matched. On the other hand, if there's something like `m_field` and user types `mf` or `m_f` it will be matched in both cases, because `m_field` yields `mf$` in the index build stage, so this change doesn't decrease code completion quality for such cases.

ioeric added inline comments.Aug 16 2018, 6:50 AM

clang-tools-extra/unittests/clangd/DexIndexTests.cpp
324	The problem is that `u_p` can now match `upXXX` with `up$`, which might be a bit surprising for users. It seems to me that another way way to handle this is to generate `u_p` trigram for `unique_ptr`. Have we considered including separators like `_` in trigrams?

kbobyrev removed reviewers: ioeric, ilya-biryukov.Aug 17 2018, 4:00 AM

kbobyrev removed subscribers: MaskRay, jkorous, arphaman, cfe-commits.

Herald added subscribers: ioeric, ilya-biryukov. · View Herald TranscriptAug 17 2018, 4:00 AM

As Eric suggested, such change might reduce the quality of code completion. A more generic approach is yet to be investigated and requires significantly more time resources, we should try it after we get other parts of the index to work as expected.

clang-tools-extra/unittests/clangd/DexIndexTests.cpp
324	I agree, this might be a better solution. Anyway, this seems that it's probably too early to optimize for such a general case like you suggested. I believe, word bounds might help here, but we should probably try them after we have a fully-functional index upstream.

Revision Contents

Path

Size

clang-tools-extra/

clangd/

index/

dex/

Trigram.h

6 lines

Trigram.cpp

28 lines

unittests/

clangd/

DexIndexTests.cpp

3 lines

Diff 160555

clang-tools-extra/clangd/index/dex/Trigram.h

	Show First 20 Lines • Show All 56 Lines • ▼ Show 20 Lines
	/// Returns list of unique fuzzy-search trigrams given a query.			/// Returns list of unique fuzzy-search trigrams given a query.
	///			///
	/// Query is segmented using FuzzyMatch API and downcasted to lowercase. Then,			/// Query is segmented using FuzzyMatch API and downcasted to lowercase. Then,
	/// the simplest trigrams - sequences of three consecutive letters and digits			/// the simplest trigrams - sequences of three consecutive letters and digits
	/// are extracted and returned after deduplication.			/// are extracted and returned after deduplication.
	///			///
	/// For short queries (less than 3 characters with Head or Tail roles in Fuzzy			/// For short queries (less than 3 characters with Head or Tail roles in Fuzzy
	/// Matching segmentation) this returns a single trigram with the first			/// Matching segmentation) this returns a single trigram with the first
	/// characters (up to 3) to perfrom prefix match.			/// characters (up to 3) to perfrom prefix match. However, if the query is short
				/// but it contains two HEAD symbols then the returned trigram would be an
				/// incomplete bigram with those two HEADs (unless query starts with '_' which
				/// is treated as an additional information). This would help to match
				/// "unique_ptr" and similar symbols with "u_p" query
	std::vector<Token> generateQueryTrigrams(llvm::StringRef Query);			std::vector<Token> generateQueryTrigrams(llvm::StringRef Query);

	} // namespace dex			} // namespace dex
	} // namespace clangd			} // namespace clangd
	} // namespace clang			} // namespace clang

	#endif			#endif

clang-tools-extra/clangd/index/dex/Trigram.cpp

	Show First 20 Lines • Show All 110 Lines • ▼ Show 20 Lines

	std::vector<Token> generateQueryTrigrams(llvm::StringRef Query) {			std::vector<Token> generateQueryTrigrams(llvm::StringRef Query) {
	// Apply fuzzy matching text segmentation.			// Apply fuzzy matching text segmentation.
	std::vector<CharRole> Roles(Query.size());			std::vector<CharRole> Roles(Query.size());
	calculateRoles(Query, llvm::makeMutableArrayRef(Roles.data(), Query.size()));			calculateRoles(Query, llvm::makeMutableArrayRef(Roles.data(), Query.size()));

	// Additional pass is necessary to count valid identifier characters.			// Additional pass is necessary to count valid identifier characters.
	// Depending on that, this function might return incomplete trigram.			// Depending on that, this function might return incomplete trigram.
				unsigned Heads = 0;
	unsigned ValidSymbolsCount = 0;			unsigned ValidSymbolsCount = 0;
	for (size_t I = 0; I < Roles.size(); ++I)			for (size_t I = 0; I < Roles.size(); ++I) {
	if (Roles[I] == Head \|\| Roles[I] == Tail)			if (Roles[I] == Head) {
				++ValidSymbolsCount;
				++Heads;
				} else if (Roles[I] == Tail) {
	++ValidSymbolsCount;			++ValidSymbolsCount;
				}
				}

	std::string LowercaseQuery = Query.lower();			std::string LowercaseQuery = Query.lower();

	DenseSet<Token> UniqueTrigrams;			DenseSet<Token> UniqueTrigrams;

	// If the number of symbols which can form fuzzy matching trigram is not			// If the number of symbols which can form fuzzy matching trigram is not
	// sufficient, generate a single incomplete trigram for query.			// sufficient, generate a single incomplete trigram for query.
	if (ValidSymbolsCount < 3) {			if (ValidSymbolsCount < 3) {
	std::string Chars =			std::string Chars;
	LowercaseQuery.substr(0, std::min<size_t>(3UL, Query.size()));			// If the query is not long enough to form a trigram but contains two heads
				// the returned trigram should be "xy$" where "x" and "y" are the heads.
				// This might be particulary important for cases like "u_p" to match
				// "unique_ptr" and similar symbols from the C++ Standard Library.
				if (Heads == 2 && !Query.startswith("_")) {
				for (size_t I = 0; I < LowercaseQuery.size(); ++I)
				if (Roles[I] == Head)
				Chars += LowercaseQuery[I];

				Chars += END_MARKER;
				} else {
				Chars = LowercaseQuery.substr(0, std::min<size_t>(3UL, Query.size()));
	Chars.append(3 - Chars.size(), END_MARKER);			Chars.append(3 - Chars.size(), END_MARKER);
				}
	UniqueTrigrams.insert(Token(Token::Kind::Trigram, Chars));			UniqueTrigrams.insert(Token(Token::Kind::Trigram, Chars));
	} else {			} else {
	std::deque<char> Chars;			std::deque<char> Chars;
	for (size_t I = 0; I < LowercaseQuery.size(); ++I) {			for (size_t I = 0; I < LowercaseQuery.size(); ++I) {
	// If current symbol is delimiter, just skip it.			// If current symbol is delimiter, just skip it.
	if (Roles[I] != Head && Roles[I] != Tail)			if (Roles[I] != Head && Roles[I] != Tail)
	continue;			continue;

	Show All 22 Lines

clang-tools-extra/unittests/clangd/DexIndexTests.cpp

Show First 20 Lines • Show All 315 Lines • ▼ Show 20 Lines	TEST(DexIndexTrigrams, QueryTrigrams) {
EXPECT_THAT(generateQueryTrigrams("c"), trigramsAre({"c$$"}));		EXPECT_THAT(generateQueryTrigrams("c"), trigramsAre({"c$$"}));
EXPECT_THAT(generateQueryTrigrams("cl"), trigramsAre({"cl$"}));		EXPECT_THAT(generateQueryTrigrams("cl"), trigramsAre({"cl$"}));
EXPECT_THAT(generateQueryTrigrams("cla"), trigramsAre({"cla"}));		EXPECT_THAT(generateQueryTrigrams("cla"), trigramsAre({"cla"}));

EXPECT_THAT(generateQueryTrigrams("_"), trigramsAre({"_$$"}));		EXPECT_THAT(generateQueryTrigrams("_"), trigramsAre({"_$$"}));
EXPECT_THAT(generateQueryTrigrams("__"), trigramsAre({"__$"}));		EXPECT_THAT(generateQueryTrigrams("__"), trigramsAre({"__$"}));
EXPECT_THAT(generateQueryTrigrams("___"), trigramsAre({"___"}));		EXPECT_THAT(generateQueryTrigrams("___"), trigramsAre({"___"}));

		EXPECT_THAT(generateQueryTrigrams("u_p"), trigramsAre({"up$"}));
		ioericUnsubmitted Not Done Reply Inline Actions I'm not sure if this is correct. If users have explicitly typed `_`, they are likely to want a `_` there. You mentioned in the patch summary that users might want to match two heads with this. Could you provide an example? ioeric: I'm not sure if this is correct. If users have explicitly typed `_`, they are likely to want a…
		kbobyrevAuthorUnsubmitted Not Done Reply Inline Actions The particular example I had on my mind was `unique_ptr`. Effectively, if the query is `SC` then `StrCat` would be matched (because the incomplete trigram would be `sc$` and two heads from the symbol identifier would also yield `sc$`), but for `u_p`, `unique_ptr` is currently not matched. On the other hand, if there's something like `m_field` and user types `mf` or `m_f` it will be matched in both cases, because `m_field` yields `mf$` in the index build stage, so this change doesn't decrease code completion quality for such cases. kbobyrev: The particular example I had on my mind was `unique_ptr`. Effectively, if the query is `SC`…
		ioericUnsubmitted Not Done Reply Inline Actions The problem is that `u_p` can now match `upXXX` with `up$`, which might be a bit surprising for users. It seems to me that another way way to handle this is to generate `u_p` trigram for `unique_ptr`. Have we considered including separators like `_` in trigrams? ioeric: The problem is that `u_p` can now match `upXXX` with `up$`, which might be a bit surprising for…
		kbobyrevAuthorUnsubmitted Not Done Reply Inline Actions I agree, this might be a better solution. Anyway, this seems that it's probably too early to optimize for such a general case like you suggested. I believe, word bounds might help here, but we should probably try them after we have a fully-functional index upstream. kbobyrev: I agree, this might be a better solution. Anyway, this seems that it's probably too early to…
		EXPECT_THAT(generateQueryTrigrams("_u_p"), trigramsAre({"_u_"}));

EXPECT_THAT(generateQueryTrigrams("X86"), trigramsAre({"x86"}));		EXPECT_THAT(generateQueryTrigrams("X86"), trigramsAre({"x86"}));

EXPECT_THAT(generateQueryTrigrams("clangd"),		EXPECT_THAT(generateQueryTrigrams("clangd"),
trigramsAre({"cla", "lan", "ang", "ngd"}));		trigramsAre({"cla", "lan", "ang", "ngd"}));

EXPECT_THAT(generateQueryTrigrams("abc_def"),		EXPECT_THAT(generateQueryTrigrams("abc_def"),
trigramsAre({"abc", "bcd", "cde", "def"}));		trigramsAre({"abc", "bcd", "cde", "def"}));

Show All 19 Lines