This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang-tools-extra/
-
clangd/
-
CMakeLists.txt
-
index/dex/
-
dex/
-
Token.h
-
Token.cpp
-
Trigram.h
-
Trigram.cpp
-
unittests/clangd/
-
clangd/
-
CMakeLists.txt
-
DexIndexTests.cpp

Differential D49417

[clangd] Implement trigram generation algorithm for new symbol index
AbandonedPublic

Authored by omtcyfz on Jul 17 2018, 1:55 AM.

Download Raw Diff

Details

Reviewers: None

Summary

This patch introduces trigram generation algorithm for the symbol index proposed in a recent design document.
RFC in the mailing list: http://lists.llvm.org/pipermail/clangd-dev/2018-July/000022.html

The trigram generation algorithm is described in detail in the proposal:
https://docs.google.com/document/d/1C-A6PGT6TynyaX4PXyExNMiGmJ2jL1UwV91Kyx11gOI/edit#heading=h.903u1zon9nkj

Diff Detail

Event Timeline

omtcyfz created this revision.Jul 17 2018, 1:55 AM

Herald added subscribers: jkorous, MaskRay, mgorny. · View Herald TranscriptJul 17 2018, 1:55 AM

Fix documentation indentation in SearchAtom description.

Wipe redundant FIXMEs.

omtcyfz updated this revision to Diff 155828.Jul 17 2018, 2:36 AM

Fix unittest file name in header.

Some high-level comments before jumping into details.

clang-tools-extra/clangd/index/noctem/SearchAtom.cpp
23 ↗	(On Diff #155830)	maybe also add how short symbol names are handled in the current algorithm and what the potential solution would be.
28 ↗	(On Diff #155830)	Any reason not to do this in segmentation?
clang-tools-extra/clangd/index/noctem/SearchAtom.h
27 ↗	(On Diff #155830)	Any reason to call this `Atom` instead of `Token`? `Token` seems to be a more commonly used name for this (e.g. https://yomguithereal.github.io/mnemonist/inverted-index#tokens).
45 ↗	(On Diff #155830)	`Namespace` can be easily confused with namespaces in clang world. Maybe `Kind` or `Type`?
51 ↗	(On Diff #155830)	What is an empty atom? Why do we need it?
52 ↗	(On Diff #155830)	Why default `Type` to `Trigram`?
52 ↗	(On Diff #155830)	Please add documentation. What is the semantic of `Data`?
53 ↗	(On Diff #155830)	Should we also incorporate `Type` into hash?
53 ↗	(On Diff #155830)	I'm wondering if we should use different hashes for different token types. For example, a trigram token "xyz" can be encoded in a 4 -byte int with `('x'<<16) & ('y'<<8) & 'z'`, which seems cheaper than `std::hash`.
62 ↗	(On Diff #155830)	nit: s/getData/data/ s/getType/type/
76 ↗	(On Diff #155830)	nit: I'd avoid calling these `tokens`, as it they can be confused with tokens for the invert index. How about `segments`? Similarly, `tokenize` should probably be something like `segment` or `segmentIdentifier`.
141 ↗	(On Diff #155830)	`generateSearchAtoms` is too generalized a name for this. Maybe just `trigram()`? I also think this could take a list of segments from `tokenize`.

Addressed all comments submitted by Eric.

As discussed internally, I should also exercise my naming skills and come up with a better for the symbol index to substitute "Noctem" which doesn't point to any project's feature.

In D49417#1166538, @omtcyfz wrote:

Addressed all comments submitted by Eric.

As discussed internally, I should also exercise my naming skills and come up with a better for the symbol index to substitute "Noctem" which doesn't point to any project's feature.

Ooh, a bikeshed!
Suggestion: dex - short, pronouncable, suggests index, and is a trigram :-)

Herald added a subscriber: arphaman. · View Herald TranscriptJul 19 2018, 2:39 AM

ioeric added inline comments.Jul 19 2018, 2:58 AM

clang-tools-extra/clangd/index/noctem/SearchToken.cpp
38 ↗	(On Diff #156073)	Please also add what the current behavior is for short names. Do we just ignore them?
44 ↗	(On Diff #156073)	Could we replace `UniqaueTrigrams`+`Trigrams` with a dense map from hash to token?
47 ↗	(On Diff #156073)	nit: a redundant "of" EOL.
57 ↗	(On Diff #156073)	nit: Maybe S1, S2, S3 instead of FirstSegment, ...?
68 ↗	(On Diff #156073)	This seems wrong... wouldn't this give you a concatenation of three segments?
68 ↗	(On Diff #156073)	For trigrams, it might make sense to put 3 chars into a `SmallVector<3>` (can be reused) and std::move it into the constructor. Might be cheaper than creating a std::string
87 ↗	(On Diff #156073)	nit: `Position < Segment.size() - 2` seems more commonly used.
92 ↗	(On Diff #156073)	Comment for this loop?
135 ↗	(On Diff #156073)	use `trim`?
138 ↗	(On Diff #156073)	Maybe first split name on `_` and them run further upper-lower segmentation on the split segments?
clang-tools-extra/clangd/index/noctem/SearchToken.h
58 ↗	(On Diff #156073)	Any reason this has to be `operator()` instead of a `hash` method? `operator()` for hash value is not trivial on the call site.
69 ↗	(On Diff #156073)	Who's the user of this friend function? Could it just call `Token.hash()`?
80 ↗	(On Diff #156073)	nit: scope in clangd world is "foo::bar::baz::", but the global scope is "".
159 ↗	(On Diff #156073)	This seems to be the same as just `generateTrigrams(segmentIdentifier(Name))`, so I'd drop it.

(just .h files. +1 to eric's comments except where noted)

clang-tools-extra/clangd/index/noctem/SearchToken.h
2 ↗	(On Diff #156073)	nit: something went wrong here
28 ↗	(On Diff #156073)	nit: remove \brief
32 ↗	(On Diff #156073)	maybe just giving one example here, and moving the concrete semantics to Kind?
34 ↗	(On Diff #156073)	ITYM suffix consider just "under directory"
35 ↗	(On Diff #156073)	namespace tokens don't have prefix/suffix semantics, they're exact
44 ↗	(On Diff #156073)	nit: this is a really common type, and it's namespaced. `Token` is probably fine.
46 ↗	(On Diff #156073)	doc: the fact that Kind is a namespace for data, and (on each enum value) the semantics
46 ↗	(On Diff #156073)	nit: drop `: short` as you don't seem to make use of it anywhere. Can add it later if we want to optimize in-memory structure (currently it has no effect)
48 ↗	(On Diff #156073)	in the first patch, probably just want trigram and scope (to ensure generality) and drop the rest
52 ↗	(On Diff #156073)	what is this for?
61 ↗	(On Diff #156073)	(nit: since you have the hashcode, comparing it before the data is probably faster?)
64 ↗	(On Diff #156073)	nit: first const is not useful
64 ↗	(On Diff #156073)	instead of these accessors, can we just make it a struct with const members and a constructor that ensures the hash is correctly set? That should reflect the lightweightness/lack of behavior in this object
69 ↗	(On Diff #156073)	This is the LLVM convention for hashing, see `ADT/Hashing.h`
76 ↗	(On Diff #156073)	I'd move these to the Kind enum - these shouldn't be examples but rather the canonical documentation for these
81 ↗	(On Diff #156073)	need to more fully spell this out: when is it full vs relative? is it a URI? Do directories have a trailing slash? But really, leave this out for now.
83 ↗	(On Diff #156073)	So SmallString always contains 3 pointers, i.e 24 bits. The very smallest SmallString that makes sense is <8>, as youll end up padded anyway.
89 ↗	(On Diff #156073)	(remove leading \brief here and elsewhere)
89 ↗	(On Diff #156073)	Please move trigram-related stuff to `Trigrams.h`
115 ↗	(On Diff #156073)	Please don't implement a different set of segmentation rules to those that exist in FuzzyMatch. Happy to chat about a sensible API to expose and where it should live.
115 ↗	(On Diff #156073)	given your examples always return substrings, these should be stringrefs or offsets. But per above comment, we should rethink this function.

sammccall mentioned this in D49540: [clangd] FuzzyMatch exposes an API for its word segmentation. NFC.Jul 19 2018, 5:28 AM

omtcyfz added a parent revision: D49540: [clangd] FuzzyMatch exposes an API for its word segmentation. NFC.Jul 20 2018, 12:39 AM

sammccall mentioned this in rL337527: [clangd] FuzzyMatch exposes an API for its word segmentation. NFC.Jul 20 2018, 1:06 AM

sammccall mentioned this in rCTE337527: [clangd] FuzzyMatch exposes an API for its word segmentation. NFC.Jul 20 2018, 1:09 AM

Addressed most comments (aside from reusing fuzzy matching segmentation routine and making data + hash a separate structure).

Since I already submitted my next revision (https://reviews.llvm.org/D49546) through another account with more verbose and less confusing username I will remove everyone from this revision and create this revision under the latter account.

clang-tools-extra/clangd/index/noctem/SearchAtom.h
53 ↗	(On Diff #155830)	Discussed internally: we probably shouldn't since there will be collisions even inside a single token namespace when Data will be too long.
clang-tools-extra/clangd/index/noctem/SearchToken.cpp
44 ↗	(On Diff #156073)	I'm not sure how to iterate through the result then. Basically, having Trigrams ensures that these trigrams can be iterated after the function execution and inserted into the inverted index. Otherwise I should either expose a callback to add the generated trigrams or do something similar. Should I do that instead?
57 ↗	(On Diff #156073)	I think it's way less explicit and slightly more confusing. I try not to have one-letter names so I guess it's better to have full names instead.
68 ↗	(On Diff #156073)	But `std::string` would be created anyway, wouldn't it be?
87 ↗	(On Diff #156073)	If `Segment.size()` is 1 then this would be `1U - 2U`, which would be something very large.
138 ↗	(On Diff #156073)	Resolving both of these for now (though they're not really fixed) since I will rethink the algorithm given Sam's patch.
clang-tools-extra/clangd/index/noctem/SearchToken.h
52 ↗	(On Diff #156073)	Tags, for example. But yes, for now it's not very clear and probably not needed. Dropped this one.

omtcyfz removed a reviewer: ioeric.Jul 20 2018, 1:39 AM

omtcyfz removed a project: Restricted Project.

omtcyfz removed subscribers: arphaman, mgorny, MaskRay and 4 others.

Herald added a subscriber: ioeric. · View Herald TranscriptJul 20 2018, 1:39 AM

Abandon in favor of https://reviews.llvm.org/D49591.

Revision Contents

Path

Size

clang-tools-extra/

clangd/

CMakeLists.txt

4 lines

index/

dex/

115 lines

37 lines

101 lines

178 lines

unittests/

clangd/

CMakeLists.txt

5 lines

DexIndexTests.cpp

92 lines

Diff 156443

clang-tools-extra/clangd/CMakeLists.txt

Show All 28 Lines	add_clang_library(clangDaemon
ProtocolHandlers.cpp		ProtocolHandlers.cpp
Quality.cpp		Quality.cpp
SourceCode.cpp		SourceCode.cpp
Threading.cpp		Threading.cpp
Trace.cpp		Trace.cpp
TUScheduler.cpp		TUScheduler.cpp
URI.cpp		URI.cpp
XRefs.cpp		XRefs.cpp

index/CanonicalIncludes.cpp		index/CanonicalIncludes.cpp
index/FileIndex.cpp		index/FileIndex.cpp
index/Index.cpp		index/Index.cpp
index/MemIndex.cpp		index/MemIndex.cpp
index/Merge.cpp		index/Merge.cpp
index/SymbolCollector.cpp		index/SymbolCollector.cpp
index/SymbolYAML.cpp		index/SymbolYAML.cpp

		index/dex/Token.cpp
		index/dex/Trigram.cpp

LINK_LIBS		LINK_LIBS
clangAST		clangAST
clangASTMatchers		clangASTMatchers
clangBasic		clangBasic
clangDriver		clangDriver
clangFormat		clangFormat
clangFrontend		clangFrontend
clangIndex		clangIndex
Show All 16 Lines

clang-tools-extra/clangd/index/dex/Token.h

This file was added.

				//===--- Token.h - Symbol Search primitive ----------------------- C++ --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// Tokens are keys for inverted index which are mapped to the
				// corresponding posting lists. Token objects represent a characteristic
				// of a symbol, which can be used to perform efficient search.
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_CLANG_TOOLS_EXTRA_CLANGD_DEX_TOKEN_H
				#define LLVM_CLANG_TOOLS_EXTRA_CLANGD_DEX_TOKEN_H

				#include "llvm/ADT/DenseMap.h"

				#include <string>
				#include <vector>

				namespace clang {
				namespace clangd {
				namespace dex {

				/// Hashable Token, which represents a search token primitive, such as
				/// trigram for fuzzy search on unqualified symbol names.
				///
				/// Tokens can be used to perform more sophisticated search queries by
				/// constructing complex iterator trees.
				class Token {
				public:
				/// Kind specifies Token type which defines semantics for the internal
				/// representation (Data field), examples of such types are:
				///
				/// * Trigram for fuzzy search on unqualified symbol names.
				/// * Scope primitives, e.g. "symbol belongs to namespace foo::bar".
				/// * If the symbol represents a variable, token can be its type such as int,
				/// clang::Decl, …
				/// * For a symbol representing a function, this can be the
				/// return type.
				///
				/// Each Kind has different representation (i.e. Data field contents):
				///
				/// * Trigram: 3 bytes containing trigram characters
				/// * Scope: full scope name, such as "foo::bar::baz::" or "" (global scope)
				/// * Path: full or relative path to the directory
				/// * Type: full type name or the USR associated with this type
				///
				/// More Kinds can be added in the future.
				enum class Kind {
				Trigram,
				Scope,
				};

				Token(llvm::StringRef Data, Kind TokenKind);

				// Returns precomputed hash.
				size_t hash(const Token &T) const { return Hash; }

				bool operator==(const Token &Other) const {
				return Hash == Other.Hash && TokenKind == Other.TokenKind &&
				Data == Other.Data;
				}

				llvm::StringRef data() const { return Data; }

				const Kind &kind() const { return TokenKind; }

				private:
				friend llvm::hash_code hash_value(const Token &Token) { return Token.Hash; }

				/// Representation which is unique among Token with the same Kind.
				// FIXME(kbobyrev): Put this into another structure
				std::string Data;
				/// Precomputed hash which is used as a key for inverted index.
				size_t Hash;
				Kind TokenKind;
				};

				} // namespace dex
				} // namespace clangd
				} // namespace clang

				namespace llvm {

				// Support Tokens as DenseMap keys.
				template <> struct DenseMapInfo<clang::clangd::dex::Token> {
				static inline clang::clangd::dex::Token getEmptyKey() {
				static clang::clangd::dex::Token EmptyKey(
				"EMPTYKEY", clang::clangd::dex::Token::Kind::Scope);
				return EmptyKey;
				}

				static inline clang::clangd::dex::Token getTombstoneKey() {
				static clang::clangd::dex::Token TombstoneKey(
				"TOMBSTONE_KEY", clang::clangd::dex::Token::Kind::Scope);
				return TombstoneKey;
				}

				static unsigned getHashValue(const clang::clangd::dex::Token &Tag) {
				return hash_value(Tag);
				}

				static bool isEqual(const clang::clangd::dex::Token &LHS,
				const clang::clangd::dex::Token &RHS) {
				return LHS == RHS;
				}
				};

				} // namespace llvm

				#endif

clang-tools-extra/clangd/index/dex/Token.cpp

This file was added.

				//===--- Token.cpp - Symbol Search primitive --------------------- C++ --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//

				#include "Token.h"
				#include "llvm/ADT/DenseSet.h"
				#include "llvm/ADT/Twine.h"

				#include <cctype>
				#include <string>

				namespace clang {
				namespace clangd {
				namespace dex {

				Token::Token(llvm::StringRef Data, Kind TokenKind)
				: Data(Data), TokenKind(TokenKind) {
				assert(TokenKind != Kind::Trigram \|\|
				Data.size() == 3 && "Trigram should contain three characters.");
				switch (TokenKind) {
				case Kind::Trigram:
				Hash = ((Data[0] << 16) & (Data[1] << 8) & Data[2]);
				break;
				default:
				Hash = std::hash<std::string>{}(Data);
				break;
				}
				}

				} // namespace dex
				} // namespace clangd
				} // namespace clang

clang-tools-extra/clangd/index/dex/Trigram.h

This file was added.

				//===--- Trigram.h - Trigram generation for Fuzzy Matching ------- C++ --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// T
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_CLANG_TOOLS_EXTRA_CLANGD_DEX_TRIGRAM_H
				#define LLVM_CLANG_TOOLS_EXTRA_CLANGD_DEX_TRIGRAM_H

				#include "Token.h"

				#include <string>

				namespace clang {
				namespace clangd {
				namespace dex {

				/// Splits unqualified symbol name into segments and casts letters to lowercase
				/// for trigram generation.
				///
				/// First stage of trigram generation algorithm. Given an unqualified symbol
				/// name, this outputs a sequence of string segments using the following rules:
				///
				/// * '_' is a separator. Multiple consecutive underscores are treated as a
				/// single separator. Underscores at the beginning and the end of the symbol
				/// name are skipped.
				///
				/// Examples: "unique_ptr" -> ["unique", "ptr"],
				/// "__builtin_popcount" -> ["builtin", "popcount"]
				/// "snake____case___" -> ["snake", "case"]
				///
				/// * Lowercase letter followed by an uppercase letter is a separator.
				///
				/// Examples: "kItemsCount" -> ["k", "Items", "Count"]
				///
				/// * Sequences of consecutive uppercase letters followed by a lowercase letter:
				/// the last uppercase letter is treated as the beginning of a next segment.
				///
				/// Examples: "TUDecl" -> ["TU", "Decl"]
				/// "kDaysInAWeek" -> ["k", "Days", "In", "A", "Week"]
				///
				/// Note: digits are treated as lowercase letters. Example: "X86" -> ["X86"]
				///
				/// FIXME(kbobyrev): Use the same segmentation algorithm as in fuzzy matching.
				/// FIXME(kbobyrev): Return StringRefs or Offsets.
				std::vector<std::string>
				segmentIdentifier(llvm::StringRef SymbolName);

				/// Returns list of unique fuzzy-search trigrams from unqualified symbol.
				///
				/// Runs trigram generation for fuzzy-search index on segments produced by
				/// segmentIdentifier(SymbolName);
				///
				///
				/// The motivation for trigram generation algorithm is that extracted trigrams
				/// are 3-char suffixes of paths through the fuzzy matching automaton. There are
				/// four classes of extracted trigrams:
				///
				/// * The simplest one consists of consecutive 3-char sequences of each segment.
				///
				/// Example: "trigram" -> ["tri", "rig", "igr", "gra", "ram"
				///
				/// * Next class consists of front character of subsequent segments.
				///
				/// Example: ["translation", "unit", "decl"] -> ["tud"]
				///
				/// Note: skipping segments is allowed, but not more than one. For example,
				/// given ["a", "b", "c", "d", "e"] -> "ace" is allowed, but "ade" is not.
				///
				/// * Another class of trigrams consists of those with 2 charactersin one
				/// segment and the front character of subsequent segment (just as before,
				/// skipping up to one segment is allowed).
				///
				/// Example: ["ab", "c", "d", "e"] -> ["abc", "abd", "abe"]
				/// Note: similarly to the previous case, "abe" would not be allowed.
				///
				/// * The last class of trigrams is similar to the previous one: it takes one
				/// character from one segment and two front characters from the next or
				/// skip-1-next segments.
				///
				/// Example: ["a", "bc", "de", "fg"] -> ["abc", "ade"]
				/// But not "afg".
				///
				/// Note: the returned list of trigrams does not have duplicates, if any
				/// trigram belongs to more than one class it is only inserted once.
				std::vector<Token>
				generateTrigrams(const std::vector<std::string> &Segments);


				} // namespace dex
				} // namespace clangd
				} // namespace clang

				#endif

clang-tools-extra/clangd/index/dex/Trigram.cpp

This file was added.

				//===--- Trigram.cpp - Trigram generation for Fuzzy Matching ----- C++ --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				//
				//
				//===----------------------------------------------------------------------===//

				#include "Trigram.h"
				#include "Token.h"

				#include "llvm/ADT/DenseSet.h"

				#include <cctype>
				#include <string>

				using namespace llvm;

				namespace clang {
				namespace clangd {
				namespace dex {

				// FIXME(kbobyrev): Deal with short symbol symbol names. A viable approach would
				// be generating unigrams and bigrams here, too. This would prevent symbol index
				// from applying fuzzy matching on a tremendous number of symbols and allow
				// supplementary retrieval for short queries.
				// Short names (< 3 characters) are currently ignored.
				std::vector<Token> generateTrigrams(const std::vector<std::string> &Segments) {
				llvm::DenseSet<Token> UniqueTrigrams;
				std::vector<Token> Trigrams;

				// Extract trigrams consisting of first characters of tokens sorted bytoken
				// positions. Trigram generator is allowed to skip 1 word between each token.
				//
				// Example: ["a", "b", "c", "d", "e"]
				//
				// would produce -> ["abc", "acd", "ace", ...] (among the others)
				//
				// but not -> ["ade"] because two tokens ("b" and "c") would be skipped in
				// this case.
				for (auto FirstSegment = Segments.begin(); FirstSegment != Segments.end();
				++FirstSegment) {
				for (auto SecondSegment = FirstSegment + 1;
				(SecondSegment <= FirstSegment + 2) &&
				(SecondSegment != Segments.end());
				++SecondSegment) {
				for (auto ThirdSegment = SecondSegment + 1;
				(ThirdSegment <= SecondSegment + 2) &&
				(ThirdSegment != Segments.end());
				++ThirdSegment) {
				// FIXME(kbobryev): This is wrong. Should be *FirstSegment[0] + ...
				Token Trigram((FirstSegment + SecondSegment + *ThirdSegment),
				Token::Kind::Trigram);
				if (!UniqueTrigrams.count(Trigram)) {
				UniqueTrigrams.insert(Trigram);
				Trigrams.push_back(Trigram);
				}
				}
				}
				}

				// Iterate through each token with a sliding window and extract trigrams
				// consisting of 3 consecutive characters.
				//
				// Example: "delete" -> ["del", "ele", "let", "ete"]
				for (const auto &Segment : Segments) {
				// Token should have at least three characters to have trigram substrings.
				if (Segment.size() < 3)
				continue;

				for (size_t Position = 0; Position + 2 < Segment.size(); ++Position)
				Trigrams.push_back(
				Token(Segment.substr(Position, 3), Token::Kind::Trigram));
				}

				// This loop generates both trigrams of the third and fourth classes. It
				// iterates through each two "subsequent" (consecutive or skip-1-next) tokens
				// and extracts trigrams out of each pair.
				for (auto FirstSegment = Segments.begin(); FirstSegment != Segments.end();
				++FirstSegment) {
				for (auto SecondSegment = FirstSegment + 1;
				(SecondSegment <= FirstSegment + 2) &&
				(SecondSegment != Segments.end());
				++SecondSegment) {
				for (size_t FirstSegmentIndex = 0;
				FirstSegmentIndex < FirstSegment->size(); ++FirstSegmentIndex) {
				// Extract trigrams of the third class: one character of the first token
				// and two characters from the next or skip-1-next token.
				if (FirstSegmentIndex + 1 < FirstSegment->size()) {
				Token Trigram((FirstSegment->substr(FirstSegmentIndex, 2) +
				SecondSegment->substr(0, 1)),
				Token::Kind::Trigram);
				if (!UniqueTrigrams.count(Trigram)) {
				UniqueTrigrams.insert(Trigram);
				Trigrams.push_back(Trigram);
				}
				}
				// Extract trigrams of the last class: two character from the first
				// token and front character from the next or skip-1-next token.
				if (SecondSegment->size() > 1) {
				Token Trigram((FirstSegment->substr(FirstSegmentIndex, 1) +
				SecondSegment->substr(0, 2)),
				Token::Kind::Trigram);
				if (!UniqueTrigrams.count(Trigram)) {
				UniqueTrigrams.insert(Trigram);
				Trigrams.push_back(Trigram);
				}
				}
				}
				}
				}

				return Trigrams;
				}

				std::vector<std::string> segmentIdentifier(StringRef SymbolName) {
				std::vector<std::string> Segments;
				size_t SegmentStart = 0;
				// Skip underscores at the beginning, e.g. "__builtin_popcount".
				while (SymbolName[SegmentStart] == '_')
				++SegmentStart;

				for (size_t Index = SegmentStart; Index + 1 < SymbolName.size(); ++Index) {
				const char CurrentSymbol = SymbolName[Index];
				const char NextSymbol = SymbolName[Index + 1];
				// Skip sequences of underscores, e.g. "my__type".
				if (CurrentSymbol == '_' && NextSymbol == '_') {
				++SegmentStart;
				continue;
				}

				// Splits if the next symbol is underscore or if processed characters are
				// [lowercase, Uppercase] which indicates beginning of next token. Digits
				// are equivalent to lowercase symbols.
				if ((NextSymbol == '_') \|\|
				((islower(CurrentSymbol) \|\| isdigit(CurrentSymbol)) &&
				isupper(NextSymbol))) {
				Segments.push_back(
				SymbolName.substr(SegmentStart, Index - SegmentStart + 1));
				SegmentStart = Index + 1;
				if (NextSymbol == '_')
				++SegmentStart;
				}

				// If there were N (> 1) consecutive uppercase letter the split should
				// generate two tokens, one of which would consist of N - 1 first uppercase
				// letters, the next token begins with the last uppercase letter.
				//
				// Example: "TUDecl" -> ["TU", "Decl"]
				if (isupper(CurrentSymbol) &&
				(islower(NextSymbol) \|\| (isdigit(NextSymbol)))) {
				// Don't perform split if Index points to the beginning of new token,
				// otherwise "NamedDecl" would be split into ["N", "amed", "D", "ecl"]
				if (Index == SegmentStart)
				continue;
				Segments.push_back(SymbolName.substr(SegmentStart, Index - SegmentStart));
				SegmentStart = Index;
				}
				}

				if (SegmentStart < SymbolName.size())
				Segments.push_back(SymbolName.substr(SegmentStart));

				// Apply lowercase text normalization.
				for (auto &Segment : Segments)
				std::for_each(Segment.begin(), Segment.end(), ::tolower);

				return Segments;
				}

				} // namespace dex
				} // namespace clangd
				} // namespace clang

clang-tools-extra/unittests/clangd/CMakeLists.txt

	Show All 9 Lines

	add_extra_unittest(ClangdTests			add_extra_unittest(ClangdTests
	Annotations.cpp			Annotations.cpp
	ClangdTests.cpp			ClangdTests.cpp
	ClangdUnitTests.cpp			ClangdUnitTests.cpp
	CodeCompleteTests.cpp			CodeCompleteTests.cpp
	CodeCompletionStringsTests.cpp			CodeCompletionStringsTests.cpp
	ContextTests.cpp			ContextTests.cpp
				DexIndexTests.cpp
	DraftStoreTests.cpp			DraftStoreTests.cpp
	FileIndexTests.cpp
	FileDistanceTests.cpp			FileDistanceTests.cpp
				FileIndexTests.cpp
	FindSymbolsTests.cpp			FindSymbolsTests.cpp
	FuzzyMatchTests.cpp			FuzzyMatchTests.cpp
	GlobalCompilationDatabaseTests.cpp			GlobalCompilationDatabaseTests.cpp
	HeadersTests.cpp			HeadersTests.cpp
	IndexTests.cpp			IndexTests.cpp
	QualityTests.cpp			QualityTests.cpp
	SourceCodeTests.cpp			SourceCodeTests.cpp
	SymbolCollectorTests.cpp			SymbolCollectorTests.cpp
	SyncAPI.cpp			SyncAPI.cpp
				TUSchedulerTests.cpp
	TestFS.cpp			TestFS.cpp
	TestTU.cpp			TestTU.cpp
	ThreadingTests.cpp			ThreadingTests.cpp
	TraceTests.cpp			TraceTests.cpp
	TUSchedulerTests.cpp
	URITests.cpp			URITests.cpp
	XRefsTests.cpp			XRefsTests.cpp
	)			)

	target_link_libraries(ClangdTests			target_link_libraries(ClangdTests
	PRIVATE			PRIVATE
	clangAST			clangAST
	clangBasic			clangBasic
	Show All 12 Lines

clang-tools-extra/unittests/clangd/DexIndexTests.cpp

This file was added.

				//===-- DexIndexTests.cpp ----------------------------- C++ ------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//

				#include "index/dex/Token.h"
				#include "index/dex/Trigram.h"
				#include "gtest/gtest.h"

				#include <string>
				#include <vector>

				using std::string;
				using std::vector;

				namespace clang {
				namespace clangd {
				namespace dex {

				vector<Token> getTrigrams(std::initializer_list<string> Trigrams) {
				vector<Token> Result;
				for (const auto &Symbols : Trigrams) {
				Result.push_back(Token(Symbols, Token::Kind::Trigram));
				}
				return Result;
				}

				TEST(DexIndexTokens, TrigramSymbolNameTokenization) {
				EXPECT_EQ(segmentIdentifier("unique_ptr"), vector<string>({"unique", "ptr"}));

				EXPECT_EQ(segmentIdentifier("TUDecl"), vector<string>({"TU", "Decl"}));

				EXPECT_EQ(segmentIdentifier("table_name_"),
				vector<string>({"table", "name"}));

				EXPECT_EQ(segmentIdentifier("kDaysInAWeek"),
				vector<string>({"k", "Days", "In", "A", "Week"}));

				EXPECT_EQ(segmentIdentifier("AlternateUrlTableErrors"),
				vector<string>({"Alternate", "Url", "Table", "Errors"}));

				EXPECT_EQ(segmentIdentifier("IsOK"), vector<string>({"Is", "OK"}));

				EXPECT_EQ(segmentIdentifier("ABSL_FALLTHROUGH_INTENDED"),
				vector<string>({"ABSL", "FALLTHROUGH", "INTENDED"}));

				EXPECT_EQ(segmentIdentifier("SystemZ"), vector<string>({"System", "Z"}));

				EXPECT_EQ(segmentIdentifier("X86"), vector<string>({"X86"}));

				EXPECT_EQ(segmentIdentifier("ASTNodeKind"),
				vector<string>({"AST", "Node", "Kind"}));

				EXPECT_EQ(segmentIdentifier("ObjCDictionaryElement"),
				vector<string>({"Obj", "C", "Dictionary", "Element"}));

				EXPECT_EQ(segmentIdentifier("multiple__underscores___everywhere____"),
				vector<string>({"multiple", "underscores", "everywhere"}));

				EXPECT_EQ(segmentIdentifier("__cuda_builtin_threadIdx_t"),
				vector<string>({"cuda", "builtin", "thread", "Idx", "t"}));

				EXPECT_EQ(segmentIdentifier("longUPPERCASESequence"),
				vector<string>({"long", "UPPERCASE", "Sequence"}));
				}

				// FIXME(kbobyrev): Add a test for "ab_cd_ef_gh".
				TEST(DexIndexTrigrams, TrigramGeneration) {
				EXPECT_EQ(
				generateTrigrams(segmentIdentifier("a_b_c_d_e_")),
				getTrigrams({"abc", "abd", "acd", "ace", "bcd", "bce", "bde", "cde"}));

				EXPECT_EQ(generateTrigrams(segmentIdentifier("clangd")),
				getTrigrams({"cla", "lan", "ang", "ngd"}));

				EXPECT_EQ(generateTrigrams(segmentIdentifier("abc_def")),
				getTrigrams({"abc", "def", "abd", "ade", "bcd", "bde", "cde"}));

				EXPECT_EQ(generateTrigrams(segmentIdentifier("unique_ptr")),
				getTrigrams({"uni", "niq", "iqu", "que", "ptr", "unp", "upt", "nip",
				"npt", "iqp", "ipt", "qup", "qpt", "uep", "ept"}));

				EXPECT_EQ(generateTrigrams(segmentIdentifier("nl")), getTrigrams({}));
				}

				} // namespace dex
				} // namespace clangd
				} // namespace clang