This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang-tools-extra/
-
clangd/
-
CMakeLists.txt
-
index/noctem/
-
noctem/
9/11
SearchAtom.h
2/2
SearchAtom.cpp
-
unittests/clangd/
-
clangd/
-
CMakeLists.txt
-
NoctemIndexTests.cpp

Differential D49417

[clangd] Implement trigram generation algorithm for new symbol index
AbandonedPublic

Authored by omtcyfz on Jul 17 2018, 1:55 AM.

Download Raw Diff

Details

Reviewers: None

Summary

This patch introduces trigram generation algorithm for the symbol index proposed in a recent design document.
RFC in the mailing list: http://lists.llvm.org/pipermail/clangd-dev/2018-July/000022.html

The trigram generation algorithm is described in detail in the proposal:
https://docs.google.com/document/d/1C-A6PGT6TynyaX4PXyExNMiGmJ2jL1UwV91Kyx11gOI/edit#heading=h.903u1zon9nkj

Diff Detail

Event Timeline

omtcyfz created this revision.Jul 17 2018, 1:55 AM

Herald added subscribers: jkorous, MaskRay, mgorny. · View Herald TranscriptJul 17 2018, 1:55 AM

Fix documentation indentation in SearchAtom description.

Wipe redundant FIXMEs.

omtcyfz updated this revision to Diff 155828.Jul 17 2018, 2:36 AM

Fix unittest file name in header.

Some high-level comments before jumping into details.

clang-tools-extra/clangd/index/noctem/SearchAtom.cpp
24	maybe also add how short symbol names are handled in the current algorithm and what the potential solution would be.
29	Any reason not to do this in segmentation?
clang-tools-extra/clangd/index/noctem/SearchAtom.h
28	Any reason to call this `Atom` instead of `Token`? `Token` seems to be a more commonly used name for this (e.g. https://yomguithereal.github.io/mnemonist/inverted-index#tokens).
46	`Namespace` can be easily confused with namespaces in clang world. Maybe `Kind` or `Type`?
52	What is an empty atom? Why do we need it?
53	Why default `Type` to `Trigram`?
53	Please add documentation. What is the semantic of `Data`?
54	Should we also incorporate `Type` into hash?
54	I'm wondering if we should use different hashes for different token types. For example, a trigram token "xyz" can be encoded in a 4 -byte int with `('x'<<16) & ('y'<<8) & 'z'`, which seems cheaper than `std::hash`.
63	nit: s/getData/data/ s/getType/type/
77	nit: I'd avoid calling these `tokens`, as it they can be confused with tokens for the invert index. How about `segments`? Similarly, `tokenize` should probably be something like `segment` or `segmentIdentifier`.
142	`generateSearchAtoms` is too generalized a name for this. Maybe just `trigram()`? I also think this could take a list of segments from `tokenize`.

Addressed all comments submitted by Eric.

As discussed internally, I should also exercise my naming skills and come up with a better for the symbol index to substitute "Noctem" which doesn't point to any project's feature.

In D49417#1166538, @omtcyfz wrote:

Addressed all comments submitted by Eric.

As discussed internally, I should also exercise my naming skills and come up with a better for the symbol index to substitute "Noctem" which doesn't point to any project's feature.

Ooh, a bikeshed!
Suggestion: dex - short, pronouncable, suggests index, and is a trigram :-)

Herald added a subscriber: arphaman. · View Herald TranscriptJul 19 2018, 2:39 AM

ioeric added inline comments.Jul 19 2018, 2:58 AM

clang-tools-extra/clangd/index/noctem/SearchToken.cpp
38 ↗	(On Diff #156073)	Please also add what the current behavior is for short names. Do we just ignore them?
44 ↗	(On Diff #156073)	Could we replace `UniqaueTrigrams`+`Trigrams` with a dense map from hash to token?
47 ↗	(On Diff #156073)	nit: a redundant "of" EOL.
57 ↗	(On Diff #156073)	nit: Maybe S1, S2, S3 instead of FirstSegment, ...?
68 ↗	(On Diff #156073)	This seems wrong... wouldn't this give you a concatenation of three segments?
68 ↗	(On Diff #156073)	For trigrams, it might make sense to put 3 chars into a `SmallVector<3>` (can be reused) and std::move it into the constructor. Might be cheaper than creating a std::string
87 ↗	(On Diff #156073)	nit: `Position < Segment.size() - 2` seems more commonly used.
92 ↗	(On Diff #156073)	Comment for this loop?
135 ↗	(On Diff #156073)	use `trim`?
138 ↗	(On Diff #156073)	Maybe first split name on `_` and them run further upper-lower segmentation on the split segments?
clang-tools-extra/clangd/index/noctem/SearchToken.h
58 ↗	(On Diff #156073)	Any reason this has to be `operator()` instead of a `hash` method? `operator()` for hash value is not trivial on the call site.
69 ↗	(On Diff #156073)	Who's the user of this friend function? Could it just call `Token.hash()`?
80 ↗	(On Diff #156073)	nit: scope in clangd world is "foo::bar::baz::", but the global scope is "".
159 ↗	(On Diff #156073)	This seems to be the same as just `generateTrigrams(segmentIdentifier(Name))`, so I'd drop it.

(just .h files. +1 to eric's comments except where noted)

clang-tools-extra/clangd/index/noctem/SearchToken.h
2 ↗	(On Diff #156073)	nit: something went wrong here
28 ↗	(On Diff #156073)	nit: remove \brief
32 ↗	(On Diff #156073)	maybe just giving one example here, and moving the concrete semantics to Kind?
34 ↗	(On Diff #156073)	ITYM suffix consider just "under directory"
35 ↗	(On Diff #156073)	namespace tokens don't have prefix/suffix semantics, they're exact
44 ↗	(On Diff #156073)	nit: this is a really common type, and it's namespaced. `Token` is probably fine.
46 ↗	(On Diff #156073)	doc: the fact that Kind is a namespace for data, and (on each enum value) the semantics
46 ↗	(On Diff #156073)	nit: drop `: short` as you don't seem to make use of it anywhere. Can add it later if we want to optimize in-memory structure (currently it has no effect)
48 ↗	(On Diff #156073)	in the first patch, probably just want trigram and scope (to ensure generality) and drop the rest
52 ↗	(On Diff #156073)	what is this for?
61 ↗	(On Diff #156073)	(nit: since you have the hashcode, comparing it before the data is probably faster?)
64 ↗	(On Diff #156073)	nit: first const is not useful
64 ↗	(On Diff #156073)	instead of these accessors, can we just make it a struct with const members and a constructor that ensures the hash is correctly set? That should reflect the lightweightness/lack of behavior in this object
69 ↗	(On Diff #156073)	This is the LLVM convention for hashing, see `ADT/Hashing.h`
76 ↗	(On Diff #156073)	I'd move these to the Kind enum - these shouldn't be examples but rather the canonical documentation for these
81 ↗	(On Diff #156073)	need to more fully spell this out: when is it full vs relative? is it a URI? Do directories have a trailing slash? But really, leave this out for now.
83 ↗	(On Diff #156073)	So SmallString always contains 3 pointers, i.e 24 bits. The very smallest SmallString that makes sense is <8>, as youll end up padded anyway.
89 ↗	(On Diff #156073)	(remove leading \brief here and elsewhere)
89 ↗	(On Diff #156073)	Please move trigram-related stuff to `Trigrams.h`
115 ↗	(On Diff #156073)	Please don't implement a different set of segmentation rules to those that exist in FuzzyMatch. Happy to chat about a sensible API to expose and where it should live.
115 ↗	(On Diff #156073)	given your examples always return substrings, these should be stringrefs or offsets. But per above comment, we should rethink this function.

sammccall mentioned this in D49540: [clangd] FuzzyMatch exposes an API for its word segmentation. NFC.Jul 19 2018, 5:28 AM

omtcyfz added a parent revision: D49540: [clangd] FuzzyMatch exposes an API for its word segmentation. NFC.Jul 20 2018, 12:39 AM

sammccall mentioned this in rL337527: [clangd] FuzzyMatch exposes an API for its word segmentation. NFC.Jul 20 2018, 1:06 AM

sammccall mentioned this in rCTE337527: [clangd] FuzzyMatch exposes an API for its word segmentation. NFC.Jul 20 2018, 1:09 AM

Addressed most comments (aside from reusing fuzzy matching segmentation routine and making data + hash a separate structure).

Since I already submitted my next revision (https://reviews.llvm.org/D49546) through another account with more verbose and less confusing username I will remove everyone from this revision and create this revision under the latter account.

clang-tools-extra/clangd/index/noctem/SearchAtom.h
54	Discussed internally: we probably shouldn't since there will be collisions even inside a single token namespace when Data will be too long.
clang-tools-extra/clangd/index/noctem/SearchToken.cpp
44 ↗	(On Diff #156073)	I'm not sure how to iterate through the result then. Basically, having Trigrams ensures that these trigrams can be iterated after the function execution and inserted into the inverted index. Otherwise I should either expose a callback to add the generated trigrams or do something similar. Should I do that instead?
57 ↗	(On Diff #156073)	I think it's way less explicit and slightly more confusing. I try not to have one-letter names so I guess it's better to have full names instead.
68 ↗	(On Diff #156073)	But `std::string` would be created anyway, wouldn't it be?
87 ↗	(On Diff #156073)	If `Segment.size()` is 1 then this would be `1U - 2U`, which would be something very large.
138 ↗	(On Diff #156073)	Resolving both of these for now (though they're not really fixed) since I will rethink the algorithm given Sam's patch.
clang-tools-extra/clangd/index/noctem/SearchToken.h
52 ↗	(On Diff #156073)	Tags, for example. But yes, for now it's not very clear and probably not needed. Dropped this one.

omtcyfz removed a reviewer: ioeric.Jul 20 2018, 1:39 AM

omtcyfz removed a project: Restricted Project.

omtcyfz removed subscribers: arphaman, mgorny, MaskRay and 4 others.

Herald added a subscriber: ioeric. · View Herald TranscriptJul 20 2018, 1:39 AM

Abandon in favor of https://reviews.llvm.org/D49591.

Revision Contents

Path

Size

clang-tools-extra/

clangd/

CMakeLists.txt

3 lines

index/

noctem/

SearchAtom.h

175 lines

SearchAtom.cpp

162 lines

unittests/

clangd/

CMakeLists.txt

1 line

NoctemIndexTests.cpp

93 lines

Diff 155827

clang-tools-extra/clangd/CMakeLists.txt

Show All 28 Lines	add_clang_library(clangDaemon
ProtocolHandlers.cpp		ProtocolHandlers.cpp
Quality.cpp		Quality.cpp
SourceCode.cpp		SourceCode.cpp
Threading.cpp		Threading.cpp
Trace.cpp		Trace.cpp
TUScheduler.cpp		TUScheduler.cpp
URI.cpp		URI.cpp
XRefs.cpp		XRefs.cpp

index/CanonicalIncludes.cpp		index/CanonicalIncludes.cpp
index/FileIndex.cpp		index/FileIndex.cpp
index/Index.cpp		index/Index.cpp
index/MemIndex.cpp		index/MemIndex.cpp
index/Merge.cpp		index/Merge.cpp
index/SymbolCollector.cpp		index/SymbolCollector.cpp
index/SymbolYAML.cpp		index/SymbolYAML.cpp

		index/noctem/SearchAtom.cpp

LINK_LIBS		LINK_LIBS
clangAST		clangAST
clangASTMatchers		clangASTMatchers
clangBasic		clangBasic
clangDriver		clangDriver
clangFormat		clangFormat
clangFrontend		clangFrontend
clangIndex		clangIndex
Show All 16 Lines

clang-tools-extra/clangd/index/noctem/SearchAtom.h

This file was added.

				//===--- SearchAtom.h- Symbol Search primitive ------------------- C++ --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// SearchAtoms are keys for inverted index which are mapped to the corresponding
				// posting lists. SearchAtom objects represent a characteristic of a symbol,
				// which can be used to perform efficient search.
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_CLANG_TOOLS_EXTRA_CLANGD_NOCTEM_TRIGRAM_H
				#define LLVM_CLANG_TOOLS_EXTRA_CLANGD_NOCTEM_TRIGRAM_H

				#include "llvm/ADT/DenseMap.h"
				#include "llvm/ADT/SmallString.h"
				#include <vector>

				namespace clang {
				namespace clangd {
				namespace noctem {

				/// \brief Hashable SearchAtom, which represents a search token primitive.
				///
				ioericUnsubmitted Done Reply Inline Actions Any reason to call this `Atom` instead of `Token`? `Token` seems to be a more commonly used name for this (e.g. https://yomguithereal.github.io/mnemonist/inverted-index#tokens). ioeric: Any reason to call this `Atom` instead of `Token`? `Token` seems to be a more commonly used…
				/// The following items are examples of search atoms:
				///
				/// * Trigram for fuzzy search on unqualified symbol names.
				/// * Proximity path primitives, e.g. "symbol is defined in directory
				/// $HOME/dev/llvm or its prefix".
				/// * Scope primitives, e.g. "symbol belongs to namespace foo::bar or its
				/// prefix".
				/// * If the symbol represents a variable, token can be its type such as int,
				/// clang::Decl, …
				/// * For a symbol representing a function, this can be the
				/// return type.
				///
				/// Tokens can be used to perform more sophisticated search queries by
				/// constructing complex iterator trees.
				class SearchAtom {
				public:
				enum class Namespace : short {
				Trigram,
				ioericUnsubmitted Done Reply Inline Actions `Namespace` can be easily confused with namespaces in clang world. Maybe `Kind` or `Type`? ioeric: `Namespace` can be easily confused with namespaces in clang world. Maybe `Kind` or `Type`?
				Scope,
				Path,
				};

				SearchAtom() = default;
				SearchAtom(llvm::StringRef Data, Namespace Type = Namespace::Trigram)
				ioericUnsubmitted Done Reply Inline Actions What is an empty atom? Why do we need it? ioeric: What is an empty atom? Why do we need it?
				: Data(Data), Hash(std::hash<std::string>{}(Data)), Type(Type) {}
				ioericUnsubmitted Done Reply Inline Actions Why default `Type` to `Trigram`? ioeric: Why default `Type` to `Trigram`?
				ioericUnsubmitted Done Reply Inline Actions Please add documentation. What is the semantic of `Data`? ioeric: Please add documentation. What is the semantic of `Data`?

				ioericUnsubmitted Not Done Reply Inline Actions Should we also incorporate `Type` into hash? ioeric: Should we also incorporate `Type` into hash?
				omtcyfzAuthorUnsubmitted Not Done Reply Inline Actions Discussed internally: we probably shouldn't since there will be collisions even inside a single token namespace when Data will be too long. omtcyfz: Discussed internally: we probably shouldn't since there will be collisions even inside a single…
				ioericUnsubmitted Done Reply Inline Actions I'm wondering if we should use different hashes for different token types. For example, a trigram token "xyz" can be encoded in a 4 -byte int with `('x'<<16) & ('y'<<8) & 'z'`, which seems cheaper than `std::hash`. ioeric: I'm wondering if we should use different hashes for different token types. For example, a…
				// Returns precomputed hash.
				size_t operator()(const SearchAtom &T) const { return Hash; }

				bool operator==(const SearchAtom &Other) const {
				return Type == Other.Type && Data == Other.Data;
				}

				const llvm::StringRef getData() const { return Data; }

				ioericUnsubmitted Done Reply Inline Actions nit: s/getData/data/ s/getType/type/ ioeric: nit: s/getData/data/ s/getType/type/
				const Namespace &getType() const { return Type; }

				private:
				friend llvm::hash_code hash_value(const SearchAtom &Atom) {
				return Atom.Hash;
				}

				llvm::SmallString<3> Data;
				size_t Hash;
				Namespace Type;
				};

				/// \brief Splits unqualified symbol name into tokens for trigram generation.
				///
				ioericUnsubmitted Done Reply Inline Actions nit: I'd avoid calling these `tokens`, as it they can be confused with tokens for the invert index. How about `segments`? Similarly, `tokenize` should probably be something like `segment` or `segmentIdentifier`. ioeric: nit: I'd avoid calling these `tokens`, as it they can be confused with tokens for the invert…
				/// First stage of trigram generation algorithm. Given an unqualified symbol
				/// name, this outputs a sequence of string tokens using the following rules:
				///
				/// * '_' is a separator. Multiple consecutive underscores are treated as a
				/// single separator. Underscores at the beginning and the end of the symbol
				/// name are skipped.
				///
				/// Examples: "unique_ptr" -> ["unique", "ptr"],
				/// "__builtin_popcount" -> ["builtin", "popcount"]
				/// "snake____case___" -> ["snake", "case"]
				///
				/// * Lowercase letter followed by an uppercase letter is a separator.
				///
				/// Examples: "kItemsCount" -> ["k", "Items", "Count"]
				///
				/// * Sequences of consecutive uppercase letters followed by a lowercase letter:
				/// the last uppercase letter is treated as the beginning of a next token.
				///
				/// Examples: "TUDecl" -> ["TU", "Decl"]
				/// "kDaysInAWeek" -> ["k", "Days", "In", "A", "Week"]
				///
				/// Note: digits are treated as lowercase letters. Example: "X86" -> ["X86"]
				std::vector<llvm::SmallString<10>> tokenize(llvm::StringRef SymbolName);

				// TODO(kbobyrev): Do a better job at documenting this one.
				/// \brief Returns list of unique fuzzy-search trigrams from unqualified symbol.
				///
				/// Combines all stages of trigram generation for fuzzy-search index.
				///
				/// 0. Splits SymbolName into tokens by applying tokenize()
				/// 1. Casts all letters to lowercase.
				/// 2. Generates trigrams.
				///
				/// The motivation for trigram generation algorithm is that extracted trigrams
				/// are 3-char suffixes of paths through the fuzzy matching automaton. There are
				/// four classes of extracted trigrams:
				///
				/// * The simplest one consists of consecutive 3-char sequences of each token.
				///
				/// Example: "trigram" -> ["tri", "rig", "igr", "gra", "ram"
				///
				/// * Next class consists of front character of subsequent tokens.
				///
				/// Example: ["translation", "unit", "decl"] -> ["tud"]
				///
				/// Note: skipping tokens is allowed, but not more than one. For example,
				/// given ["a", "b", "c", "d", "e"] -> "ace" is allowed, but "ade" is not.
				///
				/// * Another class of trigrams consists of those with 2 charactersin one token
				/// and the front character of subsequent token (just as before, skipping up
				/// to one token is allowed).
				///
				/// Example: ["ab", "c", "d", "e"] -> ["abc", "abd", "abe"]
				/// Note: similarly to the previous case, "abe" would not be allowed.
				///
				/// * The last class of trigrams is similar to the previous one: it takes one
				/// character from one token and two front characters from the next or
				/// skip-1-next tokens.
				///
				/// Example: ["a", "bc", "de", "fg"] -> ["abc", "ade"]
				/// But not "afg".
				///
				/// Note: the returned list of trigrams does not have duplicates, if any
				/// trigram belongs to more than one class it is only inserted once.
				std::vector<SearchAtom> generateSearchAtoms(llvm::StringRef SymbolName);
				ioericUnsubmitted Done Reply Inline Actions `generateSearchAtoms` is too generalized a name for this. Maybe just `trigram()`? I also think this could take a list of segments from `tokenize`. ioeric: `generateSearchAtoms` is too generalized a name for this. Maybe just `trigram()`? I also think…

				} // namespace noctem
				} // namespace clangd
				} // namespace clang

				namespace llvm {

				// Support SearchAtoms as DenseMap keys.
				template <> struct DenseMapInfo<clang::clangd::noctem::SearchAtom> {

				static inline clang::clangd::noctem::SearchAtom getEmptyKey() {
				static clang::clangd::noctem::SearchAtom EmptyKey("EMPTYKEY");
				return EmptyKey;
				}

				static inline clang::clangd::noctem::SearchAtom getTombstoneKey() {
				static clang::clangd::noctem::SearchAtom TombstoneKey("TOMBSTONE_KEY");
				return TombstoneKey;
				}

				static unsigned getHashValue(const clang::clangd::noctem::SearchAtom &Tag) {
				return hash_value(Tag);
				}

				static bool isEqual(const clang::clangd::noctem::SearchAtom &LHS,
				const clang::clangd::noctem::SearchAtom &RHS) {
				return LHS == RHS;
				}
				};

				} // namespace llvm

				#endif

clang-tools-extra/clangd/index/noctem/SearchAtom.cpp

This file was added.

				//===--- SearchAtom.cpp- Symbol Search primitive ----------------- C++ --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//

				#include "SearchAtom.h"
				#include "llvm/ADT/DenseSet.h"
				#include "llvm/ADT/Twine.h"

				#include <cctype>
				#include <string>

				using namespace llvm;

				namespace clang {
				namespace clangd {
				namespace noctem {

				// FIXME(kbobyrev): Deal with short symbol symbol names.
				std::vector<SearchAtom> generateSearchAtoms(StringRef SymbolName) {
				ioericUnsubmitted Done Reply Inline Actions maybe also add how short symbol names are handled in the current algorithm and what the potential solution would be. ioeric: maybe also add how short symbol names are handled in the current algorithm and what the…
				auto Tokens = tokenize(SymbolName);

				// Apply lowercase text normalization.
				for (auto &Token : Tokens)
				std::for_each(Token.begin(), Token.end(), ::tolower);
				ioericUnsubmitted Done Reply Inline Actions Any reason not to do this in segmentation? ioeric: Any reason not to do this in segmentation?

				llvm::DenseSet<SearchAtom> UniqueTrigrams;
				std::vector<SearchAtom> Trigrams;

				// Extract trigrams consisting of first characters of tokens sorted by of
				// token positions. Trigram generator is allowed to skip 1 word between each
				// token.
				//
				// Example: ["a", "b", "c", "d", "e"]
				//
				// would produce -> ["abc", "acd", "ace", ...] (among the others)
				//
				// but not -> ["ade"] because two tokens ("b" and "c") would be skipped in
				// this case.
				for (auto FirstToken = Tokens.begin(); FirstToken != Tokens.end();
				++FirstToken) {
				for (auto SecondToken = FirstToken + 1;
				(SecondToken <= FirstToken + 2) && (SecondToken != Tokens.end());
				++SecondToken) {
				for (auto ThirdToken = SecondToken + 1;
				(ThirdToken <= SecondToken + 2) && (ThirdToken != Tokens.end());
				++ThirdToken) {
				SearchAtom Trigram((FirstToken + SecondToken + *ThirdToken).str());
				if (!UniqueTrigrams.count(Trigram)) {
				UniqueTrigrams.insert(Trigram);
				Trigrams.push_back(Trigram);
				}
				}
				}
				}

				// Iterate through each token with a sliding window and extract trigrams
				// consisting of 3 consecutive characters.
				//
				// Example: "delete" -> ["del", "ele", "let", "ete"]
				for (const auto &Token : Tokens) {
				// Token should have at least three characters to have trigram substrings.
				if (Token.size() < 3)
				continue;

				for (size_t Position = 0; Position + 2 < Token.size(); ++Position)
				Trigrams.push_back(SearchAtom(Token.substr(Position, 3)));
				}

				for (auto FirstToken = Tokens.begin(); FirstToken != Tokens.end();
				++FirstToken) {
				for (auto SecondToken = FirstToken + 1;
				(SecondToken <= FirstToken + 2) && (SecondToken != Tokens.end());
				++SecondToken) {
				for (size_t FirstTokenIndex = 0; FirstTokenIndex < FirstToken->size();
				++FirstTokenIndex) {
				// Extract trigrams of the third class: one character of the first token
				// and two characters from the next or skip-1-next token.
				if (FirstTokenIndex + 1 < FirstToken->size()) {
				SearchAtom Trigram((FirstToken->substr(FirstTokenIndex, 2) +
				SecondToken->substr(0, 1))
				.str());
				if (!UniqueTrigrams.count(Trigram)) {
				UniqueTrigrams.insert(Trigram);
				Trigrams.push_back(Trigram);
				}
				}
				// Extract trigrams of the last class: two character from the first
				// token and front character from the next or skip-1-next token.
				if (SecondToken->size() > 1) {
				SearchAtom Trigram((FirstToken->substr(FirstTokenIndex, 1) +
				SecondToken->substr(0, 2))
				.str());
				if (!UniqueTrigrams.count(Trigram)) {
				UniqueTrigrams.insert(Trigram);
				Trigrams.push_back(Trigram);
				}
				}
				}
				}
				}

				return Trigrams;
				}

				std::vector<SmallString<10>> tokenize(StringRef SymbolName) {
				std::vector<SmallString<10>> Tokens;
				size_t TokenStart = 0;
				// Skip underscores at the beginning, e.g. "__builtin_popcount".
				while (SymbolName[TokenStart] == '_')
				++TokenStart;

				for (size_t Index = TokenStart; Index + 1 < SymbolName.size(); ++Index) {
				const char CurrentSymbol = SymbolName[Index];
				const char NextSymbol = SymbolName[Index + 1];
				// Skip sequences of underscores, e.g. "my__type".
				if (CurrentSymbol == '_' && NextSymbol == '_') {
				++TokenStart;
				continue;
				}

				// Splits if the next symbol is underscore or if processed characters are
				// [lowercase, Uppercase] which indicates beginning of next token. Digits
				// are equivalent to lowercase symbols.
				if ((NextSymbol == '_') \|\|
				((islower(CurrentSymbol) \|\| isdigit(CurrentSymbol)) &&
				isupper(NextSymbol))) {
				Tokens.push_back(SymbolName.substr(TokenStart, Index - TokenStart + 1));
				TokenStart = Index + 1;
				if (NextSymbol == '_')
				++TokenStart;
				}

				// If there were N (> 1) consecutive uppercase letter the split should
				// generate two tokens, one of which would consist of N - 1 first uppercase
				// letters, the next token begins with the last uppercase letter.
				//
				// Example: "TUDecl" -> ["TU", "Decl"]
				if (isupper(CurrentSymbol) &&
				(islower(NextSymbol) \|\| (isdigit(NextSymbol)))) {
				// Don't perform split if Index points to the beginning of new token,
				// otherwise "NamedDecl" would be split into ["N", "amed", "D", "ecl"]
				if (Index == TokenStart)
				continue;
				Tokens.push_back(SymbolName.substr(TokenStart, Index - TokenStart));
				TokenStart = Index;
				}
				}

				if (TokenStart < SymbolName.size())
				Tokens.push_back(SymbolName.substr(TokenStart));

				return Tokens;
				}

				} // namespace noctem
				} // namespace clangd
				} // namespace clang

clang-tools-extra/unittests/clangd/CMakeLists.txt

Show All 17 Lines	add_extra_unittest(ClangdTests
DraftStoreTests.cpp		DraftStoreTests.cpp
FileIndexTests.cpp		FileIndexTests.cpp
FileDistanceTests.cpp		FileDistanceTests.cpp
FindSymbolsTests.cpp		FindSymbolsTests.cpp
FuzzyMatchTests.cpp		FuzzyMatchTests.cpp
GlobalCompilationDatabaseTests.cpp		GlobalCompilationDatabaseTests.cpp
HeadersTests.cpp		HeadersTests.cpp
IndexTests.cpp		IndexTests.cpp
		NoctemIndexTests.cpp
QualityTests.cpp		QualityTests.cpp
SourceCodeTests.cpp		SourceCodeTests.cpp
SymbolCollectorTests.cpp		SymbolCollectorTests.cpp
SyncAPI.cpp		SyncAPI.cpp
TestFS.cpp		TestFS.cpp
TestTU.cpp		TestTU.cpp
ThreadingTests.cpp		ThreadingTests.cpp
TraceTests.cpp		TraceTests.cpp
Show All 21 Lines

clang-tools-extra/unittests/clangd/NoctemIndexTests.cpp

This file was added.

				//===-- IndexTests.cpp -------------------------------- C++ ------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//

				#include "index/noctem/SearchAtom.h"
				#include "llvm/ADT/SmallString.h"
				#include "gtest/gtest.h"

				namespace clang {
				namespace clangd {
				namespace noctem {

				std::vector<llvm::SmallString<10>>
				toSmallStrings(const std::vector<std::string> Strings) {
				std::vector<llvm::SmallString<10>> Result(Strings.size());
				for (size_t Index = 0; Index < Strings.size(); ++Index) {
				Result[Index] = Strings[Index];
				}
				return Result;
				}

				std::vector<SearchAtom>
				getTrigrams(std::initializer_list<std::string> Trigrams) {
				std::vector<SearchAtom> Result;
				for (const auto &Symbols : Trigrams) {
				Result.push_back(SearchAtom(Symbols));
				}
				return Result;
				}

				TEST(NoctemIndexTokens, TrigramSymbolNameTokenization) {
				EXPECT_EQ(tokenize("unique_ptr"), toSmallStrings({"unique", "ptr"}));

				EXPECT_EQ(tokenize("TUDecl"), toSmallStrings({"TU", "Decl"}));

				EXPECT_EQ(tokenize("table_name_"), toSmallStrings({"table", "name"}));

				EXPECT_EQ(tokenize("kDaysInAWeek"),
				toSmallStrings({"k", "Days", "In", "A", "Week"}));

				EXPECT_EQ(tokenize("AlternateUrlTableErrors"),
				toSmallStrings({"Alternate", "Url", "Table", "Errors"}));

				EXPECT_EQ(tokenize("IsOK"), toSmallStrings({"Is", "OK"}));

				EXPECT_EQ(tokenize("ABSL_FALLTHROUGH_INTENDED"),
				toSmallStrings({"ABSL", "FALLTHROUGH", "INTENDED"}));

				EXPECT_EQ(tokenize("SystemZ"), toSmallStrings({"System", "Z"}));

				EXPECT_EQ(tokenize("X86"), toSmallStrings({"X86"}));

				EXPECT_EQ(tokenize("ASTNodeKind"), toSmallStrings({"AST", "Node", "Kind"}));

				EXPECT_EQ(tokenize("ObjCDictionaryElement"),
				toSmallStrings({"Obj", "C", "Dictionary", "Element"}));

				EXPECT_EQ(tokenize("multiple__underscores___everywhere____"),
				toSmallStrings({"multiple", "underscores", "everywhere"}));

				EXPECT_EQ(tokenize("__cuda_builtin_threadIdx_t"),
				toSmallStrings({"cuda", "builtin", "thread", "Idx", "t"}));

				EXPECT_EQ(tokenize("longUPPERCASESequence"),
				toSmallStrings({"long", "UPPERCASE", "Sequence"}));
				}

				TEST(NoctemIndexTrigrams, TrigramGeneration) {
				EXPECT_EQ(
				generateSearchAtoms("a_b_c_d_e_"),
				getTrigrams({"abc", "abd", "acd", "ace", "bcd", "bce", "bde", "cde"}));

				EXPECT_EQ(generateSearchAtoms("clangd"),
				getTrigrams({"cla", "lan", "ang", "ngd"}));

				EXPECT_EQ(generateSearchAtoms("abc_def"),
				getTrigrams({"abc", "def", "abd", "ade", "bcd", "bde", "cde"}));

				EXPECT_EQ(generateSearchAtoms("unique_ptr"),
				getTrigrams({"uni", "niq", "iqu", "que", "ptr", "unp", "upt", "nip",
				"npt", "iqp", "ipt", "qup", "qpt", "uep", "ept"}));

				EXPECT_EQ(generateSearchAtoms("nl"), getTrigrams({}));
				}

				} // namespace noctem
				} // namespace clangd
				} // namespace clang