This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Introduce Dex symbol index search tokens
ClosedPublic

Authored by kbobyrev on Jul 20 2018, 1:41 AM.

Details

Summary

This patch introduces the core building block of the next-generation Clangd symbol index - Dex. Search tokens are the keys in the inverted index and represent a characteristic of a specific symbol: examples of search token types (Token Namespaces) are

  • Trigrams - these are essential for unqualified symbol name fuzzy search
  • Scopes for filtering the symbols by the namespace
  • Paths, e.g. these can be used to uprank symbols defined close to the edited file

This patch outlines the generic for such token namespaces, but only implements trigram generation.

The intuition behind trigram generation algorithm is that each extracted trigram is a valid sequence for Fuzzy Matcher jumps, proposed implementation utilize existing FuzzyMatcher API for segmentation and trigram extraction.

However, trigrams generation algorithm for the query string is different from the previous one: it simply yields sequences of 3 consecutive lowercased valid characters (letters, digits).

Dex RFC in the mailing list: http://lists.llvm.org/pipermail/clangd-dev/2018-July/000022.html

The trigram generation techniques are described in detail in the proposal:
https://docs.google.com/document/d/1C-A6PGT6TynyaX4PXyExNMiGmJ2jL1UwV91Kyx11gOI/edit#heading=h.903u1zon9nkj

Diff Detail

Repository
rL LLVM

Event Timeline

kbobyrev created this revision.Jul 20 2018, 1:41 AM
kbobyrev planned changes to this revision.EditedJul 20 2018, 1:42 AM

The upcoming changes:

  • Use segmentation API exposed in https://reviews.llvm.org/rL337527
  • Get rid of the pre-computed hash for now as discussed internally
  • Fix the bug with whole segments concatenation
sammccall added inline comments.Jul 20 2018, 3:34 AM
clang-tools-extra/clangd/index/dex/Token.cpp
25 ↗(On Diff #156444)

specializing the hash function looks like premature optimization (or possibly pessimisation)
One the one hand, it's a perfect hash!

On the other hand, when used with a power-of-two sized hashtable (particularly 256 is a fun example), all the trigrams are going to hash into the alphanumeric buckets, and other buckets will be empty!

DenseMap is a power-of-two hashtable!

30 ↗(On Diff #156444)

llvm::hash_combine(static_cast<int>(TokenKind),Data)

clang-tools-extra/clangd/index/dex/Token.h
10 ↗(On Diff #156444)

nit: swap these sentences around? conceptual explanation first, how they're used later?

could even give an example:

// The symbol std::cout might have the tokens:
// * Scope "std::"
// * Trigram "cou"
// * Trigram "out"
// * Type "std::ostream"
28 ↗(On Diff #156444)

nit: "Hashable" isn't really important here.
nit: it doesn't *represent* a search primitive, it *is* a search primitive.

Maybe

/// A Token represents an attribute of a symbol, such as a particular
/// trigram present in the name (used for fuzzy search).
33 ↗(On Diff #156444)

As discussed offline: think we misunderstood each other about the separate struct.
I rather meant *this* class could just be a struct.

36 ↗(On Diff #156444)

don't give examples of the enum values immediately before defining the enum values :-)
Just put the comments on the values themselves, and omit the ones that aren't here yet (or leave a TODO)

47 ↗(On Diff #156444)

move these to enum values

58 ↗(On Diff #156444)

nit: probably reverse the order here since Token(TRIGRAM, "str") reads more naturally than Token("str", TRIGRAM) - context is present to understand the data

60 ↗(On Diff #156444)

can we just use the standard hash_value() function? Why this accessor?

79 ↗(On Diff #156444)

as discussed offline: this precomputed hash looks like premature optimization to me.

clang-tools-extra/clangd/index/dex/Trigram.cpp
35 ↗(On Diff #156444)

just iterate at the end and collect them? order shouldn't matter here.

37 ↗(On Diff #156444)

ah, you're also handling this by cases here.

This doesn't seem necessary.
In fact I think there's a fairly simple way to express this that works naturally with the segmentation structure produced by fuzzymatch (so doesn't need a version of segmentIdentifier to wrap it).
But it's taking me too long to work out the details, will write a followup comment or discuss offline.

46 ↗(On Diff #156444)

please use short names like I, S1, etc for iterators and loop counters.

90 ↗(On Diff #156444)

There seems to be a lot of overlap between this one and the third loop.

clang-tools-extra/clangd/index/dex/Trigram.h
10 ↗(On Diff #156444)

This would be a great place to describe *how* trigrams are used by fuzzy matching (the idea that you generate trigrams T for the identifier, trigrams Q for the query, and require Q ⊆ T)

56 ↗(On Diff #156444)

Per the examples above, segmentIdentifier doesn't case-fold its output, but all the inputs shown here are lowercase.
Is the caller expected to do that in between calling here, or does generateTrigrams case-fold as part of generation?

(please doc)

58 ↗(On Diff #156444)

seem to have two summaries here, drop or merge with previous?

64 ↗(On Diff #156444)

This is true, but I think treating these as four cases doesn't give a clear understanding of how they're related.

What about:

Trigrams can start at any character in the input. Then we can choose to move to the next character, move to the start of the next segment, or skip over a segment.

For input [abstract factory producer impl]:
* the simplest trigrams are consecutive letters from one segment (abs, bst, ..., mpl)
* but may cross two segments (abf, bpr)
* or even three (api, ypi)
sammccall added inline comments.Jul 20 2018, 4:42 AM
clang-tools-extra/clangd/index/dex/Trigram.cpp
37 ↗(On Diff #156444)

The idea is to precompute the list of "legal next chars" after each char.
This can in fact be done in one backwards pass. Then actually computing the trigram set is very easy.

generateTrigrams(StringRef Text) {
// Segment the text.
vector<CharRole> Roles(Text.size());
calculateRoles(Text, makeMutableArrayRef(Roles));

// Calculate legal next characters after each character.
vector<unsigned[3]> Next(Text.size()); // 0 entries mean "nothing"
unsigned NextTail = 0, NextSeg = 0, NextNextSeg = 0;
for (int I = Text.size() - 1; i >= 0; --I) {
  Next[I] = {NextTail, NextSeg, NextNextSeg};
  NextTail = Roles[I] == Tail ? I : 0;
  if (Roles[I] == Head) {
    NextNextHead = NextHead;
    NextHead = I;
  }
}

// Compute trigrams. They can start at head or tail chars.
DenseSet<Token> Tokens.
for (int I = 0; I < Text.size(); ++I) {
  if (Roles[I] != Head && Roles[I] != Tail) continue;
  for (unsigned J : Next[I]) {
    if (!J) continue;
    for (unsigned K : Next[K]) {
      if (!K) continue;
      char Trigram[] = {Text[I], Text[J], Text[K]};
      Tokens.emplate(TRIGRAM, Trigram);
    }
  }
}

// Return deduplicated trigrams.
std::vector<Token> TokensV;
for (const auto& E : Tokens) TokensV.push_back(E.second);
return TokensV;
kbobyrev updated this revision to Diff 156772.Jul 23 2018, 7:06 AM
kbobyrev marked 19 inline comments as done.
kbobyrev added a reviewer: ilya-biryukov.
kbobyrev removed a subscriber: ilya-biryukov.

Address all comments from the last iteration.

Latest changes:

  • Trigram generation algorithm now uses clangd FuzzyMatching API using the technique proposed by @sammccall
  • Another piece of trigram generation algorithm on query-side was introduced
  • Hash is no longer stored for each token and is computed more effectively
  • Documentation is refined and should be more consistent now
kbobyrev updated this revision to Diff 156774.Jul 23 2018, 7:07 AM

Wrong diff, should be correct now.

Looks really nice! Only major issue is the query trigrams don't look right.
Otherwise, some style nits and fixes that seem to have gotten lost.

clang-tools-extra/clangd/index/dex/Token.cpp
28 ↗(On Diff #156774)

nit: inline these trivial accessors, so they can be inlined at the callsite

clang-tools-extra/clangd/index/dex/Token.h
33 ↗(On Diff #156444)

This one doesn't seem to be done - any reason to keep the class with accessors rather than just a struct for now?

clang-tools-extra/clangd/index/dex/Trigram.cpp
32 ↗(On Diff #156774)

nit: total segment length <3 characters?
(e.g. u_p is also ignored)

38 ↗(On Diff #156774)

nit: seems clearer to call tolower inline rather than mutating the input param - "identifier" doesn't really describe its new value

39 ↗(On Diff #156774)

nit: prefer llvm::toLower, which returns char

94 ↗(On Diff #156774)

(again, toLower and consider moving to Chars.push_back() call)

105 ↗(On Diff #156774)

i.e. you don't want this

35 ↗(On Diff #156444)

this was not done
(now also in generateQueryTrigrams)

clang-tools-extra/clangd/index/dex/Trigram.h
37 ↗(On Diff #156774)

nit: using the same logic as FuzzyMatch
(callers don't know the name calculateRoles, probably)

nit: lowercased
(downcast seems a bit confusing)

49 ↗(On Diff #156774)

nit: StringRef here and below?

56 ↗(On Diff #156774)

this appears to be too restrictive?

e.g. for TUDecl you require [tud, ude, dec, ecl]

I think you just want a sliding window of three successive head/tail characters.

clang-tools-extra/unittests/clangd/DexIndexTests.cpp
75 ↗(On Diff #156774)

seems like it should be [abc, bcd, cde, def]

kbobyrev updated this revision to Diff 157020.Jul 24 2018, 6:51 AM
kbobyrev marked 12 inline comments as done.

Address all comments, refine tests.

kbobyrev added inline comments.Jul 24 2018, 6:52 AM
clang-tools-extra/clangd/index/dex/Trigram.cpp
38 ↗(On Diff #156774)

As discussed offline, I think it might be more efficient to copy apply toLower once instead of doing that up to 8 times more.

kbobyrev marked an inline comment as done.Jul 24 2018, 6:52 AM
sammccall accepted this revision.Jul 24 2018, 7:13 AM

Thanks for the polish!
Last few nits, but feel free to land once you're happy.

clang-tools-extra/clangd/index/dex/Token.cpp
21 ↗(On Diff #157020)

nit: I'd suggest inlining these trivial functions and deleting the cpp file - one less file is always nice!

clang-tools-extra/clangd/index/dex/Token.h
51 ↗(On Diff #157020)

This sentence got a bit mangled.

90 ↗(On Diff #157020)

nit: just return {clang::clangd::dex::Token::Kind::Sentinel, "EmptyKey"};
you're making a copy here anyway.

clang-tools-extra/clangd/index/dex/Trigram.cpp
56 ↗(On Diff #157020)

(just for my curiosity - why are the double braces needed rather than single?)

101 ↗(On Diff #157020)

nit: LowercaseQuery = Query.lower();
(and above)

clang-tools-extra/unittests/clangd/DexIndexTests.cpp
22 ↗(On Diff #157020)

you never call this function from tests, fold it together into Trigrams?

31 ↗(On Diff #157020)

nit: function needs to start with a lowercase letter

if you like: trigramsAre would fit with typical matcher factory style

This revision is now accepted and ready to land.Jul 24 2018, 7:13 AM
kbobyrev updated this revision to Diff 157196.Jul 24 2018, 11:12 PM
kbobyrev marked 7 inline comments as done.

Address last round of comments.

clang-tools-extra/clangd/index/dex/Trigram.cpp
56 ↗(On Diff #157020)

Discussed offline: that's probably a bug, probably worth looking into.

kbobyrev marked an inline comment as done.Jul 24 2018, 11:12 PM
kbobyrev retitled this revision from [clangd] Introduce search Tokens and trigram generation algorithms to [clangd] Introduce Dex symbol index search tokens.Jul 25 2018, 3:07 AM
kbobyrev edited the summary of this revision. (Show Details)
kbobyrev added a subscriber: klimek.
This revision was automatically updated to reflect the committed changes.