This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Implement trigram generation algorithm for new symbol index
AbandonedPublic

Authored by omtcyfz on Jul 17 2018, 1:55 AM.

Details

Reviewers
None
Summary

This patch introduces trigram generation algorithm for the symbol index proposed in a recent design document.
RFC in the mailing list: http://lists.llvm.org/pipermail/clangd-dev/2018-July/000022.html

The trigram generation algorithm is described in detail in the proposal:
https://docs.google.com/document/d/1C-A6PGT6TynyaX4PXyExNMiGmJ2jL1UwV91Kyx11gOI/edit#heading=h.903u1zon9nkj

Diff Detail

Event Timeline

omtcyfz created this revision.Jul 17 2018, 1:55 AM
omtcyfz updated this revision to Diff 155824.Jul 17 2018, 2:00 AM

Fix documentation indentation in SearchAtom description.

omtcyfz updated this revision to Diff 155827.Jul 17 2018, 2:36 AM

Wipe redundant FIXMEs.

omtcyfz updated this revision to Diff 155828.Jul 17 2018, 2:36 AM
omtcyfz updated this revision to Diff 155830.Jul 17 2018, 2:54 AM

Fix unittest file name in header.

Some high-level comments before jumping into details.

clang-tools-extra/clangd/index/noctem/SearchAtom.cpp
23 ↗(On Diff #155830)

maybe also add how short symbol names are handled in the current algorithm and what the potential solution would be.

28 ↗(On Diff #155830)

Any reason not to do this in segmentation?

clang-tools-extra/clangd/index/noctem/SearchAtom.h
27 ↗(On Diff #155830)

Any reason to call this Atom instead of Token? Token seems to be a more commonly used name for this (e.g. https://yomguithereal.github.io/mnemonist/inverted-index#tokens).

45 ↗(On Diff #155830)

Namespace can be easily confused with namespaces in clang world. Maybe Kind or Type?

51 ↗(On Diff #155830)

What is an empty atom? Why do we need it?

52 ↗(On Diff #155830)

Why default Type to Trigram?

52 ↗(On Diff #155830)

Please add documentation. What is the semantic of Data?

53 ↗(On Diff #155830)

Should we also incorporate Type into hash?

53 ↗(On Diff #155830)

I'm wondering if we should use different hashes for different token types. For example, a trigram token "xyz" can be encoded in a 4 -byte int with ('x'<<16) & ('y'<<8) & 'z', which seems cheaper than std::hash.

62 ↗(On Diff #155830)

nit:
s/getData/data/
s/getType/type/

76 ↗(On Diff #155830)

nit: I'd avoid calling these tokens, as it they can be confused with tokens for the invert index. How about segments? Similarly, tokenize should probably be something like segment or segmentIdentifier.

141 ↗(On Diff #155830)

generateSearchAtoms is too generalized a name for this. Maybe just trigram()? I also think this could take a list of segments from tokenize.

omtcyfz updated this revision to Diff 156073.EditedJul 18 2018, 8:02 AM
omtcyfz marked 11 inline comments as done.

Addressed all comments submitted by Eric.

As discussed internally, I should also exercise my naming skills and come up with a better for the symbol index to substitute "Noctem" which doesn't point to any project's feature.

Addressed all comments submitted by Eric.

As discussed internally, I should also exercise my naming skills and come up with a better for the symbol index to substitute "Noctem" which doesn't point to any project's feature.

Ooh, a bikeshed!
Suggestion: dex - short, pronouncable, suggests index, and is a trigram :-)

ioeric added inline comments.Jul 19 2018, 2:58 AM
clang-tools-extra/clangd/index/noctem/SearchToken.cpp
38 ↗(On Diff #156073)

Please also add what the current behavior is for short names. Do we just ignore them?

44 ↗(On Diff #156073)

Could we replace UniqaueTrigrams+Trigrams with a dense map from hash to token?

47 ↗(On Diff #156073)

nit: a redundant "of" EOL.

57 ↗(On Diff #156073)

nit: Maybe S1, S2, S3 instead of FirstSegment, ...?

68 ↗(On Diff #156073)

This seems wrong... wouldn't this give you a concatenation of three segments?

68 ↗(On Diff #156073)

For trigrams, it might make sense to put 3 chars into a SmallVector<3> (can be reused) and std::move it into the constructor. Might be cheaper than creating a std::string

87 ↗(On Diff #156073)

nit: Position < Segment.size() - 2 seems more commonly used.

92 ↗(On Diff #156073)

Comment for this loop?

135 ↗(On Diff #156073)

use trim?

138 ↗(On Diff #156073)

Maybe first split name on _ and them run further upper-lower segmentation on the split segments?

clang-tools-extra/clangd/index/noctem/SearchToken.h
58 ↗(On Diff #156073)

Any reason this has to be operator() instead of a hash method? operator() for hash value is not trivial on the call site.

69 ↗(On Diff #156073)

Who's the user of this friend function? Could it just call Token.hash()?

80 ↗(On Diff #156073)

nit: scope in clangd world is "foo::bar::baz::", but the global scope is "".

159 ↗(On Diff #156073)

This seems to be the same as just generateTrigrams(segmentIdentifier(Name)), so I'd drop it.

(just .h files. +1 to eric's comments except where noted)

clang-tools-extra/clangd/index/noctem/SearchToken.h
2 ↗(On Diff #156073)

nit: something went wrong here

28 ↗(On Diff #156073)

nit: remove \brief

32 ↗(On Diff #156073)

maybe just giving one example here, and moving the concrete semantics to Kind?

34 ↗(On Diff #156073)

ITYM suffix
consider just "under directory"

35 ↗(On Diff #156073)

namespace tokens don't have prefix/suffix semantics, they're exact

44 ↗(On Diff #156073)

nit: this is a really common type, and it's namespaced. Token is probably fine.

46 ↗(On Diff #156073)

doc: the fact that Kind is a namespace for data, and (on each enum value) the semantics

46 ↗(On Diff #156073)

nit: drop : short as you don't seem to make use of it anywhere. Can add it later if we want to optimize in-memory structure (currently it has no effect)

48 ↗(On Diff #156073)

in the first patch, probably just want trigram and scope (to ensure generality) and drop the rest

52 ↗(On Diff #156073)

what is this for?

61 ↗(On Diff #156073)

(nit: since you have the hashcode, comparing it before the data is probably faster?)

64 ↗(On Diff #156073)

nit: first const is not useful

64 ↗(On Diff #156073)

instead of these accessors, can we just make it a struct with const members and a constructor that ensures the hash is correctly set?

That should reflect the lightweightness/lack of behavior in this object

69 ↗(On Diff #156073)

This is the LLVM convention for hashing, see ADT/Hashing.h

76 ↗(On Diff #156073)

I'd move these to the Kind enum - these shouldn't be examples but rather the canonical documentation for these

81 ↗(On Diff #156073)

need to more fully spell this out: when is it full vs relative? is it a URI? Do directories have a trailing slash?

But really, leave this out for now.

83 ↗(On Diff #156073)

So SmallString always contains 3 pointers, i.e 24 bits. The very smallest SmallString that makes sense is <8>, as youll end up padded anyway.

89 ↗(On Diff #156073)

(remove leading \brief here and elsewhere)

89 ↗(On Diff #156073)

Please move trigram-related stuff to Trigrams.h

115 ↗(On Diff #156073)

Please don't implement a different set of segmentation rules to those that exist in FuzzyMatch.
Happy to chat about a sensible API to expose and where it should live.

115 ↗(On Diff #156073)

given your examples always return substrings, these should be stringrefs or offsets. But per above comment, we should rethink this function.

omtcyfz updated this revision to Diff 156443.Jul 20 2018, 1:38 AM
omtcyfz marked 35 inline comments as done.

Addressed most comments (aside from reusing fuzzy matching segmentation routine and making data + hash a separate structure).

Since I already submitted my next revision (https://reviews.llvm.org/D49546) through another account with more verbose and less confusing username I will remove everyone from this revision and create this revision under the latter account.

clang-tools-extra/clangd/index/noctem/SearchAtom.h
53 ↗(On Diff #155830)

Discussed internally: we probably shouldn't since there will be collisions even inside a single token namespace when Data will be too long.

clang-tools-extra/clangd/index/noctem/SearchToken.cpp
44 ↗(On Diff #156073)

I'm not sure how to iterate through the result then. Basically, having Trigrams ensures that these trigrams can be iterated after the function execution and inserted into the inverted index. Otherwise I should either expose a callback to add the generated trigrams or do something similar. Should I do that instead?

57 ↗(On Diff #156073)

I think it's way less explicit and slightly more confusing. I try not to have one-letter names so I guess it's better to have full names instead.

68 ↗(On Diff #156073)

But std::string would be created anyway, wouldn't it be?

87 ↗(On Diff #156073)

If Segment.size() is 1 then this would be 1U - 2U, which would be something very large.

138 ↗(On Diff #156073)

Resolving both of these for now (though they're not really fixed) since I will rethink the algorithm given Sam's patch.

clang-tools-extra/clangd/index/noctem/SearchToken.h
52 ↗(On Diff #156073)

Tags, for example. But yes, for now it's not very clear and probably not needed. Dropped this one.

omtcyfz removed a reviewer: ioeric.Jul 20 2018, 1:39 AM
omtcyfz removed a project: Restricted Project.
omtcyfz removed subscribers: arphaman, mgorny, MaskRay and 4 others.
omtcyfz abandoned this revision.Jul 20 2018, 1:46 AM
omtcyfz removed a subscriber: ioeric.

Abandon in favor of https://reviews.llvm.org/D49591.