This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Implement trigram generation algorithm for new symbol index
AbandonedPublic

Authored by omtcyfz on Jul 17 2018, 1:55 AM.

Details

Reviewers
None
Summary

This patch introduces trigram generation algorithm for the symbol index proposed in a recent design document.
RFC in the mailing list: http://lists.llvm.org/pipermail/clangd-dev/2018-July/000022.html

The trigram generation algorithm is described in detail in the proposal:
https://docs.google.com/document/d/1C-A6PGT6TynyaX4PXyExNMiGmJ2jL1UwV91Kyx11gOI/edit#heading=h.903u1zon9nkj

Diff Detail

Event Timeline

omtcyfz created this revision.Jul 17 2018, 1:55 AM
omtcyfz updated this revision to Diff 155824.Jul 17 2018, 2:00 AM

Fix documentation indentation in SearchAtom description.

omtcyfz updated this revision to Diff 155827.Jul 17 2018, 2:36 AM

Wipe redundant FIXMEs.

omtcyfz updated this revision to Diff 155828.Jul 17 2018, 2:36 AM
omtcyfz updated this revision to Diff 155830.Jul 17 2018, 2:54 AM

Fix unittest file name in header.

Some high-level comments before jumping into details.

clang-tools-extra/clangd/index/noctem/SearchAtom.cpp
23

maybe also add how short symbol names are handled in the current algorithm and what the potential solution would be.

28

Any reason not to do this in segmentation?

clang-tools-extra/clangd/index/noctem/SearchAtom.h
27

Any reason to call this Atom instead of Token? Token seems to be a more commonly used name for this (e.g. https://yomguithereal.github.io/mnemonist/inverted-index#tokens).

45

Namespace can be easily confused with namespaces in clang world. Maybe Kind or Type?

51

What is an empty atom? Why do we need it?

52

Why default Type to Trigram?

52

Please add documentation. What is the semantic of Data?

53

Should we also incorporate Type into hash?

53

I'm wondering if we should use different hashes for different token types. For example, a trigram token "xyz" can be encoded in a 4 -byte int with ('x'<<16) & ('y'<<8) & 'z', which seems cheaper than std::hash.

62

nit:
s/getData/data/
s/getType/type/

76

nit: I'd avoid calling these tokens, as it they can be confused with tokens for the invert index. How about segments? Similarly, tokenize should probably be something like segment or segmentIdentifier.

141

generateSearchAtoms is too generalized a name for this. Maybe just trigram()? I also think this could take a list of segments from tokenize.

omtcyfz updated this revision to Diff 156073.EditedJul 18 2018, 8:02 AM
omtcyfz marked 11 inline comments as done.

Addressed all comments submitted by Eric.

As discussed internally, I should also exercise my naming skills and come up with a better for the symbol index to substitute "Noctem" which doesn't point to any project's feature.

Addressed all comments submitted by Eric.

As discussed internally, I should also exercise my naming skills and come up with a better for the symbol index to substitute "Noctem" which doesn't point to any project's feature.

Ooh, a bikeshed!
Suggestion: dex - short, pronouncable, suggests index, and is a trigram :-)

ioeric added inline comments.Jul 19 2018, 2:58 AM
clang-tools-extra/clangd/index/noctem/SearchToken.cpp
38 ↗(On Diff #156073)

Please also add what the current behavior is for short names. Do we just ignore them?

44 ↗(On Diff #156073)

Could we replace UniqaueTrigrams+Trigrams with a dense map from hash to token?

47 ↗(On Diff #156073)

nit: a redundant "of" EOL.

57 ↗(On Diff #156073)

nit: Maybe S1, S2, S3 instead of FirstSegment, ...?

68 ↗(On Diff #156073)

This seems wrong... wouldn't this give you a concatenation of three segments?

68 ↗(On Diff #156073)

For trigrams, it might make sense to put 3 chars into a SmallVector<3> (can be reused) and std::move it into the constructor. Might be cheaper than creating a std::string

87 ↗(On Diff #156073)

nit: Position < Segment.size() - 2 seems more commonly used.

92 ↗(On Diff #156073)

Comment for this loop?

135 ↗(On Diff #156073)

use trim?

138 ↗(On Diff #156073)

Maybe first split name on _ and them run further upper-lower segmentation on the split segments?

clang-tools-extra/clangd/index/noctem/SearchToken.h
58 ↗(On Diff #156073)

Any reason this has to be operator() instead of a hash method? operator() for hash value is not trivial on the call site.

69 ↗(On Diff #156073)

Who's the user of this friend function? Could it just call Token.hash()?

80 ↗(On Diff #156073)

nit: scope in clangd world is "foo::bar::baz::", but the global scope is "".

159 ↗(On Diff #156073)

This seems to be the same as just generateTrigrams(segmentIdentifier(Name)), so I'd drop it.

(just .h files. +1 to eric's comments except where noted)

clang-tools-extra/clangd/index/noctem/SearchToken.h
2 ↗(On Diff #156073)

nit: something went wrong here

28 ↗(On Diff #156073)

nit: remove \brief

32 ↗(On Diff #156073)

maybe just giving one example here, and moving the concrete semantics to Kind?

34 ↗(On Diff #156073)

ITYM suffix
consider just "under directory"

35 ↗(On Diff #156073)

namespace tokens don't have prefix/suffix semantics, they're exact

44 ↗(On Diff #156073)

nit: this is a really common type, and it's namespaced. Token is probably fine.

46 ↗(On Diff #156073)

doc: the fact that Kind is a namespace for data, and (on each enum value) the semantics

46 ↗(On Diff #156073)

nit: drop : short as you don't seem to make use of it anywhere. Can add it later if we want to optimize in-memory structure (currently it has no effect)

48 ↗(On Diff #156073)

in the first patch, probably just want trigram and scope (to ensure generality) and drop the rest

52 ↗(On Diff #156073)

what is this for?

61 ↗(On Diff #156073)

(nit: since you have the hashcode, comparing it before the data is probably faster?)

64 ↗(On Diff #156073)

nit: first const is not useful

64 ↗(On Diff #156073)

instead of these accessors, can we just make it a struct with const members and a constructor that ensures the hash is correctly set?

That should reflect the lightweightness/lack of behavior in this object

69 ↗(On Diff #156073)

This is the LLVM convention for hashing, see ADT/Hashing.h

76 ↗(On Diff #156073)

I'd move these to the Kind enum - these shouldn't be examples but rather the canonical documentation for these

81 ↗(On Diff #156073)

need to more fully spell this out: when is it full vs relative? is it a URI? Do directories have a trailing slash?

But really, leave this out for now.

83 ↗(On Diff #156073)

So SmallString always contains 3 pointers, i.e 24 bits. The very smallest SmallString that makes sense is <8>, as youll end up padded anyway.

89 ↗(On Diff #156073)

(remove leading \brief here and elsewhere)

89 ↗(On Diff #156073)

Please move trigram-related stuff to Trigrams.h

115 ↗(On Diff #156073)

Please don't implement a different set of segmentation rules to those that exist in FuzzyMatch.
Happy to chat about a sensible API to expose and where it should live.

115 ↗(On Diff #156073)

given your examples always return substrings, these should be stringrefs or offsets. But per above comment, we should rethink this function.

omtcyfz updated this revision to Diff 156443.Jul 20 2018, 1:38 AM
omtcyfz marked 35 inline comments as done.

Addressed most comments (aside from reusing fuzzy matching segmentation routine and making data + hash a separate structure).

Since I already submitted my next revision (https://reviews.llvm.org/D49546) through another account with more verbose and less confusing username I will remove everyone from this revision and create this revision under the latter account.

clang-tools-extra/clangd/index/noctem/SearchAtom.h
53

Discussed internally: we probably shouldn't since there will be collisions even inside a single token namespace when Data will be too long.

clang-tools-extra/clangd/index/noctem/SearchToken.cpp
44 ↗(On Diff #156073)

I'm not sure how to iterate through the result then. Basically, having Trigrams ensures that these trigrams can be iterated after the function execution and inserted into the inverted index. Otherwise I should either expose a callback to add the generated trigrams or do something similar. Should I do that instead?

57 ↗(On Diff #156073)

I think it's way less explicit and slightly more confusing. I try not to have one-letter names so I guess it's better to have full names instead.

68 ↗(On Diff #156073)

But std::string would be created anyway, wouldn't it be?

87 ↗(On Diff #156073)

If Segment.size() is 1 then this would be 1U - 2U, which would be something very large.

138 ↗(On Diff #156073)

Resolving both of these for now (though they're not really fixed) since I will rethink the algorithm given Sam's patch.

clang-tools-extra/clangd/index/noctem/SearchToken.h
52 ↗(On Diff #156073)

Tags, for example. But yes, for now it's not very clear and probably not needed. Dropped this one.

omtcyfz removed a reviewer: ioeric.Jul 20 2018, 1:39 AM
omtcyfz removed a project: Restricted Project.
omtcyfz removed subscribers: arphaman, mgorny, MaskRay and 4 others.
omtcyfz abandoned this revision.Jul 20 2018, 1:46 AM
omtcyfz removed a subscriber: ioeric.

Abandon in favor of https://reviews.llvm.org/D49591.