This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Dex Trigrams: Improve query trigram generation
ClosedPublic

Authored by kbobyrev on Nov 16 2021, 6:50 AM.

Details

Summary

These are the trigrams for queries right now:

  • "va" -> {Trigram("va")}
  • "va_" -> {} (empty)

This is suboptimal since the resulting query will discard the query information
and return all symbols, some of which will be later be scored expensively
(fuzzy matching score). This is related to
https://github.com/clangd/clangd/issues/39 but does not fix it. Accidentally,
because of that incorrect behavior, when user types "tok::va" there are no
results (the issue is that tok::kw___builtin_va_arg does not have "va" token)
but when "tok::va_" is typed, expected result (tok::kw___builtin_va_arg)
shows up by accident. This is because the dex query transformer will only
lookup symbols within the tok:: namespace. There won't be many, so the
returned results will contain symbol we need; this symbol will be filtered out
by the expensive checks and that will be displayed in the editor.

Diff Detail

Event Timeline

kbobyrev created this revision.Nov 16 2021, 6:50 AM
kbobyrev requested review of this revision.Nov 16 2021, 6:50 AM
kbobyrev updated this revision to Diff 387615.Nov 16 2021, 6:52 AM

Add a small comment.

kbobyrev updated this revision to Diff 388461.Nov 19 2021, 4:34 AM

Add improvements for identifier trigram generation and make query & id
generators consistent with each other.

sammccall added inline comments.Nov 28 2021, 3:08 PM
clang-tools-extra/clangd/index/dex/Trigram.cpp
74–94

It'd be good to describe what the semantics actually are somewhere.
As it is, this comment describes the algorithm used to generate the trigrams, and the other function refers to this one.

77

This seems overly complicated.
If a separator precedes the alphanums, then it's at the front.
Moreover its Next table seems to be set up correctly.

So I believe this all amounts to: if the first char is a separator, then we allow starting our partial trigrams at 0, first head, second head. Else just first head & second head. And we just follow the next table.

101

This looks a lot like the Next table. Can't we use that? (Second head + third head is missing, but I don't see why)

129

both "valid" and "symbols" are confusing here.

Do we actually need this variable?
Seems like we can detect the case at the end, if we generated no trigrams.

134

I'd suggest always including the separator, and then using StringRef(Letters).take_back(2). Then we don't need ValidSymbols here either.

kbobyrev updated this revision to Diff 390297.Nov 29 2021, 3:23 AM
kbobyrev marked 5 inline comments as done.

Address review comments.

kbobyrev updated this revision to Diff 390298.Nov 29 2021, 3:25 AM

Slightly improve docs.

Basically LG now, some of the changes are a bit mysterious to me though

clang-tools-extra/clangd/index/dex/Trigram.cpp
50

why the change from array to pair and 0 to NPOS?

104

Fist -> first
You're missing first separator only

I think examples might be clearer than a description of these.

124

(if the change to NPOS is just to avoid hitting Position == 0 on the first iteration, you could check it here instead)

141

Since the query trigram corresponds so closely to the query for short queries, I think it makes more sense to consider it the other way: generateIdentifierTrigrams is deciding which queries it wants to match and emulating this function.

144

nit: by prepending a separator, we form a bigram!
If a bigram can't be formed out of letters/numbers...

kbobyrev updated this revision to Diff 390627.Nov 30 2021, 2:01 AM
kbobyrev marked 3 inline comments as done.

Address review comments.

clang-tools-extra/clangd/index/dex/Trigram.cpp
50
  • Array of two elements is confusing: we had it since we had 3 elements, with 2 it doesn't make sense anymore. Also, makes it easier to decompose the items (std::tie(Tail, Head) vs size_t Tail = Next[I][0], Head = Next[I][1]).
  • NPOS is more explicit than 0 in our intent: 0 is a valid index and it feels awkward to use a valid index as NOT_FOUND. Yes, it's the one that can't appear in the Next but it's not clear from the code and requires some explanation for understanding. Why NPOS can't be found in Next is obvious.
124

It's not only for that but this was the trigger, having the Position == 0 check here makes the logic slightly confusing, I think it's better to just use NPOS instead for the reasons above.

sammccall added inline comments.Nov 30 2021, 2:14 AM
clang-tools-extra/clangd/index/dex/Trigram.cpp
50

I disagree on the style points here and they don't seem particularly related to the patch. Do they need to be included here?

Array of two elements is confusing

Well, I disagree here: the difference between array<T, 2> and pair<T, T> is homogeneous vs heterogeneous. The usage here is almost entirely homogeneous.
3 vs 2 doesn't make a difference, we could have used tuple<T, T, T>, same would apply.

Also, makes it easier to decompose the items

As far as ergonomics go, this comes at the cost of not being able to iterate over them, which we do much more often. (And AFAICT you never need NextTail decomposed.)

NPOS is not a common convention so I don't think it's more obvious.

50

you've changed unsigned -> size_t here, can you revert that please?

113

You're not treating head vs tail differently, this should just be iterating over Next as above.

kbobyrev updated this revision to Diff 390641.Nov 30 2021, 2:43 AM
kbobyrev marked 5 inline comments as done.

Address review comments.

kbobyrev updated this revision to Diff 390643.Nov 30 2021, 2:45 AM

Remove refactoring leftovers.

sammccall accepted this revision.Dec 7 2021, 5:17 AM

Sorry about the delay!

clang-tools-extra/clangd/index/dex/Trigram.cpp
46–47

the default size of this vector is 6 - if we care about the optimization we probably want to double it or so?

(Roles above is fine, it's 40)

This revision is now accepted and ready to land.Dec 7 2021, 5:17 AM
kbobyrev updated this revision to Diff 392380.Dec 7 2021, 6:45 AM
kbobyrev marked an inline comment as done.

No problem, thank you!

This revision was automatically updated to reflect the committed changes.