This is an archive of the discontinued LLVM Phabricator instance.

[cland] Dex: fix/simplify trigram generation
ClosedPublic

Authored by sammccall on Oct 2 2018, 4:16 PM.

Details

Summary
  1. Instead of a$$ for a short-query trigram, just use a
  2. Generate more short-query trigrams, e.g. "AbcDefGhi" now yields "d" and "ag". This is effectively required by LSP, having "ag" not match but "agh" match will lead to glitches due to client-side filtering.
  3. Drop leading-punctuation short-query trigrams. Nice idea, but current implementation is broken (competes with non-punctuation short query trigrams)

Diff Detail

Event Timeline

sammccall created this revision.Oct 2 2018, 4:16 PM
sammccall updated this revision to Diff 168049.Oct 2 2018, 4:19 PM

Update comment, revert unintended change

ioeric added a comment.Oct 4 2018, 3:02 AM

Generate more short-query trigrams, e.g. "AbcDefGhi" now yields "d" and "ag".

I am concerned about the impact on the size of posting lists (we can measure) and retrieval quality by adding more incomplete trigrams.

This is effectively required by LSP, having "ag" not match but "agh" match will lead to glitches due to client-side filtering.

It seems hard to make index behave the same as LSP clients, considering all the ranking signals we have; there can be other reasons that cause having "ag" not match but "agh" match. And it seems reasonable to assume that you would get better symbols as you type more characters (i.e. stronger relevance signal).

Drop leading-punctuation short-query trigrams. Nice idea, but current implementation is broken (competes with non-punctuation short query trigrams).

Could you elaborate how this is broken? We should probably fix it instead of removing it. __ not matching __some_macro sounds like a regression.

unittests/clangd/DexTests.cpp
399

nit: remove?

sammccall marked an inline comment as done.Oct 4 2018, 4:16 AM

TL;DR: i'm no longer convinced of my conclusions for short-query, look at the proposal below.

Generate more short-query trigrams, e.g. "AbcDefGhi" now yields "d" and "ag".

I am concerned about the impact on the size of posting lists (we can measure) and retrieval quality by adding more incomplete trigrams.

There is indeed an impact here. Before: 22309480, After: 22531144.
About 1/3 of this is posting lists, so 1% to overall, 3% on posting lists.
I haven't measured quality (don't have a good way to do that for dex currently). It's true that the new results we're admitting are probably worse as they don't start with the right letter.

This is effectively required by LSP, having "ag" not match but "agh" match will lead to glitches due to client-side filtering.

It seems hard to make index behave the same as LSP clients, considering all the ranking signals we have; there can be other reasons that cause having "ag" not match but "agh" match.

No, there's only one such reason: we truncated the result list (the symbol matched, but it wasn't one of the best N).
This reason is accounted for by LSP: client-side filtering only occurs when there was no truncation.

And it seems reasonable to assume that you would get better symbols as you type more characters (i.e. stronger relevance signal).

The problem is LSP clients are free to assume that the result list is complete (unless marked as incomplete) and therefore will never retrieve the better symbols.

Drop leading-punctuation short-query trigrams. Nice idea, but current implementation is broken (competes with non-punctuation short query trigrams).

Could you elaborate how this is broken? We should probably fix it instead of removing it. __ not matching __some_macro sounds like a regression.

fuzzyFind considers _ptr to match unique_ptr, so it needs to consider _ to match it too.

We will still match __some_macro, but the posting lists will match ~everything and then postfilter. It may make sense to do optimizations or tweak overall trigram generation for this case, but the way it's currently done seems unprincipled and a little broken.


OK, so after some thought on the tradeoffs here, what about this alternative design:

  • for query length <3, we support really restrictive matches: exact prefix, or first char + next head char. We
  • we get around the LSP restrictions by always marking such result sets as incomplete
sammccall updated this revision to Diff 168280.Oct 4 2018, 6:04 AM

Use similar but better-defined rules for short trigram matches.
Modify Dex to account for the matches not being exhaustive.

Unfortunately the test needs D52796, which depends on this patch.

ioeric accepted this revision.Oct 4 2018, 6:54 AM

The problem is LSP clients are free to assume that the result list is complete (unless marked as incomplete) and therefore will never retrieve the better symbols.

Good point. Thanks for the explanation!

clangd/index/dex/Dex.cpp
222

should this still be a vlog?

clangd/index/dex/Trigram.h
47

is b still generated?

This revision is now accepted and ready to land.Oct 4 2018, 6:54 AM
This revision was automatically updated to reflect the committed changes.
sammccall marked 2 inline comments as done.