- Instead of a$$ for a short-query trigram, just use a
- Generate more short-query trigrams, e.g. "AbcDefGhi" now yields "d" and "ag". This is effectively required by LSP, having "ag" not match but "agh" match will lead to glitches due to client-side filtering.
- Drop leading-punctuation short-query trigrams. Nice idea, but current implementation is broken (competes with non-punctuation short query trigrams)
Details
Diff Detail
- Repository
- rCTE Clang Tools Extra
- Build Status
Buildable 23454 Build 23453: arc lint + arc unit
Event Timeline
Generate more short-query trigrams, e.g. "AbcDefGhi" now yields "d" and "ag".
I am concerned about the impact on the size of posting lists (we can measure) and retrieval quality by adding more incomplete trigrams.
This is effectively required by LSP, having "ag" not match but "agh" match will lead to glitches due to client-side filtering.
It seems hard to make index behave the same as LSP clients, considering all the ranking signals we have; there can be other reasons that cause having "ag" not match but "agh" match. And it seems reasonable to assume that you would get better symbols as you type more characters (i.e. stronger relevance signal).
Drop leading-punctuation short-query trigrams. Nice idea, but current implementation is broken (competes with non-punctuation short query trigrams).
Could you elaborate how this is broken? We should probably fix it instead of removing it. __ not matching __some_macro sounds like a regression.
unittests/clangd/DexTests.cpp | ||
---|---|---|
365 | nit: remove? |
TL;DR: i'm no longer convinced of my conclusions for short-query, look at the proposal below.
There is indeed an impact here. Before: 22309480, After: 22531144.
About 1/3 of this is posting lists, so 1% to overall, 3% on posting lists.
I haven't measured quality (don't have a good way to do that for dex currently). It's true that the new results we're admitting are probably worse as they don't start with the right letter.
This is effectively required by LSP, having "ag" not match but "agh" match will lead to glitches due to client-side filtering.
It seems hard to make index behave the same as LSP clients, considering all the ranking signals we have; there can be other reasons that cause having "ag" not match but "agh" match.
No, there's only one such reason: we truncated the result list (the symbol matched, but it wasn't one of the best N).
This reason is accounted for by LSP: client-side filtering only occurs when there was no truncation.
And it seems reasonable to assume that you would get better symbols as you type more characters (i.e. stronger relevance signal).
The problem is LSP clients are free to assume that the result list is complete (unless marked as incomplete) and therefore will never retrieve the better symbols.
Drop leading-punctuation short-query trigrams. Nice idea, but current implementation is broken (competes with non-punctuation short query trigrams).
Could you elaborate how this is broken? We should probably fix it instead of removing it. __ not matching __some_macro sounds like a regression.
fuzzyFind considers _ptr to match unique_ptr, so it needs to consider _ to match it too.
We will still match __some_macro, but the posting lists will match ~everything and then postfilter. It may make sense to do optimizations or tweak overall trigram generation for this case, but the way it's currently done seems unprincipled and a little broken.
OK, so after some thought on the tradeoffs here, what about this alternative design:
- for query length <3, we support really restrictive matches: exact prefix, or first char + next head char. We
- we get around the LSP restrictions by always marking such result sets as incomplete
Use similar but better-defined rules for short trigram matches.
Modify Dex to account for the matches not being exhaustive.
Unfortunately the test needs D52796, which depends on this patch.
The problem is LSP clients are free to assume that the result list is complete (unless marked as incomplete) and therefore will never retrieve the better symbols.
Good point. Thanks for the explanation!
clangd/index/dex/Dex.cpp | ||
---|---|---|
222 | should this still be a vlog? | |
clangd/index/dex/Trigram.h | ||
47 | is b still generated? |
should this still be a vlog?