This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Tune the fuzzy-matching algorithm
ClosedPublic

Authored by ilya-biryukov on Mar 13 2019, 7:41 AM.

Details

Summary

To reduce the gap between prefix and initialism matches.

The motivation is producing better scoring in one particular example,
but the change does not seem to cause large regressions in other cases.

The examples is matching 'up' against 'unique_ptr' and 'upper_bound'.
Before the change, we had:

  • "[u]nique_[p]tr" with a score of 0.3,
  • "[up]per_bound" with a score of 1.0.

A 3x difference meant that symbol quality signals were almost always ignored
and 'upper_bound' was always ranked higher.

However, intuitively, the match scores should be very close for the two.

After the change we have the following scores:

  • "[u]nique_[p]tr" with a score of 0.75,
  • "[up]per_bound" with a score of 1.0.

Event Timeline

ilya-biryukov created this revision.Mar 13 2019, 7:41 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 13 2019, 7:41 AM

I'll collect the metrics and post them here too

Here are some initial measurements with the current metrics that we have. We seem to regress significantly in some cases. This is expected, since we only check the prefix matches there.
@ioeric is working on including the non-prefix (more concretely, initialism matches) into the metrics, we'll update when we get those metrics too.

==================================================================================================
                                        OVERALL (excl. CROSS_NAMESPACE)
==================================================================================================
  Total measurements: 109166 (+0)
  Average latency (ms): 219.219589233 (-7)
  All measurements:
	MRR: 69.12 (-0.93)	Top-1: 59.54% (-1.20%)	Top-5: 81.21% (-0.48%)	Top-100: 96.06% (-0.10%)
  Full identifiers:
	MRR: 97.68 (-0.54)	Top-1: 96.62% (-0.96%)	Top-5: 98.96% (-0.07%)	Top-100: 99.15% (-0.01%)
  Filter length 0-5:
	MRR:      29.11 (+0.00)		62.02 (+0.41)		70.71 (-1.45)		73.41 (-1.18)		75.44 (-1.71)		78.78 (-2.50)
	Top-1:    17.49% (+0.00%)		49.19% (+0.50%)		59.52% (-1.51%)		62.47% (-1.41%)		65.27% (-2.11%)		69.25% (-3.53%)
	Top-5:    42.45% (+0.00%)		78.99% (+0.39%)		85.48% (-1.01%)		87.21% (-0.87%)		88.26% (-1.05%)		90.83% (-0.94%)
	Top-100:  84.62% (+0.00%)		96.75% (+0.15%)		98.03% (-0.16%)		98.22% (-0.23%)		98.29% (-0.31%)		98.47% (-0.23%)
==================================================================================================
                                        DEFAULT
==================================================================================================
  Total measurements: 52054 (+0)
  Average latency (ms): 251.10333252 (-8)
  All measurements:
	MRR: 61.26 (-1.13)	Top-1: 51.13% (-1.31%)	Top-5: 74.10% (-0.68%)	Top-100: 92.60% (-0.21%)
  Full identifiers:
	MRR: 96.31 (-0.24)	Top-1: 95.08% (-0.33%)	Top-5: 97.93% (-0.09%)	Top-100: 98.30% (-0.01%)
  Filter length 0-5:
	MRR:      16.78 (+0.00)		48.89 (+0.32)		62.07 (-2.05)		66.09 (-1.67)		69.65 (-2.45)		72.49 (-2.23)
	Top-1:    9.06% (+0.00%)		34.63% (+0.25%)		49.11% (-2.03%)		53.07% (-1.99%)		58.01% (-2.83%)		61.95% (-2.74%)
	Top-5:    23.57% (+0.00%)		68.10% (+0.52%)		79.78% (-1.35%)		82.73% (-1.26%)		84.63% (-1.66%)		86.08% (-1.19%)
	Top-100:  70.73% (+0.00%)		93.67% (+0.29%)		96.41% (-0.31%)		96.77% (-0.47%)		96.92% (-0.63%)		97.07% (-0.47%)
==================================================================================================
                                        EXPLICIT_MEMBER_ACCESS
==================================================================================================
  Total measurements: 30470 (+0)
  Average latency (ms): 114.196846008 (-5)
  All measurements:
	MRR: 67.95 (-1.05)	Top-1: 58.04% (-1.53%)	Top-5: 80.28% (-0.43%)	Top-100: 98.64% (-0.01%)
  Full identifiers:
	MRR: 98.19 (-1.45)	Top-1: 96.69% (-2.75%)	Top-5: 99.83% (-0.07%)	Top-100: 99.89% (+0.00%)
  Filter length 0-5:
	MRR:      27.70 (+0.00)		62.41 (+0.78)		68.45 (-1.11)		70.82 (-0.71)		72.00 (-1.10)		78.54 (-4.29)
	Top-1:    16.41% (+0.00%)		49.21% (+1.05%)		57.13% (-1.02%)		60.02% (-0.71%)		61.31% (-1.43%)		67.65% (-6.68%)
	Top-5:    39.39% (+0.00%)		80.05% (+0.52%)		83.26% (-1.18%)		84.73% (-0.69%)		85.38% (-0.69%)		92.24% (-1.05%)
	Top-100:  94.25% (+0.00%)		99.21% (+0.02%)		99.20% (-0.04%)		99.24% (-0.02%)		99.26% (-0.02%)		99.71% (+0.00%)
==================================================================================================
                                        WANT_LOCAL
==================================================================================================
  Total measurements: 26642 (+0)
  Average latency (ms): 277.036773682 (-7)
  All measurements:
	MRR: 85.82 (-0.40)	Top-1: 77.69% (-0.60%)	Top-5: 96.18% (-0.13%)	Top-100: 99.88% (+0.00%)
  Full identifiers:
	MRR: 99.63 (-0.11)	Top-1: 99.37% (-0.21%)	Top-5: 99.91% (-0.02%)	Top-100: 99.93% (+0.00%)
  Filter length 0-5:
	MRR:      53.07 (+0.00)		86.94 (+0.19)		90.42 (-0.64)		91.19 (-0.74)		91.51 (-0.94)		92.59 (-0.81)
	Top-1:    34.01% (+0.00%)		77.25% (+0.34%)		82.89% (-1.01%)		84.30% (-1.06%)		84.99% (-1.45%)		86.95% (-1.23%)
	Top-5:    80.13% (+0.00%)		98.80% (+0.00%)		99.32% (-0.15%)		99.16% (-0.33%)		99.21% (-0.24%)		99.23% (-0.23%)
	Top-100:  99.68% (+0.00%)		99.93% (+0.00%)		99.92% (+0.00%)		99.92% (+0.00%)		99.91% (+0.00%)		99.90% (+0.00%)
==================================================================================================
                                        CROSS_NAMESPACE
==================================================================================================
  Total measurements: 18030 (+0)
  Average latency (ms): 268.525360107 (-6)
  All measurements:
	MRR: 32.61 (-2.11)	Top-1: 25.29% (-1.85%)	Top-5: 40.29% (-2.63%)	Top-100: 72.80% (-3.47%)
  Full identifiers:
	MRR: 79.96 (-1.83)	Top-1: 73.22% (-1.77%)	Top-5: 88.01% (-2.14%)	Top-100: 98.71% (+0.00%)
  Filter length 0-5:
	MRR:      1.21 (+0.00)		11.40 (+0.28)		23.19 (-2.98)		29.33 (-3.42)		39.34 (-3.80)		46.81 (-3.58)
	Top-1:    0.59% (+0.00%)		5.75% (+0.04%)		14.16% (-2.45%)		19.95% (-2.53%)		29.11% (-3.62%)		36.17% (-3.05%)
	Top-5:    1.44% (+0.00%)		16.38% (+0.41%)		31.15% (-4.28%)		39.71% (-4.67%)		50.27% (-4.08%)		59.36% (-4.25%)
	Top-100:  9.44% (+0.00%)		63.63% (+0.96%)		85.02% (-2.19%)		82.54% (-8.21%)		86.24% (-8.50%)		88.93% (-7.92%)
  • Adjust the tweaks, bring back removed heuristics.

Here are the stats for the latest version. We now get a significant bump for initialisms and a much lower hit in non-initialism cases.

==================================================================================================
                                        OVERALL (excl. CROSS_NAMESPACE and INITIALISMS)
==================================================================================================
  Total measurements: 108483 (+0)
  Average latency (ms): 223.570343018 (-5)
  All measurements:
	MRR: 69.72 (-0.31)	Top-1: 60.32% (-0.40%)	Top-5: 81.56% (-0.13%)	Top-100: 96.12% (-0.03%)
  Full identifiers:
	MRR: 97.72 (-0.48)	Top-1: 96.69% (-0.88%)	Top-5: 98.97% (-0.04%)	Top-100: 99.15% (+0.00%)
  Filter length 0-5:
	MRR:      29.15 (-0.00)		62.03 (+0.42)		71.61 (-0.50)		74.13 (-0.43)		76.52 (-0.60)		80.52 (-0.74)
	Top-1:    17.52% (+0.00%)		49.21% (+0.50%)		60.51% (-0.46%)		63.35% (-0.48%)		66.59% (-0.74%)		71.83% (-0.93%)
	Top-5:    42.53% (-0.01%)		78.96% (+0.40%)		86.18% (-0.29%)		87.76% (-0.31%)		88.92% (-0.38%)		91.36% (-0.39%)
	Top-100:  84.57% (-0.01%)		96.74% (+0.15%)		98.15% (-0.03%)		98.31% (-0.13%)		98.43% (-0.16%)		98.60% (-0.08%)
==================================================================================================
                                        INITIALISMS
==================================================================================================
  Total measurements: 16489 (+0)
  Average latency (ms): 207.854690552 (-5)
  All measurements:
	MRR: 82.31 (+3.81)	Top-1: 74.67% (+4.40%)	Top-5: 91.76% (+2.93%)	Top-100: 98.41% (+0.28%)
  Initialism length 2-4:
	MRR:      80.43 (+2.49)		85.02 (+3.48)		86.64 (+13.05)
	Top-1:    71.77% (+3.06%)		78.81% (+3.58%)		81.47% (+15.15%)
	Top-5:    91.22% (+1.60%)		92.54% (+3.27%)		92.94% (+10.37%)
	Top-100:  98.34% (+0.19%)		98.57% (+0.26%)		98.40% (+0.86%)
==================================================================================================
                                        DEFAULT
==================================================================================================
  Total measurements: 51805 (+0)
  Average latency (ms): 256.839630127 (-3)
  All measurements:
	MRR: 61.93 (-0.44)	Top-1: 51.91% (-0.52%)	Top-5: 74.55% (-0.21%)	Top-100: 92.72% (-0.07%)
  Full identifiers:
	MRR: 96.38 (-0.15)	Top-1: 95.18% (-0.20%)	Top-5: 97.96% (-0.05%)	Top-100: 98.30% (+0.00%)
  Filter length 0-5:
	MRR:      16.76 (-0.01)		48.89 (+0.32)		63.24 (-0.85)		67.15 (-0.58)		71.30 (-0.78)		73.47 (-1.25)
	Top-1:    9.05% (+0.00%)		34.65% (+0.26%)		50.29% (-0.82%)		54.37% (-0.67%)		59.95% (-0.89%)		63.15% (-1.56%)
	Top-5:    23.55% (-0.01%)		68.04% (+0.51%)		80.76% (-0.33%)		83.50% (-0.46%)		85.67% (-0.61%)		86.59% (-0.67%)
	Top-100:  70.69% (-0.01%)		93.65% (+0.31%)		96.64% (-0.06%)		96.96% (-0.27%)		97.20% (-0.34%)		97.35% (-0.17%)
==================================================================================================
                                        EXPLICIT_MEMBER_ACCESS
==================================================================================================
  Total measurements: 30162 (+0)
  Average latency (ms): 113.366485596 (-9)
  All measurements:
	MRR: 68.79 (-0.18)	Top-1: 59.23% (-0.28%)	Top-5: 80.68% (-0.05%)	Top-100: 98.64% (+0.00%)
  Full identifiers:
	MRR: 98.20 (-1.43)	Top-1: 96.72% (-2.71%)	Top-5: 99.82% (-0.07%)	Top-100: 99.89% (+0.00%)
  Filter length 0-5:
	MRR:      27.83 (-0.00)		62.41 (+0.79)		69.19 (-0.27)		71.31 (-0.13)		72.93 (-0.07)		82.67 (-0.11)
	Top-1:    16.48% (+0.00%)		49.23% (+1.06%)		57.90% (-0.09%)		60.58% (+0.00%)		62.56% (-0.02%)		74.12% (-0.13%)
	Top-5:    39.63% (+0.00%)		80.07% (+0.55%)		83.96% (-0.47%)		85.22% (-0.19%)		85.93% (-0.12%)		93.20% (-0.08%)
	Top-100:  94.21% (+0.00%)		99.21% (+0.02%)		99.24% (+0.00%)		99.26% (+0.00%)		99.28% (+0.00%)		99.71% (+0.00%)
==================================================================================================
                                        WANT_LOCAL
==================================================================================================
  Total measurements: 26516 (+0)
  Average latency (ms): 283.928375244 (-2)
  All measurements:
	MRR: 86.00 (-0.22)	Top-1: 77.98% (-0.32%)	Top-5: 96.24% (-0.07%)	Top-100: 99.88% (+0.00%)
  Full identifiers:
	MRR: 99.66 (-0.08)	Top-1: 99.42% (-0.16%)	Top-5: 99.93% (+0.00%)	Top-100: 99.93% (+0.00%)
  Filter length 0-5:
	MRR:      53.11 (+0.00)		86.95 (+0.19)		90.98 (-0.09)		91.46 (-0.46)		91.57 (-0.87)		92.96 (-0.43)
	Top-1:    34.05% (+0.00%)		77.27% (+0.35%)		83.75% (-0.15%)		84.66% (-0.68%)		85.12% (-1.31%)		87.60% (-0.57%)
	Top-5:    80.17% (+0.00%)		98.79% (+0.00%)		99.47% (+0.00%)		99.34% (-0.14%)		99.21% (-0.24%)		99.30% (-0.17%)
	Top-100:  99.67% (+0.00%)		99.93% (+0.00%)		99.92% (+0.00%)		99.92% (+0.00%)		99.91% (+0.00%)		99.90% (+0.00%)
==================================================================================================
                                        CROSS_NAMESPACE
==================================================================================================
  Total measurements: 17928 (+0)
  Average latency (ms): 272.283966064 (0)
  All measurements:
	MRR: 33.81 (-1.01)	Top-1: 26.33% (-0.90%)	Top-5: 41.87% (-1.17%)	Top-100: 74.37% (-1.88%)
  Full identifiers:
	MRR: 80.65 (-1.27)	Top-1: 73.96% (-1.22%)	Top-5: 88.87% (-1.37%)	Top-100: 98.70% (+0.00%)
  Filter length 0-5:
	MRR:      1.21 (-0.00)		11.45 (+0.28)		25.79 (-0.45)		30.97 (-1.92)		41.27 (-2.02)		48.48 (-2.01)
	Top-1:    0.59% (+0.00%)		5.79% (+0.04%)		16.00% (-0.71%)		21.13% (-1.49%)		31.06% (-1.78%)		37.93% (-1.38%)
	Top-5:    1.45% (+0.00%)		16.51% (+0.45%)		35.07% (-0.49%)		42.18% (-2.31%)		52.22% (-2.33%)		61.26% (-2.54%)
	Top-100:  9.46% (-0.04%)		63.65% (+0.89%)		87.10% (+0.00%)		85.64% (-5.05%)		89.84% (-4.87%)		91.67% (-5.16%)
==================================================================================================
                                        WITH EXPECTED_TYPE
==================================================================================================
  Total measurements: 51958 (+0)
  Average latency (ms): 224.306686401 (-6)
  All measurements:
	MRR: 71.24 (-0.02)	Top-1: 62.54% (-0.08%)	Top-5: 82.20% (+0.10%)	Top-100: 94.85% (-0.10%)
  Full identifiers:
	MRR: 94.81 (-1.30)	Top-1: 92.51% (-2.15%)	Top-5: 97.59% (-0.42%)	Top-100: 98.93% (+0.00%)
  Filter length 0-5:
	MRR:      35.97 (-0.00)		62.80 (+0.82)		75.03 (+0.17)		74.65 (+0.00)		76.16 (+0.83)		78.62 (-0.86)
	Top-1:    24.23% (+0.00%)		52.13% (+1.04%)		65.95% (+0.22%)		65.33% (-0.04%)		67.13% (+1.05%)		70.13% (-0.90%)
	Top-5:    50.87% (+0.00%)		76.24% (+0.71%)		86.68% (+0.22%)		86.46% (+0.31%)		87.31% (+0.50%)		89.11% (-0.85%)
	Top-100:  79.22% (-0.01%)		94.35% (+0.29%)		97.77% (+0.08%)		97.11% (-0.39%)		97.57% (-0.44%)		98.16% (-0.32%)
ilya-biryukov edited the summary of this revision. (Show Details)Mar 14 2019, 5:36 AM
ilya-biryukov edited the summary of this revision. (Show Details)
ioeric added inline comments.Mar 14 2019, 9:11 AM
clang-tools-extra/clangd/FuzzyMatch.cpp
74

could you explain this change?

270–271

If the word is "_something", do we still want to penalize skipping first char?

275

iiuc, there is no penalty by default because consecutive matches will gain bonus. This is not trivial here. Maybe add a comment? Or consider apply penalty for non-consecutive matches and no bonus for consecutive ones.

288

could you explain the intention of this change? Is it relevant to other changes in the patch? The original looks reasonable to me.

clang-tools-extra/unittests/clangd/FuzzyMatchTests.cpp
248

could you add a case for "onmes" pattern?

ilya-biryukov marked 5 inline comments as done.Mar 15 2019, 2:43 AM
ilya-biryukov added inline comments.
clang-tools-extra/clangd/FuzzyMatch.cpp
74

We give an extra bonus point for a consecutive match now, so the maximum bonus is now higher.
This constant is used for score normalization to bring the final score into [0,1] range (IIUC, there's a special case when the score can be 2.0, in case of a substring match, but the normalization is still applied to per-char scores)

270–271

Yes and it should actually play out nicely in practice. Consider two possibilities:

  1. the pattern starts with _. In that case we want the items starting with _ to be ranked higher, penalizing other items achieves this.
  2. the pattern does not start with _. In that case all the items, starting with _ will get a penalty and we will prefer items starting with the first character of the pattern. Also looks desirable.

This penalty in some sense replaces the prefix match boost we had before.

275

Added a comment. Adding a bonus instead of a penalty seems to produce smaller discrepancies between match and no-match cases for small patterns.

I.e. the non-consecutive match of a next segment gets a smaller hit in the final score.
I.e. a match of p in [u]nique_[p]tr gets a score of 2 out of 4 and no penalties associated with it now, previously it would get a score of 2 out of 3 but a penalty of 2 for a non-consecutive match, which just turned out to be too much.

288

The motivation is keeping a bonus for matching the head of a segment even in case of single-case patterns that lack segmentation signals.
Previously, p from up would not get this bonus when matching unique_[p]tr even though intuitively it should as we do want to encourage those matches.

  • Added comments
  • Add a test case for 'onmes'
ioeric added inline comments.Mar 15 2019, 3:38 AM
clang-tools-extra/clangd/FuzzyMatch.cpp
273

nit: move this comment into into own line now that it's taking two?

288

so, iiuc, when IsPatSingleCase is true, it means that any character in the pattern can potentially be a head? could you add comment for this?

We also had a stricter condition for Pat[P] == Word[W], but now ut would get a bonus for [u]nique_p[t]r. Is this intended?

ioeric accepted this revision.Mar 15 2019, 3:39 AM

(The result looks great)

This revision is now accepted and ready to land.Mar 15 2019, 3:39 AM
ilya-biryukov marked 3 inline comments as done.

Address comments:

  • Shorten the comment to fit it into a single line.
  • Added a comment about single-case patterns
clang-tools-extra/clangd/FuzzyMatch.cpp
273

Shrinked it to fit into a single line.

288

so, iiuc, when IsPatSingleCase is true, it means that any character in the pattern can potentially be a head? could you add comment for this?

Exactly, we don't have any segmentation signals in a pattern and we don't want small drops in scores on each match a segment head, e.g. there's no reason matching c should have smaller bonus than r when matching My[StrC]at with strc.
Added a comment.

We also had a stricter condition for Pat[P] == Word[W], but now ut would get a bonus for [u]nique_p[t]r. Is this intended?

This one is tricky and I'm less sure of it. I think the important bit is to give this bonus to t in case of [u]nique_[pt]r.
Otherwise we're making a match of the non-head character worse for no reason. I think the real problem in the previous implementation was its preference of prefix matches (the P == W condition), which gave an unfair advantage to [up]per_bound over [u]nique_[p]tr.

Matches in the middle of a segment are harshly penalized in the next line, so this small bonus does not seem to matter in practice for the case you mentioned. Maybe there's a better way to do this.

  • A better name for the new test case
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptMar 15 2019, 6:59 AM

@ilya-biryukov could you please give details about the quality metric you are using and some description of posted measurements?

Hi Jan,

Sure! And sorry for posting these metrics for a while (we had other patches mentioning them) without proper explanation.

We simulate a bunch of completions at random points in random files from our internal codebase. We assume the desired completion item is the one written in the code.
Intuitively, the higher it's ranked the better. In an attempt to measure this, we compute the following metrics:

  • MRR
  • Top-N - percentage of completions where the searched element is among the first n items.

We also independently calculate those metrics for interesting groups of completions:

  • OVERALL. All completions.
  • INITIALISMS. Completions with query (what the user typed) matching first characters of each segment in the desired completion item, e.g. SI or SIC for SomeInterestingClass.
  • EXPLICIT_MEMBER_ACCESS. Desired completion item is a class member and the completion is in a member access expression, e.g. vector().^push_back().
  • WANT_LOCAL. Desired completion item is in the same file as the completion itself.
  • CROSS_NAMESPACE. Simulated completion removes the namespace prefix, in addition to the identifier, e.g. we expect to complete std::vector not just vector.
  • WITH EXPECTED_TYPE. Only completions in a context where expected type is available, e.g. int* a = ^.

For each of the picked positions in a file, we try to complete a prefix of the desired completion item of length up to 5 and the full identifier (except initialisms, more on them below).
E.g. for the following source code:

int test() {
  std::vector<int> vec;
  vec.^push_back(10); // say, simulation runs here
}

We would try run simulation for the following completions: vec.^, vec.p^, vec.pu^, vec.pus^, vec.push^ and vec.push_^.
You can see the breakdown of the metrics for each of the prefix lengths in each of the completion groups.
Individual metrics for a fixed length of the prefix are written in the Filter length 0-5 sections.
We also try completion with the full identifier (e.g. vec.push_back^), metrics for those are written in the Full identifiers section.
Aggregated metrics for all completions in a group are written in the All measurements section.

The "initialisms" groups is special, for those we use first chars of the segments inside the desired completion item rather than the prefix, e.g. vec.p^, vec.pb^.

@ioeric is the author of these completion metrics and evaluation tools. Eric, please feel free to correct me if I got something wrong or missed something.