This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang-tools-extra/
-
clangd/
7/12
FuzzyMatch.cpp
-
unittests/clangd/
-
clangd/
1/1
FuzzyMatchTests.cpp

Differential D59300

[clangd] Tune the fuzzy-matching algorithm
ClosedPublic

Authored by ilya-biryukov on Mar 13 2019, 7:41 AM.

Download Raw Diff

Details

Reviewers

ioeric

Commits

rG373bee85c2b3: [clangd] Tune the fuzzy-matching algorithm
rCTE356261: [clangd] Tune the fuzzy-matching algorithm
rL356261: [clangd] Tune the fuzzy-matching algorithm

Summary

To reduce the gap between prefix and initialism matches.

The motivation is producing better scoring in one particular example,
but the change does not seem to cause large regressions in other cases.

The examples is matching 'up' against 'unique_ptr' and 'upper_bound'.
Before the change, we had:

"[u]nique_[p]tr" with a score of 0.3,
"[up]per_bound" with a score of 1.0.

A 3x difference meant that symbol quality signals were almost always ignored
and 'upper_bound' was always ranked higher.

However, intuitively, the match scores should be very close for the two.

After the change we have the following scores:

"[u]nique_[p]tr" with a score of 0.75,
"[up]per_bound" with a score of 1.0.

Diff Detail

Repository

rG LLVM Github Monorepo

Build Status

Buildable 29202
Build 29201: arc lint + arc unit

Event Timeline

ilya-biryukov created this revision.Mar 13 2019, 7:41 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 13 2019, 7:41 AM

Herald added subscribers: kadircet, arphaman, jkorous, MaskRay. · View Herald Transcript

Harbormaster completed remote builds in B29079: Diff 190415.Mar 13 2019, 7:42 AM

I'll collect the metrics and post them here too

Here are some initial measurements with the current metrics that we have. We seem to regress significantly in some cases. This is expected, since we only check the prefix matches there.
@ioeric is working on including the non-prefix (more concretely, initialism matches) into the metrics, we'll update when we get those metrics too.

==================================================================================================
                                        OVERALL (excl. CROSS_NAMESPACE)
==================================================================================================
  Total measurements: 109166 (+0)
  Average latency (ms): 219.219589233 (-7)
  All measurements:
	MRR: 69.12 (-0.93)	Top-1: 59.54% (-1.20%)	Top-5: 81.21% (-0.48%)	Top-100: 96.06% (-0.10%)
  Full identifiers:
	MRR: 97.68 (-0.54)	Top-1: 96.62% (-0.96%)	Top-5: 98.96% (-0.07%)	Top-100: 99.15% (-0.01%)
  Filter length 0-5:
	MRR:      29.11 (+0.00)		62.02 (+0.41)		70.71 (-1.45)		73.41 (-1.18)		75.44 (-1.71)		78.78 (-2.50)
	Top-1:    17.49% (+0.00%)		49.19% (+0.50%)		59.52% (-1.51%)		62.47% (-1.41%)		65.27% (-2.11%)		69.25% (-3.53%)
	Top-5:    42.45% (+0.00%)		78.99% (+0.39%)		85.48% (-1.01%)		87.21% (-0.87%)		88.26% (-1.05%)		90.83% (-0.94%)
	Top-100:  84.62% (+0.00%)		96.75% (+0.15%)		98.03% (-0.16%)		98.22% (-0.23%)		98.29% (-0.31%)		98.47% (-0.23%)
==================================================================================================
                                        DEFAULT
==================================================================================================
  Total measurements: 52054 (+0)
  Average latency (ms): 251.10333252 (-8)
  All measurements:
	MRR: 61.26 (-1.13)	Top-1: 51.13% (-1.31%)	Top-5: 74.10% (-0.68%)	Top-100: 92.60% (-0.21%)
  Full identifiers:
	MRR: 96.31 (-0.24)	Top-1: 95.08% (-0.33%)	Top-5: 97.93% (-0.09%)	Top-100: 98.30% (-0.01%)
  Filter length 0-5:
	MRR:      16.78 (+0.00)		48.89 (+0.32)		62.07 (-2.05)		66.09 (-1.67)		69.65 (-2.45)		72.49 (-2.23)
	Top-1:    9.06% (+0.00%)		34.63% (+0.25%)		49.11% (-2.03%)		53.07% (-1.99%)		58.01% (-2.83%)		61.95% (-2.74%)
	Top-5:    23.57% (+0.00%)		68.10% (+0.52%)		79.78% (-1.35%)		82.73% (-1.26%)		84.63% (-1.66%)		86.08% (-1.19%)
	Top-100:  70.73% (+0.00%)		93.67% (+0.29%)		96.41% (-0.31%)		96.77% (-0.47%)		96.92% (-0.63%)		97.07% (-0.47%)
==================================================================================================
                                        EXPLICIT_MEMBER_ACCESS
==================================================================================================
  Total measurements: 30470 (+0)
  Average latency (ms): 114.196846008 (-5)
  All measurements:
	MRR: 67.95 (-1.05)	Top-1: 58.04% (-1.53%)	Top-5: 80.28% (-0.43%)	Top-100: 98.64% (-0.01%)
  Full identifiers:
	MRR: 98.19 (-1.45)	Top-1: 96.69% (-2.75%)	Top-5: 99.83% (-0.07%)	Top-100: 99.89% (+0.00%)
  Filter length 0-5:
	MRR:      27.70 (+0.00)		62.41 (+0.78)		68.45 (-1.11)		70.82 (-0.71)		72.00 (-1.10)		78.54 (-4.29)
	Top-1:    16.41% (+0.00%)		49.21% (+1.05%)		57.13% (-1.02%)		60.02% (-0.71%)		61.31% (-1.43%)		67.65% (-6.68%)
	Top-5:    39.39% (+0.00%)		80.05% (+0.52%)		83.26% (-1.18%)		84.73% (-0.69%)		85.38% (-0.69%)		92.24% (-1.05%)
	Top-100:  94.25% (+0.00%)		99.21% (+0.02%)		99.20% (-0.04%)		99.24% (-0.02%)		99.26% (-0.02%)		99.71% (+0.00%)
==================================================================================================
                                        WANT_LOCAL
==================================================================================================
  Total measurements: 26642 (+0)
  Average latency (ms): 277.036773682 (-7)
  All measurements:
	MRR: 85.82 (-0.40)	Top-1: 77.69% (-0.60%)	Top-5: 96.18% (-0.13%)	Top-100: 99.88% (+0.00%)
  Full identifiers:
	MRR: 99.63 (-0.11)	Top-1: 99.37% (-0.21%)	Top-5: 99.91% (-0.02%)	Top-100: 99.93% (+0.00%)
  Filter length 0-5:
	MRR:      53.07 (+0.00)		86.94 (+0.19)		90.42 (-0.64)		91.19 (-0.74)		91.51 (-0.94)		92.59 (-0.81)
	Top-1:    34.01% (+0.00%)		77.25% (+0.34%)		82.89% (-1.01%)		84.30% (-1.06%)		84.99% (-1.45%)		86.95% (-1.23%)
	Top-5:    80.13% (+0.00%)		98.80% (+0.00%)		99.32% (-0.15%)		99.16% (-0.33%)		99.21% (-0.24%)		99.23% (-0.23%)
	Top-100:  99.68% (+0.00%)		99.93% (+0.00%)		99.92% (+0.00%)		99.92% (+0.00%)		99.91% (+0.00%)		99.90% (+0.00%)
==================================================================================================
                                        CROSS_NAMESPACE
==================================================================================================
  Total measurements: 18030 (+0)
  Average latency (ms): 268.525360107 (-6)
  All measurements:
	MRR: 32.61 (-2.11)	Top-1: 25.29% (-1.85%)	Top-5: 40.29% (-2.63%)	Top-100: 72.80% (-3.47%)
  Full identifiers:
	MRR: 79.96 (-1.83)	Top-1: 73.22% (-1.77%)	Top-5: 88.01% (-2.14%)	Top-100: 98.71% (+0.00%)
  Filter length 0-5:
	MRR:      1.21 (+0.00)		11.40 (+0.28)		23.19 (-2.98)		29.33 (-3.42)		39.34 (-3.80)		46.81 (-3.58)
	Top-1:    0.59% (+0.00%)		5.75% (+0.04%)		14.16% (-2.45%)		19.95% (-2.53%)		29.11% (-3.62%)		36.17% (-3.05%)
	Top-5:    1.44% (+0.00%)		16.38% (+0.41%)		31.15% (-4.28%)		39.71% (-4.67%)		50.27% (-4.08%)		59.36% (-4.25%)
	Top-100:  9.44% (+0.00%)		63.63% (+0.96%)		85.02% (-2.19%)		82.54% (-8.21%)		86.24% (-8.50%)		88.93% (-7.92%)

Adjust the tweaks, bring back removed heuristics.

Harbormaster completed remote builds in B29140: Diff 190604.Mar 14 2019, 5:20 AM

Here are the stats for the latest version. We now get a significant bump for initialisms and a much lower hit in non-initialism cases.

==================================================================================================
                                        OVERALL (excl. CROSS_NAMESPACE and INITIALISMS)
==================================================================================================
  Total measurements: 108483 (+0)
  Average latency (ms): 223.570343018 (-5)
  All measurements:
	MRR: 69.72 (-0.31)	Top-1: 60.32% (-0.40%)	Top-5: 81.56% (-0.13%)	Top-100: 96.12% (-0.03%)
  Full identifiers:
	MRR: 97.72 (-0.48)	Top-1: 96.69% (-0.88%)	Top-5: 98.97% (-0.04%)	Top-100: 99.15% (+0.00%)
  Filter length 0-5:
	MRR:      29.15 (-0.00)		62.03 (+0.42)		71.61 (-0.50)		74.13 (-0.43)		76.52 (-0.60)		80.52 (-0.74)
	Top-1:    17.52% (+0.00%)		49.21% (+0.50%)		60.51% (-0.46%)		63.35% (-0.48%)		66.59% (-0.74%)		71.83% (-0.93%)
	Top-5:    42.53% (-0.01%)		78.96% (+0.40%)		86.18% (-0.29%)		87.76% (-0.31%)		88.92% (-0.38%)		91.36% (-0.39%)
	Top-100:  84.57% (-0.01%)		96.74% (+0.15%)		98.15% (-0.03%)		98.31% (-0.13%)		98.43% (-0.16%)		98.60% (-0.08%)
==================================================================================================
                                        INITIALISMS
==================================================================================================
  Total measurements: 16489 (+0)
  Average latency (ms): 207.854690552 (-5)
  All measurements:
	MRR: 82.31 (+3.81)	Top-1: 74.67% (+4.40%)	Top-5: 91.76% (+2.93%)	Top-100: 98.41% (+0.28%)
  Initialism length 2-4:
	MRR:      80.43 (+2.49)		85.02 (+3.48)		86.64 (+13.05)
	Top-1:    71.77% (+3.06%)		78.81% (+3.58%)		81.47% (+15.15%)
	Top-5:    91.22% (+1.60%)		92.54% (+3.27%)		92.94% (+10.37%)
	Top-100:  98.34% (+0.19%)		98.57% (+0.26%)		98.40% (+0.86%)
==================================================================================================
                                        DEFAULT
==================================================================================================
  Total measurements: 51805 (+0)
  Average latency (ms): 256.839630127 (-3)
  All measurements:
	MRR: 61.93 (-0.44)	Top-1: 51.91% (-0.52%)	Top-5: 74.55% (-0.21%)	Top-100: 92.72% (-0.07%)
  Full identifiers:
	MRR: 96.38 (-0.15)	Top-1: 95.18% (-0.20%)	Top-5: 97.96% (-0.05%)	Top-100: 98.30% (+0.00%)
  Filter length 0-5:
	MRR:      16.76 (-0.01)		48.89 (+0.32)		63.24 (-0.85)		67.15 (-0.58)		71.30 (-0.78)		73.47 (-1.25)
	Top-1:    9.05% (+0.00%)		34.65% (+0.26%)		50.29% (-0.82%)		54.37% (-0.67%)		59.95% (-0.89%)		63.15% (-1.56%)
	Top-5:    23.55% (-0.01%)		68.04% (+0.51%)		80.76% (-0.33%)		83.50% (-0.46%)		85.67% (-0.61%)		86.59% (-0.67%)
	Top-100:  70.69% (-0.01%)		93.65% (+0.31%)		96.64% (-0.06%)		96.96% (-0.27%)		97.20% (-0.34%)		97.35% (-0.17%)
==================================================================================================
                                        EXPLICIT_MEMBER_ACCESS
==================================================================================================
  Total measurements: 30162 (+0)
  Average latency (ms): 113.366485596 (-9)
  All measurements:
	MRR: 68.79 (-0.18)	Top-1: 59.23% (-0.28%)	Top-5: 80.68% (-0.05%)	Top-100: 98.64% (+0.00%)
  Full identifiers:
	MRR: 98.20 (-1.43)	Top-1: 96.72% (-2.71%)	Top-5: 99.82% (-0.07%)	Top-100: 99.89% (+0.00%)
  Filter length 0-5:
	MRR:      27.83 (-0.00)		62.41 (+0.79)		69.19 (-0.27)		71.31 (-0.13)		72.93 (-0.07)		82.67 (-0.11)
	Top-1:    16.48% (+0.00%)		49.23% (+1.06%)		57.90% (-0.09%)		60.58% (+0.00%)		62.56% (-0.02%)		74.12% (-0.13%)
	Top-5:    39.63% (+0.00%)		80.07% (+0.55%)		83.96% (-0.47%)		85.22% (-0.19%)		85.93% (-0.12%)		93.20% (-0.08%)
	Top-100:  94.21% (+0.00%)		99.21% (+0.02%)		99.24% (+0.00%)		99.26% (+0.00%)		99.28% (+0.00%)		99.71% (+0.00%)
==================================================================================================
                                        WANT_LOCAL
==================================================================================================
  Total measurements: 26516 (+0)
  Average latency (ms): 283.928375244 (-2)
  All measurements:
	MRR: 86.00 (-0.22)	Top-1: 77.98% (-0.32%)	Top-5: 96.24% (-0.07%)	Top-100: 99.88% (+0.00%)
  Full identifiers:
	MRR: 99.66 (-0.08)	Top-1: 99.42% (-0.16%)	Top-5: 99.93% (+0.00%)	Top-100: 99.93% (+0.00%)
  Filter length 0-5:
	MRR:      53.11 (+0.00)		86.95 (+0.19)		90.98 (-0.09)		91.46 (-0.46)		91.57 (-0.87)		92.96 (-0.43)
	Top-1:    34.05% (+0.00%)		77.27% (+0.35%)		83.75% (-0.15%)		84.66% (-0.68%)		85.12% (-1.31%)		87.60% (-0.57%)
	Top-5:    80.17% (+0.00%)		98.79% (+0.00%)		99.47% (+0.00%)		99.34% (-0.14%)		99.21% (-0.24%)		99.30% (-0.17%)
	Top-100:  99.67% (+0.00%)		99.93% (+0.00%)		99.92% (+0.00%)		99.92% (+0.00%)		99.91% (+0.00%)		99.90% (+0.00%)
==================================================================================================
                                        CROSS_NAMESPACE
==================================================================================================
  Total measurements: 17928 (+0)
  Average latency (ms): 272.283966064 (0)
  All measurements:
	MRR: 33.81 (-1.01)	Top-1: 26.33% (-0.90%)	Top-5: 41.87% (-1.17%)	Top-100: 74.37% (-1.88%)
  Full identifiers:
	MRR: 80.65 (-1.27)	Top-1: 73.96% (-1.22%)	Top-5: 88.87% (-1.37%)	Top-100: 98.70% (+0.00%)
  Filter length 0-5:
	MRR:      1.21 (-0.00)		11.45 (+0.28)		25.79 (-0.45)		30.97 (-1.92)		41.27 (-2.02)		48.48 (-2.01)
	Top-1:    0.59% (+0.00%)		5.79% (+0.04%)		16.00% (-0.71%)		21.13% (-1.49%)		31.06% (-1.78%)		37.93% (-1.38%)
	Top-5:    1.45% (+0.00%)		16.51% (+0.45%)		35.07% (-0.49%)		42.18% (-2.31%)		52.22% (-2.33%)		61.26% (-2.54%)
	Top-100:  9.46% (-0.04%)		63.65% (+0.89%)		87.10% (+0.00%)		85.64% (-5.05%)		89.84% (-4.87%)		91.67% (-5.16%)
==================================================================================================
                                        WITH EXPECTED_TYPE
==================================================================================================
  Total measurements: 51958 (+0)
  Average latency (ms): 224.306686401 (-6)
  All measurements:
	MRR: 71.24 (-0.02)	Top-1: 62.54% (-0.08%)	Top-5: 82.20% (+0.10%)	Top-100: 94.85% (-0.10%)
  Full identifiers:
	MRR: 94.81 (-1.30)	Top-1: 92.51% (-2.15%)	Top-5: 97.59% (-0.42%)	Top-100: 98.93% (+0.00%)
  Filter length 0-5:
	MRR:      35.97 (-0.00)		62.80 (+0.82)		75.03 (+0.17)		74.65 (+0.00)		76.16 (+0.83)		78.62 (-0.86)
	Top-1:    24.23% (+0.00%)		52.13% (+1.04%)		65.95% (+0.22%)		65.33% (-0.04%)		67.13% (+1.05%)		70.13% (-0.90%)
	Top-5:    50.87% (+0.00%)		76.24% (+0.71%)		86.68% (+0.22%)		86.46% (+0.31%)		87.31% (+0.50%)		89.11% (-0.85%)
	Top-100:  79.22% (-0.01%)		94.35% (+0.29%)		97.77% (+0.08%)		97.11% (-0.39%)		97.57% (-0.44%)		98.16% (-0.32%)

ilya-biryukov edited the summary of this revision. (Show Details)Mar 14 2019, 5:36 AM

ilya-biryukov edited the summary of this revision. (Show Details)

ioeric added inline comments.Mar 14 2019, 9:11 AM

clang-tools-extra/clangd/FuzzyMatch.cpp
74	could you explain this change?
270–271	If the word is "_something", do we still want to penalize skipping first char?
275	iiuc, there is no penalty by default because consecutive matches will gain bonus. This is not trivial here. Maybe add a comment? Or consider apply penalty for non-consecutive matches and no bonus for consecutive ones.
288	could you explain the intention of this change? Is it relevant to other changes in the patch? The original looks reasonable to me.
clang-tools-extra/unittests/clangd/FuzzyMatchTests.cpp
248	could you add a case for "onmes" pattern?

ilya-biryukov marked 5 inline comments as done.Mar 15 2019, 2:43 AM

ilya-biryukov added inline comments.

clang-tools-extra/clangd/FuzzyMatch.cpp
74	We give an extra bonus point for a consecutive match now, so the maximum bonus is now higher. This constant is used for score normalization to bring the final score into `[0,1]` range (IIUC, there's a special case when the score can be `2.0`, in case of a substring match, but the normalization is still applied to per-char scores)
270–271	Yes and it should actually play out nicely in practice. Consider two possibilities: the pattern starts with `_`. In that case we want the items starting with `_` to be ranked higher, penalizing other items achieves this. the pattern does not start with `_`. In that case all the items, starting with `_` will get a penalty and we will prefer items starting with the first character of the pattern. Also looks desirable. This penalty in some sense replaces the `prefix match` boost we had before.
275	Added a comment. Adding a bonus instead of a penalty seems to produce smaller discrepancies between match and no-match cases for small patterns. I.e. the non-consecutive match of a next segment gets a smaller hit in the final score. I.e. a match of `p` in `[u]nique_[p]tr` gets a score of `2 out of 4` and no penalties associated with it now, previously it would get a score of `2 out of 3` but a penalty of `2` for a non-consecutive match, which just turned out to be too much.
288	The motivation is keeping a bonus for matching the head of a segment even in case of single-case patterns that lack segmentation signals. Previously, `p` from `up` would not get this bonus when matching `unique_[p]tr` even though intuitively it should as we do want to encourage those matches.

Added comments
Add a test case for 'onmes'

Harbormaster completed remote builds in B29202: Diff 190796.Mar 15 2019, 2:44 AM

ioeric added inline comments.Mar 15 2019, 3:38 AM

clang-tools-extra/clangd/FuzzyMatch.cpp
273	nit: move this comment into into own line now that it's taking two?
288	so, iiuc, when `IsPatSingleCase` is true, it means that any character in the pattern can potentially be a head? could you add comment for this? We also had a stricter condition for `Pat[P] == Word[W]`, but now `ut` would get a bonus for `[u]nique_p[t]r`. Is this intended?

(The result looks great)

This revision is now accepted and ready to land.Mar 15 2019, 3:39 AM

Address comments:

Shorten the comment to fit it into a single line.
Added a comment about single-case patterns

clang-tools-extra/clangd/FuzzyMatch.cpp
273	Shrinked it to fit into a single line.
288	so, iiuc, when IsPatSingleCase is true, it means that any character in the pattern can potentially be a head? could you add comment for this? Exactly, we don't have any segmentation signals in a pattern and we don't want small drops in scores on each match a segment head, e.g. there's no reason matching `c` should have smaller bonus than `r` when matching `My[StrC]at` with `strc`. Added a comment. We also had a stricter condition for Pat[P] == Word[W], but now ut would get a bonus for [u]nique_p[t]r. Is this intended? This one is tricky and I'm less sure of it. I think the important bit is to give this bonus to `t` in case of `[u]nique_[pt]r`. Otherwise we're making a match of the non-head character worse for no reason. I think the real problem in the previous implementation was its preference of prefix matches (the `P == W` condition), which gave an unfair advantage to `[up]per_bound` over `[u]nique_[p]tr`. Matches in the middle of a segment are harshly penalized in the next line, so this small bonus does not seem to matter in practice for the case you mentioned. Maybe there's a better way to do this.

Harbormaster completed remote builds in B29204: Diff 190803.Mar 15 2019, 4:13 AM

A better name for the new test case

Harbormaster completed remote builds in B29205: Diff 190804.Mar 15 2019, 4:15 AM

Closed by commit rL356261: [clangd] Tune the fuzzy-matching algorithm (authored by ibiryukov). · Explain WhyMar 15 2019, 6:59 AM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptMar 15 2019, 6:59 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

@ilya-biryukov could you please give details about the quality metric you are using and some description of posted measurements?

Hi Jan,

Sure! And sorry for posting these metrics for a while (we had other patches mentioning them) without proper explanation.

We simulate a bunch of completions at random points in random files from our internal codebase. We assume the desired completion item is the one written in the code.
Intuitively, the higher it's ranked the better. In an attempt to measure this, we compute the following metrics:

MRR
Top-N - percentage of completions where the searched element is among the first n items.

We also independently calculate those metrics for interesting groups of completions:

OVERALL. All completions.
INITIALISMS. Completions with query (what the user typed) matching first characters of each segment in the desired completion item, e.g. SI or SIC for SomeInterestingClass.
EXPLICIT_MEMBER_ACCESS. Desired completion item is a class member and the completion is in a member access expression, e.g. vector().^push_back().
WANT_LOCAL. Desired completion item is in the same file as the completion itself.
CROSS_NAMESPACE. Simulated completion removes the namespace prefix, in addition to the identifier, e.g. we expect to complete std::vector not just vector.
WITH EXPECTED_TYPE. Only completions in a context where expected type is available, e.g. int* a = ^.

For each of the picked positions in a file, we try to complete a prefix of the desired completion item of length up to 5 and the full identifier (except initialisms, more on them below).
E.g. for the following source code:

int test() {
  std::vector<int> vec;
  vec.^push_back(10); // say, simulation runs here
}

We would try run simulation for the following completions: vec.^, vec.p^, vec.pu^, vec.pus^, vec.push^ and vec.push_^.
You can see the breakdown of the metrics for each of the prefix lengths in each of the completion groups.
Individual metrics for a fixed length of the prefix are written in the Filter length 0-5 sections.
We also try completion with the full identifier (e.g. vec.push_back^), metrics for those are written in the Full identifiers section.
Aggregated metrics for all completions in a group are written in the All measurements section.

The "initialisms" groups is special, for those we use first chars of the segments inside the desired completion item rather than the prefix, e.g. vec.p^, vec.pb^.

@ioeric is the author of these completion metrics and evaluation tools. Eric, please feel free to correct me if I got something wrong or missed something.

Revision Contents

Path

Size

clang-tools-extra/

clangd/

FuzzyMatch.cpp

28 lines

unittests/

clangd/

FuzzyMatchTests.cpp

11 lines

Diff 190796

clang-tools-extra/clangd/FuzzyMatch.cpp

Show First 20 Lines • Show All 65 Lines • ▼ Show 20 Lines
constexpr int FuzzyMatcher::MaxWord;		constexpr int FuzzyMatcher::MaxWord;

static char lower(char C) { return C >= 'A' && C <= 'Z' ? C + ('a' - 'A') : C; }		static char lower(char C) { return C >= 'A' && C <= 'Z' ? C + ('a' - 'A') : C; }
// A "negative infinity" score that won't overflow.		// A "negative infinity" score that won't overflow.
// We use this to mark unreachable states and forbidden solutions.		// We use this to mark unreachable states and forbidden solutions.
// Score field is 15 bits wide, min value is -2^14, we use half of that.		// Score field is 15 bits wide, min value is -2^14, we use half of that.
static constexpr int AwfulScore = -(1 << 13);		static constexpr int AwfulScore = -(1 << 13);
static bool isAwful(int S) { return S < AwfulScore / 2; }		static bool isAwful(int S) { return S < AwfulScore / 2; }
static constexpr int PerfectBonus = 3; // Perfect per-pattern-char score.		static constexpr int PerfectBonus = 4; // Perfect per-pattern-char score.
		ioericUnsubmitted Not Done Reply Inline Actions could you explain this change? ioeric: could you explain this change?
		ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions We give an extra bonus point for a consecutive match now, so the maximum bonus is now higher. This constant is used for score normalization to bring the final score into `[0,1]` range (IIUC, there's a special case when the score can be `2.0`, in case of a substring match, but the normalization is still applied to per-char scores) ilya-biryukov: We give an extra bonus point for a consecutive match now, so the maximum bonus is now higher.

FuzzyMatcher::FuzzyMatcher(llvm::StringRef Pattern)		FuzzyMatcher::FuzzyMatcher(llvm::StringRef Pattern)
: PatN(std::min<int>(MaxPat, Pattern.size())),		: PatN(std::min<int>(MaxPat, Pattern.size())),
ScoreScale(PatN ? float{1} / (PerfectBonus * PatN) : 0), WordN(0) {		ScoreScale(PatN ? float{1} / (PerfectBonus * PatN) : 0), WordN(0) {
std::copy(Pattern.begin(), Pattern.begin() + PatN, Pat);		std::copy(Pattern.begin(), Pattern.begin() + PatN, Pat);
for (int I = 0; I < PatN; ++I)		for (int I = 0; I < PatN; ++I)
LowPat[I] = lower(Pat[I]);		LowPat[I] = lower(Pat[I]);
Scores[0][0][Miss] = {0, Miss};		Scores[0][0][Miss] = {0, Miss};
▲ Show 20 Lines • Show All 179 Lines • ▼ Show 20 Lines	if (Last == Miss) {
if (WordRole[W] == Tail &&		if (WordRole[W] == Tail &&
(Word[W] == LowWord[W] \|\| !(WordTypeSet & 1 << Lower)))		(Word[W] == LowWord[W] \|\| !(WordTypeSet & 1 << Lower)))
return false;		return false;
}		}
return true;		return true;
}		}

int FuzzyMatcher::skipPenalty(int W, Action Last) const {		int FuzzyMatcher::skipPenalty(int W, Action Last) const {
int S = 0;		if (W == 0) // Skipping the first character.
		return 3;
		ioericUnsubmitted Not Done Reply Inline Actions If the word is "_something", do we still want to penalize skipping first char? ioeric: If the word is "_something", do we still want to penalize skipping first char?
		ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions Yes and it should actually play out nicely in practice. Consider two possibilities: the pattern starts with `_`. In that case we want the items starting with `_` to be ranked higher, penalizing other items achieves this. the pattern does not start with `_`. In that case all the items, starting with `_` will get a penalty and we will prefer items starting with the first character of the pattern. Also looks desirable. This penalty in some sense replaces the `prefix match` boost we had before. ilya-biryukov: Yes and it should actually play out nicely in practice. Consider two possibilities: 1. the…
if (WordRole[W] == Head) // Skipping a segment.		if (WordRole[W] == Head) // Skipping a segment.
S += 1;		return 1; // We want to keep it lower than the bonus for a consecutive
		ioericUnsubmitted Done Reply Inline Actions nit: move this comment into into own line now that it's taking two? ioeric: nit: move this comment into into own line now that it's taking two?
		ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions Shrinked it to fit into a single line. ilya-biryukov: Shrinked it to fit into a single line.
if (Last == Match) // Non-consecutive match.		// match.
S += 2; // We'd rather skip a segment than split our match.		// Instead of penalizing non-consecutive matches, we give a bonus to a
		ioericUnsubmitted Not Done Reply Inline Actions iiuc, there is no penalty by default because consecutive matches will gain bonus. This is not trivial here. Maybe add a comment? Or consider apply penalty for non-consecutive matches and no bonus for consecutive ones. ioeric: iiuc, there is no penalty by default because consecutive matches will gain bonus. This is not…
		ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions Added a comment. Adding a bonus instead of a penalty seems to produce smaller discrepancies between match and no-match cases for small patterns. I.e. the non-consecutive match of a next segment gets a smaller hit in the final score. I.e. a match of `p` in `[u]nique_[p]tr` gets a score of `2 out of 4` and no penalties associated with it now, previously it would get a score of `2 out of 3` but a penalty of `2` for a non-consecutive match, which just turned out to be too much. ilya-biryukov: Added a comment. Adding a bonus instead of a penalty seems to produce smaller discrepancies…
return S;		// consecutive match in matchBonus. This produces a better score distribution
		// than penalties in case of small patterns, e.g. 'up' for 'unique_ptr'.
		return 0;
}		}

int FuzzyMatcher::matchBonus(int P, int W, Action Last) const {		int FuzzyMatcher::matchBonus(int P, int W, Action Last) const {
assert(LowPat[P] == LowWord[W]);		assert(LowPat[P] == LowWord[W]);
int S = 1;		int S = 1;
// Bonus: pattern so far is a (case-insensitive) prefix of the word.		bool IsPatSingleCase =
if (P == W) // We can't skip pattern characters, so we must have matched all.		(PatTypeSet == 1 << Lower) \|\| (PatTypeSet == 1 << Upper);
++S;
// Bonus: case matches, or a Head in the pattern aligns with one in the word.		// Bonus: case matches, or a Head in the pattern aligns with one in the word.
if ((Pat[P] == Word[W] && ((PatTypeSet & 1 << Upper) \|\| P == W)) \|\|		if (Pat[P] == Word[W] \|\|
(PatRole[P] == Head && WordRole[W] == Head))		(WordRole[W] == Head && (IsPatSingleCase \|\| PatRole[P] == Head)))
		ioericUnsubmitted Not Done Reply Inline Actions could you explain the intention of this change? Is it relevant to other changes in the patch? The original looks reasonable to me. ioeric: could you explain the intention of this change? Is it relevant to other changes in the patch?
		ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions The motivation is keeping a bonus for matching the head of a segment even in case of single-case patterns that lack segmentation signals. Previously, `p` from `up` would not get this bonus when matching `unique_[p]tr` even though intuitively it should as we do want to encourage those matches. ilya-biryukov: The motivation is keeping a bonus for matching the head of a segment even in case of single…
		ioericUnsubmitted Not Done Reply Inline Actions so, iiuc, when `IsPatSingleCase` is true, it means that any character in the pattern can potentially be a head? could you add comment for this? We also had a stricter condition for `Pat[P] == Word[W]`, but now `ut` would get a bonus for `[u]nique_p[t]r`. Is this intended? ioeric: so, iiuc, when `IsPatSingleCase` is true, it means that any character in the pattern can…
		ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions so, iiuc, when IsPatSingleCase is true, it means that any character in the pattern can potentially be a head? could you add comment for this? Exactly, we don't have any segmentation signals in a pattern and we don't want small drops in scores on each match a segment head, e.g. there's no reason matching `c` should have smaller bonus than `r` when matching `My[StrC]at` with `strc`. Added a comment. We also had a stricter condition for Pat[P] == Word[W], but now ut would get a bonus for [u]nique_p[t]r. Is this intended? This one is tricky and I'm less sure of it. I think the important bit is to give this bonus to `t` in case of `[u]nique_[pt]r`. Otherwise we're making a match of the non-head character worse for no reason. I think the real problem in the previous implementation was its preference of prefix matches (the `P == W` condition), which gave an unfair advantage to `[up]per_bound` over `[u]nique_[p]tr`. Matches in the middle of a segment are harshly penalized in the next line, so this small bonus does not seem to matter in practice for the case you mentioned. Maybe there's a better way to do this. ilya-biryukov: > so, iiuc, when IsPatSingleCase is true, it means that any character in the pattern can…
++S;		++S;
		// Bonus: a consecutive match. First character match also gets a bonus to
		// ensure prefix final match score normalizes to 1.0.
		if (W == 0 \|\| Last == Match)
		S += 2;
// Penalty: matching inside a segment (and previous char wasn't matched).		// Penalty: matching inside a segment (and previous char wasn't matched).
if (WordRole[W] == Tail && P && Last == Miss)		if (WordRole[W] == Tail && P && Last == Miss)
S -= 3;		S -= 3;
// Penalty: a Head in the pattern matches in the middle of a word segment.		// Penalty: a Head in the pattern matches in the middle of a word segment.
if (PatRole[P] == Head && WordRole[W] == Tail)		if (PatRole[P] == Head && WordRole[W] == Tail)
--S;		--S;
// Penalty: matching the first pattern character in the middle of a segment.		// Penalty: matching the first pattern character in the middle of a segment.
if (P == 0 && WordRole[W] == Tail)		if (P == 0 && WordRole[W] == Tail)
▲ Show 20 Lines • Show All 99 Lines • Show Last 20 Lines

clang-tools-extra/unittests/clangd/FuzzyMatchTests.cpp

//===-- FuzzyMatchTests.cpp - String fuzzy matcher tests --------- C++ --===//		//===-- FuzzyMatchTests.cpp - String fuzzy matcher tests --------- C++ --===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "FuzzyMatch.h"		#include "FuzzyMatch.h"

#include "llvm/ADT/StringExtras.h"		#include "llvm/ADT/StringExtras.h"
		#include "gmock/gmock-matchers.h"
#include "gmock/gmock.h"		#include "gmock/gmock.h"
#include "gtest/gtest.h"		#include "gtest/gtest.h"

namespace clang {		namespace clang {
namespace clangd {		namespace clangd {
namespace {		namespace {
using testing::Not;		using testing::Not;

▲ Show 20 Lines • Show All 220 Lines • ▼ Show 20 Lines	testing::Matcher<llvm::StringRef> ranks(T... RankedStrings) {
return testing::MakeMatcher<llvm::StringRef>(		return testing::MakeMatcher<llvm::StringRef>(
new RankMatcher{ExpectedMatch(RankedStrings)...});		new RankMatcher{ExpectedMatch(RankedStrings)...});
}		}

TEST(FuzzyMatch, Ranking) {		TEST(FuzzyMatch, Ranking) {
EXPECT_THAT("cons",		EXPECT_THAT("cons",
ranks("[cons]ole", "[Cons]ole", "ArrayBuffer[Cons]tructor"));		ranks("[cons]ole", "[Cons]ole", "ArrayBuffer[Cons]tructor"));
EXPECT_THAT("foo", ranks("[foo]", "[Foo]"));		EXPECT_THAT("foo", ranks("[foo]", "[Foo]"));
EXPECT_THAT("onMes",		EXPECT_THAT("onMes",
ioericUnsubmitted Done Reply Inline Actions could you add a case for "onmes" pattern? ioeric: could you add a case for "onmes" pattern?
ranks("[onMes]sage", "[onmes]sage", "[on]This[M]ega[Es]capes"));		ranks("[onMes]sage", "[onmes]sage", "[on]This[M]ega[Es]capes"));
		EXPECT_THAT("onmes",
		ranks("[onmes]sage", "[onMes]sage", "[on]This[M]ega[Es]capes"));
EXPECT_THAT("CC", ranks("[C]amel[C]ase", "[c]amel[C]ase"));		EXPECT_THAT("CC", ranks("[C]amel[C]ase", "[c]amel[C]ase"));
EXPECT_THAT("cC", ranks("[c]amel[C]ase", "[C]amel[C]ase"));		EXPECT_THAT("cC", ranks("[c]amel[C]ase", "[C]amel[C]ase"));
EXPECT_THAT("p", ranks("[p]", "[p]arse", "[p]osix", "[p]afdsa", "[p]ath"));		EXPECT_THAT("p", ranks("[p]", "[p]arse", "[p]osix", "[p]afdsa", "[p]ath"));
EXPECT_THAT("pa", ranks("[pa]rse", "[pa]th", "[pa]fdsa"));		EXPECT_THAT("pa", ranks("[pa]rse", "[pa]th", "[pa]fdsa"));
EXPECT_THAT("log", ranks("[log]", "Scroll[Log]icalPosition"));		EXPECT_THAT("log", ranks("[log]", "Scroll[Log]icalPosition"));
EXPECT_THAT("e", ranks("[e]lse", "Abstract[E]lement"));		EXPECT_THAT("e", ranks("[e]lse", "Abstract[E]lement"));
EXPECT_THAT("workbench.sideb",		EXPECT_THAT("workbench.sideb",
ranks("[workbench.sideB]ar.location",		ranks("[workbench.sideB]ar.location",
"[workbench.]editor.default[SideB]ySideLayout"));		"[workbench.]editor.default[SideB]ySideLayout"));
EXPECT_THAT("editor.r", ranks("[editor.r]enderControlCharacter",		EXPECT_THAT("editor.r", ranks("[editor.r]enderControlCharacter",
"[editor.]overview[R]ulerlanes",		"[editor.]overview[R]ulerlanes",
"diff[Editor.r]enderSideBySide"));		"diff[Editor.r]enderSideBySide"));
EXPECT_THAT("-mo", ranks("[-mo]z-columns", "[-]ms-ime-[mo]de"));		EXPECT_THAT("-mo", ranks("[-mo]z-columns", "[-]ms-ime-[mo]de"));
EXPECT_THAT("convertModelPosition",		EXPECT_THAT("convertModelPosition",
ranks("[convertModelPosition]ToViewPosition",		ranks("[convertModelPosition]ToViewPosition",
"[convert]ViewTo[ModelPosition]"));		"[convert]ViewTo[ModelPosition]"));
EXPECT_THAT("is", ranks("[is]ValidViewletId", "[i]mport [s]tatement"));		EXPECT_THAT("is", ranks("[is]ValidViewletId", "[i]mport [s]tatement"));
EXPECT_THAT("strcpy", ranks("[strcpy]", "[strcpy]_s"));		EXPECT_THAT("strcpy", ranks("[strcpy]", "[strcpy]_s"));
}		}

// Verify some bounds so we know scores fall in the right range.		// Verify some bounds so we know scores fall in the right range.
// Testing exact scores is fragile, so we prefer Ranking tests.		// Testing exact scores is fragile, so we prefer Ranking tests.
TEST(FuzzyMatch, Scoring) {		TEST(FuzzyMatch, Scoring) {
EXPECT_THAT("abs", matches("[a]w[B]xYz[S]", 0.f));		EXPECT_THAT("abs", matches("[a]w[B]xYz[S]", 7.f / 12.f));
EXPECT_THAT("abs", matches("[abs]l", 1.f));		EXPECT_THAT("abs", matches("[abs]l", 1.f));
EXPECT_THAT("abs", matches("[abs]", 2.f));		EXPECT_THAT("abs", matches("[abs]", 2.f));
EXPECT_THAT("Abs", matches("[abs]", 2.f));		EXPECT_THAT("Abs", matches("[abs]", 2.f));
}		}

		TEST(FuzzyMatch, InitialismAndSegment) {
		// We want these scores to be roughly the same.
		EXPECT_THAT("up", matches("[u]nique_[p]tr", 3.f / 4.f));
		EXPECT_THAT("up", matches("[up]per_bound", 1.f));
		}

// Returns pretty-printed segmentation of Text.		// Returns pretty-printed segmentation of Text.
// e.g. std::basic_string --> +-- +---- +-----		// e.g. std::basic_string --> +-- +---- +-----
std::string segment(llvm::StringRef Text) {		std::string segment(llvm::StringRef Text) {
std::vector<CharRole> Roles(Text.size());		std::vector<CharRole> Roles(Text.size());
calculateRoles(Text, Roles);		calculateRoles(Text, Roles);
std::string Printed;		std::string Printed;
for (unsigned I = 0; I < Text.size(); ++I)		for (unsigned I = 0; I < Text.size(); ++I)
Printed.push_back("?-+ "[static_cast<unsigned>(Roles[I])]);		Printed.push_back("?-+ "[static_cast<unsigned>(Roles[I])]);
Show All 18 Lines