This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Fuzzy match scorer
ClosedPublic

Authored by sammccall on Nov 14 2017, 5:37 PM.

Details

Summary

This will be used for rescoring code completion results based on partial
identifiers.
Short-term use:

  • we want to limit the number of code completion results returned to improve performance of global completion. The scorer will be used to rerank the results to return when the user has applied a filter.

Long-term use case:

  • ranking of completion results from in-memory index
  • merging of completion results from multiple sources (merging usually works best when done at the component-score level, rescoring the fuzzy-match quality avoids different backends needing to have comparable scores)

Diff Detail

Repository
rL LLVM

Event Timeline

sammccall created this revision.Nov 14 2017, 5:37 PM
sammccall updated this revision to Diff 122956.Nov 14 2017, 5:37 PM

clang-format

Harbormaster completed remote builds in B12191: Diff 122956.

Trim memory usage and add comments.

ioeric added a subscriber: ioeric.Nov 15 2017, 4:25 AM
klimek added a subscriber: klimek.Nov 23 2017, 1:41 AM
klimek added inline comments.
clangd/FuzzyMatch.cpp
69 ↗(On Diff #122986)

Why .5?

88 ↗(On Diff #122986)

Why 2 * NPat?

92 ↗(On Diff #122986)

Do you mean "part of the Head or Tail"?
Also, explain that these are the CharRoles. A reader reads this first, and will search for what CharRole means in the code later. CharRole is defined in a different file, without comments, so figuring out how that all relates is super hard :)

101–103 ↗(On Diff #122986)

I think this is the only place where the roles are hinted at. Explain what roles mean and what we need them for.

108 ↗(On Diff #122986)

I'd spell out the numbers, as they are important (here and for CharRole).

110 ↗(On Diff #122986)

Finding bugs in these will be hard :)

120 ↗(On Diff #122986)

Can you expand in the comment why this works for utf-8?

137 ↗(On Diff #122986)

The body of this needs more comments on what it does. I can slowly figure it out by doing bit math, but it should be spelled out what's expected to be in each value at each point.

207 ↗(On Diff #122986)

Perhaps add assert(LPat[P] == LWord[W]);

212 ↗(On Diff #122986)

Why does P == W imply that?

215 ↗(On Diff #122986)

This is the first time the term "asserted word break" shows up, perhaps explain this when explaining the roles.

218 ↗(On Diff #122986)

The previous what didn't match?

clangd/FuzzyMatch.h
31 ↗(On Diff #122986)

Document that patterns larger than MaxPat will be silently cut.

39 ↗(On Diff #122986)

I find most of the abbreviations here non-intuitive, and thus needing comments (Pat I get is for pattern :)
N - what does it mean? Number of characters in the pattern? I'd use Length instead.
LPat and LWord (no idea what L could stand for).

50 ↗(On Diff #122986)

I'd use a StringRef instead, and call the storage *Storage or something.

60 ↗(On Diff #122986)

Comment that this is not actually used inside the algorithm, just for debugging.

sammccall updated this revision to Diff 124096.Nov 23 2017, 8:54 AM
sammccall marked 15 inline comments as done.

Addressing review comments and generally improving comments.

Thanks for the review, and sorry for the subtlety of the code and sparse comments.
It should be a little better now, please let me know which parts aren't clear enough.

clangd/FuzzyMatch.cpp
69 ↗(On Diff #122986)

The .5 and the 2 are the same thing. Extracted a constant with a comment.

92 ↗(On Diff #122986)

Rewrote this section and added more comments. CharRole is defined here now.

Each character in a segment that isn't the Head is the Tail. It's a bit of a stretch, but it's short and evocative and (now) explained with examples.

110 ↗(On Diff #122986)

Ack :-(

120 ↗(On Diff #122986)

Done. It doesn't really "work", so much as we just give up...

212 ↗(On Diff #122986)

Every pattern character must match in order, so a match with P < W is impossible, and P == W means the match is perfect so far.
(Also explained in the comment)

215 ↗(On Diff #122986)

Rephrased the comment here to use the familiar terminology.

clangd/FuzzyMatch.h
39 ↗(On Diff #122986)

Comments added throughout.

N - what does it mean? Number of characters in the pattern? I'd use Length instead.

WordLength is too long for something so common I think, these each have >20 refs. Changed NWord -> WordN which I think reads better - Word and WordN form a nice pair.

(N rather than L for length because of confusion with Lower)

Changed LWord to LowWord etc.

50 ↗(On Diff #122986)

What's the goal here?

I have a couple of objections to this:

  • if you actually use StringRef[] to access the data, now you've got a gratuitous indirection everywhere
  • For LPat/LWord too? Now we have *more* members than in the first place, and two ways to write each bounds check.

If the intent is to clean up the places where I construct StringRef(Word, NWord) explicitly, adding StringRef word() would certainly make sense.

inspirer added inline comments.
clangd/FuzzyMatch.cpp
254 ↗(On Diff #124096)

You need a third boolean dimension in your DP table for this condition to work - "matches".

Consider matching "Abde" against "AbdDe". The result should be [Ab]d[De] and not [Abd]D[e]. While evaluating Abd against AbdD, you will have to choose between two ways to represent the match and no matter what you choose, scoring in this line will not know whether your previous char matched, since you merged two branches and kept only one of them.

This scoring works OK-ish since you check "if (Diag >= Left)" above and so you Matched table is full of trues, but you matches will gravitate towards the ends of the candidate string if you decide to show them in the UI.

ilya-biryukov added inline comments.Nov 29 2017, 3:08 AM
clangd/FuzzyMatch.cpp
118 ↗(On Diff #124096)

I'm not sure if we care, but maybe we should treat +, - and other symbols that could be in operator names (e.g., operator +) differently for C++.
Could also make sense for other languages with overloaded operators.

244 ↗(On Diff #124096)

Maybe use P == 0 instead? It's equivalent, but a bit easier to read if you think of P as an offset.
Totally subjective, though, it's fine to have it either way.

245 ↗(On Diff #124096)

Does it mean I will get no matches in the following situation?

Items = [printf, scanf]
Pattern = f

It may be a bit confusing, since I do have a match, even though is terrible and it's ok to put those items very low in the list.
A more real example is:
Items = [fprintf, fscanf]
Pattern = print

Would fprintf match in that case? I think it should.

Another important one:
Items = [istream, ostream]
Pattern = stream

clangd/FuzzyMatch.h
31 ↗(On Diff #124096)

Maybe move the truncating logic into the clients? The users of this code are certainly better suited to report warnings/reject requests that are too large.

53 ↗(On Diff #124096)

Maybe we could split the data we store into two sections:

  1. Pattern-specific data. Initialized on construction, never changed later.
  2. Per-match data. Initialized per match() call.

Otherwise it is somewhat hard to certify whether everything is being initialized properly.

sammccall marked 5 inline comments as done.
  • added more VSCode tests, and made test assert matched characters. This uncovered algorithm problems
  • cache now includes "did previous character match" in the key (scoring depends on this, so we gave incorrect results)
  • added a penalty for non-consecutive matches
  • first character matching inside a segment downgraded from a ban to a penalty This allows [stream] to match "istream"
  • don't award case bonuses if the query is all lowercase. This helps matches like [ccm] -> [c]ode[C]ompletec[m] compete with [c]odeComplete[cm]

Thanks @ilya-biryukov, @inspirer, @klimek for the helpful comments!

I've addressed hopefully the most important and added more rigorous testing.
Sorry for the large delta, the most invasive change was of course adding the extra dimension to the scoring table. (Which fixed a bunch of problems)

clangd/FuzzyMatch.cpp
118 ↗(On Diff #124096)

You might be right, but in the absence of concrete problems I think treating them as punctuation is actually the most conservative thing to do.

E.g. matching [op=] against "operator=" gets big penalties if we treat '=' as Lower, and treating it as Upper seems likely to have other weird effects... Punctuation/separators are treated pretty neutrally.

245 ↗(On Diff #124096)

Done. VSCode will filter these out, but I agree these are important and don't seem to cause problems.

254 ↗(On Diff #124096)

Thank you for this! Fixed. The naming around Scores/ScoreInfo is a bit clumsy, happy to take suggestions :-(

I've also made all our tests assert the exact characters matched. We don't have an API or need this feature, but it makes the tests detect a lot more misbehavior that's hard to capture otherwise.

clangd/FuzzyMatch.h
53 ↗(On Diff #124096)

This hides the parallels between the Pattern and Word data, I'm not sure I like it better overall.

I've added a comment describing this split, reordered some variables, and renamed IsSubstring to WordContainsPattern, which I think clarifies this a bit. WDYT?

ilya-biryukov accepted this revision.Dec 1 2017, 8:07 AM

LGTM.

clangd/FuzzyMatch.h
53 ↗(On Diff #124096)

I'd prefer grouping the fields by their lifetime in that case, because it makes certifying that everything was properly initialized easier. Which is especially a big deal when changing code to avoid silly initialization-related bugs.
Grouping by meaning also makes lots of sense, of course, but logical relations are only hard to grasp when reading the code and don't usually cause subtle bugs when rewriting the code. And proper comments allow to reintroduce those logical parallels.

But that could be accounted to my personal preference, so feel free to leave the code as is. Just wanted to clarify my point a bit more.

This revision is now accepted and ready to land.Dec 1 2017, 8:07 AM
sammccall marked an inline comment as done.Dec 1 2017, 9:08 AM

I'd broken the scoring scale with the last few tweaks:

  • The harsh pattern-split penalty was driving too many decent matches to 0 score
  • The case-insensitive change resulted in some perfect prefix matches not getting perfect scores

Added tweaks to address these. Match quality is now 0-3, with default being 1.
Happy to make followup changes, but this seems unlikely to be controversial :-)

clangd/FuzzyMatch.h
53 ↗(On Diff #124096)

Makes sense. I've split the fields as you suggest, it also reads well.

This revision was automatically updated to reflect the committed changes.
sammccall marked an inline comment as done.