This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Simplify fuzzy matcher (sequence alignment) by removing some condition checks.
Needs ReviewPublic

Authored by MaskRay on Mar 20 2018, 5:06 PM.

Details

Reviewers
sammccall
Summary
  • Remove initialization of Scores[][][] in ctor. This is feasible if we see Scores[I][J][] when I > J as uninitialized.
  • Move if (PatN > 0) check to calculateRoles and remove some length checks.
  • The length checks can be removed because the dynamic programming is now regularized. (Scores[0][][] took trailing penalty into account while other Scores[*][][] did not before)
  • Rename matchBonus and skipPenalty to {match,miss}Score. This eliminates negative forms in the dynamic programming.

Event Timeline

MaskRay created this revision.Mar 20 2018, 5:06 PM
MaskRay added a comment.EditedMar 20 2018, 5:09 PM

Brings some goodies from https://github.com/cquery-project/cquery/blob/master/src/fuzzy_match.cc (what I plagiarized from clangd)

I would also like to learn how you run tests.

% ninja tools/clang/tools/extra/unittests/clangd/ClangdTests
% tools/clang/tools/extra/unittests/clangd/ClangdTests --gtest_filter='Fuzzy*'

How do you debug?

gdb --args tools/clang/tools/extra/unittests/clangd/ClangdTests --gtest_filter='Fuzzy*'

Or lldb?

Can you elaborate on what this patch is improving, and how?
There are some stylistic changes, and also what look like subtle logic changes, and some rearrangement of the algorithm - to what end?

Canonical way to run all tests is ninja check-clang-tools, the way you suggested is the right thing for rapid iteration with unit tests.
Personally, I don't use a debugger.

clangd/FuzzyMatch.cpp
93

this looks like a behavior change - why?

243–244

adding the penalty unconditionally seems like a behavior change, why?

clangd/FuzzyMatch.h
61–62

FWIW, I don't think this is an improvement - I think the clarity of purpose in names is more useful than having consistent signs in this case.

(Sorry if I sound gruff - I'm sure there are improvements to be had here. But since the code is a bit dense (my fault) I have trouble inferring them from the deltas.

MaskRay updated this revision to Diff 139313.Mar 21 2018, 9:54 AM

Update summary

MaskRay edited the summary of this revision. (Show Details)Mar 21 2018, 9:55 AM
MaskRay updated this revision to Diff 139314.Mar 21 2018, 9:56 AM
MaskRay edited the summary of this revision. (Show Details)

Update summary

MaskRay edited the summary of this revision. (Show Details)Mar 21 2018, 9:56 AM
MaskRay added inline comments.Mar 21 2018, 10:01 AM
clangd/FuzzyMatch.cpp
93

This is a behavior change. Instead of choosing between Match/Miss in the last position, we enumerate the last matching position in Word.

This saves if (P < PatN - 1) { check in the main loop at the cost of a for loop here (use sites of ending values)

243–244

Because now we use a different method to calculate the final value. I believe this makes the loop simpler.

This was not regular because

Scores[0][W + 1][Miss] = {Scores[0][W][Miss].Score + missScore(W, Miss),

Miss};

This unconditionally added a trailing penalty but the main loop did not.

MaskRay added inline comments.Mar 21 2018, 10:04 AM
clangd/FuzzyMatch.h
61–62

Keep matchBonus but rename skipPenalty to missPenalty ?

MaskRay added inline comments.Mar 21 2018, 10:07 AM
clangd/FuzzyMatch.cpp
98

I also don't understand why it clamps the value to zero here. Negative values are also meaningful to me. Given that perfectBonus is only 3 it is very easy to get a negative value here.

MaskRay added inline comments.Mar 21 2018, 12:34 PM
clangd/FuzzyMatch.h
61–62

Also note in the original scheme, the skip score does not need to be negative. Because Scores[PatN][WordN][] is used and each path takes the same number of positions (WordN). We can add an offset to all positional scores to make all of them non-negative. In the new scheme it does make sense to make them negative, as each path has now different number of positions.

Friendly ping......

Sorry for the delay, it took me a while to understand exactly what everything is doing.
If I understand right, there's actually no functional change (to match logic or scoring) being proposed here.
But some nice fixes indeed!

Most of the comments are readability nits. The existing code is pretty dense and hard to follow (my fault, thanks for picking through it!) so I might have misunderstood some things.

clangd/FuzzyMatch.cpp
93

Ah, I see - the case where we match only part of the word is handled up here now.
(I think you mean this is not a behavior change? The result is the same AFAICS)

That does make more sense, but it's pretty subtle.
Can you add a comment like
// The pattern doesn't have to match the whole word (but the whole pattern must match).

97

I'd prefer to keep this - the empty pattern case is very common, and buildGraph() isn't trivially cheap in this case.

98

An important part of the contract of match() is that it returns a value in [0,1].
We rely on this range to combine this with other scoring signals - we multiply this by a quality signal in code completion.
(Currently the quality signal is just derived from Sema, but the global index will provide the number of uses).

It would be possible to use a different squash function here, but I found max(kFloor,x) worked well for the examples I looked at - anything <= some floor value was "not really a useful match at all", and most of the variance below the floor seemed to be noise to me.
(Then I tuned the bonuses/penalties so the floor was at zero)

176–177

Why this change? Previously this check was dynamic at the callsite in the constructor (which is cold), and omitted in the call to init() which is relatively hot.

Generally, here we expect the constructor to be called once per request and match() to be called thousands of times, so it's ok to do some wasteful initialization/work in the constructor, but we should avoid it on the match() path.

209

similarly this one.
(ideally we wouldn't do the work above, it's just there to make dumpLast work I think)

232

why this change?
this has also been moved from the cheaper constructor to the more expensive per-match call. (also the diagonal assignment added in the next loop)

Also, shouldn't [0][0][Match] be AwfulScore?

327

nit: I -> P, move increment to the increment expression of the for loop?

340

W is the right name in this file for a variable iterating over word indices, please don't change this.
The new variable above could be EndW or so?

340

As far as I can see, this loop is setting A[W+1:...] = Miss and populating A[0...W] with the exsting logic.
I think this would be clearer as two loops, currently there's a lot of conditionals around Last that obscure what's actually happening.

340

You've shifted P (and the old W, new I) by 1. This does reduce the number of +1 and -1 in this function, but it's inconsistent with how these are used elsewhere: P should be the index into Pat of the character that we're considering.

365

now that the end of the word could be anywhere, it might be nice to add an arrowhead `^' below the table pointing at it :) up to you

BTW if you're interested in this stuff in clangd, there's some greener-field related stuff too:
Our goal is to be able to do project-wide fuzzy-find navigation and code completion with the global symbol index.
The index implementation in upstream clangd is naive at the moment (iterates over every symbol) but could be made efficient even with the fuzzy find functionality. (Inside google there's a service that does this, but we can't use that code upstream). It's an interesting problem and could be useful for cquery.

We should chat offline if this is at all interesting, even if you don't want to work on it - I'd like to hear more about cquery too :)

MaskRay added inline comments.Mar 29 2018, 8:58 AM
clangd/FuzzyMatch.cpp
232

"The more expensive per-match call" is just two value assignments.

I have removed the expensive table initialization in the constructor.

[0][0][*] can be any value.

327

I -> P.

move increment to the increment expression of the for loop?

Not sure about the coding standard here, but if you insist I'll have to change it as you are the reviewer. If the loop variable was an iterator, for (It I = std::next(...); I != E; ++I) would be uglier than for (It I = ...; ++I != E; )

340

I don't understand the rationale not to use the shifted indices. The code actually use Scores[P][W][*] to mean the optimal match of the first P characters of the pattern with the first W characters of the word, not the position of the character.

On the other hand, C++ reverse iterators use the shifted one for for (I = rend(); I != rbegin(); ++I). The shifted one makes ending condition check easier.

MaskRay updated this revision to Diff 140265.Mar 29 2018, 9:03 AM
MaskRay edited the summary of this revision. (Show Details)

Address comments

MaskRay updated this revision to Diff 140268.Mar 29 2018, 9:08 AM

Add comment

// Find the optimal prefix of Word to match Pattern.
MaskRay added inline comments.Mar 29 2018, 9:09 AM
clangd/FuzzyMatch.cpp
93

Added

// Find the optimal prefix of Word to match Pattern.

I meant this is a behavior change but it makes the first row and the rest rows of the score table more consistent.

MaskRay added inline comments.Mar 29 2018, 9:12 AM
clangd/FuzzyMatch.cpp
98

We could try other criteria in the future. I believe the current one can be improved because negative scores may be returned but the scoring shouldn't return 0 for all the cases.

sammccall added inline comments.Mar 29 2018, 11:09 AM
clangd/FuzzyMatch.cpp
93

That comment really doesn't capture what's significant about this line - it's the *policy*, rather than the mechanism, that needs highlighting here.

(Re: behavior change - I *think* there's no inputs for which we produce a different match result/score because of this patch, right?)

98

Sure, we can try other things, and to gather more data.
(To be clear though - with the data I *did* look at, including the scores <0 did not add more information, only noise)

232

"The more expensive per-match call" is just two value assignments.

Oops, sorry - by "more expensive" I mean "called thousands of times more often".

I have removed the expensive table initialization in the constructor.

I don't want to be rude, but I asked why you changed this, and you didn't answer. Unless there's a strong reason, I'd prefer to revert this change, as I find this harder to reason about.
(Roughly: in the old version of the code, any data that didn't need to change for the life of the object was initialized in the constructor. That way I didn't need to worry what was performance-critical and what wasn't - match() only did what was strictly necessary).

[0][0][*] can be any value.

Can you please explain why?

327

Uglier is subjective, but side-effects in the condition of a for-loop is sufficiently unusual and surprising that I'd prefer to avoid it in both cases.

340

I don't understand the rationale not to use the shifted indices

The rationale is entirely consistency with the surrounding code. The consistency helps avoid off-by-one errors when similar loops have different conventions.

In this file, when looping over word or pattern dimensions, P and W respectively are used for loop variables, and can be interpreted as indices into Pat/Word.
Here the interpretation would be "did we match or miss character Word[W]"

clangd/FuzzyMatch.h
61–62

missPenalty SGTM.

(I don't see any particular reason to avoid negative numbers here - it has a natural interpretation: a positive increment means the match is better than if that action wasn't taken, negative means it's worse, etc)

MaskRay updated this revision to Diff 140297.Mar 29 2018, 11:13 AM

missScore -> missPenalty

MaskRay marked 4 inline comments as done.Mar 29 2018, 11:19 AM
MaskRay added inline comments.
clangd/FuzzyMatch.cpp
232

Oops, sorry - by "more expensive" I mean "called thousands of times more often".

It is true that Scores[0][0][Miss] = Scores[0][0][Match] = {0, Miss}; is the cost incurred for each word.

But it is not full table initialization, it is just two variable assignments. And we will assign to other values of the first row Scores[0][*][*] in the following loop. The old scatters the table construction to two places, the constructor and this dynamic programming site.

MaskRay added inline comments.Mar 29 2018, 11:25 AM
clangd/FuzzyMatch.cpp
232

[0][0][*] can be any value.

Can you please explain why?

Scores[0][0][*] is the initial value which will be propagated to all other values in the table.
The relative difference of pairwise values in the table is a constant whatever initial value is chosen.

If you ignore the max clamp you used later, the initial value does not matter.

MaskRay added inline comments.Mar 29 2018, 11:30 AM
clangd/FuzzyMatch.cpp
340

Scores[P][W][*] is interpreted as how good it is if we align the first P characters of the pattern with the first W characters of the word.

Note the code uses number of characters instead of the position.

Here the new interpretation would be "what we should do for the last character of the first W characters"

MaskRay updated this revision to Diff 140306.Mar 29 2018, 11:57 AM

nit picking

MaskRay added inline comments.Mar 29 2018, 11:59 AM
clangd/FuzzyMatch.cpp
93

Added your comment.

The behavior change is regarding the values of Scores (there is no longer different interpretation for the last character of Pattern) how the final value is chosen (Scores[P][W][*] -> Scores[P][*][Match]) . There is no noticeable change in the user viewpoint.

MaskRay added a comment.EditedMar 29 2018, 12:11 PM

There are two viewpoints: position-central and cell-central.

In buildGraph, the nested loop

for (int P = 0; P < PatN; ++P) {
    // Scores[P + 1][P][Miss] = Scores[P + 1][P][Match] = {AwfulScore, Miss};
    for (int W = P; W < WordN; ++W) {

can be interpreted as: we are calculating the cell Scores[P+1][W+1][*], using the characters Pattern[P] and Word[W]. This is a position-central viewpoint. But note the cell we are calculating is Scores[P+1][W+1]. If we rephrase it as:

for (int P = 1; P <= PatN; ++P) {
    for (int W = P + 1; W <= WordN; ++W) {     // Since you like this form (though I don't)

It makes it more clear that we are using the cell indices.
(we are calculating the cell Scores[P][W][*], using the characters Pattern[P-1] and Word[W-1])

The former interpretation is preferred because half closed intervals generally reduce the number of +1 -1 offsets.

In the table dumping stage, I find this cell-central viewpoint

for (int I = W; i > 0; i--)

better than the position-central viewpoint (with my former dynamic programming experience, I have solved numerous problems like this)

for (int I = W - 1; I >= 0; --I)

because we are tracing through the optimal path of the dynamic programming *cells*. We are also tracing the W P positions, but the former interpretation gets rid of some +1 -1.

MaskRay updated this revision to Diff 140311.Mar 29 2018, 12:17 PM

Don't rename anything. I just want to have this revision reviewed as soon as possible

MaskRay added inline comments.Mar 29 2018, 12:18 PM
clangd/FuzzyMatch.cpp
209

This is very cheap and dumpLast has checked the emptiness so there is no need to duplicate the work here.

Thanks for your work on this patch!
I think there a several useful improvements here, which I'd love to see landed. Particularly the logic around matches that end early is much better.

There are also places that change existing design decisions in ways that don't seem to be clear improvements: doing extra work (albeit minor) at match() time, and using different naming conventions for indexes in dumpLast. I see you have strong feelings about these and I do think I understand your arguments, but don't agree as discussed above.

Where should we go from here? One option is to land the pieces we agree on. But it's your patch, and if you'd like an opinion from another clangd maintainer that's fine too; happy to help with that.

clangd/FuzzyMatch.cpp
232

Is it not possible that we'll choose a best path starting at Scores[0][0][Match]?
This is invalid, and previously that was signaled by giving that cell AwfulScore, which ensures any path that emerges from it isAwful.

Glad you took another look. I don't want to yield, let's find another reviewer :)

clangd/FuzzyMatch.cpp
232

I think Scores[0][0][Match] is as valid as Scores[0][0][Miss]. The argument is that a Miss state should ensure a skipped position in Text. When there is zero position, it cannot be Missed. A dynamic programming algorithm does not necessarily have only one valid initial state (thinking about the constant term in an indefinite integral). I will choose the one which makes more sense if such initial state exists or if there is one that simplifies the case. In this case treating both of them as valid simplifies the code.

Glad you took another look. I don't want to yield, let's find another reviewer :)

OK - the people with the most context on this particular code are ilya-biryukov and klimek (but klimek is out this week).
Others who write/review clangd stuff include hokein, malaperle, simark, bkramer. (ioeric is out for a few weeks).

I can give ilya or someone context on the specific changes we're looking at if that's useful.