- Remove initialization of Scores[][][] in ctor. This is feasible if we see Scores[I][J][] when I > J as uninitialized.
- Move if (PatN > 0) check to calculateRoles and remove some length checks.
- The length checks can be removed because the dynamic programming is now regularized. (Scores[0][][] took trailing penalty into account while other Scores[*][][] did not before)
- Rename matchBonus and skipPenalty to {match,miss}Score. This eliminates negative forms in the dynamic programming.
Details
- Reviewers
sammccall
Diff Detail
- Repository
- rCTE Clang Tools Extra
- Build Status
Buildable 16549 Build 16549: arc lint + arc unit
Event Timeline
Brings some goodies from https://github.com/cquery-project/cquery/blob/master/src/fuzzy_match.cc (what I plagiarized from clangd)
I would also like to learn how you run tests.
% ninja tools/clang/tools/extra/unittests/clangd/ClangdTests
% tools/clang/tools/extra/unittests/clangd/ClangdTests --gtest_filter='Fuzzy*'
How do you debug?
gdb --args tools/clang/tools/extra/unittests/clangd/ClangdTests --gtest_filter='Fuzzy*'
Or lldb?
Can you elaborate on what this patch is improving, and how?
There are some stylistic changes, and also what look like subtle logic changes, and some rearrangement of the algorithm - to what end?
Canonical way to run all tests is ninja check-clang-tools, the way you suggested is the right thing for rapid iteration with unit tests.
Personally, I don't use a debugger.
clangd/FuzzyMatch.cpp | ||
---|---|---|
93 | this looks like a behavior change - why? | |
242–243 | adding the penalty unconditionally seems like a behavior change, why? | |
clangd/FuzzyMatch.h | ||
61–62 | FWIW, I don't think this is an improvement - I think the clarity of purpose in names is more useful than having consistent signs in this case. |
(Sorry if I sound gruff - I'm sure there are improvements to be had here. But since the code is a bit dense (my fault) I have trouble inferring them from the deltas.
clangd/FuzzyMatch.cpp | ||
---|---|---|
93 | This is a behavior change. Instead of choosing between Match/Miss in the last position, we enumerate the last matching position in Word. This saves if (P < PatN - 1) { check in the main loop at the cost of a for loop here (use sites of ending values) | |
242–243 | Because now we use a different method to calculate the final value. I believe this makes the loop simpler. This was not regular because Scores[0][W + 1][Miss] = {Scores[0][W][Miss].Score + missScore(W, Miss), Miss}; This unconditionally added a trailing penalty but the main loop did not. |
clangd/FuzzyMatch.h | ||
---|---|---|
61–62 | Keep matchBonus but rename skipPenalty to missPenalty ? |
clangd/FuzzyMatch.cpp | ||
---|---|---|
97 | I also don't understand why it clamps the value to zero here. Negative values are also meaningful to me. Given that perfectBonus is only 3 it is very easy to get a negative value here. |
clangd/FuzzyMatch.h | ||
---|---|---|
61–62 | Also note in the original scheme, the skip score does not need to be negative. Because Scores[PatN][WordN][] is used and each path takes the same number of positions (WordN). We can add an offset to all positional scores to make all of them non-negative. In the new scheme it does make sense to make them negative, as each path has now different number of positions. |
Sorry for the delay, it took me a while to understand exactly what everything is doing.
If I understand right, there's actually no functional change (to match logic or scoring) being proposed here.
But some nice fixes indeed!
Most of the comments are readability nits. The existing code is pretty dense and hard to follow (my fault, thanks for picking through it!) so I might have misunderstood some things.
clangd/FuzzyMatch.cpp | ||
---|---|---|
93 | Ah, I see - the case where we match only part of the word is handled up here now. That does make more sense, but it's pretty subtle. | |
97 | I'd prefer to keep this - the empty pattern case is very common, and buildGraph() isn't trivially cheap in this case. | |
97 | An important part of the contract of match() is that it returns a value in [0,1]. It would be possible to use a different squash function here, but I found max(kFloor,x) worked well for the examples I looked at - anything <= some floor value was "not really a useful match at all", and most of the variance below the floor seemed to be noise to me. | |
175–176 | Why this change? Previously this check was dynamic at the callsite in the constructor (which is cold), and omitted in the call to init() which is relatively hot. Generally, here we expect the constructor to be called once per request and match() to be called thousands of times, so it's ok to do some wasteful initialization/work in the constructor, but we should avoid it on the match() path. | |
209 | similarly this one. | |
231 | why this change? Also, shouldn't [0][0][Match] be AwfulScore? | |
326 | nit: I -> P, move increment to the increment expression of the for loop? | |
339 | W is the right name in this file for a variable iterating over word indices, please don't change this. | |
339 | As far as I can see, this loop is setting A[W+1:...] = Miss and populating A[0...W] with the exsting logic. | |
339 | You've shifted P (and the old W, new I) by 1. This does reduce the number of +1 and -1 in this function, but it's inconsistent with how these are used elsewhere: P should be the index into Pat of the character that we're considering. | |
364 | now that the end of the word could be anywhere, it might be nice to add an arrowhead `^' below the table pointing at it :) up to you |
BTW if you're interested in this stuff in clangd, there's some greener-field related stuff too:
Our goal is to be able to do project-wide fuzzy-find navigation and code completion with the global symbol index.
The index implementation in upstream clangd is naive at the moment (iterates over every symbol) but could be made efficient even with the fuzzy find functionality. (Inside google there's a service that does this, but we can't use that code upstream). It's an interesting problem and could be useful for cquery.
We should chat offline if this is at all interesting, even if you don't want to work on it - I'd like to hear more about cquery too :)
clangd/FuzzyMatch.cpp | ||
---|---|---|
231 | "The more expensive per-match call" is just two value assignments. I have removed the expensive table initialization in the constructor. [0][0][*] can be any value. | |
326 | I -> P.
Not sure about the coding standard here, but if you insist I'll have to change it as you are the reviewer. If the loop variable was an iterator, for (It I = std::next(...); I != E; ++I) would be uglier than for (It I = ...; ++I != E; ) | |
339 | I don't understand the rationale not to use the shifted indices. The code actually use Scores[P][W][*] to mean the optimal match of the first P characters of the pattern with the first W characters of the word, not the position of the character. On the other hand, C++ reverse iterators use the shifted one for for (I = rend(); I != rbegin(); ++I). The shifted one makes ending condition check easier. |
clangd/FuzzyMatch.cpp | ||
---|---|---|
93 | Added // Find the optimal prefix of Word to match Pattern. I meant this is a behavior change but it makes the first row and the rest rows of the score table more consistent. |
clangd/FuzzyMatch.cpp | ||
---|---|---|
97 | We could try other criteria in the future. I believe the current one can be improved because negative scores may be returned but the scoring shouldn't return 0 for all the cases. |
clangd/FuzzyMatch.cpp | ||
---|---|---|
93 | That comment really doesn't capture what's significant about this line - it's the *policy*, rather than the mechanism, that needs highlighting here. (Re: behavior change - I *think* there's no inputs for which we produce a different match result/score because of this patch, right?) | |
97 | Sure, we can try other things, and to gather more data. | |
231 |
Oops, sorry - by "more expensive" I mean "called thousands of times more often".
I don't want to be rude, but I asked why you changed this, and you didn't answer. Unless there's a strong reason, I'd prefer to revert this change, as I find this harder to reason about.
Can you please explain why? | |
326 | Uglier is subjective, but side-effects in the condition of a for-loop is sufficiently unusual and surprising that I'd prefer to avoid it in both cases. | |
339 |
The rationale is entirely consistency with the surrounding code. The consistency helps avoid off-by-one errors when similar loops have different conventions. In this file, when looping over word or pattern dimensions, P and W respectively are used for loop variables, and can be interpreted as indices into Pat/Word. | |
clangd/FuzzyMatch.h | ||
61–62 | missPenalty SGTM. (I don't see any particular reason to avoid negative numbers here - it has a natural interpretation: a positive increment means the match is better than if that action wasn't taken, negative means it's worse, etc) |
clangd/FuzzyMatch.cpp | ||
---|---|---|
231 |
It is true that Scores[0][0][Miss] = Scores[0][0][Match] = {0, Miss}; is the cost incurred for each word. But it is not full table initialization, it is just two variable assignments. And we will assign to other values of the first row Scores[0][*][*] in the following loop. The old scatters the table construction to two places, the constructor and this dynamic programming site. |
clangd/FuzzyMatch.cpp | ||
---|---|---|
231 |
Can you please explain why? Scores[0][0][*] is the initial value which will be propagated to all other values in the table. If you ignore the max clamp you used later, the initial value does not matter. |
clangd/FuzzyMatch.cpp | ||
---|---|---|
339 | Scores[P][W][*] is interpreted as how good it is if we align the first P characters of the pattern with the first W characters of the word. Note the code uses number of characters instead of the position. Here the new interpretation would be "what we should do for the last character of the first W characters" |
clangd/FuzzyMatch.cpp | ||
---|---|---|
93 | Added your comment. The behavior change is regarding the values of Scores (there is no longer different interpretation for the last character of Pattern) how the final value is chosen (Scores[P][W][*] -> Scores[P][*][Match]) . There is no noticeable change in the user viewpoint. |
There are two viewpoints: position-central and cell-central.
In buildGraph, the nested loop
for (int P = 0; P < PatN; ++P) { // Scores[P + 1][P][Miss] = Scores[P + 1][P][Match] = {AwfulScore, Miss}; for (int W = P; W < WordN; ++W) {
can be interpreted as: we are calculating the cell Scores[P+1][W+1][*], using the characters Pattern[P] and Word[W]. This is a position-central viewpoint. But note the cell we are calculating is Scores[P+1][W+1]. If we rephrase it as:
for (int P = 1; P <= PatN; ++P) { for (int W = P + 1; W <= WordN; ++W) { // Since you like this form (though I don't)
It makes it more clear that we are using the cell indices.
(we are calculating the cell Scores[P][W][*], using the characters Pattern[P-1] and Word[W-1])
The former interpretation is preferred because half closed intervals generally reduce the number of +1 -1 offsets.
In the table dumping stage, I find this cell-central viewpoint
for (int I = W; i > 0; i--)
better than the position-central viewpoint (with my former dynamic programming experience, I have solved numerous problems like this)
for (int I = W - 1; I >= 0; --I)
because we are tracing through the optimal path of the dynamic programming *cells*. We are also tracing the W P positions, but the former interpretation gets rid of some +1 -1.
Don't rename anything. I just want to have this revision reviewed as soon as possible
clangd/FuzzyMatch.cpp | ||
---|---|---|
209 | This is very cheap and dumpLast has checked the emptiness so there is no need to duplicate the work here. |
Thanks for your work on this patch!
I think there a several useful improvements here, which I'd love to see landed. Particularly the logic around matches that end early is much better.
There are also places that change existing design decisions in ways that don't seem to be clear improvements: doing extra work (albeit minor) at match() time, and using different naming conventions for indexes in dumpLast. I see you have strong feelings about these and I do think I understand your arguments, but don't agree as discussed above.
Where should we go from here? One option is to land the pieces we agree on. But it's your patch, and if you'd like an opinion from another clangd maintainer that's fine too; happy to help with that.
clangd/FuzzyMatch.cpp | ||
---|---|---|
231 | Is it not possible that we'll choose a best path starting at Scores[0][0][Match]? |
Glad you took another look. I don't want to yield, let's find another reviewer :)
clangd/FuzzyMatch.cpp | ||
---|---|---|
231 | I think Scores[0][0][Match] is as valid as Scores[0][0][Miss]. The argument is that a Miss state should ensure a skipped position in Text. When there is zero position, it cannot be Missed. A dynamic programming algorithm does not necessarily have only one valid initial state (thinking about the constant term in an indefinite integral). I will choose the one which makes more sense if such initial state exists or if there is one that simplifies the case. In this case treating both of them as valid simplifies the code. |
OK - the people with the most context on this particular code are ilya-biryukov and klimek (but klimek is out this week).
Others who write/review clangd stuff include hokein, malaperle, simark, bkramer. (ioeric is out for a few weeks).
I can give ilya or someone context on the specific changes we're looking at if that's useful.
FWIW, I don't think this is an improvement - I think the clarity of purpose in names is more useful than having consistent signs in this case.