This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clangd/
4/4
FuzzyMatch.h
33
FuzzyMatch.cpp

Differential D44720

[clangd] Simplify fuzzy matcher (sequence alignment) by removing some condition checks.
Needs ReviewPublic

Authored by MaskRay on Mar 20 2018, 5:06 PM.

Download Raw Diff

Details

Reviewers

sammccall

Summary

Remove initialization of Scores[][][] in ctor. This is feasible if we see Scores[I][J][] when I > J as uninitialized.
Move if (PatN > 0) check to calculateRoles and remove some length checks.
The length checks can be removed because the dynamic programming is now regularized. (Scores[0][][] took trailing penalty into account while other Scores[*][][] did not before)
Rename matchBonus and skipPenalty to {match,miss}Score. This eliminates negative forms in the dynamic programming.

Diff Detail

Repository

rCTE Clang Tools Extra

Build Status

Buildable 16555
Build 16555: arc lint + arc unit

Event Timeline

MaskRay created this revision.Mar 20 2018, 5:06 PM

Herald added subscribers: cfe-commits, ioeric, jkorous-apple and 2 others. · View Herald TranscriptMar 20 2018, 5:06 PM

Harbormaster completed remote builds in B16295: Diff 139230.Mar 20 2018, 5:06 PM

Brings some goodies from https://github.com/cquery-project/cquery/blob/master/src/fuzzy_match.cc (what I plagiarized from clangd)

I would also like to learn how you run tests.

% ninja tools/clang/tools/extra/unittests/clangd/ClangdTests
% tools/clang/tools/extra/unittests/clangd/ClangdTests --gtest_filter='Fuzzy*'

How do you debug?

gdb --args tools/clang/tools/extra/unittests/clangd/ClangdTests --gtest_filter='Fuzzy*'

Or lldb?

Can you elaborate on what this patch is improving, and how?
There are some stylistic changes, and also what look like subtle logic changes, and some rearrangement of the algorithm - to what end?

Canonical way to run all tests is ninja check-clang-tools, the way you suggested is the right thing for rapid iteration with unit tests.
Personally, I don't use a debugger.

clangd/FuzzyMatch.cpp
93	this looks like a behavior change - why?
243–244	adding the penalty unconditionally seems like a behavior change, why?
clangd/FuzzyMatch.h
62	FWIW, I don't think this is an improvement - I think the clarity of purpose in names is more useful than having consistent signs in this case.

(Sorry if I sound gruff - I'm sure there are improvements to be had here. But since the code is a bit dense (my fault) I have trouble inferring them from the deltas.

Update summary

Harbormaster completed remote builds in B16314: Diff 139313.Mar 21 2018, 9:55 AM

MaskRay edited the summary of this revision. (Show Details)Mar 21 2018, 9:55 AM

Update summary

Harbormaster completed remote builds in B16315: Diff 139314.Mar 21 2018, 9:56 AM

MaskRay edited the summary of this revision. (Show Details)Mar 21 2018, 9:56 AM

MaskRay added inline comments.Mar 21 2018, 10:01 AM

clangd/FuzzyMatch.cpp
93	This is a behavior change. Instead of choosing between `Match/Miss` in the last position, we enumerate the last matching position in `Word`. This saves `if (P < PatN - 1) {` check in the main loop at the cost of a for loop here (use sites of ending values)
243–244	Because now we use a different method to calculate the final value. I believe this makes the loop simpler. This was not regular because Scores[0][W + 1][Miss] = {Scores[0][W][Miss].Score + missScore(W, Miss), Miss}; This unconditionally added a trailing penalty but the main loop did not.

MaskRay added inline comments.Mar 21 2018, 10:04 AM

clangd/FuzzyMatch.h
62	Keep `matchBonus` but rename `skipPenalty` to `missPenalty` ?

MaskRay added inline comments.Mar 21 2018, 10:07 AM

clangd/FuzzyMatch.cpp
98	I also don't understand why it clamps the value to zero here. Negative values are also meaningful to me. Given that perfectBonus is only 3 it is very easy to get a negative value here.

MaskRay added inline comments.Mar 21 2018, 12:34 PM

clangd/FuzzyMatch.h
62	Also note in the original scheme, the skip score does not need to be negative. Because Scores[PatN][WordN][] is used and each path takes the same number of positions (WordN). We can add an offset to all positional scores to make all of them non-negative. In the new scheme it does make sense to make them negative, as each path has now different number of positions.

Friendly ping......

Sorry for the delay, it took me a while to understand exactly what everything is doing.
If I understand right, there's actually no functional change (to match logic or scoring) being proposed here.
But some nice fixes indeed!

Most of the comments are readability nits. The existing code is pretty dense and hard to follow (my fault, thanks for picking through it!) so I might have misunderstood some things.

clangd/FuzzyMatch.cpp
93	Ah, I see - the case where we match only part of the word is handled up here now. (I think you mean this is not a behavior change? The result is the same AFAICS) That does make more sense, but it's pretty subtle. Can you add a comment like `// The pattern doesn't have to match the whole word (but the whole pattern must match).`
97	I'd prefer to keep this - the empty pattern case is very common, and buildGraph() isn't trivially cheap in this case.
98	An important part of the contract of `match()` is that it returns a value in `[0,1]`. We rely on this range to combine this with other scoring signals - we multiply this by a quality signal in code completion. (Currently the quality signal is just derived from Sema, but the global index will provide the number of uses). It would be possible to use a different squash function here, but I found max(kFloor,x) worked well for the examples I looked at - anything <= some floor value was "not really a useful match at all", and most of the variance below the floor seemed to be noise to me. (Then I tuned the bonuses/penalties so the floor was at zero)
176–177	Why this change? Previously this check was dynamic at the callsite in the constructor (which is cold), and omitted in the call to init() which is relatively hot. Generally, here we expect the constructor to be called once per request and match() to be called thousands of times, so it's ok to do some wasteful initialization/work in the constructor, but we should avoid it on the match() path.
209	similarly this one. (ideally we wouldn't do the work above, it's just there to make dumpLast work I think)
232	why this change? this has also been moved from the cheaper constructor to the more expensive per-match call. (also the diagonal assignment added in the next loop) Also, shouldn't [0][0][Match] be AwfulScore?
327	nit: I -> P, move increment to the increment expression of the for loop?
340	W is the right name in this file for a variable iterating over word indices, please don't change this. The new variable above could be EndW or so?
340	As far as I can see, this loop is setting `A[W+1:...] = Miss` and populating `A[0...W]` with the exsting logic. I think this would be clearer as two loops, currently there's a lot of conditionals around Last that obscure what's actually happening.
340	You've shifted P (and the old W, new I) by 1. This does reduce the number of +1 and -1 in this function, but it's inconsistent with how these are used elsewhere: P should be the index into Pat of the character that we're considering.
365	now that the end of the word could be anywhere, it might be nice to add an arrowhead `^' below the table pointing at it :) up to you

BTW if you're interested in this stuff in clangd, there's some greener-field related stuff too:
Our goal is to be able to do project-wide fuzzy-find navigation and code completion with the global symbol index.
The index implementation in upstream clangd is naive at the moment (iterates over every symbol) but could be made efficient even with the fuzzy find functionality. (Inside google there's a service that does this, but we can't use that code upstream). It's an interesting problem and could be useful for cquery.

We should chat offline if this is at all interesting, even if you don't want to work on it - I'd like to hear more about cquery too :)

MaskRay added inline comments.Mar 29 2018, 8:58 AM

clangd/FuzzyMatch.cpp
232	"The more expensive per-match call" is just two value assignments. I have removed the expensive table initialization in the constructor. [0][0][*] can be any value.
327	I -> P. move increment to the increment expression of the for loop? Not sure about the coding standard here, but if you insist I'll have to change it as you are the reviewer. If the loop variable was an iterator, `for (It I = std::next(...); I != E; ++I)` would be uglier than `for (It I = ...; ++I != E; )`
340	I don't understand the rationale not to use the shifted indices. The code actually use `Scores[P][W][*]` to mean the optimal match of the first `P` characters of the pattern with the first `W` characters of the word, not the position of the character. On the other hand, C++ reverse iterators use the shifted one for `for (I = rend(); I != rbegin(); ++I)`. The shifted one makes ending condition check easier.

Address comments

Harbormaster completed remote builds in B16543: Diff 140265.Mar 29 2018, 9:03 AM

Add comment

// Find the optimal prefix of Word to match Pattern.

Harbormaster completed remote builds in B16544: Diff 140268.Mar 29 2018, 9:08 AM

MaskRay added inline comments.Mar 29 2018, 9:09 AM

clangd/FuzzyMatch.cpp
93	Added // Find the optimal prefix of Word to match Pattern. I meant this is a behavior change but it makes the first row and the rest rows of the score table more consistent.

MaskRay added inline comments.Mar 29 2018, 9:12 AM

clangd/FuzzyMatch.cpp
98	We could try other criteria in the future. I believe the current one can be improved because negative scores may be returned but the scoring shouldn't return 0 for all the cases.

sammccall added inline comments.Mar 29 2018, 11:09 AM

clangd/FuzzyMatch.cpp
93	That comment really doesn't capture what's significant about this line - it's the policy, rather than the mechanism, that needs highlighting here. (Re: behavior change - I think there's no inputs for which we produce a different match result/score because of this patch, right?)
98	Sure, we can try other things, and to gather more data. (To be clear though - with the data I did look at, including the scores <0 did not add more information, only noise)
232	"The more expensive per-match call" is just two value assignments. Oops, sorry - by "more expensive" I mean "called thousands of times more often". I have removed the expensive table initialization in the constructor. I don't want to be rude, but I asked why you changed this, and you didn't answer. Unless there's a strong reason, I'd prefer to revert this change, as I find this harder to reason about. (Roughly: in the old version of the code, any data that didn't need to change for the life of the object was initialized in the constructor. That way I didn't need to worry what was performance-critical and what wasn't - match() only did what was strictly necessary). [0][0][*] can be any value. Can you please explain why?
327	Uglier is subjective, but side-effects in the condition of a for-loop is sufficiently unusual and surprising that I'd prefer to avoid it in both cases.
340	I don't understand the rationale not to use the shifted indices The rationale is entirely consistency with the surrounding code. The consistency helps avoid off-by-one errors when similar loops have different conventions. In this file, when looping over word or pattern dimensions, P and W respectively are used for loop variables, and can be interpreted as indices into Pat/Word. Here the interpretation would be "did we match or miss character Word[W]"
clangd/FuzzyMatch.h
62	missPenalty SGTM. (I don't see any particular reason to avoid negative numbers here - it has a natural interpretation: a positive increment means the match is better than if that action wasn't taken, negative means it's worse, etc)

missScore -> missPenalty

MaskRay marked 4 inline comments as done.Mar 29 2018, 11:19 AM

MaskRay added inline comments.

clangd/FuzzyMatch.cpp
232	Oops, sorry - by "more expensive" I mean "called thousands of times more often". It is true that `Scores[0][0][Miss] = Scores[0][0][Match] = {0, Miss};` is the cost incurred for each word. But it is not full table initialization, it is just two variable assignments. And we will assign to other values of the first row `Scores[0][][]` in the following loop. The old scatters the table construction to two places, the constructor and this dynamic programming site.

MaskRay added inline comments.Mar 29 2018, 11:25 AM

clangd/FuzzyMatch.cpp
232	[0][0][] can be any value. Can you please explain why? `Scores[0][0][]` is the initial value which will be propagated to all other values in the table. The relative difference of pairwise values in the table is a constant whatever initial value is chosen. If you ignore the max clamp you used later, the initial value does not matter.

MaskRay added inline comments.Mar 29 2018, 11:30 AM

clangd/FuzzyMatch.cpp
340	`Scores[P][W][*]` is interpreted as how good it is if we align the first `P` characters of the pattern with the first `W` characters of the word. Note the code uses `number of characters` instead of the position. Here the new interpretation would be "what we should do for the last character of the first W characters"

nit picking

Harbormaster completed remote builds in B16554: Diff 140306.Mar 29 2018, 11:57 AM

MaskRay added inline comments.Mar 29 2018, 11:59 AM

clangd/FuzzyMatch.cpp
93	Added your comment. The behavior change is regarding the values of `Scores` (there is no longer different interpretation for the last character of Pattern) how the final value is chosen (Scores[P][W][] -> Scores[P][][Match]) . There is no noticeable change in the user viewpoint.

There are two viewpoints: position-central and cell-central.

In buildGraph, the nested loop

for (int P = 0; P < PatN; ++P) {
    // Scores[P + 1][P][Miss] = Scores[P + 1][P][Match] = {AwfulScore, Miss};
    for (int W = P; W < WordN; ++W) {

can be interpreted as: we are calculating the cell Scores[P+1][W+1][*], using the characters Pattern[P] and Word[W]. This is a position-central viewpoint. But note the cell we are calculating is Scores[P+1][W+1]. If we rephrase it as:

for (int P = 1; P <= PatN; ++P) {
    for (int W = P + 1; W <= WordN; ++W) {     // Since you like this form (though I don't)

It makes it more clear that we are using the cell indices.
(we are calculating the cell Scores[P][W][*], using the characters Pattern[P-1] and Word[W-1])

The former interpretation is preferred because half closed intervals generally reduce the number of +1 -1 offsets.

In the table dumping stage, I find this cell-central viewpoint

for (int I = W; i > 0; i--)

better than the position-central viewpoint (with my former dynamic programming experience, I have solved numerous problems like this)

for (int I = W - 1; I >= 0; --I)

because we are tracing through the optimal path of the dynamic programming *cells*. We are also tracing the W P positions, but the former interpretation gets rid of some +1 -1.

Don't rename anything. I just want to have this revision reviewed as soon as possible

Harbormaster completed remote builds in B16555: Diff 140311.Mar 29 2018, 12:17 PM

MaskRay added inline comments.Mar 29 2018, 12:18 PM

clangd/FuzzyMatch.cpp
209	This is very cheap and dumpLast has checked the emptiness so there is no need to duplicate the work here.

Thanks for your work on this patch!
I think there a several useful improvements here, which I'd love to see landed. Particularly the logic around matches that end early is much better.

There are also places that change existing design decisions in ways that don't seem to be clear improvements: doing extra work (albeit minor) at match() time, and using different naming conventions for indexes in dumpLast. I see you have strong feelings about these and I do think I understand your arguments, but don't agree as discussed above.

Where should we go from here? One option is to land the pieces we agree on. But it's your patch, and if you'd like an opinion from another clangd maintainer that's fine too; happy to help with that.

clangd/FuzzyMatch.cpp
232	Is it not possible that we'll choose a best path starting at Scores[0][0][Match]? This is invalid, and previously that was signaled by giving that cell AwfulScore, which ensures any path that emerges from it isAwful.

Glad you took another look. I don't want to yield, let's find another reviewer :)

clangd/FuzzyMatch.cpp
232	I think `Scores[0][0][Match]` is as valid as `Scores[0][0][Miss]`. The argument is that a `Miss` state should ensure a skipped position in Text. When there is zero position, it cannot be `Miss`ed. A dynamic programming algorithm does not necessarily have only one valid initial state (thinking about the constant term in an indefinite integral). I will choose the one which makes more sense if such initial state exists or if there is one that simplifies the case. In this case treating both of them as valid simplifies the code.

In D44720#1055997, @MaskRay wrote:

Glad you took another look. I don't want to yield, let's find another reviewer :)

OK - the people with the most context on this particular code are ilya-biryukov and klimek (but klimek is out this week).
Others who write/review clangd stuff include hokein, malaperle, simark, bkramer. (ioeric is out for a few weeks).

I can give ilya or someone context on the specific changes we're looking at if that's useful.

Revision Contents

Path

Size

clangd/

FuzzyMatch.h

2 lines

FuzzyMatch.cpp

93 lines

Diff 140311

clangd/FuzzyMatch.h

Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	private:
// - GCC 4.8 complains not all values fit if the type is unsigned		// - GCC 4.8 complains not all values fit if the type is unsigned
using Action = bool;		using Action = bool;
constexpr static Action Miss = false, Match = true;		constexpr static Action Miss = false, Match = true;

bool init(llvm::StringRef Word);		bool init(llvm::StringRef Word);
void buildGraph();		void buildGraph();
void calculateRoles(const char Text, CharRole Out, int &Types, int N);		void calculateRoles(const char Text, CharRole Out, int &Types, int N);
bool allowMatch(int P, int W) const;		bool allowMatch(int P, int W) const;
int skipPenalty(int W, Action Last) const;		int missPenalty(int W, Action Last) const;
int matchBonus(int P, int W, Action Last) const;		int matchBonus(int P, int W, Action Last) const;
		sammccallUnsubmitted Done Reply Inline Actions FWIW, I don't think this is an improvement - I think the clarity of purpose in names is more useful than having consistent signs in this case. sammccall: FWIW, I don't think this is an improvement - I think the clarity of purpose in names is more…
		MaskRayAuthorUnsubmitted Done Reply Inline Actions Keep `matchBonus` but rename `skipPenalty` to `missPenalty` ? MaskRay: Keep `matchBonus` but rename `skipPenalty` to `missPenalty` ?
		MaskRayAuthorUnsubmitted Done Reply Inline Actions Also note in the original scheme, the skip score does not need to be negative. Because Scores[PatN][WordN][] is used and each path takes the same number of positions (WordN). We can add an offset to all positional scores to make all of them non-negative. In the new scheme it does make sense to make them negative, as each path has now different number of positions. MaskRay: Also note in the original scheme, the skip score does not need to be negative. Because Scores…
		sammccallUnsubmitted Done Reply Inline Actions missPenalty SGTM. (I don't see any particular reason to avoid negative numbers here - it has a natural interpretation: a positive increment means the match is better than if that action wasn't taken, negative means it's worse, etc) sammccall: missPenalty SGTM. (I don't see any particular reason to avoid negative numbers here - it has a…

// Pattern data is initialized by the constructor, then constant.		// Pattern data is initialized by the constructor, then constant.
char Pat[MaxPat]; // Pattern data		char Pat[MaxPat]; // Pattern data
int PatN; // Length		int PatN; // Length
char LowPat[MaxPat]; // Pattern in lowercase		char LowPat[MaxPat]; // Pattern in lowercase
CharRole PatRole[MaxPat]; // Pattern segmentation info		CharRole PatRole[MaxPat]; // Pattern segmentation info
int PatTypeSet; // Bitmask of 1<<CharType		int PatTypeSet; // Bitmask of 1<<CharType
float ScoreScale; // Normalizes scores for the pattern length.		float ScoreScale; // Normalizes scores for the pattern length.
Show All 23 Lines

clangd/FuzzyMatch.cpp

Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines
//		//
// This algorithm was inspired by VS code's client-side filtering, and aims		// This algorithm was inspired by VS code's client-side filtering, and aims
// to be mostly-compatible.		// to be mostly-compatible.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "FuzzyMatch.h"		#include "FuzzyMatch.h"
#include "llvm/ADT/Optional.h"		#include "llvm/ADT/Optional.h"
		#include "llvm/ADT/StringExtras.h"
#include "llvm/Support/Format.h"		#include "llvm/Support/Format.h"

namespace clang {		namespace clang {
namespace clangd {		namespace clangd {
using namespace llvm;		using namespace llvm;

constexpr int FuzzyMatcher::MaxPat;		constexpr int FuzzyMatcher::MaxPat;
constexpr int FuzzyMatcher::MaxWord;		constexpr int FuzzyMatcher::MaxWord;

static char lower(char C) { return C >= 'A' && C <= 'Z' ? C + ('a' - 'A') : C; }
// A "negative infinity" score that won't overflow.		// A "negative infinity" score that won't overflow.
// We use this to mark unreachable states and forbidden solutions.		// We use this to mark unreachable states and forbidden solutions.
// Score field is 15 bits wide, min value is -2^14, we use half of that.		// Score field is 15 bits wide, min value is -2^14, we use half of that.
static constexpr int AwfulScore = -(1 << 13);		static constexpr int AwfulScore = -(1 << 13);
static bool isAwful(int S) { return S < AwfulScore / 2; }		static bool isAwful(int S) { return S < AwfulScore / 2; }
static constexpr int PerfectBonus = 3; // Perfect per-pattern-char score.		static constexpr int PerfectBonus = 3; // Perfect per-pattern-char score.

FuzzyMatcher::FuzzyMatcher(StringRef Pattern)		FuzzyMatcher::FuzzyMatcher(StringRef Pattern)
: PatN(std::min<int>(MaxPat, Pattern.size())),		: PatN(std::min<int>(MaxPat, Pattern.size())),
ScoreScale(PatN ? float{1} / (PerfectBonus * PatN) : 0), WordN(0) {		ScoreScale(PatN ? float{1} / (PerfectBonus * PatN) : 0), WordN(0) {
std::copy(Pattern.begin(), Pattern.begin() + PatN, Pat);		std::copy(Pattern.begin(), Pattern.begin() + PatN, Pat);
for (int I = 0; I < PatN; ++I)		for (int I = 0; I < PatN; ++I)
LowPat[I] = lower(Pat[I]);		LowPat[I] = toLower(Pat[I]);
Scores[0][0][Miss] = {0, Miss};
Scores[0][0][Match] = {AwfulScore, Miss};
for (int P = 0; P <= PatN; ++P)
for (int W = 0; W < P; ++W)
for (Action A : {Miss, Match})
Scores[P][W][A] = {AwfulScore, Miss};
if (PatN > 0)
calculateRoles(Pat, PatRole, PatTypeSet, PatN);		calculateRoles(Pat, PatRole, PatTypeSet, PatN);
}		}

Optional<float> FuzzyMatcher::match(StringRef Word) {		Optional<float> FuzzyMatcher::match(StringRef Word) {
if (!(WordContainsPattern = init(Word)))		if (!(WordContainsPattern = init(Word)))
return None;		return None;
if (!PatN)
sammccallUnsubmitted Not Done Reply Inline Actions I'd prefer to keep this - the empty pattern case is very common, and buildGraph() isn't trivially cheap in this case. sammccall: I'd prefer to keep this - the empty pattern case is very common, and buildGraph() isn't…
return 1;
buildGraph();		buildGraph();
auto Best = std::max(Scores[PatN][WordN][Miss].Score,		// The pattern doesn't have to match the whole word (but the whole pattern
Scores[PatN][WordN][Match].Score);		// must match). Find the optimal prefix of Word to match Pattern.
		int Best = AwfulScore;
		sammccallUnsubmitted Not Done Reply Inline Actions this looks like a behavior change - why? sammccall: this looks like a behavior change - why?
		MaskRayAuthorUnsubmitted Not Done Reply Inline Actions This is a behavior change. Instead of choosing between `Match/Miss` in the last position, we enumerate the last matching position in `Word`. This saves `if (P < PatN - 1) {` check in the main loop at the cost of a for loop here (use sites of ending values) MaskRay: This is a behavior change. Instead of choosing between `Match/Miss` in the last position, we…
		sammccallUnsubmitted Not Done Reply Inline Actions Ah, I see - the case where we match only part of the word is handled up here now. (I think you mean this is not a behavior change? The result is the same AFAICS) That does make more sense, but it's pretty subtle. Can you add a comment like `// The pattern doesn't have to match the whole word (but the whole pattern must match).` sammccall: Ah, I see - the case where we match only part of the word is handled up here now. (I think you…
		MaskRayAuthorUnsubmitted Not Done Reply Inline Actions Added // Find the optimal prefix of Word to match Pattern. I meant this is a behavior change but it makes the first row and the rest rows of the score table more consistent. MaskRay: Added ``` // Find the optimal prefix of Word to match Pattern. ``` I meant this is a…
		sammccallUnsubmitted Not Done Reply Inline Actions That comment really doesn't capture what's significant about this line - it's the policy, rather than the mechanism, that needs highlighting here. (Re: behavior change - I think there's no inputs for which we produce a different match result/score because of this patch, right?) sammccall: That comment really doesn't capture what's significant about this line - it's the policy…
		MaskRayAuthorUnsubmitted Not Done Reply Inline Actions Added your comment. The behavior change is regarding the values of `Scores` (there is no longer different interpretation for the last character of Pattern) how the final value is chosen (Scores[P][W][] -> Scores[P][][Match]) . There is no noticeable change in the user viewpoint. MaskRay: Added your comment. The behavior change is regarding the values of `Scores` (there is no…
		for (int I = PatN; I <= WordN; I++)
		Best = std::max(Best, Scores[PatN][I][Match].Score);
if (isAwful(Best))		if (isAwful(Best))
return None;		return None;
return ScoreScale * std::min(PerfectBonus * PatN, std::max<int>(0, Best));		return ScoreScale * std::min(PerfectBonus * PatN, std::max<int>(0, Best));
		MaskRayAuthorUnsubmitted Not Done Reply Inline Actions I also don't understand why it clamps the value to zero here. Negative values are also meaningful to me. Given that perfectBonus is only 3 it is very easy to get a negative value here. MaskRay: I also don't understand why it clamps the value to zero here. Negative values are also…
		sammccallUnsubmitted Not Done Reply Inline Actions An important part of the contract of `match()` is that it returns a value in `[0,1]`. We rely on this range to combine this with other scoring signals - we multiply this by a quality signal in code completion. (Currently the quality signal is just derived from Sema, but the global index will provide the number of uses). It would be possible to use a different squash function here, but I found max(kFloor,x) worked well for the examples I looked at - anything <= some floor value was "not really a useful match at all", and most of the variance below the floor seemed to be noise to me. (Then I tuned the bonuses/penalties so the floor was at zero) sammccall: An important part of the contract of `match()` is that it returns a value in `[0,1]`. We rely…
		MaskRayAuthorUnsubmitted Not Done Reply Inline Actions We could try other criteria in the future. I believe the current one can be improved because negative scores may be returned but the scoring shouldn't return 0 for all the cases. MaskRay: We could try other criteria in the future. I believe the current one can be improved because…
		sammccallUnsubmitted Not Done Reply Inline Actions Sure, we can try other things, and to gather more data. (To be clear though - with the data I did look at, including the scores <0 did not add more information, only noise) sammccall: Sure, we can try other things, and to gather more data. (To be clear though - with the data I…
}		}

// Segmentation of words and patterns.		// Segmentation of words and patterns.
// A name like "fooBar_baz" consists of several parts foo, bar, baz.		// A name like "fooBar_baz" consists of several parts foo, bar, baz.
// Aligning segmentation of word and pattern improves the fuzzy-match.		// Aligning segmentation of word and pattern improves the fuzzy-match.
// For example: [lol] matches "LaughingOutLoud" better than "LionPopulation"		// For example: [lol] matches "LaughingOutLoud" better than "LionPopulation"
//		//
// First we classify each character into types (uppercase, lowercase, etc).		// First we classify each character into types (uppercase, lowercase, etc).
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	constexpr static uint8_t CharRoles[] = {
// clang-format on		// clang-format on
};		};

template <typename T> static T packedLookup(const uint8_t *Data, int I) {		template <typename T> static T packedLookup(const uint8_t *Data, int I) {
return static_cast<T>((Data[I >> 2] >> ((I & 3) * 2)) & 3);		return static_cast<T>((Data[I >> 2] >> ((I & 3) * 2)) & 3);
}		}
void FuzzyMatcher::calculateRoles(const char Text, CharRole Out, int &TypeSet,		void FuzzyMatcher::calculateRoles(const char Text, CharRole Out, int &TypeSet,
int N) {		int N) {
assert(N > 0);		if (!N)
		return;
		sammccallUnsubmitted Not Done Reply Inline Actions Why this change? Previously this check was dynamic at the callsite in the constructor (which is cold), and omitted in the call to init() which is relatively hot. Generally, here we expect the constructor to be called once per request and match() to be called thousands of times, so it's ok to do some wasteful initialization/work in the constructor, but we should avoid it on the match() path. sammccall: Why this change? Previously this check was dynamic at the callsite in the constructor (which is…
CharType Type = packedLookup<CharType>(CharTypes, Text[0]);		CharType Type = packedLookup<CharType>(CharTypes, Text[0]);
TypeSet = 1 << Type;		TypeSet = 1 << Type;
// Types holds a sliding window of (Prev, Curr, Next) types.		// Types holds a sliding window of (Prev, Curr, Next) types.
// Initial value is (Empty, Empty, type of Text[0]).		// Initial value is (Empty, Empty, type of Text[0]).
int Types = Type;		int Types = Type;
// Rotate slides in the type of the next character.		// Rotate slides in the type of the next character.
auto Rotate = [&](CharType T) { Types = ((Types << 2) \| T) & 0x3f; };		auto Rotate = [&](CharType T) { Types = ((Types << 2) \| T) & 0x3f; };
for (int I = 0; I < N - 1; ++I) {		for (int I = 0; I < N - 1; ++I) {
Show All 10 Lines

// Sets up the data structures matching Word.		// Sets up the data structures matching Word.
// Returns false if we can cheaply determine that no match is possible.		// Returns false if we can cheaply determine that no match is possible.
bool FuzzyMatcher::init(StringRef NewWord) {		bool FuzzyMatcher::init(StringRef NewWord) {
WordN = std::min<int>(MaxWord, NewWord.size());		WordN = std::min<int>(MaxWord, NewWord.size());
if (PatN > WordN)		if (PatN > WordN)
return false;		return false;
std::copy(NewWord.begin(), NewWord.begin() + WordN, Word);		std::copy(NewWord.begin(), NewWord.begin() + WordN, Word);
if (PatN == 0)
sammccallUnsubmitted Not Done Reply Inline Actions similarly this one. (ideally we wouldn't do the work above, it's just there to make dumpLast work I think) sammccall: similarly this one. (ideally we wouldn't do the work above, it's just there to make dumpLast…
MaskRayAuthorUnsubmitted Not Done Reply Inline Actions This is very cheap and dumpLast has checked the emptiness so there is no need to duplicate the work here. MaskRay: This is very cheap and dumpLast has checked the emptiness so there is no need to duplicate the…
return true;
for (int I = 0; I < WordN; ++I)		for (int I = 0; I < WordN; ++I)
LowWord[I] = lower(Word[I]);		LowWord[I] = toLower(Word[I]);

// Cheap subsequence check.		// Cheap subsequence check.
for (int W = 0, P = 0; P != PatN; ++W) {		for (int W = 0, P = 0; P != PatN; ++W) {
if (W == WordN)		if (W == WordN)
return false;		return false;
if (LowWord[W] == LowPat[P])		if (LowWord[W] == LowPat[P])
++P;		++P;
}		}
Show All 10 Lines
// Unlike other tables, indices range from 0 to N inclusive		// Unlike other tables, indices range from 0 to N inclusive
// Matched = whether we chose to match Word[W] with Pat[P] or not.		// Matched = whether we chose to match Word[W] with Pat[P] or not.
//		//
// Points are mostly assigned to matched characters, with 1 being a good score		// Points are mostly assigned to matched characters, with 1 being a good score
// and 3 being a great one. So we treat the score range as [0, 3 * PatN].		// and 3 being a great one. So we treat the score range as [0, 3 * PatN].
// This range is not strict: we can apply larger bonuses/penalties, or penalize		// This range is not strict: we can apply larger bonuses/penalties, or penalize
// non-matched characters.		// non-matched characters.
void FuzzyMatcher::buildGraph() {		void FuzzyMatcher::buildGraph() {
		Scores[0][0][Miss] = Scores[0][0][Match] = {0, Miss};
		sammccallUnsubmitted Not Done Reply Inline Actions why this change? this has also been moved from the cheaper constructor to the more expensive per-match call. (also the diagonal assignment added in the next loop) Also, shouldn't [0][0][Match] be AwfulScore? sammccall: why this change? this has also been moved from the cheaper constructor to the more expensive…
		MaskRayAuthorUnsubmitted Not Done Reply Inline Actions "The more expensive per-match call" is just two value assignments. I have removed the expensive table initialization in the constructor. [0][0][] can be any value. MaskRay:* "The more expensive per-match call" is just two value assignments. I have removed the…
		sammccallUnsubmitted Not Done Reply Inline Actions "The more expensive per-match call" is just two value assignments. Oops, sorry - by "more expensive" I mean "called thousands of times more often". I have removed the expensive table initialization in the constructor. I don't want to be rude, but I asked why you changed this, and you didn't answer. Unless there's a strong reason, I'd prefer to revert this change, as I find this harder to reason about. (Roughly: in the old version of the code, any data that didn't need to change for the life of the object was initialized in the constructor. That way I didn't need to worry what was performance-critical and what wasn't - match() only did what was strictly necessary). [0][0][] can be any value. Can you please explain why? sammccall:* > "The more expensive per-match call" is just two value assignments. Oops, sorry - by "more…
		MaskRayAuthorUnsubmitted Not Done Reply Inline Actions Oops, sorry - by "more expensive" I mean "called thousands of times more often". It is true that `Scores[0][0][Miss] = Scores[0][0][Match] = {0, Miss};` is the cost incurred for each word. But it is not full table initialization, it is just two variable assignments. And we will assign to other values of the first row `Scores[0][][]` in the following loop. The old scatters the table construction to two places, the constructor and this dynamic programming site. MaskRay: > Oops, sorry - by "more expensive" I mean "called thousands of times more often". It is true…
		MaskRayAuthorUnsubmitted Not Done Reply Inline Actions [0][0][] can be any value. Can you please explain why? `Scores[0][0][]` is the initial value which will be propagated to all other values in the table. The relative difference of pairwise values in the table is a constant whatever initial value is chosen. If you ignore the max clamp you used later, the initial value does not matter. MaskRay: > [0][0][] can be any value. Can you please explain why? `Scores[0][0][]` is the initial…
		sammccallUnsubmitted Not Done Reply Inline Actions Is it not possible that we'll choose a best path starting at Scores[0][0][Match]? This is invalid, and previously that was signaled by giving that cell AwfulScore, which ensures any path that emerges from it isAwful. sammccall: Is it not possible that we'll choose a best path starting at Scores[0][0][Match]? This is…
		MaskRayAuthorUnsubmitted Not Done Reply Inline Actions I think `Scores[0][0][Match]` is as valid as `Scores[0][0][Miss]`. The argument is that a `Miss` state should ensure a skipped position in Text. When there is zero position, it cannot be `Miss`ed. A dynamic programming algorithm does not necessarily have only one valid initial state (thinking about the constant term in an indefinite integral). I will choose the one which makes more sense if such initial state exists or if there is one that simplifies the case. In this case treating both of them as valid simplifies the code. MaskRay: I think `Scores[0][0][Match]` is as valid as `Scores[0][0][Miss]`. The argument is that a…
for (int W = 0; W < WordN; ++W) {		for (int W = 0; W < WordN; ++W) {
Scores[0][W + 1][Miss] = {Scores[0][W][Miss].Score - skipPenalty(W, Miss),		Scores[0][W + 1][Miss] = {Scores[0][W][Miss].Score - missPenalty(W, Miss),
Miss};		Miss};
Scores[0][W + 1][Match] = {AwfulScore, Miss};		Scores[0][W + 1][Match] = {AwfulScore, Miss};
}		}
for (int P = 0; P < PatN; ++P) {		for (int P = 0; P < PatN; ++P) {
		Scores[P + 1][P][Miss] = Scores[P + 1][P][Match] = {AwfulScore, Miss};
for (int W = P; W < WordN; ++W) {		for (int W = P; W < WordN; ++W) {
auto &Score = Scores[P + 1][W + 1], &PreMiss = Scores[P + 1][W];		auto &Score = Scores[P + 1][W + 1], &PreMiss = Scores[P + 1][W];

auto MatchMissScore = PreMiss[Match].Score;		auto MatchMissScore = PreMiss[Match].Score - missPenalty(W, Match);
auto MissMissScore = PreMiss[Miss].Score;		auto MissMissScore = PreMiss[Miss].Score - missPenalty(W, Miss);
		sammccallUnsubmitted Not Done Reply Inline Actions adding the penalty unconditionally seems like a behavior change, why? sammccall: adding the penalty unconditionally seems like a behavior change, why?
		MaskRayAuthorUnsubmitted Not Done Reply Inline Actions Because now we use a different method to calculate the final value. I believe this makes the loop simpler. This was not regular because Scores[0][W + 1][Miss] = {Scores[0][W][Miss].Score + missScore(W, Miss), Miss}; This unconditionally added a trailing penalty but the main loop did not. MaskRay: Because now we use a different method to calculate the final value. I believe this makes the…
if (P < PatN - 1) { // Skipping trailing characters is always free.
MatchMissScore -= skipPenalty(W, Match);
MissMissScore -= skipPenalty(W, Miss);
}
Score[Miss] = (MatchMissScore > MissMissScore)		Score[Miss] = (MatchMissScore > MissMissScore)
? ScoreInfo{MatchMissScore, Match}		? ScoreInfo{MatchMissScore, Match}
: ScoreInfo{MissMissScore, Miss};		: ScoreInfo{MissMissScore, Miss};

if (!allowMatch(P, W)) {		if (!allowMatch(P, W)) {
Score[Match] = {AwfulScore, Miss};		Score[Match] = {AwfulScore, Miss};
} else {		} else {
auto &PreMatch = Scores[P][W];		auto &PreMatch = Scores[P][W];
Show All 17 Lines	bool FuzzyMatcher::allowMatch(int P, int W) const {
// We're banning matches outright, so conservatively accept some other cases		// We're banning matches outright, so conservatively accept some other cases
// where our segmentation might be wrong:		// where our segmentation might be wrong:
// - allow matching B in ABCDef (but not in NDEBUG)		// - allow matching B in ABCDef (but not in NDEBUG)
// - we'd like to accept print in sprintf, but too many false positives		// - we'd like to accept print in sprintf, but too many false positives
return WordRole[W] != Tail \|\|		return WordRole[W] != Tail \|\|
(Word[W] != LowWord[W] && WordTypeSet & 1 << Lower);		(Word[W] != LowWord[W] && WordTypeSet & 1 << Lower);
}		}

int FuzzyMatcher::skipPenalty(int W, Action Last) const {		int FuzzyMatcher::missPenalty(int W, Action Last) const {
int S = 0;		int S = 0;
if (WordRole[W] == Head) // Skipping a segment.		if (WordRole[W] == Head) // Skipping a segment.
S += 1;		++S;
if (Last == Match) // Non-consecutive match.		if (Last == Match) // Non-consecutive match.
S += 2; // We'd rather skip a segment than split our match.		S += 2; // We'd rather skip a segment than split our match.
return S;		return S;
}		}

int FuzzyMatcher::matchBonus(int P, int W, Action Last) const {		int FuzzyMatcher::matchBonus(int P, int W, Action Last) const {
assert(LowPat[P] == LowWord[W]);		assert(LowPat[P] == LowWord[W]);
int S = 1;		int S = 1;
Show All 27 Lines	llvm::SmallString<256> FuzzyMatcher::dumpLast(llvm::raw_ostream &OS) const {
}		}
if (WordN == 0) {		if (WordN == 0) {
OS << "Word is empty: no match.\n";		OS << "Word is empty: no match.\n";
return Result;		return Result;
}		}
if (!WordContainsPattern) {		if (!WordContainsPattern) {
OS << "Substring check failed.\n";		OS << "Substring check failed.\n";
return Result;		return Result;
} else if (isAwful(std::max(Scores[PatN][WordN][Match].Score,
Scores[PatN][WordN][Miss].Score))) {
OS << "Substring check passed, but all matches are forbidden\n";
}		}
		int W = PatN;
		for (int P = PatN; ++P <= WordN; )
		sammccallUnsubmitted Not Done Reply Inline Actions nit: I -> P, move increment to the increment expression of the for loop? sammccall: nit: I -> P, move increment to the increment expression of the for loop?
		MaskRayAuthorUnsubmitted Not Done Reply Inline Actions I -> P. move increment to the increment expression of the for loop? Not sure about the coding standard here, but if you insist I'll have to change it as you are the reviewer. If the loop variable was an iterator, `for (It I = std::next(...); I != E; ++I)` would be uglier than `for (It I = ...; ++I != E; )` MaskRay: I -> P. > move increment to the increment expression of the for loop? Not sure about the…
		sammccallUnsubmitted Not Done Reply Inline Actions Uglier is subjective, but side-effects in the condition of a for-loop is sufficiently unusual and surprising that I'd prefer to avoid it in both cases. sammccall: Uglier is subjective, but side-effects in the condition of a for-loop is sufficiently unusual…
		if (Scores[PatN][P][Match].Score > Scores[PatN][W][Match].Score)
		W = P;
		if (isAwful(Scores[PatN][W][Match].Score))
		OS << "Substring check passed, but all matches are forbidden\n";
if (!(PatTypeSet & 1 << Upper))		if (!(PatTypeSet & 1 << Upper))
OS << "Lowercase query, so scoring ignores case\n";		OS << "Lowercase query, so scoring ignores case\n";

// Traverse Matched table backwards to reconstruct the Pattern/Word mapping.		// Traverse Matched table backwards to reconstruct the Pattern/Word mapping.
// The Score table has cumulative scores, subtracting along this path gives		// The Score table has cumulative scores, subtracting along this path gives
// us the per-letter scores.		// us the per-letter scores.
Action Last =
(Scores[PatN][WordN][Match].Score > Scores[PatN][WordN][Miss].Score)
? Match
: Miss;
int S[MaxWord];		int S[MaxWord];
Action A[MaxWord];		Action A[MaxWord + 1];
for (int W = WordN - 1, P = PatN - 1; W >= 0; --W) {		{
		sammccallUnsubmitted Not Done Reply Inline Actions W is the right name in this file for a variable iterating over word indices, please don't change this. The new variable above could be EndW or so? sammccall: W is the right name in this file for a variable iterating over word indices, please don't…
		sammccallUnsubmitted Not Done Reply Inline Actions As far as I can see, this loop is setting `A[W+1:...] = Miss` and populating `A[0...W]` with the exsting logic. I think this would be clearer as two loops, currently there's a lot of conditionals around Last that obscure what's actually happening. sammccall: As far as I can see, this loop is setting `A[W+1:...] = Miss` and populating `A[0...W]` with…
		sammccallUnsubmitted Not Done Reply Inline Actions You've shifted P (and the old W, new I) by 1. This does reduce the number of +1 and -1 in this function, but it's inconsistent with how these are used elsewhere: P should be the index into Pat of the character that we're considering. sammccall: You've shifted P (and the old W, new I) by 1. This does reduce the number of +1 and -1 in this…
		MaskRayAuthorUnsubmitted Not Done Reply Inline Actions I don't understand the rationale not to use the shifted indices. The code actually use `Scores[P][W][]` to mean the optimal match of the first `P` characters of the pattern with the first `W` characters of the word, not the position of the character. On the other hand, C++ reverse iterators use the shifted one for `for (I = rend(); I != rbegin(); ++I)`. The shifted one makes ending condition check easier. MaskRay:* I don't understand the rationale not to use the shifted indices. The code actually use `Scores…
		sammccallUnsubmitted Not Done Reply Inline Actions I don't understand the rationale not to use the shifted indices The rationale is entirely consistency with the surrounding code. The consistency helps avoid off-by-one errors when similar loops have different conventions. In this file, when looping over word or pattern dimensions, P and W respectively are used for loop variables, and can be interpreted as indices into Pat/Word. Here the interpretation would be "did we match or miss character Word[W]" sammccall: > I don't understand the rationale not to use the shifted indices The rationale is entirely…
		MaskRayAuthorUnsubmitted Not Done Reply Inline Actions `Scores[P][W][]` is interpreted as how good it is if we align the first `P` characters of the pattern with the first `W` characters of the word. Note the code uses `number of characters` instead of the position. Here the new interpretation would be "what we should do for the last character of the first W characters" MaskRay:* `Scores[P][W][*]` is interpreted as how good it is if we align the first `P` characters of the…
A[W] = Last;		int P = PatN;
const auto &Cell = Scores[P + 1][W + 1][Last];		A[WordN] = Miss;
		for (int I = WordN; I > W; I--) {
		A[I - 1] = Miss;
		S[I - 1] = Scores[P][I][Miss].Score - Scores[P][I - 1][Miss].Score;
		}
		Action Last = Match;
		for (int I = W; I > 0; I--) {
		A[I - 1] = Last;
		const auto &Cell = Scores[P][I][Last];
if (Last == Match)		if (Last == Match)
--P;		--P;
const auto &Prev = Scores[P + 1][W][Cell.Prev];		const auto &Prev = Scores[P][I - 1][Cell.Prev];
S[W] = Cell.Score - Prev.Score;		S[I - 1] = Cell.Score - Prev.Score;
Last = Cell.Prev;		Last = Cell.Prev;
}		}
		}
for (int I = 0; I < WordN; ++I) {		for (int I = 0; I < WordN; ++I) {
if (A[I] == Match && (I == 0 \|\| A[I - 1] == Miss))		if (A[I] == Match && (I == 0 \|\| A[I - 1] == Miss))
Result.push_back('[');		Result.push_back('[');
if (A[I] == Miss && I > 0 && A[I - 1] == Match)
Result.push_back(']');
Result.push_back(Word[I]);		Result.push_back(Word[I]);
}		if (A[I] == Match && A[I + 1] == Miss)
if (A[WordN - 1] == Match)
Result.push_back(']');		Result.push_back(']');
		}

		sammccallUnsubmitted Not Done Reply Inline Actions now that the end of the word could be anywhere, it might be nice to add an arrowhead `^' below the table pointing at it :) up to you sammccall: now that the end of the word could be anywhere, it might be nice to add an arrowhead `^' below…
for (char C : StringRef(Word, WordN))		for (char C : StringRef(Word, WordN))
OS << " " << C << " ";		OS << " " << C << " ";
OS << "\n";		OS << "\n";
for (int I = 0, J = 0; I < WordN; I++)		for (int I = 0, J = 0; I < WordN; I++)
OS << " " << (A[I] == Match ? Pat[J++] : ' ') << " ";		OS << " " << (A[I] == Match ? Pat[J++] : ' ') << " ";
OS << "\n";		OS << "\n";
for (int I = 0; I < WordN; I++)		for (int I = 0; I < WordN; I++)
OS << format("%2d ", S[I]);		OS << format("%2d ", S[I]);
Show All 13 Lines	llvm::SmallString<256> FuzzyMatcher::dumpLast(llvm::raw_ostream &OS) const {
for (char C : StringRef(Word, WordN))		for (char C : StringRef(Word, WordN))
OS << " " << C << " ";		OS << " " << C << " ";
OS << "\n";		OS << "\n";
OS << "-+----" << std::string(WordN * 4, '-') << "\n";		OS << "-+----" << std::string(WordN * 4, '-') << "\n";
for (int I = 0; I <= PatN; ++I) {		for (int I = 0; I <= PatN; ++I) {
for (Action A : {Miss, Match}) {		for (Action A : {Miss, Match}) {
OS << ((I && A == Miss) ? Pat[I - 1] : ' ') << "\|";		OS << ((I && A == Miss) ? Pat[I - 1] : ' ') << "\|";
for (int J = 0; J <= WordN; ++J) {		for (int J = 0; J <= WordN; ++J) {
if (!isAwful(Scores[I][J][A].Score))		if (I <= J && !isAwful(Scores[I][J][A].Score))
OS << format("%3d%c", Scores[I][J][A].Score,		OS << format("%3d%c", Scores[I][J][A].Score,
Scores[I][J][A].Prev == Match ? '*' : ' ');		Scores[I][J][A].Prev == Match ? '*' : ' ');
else		else
OS << " ";		OS << " ";
}		}
OS << "\n";		OS << "\n";
}		}
}		}

return Result;		return Result;
}		}

} // namespace clangd		} // namespace clangd
} // namespace clang		} // namespace clang