This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Boost completion score according to file proximity.
ClosedPublic

Authored by ioeric on Jun 8 2018, 3:21 AM.

Diff Detail

Repository
rL LLVM

Event Timeline

ioeric created this revision.Jun 8 2018, 3:21 AM
ioeric updated this revision to Diff 150477.Jun 8 2018, 3:24 AM
  • Rebase again...

Can you benchmark this? I'm nervous about the URI stuff in the hot path.
Timing CodeCompleteFlow::measureResults before/after with index enabled seems like a reasonable test.
(But you might want to make this apply to sema first too for realistic numbers?)

clangd/Quality.cpp
203 ↗(On Diff #150477)

how do we know proximitypath is set at this point?
Better to copy the symbol URL path I think :-(

208 ↗(On Diff #150477)

Why U->toString() rather than ->body()?

215 ↗(On Diff #150477)

proximity path needs to be set here too

clangd/Quality.h
74 ↗(On Diff #150477)

It seems OK to have ProximityPath or ProximityScore, but we shouldn't have both: drop proximityscore and calculate it during evaluate()?

Oops, couple more comments.
But the big things I think are:

  • what's the performance impact of doing all this work (including the URI stuff) inside the scoring loop?
  • what's the most useful formula for the proximity score
clangd/Quality.cpp
171 ↗(On Diff #150477)

This doesn't look quite right to me.
We can tune the details later, but in practice this seems like it's *very* hard to get zero proximity, which is our neutral score - you need to be 18 directories away?

FWIW, fozzie appears to give an additive boost proportional to 5-up, where up is the number of directories from the context you have to traverse up from the context to get to a parent of the symbol. (There's no penalty for down-traversals probably for implementation reasons, this should be smaller than the up-traversal penalty I think)

clangd/Quality.h
74 ↗(On Diff #150477)

what's the plan for associated-header? should this be a smallvector<2>?

ioeric added a comment.Jun 8 2018, 5:59 AM

Can you benchmark this? I'm nervous about the URI stuff in the hot path.
Timing CodeCompleteFlow::measureResults before/after with index enabled seems like a reasonable test.
(But you might want to make this apply to sema first too for realistic numbers?)

Sure! I had some numbers but they are on some paper that I don't access to right now... will collect some new figures (with URI manipulations in sema).

clangd/Quality.cpp
171 ↗(On Diff #150477)

The numbers are guessed... definitely happy to tune.

We can tune the details later, but in practice this seems like it's *very* hard to get zero proximity, which is our neutral score - you need to be 18 directories away?

It's 18 directories away if one file is in an ancestor directories of the other (i.e. only traverse up or down). If you need to traverse up and down, the penalty for each directory is 0.1, which takes 10 directories (up+down, so
5 up in average). I think it's useful to make this distinction because I think it's more likely for a file to use a header if it's in the file path.

I'm not sure if we should use zero as the neutral score. For example, if a codebase has deep directory structure, most scores are probably going to be small; conversely, most scores would be relatively big. I think relative scores are more useful.

(There's no penalty for down-traversals probably for implementation reasons, this should be smaller than the up-traversal penalty I think)

Why do you think down-traversal should take less penalty?

208 ↗(On Diff #150477)

Because both URIs need to be parsed in order to use body(). Here we don't parse SymURI.

215 ↗(On Diff #150477)

Alternatively, I wonder if we could give sema result a fixed proximity score as they are symbols that are already included?

clangd/Quality.h
74 ↗(On Diff #150477)

Just want to make sure I understand. We would copy the symbol URI to use in merge right?

74 ↗(On Diff #150477)

I think it should be easy to change this to vector when it's actually needed?

ioeric added a comment.Jun 8 2018, 7:44 AM

Here are some numbers by completing "clang::^" 40 times (with result limit 1000 instead of 100).

Timing in CodeCompleteFlow::measureResults

Before: Avg: 1811 us Med: 1792 us 
After: Avg: 2714 us Med: 2689 us

As a reference, a full CodeCompleteFlow (with 1000 candidates) takes ~70 ms (using LLVM's yaml index).

So, with the current limit of 100 results, the increase for measureResults should be roughly 0.18ms -> 0.27ms, which I think is reasonable.

ioeric updated this revision to Diff 150923.Jun 12 2018, 3:33 AM
  • Merge branch 'uri' into proximity
  • Addressed review comments.
ioeric updated this revision to Diff 150924.Jun 12 2018, 3:35 AM
  • Cleanup comment a bit.

PTAL

clangd/Quality.cpp
215 ↗(On Diff #150477)

As discussed offline, sema symbols now have a fixed proximity score (not entirely sure about the value though).

clangd/Quality.h
74 ↗(On Diff #150477)

Changed to vector anyway...

74 ↗(On Diff #150477)

Experimented with this a bit (removing ProximityScore). As we print the proximity score for debugging, we would still want to keep the store. Alternatively, I made the proximity paths a parameter of merge as we only use them for index result anyway.

Sorry for the delay on this change. There's a bunch of complexity in this problem that I haven't seen how to slice through:

  1. the signals needed seem like a weird fit for the Symbol*Signals structs for some reason (maybe my misdesign)
  2. the inconsistency between how we do this for Sema and for Index results has... only slightly good reasons
  3. the URI vs filename thing is awkward
  4. with all this, the actual scoring still seems ad-hoc and is missing important parts (main header, transitive includes)

Not all your fault that the code reflects this, the problem is tangly. But it's hard for me to reason about APIs or performance or layering.

Looking at the last point (scoring model) because it seems the most tractable. I think this is basically an edit distance problem?
(We can call the result "proximity", start at one, and multiply by <1, or call it "distance" and start at 0 and add penalties, but it's equivalent).

  • we're computing distances between files (glossing over URI-space vs file-space)
  • the roots are the main file, and maybe the matching header
  • edits take us from a filepath to a related filepath:
    • from a file to a header it #includes
    • from a file to its parent directory
    • from a parent directory to a child directory
    • from a parent directory to a file in it
  • the distance is the smallest sum-of-penalties for any path leading from the root to the symbol

What do you think of this model?


If the model seems reasonable, then it suggests an approach of building a one-per-query data structure that computes the needed edit-distance recursively, memoizing results. SymbolRelevanceResults could store the symbol path and a pointer to the edit-distance machine, and for debugging the machine would know how to describe its configuration. URI/path mapping wouldn't be a performance concern (I think) if the memoization captured it.

Let's chat offline?

clangd/Quality.cpp
171 ↗(On Diff #150477)

If you need to traverse up and down, the penalty for each directory is 0.1, which takes 10 directories (up+down, so 5 up in average).

I think you've halved twice there - it still seems to be 10, which is a lot.

I'm not sure if we should use zero as the neutral score.

Well, zero is currently the neutral score, and this patch doesn't change it :-)
I think starting at 1 for the current file and *multiplying* by p<1 to apply penalties should give a reasonable 0-1 score that's relatively sane even for codebases of different sizes. Happy to have a different model, but you need to explain/implement how it combines with other signals.

Why do you think down-traversal should take less penalty?

Intuitively, because subprojects are more closely related than superprojects. But this didn't occur to me until someone mentioned it, we should check with Matei and Alexander.

ioeric updated this revision to Diff 151169.Jun 13 2018, 8:03 AM
  • Introduced a one-per-query structure for relevance signals; use multiplication for proximity; simplify tests a bit; separate index and sema proximity scores.

Sorry for the delay on this change. There's a bunch of complexity in this problem that I haven't seen how to slice through:

  1. the signals needed seem like a weird fit for the Symbol*Signals structs for some reason (maybe my misdesign)

According to offline discussion, I added a structure SymbolRelevanceContext that captures per-query signals like proximity paths. Not sure about the name though.

  1. the inconsistency between how we do this for Sema and for Index results has... only slightly good reasons

The proximity scores for index and sema are now explicitly separated to make it easier to understand and debug.

  1. the URI vs filename thing is awkward
  2. with all this, the actual scoring still seems ad-hoc and is missing important parts (main header, transitive includes)

Not all your fault that the code reflects this, the problem is tangly. But it's hard for me to reason about APIs or performance or layering.

Looking at the last point (scoring model) because it seems the most tractable. I think this is basically an edit distance problem?
(We can call the result "proximity", start at one, and multiply by <1, or call it "distance" and start at 0 and add penalties, but it's equivalent).

  • we're computing distances between files (glossing over URI-space vs file-space)
  • the roots are the main file, and maybe the matching header
  • edits take us from a filepath to a related filepath:
    • from a file to a header it #includes
    • from a file to its parent directory
    • from a parent directory to a child directory
    • from a parent directory to a file in it
  • the distance is the smallest sum-of-penalties for any path leading from the root to the symbol

What do you think of this model?


If the model seems reasonable, then it suggests an approach of building a one-per-query data structure that computes the needed edit-distance recursively, memoizing results. SymbolRelevanceResults could store the symbol path and a pointer to the edit-distance machine, and for debugging the machine would know how to describe its configuration. URI/path mapping wouldn't be a performance concern (I think) if the memoization captured it.

I like how this model addresses the proximity for src/ and include/ setup. I think we could start with something simple and iterate, although I agree that we should strike for a design that would be easy to replace the proximity algorithm in the future.

Let's chat offline?

clangd/Quality.cpp
171 ↗(On Diff #150477)

I think you've halved twice there - it still seems to be 10, which is a lot.

OK you are right. It would behave badly when there are many ups and only one down.

I think starting at 1 for the current file and *multiplying* by p<1 to apply penalties should give a reasonable 0-1 score that's relatively sane even for codebases of different sizes.

Sounds good. Picked p=0.7 which seems to give reasonable scores.

ioeric updated this revision to Diff 151172.Jun 13 2018, 8:07 AM
  • Rebased.

Thanks, this looks much clearer/more modular/more extensible to me!
A couple of notes on the abstractions before digging into details again.

clangd/Quality.h
71 ↗(On Diff #151172)

This is ambiguously a couple of different (and good!) things:

  • an encapsulation of the proximitypaths state and logic
  • a grouping together of the "query-dependent, symbol-invariant" inputs to the relevance calculation.

There's a place for both of these, but I'd argue for separating them (and only doing the second in this patch). Reasons:

  • the former doesn't need to be in this file if it gets complex (FuzzyMatch.h is a similar case), while the latter does
  • easier to understand/name if this hierarchy is expressed explicitly
  • I suspect we may want the context to be a separate struct, passed to SymbolRelevanceSignals::evaluate(), rather than a member of SymbolRelevanceSignals. That would add more churn than needs to be in this patch though.

If this makes sense to you then I think this class looks great but should be called something specific like FileProximityMatcher.

91 ↗(On Diff #151172)

One of the simplifying assumptions in the model is that all signals are optional - can we make Context a pointer = nullptr and drop the constructor?

ioeric updated this revision to Diff 151321.Jun 14 2018, 3:35 AM
ioeric marked 2 inline comments as done.
  • addressed review comments.
ioeric updated this revision to Diff 151322.Jun 14 2018, 3:36 AM
  • Rebase...
ioeric added inline comments.Jun 14 2018, 3:39 AM
clangd/Quality.h
71 ↗(On Diff #151172)

Sounds good. Thanks for the explanation!

Thanks, just details now!

clangd/Quality.cpp
185 ↗(On Diff #151322)

why is this a special case?

  • /x/a/b vs /x/a/c is 1 up + 1 down --> 0.59
  • /a/b vs /a/c is 1 up + 1 down --> 0.59
  • /b vs /c is unrelated --> 0

I don't doubt the assertion that these are unrelated paths, but I'm not sure fixing just this case is an improvement overall. (In a perfect world, we'd define the algorithm so that this case yields 0 without a discontinuity)

216 ↗(On Diff #151322)

For composability, you could consider styling more tersely e.g. as ProximityPaths{/path/to/file}, and in the RelevanceSignals operator<< including it like other fields, yielding:

== Symbol relevance: 0.8 ==
 Name match: 0.7
 File proximity matcher: ProximityPaths{/path/to/file}
 ...
clangd/Quality.h
78 ↗(On Diff #151322)

Should mention the semantics of the score, maybe via the other extreme: when the SymbolURI exactly matches a proximity path, score is 1.

98 ↗(On Diff #151322)

This is redundant with (IndexSymbolURI, FileProximityMatch) I think, and will only be correctly set if FileProximityMatch is set before calling merge(Symbol).

Can we defer calculation until evaluate()?
(If you want to dump this intermediate score, you can recompute it in operator<<, I'm not sure it's necessary).

unittests/clangd/TestFS.cpp
66 ↗(On Diff #151322)

These helpers would be more coherent if this used the same test root as above - any reason we can't do that?

Then this comment could just be "unittest: is a scheme that refers to files relative to testRoot()"

107 ↗(On Diff #151322)

This is really surprising to me - is this the common pattern for registries?
(i.e. we don't have something more declarative like bazel's cc_library.alwayslink)?

If so, can we move the declaration to TestFS.h and give a usage example, so the consuming libraries don't have to repeat the decl?

ioeric updated this revision to Diff 151358.Jun 14 2018, 8:47 AM
ioeric marked 6 inline comments as done.
  • addressed review comments
clangd/Quality.cpp
185 ↗(On Diff #151322)

The intuition is that when we hit the root, it's very likely that we are switching projects. But we could leave this out of the patch and evaluate whether this is an improvement later.

clangd/Quality.h
98 ↗(On Diff #151322)

Done.

(If you want to dump this intermediate score, you can recompute it in operator<<, I'm not sure it's necessary).

I think the proximity score would be useful for debugging, no?

unittests/clangd/TestFS.cpp
66 ↗(On Diff #151322)

Good idea.

107 ↗(On Diff #151322)

yeah... this pattern is also used in tooling::CompilationDatabase (e.g. https://github.com/llvm-mirror/clang/blob/master/lib/Tooling/CompilationDatabase.cpp#L398), and I'm not aware of a good way to deal without alwayslink.

If so, can we move the declaration to TestFS.h and give a usage example, so the consuming libraries don't have to repeat the decl?

Done.

sammccall accepted this revision.Jun 15 2018, 12:41 AM

Thanks! Just nits

clangd/Quality.cpp
304 ↗(On Diff #151358)

No camel case here, just words

305 ↗(On Diff #151358)

Could this just be inlined like the others?

Index proximity: 0.5 (ProximityRoots{foo/bar.h})

clangd/Quality.h
71 ↗(On Diff #151358)

nit: can we forward declare this here and move it down (e.g. above TopN) to keep the signals at the top?
(I suspect it'll end up in another header eventually)

99 ↗(On Diff #151358)

nit: just SymbolURI (signals should be conceptually source-independent, may be missing)

101 ↗(On Diff #151358)

can you add a FIXME to unify with index proximity score? signals should be source-independent

unittests/clangd/TestFS.h
60 ↗(On Diff #151358)

document the unittest: scheme here?

This revision is now accepted and ready to land.Jun 15 2018, 12:41 AM
ioeric updated this revision to Diff 151469.Jun 15 2018, 2:00 AM
ioeric marked 6 inline comments as done.
  • addressed review comments and rebase.

Thanks for the review!

This revision was automatically updated to reflect the committed changes.