Also move unittest: URI scheme to TestFS so that it can be shared by
different tests.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Can you benchmark this? I'm nervous about the URI stuff in the hot path.
Timing CodeCompleteFlow::measureResults before/after with index enabled seems like a reasonable test.
(But you might want to make this apply to sema first too for realistic numbers?)
clangd/Quality.cpp | ||
---|---|---|
203 ↗ | (On Diff #150477) | how do we know proximitypath is set at this point? |
208 ↗ | (On Diff #150477) | Why U->toString() rather than ->body()? |
215 ↗ | (On Diff #150477) | proximity path needs to be set here too |
clangd/Quality.h | ||
74 ↗ | (On Diff #150477) | It seems OK to have ProximityPath or ProximityScore, but we shouldn't have both: drop proximityscore and calculate it during evaluate()? |
Oops, couple more comments.
But the big things I think are:
- what's the performance impact of doing all this work (including the URI stuff) inside the scoring loop?
- what's the most useful formula for the proximity score
clangd/Quality.cpp | ||
---|---|---|
171 ↗ | (On Diff #150477) | This doesn't look quite right to me. FWIW, fozzie appears to give an additive boost proportional to 5-up, where up is the number of directories from the context you have to traverse up from the context to get to a parent of the symbol. (There's no penalty for down-traversals probably for implementation reasons, this should be smaller than the up-traversal penalty I think) |
clangd/Quality.h | ||
74 ↗ | (On Diff #150477) | what's the plan for associated-header? should this be a smallvector<2>? |
Sure! I had some numbers but they are on some paper that I don't access to right now... will collect some new figures (with URI manipulations in sema).
clangd/Quality.cpp | ||
---|---|---|
171 ↗ | (On Diff #150477) | The numbers are guessed... definitely happy to tune.
It's 18 directories away if one file is in an ancestor directories of the other (i.e. only traverse up or down). If you need to traverse up and down, the penalty for each directory is 0.1, which takes 10 directories (up+down, so I'm not sure if we should use zero as the neutral score. For example, if a codebase has deep directory structure, most scores are probably going to be small; conversely, most scores would be relatively big. I think relative scores are more useful.
Why do you think down-traversal should take less penalty? |
208 ↗ | (On Diff #150477) | Because both URIs need to be parsed in order to use body(). Here we don't parse SymURI. |
215 ↗ | (On Diff #150477) | Alternatively, I wonder if we could give sema result a fixed proximity score as they are symbols that are already included? |
clangd/Quality.h | ||
74 ↗ | (On Diff #150477) | Just want to make sure I understand. We would copy the symbol URI to use in merge right? |
74 ↗ | (On Diff #150477) | I think it should be easy to change this to vector when it's actually needed? |
Here are some numbers by completing "clang::^" 40 times (with result limit 1000 instead of 100).
Timing in CodeCompleteFlow::measureResults
Before: Avg: 1811 us Med: 1792 us After: Avg: 2714 us Med: 2689 us
As a reference, a full CodeCompleteFlow (with 1000 candidates) takes ~70 ms (using LLVM's yaml index).
So, with the current limit of 100 results, the increase for measureResults should be roughly 0.18ms -> 0.27ms, which I think is reasonable.
PTAL
clangd/Quality.cpp | ||
---|---|---|
215 ↗ | (On Diff #150477) | As discussed offline, sema symbols now have a fixed proximity score (not entirely sure about the value though). |
clangd/Quality.h | ||
74 ↗ | (On Diff #150477) | Changed to vector anyway... |
74 ↗ | (On Diff #150477) | Experimented with this a bit (removing ProximityScore). As we print the proximity score for debugging, we would still want to keep the store. Alternatively, I made the proximity paths a parameter of merge as we only use them for index result anyway. |
Sorry for the delay on this change. There's a bunch of complexity in this problem that I haven't seen how to slice through:
- the signals needed seem like a weird fit for the Symbol*Signals structs for some reason (maybe my misdesign)
- the inconsistency between how we do this for Sema and for Index results has... only slightly good reasons
- the URI vs filename thing is awkward
- with all this, the actual scoring still seems ad-hoc and is missing important parts (main header, transitive includes)
Not all your fault that the code reflects this, the problem is tangly. But it's hard for me to reason about APIs or performance or layering.
Looking at the last point (scoring model) because it seems the most tractable. I think this is basically an edit distance problem?
(We can call the result "proximity", start at one, and multiply by <1, or call it "distance" and start at 0 and add penalties, but it's equivalent).
- we're computing distances between files (glossing over URI-space vs file-space)
- the roots are the main file, and maybe the matching header
- edits take us from a filepath to a related filepath:
- from a file to a header it #includes
- from a file to its parent directory
- from a parent directory to a child directory
- from a parent directory to a file in it
- the distance is the smallest sum-of-penalties for any path leading from the root to the symbol
What do you think of this model?
If the model seems reasonable, then it suggests an approach of building a one-per-query data structure that computes the needed edit-distance recursively, memoizing results. SymbolRelevanceResults could store the symbol path and a pointer to the edit-distance machine, and for debugging the machine would know how to describe its configuration. URI/path mapping wouldn't be a performance concern (I think) if the memoization captured it.
Let's chat offline?
clangd/Quality.cpp | ||
---|---|---|
171 ↗ | (On Diff #150477) |
I think you've halved twice there - it still seems to be 10, which is a lot.
Well, zero is currently the neutral score, and this patch doesn't change it :-)
Intuitively, because subprojects are more closely related than superprojects. But this didn't occur to me until someone mentioned it, we should check with Matei and Alexander. |
- Introduced a one-per-query structure for relevance signals; use multiplication for proximity; simplify tests a bit; separate index and sema proximity scores.
According to offline discussion, I added a structure SymbolRelevanceContext that captures per-query signals like proximity paths. Not sure about the name though.
- the inconsistency between how we do this for Sema and for Index results has... only slightly good reasons
The proximity scores for index and sema are now explicitly separated to make it easier to understand and debug.
- the URI vs filename thing is awkward
- with all this, the actual scoring still seems ad-hoc and is missing important parts (main header, transitive includes)
Not all your fault that the code reflects this, the problem is tangly. But it's hard for me to reason about APIs or performance or layering.
Looking at the last point (scoring model) because it seems the most tractable. I think this is basically an edit distance problem?
(We can call the result "proximity", start at one, and multiply by <1, or call it "distance" and start at 0 and add penalties, but it's equivalent).
- we're computing distances between files (glossing over URI-space vs file-space)
- the roots are the main file, and maybe the matching header
- edits take us from a filepath to a related filepath:
- from a file to a header it #includes
- from a file to its parent directory
- from a parent directory to a child directory
- from a parent directory to a file in it
- the distance is the smallest sum-of-penalties for any path leading from the root to the symbol
What do you think of this model?
If the model seems reasonable, then it suggests an approach of building a one-per-query data structure that computes the needed edit-distance recursively, memoizing results. SymbolRelevanceResults could store the symbol path and a pointer to the edit-distance machine, and for debugging the machine would know how to describe its configuration. URI/path mapping wouldn't be a performance concern (I think) if the memoization captured it.
I like how this model addresses the proximity for src/ and include/ setup. I think we could start with something simple and iterate, although I agree that we should strike for a design that would be easy to replace the proximity algorithm in the future.
Let's chat offline?
clangd/Quality.cpp | ||
---|---|---|
171 ↗ | (On Diff #150477) |
OK you are right. It would behave badly when there are many ups and only one down.
Sounds good. Picked p=0.7 which seems to give reasonable scores. |
Thanks, this looks much clearer/more modular/more extensible to me!
A couple of notes on the abstractions before digging into details again.
clangd/Quality.h | ||
---|---|---|
71 ↗ | (On Diff #151172) | This is ambiguously a couple of different (and good!) things:
There's a place for both of these, but I'd argue for separating them (and only doing the second in this patch). Reasons:
If this makes sense to you then I think this class looks great but should be called something specific like FileProximityMatcher. |
91 ↗ | (On Diff #151172) | One of the simplifying assumptions in the model is that all signals are optional - can we make Context a pointer = nullptr and drop the constructor? |
clangd/Quality.h | ||
---|---|---|
71 ↗ | (On Diff #151172) | Sounds good. Thanks for the explanation! |
Thanks, just details now!
clangd/Quality.cpp | ||
---|---|---|
185 ↗ | (On Diff #151322) | why is this a special case?
I don't doubt the assertion that these are unrelated paths, but I'm not sure fixing just this case is an improvement overall. (In a perfect world, we'd define the algorithm so that this case yields 0 without a discontinuity) |
216 ↗ | (On Diff #151322) | For composability, you could consider styling more tersely e.g. as ProximityPaths{/path/to/file}, and in the RelevanceSignals operator<< including it like other fields, yielding: == Symbol relevance: 0.8 == Name match: 0.7 File proximity matcher: ProximityPaths{/path/to/file} ... |
clangd/Quality.h | ||
78 ↗ | (On Diff #151322) | Should mention the semantics of the score, maybe via the other extreme: when the SymbolURI exactly matches a proximity path, score is 1. |
98 ↗ | (On Diff #151322) | This is redundant with (IndexSymbolURI, FileProximityMatch) I think, and will only be correctly set if FileProximityMatch is set before calling merge(Symbol). Can we defer calculation until evaluate()? |
unittests/clangd/TestFS.cpp | ||
66 ↗ | (On Diff #151322) | These helpers would be more coherent if this used the same test root as above - any reason we can't do that? Then this comment could just be "unittest: is a scheme that refers to files relative to testRoot()" |
107 ↗ | (On Diff #151322) | This is really surprising to me - is this the common pattern for registries? If so, can we move the declaration to TestFS.h and give a usage example, so the consuming libraries don't have to repeat the decl? |
- addressed review comments
clangd/Quality.cpp | ||
---|---|---|
185 ↗ | (On Diff #151322) | The intuition is that when we hit the root, it's very likely that we are switching projects. But we could leave this out of the patch and evaluate whether this is an improvement later. |
clangd/Quality.h | ||
98 ↗ | (On Diff #151322) | Done.
I think the proximity score would be useful for debugging, no? |
unittests/clangd/TestFS.cpp | ||
66 ↗ | (On Diff #151322) | Good idea. |
107 ↗ | (On Diff #151322) | yeah... this pattern is also used in tooling::CompilationDatabase (e.g. https://github.com/llvm-mirror/clang/blob/master/lib/Tooling/CompilationDatabase.cpp#L398), and I'm not aware of a good way to deal without alwayslink.
Done. |
Thanks! Just nits
clangd/Quality.cpp | ||
---|---|---|
304 ↗ | (On Diff #151358) | No camel case here, just words |
305 ↗ | (On Diff #151358) | Could this just be inlined like the others? Index proximity: 0.5 (ProximityRoots{foo/bar.h}) |
clangd/Quality.h | ||
71 ↗ | (On Diff #151358) | nit: can we forward declare this here and move it down (e.g. above TopN) to keep the signals at the top? |
99 ↗ | (On Diff #151358) | nit: just SymbolURI (signals should be conceptually source-independent, may be missing) |
101 ↗ | (On Diff #151358) | can you add a FIXME to unify with index proximity score? signals should be source-independent |
unittests/clangd/TestFS.h | ||
60 ↗ | (On Diff #151358) | document the unittest: scheme here? |