This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Boost scores for decls from current file in completion
ClosedPublic

Authored by ilya-biryukov on May 16 2018, 7:09 AM.

Event Timeline

ilya-biryukov created this revision.May 16 2018, 7:09 AM
ioeric added inline comments.May 16 2018, 7:32 AM
clangd/Quality.cpp
34

Could you explain why AllDeclsInMainFile is necessary? I think it might still be fine to boost score for symbols with a declaration in the main file even if it has redecls in other files (e.g. function fwd in headers).

sammccall added inline comments.May 16 2018, 9:26 AM
clangd/AST.h
32 ↗(On Diff #147083)

Do you expect this to be reused?
If so, unit test? Otherwise this seems small enough to move where it's used.

clangd/Quality.cpp
34

Agree. I think the better signal is (any) decl in main file.

unittests/clangd/CodeCompleteTests.cpp
965 ↗(On Diff #147083)

This should really be unit tested in QualityTests. I think instead rather than in addition - I'd like to get away from code completion ranking tests for every quality signal, it's really indirect and hard to maintain/review.

966 ↗(On Diff #147083)

(in QualityTests, we should be able to continue using TestTU)

ilya-biryukov added inline comments.May 17 2018, 4:17 AM
clangd/Quality.cpp
34

My intuition was that it does not make any sense to functions if there definitions are in the same cpp file, because it does not give any useful signals on whether those should be preferably called in the same file.
Also, some defs may not be visible to the compiler at the point of completion and, therefore, won't get a boost, even if they are in the same file. This is inconsistent.

E.g.,

// === foo.h
class Foo {
  int foo();
  int bar();
  int baz();
  int test();
};

int glob_foo();
int glob_bar();
int glob_baz();

// === foo.cpp
#include "foo.h"
static some_local_func() {}

Foo::foo() {
};

int ::glob_foo() {
}

Foo::test() {
   ^
   // out of all functions declared in headers, only foo and global_foo will get a boost here. That does not make much sense, since:
   // - glob_bar() and Foo::bar() are also declared in the same file, but compiler hasn't seen them yet, so they won't get a boost.
   // - glob_baz() and Foo::baz() are not declared in the same file, but they are so close to `bar` in the header, 
   //   and it doesn't seem to make sense to rank them differently.
}

Foo::bar() {
}

int ::glob_bar() {
}

Why upranking decls only in the current file is still useful?

  • Those are usually helpers that are very relevant to the file. Especially true for small files.
  • As a side-effect, we'll uprank local vars and parameters (they can't have decls in other files :-)), that seems useful. Maybe such implicit effects are not desirable, though.

I should've definitely documented this better. If we decide this change is useful, I'll add more docs.

  • Move tests to QualityTests.cpp
  • Fix findDecl() assertion on redecls of the same thing
  • Fix file comment of TestTU.cpp
ilya-biryukov marked 2 inline comments as done.May 17 2018, 5:47 AM
ilya-biryukov added inline comments.May 17 2018, 5:50 AM
clangd/AST.h
32 ↗(On Diff #147083)

I'd say we could reuse it, yes. Even though the function is pretty straightforward, there are details that should be consistent across the whole codebase (i.e. which locations to check, spelling, expansion?)
I'll add unittests.

ioeric added inline comments.May 17 2018, 7:14 AM
clangd/Quality.cpp
34

My intuition was that it does not make any sense to functions if there definitions are in the same cpp file, because it does not give any useful signals on whether those should be preferably called in the same file.

Not sure if I understand this. Could you explain why?

Also, some defs may not be visible to the compiler at the point of completion and, therefore, won't get a boost, even if they are in the same file. This is inconsistent.

I think this is somewhat expected as that symbol might not be visible to me e.g. a helper function defined after me.

out of all functions declared in headers, only foo and global_foo will get a boost here. That does not make much sense, since: ....

I think you made a very good point here: for a .cc main file, symbols declared/defined in the main header (e.g. x.h for x.cc) are also interesting and should get a boost, so instead of only boosting symbols that are only declared in the main file, I would suggest also boost symbols in its corresponding header. We would need some heuristics to identify main header though (clang-format uses base file name sans extension which seems to work pretty well).

Thanks for the example! I think it's very useful for discussions.

ioeric added inline comments.May 17 2018, 7:19 AM
clangd/Quality.cpp
34

I think this is somewhat expected as that symbol might not be visible to me e.g. a helper function defined after me.

By "me", I mean I am a decl ;)

ilya-biryukov added inline comments.May 17 2018, 7:31 AM
clangd/Quality.cpp
34

My intuition was that it does not make any sense to functions if there definitions are in the same cpp file, because it does not give any useful signals on whether those should be preferably called in the same file.

Not sure if I understand this. Could you explain why?

If I'm implementing a class, relevance of all class methods for me is probably the same, no matter if I have a definition for some of those in the current file or not.
Probably the same point you make later about boosting symbols in the matching header.

I think this is somewhat expected as that symbol might not be visible to me e.g. a helper function defined after me.

A helper function defined later only in the current file is not visible in completion and this is (arguably) fine, since you can't refer to it anymore.
The function defined in the header, though, is already visible, and ranking it differently based on whether it has another declaration later in the current file is weird (e.g. Foo::bar and glob_bar from example above).

symbols declared/defined in the main header (e.g. x.h for x.cc) are also interesting and should get a boost

That sounds like a great idea, will try to implement it. We also have a logic in clangd to find the matching header for a source file, I'll maybe also try to unify relevant parts with clang-format.

  • Move helpers tha switch header and source into separate files.
  • Also uprank items coming from the matching headers

I've added an initial version of testing for the matching header and wanted to get feedback before proceeding further with tests and other changes.

A few things that bug me so far:

  • We need to match headers of items from the index, not only from the Sema results.
  • Symbols store their paths as URIs ⇒ we need to parse them in order to apply heuristics. We could avoid that by writing a version of header-matching that also works on URIs, but that would mean more complexity.
  • Merging quality signals from headers now requires an extra paramater: name of the source file. I wonder if it's worth extracting builders for symbol qualities into separate classes to keep the current nice signatures, i.e. merge(Symbol& IndexResult).
  • How should we match the header with the main file? Our options are:
    • (proposed by Eric) use main file regex from clang-format for that. I'm not entirely sure it suits us well, since the regex is designed to work on paths inside #include directive, but we're getting ours from the Symbols and Sema AST Decls. Moreover, it means we're gonna read .clang-format files to get that style.
    • Come up with our own heuristics. There is a similar place in ClangdServer that matches a header with source and back. We could extend those heuristics to also allow figuring out whether the paths are matching header/source. I chose this option for initial implementation, since it's less work and it seems easier to switch to clang-format's regex later.

I've added an initial version of testing for the matching header and wanted to get feedback before proceeding further with tests and other changes.

A few things that bug me so far:

  • We need to match headers of items from the index, not only from the Sema results.

This sounds reasonable.

  • Symbols store their paths as URIs ⇒ we need to parse them in order to apply heuristics. We could avoid that by writing a version of header-matching that also works on URIs, but that would mean more complexity.

Yeah, this is a bit unfortunate. It's probably OK to parse URIs; they are not that expensive after all (we might want to do some measurement though).

  • Merging quality signals from headers now requires an extra paramater: name of the source file. I wonder if it's worth extracting builders for symbol qualities into separate classes to keep the current nice signatures, i.e. merge(Symbol& IndexResult).

I'm not very familiar with SymbolQualitySignals. But if we decide to use main file name as a signal, it might make sense to pass it in through the constructor?

  • How should we match the header with the main file? Our options are:
    • (proposed by Eric) use main file regex from clang-format for that. I'm not entirely sure it suits us well, since the regex is designed to work on paths inside #include directive, but we're getting ours from the Symbols and Sema AST Decls. Moreover, it means we're gonna read .clang-format files to get that style.

I think the ".clang-format problem" is not specific to the header matching here. We would eventually need proper format style support in clangd anyway, as clangd provides formatting features (e.g. reformat and include insertion).

  • Come up with our own heuristics. There is a similar place in ClangdServer that matches a header with source and back. We could extend those heuristics to also allow figuring out whether the paths are matching header/source. I chose this option for initial implementation, since it's less work and it seems easier to switch to clang-format's regex later.

Not against this option. Just want to point out that the heuristics would not work for test files (that include tested main headers) with the current matching.

clangd/MatchingHeader.cpp
44 ↗(On Diff #148006)

Why equals_lower?

clangd/MatchingHeader.h
1 ↗(On Diff #148006)

I wonder if we could merge this into Headers.h

24 ↗(On Diff #148006)

We might want to briefly explain what a matching header is and what the heuristics are.

clangd/Quality.cpp
24–25

Could we merge these two flags into something like IsDeclaredInMainHeaderOrFile?

36

I wonder if it's still necessary to check all redecls. Would it be sufficient to check D, if D is the decl we referencing to?

  • Symbols store their paths as URIs ⇒ we need to parse them in order to apply heuristics. We could avoid that by writing a version of header-matching that also works on URIs, but that would mean more complexity.

Yeah, this is a bit unfortunate. It's probably OK to parse URIs; they are not that expensive after all (we might want to do some measurement though).

Yeah. I really wish we had microbenchmarks for things like completion.

  • Merging quality signals from headers now requires an extra paramater: name of the source file. I wonder if it's worth extracting builders for symbol qualities into separate classes to keep the current nice signatures, i.e. merge(Symbol& IndexResult).

I'm not very familiar with SymbolQualitySignals. But if we decide to use main file name as a signal, it might make sense to pass it in through the constructor?

That's possible, but that essentially turns SymbolQualitySignals from a data class to a stateful builder.

  • How should we match the header with the main file? Our options are:
    • (proposed by Eric) use main file regex from clang-format for that. I'm not entirely sure it suits us well, since the regex is designed to work on paths inside #include directive, but we're getting ours from the Symbols and Sema AST Decls. Moreover, it means we're gonna read .clang-format files to get that style.

I think the ".clang-format problem" is not specific to the header matching here. We would eventually need proper format style support in clangd anyway, as clangd provides formatting features (e.g. reformat and include insertion).

Yeah, IncludeMainRegex does look like a useful setting from clang-format. And maybe using clang-format settings is a good idea there. I'm just a bit weary of adding this logic in this change along with the completion changes.
So I'd go with a simple heuristic for starters to solve a problem at hand. Happy to improve it to use clang-format regex with a follow-up change if everyone agrees that's a good idea. I mostly feel uneasy about the added complexity to the current change.

clangd/MatchingHeader.cpp
44 ↗(On Diff #148006)

A former-windows-developer habbit. I don't think it hurts in any way if we do equals_lower here, we'll merely work in those strange cases where the extensions are not lower-case.
This is also consistent with isHeaderFile/isSourceFile (moved from ClangdServer)

clangd/MatchingHeader.h
1 ↗(On Diff #148006)

Thanks, I totally forgot we have Headers.h. Will move the helpers there.

24 ↗(On Diff #148006)

My idea was to not elaborate before we agree on what those heuristics should be.
Given the heuristics are so simple and there are suggestions to change them, documenting the current behavior seems like a bad idea. I'd rather keep them a black box for now.
Does that make sense? Maybe add a comment that the heuristics are likely to change, so the users shouldn't rely on them too much?

clangd/Quality.cpp
24–25

The two flags are built differently.
Any decl in the matching header gives a boost. Otherwise, all decls should be in the main file to get a boost.
The second one is built differently for the reasons outlined in the previous comments on why it might not be the best idea to boost completion items that have one of the inside the current file:

  • It gives inconsistent ranking for different completion points (before/after declaration)
  • The fact that a function has definition in the current file does not necessarily mean I want to call it more often than the others.
36

I don't think it's sufficient. How could we compute the flags that this function returns by looking at just a single decl?

After an offline chat: we decided to boost sema results that have any decls in the main file. Even though it introduces some unwanted inconsistencies in some cases (e.g. see the comment), most of us agree that's a very simple implementation that does boost useful things, including stuff from association header.
To get better results, we need to design and build the real proximity scoring that handles associated headers and other things.

  • Simplify the change by boosting any Decls that have a redecl in the current file
sammccall accepted this revision.May 24 2018, 9:04 AM

Nice, simple and will admit refinements later.

Just test nits and a trivial organizational thing.

clangd/Quality.cpp
22

nit: per coding style use static for functions
(I'm not sure it's a great rule, but since the ns only has this function...)

clangd/Quality.h
52

this belongs in SymbolRelevanceSignals rather than this struct. In principle, SymbolQualitySignals should all be stuff we can store in a global index (which is the point of the split). I should probably add a comment to that effect :-)

You could be a little more specific on the semantics, e.g. "Proximity between the best declaration and the query location. [0-1] score where 1 is closest."

unittests/clangd/QualityTests.cpp
125

please add a test for proximity here

154

consider merging into SymbolRelevanceSignalsExtraction test, which tests the same entrypoint.

If not, move up next to that one.

162

this is not needed, the #include is implicit in TestTU

(and so you don't need to specify HeaderFilename either)

188

this tests end-to-end, but the other tests verify input -> signals and signal -> score separately.

I'd prefer to keep (only) doing that, for consistency and because it's important we know/assert precisely what each half does so we can actually debug.

unittests/clangd/TestTU.cpp
80 ↗(On Diff #148381)

well, I definitely wanted to flag this as an error (for the tests where this function was introduced).

Actually I think this is wrong for your test anyway - don't you want to test that no matter which decl the code completion result refers to, you get the same boost?

I'd suggest adding a findDecls() function that returns a vector<NamedDecl*>, and implementing findDecl() in terms of it. In the test_func_in_header_and_cpp test, you could loop over findDecls() and run the assertions each time.

(I guess findDecls() should assert that the returned vector is non-empty? Slightly weird but might catch accidentally vacuous tests)

This revision is now accepted and ready to land.May 24 2018, 9:04 AM
ilya-biryukov marked 5 inline comments as done.
  • Address review comments
ilya-biryukov added inline comments.Jun 4 2018, 7:54 AM
clangd/Quality.h
52

Hah, I totally missed the SymbolRelevanceSignals. Thanks, moved it there and added a comment per suggestions.

unittests/clangd/QualityTests.cpp
154

Merged them together. It is now a bit more verbose. In case you have more suggestions on how to properly test this, I'm happy to address them in a follow-up change.

162

Done.

I did not expect the TestTU to do that, actually. Is this implicit #include something we want?
Maybe we should just have a default convention for naming the headers instead and the default Code value that only includes the header?
E.g. say that a file test_header.h is provided by TestTU and let the tests specify #include "test_header.h" if needed. WDYT?

unittests/clangd/TestTU.cpp
80 ↗(On Diff #148381)

This function got reimplemented in one of the recent changes and works for header decls too now.
I actually think it's fine if it returns one of the overladed Decls in case there are multiple overloads (i.e. multiple Decls in the AST) for the same function. It does assert for multiple overloads of the same thing currently, which is probably what we want.

This revision was automatically updated to reflect the committed changes.

oops, just saw this is closed - will fix these nits as I'm touching this file soon

clangd/Quality.h
68

(nit: trailing .)

unittests/clangd/QualityTests.cpp
139

nit: can you EXPECT_GT and reverse the order? (default is on the right in all tests above, even the positive signals for symbolquality)

sammccall added inline comments.Jun 4 2018, 8:10 AM
unittests/clangd/QualityTests.cpp
162

Missed this - definitely making sure there's a #include in the cc file that lines up to the name/location of the header file is one of the main things I wanted this helper to do. Maybe the documentation can be more clear?

ilya-biryukov added inline comments.Jun 5 2018, 3:29 AM
unittests/clangd/QualityTests.cpp
139

I always use < and <= in all the code I write. That gives the "natural" order of the arguments. (argument to the left is less)
So it's completely opposite for me: I find the Default.evaluate() on the left side to be more readable.

Feel free to change it, though. Just wanted to point out this style can be more or less readable based on the personal preferences.

162

SG, I'd definitely mention that the file is included and the way it's done, i.e. via -include flag passed to clang.