Provides facilities to model the C++ conversion rules without the AST.
The introduced representation can be stored in the index and used to
implement type-based ranking improvements for index-based completions.
Details
Diff Detail
- Repository
- rCTE Clang Tools Extra
- Build Status
Buildable 22858 Build 22858: arc lint + arc unit
Event Timeline
The implementation might look a bit scary, please feel free to ask for comments/clarifications!
clangd/ExpectedTypes.h | ||
---|---|---|
119 | I assume this will be controversial. Happy to discuss/change. |
This seems very clever, but extremely complicated - you've implemented much of C++'s conversion logic, it's not clear to me which parts are actually necessary to completion quality.
(Honestly this applies to expected types overall - it seems intuitively likely that it's a good signal, it seems less obvious that it pulls its weight if it can't be made simple).
From the outside it seems much of it is YAGNI, and if we do then we need to build it up slowly with an eye for maintainability.
Can we start with expected type boosting (no conversions) as previously discussed, and later measure which other parts make a difference? (I think we'll need/want the simple model anyway, for this to work with Dex and other token-based indexes).
clangd/ExpectedTypes.h | ||
---|---|---|
66 | While a hash of a string might be a reasonable choice in the long term, I worry about debuggability. (With SymbolID we can just look up the symbol). You could make the hashing an implementation detail of the index, and have the APIs speak in terms of opaque strings. But that forces the index to be able to report the full opaque string of each returned symbol (for scoring), so the index now has to have a lookup table... messy. Another fun thing about this representation is that you're storing 20 bytes of data (+ overhead) for common types like "void" where we could get away with one. | |
66 | in the *short run* I'd suggest just printing the type name and using that as the representation. | |
68 | this represents a type (in the c++ sense), not a conversion, right? | |
69 | "convertible (using equality)" is confusing. It sounds like "this is actually an equivalence class of types" but I think that's not true, because it's not symmetric. Isn't the model here just "SType is a serializable token representing a type. They can be compared for equality." | |
72 | Is this a placeholder name? It's not clear what it means. Suggest OpaqueType or ExpectedType | |
81 | can we separate "get the representative set of types for R" from "encode them as SType"? (I think at least the former should be a non-member function BTW, to keep clear that SType itself isn't aware of any clever folding or whatnot) | |
82 | coupling to CompletionResult seems premature here, can we stick to passing getExpectedType() until we know that abstraction needs to be broken? | |
91 | I don't understand the scale here. If better conversions get higher numbers, what number does "no conversion" get? this all seems like YAGNI. Will a set do for now? | |
213 | why is implementing one of these directions not enough? It should probably be: So it seems the minimal set you need for handling pointer to base is Type getRepresentative(Type) and set<Type> getRepresentativesAfterConversion(Type) or so... | |
213 | names are unclear: is collectConvertibleFrom(T) the convertible-from types for T (i.e the types T is convertible from), or the types that are convertible from T? |
+1 to a simpler model.
As chatted offline, I think the return type can be split into multiple orthogonal signals. For example, const T & can be split into 3 independent signals {const, type T, reference}. I think this can make the reasoning of boosting/scoring easier for both index and code completion. Agree with Sam that we should start with something simple (e.g. type matching without conversing) and land basic components to make further evaluation possible.
This seems very clever, but extremely complicated - you've implemented much of C++'s conversion logic, it's not clear to me which parts are actually necessary to completion quality.
Clearly the model that supports C++ conversions is something that will improve code completion quality.
I do agree it's not trivial, but would argue we at least want:
- qualification conversions (i.e. adding const)
- user-defined conversions (e.g. operator bool is commonly useful think)
- derived-to-base conversions (Derived* should convert to Base*)
Without those, we don't support a bunch of useful cases.
As chatted offline, I think the return type can be split into multiple orthogonal signals. For example, const T & can be split into 3 independent signals {const, type T, reference}. I think this can make the reasoning of boosting/scoring easier for both index and code completion. Agree with Sam that we should start with something simple (e.g. type matching without conversing) and land basic components to make further evaluation possible.
Yeah, I do keep it in mind and I think it's a great idea. E.g., we can put all numeric types into one equivalence class and get rid of all numeric conversions.
That adds some complexity to the interface, though, I wanted to measure how the trivial solution (enumerate all types) works. To make sure we actually can't get away without it.
clangd/ExpectedTypes.h | ||
---|---|---|
68 | It's an "expression" with an extra data with some extra data (whether the user conversion was applied to get this expression) | |
82 | There's some useful logic that is tied to completion results, e.g. to extract function return type CompletionResult. | |
213 | Derived-to-base and user conversions. We can't enumerate all derived classes for some type, so instead need to enumerate all bases when adding a symbol to the index. |
Happy to speculate about what might work here, but I strongly believe the path forward here is to build the simplest version of this feature, without conversions, and try to avoid complicated conversion logic if we can get most of the benefit in simpler ways.
It's not clear that will be significant. This isn't hard to measure, so I'm not sure why we should guess. And I'm not sure why it all has to go in the first patch.
I do agree it's not trivial, but would argue we at least want:
- qualification conversions (i.e. adding const)
Another approach here is just always dropping const. (And refs, and so on). This will create some false positives, but maybe they don't hurt much. This handles some true cases too, like invoking copy constructors.
- user-defined conversions (e.g. operator bool is commonly useful think)
My guess is you're not going to measure a difference here, bool has lots of false positives and others are rare.
- derived-to-base conversions (Derived* should convert to Base*)
Yes, probably. If this ends up being the only "chain" we have to follow, we're probably in good shape complexity-wise.
I've run the measurements on a highly simplified vs the original complicated model and got roughly the same results wrt to ranking improvements, so sending a new version of the patch with highly simplified mode for the type representation.
I believe there are still gains to be had from a more thorough treatment of C++ conversions, but there is definitely much to do in other areas that should provide better ground for seeing the actual improvements with the more complicated model.
In any case, starting with something simple is definitely a better ground. Thanks for initial review and suggestions!
And please take a look at the new version, it is significantly simpler and should be pretty easy to review :-)
What is the goal for doing this without the AST? Is the goal to not have to keep the AST and save memory?
clangd/ExpectedTypes.cpp | ||
---|---|---|
28 | maybe add a comment what ValueDecl covers roughly? E.g. functions, classes, variables etc. | |
41 | IIUC, we also encode the qualifiers into the final representation? If so, have you considered the underlying type without qualifiers? It seems to me this might be too restrictive for type-based boosting. For code completion ranking, I think type qualifiers (const etc) can be separate signals. | |
clangd/ExpectedTypes.h | ||
11 | We might want to formalize what "convertible" means here. E.g. does it cover conversion between base and derived class? Does it cover double <-> int conversion? | |
30 | The name seems opaque ;) Why is it opaque? | |
38 | why "preferred type"? maybe add a comment? | |
41 | What is the raw representation? A hash or the type name or USR? |
@ioeric, thanks for the review round!
Answering the most important comments, will shortly send changes to actually address the rest.
clangd/ExpectedTypes.cpp | ||
---|---|---|
41 | This function's responsibility is to encode the type. There is code to strip the qualifiers from the types in toEquivClass. | |
clangd/ExpectedTypes.h | ||
11 | I want to leave it vague for now. Convertible means whatever we think is good for code completion ranking. Happy to add the docs, but they'll probably get outdated on every change. Reading the code is actually simpler to get what's going on at this point. | |
38 | That's the terminology that clang uses for completion's context type. Will add a comment, thanks! | |
41 | A string representation of the usr, but users shouldn't rely on it. |
clangd/ExpectedTypes.cpp | ||
---|---|---|
13 | returning QualType vs Type*? It seems we strip all qualifiers, seems clearest for the return type to reflect that. | |
16 | Maybe we want Ctx.getUnqualifiedArrayType here or (more likely?) do array-to-pointer decay? | |
17 | wow, "enumeral" might be my favorite c++-made-up word, displacing "emplace"... | |
26 | nit: dyn_cast_or_null below instead? | |
31 | nit: is canonicalization necessary here? you do it in toEquivClass | |
34 | nit: I'd put the special case in the if() block, but up to you | |
38 | dropping references seems redundant here, as you do it again later | |
47 | I think ultimately we may want to replace this with a custom walker:
not something for this patch, but maybe a FIXME? | |
72 | can you reuse fromPreferredType for the rest? | |
clangd/ExpectedTypes.h | ||
33 | Does this need to be a separate class rather than using std::string?
I suspect at least initially producing a somewhat readable std::string a la USRGeneration would be enough. | |
unittests/clangd/ExpectedTypeTest.cpp | ||
80 | note that if you think it's useful you can To.dump(*L->stream()) | |
93 | I really like the declarative equivalence-class setup of the tests. A couple of suggestions:
I'd consider actually building up the equivalence classes map<OpaqueType, set</*decl*/string>> and writing a MATCHER_P2(ClassesAre, /*vector<set<string>>*/Classes, /*ParsedAST*/AST, "classes are " + testing::PrintToString(Classes)) That way the actual and expected equivalence classes will be dumped on failure, and you can still grab the decls/types from the AST to dump their details. |
clangd/ExpectedTypes.cpp | ||
---|---|---|
13 | Done. That produces a bit more trouble at the callsites, so not sure if it's an improvement overall. | |
16 | Added array-to-pointer decays, they should improve ranking when assigning from an array to a pointer, which is nice. | |
17 | ¯\_(ツ)_/¯ | |
31 | It was not important, removed it. | |
47 | USRs actually seems like a pretty good fit here. I'm not sure dropping attributes for internal types would make a big difference in the scoring and not sure how big of a problem the strings are, would be nice to actually learn it's a problem (in memory consumption, memory alloc rates, etc) before changing this. It's definitely possible to do that, of course, we have a room to change the encoding whenever we want, but would avoid adding a FIXME and committing to this approach in the initial patch. | |
clangd/ExpectedTypes.h | ||
11 | Added a clarification that we want "convertible for the purpose of code completion". | |
30 | Removed the "opaque" from the comment, hopefully this causes less confusion. | |
33 | Would still want to keep it as a marker type just for the sake of indicating what we return and documentation purposes. There's still an option to go strings with rawStr() if needed. | |
41 | Clarified that we leave a room for ourselves to change the encoding we use. | |
unittests/clangd/ExpectedTypeTest.cpp | ||
80 | From the personal experience, looking at the string representation is usually to figure out what's wrong and dumping wouldn't actually help. | |
93 | Thanks, this approach works most of the time. |
clangd/ExpectedTypes.h | ||
---|---|---|
16 | I think this largely rehashes the second sentence of the above para. I'd suggest this one focus more closely on what our model *is*: We define an encoding of AST types as opaque strings, which can be stored in the index. Similar types (such as `string` and `const string&`) are folded together, forming equivalence classes with the same encoding. | |
19 | ("stable" might suggest across versions) | |
33 | For documentation purposes, using OpaqueType = std::string or so seems like a reasonable compromise? This is very heavyweight for the amount of typesafety we get. | |
43 | I'd suggest just fromType, exposing this as the primary method, and then on fromCompletionResult document why it's different. Having the names suggest the underlying structure (that fromType is "more fundamental") aids understanding, and doesn't really feel like we're painting ourselves into a corner. Alternately, fromCompletionContext and fromCompletionResult would be more clearly symmetrical. | |
50 | nit: if you keep this class, call this raw() for consistency with symbolid?( | |
60 | any reason to put this in the header? | |
unittests/clangd/ExpectedTypeTest.cpp | ||
30 | This seems fine as a fixture, but I'd merge with the subclass - tests should be easy to read! | |
52 | "convertible to" is a problematic description for a couple of reasons:
As discussed offline/below, I think the best remedy here is just to drop this matcher - it's only used in one test that can now live with something much simpler. | |
108 | nit: any reason this takes Decl*s instead of strings? would be a bit terser not to wrap the args in decl() | |
111 | I think we could simplify by only testing the type encodings/equiv classes here, and relying on the function -> return type conversion happening elsewhere. | |
143 | Ooh, we should avoid folding bool with other integer types I think! You hardly ever want to pass a bool where an int is expected. (The reverse int -> bool is somewhat common, but no more than pointer -> bool... type equivalence isn't the right hammer to solve that case). | |
174 | I think this test is a bit too high-level - there are big abstractions between the test code and the code under test (which is pretty simple). I'd suggest just OpaqueType::fromCompletionResult(ASTCtx(), decl("returns_int")), OpaqueType::fromExpectedType(ASTCtx(), decl("int_"));` (If you think there's something worth testing for the pointer case, I'd do that instead rather than as well) |
clangd/ExpectedTypes.h | ||
---|---|---|
33 | As discussed offline, kept the class with an expectation that we'll use the fixed-size representation at some point. Added a comment that it can be viewed as a strong typedef to string for now. | |
43 | Done. Using fromType now. | |
60 | It uses a private constructor of the class, so it seems natural for it to be a private static function. | |
unittests/clangd/ExpectedTypeTest.cpp | ||
52 | Done. It was needed only for one test, testing it diretly now. | |
143 | Fair point, changed this. Bool requires a whole different handling anyway, e.g. I definitely want my pointers to be boosted in if conditions. | |
174 | Done. There is still a helper variable per case (I think it improves the readability a little), but otherwise the test is more straightforward now. |
We might want to formalize what "convertible" means here. E.g. does it cover conversion between base and derived class? Does it cover double <-> int conversion?