Also add new modifier for differentiating between built-in and user-
provided operators.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
For reference, earlier work along these lines which never landed: https://reviews.llvm.org/D119077
A couple of high-level thoughts on this:
- Based on the discussion in https://github.com/clangd/clangd/issues/1115, I believe highlighting of built-in operators should be out of scope for semantic highlighting, at least in the default mode; client-side highlighting should be sufficient for these, similar to strings and literals.
- An alternative to assigning (user-provided) operators a new token kind would be to assign them the same token kind as the entity they invoke (i.e. function or method). Both approaches have their advantages:
- If we use the function/method kinds, then uses of user-provided operators will be highlighted differently from built-in operators even when using a default / standard theme that doesn't know about clangd-specific token types.
- If we use a dedicated operator kind, users can configure different styles for operators vs. function/methods (and they may want different styles given that syntactically the two look quite different).
One way to get the best of both worlds could be to use the function/method kinds in combination with an operator modifier. That would color overloaded operators out of the box while also allowing users to customize the style based on the presence of the modifier. What do you think about this approach?
The problem with "operator" as a modifier is that if and when the augmentsSyntaxTokens feature is implemented, we'll need an operator token kind anyway (and having both seems weird).
The modifier semantics could also be switched around, btw, so that instead of "userProvided" we could have "builtIn" (which might also apply to other token kinds). I don't have a strong opinion on this.
Should I perhaps add the augmentsSyntaxTokens option and rebase this patch to make it its first user?
I'm not sure if "built-in" (as opposed to overloaded) is the right distinction here: whether a == b can be highlighted client-side seems unrelated to whether the type is int or string.
But I agree we only need to provide this (by default) if it's something that can't be determined by lexing.
An (IMO) useful distinction that can't be found by the lexer is the use of * as a declarator (int *x) vs an operator (return *x). These are both potentially "built-in". https://github.com/helix-editor/helix/pull/4278 has a fuller example. (In practice, clients are going to highlight declarator-stars as operators on the client-side, so to make this work we have to highlight them as something else rather than just mark operators. But this patch looks like the right direction).
- An alternative to assigning (user-provided) operators a new token kind would be to assign them the same token kind as the entity they invoke (i.e. function or method). Both approaches have their advantages:
- If we use the function/method kinds, then uses of user-provided operators will be highlighted differently from built-in operators even when using a default / standard theme that doesn't know about clangd-specific token types.
- If we use a dedicated operator kind, users can configure different styles for operators vs. function/methods (and they may want different styles given that syntactically the two look quite different).
One way to get the best of both worlds could be to use the function/method kinds in combination with an operator modifier. That would color overloaded operators out of the box while also allowing users to customize the style based on the presence of the modifier. What do you think about this approach?
I think since LSP specifies an operator SemanticTokenType, we should use it unless there's a *really* strong reason not to.
A well-behaved editor will provide a themeable color for (client-side) highlighting of operators, and also map LSP's operator SemanticTokenType onto that color by default. If we decide to do our own different thing, we'll break that happy path.
clang-tools-extra/clangd/SemanticHighlighting.h | ||
---|---|---|
78 | Can you give some background here or on the bug tracker about what kind of distinction you're trying to draw here and why it's important? | |
78 | as well as being jargony, "user-provided" has a specific technical meaning that I don't think you intend here. For example, auto operator<=>(const S&) const = default is not user-provided (defaulted on first declaration). https://eel.is/c++draft/dcl.fct.def.default#5 If we need this and can't get away with reusing defaultLibrary (which would include std::) then maybe we should add something like builtin which seems quite reusable. | |
78 | Since we often can't say whether an operator is user-provided or not (in templates), we should consider what we want the highlighting to be in these cases. |
clang-tools-extra/clangd/SemanticHighlighting.h | ||
---|---|---|
78 |
I use "userProvided" here in the sense of "not built-in" or "overloaded". I do not cling to the term, and have also suggested the opposite default of using "builtin" in another comment. | |
78 |
True, I have not considered this.
Intuitively, it seems we should be conservative and not claim the operator is overloaded unless we know it is. So "built-in" might then mean "we can't prove it's not a built-in". It's probably not worth to introduce a modifier CouldBeEitherWay just to explicitly express ambiguity ;) |
My bad, I completely overlooked that LSP has a standard token type for this! (I think maybe because I assumed D119077 would have used it had it existed.) In light of that, please disregard my comment about making operator a modifier.
clang-tools-extra/clangd/SemanticHighlighting.h | ||
---|---|---|
78 |
In my mind, "go-to-definition on this operator symbol will take me to a function declaration/definition" is a good match for "I want this colored differently". (Which would imply treating dependent operator calls where we can't figure out an overloaded operator target even heuristically, as "built-in".) |
Should I perhaps add the augmentsSyntaxTokens option and rebase this patch to make it its first user?
So I went ahead and did this. Opinions?
I failed to appreciate the implication of this the first time I read it: this is in fact a semantic distinction. Given A * B;, the * could be a declarator or an operator depending on whether A resolves to a type name or a variable name; at the lexer level, it's just tok::star in both cases.
The current patch will produce an operator token in the operator case but not the declarator case, thereby achieving an effect that client-side highlighting couldn't. As such, I guess it does make sense for this to be a semantic token (i.e. produced even with with augmentsSyntaxTokens=true) even in the built-in case?
(I envisioned the implementation of augmentsSyntaxTokens=false to be "loop over the lexer tokens and turn a subset of them into additional tokens to return over LSP". The fact that that would have meant producing an "operator" token even for tok::star which is actually a declarator, is what clued me in to the fact that "operators" are a semantic category.)
Regarding AugmentsSyntaxTokens:
- in general this seems like a nice thing to support, though I'm not aware of "big" !AugmentsSyntaxTokens clients so it's not urgent
- I think the right design for that is to (conditionally) run the lexer and mark only tokens that haven't been marked yet. If the lexer doesn't provide enough info, then it's maybe not "syntactic".
- I think it's OK to "accidentally" produce some "syntactic" tokens in semantic-only mode - we shouldn't go out of our way to avoid doing so.
Since &, &&, * aren't always operators, and this is not syntactically distinguishable, I think we should just always emit all operator tokens, even in the standard AugmentsSyntaxTokens mode, and even for tokens that can only be operators. (In practice many other "obvious" operators aren't always either: ~ (destructor), ^ (objc blocks), = (lambda capture), # (directive vs pasting), +- (objc method specifiers)...)
This means AugmentSyntaxTokens doesn't belong in this patch, I think - as Nathan said we want to emit these tokens regardless. (Really sorry for the churn...)
clang-tools-extra/clangd/SemanticHighlighting.h | ||
---|---|---|
78 |
This was one of the biggest questions I had about this patch - just hoping it doesn't get missed. | |
78 |
This feels a bit circular, if we agree we're not going to introduce a CouldBeEitherWay then why is "built-in" a more conservative claim than "overloaded"? I'm inclined towards builtin as a modifier because I think for language entities as a whole (types, functions etc, not just operators) it's the exception. It also seems easier to name and define.
This mostly makes sense to me, but:
I think my internal question is more like "is this a trivial arithmetic shift, or something potentially complicated"? And I think depending on template resolution is "potentially complicated". (Maybe trivial in the end, but so might be an overloaded operator) |
clang-tools-extra/clangd/SemanticHighlighting.h | ||
---|---|---|
78 |
The documentation for BinaryOperator says: |
clang-tools-extra/clangd/SemanticHighlighting.h | ||
---|---|---|
78 | Can you reduce this to an example where clang -fsyntax-only -Xclang -ast-dump doesn't print what you expect? (if this isn't possible, then it might be a bug in the patch) |
clang-tools-extra/clangd/SemanticHighlighting.h | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
78 | $ cat test.cpp void run(typename T::A t1, typename T::A t2) { return t1 == t2; } } TranslationUnitDecl 0x55986dab0dd8 <<invalid sloc>> <invalid sloc>
`-ClassTemplateDecl 0x55986daf6750 <test.cpp:1:1, line:3:1> line:1:28 MyTemplate |-TemplateTypeParmDecl 0x55986daf6600 <col:10, col:19> col:19 typename depth 0 index 0 T `-CXXRecordDecl 0x55986daf66c0 <col:22, line:3:1> line:1:28 class MyTemplate definition |-DefinitionData empty aggregate standard_layout trivially_copyable pod trivial literal has_constexpr_non_copy_move_ctor can_const_default_init | |-DefaultConstructor exists trivial constexpr needs_implicit defaulted_is_constexpr | |-CopyConstructor simple trivial has_const_param needs_implicit implicit_has_const_param | |-MoveConstructor exists simple trivial needs_implicit | |-CopyAssignment simple trivial has_const_param needs_implicit implicit_has_const_param | |-MoveAssignment exists simple trivial needs_implicit | `-Destructor simple irrelevant trivial needs_implicit |-CXXRecordDecl 0x55986daf6990 <col:22, col:28> col:28 implicit class MyTemplate `-CXXMethodDecl 0x55986daf6d18 <line:2:5, col:73> col:10 compare 'bool (typename T::A, typename T::A)' |-ParmVarDecl 0x55986daf6b10 <col:18, col:32> col:32 referenced t1 'typename T::A' |-ParmVarDecl 0x55986daf6bd0 <col:36, col:50> col:50 referenced t2 'typename T::A' `-CompoundStmt 0x55986daf6e50 <col:54, col:73> `-ReturnStmt 0x55986daf6e40 <col:56, col:69> `-BinaryOperator 0x55986daf6e20 <col:63, col:69> '<dependent type>' '==' |-DeclRefExpr 0x55986daf6de0 <col:63> 'typename T::A' lvalue ParmVar 0x55986daf6b10 't1' 'typename T::A' `-DeclRefExpr 0x55986daf6e00 <col:69> 'typename T::A' lvalue ParmVar 0x55986daf6bd0 't2' 'typename T::A' |
For my part, I still need to understand why we want the builtin/UserModified modifier. (The operator highlight kind seems obvious to me).
From earlier comments:
Can you give some background here or on the bug tracker about what kind of distinction you're trying to draw here and why it's important?
(Most clients are never going to benefit from nonstandard modifiers so they should be pretty compelling)This was one of the biggest questions I had about this patch - just hoping it doesn't get missed.
clang-tools-extra/clangd/SemanticHighlighting.h | ||
---|---|---|
78 |
It indeed looks like the documentation is wrong :-( Based on that, your implementation looks right to me. |
We make this distinction in our client. The reasoning is explained here: https://codereview.qt-project.org/c/qt-creator/qt-creator/+/220587
I think Nathan made the same point earlier, i.e. it's helpful to see that an operator is (potentially) overloaded and you could follow the symbol.
Yeah, basically, if I'm looking at a + b, and the + is actually a syntactically disguised function call, I'd like to know about it, i.e. I'd like to be able to make my editor color that + differently than the case where a and b are e.g. ints.
I do think that there are two potential routes to get there:
- Only produce a semantic token in the UserProvided case. The Builtin case doesn't get an operator semantic token at all, hence there's no need for a modifier. As a user, I configure my editor to color all operator semantic tokens prominently.
- Produce a semantic token in both the Builtin and UserProvided cases, but with the modifier to distinguish between them (i.e. the current patch). As a user, I configure my editor to color operator tokens with the UserProvided modifier prominently.
The advantage of (2) over (1) is that built-in operators get a semantic token to distinguish them from declarators (i.e. what Sam brought up in this comment starting at "An (IMO) useful distinction that can't be found by the lexer ...").
The advantage of (1) over (2) is that the primary intended effect (overloaded operators are highlighted prominently) can be achieved with less configuration (no custom modifier).
If we care more about how things look out of the box (i.e. with a theme that provides colors only for the tokens/modifiers in the LSP spec), perhaps we should prefer (1) over (2), in spite of the declarator/operator issue?
I don't have a strong preference between these options. (But I do have a strong preference for being able to make overloaded operators look different from built-in operators, whether that difference is modifier vs. no modifier, or semantic token vs. no semantic token at all.)
While we're waiting for Sam to weigh in on the modifier question, a couple of comments about the details of the patch:
- Does the implementation handle explicit operator calls, e.g. a.operator+(b)? If so, it would be nice to add some test coverage.
- I see the patch handles new and delete expression but I couldn't spot any test cases for them, could you add some? (Apologies if I've overlooked it.)
Thanks for the explanation! It makes sense. And given this motivation I don't think inverting to "builtin" is a good idea.
The impact is a bit limited because it's not a standard modifier, but it doesn't add much complexity, so LGTM.
I think the naming is an issue: "user-provided" is an unusual term with a very specific and unusual meaning for functions. I'd prefer the more generic "user-defined", if you don't mind changing it. (Also not perfect, but I can't think of a really good name).
Something related that might be interesting (and visible in more editors): highlighting the non-operator uses of * or other tokens. (This patch largely won't affect them, as they'll probably be client-side highlighted as operators if we don't say anything). Filed https://github.com/clangd/clangd/issues/1421. (Unfortunately I'm not really able to do clangd feature work these days)
This patch causes a clangd crash on dependent code (e.g. running on ASTMatchers.h), fixed in https://github.com/llvm/llvm-project/commit/bcb457c68e20120f0bdd0a59e4b4ce90b8121310.
Can you give some background here or on the bug tracker about what kind of distinction you're trying to draw here and why it's important?
(Most clients are never going to benefit from nonstandard modifiers so they should be pretty compelling)