This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Add support for semantic token type "operator"
ClosedPublic

Authored by ckandeler on Oct 24 2022, 5:53 AM.

Details

Summary

Also add new modifier for differentiating between built-in and user-
provided operators.

Diff Detail

Event Timeline

ckandeler created this revision.Oct 24 2022, 5:53 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 24 2022, 5:53 AM
ckandeler requested review of this revision.Oct 24 2022, 5:53 AM
nridge added a subscriber: nridge.Oct 29 2022, 5:48 PM

For reference, earlier work along these lines which never landed: https://reviews.llvm.org/D119077

A couple of high-level thoughts on this:

  1. Based on the discussion in https://github.com/clangd/clangd/issues/1115, I believe highlighting of built-in operators should be out of scope for semantic highlighting, at least in the default mode; client-side highlighting should be sufficient for these, similar to strings and literals.
  2. An alternative to assigning (user-provided) operators a new token kind would be to assign them the same token kind as the entity they invoke (i.e. function or method). Both approaches have their advantages:
    • If we use the function/method kinds, then uses of user-provided operators will be highlighted differently from built-in operators even when using a default / standard theme that doesn't know about clangd-specific token types.
    • If we use a dedicated operator kind, users can configure different styles for operators vs. function/methods (and they may want different styles given that syntactically the two look quite different).

      One way to get the best of both worlds could be to use the function/method kinds in combination with an operator modifier. That would color overloaded operators out of the box while also allowing users to customize the style based on the presence of the modifier. What do you think about this approach?

A couple of high-level thoughts on this:

  1. Based on the discussion in https://github.com/clangd/clangd/issues/1115, I believe highlighting of built-in operators should be out of scope for semantic highlighting, at least in the default mode; client-side highlighting should be sufficient for these, similar to strings and literals.
  2. An alternative to assigning (user-provided) operators a new token kind would be to assign them the same token kind as the entity they invoke (i.e. function or method). Both approaches have their advantages:
    • If we use the function/method kinds, then uses of user-provided operators will be highlighted differently from built-in operators even when using a default / standard theme that doesn't know about clangd-specific token types.
    • If we use a dedicated operator kind, users can configure different styles for operators vs. function/methods (and they may want different styles given that syntactically the two look quite different).

      One way to get the best of both worlds could be to use the function/method kinds in combination with an operator modifier. That would color overloaded operators out of the box while also allowing users to customize the style based on the presence of the modifier. What do you think about this approach?

The problem with "operator" as a modifier is that if and when the augmentsSyntaxTokens feature is implemented, we'll need an operator token kind anyway (and having both seems weird).
The modifier semantics could also be switched around, btw, so that instead of "userProvided" we could have "builtIn" (which might also apply to other token kinds). I don't have a strong opinion on this.
Should I perhaps add the augmentsSyntaxTokens option and rebase this patch to make it its first user?

I believe highlighting of built-in operators should be out of scope for semantic highlighting, at least in the default mode; client-side highlighting should be sufficient for these, similar to strings and literals.

I'm not sure if "built-in" (as opposed to overloaded) is the right distinction here: whether a == b can be highlighted client-side seems unrelated to whether the type is int or string.
But I agree we only need to provide this (by default) if it's something that can't be determined by lexing.

An (IMO) useful distinction that can't be found by the lexer is the use of * as a declarator (int *x) vs an operator (return *x). These are both potentially "built-in". https://github.com/helix-editor/helix/pull/4278 has a fuller example. (In practice, clients are going to highlight declarator-stars as operators on the client-side, so to make this work we have to highlight them as something else rather than just mark operators. But this patch looks like the right direction).

  1. An alternative to assigning (user-provided) operators a new token kind would be to assign them the same token kind as the entity they invoke (i.e. function or method). Both approaches have their advantages:
    • If we use the function/method kinds, then uses of user-provided operators will be highlighted differently from built-in operators even when using a default / standard theme that doesn't know about clangd-specific token types.
    • If we use a dedicated operator kind, users can configure different styles for operators vs. function/methods (and they may want different styles given that syntactically the two look quite different).

      One way to get the best of both worlds could be to use the function/method kinds in combination with an operator modifier. That would color overloaded operators out of the box while also allowing users to customize the style based on the presence of the modifier. What do you think about this approach?

I think since LSP specifies an operator SemanticTokenType, we should use it unless there's a *really* strong reason not to.
A well-behaved editor will provide a themeable color for (client-side) highlighting of operators, and also map LSP's operator SemanticTokenType onto that color by default. If we decide to do our own different thing, we'll break that happy path.

clang-tools-extra/clangd/SemanticHighlighting.h
78

Can you give some background here or on the bug tracker about what kind of distinction you're trying to draw here and why it's important?
(Most clients are never going to benefit from nonstandard modifiers so they should be pretty compelling)

78

as well as being jargony, "user-provided" has a specific technical meaning that I don't think you intend here. For example, auto operator<=>(const S&) const = default is not user-provided (defaulted on first declaration). https://eel.is/c++draft/dcl.fct.def.default#5

If we need this and can't get away with reusing defaultLibrary (which would include std::) then maybe we should add something like builtin which seems quite reusable.

78

Since we often can't say whether an operator is user-provided or not (in templates), we should consider what we want the highlighting to be in these cases.
(If templates should be highlighted as built-in, we should prefer a modifier like UserProvided, if they should be highlighted as user-provided, we should prefer a modifier like Builtin)

ckandeler added inline comments.Nov 21 2022, 1:52 AM
clang-tools-extra/clangd/SemanticHighlighting.h
78

as well as being jargony, "user-provided" has a specific technical meaning that I don't think you intend here. For example, auto operator<=>(const S&) const = default is not user-provided (defaulted on first declaration). https://eel.is/c++draft/dcl.fct.def.default#5

If we need this and can't get away with reusing defaultLibrary (which would include std::) then maybe we should add something like builtin which seems quite reusable.

I use "userProvided" here in the sense of "not built-in" or "overloaded". I do not cling to the term, and have also suggested the opposite default of using "builtin" in another comment.

78

Since we often can't say whether an operator is user-provided or not (in templates), we should consider what we want the highlighting to be in these cases.

True, I have not considered this.

(If templates should be highlighted as built-in, we should prefer a modifier like UserProvided, if they should be highlighted as user-provided, we should prefer a modifier like Builtin)

Intuitively, it seems we should be conservative and not claim the operator is overloaded unless we know it is. So "built-in" might then mean "we can't prove it's not a built-in". It's probably not worth to introduce a modifier CouldBeEitherWay just to explicitly express ambiguity ;)

I think since LSP specifies an operator SemanticTokenType, we should use it unless there's a *really* strong reason not to.

My bad, I completely overlooked that LSP has a standard token type for this! (I think maybe because I assumed D119077 would have used it had it existed.) In light of that, please disregard my comment about making operator a modifier.

nridge added inline comments.Nov 21 2022, 2:08 PM
clang-tools-extra/clangd/SemanticHighlighting.h
78

Since we often can't say whether an operator is user-provided or not (in templates), we should consider what we want the highlighting to be in these cases.
(If templates should be highlighted as built-in, we should prefer a modifier like UserProvided, if they should be highlighted as user-provided, we should prefer a modifier like Builtin)

In my mind, "go-to-definition on this operator symbol will take me to a function declaration/definition" is a good match for "I want this colored differently". (Which would imply treating dependent operator calls where we can't figure out an overloaded operator target even heuristically, as "built-in".)

ckandeler updated this revision to Diff 477178.Nov 22 2022, 6:49 AM

Make highlighting of built-in operators conditional

Should I perhaps add the augmentsSyntaxTokens option and rebase this patch to make it its first user?

So I went ahead and did this. Opinions?

nridge added a comment.EditedNov 25 2022, 1:19 AM

An (IMO) useful distinction that can't be found by the lexer is the use of * as a declarator (int *x) vs an operator (return *x).

I failed to appreciate the implication of this the first time I read it: this is in fact a semantic distinction. Given A * B;, the * could be a declarator or an operator depending on whether A resolves to a type name or a variable name; at the lexer level, it's just tok::star in both cases.

The current patch will produce an operator token in the operator case but not the declarator case, thereby achieving an effect that client-side highlighting couldn't. As such, I guess it does make sense for this to be a semantic token (i.e. produced even with with augmentsSyntaxTokens=true) even in the built-in case?

(I envisioned the implementation of augmentsSyntaxTokens=false to be "loop over the lexer tokens and turn a subset of them into additional tokens to return over LSP". The fact that that would have meant producing an "operator" token even for tok::star which is actually a declarator, is what clued me in to the fact that "operators" are a semantic category.)

The current patch will produce an operator token in the operator case but not the declarator case, thereby achieving an effect that client-side highlighting couldn't. As such, I guess it does make sense for this to be a semantic token (i.e. produced even with with augmentsSyntaxTokens=true) even in the built-in case?

You mean just for that one case or in general?

Regarding AugmentsSyntaxTokens:

  • in general this seems like a nice thing to support, though I'm not aware of "big" !AugmentsSyntaxTokens clients so it's not urgent
  • I think the right design for that is to (conditionally) run the lexer and mark only tokens that haven't been marked yet. If the lexer doesn't provide enough info, then it's maybe not "syntactic".
  • I think it's OK to "accidentally" produce some "syntactic" tokens in semantic-only mode - we shouldn't go out of our way to avoid doing so.

Since &, &&, * aren't always operators, and this is not syntactically distinguishable, I think we should just always emit all operator tokens, even in the standard AugmentsSyntaxTokens mode, and even for tokens that can only be operators. (In practice many other "obvious" operators aren't always either: ~ (destructor), ^ (objc blocks), = (lambda capture), # (directive vs pasting), +- (objc method specifiers)...)

This means AugmentSyntaxTokens doesn't belong in this patch, I think - as Nathan said we want to emit these tokens regardless. (Really sorry for the churn...)

clang-tools-extra/clangd/SemanticHighlighting.h
78

Can you give some background here or on the bug tracker about what kind of distinction you're trying to draw here and why it's important?
(Most clients are never going to benefit from nonstandard modifiers so they should be pretty compelling)

This was one of the biggest questions I had about this patch - just hoping it doesn't get missed.

78

Intuitively, it seems we should be conservative and not claim the operator is overloaded unless we know it is.

This feels a bit circular, if we agree we're not going to introduce a CouldBeEitherWay then why is "built-in" a more conservative claim than "overloaded"?

I'm inclined towards builtin as a modifier because I think for language entities as a whole (types, functions etc, not just operators) it's the exception. It also seems easier to name and define.

In my mind, "go-to-definition on this operator symbol will take me to a function declaration/definition" is a good match for "I want this colored differently".

This mostly makes sense to me, but:

  • I don't think we should actually run all the heuristics logic
  • if there's probably a definition available but we can't resolve it due to templates, I'd still like to know something's up

I think my internal question is more like "is this a trivial arithmetic shift, or something potentially complicated"? And I think depending on template resolution is "potentially complicated". (Maybe trivial in the end, but so might be an overloaded operator)

ckandeler updated this revision to Diff 477919.Nov 25 2022, 4:41 AM

Added template example.

clang-tools-extra/clangd/SemanticHighlighting.h
78

Intuitively, it seems we should be conservative and not claim the operator is overloaded unless we know it is.

This feels a bit circular, if we agree we're not going to introduce a CouldBeEitherWay then why is "built-in" a more conservative claim than "overloaded"?

I'm inclined towards builtin as a modifier because I think for language entities as a whole (types, functions etc, not just operators) it's the exception. It also seems easier to name and define.

In my mind, "go-to-definition on this operator symbol will take me to a function declaration/definition" is a good match for "I want this colored differently".

This mostly makes sense to me, but:

  • I don't think we should actually run all the heuristics logic
  • if there's probably a definition available but we can't resolve it due to templates, I'd still like to know something's up

I think my internal question is more like "is this a trivial arithmetic shift, or something potentially complicated"? And I think depending on template resolution is "potentially complicated". (Maybe trivial in the end, but so might be an overloaded operator)

The documentation for BinaryOperator says:
/ Within a C++ template, whether BinaryOperator or CXXOperatorCallExpr is
/ used to store an expression "x + y" depends on the subexpressions
/ for x and y. If neither x or y is type-dependent, and the "+"
/ operator resolves to a built-in operation, BinaryOperator will be
/ used to express the computation (x and y may still be
/ value-dependent). If either x or y is type-dependent, or if the
/ "+" resolves to an overloaded operator, CXXOperatorCallExpr will
/ be used to express the computation.
With the patch as-is (possibly with the UserProvided/BuiltIn switch) , this should result exactly in what you want (if I understand you correctly). However, it does not match my observation: I always see the normal operator types. Am I missing something or is the documentation wrong?

sammccall added inline comments.Nov 25 2022, 4:55 AM
clang-tools-extra/clangd/SemanticHighlighting.h
78

Can you reduce this to an example where clang -fsyntax-only -Xclang -ast-dump doesn't print what you expect?

(if this isn't possible, then it might be a bug in the patch)

ckandeler added inline comments.Nov 25 2022, 5:07 AM
clang-tools-extra/clangd/SemanticHighlighting.h
78

$ cat test.cpp
template<typename T> class MyTemplate {

void run(typename T::A t1, typename T::A t2) { return t1 == t2; }

}
$ clang -fsyntax-only -Xclang -ast-dump test.cpp

TranslationUnitDecl 0x55986dab0dd8 <<invalid sloc>> <invalid sloc>

-TypedefDecl 0x55986dab1640 <<invalid sloc>> <invalid sloc> implicit int128_t 'int128'
`-BuiltinType 0x55986dab13a0 '__int128'
-TypedefDecl 0x55986dab16b0 <<invalid sloc>> <invalid sloc> implicit uint128_t 'unsigned int128'
`-BuiltinType 0x55986dab13c0 'unsigned __int128'
-TypedefDecl 0x55986dab1a28 <<invalid sloc>> <invalid sloc> implicit NSConstantString 'NSConstantString_tag'
`-RecordType 0x55986dab17a0 '__NSConstantString_tag'
`-CXXRecord 0x55986dab1708 '__NSConstantString_tag'
-TypedefDecl 0x55986dab1ac0 <<invalid sloc>> <invalid sloc> implicit __builtin_ms_va_list 'char *'
`-PointerType 0x55986dab1a80 'char *'
`-BuiltinType 0x55986dab0e80 'char'
-TypedefDecl 0x55986daf65a8 <<invalid sloc>> <invalid sloc> implicit builtin_va_list 'va_list_tag[1]'
`-ConstantArrayType 0x55986daf6550 '__va_list_tag[1]' 1
`-RecordType 0x55986dab1bb0 '__va_list_tag'
`-CXXRecord 0x55986dab1b18 '__va_list_tag'

`-ClassTemplateDecl 0x55986daf6750 <test.cpp:1:1, line:3:1> line:1:28 MyTemplate

|-TemplateTypeParmDecl 0x55986daf6600 <col:10, col:19> col:19 typename depth 0 index 0 T
`-CXXRecordDecl 0x55986daf66c0 <col:22, line:3:1> line:1:28 class MyTemplate definition
  |-DefinitionData empty aggregate standard_layout trivially_copyable pod trivial literal has_constexpr_non_copy_move_ctor can_const_default_init
  | |-DefaultConstructor exists trivial constexpr needs_implicit defaulted_is_constexpr
  | |-CopyConstructor simple trivial has_const_param needs_implicit implicit_has_const_param
  | |-MoveConstructor exists simple trivial needs_implicit
  | |-CopyAssignment simple trivial has_const_param needs_implicit implicit_has_const_param
  | |-MoveAssignment exists simple trivial needs_implicit
  | `-Destructor simple irrelevant trivial needs_implicit
  |-CXXRecordDecl 0x55986daf6990 <col:22, col:28> col:28 implicit class MyTemplate
  `-CXXMethodDecl 0x55986daf6d18 <line:2:5, col:73> col:10 compare 'bool (typename T::A, typename T::A)'
    |-ParmVarDecl 0x55986daf6b10 <col:18, col:32> col:32 referenced t1 'typename T::A'
    |-ParmVarDecl 0x55986daf6bd0 <col:36, col:50> col:50 referenced t2 'typename T::A'
    `-CompoundStmt 0x55986daf6e50 <col:54, col:73>
      `-ReturnStmt 0x55986daf6e40 <col:56, col:69>
        `-BinaryOperator 0x55986daf6e20 <col:63, col:69> '<dependent type>' '=='
          |-DeclRefExpr 0x55986daf6de0 <col:63> 'typename T::A' lvalue ParmVar 0x55986daf6b10 't1' 'typename T::A'
          `-DeclRefExpr 0x55986daf6e00 <col:69> 'typename T::A' lvalue ParmVar 0x55986daf6bd0 't2' 'typename T::A'

Using operators on expressions with dependent types now considered "complex",

Now seems to do what we want. Should I switch the modifier around?

For my part, I still need to understand why we want the builtin/UserModified modifier. (The operator highlight kind seems obvious to me).

From earlier comments:

Can you give some background here or on the bug tracker about what kind of distinction you're trying to draw here and why it's important?
(Most clients are never going to benefit from nonstandard modifiers so they should be pretty compelling)

This was one of the biggest questions I had about this patch - just hoping it doesn't get missed.

clang-tools-extra/clangd/SemanticHighlighting.h
78

The documentation for BinaryOperator says:

It indeed looks like the documentation is wrong :-( Based on that, your implementation looks right to me.

For my part, I still need to understand why we want the builtin/UserModified modifier. (The operator highlight kind seems obvious to me).

We make this distinction in our client. The reasoning is explained here: https://codereview.qt-project.org/c/qt-creator/qt-creator/+/220587
I think Nathan made the same point earlier, i.e. it's helpful to see that an operator is (potentially) overloaded and you could follow the symbol.

nridge added a comment.EditedDec 8 2022, 2:03 AM

For my part, I still need to understand why we want the builtin/UserModified modifier. (The operator highlight kind seems obvious to me).

We make this distinction in our client. The reasoning is explained here: https://codereview.qt-project.org/c/qt-creator/qt-creator/+/220587
I think Nathan made the same point earlier, i.e. it's helpful to see that an operator is (potentially) overloaded and you could follow the symbol.

Yeah, basically, if I'm looking at a + b, and the + is actually a syntactically disguised function call, I'd like to know about it, i.e. I'd like to be able to make my editor color that + differently than the case where a and b are e.g. ints.

I do think that there are two potential routes to get there:

  1. Only produce a semantic token in the UserProvided case. The Builtin case doesn't get an operator semantic token at all, hence there's no need for a modifier. As a user, I configure my editor to color all operator semantic tokens prominently.
  2. Produce a semantic token in both the Builtin and UserProvided cases, but with the modifier to distinguish between them (i.e. the current patch). As a user, I configure my editor to color operator tokens with the UserProvided modifier prominently.

The advantage of (2) over (1) is that built-in operators get a semantic token to distinguish them from declarators (i.e. what Sam brought up in this comment starting at "An (IMO) useful distinction that can't be found by the lexer ...").

The advantage of (1) over (2) is that the primary intended effect (overloaded operators are highlighted prominently) can be achieved with less configuration (no custom modifier).

If we care more about how things look out of the box (i.e. with a theme that provides colors only for the tokens/modifiers in the LSP spec), perhaps we should prefer (1) over (2), in spite of the declarator/operator issue?

I don't have a strong preference between these options. (But I do have a strong preference for being able to make overloaded operators look different from built-in operators, whether that difference is modifier vs. no modifier, or semantic token vs. no semantic token at all.)

While we're waiting for Sam to weigh in on the modifier question, a couple of comments about the details of the patch:

  • Does the implementation handle explicit operator calls, e.g. a.operator+(b)? If so, it would be nice to add some test coverage.
  • I see the patch handles new and delete expression but I couldn't spot any test cases for them, could you add some? (Apologies if I've overlooked it.)
ckandeler updated this revision to Diff 482055.Dec 12 2022, 3:21 AM

Added more test cases.

  • Does the implementation handle explicit operator calls, e.g. a.operator+(b)? If so, it would be nice to add some test coverage.

It does now.

sammccall accepted this revision.Dec 12 2022, 4:18 AM

Thanks for the explanation! It makes sense. And given this motivation I don't think inverting to "builtin" is a good idea.
The impact is a bit limited because it's not a standard modifier, but it doesn't add much complexity, so LGTM.

I think the naming is an issue: "user-provided" is an unusual term with a very specific and unusual meaning for functions. I'd prefer the more generic "user-defined", if you don't mind changing it. (Also not perfect, but I can't think of a really good name).

Something related that might be interesting (and visible in more editors): highlighting the non-operator uses of * or other tokens. (This patch largely won't affect them, as they'll probably be client-side highlighted as operators if we don't say anything). Filed https://github.com/clangd/clangd/issues/1421. (Unfortunately I'm not really able to do clangd feature work these days)

This revision is now accepted and ready to land.Dec 12 2022, 4:18 AM
ckandeler updated this revision to Diff 482103.Dec 12 2022, 6:37 AM

Renamed modifier.

This revision was automatically updated to reflect the committed changes.

Thanks, Christian, for your work on this!

hokein added a subscriber: hokein.Dec 13 2022, 3:22 AM

This patch causes a clangd crash on dependent code (e.g. running on ASTMatchers.h), fixed in https://github.com/llvm/llvm-project/commit/bcb457c68e20120f0bdd0a59e4b4ce90b8121310.

This patch causes a clangd crash on dependent code (e.g. running on ASTMatchers.h), fixed in https://github.com/llvm/llvm-project/commit/bcb457c68e20120f0bdd0a59e4b4ce90b8121310.

Thanks.