This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Add semantic token for angle brackets
ClosedPublic

Authored by ckandeler on Dec 13 2022, 2:45 AM.

Details

Summary

This is needed for clients that would like to visualize matching
opening and closing angle brackets, which can be valuable in non-trivial
template declarations or instantiations.
It is not possible to do this with simple lexing, as the tokens
could also refer to operators.

Diff Detail

Event Timeline

ckandeler created this revision.Dec 13 2022, 2:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 13 2022, 2:45 AM
ckandeler requested review of this revision.Dec 13 2022, 2:45 AM
nridge added a subscriber: nridge.Dec 18 2022, 1:02 AM
nridge added a comment.Jan 6 2023, 1:23 AM

I'm happy to review the implementation, but I would first appreciate some guidance from @sammccall or @kadircet about whether we should add these semantic token kinds in the first place.

They would be a non-standard extension to the LSP token kinds, so I think the use case for them should be fairly compelling.

It's true that there is an ambiguity between < and > as operators, vs. template arg/param list delimiters, but, at least in terms of user understanding of code, my sense is that the highlighting of the preceding token should be sufficient to disambiguate -- i.e. it would be some sort of type name in the template case, vs. a variable / literal / punctuation ending an expression in the operator case.

This is needed for clients that would like to visualize matching opening and closing angle brackets, which can be valuable in non-trivial template declarations or instantiations.

For this use case, could an editor make use of the recently added operator tokens (https://reviews.llvm.org/D136594) instead, inferring that a < token which does not have an operator semantic token is a template delimiter?

It's true that there is an ambiguity between < and > as operators, vs. template arg/param list delimiters, but, at least in terms of user understanding of code, my sense is that the highlighting of the preceding token should be sufficient to disambiguate -- i.e. it would be some sort of type name in the template case, vs. a variable / literal / punctuation ending an expression in the operator case.

We used to do this sort of heuristic in our old libclang-based implementation, and it turned out to be rather messy, with a surprising amount of exceptions having to be added.

This is needed for clients that would like to visualize matching opening and closing angle brackets, which can be valuable in non-trivial template declarations or instantiations.

For this use case, could an editor make use of the recently added operator tokens (https://reviews.llvm.org/D136594) instead, inferring that a < token which does not have an operator semantic token is a template delimiter?

I have a suspicion that this will lead to false positives for invalid code.

It's true that there is an ambiguity between < and > as operators, vs. template arg/param list delimiters, but, at least in terms of user understanding of code, my sense is that the highlighting of the preceding token should be sufficient to disambiguate -- i.e. it would be some sort of type name in the template case, vs. a variable / literal / punctuation ending an expression in the operator case.

We used to do this sort of heuristic in our old libclang-based implementation, and it turned out to be rather messy, with a surprising amount of exceptions having to be added.

To clarify, I'm not suggesting that any client-side logic use this heuristic, only that it's probably good enough for a human reader to disambiguate without needing to assign angle brackets and comparison operators different colors.

This is needed for clients that would like to visualize matching opening and closing angle brackets, which can be valuable in non-trivial template declarations or instantiations.

For this use case, could an editor make use of the recently added operator tokens (https://reviews.llvm.org/D136594) instead, inferring that a < token which does not have an operator semantic token is a template delimiter?

I have a suspicion that this will lead to false positives for invalid code.

Ah, interesting. I guess in the case of invalid code (no/malformed AST) you wouldn't get AngleBracket tokens either, so this gives you a three-way signal (Operator vs. AngleBracket vs. no semantic token) that's richer than just the two-way signal of Operator vs. no Operator.

It's true that there is an ambiguity between < and > as operators, vs. template arg/param list delimiters, but, at least in terms of user understanding of code, my sense is that the highlighting of the preceding token should be sufficient to disambiguate -- i.e. it would be some sort of type name in the template case, vs. a variable / literal / punctuation ending an expression in the operator case.

We used to do this sort of heuristic in our old libclang-based implementation, and it turned out to be rather messy, with a surprising amount of exceptions having to be added.

To clarify, I'm not suggesting that any client-side logic use this heuristic, only that it's probably good enough for a human reader to disambiguate without needing to assign angle brackets and comparison operators different colors.

Note that we use this information to *animate* the matching tokens, i.e. when the cursor is on one of them, both it and its counterpart get a special highlighting. That's why it's so important that the language server guarantees they always come in pairs.

Sorry for missing this one, apparently I've already started writing an answer but got context switching and forgot about it.

This is needed for clients that would like to visualize matching
opening and closing angle brackets, which can be valuable in non-trivial
template declarations or instantiations.

I am not sure how the clients will make use of this information to visualize "matching" angle brackets. all the opening and closing brackets are likely to be visualized in the same way (respectively), no?
This will only make sure we have a distinction between << and >> used as operators vs </> and the rest of the work to highlight matching braces needs to be specially done by the editor.
Since we've the distinction of "operator" now as @nridge pointed out, I would rather want editors to use that information to skip brackets that have the operator highlighting.
I am not sure what concrete cases you've (it would help if you could provide some examples) for the incomplete code argument, because it goes both ways, i.e. I don't think the false positives you get in operators vs template argument lists won't be any different (e.g. foo<int> x, where foo is not defined won't have right highlighting).

Note that we use this information to *animate* the matching tokens, i.e. when the cursor is on one of them, both it and its counterpart get a special highlighting. That's why it's so important that the language server guarantees they always come in pairs.

Oh I see the argument here (somehow missed comment, probably because i had a stale window lying around), this totally makes sense. but I am still feeling uneasy due to implementation complexity (it's embarrassing but dealing with >> is quite problematic, still, with little value :/) + the UX we'll have in other editors by default.
Especially as this comes as two different HighlightingKinds and they're likely to get colored differently, and having your matching brackets in different colors is quite annoying.

I feel like editors can still have a quite good behaviour with existing operator highlights, as one can build an extra layer on top to provide "matching bracket" support, and whenever the data is broken (e.g. non-matching brackets) just keep using the results from last "successful" run.

clang-tools-extra/clangd/SemanticHighlighting.cpp
409

there can be weird cases like (same goes for -1 below):

foo>\
>

can you add tests to make sure we're handling these properly (or not producing braces for them if it complicates the logic too much, i don't think there'll be many cases where they pop-up).

clang-tools-extra/clangd/unittests/SemanticHighlightingTests.cpp
930

could you also add tests for >> and >>> to make sure they're handled correctly

Especially as this comes as two different HighlightingKinds and they're likely to get colored differently, and having your matching brackets in different colors is quite annoying.

We can easily check the actual character at the given position in the client, so I could just merge the two highlighting kinds.

This will only make sure we have a distinction between << and >> used as operators vs </>

Note that << and >> are not the only (or even the primary, in my mind) issue: there is an ambiguity between </> as template argument list delimiters vs. comparison operators.

For example:

template <bool>
class a {};

constexpr int b = 2;
constexpr int c = 3;

a<b>c> e;

Here, if the cursor is next to the <, a naive client-side angle-bracket matcher might highlight the first >, when the correct one to highlight would be the second one.

nridge added a comment.EditedJan 16 2023, 11:35 AM

Sorry, that was a bad example as it's not actually valid. Here's a slightly modified one:

template <bool,int>
class a {};

constexpr int b = 2;
constexpr int c = 3;
constexpr int d = 4;

a<b<c,d> e;

The parser knows the first < opens a template-argument-list because the name before it, a, resolves to a template-name, and that the second < is a comparison operator because the name before that, b, does not.

But a naive client-side matcher might incorrectly pick the second < as the opening angle-bracket to match the >.

ckandeler marked 2 inline comments as done.
ckandeler retitled this revision from [clangd] Add semantic tokens for angle brackets to [clangd] Add semantic token for angle brackets.

Added test cases, merged the two HighlightingKinds.

We can easily check the actual character at the given position in the client, so I could just merge the two highlighting kinds.

Thanks! Note that it might not be as easy at the face of templates, eg:

#define LESS <
template LESS typename T > class A {};

Note that << and >> are not the only (or even the primary, in my mind) issue: there is an ambiguity between </> as template argument list delimiters vs. comparison operators.

Right, I was trying to imply all operator uses (e.g. including comparisons).


Currently my only high level comment is renaming the new kind to just bracket, apart from that i think the idea LG. LMK if you're going to take a look at the implementation @nridge, otherwise I am happy to do that as well.

clang-tools-extra/clangd/SemanticHighlighting.h
54

actually let's name these as Bracket rather than AngleBracket, as we might want to increase the coverage further in the future (and a use case such as yours can still look at the textual tokens to match relevant brackets).

clang-tools-extra/clangd/unittests/SemanticHighlightingTests.cpp
939

i don't think this is enough of a test case, we probably need a bunch other like:

foo >\
>>

foo >\
 >>

i think the firstone is going to highlight \ for example.

ckandeler updated this revision to Diff 490457.Jan 19 2023, 4:18 AM
ckandeler marked 2 inline comments as done.

Made token name more generic, added test cases.

We can easily check the actual character at the given position in the client, so I could just merge the two highlighting kinds.

Thanks! Note that it might not be as easy at the face of templates, eg:

#define LESS <
template LESS typename T > class A {};

At least this one doesn't seem to be an issue; added test case.

clang-tools-extra/clangd/SemanticHighlighting.h
54

Yes, I was going to suggest this myself.

nridge added a comment.EditedJan 24 2023, 12:37 AM

I played around with the patch a little, and found some cases where semantic tokens should be present but aren't:

template <typename T> class S {
public:
  template <typename U> class Nested;
};

// explicit specialization
// parameter list is missing semantic tokens
template <>
class S<int> {};

// partial specialization
// parameter and argument lists are missing semantic tokens
template <typename T>
class S<T*> {};

// nested template definition
// outer parameter list is missing semantic tokens
template <typename T>
template <typename U>
class S<T>::Nested {};

// nested template specialization
// both parameter lists are missing semantic tokens
template <>
template <>
class S<float>::Nested<float> {};

template <typename T> void foo();

void bar() {
  // function call with explicit template arguments
  // argument list is missing semantic tokens
  foo<int>();
}

template <typename T> constexpr int V = 42;

// variable template reference
// argument list is missing semantic tokens
constexpr int Y = V<char>;

// variable template specialization
// parameter and argument lists are missing semantic tokens
template <>
constexpr int V<int> = 5;

// variable template partial specialization
// parameter and argument lists are missing semantic tokens
template <typename T>
constexpr int V<T*> = 6;

template <typename T>
concept C = true;

// constrained template
// concept's argument list is missing semantic tokens
template <typename T> requires C<T>
class Z {};
clang-tools-extra/clangd/unittests/SemanticHighlightingTests.cpp
939

It's easy to overlook this, could you please add a comment similar to:

// Semantic tokens for outer argument list are deliberately omitted to
// avoid having to handle line-continuations

(feel free to reword to better express the reason)

One other thought that has occurred to me is that as we add more semantic tolens for punctuation, the test cases in GetsCorrectTokens become harder to read.

What would you think about omitting punctuation tokens in the GetsCorrectTokens test cases (both Operator and Bracket, and any future ones), similarly to the way we already omit scope modifiers like _classScope, and have another test case like GetsCorrectPunctuationTokens where we put specifically test cases that test for punctuation tokens?

ckandeler updated this revision to Diff 491718.Jan 24 2023, 5:01 AM
ckandeler marked an inline comment as done.

Added test cases from review comment and made them pass.

Thanks for the test cases!
All fixed, except:

variable template specialization
parameter and argument lists are missing semantic tokens
template <>
constexpr int V<int> = 5;

Argument list fixed. I didn't manage to make the parameter list work.

One other thought that has occurred to me is that as we add more semantic tolens for punctuation, the test cases in GetsCorrectTokens become harder to read.

What would you think about omitting punctuation tokens in the GetsCorrectTokens test cases (both Operator and Bracket, and any future ones), similarly to the way we already omit scope modifiers like _classScope, and have another test case like GetsCorrectPunctuationTokens where we put specifically test cases that test for punctuation tokens?

I suppose that's an option, but I don't have a strong opinion on this.

Thanks for the test cases!
All fixed, except:

variable template specialization
parameter and argument lists are missing semantic tokens
template <>
constexpr int V<int> = 5;

Argument list fixed. I didn't manage to make the parameter list work.

Thanks! That one remaining case seems to be a bug in the AST, I investigated it a bit and posted a fix at D142692. It should make your code work without further modification (the missing template parameter list will start showing up in VarTemplateSpecializationDecl::getNumTemplateParameterLists()).

Function calls are still missing some cases:

template <typename> void free();

template <typename T>
struct A {
  template <typename> void mem();
};

void foo() {
  A<int> a;
  a.mem<int>();  // <--
}

template<typename T>
void bar() {
    free<T>();  // <--

    A<int> a;
    a.mem<T>();  // <--

    A<T> b;
    b.template mem<T>();  // <--
}

Two more cases I found:

template <typename, typename>
concept C = true;

template <C<int> A>  // <-- (inner pair)
class B {};

This is a TypeConstraint, but RecursiveASTVisitor is lacking a Visit() method for it, so I think you'll need to override TraverseTypeConstraint() instead (example)

template <class T>
class A {
   template <class U> void foo(U a) { }
   template<> void foo(int a) { }  // <--
};

This one is ClassScopeFunctionSpecializationDecl::getTemplateArgsAsWritten()

clang-tools-extra/clangd/SemanticHighlighting.cpp
631

I would suggest moving this loop into VisitTagDecl(), as TagDecl is the base class of ClassTemplateSpecializationDecl that declares getNumTemplateParameterLists(). That way, we can be sure every derived class is handled.

648

Similarly, I would suggest moving this loop into VisitDeclaratorDecl.

(The args loop can stay here.)

nridge added inline comments.Jan 27 2023, 12:56 AM
clang-tools-extra/clangd/SemanticHighlighting.cpp
631

(Here's an example of a TagDecl which is not a ClassTemplateSpecializationDecl but has a template parameter list:

template <typename>
class A {
  enum class E;
};

template <typename T> // <--
enum class A<T>::E {X, Y, Z};

)

Function calls are still missing some cases:

I would have expected these to be handled by VisitDeclRefExpr(), but they aren't. Any idea?

I would have expected these to be handled by VisitDeclRefExpr(), but they aren't. Any idea?

There are two other expression types, DependentScopeDeclRefExpr and CXXDependentScopeMemberExpr, which are used to model decl-ref-exprs where the referent decl cannot be resolved until instantiation.

I would have expected these to be handled by VisitDeclRefExpr(), but they aren't. Any idea?

There are two other expression types, DependentScopeDeclRefExpr and CXXDependentScopeMemberExpr, which are used to model decl-ref-exprs where the referent decl cannot be resolved until instantiation.

Oh and the non-dependent a.mem<int> one is MemberExpr

I figured I might as well look through the AST API for classes with getLAngleLoc/getRAngleLoc methods.

It looks like we've got almost all of them (including the ones mentioned in recent comments) except:

  • OverloadExpr
  • DependentTemplateSpecializationTypeLoc
  • AutoTypeLoc
ckandeler updated this revision to Diff 492670.Jan 27 2023, 1:58 AM
ckandeler marked 3 inline comments as done.

Handled more cases.

I figured I might as well look through the AST API for classes with getLAngleLoc/getRAngleLoc methods.

It looks like we've got almost all of them (including the ones mentioned in recent comments) except:

  • OverloadExpr
  • DependentTemplateSpecializationTypeLoc
  • AutoTypeLoc

Adding AutoTypeLoc broke tons of tests, so I left it out for now.

ckandeler added inline comments.Jan 27 2023, 2:00 AM
clang-tools-extra/clangd/unittests/SemanticHighlightingTests.cpp
1026

Note that C2 does not get a token here. Maybe we can fix that along the way?

Adding AutoTypeLoc broke tons of tests, so I left it out for now.

Poked at this a bit, it looks like AutoTypeLoc.getLAngleLoc() is only safe to access if it isConstrained(). This is poor API (it would be better to return an invalid SourceLocation rather than garbage in the !isConstrained() case) but for now we can check isConstrained() at our call site.

nridge added inline comments.Jan 30 2023, 12:12 AM
clang-tools-extra/clangd/unittests/SemanticHighlightingTests.cpp
1026

That token should come via findExplicitReferences rather than CollectExtraHighlightings, so I put the fix in a separate patch: D142871

template <class T>
class A {
   template <class U> void foo(U a) { }
   template<> void foo(int a) { }  // <--
};

This one is ClassScopeFunctionSpecializationDecl::getTemplateArgsAsWritten()

I got this testcase slightly wrong, the template<> is handled via VisitDeclaratorDecl, but it can have arguments as well:

template <class T>
class A {
   template <class U> void foo(U a) { }
   template<> void foo<int>(int a) { }  // note the <int>
};

which is what requires handling ClassScopeFunctionSpecializationDecl::getTemplateArgsAsWritten()

I did find one other edge case (last one, I promise!)

template <class T> void foo(T);
template <class T> class A {
  friend void foo<>(T);  // <--
};

This one is FunctionDecl::getDependentSpecializationInfo()->getLAngleLoc()

ckandeler updated this revision to Diff 493247.Jan 30 2023, 2:18 AM

Support more cases as per review comments.

Thanks for the updates!

LMK if you're going to take a look at the implementation @nridge, otherwise I am happy to do that as well.

The implementation looks good to me now. (I admit I don't have a high level of confidence in my understanding of the SourceLocation stuff in the implementation of addAngleBracketTokens(), but it sounds like you've looked at that).

I'll leave final approval to you @kadircet in case you have any remaining comments.

kadircet added inline comments.Jan 31 2023, 5:55 AM
clang-tools-extra/clangd/SemanticHighlighting.cpp
382

i am actually having a hard time following the logic here. it feels like what you really want is map > back to its file location, possibly to a >> token, and deal with second case specially.
so maybe something like:

// RLoc might be pointing at a virtual buffer when it's part of a `>>` token.
RLoc = SM.getFileLoc(RLoc);
// Make sure token is part of the main file.
RLoc = getHighlightableSpellingToken(RLoc);
if(!RLoc.isValid())
  return;

const auto *RTok = TB.spelledTokenAt(RLoc);
// Handle `>>`. RLoc is always pointing at the right location, just change
// the end to be offset by 1.
// We'll either point at the beginning of `>>`, hence get a proper spelled
// or point in the middle of `>>` hence get no spelled tok.
if (!RTok || RTok->kind() == tok::greatergreater) {
  Position Begin = sourceLocToPosition(SourceMgr, RLoc);
  Position End = sourceLocToPosition(SourceMgr, RLoc.getLocWithOffset(1));
  addToken(*LRange, HighlightingKind::Bracket);
  addToken({Begin, End}, HighlightingKind::Bracket);
  return;
}

// Easy case, we have the `>` token directly available.
if (RTok->kind() == tok::greater) {
  if (auto RRange = getRangeForSourceLocation(RLoc)) {
    addToken(*LRange, HighlightingKind::Bracket);
    addToken(*RRange, HighlightingKind::Bracket);
  }
  return;
}

This should also make sure you're handling line continuations properly.

ckandeler updated this revision to Diff 493594.Jan 31 2023, 6:38 AM
ckandeler marked an inline comment as done.

Improved implementation as per review comment.

clang-tools-extra/clangd/SemanticHighlighting.cpp
382

Indeed, thanks.

kadircet accepted this revision.Jan 31 2023, 7:18 AM

but it sounds like you've looked at that

well i did now :P


thanks a lot to you both, let's ship it!

clang-tools-extra/clangd/SemanticHighlighting.cpp
408

nit: i'd actually keep the return; here to make sure we don't fall into handling of other cases implicitly in the future.

This revision is now accepted and ready to land.Jan 31 2023, 7:18 AM
ckandeler updated this revision to Diff 493610.Jan 31 2023, 7:27 AM

Incorporated review comment.

This revision was automatically updated to reflect the committed changes.