This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Add a textual fallback for go-to-definition
ClosedPublic

Authored by nridge on Jan 16 2020, 1:32 PM.

Details

Summary

This facilitates performing go-to-definition in contexts where AST-based resolution does not work, such as comments, string literals, preprocessor disabled regions, and macro definitions, based on textual lookup in the index.

Partially fixes https://github.com/clangd/clangd/issues/241

Diff Detail

Event Timeline

nridge created this revision.Jan 16 2020, 1:32 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 16 2020, 1:32 PM

Unit tests: pass. 61850 tests passed, 0 failed and 781 were skipped.

clang-tidy: unknown.

clang-format: pass.

Build artifacts: diff.json, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

nridge marked an inline comment as done.Jan 21 2020, 8:40 AM
nridge added inline comments.
clang-tools-extra/clangd/XRefs.cpp
196

(There should be a return here, will fix locally.)

I've tried this out locally and it's fun! As suspected on the bug though, IMO it's far from accurate enough. Examples from clangd/Compiler.cpp:

  • it triggers on almost every word, even words that plainly don't refer to any decl like format [[lazily]], in case vlog is off. This means that e.g. (in VSCode) the underline on ctrl-hover gives no/misleading signal. It also means that missing your target now jumps you somewhere random instead of doing nothing.
  • when it works properly, the correct result usually mixed with incorrect results (e.g. createInvocationFromCommandLine sets [[DisableFree]]).
  • it doesn't work for some symbols - ones that are not indexable (e.g. RemappedFileBuffers will handle the lifetime of the [[Buffer]] pointer, gives a variety of wrong results)

So while I want to stress this is really cool, it doesn't feel reliable on any dimension: you can't trust clangd on whether the word is an actual reference, you can't trust any particular result, and you can't trust the correct result is in the set.

Some suggestions:

  • only trigger when there's *some* positive signal for the word.
    • Markup like quotes/backticks/brackets/\p
    • weird case like lowerCamel, UpperCamel, CAPS, mid-sentence Capitalization, under_scores.
    • use of the word as a token in nearby code (very close if very short, anywhere in file if longer)
    • (maybe you want to support ns::Qualifiers?)
  • post-filter aggressively - only return exact name matches (I think including case).
  • call fuzzyFind directly and set ProximityPath as well as the enclosing scopes from lexing. For extra strictness consider AnyScope=false
  • if you get more than 3 results, and none from current file, maybe don't return anything, as confidence is too low. Or try a stricter query...
  • handle the most common case of non-indexable symbols (local symbols) by running the query against the closest occurrence of the token in code.
nridge planned changes to this revision.EditedJan 21 2020, 12:32 PM

Thanks for taking a look!

  • it triggers on almost every word, even words that plainly don't refer to any decl like format [[lazily]], in case vlog is off. This means that e.g. (in VSCode) the underline on ctrl-hover gives no/misleading signal. It also means that missing your target now jumps you somewhere random instead of doing nothing.

Heh, I didn't realize VSCode had this feature. I do agree that it changes the tradeoffs a bit, as it means go-to-definition can be invoked in a context where there isn't an explicit signal from the user that they think there's a target there.

The other points you make are completely fair too. I will revise and take your suggestions into account.

I'll aim to start by factoring in enough of your suggestions to reduce the noise to an acceptable level for an initial landing, and leave some of the others for follow-up enhancements.

nridge updated this revision to Diff 247529.Mar 1 2020, 4:36 PM

Address some review comments

nridge added a comment.Mar 1 2020, 4:47 PM

I've addressed some of the review comments, with a view to getting something minimal we can land, and improve on in follow-up changes.

Mostly, I focused on the suggestions which reduce the number of results. I've left other suggestions which increase the number of results (e.g. handling non-indexed symbols) for follow-ups.

  • only trigger when there's *some* positive signal for the word.
    • Markup like quotes/backticks/brackets/\p
    • weird case like lowerCamel, UpperCamel, CAPS, mid-sentence Capitalization, under_scores.
    • use of the word as a token in nearby code (very close if very short, anywhere in file if longer)
    • (maybe you want to support ns::Qualifiers?)

I currently handle lowerCamel, UpperCamel, CAPS, and under_scores. I've left the others as follow-ups.

  • post-filter aggressively - only return exact name matches (I think including case).

Done.

  • call fuzzyFind directly and set ProximityPath

Done.

as well as the enclosing scopes from lexing. For extra strictness consider AnyScope=false

I haven't done this yet, do you think it's important for an initial landing?

If so, could you mention what API you had in mind for determining "enclosing scopes from lexing"?

I had in mind using something like SelectionTree and collecting any RecordDecls or NamespaceDecls on the path from the common ancestor to the TU, but that's technically not "from lexing", so perhaps you have something else in mind.

  • if you get more than 3 results, and none from current file, maybe don't return anything, as confidence is too low. Or try a stricter query...

I implemented this, but my testing shows this causes a lot of results for class names to be excluded. The reason appears to be that fuzzyFind() returns the class and each of its constructors as distinct results, so if a class has more than two constructors, we'll have more than 3 results (and typically the class is declared in a different file).

Should we try to handle this case specifically (collapse a class name and its construtors to a single result), or should we reconsider this filtering criterion? It's not exactly clear to me what sort of bad behaviour it's intended to weed out.

  • handle the most common case of non-indexable symbols (local symbols) by running the query against the closest occurrence of the token in code.

I've left this as a follow-up.

Thanks! The scope looks good to me now, on to implementation details.
I'm being a bit picky on the behaivor because go-to-def is a heavily-used feature, many users won't be expecting what we're doing here, and we can't reasonably expect them to understand the failure modes.
So, let's try hard not to fail :-)

This reminds me: it's not completely obvious what set of "act on symbol under the cursor" things this should (eventually) apply to.
I think not having e.g. find-references work makes sense - user should navigate to a "real" occurrence to resolve the ambiguity, and things like code actions are right out.
However having textDocument/hover work when we have high confidence in results would be really cool.
Obviously nothing in scope for this patch, but it seems worth writing this down somewhere, precisely because we shouldn't do it now.

I currently handle lowerCamel, UpperCamel, CAPS, and under_scores. I've left the others as follow-ups.

(sorry for shifting goalposts, I think CAPS may be too broad. Left a comment inline)

  • if you get more than 3 results, and none from current file, maybe don't return anything, as confidence is too low. Or try a stricter query...

I implemented this, but my testing shows this causes a lot of results for class names to be excluded. The reason appears to be that fuzzyFind() returns the class and each of its constructors as distinct results, so if a class has more than two constructors, we'll have more than 3 results (and typically the class is declared in a different file).

I think we should just drop constructor results, they'll always have this problem.
(There are other cases but this is the biggest).

  • handle the most common case of non-indexable symbols (local symbols) by running the query against the closest occurrence of the token in code.

I've left this as a follow-up.

Makes sense. I think this there's not a lot of new complexity here, we have the major pieces (getWordAtPosition, TokenBuffer, SelectionTree, targetDecl, index) but integration is definitely substantial.

I'd suggest we go down that path before adding complexity for the indexed-based path though, because I suspect it's going to handle many of the practical situations where the index-based approach needs a lot of help (and vice-versa).

clang-tools-extra/clangd/SourceCode.cpp
313 ↗(On Diff #247529)

@kadircet is working on getting rid of this function because creating raw lexers is is wasteful and not actually very powerful. Mostly we're moving to syntax::TokenBuffer, which records actual lexed tokens, but that doesn't apply here.

The examples in the tests seem like they'd be covered by something *really* simple, like enclosing identifier chars:

unsigned Begin, End;
for (Begin = Offset; Begin > 0 && isIdentifierBody(Code[Begin-1]); --BeginEnd) {}
for (End = Offset; End < Code.size() && isIdentifierBody(Code[End]); ++End) {}
return Code.slice(Begin, End);

(Lexer::isIdentifierBodyChar requires langopts but just passes through DollarIdents to isIdentifierBody, and I don't think we care much about identifiers with $ in them.)

If we really want to do something more subtle here, we should check it in SourceCodeTests.

clang-tools-extra/clangd/SourceCode.h
93 ↗(On Diff #247529)

consider moving the isLikelyToBeIdentifier check inside. The current API is pretty general and it's not clear yet what (else) it's good for so it's nice to direct towards intended usage.

Also doing the identifier check inside this function is more convenient when it relies on markers outside the identifier range (like doxygen \p or backtick-quoted identifiers)

That said, you may still want to return the range when it's not a likely identifier, with a signature like StringRef getWordAtPosition(bool *LikelyIdentifier = nullptr). I'm thinking of the future case where the caller wants to find a nearby matching token and resolve it - resolving belongs in the caller so there's not much point having this function duplicate the check.

93 ↗(On Diff #247529)

This doesn't use the SourceManager-structure of the file, so the natural signature would be StringRef getWordAtPosition(StringRef Code, unsigned Offset).

(what are the practical cases where langopts is relevant?)

clang-tools-extra/clangd/XRefs.cpp
184

nit: mention snake_case, MACRO_CASE?

188

nit: can you mention this catches lowerCamel and UpperCamel

189

nit: prefer llvm::isUppercase to avoid locales

189

this will fire for initialisms like HTTP.
I think we want to require *both* upper and lowercase letters.

208

I think this is dead - we're just sorting by score.

212

this function should have a high-level comment describing the strategy and the limitations (e.g. idea of extending it to resolve nearby matching tokens).

A name like locateSymbolNamedTextuallyAt would better describe what this does, rather than what its caller does.

I would strongly consider exposing this function publicly for the detailed tests, and only smoke-testing it through locateSymbolAt. Having to break the AST in tests or otherwise rely on the "primary" logic not working is brittle and hard to verify.

226

FWIW the API for this is visibleNamespaces() from SourceCode.cpp.
(No enclosing classes, but I suspect we can live without them once we have a nearby-tokens solution too)

229

If we're bailing out on >3, I think this limit should be aiming to detect when there's >3, and avoid fetching way too much data, but *not* trying to avoid noise.
(I'd suggest 10 or so)

234

This seems dead, you're requiring exact matches, these will always have the same score.

236

This is an interesting signal, I think there are two sensible ways to go about it:

  • assume results in this file are more likely accurate than those in other files. In this case we should at minimum be using this in ranking, but really we should just drop all cross-file results if we have an in-file one.
  • don't rely on index for main-file cases, and rely on "find nearby matching token and resolve it instead". That can easily handled cases defined/referenced in the main-file with sufficient accuracy, including non-indexed symbols. So here we can assume this signal is always false, and drop it.
237

BTW I think the answer for constructors is just to drop all constructor results here.

(This also affects template specializations which I think we can not worry about, and virtual method hierarchies which are more painful but I also wouldn't try to fix now)

238

I'm not sure why we're using SymbolToLocation here:

  • Main file URI check: the Symbol has URIs. They need to be canonicalized to file URIs before comparison. This allows checking both decl and def location.
  • PreferredDeclaration and Definition can be more easily set directly from the Symbol
244

I wouldn't bother qualifying this as "for now". Any code is subject to change in the future, but requiring an exact name match for index-based results seems more like a design decision than a fixme.

270

I don't think this should be logged, particularly by default - it doesn't really indicate anything other than we should have a "look up symbol by name" API

(ok, actually I think this is just dead code because we've already checked name above)

clang-tools-extra/clangd/unittests/XRefsTests.cpp
599

#ifdef'd out code is another interesting motivation worth testing.

I'm playing with a prototype of the token-based approach, a couple of follow-ups from that.

I've split out functions to handle file/macro/ast from locateSymbolAt in e7de00cf974a4e30d4900518ae8473a117efbd6c - hopefully an easy merge, you're adding another one.

I think having this trigger where the identifier is an actual token in the program is a surprisingly invasive change and runs a strong risk of confusing users (who can't distinguish these textual heuristics from normal go-to-def behaviour, and rely on its accuracy), and we shouldn't do it without a lot more testing.
I think the way to implement this is to call getMacroArgExpandedLocation on the start of the "token" we found, and feed the result into TokenBuffer::expandedTokens(SourceRange). If we get an empty list back, then the parser didn't see this token and we're good to proceed without any overlap with the strict AST-based options.
This will leave comments, strings, and #ifdef'd sections should work fine, but not dependent or broken code. (Many cases of broken code can be fixed using RecoveryExpr which is finally going to land)

sammccall added inline comments.Mar 2 2020, 10:10 AM
clang-tools-extra/clangd/SourceCode.cpp
313 ↗(On Diff #247529)

Mostly we're moving to syntax::TokenBuffer, which records actual lexed tokens, but that doesn't apply here.

Oops, this isn't true - token buffer's expanded token stream has "real" tokens, but the spelled token streams use the raw lexer. You can just use spelledIdentifierTouching(), I think.

sammccall added inline comments.Mar 2 2020, 10:12 AM
clang-tools-extra/clangd/SourceCode.cpp
313 ↗(On Diff #247529)

You can just use spelledIdentifierTouching(), I think.

Sorry disregard this, obviously it doesn't work in comments etc. Need more coffee...

nridge added a comment.EditedMar 2 2020, 12:08 PM

Thanks for all the comments Sam! I'll have a detailed look tomorrow, but I wanted to follow up on this:

I think having this trigger where the identifier is an actual token in the program is a surprisingly invasive change and runs a strong risk of confusing users (who can't distinguish these textual heuristics from normal go-to-def behaviour, and rely on its accuracy), and we shouldn't do it without a lot more testing.

The "dependent code" use case is a pretty important one in my eyes.

In one of the codebases I work on, we have a fair amount of code like this:

template <typename T>
void foo(T t) {
   // ...
   t.someUniqueMethodName();
   // ...
   t.someOtherUniqueMethodName();
   // ...
}

The code is in practice only instantiated with a handful of types for T (often just two). (But we don't have a way to express this in the code at this time.) Being able to invoke go-to-definition at e.g. someUniqueMethodName and get the definition sites of the corresponding handful of methods, as opposed to nothing at all, is something I'd really like to get working. I'm open to suggestions to how we can test this better, or scope the behaviour more narrowly to avoid other unintended results for real tokens.

The "dependent code" use case is a pretty important one in my eyes.

In one of the codebases I work on, we have a fair amount of code like this:

Yep, fair enough. And I don't think that this is so bad for say DependentDeclRefExpr, where we're already doing heuristic stuff (and the user can reasonably understand that we might).
I'm more concerned that it might trigger at arbitrary times, like say on [[^noreturn]] void abort();.

But we can distinguish these cases! SelectionTree recognizes DependentDeclRefExpr and friends even if targetDecl can't resolve them. So I think we can use a whitelist: the AST part of locateSymbol reports the type of node that owned TouchedIdentifier, and if it's one of the types we want to use textual fallback for, then we go ahead with the fallback code (in addition to the cases I mentioned where the touched word doesn't turn out to be a real identifier).
We can even try to glean more info, e.g. if it's a CXXDependentScopeMemberExpr then we can filter out non-member index results.

(Tactically I think it makes sense to add the basic fallback logic, and follow up with the dependent-code entrypoints, but up to you)

nridge marked 6 inline comments as done.Mar 3 2020, 2:49 PM

I'm posting some partial responses because I have some questions (really just one, about fuzzy matching).

In general the comments seem reasonable and I plan to address all of them.

(I've marked some comments as done because I've addressed them locally. I'm not uploading a revised patch yet because it wouldn't be very interesting.)

clang-tools-extra/clangd/XRefs.cpp
212

I would strongly consider exposing this function publicly for the detailed tests, and only smoke-testing it through locateSymbolAt. Having to break the AST in tests or otherwise rely on the "primary" logic not working is brittle and hard to verify.

I was going to push back against this, but I ended up convincing myself that your suggestion is better :)

For the record, the consideration that convinced me was:

Suppose in the future we add fancier AST-based logic that handles a case like T().foo() (for example, by surveying types for which T is actually substituted, and offering foo() inside those types). If all we're testing is "navigation works for this case" rather than "navigation works for this case via the AST-based mechanism", we could regress the AST logic but have our test still pass because the testcase is simple enough that the text-based navigation fallback (that we're adding here) works as well.

238

Well the Symbol has SymbolLocations and we need protocol Locations, so we have to use something to convert them. Other places that perform such conversion use symbolToLocation(), so I reused it.

But you're right that symbolToLocation() also has some "pick the definition or the declaration" logic which is less appropriate here. I can factor out the SymbolLocation --> Location conversion logic from symbolToLocation(), and just use that here.

244

Do we want to rule out the possibility of handling typos in an identifier name in a comment (in cases where we have high confidence in the match, e.g. a long / unique name, small edit distance, only one potential match) in the future?

This is also relevant to whether we want to keep the FuzzyMatcher or not.

nridge updated this revision to Diff 248630.Mar 5 2020, 4:44 PM
nridge marked 16 inline comments as done.

Rebase onto D75479 and address most review comments

Comments remaining to be addressed:

  • revising the tests to exercise locateSymbolNamedTextuallyAt() directly
  • comments related to fuzzy matching (I have an outstanding question about that)

Handling of dependent code has been deferred to a follow-up change

nridge marked an inline comment as done.Mar 5 2020, 4:50 PM

This reminds me: it's not completely obvious what set of "act on symbol under the cursor" things this should (eventually) apply to.
I think not having e.g. find-references work makes sense - user should navigate to a "real" occurrence to resolve the ambiguity, and things like code actions are right out.
However having textDocument/hover work when we have high confidence in results would be really cool.
Obviously nothing in scope for this patch, but it seems worth writing this down somewhere, precisely because we shouldn't do it now.

Agreed. Filed https://github.com/clangd/clangd/issues/303 for hover.

(Tactically I think it makes sense to add the basic fallback logic, and follow up with the dependent-code entrypoints, but up to you)

Yep, will handle dependent code in a folow-up patch.

clang-tools-extra/clangd/SourceCode.h
93 ↗(On Diff #247529)

Now that I'm using wordTouching() from D75479, I think this comment no longer applies?

clang-tools-extra/clangd/XRefs.cpp
212

Renamed and comment added. I still need to revise the tests.

226

Thanks, that's convenient!

Out of curiosity, though: is the reason to prefer this lexer-based approach over hit-testing the query location against NamespaceDecls in the AST, mainly for performance?

228

It occured to me that I don't think we can do AnyScope=false if we want to handle dependent member cases like T().uniqueMethodName(). The members we want to find in such a case will often be both in a different file (so nearby-tokens won't handle them) and not in any visible scope.

236

Since you've implemented "find nearby matching token and resolve it", I went with the second approach.

nridge marked an inline comment as done.Mar 5 2020, 4:53 PM
nridge added inline comments.
clang-tools-extra/clangd/XRefs.cpp
489

Sorry this location-setting code is so messy. All my attempts to make it more concise have been thwarted by llvm::Expected's very restrictive API.

I'd like to sync up briefly on https://github.com/clangd/clangd/issues/241 so we know where we want to end up.

I think this is in good shape and certainly doesn't need a bigger scope, just want to be able to reason about how things will fit together.

clang-tools-extra/clangd/SourceCode.h
93 ↗(On Diff #247529)

I think the reasons still apply - D75479 doesn't need to check likelihood (it considers actual use as identifier evidence enough) so I didn't include it there, but we should eventually merge these more thoroughly I think. No need to do that until we actually want to implement different heuristics though.

clang-tools-extra/clangd/XRefs.cpp
226

Well, it was written for fallback code completion when we have no AST at all :-)

Gathering from the AST should be better, though it's not quite as simple as hit-testing (you also have to find using namespace). But this exists today, which is a feature!

244

No idea whether typo-correction is a good idea in principle - tradeoff between current false negatives and false positives+compute.

However neither FuzzyMatcher nor the existing index implementations support/can easily support real typo correction, and it seems implausible to me we'd add it for this feature.

Compare to e.g:

  • allowing case-insensitive match in some cases: fooBar vs FooBar is a plausible "typo". This is easy to implement.
  • correct the typo where we see the fixed version used as an identifier in this file (and not the original). Excludes some cases, but drives false-positives way down, and easy to implement.

I don't think we need to rule things out, but I'm uncertain enough about the approach to think that putting comments, fuzzymatcher etc here speculatively isn't worth it.

489

Ugh, don't get me started on Error/Expected :-(

I'd love to get rid of it somehow, but it seems like we'd inevitably just end up with the new thing + Error/Expected + error_code/ErrorOr + return-a-bool, and I'm not sure it'd be better.

(If you have more energy than me, I'd enthusiastically +1 an llvm-dev proposal to drop the clever checks from llvm::Error, and I know some others who would...)

nridge marked an inline comment as done.Mar 9 2020, 8:00 AM
nridge added inline comments.
clang-tools-extra/clangd/XRefs.cpp
244

Perhaps I'm unclear on the distinction between fuzzy matching and typo correction. Are they not both a matter of comparing a candidate string against a test string, and considering it a match if the they are "close enough" according to some metric (with the metric potentially being a simple edit distance in the case of typo correction)?

I've started to update the patch to be in line with the direction discussed in the issue.

@sammccall, how would you like to proceed logistically:

  • Do you plan to land (a possibly modified version of) D75479?
  • Or should I combine that patch into this one?
sammccall accepted this revision.Mar 11 2020, 8:44 AM

I've started to update the patch to be in line with the direction discussed in the issue.

@sammccall, how would you like to proceed logistically:

  • Do you plan to land (a possibly modified version of) D75479?
  • Or should I combine that patch into this one?

This patch looks good, I wouldn't bother redesigning anything, we should iterate instead.

You should go ahead, and I'll merge, and then we should work towards enabling dependent code use cases etc. SG?

clang-tools-extra/clangd/FindSymbols.h
25

nit: these names are vague and echo the type signature, maybe indexToLSPLocation?

26

nit: HintPath should be TUPath, the decision to use some other path as a TU path can only be made in the caller (needs context).
(Same is true for symbolToLocation, I'm not sure when that became public)

clang-tools-extra/clangd/XRefs.cpp
486

(The fuzzy matcher and topN are still here - I think we don't need them, right? With only up-to-3 results, std::sort seems more obvious)

488

maybe bail out early (on unusable/too many) instead of doing all the score computations first?
fuzzyFind(..., {

// bail out if it's a constructor or name doesn't match
if (Results.size() >= 3) {
  TooMany = true;
  return;
}
// add result

});

This revision is now accepted and ready to land.Mar 11 2020, 8:44 AM
nridge updated this revision to Diff 250018.Mar 12 2020, 12:12 PM
nridge marked 9 inline comments as done.
  • Remove fuzzy matching
  • Rebase to apply to head, taking only the parts from D75479 that I need for index-based lookup (such as wordTouching())
  • Revise tests so they exercise locateSymbolNamedTextuallyAt() directly, except for one smoke test
  • Add tests that verify that we do not trigger on dependent or broken code
nridge marked 2 inline comments as done.Mar 12 2020, 12:16 PM
nridge added inline comments.
clang-tools-extra/clangd/XRefs.cpp
441

Oh whoops, this assumption is another dependency on findNearbyIdentifier()

nridge edited the summary of this revision. (Show Details)Mar 12 2020, 12:23 PM
nridge updated this revision to Diff 250031.Mar 12 2020, 12:56 PM

Tweak a comment

nridge marked an inline comment as done.Mar 12 2020, 12:58 PM
nridge added inline comments.
clang-tools-extra/clangd/XRefs.cpp
441

For now, I just had it restrict to 3 results in general (even if they're in the same file).

Once findNearbyIdentifier() lands, the behaviour will automatically become what we intended.

I should mention that in my local usage, I've found the restriction on no more than 3 results (even if they're not in the current file) to be somewhat limiting. For example, a comment can easily reference the name of a function which has more than 3 overloads.

But we can start by landing this, and consider relaxing the limit (either in general, or in specific cases such as the overload set case) in follow-ups.

(Also just to clarify: while I said on Discord that I already implemented exclusion of string literals, I actually ended up deferring that part to a follow-up because it wasn't working as I expected.)

This revision was automatically updated to reflect the committed changes.