This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Indexing of standard library
ClosedPublic

Authored by sammccall on Dec 7 2021, 3:27 AM.

Details

Summary

This provides a nice "warm start" with all headers indexed, not just
those included so far.

The standard library is indexed after a preamble is parsed, using that
file's configuration. The result is pushed into the dynamic index.
If we later see a higher language version, we reindex it.

It's configurable as Index.StandardLibrary, off by default for now.

Based on D105177 by @kuhnel

Fixes https://github.com/clangd/clangd/issues/618

Diff Detail

Event Timeline

sammccall created this revision.Dec 7 2021, 3:27 AM
sammccall requested review of this revision.Dec 7 2021, 3:27 AM
sammccall updated this revision to Diff 392582.Dec 7 2021, 4:04 PM

remove parts split into other patches

sammccall updated this revision to Diff 392605.Dec 7 2021, 5:40 PM
sammccall retitled this revision from [clangd] WIP various stdlib indexing stuff to [clangd] Indexing of standard library.
sammccall edited the summary of this revision. (Show Details)
sammccall added a subscriber: kuhnel.

Tests, polish, comments

nridge added a subscriber: nridge.Dec 8 2021, 9:54 PM

thanks LG, mostly nits and a couple of questions

clang-tools-extra/clangd/ClangdServer.cpp
91

I suppose this should be rare hence won't bite us in practice, but might worth having a comment mentioning this creates tasks with no barriers.

clang-tools-extra/clangd/index/StdLib.cpp
45

llvm_uncreachable instead?

50

nit: this feels a little bit hard to read, what about:

if(2b) return 2b;
if(20) return 20;
...
return 98;
57

same here

93

maybe move second half of this comment into buildUmbrella ?

122

s/our our/to our/

132

drop static, and move into previous anonymous namespace ?

143

s/Name/Header/ ?

161

StandardHeaders are always in verbatim format, but are we sure if that holds for the IncludeHeader ?

I suppose that should be the case since we map known path suffixes back to a verbatim umbrella header, I just wonder how exhaustive the list is. (it might be nice to manually check no implementation detail headers falls out of cracks)

178

seems like debugging artifact, either drop or put in #ifndef NDEBUG

205

why log if we think we can't hit this case?

217

why are we doing this exactly? once we override the same file multiple times, I believe the last one takes precedence. it's definitely safer to clear the remapped files here rather than relying on that fact, but I am afraid of losing other mappings, possibly set outside of clangd's knowledge (e.g. through flags like -remap-file?)

234

what does the location refer to here? I think we should also stress the fact that even when indexing the same file, we have a good chance of seeing different symbols due to PP directives (and different std versions)

238

s/containers/include graph

257

i'd make this part of the next log

300

maybe move this check to the top

314

why do we resolve the symlinks ?

324

any reason for going with Noncached version? (clangd doesn't set one up today, but not relying on that would be nice if we don't have a particular concern here)

330

s/bust/must

clang-tools-extra/clangd/index/StdLib.h
67

maybe drop the optinal and bail out in indexing when Paths are empty ?

69

s/a built/built an/

sammccall updated this revision to Diff 419259.Mar 30 2022, 1:57 PM
sammccall marked 15 inline comments as done.

address review comments

Herald added a project: Restricted Project. · View Herald TranscriptMar 30 2022, 1:57 PM

(sorry about the long delay, would still love to merge this)

clang-tools-extra/clangd/index/StdLib.cpp
93

I think that removes the context of why we're #including them.

161

Right, I looked at these manually and the headers (and symbols) we were dropping seemed reasonable.

Note that we're not filtering symbols in this loop, just building a list of blessed files to add to the StdLibLocation.

So we only drop symbols that:

  • aren't recognized by the indexer as being in the standard library
  • aren't in the standard library directory
  • don't share a file with anything recognized as being in the standard library
178

This is useful for debugging, and fits well with the other dlog()s in this file, I'd like to check it in.

I was going to call you paranoid, being sure this would get compiled out.
but indeed not: https://godbolt.org/z/hKWhro3Mv (gcc manages it)

Added #ifndef NDEBUG

205

Because I'm not certain, and would much rather get a user bug report with this log line in it than without!

(Assert because I'd most rather find out before release of course)

217

We map dirty buffers ourselves. Conceivably, it may be part of the standard library itself that was remapped to some dirty buffer content!

We're reusing the CompilerInvocation from building a preamble, where we remap the main-file buffer to the dirty contents. (We do this in prepareCompilerInstance, but the PPOptions are shared).
This isn't part of the "compilerinvocation-as-proxy-for-build-flags" that we're trying to index.

I don't think it's a realistic possibility that anyone would rely on -Xclang -remap-file to find the standard library (note that it's a cc1 flag, not a public one...).

234

what does the location refer to here?

It refers to the StdLibLocation Loc, made that explicit.

I think we should also stress the fact that even when indexing the same file, we have a good chance of seeing different symbols due to PP directives (and different std versions)

Different than what? Do you mean "why might different calls to indexStandardLibrary see different symbols" from the same file?

257

Can you say why? I generally like one thought per line. Scanning vertically through familiar lines, it's easy to miss something unfamiliar tacked onto the end. This message should be rare, and log lines aren't precious.

(I reordered them, which seems a bit more natural)

314

Oops, because I read the documentation of getCanonicalPath() instead of the implementation, and they diverged in https://github.com/llvm/llvm-project/commit/dd67793c0c549638cd93cad1142d324b979ba682 :-D

Ultimately we're forming URIs to lexically compare with the ones emitted by the indexer, which uses getCanonicalPath(). Today getCanonicalPath() wants a FileEntry and we don't have one, but I think there's no fundamental reason for that, I can fix it.

(I'll do it as a separate patch, for now it's just calling getCanonicalPath with the wrong arguments)

324

Because the cached version is more complicated with no benefits:

  • any entanglement between the IO of the preamble indexing and that of the translation unit that happened to trigger it seems like a complicated idea, that's worth understanding before doing
  • as you say, we don't actually install a statcache, so there's no concrete benefit
  • in fact there exists no caching implementation of FileSystemStatCache, so the idea that we might be able to implement that interface and gain benefits is *extremely* speculative
clang-tools-extra/clangd/index/StdLib.h
67

Why? This would definitely be using an empty vector as a sentinel value:

  • 2 paths -> index
  • 1 path -> index
  • 0 paths -> don't index

And it's not as if "probe for a standard library" is the main point of this function so the interpretation of the return value is obvious - that's only one of three criteria.

None seems to be a clearer way to communicate this than {}, and performance doesn't seem to be an issue here.

sammccall updated this revision to Diff 419262.Mar 30 2022, 2:07 PM

Revert StdLibLocation to realpath, document why

sammccall updated this revision to Diff 419265.Mar 30 2022, 2:13 PM

revert to previous version of realpath code

clang-tools-extra/clangd/index/StdLib.cpp
314

Actually, nevermind, the code is correct and I'd just forgotten why :-) Added a comment to StdLibLocation.

getCanonicalPath() does actually resolve symlinks and so on: it asks the FileManager for the directory entry and grabs the its "canonical name" which is just FS->getRealPath behind a cache.
So the strings are going to match the indexer after all.

It'd be possible to modify getCanonicalPath() so we can call it here, but I don't think it helps. Calling it with (path, filemanager) is inconvenient for the (many) existing callsites, so it'd have to be a new overload just for this case. And the FileManager caching we'd gain doesn't matter here.
I can still do it if you like, though.

(Also, relevant to your interests, realpath is probably taking care of case mismatches too!)

This had a "LG" comment above... want to take another pass?
(Not urgent, just checking)

kadircet accepted this revision.May 16 2022, 8:42 AM

sorry for the long turn around here, LGTM. let's ship it!

clang-tools-extra/clangd/index/StdLib.cpp
234

Different than what? Do you mean "why might different calls to indexStandardLibrary see different symbols" from the same file?

yes, i meant compared to a previous runs. but i don't think it's as relevant here. i believe i was thinking about caching indexing status across runs and using that cache to implement filefilter, so that we don't index the same file twice (as we normally do in bgindex).

257

i was rather implying to add it as a (in)complete field into the current log line you have. usually when clangd is printing lots of logs across threads it might be hard to correlate these. hence having them printed as a single log would help.

314

So the strings are going to match the indexer after all.

thanks, this makes sense.

It'd be possible to modify getCanonicalPath() so we can call it here, but I don't think it helps. Calling it with (path, filemanager) is inconvenient for the (many) existing callsites, so it'd have to be a new overload just for this case. And the FileManager caching we'd gain doesn't matter here.
I can still do it if you like, though.

No need. We can take a look at that if the logic is likely to change (or get more complicated) in the future.

clang-tools-extra/clangd/index/StdLib.h
67

okay, makes sense.

This revision is now accepted and ready to land.May 16 2022, 8:42 AM
sammccall marked 8 inline comments as done.May 17 2022, 1:11 AM
sammccall updated this revision to Diff 430062.May 17 2022, 7:50 AM

Address comments
Add end-to-end test
Move ownership of AsyncTaskRunner to allow blockUntilIdle() in test
Fix bugs caught by end-to-end-test

This revision was landed with ongoing or failed builds.May 17 2022, 7:51 AM
This revision was automatically updated to reflect the committed changes.
sammccall reopened this revision.May 17 2022, 11:11 AM
This revision is now accepted and ready to land.May 17 2022, 11:11 AM

fix HasSubstr matcher type issue

Hmm, the test keeps crashing on the GN bot: http://45.33.8.238/win/58316/step_9.txt
Unfortunately the stacktrace is not symbolized, and I'm not seeing this elsewhere (e.g. premerge bot).

@thakis, any idea why unittests no longer manage to symbolize stack traces on crash on the windows bot? I believe this used to work...

Hmm, the test keeps crashing on the GN bot: http://45.33.8.238/win/58316/step_9.txt
Unfortunately the stacktrace is not symbolized, and I'm not seeing this elsewhere (e.g. premerge bot).

@thakis, any idea why unittests no longer manage to symbolize stack traces on crash on the windows bot? I believe this used to work...

I do not know. Maybe related to the "run many unit tests in a single process" lit change from a month ago?

Anyways, looks like this relanded and broke tests yet again. Maybe find a win box before relanding the next time?

Hmm, the test keeps crashing on the GN bot: http://45.33.8.238/win/58316/step_9.txt
Unfortunately the stacktrace is not symbolized, and I'm not seeing this elsewhere (e.g. premerge bot).

@thakis, any idea why unittests no longer manage to symbolize stack traces on crash on the windows bot? I believe this used to work...

I do not know. Maybe related to the "run many unit tests in a single process" lit change from a month ago?

I suspected that, and verified locally that:

Anyways, looks like this relanded and broke tests yet again. Maybe find a win box before relanding the next time?

I found one, but the test doesn't crash (nor on the premerge bots).

I'll revert again, but I have no idea how to proceed. Only this bot and llvm-avr-linux show the failure, and neither of them have a working symbolizer.

Hmm, the test keeps crashing on the GN bot: http://45.33.8.238/win/58316/step_9.txt
Unfortunately the stacktrace is not symbolized, and I'm not seeing this elsewhere (e.g. premerge bot).

@thakis, any idea why unittests no longer manage to symbolize stack traces on crash on the windows bot? I believe this used to work...

I do not know. Maybe related to the "run many unit tests in a single process" lit change from a month ago?

I suspected that, and verified locally that:

(My bot neither has llvm-symbolizer on path, nor sets LLVM_SYMBOLIZER_PATH fwiw.)

Anyways, looks like this relanded and broke tests yet again. Maybe find a win box before relanding the next time?

I found one, but the test doesn't crash (nor on the premerge bots).

I'll revert again, but I have no idea how to proceed. Only this bot and llvm-avr-linux show the failure, and neither of them have a working symbolizer.

I wouldn't be super surprised if this is related to windows and delayed template parsing. I haven't found any bots on llvm's official waterfall that run ClangdTests.exe – maybe that's why there aren't more bots finding this?

I'd recommend building and running the test on a win box and see if it repros locally.

I managed to get a stack trace from a bot (by leaving the broken commit up for longer this time).

Hmm, the test keeps crashing on the GN bot: http://45.33.8.238/win/58316/step_9.txt
Unfortunately the stacktrace is not symbolized, and I'm not seeing this elsewhere (e.g. premerge bot).

@thakis, any idea why unittests no longer manage to symbolize stack traces on crash on the windows bot? I believe this used to work...

I do not know. Maybe related to the "run many unit tests in a single process" lit change from a month ago?

I suspected that, and verified locally that:

(My bot neither has llvm-symbolizer on path, nor sets LLVM_SYMBOLIZER_PATH fwiw.)

Is this something you could add? I'd much rather revert quickly when after seeing a problem than wait for the slower official bots to catch it, but the stacktrace is pretty key.

Anyways, looks like this relanded and broke tests yet again. Maybe find a win box before relanding the next time?

I found one, but the test doesn't crash (nor on the premerge bots).

I'll revert again, but I have no idea how to proceed. Only this bot and llvm-avr-linux show the failure, and neither of them have a working symbolizer.

I wouldn't be super surprised if this is related to windows and delayed template parsing.

It seems possible. The other other failure I'd seen was the avr-linux bot (not windows).
However it definitely passes in some configurations: both the premerge tests and my local build.

I haven't found any bots on llvm's official waterfall that run ClangdTests.exe – maybe that's why there aren't more bots finding this?

clang-x64-windows-msvc does, but it's slow.

I'd recommend building and running the test on a win box and see if it repros locally.

Again, I have done this, and cannot reproduce (on 19.29.30038.1).

Now I have some idea where the crash is I can try some blind fixes, though.

bnbarham added inline comments.
clang-tools-extra/clangd/ClangdServer.cpp
1004–1006

@sammccall shouldn't we also be waiting for this to finish when ClangdServer is destroyed? IIUC right now the both FileIndex itself (stored in ClangdServer) and the actual UpdateIndexCallbacks (stored in TUScheduler) can be freed while indexStdlib is running asynchronously, resulting in a use-after-free on eg. FIndex->updatePreamble(std::move(IF)). I was confused as to why this wasn't happening in the tests, but these lines would explain it 😅

Adding a IndexTasks->wait() to ~ClangdServer fixes the crash I'm seeing in the sourcekit-lsp tests (see https://github.com/apple/llvm-project/pull/5837), though ideally we (sourcekit-lsp) wouldn't be running any indexing at all. As far as I can tell there's no way to turn off dynamic indexing now though, except for StandardLibrary indexing through the config file (but not from clangd args)?

sammccall added inline comments.Dec 21 2022, 10:09 AM
clang-tools-extra/clangd/ClangdServer.cpp
1004–1006

Thanks for flagging this!

We *almost* have the sequencing we need in ~ClangdServer:

  • when we fall off the end of ~ClangdServer it destroys all its members
  • ClangdServer::IndexTasks is declared after FIndex, so is destroyed first
  • ~AsyncTaskRunner calls wait()

But the task we schedule on IndexTasks captures a ref to UpdateIndexCallbacks, which is owned by the TUScheduler, which we explicitly destroy at the beginning of ~ClangdServer.

However I think your patch is *also* not quite correct: we can wait for the tasks to be empty, but then the TUScheduler could fill it up again before we destroy TUScheduler.

Options include adding an explicit stop() to TUScheduler, changing TUScheduler to not (exclusively) own UpdateIndexCallbacks, or have the task not capture the callbacks by reference.
I'll try the latter first, which seems least invasive.


ideally we (sourcekit-lsp) wouldn't be running any indexing at all. As far as I can tell there's no way to turn off dynamic indexing now though, except for StandardLibrary indexing through the config file (but not from clangd args)?

Clangd won't provide any top-level/namespace-level completions at all without dynamic index (index of preambles), and various other features won't work (docs on hover, include-fixer, type/call-hierarchy). We dropped support for disabling this at some point, as it didn't really seem usable and made features more complex if we tried to accommodate it. At a technical level it would be possible to disable I think, but I'd be really surprised if completion worked well, or if a language server without completion was useful.

StandardLibrary indexing through the config file (but not from clangd args)

We've tried to move away from flags for options that are interesting to users, as config files are more flexible, more forgiving on errors, and allow different settings per-project in a consistent way. (We don't own the editors, so cross-editor consistency is important to being able to support users at all...)

I can see how requiring config to be materialized on disk could be inconvenient for IDEs though. I think we could add a general-purpose --config-inline=<YAML/JSON goes here> flag, and/or the ability to set config over LSP (this can be dynamic, accordingly bigger design space that might be hard to get right).

bnbarham added inline comments.Dec 21 2022, 10:36 AM
clang-tools-extra/clangd/ClangdServer.cpp
1004–1006

Ah, I didn't actually check AsyncTaskRunner. Makes sense it would wait though :). Thanks for looking into this in detail!

or have the task not capture the callbacks by reference. I'll try the latter first, which seems least invasive.

This + moving FIndex after IndexTasks seems reasonable to me.

Clangd won't provide any top-level/namespace-level completions at all without dynamic index (index of preambles), and various other features won't work (docs on hover, include-fixer, type/call-hierarchy).

That's good to know - I assume this extends to indexing the stdlib as well, ie. the stdlib would be missing from top/namespace level completion if not indexed? Does the dynamic index grow with every opened file, or is it just the currently opened file? If disabling breaks everything it's not something we'd want to do, we just don't need it for find refs/etc.

sammccall added inline comments.Dec 21 2022, 11:02 AM
clang-tools-extra/clangd/ClangdServer.cpp
1004–1006

This + moving FIndex after IndexTasks seems reasonable to me.

I sent D140486. FIndex should be before IndexTasks in order to outlive it, unless I'm missing something.

I assume this extends to indexing the stdlib as well, ie. the stdlib would be missing from top/namespace level completion if not indexed?

Any parts of the standard library that you haven't (transitively) included from an open file would be missing. In particular, if you start up clangd, open a blank file, and type std:: you'll get nothing.

Does the dynamic index grow with every opened file, or is it just the currently opened file?

It contains symbols for all the transitively included headers of every file you've had open. So it does grow for each opened file, but in practice usually only by a little bit: if you've opened three files usually >90% of the headers visible from the fourth file are already in the index.

bnbarham added inline comments.Dec 21 2022, 11:09 AM
clang-tools-extra/clangd/ClangdServer.cpp
1004–1006

I sent D140486. FIndex should be before IndexTasks in order to outlive it, unless I'm missing something.

Thanks! No, that's right. And it is already so 👍.

And thanks for the extra information too.