This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Implemented indexing of standard library
AcceptedPublic

Authored by kuhnel on Jun 30 2021, 1:00 AM.

Details

Reviewers
sammccall
Summary

This is only the indexing part, it is NOT wired up to the
rest of ClandgdServer.

This is a step towards an implementation for
https://github.com/clangd/clangd/issues/618

Diff Detail

Event Timeline

kuhnel created this revision.Jun 30 2021, 1:00 AM
kuhnel requested review of this revision.Jun 30 2021, 1:00 AM
kuhnel updated this revision to Diff 355466.Jun 30 2021, 1:01 AM

forgot header comment

kuhnel updated this revision to Diff 355478.Jun 30 2021, 1:54 AM

using EXPECT_THAT_EXPECTED

nridge added a subscriber: nridge.Jul 6 2021, 11:04 PM
kuhnel edited the summary of this revision. (Show Details)Aug 4 2021, 12:18 AM

Sorry about the delay here, and welcome back!

The core design looks sensible to me - we talked about making this part of the background index (which could reuse more code) but we don't actually want the data mixed together.

Main points in the implementation are:

  • simplify the exposed interface
  • i don't think we need to mess with VFS at all actually
  • we should think a little about which index data we actually want to keep/use

Next design questions seem to be about lifetime/triggering:

  • how many configurations of stdlib index to have
  • when do we build the indexes, and who owns them
  • what logic governs whether/which stdlib index is triggered, and where do we put it
clang-tools-extra/clangd/index/StdLib.cpp
30

nit: use char[] or llvm::stringliteral for constants to avoid global constructors/destructors

(see "static initialization order fiasco" and related destructor issues)

33

This generates the same result every time, if there's any chance we'll call it multiple times (i think there is) maybe rather:

llvm::StringRef getIncludeHeader() {
  // construct on first use, never destroy
  static std::string *once = []{
     ...
     return new std::string(std::move(result));
  };
  return *once;
};
36

std::set is a poor (wasteful) data structure.

here a simple reasonably-efficient thing would be
vector<StringLiteral> Headers = { ... };
llvm::sort(Headers);
llvm::unique(Headers);

Same big-O, still simple, smaller-faster in practice.

The slightly silly thing is doing this at runtime at all, but that would be a fair amount of work to avoid.

82

this is a no-op

83

/ may not be what we want on windows right?

85

this is not used, no need to set it

clang-tools-extra/clangd/index/StdLib.h
21

whether you want to use a class for implementation or not, the *interface* exposed by the header file seems like it's just a function (or possibly several if you want to test some details).

25

nit: i'd call this "entry point" or something, because "std lib header file" already means something.

"umbrella header" is a common term for an entry point header designed to include a lot of other headers

clang-tools-extra/clangd/unittests/StdLibIndexTests.cpp
33

I don't think this is a very useful test - it's mostly asserting implementation details, and if the FS is broken the e2e will fail in a really obvious way.

sammccall added inline comments.Aug 5 2021, 5:15 AM
clang-tools-extra/clangd/index/StdLib.cpp
31

this path should probably use native conventions: "/stdlibheaders.cpp" isn't a good filename on windows.

Since it doesn't really need to exist, maybe #ifdef _WIN32 like we do for testRoot in unittests/TestFS would be nice. Such a virtualRoot() could go in FS.h if you like!

Would be nice to have "virtual" in the path so it's clear in e.g. error messages that it's not a real path.

58

This isn't actually threadsafe though?
Calls to viewImpl() are supposed to return filesystems with independent state (workdir).

Maybe it doesn't actually bite us here but I don't see much cost to actually building the FS on demand. (The threadsafeFS should own the string and use a non-copying buffer)

66

this contains only the umbrella headers, and doesn't seem to be overlaid on the real filesystem. So how does this work?

88

no need for the -o either

98

CreateMemBuffer instead to avoid expressing this impossible error condition?

101

hang on, if we're providing the overridden buffer to prepareCompilerInstance (like we do for edited files in clangd that may or may not be saved on disk yet) then why do we need the VFS at all?

109

I don't think storing refs originating in the standard library is profitable overall.
Functionally I can think of cases it makes better & worse, but it's not key to our core use cases and it's the majority of index size IIRC.

110

similarly we don't need relations for this index i believe

111

we don't actually use the include graph afaics

113

This comment is copied from background index, we don't need it everywhere and it's less relevant here (we're not really compiling arbitrary broken code)

126

This never gets used, why are we setting it?

130

This should mention the standard library

133

we could filter Symbols to only include those in our list, rather than all the private implementation cruft. (I expect cruft is the majority by weight)
Not in this patch, but maybe a fixme?

134

this tracer has the wrong lifetime and won't measure the actual indexing time, hoist it to the top?

142

nit: librarx->library

145

this should probably be Dex instead of MemIndex unless it's tiny

clang-tools-extra/clangd/index/StdLib.h
7

This is where an overview of the feature would go :-)

19

the interface here will probably want to grow to include some sort of enum/config struct for the language/version/whatever we decide to support (at minimum I'd think C vs C++). Fine to just hardcode c++ for now, maybe leave a comment.

clang-tools-extra/clangd/unittests/StdLibIndexTests.cpp
44

nit: llvm/clangd don't generally use this auto-everywhere style, rather StandardLibraryIndex Sli.
And this would be SLI rather than Sli.

47

I'm not sure we're testing what we want to be testing here.

In real life, the symbols are not going to be in the entrypoint, but a file included from it.
And clangd's indexer *does* treat the main file differently from others.

It's pretty awkward to actually fix but maybe worth a comment.

55

not sure why we're setting a filter here - we don't need to test that fuzzyfind works

56

AnyScope and Limit are not needed

58
#include "TestIndex.h"
MATCHER_P(Named, N, "") { return arg.Name == N; }
...
EXPECT_THAT(match(*Index, FuzzyFindRequest{}), ElementsAre(Named("myfunc"), Named("otherfunc")));

EXPECT_THAT/ElementsAre give much better error messages (e.g. if there are extra elements, it'll tell you what they are)
If you don't care if there are extra elements, just drop the assertion on size

kuhnel updated this revision to Diff 365181.Aug 9 2021, 7:35 AM
kuhnel marked 29 inline comments as done.
kuhnel edited the summary of this revision. (Show Details)

addressed code review comments

kuhnel added a comment.Aug 9 2021, 7:36 AM

Main points in the implementation are:

  • simplify the exposed interface

Good point, I added a new function indexStandardLibrary() as external interface.

  • i don't think we need to mess with VFS at all actually

Yes, removed that.

  • we should think a little about which index data we actually want to keep/use

I removed Refs, Relations and Graph as per your comment. However I have to admit, I don't know what they are and how they are used.
What's a good place to look at so that I learn what they do?

Next design questions seem to be about lifetime/triggering:

  • how many configurations of stdlib index to have
  • when do we build the indexes, and who owns them
  • what logic governs whether/which stdlib index is triggered, and where do we put it

While you're out, I'll try to set up something so that I can do some manual tests with a real standard library.

clang-tools-extra/clangd/index/StdLib.cpp
98

Not sure what you mean with CreateMemBuffer.
I replaced the if with an assert as this should not fail.

101

We don't need the VFS at all, you're right. Handing over a buffer is good enough.

clang-tools-extra/clangd/unittests/StdLibIndexTests.cpp
47

So you would prefer that we change HeaderMock to only contain #include and then create the mock headers as virtual files in the MockFS?

56

AnyScope is actually needed, otherwise the result is empty.

kuhnel updated this revision to Diff 365370.Aug 10 2021, 12:28 AM

tried to fix Windows build failure

kuhnel updated this revision to Diff 365415.Aug 10 2021, 3:50 AM

fixed a couple of bugs

  • wrong usage of llvm::unique
  • wrong usage of static pointer
sammccall added inline comments.Aug 17 2021, 6:15 AM
clang-tools-extra/clangd/FS.h
77 ↗(On Diff #365415)

"node" doesn't mean anything here.

79 ↗(On Diff #365415)

this should be defined out-of-line unless it's performance-critical for some reason.

Conditional compilation in inline bodies is a magnet for ODR violations. _WIN32 is probably fine but no reason to scare the reader :-)

80 ↗(On Diff #365415)

This is a (drive-)relative path, we have various places we need absolute paths and may want to reuse this there. Does C:\virtual work?

80 ↗(On Diff #365415)

as mentioned this should ideally include some clue that it's a virtual path

clang-tools-extra/clangd/index/StdLib.cpp
41

llvm::sys::path::append

41

don't use .hpp and rely on the driver picking the right language & version, this won't generalize. Instead, insert the flags "-xc++-header" and "-std=c++14" based on the StandardLibraryVersion

49

the static variable must be a string* rather than string to avoid global destructors.

73

I'm not sure what this means, I don't think there's anything better to do here.

87

why false? the default (true) is what clang's parser needs

98

pass the default nullptr instead of a no-op lambda, it allows the indexer to skip work

111

this assertion message doesn't say anything vs the assertion itself, either drop it or say why instead

clang-tools-extra/clangd/index/StdLib.h
35

nit: enum class to avoid polluting namespace.

llvm style capitalizes variables: CXX14

35

nit: "variant" rather than version to avoid confusion with language version?
(since this will cover c also)

37

The comment should be aimed at users of this module, not implementers of it :-)
This is the main API comment...

38

I think I agree with your comment elsewhere that it's sufficient to return unique_ptr, indicate it might be null, and log errors.

40

if introducing the enum i'd remove the default, this policy is best expressed at the call site

43

This class doesn't need to be public and I don't think it needs to exist.

generateIncludeHeader is a pure function of StandardLibraryVersion, it can be a free function.

This leaves only indexHeaders, which can also be a free function of StandardLibraryVersion and TFS.
(VirtualUmbrellaHeaderFileName is purely transient state, and isn't used by any tests)

61

as mentioned, umbrella header

65

this is gone

nridge added inline comments.Aug 17 2021, 12:34 PM
clang-tools-extra/clangd/index/StdLib.cpp
73

One could imagine picking a source file from the project's CDB, and using its flags to parse the standard library.

That could be relevant for macros that affect the way standard library headers are parsed (like _GLIBCXX_DEBUG perhaps?)

I removed Refs, Relations and Graph as per your comment. However I have to admit, I don't know what they are and how they are used.
What's a good place to look at so that I learn what they do?

Sorry about missing this.
The best place is the SymbolIndex interface - Symbols corresponds to the fuzzyFind/lookup methods, refs and relations each have their own methods. You can somewhat guess from the methods/data structures what these are used for, but clangd can answer your question! e.g. find-references on refs() will show you it is used in the find-references implementation :-) As well as rename and others.

Include graph is a bit of a special case, it's basically just used for partitioning, incrementally updating, and loading the background index IIRC.

clang-tools-extra/clangd/index/StdLib.cpp
73

Oh, that makes sense. I'd still probably not do this, given:

  • for projects with a CDB, we'll probably bg-index most of the stdlib in that configuration soon anyway.
  • until we see evidence otherwise, my guess is differences are pretty minor. (Honestly if it were easy to just ship a prebuilt index for code completion, I would be tempted.)
  • it adds some constraints on design/layering/sequencing etc
  • it makes questions of how many configurations to build/when to reuse vs invalidate more complicated
kuhnel updated this revision to Diff 367523.Aug 19 2021, 9:13 AM
kuhnel marked 18 inline comments as done.

addressed review comments, has use-after-free problem

kuhnel added inline comments.Aug 19 2021, 9:16 AM
clang-tools-extra/clangd/index/StdLib.cpp
73

Yes, my question was: can we get the real compile command form the file in which we're querying the standard library index and then extract the (relevant) compiler argument from that.

That might also help in guessing the current language variant.

clang-tools-extra/clangd/unittests/StdLibIndexTests.cpp
52

@sammccall I seem to be running into a use-after-free problem here. Debugging the whole thing shows that Index is pointing to an invalid address. So the problem is somewhere between returning the unique_ptr from indexUmbrellaHeaders(...) and assigning it to the Index variable.

Can you please take a look and give me a hint how to fix this?

nridge added inline comments.Aug 19 2021, 11:08 AM
clang-tools-extra/clangd/unittests/StdLibIndexTests.cpp
52

I think your issue may be that Dex doesn't actually take ownership of the slabs that get passed to it; the slabs need to outlive it.

Dex has another constructor which allows it to also take ownership, and a Dex::build() helper function to call it -- you probably want to be using that.

kuhnel marked an inline comment as done.Aug 20 2021, 1:23 AM
kuhnel added inline comments.
clang-tools-extra/clangd/unittests/StdLibIndexTests.cpp
52

Awesome, thx @nridge ! That fixed the use-after-free!
I was searching in the wrong place the whole time...

kuhnel updated this revision to Diff 367737.Aug 20 2021, 2:01 AM
kuhnel marked an inline comment as done.

addressed code review comments

also fixed use-after-free

sammccall accepted this revision.Aug 20 2021, 2:26 AM

Remainder is just nits, looks good!

clang-tools-extra/clangd/FS.cpp
126 ↗(On Diff #367737)

"/virtual/"

clang-tools-extra/clangd/index/StdLib.cpp
80

nit: the unhandled case can't dynamically happen. This should probably be a switch, and the default: case should be llvm_unreachable("Unhandled language variant")

105

tandard -> Standard

118

missing space after colon

123

elog("Standard Library Index: {0}", std::move(Err));

clang-tools-extra/clangd/index/StdLib.h
35

The somehow/magically is by looking at the clang::LangOptions of the file being parsed :-)
This can happen if/when we move the triggering into TUScheduler which obtains and parses the command line flags.

Concretely I'd expect to resolve this FIXME by adding some function like Optional<StandardLibraryVariant> chooseStandardLibrary(const LangOptions&)

If this makes sense to you, you might want to make the comment a bit less hand-wavy

36

librarVariant ->libraryVariant

39

I'd say rather "this index allows completion of standard library symbols whose headers have not been included yet".

The current text implies that we'd turn this off once we see #include <vector>, thus breaking completion of e.g unordered_map. I don't think we want to.

41

Sorry, I think I wasn't clear: the language variant should still be a parameter here, it just shouldn't have a default value. Instead the caller should pass it explicitly, this makes it obvious at the callsite that there's an imperfect assumption being made.

46

nit: umbrellaHeader singular. (The umbrella is the file mapped to HeaderSources, the headers it includes are not umbrella headers for our purposes)

47

enums are passed by value, not const reference
(and below)

49

nit: capitalization

This revision is now accepted and ready to land.Aug 20 2021, 2:26 AM
kuhnel updated this revision to Diff 367787.Aug 20 2021, 7:07 AM

fixing windows build

What is the status of this -- is it ready to be merged?

What is the status of this -- is it ready to be merged?

This works as far as it goes, but it needs someone to wire it up completely: build these indexes somewhere that's less blocking than the main thread, determine the right one to attach dynamically based on the file language, etc.

The original plan was that Christian would do this as a followup but that's not likely to happen. Meanwhile many of the usual suspects are a bit backed up. We can definitely land this if you or anyone might want to finish it...