This is an archive of the discontinued LLVM Phabricator instance.

[clangd] collect symbol #include & insert #include in global code completion.
ClosedPublic

Authored by ioeric on Jan 29 2018, 5:28 AM.

Details

Summary

o Collect suitable #include paths for index symbols. This also does smart mapping
for STL symbols and IWYU pragma (code borrowed from include-fixer).
o For global code completion, add a command for inserting new #include in each code
completion item.

Event Timeline

ioeric created this revision.Jan 29 2018, 5:28 AM
ioeric edited the summary of this revision. (Show Details)Jan 29 2018, 5:28 AM

TL;DR:

  • I think we can expose fewer types/abstractions in the header mapping, and put it all in one header. If we want some functionality to be global-index only, we should make it an option to the symbolcollector, not stash the logic somewhere else I think.
  • regex seems like a big hammer, but we can revisit if this is on the hot path
  • the overridden path should probably be an "include string" rather than a URI
  • lots of naming nits as usual :-)
  • I haven't reviewed the actual preprocessor/include-path madness yet :-)
clangd/ClangdLSPServer.cpp
200

The caveat from apply fix also applies here. Maybe we should combine the tail of this case and the tail of apply-fix (the part from WorkspaceEdit)? They're simple (and wrong) now, but will need the same changes.

clangd/ClangdServer.cpp
365

(I still want to look at this bit... on monday :-)

clangd/CodeComplete.cpp
255–256

contents is unused

290–291

not sure where this is shown, but maybe add "if needed"?

297

LSP says we should use additionalTextEdits instead.
I think we have OK reasons, but you probably want to mention it here.

852

What do you think about (redundantly) passing filename to the constructor, and stashing it in a member, instead of passing to all the methods here?
I think main purpose of having this class is not having the *primary* data flow obscured by all the various context variables.

If you don't like that, we could also pass the SemaCCInput struct to the constructor itself. The distinction between constrcutor params and run() params is pretty artificial.

(as noted above, I think/hope Contents is unused in any case)

clangd/Protocol.cpp
371

nit: drop braces consistent with the other code around here

clangd/Protocol.h
380

I would call this "header". At least in the current implementation, it's not a URI, and we we really don't want anyone trying to parse it.

381

I don't think this comment applies here?

400

nit: remove blank line (and below). These form a list.

412

Hmm, you could consider struct Command : public ExecuteCommandParams { std::string title; }

I guess it's a hack, but this repetition seems a little silly. You could then reuse some of the JSON stuff too.

clangd/global-symbol-builder/PragmaCommentHandler.h
1 ↗(On Diff #131782)

can this also be put in HeaderMap.h?

28 ↗(On Diff #131782)

I'm not sure this class needs to be exposed, can we just expose a function unique_ptr<CommentHandler> collectHeaderMaps(HeaderMapCollector*)?

clangd/global-symbol-builder/STLPostfixHeaderMap.h
1 ↗(On Diff #131782)

STL isn't really a name we want to be using :-)

Why isn't this in HeaderMapCollector.h?

clangd/index/HeaderMapCollector.h
9 ↗(On Diff #131782)

context! what are header mappings, why do we use them (example)

23 ↗(On Diff #131782)

I'm not sure "header mapping" is as descriptive as it should be, and "collector':

  • we're mapping *to a canonical* representation
  • the context is "how should we spell this in a #include"

"Collector" doesn't really describe the class - it both records and applies rules.

Maybe CanonicalIncludes?

25 ↗(On Diff #131782)

these public typedefs and the explicit constructor seem leaky. Maybe just expose a public function addStandardLibraryMapping(CanonicalIncludes*)?

31 ↗(On Diff #131782)

nit: just addMapping?

41 ↗(On Diff #131782)

nit: just mapHeader?

41 ↗(On Diff #131782)

what's the input? absolute path? working-directory-relative? include-path relative?

Similarly, what's the output?

(This also relates to addHeaderMapping, so might belong in the class desc)

48 ↗(On Diff #131782)

This comment and the mutable are scary. We don't need to be threadsafe I think?

50 ↗(On Diff #131782)

all the rules we have so far are suffix matches. It seems a little weird to allow arbitrary regexes (and prevents fast lookup by sorting the reversed strings).

But it's a small amount of code and I guess the table will stay small. Maybe a FIXME that performance could be improved here?

clangd/index/Index.h
154

I think we concluded this shouldn't be a URI but a literal include string. If absent, we should fall back to CanonicalDeclaration.

The reason is that we choose to interpret the target of IWYU directives as a literal string to include. Alternatively we could choose to resolve it somehow and store the URI to the resolved target, but that would be more work.

clangd/index/SymbolCollector.cpp
134

shouldn't this just be return SK != Namespace, to avoid duplication? Since we only get here if we're indexing this symbol?

sammccall added inline comments.Feb 5 2018, 5:08 AM
clangd/ClangdServer.cpp
372

Pull out a function to compute ToInclude from a uri/abspath?

379

Need high-level comments explaining what we're doing to determine the right include path.

387

temporary logs?

399

explain why? (this has implications for direct vs transitive includes I think)

401

hmm, why are we actually going to run the compiler/preprocessor?
It looks like we just get HeaderMapping out - can we "just" call ApplyHeaderSearchOptions instead?

419

do we handle the case that the suggestion is already included?
(including the case where it's included by a different name)

422

leftover?

444

maybe this fixme should just be to have more consistent style support in clangdserver?

ioeric updated this revision to Diff 133021.Feb 6 2018, 8:55 AM
ioeric marked 10 inline comments as done.
  • Merge with origin/master
  • Renamed and moved a bunch files according to review comments.
  • Merge remote-tracking branch 'origin/master' into include
  • Half way
  • Merge with uri changes.
  • merged with origin/master
  • Merge with origin/master
  • s/HeaderUri/IncludeURI/
ioeric added a comment.Feb 6 2018, 8:57 AM

Thanks for the comments! I addressed comments in the symbol collect part (I think?). Will add tests in a followup patch.

clangd/index/HeaderMapCollector.h
48 ↗(On Diff #131782)

mutable was here because of llvm::Regex. Matching again a llvm::Regex is not a constant operation, but we would still like the lookup function to be const.

clangd/index/Index.h
154

In our latest offline discussion, we concluded that this would be either URI of a file (in .inc case) or a literal string that is suitable to be inserted. I left resolving IWYU targets as a FIXME in this patch as it is more work and doesn't seem to cause any trouble at this point. Will revisit when it actually becomes a problem.

clangd/index/SymbolCollector.cpp
134

I think it'd be easier to justify the behavior with a white list. There are other symbols like NamespaceAlias etc, which we might or might not be collecting.

ioeric added a comment.Feb 6 2018, 8:58 AM
This comment was removed by ioeric.
ioeric updated this revision to Diff 133202.Feb 7 2018, 5:26 AM
  • Added tests for IncludeURI and CanonicalIncludes and minor cleanup.
ioeric updated this revision to Diff 133203.Feb 7 2018, 5:30 AM
  • Merge with origin/master
ioeric updated this revision to Diff 133411.Feb 8 2018, 6:20 AM
ioeric marked 13 inline comments as done.
  • Added tests for all components; more cleanup; s/IncludeURI/IncludeHeader/
ioeric added a comment.Feb 8 2018, 6:20 AM

Thanks for the comments!

Sorry that I didn't clean up the code before sending out the prototype. I planned to deal with code structure and style issues after getting some early feedback, but I think the patch is ready for review now.

clangd/ClangdServer.cpp
399

Since we only need predecessor for HeaderSearch and don't really care about the actual code, I set this to false in the hope of speeding up the code. But in the latest revision, I simply use an empty file (as we only care about header search), so this option is no longer necessary.

401

I couldn't find an easy way to use ApplyHeaderSearchOptions... It requires an instance of HeaderSearch, which needs a preprocessor and a bunch of other objects (SourceManager, Target etc). And these objects are mostly initialized in BeginSourceFile. We could probably pull out the code that only initializes up to proprocessor, but this is not very trivial :(

419

Yes and no. The current implementation only does textual matching against existing #includes in the current file and inserts the header if no same header was found. This complies with IWYU. But we are not handling the case where the same header is included by different names. I added a FIXME for this.

clangd/CodeComplete.cpp
852

Passing filename to the constructor sounds good. Thanks!

ioeric edited the summary of this revision. (Show Details)Feb 8 2018, 6:20 AM
ioeric retitled this revision from [clangd] Prototype: collect symbol #include & insert #include in global code completion. to [clangd] collect symbol #include & insert #include in global code completion..

Some comments on the insert side, which looks pretty good. I'll take another look at indexing tomorrow.

clangd/ClangdServer.cpp
365

This has a fair amount of logic (well, plumbing :-D) and uses relatively little from ClangdServer.
Can we move shortenIncludePath to a separate file and pass in the FS and command?

I'd suggest maybe Headers.h - The formatting part of insertInclude could also fit there, as well as probably the logic from switchSourceHeader I think (the latter not in this patch of course).

This won't really help much with testing, I just find it gets hard to navigate files that have lots of details of unrelated features. ClangdUnit and ClangdServer both have this potential to grow without bound, though they're in reasonable shape now. Interested what you think!

367

maybe drop "via clang::HeaderSearch" (it's doc'd in the implementation) and add an example?

It might be worth explicitly stating that the result is an #include string (with quotes) - you just call it a "path" here.

"shortest" makes sense as a first implementation, but we may want to document something more like "best" - I know there are codebases we care about where file-relative paths are banned. Also I suspect we don't give "../../../../Foo.h" even if it's shortest :-)

376

This comment helps a lot. The subtext is: HeaderSearch is hard to construct directly, so we're doing this weird dance.
I think this is worth calling out even louder - when I read this sort of code I tend to take a *long* time to work out why the code seems to be doing unrelated work.

417

consume the optional IsSystem and use it to quote appropriately?

419

I can't see the FIXME? (There's one in the header, but it doesn't seem to really cover this case)

So this doesn't seem so hard: we can pass the file content, turn off recursive PP, and add PP callbacks to capture the names of each included file (the include-scanner patch does this).
I'm not sure it's worth deferring, at least we should fix it soon before we lose context.

But up to you, I'd suggest putting the fixme where we expect to fix it.

unittests/clangd/ClangdTests.cpp
849

Similarly, it'd be nice to pull these tests out into a test file parallel to the header.
(As with the other tests, often it's easiest to actually test through the ClangdServer interface - this is mostly just for navigation)

ioeric updated this revision to Diff 133595.Feb 9 2018, 6:00 AM
ioeric marked 5 inline comments as done.
  • Addressed review comments.
ioeric added a comment.Feb 9 2018, 6:05 AM

Thanks! PTAL

clangd/ClangdServer.cpp
365

Done.

I didn't move the formatting code, as half of the code is pulling out the style, which we might want to share/change depending on other clangd logics that might use style.

419

Since you prefer, I have included the change in the patch. I wanted to get to this as soon as this patch is landed.

unittests/clangd/ClangdTests.cpp
849

Done. I left a simple test for replacements.

ioeric updated this revision to Diff 133599.Feb 9 2018, 6:15 AM
  • fix a leftover bug

Insertion side LGTM, feel free to split and land.
Sorry I need to take off and will need to get to indexing on monday :(

clangd/ClangdServer.cpp
365

I'd still think pulling out Expected<tooling::Replacement> insertInclude(File, Code, Header, VFS, Style) would be worthwhile here - the formatting isn't a lot of code, but it's a bit magic, plus the quote handling... it's a bit of code. It'd make it more obvious what the interactions with ClangdServer's state are. But up to you, we can always do this later.

367

"shortest" makes sense as a first implementation, but we may want to document something more like "best" - I know there are codebases we care about where file-relative paths are banned. Also I suspect we don't give "../../../../Foo.h" even if it's shortest :-)

I think this part wasn't addressed

clangd/Headers.cpp
75

comment is outdated now

115

can you include the Header in this log message? (and possibly File, but that might add more noise than signal)

clangd/Headers.h
31

header-already-included is not an error condition.

Suggest returning llvm::Expected<Optional<String>>, or returning "" for this case.

ioeric updated this revision to Diff 133635.Feb 9 2018, 9:30 AM
ioeric marked 3 inline comments as done.
  • Addressed review comments.
clangd/ClangdServer.cpp
365

So it's just 2 lines for the replacement magic (+1 for comment) now after removing some redundant code.

I like shortenIncludePath better because it's more self-contained and easier to write tests against, and insertInclude doesn't seem to carry much more weight while we would need to handle replacements logic which has been tested in the tests.

367

I added a comment for this in the header. But I might be misunderstanding you suggestion. Did you mean we need a better name for the function?

Insertion still LG (couple of nits, inline).

For indexing, my biggest questions:

  • I worry CanonicalIncludes doesn't get enough information to make good decisions - passing the include stack in some form may be better
  • CanonicalIncludes has slightly weird patterns of reads/writes - most writes are permanent and a few are transient, and it's not totally obvious how it works in a multithreading context (though your code is correct). I'm not sure whether this is worth fixing.
clangd/ClangdServer.cpp
367

I think the function name is fine as a shorthand, just that the comment is a bit overspecified and possibly inaccurate compared to e.g. "Determines the preferred way to #include a file, taking into account the search path. Usually this will prefer a shorter representation like 'Foo/Bar.h' over a longer one like 'Baz/include/Foo/Bar.h'".

I don't think the ../../../Foo.h case is worth explicitly calling out - I just meant it as an example of why we *don't* want an overly-specific contract here.

clangd/CodeComplete.cpp
290

whether

301

user-facing. unwrap?

clangd/clients/clangd-vscode/package.json
1 ↗(On Diff #133635)

phab says you have ws-only changes in this file, which you might want to revert

clangd/global-symbol-builder/CMakeLists.txt
1 ↗(On Diff #133635)

(and here - ws-only changes?)

clangd/global-symbol-builder/GlobalSymbolBuilderMain.cpp
99

Do we create one of these per TU or per thread? The former is "clean" but seems potentially wasteful (compiling all those system header regexes for each TU). The latter is "fast" but potentially non-hermetic (can't think of a triggering case though).

Maybe we should have a split between transient mappings (IWYU) and permanent ones?

clangd/index/CanonicalIncludes.cpp
52

I think this is

if (!Text.consume_front(IWYUPragma))
return false;

73

"STL" :-)

75

these aren't standard library - deserves a comment?

in general it looks like there's a bunch of standard library stuff, also posix stuff, and some compiler intrinsics.
If this is the case, maybe "system headers" is a better descriptor than "standard library".

Can we document approximately which standard libraries, which compiler extensions, and other standards (posix, but I guess windows one day) are included?

clangd/index/CanonicalIncludes.h
11

Can we split out the main ideas a bit? I think these are: a) what include mapping is, b) IWYU pragmas, c) standard library.
We should also probably call out the relationship with the stuff in clangd/Headers.h.

e.g.

At indexing time, we decide which file to #included for a symbol.
Usually this is the file with the canonical decl, but there are exceptions:

- private headers may have pragmas pointing to the matching public header.
  (These are "IWYU" pragmas, named after the include-what-you-use tool).
- the standard library is implemented in many files, without any pragmas. 
  We have a lookup table for common standard library implementations.
  libstdc++ puts char_traits in bits/char_traits.h, but we #include <string>.

The insert-time logic in clang/Headers.h chooses how to spell the
filename we emit here; this depends on the search path at the insertion site.

(I think .inc files are conceptually similar and should also be handled and mentioned here, commented below)

33

nit: \brief needed?

I'm not sure this class actually collects anything - that's the handler returned by collectIWYUHeaderMaps.

Maybe "Maps a definition location onto an #include file, based on a set of filename rules."?

45

Maps all files matching \p RE to \p CanonicalPath?

51

Just this comment is probably enough for the whole function...

53

So I'm a bit concerned this is too narrow an interface, and we really want to deal with SourceLocation here which would give us the include stack.

Evidence #1: .inc handling really is the same job, but because this class has a single-file interface, we have to push it into the caller.
Evidence #2: I was thinking about a more... principled way of determining system headers. One option would be to have a whitelist of public headers, and walk up the include stack to see if you hit one. (I'm not 100% sure this would work, but still...) This isn't implementable with the current interface.

One compromise would be to pass in a stack<StringRef> or something. Efficiency doesn't really matter because the caller can cache based on the top element.

57

do we need both tables? it seems the system headers (which are regex-based) will typically outnumber the IWYU pragmas by a bunch.
And if we care about performance at all, we can cache the results of mapHeader?

68

Explicitly mention that the mappings are registered with *Includes?

(interaction isn't totally obvious here because Includes has read and write methods)

clangd/index/SymbolCollector.h
46

const?

ioeric updated this revision to Diff 133895.Feb 12 2018, 10:11 AM
ioeric marked 15 inline comments as done.
  • Addressed some review comments.
clangd/ClangdServer.cpp
367

I see. Thanks!

clangd/global-symbol-builder/GlobalSymbolBuilderMain.cpp
99

This is created per TU now. In an earlier revision, this was one-per-program because we statically constructed a regex map and passed the map into the CanonicalIncludes via the constructor, like we do in include-fixer (https://github.com/llvm-mirror/clang-tools-extra/blob/master/include-fixer/find-all-symbols/STLPostfixHeaderMap.cpp#L15). But with the current design, I didn't do this because one-per-TU approach seems to be cleaner, and the regex construction time seems to be relatively small comparing to the time spent on actually compiling a TU.

If we really want to squeeze performance here, we would probably need either an interface that takes a static regex mapping or one that takes pre-computed llvm::Regex. But I'm not really sure if it would be worth it since this is not a performance critical code path.

clangd/index/CanonicalIncludes.cpp
75

Switched both function and variable to "system headers" and updated the documentation.

clangd/index/CanonicalIncludes.h
11

Thanks for the suggestion!

We should also probably call out the relationship with the stuff in clangd/Headers.h.

I'm a bit inclined to not call out the relationship here as this doesn't seem to be a better place than the code where they interact, and the doc could easily be outdated if the interaction is changed from other libraries.

53

Evidence #1: .inc handling really is the same job, but because this class has a single-file interface, we have to push it into the caller.

I think this would depend on how you define the scope of this class. .inc handling is a subtle case that I'm a bit hesitated to build into the interface here.

Evidence #2: ....

This is actually very similar to how the hardcoded mapping was generated. I had a tool that examined include stacks for a library (e.g. STL) and applied a similar heuristic - treat the last header in the include stack within the library boundary as the "exporting" public header for a leaf include header, if there is no other public header that has shorter distance to that include. For example, if we see a stack like stl/bits/internal.h -> stl/bits/another.h -> stl/public_1.h -> user_code.cc, we say public_1.h exports bits/internal.h and add a mapping from bits/internal.h$ to public_1.h. But if we see another (shorter) include stack like stl/bits/internal.h -> stl/public_2.h -> user_code.cc, we say stl/public_2.h exports stl/bits/internal.h. This heuristic works well for many cases. However, this may produce wrong mappings when an internal header is used by multiple public headers. For example, if we have two include stacks with the same length e.g. bits/internal.h -> public_1.h -> user.cc and bits/inernal.h -> public_2.h -> user.cc, the result mapping would depend on the order in which we see these two stacks; thus, we had to do some manual adjustment to make sure bits/internal.h is mapped to the correct header according to the standard.

I am happy to discuss better solution here. But within the scope of this patch, I'd prefer to stick to interfaces that work well for the current working solution instead of designing for potential future solutions. I should be easy to iterate on the interfaces as these interfaces aren't going to be widely used in clangd after all. WDYT?

57

Makes sense.

ioeric updated this revision to Diff 134394.Feb 15 2018, 2:42 AM
  • Merged with origin/master
sammccall accepted this revision.Feb 16 2018, 12:42 AM

LG apart from the .inc handling (happy to chat more)

clangd/index/CanonicalIncludes.h
53

I think this would depend on how you define the scope of this class. .inc handling is a subtle case that I'm a bit hesitated to build into the interface here.

Sure it's subtle, but it's clearly in the scope of determining what the canonical header is for a symbol, which is the job of this class. We wouldn't be building it into the interface - on the contrary, the *current* proposed interface codifies *not* handling .inc files.

But you're right that we should check in something that handles most cases.

My preference would be to drop .inc from this patch until we can incorporate it into the design, but I'm also OK with a FIXME to move it.

unittests/clangd/HeadersTests.cpp
30

This test would be clearer to me if you removed this helper and just did

FS.Files["sub/bar.h"] = ...

in the test.

Can we change buildTestFS in TestFS.cpp to call getVirtualTestFilePath on relative paths to allow this?

(I can do this as a followup if you like, but it seems like a trivial change)

unittests/clangd/SymbolCollectorTests.cpp
50

nit: this is only used once, as Not(HasIncludeHeader()). Just use IncludeHeader("")?
The slight difference in detail handling doesn't seem to matter (I'm not even sure if it's exactly the right assertion)

590

Took me a while to understand this test, and still not sure I get it. Maybe "class string" here?

This revision is now accepted and ready to land.Feb 16 2018, 12:42 AM
sammccall added inline comments.Feb 16 2018, 1:45 AM
unittests/clangd/HeadersTests.cpp
30

I thought better of that change to TestFS, but did some renames in r325326.

So this would be FS.Files[testPath("sub/bar.h")) = ... which still seems more transparent - up to you.

ioeric updated this revision to Diff 134600.Feb 16 2018, 5:56 AM
ioeric marked 3 inline comments as done.
  • Merged with origin/master
  • Addressed review comments; removed .inc handling.

Thank you for reviewing this!

clangd/index/CanonicalIncludes.h
53

Okay, I removed .inc handling from this patch ;)

unittests/clangd/HeadersTests.cpp
30

Thanks a lot!

This revision was automatically updated to reflect the committed changes.
This revision was automatically updated to reflect the committed changes.

Hi @ioeric:

Just to let you know that your submission seems to break a Windows test.

 FAIL: Extra Tools Unit Tests :: clangd/Checking/./ClangdTests.exe/SymbolCollectorTest.IWYUPragma (15474 of 36599)

C:\xxx\llvm\tools\clang\tools\extra\unittests\clangd\SymbolCollectorTests.cpp(620): error : Value of: Symbols [C:\xxx\build\check-all.vcxproj]

  Expected: has 1 element and that element (q name "Foo") and ((decl u r i "file:///C%3a/clangd-test/symbol.h") and (include header "\"the/good/header.h\""))

    Actual: { Foo }, where the following matchers don't match any elements:

  matcher #0: (q name "Foo") and ((decl u r i "file:///C%3a/clangd-test/symbol.h") and (include header "\"the/good/header.h\""))

  and where the following elements don't match any matchers:

  element #0: Foo

  [  FAILED  ] SymbolCollectorTest.IWYUPragma (113 ms)

  [----------] 1 test from SymbolCollectorTest (113 ms total)

Thanks