Page MenuHomePhabricator

[clangd] collect symbol #include & insert #include in global code completion.

Authored by ioeric on Jan 29 2018, 5:28 AM.



o Collect suitable #include paths for index symbols. This also does smart mapping
for STL symbols and IWYU pragma (code borrowed from include-fixer).
o For global code completion, add a command for inserting new #include in each code
completion item.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ioeric updated this revision to Diff 133203.Feb 7 2018, 5:30 AM
  • Merge with origin/master
ioeric updated this revision to Diff 133411.Feb 8 2018, 6:20 AM
ioeric marked 13 inline comments as done.
  • Added tests for all components; more cleanup; s/IncludeURI/IncludeHeader/
ioeric added a comment.Feb 8 2018, 6:20 AM

Thanks for the comments!

Sorry that I didn't clean up the code before sending out the prototype. I planned to deal with code structure and style issues after getting some early feedback, but I think the patch is ready for review now.


Since we only need predecessor for HeaderSearch and don't really care about the actual code, I set this to false in the hope of speeding up the code. But in the latest revision, I simply use an empty file (as we only care about header search), so this option is no longer necessary.


I couldn't find an easy way to use ApplyHeaderSearchOptions... It requires an instance of HeaderSearch, which needs a preprocessor and a bunch of other objects (SourceManager, Target etc). And these objects are mostly initialized in BeginSourceFile. We could probably pull out the code that only initializes up to proprocessor, but this is not very trivial :(


Yes and no. The current implementation only does textual matching against existing #includes in the current file and inserts the header if no same header was found. This complies with IWYU. But we are not handling the case where the same header is included by different names. I added a FIXME for this.


Passing filename to the constructor sounds good. Thanks!

ioeric edited the summary of this revision. (Show Details)Feb 8 2018, 6:20 AM
ioeric retitled this revision from [clangd] Prototype: collect symbol #include & insert #include in global code completion. to [clangd] collect symbol #include & insert #include in global code completion..

Some comments on the insert side, which looks pretty good. I'll take another look at indexing tomorrow.


This has a fair amount of logic (well, plumbing :-D) and uses relatively little from ClangdServer.
Can we move shortenIncludePath to a separate file and pass in the FS and command?

I'd suggest maybe Headers.h - The formatting part of insertInclude could also fit there, as well as probably the logic from switchSourceHeader I think (the latter not in this patch of course).

This won't really help much with testing, I just find it gets hard to navigate files that have lots of details of unrelated features. ClangdUnit and ClangdServer both have this potential to grow without bound, though they're in reasonable shape now. Interested what you think!


maybe drop "via clang::HeaderSearch" (it's doc'd in the implementation) and add an example?

It might be worth explicitly stating that the result is an #include string (with quotes) - you just call it a "path" here.

"shortest" makes sense as a first implementation, but we may want to document something more like "best" - I know there are codebases we care about where file-relative paths are banned. Also I suspect we don't give "../../../../Foo.h" even if it's shortest :-)


This comment helps a lot. The subtext is: HeaderSearch is hard to construct directly, so we're doing this weird dance.
I think this is worth calling out even louder - when I read this sort of code I tend to take a *long* time to work out why the code seems to be doing unrelated work.


consume the optional IsSystem and use it to quote appropriately?


I can't see the FIXME? (There's one in the header, but it doesn't seem to really cover this case)

So this doesn't seem so hard: we can pass the file content, turn off recursive PP, and add PP callbacks to capture the names of each included file (the include-scanner patch does this).
I'm not sure it's worth deferring, at least we should fix it soon before we lose context.

But up to you, I'd suggest putting the fixme where we expect to fix it.


Similarly, it'd be nice to pull these tests out into a test file parallel to the header.
(As with the other tests, often it's easiest to actually test through the ClangdServer interface - this is mostly just for navigation)

ioeric updated this revision to Diff 133595.Feb 9 2018, 6:00 AM
ioeric marked 5 inline comments as done.
  • Addressed review comments.
ioeric added a comment.Feb 9 2018, 6:05 AM

Thanks! PTAL



I didn't move the formatting code, as half of the code is pulling out the style, which we might want to share/change depending on other clangd logics that might use style.


Since you prefer, I have included the change in the patch. I wanted to get to this as soon as this patch is landed.


Done. I left a simple test for replacements.

ioeric updated this revision to Diff 133599.Feb 9 2018, 6:15 AM
  • fix a leftover bug

Insertion side LGTM, feel free to split and land.
Sorry I need to take off and will need to get to indexing on monday :(


I'd still think pulling out Expected<tooling::Replacement> insertInclude(File, Code, Header, VFS, Style) would be worthwhile here - the formatting isn't a lot of code, but it's a bit magic, plus the quote handling... it's a bit of code. It'd make it more obvious what the interactions with ClangdServer's state are. But up to you, we can always do this later.


"shortest" makes sense as a first implementation, but we may want to document something more like "best" - I know there are codebases we care about where file-relative paths are banned. Also I suspect we don't give "../../../../Foo.h" even if it's shortest :-)

I think this part wasn't addressed


comment is outdated now


can you include the Header in this log message? (and possibly File, but that might add more noise than signal)


header-already-included is not an error condition.

Suggest returning llvm::Expected<Optional<String>>, or returning "" for this case.

ioeric updated this revision to Diff 133635.Feb 9 2018, 9:30 AM
ioeric marked 3 inline comments as done.
  • Addressed review comments.

So it's just 2 lines for the replacement magic (+1 for comment) now after removing some redundant code.

I like shortenIncludePath better because it's more self-contained and easier to write tests against, and insertInclude doesn't seem to carry much more weight while we would need to handle replacements logic which has been tested in the tests.


I added a comment for this in the header. But I might be misunderstanding you suggestion. Did you mean we need a better name for the function?

Insertion still LG (couple of nits, inline).

For indexing, my biggest questions:

  • I worry CanonicalIncludes doesn't get enough information to make good decisions - passing the include stack in some form may be better
  • CanonicalIncludes has slightly weird patterns of reads/writes - most writes are permanent and a few are transient, and it's not totally obvious how it works in a multithreading context (though your code is correct). I'm not sure whether this is worth fixing.

I think the function name is fine as a shorthand, just that the comment is a bit overspecified and possibly inaccurate compared to e.g. "Determines the preferred way to #include a file, taking into account the search path. Usually this will prefer a shorter representation like 'Foo/Bar.h' over a longer one like 'Baz/include/Foo/Bar.h'".

I don't think the ../../../Foo.h case is worth explicitly calling out - I just meant it as an example of why we *don't* want an overly-specific contract here.




user-facing. unwrap?

1 ↗(On Diff #133635)

phab says you have ws-only changes in this file, which you might want to revert

1 ↗(On Diff #133635)

(and here - ws-only changes?)


Do we create one of these per TU or per thread? The former is "clean" but seems potentially wasteful (compiling all those system header regexes for each TU). The latter is "fast" but potentially non-hermetic (can't think of a triggering case though).

Maybe we should have a split between transient mappings (IWYU) and permanent ones?


I think this is

if (!Text.consume_front(IWYUPragma))
return false;


"STL" :-)


these aren't standard library - deserves a comment?

in general it looks like there's a bunch of standard library stuff, also posix stuff, and some compiler intrinsics.
If this is the case, maybe "system headers" is a better descriptor than "standard library".

Can we document approximately which standard libraries, which compiler extensions, and other standards (posix, but I guess windows one day) are included?


Can we split out the main ideas a bit? I think these are: a) what include mapping is, b) IWYU pragmas, c) standard library.
We should also probably call out the relationship with the stuff in clangd/Headers.h.


At indexing time, we decide which file to #included for a symbol.
Usually this is the file with the canonical decl, but there are exceptions:

- private headers may have pragmas pointing to the matching public header.
  (These are "IWYU" pragmas, named after the include-what-you-use tool).
- the standard library is implemented in many files, without any pragmas. 
  We have a lookup table for common standard library implementations.
  libstdc++ puts char_traits in bits/char_traits.h, but we #include <string>.

The insert-time logic in clang/Headers.h chooses how to spell the
filename we emit here; this depends on the search path at the insertion site.

(I think .inc files are conceptually similar and should also be handled and mentioned here, commented below)


nit: \brief needed?

I'm not sure this class actually collects anything - that's the handler returned by collectIWYUHeaderMaps.

Maybe "Maps a definition location onto an #include file, based on a set of filename rules."?


Maps all files matching \p RE to \p CanonicalPath?


Just this comment is probably enough for the whole function...


So I'm a bit concerned this is too narrow an interface, and we really want to deal with SourceLocation here which would give us the include stack.

Evidence #1: .inc handling really is the same job, but because this class has a single-file interface, we have to push it into the caller.
Evidence #2: I was thinking about a more... principled way of determining system headers. One option would be to have a whitelist of public headers, and walk up the include stack to see if you hit one. (I'm not 100% sure this would work, but still...) This isn't implementable with the current interface.

One compromise would be to pass in a stack<StringRef> or something. Efficiency doesn't really matter because the caller can cache based on the top element.


do we need both tables? it seems the system headers (which are regex-based) will typically outnumber the IWYU pragmas by a bunch.
And if we care about performance at all, we can cache the results of mapHeader?


Explicitly mention that the mappings are registered with *Includes?

(interaction isn't totally obvious here because Includes has read and write methods)



ioeric updated this revision to Diff 133895.Feb 12 2018, 10:11 AM
ioeric marked 15 inline comments as done.
  • Addressed some review comments.

I see. Thanks!


This is created per TU now. In an earlier revision, this was one-per-program because we statically constructed a regex map and passed the map into the CanonicalIncludes via the constructor, like we do in include-fixer ( But with the current design, I didn't do this because one-per-TU approach seems to be cleaner, and the regex construction time seems to be relatively small comparing to the time spent on actually compiling a TU.

If we really want to squeeze performance here, we would probably need either an interface that takes a static regex mapping or one that takes pre-computed llvm::Regex. But I'm not really sure if it would be worth it since this is not a performance critical code path.


Switched both function and variable to "system headers" and updated the documentation.


Thanks for the suggestion!

We should also probably call out the relationship with the stuff in clangd/Headers.h.

I'm a bit inclined to not call out the relationship here as this doesn't seem to be a better place than the code where they interact, and the doc could easily be outdated if the interaction is changed from other libraries.


Evidence #1: .inc handling really is the same job, but because this class has a single-file interface, we have to push it into the caller.

I think this would depend on how you define the scope of this class. .inc handling is a subtle case that I'm a bit hesitated to build into the interface here.

Evidence #2: ....

This is actually very similar to how the hardcoded mapping was generated. I had a tool that examined include stacks for a library (e.g. STL) and applied a similar heuristic - treat the last header in the include stack within the library boundary as the "exporting" public header for a leaf include header, if there is no other public header that has shorter distance to that include. For example, if we see a stack like stl/bits/internal.h -> stl/bits/another.h -> stl/public_1.h ->, we say public_1.h exports bits/internal.h and add a mapping from bits/internal.h$ to public_1.h. But if we see another (shorter) include stack like stl/bits/internal.h -> stl/public_2.h ->, we say stl/public_2.h exports stl/bits/internal.h. This heuristic works well for many cases. However, this may produce wrong mappings when an internal header is used by multiple public headers. For example, if we have two include stacks with the same length e.g. bits/internal.h -> public_1.h -> and bits/inernal.h -> public_2.h ->, the result mapping would depend on the order in which we see these two stacks; thus, we had to do some manual adjustment to make sure bits/internal.h is mapped to the correct header according to the standard.

I am happy to discuss better solution here. But within the scope of this patch, I'd prefer to stick to interfaces that work well for the current working solution instead of designing for potential future solutions. I should be easy to iterate on the interfaces as these interfaces aren't going to be widely used in clangd after all. WDYT?


Makes sense.

ioeric updated this revision to Diff 134394.Feb 15 2018, 2:42 AM
  • Merged with origin/master
sammccall accepted this revision.Feb 16 2018, 12:42 AM

LG apart from the .inc handling (happy to chat more)


I think this would depend on how you define the scope of this class. .inc handling is a subtle case that I'm a bit hesitated to build into the interface here.

Sure it's subtle, but it's clearly in the scope of determining what the canonical header is for a symbol, which is the job of this class. We wouldn't be building it into the interface - on the contrary, the *current* proposed interface codifies *not* handling .inc files.

But you're right that we should check in something that handles most cases.

My preference would be to drop .inc from this patch until we can incorporate it into the design, but I'm also OK with a FIXME to move it.


This test would be clearer to me if you removed this helper and just did

FS.Files["sub/bar.h"] = ...

in the test.

Can we change buildTestFS in TestFS.cpp to call getVirtualTestFilePath on relative paths to allow this?

(I can do this as a followup if you like, but it seems like a trivial change)


nit: this is only used once, as Not(HasIncludeHeader()). Just use IncludeHeader("")?
The slight difference in detail handling doesn't seem to matter (I'm not even sure if it's exactly the right assertion)


Took me a while to understand this test, and still not sure I get it. Maybe "class string" here?

This revision is now accepted and ready to land.Feb 16 2018, 12:42 AM
sammccall added inline comments.Feb 16 2018, 1:45 AM

I thought better of that change to TestFS, but did some renames in r325326.

So this would be FS.Files[testPath("sub/bar.h")) = ... which still seems more transparent - up to you.

ioeric updated this revision to Diff 134600.Feb 16 2018, 5:56 AM
ioeric marked 3 inline comments as done.
  • Merged with origin/master
  • Addressed review comments; removed .inc handling.

Thank you for reviewing this!


Okay, I removed .inc handling from this patch ;)


Thanks a lot!

This revision was automatically updated to reflect the committed changes.
This revision was automatically updated to reflect the committed changes.

Hi @ioeric:

Just to let you know that your submission seems to break a Windows test.

 FAIL: Extra Tools Unit Tests :: clangd/Checking/./ClangdTests.exe/SymbolCollectorTest.IWYUPragma (15474 of 36599)

C:\xxx\llvm\tools\clang\tools\extra\unittests\clangd\SymbolCollectorTests.cpp(620): error : Value of: Symbols [C:\xxx\build\check-all.vcxproj]

  Expected: has 1 element and that element (q name "Foo") and ((decl u r i "file:///C%3a/clangd-test/symbol.h") and (include header "\"the/good/header.h\""))

    Actual: { Foo }, where the following matchers don't match any elements:

  matcher #0: (q name "Foo") and ((decl u r i "file:///C%3a/clangd-test/symbol.h") and (include header "\"the/good/header.h\""))

  and where the following elements don't match any matchers:

  element #0: Foo

  [  FAILED  ] SymbolCollectorTest.IWYUPragma (113 ms)

  [----------] 1 test from SymbolCollectorTest (113 ms total)