This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Using symbol name to map includes for STL symbols.
ClosedPublic

Authored by hokein on Feb 18 2019, 4:51 AM.

Details

Summary

Using suffix path mapping relies on the STL implementations, and it is
not portable. This patch is using symbol name mapping, which should
work with different STL implementations, fix clangd/clangd#9.

To generate the symbol mapping, we parse the cppreference symbol index
page to build a lookup table.

The mapping is not completed, a few TODOs:

  • support symbols from different headers (e.g. std::move)
  • support STL macros
  • support symbols from std's sub-namespaces (e.g. chrono)

Event Timeline

hokein created this revision.Feb 18 2019, 4:51 AM
Herald added a project: Restricted Project. · View Herald Transcript
hokein updated this revision to Diff 187225.Feb 18 2019, 4:53 AM

Fix style.

Looks great! Thanks for doing this.

Could you also check in the tool that is used to generate the mapping? We need a way to update the mapping when cppreference is updated.

clangd/index/CanonicalIncludes.cpp
123

Can we remove the suffix header mapping now? Is it for the std::chrono:: symbols? What are the reasons not to include them in this patch?

hokein marked an inline comment as done.Feb 18 2019, 7:44 AM

Looks great! Thanks for doing this.

Could you also check in the tool that is used to generate the mapping? We need a way to update the mapping when cppreference is updated.

The tool is not ready yet, I'm still polishing it, but the results are good, I think this patch should not be blocked by the tool.

clangd/index/CanonicalIncludes.cpp
123

Ideally, we could remove it once we get a perfect symbol mapping but I'd prefer to keep it (it serves as a fallback when we don't find symbols in the symbol mapping), so that we could add new symbol mapping increasingly without changing the current behavior.

Removing it should be done carefully without introducing regression, and is outside the scope of this patch.

sammccall added inline comments.Feb 18 2019, 7:57 AM
clangd/index/CanonicalIncludes.cpp
123

We do need fallback heuristics. We should conisder cases:

  • for known std:: symbols mapped to one system header, we don't need a fallback
  • for known std:: mapped to multiple system headers (std::move), we need some fallback. There are probably few enough of these that it doesn't justify listing *all* system header paths.
  • for known std:: symbols with a primary template in one file and specializations/overloads in another (swap?) always inserting the primary seems fine
  • For "random" unknown symbols from system header foo/bits/_bar.h, we can not insert, correct to <bar>, or use the full path name. I don't have a strong preference here, maybe we should do what's easiest.
  • for c standard library (::printf --> <stdio.h> etc) we probably want mappings just like in this patch, I think cppreference has them.
  • other "standardish" headers (POSIX etc) - hmm, unclear.

Thinking about all these cases, I'm thinking the nicest solution would be having the symbol -> header mapping, having a small (symbol,header) -> header mapping only for ambiguous cases, and not inserting "system" headers that aren't in the main mapping at all. Then we can improve coverage of posix, windows etc headers over time.
Question is, how can we recognize "system" headers?

Looks great! Thanks for doing this.

Could you also check in the tool that is used to generate the mapping? We need a way to update the mapping when cppreference is updated.

The tool is not ready yet, I'm still polishing it, but the results are good, I think this patch should not be blocked by the tool.

Generated files aren't really source code, I agree with Eric we should check in the tool with this patch.

clangd/StdSymbolMap.imp
1

Even if we don't use tablegen to produce it, maybe we want to produce similar output for flexibility? e.g.

SYMBOL(_Exit, std::, "<cstdlib>");

with multiple entries (or multiple AMBIGUOUS_SYMBOL entries) for symbols mapped to several headers.

1

is there precedent for .imp? I think this should probably be .inc

410

I'm not sure if the language comments are useful to human readers, who can just look this stuff up. If you think they might be useful to code, we could include them in tablegen-style macros.

In any case, this is a C++17 symbol, so maybe something's wrong with the generator.

1248

Now we come to the ambiguous cases :-)

I think it's actually quite clear what to do here:

  • the <cmath> and <cstdlib> are equivalent, but <cmath> is only in C++17, so we should probably prefer <cstdlib>
  • the <complex> version is written as "abs(std::complex)" in cppreference. We can tag that one with the string "complex" and prefer it if the symbol's signature contains "complex".
  • '<valarray> is the same as <complex>`
1256

I can't find a definition in <streambuf>?

1257

I can't find a definition for this in <locale>?

1283

we're missing the future variant I think?

1291

this is interesting: any of the options are valid. IMO the best one here is <cstddef>.
But we should be picking by policy, not by looking at the header where the symbol is actually defined. We could do this on either the generator or consumer side.

clangd/index/CanonicalIncludes.cpp
47

Looking at the actual data, I'm not sure this is the right strategy. (and it's not clear what "the clangd side" is :-)

The ambiguous cases seem to break down into:

  • cases where we should just pick one, e.g. cstdlib vs cmath for std::abs
  • cases where a function is overloaded for a particular type (e.g. complex), where passing more information in here (strawman: "the signature as a string") would let us get this correct. Alternatively (and easier), I believe always using the "primary" version of this function is going to be correct if not *strictly* iwyu, as the type comes from the same header. So In the short term I'd suggest just blacklisting the variants.
  • cases like std::move and... maybe one or two more? In the short term always returning <utility> seems good enough. In the long term, passing in the signature again would let us disambiguate here.
hokein marked 2 inline comments as done.Feb 20 2019, 11:08 AM
hokein added inline comments.
clangd/index/CanonicalIncludes.cpp
47

Looking at the actual data, I'm not sure this is the right strategy. (and it's not clear what "the clangd side" is :-)

clangd side I mean the consumer side.

cases where we should just pick one, e.g. cstdlib vs cmath for std::abs

For these cases, I think we can address them in generator side by using a hardcoded whitelist.

cases where a function is overloaded for a particular type (e.g. complex), where passing more information in here (strawman: "the signature as a string") would let us get this correct. Alternatively (and easier), I believe always using the "primary" version of this function is going to be correct if not *strictly* iwyu, as the type comes from the same header. So In the short term I'd suggest just blacklisting the variants.

Using primary version seems promising, and our indexer doesn't index the template specification symbols.

cases like std::move and... maybe one or two more? In the short term always returning <utility> seems good enough. In the long term, passing in the signature again would let us disambiguate here.

For std::move this particular case, I don't think using signature (as a string) really helps to disambiguate, since both of them are templates, the signature string doesn't contain any word specific to the header names <utility>/<algorithm>. The best way to disambiguate them is to use the number of parameters IMO.

I'm +1 on returning utility, since std::move in utility is more widely-used.

Or another way to disambiguate: use symbol name => header for symbols with unique headers; otherwise fallback to use header mapping.

123

I think we were talking about C++ std symbols.

for known std:: symbols mapped to one system header, we don't need a fallback
for known std:: mapped to multiple system headers (std::move), we need some fallback. There are probably few enough of these that it doesn't justify listing *all* system header paths.
for known std:: symbols with a primary template in one file and specializations/overloads in another (swap?) always inserting the primary seems fine

As discussed in this patch, we have other ways to disambiguate these symbols (rather than relying on the header mapping), so it is possible to remove STL headers mapping, but we'll lose correct headers for STL implement-related symbols (e.g. __hash_code), ideally we should drop these symbols in the indexer...

for c standard library (::printf --> <stdio.h> etc) we probably want mappings just like in this patch, I think cppreference has them.

Yes, cppreference has them.

other "standardish" headers (POSIX etc) - hmm, unclear.

I think we should still keep the header mapping. Not sure whether there are some sources like cppreference that list all symbols.

hokein updated this revision to Diff 187835.Feb 21 2019, 11:46 AM
hokein marked 6 inline comments as done.
  • address comments
  • narrow the scope of this patch -- only emit unique symbols
  • add the generator tool

Looks great! Thanks for doing this.

Could you also check in the tool that is used to generate the mapping? We need a way to update the mapping when cppreference is updated.

The tool is not ready yet, I'm still polishing it, but the results are good, I think this patch should not be blocked by the tool.

Generated files aren't really source code, I agree with Eric we should check in the tool with this patch.

Added the tool in this patch, but I think a separate repository under GitHub's clangd org might be a more suitable place.

For simplicity, I re-scoped this patch -- only emits unique symbols. We have ideas to handle ambiguous symbols, will address them in a followup.

clangd/StdSymbolMap.imp
410

In any case, this is a C++17 symbol, so maybe something's wrong with the generator.

hmm, it is a bug in cppreference index page, this symbol is wrongly marked C++11.

1283

looks like it was a bug in the generator code, fixed.

1291

this can be addressed by using a whitelist in the generator.

Sorry to be a pain here, but I'm not sure introducing Javascript is a good idea unless there's a strong reason for it.
All LLVM developers have python installed, many are comfortable with the language - it seems less likely to be a maintenance hazard. I can recommend BeautifulSoup for the HTML parsing stuff.
(For the record, I kind of hate python, but I'm more comfortable reviewing and maintaining it than JS).
Mitigating points: we have some in the VSCode plugin by necessity, and this isn't a build-time dependency.

Happy to get another opinion here.

Added the tool in this patch, but I think a separate repository under GitHub's clangd org might be a more suitable place.

I'm somewhat sympathetic to this idea, but I don't think we can do it as things stand. @klimek can confirm.

clangd/StdGen/Readme.md
1 ↗(On Diff #187835)

can we add the caveats here?

  • only symbols directly in namespace std are added
  • symbols with multiple variants or defined in multiple headers aren't added
3 ↗(On Diff #187835)

Can we be a bit more specific about what it generates?

From this text I'd have guess it generates headers like namespace std { int printf(char*, ...); }.

8 ↗(On Diff #187835)

Requriement -> Requirements

11 ↗(On Diff #187835)

nit: can we make this a command-line arg instead of requiring people to put it in their source tree?

clangd/StdGen/lib/main.js
1 ↗(On Diff #187835)

note: I haven't reviewed the JS files for style/readability as I'm weak on JS and it's unclear what style would govern here. e.g. I suspect when used as a niche language in a C++ project by people who don't know it well, we should probably have semicolons rather than relying on ASI.
But see code review comment - this may be a moot point.

clangd/StdGen/lib/parse.js
39 ↗(On Diff #187835)

Not sure this is necessary - it seems preferable to be forwards-compatible even if not 100% accurate, no?

(And it would simplify the code, as we don't need to parse out the version)

clangd/index/CanonicalIncludes.cpp
114–115

can you remove the ones that duplicate the .inc file?

Sorry to be a pain here, but I'm not sure introducing Javascript is a good idea unless there's a strong reason for it.
All LLVM developers have python installed, many are comfortable with the language - it seems less likely to be a maintenance hazard. I can recommend BeautifulSoup for the HTML parsing stuff.
(For the record, I kind of hate python, but I'm more comfortable reviewing and maintaining it than JS).
Mitigating points: we have some in the VSCode plugin by necessity, and this isn't a build-time dependency.

Happy to get another opinion here.

Added the tool in this patch, but I think a separate repository under GitHub's clangd org might be a more suitable place.

I'm somewhat sympathetic to this idea, but I don't think we can do it as things stand. @klimek can confirm.

I unfortunately agree with Sam on this.

hokein updated this revision to Diff 190218.Mar 12 2019, 1:56 AM
hokein marked 4 inline comments as done.

Address review comments, rewrite the tool with python.

hokein edited the summary of this revision. (Show Details)Mar 12 2019, 1:59 AM
hokein edited reviewers, added: ioeric; removed: sammccall, jfb, serge-sans-paille.
hokein edited projects, added Restricted Project; removed Restricted Project.
hokein added a subscriber: sammccall.
Herald added a project: Restricted Project. · View Herald Transcript

add @ioeric as a reviewer, since @sammccall is OOO.

ioeric added inline comments.Mar 12 2019, 7:24 AM
clangd/StdGen/StdGen.py
2 ↗(On Diff #190218)

I'd avoid abbreviation in the file name and the new directory name. StdGen sounds a lot like a library ;)

I'd suggest something like std-include-mapping/generate.py

14 ↗(On Diff #190218)

we should also mention what the tool's behaviors are for these cases.

15 ↗(On Diff #190218)

nit: s/"std::namespace"/std:: namespace/?

16 ↗(On Diff #190218)

it's not trivial what these are. could provide an example here.

18 ↗(On Diff #190218)

Suggestion: it would be nice to provide the common commands for each step. E.g. pip install <beautiful soup>.

23 ↗(On Diff #190218)

what is the <path/to/cppreference-doc> here? is it the downloaded zip, the unzipped (e.g. cppreference-doc-20181028/), or some specific html doc in the directory?

23 ↗(On Diff #190218)

Again, StdGen is not great name. I'd suggest StdHeaderMapping.

nit: s/.h/.inc/

32 ↗(On Diff #190218)

nit: maybe STDGEN_CODE_HEADER?

48 ↗(On Diff #190218)

could you provide a sample html snippet in the comment? it would make the code easier to understand. thanks!

(same for ParseIndexPage)

94 ↗(On Diff #190218)

why not check exists(index_page_path) instead?

116 ↗(On Diff #190218)

move this FIXME into the if body to make it clearer that this is for len(headers) > 1.

nit: it's not very clear whether ambiguous symbols = symbols with multiple headers. Maybe be more specific?

clangd/StdSymbolMap.inc
8 ↗(On Diff #190218)

can we include the information about the HTML archive (e.g. STL version) from which this table is generated? This would be useful for future maintenance.

11 ↗(On Diff #190218)

Add a pointer to the python script?

clangd/index/CanonicalIncludes.cpp
123

So, do we have a plan on how to handle ambiguous symbols properly? If so, could you please summarize this in the FIXME comment so that we don't lose track of it?

I think we should still keep the header mapping. Not sure whether there are some sources like cppreference that list all symbols.

What's the reasoning?

I am +1 on Sam's idea about skipping include insertion for std:: symbols that are not on cppreference. It's fine to keep the suffix header mapping in the short term while we are working out ambiguous symbols. But I don't think we would want to maintain both mappings in the long term.

Could you maybe add comment for the header mapping indicating that they are now the fallback mapping for std:: symbols?

hokein updated this revision to Diff 190275.Mar 12 2019, 9:03 AM
hokein marked 13 inline comments as done.

Address review comments.

hokein added inline comments.Mar 12 2019, 9:04 AM
clangd/StdGen/StdGen.py
2 ↗(On Diff #190218)

Personally I'd vote StdGen, and it follows llvm's TableGen. The name std-include-mapping/generate.py seems too verbose...

32 ↗(On Diff #190218)

renamed to STDGEN_CODE_PREFIX.

94 ↗(On Diff #190218)

I think either way works.

clangd/StdSymbolMap.inc
8 ↗(On Diff #190218)

The version of HTML archive is the date information, but this information is only present in the .zip name, we lose it after we unzip the zip file....

ioeric added inline comments.Mar 12 2019, 9:11 AM
clangd/StdGen/StdGen.py
2 ↗(On Diff #190218)

Why do we want to follow TableGen?

Two reasons why I think a more informative name should be used. 1) StdGen sounds like a name for library and doesn't follow the naming convention for tool directory (e.g. global-symbol-builder) and 2) it's hard for readers without much context to understand what StdGen actually means.

94 ↗(On Diff #190218)

It seems to me that using exists(index_page_path) would be less likely to trigger an exception (better use experience). I don't see a good reason to check the root.

clangd/StdSymbolMap.inc
8 ↗(On Diff #190218)

If we can't get this from the doc, we should ask users to explicitly provide this information. I think we would want some version info to relate the output to the source.

hokein updated this revision to Diff 190406.Mar 13 2019, 6:53 AM
hokein marked 3 inline comments as done.

Rename the tool.

hokein added inline comments.Mar 13 2019, 6:54 AM
clangd/StdGen/StdGen.py
2 ↗(On Diff #190218)

Changed to include-mapping/gen_std.py, according to offline discussion.

94 ↗(On Diff #190218)

It seems to me that using exists(index_page_path) would be less likely to trigger an exception (better use experience).

I don't see any big difference between checking exists(index_page_path) and exists(cpp_symbol_root) here. Checking cpp_symbol_root is more natural as we also construct the path of symbol page from it.

Anyway, changed to check index_page_path.

clangd/StdSymbolMap.inc
8 ↗(On Diff #190218)

If we can't get this from the doc, we should ask users to explicitly provide this information.

I'm not sure this is a good idea. Asking users provide the information seems error-prone.

I think we may use the file stat information (modified date) of the symbol_index.html page.

ioeric added inline comments.Mar 14 2019, 3:11 AM
clangd/include-mapping/gen_std.py
46 ↗(On Diff #190406)

I'd suggest dropping v prefix as it might be confusing (e.g. one would search for the exact same version number on cppreference archive). Maybe add a short comment explaining that this is the modification date of the doc?

114 ↗(On Diff #190406)

s/version/date/, as this not really a version.

129 ↗(On Diff #190406)

if not headers:

clangd/index/CanonicalIncludes.cpp
123

This is not done?

hokein updated this revision to Diff 190592.Mar 14 2019, 4:00 AM
hokein marked 4 inline comments as done.

Address comments

clangd/index/CanonicalIncludes.cpp
123

Oops, I missed it. Added a FIXME.

ioeric accepted this revision.Mar 14 2019, 4:08 AM
ioeric added inline comments.
clangd/index/CanonicalIncludes.cpp
111

nit: as these symbols are not covered by the new mapping, I think this comment is still meaningful.

This revision is now accepted and ready to land.Mar 14 2019, 4:08 AM
hokein updated this revision to Diff 190598.Mar 14 2019, 4:17 AM
hokein marked an inline comment as done.

One more comment.

This revision was automatically updated to reflect the committed changes.