This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Store system relative includes as verbatim
Changes PlannedPublic

Authored by kadircet on Mar 9 2021, 2:36 AM.

Details

Reviewers
sammccall
Summary

This makes our index more portable between different systems, e.g. we
can index a codebase with a --sysroot and make use of it on a system
that has a different abs path for those includes.

Diff Detail

Event Timeline

kadircet created this revision.Mar 9 2021, 2:36 AM
kadircet requested review of this revision.Mar 9 2021, 2:36 AM
sammccall added inline comments.Mar 9 2021, 3:25 AM
clang-tools-extra/clangd/index/SymbolCollector.cpp
900

This is a great heuristic, now I'm thinking of all the ways it can go wrong...


It could be slow: it's doing a string comparison for every include path entry (not just the -isystems) and that can be a long list for big codebases. We're calling this for most symbols we encounter, so in principle this looks quadratic in codebase size when indexing a preamble.

It's tempting to throw a cache on it but actually... the callsite looks like:

foreach header:
  foreach symbol:
    adjust(getIncludeHeader(symbol, header));

getIncludeHeader() is entirely dependent on the header almost ignores the symbol, so hoisting it out of the loop would be as good as caching it.

The wrinkle is the std::move hack, but we can move *that* part alone inside the loop instead.


It could fail to translate properly between paths, being inserted where the -isystem wasn't available. Often this means the *library* isn't available in this part of the codebase, so really the problem is offering the symbol at all (though using verbatim headers may make this harder to fix). Dynamic indexes spanning multiple projects are another case, but I think a marginal one. I don't think we need to address this.


It may give inconsistent results between TUs for the same symbol.
When a definition is indexed, it should win, so the problematic cases are:

  • header-only symbols (or no definition indexed) with inconsistent -isystem. This is not terribly important, as above I think.
  • definition is missing -isystem, because *within* the library headers are not considered "system". This seems likely to be common, and I'm not sure how to address it (other than giving verbatim headers priority always, which seems dubious). It's not a regression though (just the bug doesn't get fixed for these symbols.)
  • definition has -isystem (library impl considers its own headers as system) but other files don't. Also not sure how to solve this but it seems less likely to be common to me.

One design around these problems would be to store *both* a preferred spelling and a filename, and to use the spelling if it can be resolved. But this is invasive, means doing stat()s, doesn't work well with genfiles... I'm not really sure of a clean solution here.

Maybe we just move ahead as-is and see if we hit problems...

kadircet added inline comments.Mar 9 2021, 12:20 PM
clang-tools-extra/clangd/index/SymbolCollector.cpp
900

for latency concerns, i was mostly turning a blind eye as this would only happen once per indexing. but you are right, we can get away from any possible regressions by moving it to per-header layer.


It could fail to translate properly between paths, being inserted where the -isystem wasn't available. Often this means the *library* isn't available in this part of the codebase, so really the problem is offering the symbol at all (though using verbatim headers may make this harder to fix). Dynamic indexes spanning multiple projects are another case, but I think a marginal one. I don't think we need to address this.

as you mentioned this is likely to show up either when there's a remote-index/static-index built on another machine in use or dynamic index with multiple projects.

  • for remote-index/static-index case, i think it is unlikely to show up, as the symbol should be available during local development too.

-for dynamic index spanning multiple indexes though, previously we would just suggest a wrong symbol and now we'll also perform a wrong header insertion so it is sort-of worse. but as you mentioned this should probably be handled on symbol level as well.

my mental model around the fix was rather making use of the declaration/definition location of the symbol, which is guaranteed to be a URI, rather than a symbolic representation of the "owning" header. hence i feel like this shouldn't make things harder when we are fixing that issue.

It may give inconsistent results between TUs for the same symbol.

I didn't consider this case thoroughly, as I was assuming system search paths to be same for the whole codebase (leaving multiple-projects case out).

header-only symbols (or no definition indexed) with inconsistent -isystem. This is not terribly important, as above I think.

well, in theory we just accumulate include headers in this case, which means we can still recover somehow if it turns out to be important.

definition is missing -isystem, because *within* the library headers are not considered "system". This seems likely to be common, and I'm not sure how to address it (other than giving verbatim headers priority always, which seems dubious). It's not a regression though (just the bug doesn't get fixed for these symbols.)

ugh that's true :/ i am not really sure what's the convention for these in libraries themselves, but even with clangd if we somehow end up indexing definition locations for those symbols and user triggers a go-to, clangd will likely use fallback commands to index those files, which will be missing -isystem like flags. but as you mentioned this just means some cases are left unfixed.

as for solution, the best i can think of is we keep storing URIs for include headers, in all cases and never store verbatim. then we figure out the spelled include at use time, while making use of headersearchinfo and URI specific "get the spelling" logic (we probably want to have some other mechanism for the latter).
but it means we perform the resolution on every code completion item (which we already do, when the stored include isn't verbatim and uri spelling doesn't yield anything), so this shouldn't terribly regress completion latency.

basically what i am suggesting is propagating headersearchinfo into URI::getIncludeSpelling, which is more invasive then this change and conceptually doesn't sound so appealing. here's another terrible idea :), maybe we should have a module hook instead for generating custom include spellings ? (but we don't want to iterate over all modules on every completion i suppose, so this will definitely require some extra tricks like fetching a functor from the module, also there's the question of multiple modules implementing the hook..)

so yeah this is definitely more invasive, hence i'd rather go with the solution proposed here until we see some complaints around this specific workflow, wdyt?

definition has -isystem (library impl considers its own headers as system) but other files don't. Also not sure how to solve this but it seems less likely to be common to me.

i think the approach mentioned above should help with this too.

sammccall accepted this revision.Mar 9 2021, 12:37 PM

I think this is going to be more good than bad, and the alternatives are terrifically complicated.

clang-tools-extra/clangd/index/SymbolCollector.cpp
900

this would only happen once per indexing

I'm not quite sure what this means, am I wrong to think this happens each time we see a symbol in a preamble? In any case, avoiding calling this function in the inner loop is a separate optimization I can send a patch for.

my mental model around the fix was rather making use of the declaration/definition location of the symbol, which is guaranteed to be a URI

Of course, you're right. Nevermind this point.

I was assuming system search paths to be same for the whole codebase

I think that assumption is broken for the combination of:

  • decent size project with meaningful subcomponents whose deps vary, like llvm-project (or even smaller ones)
  • a liberal use of "system", e.g. i've often seen OpenGL headers used with <>, to pick a random example

well, in theory we just accumulate include headers in this case

... and I forgot we collect multiple and "vote" on them. That makes this much less bad I think.

definition is missing -isystem, because *within* the library headers are not considered "system".

ugh that's true :/ i am not really sure what's the convention for these in libraries themselves

Well, this isn't a regression (though inconsistent) and in the large majority of cases we're not actually going to be indexing the definitions. So let's ship it and see what breaks.

This revision is now accepted and ready to land.Mar 9 2021, 12:37 PM

I measured this as a mild (3%) increase in preamble indexing time when working on clangd itself, more details in D98329

This regression could be more significant with a long include search path, so I'd like to avoid adding this without caching. If we need this urgently then D98329 is enough, if not then D98371 is more comprehensive. Up to you how to proceed here.

I've landed the caching patch so this is good to go.
The merge is *conceptually* trivial but the enclosing function has moved so you may need to reapply it by hand, sorry...

clang-tools-extra/clangd/index/SymbolCollector.cpp
900

It's tempting to throw a cache on it but actually... the callsite looks like:

foreach header:
 foreach symbol:
  adjust(getIncludeHeader(symbol, header));

Haha, actually I was completely misreading the code, sorry for the detour here. (Worked out well in the end, mostly due to luck and a bit of stubbornness though)

kadircet updated this revision to Diff 329930.Mar 11 2021, 5:14 AM
  • Rebase
  • Add a storage for returned verbatim header spellings, as getHeaderIncludeUncached can only return a ref.
kadircet planned changes to this revision.Mar 29 2021, 2:32 PM

as discussed offline, this gets messy if the header can be included both as a system and quoted included, and the coding style actually prefers the quoted one. it is not a big deal yet, but if it turns out to be we can move forward this patch by also having a way for users to resolve such ambiguities (probably through config).

Herald added a project: Restricted Project. · View Herald TranscriptMar 29 2021, 2:32 PM