This makes our index more portable between different systems, e.g. we
can index a codebase with a --sysroot and make use of it on a system
that has a different abs path for those includes.
Details
- Reviewers
sammccall
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
clang-tools-extra/clangd/index/SymbolCollector.cpp | ||
---|---|---|
900 | This is a great heuristic, now I'm thinking of all the ways it can go wrong... It could be slow: it's doing a string comparison for every include path entry (not just the -isystems) and that can be a long list for big codebases. We're calling this for most symbols we encounter, so in principle this looks quadratic in codebase size when indexing a preamble. It's tempting to throw a cache on it but actually... the callsite looks like: foreach header: foreach symbol: adjust(getIncludeHeader(symbol, header)); getIncludeHeader() is entirely dependent on the header almost ignores the symbol, so hoisting it out of the loop would be as good as caching it. The wrinkle is the std::move hack, but we can move *that* part alone inside the loop instead. It could fail to translate properly between paths, being inserted where the -isystem wasn't available. Often this means the *library* isn't available in this part of the codebase, so really the problem is offering the symbol at all (though using verbatim headers may make this harder to fix). Dynamic indexes spanning multiple projects are another case, but I think a marginal one. I don't think we need to address this. It may give inconsistent results between TUs for the same symbol.
One design around these problems would be to store *both* a preferred spelling and a filename, and to use the spelling if it can be resolved. But this is invasive, means doing stat()s, doesn't work well with genfiles... I'm not really sure of a clean solution here. Maybe we just move ahead as-is and see if we hit problems... |
clang-tools-extra/clangd/index/SymbolCollector.cpp | ||
---|---|---|
900 | for latency concerns, i was mostly turning a blind eye as this would only happen once per indexing. but you are right, we can get away from any possible regressions by moving it to per-header layer.
as you mentioned this is likely to show up either when there's a remote-index/static-index built on another machine in use or dynamic index with multiple projects.
-for dynamic index spanning multiple indexes though, previously we would just suggest a wrong symbol and now we'll also perform a wrong header insertion so it is sort-of worse. but as you mentioned this should probably be handled on symbol level as well. my mental model around the fix was rather making use of the declaration/definition location of the symbol, which is guaranteed to be a URI, rather than a symbolic representation of the "owning" header. hence i feel like this shouldn't make things harder when we are fixing that issue.
I didn't consider this case thoroughly, as I was assuming system search paths to be same for the whole codebase (leaving multiple-projects case out).
well, in theory we just accumulate include headers in this case, which means we can still recover somehow if it turns out to be important.
ugh that's true :/ i am not really sure what's the convention for these in libraries themselves, but even with clangd if we somehow end up indexing definition locations for those symbols and user triggers a go-to, clangd will likely use fallback commands to index those files, which will be missing -isystem like flags. but as you mentioned this just means some cases are left unfixed. as for solution, the best i can think of is we keep storing URIs for include headers, in all cases and never store verbatim. then we figure out the spelled include at use time, while making use of headersearchinfo and URI specific "get the spelling" logic (we probably want to have some other mechanism for the latter). basically what i am suggesting is propagating headersearchinfo into URI::getIncludeSpelling, which is more invasive then this change and conceptually doesn't sound so appealing. here's another terrible idea :), maybe we should have a module hook instead for generating custom include spellings ? (but we don't want to iterate over all modules on every completion i suppose, so this will definitely require some extra tricks like fetching a functor from the module, also there's the question of multiple modules implementing the hook..) so yeah this is definitely more invasive, hence i'd rather go with the solution proposed here until we see some complaints around this specific workflow, wdyt?
i think the approach mentioned above should help with this too. |
I think this is going to be more good than bad, and the alternatives are terrifically complicated.
clang-tools-extra/clangd/index/SymbolCollector.cpp | ||
---|---|---|
900 |
I'm not quite sure what this means, am I wrong to think this happens each time we see a symbol in a preamble? In any case, avoiding calling this function in the inner loop is a separate optimization I can send a patch for.
Of course, you're right. Nevermind this point.
I think that assumption is broken for the combination of:
... and I forgot we collect multiple and "vote" on them. That makes this much less bad I think.
Well, this isn't a regression (though inconsistent) and in the large majority of cases we're not actually going to be indexing the definitions. So let's ship it and see what breaks. |
I measured this as a mild (3%) increase in preamble indexing time when working on clangd itself, more details in D98329
This regression could be more significant with a long include search path, so I'd like to avoid adding this without caching. If we need this urgently then D98329 is enough, if not then D98371 is more comprehensive. Up to you how to proceed here.
I've landed the caching patch so this is good to go.
The merge is *conceptually* trivial but the enclosing function has moved so you may need to reapply it by hand, sorry...
clang-tools-extra/clangd/index/SymbolCollector.cpp | ||
---|---|---|
900 |
Haha, actually I was completely misreading the code, sorry for the detour here. (Worked out well in the end, mostly due to luck and a bit of stubbornness though) |
- Rebase
- Add a storage for returned verbatim header spellings, as getHeaderIncludeUncached can only return a ref.
as discussed offline, this gets messy if the header can be included both as a system and quoted included, and the coding style actually prefers the quoted one. it is not a big deal yet, but if it turns out to be we can move forward this patch by also having a way for users to resolve such ambiguities (probably through config).
This is a great heuristic, now I'm thinking of all the ways it can go wrong...
It could be slow: it's doing a string comparison for every include path entry (not just the -isystems) and that can be a long list for big codebases. We're calling this for most symbols we encounter, so in principle this looks quadratic in codebase size when indexing a preamble.
It's tempting to throw a cache on it but actually... the callsite looks like:
getIncludeHeader() is entirely dependent on the header almost ignores the symbol, so hoisting it out of the loop would be as good as caching it.
The wrinkle is the std::move hack, but we can move *that* part alone inside the loop instead.
It could fail to translate properly between paths, being inserted where the -isystem wasn't available. Often this means the *library* isn't available in this part of the codebase, so really the problem is offering the symbol at all (though using verbatim headers may make this harder to fix). Dynamic indexes spanning multiple projects are another case, but I think a marginal one. I don't think we need to address this.
It may give inconsistent results between TUs for the same symbol.
When a definition is indexed, it should win, so the problematic cases are:
One design around these problems would be to store *both* a preferred spelling and a filename, and to use the spelling if it can be resolved. But this is invasive, means doing stat()s, doesn't work well with genfiles... I'm not really sure of a clean solution here.
Maybe we just move ahead as-is and see if we hit problems...