This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Use URIs instead of paths in the index file list
ClosedPublic

Authored by ArcsinX on Feb 26 2021, 1:06 AM.

Details

Summary

Without this patch the file list of the preamble index contains URIs, but other indexes file lists contain file paths.
This makes indexedFiles() always returns IndexContents::None for the preamble index, because current implementation expects file paths inside the file list of the index.

This patch fixes this problem and also helps to avoid a lot of URI to path conversions during indexes merge.

Diff Detail

Event Timeline

ArcsinX created this revision.Feb 26 2021, 1:06 AM
ArcsinX requested review of this revision.Feb 26 2021, 1:06 AM
kadircet added inline comments.Feb 26 2021, 9:18 AM
clang-tools-extra/clangd/index/FileIndex.cpp
279

iterating over all the symbols here (and refs below) seems really unfortunate. But looking at the previous discussions that seems to be best of both worlds until we populate a file list in SymbolCollector. However, I think it would be better to do the looping after releasing the lock (we already have the information copied over to our snapshots).

moreover we seem to be still inserting keys of the Symbol and RefSlabs into Files, but not doing that for RelationSlabs, why? (i believe we shouldn't be inserting the keys at all, if we indeed want to just insert URIs and keep treating the keys as opaque objects).

ArcsinX updated this revision to Diff 326903.Feb 27 2021, 7:02 AM
  • Do not insert the keys into the file list
  • Iterate through refs and symbols to create the file list after releasing the lock
ArcsinX marked an inline comment as done.Feb 27 2021, 7:04 AM
ArcsinX added inline comments.
clang-tools-extra/clangd/index/FileIndex.cpp
279

However, I think it would be better to do the looping after releasing the lock (we already have the information copied over to our snapshots).

Done

we seem to be still inserting keys of the Symbol and RefSlabs into Files, but not doing that for RelationSlabs

Forgot to remove these insertions, thanks, fixed.

kadircet added inline comments.Mar 1 2021, 2:44 AM
clang-tools-extra/clangd/index/FileIndex.cpp
297

it feels weird to choose one or the other here, why not just insert both (after checking if definition exists, ofc).

We are likely to have a merged symbol information anyway, and the logic here will likely result in no index owning the header files, unless there are some symbols *defined* (not just declared) in it.

This will likely result in some overshooting as knowing about a symbols declaration location doesn't mean indexing that particular file. But so does the current version, as knowing about definition location might be because of a merged symbol, rather than indexing of particular file.

ArcsinX added inline comments.Mar 1 2021, 6:27 AM
clang-tools-extra/clangd/index/FileIndex.cpp
297

I will try to explain on example:

  • test.h
void f();
  • test.c
#include "test.h"
void f() { }
  • compile_commands.json contains a command for test.c compilation.

Scenario:

  • open test.c
  • try to find all references for f()

For that scenario result for find all references will be incorrect if both (decl and def) files are in the file list because:

  • decl location is inside test.h
  • def location is inside test.c
  • the file list for the main file index contains test.h and test.c
  • the main file index does not contain references from test.h
  • the background (static) index contains references from test.c, but results from the background index will be skipped, because test.h is in the main file (dynamic) index file list.
kadircet added inline comments.Mar 1 2021, 10:19 PM
clang-tools-extra/clangd/index/FileIndex.cpp
297

Ah i see. This all seems really fragile :/ We might as well have something like:

a.h:

struct Bar;

a.cc:

#include "a.h"
struct Bar;

and now as soon as you open a.cc, all the results from a.h will be gone, because canonical declaration location for Bar will be in a.h and there is no definition. (and i believe this kind of forward-decl madness is quite common in at least LLVM).
Even worse, the same will also happen even if you have definition in the header, but a forward decl in the main file, so even accepting the file for definition wouldn't be enough.

It is currently (i.e. without this patch) working as expected, as main file index only owns the information for the "main file" indexing was initiated for. This feels like a big regression to me (that I didn't notice initially, sorry for that), but I am ready to be convinced otherwise :D

As Sam mentioned what you do here and partitioning logic in FileShardedIndex is quite similar (yours undershoot, sharding logic overshoots) but in the sharding process we split indexing result of a full TU/preamble, and later on those shards will always be used in a merged fashion (e.g. when a.cc and b.cc, both including a.h gets indexed, the shard produced for a.h from b.cc won't contain any definition locations, but the shard produced for a.cc will know about those symbols definition locations and in a merged view those symbols will have all the necessary information.), whereas in here the resulting information is used in isolation (main file isn't merged with preamble symbols, but mixes Files view). Hence causing such regressions.

I think the proper thing to do here is to propagate relevant files with slabs on FileSymbols::update. What you do here and sharding isn't very different. The question is should we have a:

  • string File
  • optional<string> File
  • vector<string> Files

Currently we always have exactly one File associated with all of those slabs as we either:

  • always do sharding (preamble and background idx)
  • even though symbol informations tell otherwise, we've only processed a single file (main file idx)

We would need 3rd option if we were to use filesymbols with a monolithic index, but we don't. And even if we need such a thing in future it shouldn't be a huge change hopefully.

Sorry for the wall of text, I hope I do make sense, but please tell me if I misunderstood/missing something.

ArcsinX added inline comments.Mar 2 2021, 2:36 AM
clang-tools-extra/clangd/index/FileIndex.cpp
297

It is currently (i.e. without this patch) working as expected, as main file index only owns the information for the "main file" indexing was initiated for. This feels like a big regression to me (that I didn't notice initially, sorry for that), but I am ready to be convinced otherwise :D

Currently it works mostly as expected. The only thing which worries me is the preamble index update: the first parameter passed to PreambleSymbols.update() is an URI, but we expect a file path there. Thus, the file list of the preamble index contains URIs, which seems incorrect and PreambleIndex::indexedFiles() always returns IndexContents::None.

In other words, we need to make the file list to contains always file paths or always URIs (current situation: the preamble index contains URIs, other indexes contain file paths). I thought that the best solution is to convert URI to file path before PreambleSymbols.update() call, but after D94952#inline-892421 discussion I was no sure about my solution, and seems the solution introduces in this patch also inacceptable.

Currently I am not sure what is the best way to solve this problem =(

Right, i see the problem you are trying to solve, and all of this is a mess working nicely in unisom, but falls apart if you try to isolate each piece :/ (hence the need for figuring out list of files while indexing instead)

But until that day, I suppose the best we can do is change the contract of FileSymbols to sey keys are always URIs representing the file producing associated slabs. This works:

  • For preamble symbols, because even though some shards might contain partial information, when merged they indeed correspond to those set of files, and we always query preamble index in a merged fashion.
  • Bg-index is literally the same as preamble-index in that regard
  • Main file index will only associate slabs with the "main file" now, which will stop the leakage i mentioned above.

Sorry for going in circles here a bit but it took us some time to figure out the issue here, does that make sense for you too?

ArcsinX planned changes to this revision.Mar 2 2021, 7:46 AM

But until that day, I suppose the best we can do is change the contract of FileSymbols to sey keys are always URIs representing the file producing associated slabs.

Agree. I will rework this patch.

ArcsinX updated this revision to Diff 327853.Mar 3 2021, 10:51 AM
  • use URIs as keys
  • create the file list from the keys
ArcsinX edited the summary of this revision. (Show Details)Mar 3 2021, 10:56 AM
kadircet accepted this revision.Mar 3 2021, 9:34 PM

thanks (both for the patch and bearing with me :D), LGTM!

This revision is now accepted and ready to land.Mar 3 2021, 9:34 PM
ArcsinX updated this revision to Diff 328222.Mar 4 2021, 10:03 AM

Call FileIndex::updateMain() with an absolute file path in tests to avoid test failures in debug mode.

Thank you for review!

This revision was landed with ongoing or failed builds.Mar 5 2021, 11:48 PM
This revision was automatically updated to reflect the committed changes.