This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Avoid duplicates in findDefinitions response
ClosedPublic

Authored by simark on Jun 27 2018, 5:08 PM.

Details

Summary

When compile_commands.json contains some source files expressed as
relative paths, we can get duplicate responses to findDefinitions. The
responses only differ by the URI, which are different versions of the
same file:

"result": [
    {
        ...
        "uri": "file:///home/emaisin/src/ls-interact/cpp-test/build/../src/first.h"
    },
    {
        ...
        "uri": "file:///home/emaisin/src/ls-interact/cpp-test/src/first.h"
    }
]

In getAbsoluteFilePath, we try to obtain the realpath of the FileEntry
by calling tryGetRealPathName. However, this doesn't work for
FileEntries that don't actually open the file, because their
RealPathName field is never populated.

I changed getAbsoluteFilePath so that if tryGetRealPathName returns an
empty string, we try to find the real path using VFS.getRealPath. This
requires to pass the VFS all the way from ClangdServer.

I tried to write a test case similar to

TEST(GoToDefinition, RelPathsInCompileCommand)

that would trigger this issue, but I wasn't able to. No matter what I
do, the paths are returned by the FileManager as already normalized /
without the "..". There must be something different in the test case
than in real execution that I don't see.

Repro instructions are in bug #37963, though I can include them in the
commit message if you prefer.

Event Timeline

simark created this revision.Jun 27 2018, 5:08 PM

Thanks for the patch!
Could we try to figure out why the duplicates were there in the first place and why the paths were different?
It should be easy to mock exactly the same setup you have in #37963, i.e. create a vfs with three files and compilation database that has exactly those compile commands. You mention you tried to repro the issue, did you try doing that with the unit-test or a lit-test?

After looking at the makefile, my guess would be that the problem comes from the paths starting with ../ inside the compilation database.

clangd/XRefs.cpp
179

Do we really need an extra vfs param?
Could we use SourceMgr.getFileManager().getVirtualFileSystem() instead?

unittests/clangd/TestTU.h
47 ↗(On Diff #153209)

We don't need an extra output param here.
There's a way to get the vfs from the ASTContext: ParsedAST().getASTContext().getSourceManager().getFileManager().getVirtualFileSystem().

Also some heads-up: this would probably conflict D47846 that moves the same function into a different file.

Thanks for the patch!
Could we try to figure out why the duplicates were there in the first place and why the paths were different?

I tried to do that, but it goes deep in the clang internals with which I'm not familiar. All I could see was that when creating the FileEntry representing the /home/emaisin/src/ls-interact/cpp-test/build/../src/first.h file, FileManager::getFile is called with OpenFile=false. This makes it so that the RealPathName field is not set (at FileManager.cpp:320). Because RealPathName is not set (well, empty), we use the non-normalized name in getAbsoluteFilePath. That's all I can tell.

It should be easy to mock exactly the same setup you have in #37963, i.e. create a vfs with three files and compilation database that has exactly those compile commands. You mention you tried to repro the issue, did you try doing that with the unit-test or a lit-test?

I tried doing a unit test that mimics the setup in #37963. For some reason I can't explain, the header file would always come out "correct". If you want to investigate further, I can update the patch with what I have so far.

After looking at the makefile, my guess would be that the problem comes from the paths starting with ../ inside the compilation database.

Yes. But AFAIK it's valid (it's relative to directory, which is the build directory. The compile_commands.json is generated with Bear.

clangd/XRefs.cpp
179

Ah probably not. I couldn't find a way to get a handle on the VFS, but that looks good.

unittests/clangd/TestTU.h
47 ↗(On Diff #153209)

Thanks. It's just not obvious :).

Thanks for the patch!
Could we try to figure out why the duplicates were there in the first place and why the paths were different?

I tried to do that, but it goes deep in the clang internals with which I'm not familiar. All I could see was that when creating the FileEntry representing the /home/emaisin/src/ls-interact/cpp-test/build/../src/first.h file, FileManager::getFile is called with OpenFile=false. This makes it so that the RealPathName field is not set (at FileManager.cpp:320). Because RealPathName is not set (well, empty), we use the non-normalized name in getAbsoluteFilePath. That's all I can tell.

I think from what we've seen before, it had to do with location coming from the preamble vs not. Simon, maybe we can spend some time together to investigate this. I know I said I'd rather not touch the Clang part as I thought it might be on purpose and/or performance sensitive, but the fact that the locations are inconsistent between the preamble and outside...it's not ideal and error-prone. Either way I think we need this fix in Clangd as a defensive fix because it's not in the AST's contract/guarantees (preamble or not) to provide locations with realpaths.

hokein added a subscriber: hokein.Jun 28 2018, 6:48 AM

Could we try to figure out why the duplicates were there in the first place and why the paths were different?

+1, I think there are two issues here:

  1. the result contains two candidates, which should be one, IIUC.
  2. the absolute file path problem, we encountered similar problem in SymbolCollector, and we have similar function ToURI there, I think we can share the implementation instead of having different duplications.

I'm somewhat familiar with the internals of clang around the FileManager/VFS, so I could try creating the repro of this issue. The bugreport has enough info to get started. (Probably tomorrow, I certainly won't get to it today).

  1. the absolute file path problem, we encountered similar problem in SymbolCollector, and we have similar function ToURI there, I think we can share the implementation instead of having different duplications.

+1 to sharing the code. I guess we're struggling with similar problems here. Any pointers to the functions we should use?

unittests/clangd/TestTU.h
47 ↗(On Diff #153209)

Yeah, it is quite deeply hidden :-)

Btw, this seems to happen only when the included header is in the preamble. If I put a variable declaration before #include "first.h", things work as expected, we have not duplicates and the path is normalized.

FYI, I have uploaded my attempt at writing a unit test here:

https://reviews.llvm.org/D48726

I tried to make it as close as possible to the example in the bug report, but don't manage to reproduce the bug.

hokein added a comment.Jul 3 2018, 9:13 AM

After taking a look closely, I figured why there are two candidates here -- one is from AST (the one with ".." path); the other one is from dynamic index, the deduplication failed because of the different paths :(

I think the fixing way is to normalize the file path from AST (making it absolute).

+1 to sharing the code. I guess we're struggling with similar problems here. Any pointers to the functions we should use?

Yeah, we struggled this problem in SymbolCollector. We cleanup the file path manually. I'd suggest abstract that logic from toURI (https://github.com/llvm-mirror/clang-tools-extra/blob/master/clangd/index/SymbolCollector.cpp#L80), and use it here.

I tried to make it as close as possible to the example in the bug report, but don't manage to reproduce the bug.

To reproduce the issue, I think you'd also need to enable the dynamic index.

After taking a look closely, I figured why there are two candidates here -- one is from AST (the one with ".." path); the other one is from dynamic index, the deduplication failed because of the different paths :(

I think the fixing way is to normalize the file path from AST (making it absolute).

+1 to sharing the code. I guess we're struggling with similar problems here. Any pointers to the functions we should use?

Yeah, we struggled this problem in SymbolCollector. We cleanup the file path manually. I'd suggest abstract that logic from toURI (https://github.com/llvm-mirror/clang-tools-extra/blob/master/clangd/index/SymbolCollector.cpp#L80), and use it here.

I tried to make it as close as possible to the example in the bug report, but don't manage to reproduce the bug.

To reproduce the issue, I think you'd also need to enable the dynamic index.

Thanks for looking into it. In my unit test, the only path I receive is one without "..", although I don't use the dynamic index. Still, I tried to enable the dynamic index using:

auto Opts = ClangdServer::optsForTest();
Opts.BuildDynamicSymbolIndex = true;
ClangdServer Server(CDB, FS, DiagConsumer, Opts);

but it didn't change the result.

simark added a comment.Jul 3 2018, 3:46 PM

An update, I traced the difference in behavior to the difference in how RealFileSystem and InMemoryFileSystem return Statuses.

I uploaded a patch to change InMemoryFileSystem to work like RealFileSystem: https://reviews.llvm.org/D48903

With this patch applied on clang, it is now possible to write a test that reproduces the issue. As @hokein said, if I keep the dynamic disabled, I get a single result with the ".." in it. If I enable the dynamic index, I get two results, one with ".." and one without.

I think the fixing way is to normalize the file path from AST (making it absolute).

Totally agree. Could we run the code used to get the URI to store in the dynamic index? Should we expose and reuse code in getSymbolLocation() from SymbolCollector.cpp?

I took a look at SymbolCollector's toURI, and I am not sure what to get from it. It seems like a lot of it could be replaced with a call to FileSystem::getRealPath.

simark updated this revision to Diff 155057.Jul 11 2018, 1:36 PM

This is an update of my work in progress. I haven't done any sharing with
SymbolCollector, as it's not clear to me how to proceed with that.

simark updated this revision to Diff 159764.Aug 8 2018, 11:32 AM

New version. I tried to share the code between this and SymbolCollector, but
didn't find a good clean way. Do you have concrete suggestions on how to do
this? Otherwise, would it be acceptable to merge it as-is?

simark updated this revision to Diff 159768.Aug 8 2018, 12:09 PM

Formatting.

hokein added a comment.Aug 9 2018, 1:20 AM

Sorry for the delay. I thought there was a dependent patch of this patch, but it seems resolved, right?

A rough round of review.

New version. I tried to share the code between this and SymbolCollector, but
didn't find a good clean way. Do you have concrete suggestions on how to do
this? Otherwise, would it be acceptable to merge it as-is?

Yeah, I'm fine with the current change. I was not aware of the getAbsoluteFilePath has been pulled out to the SourceCode.h. I think the SymbolCollector could use this function as well (but that's out of scope of this patch).

clangd/SourceCode.h
65–70

Why rename this function?

Is it guaranteed that RealPath is always an absolute path?

unittests/clangd/TestFS.cpp
52–64

Can we use RelPathPrefix.empty() instead of comparing with StringRef()?

unittests/clangd/XRefsTests.cpp
325

It seems that there is no difference between HeaderInPreambleAnnotations and HeaderNotInPreambleAnnotations in the test. Both of them are treated equally.

simark marked 3 inline comments as done.Aug 9 2018, 12:06 PM

Sorry for the delay. I thought there was a dependent patch of this patch, but it seems resolved, right?

A rough round of review.

Right, the patch this one depends on is in clang, and there does not seem to be regressions this time.

New version. I tried to share the code between this and SymbolCollector, but
didn't find a good clean way. Do you have concrete suggestions on how to do
this? Otherwise, would it be acceptable to merge it as-is?

Yeah, I'm fine with the current change. I was not aware of the getAbsoluteFilePath has been pulled out to the SourceCode.h. I think the SymbolCollector could use this function as well (but that's out of scope of this patch).

Thanks.

clangd/SourceCode.h
65–70

Before, it only made sure that the path was absolute, now it goes a step further and makes it a "real path", resolving symlinks and removing . and ... When we talk about a "real path", it refers to the unix realpath syscall:

http://man7.org/linux/man-pages/man3/realpath.3.html

So yes, the result is guaranteed to be absolute too.

unittests/clangd/TestFS.cpp
52–64

Done.

unittests/clangd/XRefsTests.cpp
325

The difference is that one is included in the main file as part of the preamble and the other is out of the preamble. Since they take different code paths in clang, I think it's good to test both. At some point, I had a similar bug that would only happen with a header included in the preamble.

simark updated this revision to Diff 159972.Aug 9 2018, 12:08 PM
simark marked an inline comment as done.

Replace

RelPathPrefix == StringRef()

with

RelPathPrefix.empty()

Only a few NITs from my side.
Excited for this fix to get in, have been seeing duplicate in other cases too :-)

clangd/SourceCode.h
69

This function looks like a good default choice for normalizing paths before putting them into LSP structs, ClangdServer responses, etc.
I suggest we add a small comment here with a guideline for everyone to attempt using it whenever possible. WDYT?

unittests/clangd/TestFS.h
43

s/not null/not empty

46

s/not null/not empty

simark marked 8 inline comments as done.Aug 10 2018, 8:28 AM
simark added inline comments.
clangd/SourceCode.h
69

What about this:

/// Get the real/canonical path of \p F.  This means:
///
///   - Absolute path
///   - Symlinks resolved
///   - No "." or ".." component
///   - No duplicate or trailing directory separator
///
/// This function should be used when sending paths to clients, so that paths
/// are normalized as much as possible.
hokein accepted this revision.Aug 10 2018, 1:45 PM

Looks good with a few nits. Looks like you didn't update the patch correctly. You have marked comments done, but your code doesn't get changed accordingly. Please double check with it.

I tried your patch and it did fix the duplicated issue I encountered before. Thanks for fixing it!

clangd/SourceCode.h
65

nit: this comment is not needed.

65–70

That makes sense, thanks for the explanation and the useful link!

69

SG.

This revision is now accepted and ready to land.Aug 10 2018, 1:45 PM
simark marked an inline comment as done.Aug 10 2018, 1:50 PM

Looks good with a few nits. Looks like you didn't update the patch correctly. You have marked comments done, but your code doesn't get changed accordingly. Please double check with it.

I tried your patch and it did fix the duplicated issue I encountered before. Thanks for fixing it!

Ah sorry, I applied the fixes locally, but was waiting for Ilya's feedback for the function comment.

simark updated this revision to Diff 160175.Aug 10 2018, 1:54 PM
simark marked 4 inline comments as done.

Latest update.

This revision was automatically updated to reflect the committed changes.
simark added inline comments.Aug 10 2018, 3:29 PM
clangd/SourceCode.h
65

Woops, merge mistake.

ioeric added inline comments.Aug 23 2018, 1:07 PM
clangd/SourceCode.cpp
209

With the recent performance regression due to getRealPath() mentioned in D51159, I think we should re-evaluate whether VFS::getRealPath(), which calls expensive real_path system call on real FS, was necessary to fix the bug mentioned in the patch summary. I think vfs::makeAbsolutePath with remove_dot_dot (without symlink resolution) should be sufficient to address the issue. The code completion latency has increased dramatically with the real_path call, so I would also expect Xrefs/findDefinition to slow down due to this. As SymbolCollector is not in a latency sensitive code paths, it might be OK for it to call real_path, but I think we should try to avoid using real_path elsewhere.

In general, it's unclear whether clangd should always resolve symlink (it might not always be what users want), and it should probably be a decision made by the build system integration. I think we would need a more careful design if we decide to handle symlinks in clangd.