This is an archive of the discontinued LLVM Phabricator instance.

[clang][Tooling] Try to avoid file system access if there is no record for the file in compile_commads.json
ClosedPublic

Authored by ArcsinX on Jul 11 2020, 6:05 AM.

Details

Summary

If there is no record in compile_commands.json, we try to find suitable record with MatchTrie.findEquivalent() call.
This is very expensive operation with a lot of llvm::sys::fs::equivalent() calls in some cases.

This patch disables file symlinks for performance reasons.

Example scenario without this patch:

  • compile_commands.json generated at clangd build (contains ~3000 files).
  • it tooks more than 1 second to get compile command for newly created file in the root folder of LLVM project.
  • we wait for 1 second every time when clangd requests compile command for this file (at file change).

Diff Detail

Event Timeline

ArcsinX created this revision.Jul 11 2020, 6:05 AM

Also I think that FileMatchTrie introduced in https://reviews.llvm.org/D30 is not efficient solution to fight with symlinks and for most cases InterpolatingCompilationDatabase will be accurate enough. But I am not sure, is it safe to completely remove FileMatchTrie? What do you think?

Wow, yeah, this seems pretty bad! I'm not very familiar with this code. I'm curious, for "it tooks more than 1 second" - what OS is this on?

My guess is that way back when, the performance of looking up a CDB entry that didn't exist wasn't a big deal, or windows support wasn't a big thing, or this was just an oversight.

My reading of the code is that any TrieNode is a set of candidates that match a suffix of the path equally well, and we're trying to walk down the tree making the suffix longer and the candidate set more restrictive.
When we reach the node whether the path stops matching, we scan all the candidates to see if they're symlink-equivalent.
This is expensive if the set is too large, which is trivially the case if the filename is unique in the project (so we never get past the root node, and stat every file in the project).

I think it would be slightly cleaner to fix this in the trie itself, either:

  • don't scan for equivalences if the set of candidates exceeds some size threshold (10 or so)
  • don't scan for equivalences if the node we'd scan under is the root

@klimek In case you have context or thoughts here. Always fun to find a 9 year old bug :-)

But I am not sure, is it safe to completely remove FileMatchTrie?

Missed this... honestly I don't know. It does seem a little overly defensive to me, but any change in symlink behaviour tends to break *someone*.

klimek added a subscriber: djasper.Jul 13 2020, 2:04 AM

@djasper wrote this iirc, but I doubt he'll have much time to look into this :)

IIRC the symlink checking was there because some part of the system on linux tends to give us realpaths (with CMake builds), and that leads to not finding anything if there are symlinks involved.

Wow, yeah, this seems pretty bad! I'm not very familiar with this code. I'm curious, for "it tooks more than 1 second" - what OS is this on?

Windows.
I faced this problem on a huge project and it takes 7 seconds for MatchTrie.findEquivalent() to return.

My guess is that way back when, the performance of looking up a CDB entry that didn't exist wasn't a big deal, or windows support wasn't a big thing, or this was just an oversight.

I think it would be slightly cleaner to fix this in the trie itself, either:

Firstly, I wanted to fix this problem in FileMatchTrie, but for caching solution we need to update cache at every instert() call, but for caching solution inside JSONCompilationDatabase we can avoid this, because all inserts already done at JSONCompilationDatabase::parse().

  • don't scan for equivalences if the set of candidates exceeds some size threshold (10 or so)
  • don't scan for equivalences if the node we'd scan under is the root

After such fixes JSONCompilationDatabase::getCompileCommands() will return other results than before in some cases.

IIRC the symlink checking was there because some part of the system on linux tends to give us realpaths (with CMake builds), and that leads to not finding anything if there are symlinks involved.

That was before InterpolatingCompilationDatabase introduction in https://reviews.llvm.org/D45006.
For my experience InterpolatingCompilationDatabase works pretty well for symlinks cases.

  • don't scan for equivalences if the set of candidates exceeds some size threshold (10 or so)
  • don't scan for equivalences if the node we'd scan under is the root

After such fixes JSONCompilationDatabase::getCompileCommands() will return other results than before in some cases.

Can you elaborate on this - do you think there are reasonable cases where it will give the wrong answer?
I can certainly imagine cases where this changes behaviour, e.g. project/foo.cc is a symlink to project/bar.cc, compile_commands contains an entry for foo.cc, we query for bar.cc.
But I don't think this was ever really intended to handle arbitrary symlinks that change the names of files, and I do think this is OK to break.

For my experience InterpolatingCompilationDatabase works pretty well for symlinks cases.

This sounds reasonable but is a bit anecdotal - people use different build systems/source layouts, some of them use symlinks heavily and some not at all.
InterpolatingCompilationDatabase often works well but it's very fuzzy/heuristic, and there are layouts where it really doesn't work well.
(At the same time I feel uncomfortable because I'm not sure what would be sufficient evidence to remove the filematchtrie)

  • don't scan for equivalences if the set of candidates exceeds some size threshold (10 or so)
  • don't scan for equivalences if the node we'd scan under is the root

After such fixes JSONCompilationDatabase::getCompileCommands() will return other results than before in some cases.

Can you elaborate on this - do you think there are reasonable cases where it will give the wrong answer?
I can certainly imagine cases where this changes behaviour, e.g. project/foo.cc is a symlink to project/bar.cc, compile_commands contains an entry for foo.cc, we query for bar.cc.
But I don't think this was ever really intended to handle arbitrary symlinks that change the names of files, and I do think this is OK to break.

If breaking project/foo.cc <=> project/bar.cc is OK, then could we also break some/path/here/foo.cc <=> some/other/path/bar.cc?

In that case we can just prevent Comparator.equivalent(FileA,FileB) calls when llvm::sys::path::filename(FileA) != llvm::sys::path::filename(FileB).
I tried this locally:

  • it works fast.
  • does not break tests.
  • easy to add unit test for FileMatchTrie.

What do you think?

  • don't scan for equivalences if the set of candidates exceeds some size threshold (10 or so)
  • don't scan for equivalences if the node we'd scan under is the root

After such fixes JSONCompilationDatabase::getCompileCommands() will return other results than before in some cases.

Can you elaborate on this - do you think there are reasonable cases where it will give the wrong answer?
I can certainly imagine cases where this changes behaviour, e.g. project/foo.cc is a symlink to project/bar.cc, compile_commands contains an entry for foo.cc, we query for bar.cc.
But I don't think this was ever really intended to handle arbitrary symlinks that change the names of files, and I do think this is OK to break.

If breaking project/foo.cc <=> project/bar.cc is OK, then could we also break some/path/here/foo.cc <=> some/other/path/bar.cc?

Yes. We discussed offline a bit with @chandlerc who doesn't think identifying cases where symlinked files have different names was a motivating case or likely to be important.

In that case we can just prevent Comparator.equivalent(FileA,FileB) calls when llvm::sys::path::filename(FileA) != llvm::sys::path::filename(FileB).

You can - but this is equivalent to not doing the getAll() path when the node is the root (i.e. when ConsumedLength == 0).
So I think if (ConsumedLength == 0) return {} around line 124 in FileMatchTrie with an appropriate comment would also do the trick, and avoid pointless traversal/copy/string comparisons (much cheaper than IO but not free).

I tried this locally:

  • it works fast.
  • does not break tests.
  • easy to add unit test for FileMatchTrie.

What do you think?

+1, happy with this behavior.
I think we should try the implementation that skips the traversal first - it's no more code and should be a bit more efficient.
But either way works, really.

ArcsinX updated this revision to Diff 278161.Jul 15 2020, 6:11 AM

Support only directory simlinks.

sammccall accepted this revision.Jul 15 2020, 9:20 AM

Hmm, there is actually a case the old behavior may have been papering over: case-insensitive filesystems. If the real file is foo.cc and we query Foo.cc, then the trie would (inefficiently) return the correct file.
It may be a good idea to make the trie case-insensitive for this purpose, but maybe we should wait until someone complains. That's a different patch, anyway.

clang/lib/Tooling/FileMatchTrie.cpp
111

Here it's really just for consistency - we have a single candidate, and calling equivalent() on a single file isn't expensive. I'd be happy with or without this check, but the comment should mention consistency.

131

nit: This is quite a lot of words to come to "we do not support file symlinks" - I think it could be a bit shorter and also get into motivation:

"If ConsumedLength is zero, this is the root and we have no filename match. Give up in this case, we don't try to find symlinks with different names".

This revision is now accepted and ready to land.Jul 15 2020, 9:20 AM
ArcsinX updated this revision to Diff 278230.Jul 15 2020, 9:55 AM

Update comments.

ArcsinX edited the summary of this revision. (Show Details)Jul 15 2020, 10:02 AM
ArcsinX marked 3 inline comments as done.Jul 15 2020, 10:11 AM

As far as I do not have commit access, could you please commit for me?
Aleksandr Platonov <platonov.aleksandr@huawei.com>

clang/lib/Tooling/FileMatchTrie.cpp
111

Removed "expensive" word.

131

Fixed.

This revision was automatically updated to reflect the committed changes.

As far as I do not have commit access, could you please commit for me?
Aleksandr Platonov <platonov.aleksandr@huawei.com>

Oops, missed this! Committed as d19f0666bcd8f7d26aaf4019244c3ed91e47b1b7 :-)