This is an archive of the discontinued LLVM Phabricator instance.

[clang][deps] Normalize ignored filenames in minimizing file system
ClosedPublic

Authored by jansvoboda11 on Jul 15 2021, 6:50 AM.

Details

Summary

This patch normalizes filenames in DependencyScanningWorkerFilesystem so that lookup of ignored files works correctly on Windows (where / and \ are equivalent).

Diff Detail

Event Timeline

jansvoboda11 requested review of this revision.Jul 15 2021, 6:50 AM
jansvoboda11 created this revision.
dexonsmith added inline comments.Jul 16 2021, 12:23 PM
clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
168–169

I'm a bit nervous about the impact of modifying the input filename on Windows before passing it into other APIs. This could change behaviour of lower layers of the VFS (since they'll see a different filename than when DependencyScanningWOrkerFileSystem is NOT on top of them).

Can we restrict this just to what's passed to IgnoredFiles? (Maybe add shouldIgnore() API, which returns false if the set is empty, and then locally converts to native and checks for membership...)

It also seems wasteful to be calling sys::path::native and the memcpy all the time, when usually it has no effect. Have you checked whether this affects performance of scanning something big?

175–176

Looking at this, makes me wonder if this is just fixing a specific instance of a more general problem.

Maybe IgnoredFiles should be a set of FileEntrys instead of StringRefs... but that'd create a different performance bottleneck when the set is big, since creating the FileEntrys would be expensive. We'd want the FileEntry lookup to be globally cached / etc. -- and FileManager isn't quite safe to use globally.

Do you think IgnoredFiles as-is will work well enough for where it'll be used for PCH? Or do we need to catch headers referenced in two different ways somehow?

jansvoboda11 added inline comments.Jul 19 2021, 6:55 AM
clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
168–169

Yeah, I can see that path changing between VFS layers can be problematic. I'm pretty sure we can get away with only converting Filename to its native form when interacting with IgnoredFiles.

I haven't checked the performance impact. If it ends up being measurable, I could implement something like sys::path::is_native and avoid the copy most of the time on unix-like OSes. WDYT?

175–176

I think we could use llvm::sys::fs::UniqueID instead of the filename to refer to files. Since the VFS layer resolves symlinks when stat-ing a file, that should be a canonical file identifier. I can tackle that in a follow up patch.

dexonsmith added inline comments.Jul 19 2021, 10:07 AM
clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
168–169

Probably it'll end up not being measurable, but if it is, something like is_native might help... that said, if this will eventually be replaced with logic relyin on fs::UniqueID it might not be worth optimizing.

175–176

Yup, a unique ID should work for a file identifier.

I'm concerned about the cost of looking up the unique ID — avoiding stat traffic was measured to be an important performance benefit in the dependency scanner model.

To avoid a perf regression, I think you could use caches like:

  • ids: filename -> unique-id
  • originals: unique-id -> original file content
  • minimized: unique-id -> minimized file content

Where "ids" and "originals" are read/cached in lock-step when accessing a filename, additionally computing "minimized" if not in the ignore-list. (Adding a file to the ignore-list would put content in "ids" and "originals".)

The goal is to amortize the stat cost across the lifetime of the service while ensuring a consistent view of the file content.

WDYT?

... regardless I think all of this is out of scope for the current patch, which is still useful for unblocking adding tests to the subsequent patches in the stack.

With the call to llvm::sys::path::native scoped only to IgnoredFiles, would this patch LGTY?

clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
168–169

Agreed.

175–176

Yes, this is the cache structure I had in mind.

I agree that this should be tackled in a follow-up patch. I'm going to create a patch with xfailing test case that demonstrates how one file with two different names (e.g. symlink) can cause issues with the current approach.

dexonsmith accepted this revision.Jul 19 2021, 11:50 AM

With the call to llvm::sys::path::native scoped only to IgnoredFiles, would this patch LGTY?

Yes, this LGTM once you update that.

clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
175–176

Might be nice to include that XFAIL'ed test in this patch, as well as a FIXME in the code, documenting the general problem. But if you'd rather land that separately/after it's fine with me.

This revision is now accepted and ready to land.Jul 19 2021, 11:50 AM
jansvoboda11 retitled this revision from [clang][deps] Normalize paths in minimizing file system to [clang][deps] Normalize ignored filenames in minimizing file system.Jul 20 2021, 2:26 AM
jansvoboda11 edited the summary of this revision. (Show Details)
This revision was landed with ongoing or failed builds.Jul 20 2021, 2:32 AM
This revision was automatically updated to reflect the committed changes.