This is an archive of the discontinued LLVM Phabricator instance.

[clang-scan-deps] Implementation of dependency scanner over minimized sources
ClosedPublic

Authored by arphaman on Jun 27 2019, 5:04 PM.

Details

Summary

This patch implements the fast dependency scanning mode in clang-scan-deps: the preprocessing is done on files that are minimized using the dependency directives source minimizer.

A shared file system cache is used to ensure that the file system requests and source minimization is performed only once. The cache assumes that the underlying filesystem won't change during the course of the scan (or if it will, it will not affect the output), and it can't be evicted. This means that the service and workers can be used for a single run of a dependency scanner, and can't be reused across multiple, incremental runs. This is something that we'll most likely support in the future though. Note that the driver still utilizes the underlying real filesystem.

This patch is also still missing the fast skipped PP block skipping optimization that I mentioned at EuroLLVM talk.

Diff Detail

Event Timeline

arphaman created this revision.Jun 27 2019, 5:04 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 27 2019, 5:04 PM
arphaman updated this revision to Diff 211385.Jul 23 2019, 4:52 PM

Fix a bug for empty minimized files where null terminator lookup by the lexer was an out of bounds read

I will take a look next week when I get back!

aganea added inline comments.
clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h
2

General comment for this file and the implementation: it seems there's nothing specific to the dependency scanning, except for replacing the content with minimized content? This cached FS could be very well used to compile a project with parallel workers. Could this be part of the VirtualFileSystem infrastructure? ( llvm/include/llvm/Support/VirtualFileSystem.h) If yes, can you possibly create a separate patch for this? @JDevlieghere @sammccall

102

The struct is self-explanatory, not sure the comment is needed here?

104

Would you mind renaming this to ValueLock so it is easier to search for? (and to remain consistent with CacheLock below)

130

Maybe worth mentioning this is NOT a global, thread-safe, class like DependencyScanningFilesystemSharedCache, but rather meant to be used as per-thread instances?

I am also wondering if there could be a better name to signify at first sight that this is a per-thread class. DependencyScanningLocalFilesystem? DependencyScanningWorkerFilesystem?

clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
85

This line needs a comment. Is this value based on empirical results across different hardwares? (I can't recall your answer at the LLVM conf) I am wondering what would be the good strategy here? The more cores/HT in your PC, the more chances that you'll hit the same shard, thus locking. But the bigger we make this value, the more StringMaps we will have, and more cache misses possibly.
Should it be something like llvm::hardware_concurrency() / some_fudge? It'd be interesting to subsequently profile on high core count machines (or maybe you have done that).

211

Don't use auto when the type is not obvious.

clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
148

Can we not use caching all the time?

arphaman updated this revision to Diff 212435.Jul 30 2019, 2:06 PM
arphaman marked 7 inline comments as done.
arphaman added inline comments.
clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h
2

I think it could be possible to separate out the cache, but we definitely need a subclass of a VFS to handle some Clang specific logic for how to determine how to deal with module files for instance. I wouldn't be opposed to factoring it out if people are interested. I think that can be done as a follow-up if there's an interest though.

clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
85

I rewrote it to use a heuristic we settled on after doing empirical testing on an 18 core / 36 thread machine ( max(2, thread_count / 4) ), and added a comment. This was the number 9 (36 / 4) I mentioned at the talk, so we only got to it because of a heuristic. I think now after some SDK/OS updates the effect from adding more shards is less pronounced, so it mostly flatlines past some number between 5-10. A better heuristic would probably be OS specific to take the cost of lock contention into account.

Note that the increased number of shards does not increase the number of cache misses, because the shard # is determined by the filename (we don't actually have global cache misses, as the cache is designed to have only one miss per file when it's first accessed)! It's just that an increased number of shards doesn't improve performance after hitting some specific limit, so we want to find a point where we can taper it off.

It would still be definitely interesting to profile it on other high core machines with different OSes to see if it's a reasonable heuristic for those scenarios too.

clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
148

We want to have a mode where it's as close to the regular clang -E invocation as possible for correctness CI and testing. We also haven't evaluated if the cost of keeping non-minimized sources in memory, so it might be too expensive for practical use? I can add a third option that caches but doesn't minimize though as a follow-up patch though

arphaman updated this revision to Diff 212442.Jul 30 2019, 2:29 PM
bruno added a subscriber: bruno.Jul 31 2019, 12:02 PM
aganea added inline comments.Aug 1 2019, 9:26 AM
clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
85

I'll give it a try on Windows 10, one project here has a large codebase and some wild machines to test on.

156

This makes only a short-lived objects, is that right? Just during the call to CachedFileSystemEntry::createFileEntry?

clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
94

"might be" twice.

148

Yes that would be nice. As for the size taken in RAM, it would be only .H files, right? For Clang+LLVM+LLD I'm counting about 40 MB. But indeed with a large project, that would be about 1.5 GB of .H. Although I doubt all these files will be loaded at once in memory (I'll check)

As for the usage: Fastbuild works like distcc (plain mode, not pump) so we were also planning on extracting the fully preprocessed output, not only the dependencies. There's one use-case where we need to preprocess locally, then send the preprocessed output remotely for compilation. And another use-case where we only want to extract the dependency list, compute a digest, to retrieve the OBJ from a network cache.

arphaman updated this revision to Diff 213511.Aug 5 2019, 6:38 PM
arphaman marked 3 inline comments as done.

Fix comment typo

clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
156

Yes, these VFS buffer files are only alive for a particular invocation.

clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
148

Right now it doesn't differentiate between .H and other files, but we could teach it do have a header only filter for the cache.

aganea added a comment.Aug 6 2019, 8:34 AM

Thanks for the update Alex! Just a few more comments and we should be good to go:

clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h
118

Why not use a bump allocator here? (it would be per-thread) On Windows the heap is always thread-safe, which induce a lock in malloc for each new entry. You could also avoid the usage of unique_ptr by the same occasion:

llvm::StringMap<SharedFileSystemEntry, SpecificBumpPtrAllocator<SharedFileSystemEntry>> Cache;

(unless you're planning on removing entries from the cache later on?)

clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
77

llvm::vfs::Status &&Stat to avoid a copy?

104

Shard.Cache.try_emplace(Key) will provide a default constructed value if it's not there.

118

CachedFileSystemEntry *Entry ?

199

CachedFileSystemEntry *Entry ?

clang/lib/Tooling/DependencyScanning/DependencyScanningWorker.cpp
148

No worries, I was simply wondering about the size in memory.

arphaman updated this revision to Diff 213664.Aug 6 2019, 11:22 AM
arphaman marked 6 inline comments as done.

Address review comments.

clang/include/clang/Tooling/DependencyScanning/DependencyScanningFilesystem.h
118

Good idea, I switched to a bump pointer allocator (I don't think I can use a specific one, as that would cause the values to be destroyed twice). I'm not planning on removing entries from the cache, no.

clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
77

The copy should already be avoided, as I move the argument when passing in, and when it's assigned to the result.

Just for reference, this patch still doesn't reuse the FileManager across invocations in a thread. We expect to get even better performance once we reuse it, but I'm going have to improve its API first.

aganea accepted this revision.Aug 6 2019, 11:48 AM

LGTM, thank you!

Just for reference, this patch still doesn't reuse the FileManager across invocations in a thread. We expect to get even better performance once we reuse it, but I'm going have to improve its API first.

Can't wait! @SamChaps is already testing this patch. He found some minor edge-cases with the minimizer (most likely unrelated to this), I'll post a patch.

clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
77

If Stat is not a rvalue reference, wouldn't the std::move in the call site end up as a copy? See this. If you change int test(A a) to int test(A &&a) you can see the difference in the asm output. However if the function is inlined, the extra copy will probably be optimized out. Not a big deal - the OS calls like stat() will most likely dominate here.

This revision is now accepted and ready to land.Aug 6 2019, 11:48 AM
arphaman marked an inline comment as done.Aug 6 2019, 1:38 PM
arphaman added inline comments.
clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
77

Isn't the difference in the output caused by the details of the calling convention (pass the structure on the stack by value)? The move constructor should still be called for the stat, it will not copy its members. We can optimize this though and pass by ref directly, I agree, so let me do that.

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptAug 6 2019, 1:43 PM