This is an archive of the discontinued LLVM Phabricator instance.

[llvm-cov] Cache file status information
ClosedPublic

Authored by tunz on Feb 19 2021, 9:15 AM.

Details

Summary

Currently, getSourceFile accesses file system to check if two paths are
the same file with a thread lock, which is a huge performance bottleneck
in some cases. Currently, it's accessing file system size(files) * size(files) times.

Thus, cache file status information, which reduces file system access to size(files) times.

When I tested it with two binaries and 16 cpu cores,
it saved over 70% of time.

Binary 1: 56 secs -> 3 secs
Binary 2: 17 hours -> 4 hours

Diff Detail

Event Timeline

tunz requested review of this revision.Feb 19 2021, 9:15 AM
tunz created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptFeb 19 2021, 9:15 AM
tunz added a comment.Feb 19 2021, 9:21 AM

I'm not sure about the background context about sys::fs::equivalent usage, so I added a new option for this change. I think the only difference is to follow symlinks, which doesn't look common and isn't worth keeping as a default to me.
I'd like to get some feedbacks about this change. I'll add tests and documentation if this looks good.

tunz edited the summary of this revision. (Show Details)Feb 19 2021, 9:23 AM
tunz updated this revision to Diff 326755.Feb 26 2021, 11:26 AM

rebase and add doc

vsk added a comment.Mar 1 2021, 10:43 AM

It seems unexpected that there's such a huge performance cost associated with sys::fs::equivalent.

Could you share a profile of part of your 56sec workload? It'd be helpful to know whether the bottleneck is in the filesystem, or in llvm code.

Do you know what the queries look like (would caching help)?

I'd hope that the performance problem can be solved without changing the tool's behavior. But if not, I don't think it's possible to change the default behavior, as this could silently regress existing workflows which depend on symlinks.

llvm/tools/llvm-cov/CodeCoverage.cpp
295

Could you introduce a helper method that compares file paths for equality, so there's only one copy of the outer loop?

tunz updated this revision to Diff 327461.Mar 2 2021, 8:11 AM

Use a helper function

tunz added a comment.Mar 2 2021, 8:13 AM

@vsk Thanks for looking into it.

Here is some summary of the profiling of LLVMSupport and llvm-cov

top-5 exclusive time for a single core

  • llvm::sys::fs::status (44.2s)
  • llvm::sys::process::SafelyCloseFileDescriptor (14.2s)
  • llvm::sys::fs::openNativeFileInternal (13.6s)
  • llvm::sys::fs::getStatus (9.38s)
  • std::_Uninitialized_move<llvm::coverage::CountedRegion *, std::allocator<llvm::coverage::CountedRegion>> (4.76s)

top-5 exclusive time for 8 threads

  • llvm::ThreadPool::wait (76.6s)
  • CodeCoverageTool::getSourceFile (63.6s)
  • llvm::sys::fs::status (6.25s)
  • std::_Deallocate<16, 0> (4.71s)
  • llvm::sys::process::SafelyCloseFileDescriptor (1.99s)

It seems like the performance loss mostly comes from file system accesses.
The time of status function includes GetFileAttributesW and CreateFileW calls since I didn't instrument windows apis.

Caching of file path pairs (caching of the comparison function result) did not help because
all of the files are different in my case. The binary for testing contains about 1150 unique files,
which results 1.3 millions of file accesses. I think caching of status result might help,
but then we might need to update LLVMSupport.

tunz marked an inline comment as done.Mar 2 2021, 8:14 AM
tunz updated this revision to Diff 327531.Mar 2 2021, 11:29 AM

Use status cache

tunz updated this revision to Diff 327533.Mar 2 2021, 11:31 AM

Do not update test

tunz retitled this revision from [llvm-cov] Compare path only to find the same file to [llvm-cov] Cache file status information.Mar 2 2021, 11:35 AM
tunz edited the summary of this revision. (Show Details)
tunz added a comment.Mar 2 2021, 11:40 AM

@vsk Updated to use file status cache. It does not change the existing behavior, and performance loss is not huge (only 0.x secs for the 56 secs binary).
I didn't want to modify the LLVMSupport, so implemented the cache and rewrote the equivalent function only for llvm-cov. This still could result different outputs if some file type changes whilst running llvm-cov. But I don't think we need support that case.

phosek added inline comments.Mar 2 2021, 11:41 AM
llvm/tools/llvm-cov/CodeCoverage.cpp
163

StringMap is more efficient.

tunz updated this revision to Diff 327535.Mar 2 2021, 11:54 AM

Use StringMap

tunz marked an inline comment as done.Mar 2 2021, 11:54 AM

Thanks

vsk added a comment.Mar 2 2021, 11:56 AM

This looks nice.

llvm/tools/llvm-cov/CodeCoverage.cpp
257

The result variable shouldn't be necessary - the Optional should be implicitly converted wherever status is used.

268

Please keep the naming convention consistent (e.g. 'Status1' might work better here).

284

Please leave out unrelated whitespace changes.

tunz updated this revision to Diff 327538.Mar 2 2021, 12:19 PM
tunz marked 2 inline comments as done.

Address comments

tunz marked an inline comment as done.Mar 2 2021, 12:20 PM
tunz updated this revision to Diff 327544.Mar 2 2021, 12:39 PM

Fix clang-tidy check

vsk accepted this revision.Mar 2 2021, 3:05 PM

Thanks, lgtm. Do you need someone to land this on your behalf?

This revision is now accepted and ready to land.Mar 2 2021, 3:05 PM
tunz updated this revision to Diff 327648.Mar 2 2021, 6:39 PM

Fix typo

tunz added a comment.Mar 2 2021, 7:31 PM

Thanks for the review! yes, I need someone to land. I don't have commit access.

This revision was landed with ongoing or failed builds.Mar 3 2021, 10:08 AM
Closed by commit rG9d8a3e75b442: [llvm-cov] Cache file status information (authored by tunz, committed by vsk). · Explain Why
This revision was automatically updated to reflect the committed changes.