This is an archive of the discontinued LLVM Phabricator instance.

[clang][deps] Prune unused header search paths
ClosedPublic

Authored by jansvoboda11 on May 14 2021, 5:28 AM.

Details

Summary

To reduce the number of explicit builds of a single module, we can try to squash multiple occurrences of the module with different command-lines (and context hashes) by removing benign command-line options. The greatest contributors to benign differences between command-lines are the header search paths.

In this patch, the lookup cache in HeaderSearch is used to identify paths that were actually used when implicitly building the module during scanning. This information is serialized into the unhashed control block of the implicitly-built PCM. The dependency scanner then loads this and may use it to prune the header search paths before computing the context hash of the module and generating the command-line.

We could also prune the header search paths when serializing HeaderSearchOptions into the PCM. That way, we could do it only once instead of every load of the PCM file by dependency scanner. However, that would result in a PCM file whose contents don't produce the same context hash as the original build, which is probably highly surprising.

There is an alternative approach to storing extra information into the PCM: wire up preprocessor callbacks to capture the used header search paths on-the-fly during preprocessing of modularized headers (similar to what we currently do for the main source file and textual headers). Right now, that's not compatible with the fact that we do an actual implicit build producing PCM files during dependency scanning. The second run of dependency scanner loads the PCM from the first run, skipping the preprocessing altogether, which would result in different results between runs. We can revisit this approach when we stop building implicitly during dependency scanning.

Depends on D102923.

Diff Detail

Event Timeline

jansvoboda11 requested review of this revision.May 14 2021, 5:28 AM
jansvoboda11 created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptMay 14 2021, 5:28 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

Undo changes to modules-full.cpp test

dexonsmith requested changes to this revision.May 14 2021, 11:54 AM

The code looks mostly good; some inline comments.

I think it'd be useful to have direct tests for "detecting used search paths". That might be a nice thing to separate out and test independently. E.g., you could add a remark, -Rused-search-path, which printed a diagnostic the first time any given search path is used in a TU.

It'd also be really nice to be able to dump this info from the .pcm file. I guess it'd be outside the scope of this patch, but I really think there should be a clang-pcm-analyzer where we can put functionality like that. Somewhat useful for more fine-grained testing; super useful as a tool for debugging.

clang/include/clang/Lex/HeaderSearch.h
295 ↗(On Diff #345412)

I think the word "get" here is a bit misleading, since this isn't a simple accessor, it's a potentially expensive computation. I suggest "compute", unless it's changed to be computed on-the-fly.

I wonder, would llvm::BitVector be more appropriate here, since we expect it to count from 0 and be relatively dense?

clang/lib/Serialization/ASTReader.cpp
4763–4771

If this were stored as a blob it could be lazily parsed; WDYT?

Alternatively, it might be more compact if stored as a bit vector, storing 64-bits positions per element of the record.

This revision now requires changes to proceed.May 14 2021, 11:54 AM

Use, serialize and deserialize llvm::BitVector instead of std::set.

jansvoboda11 planned changes to this revision.May 17 2021, 6:52 AM

I think it'd be useful to have direct tests for "detecting used search paths". That might be a nice thing to separate out and test independently. E.g., you could add a remark, -Rused-search-path, which printed a diagnostic the first time any given search path is used in a TU.

Good idea, I'll look into that.

It'd also be really nice to be able to dump this info from the .pcm file. I guess it'd be outside the scope of this patch, but I really think there should be a clang-pcm-analyzer where we can put functionality like that. Somewhat useful for more fine-grained testing; super useful as a tool for debugging.

Agreed, it's in my todo list.

clang/lib/Serialization/ASTReader.cpp
4763–4771

I'll look into making this lazy.

Make PCM parsing lazier

jansvoboda11 planned changes to this revision.May 17 2021, 10:14 AM

Fix test on Windows

jansvoboda11 planned changes to this revision.May 19 2021, 7:22 AM

Split out extra patch

jansvoboda11 marked an inline comment as done.May 21 2021, 8:33 AM

This is now ready for review.

clang/test/ClangScanDeps/Inputs/header-search-pruning/cdb.json
5

The extra numerical include directories here force AST{Writer,Reader} to work with multi-byte blob, providing extra test coverage.

clang/test/ClangScanDeps/header-search-pruning.cpp
28

The being and end include directories here are sentinels that make it easier to make positive/negative occurrence checks with -NEXT.

jansvoboda11 edited the summary of this revision. (Show Details)May 21 2021, 8:51 AM
jansvoboda11 added inline comments.Jun 2 2021, 5:13 AM
clang/tools/clang-scan-deps/ClangScanDeps.cpp
361

This change will be reverted in favor of D103516.

jansvoboda11 planned changes to this revision.Jul 26 2021, 4:54 AM

After speaking with @dexonsmith, the laziness is probably not necessary when dealing with a short bit vector. I'll also explore using llvm::BitVector instead of std::vector<bool>.

Rebase, switch from std::vector<bool> to llvm::BitVector, load the bit vector eagerly.

dexonsmith accepted this revision.Oct 8 2021, 1:25 PM

LGTM, thanks!

This revision is now accepted and ready to land.Oct 8 2021, 1:25 PM
This revision was automatically updated to reflect the committed changes.