This is an archive of the discontinued LLVM Phabricator instance.

[clang][deps] Prevent PCH validation failures by padding minimized files
AbandonedPublic

Authored by jansvoboda11 on Jun 17 2021, 7:44 AM.

Details

Summary

This patch fixes dependency scanning of a TU with prebuilt modular/PCH dependencies.

Such modules/PCH might have been built using non-minimized files, while dependency scanner does use the minimized versions of source files. Implicit build doesn't like the discrepancy in file sizes and errors out.

The issues is worked around by padding the minimized files with whitespace in such scenarios. This approach throws away one advantage of source minimization (the memory savings), but still keeps the benefits of faster preprocessing/lexing.

The padding solution comes with a 14% runtime regression compared to not padding the inputs. (I tested this by forcing padding and running clang-scan-deps -j 20 on LLVM's compile_commands.json with modules disabled.) Interestingly, padding only with newlines (compared to spaces), the regression is 39%.

Depends on D104462.

Diff Detail

Event Timeline

jansvoboda11 requested review of this revision.Jun 17 2021, 7:44 AM
jansvoboda11 created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptJun 17 2021, 7:44 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

This patch fixes dependency scanning of a TU with prebuilt modular/PCH dependencies.

Such modules/PCH might have been built using non-minimized files, while dependency scanner does use the minimized versions of source files. Implicit build doesn't like the discrepancy in file sizes and errors out.

The issues is worked around by padding the minimized files with whitespace in such scenarios. This approach throws away one advantage of source minimization (the memory savings), but still keeps the benefits of faster preprocessing/lexing.

I'm not sure this is the right approach.

  • Don't PCHs sometimes validate based on hashes of inputs instead of size? At least, that has been discussed, and this patch would block making the change.
  • Can PCHs embed inputs, like modules can? How would that work (or not) minimized sources generally?
  • What happens if the PCH has a delayed diagnostic, triggered when parsing the next thing? Will the fix-it point in the wrong place? (Are there other things that depend on reading the original sources when reading a PCH?)

A few other ideas:

  1. Disable PCH validation when minimizing. (Is that less robust than your current workaround? If so, can you explain why?)
  2. Use the original PCH header in the scanning -cc1s (translate -include-pch to -include) and switch back in the generated -cc1s (back to -include-pch).
  3. Embed instructions in the PCH for how to build it, and make a "minimized" version of the PCH.

A few other ideas:

  1. Disable PCH validation when minimizing. (Is that less robust than your current workaround? If so, can you explain why?)
  2. Use the original PCH header in the scanning -cc1s (translate -include-pch to -include) and switch back in the generated -cc1s (back to -include-pch).
  3. Embed instructions in the PCH for how to build it, and make a "minimized" version of the PCH.

Two more options to consider:

  1. If a compilation uses a PCH, use the original files for any input file transitively referenced by the PCH. Can be done by using a filesystem overlay that skips over the minimization layer for those files, or could tell the minimization layer not to minimize those files. You can figure out the files by adding a free function that preloads the PCH and extracts out all the input files.
  2. Add support for building/using a PCH, and only support PCHes that are built this way. E.g., add a -include-pch-auto option or something. Scanner would use -include, spit out a command for building a PCH using explicit modules, and the generated -cc1s would use -include-pch to refer to it. This way all the modules are generated "in house".

Thanks for sharing your ideas! I've left my initial thoughts below, but I want to revisit this and think about it some more.

I'm not sure this is the right approach.

  • Don't PCHs sometimes validate based on hashes of inputs instead of size? At least, that has been discussed, and this patch would block making the change.

The hash-based validation only kicks in when modification times don't match. The size of input files is always* validated.

  • Can PCHs embed inputs, like modules can? How would that work (or not) minimized sources generally?

I think PCHs can embed inputs the same way modules can. When reading an AST file, the embedded input files get registered in the current compilation along with their buffers via SourceManager::createFileID.
But it doesn't seem to be propagated to the FileManager level, which the input file size check uses. I might need to experiment with this a bit to fully understand how this works in practice.

  • What happens if the PCH has a delayed diagnostic, triggered when parsing the next thing? Will the fix-it point in the wrong place? (Are there other things that depend on reading the original sources when reading a PCH?)

Can you point me to the code that handles delayed diagnostics? I can't find anything related in ASTWriter.

A few other ideas:

  1. Disable PCH validation when minimizing. (Is that less robust than your current workaround? If so, can you explain why?)

I think PCH validation (and detection of out-of-date input files) is still useful for the dependency scanner as it might help debugging build failures (e.g. if the input files changed between explicit PCH build and TU dependency scan).
Compared to disabling PCH validation, padding of minimized files is fairly local change that isn't susceptible to concurrency bugs and integrates with the rest of the implicit build infrastructure fairly well IMO.

  1. Use the original PCH header in the scanning -cc1s (translate -include-pch to -include) and switch back in the generated -cc1s (back to -include-pch).

The two separate dependency scans (one for PCH, one for TU) could end up using different context hashes and we don't have an easy way to detect this.
If they both use a common module A, the explicitly-built PCH would contains version 1 of A, while the TU would attempt to build with version 2. This is currently not supported in explicit builds AFAIK.
The current implementation works around this by always choosing the module present in the explicitly-built module. How could we make this situation work here?

  1. Embed instructions in the PCH for how to build it, and make a "minimized" version of the PCH.

That could work. This makes me think if we could emit a minimized PCH during the PCH scan and just reuse it in the TU scan (similarly how we're currently reusing the explicitly-built PCH, but with the file size issues gone). Though passing state from one dependency scan to another through the filesystem might be unreliable.

Two more options to consider:

  1. If a compilation uses a PCH, use the original files for any input file transitively referenced by the PCH. Can be done by using a filesystem overlay that skips over the minimization layer for those files, or could tell the minimization layer not to minimize those files. You can figure out the files by adding a free function that preloads the PCH and extracts out all the input files.

I tend to prefer this option, since it integrates fairly well with the current way we build PCHs in XCBuild.
I have a WIP patch that implements this: D104536. I plan to run some benchmarks to see the performance impact compared to minification + padding.

  1. Add support for building/using a PCH, and only support PCHes that are built this way. E.g., add a -include-pch-auto option or something. Scanner would use -include, spit out a command for building a PCH using explicit modules, and the generated -cc1s would use -include-pch to refer to it. This way all the modules are generated "in house".

This would be the ideal solution IMO, but needs build system support.


BTW: I noticed the file size stored in AST also plays role when deciding whether to import or include header. HeaderSearch::findModuleForHeader calls HeaderSearch::getExistingFileInfo which pulls information (such as the file size) from ExternalSource (a.k.a. ASTReader).

jansvoboda11 abandoned this revision.Jul 15 2021, 2:08 PM