This is an archive of the discontinued LLVM Phabricator instance.

[Modules][PCH] Hash input files content
ClosedPublic

Authored by bruno on Sep 5 2019, 5:50 PM.

Details

Summary

When files often get touched during builds, the mtime based validation
leads to different problems in implicit modules builds, even when the
content doesn't actually change:

  • Modules only: module invalidation due to out of date files. Usually

causing rebuild traffic.

  • Modules + PCH: build failures because clang cannot rebuild a module if

it comes from building a PCH.

  • PCH: build failures because clang cannot rebuild a PCH in case one of

the input headers has different mtime.

This patch proposes hashing the content of input files (headers and
module maps), which is performed during serialization time. When looking
at input files for validation, clang only computes the hash in case
there's a mtime mismatch.

I've tested a couple of different hash algorithms availble in LLVM in
face of building modules+pch for #import <Cocoa/Cocoa.h>:

  • hash_code: performace diff within the noise, total module cache

increased by 0.07%.

  • SHA1: 5% slowdown. Haven't done real size measurements, but it'd be

BLOCK_ID+20 bytes per input file, instead of BLOCK_ID+8 bytes from
hash_code.

  • MD5: 3% slowdown. Like above, but BLOCK_ID+16 bytes per input file.

Given the numbers above, the patch uses hash_code. The patch also
improves invalidation error msgs to point out which type of problem the
user is facing: "mtime", "size" or "content".

rdar://problem/29320105

Diff Detail

Event Timeline

bruno created this revision.Sep 5 2019, 5:50 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 5 2019, 5:50 PM

Nice! Are you planning to address the FIXME's in a later update of this patch?

clang/lib/Serialization/ASTWriter.cpp
1768

Why is this not a uint64_t?

clang/test/Modules/validate-file-content.m
15

Are there other tests that use touch and file timestamps in general? I wonder if this could cause random issues when building on NFS or filesystems with a really low timestamp granularity. I don't have a better suggestion either since the mtime is part of the code being tested here.

bruno marked 2 inline comments as done.Sep 6 2019, 1:33 PM

Nice! Are you planning to address the FIXME's in a later update of this patch?

The FIXME's are just replaying what the code around does, both error dropping and FileEntryRef are recent changes that didn't make all the way through yet. I should help improving that at some point cause it will overall be good for modules, but those changes are orthogonal to this patch.

clang/lib/Serialization/ASTWriter.cpp
1768

Serializing a uint64_t directly instead of two uint32_t gives me a slightly bigger final cache. One could argue that the value is negligible (+800 bytes for a 40MB cache), but here's the rationale :)

clang/test/Modules/validate-file-content.m
15

Are there other tests that use touch and file timestamps in general?

Yes, I found quite a few around the tests. However, I just tested touch -m -a -A 01 on linux and it seems that -A isn't supported, maybe I'll just change the entire mtime for some time in the future, that's likely more portable.

I wonder if this could cause random issues when building on NFS or filesystems with a really low timestamp granularity. I don't have a better suggestion either since the mtime is part of the code being tested here.

Right, perhaps changing the mtime completely might help here.

bruno updated this revision to Diff 219166.Sep 6 2019, 1:51 PM

Update testcase to use a more portable version of touch

bruno updated this revision to Diff 219168.Sep 6 2019, 1:53 PM

Remove pasto from one of the testcases

Harbormaster completed remote builds in B37870: Diff 219168.
ributzka added inline comments.Sep 6 2019, 3:40 PM
clang/lib/Serialization/ASTWriter.cpp
1768

Creating an abbrev might fix this, because this should be a fixed size field. The generic emission code for a record without an abbrev is not very space efficient for single fields.

bruno marked an inline comment as done.Sep 10 2019, 2:15 PM
bruno added inline comments.
clang/lib/Serialization/ASTWriter.cpp
1768

Right, I did more experiments, and although VBR::64 or Fixed::64 aren't supported, I could shave off 10KB by using abbrev + two ::Fixed, 32. Using two ::VBR:32 is actually 400 bytes worse than the current approach. Thanks for the suggestion, gonna update the patch.

bruno updated this revision to Diff 219607.Sep 10 2019, 2:20 PM

Update the patch to use two ::Fixed, 32 in abbrev to encode hash.

Did you try xxHash64?

Did you try xxHash64?

No, and I'm not planning to. I believe hash_code is enough here, already has low overhead on performance and size, and bunch of prior use elsewhere in LLVM/Clang. If there's a strong reason to do it I wanna know why that's the case first.

xxhash64 is apparently faster than MD5 and SHA1, and produces good quality hashes. I am not sure about the hash quality of hash_code for this purpose.

rsmith accepted this revision.Oct 1 2019, 5:44 PM

LGTM

This revision is now accepted and ready to land.Oct 1 2019, 5:44 PM
This revision was automatically updated to reflect the committed changes.