This is an archive of the discontinued LLVM Phabricator instance.

Module: hash the pcm content and use it as SIGNATURE for implicit modules.
ClosedPublic

Authored by dexonsmith on Dec 12 2016, 3:09 PM.

Details

Summary

We change ASTFileSignature from a random 32-bit number to the actual SHA hash of the pcm content.

1> Definition of ASTFileSignature is moved to Basic/Module.h so Module and ASTSourceDescriptor can use it. Before this patch both are using uint64_t as the type of the signature.
2> Add DIAGNOSTIC_OPTIONS_BLOCK and make sure it is the last block of the pcm file. This block contains DIAGNOSTIC_OPTIONS and SIGNATURE.
We are using this patch to solve a PCH+Modules issue: PCH are handled by the build system and modules are handled by the compiler, when the compiler overwrites a pcm file with the same content except the diagnostic differences, we can still treat the PCH as valid and continue using it. This requires us not hashing the diagnostic differences. Moving the diagnostic differences to the end of the pcm makes the task easy:

In a few records, we save bit positions in the pcm file, if the size of the diagnostic options is different, the bit positions can be off.
We only need to hash the content till the start of the DIAGNOSTIC_OPTIONS_BLOCK.

3> When we hash the pcm content, there is no point in saving the file size and modification time for the imported module file. The patch will use 0 for both. And the validation will not check the size differences or the modification time differences.
4> In GlobalModuleIndexBuilder, when the signature is non-zero, we check the signature instead of file size/modification time for imported modules.
5> Debug info has a 64-bit slot for dwo id, this patch uses the lower 64 bits of the SHA hash.

I want to gather some feedback on the general direction of this. Given that this is a big patch, I am open to separate it into smaller patches.

Cheers,
Manman

Diff Detail

Event Timeline

manmanren updated this revision to Diff 81145.Dec 12 2016, 3:09 PM
manmanren retitled this revision from to Module: hash the pcm content and use it as SIGNATURE for implicit modules..
manmanren updated this object.
manmanren added reviewers: rsmith, benlangmuir, aprantl.
manmanren added subscribers: bruno, cfe-commits, dexonsmith.
aprantl added inline comments.Dec 12 2016, 3:28 PM
include/clang/Serialization/ASTBitCodes.h
322

FYI, could be fixed in a separate commit for the entire file: These uses of \brief are redundant and we don't need them any more. We use the autobrief option in our Doxygen configuration file.

lib/CodeGen/CGDebugInfo.cpp
2012–2014

Just wanted to comment that this change LGTM.

lib/CodeGen/ObjectFilePCHContainerOperations.cpp
247

Maybe factor this out into a getDwoId() helper function?

rsmith edited edge metadata.Dec 12 2016, 4:13 PM

Generally this seems reasonable to me. I don't see any particular need to split this patch up into smaller pieces.

include/clang/Serialization/ASTBitCodes.h
256

Add a comment describing this block.

Please also give this a name that describes its purpose (to hold records that should not be part of the signature). The signature is not a diagnostic option.

323

Move this and DIAGNOSTIC_OPTIONS to a separate enum holding record types for the new block.

include/clang/Serialization/ASTWriter.h
433–438

Fix indentation

436

Give the parameter a name

504–507

Fix indentation

537–540

Fix indentation

include/clang/Serialization/Module.h
19

Redundant include?

lib/Serialization/ASTWriter.cpp
1373

I don't think it makes sense to use ImplicitModules to control this. Doing that for the old signature computation was at best a hack.

If there is no measurable performance difference from the hashing, we should just do it unconditionally in all modes. Otherwise, there should be a separate flag to control it; we should probably force that flag to be enabled when the frontend implicitly builds a module and inserts it into the module cache, but otherwise let the user control it as they see fit.

dexonsmith added inline comments.Dec 12 2016, 5:42 PM
include/clang/Serialization/ASTBitCodes.h
256

Also add a comma to the end of the line to avoid future additions to the enum from touching this line.

manmanren updated this revision to Diff 83068.Jan 4 2017, 9:55 AM
manmanren edited edge metadata.

Addressing review comments!

I still need to measure the performance impact of calculating the hash then decide how to enable this.

manmanren updated this revision to Diff 83999.Jan 11 2017, 10:27 AM

Rebase on top of r291686.

Bruno helped on collecting performance data for this patch:
$ cat cocoa-test.h
#import <Cocoa/Cocoa.h>

$ rm -rf tmp; clang -x objective-c -isysroot xcrun --sdk macosx --show-sdk-path -fblocks -fmodules -fmodules-cache-path=tmp cocoa-test.h -fsyntax-only

clang-release (10 runs)

name avg min med max SD SD% total
user 2.0023 1.9804 1.9994 2.0408 0.0149 0.74 20.0225
sys 0.3601 0.3547 0.3505 0.4454 0.0287 7.98 3.6007
wall 2.5432 2.4291 2.4264 3.3604 0.2737 10.76 25.4324

clang-modhash-release (10 runs)

name avg min med max SD SD% total
user 2.1101 2.0935 2.1095 2.1274 0.0112 0.53 21.1012
sys 0.3485 0.3462 0.3556 0.3476 0.0052 1.50 3.4854
wall 2.5614 2.5163 2.5605 2.6657 0.0419 1.63 25.6138

This ~5% slowdown for the total time of compiling all the modules (30+ modules).

A similar test for one object file in selfhost clang:

[TemplateName.cpp.o] clang-release (10 runs)

name avg min med max SD SD% total
user 7.4332 7.3716 7.4369 7.4976 0.0444 0.60 74.3318
sys 0.5646 0.5512 0.5688 0.5570 0.0161 2.86 5.6464
wall 8.3357 8.1675 8.2991 8.2570 0.3240 3.89 83.3568

[TemplateName.cpp.o] clang-modhash-release (10 runs)

name avg min med max SD SD% total
user 7.6911 7.6202 7.6979 7.7779 0.0418 0.54 76.9114
sys 0.5531 0.5437 0.5520 0.5622 0.0105 1.89 5.5315
wall 8.4598 8.3868 8.4839 8.5497 0.0455 0.54 84.5977

~3.5% slowdown for total of modules:

$ ls module.cache/1UKP65OUKXURH/
Clang_AST-1XMT5DOVQWGLJ.pcm LLVM_C-2WKTHJHA6SDF0.pcm LLVM_Pass-HNKP05GYNHFQ.pcm _Builtin_stddef_max_align_t-2T552E4NSAIDI.pcm
Clang_Basic-1XMT5DOVQWGLJ.pcm LLVM_Config_ABI_Breaking-2FT1AFJTZ51QZ.pcm LLVM_Support_DataTypes-2FT1AFJTZ51QZ.pcm modules.idx
Darwin-399Z2LWDSMSTV.pcm LLVM_IR-HNKP05GYNHFQ.pcm LLVM_Utils-HNKP05GYNHFQ.pcm std-1337JMP4UGPIP.pcm

Note that the cache was wiped out right before every tests instance.

Thanks Bruno. I am going to guard the hashing with a CC1 option (something like fmodules-hash-content) and force that flag to be enabled when the frontend implicitly builds a module. We will try to improve the hashing performance later (either improving SHA1 or using MD5 instead of SHA1).

Manman

manmanren updated this revision to Diff 84406.Jan 13 2017, 3:56 PM
manmanren marked 7 inline comments as done.

Add CC1 option to control hashing of the module file content.

manmanren marked an inline comment as done.Jan 13 2017, 3:58 PM
manmanren added inline comments.
include/clang/Serialization/Module.h
19

Definition of ASTFileSignature is moved from Serialization/Module.h to Basic/Module.h so we need to have this include to get the definition.

lib/Serialization/ASTWriter.cpp
1373

Use HSOpts.ModulesHashContent to guard the hashing of module file content. This flag is set to true when we implicitly build a module.

manmanren marked an inline comment as done.Jan 18 2017, 1:47 PM

Ping :]
I am hoping to wrap this up this week. Thanks,
Manman

dexonsmith commandeered this revision.Jan 28 2017, 8:05 PM
dexonsmith added a reviewer: manmanren.

Taking this over to rebase.

dexonsmith updated this revision to Diff 86193.Jan 28 2017, 8:09 PM

I've rebased the patch on ToT and cleaned a few things up.

Sadly, the clang/test/Modules/diagnostic-options-out-of-date.m test fails because r293123 makes the command-line diagnostic options affect the AST, and thus makes -Wconversion vs -Wno-conversion change ASTFileSignature.

I'm not sure how to move forward.

dexonsmith updated this revision to Diff 87256.Feb 6 2017, 10:36 AM

Rebased past r293123, which filled in diagnostic pragma records for modules. I've moved the DIAG_PRAGMA_MAPPINGS to the unhashed block, as suggested by Richard, so the tests now pass.

Richard, can you take a look?

rsmith accepted this revision.Feb 21 2017, 1:26 PM
rsmith added inline comments.
clang/include/clang/Serialization/ASTBitCodes.h
258 ↗(On Diff #87256)

This comment is out of date. Maybe just point to the UnhashedControlBlockRecordTypes for the definitive list of records within this block?

clang/lib/Serialization/ASTReader.cpp
2240–2241 ↗(On Diff #87256)

I guess the reason to do this is because reading that block depends on certain things in this block having been already read, and reading other things in this block depends on that block having been read? A comment explaining that would be useful.

4014–4015 ↗(On Diff #87256)

We shouldn't return Failure without producing an error message.

This revision is now accepted and ready to land.Feb 21 2017, 1:26 PM
dexonsmith closed this revision.Mar 13 2017, 12:00 PM
dexonsmith marked 3 inline comments as done.

Thanks for the review Richard! Committed in r297655 with those changes.

Sorry for the long delay on this. Somehow I missed until today the review after my ultimate ping. I'd like to blame Phab, but it's more likely that I just failed at email :/.