Page MenuHomePhabricator

Add a module Hash in the bitcode and the combined index.

Authored by mehdi_amini on Mar 16 2016, 8:58 AM.



This is the equivalent of a "build-id". The module that is written is hashed on the fly, and the last record will be the hash itself.
It is intended to be used for ThinLTO incremental build, so it is included along with each module entry in the combined index. The function importer will compute the set of imports and where they are coming from. Then before the actual import, the hashes of all modules that we are about to import from are combined together to produce a single hash. This hash identifies uniquely the module we're about to create with the importing and thus can be used as a key for a cache supporting incremental build.

The SHA1 implementation is not very optimized, on the performance part WriteModule() takes 1.9s when disabled and 2.1s when enabled (writing LTO llvm-tblgen with full dbg-info).
Note that is it only performed when we emit the summary information, i.e. only with ThinLTO and only by the frontend. The link stage is just consuming this information.

Diff Detail

Event Timeline

mehdi_amini retitled this revision from to Add a module Hash in the bitcode and the combined index..
mehdi_amini updated this object.
mehdi_amini added reviewers: tejohnson, rafael.
mehdi_amini added a subscriber: llvm-commits.
rafael added inline comments.Mar 16 2016, 9:26 AM

This produces different results in 32 and 64 bit machines, no?

Disable by default...

mehdi_amini updated this object.Mar 16 2016, 1:51 PM
mehdi_amini added inline comments.

Uh, dunno, is it a property of hash_combine_range?

mehdi_amini updated this object.Mar 16 2016, 1:58 PM

Minor bug fix (forgot to nullptr check)

Insert the full 160bits of the SHA1 in the bitcode. Also add a unittests.

vsk added a subscriber: vsk.Mar 21 2016, 5:15 PM

Have you considered adding a textual-IR representation of the module hash?

I can think of two uses for it: 1) we can add -module-summary to the run line in compatibility.ll and get additional coverage, 2) we can debug any future hash-related issues by looking at textual IR.

I would be quite tricky to add the hash at the IR level, because it requires to generate bitcode to get a hash any time you want to print textual IR.

Add missing Inputs/ for the lit test.

tejohnson edited edge metadata.Mar 22 2016, 8:05 AM

Can we make generation of the hashes conditional? They aren't needed if a module cache is not being used for incremental rebuilds. I anticipate we will use a different mechanism with our distributed build system, for example. It is simple to suppress generation of the new MODULE_CODE_HASH record, but would be good to also suppress the larger module string table as well. Maybe have a separate set of record types for when we want to include the hashes?


Why not use uint32_t? That seems to be what is used in SHA1.h.


No longer returning an iterator. Ditto for non-const method below.


Comment needs update: "with the given Hash". Also, is it possible to combine this with the above using a default argument value of {{0}}?


5*i32? Assert below checking for 5 values. And 32*5==160, so that seems like the right number. And update the description of this record accordingly in LLVMBitCodes.h where it just says [unsigned].


Why is this one 6*i32 whereas the MODULE_CODE_HASH is only expected to have 5 i32 (according to implementation, comment above incorrectly says it is 6*i32)?


What is the added VBR8 here and in the 6-bit case below? I don't see anything else being written. And it isn't in the 8-bit case above.




Please make this conditional so it isn't generating a hash if we aren't using a module cache.






Won't these values change frequently as the compiler changes and generates slightly different bitcode? How about just checking if the per module values match the combined index values via FileCheck variables/pattern matching?



We can make it conditional (with a clang flag?), but the desired behavior is to be able to decide on enabling/disabling the caching at link time, so any bitcode static archive distributed would need to have a hash embedded. The user friendly choice would be to enabled it by *default* when -flto=thing, and an explicit option to disable it.

We can make it conditional (with a clang flag?), but the desired behavior is to be able to decide on enabling/disabling the caching at link time, so any bitcode static archive distributed would need to have a hash embedded. The user friendly choice would be to enabled it by *default* when -flto=thing, and an explicit option to disable it.

I'm less concerned about the default than just having a way to disable it.

mehdi_amini edited edge metadata.Mar 22 2016, 4:15 PM
mehdi_amini added a subscriber: dexonsmith.
mehdi_amini marked 13 inline comments as done.

Update after Teresa's comments

Thanks for the review Teresa, I updated accordingly (but for the FileCheck, see comment below).


The hope is that a "void foo()" won't change much, but yeah it's absolutely not nice.
What you suggest is fine for testing that we correctly read and propagate to the combined index.
HoweverI'm not comfortable not checking a value, because we lose coverage for the production of the hash. I.e. writing only zero in the first place would make the test pass :(
Would it be enough to add a comment here saying that is fine updating the value is someone change the IR/bitcode?
I'm gonna CC Duncan for an opinion on this.

Forgot to remote the SHA1 for the MST_ENTRY abbrev.

tejohnson added inline comments.Mar 28 2016, 7:17 AM

s/Table of module/Table of modules/ (here and below)


Now returns iterator not StringRef.


In order for this check to work after the first module id + hash is encountered, LastSeenModulePath needs to be set to the end() after it is used in the for loop below. Otherwise it will silently overwrite the earlier module path's values if not preceded by a new module path.


"Optionally emitted after..."

mehdi_amini marked 6 inline comments as done.

Taking comments into account.

Needs test with new llvm-as option then fed through llvm-bcanalyzer to trigger new matching check.



The test are checking for "match" in the output of llvm-bcanalyzer

In D18213#386848, @joker.eph wrote:

The test are checking for "match" in the output of llvm-bcanalyzer

Which test? Is there a new test missing from the patch? The hash generation is off by default so I assume you need to add a new one (or modify an existing) to pass the new -module-hash option to llvm-as. Otherwise I don't think the checking in llvm-bcanalyzer will ever kick in.


Is the MODULE_CODE_HASH ever expected outside a module block? If not, can the first part of the if be turned into an assert inside the if body?

Add back the test, I lost it from the commit at some point

mehdi_amini added inline comments.Mar 30 2016, 8:03 AM

Came up with a different approach!

  1. llvm-bcanalyzer will re-hash and check for mismatch
  2. using a temporary file to collect the hash, I check that whatever is in the module is correctly propagated into the combined index.

I think the code a reused across Blocks, so you need to know that you are in a module block to interpret the code as being the hash. Did I misunderstand your point?


tejohnson added inline comments.Mar 30 2016, 8:18 AM

Weird, I still don't see the test when I go to, only when I click on this comment which opens up

In any case, this sounds good. The version of the test I am seeing doesn't implement it yet (assume that is WIP).


Ok, got it, checking both does make sense.

Try to add the test, if it still does not work I'll make some coffee first...

I'm confident this one will work...

tejohnson accepted this revision.Mar 30 2016, 8:33 AM
tejohnson edited edge metadata.
In D18213#386957, @joker.eph wrote:

I'm confident this one will work...

Yep, it is there now!

LGTM, but as Duncan suggested, do the NFC commit of the clang-formatted LLVMBitCodes.h first.

This revision is now accepted and ready to land.Mar 30 2016, 8:33 AM