This is an archive of the discontinued LLVM Phabricator instance.

[modules] Use `HashBuilder` and `MD5` for the module hash.
ClosedPublic

Authored by arames on May 21 2021, 1:31 PM.

Details

Summary

Per the comments, hash_code values "are not stable to save or
persist", so are unsuitable for the module hash, which must persist
across compilations for the implicit module hashes to match. Note that
in practice, today, hash_code are stable. But this is an
implementation detail, with a clear FIXME indicating we should switch
to a per-execution seed.

The stability of MD5 also allows modules cross-compilation use-cases.
The size_t underlying storage for hash_code varying across platforms
could cause mismatching hashes when cross-compiling from a 64bit
target to a 32bit target.

Note that native endianness is still used for the hash computation. So hashes
will differ between platforms of different endianness.

Diff Detail

Event Timeline

arames created this revision.May 21 2021, 1:31 PM
arames requested review of this revision.May 21 2021, 1:31 PM
Herald added projects: Restricted Project, Restricted Project, Restricted Project. · View Herald TranscriptMay 21 2021, 1:31 PM

why do module hashes need to be stable when cross-compiling?

why do module hashes need to be stable when cross-compiling?

IIUC, the use case is cross-compiling when building the modules, which are then sent to the target to use when compiling other things.

pcc added a subscriber: pcc.May 21 2021, 5:54 PM

Isn't the bug here that module hashing is using hash_code? So shouldn't the correct fix be to use a specific hashing algorithm for module hashes?

Isn't the bug here that module hashing is using hash_code? So shouldn't the correct fix be to use a specific hashing algorithm for module hashes?

Yeah, I tend to agree. I thought modules already used MD5, but maybe just for the AST signature.

Isn't the bug here that module hashing is using hash_code? So shouldn't the correct fix be to use a specific hashing algorithm for module hashes?

Yeah, I tend to agree. I thought modules already used MD5, but maybe just for the AST signature.

For reference, the issue I am trying to fix stems from the hash computed as part of CompilerInvocation::getModuleHash().

Reading the doc around llvm::hash_code, my first reaction was indeed that modules may have to use a different hash.
I thought the current proposal was something to consider, because:

  1. Practically, everything is done on uint64_t, so I would expect switching from size_t to uint64_t to have little to no impact on non-64bit targets.
  2. The current modules code kind of already assumes hash_code is stable over different executions. Plain simple use-case with implicit modules would be severely deficient if the hash was not stable across executions. To be fair, the assumption is different from the one I am trying to fix of 32bit vs 64bit.

Alternatively, we could introduce something along the lines of llvm::stable_hash_code (other name suggestions welcome).
We can probably restructure the code with very few changes. (Really, it looks like we only need a configurable storage type, and allow skipping the execution seed part.)

Would you think this is a better approach ? Or do you have something else to suggest ?

Cheers

why do module hashes need to be stable when cross-compiling?

IIUC, the use case is cross-compiling when building the modules, which are then sent to the target to use when compiling other things.

Yes, that's right.
In particular, we are talking about the hash used for for (prebuilt) implicit modules paths.

arames updated this revision to Diff 348403.May 27 2021, 3:33 PM

Introduce and use stable_hash_code instead of modifying hash_code.

The early commits are missing from the PR. Looking out to do this with arc.

arames updated this revision to Diff 348406.May 27 2021, 3:38 PM

Diff against the parent commit.

This new version is an attempt to have modules not rely on llvm::hash_code, but on a new llvm::stable_hash_code.
I understand modifying ADT/Hashing.h is sensitive, so maybe we need to discuss the high-level approach first.

This new version is an attempt to have modules not rely on llvm::hash_code, but on a new llvm::stable_hash_code.
I understand modifying ADT/Hashing.h is sensitive, so maybe we need to discuss the high-level approach first.

Just seeing the update here, but we already chatted offline. @arames is going to try using llvm::MD5, which is used elsewhere for a stable hash in modules (for the AST signature), rather than introducing stable_hash_*.

arames updated this revision to Diff 349139.Jun 1 2021, 5:11 PM

Use llvm::MD5.

arames added inline comments.Jun 1 2021, 5:14 PM
clang/include/clang/Basic/Sanitizers.h
82

An alternative to having those in class would be to have helpers directly in CompilerInvocation.cpp. I decided to keep them here for two reasons:

  1. Keep them next to hash_value, to make it clear they need to be updated as well.
  2. At least this class does not have all elements going into the hash public.
arames updated this revision to Diff 349141.Jun 1 2021, 5:17 PM

Fix detail namespace.

dexonsmith added a subscriber: Bigcheese.

This looks much cleaner; thanks for the update!

I'm not sure this has decided whether it's writing special case code for building a CompilerInvocation's module context hash, or creating (and using) a new generalized hash builder.

If the latter, I think it'd be worth splitting the hash builder out and landing it separately (in LLVM), then adopting it here; I think that's what you were doing with stable_hash_code (although that had the downsides of reifying city hash and not reusing the Hasher tools we have). This could take a few rounds of review to get the design right / get consensus on... not sure what you're able to invest. A few design questions I think you'd want to think about:

  • What's the interaction between hash_value and (e.g.) updateHash? (It'd be nice if implementing updateHash gave you an implementation of hash_value for free.)
  • Should updateHash be templated on the HasherT, or use type erasure? (E.g., a stripped down HashBuilder type could have a single member that was llvm::unique_function<void(ArrayRef<uint8_t>)>, and all the updateHash() functions are just free functions)
  • Should the hash builder reference a HasherT, or own it?
  • What are the API affordances for hashing structure? (E.g., ensuring that calls to updateHash with "a" and "bc" give a different hash than "ab" and "c".)

On the other hand, if you're not going to have time to generalize this, I'd suggest simplifying the code a bit, merging the "detail" subclass into the builder type. I think ModuleContextHashBuilder might be a better name than ModuleHash; I think in clangBasic, rather than clangSerialization, since it could be used for things other than serialization and doesn't really have any dependencies there.

In either case, there's a decent chunk of new code, and I think you should add some unit test coverage (a new file in llvm/unittests/Support/, in a separate patch, and/or clang/unittests/ (depending on direction)).

A have a bunch of comments inline, some which would only apply depending on which direction you're going with this.

@Bigcheese and @jansvoboda11 , any thoughts?

clang/include/clang/Serialization/ModuleHash.h
19–22 ↗(On Diff #349141)

I think this builder is an interesting idea, and could be generally useful.

  • It'd fit better in llvm/Support I think than hidden here.
  • Taking the hasher by reference might be more flexible. Clients can ask the builder to do some of the hashing, but do some on their own.
36 ↗(On Diff #349141)

I don't think the names HashValueT and Hash are quite right (I tend to think of "hash" as a value as well).

  • Maybe HasherT instead of HashValueT?
  • Maybe HashBuilder instead of Hash?

Nit: I think (?) we more often use class for type template parameters than typename.

43 ↗(On Diff #349141)

For a builder interface, it might be more convenient to take the hasher by reference.

47–51 ↗(On Diff #349141)

Because this doesn't hash the distance of the range, it won't see a difference between {{0}, {1, 2}} and {{0, 1}, {2}}. Sometimes that's fine, but sometimes it isn't. I worry about generalizing this... unless the onus is on the caller to hash the size when appropriate?

54–57 ↗(On Diff #349141)

I'm not sure we want it to be easy to hash pointers by-value. That's appropriate for an in-memory hash table, but not much else.

77–83 ↗(On Diff #349141)

Is there a way of reducing the boilerplate for these update_impl functions, which are currently types? Could they maybe be free functions?

If you hit trouble with using std::enable_if_t in a function template, I seem to remember needing to use something like this in the past:

std::enable_if_t<whatever<T>::value, bool> = false
96–104 ↗(On Diff #349141)

I wonder if this should also hash the size of the StringRef. Maybe it should also be called updateRange.

MD5/SHA1/etc. hashers have the perspective that they're seeing a flat stream of bytes, and I think this streams in the bytes from the string directly. I think the right way to model hashing the context hash is as a serialization, and serializing a string ref requires serializing the size before the content, so that the deserializer knows how big the string is.

clang/lib/Frontend/CompilerInvocation.cpp
4473–4474

I'm not sure MD5 is cryptographically sound either, so it probably makes sense to leave this FIXME in / update it.

4573

Do we just want 64 bits, or do we want the full hash?

clang/lib/Serialization/ModuleFileExtension.cpp
16–18

Seems like this could stay in the .cpp, not sure if moving it to the header is important (or could be done separately to make the functionality changes more obvious).

llvm/include/llvm/Support/VersionTuple.h
168–171

Could this be declared as a friend, similar to hash_value?

Note that if the new builder code doesn't live in llvm/, probably this updateHash free function should be an implementation detail in the CompilerInvocation / ModuleHash code.

arames updated this revision to Diff 369203.Aug 27 2021, 3:13 PM
arames marked an inline comment as not done.

Now using the HashBuilder interface.

arames retitled this revision from Hashing: use a 64-bit storage type on all platforms. to [modules] Use `HashBuilder` and `MD5` for the module hash..Aug 27 2021, 3:14 PM
arames edited the summary of this revision. (Show Details)

I still need to go through earlier comments.

arames marked 3 inline comments as done.Aug 27 2021, 4:28 PM
arames added inline comments.
clang/lib/Frontend/CompilerInvocation.cpp
4573

My guess is either would be fine. I thought combining the two parts was simple enough.

arames added inline comments.Aug 29 2021, 11:50 AM
clang/include/clang/Serialization/ModuleFileExtension.h
92

This obviously needs to be fixed.

arames updated this revision to Diff 369489.Aug 30 2021, 10:44 AM

Fix to native endianness.

arames added inline comments.Aug 30 2021, 10:48 AM
clang/include/clang/Basic/ObjCRuntime.h
485

I have added these in the same line as existing hash_value functions. The idea being others may also need to hash those objects.
Let me know if you would rather move some/all of these locally in CompilerInvocation.cpp.

A couple of suggestions inline, but this basically LGTM.

@jansvoboda11 @Bigcheese : can you take a look as well and confirm this looks reasonable from your perspectives?

clang/include/clang/Basic/ObjCRuntime.h
485

Seems reasonable to me.

An option to consider (maybe in a separate commit?) would be to add:

class HashCodeHasher {
public:
  void update(ArrayRef<uint8_t> Bytes) {
    llvm::hash_code BytesCode = llvm::hash_value(Bytes);
    Code = Code ? BytesCode : llvm::hash_combine(*Code, BytesCode);
  }
  Optional<llvm::hash_code> Code;
};
template <class T,
          std::enable_if_t<is_hashable<T>::value, bool> = true>
llvm::hash_code hash_value(const T &Value) {
  HashBuilder<HashCodeHasher, support::endianness::native> Builder;
  Builder.add(Value);
  return *Builder.Code;
}

Then objects that support HashBuilder also support llvm::hash_value without extra code...

clang/lib/Frontend/CompilerInvocation.cpp
4465–4467

I'd simplify:

// FIXME: Consider using SHA1 instead of MD5.
llvm/include/llvm/Support/HashBuilder.h
167 ↗(On Diff #369489)

Please fix in a separate commit (no need for a review IMO).

arames marked 2 inline comments as done.Aug 31 2021, 2:01 PM
arames added inline comments.
clang/include/clang/Basic/ObjCRuntime.h
485

I created https://reviews.llvm.org/D109024 for this.
If we decide to go for that mechanism, whichever PR lands first, I can update the other to use the new mechanism.

clang/lib/Frontend/CompilerInvocation.cpp
4465–4467

Do you think we should switch to SHA1 now ?
I picked MD5 at random amongst the available hashes.

arames updated this revision to Diff 369794.Aug 31 2021, 2:02 PM
arames marked an inline comment as done.

Address review comments.

jansvoboda11 accepted this revision.Sep 3 2021, 5:16 AM

This looks good to me.

This revision is now accepted and ready to land.Sep 3 2021, 5:16 AM
This revision was automatically updated to reflect the committed changes.