Page MenuHomePhabricator

Hashing: use a 64-bit storage type on all platforms.
Needs ReviewPublic

Authored by arames on May 21 2021, 1:31 PM.

Details

Summary

size_t varying across platforms can cause issues, for example mismatching
hashes when cross-compiling modules from a 64bit target to a 32bit target.

Although the comments note that hash_code values "are not stable to save or
persist", it is effectively used in such cases today, for example for implicit
module hashes.

Since hashing mechanisms in practice already use uint64_t for computations,
use uint64_t instead of size_t to store the value of hash_code.
This is similar to earlier changes in c0445098519defc4bd13624095fac92977c9cfe4.

Diff Detail

Event Timeline

arames created this revision.May 21 2021, 1:31 PM
arames requested review of this revision.May 21 2021, 1:31 PM
Herald added projects: Restricted Project, Restricted Project, Restricted Project. · View Herald TranscriptMay 21 2021, 1:31 PM

why do module hashes need to be stable when cross-compiling?

why do module hashes need to be stable when cross-compiling?

IIUC, the use case is cross-compiling when building the modules, which are then sent to the target to use when compiling other things.

pcc added a subscriber: pcc.May 21 2021, 5:54 PM

Isn't the bug here that module hashing is using hash_code? So shouldn't the correct fix be to use a specific hashing algorithm for module hashes?

Isn't the bug here that module hashing is using hash_code? So shouldn't the correct fix be to use a specific hashing algorithm for module hashes?

Yeah, I tend to agree. I thought modules already used MD5, but maybe just for the AST signature.

Isn't the bug here that module hashing is using hash_code? So shouldn't the correct fix be to use a specific hashing algorithm for module hashes?

Yeah, I tend to agree. I thought modules already used MD5, but maybe just for the AST signature.

For reference, the issue I am trying to fix stems from the hash computed as part of CompilerInvocation::getModuleHash().

Reading the doc around llvm::hash_code, my first reaction was indeed that modules may have to use a different hash.
I thought the current proposal was something to consider, because:

  1. Practically, everything is done on uint64_t, so I would expect switching from size_t to uint64_t to have little to no impact on non-64bit targets.
  2. The current modules code kind of already assumes hash_code is stable over different executions. Plain simple use-case with implicit modules would be severely deficient if the hash was not stable across executions. To be fair, the assumption is different from the one I am trying to fix of 32bit vs 64bit.

Alternatively, we could introduce something along the lines of llvm::stable_hash_code (other name suggestions welcome).
We can probably restructure the code with very few changes. (Really, it looks like we only need a configurable storage type, and allow skipping the execution seed part.)

Would you think this is a better approach ? Or do you have something else to suggest ?

Cheers

why do module hashes need to be stable when cross-compiling?

IIUC, the use case is cross-compiling when building the modules, which are then sent to the target to use when compiling other things.

Yes, that's right.
In particular, we are talking about the hash used for for (prebuilt) implicit modules paths.

arames updated this revision to Diff 348403.May 27 2021, 3:33 PM

Introduce and use stable_hash_code instead of modifying hash_code.

The early commits are missing from the PR. Looking out to do this with arc.

arames updated this revision to Diff 348406.May 27 2021, 3:38 PM

Diff against the parent commit.

This new version is an attempt to have modules not rely on llvm::hash_code, but on a new llvm::stable_hash_code.
I understand modifying ADT/Hashing.h is sensitive, so maybe we need to discuss the high-level approach first.

This new version is an attempt to have modules not rely on llvm::hash_code, but on a new llvm::stable_hash_code.
I understand modifying ADT/Hashing.h is sensitive, so maybe we need to discuss the high-level approach first.

Just seeing the update here, but we already chatted offline. @arames is going to try using llvm::MD5, which is used elsewhere for a stable hash in modules (for the AST signature), rather than introducing stable_hash_*.

arames updated this revision to Diff 349139.Jun 1 2021, 5:11 PM

Use llvm::MD5.

arames added inline comments.Jun 1 2021, 5:14 PM
clang/include/clang/Basic/Sanitizers.h
81

An alternative to having those in class would be to have helpers directly in CompilerInvocation.cpp. I decided to keep them here for two reasons:

  1. Keep them next to hash_value, to make it clear they need to be updated as well.
  2. At least this class does not have all elements going into the hash public.
arames updated this revision to Diff 349141.Jun 1 2021, 5:17 PM

Fix detail namespace.

dexonsmith added a subscriber: Bigcheese.

This looks much cleaner; thanks for the update!

I'm not sure this has decided whether it's writing special case code for building a CompilerInvocation's module context hash, or creating (and using) a new generalized hash builder.

If the latter, I think it'd be worth splitting the hash builder out and landing it separately (in LLVM), then adopting it here; I think that's what you were doing with stable_hash_code (although that had the downsides of reifying city hash and not reusing the Hasher tools we have). This could take a few rounds of review to get the design right / get consensus on... not sure what you're able to invest. A few design questions I think you'd want to think about:

  • What's the interaction between hash_value and (e.g.) updateHash? (It'd be nice if implementing updateHash gave you an implementation of hash_value for free.)
  • Should updateHash be templated on the HasherT, or use type erasure? (E.g., a stripped down HashBuilder type could have a single member that was llvm::unique_function<void(ArrayRef<uint8_t>)>, and all the updateHash() functions are just free functions)
  • Should the hash builder reference a HasherT, or own it?
  • What are the API affordances for hashing structure? (E.g., ensuring that calls to updateHash with "a" and "bc" give a different hash than "ab" and "c".)

On the other hand, if you're not going to have time to generalize this, I'd suggest simplifying the code a bit, merging the "detail" subclass into the builder type. I think ModuleContextHashBuilder might be a better name than ModuleHash; I think in clangBasic, rather than clangSerialization, since it could be used for things other than serialization and doesn't really have any dependencies there.

In either case, there's a decent chunk of new code, and I think you should add some unit test coverage (a new file in llvm/unittests/Support/, in a separate patch, and/or clang/unittests/ (depending on direction)).

A have a bunch of comments inline, some which would only apply depending on which direction you're going with this.

@Bigcheese and @jansvoboda11 , any thoughts?

clang/include/clang/Serialization/ModuleHash.h
19–22

I think this builder is an interesting idea, and could be generally useful.

  • It'd fit better in llvm/Support I think than hidden here.
  • Taking the hasher by reference might be more flexible. Clients can ask the builder to do some of the hashing, but do some on their own.
36

I don't think the names HashValueT and Hash are quite right (I tend to think of "hash" as a value as well).

  • Maybe HasherT instead of HashValueT?
  • Maybe HashBuilder instead of Hash?

Nit: I think (?) we more often use class for type template parameters than typename.

43

For a builder interface, it might be more convenient to take the hasher by reference.

47–51

Because this doesn't hash the distance of the range, it won't see a difference between {{0}, {1, 2}} and {{0, 1}, {2}}. Sometimes that's fine, but sometimes it isn't. I worry about generalizing this... unless the onus is on the caller to hash the size when appropriate?

54–57

I'm not sure we want it to be easy to hash pointers by-value. That's appropriate for an in-memory hash table, but not much else.

77–83

Is there a way of reducing the boilerplate for these update_impl functions, which are currently types? Could they maybe be free functions?

If you hit trouble with using std::enable_if_t in a function template, I seem to remember needing to use something like this in the past:

std::enable_if_t<whatever<T>::value, bool> = false
96–104

I wonder if this should also hash the size of the StringRef. Maybe it should also be called updateRange.

MD5/SHA1/etc. hashers have the perspective that they're seeing a flat stream of bytes, and I think this streams in the bytes from the string directly. I think the right way to model hashing the context hash is as a serialization, and serializing a string ref requires serializing the size before the content, so that the deserializer knows how big the string is.

clang/lib/Frontend/CompilerInvocation.cpp
4483–4484

I'm not sure MD5 is cryptographically sound either, so it probably makes sense to leave this FIXME in / update it.

4580

Do we just want 64 bits, or do we want the full hash?

clang/lib/Serialization/ModuleFileExtension.cpp
16–18

Seems like this could stay in the .cpp, not sure if moving it to the header is important (or could be done separately to make the functionality changes more obvious).

llvm/include/llvm/Support/VersionTuple.h
164–167

Could this be declared as a friend, similar to hash_value?

Note that if the new builder code doesn't live in llvm/, probably this updateHash free function should be an implementation detail in the CompilerInvocation / ModuleHash code.