This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Support/
-
llvm/
-
Support/
47/52
HashBuilder.h
-
unittests/Support/
-
Support/
-
CMakeLists.txt
28/29
HashBuilderTest.cpp

Differential D106910

[Support]: Introduce the `HashBuilder` interface.
ClosedPublic

Authored by arames on Jul 27 2021, 12:59 PM.

Download Raw Diff

Details

Reviewers

t.p.northover
jansvoboda11
Bigcheese
dexonsmith

Commits

rG1076082a0d97: [Support]: Introduce the `HashBuilder` interface.

Summary

The HashBuilder interface allows conveniently building hashes of various data
types, without relying on the underlying hasher type to know about hashed data
types.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

arames created this revision.Jul 27 2021, 12:59 PM

Herald added subscribers: dexonsmith, hiraditya, mgorny. · View Herald TranscriptJul 27 2021, 12:59 PM

arames requested review of this revision.Jul 27 2021, 12:59 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 27 2021, 12:59 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

This is a follow-up to discussion about https://reviews.llvm.org/D102943.
This is a proposal to expose the HashBuilder helper that will then be used for the modules hashing.

Harbormaster completed remote builds in B116517: Diff 362146.Jul 27 2021, 3:22 PM

Thanks, this looks like a great start to me. Lots of comments inline (many of them nitpicks).

Expose a raw pointer update interface for hashes.
The new method update(const uint8_t *Ptr, size_t Size) is an alternative to
the existing update(ArrayRef<uint8_t> Data). It will allow the incoming
HashBuilder interface to not depend on ArrayRef.

If we add pointer+size APIs to the hashers, I think that should be in a separate prep patch...

... but IMO, ArrayRef is a safe/clean encapsulation of pointer+size for HashBuilder to use. Can you explain why you want to avoid it?

llvm/include/llvm/ADT/ArrayRef.h
17 ↗	(On Diff #362146)	I think HashBuilder will probably be needed in fewer places than ArrayRef. It might be better to invert the includes (and move the new functions to HashBuilder.h).
575–602 ↗	(On Diff #362146)	I suggest moving these to member functions of `HashBuilder`.
llvm/include/llvm/ADT/StringRef.h
15 ↗	(On Diff #362146)	Even more so than ArrayRef.h, we try really hard to avoid adding includes to StringRef.h since it's so widely included. I suggest inverting the include relationship.
966 ↗	(On Diff #362146)	(I suggest an overload in `HashBuilder` rather than a free function)
llvm/include/llvm/Support/HashBuilder.h
22	You should be able to remove this once you're including StringRef.h
140	StringRef would be more generally useful; I suggest that instead of std::string. (I don't think we usually bother handling other character types in LLVM... I suggest leaving it out, unless/until you have a specific use case that needs it?)
144–147	Should this be by-reference? Might be worth adding a test with a no-move no-copy type in a pair.
148–153	Should this be by-reference? Might be worth adding a test with a no-move no-copy type in a tuple.
167	Neat, I hadn't seen this pattern before. Can the variable name be skipped? (void)std::tuple<const Ts &...>{(update(Args), Args)...}; If not, please add `(void)Unused;` to suppress diagnostics.
170–173	I suggest adding a single-element overload that takes a generic range, calling `adl_begin() should also be a single-element overload that calls` llvm::adl_begin() `and` llvm::adl_end()` to extract iterators. Also, this isn't going to work for any old input iterator. I suggest: Name the parameter "ForwardIteratorT" Add a test that confirms std::list iterators will work. Add an enable_if to get a nice compile error for non-forward iterators. Another option is to use random-access iterators and leave it for a future patch to handle forward iterators.
177	I suggest taking / using an `ArrayRef<uint8_t>` here. It'd be a nice convenience to add an overload for `updateBytes(StringRef)` as well -- something I've wanted a few times when using the Hasher interfaces with `MemoryBuffer::getBuffer()` -- which can just cast over to ArrayRef<uint8_t>`.
199	I think this will have to be a `std::distance` call if you handle arbitrary forward iterators (not sure that's necessary). The template parameters could use an update in either case.
llvm/unittests/Support/HashBuilderTest.cpp
52	Lots of "invalid case style" warnings coming from clang-tidy in this file. Please follow the style conventions at https://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly (e.g., use `C` for this variable instead of `c`), if nothing else to make the patch easier to read.
76	Rather than duplicating the tests, I suggest using http://google.github.io/googletest/advanced.html#typed-tests to iterate through MD5, SHA1, and SHA256.
226–229	I suggest comparing these both (or at least one of them) against: update(1); update("string"); so that you're confirming the hash is updated correctly, not just the same way as each other (unless you tested that elsewhere and I missed it?)

This revision now requires changes to proceed.Jul 29 2021, 1:05 PM

Address review comments:
Main changes:

Switch dependency order: HashBuilder now depends on ArrayRef and StringRef
Clean up updateRange and updateRangeImpl
Fix case style

In D106910#2914469, @dexonsmith wrote:

Thanks, this looks like a great start to me. Lots of comments inline (many of them nitpicks).

Expose a raw pointer update interface for hashes.
The new method update(const uint8_t *Ptr, size_t Size) is an alternative to
the existing update(ArrayRef<uint8_t> Data). It will allow the incoming
HashBuilder interface to not depend on ArrayRef.

If we add pointer+size APIs to the hashers, I think that should be in a separate prep patch...

Sure. That was already separated in a different commit.

... but IMO, ArrayRef is a safe/clean encapsulation of pointer+size for HashBuilder to use. Can you explain why you want to avoid it?

I agree ArrayRef is a safe and clean encapsulation. It seemed to me it was cleaner to not have the HashBuilder interface depend on any particular type, in a similar way to hash_code.
However your arguments about ArrayRef and StringRef being much more commonly used are on point, and si agree switching the dependency order and having ArrayRef and StringRef specialized in HashBuilder.h will be better.

llvm/include/llvm/ADT/ArrayRef.h
17 ↗	(On Diff #362146)	Done. (See also global reply.)
575–602 ↗	(On Diff #362146)	Done. (See also global reply.)
llvm/include/llvm/ADT/StringRef.h
15 ↗	(On Diff #362146)	Done. (See also global reply.)
966 ↗	(On Diff #362146)	Done for `StringRef` and `ArrayRef`, since we now depend on both.
llvm/include/llvm/Support/HashBuilder.h
140	Nah; this was to have the generic form. Leaving it out.
144–147	Absolutely. Fixed, with a test.
148–153	Absolutely. Fixed with a test.
167	It can be skipped indeed.
170–173	The update should fit what you want, but please take a second look. Added an overload for a generic range using `adl_begin()` and `adl_end()`. Cleaned-up `updateRange` and `updateRangeImpl`, with a test for `std::list`. The check for forward iterator does not use `enable_if`, but should be good enough. Interestingly, with all this, implementations of `update` for `ArrayRef` and `StringRef` can simply forward to `updateRange`.
177	Sure thing. I can imagine the three overloads to be helpful in different contexts.

Harbormaster completed remote builds in B117745: Diff 363876.Aug 3 2021, 2:36 PM

Updates look great; a few more comments inline.

Since the ultimate goal is to make it easy for higher-level logic to use a stable hash across different architectures, I realize that endianness could be important too for the hashable-data overloads. Have you considered what to do there?

One idea: add an endianness template parameter to HashBuilder:

template <class HasherT, support::endianness Endianness = support::native>
class HashBuilder;

then change the update overloads for hashable data to hash the result of support::endian::byte_swap(V, Endianness). In the common case (native) the copy/etc should get optimized out. WDYT?

llvm/include/llvm/Support/HashBuilder.h
87–95	Please use `makeArrayRef()` to avoid splitting the pointer from the size. Also, I wonder if this should take into account endianness.
196	I think you can drop the name here too.
206–208	I think it's important to have `updateRange` overloads for `ArrayRef` (at least for `uint8_t`) and `StringRef` that directly call `updateBytes` rather than iterating.
221–225	I don't think we should add this overload. The rare call site can call `makeArrayRef` itself, encouraging it to push ArrayRef further up the stack.
llvm/unittests/Support/HashBuilderTest.cpp
27–37	Can MD5 be updated in a prep patch to have an overload of `final` like this so this bridge code isn't needed?
39–45	Seeing this boilerplate makes me wonder if the builder API could be improved with something like: Add an `Optional<HasherT>` data member and a default constructor for `HashBuilder` that constructs it. Make `update*` return `HashBuilder&`. Add a `HashBuilder::final()` that returns `HasherT::final()` (after adding it to MD5 as above). Then the above logic becomes a one-liner: HashBuilder<HasherT>().update(Args...).final() (maybe still worth encapsulating in a function here for testing, but maybe not; not sure)
82–83	It's not ideal to need to spell this out for each `HasherT`, and hardcoding the magic numbers doesn't seem quite right either. I suggest instead: Unify the `HasherT` interfaces as necessary in prep commits to allow using them generically (maybe my other suggestions above are sufficient for that?). Test each of these types in a standalone test in a unified `HashBuilderTest` below, dropping BasicTest/etc. Add a helper function to the test fixture below to avoid boilerplate, something like: template <class T> void checkHashableData(T Data) { HasherT RawHasher; auto Bytes = makeArrayRef(reinterpret_cast<uint8_t *>(&Data),...); RawHasher.update(Bytes); CHECK_EQ(HashBuilder<HasherT>().update(Data).final(), RawHasher.final()); }
103	I suggest dropping the `Typed` and calling this simply `HashBuilderTest`.

This revision now requires changes to proceed.Aug 3 2021, 3:08 PM

Address review comments.

The main change is support and tests for endianness.

Harbormaster completed remote builds in B118751: Diff 365283.Aug 9 2021, 2:16 PM

Run clang-format.

Harbormaster completed remote builds in B118932: Diff 365532.Aug 10 2021, 11:13 AM

Fix forgotten TODO.

dexonsmith added inline comments.Aug 12 2021, 6:49 PM

llvm/unittests/Support/HashBuilderTest.cpp
82–83	I'm still seeing magic numbers, although this comment seems to be in the wrong place now. See further down in ReferenceHashCheck.
102–103	It's not clear to me why you're including volatile ints here. Is there any reason to expect them to have different behaviour?
109–110	I'm not such a fan of this reference hash / magic value idea. I'd prefer to check each of these data types separately, and rely on comparing against the result of the underlying Hasher (which we can assume is already well-tested elsewhere). E.g.: // helper for hashers template <class H, class T> auto hashRawData(T Data) { H Hasher; Hasher.update(makeRawDataArrayRef(makeArrayRef( reinterpret_cast<uint8_t>(&Data), sizeof(Data))); return H.final(); }; // then in the test for raw integer data: int I = 0x12345678; EXPECT_EQ(HashBuilder<H>.update(I).final(), hashRawData<H>(I)); Could be each in their own test (one per type), or just in separate scopes, don't think it matters much. Or if you think the magic numbers are better, can you walk me through your logic?

Harbormaster completed remote builds in B119369: Diff 366162.Aug 12 2021, 6:50 PM

arames marked 2 inline comments as done.Aug 16 2021, 11:30 AM

arames added inline comments.

llvm/include/llvm/Support/HashBuilder.h
87–95	Updated to use `makeArrayRef()`. Great point. I didn't pay attention to it, but taking it into account will significantly improve the value of this class. Supporting endianness looks pretty straight forward thanks to `Support/Endian.h`. Please take a look. A couple questions: I kept the default as `native` instead of `little` to avoid penalizing big-endian platforms. On the other hand defaulting to a stable hash may be safer. What do you think ? To add support for user types, we end up with template <typename HasherT, support::endianness Endianness> void updateHash(HashBuilder<HasherT, Endianness> &HBuilder, const UserType &Value) I think that is ok. Do you prefer "reserving" the `updateHash` name and going with template <typename HashBuilderT> void updateHash(HashBuilderT &HBuilder, const UserType &Value) ?
206–208	Both `ArrayRef` and `StringRef` do end up calling the version of `updateRangeImpl` with `updateBytes`, due to how their `begin` and `end` are declared. Do you mean to do it explicitly, in case the `ArrayRef` implementation changes ?
llvm/unittests/Support/HashBuilderTest.cpp
27–37	https://reviews.llvm.org/D107781 This PR will have to be rebased once the prep is reviewed and lands.
39–45	Done. That makes it more usable indeed. I kept the helper to keep the tests concise.
82–83	I have cleaned this up significantly as part of the endianness support. It does not look like what you suggest, but I think it is much better than before. There is now a single `ReferenceHashCheck` handling all hasher types, with the three big, little, and native endiannesses. This is the only test that verifies against a hard-coded reference hash. In my opinion it is valuable to ensure the stability of hashes across platforms. Other tests rely on equality or difference of different hashes.

arames added inline comments.Aug 16 2021, 11:30 AM

llvm/unittests/Support/HashBuilderTest.cpp
102–103	I had borrowed the basic tests for `hash_code`, which did include it. I agree they bring no value here. Removed.

Update test to compare against the hasher.

(Sorry it looks like I missed publishing a few earlier comments.)

Harbormaster completed remote builds in B119753: Diff 366685.Aug 16 2021, 12:01 PM

dexonsmith added a parent revision: D107781: [Support] Update `MD5` to follow other hashes..Aug 16 2021, 1:00 PM

I think this is getting close... I've found a few new things, and I apologize for not seeing them sooner.

Two high-level points, then more comments inline.

Firstly, I'm a bit concerned about the semantics of update(ArrayRef) (and update(StringRef) by proxy) due to inconsistency with HasherT.

HasherT::update(ArrayRef<uint8_t>) hashes the raw bytes pointed to by the array.
HashBuilder<HasherT>::update(ArrayRef) (including for uint8_t) hashes the size first, then the raw bytes.

This inconsistency could be confusing and error-prone.

I wonder if we should (mostly) avoid the word update in HashBuilder and use add instead. This also fixes the updateRange problem I point out inline (addRange works just fine without a preposition). We could leave update() around *just* for ArrayRef<uint8_t> (and maybe StringRef), forwarding to addElements/addRangeElements (see inline). The result would be:

HashBuilder<HasherT>::update does exactly the same thing as HasherT::update (also working with StringRef, for convenience)
HashBuilder<HasherT>::add (and add*) handles higher-level semantics, including primitives, endianness, ranges, etc.

Secondly, a base class that lacks the Endianness template parameter should probably be factored out to avoid bloating code once this starts getting used. I think there should be three layers.

HashBuilderBase can take just a HasherT parameter, and implement update (new meaning, matching HasherT::update), final(), and all other things that don't care about endianness.
HashBuilderImpl layers the add* family on top and takes an Endianness parameter. But native is illegal here.
HashBuilder canonicalizes native to system_endianness() to avoid code bloat, something like:

template <typename HasherT, support::endianness Endianness>
class HashBuilderImpl {
  static_assert(Endianness != native, "HashBuilder should canonicalize endianness");
  // etc.
};

template <typename HasherT, support::Endianness Endianness = native,
                            support::Endianness EffectiveEndianness =
                                Endianness == native
                                   ? support::endian::system_endianness()
                                   : Endianness>
class HashBuilder
    : HashBuilderImpl<HasherT, EffectiveEndianness> {
  using HashBuilderImplT = HashBuilderImpl<HasherT, EffectiveEndianness>;

public:
  HashBuilder() = default;
  HashBuilder(HasherT &Hasher) : HashBuilderImplT(Hasher) {}
};

llvm/include/llvm/Support/HashBuilder.h
81	I wonder if `std::is_floating_point` should be excluded. Float values might generally a bit iffy to be included in a stable hash, because of all the floating point modes that can affect their values. `long double` has a platform-dependent mantissa. This would disable `HashBuilder::add`, forcing clients to think about what they actually want to do with it. WDYT?
84–85	I think these should be private, and `Hasher` can have a `getHasher()` accessor. This'll make it easier to make HashBuilder move/copy-friendly in the future if for some reason that's useful.
87–95	Just seeing this question. I don't think it's necessary to default to a stable hash. Clients can do that when it's useful. But if you're uncomfortable with unstable-by-default, another option is not to have a default, and then the client will need to spell out `support::native`, making it more clear that the hash is unstable. (It's easier to add a default later, but changing the default might not be easy.) I think the explicit parameter for `Endianness` makes it clear to authors they should be thinking about Endianness. There should be a name change to align with `add`. `addToHash`?
88–91	Can this be simplified? HashBuilder() : OptionalHasher(in_place), Hasher(*OptionalHasher) {}
97–99	I made some comments on the patch this one depends on (https://reviews.llvm.org/D107781 -- FYI, I linked them using the add parent/child revision button above). I suggest the following to better align with the hashers themselves: Remove the template / enable_if. Require hashers to provide this. Return `StringRef`. Expect hashers to do the same. Add a `result()` method also returning StringRef, matching the one from hashers. I also think an analogue for the static `hash()` that the hashers have would be useful, at least in the tests. Maybe: using HashArrayT = decltype(HasherT::hash(ArrayRef<uint8_t>())); static HashArrayT toArrayImpl(StringRef HashString) { assert(HashString.size() == sizeof(HashArrayT)); HashArrayT HashArray; llvm::copy(HashString, HashArray.begin()); return HashArray; } HashArrayT finalHash() { return toArrayImpl(Hasher.final()); } HashArrayT resultHash() { return toArrayImpl(Hasher.result()); } I don't think there should be a `HashBuilder::hash()` function (static or otherwise). Forwarding to `HashBuilder::add()` would be confusing / error-prone, since then `HashBuilder<HasherT>::hash(ArrayRef<uint8_t>)` would do something different from `HasherT::hash()`. Forwarding to `HasherT::hash` would also be confusing, and doesn't really add any value since clients can already do that themselves.
101–108	I wonder if the body of this should be exposed as a general helper for clients, using a weaker `std::enable_if` check. I.e.: Keep the signature and `std::enable_if` check for this overload of `add`. Move the body to a new function, called something like `addRawBytes`, which takes a `const T&` and has a weaker `std::enable_if` check (maybe `std::is_primitive`? or what does `endian::byte_swap` support?). Change `add` to call the new function. WDYT?
206–208	Both `ArrayRef` and `StringRef` do end up calling the version of `updateRangeImpl` with `updateBytes`, due to how their `begin` and `end` are declared. Do you mean to do it explicitly, in case the `ArrayRef` implementation changes ? Just seeing this question. Two reasons to do it explicitly: avoid compile-time overhead for this common case make it obvious what happens for anyone reading the code (as a common case, people are likely to care that it's doing the right thing)
276–278	I think there's something off with the naming. `updateRange` sounds like it is mutating the range itself, rather than updating the hash. Could add a preposition, or renaming to `add`.
285	Similar naming problem here (I'm pretty sure it's my fault), but worse, since I now realize there's nothing inherent about the word "bytes" that implies the size of the range is skipped. Someone could have a container of bytes, and expect that this encoded the container. Stepping back, I feel like this function -- skipping the size of a range -- could be more generally useful to clients. I wonder if it should be exposed for generic ranges as well, keeping specializations for `ArrayRef<uint8_t>` and `StringRef`. Name could be `addElements` or `addRangeElements` or something?
315–316	This can call `addRangeElements()`.
320–324	This overload can be dropped, but `addRangeElements` would want something similar.
llvm/unittests/Support/HashBuilderTest.cpp
39–45	If you take my suggestion to add `finalHash`, `computeHash` can be further simplified to: return HashBuilder<HasherTAndEndianness>().add(Args...).finalHash(); removing the static cast. IMO, inlining it would improve clarity, but renaming it to `hashWithBuilder` would also help.
54–57	(Same comments as `computeHash`: drop std::string, and simplify or rename)
82–83	Looks like you've adjusted the tests in the meantime, but since I hadn't seen this comment before, just wanted to give some more reasoning: In my opinion it is valuable to ensure the stability of hashes across platforms. I agree that's a useful trait, but the tests for MD5, SHA1, and SHA256 should be doing that, respectively. Meanwhile, the following important trait was not being tested: do `HashBuilder<HashT>` and `HashT` give the same result as each other for each of these raw data types? (Since fixed!) You could have both (the above, plus the magic number test). It would uncover things such as `double` having a different representation on different platforms, or some platforms being big- or little-endian. Those both seem important, but IMO probably issues that clients of HashBuilder need to be aware of, rather than HashBuilder. But maybe there's some room for debate...
91	This is also byte-swapping. I suggest `byteSwapAndHashRawData()`. Maybe even add "WithoutBuilder"? Or, since this is only used in one function, you could make it a lambda.
96–98	I (re-)discovered the static method `hash` when reviewing the other patch, which returns a `std::array`. I suggest using that instead, and directly returning `H::hash(SwappedData)`. I suggest forwarding the return array directly rather than wrapping in a `std::string`.
101	I think the name is wrong since the change. Maybe `addHashableData`?
109–110	Update here looks like what I was asking for, but it's actually not clear at all how these functions are different (I probably suggested a bad name I think, sorry). I think `computeHash` can be inlined, avoiding any ambiguity about what it does, if you take my suggestion to add `finalHash`. But if you keep it, it should probably be renamed to `hashWithBuilder` to make it clear how the hash is being computed. That'd disambiguate the other one I think?

(forgot to mark "request changes" in last comment)

This revision now requires changes to proceed.Aug 16 2021, 4:34 PM

Thanks - again - for the comments. Please take a look at the update.

The major changes are around:

Renaming update* to add*
Refactoring HashBuilder into HashBuilderBase, HashBuilderImpl, and HashBuilder.

However I am pushing back on a number of suggestions that would impose requirement on the hasher type. I think there is a lot of value in keeping the interface trivial to use for user-defined/other hasher types.
That includes the suggested finalHash, and supporting ctors with arguments, and using enable_if to conditionally forward methods like final() and result(), so that we can rely on the single HasherT::update(ArrayRef<uint8_t>) method. I think that is a fair implementation cost for the value it brings.
I have added a test for a custom minimal hasher.

Let me know what you think. I'm always open to discuss and update.

Cheers!

In D106910#2948072, @dexonsmith wrote:

I think this is getting close... I've found a few new things, and I apologize for not seeing them sooner.

Two high-level points, then more comments inline.

Firstly, I'm a bit concerned about the semantics of update(ArrayRef) (and update(StringRef) by proxy) due to inconsistency with HasherT.

HasherT::update(ArrayRef<uint8_t>) hashes the raw bytes pointed to by the array.

HashBuilder<HasherT>::update(ArrayRef) (including for uint8_t) hashes the size first, then the raw bytes.

This inconsistency could be confusing and error-prone.

I wonder if we should (mostly) avoid the word update in HashBuilder and use add instead. This also fixes the updateRange problem I point out inline (addRange works just fine without a preposition). We could leave update() around *just* for ArrayRef<uint8_t> (and maybe StringRef), forwarding to addElements/addRangeElements (see inline). The result would be:

HashBuilder<HasherT>::update does exactly the same thing as HasherT::update (also working with StringRef, for convenience)

HashBuilder<HasherT>::add (and add*) handles higher-level semantics, including primitives, endianness, ranges, etc.

I did not consider the issue. But I like the proposal, as it removes potential confusions.

Secondly, a base class that lacks the Endianness template parameter should probably be factored out to avoid bloating code once this starts getting used. I think there should be three layers.

HashBuilderBase can take just a HasherT parameter, and implement update (new meaning, matching HasherT::update), final(), and all other things that don't care about endianness.

HashBuilderImpl layers the add* family on top and takes an Endianness parameter. But native is illegal here.

HashBuilder canonicalizes native to system_endianness() to avoid code bloat, something like:

Done.
Doing so requires templated CRTP, as we need to return the reference to our derived class and keep track of HasherT/endianness. It looks slightly convoluted at the template declaration site, but is neat otherwise otherwise.

template <typename HasherT, support::endianness Endianness>
class HashBuilderImpl {
  static_assert(Endianness != native, "HashBuilder should canonicalize endianness");
  // etc.
};

template <typename HasherT, support::Endianness Endianness = native,
                            support::Endianness EffectiveEndianness =
                                Endianness == native
                                   ? support::endian::system_endianness()
                                   : Endianness>
class HashBuilder
    : HashBuilderImpl<HasherT, EffectiveEndianness> {
  using HashBuilderImplT = HashBuilderImpl<HasherT, EffectiveEndianness>;

public:
  HashBuilder() = default;
  HashBuilder(HasherT &Hasher) : HashBuilderImplT(Hasher) {}
};

llvm/include/llvm/Support/HashBuilder.h
81	That sounds reasonable to me. Forcing the hash would still be easy via other scalar types.
87–95	I'll drop the default value for `Endianness`. The template is simple enough for the rare use-cases, and simple enough to wrap if the lack of default is ever a problem for a user.
88–91	See my top-level comment. I think we should keep support for ctors with arguments (e.g. for a seed).
97–99	Added a forward to `result()`. Per my top-level reply, I think we should keep the `final()` and `result()` forwards conditional on their existence in `HasherT`, to keep the use of this interface with a custom hasher type trivial. I am not convinced by the `finalHash()` and `resultHash()` helpers should be members of `HashBuilder`. I may be missing context on how it is used, so feel free to push if you think the use-case is really valuable. I would assume that the return type of `HasherT::final()` is appropriate to represent the hash. It seems bothersome to have to provide these helpers in the interface, making additional assumptions (or more SFINAE code) about the hasher type. Looking at other related comments, I think a lot of this discussion stems from the fact that `MD5`, `SHA1`, and `SHA256` `final()` methods return a reference (`StringRef`). It seems to me this discussion would be simplified if there was a method like `finalByValue()`. My suggestion would be to keep relying on the return type for `final()`, and have "type converting" wrappers implemented outside of `HasherT` or `HashBuilder` for now. If you like the `finalByValue()` idea, I could do it in a successor patch. On this topic, I find the design of the `::hash` functions join on the issue. Being `static`, they return a result by value. Though both strings and `array<uint8_t>` are of course pretty close, I find `std::array` is a bit surprising choice when `final()` and `result()` return `StringRef`. So the question will be, what should be the return type for `finalByValue()` for these hashes. `std::array` like `hash()`, or a `SmallString` maybe.
101–108	Sounds good. Done, with enablement automatically whenever the type is handled by `support::endian::byte_swap()`.
206–208	Done, with comments.
285	I like it. Done.
315–316	This has become `addRangeElementsImpl`. Now, `addRange` simply does `add(size); addRangeElements(...);`.
320–324	This has become `addRangeElementsImpl`.
llvm/unittests/Support/HashBuilderTest.cpp
39–45	Per my other comment, I am not convinced by the `finalHash()` method. Renamed to `hashWithBuilder`.
54–57	Renamed to `hashRangeWithBuilder`.
82–83	I did adjust the tests following your suggestion. Those both seem important, but IMO probably issues that clients of HashBuilder need to be aware of, rather than HashBuilder. But maybe there's some room for debate... I agree. We would not want to start seeing the `HashBuilder` tests failing across different platforms because of reasons that are known (and maybe already correctly handled). So let's correctly separate the responsibilities.
91	Done. I was not familiar with using lambdas with `auto` for that type of use-case. Also renamed it to `ByteSwapAndHashWithHasher` for clarity.
96–98	I agree here a single call would be nice. The change would require the update of previous helpers to return `decltype(H::hash)`, that I am reluctant to do.
109–110	This ended up with explicit names `ByteSwapAndHashWithHasher` and `hashWithBuilder`.

Address review comments.

The major changes are around:

Renaming update* to add*
Refactoring HashBuilder into HashBuilderBase, HashBuilderImpl, and HashBuilder.

Harbormaster completed remote builds in B120818: Diff 368141.Aug 23 2021, 11:24 AM

Thanks for the working through this! LGTM if you remove the CRTP (see below, I don't think it's needed). See also a few more nits I found to pick.

llvm/include/llvm/Support/HashBuilder.h
31–36	Using CRTP blocks the compile-time speedup / code size wins. Can we skip HashBuilder/Endianness/EffectiveEndianness/Derived?
46–49	I suggest dropping the return statement, for `void update(ArrayRef)`, a simple passthrough to HasherT::update. No need for the CRTP (faster compile-time / avoid code size bloat by factoring out a common instantiation for the base class) The same functionality is available as `HashBuilder& HasherBuilder::addRangeElements`, so there's no loss of convenience
56	(this would also return `void`)
61–67	I don't think we need enable_if or return type deduction here. It seems like overkill to make "does HashBuilder::final work?" SFINAE-detectible.
69–77	Same as `final`.
92	I think this comment should explain that `support::endianness::native` is not supported here but that it's handled by `HashBuilder`. Basically, explain the static assertion in text.
101–103	Using CRTP in `HashBuilderImpl` also increases compile-time / code size since it blocks sharing code between "native" and whatever the system endianness is. I suggest dropping CRTP and returning `HashBuilderImpl&` instead of `HashBuilder&` from `add*`. (If this needed CRTP, which I don't think it does, I'd suggest using an opaque `DerivedT` to avoid passing through details like EffectiveEndianness.)
411–425	Given that `add*` will return `HashBuilderImpl&`, might be reasonable to reduce this to a typedef: template <class HasherT, support::endianness Endianness> using HashBuilder = HashBuilderImpl<HasherT, (Endianness == support::endianness::native ? support::endian::system_endianness() : Endianness)>; Either way seems fine though.
llvm/unittests/Support/HashBuilderTest.cpp
47–51	I'd skip `auto` type deduction unless the type is hard to spell. Also you can use StringRef::str here.
54–57	(same as above)

This revision is now accepted and ready to land.Aug 23 2021, 12:40 PM

Main changes:

Remove CRTP
(Attempt to) fix Windows build
Fix nits

llvm/include/llvm/Support/HashBuilder.h
61–67	You may want to double-check my earlier comments. I think there is a strong value in allowing trivial support for user-defined (or simply other) hashes. Here, that means: not requiring support for `HasherT::final()` not assuming the return type
101–103	Removed CRTP.

Harbormaster completed remote builds in B121044: Diff 368459.Aug 24 2021, 2:25 PM

Apply clang-format.

One more attempt to fix the Windows build.

Harbormaster completed remote builds in B121051: Diff 368472.Aug 24 2021, 3:02 PM

Explicitly disable hashable types for the addHash overload (for the Windows build).
Introduce checks to confirm debian x64 test failures (to be removed).

dexonsmith added inline comments.Aug 24 2021, 3:35 PM

llvm/include/llvm/Support/HashBuilder.h
61–67	On not requiring HasherT::final, my suggested edit doesn't change that (unless I'm missing something?). The call to HasherT::final is template-dependent. Both before and after the edit, there will be no compile errors if there are no calls to `HashBuilder<HasherT>::final`, and a compile error if there is such a call. The only semantic difference is whether HashBuilder::final is SFINAE-detectable (compile error because there is no overload for HashBuilder::final vs. compile error because the overload for HashBuilder::final can't be instantiated), which doesn't seem important. On the return type, I don't see the benefit of allowing flexibility here. I also don't see a tie-in to the argument about trivial support. And there is a cost: generic code that depends on HashBuilder::final will have to reason about other arbitrary return types. It's simpler for clients if this just returns `StringRef`. If you still think there's strong value in either of those (which both seem unrelated to allowing use of trivial HasherT types), can you clarify what it is?

arames added inline comments.Aug 24 2021, 3:56 PM

llvm/include/llvm/Support/HashBuilder.h
61–67	Both done. (will be updated in the next `arc diff`.) I misunderstood your earlier comment, thinking you suggested to still require `HasherT::final()`. And I didn't think of simply relying on the template for the purpose. I felt specializing the interface to follow particular hasher types might not be very wise. I could imagine a hash wanted to return say `uint64_t`. But since this is hypothetical, I buy the argument of just keeping things simple for now.

Harbormaster completed remote builds in B121057: Diff 368482.Aug 24 2021, 4:01 PM

Ran through the address sanitizer and fixed stack-use-after-scope issues with ArrayRef.

Harbormaster completed remote builds in B121084: Diff 368518.Aug 24 2021, 6:43 PM

Test different variadic impl for Windows.

Harbormaster completed remote builds in B121224: Diff 368720.Aug 25 2021, 2:36 PM

Propagate variadic expansion workaround to the tuple overload.

arames edited the summary of this revision. (Show Details)Aug 25 2021, 4:14 PM

Harbormaster completed remote builds in B121245: Diff 368748.Aug 25 2021, 5:04 PM

Remove one of the workarounds for Windows.

Harbormaster completed remote builds in B121280: Diff 368788.Aug 25 2021, 7:16 PM

Finalize workaround for Windows.

As far as I can see in the C++ spec section 8.5.4 guarantees evaluation order of
the initializer-list in a braced-init-list. But this not seem to hold with MSVC
19.28.29914.0. So simply use a recursive template instead.

Harbormaster completed remote builds in B121294: Diff 368802.Aug 25 2021, 10:08 PM

Closed by commit rG1076082a0d97: [Support]: Introduce the `HashBuilder` interface. (authored by arames). · Explain WhyAug 26 2021, 9:21 AM

This revision was automatically updated to reflect the committed changes.

arames added a commit: rG1076082a0d97: [Support]: Introduce the `HashBuilder` interface..

Revision Contents

Path

Size

llvm/

include/

llvm/

Support/

HashBuilder.h

404 lines

unittests/

Support/

CMakeLists.txt

1 line

HashBuilderTest.cpp

336 lines

Diff 368908

llvm/include/llvm/Support/HashBuilder.h

This file was added.

//===- llvm/Support/HashBuilder.h - Convenient hashing interface-*- C++ -*-===//

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

// This file implements an interface allowing to conveniently build hashes of

// various data types, without relying on the underlying hasher type to know

// about hashed data types.

//===----------------------------------------------------------------------===//

#ifndef LLVM_SUPPORT_HASHBUILDER_H

#define LLVM_SUPPORT_HASHBUILDER_H

#include "llvm/ADT/ArrayRef.h"

#include "llvm/ADT/STLExtras.h"

#include "llvm/ADT/StringRef.h"

#include "llvm/Support/Endian.h"

#include "llvm/Support/type_traits.h"

dexonsmithUnsubmitted

Done

You should be able to remove this once you're including StringRef.h

dexonsmith: You should be able to remove this once you're including StringRef.h

#include <iterator>

#include <utility>

namespace llvm {

/// Declares the hasher member, and functions forwarding directly to the hasher.

template <typename HasherT> class HashBuilderBase {

public:

HasherT &getHasher() { return Hasher; }

/// Forward to `HasherT::update(ArrayRef<uint8_t>)`.

///

/// This may not take the size of `Data` into account.

dexonsmithUnsubmitted

Done

Using CRTP blocks the compile-time speedup / code size wins. Can we skip HashBuilder/Endianness/EffectiveEndianness/Derived?

dexonsmith: Using CRTP blocks the compile-time speedup / code size wins. Can we skip…

/// Users of this function should pay attention to respect endianness

/// contraints.

void update(ArrayRef<uint8_t> Data) { this->getHasher().update(Data); }

/// Forward to `HasherT::update(ArrayRef<uint8_t>)`.

///

/// This may not take the size of `Data` into account.

/// Users of this function should pay attention to respect endianness

/// contraints.

void update(StringRef Data) {

update(makeArrayRef(reinterpret_cast<const uint8_t *>(Data.data()),

Data.size()));

}

dexonsmithUnsubmitted

Done

I suggest dropping the return statement, for void update(ArrayRef), a simple passthrough to HasherT::update.

No need for the CRTP (faster compile-time / avoid code size bloat by factoring out a common instantiation for the base class)
The same functionality is available as HashBuilder& HasherBuilder::addRangeElements, so there's no loss of convenience

dexonsmith: I suggest dropping the return statement, for `void update(ArrayRef)`, a simple passthrough to…

/// Forward to `HasherT::final()` if available.

template <typename HasherT_ = HasherT> StringRef final() {

return this->getHasher().final();

}

/// Forward to `HasherT::result()` if available.

dexonsmithUnsubmitted

Done

(this would also return void)

dexonsmith: (this would also return `void`)

template <typename HasherT_ = HasherT> StringRef result() {

return this->getHasher().result();

}

protected:

explicit HashBuilderBase(HasherT &Hasher) : Hasher(Hasher) {}

template <typename... ArgTypes>

explicit HashBuilderBase(ArgTypes &&...Args)

: OptionalHasher(in_place, std::forward<ArgTypes>(Args)...),

Hasher(*OptionalHasher) {}

dexonsmithUnsubmitted

Not Done

Data.size()));

}

- template <typename T> using HasFinalT = decltype(std::declval<T &>().final());

/// Forward to `HasherT::final()` if available.

- template <typename HasherT_ = HasherT>

- std::enable_if_t<is_detected<HasFinalT, HasherT_>::value, HasFinalT<HasherT_>>

- final() {

- return this->getHasher().final();

- }

+ StringRef final() { return this->getHasher().final(); }

template <typename T>

I don't think we need enable_if or return type deduction here. It seems like overkill to make "does HashBuilder::final work?" SFINAE-detectible.

dexonsmith: I don't think we need enable_if or return type deduction here. It seems like overkill to make…

aramesAuthorUnsubmitted

Done

You may want to double-check my earlier comments.
I think there is a strong value in allowing trivial support for user-defined (or simply other) hashes. Here, that means:

not requiring support for HasherT::final()
not assuming the return type

arames: You may want to double-check my earlier comments. I think there is a strong value in allowing…

dexonsmithUnsubmitted

Not Done

On not requiring HasherT::final, my suggested edit doesn't change that (unless I'm missing something?). The call to HasherT::final is template-dependent. Both before and after the edit, there will be no compile errors if there are no calls to HashBuilder<HasherT>::final, and a compile error if there is such a call. The only semantic difference is whether HashBuilder::final is SFINAE-detectable (compile error because there is no overload for HashBuilder::final vs. compile error because the overload for HashBuilder::final can't be instantiated), which doesn't seem important.

On the return type, I don't see the benefit of allowing flexibility here. I also don't see a tie-in to the argument about trivial support. And there is a cost: generic code that depends on HashBuilder::final will have to reason about other arbitrary return types. It's simpler for clients if this just returns StringRef.

If you still think there's strong value in either of those (which both seem unrelated to allowing use of trivial HasherT types), can you clarify what it is?

dexonsmith: On not requiring HasherT::final, my suggested edit doesn't change that (unless I'm missing…

aramesAuthorUnsubmitted

Done

Both done. (will be updated in the next arc diff.)

I misunderstood your earlier comment, thinking you suggested to still require HasherT::final(). And I didn't think of simply relying on the template for the purpose.
I felt specializing the interface to follow particular hasher types might not be very wise. I could imagine a hash wanted to return say uint64_t. But since this is hypothetical, I buy the argument of just keeping things simple for now.

arames: Both done. (will be updated in the next `arc diff`.) I misunderstood your earlier comment…

private:

Optional<HasherT> OptionalHasher;

HasherT &Hasher;

};

/// Implementation of the `HashBuilder` interface.

///

/// `support::endianness::native` is not supported. `HashBuilder` is

/// expected to canonicalize `support::endianness::native` to one of

dexonsmithUnsubmitted

Not Done

return this->getHasher().final();

}

- template <typename T>

- using HasResultT = decltype(std::declval<T &>().result());

/// Forward to `HasherT::result()` if available.

- template <typename HasherT_ = HasherT>

- std::enable_if_t<is_detected<HasResultT, HasherT_>::value,

- HasResultT<HasherT_>>

- result() {

- return this->getHasher().result();

- }

+ StringRef result() { return this->getHasher().result(); }

protected:

Same as final.

dexonsmith: Same as `final`.

/// `support::endianness::big` or `support::endianness::little`.

template <typename HasherT, support::endianness Endianness>

class HashBuilderImpl : public HashBuilderBase<HasherT> {

static_assert(Endianness != support::endianness::native,

dexonsmithUnsubmitted

Done

I wonder if std::is_floating_point should be excluded.

Float values might generally a bit iffy to be included in a stable hash, because of all the floating point modes that can affect their values.
long double has a platform-dependent mantissa.

This would disable HashBuilder::add, forcing clients to think about what they actually want to do with it. WDYT?

dexonsmith: I wonder if `std::is_floating_point` should be excluded. - Float values might generally a bit…

aramesAuthorUnsubmitted

Done

That sounds reasonable to me.
Forcing the hash would still be easy via other scalar types.

arames: That sounds reasonable to me. Forcing the hash would still be easy via other scalar types.

"HashBuilder should canonicalize endianness");

/// Trait to indicate whether a type's bits can be hashed directly (after

/// endianness correction).

template <typename U>

dexonsmithUnsubmitted

Done

I think these should be private, and Hasher can have a getHasher() accessor. This'll make it easier to make HashBuilder move/copy-friendly in the future if for some reason that's useful.

dexonsmith: I think these should be private, and `Hasher` can have a `getHasher()` accessor. This'll make…

struct IsHashableData

: std::integral_constant<bool, is_integral_or_enum<U>::value> {};

public:

explicit HashBuilderImpl(HasherT &Hasher)

: HashBuilderBase<HasherT>(Hasher) {}

dexonsmithUnsubmitted

Not Done

Can this be simplified?

HashBuilder() : OptionalHasher(in_place), Hasher(*OptionalHasher) {}

dexonsmith: Can this be simplified? ``` lang=c++ HashBuilder() : OptionalHasher(in_place), Hasher…

aramesAuthorUnsubmitted

Done

See my top-level comment.
I think we should keep support for ctors with arguments (e.g. for a seed).

arames: See my top-level comment. I think we should keep support for ctors with arguments (e.g. for a…

template <typename... ArgTypes>

dexonsmithUnsubmitted

Done

I think this comment should explain that support::endianness::native is not supported here but that it's handled by HashBuilder. Basically, explain the static assertion in text.

dexonsmith: I think this comment should explain that `support::endianness::native` is not supported here…

explicit HashBuilderImpl(ArgTypes &&...Args)

: HashBuilderBase<HasherT>(Args...) {}

dexonsmithUnsubmitted

Done

Please use makeArrayRef() to avoid splitting the pointer from the size.

Also, I wonder if this should take into account endianness.

dexonsmith: Please use `makeArrayRef()` to avoid splitting the pointer from the size. Also, I wonder if…

aramesAuthorUnsubmitted

Done

Updated to use makeArrayRef().

Great point. I didn't pay attention to it, but taking it into account will significantly improve the value of this class.
Supporting endianness looks pretty straight forward thanks to Support/Endian.h. Please take a look.
A couple questions:

I kept the default as native instead of little to avoid penalizing big-endian platforms. On the other hand defaulting to a stable hash may be safer. What do you think ?
To add support for user types, we end up with

template <typename HasherT, support::endianness Endianness>
void updateHash(HashBuilder<HasherT, Endianness> &HBuilder, const UserType &Value)

I think that is ok. Do you prefer "reserving" the updateHash name and going with

template <typename HashBuilderT>
void updateHash(HashBuilderT &HBuilder, const UserType &Value)

arames: Updated to use `makeArrayRef()`. Great point. I didn't pay attention to it, but taking it into…

dexonsmithUnsubmitted

Done

Just seeing this question.

I don't think it's necessary to default to a stable hash. Clients can do that when it's useful. But if you're uncomfortable with unstable-by-default, another option is not to have a default, and then the client will need to spell out support::native, making it more clear that the hash is unstable. (It's easier to add a default later, but changing the default might not be easy.)
I think the explicit parameter for Endianness makes it clear to authors they should be thinking about Endianness.
There should be a name change to align with add. addToHash?

dexonsmith: Just seeing this question. - I don't think it's necessary to default to a stable hash. Clients…

aramesAuthorUnsubmitted

Done

I'll drop the default value for Endianness. The template is simple enough for the rare use-cases, and simple enough to wrap if the lack of default is ever a problem for a user.

arames: I'll drop the default value for `Endianness`. The template is simple enough for the rare use…

/// Implement hashing for hashable data types, e.g. integral or enum values.

template <typename T>

std::enable_if_t<IsHashableData<T>::value, HashBuilderImpl &> add(T Value) {

return adjustForEndiannessAndAdd(Value);

dexonsmithUnsubmitted

Not Done

I made some comments on the patch this one depends on (https://reviews.llvm.org/D107781 -- FYI, I linked them using the add parent/child revision button above).

I suggest the following to better align with the hashers themselves:

Remove the template / enable_if. Require hashers to provide this.
Return StringRef. Expect hashers to do the same.
Add a result() method also returning StringRef, matching the one from hashers.

I also think an analogue for the static hash() that the hashers have would be useful, at least in the tests. Maybe:

using HashArrayT = decltype(HasherT::hash(ArrayRef<uint8_t>()));
static HashArrayT toArrayImpl(StringRef HashString) {
  assert(HashString.size() == sizeof(HashArrayT));
  HashArrayT HashArray;
  llvm::copy(HashString, HashArray.begin());
  return HashArray;
}

HashArrayT finalHash() { return toArrayImpl(Hasher.final()); }
HashArrayT resultHash() { return toArrayImpl(Hasher.result()); }

I *don't* think there should be a HashBuilder::hash() function (static or otherwise). Forwarding to HashBuilder::add() would be confusing / error-prone, since then HashBuilder<HasherT>::hash(ArrayRef<uint8_t>) would do something different from HasherT::hash(). Forwarding to HasherT::hash would also be confusing, and doesn't really add any value since clients can already do that themselves.

dexonsmith: I made some comments on the patch this one depends on (https://reviews.llvm.org/D107781 -- FYI…

aramesAuthorUnsubmitted

Done

Added a forward to result().

Per my top-level reply, I think we should keep the final() and result() forwards conditional on their existence in HasherT, to keep the use of this interface with a custom hasher type trivial.

I am not convinced by the finalHash() and resultHash() helpers should be members of HashBuilder. I may be missing context on how it is used, so feel free to push if you think the use-case is really valuable.
I would assume that the return type of HasherT::final() is appropriate to represent the hash. It seems bothersome to have to provide these helpers in the interface, making additional assumptions (or more SFINAE code) about the hasher type.

Looking at other related comments, I think a lot of this discussion stems from the fact that MD5, SHA1, and SHA256 final() methods return a reference (StringRef).
It seems to me this discussion would be simplified if there was a method like finalByValue().

My suggestion would be to keep relying on the return type for final(), and have "type converting" wrappers implemented outside of HasherT or HashBuilder for now.
If you like the finalByValue() idea, I could do it in a successor patch.

On this topic, I find the design of the ::hash functions join on the issue. Being static, they return a result by value. Though both strings and array<uint8_t> are of course pretty close, I find std::array is a bit surprising choice when final() and result() return StringRef. So the question will be, what should be the return type for finalByValue() for these hashes. std::array like hash(), or a SmallString maybe.

arames: Added a forward to `result()`. Per my top-level reply, I think we should keep the `final()`…

}

/// Support hashing `ArrayRef`.

///

dexonsmithUnsubmitted

Done

Using CRTP in HashBuilderImpl also increases compile-time / code size since it blocks sharing code between "native" and whatever the system endianness is. I suggest dropping CRTP and returning HashBuilderImpl& instead of HashBuilder& from add*. (If this needed CRTP, which I don't think it does, I'd suggest using an opaque DerivedT to avoid passing through details like EffectiveEndianness.)

dexonsmith: Using CRTP in `HashBuilderImpl` also increases compile-time / code size since it blocks sharing…

aramesAuthorUnsubmitted

Done

Removed CRTP.

arames: Removed CRTP.

/// `Value.size()` is taken into account to ensure cases like

/// ```

/// builder.add({1});

/// builder.add({2, 3});

/// ```

dexonsmithUnsubmitted

Done

I wonder if the body of this should be exposed as a general helper for clients, using a weaker std::enable_if check. I.e.:

Keep the signature and std::enable_if check for this overload of add.
Move the body to a new function, called something like addRawBytes, which takes a const T& and has a weaker std::enable_if check (maybe std::is_primitive? or what does endian::byte_swap support?).
Change add to call the new function.

WDYT?

dexonsmith: I wonder if the body of this should be exposed as a general helper for clients, using a weaker…

aramesAuthorUnsubmitted

Done

Sounds good.

Done, with enablement automatically whenever the type is handled by support::endian::byte_swap().

arames: Sounds good. Done, with enablement automatically whenever the type is handled by `support…

/// and

/// ```

/// builder.add({1, 2});

/// builder.add({3});

/// ```

/// do not collide.

template <typename T> HashBuilderImpl &add(ArrayRef<T> Value) {

// As of implementation time, simply calling `addRange(Value)` would also go

// through the `update` fast path. But that would rely on the implementation

// details of `ArrayRef::begin()` and `ArrayRef::end()`. Explicitly call

// `update` to guarantee the fast path.

add(Value.size());

if (IsHashableData<T>::value &&

Endianness == support::endian::system_endianness()) {

this->update(

makeArrayRef(reinterpret_cast<const uint8_t *>(Value.begin()),

Value.size() * sizeof(T)));

} else {

for (auto &V : Value)

add(V);

}

return *this;

}

/// Support hashing `StringRef`.

///

/// `Value.size()` is taken into account to ensure cases like

/// ```

/// builder.add("a");

/// builder.add("bc");

/// ```

/// and

dexonsmithUnsubmitted

Done

StringRef would be more generally useful; I suggest that instead of std::string. (I don't think we usually bother handling other character types in LLVM... I suggest leaving it out, unless/until you have a specific use case that needs it?)

dexonsmith: StringRef would be more generally useful; I suggest that instead of std::string. (I don't think…

aramesAuthorUnsubmitted

Done

Nah; this was to have the generic form. Leaving it out.

arames: Nah; this was to have the generic form. Leaving it out.

/// ```

/// builder.add("ab");

/// builder.add("c");

/// ```

/// do not collide.

HashBuilderImpl &add(StringRef Value) {

// As of implementation time, simply calling `addRange(Value)` would also go

dexonsmithUnsubmitted

Done

Should this be by-reference? Might be worth adding a test with a no-move no-copy type in a pair.

dexonsmith: Should this be by-reference? Might be worth adding a test with a no-move no-copy type in a pair.

aramesAuthorUnsubmitted

Done

Absolutely. Fixed, with a test.

arames: Absolutely. Fixed, with a test.

// through `update`. But that would rely on the implementation of

// `StringRef::begin()` and `StringRef::end()`. Explicitly call `update` to

// guarantee the fast path.

add(Value.size());

this->update(makeArrayRef(reinterpret_cast<const uint8_t *>(Value.begin()),

Value.size()));

dexonsmithUnsubmitted

Done

Should this be by-reference? Might be worth adding a test with a no-move no-copy type in a tuple.

dexonsmith: Should this be by-reference? Might be worth adding a test with a no-move no-copy type in a…

aramesAuthorUnsubmitted

Done

Absolutely. Fixed with a test.

arames: Absolutely. Fixed with a test.

return *this;

}

template <typename T>

using HasAddHashT =

decltype(addHash(std::declval<HashBuilderImpl &>(), std::declval<T &>()));

/// Implement hashing for user-defined `struct`s.

///

/// Any user-define `struct` can participate in hashing via `HashBuilder` by

/// providing a `addHash` templated function.

///

/// ```

/// template <typename HasherT, support::endianness Endianness>

/// void addHash(HashBuilder<HasherT, Endianness> &HBuilder,

dexonsmithUnsubmitted

Done

Neat, I hadn't seen this pattern before. Can the variable name be skipped?

(void)std::tuple<const Ts &...>{(update(Args), Args)...};

If not, please add (void)Unused; to suppress diagnostics.

dexonsmith: Neat, I hadn't seen this pattern before. Can the variable name be skipped? ``` (void)std…

aramesAuthorUnsubmitted

Done

It can be skipped indeed.

arames: It can be skipped indeed.

/// const UserDefinedStruct &Value);

/// ```

///

/// For example:

/// ```

/// struct SimpleStruct {

dexonsmithUnsubmitted

Done

I suggest adding a single-element overload that takes a generic range, calling adl_begin() should also be a single-element overload that calls llvm::adl_begin() and llvm::adl_end()` to extract iterators.

Also, this isn't going to work for any old input iterator. I suggest:

Name the parameter "ForwardIteratorT"
Add a test that confirms std::list iterators will work.
Add an enable_if to get a nice compile error for non-forward iterators.

Another option is to use random-access iterators and leave it for a future patch to handle forward iterators.

dexonsmith: I suggest adding a single-element overload that takes a generic range, calling `adl_begin()…

aramesAuthorUnsubmitted

Done

The update should fit what you want, but please take a second look.

Added an overload for a generic range using adl_begin() and adl_end().
Cleaned-up updateRange and updateRangeImpl, with a test for std::list. The check for forward iterator does not use enable_if, but should be good enough.

Interestingly, with all this, implementations of update for ArrayRef and StringRef can simply forward to updateRange.

arames: The update should fit what you want, but please take a second look. Added an overload for a…

/// char c;

/// int i;

/// };

///

dexonsmithUnsubmitted

Done

I suggest taking / using an ArrayRef<uint8_t> here. It'd be a nice convenience to add an overload for updateBytes(StringRef) as well -- something I've wanted a few times when using the Hasher interfaces with MemoryBuffer::getBuffer() -- which can just cast over to ArrayRef<uint8_t>`.

dexonsmith: I suggest taking / using an `ArrayRef<uint8_t>` here. It'd be a nice convenience to add an…

aramesAuthorUnsubmitted

Done

Sure thing. I can imagine the three overloads to be helpful in different contexts.

arames: Sure thing. I can imagine the three overloads to be helpful in different contexts.

/// template <typename HasherT, support::endianness Endianness>

/// void addHash(HashBuilderImpl<HasherT, Endianness> &HBuilder,

/// const SimpleStruct &Value) {

/// HBuilder.add(Value.c);

/// HBuilder.add(Value.i);

/// }

/// ```

///

/// To avoid endianness issues, specializations of `addHash` should

/// generally rely on exising `add`, `addRange`, and `addRangeElements`

/// functions. If directly using `update`, an implementation must correctly

/// handle endianness.

///

/// ```

/// struct __attribute__ ((packed)) StructWithFastHash {

/// int I;

/// char C;

///

/// // If possible, we want to hash both `I` and `C` in a single

dexonsmithUnsubmitted

Done

I think you can drop the name here too.

dexonsmith: I think you can drop the name here too.

/// // `update` call for performance concerns.

/// template <typename HasherT, support::endianness Endianness>

/// friend void addHash(HashBuilderImpl<HasherT, Endianness> &HBuilder,

dexonsmithUnsubmitted

Done

I think this will have to be a std::distance call if you handle arbitrary forward iterators (not sure that's necessary). The template parameters could use an update in either case.

dexonsmith: I think this will have to be a `std::distance` call if you handle arbitrary forward iterators…

/// const StructWithFastHash &Value) {

/// if (Endianness == support::endian::system_endianness()) {

/// HBuilder.update(makeArrayRef(

/// reinterpret_cast<const uint8_t *>(&Value), sizeof(Value)));

/// } else {

/// // Rely on existing `add` methods to handle endianness.

/// HBuilder.add(Value.I);

/// HBuilder.add(Value.C);

/// }

dexonsmithUnsubmitted

Done

I think it's important to have updateRange overloads for ArrayRef (at least for uint8_t) and StringRef that directly call updateBytes rather than iterating.

dexonsmith: I think it's important to have `updateRange` overloads for `ArrayRef` (at least for `uint8_t`)…

aramesAuthorUnsubmitted

Done

Both ArrayRef and StringRef do end up calling the version of updateRangeImpl with updateBytes, due to how their begin and end are declared.
Do you mean to do it explicitly, in case the ArrayRef implementation changes ?

arames: Both `ArrayRef` and `StringRef` do end up calling the version of `updateRangeImpl` with…

dexonsmithUnsubmitted

Done

Both ArrayRef and StringRef do end up calling the version of updateRangeImpl with updateBytes, due to how their begin and end are declared.
Do you mean to do it explicitly, in case the ArrayRef implementation changes ?

Just seeing this question. Two reasons to do it explicitly:

avoid compile-time overhead for this common case
make it obvious what happens for anyone reading the code (as a common case, people are likely to care that it's doing the right thing)

dexonsmith: > Both `ArrayRef` and `StringRef` do end up calling the version of `updateRangeImpl` with…

aramesAuthorUnsubmitted

Done

Done, with comments.

arames: Done, with comments.

/// }

/// };

/// ```

///

/// To avoid collisions, specialization of `addHash` for variable-size

/// types must take the size into account.

///

/// For example:

/// ```

/// struct CustomContainer {

/// private:

/// size_t Size;

/// int Elements[100];

///

/// public:

/// CustomContainer(size_t Size) : Size(Size) {

/// for (size_t I = 0; I != Size; ++I)

dexonsmithUnsubmitted

Done

I don't think we should add this overload. The rare call site can call makeArrayRef itself, encouraging it to push ArrayRef further up the stack.

dexonsmith: I don't think we should add this overload. The rare call site can call `makeArrayRef` itself…

/// Elements[I] = I;

/// }

/// template <typename HasherT, support::endianness Endianness>

/// friend void addHash(HashBuilderImpl<HasherT, Endianness> &HBuilder,

/// const CustomContainer &Value) {

/// if (Endianness == support::endian::system_endianness()) {

/// HBuilder.update(makeArrayRef(

/// reinterpret_cast<const uint8_t *>(&Value.Size),

/// sizeof(Value.Size) + Value.Size * sizeof(Value.Elements[0])));

/// } else {

/// // `addRange` will take care of encoding the size.

/// HBuilder.addRange(&Value.Elements[0], &Value.Elements[0] +

/// Value.Size);

/// }

/// };

/// ```

template <typename T>

std::enable_if_t<is_detected<HasAddHashT, T>::value &&

!IsHashableData<T>::value,

HashBuilderImpl &>

add(const T &Value) {

addHash(*this, Value);

return *this;

}

template <typename T1, typename T2>

HashBuilderImpl &add(const std::pair<T1, T2> &Value) {

add(Value.first);

add(Value.second);

return *this;

}

template <typename... Ts> HashBuilderImpl &add(const std::tuple<Ts...> &Arg) {

return addTupleHelper(Arg, typename std::index_sequence_for<Ts...>());

}

/// A convenenience variadic helper.

/// It simply iterates over its arguments, in order.

/// ```

/// add(Arg1, Arg2);

/// ```

/// is equivalent to

/// ```

/// add(Arg1)

/// add(Arg2)

/// ```

template <typename T, typename... Ts>

typename std::enable_if<(sizeof...(Ts) >= 1), HashBuilderImpl &>::type

add(const T &FirstArg, const Ts &...Args) {

add(FirstArg);

add(Args...);

return *this;

dexonsmithUnsubmitted

Done

I think there's something off with the naming. updateRange sounds like it is mutating the range itself, rather than updating the hash. Could add a preposition, or renaming to add.

dexonsmith: I think there's something off with the naming. `updateRange` sounds like it is mutating the…

}

template <typename ForwardIteratorT>

HashBuilderImpl &addRange(ForwardIteratorT First, ForwardIteratorT Last) {

add(std::distance(First, Last));

return addRangeElements(First, Last);

}

dexonsmithUnsubmitted

Done

Similar naming problem here (I'm pretty sure it's my fault), but worse, since I now realize there's nothing inherent about the word "bytes" that implies the size of the range is skipped. Someone could have a container of bytes, and expect that this encoded the container.

Stepping back, I feel like this function -- skipping the size of a range -- could be more generally useful to clients. I wonder if it should be exposed for generic ranges as well, keeping specializations for ArrayRef<uint8_t> and StringRef. Name could be addElements or addRangeElements or something?

dexonsmith: Similar naming problem here (I'm pretty sure it's my fault), but worse, since I now realize…

aramesAuthorUnsubmitted

Done

I like it. Done.

arames: I like it. Done.

template <typename RangeT> HashBuilderImpl &addRange(const RangeT &Range) {

return addRange(adl_begin(Range), adl_end(Range));

}

template <typename ForwardIteratorT>

HashBuilderImpl &addRangeElements(ForwardIteratorT First,

ForwardIteratorT Last) {

return addRangeElementsImpl(

First, Last,

typename std::iterator_traits<ForwardIteratorT>::iterator_category());

}

template <typename RangeT>

HashBuilderImpl &addRangeElements(const RangeT &Range) {

return addRangeElements(adl_begin(Range), adl_end(Range));

}

template <typename T>

using HasByteSwapT = decltype(support::endian::byte_swap(

std::declval<T &>(), support::endianness::little));

/// Adjust `Value` for the target endianness and add it to the hash.

template <typename T>

std::enable_if_t<is_detected<HasByteSwapT, T>::value, HashBuilderImpl &>

adjustForEndiannessAndAdd(const T &Value) {

T SwappedValue = support::endian::byte_swap(Value, Endianness);

this->update(makeArrayRef(reinterpret_cast<const uint8_t *>(&SwappedValue),

sizeof(SwappedValue)));

return *this;

}

dexonsmithUnsubmitted

Done

This can call addRangeElements().

dexonsmith: This can call `addRangeElements()`.

aramesAuthorUnsubmitted

Done

This has become addRangeElementsImpl. Now, addRange simply does add(size); addRangeElements(...);.

arames: This has become `addRangeElementsImpl`. Now, `addRange` simply does `add(size)…

private:

template <typename... Ts, std::size_t... Indices>

HashBuilderImpl &addTupleHelper(const std::tuple<Ts...> &Arg,

std::index_sequence<Indices...>) {

add(std::get<Indices>(Arg)...);

return *this;

}

dexonsmithUnsubmitted

Done

This overload can be dropped, but addRangeElements would want something similar.

dexonsmith: This overload can be dropped, but `addRangeElements` would want something similar.

aramesAuthorUnsubmitted

Done

This has become addRangeElementsImpl.

arames: This has become `addRangeElementsImpl`.

// FIXME: Once available, specialize this function for `contiguous_iterator`s,

// and use it for `ArrayRef` and `StringRef`.

template <typename ForwardIteratorT>

HashBuilderImpl &addRangeElementsImpl(ForwardIteratorT First,

ForwardIteratorT Last,

std::forward_iterator_tag) {

for (auto It = First; It != Last; ++It)

add(*It);

return *this;

}

template <typename T>

std::enable_if_t<IsHashableData<T>::value &&

Endianness == support::endian::system_endianness(),

HashBuilderImpl &>

addRangeElementsImpl(T *First, T *Last, std::forward_iterator_tag) {

this->update(makeArrayRef(reinterpret_cast<const uint8_t *>(First),

(Last - First) * sizeof(T)));

return *this;

}

};

/// Interface to help hash various types through a hasher type.

///

/// Via provided specializations of `add`, `addRange`, and `addRangeElements`

/// functions, various types (e.g. `ArrayRef`, `StringRef`, etc.) can be hashed

/// without requiring any knowledge of hashed types from the hasher type.

///

/// The only method expected from the templated hasher type `HasherT` is:

/// * void update(ArrayRef<uint8_t> Data)

///

/// Additionally, the following methods will be forwarded to the hasher type:

/// * decltype(std::declval<HasherT &>().final()) final()

/// * decltype(std::declval<HasherT &>().result()) result()

///

/// From a user point of view, the interface provides the following:

/// * `template<typename T> add(const T &Value)`

/// The `add` function implements hashing of various types.

/// * `template <typename ItT> void addRange(ItT First, ItT Last)`

/// The `addRange` function is designed to aid hashing a range of values.

/// It explicitly adds the size of the range in the hash.

/// * `template <typename ItT> void addRangeElements(ItT First, ItT Last)`

/// The `addRangeElements` function is also designed to aid hashing a range of

/// values. In contrast to `addRange`, it **ignores** the size of the range,

/// behaving as if elements were added one at a time with `add`.

///

/// User-defined `struct` types can participate in this interface by providing

/// an `addHash` templated function. See the associated template specialization

/// for details.

///

/// This interface does not impose requirements on the hasher

/// `update(ArrayRef<uint8_t> Data)` method. We want to avoid collisions for

/// variable-size types; for example for

/// ```

/// builder.add({1});

/// builder.add({2, 3});

/// ```

/// and

/// ```

/// builder.add({1, 2});

/// builder.add({3});

/// ```

/// . Thus, specializations of `add` and `addHash` for variable-size types must

/// not assume that the hasher type considers the size as part of the hash; they

/// must explicitly add the size to the hash. See for example specializations

/// for `ArrayRef` and `StringRef`.

///

/// Additionally, since types are eventually forwarded to the hasher's

/// `void update(ArrayRef<uint8_t>)` method, endianness plays a role in the hash

/// computation (for example when computing `add((int)123)`).

/// Specifiying a non-`native` `Endianness` template parameter allows to compute

/// stable hash across platforms with different endianness.

template <class HasherT, support::endianness Endianness>

using HashBuilder =

HashBuilderImpl<HasherT, (Endianness == support::endianness::native

? support::endian::system_endianness()

: Endianness)>;

} // end namespace llvm

#endif // LLVM_SUPPORT_HASHBUILDER_H

dexonsmithUnsubmitted

Done

Given that add* will return HashBuilderImpl&, might be reasonable to reduce this to a typedef:

template <class HasherT, support::endianness Endianness>
using HashBuilder =
    HashBuilderImpl<HasherT,
                    (Endianness == support::endianness::native
                         ? support::endian::system_endianness()
                         : Endianness)>;

Either way seems fine though.

dexonsmith: Given that `add*` will return `HashBuilderImpl&`, might be reasonable to reduce this to a…

llvm/unittests/Support/CMakeLists.txt

Show All 33 Lines	add_llvm_unittest(SupportTests
ErrorTest.cpp		ErrorTest.cpp
ExtensibleRTTITest.cpp		ExtensibleRTTITest.cpp
FileCollectorTest.cpp		FileCollectorTest.cpp
FileOutputBufferTest.cpp		FileOutputBufferTest.cpp
FileUtilitiesTest.cpp		FileUtilitiesTest.cpp
FormatVariadicTest.cpp		FormatVariadicTest.cpp
FSUniqueIDTest.cpp		FSUniqueIDTest.cpp
GlobPatternTest.cpp		GlobPatternTest.cpp
		HashBuilderTest.cpp
Host.cpp		Host.cpp
IndexedAccessorTest.cpp		IndexedAccessorTest.cpp
InstructionCostTest.cpp		InstructionCostTest.cpp
ItaniumManglingCanonicalizerTest.cpp		ItaniumManglingCanonicalizerTest.cpp
JSONTest.cpp		JSONTest.cpp
KnownBitsTest.cpp		KnownBitsTest.cpp
LEB128Test.cpp		LEB128Test.cpp
LinearPolyBaseTest.cpp		LinearPolyBaseTest.cpp
▲ Show 20 Lines • Show All 82 Lines • Show Last 20 Lines

llvm/unittests/Support/HashBuilderTest.cpp

This file was added.

//===- llvm/unittest/Support/HashBuilderTest.cpp - HashBuilder unit tests -===//

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

#include "llvm/Support/HashBuilder.h"

#include "llvm/ADT/ArrayRef.h"

#include "llvm/Support/MD5.h"

#include "llvm/Support/SHA1.h"

#include "llvm/Support/SHA256.h"

#include "gtest/gtest.h"

#include <list>

#include <string>

#include <type_traits>

#include <utility>

#include <vector>

// gtest utilities and macros rely on using a single type. So wrap both the

// hasher type and endianness.

template <typename _HasherT, llvm::support::endianness _Endianness>

struct HasherTAndEndianness {

using HasherT = _HasherT;

static constexpr llvm::support::endianness Endianness = _Endianness;

};

using HasherTAndEndiannessToTest =

::testing::Types<HasherTAndEndianness<llvm::MD5, llvm::support::big>,

HasherTAndEndianness<llvm::MD5, llvm::support::little>,

HasherTAndEndianness<llvm::MD5, llvm::support::native>,

HasherTAndEndianness<llvm::SHA1, llvm::support::big>,

HasherTAndEndianness<llvm::SHA1, llvm::support::little>,

HasherTAndEndianness<llvm::SHA1, llvm::support::native>,

HasherTAndEndianness<llvm::SHA256, llvm::support::big>,

HasherTAndEndianness<llvm::SHA256, llvm::support::little>,

dexonsmithUnsubmitted

Done

Can MD5 be updated in a prep patch to have an overload of final like this so this bridge code isn't needed?

dexonsmith: Can MD5 be updated in a prep patch to have an overload of `final` like this so this bridge code…

aramesAuthorUnsubmitted

Done

https://reviews.llvm.org/D107781

This PR will have to be rebased once the prep is reviewed and lands.

arames: https://reviews.llvm.org/D107781 This PR will have to be rebased once the prep is reviewed and…

HasherTAndEndianness<llvm::SHA256, llvm::support::native>>;

template <typename HasherT> class HashBuilderTest : public testing::Test {};

TYPED_TEST_SUITE(HashBuilderTest, HasherTAndEndiannessToTest);

template <typename HasherTAndEndianness>

using HashBuilder = llvm::HashBuilder<typename HasherTAndEndianness::HasherT,

HasherTAndEndianness::Endianness>;

dexonsmithUnsubmitted

Done

Seeing this boilerplate makes me wonder if the builder API could be improved with something like:

Add an Optional<HasherT> data member and a default constructor for HashBuilder that constructs it.
Make update* return HashBuilder&.
Add a HashBuilder::final() that returns HasherT::final() (after adding it to MD5 as above).

Then the above logic becomes a one-liner:

HashBuilder<HasherT>().update(Args...).final()

(maybe still worth encapsulating in a function here for testing, but maybe not; not sure)

dexonsmith: Seeing this boilerplate makes me wonder if the builder API could be improved with something…

aramesAuthorUnsubmitted

Done

Done. That makes it more usable indeed.
I kept the helper to keep the tests concise.

arames: Done. That makes it more usable indeed. I kept the helper to keep the tests concise.

dexonsmithUnsubmitted

Done

If you take my suggestion to add finalHash, computeHash can be further simplified to:

return HashBuilder<HasherTAndEndianness>().add(Args...).finalHash();

removing the static cast.

IMO, inlining it would improve clarity, but renaming it to hashWithBuilder would also help.

dexonsmith: If you take my suggestion to add `finalHash`, `computeHash` can be further simplified to: ```…

aramesAuthorUnsubmitted

Done

Per my other comment, I am not convinced by the finalHash() method.
Renamed to hashWithBuilder.

arames: Per my other comment, I am not convinced by the `finalHash()` method. Renamed to…

template <typename HasherTAndEndianness, typename... Ts>

static std::string hashWithBuilder(const Ts &...Args) {

return HashBuilder<HasherTAndEndianness>().add(Args...).final().str();

}

template <typename HasherTAndEndianness, typename... Ts>

dexonsmithUnsubmitted

Done

HasherTAndEndianness::Endianness>;

template <typename HasherTAndEndianness, typename... Ts>

- static auto hashWithBuilder(const Ts &...Args) {

- return static_cast<std::string>(

- HashBuilder<HasherTAndEndianness>().add(Args...).final());

+ static std::string hashWithBuilder(const Ts &...Args) {

+ return HashBuilder<HasherTAndEndianness>().add(Args...).final().str();

}

template <typename HasherTAndEndianness, typename... Ts>

I'd skip auto type deduction unless the type is hard to spell. Also you can use StringRef::str here.

dexonsmith: I'd skip `auto` type deduction unless the type is hard to spell. Also you can use StringRef…

static std::string hashRangeWithBuilder(const Ts &...Args) {

dexonsmithUnsubmitted

Done

Lots of "invalid case style" warnings coming from clang-tidy in this file. Please follow the style conventions at https://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly (e.g., use C for this variable instead of c), if nothing else to make the patch easier to read.

dexonsmith: Lots of "invalid case style" warnings coming from clang-tidy in this file. Please follow the…

return HashBuilder<HasherTAndEndianness>().addRange(Args...).final().str();

}

// All the test infrastructure relies on the variadic helpers. Test them first.

TYPED_TEST(HashBuilderTest, VariadicHelpers) {

dexonsmithUnsubmitted

Done

(Same comments as computeHash: drop std::string, and simplify or rename)

dexonsmith: (Same comments as `computeHash`: drop std::string, and simplify or rename)

aramesAuthorUnsubmitted

Done

Renamed to hashRangeWithBuilder.

arames: Renamed to `hashRangeWithBuilder`.

dexonsmithUnsubmitted

Done

template <typename HasherTAndEndianness, typename... Ts>

- static auto hashRangeWithBuilder(const Ts &...Args) {

- return static_cast<std::string>(

- HashBuilder<HasherTAndEndianness>().addRange(Args...).final());

+ static std::string hashRangeWithBuilder(const Ts &...Args) {

+ return HashBuilder<HasherTAndEndianness>().addRange(Args...).final().str();

}

// All the test infrastructure relies on the variadic helpers. Test them first.

(same as above)

dexonsmith: (same as above)

{

HashBuilder<TypeParam> HBuilder;

HBuilder.add(100);

HBuilder.add('c');

HBuilder.add("string");

EXPECT_EQ(HBuilder.final(), hashWithBuilder<TypeParam>(100, 'c', "string"));

}

{

HashBuilder<TypeParam> HBuilder;

std::vector<int> Vec{100, 101, 102};

HBuilder.addRange(Vec);

EXPECT_EQ(HBuilder.final(), hashRangeWithBuilder<TypeParam>(Vec));

}

dexonsmithUnsubmitted

Done

Rather than duplicating the tests, I suggest using http://google.github.io/googletest/advanced.html#typed-tests to iterate through MD5, SHA1, and SHA256.

dexonsmith: Rather than duplicating the tests, I suggest using http://google.github.io/googletest/advanced.

{

HashBuilder<TypeParam> HBuilder;

std::vector<int> Vec{200, 201, 202};

HBuilder.addRange(Vec.begin(), Vec.end());

EXPECT_EQ(HBuilder.final(),

dexonsmithUnsubmitted

Done

It's not ideal to need to spell this out for each HasherT, and hardcoding the magic numbers doesn't seem quite right either. I suggest instead:

Unify the HasherT interfaces as necessary in prep commits to allow using them generically (maybe my other suggestions above are sufficient for that?).
Test each of these types in a standalone test in a unified HashBuilderTest below, dropping BasicTest/etc.
Add a helper function to the test fixture below to avoid boilerplate, something like:

template <class T> void checkHashableData(T Data) {
  HasherT RawHasher;
  auto Bytes = makeArrayRef(reinterpret_cast<uint8_t *>(&Data),...);
  RawHasher.update(Bytes);
  CHECK_EQ(HashBuilder<HasherT>().update(Data).final(),
           RawHasher.final());
}

dexonsmith: It's not ideal to need to spell this out for each `HasherT`, and hardcoding the magic numbers…

aramesAuthorUnsubmitted

Done

I have cleaned this up significantly as part of the endianness support. It does not look like what you suggest, but I think it is much better than before.
There is now a single ReferenceHashCheck handling all hasher types, with the three big, little, and native endiannesses.

This is the only test that verifies against a hard-coded reference hash. In my opinion it is valuable to ensure the stability of hashes across platforms.
Other tests rely on equality or difference of different hashes.

arames: I have cleaned this up significantly as part of the endianness support. It does not look like…

dexonsmithUnsubmitted

Done

I'm still seeing magic numbers, although this comment seems to be in the wrong place now. See further down in ReferenceHashCheck.

dexonsmith: I'm still seeing magic numbers, although this comment seems to be in the wrong place now. See…

dexonsmithUnsubmitted

Done

Looks like you've adjusted the tests in the meantime, but since I hadn't seen this comment before, just wanted to give some more reasoning:

In my opinion it is valuable to ensure the stability of hashes across platforms.

I agree that's a useful trait, but the tests for MD5, SHA1, and SHA256 should be doing that, respectively.

Meanwhile, the following important trait was *not* being tested: do HashBuilder<HashT> and HashT give the same result as each other for each of these raw data types? (Since fixed!)

You could have both (the above, plus the magic number test). It would uncover things such as double having a different representation on different platforms, or some platforms being big- or little-endian. Those both seem important, but IMO probably issues that clients of HashBuilder need to be aware of, rather than HashBuilder. But maybe there's some room for debate...

dexonsmith: Looks like you've adjusted the tests in the meantime, but since I hadn't seen this comment…

aramesAuthorUnsubmitted

Done

I did adjust the tests following your suggestion.

Those both seem important, but IMO probably issues that clients of HashBuilder need to be aware of, rather than HashBuilder. But maybe there's some room for debate...

I agree. We would not want to start seeing the HashBuilder tests failing across different platforms because of reasons that are known (and maybe already correctly handled). So let's correctly separate the responsibilities.

arames: I did adjust the tests following your suggestion. > Those both seem important, but IMO…

hashRangeWithBuilder<TypeParam>(Vec.begin(), Vec.end()));

}

TYPED_TEST(HashBuilderTest, AddRangeElements) {

HashBuilder<TypeParam> HBuilder;

int Values[] = {1, 2, 3};

HBuilder.addRangeElements(llvm::ArrayRef<int>(Values));

dexonsmithUnsubmitted

Done

This is also byte-swapping. I suggest byteSwapAndHashRawData(). Maybe even add "WithoutBuilder"? Or, since this is only used in one function, you could make it a lambda.

dexonsmith: This is also byte-swapping. I suggest `byteSwapAndHashRawData()`. Maybe even add…

aramesAuthorUnsubmitted

Done

Done. I was not familiar with using lambdas with auto for that type of use-case.
Also renamed it to ByteSwapAndHashWithHasher for clarity.

arames: Done. I was not familiar with using lambdas with `auto` for that type of use-case. Also renamed…

EXPECT_EQ(HBuilder.final(), hashWithBuilder<TypeParam>(1, 2, 3));

}

TYPED_TEST(HashBuilderTest, AddHashableData) {

using HE = TypeParam;

auto ByteSwapAndHashWithHasher = [](auto Data) {

dexonsmithUnsubmitted

Not Done

I (re-)discovered the static method hash when reviewing the other patch, which returns a std::array. I suggest using that instead, and directly returning H::hash(SwappedData). I suggest forwarding the return array directly rather than wrapping in a std::string.

dexonsmith: I (re-)discovered the static method `hash` when reviewing the other patch, which returns a `std…

aramesAuthorUnsubmitted

Done

I agree here a single call would be nice.
The change would require the update of previous helpers to return decltype(H::hash), that I am reluctant to do.

arames: I agree here a single call would be nice. The change would require the update of previous…

using H = typename HE::HasherT;

constexpr auto E = HE::Endianness;

H Hasher;

dexonsmithUnsubmitted

Done

I think the name is wrong since the change. Maybe addHashableData?

dexonsmith: I think the name is wrong since the change. Maybe `addHashableData`?

auto SwappedData = llvm::support::endian::byte_swap(Data, E);

Hasher.update(llvm::makeArrayRef(

dexonsmithUnsubmitted

Done

I suggest dropping the Typed and calling this simply HashBuilderTest.

dexonsmith: I suggest dropping the `Typed` and calling this simply `HashBuilderTest`.

dexonsmithUnsubmitted

Done

It's not clear to me why you're including volatile ints here. Is there any reason to expect them to have different behaviour?

dexonsmith: It's not clear to me why you're including volatile ints here. Is there any reason to expect…

aramesAuthorUnsubmitted

Done

I had borrowed the basic tests for hash_code, which did include it. I agree they bring no value here. Removed.

arames: I had borrowed the basic tests for `hash_code`, which did include it. I agree they bring no…

reinterpret_cast<const uint8_t *>(&SwappedData), sizeof(Data)));

return static_cast<std::string>(Hasher.final());

};

char C = 'c';

int32_t I = 0x12345678;

uint64_t UI64 = static_cast<uint64_t>(1) << 50;

dexonsmithUnsubmitted

Done

I'm not such a fan of this reference hash / magic value idea. I'd prefer to check each of these data types separately, and rely on comparing against the result of the underlying Hasher (which we can assume is already well-tested elsewhere).

E.g.:

// helper for hashers
template <class H, class T> auto hashRawData(T Data) {
  H Hasher;
  Hasher.update(makeRawDataArrayRef(makeArrayRef(
      reinterpret_cast<uint8_t>(&Data), sizeof(Data)));
  return H.final();
};

// then in the test for raw integer data:

int I = 0x12345678;
EXPECT_EQ(HashBuilder<H>.update(I).final(), hashRawData<H>(I));

Could be each in their own test (one per type), or just in separate scopes, don't think it matters much.

Or if you think the magic numbers are better, can you walk me through your logic?

dexonsmith: I'm not such a fan of this reference hash / magic value idea. I'd prefer to check each of these…

dexonsmithUnsubmitted

Done

Update here looks like what I was asking for, but it's actually not clear at all how these functions are different (I probably suggested a bad name I think, sorry).

I think computeHash can be inlined, avoiding any ambiguity about what it does, if you take my suggestion to add finalHash. But if you keep it, it should probably be renamed to hashWithBuilder to make it clear how the hash is being computed. That'd disambiguate the other one I think?

dexonsmith: Update here looks like what I was asking for, but it's actually not clear at all how these…

aramesAuthorUnsubmitted

Done

This ended up with explicit names ByteSwapAndHashWithHasher and hashWithBuilder.

arames: This ended up with explicit names `ByteSwapAndHashWithHasher` and `hashWithBuilder`.

enum TestEnumeration : uint16_t { TE_One = 1, TE_Two = 2 };

TestEnumeration Enum = TE_Two;

EXPECT_EQ(ByteSwapAndHashWithHasher(C), hashWithBuilder<HE>(C));

EXPECT_EQ(ByteSwapAndHashWithHasher(I), hashWithBuilder<HE>(I));

EXPECT_EQ(ByteSwapAndHashWithHasher(UI64), hashWithBuilder<HE>(UI64));

EXPECT_EQ(ByteSwapAndHashWithHasher(Enum), hashWithBuilder<HE>(Enum));

}

struct SimpleStruct {

char C;

int I;

};

template <typename HasherT, llvm::support::endianness Endianness>

void addHash(llvm::HashBuilderImpl<HasherT, Endianness> &HBuilder,

const SimpleStruct &Value) {

HBuilder.add(Value.C);

HBuilder.add(Value.I);

}

struct StructWithoutCopyOrMove {

int I;

StructWithoutCopyOrMove() = default;

StructWithoutCopyOrMove(const StructWithoutCopyOrMove &) = delete;

StructWithoutCopyOrMove &operator=(const StructWithoutCopyOrMove &) = delete;

template <typename HasherT, llvm::support::endianness Endianness>

friend void addHash(llvm::HashBuilderImpl<HasherT, Endianness> &HBuilder,

const StructWithoutCopyOrMove &Value) {

HBuilder.add(Value.I);

}

};

// The struct and associated tests are simplified to avoid failures caused by

// different alignments on different platforms.

struct /* __attribute__((packed)) */ StructWithFastHash {

int I;

// char C;

// If possible, we want to hash both `I` and `C` in a single `update`

// call for performance concerns.

template <typename HasherT, llvm::support::endianness Endianness>

friend void addHash(llvm::HashBuilderImpl<HasherT, Endianness> &HBuilder,

const StructWithFastHash &Value) {

if (Endianness == llvm::support::endian::system_endianness()) {

HBuilder.update(llvm::makeArrayRef(

reinterpret_cast<const uint8_t *>(&Value), sizeof(Value)));

} else {

// Rely on existing `add` methods to handle endianness.

HBuilder.add(Value.I);

// HBuilder.add(Value.C);

}

};

struct CustomContainer {

private:

size_t Size;

int Elements[100];

public:

CustomContainer(size_t Size) : Size(Size) {

for (size_t I = 0; I != Size; ++I)

Elements[I] = I;

}

template <typename HasherT, llvm::support::endianness Endianness>

friend void addHash(llvm::HashBuilderImpl<HasherT, Endianness> &HBuilder,

const CustomContainer &Value) {

if (Endianness == llvm::support::endian::system_endianness()) {

HBuilder.update(llvm::makeArrayRef(

reinterpret_cast<const uint8_t *>(&Value.Size),

sizeof(Value.Size) + Value.Size * sizeof(Value.Elements[0])));

} else {

HBuilder.addRange(&Value.Elements[0], &Value.Elements[0] + Value.Size);

}

};

TYPED_TEST(HashBuilderTest, HashUserDefinedStruct) {

using HE = TypeParam;

EXPECT_EQ(hashWithBuilder<HE>(SimpleStruct{'c', 123}),

hashWithBuilder<HE>('c', 123));

EXPECT_EQ(hashWithBuilder<HE>(StructWithoutCopyOrMove{1}),

hashWithBuilder<HE>(1));

EXPECT_EQ(hashWithBuilder<HE>(StructWithFastHash{123}),

hashWithBuilder<HE>(123));

EXPECT_EQ(hashWithBuilder<HE>(CustomContainer(3)),

hashWithBuilder<HE>(static_cast<size_t>(3), 0, 1, 2));

}

TYPED_TEST(HashBuilderTest, HashArrayRefHashableDataTypes) {

using HE = TypeParam;

int Values[] = {1, 20, 0x12345678};

llvm::ArrayRef<int> Array(Values);

EXPECT_NE(hashWithBuilder<HE>(Array), hashWithBuilder<HE>(1, 20, 0x12345678));

EXPECT_EQ(hashWithBuilder<HE>(Array),

hashRangeWithBuilder<HE>(Array.begin(), Array.end()));

EXPECT_EQ(

hashWithBuilder<HE>(Array),

hashRangeWithBuilder<HE>(Array.data(), Array.data() + Array.size()));

}

TYPED_TEST(HashBuilderTest, HashArrayRef) {

using HE = TypeParam;

int Values[] = {1, 2, 3};

llvm::ArrayRef<int> Array123(&Values[0], 3);

llvm::ArrayRef<int> Array12(&Values[0], 2);

llvm::ArrayRef<int> Array1(&Values[0], 1);

llvm::ArrayRef<int> Array23(&Values[1], 2);

llvm::ArrayRef<int> Array3(&Values[2], 1);

llvm::ArrayRef<int> ArrayEmpty(&Values[0], static_cast<size_t>(0));

auto Hash123andEmpty = hashWithBuilder<HE>(Array123, ArrayEmpty);

auto Hash12And3 = hashWithBuilder<HE>(Array12, Array3);

auto Hash1And23 = hashWithBuilder<HE>(Array1, Array23);

auto HashEmptyAnd123 = hashWithBuilder<HE>(ArrayEmpty, Array123);

EXPECT_NE(Hash123andEmpty, Hash12And3);

dexonsmithUnsubmitted

Done

I suggest comparing these both (or at least one of them) against:

update(1);
update("string");

so that you're confirming the hash is updated correctly, not just the same way as each other (unless you tested that elsewhere and I missed it?)

dexonsmith: I suggest comparing these both (or at least one of them) against: ``` lang=c++ update(1)…

EXPECT_NE(Hash123andEmpty, Hash1And23);

EXPECT_NE(Hash123andEmpty, HashEmptyAnd123);

EXPECT_NE(Hash12And3, Hash1And23);

EXPECT_NE(Hash12And3, HashEmptyAnd123);

EXPECT_NE(Hash1And23, HashEmptyAnd123);

}

TYPED_TEST(HashBuilderTest, HashArrayRefNonHashableDataTypes) {

using HE = TypeParam;

SimpleStruct Values[] = {{'a', 100}, {'b', 200}};

llvm::ArrayRef<SimpleStruct> Array(Values);

EXPECT_NE(

hashWithBuilder<HE>(Array),

hashWithBuilder<HE>(SimpleStruct{'a', 100}, SimpleStruct{'b', 200}));

}

TYPED_TEST(HashBuilderTest, HashStringRef) {

using HE = TypeParam;

llvm::StringRef SEmpty("");

llvm::StringRef S1("1");

llvm::StringRef S12("12");

llvm::StringRef S123("123");

llvm::StringRef S23("23");

llvm::StringRef S3("3");

auto Hash123andEmpty = hashWithBuilder<HE>(S123, SEmpty);

auto Hash12And3 = hashWithBuilder<HE>(S12, S3);

auto Hash1And23 = hashWithBuilder<HE>(S1, S23);

auto HashEmptyAnd123 = hashWithBuilder<HE>(SEmpty, S123);

EXPECT_NE(Hash123andEmpty, Hash12And3);

EXPECT_NE(Hash123andEmpty, Hash1And23);

EXPECT_NE(Hash123andEmpty, HashEmptyAnd123);

EXPECT_NE(Hash12And3, Hash1And23);

EXPECT_NE(Hash12And3, HashEmptyAnd123);

EXPECT_NE(Hash1And23, HashEmptyAnd123);

}

TYPED_TEST(HashBuilderTest, HashStdString) {

using HE = TypeParam;

EXPECT_EQ(hashWithBuilder<HE>(std::string("123")),

hashWithBuilder<HE>(llvm::StringRef("123")));

}

TYPED_TEST(HashBuilderTest, HashStdPair) {

using HE = TypeParam;

EXPECT_EQ(hashWithBuilder<HE>(std::make_pair(1, "string")),

hashWithBuilder<HE>(1, "string"));

std::pair<StructWithoutCopyOrMove, std::string> Pair;

Pair.first.I = 1;

Pair.second = "string";

EXPECT_EQ(hashWithBuilder<HE>(Pair), hashWithBuilder<HE>(1, "string"));

}

TYPED_TEST(HashBuilderTest, HashStdTuple) {

using HE = TypeParam;

EXPECT_EQ(hashWithBuilder<HE>(std::make_tuple(1)), hashWithBuilder<HE>(1));

EXPECT_EQ(hashWithBuilder<HE>(std::make_tuple(2ULL)),

hashWithBuilder<HE>(2ULL));

EXPECT_EQ(hashWithBuilder<HE>(std::make_tuple("three")),

hashWithBuilder<HE>("three"));

EXPECT_EQ(hashWithBuilder<HE>(std::make_tuple(1, 2ULL)),

hashWithBuilder<HE>(1, 2ULL));

EXPECT_EQ(hashWithBuilder<HE>(std::make_tuple(1, 2ULL, "three")),

hashWithBuilder<HE>(1, 2ULL, "three"));

std::tuple<StructWithoutCopyOrMove, std::string> Tuple;

std::get<0>(Tuple).I = 1;

std::get<1>(Tuple) = "two";

EXPECT_EQ(hashWithBuilder<HE>(Tuple), hashWithBuilder<HE>(1, "two"));

}

TYPED_TEST(HashBuilderTest, HashRangeWithForwardIterator) {

using HE = TypeParam;

std::list<int> List;

List.push_back(1);

List.push_back(2);

List.push_back(3);

EXPECT_NE(hashRangeWithBuilder<HE>(List), hashWithBuilder<HE>(1, 2, 3));

}

TEST(CustomHasher, CustomHasher) {

struct SumHash {

explicit SumHash(uint8_t Seed1, uint8_t Seed2) : Hash(Seed1 + Seed2) {}

void update(llvm::ArrayRef<uint8_t> Data) {

for (uint8_t C : Data)

Hash += C;

}

uint8_t Hash;

};

{

llvm::HashBuilder<SumHash, llvm::support::endianness::little> HBuilder(0,

1);

EXPECT_EQ(HBuilder.add(0x02, 0x03, 0x400).getHasher().Hash, 0xa);

}

{

llvm::HashBuilder<SumHash, llvm::support::endianness::little> HBuilder(2,

3);

EXPECT_EQ(HBuilder.add("ab", 'c').getHasher().Hash,

static_cast<uint8_t>(/*seeds*/ 2 + 3 + /*range size*/ 2 +

/*characters*/ 'a' + 'b' + 'c'));

}

This is an archive of the discontinued LLVM Phabricator instance.

[Support]: Introduce the `HashBuilder` interface.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 368908

llvm/include/llvm/Support/HashBuilder.h

llvm/unittests/Support/CMakeLists.txt

llvm/unittests/Support/HashBuilderTest.cpp

[Support]: Introduce the `HashBuilder` interface.
ClosedPublic