This is an archive of the discontinued LLVM Phabricator instance.

[Support]: Introduce the `HashBuilder` interface.
ClosedPublic

Authored by arames on Jul 27 2021, 12:59 PM.

Details

Summary

The HashBuilder interface allows conveniently building hashes of various data
types, without relying on the underlying hasher type to know about hashed data
types.

Diff Detail

Event Timeline

arames created this revision.Jul 27 2021, 12:59 PM
arames requested review of this revision.Jul 27 2021, 12:59 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 27 2021, 12:59 PM

This is a follow-up to discussion about https://reviews.llvm.org/D102943.
This is a proposal to expose the HashBuilder helper that will then be used for the modules hashing.

dexonsmith requested changes to this revision.Jul 29 2021, 1:05 PM

Thanks, this looks like a great start to me. Lots of comments inline (many of them nitpicks).

Expose a raw pointer update interface for hashes.
The new method update(const uint8_t *Ptr, size_t Size) is an alternative to
the existing update(ArrayRef<uint8_t> Data). It will allow the incoming
HashBuilder interface to not depend on ArrayRef.

If we add pointer+size APIs to the hashers, I think that should be in a separate prep patch...

... but IMO, ArrayRef is a safe/clean encapsulation of pointer+size for HashBuilder to use. Can you explain why you want to avoid it?

llvm/include/llvm/ADT/ArrayRef.h
17 ↗(On Diff #362146)

I think HashBuilder will probably be needed in fewer places than ArrayRef. It might be better to invert the includes (and move the new functions to HashBuilder.h).

575–602 ↗(On Diff #362146)

I suggest moving these to member functions of HashBuilder.

llvm/include/llvm/ADT/StringRef.h
15 ↗(On Diff #362146)

Even more so than ArrayRef.h, we try really hard to avoid adding includes to StringRef.h since it's so widely included. I suggest inverting the include relationship.

966 ↗(On Diff #362146)

(I suggest an overload in HashBuilder rather than a free function)

llvm/include/llvm/Support/HashBuilder.h
22

You should be able to remove this once you're including StringRef.h

140

StringRef would be more generally useful; I suggest that instead of std::string. (I don't think we usually bother handling other character types in LLVM... I suggest leaving it out, unless/until you have a specific use case that needs it?)

144–147

Should this be by-reference? Might be worth adding a test with a no-move no-copy type in a pair.

148–153

Should this be by-reference? Might be worth adding a test with a no-move no-copy type in a tuple.

167

Neat, I hadn't seen this pattern before. Can the variable name be skipped?

(void)std::tuple<const Ts &...>{(update(Args), Args)...};

If not, please add (void)Unused; to suppress diagnostics.

170–173

I suggest adding a single-element overload that takes a generic range, calling adl_begin() should also be a single-element overload that calls llvm::adl_begin() and llvm::adl_end()` to extract iterators.

Also, this isn't going to work for any old input iterator. I suggest:

  • Name the parameter "ForwardIteratorT"
  • Add a test that confirms std::list iterators will work.
  • Add an enable_if to get a nice compile error for non-forward iterators.

Another option is to use random-access iterators and leave it for a future patch to handle forward iterators.

177

I suggest taking / using an ArrayRef<uint8_t> here. It'd be a nice convenience to add an overload for updateBytes(StringRef) as well -- something I've wanted a few times when using the Hasher interfaces with MemoryBuffer::getBuffer() -- which can just cast over to ArrayRef<uint8_t>`.

199

I think this will have to be a std::distance call if you handle arbitrary forward iterators (not sure that's necessary). The template parameters could use an update in either case.

llvm/unittests/Support/HashBuilderTest.cpp
52

Lots of "invalid case style" warnings coming from clang-tidy in this file. Please follow the style conventions at https://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly (e.g., use C for this variable instead of c), if nothing else to make the patch easier to read.

76

Rather than duplicating the tests, I suggest using http://google.github.io/googletest/advanced.html#typed-tests to iterate through MD5, SHA1, and SHA256.

226–229

I suggest comparing these both (or at least one of them) against:

update(1);
update("string");

so that you're confirming the hash is updated correctly, not just the same way as each other (unless you tested that elsewhere and I missed it?)

This revision now requires changes to proceed.Jul 29 2021, 1:05 PM
arames updated this revision to Diff 363876.Aug 3 2021, 2:00 PM
arames marked 14 inline comments as done.

Address review comments:
Main changes:

  • Switch dependency order: HashBuilder now depends on ArrayRef and StringRef
  • Clean up updateRange and updateRangeImpl
  • Fix case style
arames added a comment.Aug 3 2021, 2:00 PM

Thanks, this looks like a great start to me. Lots of comments inline (many of them nitpicks).

Expose a raw pointer update interface for hashes.
The new method update(const uint8_t *Ptr, size_t Size) is an alternative to
the existing update(ArrayRef<uint8_t> Data). It will allow the incoming
HashBuilder interface to not depend on ArrayRef.

If we add pointer+size APIs to the hashers, I think that should be in a separate prep patch...

Sure. That was already separated in a different commit.

... but IMO, ArrayRef is a safe/clean encapsulation of pointer+size for HashBuilder to use. Can you explain why you want to avoid it?

I agree ArrayRef is a safe and clean encapsulation. It seemed to me it was cleaner to not have the HashBuilder interface depend on any particular type, in a similar way to hash_code.
However your arguments about ArrayRef and StringRef being much more commonly used are on point, and si agree switching the dependency order and having ArrayRef and StringRef specialized in HashBuilder.h will be better.

llvm/include/llvm/ADT/ArrayRef.h
17 ↗(On Diff #362146)

Done. (See also global reply.)

575–602 ↗(On Diff #362146)

Done. (See also global reply.)

llvm/include/llvm/ADT/StringRef.h
15 ↗(On Diff #362146)

Done. (See also global reply.)

966 ↗(On Diff #362146)

Done for StringRef and ArrayRef, since we now depend on both.

llvm/include/llvm/Support/HashBuilder.h
140

Nah; this was to have the generic form. Leaving it out.

144–147

Absolutely. Fixed, with a test.

148–153

Absolutely. Fixed with a test.

167

It can be skipped indeed.

170–173

The update should fit what you want, but please take a second look.

Added an overload for a generic range using adl_begin() and adl_end().
Cleaned-up updateRange and updateRangeImpl, with a test for std::list. The check for forward iterator does not use enable_if, but should be good enough.

Interestingly, with all this, implementations of update for ArrayRef and StringRef can simply forward to updateRange.

177

Sure thing. I can imagine the three overloads to be helpful in different contexts.

dexonsmith requested changes to this revision.Aug 3 2021, 3:08 PM

Updates look great; a few more comments inline.

Since the ultimate goal is to make it easy for higher-level logic to use a stable hash across different architectures, I realize that endianness could be important too for the hashable-data overloads. Have you considered what to do there?

One idea: add an endianness template parameter to HashBuilder:

template <class HasherT, support::endianness Endianness = support::native>
class HashBuilder;

then change the update overloads for hashable data to hash the result of support::endian::byte_swap(V, Endianness). In the common case (native) the copy/etc should get optimized out. WDYT?

llvm/include/llvm/Support/HashBuilder.h
87–95

Please use makeArrayRef() to avoid splitting the pointer from the size.

Also, I wonder if this should take into account endianness.

196

I think you can drop the name here too.

206–208

I think it's important to have updateRange overloads for ArrayRef (at least for uint8_t) and StringRef that directly call updateBytes rather than iterating.

221–225

I don't think we should add this overload. The rare call site can call makeArrayRef itself, encouraging it to push ArrayRef further up the stack.

llvm/unittests/Support/HashBuilderTest.cpp
27–37

Can MD5 be updated in a prep patch to have an overload of final like this so this bridge code isn't needed?

39–45

Seeing this boilerplate makes me wonder if the builder API could be improved with something like:

  • Add an Optional<HasherT> data member and a default constructor for HashBuilder that constructs it.
  • Make update* return HashBuilder&.
  • Add a HashBuilder::final() that returns HasherT::final() (after adding it to MD5 as above).

Then the above logic becomes a one-liner:

HashBuilder<HasherT>().update(Args...).final()

(maybe still worth encapsulating in a function here for testing, but maybe not; not sure)

82–83

It's not ideal to need to spell this out for each HasherT, and hardcoding the magic numbers doesn't seem quite right either. I suggest instead:

  • Unify the HasherT interfaces as necessary in prep commits to allow using them generically (maybe my other suggestions above are sufficient for that?).
  • Test each of these types in a standalone test in a unified HashBuilderTest below, dropping BasicTest/etc.
  • Add a helper function to the test fixture below to avoid boilerplate, something like:
template <class T> void checkHashableData(T Data) {
  HasherT RawHasher;
  auto Bytes = makeArrayRef(reinterpret_cast<uint8_t *>(&Data),...);
  RawHasher.update(Bytes);
  CHECK_EQ(HashBuilder<HasherT>().update(Data).final(),
           RawHasher.final());
}
103

I suggest dropping the Typed and calling this simply HashBuilderTest.

This revision now requires changes to proceed.Aug 3 2021, 3:08 PM
arames updated this revision to Diff 365283.Aug 9 2021, 2:00 PM
arames marked 7 inline comments as done.

Address review comments.

The main change is support and tests for endianness.

arames updated this revision to Diff 365532.Aug 10 2021, 10:26 AM

Run clang-format.

arames updated this revision to Diff 366162.Aug 12 2021, 6:17 PM

Fix forgotten TODO.

dexonsmith added inline comments.Aug 12 2021, 6:49 PM
llvm/unittests/Support/HashBuilderTest.cpp
82–83

I'm still seeing magic numbers, although this comment seems to be in the wrong place now. See further down in ReferenceHashCheck.

102–103

It's not clear to me why you're including volatile ints here. Is there any reason to expect them to have different behaviour?

109–110

I'm not such a fan of this reference hash / magic value idea. I'd prefer to check each of these data types separately, and rely on comparing against the result of the underlying Hasher (which we can assume is already well-tested elsewhere).

E.g.:

// helper for hashers
template <class H, class T> auto hashRawData(T Data) {
  H Hasher;
  Hasher.update(makeRawDataArrayRef(makeArrayRef(
      reinterpret_cast<uint8_t>(&Data), sizeof(Data)));
  return H.final();
};

// then in the test for raw integer data:

int I = 0x12345678;
EXPECT_EQ(HashBuilder<H>.update(I).final(), hashRawData<H>(I));

Could be each in their own test (one per type), or just in separate scopes, don't think it matters much.

Or if you think the magic numbers are better, can you walk me through your logic?

arames marked 2 inline comments as done.Aug 16 2021, 11:30 AM
arames added inline comments.
llvm/include/llvm/Support/HashBuilder.h
87–95

Updated to use makeArrayRef().

Great point. I didn't pay attention to it, but taking it into account will significantly improve the value of this class.
Supporting endianness looks pretty straight forward thanks to Support/Endian.h. Please take a look.
A couple questions:

  1. I kept the default as native instead of little to avoid penalizing big-endian platforms. On the other hand defaulting to a stable hash may be safer. What do you think ?
  2. To add support for user types, we end up with
template <typename HasherT, support::endianness Endianness>
void updateHash(HashBuilder<HasherT, Endianness> &HBuilder, const UserType &Value)

I think that is ok. Do you prefer "reserving" the updateHash name and going with

template <typename HashBuilderT>
void updateHash(HashBuilderT &HBuilder, const UserType &Value)

?

206–208

Both ArrayRef and StringRef do end up calling the version of updateRangeImpl with updateBytes, due to how their begin and end are declared.
Do you mean to do it explicitly, in case the ArrayRef implementation changes ?

llvm/unittests/Support/HashBuilderTest.cpp
27–37

https://reviews.llvm.org/D107781

This PR will have to be rebased once the prep is reviewed and lands.

39–45

Done. That makes it more usable indeed.
I kept the helper to keep the tests concise.

82–83

I have cleaned this up significantly as part of the endianness support. It does not look like what you suggest, but I think it is much better than before.
There is now a single ReferenceHashCheck handling all hasher types, with the three big, little, and native endiannesses.

This is the only test that verifies against a hard-coded reference hash. In my opinion it is valuable to ensure the stability of hashes across platforms.
Other tests rely on equality or difference of different hashes.

arames added inline comments.Aug 16 2021, 11:30 AM
llvm/unittests/Support/HashBuilderTest.cpp
102–103

I had borrowed the basic tests for hash_code, which did include it. I agree they bring no value here. Removed.

arames updated this revision to Diff 366685.Aug 16 2021, 11:31 AM
arames marked an inline comment as done.

Update test to compare against the hasher.

(Sorry it looks like I missed publishing a few earlier comments.)

I think this is getting close... I've found a few new things, and I apologize for not seeing them sooner.

Two high-level points, then more comments inline.

Firstly, I'm a bit concerned about the semantics of update(ArrayRef) (and update(StringRef) by proxy) due to inconsistency with HasherT.

  • HasherT::update(ArrayRef<uint8_t>) hashes the raw bytes pointed to by the array.
  • HashBuilder<HasherT>::update(ArrayRef) (including for uint8_t) hashes the size first, then the raw bytes.

This inconsistency could be confusing and error-prone.

I wonder if we should (mostly) avoid the word update in HashBuilder and use add instead. This also fixes the updateRange problem I point out inline (addRange works just fine without a preposition). We could leave update() around *just* for ArrayRef<uint8_t> (and maybe StringRef), forwarding to addElements/addRangeElements (see inline). The result would be:

  • HashBuilder<HasherT>::update does exactly the same thing as HasherT::update (also working with StringRef, for convenience)
  • HashBuilder<HasherT>::add (and add*) handles higher-level semantics, including primitives, endianness, ranges, etc.

Secondly, a base class that lacks the Endianness template parameter should probably be factored out to avoid bloating code once this starts getting used. I think there should be three layers.

  • HashBuilderBase can take just a HasherT parameter, and implement update (new meaning, matching HasherT::update), final(), and all other things that don't care about endianness.
  • HashBuilderImpl layers the add* family on top and takes an Endianness parameter. But native is illegal here.
  • HashBuilder canonicalizes native to system_endianness() to avoid code bloat, something like:
template <typename HasherT, support::endianness Endianness>
class HashBuilderImpl {
  static_assert(Endianness != native, "HashBuilder should canonicalize endianness");
  // etc.
};

template <typename HasherT, support::Endianness Endianness = native,
                            support::Endianness EffectiveEndianness =
                                Endianness == native
                                   ? support::endian::system_endianness()
                                   : Endianness>
class HashBuilder
    : HashBuilderImpl<HasherT, EffectiveEndianness> {
  using HashBuilderImplT = HashBuilderImpl<HasherT, EffectiveEndianness>;

public:
  HashBuilder() = default;
  HashBuilder(HasherT &Hasher) : HashBuilderImplT(Hasher) {}
};
llvm/include/llvm/Support/HashBuilder.h
81

I wonder if std::is_floating_point should be excluded.

  • Float values might generally a bit iffy to be included in a stable hash, because of all the floating point modes that can affect their values.
  • long double has a platform-dependent mantissa.

This would disable HashBuilder::add, forcing clients to think about what they actually want to do with it. WDYT?

84–85

I think these should be private, and Hasher can have a getHasher() accessor. This'll make it easier to make HashBuilder move/copy-friendly in the future if for some reason that's useful.

87–95

Just seeing this question.

  • I don't think it's necessary to default to a stable hash. Clients can do that when it's useful. But if you're uncomfortable with unstable-by-default, another option is not to have a default, and then the client will need to spell out support::native, making it more clear that the hash is unstable. (It's easier to add a default later, but changing the default might not be easy.)
  • I think the explicit parameter for Endianness makes it clear to authors they should be thinking about Endianness.
  • There should be a name change to align with add. addToHash?
88–91

Can this be simplified?

HashBuilder() : OptionalHasher(in_place), Hasher(*OptionalHasher) {}
97–99

I made some comments on the patch this one depends on (https://reviews.llvm.org/D107781 -- FYI, I linked them using the add parent/child revision button above).

I suggest the following to better align with the hashers themselves:

  • Remove the template / enable_if. Require hashers to provide this.
  • Return StringRef. Expect hashers to do the same.
  • Add a result() method also returning StringRef, matching the one from hashers.

I also think an analogue for the static hash() that the hashers have would be useful, at least in the tests. Maybe:

using HashArrayT = decltype(HasherT::hash(ArrayRef<uint8_t>()));
static HashArrayT toArrayImpl(StringRef HashString) {
  assert(HashString.size() == sizeof(HashArrayT));
  HashArrayT HashArray;
  llvm::copy(HashString, HashArray.begin());
  return HashArray;
}

HashArrayT finalHash() { return toArrayImpl(Hasher.final()); }
HashArrayT resultHash() { return toArrayImpl(Hasher.result()); }

I *don't* think there should be a HashBuilder::hash() function (static or otherwise). Forwarding to HashBuilder::add() would be confusing / error-prone, since then HashBuilder<HasherT>::hash(ArrayRef<uint8_t>) would do something different from HasherT::hash(). Forwarding to HasherT::hash would also be confusing, and doesn't really add any value since clients can already do that themselves.

101–108

I wonder if the body of this should be exposed as a general helper for clients, using a weaker std::enable_if check. I.e.:

  • Keep the signature and std::enable_if check for this overload of add.
  • Move the body to a new function, called something like addRawBytes, which takes a const T& and has a weaker std::enable_if check (maybe std::is_primitive? or what does endian::byte_swap support?).
  • Change add to call the new function.

WDYT?

206–208

Both ArrayRef and StringRef do end up calling the version of updateRangeImpl with updateBytes, due to how their begin and end are declared.
Do you mean to do it explicitly, in case the ArrayRef implementation changes ?

Just seeing this question. Two reasons to do it explicitly:

  • avoid compile-time overhead for this common case
  • make it obvious what happens for anyone reading the code (as a common case, people are likely to care that it's doing the right thing)
276–278

I think there's something off with the naming. updateRange sounds like it is mutating the range itself, rather than updating the hash. Could add a preposition, or renaming to add.

285

Similar naming problem here (I'm pretty sure it's my fault), but worse, since I now realize there's nothing inherent about the word "bytes" that implies the size of the range is skipped. Someone could have a container of bytes, and expect that this encoded the container.

Stepping back, I feel like this function -- skipping the size of a range -- could be more generally useful to clients. I wonder if it should be exposed for generic ranges as well, keeping specializations for ArrayRef<uint8_t> and StringRef. Name could be addElements or addRangeElements or something?

315–316

This can call addRangeElements().

320–324

This overload can be dropped, but addRangeElements would want something similar.

llvm/unittests/Support/HashBuilderTest.cpp
39–45

If you take my suggestion to add finalHash, computeHash can be further simplified to:

return HashBuilder<HasherTAndEndianness>().add(Args...).finalHash();

removing the static cast.

IMO, inlining it would improve clarity, but renaming it to hashWithBuilder would also help.

54–57

(Same comments as computeHash: drop std::string, and simplify or rename)

82–83

Looks like you've adjusted the tests in the meantime, but since I hadn't seen this comment before, just wanted to give some more reasoning:

In my opinion it is valuable to ensure the stability of hashes across platforms.

I agree that's a useful trait, but the tests for MD5, SHA1, and SHA256 should be doing that, respectively.

Meanwhile, the following important trait was *not* being tested: do HashBuilder<HashT> and HashT give the same result as each other for each of these raw data types? (Since fixed!)

You could have both (the above, plus the magic number test). It would uncover things such as double having a different representation on different platforms, or some platforms being big- or little-endian. Those both seem important, but IMO probably issues that clients of HashBuilder need to be aware of, rather than HashBuilder. But maybe there's some room for debate...

91

This is also byte-swapping. I suggest byteSwapAndHashRawData(). Maybe even add "WithoutBuilder"? Or, since this is only used in one function, you could make it a lambda.

96–98

I (re-)discovered the static method hash when reviewing the other patch, which returns a std::array. I suggest using that instead, and directly returning H::hash(SwappedData). I suggest forwarding the return array directly rather than wrapping in a std::string.

101

I think the name is wrong since the change. Maybe addHashableData?

109–110

Update here looks like what I was asking for, but it's actually not clear at all how these functions are different (I probably suggested a bad name I think, sorry).

I think computeHash can be inlined, avoiding any ambiguity about what it does, if you take my suggestion to add finalHash. But if you keep it, it should probably be renamed to hashWithBuilder to make it clear how the hash is being computed. That'd disambiguate the other one I think?

dexonsmith requested changes to this revision.Aug 16 2021, 4:34 PM

(forgot to mark "request changes" in last comment)

This revision now requires changes to proceed.Aug 16 2021, 4:34 PM
arames marked 18 inline comments as done.Aug 23 2021, 10:44 AM

Thanks - again - for the comments. Please take a look at the update.

The major changes are around:

  • Renaming update* to add*
  • Refactoring HashBuilder into HashBuilderBase, HashBuilderImpl, and HashBuilder.

However I am pushing back on a number of suggestions that would impose requirement on the hasher type. I think there is a lot of value in keeping the interface trivial to use for user-defined/other hasher types.
That includes the suggested finalHash, and supporting ctors with arguments, and using enable_if to conditionally forward methods like final() and result(), so that we can rely on the single HasherT::update(ArrayRef<uint8_t>) method. I think that is a fair implementation cost for the value it brings.
I have added a test for a custom minimal hasher.

Let me know what you think. I'm always open to discuss and update.

Cheers!

I think this is getting close... I've found a few new things, and I apologize for not seeing them sooner.

Two high-level points, then more comments inline.

Firstly, I'm a bit concerned about the semantics of update(ArrayRef) (and update(StringRef) by proxy) due to inconsistency with HasherT.

  • HasherT::update(ArrayRef<uint8_t>) hashes the raw bytes pointed to by the array.
  • HashBuilder<HasherT>::update(ArrayRef) (including for uint8_t) hashes the size first, then the raw bytes.

This inconsistency could be confusing and error-prone.

I wonder if we should (mostly) avoid the word update in HashBuilder and use add instead. This also fixes the updateRange problem I point out inline (addRange works just fine without a preposition). We could leave update() around *just* for ArrayRef<uint8_t> (and maybe StringRef), forwarding to addElements/addRangeElements (see inline). The result would be:

  • HashBuilder<HasherT>::update does exactly the same thing as HasherT::update (also working with StringRef, for convenience)
  • HashBuilder<HasherT>::add (and add*) handles higher-level semantics, including primitives, endianness, ranges, etc.

I did not consider the issue. But I like the proposal, as it removes potential confusions.

Secondly, a base class that lacks the Endianness template parameter should probably be factored out to avoid bloating code once this starts getting used. I think there should be three layers.

  • HashBuilderBase can take just a HasherT parameter, and implement update (new meaning, matching HasherT::update), final(), and all other things that don't care about endianness.
  • HashBuilderImpl layers the add* family on top and takes an Endianness parameter. But native is illegal here.
  • HashBuilder canonicalizes native to system_endianness() to avoid code bloat, something like:

Done.
Doing so requires templated CRTP, as we need to return the reference to our derived class and keep track of HasherT/endianness. It looks slightly convoluted at the template declaration site, but is neat otherwise otherwise.

template <typename HasherT, support::endianness Endianness>
class HashBuilderImpl {
  static_assert(Endianness != native, "HashBuilder should canonicalize endianness");
  // etc.
};

template <typename HasherT, support::Endianness Endianness = native,
                            support::Endianness EffectiveEndianness =
                                Endianness == native
                                   ? support::endian::system_endianness()
                                   : Endianness>
class HashBuilder
    : HashBuilderImpl<HasherT, EffectiveEndianness> {
  using HashBuilderImplT = HashBuilderImpl<HasherT, EffectiveEndianness>;

public:
  HashBuilder() = default;
  HashBuilder(HasherT &Hasher) : HashBuilderImplT(Hasher) {}
};
llvm/include/llvm/Support/HashBuilder.h
81

That sounds reasonable to me.
Forcing the hash would still be easy via other scalar types.

87–95

I'll drop the default value for Endianness. The template is simple enough for the rare use-cases, and simple enough to wrap if the lack of default is ever a problem for a user.

88–91

See my top-level comment.
I think we should keep support for ctors with arguments (e.g. for a seed).

97–99

Added a forward to result().

Per my top-level reply, I think we should keep the final() and result() forwards conditional on their existence in HasherT, to keep the use of this interface with a custom hasher type trivial.

I am not convinced by the finalHash() and resultHash() helpers should be members of HashBuilder. I may be missing context on how it is used, so feel free to push if you think the use-case is really valuable.
I would assume that the return type of HasherT::final() is appropriate to represent the hash. It seems bothersome to have to provide these helpers in the interface, making additional assumptions (or more SFINAE code) about the hasher type.

Looking at other related comments, I think a lot of this discussion stems from the fact that MD5, SHA1, and SHA256 final() methods return a reference (StringRef).
It seems to me this discussion would be simplified if there was a method like finalByValue().

My suggestion would be to keep relying on the return type for final(), and have "type converting" wrappers implemented outside of HasherT or HashBuilder for now.
If you like the finalByValue() idea, I could do it in a successor patch.

On this topic, I find the design of the ::hash functions join on the issue. Being static, they return a result by value. Though both strings and array<uint8_t> are of course pretty close, I find std::array is a bit surprising choice when final() and result() return StringRef. So the question will be, what should be the return type for finalByValue() for these hashes. std::array like hash(), or a SmallString maybe.

101–108

Sounds good.

Done, with enablement automatically whenever the type is handled by support::endian::byte_swap().

206–208

Done, with comments.

285

I like it. Done.

315–316

This has become addRangeElementsImpl. Now, addRange simply does add(size); addRangeElements(...);.

320–324

This has become addRangeElementsImpl.

llvm/unittests/Support/HashBuilderTest.cpp
39–45

Per my other comment, I am not convinced by the finalHash() method.
Renamed to hashWithBuilder.

54–57

Renamed to hashRangeWithBuilder.

82–83

I did adjust the tests following your suggestion.

Those both seem important, but IMO probably issues that clients of HashBuilder need to be aware of, rather than HashBuilder. But maybe there's some room for debate...

I agree. We would not want to start seeing the HashBuilder tests failing across different platforms because of reasons that are known (and maybe already correctly handled). So let's correctly separate the responsibilities.

91

Done. I was not familiar with using lambdas with auto for that type of use-case.
Also renamed it to ByteSwapAndHashWithHasher for clarity.

96–98

I agree here a single call would be nice.
The change would require the update of previous helpers to return decltype(H::hash), that I am reluctant to do.

109–110

This ended up with explicit names ByteSwapAndHashWithHasher and hashWithBuilder.

arames updated this revision to Diff 368141.Aug 23 2021, 10:46 AM
arames marked 12 inline comments as done.

Address review comments.

The major changes are around:

  • Renaming update* to add*
  • Refactoring HashBuilder into HashBuilderBase, HashBuilderImpl, and HashBuilder.
dexonsmith accepted this revision.Aug 23 2021, 12:40 PM

Thanks for the working through this! LGTM if you remove the CRTP (see below, I don't think it's needed). See also a few more nits I found to pick.

llvm/include/llvm/Support/HashBuilder.h
31–36

Using CRTP blocks the compile-time speedup / code size wins. Can we skip HashBuilder/Endianness/EffectiveEndianness/Derived?

46–49

I suggest dropping the return statement, for void update(ArrayRef), a simple passthrough to HasherT::update.

  • No need for the CRTP (faster compile-time / avoid code size bloat by factoring out a common instantiation for the base class)
  • The same functionality is available as HashBuilder& HasherBuilder::addRangeElements, so there's no loss of convenience
56

(this would also return void)

61–67

I don't think we need enable_if or return type deduction here. It seems like overkill to make "does HashBuilder::final work?" SFINAE-detectible.

69–77

Same as final.

92

I think this comment should explain that support::endianness::native is not supported here but that it's handled by HashBuilder. Basically, explain the static assertion in text.

101–103

Using CRTP in HashBuilderImpl also increases compile-time / code size since it blocks sharing code between "native" and whatever the system endianness is. I suggest dropping CRTP and returning HashBuilderImpl& instead of HashBuilder& from add*. (If this needed CRTP, which I don't think it does, I'd suggest using an opaque DerivedT to avoid passing through details like EffectiveEndianness.)

411–425

Given that add* will return HashBuilderImpl&, might be reasonable to reduce this to a typedef:

template <class HasherT, support::endianness Endianness>
using HashBuilder =
    HashBuilderImpl<HasherT,
                    (Endianness == support::endianness::native
                         ? support::endian::system_endianness()
                         : Endianness)>;

Either way seems fine though.

llvm/unittests/Support/HashBuilderTest.cpp
47–51

I'd skip auto type deduction unless the type is hard to spell. Also you can use StringRef::str here.

54–57

(same as above)

This revision is now accepted and ready to land.Aug 23 2021, 12:40 PM
arames updated this revision to Diff 368459.Aug 24 2021, 2:04 PM
arames marked 8 inline comments as done.

Main changes:

  • Remove CRTP
  • (Attempt to) fix Windows build
  • Fix nits
llvm/include/llvm/Support/HashBuilder.h
61–67

You may want to double-check my earlier comments.
I think there is a strong value in allowing trivial support for user-defined (or simply other) hashes. Here, that means:

  • not requiring support for HasherT::final()
  • not assuming the return type
101–103

Removed CRTP.

arames updated this revision to Diff 368471.Aug 24 2021, 2:33 PM

Apply clang-format.

arames updated this revision to Diff 368472.Aug 24 2021, 2:41 PM

One more attempt to fix the Windows build.

arames updated this revision to Diff 368482.Aug 24 2021, 3:26 PM
  • Explicitly disable hashable types for the addHash overload (for the Windows build).
  • Introduce checks to confirm debian x64 test failures (to be removed).
dexonsmith added inline comments.Aug 24 2021, 3:35 PM
llvm/include/llvm/Support/HashBuilder.h
61–67

On not requiring HasherT::final, my suggested edit doesn't change that (unless I'm missing something?). The call to HasherT::final is template-dependent. Both before and after the edit, there will be no compile errors if there are no calls to HashBuilder<HasherT>::final, and a compile error if there is such a call. The only semantic difference is whether HashBuilder::final is SFINAE-detectable (compile error because there is no overload for HashBuilder::final vs. compile error because the overload for HashBuilder::final can't be instantiated), which doesn't seem important.

On the return type, I don't see the benefit of allowing flexibility here. I also don't see a tie-in to the argument about trivial support. And there is a cost: generic code that depends on HashBuilder::final will have to reason about other arbitrary return types. It's simpler for clients if this just returns StringRef.

If you still think there's strong value in either of those (which both seem unrelated to allowing use of trivial HasherT types), can you clarify what it is?

arames added inline comments.Aug 24 2021, 3:56 PM
llvm/include/llvm/Support/HashBuilder.h
61–67

Both done. (will be updated in the next arc diff.)

I misunderstood your earlier comment, thinking you suggested to still require HasherT::final(). And I didn't think of simply relying on the template for the purpose.
I felt specializing the interface to follow particular hasher types might not be very wise. I could imagine a hash wanted to return say uint64_t. But since this is hypothetical, I buy the argument of just keeping things simple for now.

arames updated this revision to Diff 368518.Aug 24 2021, 6:08 PM

Ran through the address sanitizer and fixed stack-use-after-scope issues with ArrayRef.

arames updated this revision to Diff 368720.Aug 25 2021, 2:00 PM

Test different variadic impl for Windows.

arames updated this revision to Diff 368748.Aug 25 2021, 3:10 PM

Propagate variadic expansion workaround to the tuple overload.

arames edited the summary of this revision. (Show Details)Aug 25 2021, 4:14 PM
arames updated this revision to Diff 368788.Aug 25 2021, 6:20 PM

Remove one of the workarounds for Windows.

arames updated this revision to Diff 368802.Aug 25 2021, 9:28 PM

Finalize workaround for Windows.

As far as I can see in the C++ spec section 8.5.4 guarantees evaluation order of
the initializer-list in a braced-init-list. But this not seem to hold with MSVC
19.28.29914.0. So simply use a recursive template instead.

This revision was automatically updated to reflect the committed changes.