Page MenuHomePhabricator

[Support]: Introduce the `HashBuilder` interface.
ClosedPublic

Authored by arames on Jul 27 2021, 12:59 PM.

Details

Summary

The HashBuilder interface allows conveniently building hashes of various data
types, without relying on the underlying hasher type to know about hashed data
types.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
arames added inline comments.Aug 16 2021, 11:30 AM
llvm/unittests/Support/HashBuilderTest.cpp
102–103

I had borrowed the basic tests for hash_code, which did include it. I agree they bring no value here. Removed.

arames updated this revision to Diff 366685.Aug 16 2021, 11:31 AM
arames marked an inline comment as done.

Update test to compare against the hasher.

(Sorry it looks like I missed publishing a few earlier comments.)

I think this is getting close... I've found a few new things, and I apologize for not seeing them sooner.

Two high-level points, then more comments inline.

Firstly, I'm a bit concerned about the semantics of update(ArrayRef) (and update(StringRef) by proxy) due to inconsistency with HasherT.

  • HasherT::update(ArrayRef<uint8_t>) hashes the raw bytes pointed to by the array.
  • HashBuilder<HasherT>::update(ArrayRef) (including for uint8_t) hashes the size first, then the raw bytes.

This inconsistency could be confusing and error-prone.

I wonder if we should (mostly) avoid the word update in HashBuilder and use add instead. This also fixes the updateRange problem I point out inline (addRange works just fine without a preposition). We could leave update() around *just* for ArrayRef<uint8_t> (and maybe StringRef), forwarding to addElements/addRangeElements (see inline). The result would be:

  • HashBuilder<HasherT>::update does exactly the same thing as HasherT::update (also working with StringRef, for convenience)
  • HashBuilder<HasherT>::add (and add*) handles higher-level semantics, including primitives, endianness, ranges, etc.

Secondly, a base class that lacks the Endianness template parameter should probably be factored out to avoid bloating code once this starts getting used. I think there should be three layers.

  • HashBuilderBase can take just a HasherT parameter, and implement update (new meaning, matching HasherT::update), final(), and all other things that don't care about endianness.
  • HashBuilderImpl layers the add* family on top and takes an Endianness parameter. But native is illegal here.
  • HashBuilder canonicalizes native to system_endianness() to avoid code bloat, something like:
template <typename HasherT, support::endianness Endianness>
class HashBuilderImpl {
  static_assert(Endianness != native, "HashBuilder should canonicalize endianness");
  // etc.
};

template <typename HasherT, support::Endianness Endianness = native,
                            support::Endianness EffectiveEndianness =
                                Endianness == native
                                   ? support::endian::system_endianness()
                                   : Endianness>
class HashBuilder
    : HashBuilderImpl<HasherT, EffectiveEndianness> {
  using HashBuilderImplT = HashBuilderImpl<HasherT, EffectiveEndianness>;

public:
  HashBuilder() = default;
  HashBuilder(HasherT &Hasher) : HashBuilderImplT(Hasher) {}
};
llvm/include/llvm/Support/HashBuilder.h
81

I wonder if std::is_floating_point should be excluded.

  • Float values might generally a bit iffy to be included in a stable hash, because of all the floating point modes that can affect their values.
  • long double has a platform-dependent mantissa.

This would disable HashBuilder::add, forcing clients to think about what they actually want to do with it. WDYT?

84–85

I think these should be private, and Hasher can have a getHasher() accessor. This'll make it easier to make HashBuilder move/copy-friendly in the future if for some reason that's useful.

87–95

Just seeing this question.

  • I don't think it's necessary to default to a stable hash. Clients can do that when it's useful. But if you're uncomfortable with unstable-by-default, another option is not to have a default, and then the client will need to spell out support::native, making it more clear that the hash is unstable. (It's easier to add a default later, but changing the default might not be easy.)
  • I think the explicit parameter for Endianness makes it clear to authors they should be thinking about Endianness.
  • There should be a name change to align with add. addToHash?
88–91

Can this be simplified?

HashBuilder() : OptionalHasher(in_place), Hasher(*OptionalHasher) {}
97–99

I made some comments on the patch this one depends on (https://reviews.llvm.org/D107781 -- FYI, I linked them using the add parent/child revision button above).

I suggest the following to better align with the hashers themselves:

  • Remove the template / enable_if. Require hashers to provide this.
  • Return StringRef. Expect hashers to do the same.
  • Add a result() method also returning StringRef, matching the one from hashers.

I also think an analogue for the static hash() that the hashers have would be useful, at least in the tests. Maybe:

using HashArrayT = decltype(HasherT::hash(ArrayRef<uint8_t>()));
static HashArrayT toArrayImpl(StringRef HashString) {
  assert(HashString.size() == sizeof(HashArrayT));
  HashArrayT HashArray;
  llvm::copy(HashString, HashArray.begin());
  return HashArray;
}

HashArrayT finalHash() { return toArrayImpl(Hasher.final()); }
HashArrayT resultHash() { return toArrayImpl(Hasher.result()); }

I *don't* think there should be a HashBuilder::hash() function (static or otherwise). Forwarding to HashBuilder::add() would be confusing / error-prone, since then HashBuilder<HasherT>::hash(ArrayRef<uint8_t>) would do something different from HasherT::hash(). Forwarding to HasherT::hash would also be confusing, and doesn't really add any value since clients can already do that themselves.

101–108

I wonder if the body of this should be exposed as a general helper for clients, using a weaker std::enable_if check. I.e.:

  • Keep the signature and std::enable_if check for this overload of add.
  • Move the body to a new function, called something like addRawBytes, which takes a const T& and has a weaker std::enable_if check (maybe std::is_primitive? or what does endian::byte_swap support?).
  • Change add to call the new function.

WDYT?

206–208

Both ArrayRef and StringRef do end up calling the version of updateRangeImpl with updateBytes, due to how their begin and end are declared.
Do you mean to do it explicitly, in case the ArrayRef implementation changes ?

Just seeing this question. Two reasons to do it explicitly:

  • avoid compile-time overhead for this common case
  • make it obvious what happens for anyone reading the code (as a common case, people are likely to care that it's doing the right thing)
276–278

I think there's something off with the naming. updateRange sounds like it is mutating the range itself, rather than updating the hash. Could add a preposition, or renaming to add.

285

Similar naming problem here (I'm pretty sure it's my fault), but worse, since I now realize there's nothing inherent about the word "bytes" that implies the size of the range is skipped. Someone could have a container of bytes, and expect that this encoded the container.

Stepping back, I feel like this function -- skipping the size of a range -- could be more generally useful to clients. I wonder if it should be exposed for generic ranges as well, keeping specializations for ArrayRef<uint8_t> and StringRef. Name could be addElements or addRangeElements or something?

315–316

This can call addRangeElements().

320–324

This overload can be dropped, but addRangeElements would want something similar.

llvm/unittests/Support/HashBuilderTest.cpp
39–45

If you take my suggestion to add finalHash, computeHash can be further simplified to:

return HashBuilder<HasherTAndEndianness>().add(Args...).finalHash();

removing the static cast.

IMO, inlining it would improve clarity, but renaming it to hashWithBuilder would also help.

54–57

(Same comments as computeHash: drop std::string, and simplify or rename)

82–83

Looks like you've adjusted the tests in the meantime, but since I hadn't seen this comment before, just wanted to give some more reasoning:

In my opinion it is valuable to ensure the stability of hashes across platforms.

I agree that's a useful trait, but the tests for MD5, SHA1, and SHA256 should be doing that, respectively.

Meanwhile, the following important trait was *not* being tested: do HashBuilder<HashT> and HashT give the same result as each other for each of these raw data types? (Since fixed!)

You could have both (the above, plus the magic number test). It would uncover things such as double having a different representation on different platforms, or some platforms being big- or little-endian. Those both seem important, but IMO probably issues that clients of HashBuilder need to be aware of, rather than HashBuilder. But maybe there's some room for debate...

91

This is also byte-swapping. I suggest byteSwapAndHashRawData(). Maybe even add "WithoutBuilder"? Or, since this is only used in one function, you could make it a lambda.

96–98

I (re-)discovered the static method hash when reviewing the other patch, which returns a std::array. I suggest using that instead, and directly returning H::hash(SwappedData). I suggest forwarding the return array directly rather than wrapping in a std::string.

101

I think the name is wrong since the change. Maybe addHashableData?

109–110

Update here looks like what I was asking for, but it's actually not clear at all how these functions are different (I probably suggested a bad name I think, sorry).

I think computeHash can be inlined, avoiding any ambiguity about what it does, if you take my suggestion to add finalHash. But if you keep it, it should probably be renamed to hashWithBuilder to make it clear how the hash is being computed. That'd disambiguate the other one I think?

dexonsmith requested changes to this revision.Aug 16 2021, 4:34 PM

(forgot to mark "request changes" in last comment)

This revision now requires changes to proceed.Aug 16 2021, 4:34 PM
arames marked 18 inline comments as done.Mon, Aug 23, 10:44 AM

Thanks - again - for the comments. Please take a look at the update.

The major changes are around:

  • Renaming update* to add*
  • Refactoring HashBuilder into HashBuilderBase, HashBuilderImpl, and HashBuilder.

However I am pushing back on a number of suggestions that would impose requirement on the hasher type. I think there is a lot of value in keeping the interface trivial to use for user-defined/other hasher types.
That includes the suggested finalHash, and supporting ctors with arguments, and using enable_if to conditionally forward methods like final() and result(), so that we can rely on the single HasherT::update(ArrayRef<uint8_t>) method. I think that is a fair implementation cost for the value it brings.
I have added a test for a custom minimal hasher.

Let me know what you think. I'm always open to discuss and update.

Cheers!

I think this is getting close... I've found a few new things, and I apologize for not seeing them sooner.

Two high-level points, then more comments inline.

Firstly, I'm a bit concerned about the semantics of update(ArrayRef) (and update(StringRef) by proxy) due to inconsistency with HasherT.

  • HasherT::update(ArrayRef<uint8_t>) hashes the raw bytes pointed to by the array.
  • HashBuilder<HasherT>::update(ArrayRef) (including for uint8_t) hashes the size first, then the raw bytes.

This inconsistency could be confusing and error-prone.

I wonder if we should (mostly) avoid the word update in HashBuilder and use add instead. This also fixes the updateRange problem I point out inline (addRange works just fine without a preposition). We could leave update() around *just* for ArrayRef<uint8_t> (and maybe StringRef), forwarding to addElements/addRangeElements (see inline). The result would be:

  • HashBuilder<HasherT>::update does exactly the same thing as HasherT::update (also working with StringRef, for convenience)
  • HashBuilder<HasherT>::add (and add*) handles higher-level semantics, including primitives, endianness, ranges, etc.

I did not consider the issue. But I like the proposal, as it removes potential confusions.

Secondly, a base class that lacks the Endianness template parameter should probably be factored out to avoid bloating code once this starts getting used. I think there should be three layers.

  • HashBuilderBase can take just a HasherT parameter, and implement update (new meaning, matching HasherT::update), final(), and all other things that don't care about endianness.
  • HashBuilderImpl layers the add* family on top and takes an Endianness parameter. But native is illegal here.
  • HashBuilder canonicalizes native to system_endianness() to avoid code bloat, something like:

Done.
Doing so requires templated CRTP, as we need to return the reference to our derived class and keep track of HasherT/endianness. It looks slightly convoluted at the template declaration site, but is neat otherwise otherwise.

template <typename HasherT, support::endianness Endianness>
class HashBuilderImpl {
  static_assert(Endianness != native, "HashBuilder should canonicalize endianness");
  // etc.
};

template <typename HasherT, support::Endianness Endianness = native,
                            support::Endianness EffectiveEndianness =
                                Endianness == native
                                   ? support::endian::system_endianness()
                                   : Endianness>
class HashBuilder
    : HashBuilderImpl<HasherT, EffectiveEndianness> {
  using HashBuilderImplT = HashBuilderImpl<HasherT, EffectiveEndianness>;

public:
  HashBuilder() = default;
  HashBuilder(HasherT &Hasher) : HashBuilderImplT(Hasher) {}
};
llvm/include/llvm/Support/HashBuilder.h
81

That sounds reasonable to me.
Forcing the hash would still be easy via other scalar types.

87–95

I'll drop the default value for Endianness. The template is simple enough for the rare use-cases, and simple enough to wrap if the lack of default is ever a problem for a user.

88–91

See my top-level comment.
I think we should keep support for ctors with arguments (e.g. for a seed).

97–99

Added a forward to result().

Per my top-level reply, I think we should keep the final() and result() forwards conditional on their existence in HasherT, to keep the use of this interface with a custom hasher type trivial.

I am not convinced by the finalHash() and resultHash() helpers should be members of HashBuilder. I may be missing context on how it is used, so feel free to push if you think the use-case is really valuable.
I would assume that the return type of HasherT::final() is appropriate to represent the hash. It seems bothersome to have to provide these helpers in the interface, making additional assumptions (or more SFINAE code) about the hasher type.

Looking at other related comments, I think a lot of this discussion stems from the fact that MD5, SHA1, and SHA256 final() methods return a reference (StringRef).
It seems to me this discussion would be simplified if there was a method like finalByValue().

My suggestion would be to keep relying on the return type for final(), and have "type converting" wrappers implemented outside of HasherT or HashBuilder for now.
If you like the finalByValue() idea, I could do it in a successor patch.

On this topic, I find the design of the ::hash functions join on the issue. Being static, they return a result by value. Though both strings and array<uint8_t> are of course pretty close, I find std::array is a bit surprising choice when final() and result() return StringRef. So the question will be, what should be the return type for finalByValue() for these hashes. std::array like hash(), or a SmallString maybe.

101–108

Sounds good.

Done, with enablement automatically whenever the type is handled by support::endian::byte_swap().

206–208

Done, with comments.

285

I like it. Done.

315–316

This has become addRangeElementsImpl. Now, addRange simply does add(size); addRangeElements(...);.

320–324

This has become addRangeElementsImpl.

llvm/unittests/Support/HashBuilderTest.cpp
39–45

Per my other comment, I am not convinced by the finalHash() method.
Renamed to hashWithBuilder.

54–57

Renamed to hashRangeWithBuilder.

82–83

I did adjust the tests following your suggestion.

Those both seem important, but IMO probably issues that clients of HashBuilder need to be aware of, rather than HashBuilder. But maybe there's some room for debate...

I agree. We would not want to start seeing the HashBuilder tests failing across different platforms because of reasons that are known (and maybe already correctly handled). So let's correctly separate the responsibilities.

91

Done. I was not familiar with using lambdas with auto for that type of use-case.
Also renamed it to ByteSwapAndHashWithHasher for clarity.

96–98

I agree here a single call would be nice.
The change would require the update of previous helpers to return decltype(H::hash), that I am reluctant to do.

109–110

This ended up with explicit names ByteSwapAndHashWithHasher and hashWithBuilder.

arames updated this revision to Diff 368141.Mon, Aug 23, 10:46 AM
arames marked 12 inline comments as done.

Address review comments.

The major changes are around:

  • Renaming update* to add*
  • Refactoring HashBuilder into HashBuilderBase, HashBuilderImpl, and HashBuilder.
dexonsmith accepted this revision.Mon, Aug 23, 12:40 PM

Thanks for the working through this! LGTM if you remove the CRTP (see below, I don't think it's needed). See also a few more nits I found to pick.

llvm/include/llvm/Support/HashBuilder.h
31–36

Using CRTP blocks the compile-time speedup / code size wins. Can we skip HashBuilder/Endianness/EffectiveEndianness/Derived?

46–49

I suggest dropping the return statement, for void update(ArrayRef), a simple passthrough to HasherT::update.

  • No need for the CRTP (faster compile-time / avoid code size bloat by factoring out a common instantiation for the base class)
  • The same functionality is available as HashBuilder& HasherBuilder::addRangeElements, so there's no loss of convenience
56

(this would also return void)

61–67

I don't think we need enable_if or return type deduction here. It seems like overkill to make "does HashBuilder::final work?" SFINAE-detectible.

69–77

Same as final.

92

I think this comment should explain that support::endianness::native is not supported here but that it's handled by HashBuilder. Basically, explain the static assertion in text.

101–103

Using CRTP in HashBuilderImpl also increases compile-time / code size since it blocks sharing code between "native" and whatever the system endianness is. I suggest dropping CRTP and returning HashBuilderImpl& instead of HashBuilder& from add*. (If this needed CRTP, which I don't think it does, I'd suggest using an opaque DerivedT to avoid passing through details like EffectiveEndianness.)

411–425

Given that add* will return HashBuilderImpl&, might be reasonable to reduce this to a typedef:

template <class HasherT, support::endianness Endianness>
using HashBuilder =
    HashBuilderImpl<HasherT,
                    (Endianness == support::endianness::native
                         ? support::endian::system_endianness()
                         : Endianness)>;

Either way seems fine though.

llvm/unittests/Support/HashBuilderTest.cpp
47–51

I'd skip auto type deduction unless the type is hard to spell. Also you can use StringRef::str here.

54–57

(same as above)

This revision is now accepted and ready to land.Mon, Aug 23, 12:40 PM
arames updated this revision to Diff 368459.Tue, Aug 24, 2:04 PM
arames marked 8 inline comments as done.

Main changes:

  • Remove CRTP
  • (Attempt to) fix Windows build
  • Fix nits
llvm/include/llvm/Support/HashBuilder.h
61–67

You may want to double-check my earlier comments.
I think there is a strong value in allowing trivial support for user-defined (or simply other) hashes. Here, that means:

  • not requiring support for HasherT::final()
  • not assuming the return type
101–103

Removed CRTP.

arames updated this revision to Diff 368471.Tue, Aug 24, 2:33 PM

Apply clang-format.

arames updated this revision to Diff 368472.Tue, Aug 24, 2:41 PM

One more attempt to fix the Windows build.

arames updated this revision to Diff 368482.Tue, Aug 24, 3:26 PM
  • Explicitly disable hashable types for the addHash overload (for the Windows build).
  • Introduce checks to confirm debian x64 test failures (to be removed).
dexonsmith added inline comments.Tue, Aug 24, 3:35 PM
llvm/include/llvm/Support/HashBuilder.h
61–67

On not requiring HasherT::final, my suggested edit doesn't change that (unless I'm missing something?). The call to HasherT::final is template-dependent. Both before and after the edit, there will be no compile errors if there are no calls to HashBuilder<HasherT>::final, and a compile error if there is such a call. The only semantic difference is whether HashBuilder::final is SFINAE-detectable (compile error because there is no overload for HashBuilder::final vs. compile error because the overload for HashBuilder::final can't be instantiated), which doesn't seem important.

On the return type, I don't see the benefit of allowing flexibility here. I also don't see a tie-in to the argument about trivial support. And there is a cost: generic code that depends on HashBuilder::final will have to reason about other arbitrary return types. It's simpler for clients if this just returns StringRef.

If you still think there's strong value in either of those (which both seem unrelated to allowing use of trivial HasherT types), can you clarify what it is?

arames added inline comments.Tue, Aug 24, 3:56 PM
llvm/include/llvm/Support/HashBuilder.h
61–67

Both done. (will be updated in the next arc diff.)

I misunderstood your earlier comment, thinking you suggested to still require HasherT::final(). And I didn't think of simply relying on the template for the purpose.
I felt specializing the interface to follow particular hasher types might not be very wise. I could imagine a hash wanted to return say uint64_t. But since this is hypothetical, I buy the argument of just keeping things simple for now.

arames updated this revision to Diff 368518.Tue, Aug 24, 6:08 PM

Ran through the address sanitizer and fixed stack-use-after-scope issues with ArrayRef.

arames updated this revision to Diff 368720.Wed, Aug 25, 2:00 PM

Test different variadic impl for Windows.

arames updated this revision to Diff 368748.Wed, Aug 25, 3:10 PM

Propagate variadic expansion workaround to the tuple overload.

arames edited the summary of this revision. (Show Details)Wed, Aug 25, 4:14 PM
arames updated this revision to Diff 368788.Wed, Aug 25, 6:20 PM

Remove one of the workarounds for Windows.

arames updated this revision to Diff 368802.Wed, Aug 25, 9:28 PM

Finalize workaround for Windows.

As far as I can see in the C++ spec section 8.5.4 guarantees evaluation order of
the initializer-list in a braced-init-list. But this not seem to hold with MSVC
19.28.29914.0. So simply use a recursive template instead.

This revision was automatically updated to reflect the committed changes.