The HashBuilder interface allows conveniently building hashes of various data
types, without relying on the underlying hasher type to know about hashed data
types.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
This is a follow-up to discussion about https://reviews.llvm.org/D102943.
This is a proposal to expose the HashBuilder helper that will then be used for the modules hashing.
Thanks, this looks like a great start to me. Lots of comments inline (many of them nitpicks).
Expose a raw pointer update interface for hashes.
The new method update(const uint8_t *Ptr, size_t Size) is an alternative to
the existing update(ArrayRef<uint8_t> Data). It will allow the incoming
HashBuilder interface to not depend on ArrayRef.
If we add pointer+size APIs to the hashers, I think that should be in a separate prep patch...
... but IMO, ArrayRef is a safe/clean encapsulation of pointer+size for HashBuilder to use. Can you explain why you want to avoid it?
llvm/include/llvm/ADT/ArrayRef.h | ||
---|---|---|
17 ↗ | (On Diff #362146) | I think HashBuilder will probably be needed in fewer places than ArrayRef. It might be better to invert the includes (and move the new functions to HashBuilder.h). |
575–602 ↗ | (On Diff #362146) | I suggest moving these to member functions of HashBuilder. |
llvm/include/llvm/ADT/StringRef.h | ||
15 ↗ | (On Diff #362146) | Even more so than ArrayRef.h, we try really hard to avoid adding includes to StringRef.h since it's so widely included. I suggest inverting the include relationship. |
966 ↗ | (On Diff #362146) | (I suggest an overload in HashBuilder rather than a free function) |
llvm/include/llvm/Support/HashBuilder.h | ||
22 | You should be able to remove this once you're including StringRef.h | |
140 | StringRef would be more generally useful; I suggest that instead of std::string. (I don't think we usually bother handling other character types in LLVM... I suggest leaving it out, unless/until you have a specific use case that needs it?) | |
144–147 | Should this be by-reference? Might be worth adding a test with a no-move no-copy type in a pair. | |
148–153 | Should this be by-reference? Might be worth adding a test with a no-move no-copy type in a tuple. | |
167 | Neat, I hadn't seen this pattern before. Can the variable name be skipped? (void)std::tuple<const Ts &...>{(update(Args), Args)...}; If not, please add (void)Unused; to suppress diagnostics. | |
170–173 | I suggest adding a single-element overload that takes a generic range, calling adl_begin() should also be a single-element overload that calls llvm::adl_begin() and llvm::adl_end()` to extract iterators. Also, this isn't going to work for any old input iterator. I suggest:
Another option is to use random-access iterators and leave it for a future patch to handle forward iterators. | |
177 | I suggest taking / using an ArrayRef<uint8_t> here. It'd be a nice convenience to add an overload for updateBytes(StringRef) as well -- something I've wanted a few times when using the Hasher interfaces with MemoryBuffer::getBuffer() -- which can just cast over to ArrayRef<uint8_t>`. | |
199 | I think this will have to be a std::distance call if you handle arbitrary forward iterators (not sure that's necessary). The template parameters could use an update in either case. | |
llvm/unittests/Support/HashBuilderTest.cpp | ||
52 | Lots of "invalid case style" warnings coming from clang-tidy in this file. Please follow the style conventions at https://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly (e.g., use C for this variable instead of c), if nothing else to make the patch easier to read. | |
76 | Rather than duplicating the tests, I suggest using http://google.github.io/googletest/advanced.html#typed-tests to iterate through MD5, SHA1, and SHA256. | |
226–229 | I suggest comparing these both (or at least one of them) against: update(1); update("string"); so that you're confirming the hash is updated correctly, not just the same way as each other (unless you tested that elsewhere and I missed it?) |
Address review comments:
Main changes:
- Switch dependency order: HashBuilder now depends on ArrayRef and StringRef
- Clean up updateRange and updateRangeImpl
- Fix case style
Sure. That was already separated in a different commit.
... but IMO, ArrayRef is a safe/clean encapsulation of pointer+size for HashBuilder to use. Can you explain why you want to avoid it?
I agree ArrayRef is a safe and clean encapsulation. It seemed to me it was cleaner to not have the HashBuilder interface depend on any particular type, in a similar way to hash_code.
However your arguments about ArrayRef and StringRef being much more commonly used are on point, and si agree switching the dependency order and having ArrayRef and StringRef specialized in HashBuilder.h will be better.
llvm/include/llvm/ADT/ArrayRef.h | ||
---|---|---|
17 ↗ | (On Diff #362146) | Done. (See also global reply.) |
575–602 ↗ | (On Diff #362146) | Done. (See also global reply.) |
llvm/include/llvm/ADT/StringRef.h | ||
15 ↗ | (On Diff #362146) | Done. (See also global reply.) |
966 ↗ | (On Diff #362146) | Done for StringRef and ArrayRef, since we now depend on both. |
llvm/include/llvm/Support/HashBuilder.h | ||
140 | Nah; this was to have the generic form. Leaving it out. | |
144–147 | Absolutely. Fixed, with a test. | |
148–153 | Absolutely. Fixed with a test. | |
167 | It can be skipped indeed. | |
170–173 | The update should fit what you want, but please take a second look. Added an overload for a generic range using adl_begin() and adl_end(). Interestingly, with all this, implementations of update for ArrayRef and StringRef can simply forward to updateRange. | |
177 | Sure thing. I can imagine the three overloads to be helpful in different contexts. |
Updates look great; a few more comments inline.
Since the ultimate goal is to make it easy for higher-level logic to use a stable hash across different architectures, I realize that endianness could be important too for the hashable-data overloads. Have you considered what to do there?
One idea: add an endianness template parameter to HashBuilder:
template <class HasherT, support::endianness Endianness = support::native> class HashBuilder;
then change the update overloads for hashable data to hash the result of support::endian::byte_swap(V, Endianness). In the common case (native) the copy/etc should get optimized out. WDYT?
llvm/include/llvm/Support/HashBuilder.h | ||
---|---|---|
87–95 | Please use makeArrayRef() to avoid splitting the pointer from the size. Also, I wonder if this should take into account endianness. | |
196 | I think you can drop the name here too. | |
206–208 | I think it's important to have updateRange overloads for ArrayRef (at least for uint8_t) and StringRef that directly call updateBytes rather than iterating. | |
221–225 | I don't think we should add this overload. The rare call site can call makeArrayRef itself, encouraging it to push ArrayRef further up the stack. | |
llvm/unittests/Support/HashBuilderTest.cpp | ||
27–37 | Can MD5 be updated in a prep patch to have an overload of final like this so this bridge code isn't needed? | |
39–45 | Seeing this boilerplate makes me wonder if the builder API could be improved with something like:
Then the above logic becomes a one-liner: HashBuilder<HasherT>().update(Args...).final() (maybe still worth encapsulating in a function here for testing, but maybe not; not sure) | |
82–83 | It's not ideal to need to spell this out for each HasherT, and hardcoding the magic numbers doesn't seem quite right either. I suggest instead:
template <class T> void checkHashableData(T Data) { HasherT RawHasher; auto Bytes = makeArrayRef(reinterpret_cast<uint8_t *>(&Data),...); RawHasher.update(Bytes); CHECK_EQ(HashBuilder<HasherT>().update(Data).final(), RawHasher.final()); } | |
103 | I suggest dropping the Typed and calling this simply HashBuilderTest. |
llvm/unittests/Support/HashBuilderTest.cpp | ||
---|---|---|
82–83 | I'm still seeing magic numbers, although this comment seems to be in the wrong place now. See further down in ReferenceHashCheck. | |
102–103 | It's not clear to me why you're including volatile ints here. Is there any reason to expect them to have different behaviour? | |
109–110 | I'm not such a fan of this reference hash / magic value idea. I'd prefer to check each of these data types separately, and rely on comparing against the result of the underlying Hasher (which we can assume is already well-tested elsewhere). E.g.: // helper for hashers template <class H, class T> auto hashRawData(T Data) { H Hasher; Hasher.update(makeRawDataArrayRef(makeArrayRef( reinterpret_cast<uint8_t>(&Data), sizeof(Data))); return H.final(); }; // then in the test for raw integer data: int I = 0x12345678; EXPECT_EQ(HashBuilder<H>.update(I).final(), hashRawData<H>(I)); Could be each in their own test (one per type), or just in separate scopes, don't think it matters much. Or if you think the magic numbers are better, can you walk me through your logic? |
llvm/include/llvm/Support/HashBuilder.h | ||
---|---|---|
87–95 | Updated to use makeArrayRef(). Great point. I didn't pay attention to it, but taking it into account will significantly improve the value of this class.
template <typename HasherT, support::endianness Endianness> void updateHash(HashBuilder<HasherT, Endianness> &HBuilder, const UserType &Value) I think that is ok. Do you prefer "reserving" the updateHash name and going with template <typename HashBuilderT> void updateHash(HashBuilderT &HBuilder, const UserType &Value) ? | |
206–208 | Both ArrayRef and StringRef do end up calling the version of updateRangeImpl with updateBytes, due to how their begin and end are declared. | |
llvm/unittests/Support/HashBuilderTest.cpp | ||
27–37 | https://reviews.llvm.org/D107781 This PR will have to be rebased once the prep is reviewed and lands. | |
39–45 | Done. That makes it more usable indeed. | |
82–83 | I have cleaned this up significantly as part of the endianness support. It does not look like what you suggest, but I think it is much better than before. This is the only test that verifies against a hard-coded reference hash. In my opinion it is valuable to ensure the stability of hashes across platforms. |
llvm/unittests/Support/HashBuilderTest.cpp | ||
---|---|---|
102–103 | I had borrowed the basic tests for hash_code, which did include it. I agree they bring no value here. Removed. |
I think this is getting close... I've found a few new things, and I apologize for not seeing them sooner.
Two high-level points, then more comments inline.
Firstly, I'm a bit concerned about the semantics of update(ArrayRef) (and update(StringRef) by proxy) due to inconsistency with HasherT.
- HasherT::update(ArrayRef<uint8_t>) hashes the raw bytes pointed to by the array.
- HashBuilder<HasherT>::update(ArrayRef) (including for uint8_t) hashes the size first, then the raw bytes.
This inconsistency could be confusing and error-prone.
I wonder if we should (mostly) avoid the word update in HashBuilder and use add instead. This also fixes the updateRange problem I point out inline (addRange works just fine without a preposition). We could leave update() around *just* for ArrayRef<uint8_t> (and maybe StringRef), forwarding to addElements/addRangeElements (see inline). The result would be:
- HashBuilder<HasherT>::update does exactly the same thing as HasherT::update (also working with StringRef, for convenience)
- HashBuilder<HasherT>::add (and add*) handles higher-level semantics, including primitives, endianness, ranges, etc.
Secondly, a base class that lacks the Endianness template parameter should probably be factored out to avoid bloating code once this starts getting used. I think there should be three layers.
- HashBuilderBase can take just a HasherT parameter, and implement update (new meaning, matching HasherT::update), final(), and all other things that don't care about endianness.
- HashBuilderImpl layers the add* family on top and takes an Endianness parameter. But native is illegal here.
- HashBuilder canonicalizes native to system_endianness() to avoid code bloat, something like:
template <typename HasherT, support::endianness Endianness> class HashBuilderImpl { static_assert(Endianness != native, "HashBuilder should canonicalize endianness"); // etc. }; template <typename HasherT, support::Endianness Endianness = native, support::Endianness EffectiveEndianness = Endianness == native ? support::endian::system_endianness() : Endianness> class HashBuilder : HashBuilderImpl<HasherT, EffectiveEndianness> { using HashBuilderImplT = HashBuilderImpl<HasherT, EffectiveEndianness>; public: HashBuilder() = default; HashBuilder(HasherT &Hasher) : HashBuilderImplT(Hasher) {} };
llvm/include/llvm/Support/HashBuilder.h | ||
---|---|---|
81 | I wonder if std::is_floating_point should be excluded.
This would disable HashBuilder::add, forcing clients to think about what they actually want to do with it. WDYT? | |
84–85 | I think these should be private, and Hasher can have a getHasher() accessor. This'll make it easier to make HashBuilder move/copy-friendly in the future if for some reason that's useful. | |
87–95 | Just seeing this question.
| |
88–91 | Can this be simplified? HashBuilder() : OptionalHasher(in_place), Hasher(*OptionalHasher) {} | |
97–99 | I made some comments on the patch this one depends on (https://reviews.llvm.org/D107781 -- FYI, I linked them using the add parent/child revision button above). I suggest the following to better align with the hashers themselves:
I also think an analogue for the static hash() that the hashers have would be useful, at least in the tests. Maybe: using HashArrayT = decltype(HasherT::hash(ArrayRef<uint8_t>())); static HashArrayT toArrayImpl(StringRef HashString) { assert(HashString.size() == sizeof(HashArrayT)); HashArrayT HashArray; llvm::copy(HashString, HashArray.begin()); return HashArray; } HashArrayT finalHash() { return toArrayImpl(Hasher.final()); } HashArrayT resultHash() { return toArrayImpl(Hasher.result()); } I *don't* think there should be a HashBuilder::hash() function (static or otherwise). Forwarding to HashBuilder::add() would be confusing / error-prone, since then HashBuilder<HasherT>::hash(ArrayRef<uint8_t>) would do something different from HasherT::hash(). Forwarding to HasherT::hash would also be confusing, and doesn't really add any value since clients can already do that themselves. | |
101–108 | I wonder if the body of this should be exposed as a general helper for clients, using a weaker std::enable_if check. I.e.:
WDYT? | |
206–208 |
Just seeing this question. Two reasons to do it explicitly:
| |
276–278 | I think there's something off with the naming. updateRange sounds like it is mutating the range itself, rather than updating the hash. Could add a preposition, or renaming to add. | |
285 | Similar naming problem here (I'm pretty sure it's my fault), but worse, since I now realize there's nothing inherent about the word "bytes" that implies the size of the range is skipped. Someone could have a container of bytes, and expect that this encoded the container. Stepping back, I feel like this function -- skipping the size of a range -- could be more generally useful to clients. I wonder if it should be exposed for generic ranges as well, keeping specializations for ArrayRef<uint8_t> and StringRef. Name could be addElements or addRangeElements or something? | |
315–316 | This can call addRangeElements(). | |
320–324 | This overload can be dropped, but addRangeElements would want something similar. | |
llvm/unittests/Support/HashBuilderTest.cpp | ||
39–45 | If you take my suggestion to add finalHash, computeHash can be further simplified to: return HashBuilder<HasherTAndEndianness>().add(Args...).finalHash(); removing the static cast. IMO, inlining it would improve clarity, but renaming it to hashWithBuilder would also help. | |
54–57 | (Same comments as computeHash: drop std::string, and simplify or rename) | |
82–83 | Looks like you've adjusted the tests in the meantime, but since I hadn't seen this comment before, just wanted to give some more reasoning:
I agree that's a useful trait, but the tests for MD5, SHA1, and SHA256 should be doing that, respectively. Meanwhile, the following important trait was *not* being tested: do HashBuilder<HashT> and HashT give the same result as each other for each of these raw data types? (Since fixed!) You could have both (the above, plus the magic number test). It would uncover things such as double having a different representation on different platforms, or some platforms being big- or little-endian. Those both seem important, but IMO probably issues that clients of HashBuilder need to be aware of, rather than HashBuilder. But maybe there's some room for debate... | |
91 | This is also byte-swapping. I suggest byteSwapAndHashRawData(). Maybe even add "WithoutBuilder"? Or, since this is only used in one function, you could make it a lambda. | |
96–98 | I (re-)discovered the static method hash when reviewing the other patch, which returns a std::array. I suggest using that instead, and directly returning H::hash(SwappedData). I suggest forwarding the return array directly rather than wrapping in a std::string. | |
101 | I think the name is wrong since the change. Maybe addHashableData? | |
109–110 | Update here looks like what I was asking for, but it's actually not clear at all how these functions are different (I probably suggested a bad name I think, sorry). I think computeHash can be inlined, avoiding any ambiguity about what it does, if you take my suggestion to add finalHash. But if you keep it, it should probably be renamed to hashWithBuilder to make it clear how the hash is being computed. That'd disambiguate the other one I think? |
Thanks - again - for the comments. Please take a look at the update.
The major changes are around:
- Renaming update* to add*
- Refactoring HashBuilder into HashBuilderBase, HashBuilderImpl, and HashBuilder.
However I am pushing back on a number of suggestions that would impose requirement on the hasher type. I think there is a lot of value in keeping the interface trivial to use for user-defined/other hasher types.
That includes the suggested finalHash, and supporting ctors with arguments, and using enable_if to conditionally forward methods like final() and result(), so that we can rely on the single HasherT::update(ArrayRef<uint8_t>) method. I think that is a fair implementation cost for the value it brings.
I have added a test for a custom minimal hasher.
Let me know what you think. I'm always open to discuss and update.
Cheers!
I did not consider the issue. But I like the proposal, as it removes potential confusions.
Secondly, a base class that lacks the Endianness template parameter should probably be factored out to avoid bloating code once this starts getting used. I think there should be three layers.
- HashBuilderBase can take just a HasherT parameter, and implement update (new meaning, matching HasherT::update), final(), and all other things that don't care about endianness.
- HashBuilderImpl layers the add* family on top and takes an Endianness parameter. But native is illegal here.
- HashBuilder canonicalizes native to system_endianness() to avoid code bloat, something like:
Done.
Doing so requires templated CRTP, as we need to return the reference to our derived class and keep track of HasherT/endianness. It looks slightly convoluted at the template declaration site, but is neat otherwise otherwise.
template <typename HasherT, support::endianness Endianness> class HashBuilderImpl { static_assert(Endianness != native, "HashBuilder should canonicalize endianness"); // etc. }; template <typename HasherT, support::Endianness Endianness = native, support::Endianness EffectiveEndianness = Endianness == native ? support::endian::system_endianness() : Endianness> class HashBuilder : HashBuilderImpl<HasherT, EffectiveEndianness> { using HashBuilderImplT = HashBuilderImpl<HasherT, EffectiveEndianness>; public: HashBuilder() = default; HashBuilder(HasherT &Hasher) : HashBuilderImplT(Hasher) {} };
llvm/include/llvm/Support/HashBuilder.h | ||
---|---|---|
81 | That sounds reasonable to me. | |
87–95 | I'll drop the default value for Endianness. The template is simple enough for the rare use-cases, and simple enough to wrap if the lack of default is ever a problem for a user. | |
88–91 | See my top-level comment. | |
97–99 | Added a forward to result(). Per my top-level reply, I think we should keep the final() and result() forwards conditional on their existence in HasherT, to keep the use of this interface with a custom hasher type trivial. I am not convinced by the finalHash() and resultHash() helpers should be members of HashBuilder. I may be missing context on how it is used, so feel free to push if you think the use-case is really valuable. Looking at other related comments, I think a lot of this discussion stems from the fact that MD5, SHA1, and SHA256 final() methods return a reference (StringRef). My suggestion would be to keep relying on the return type for final(), and have "type converting" wrappers implemented outside of HasherT or HashBuilder for now. On this topic, I find the design of the ::hash functions join on the issue. Being static, they return a result by value. Though both strings and array<uint8_t> are of course pretty close, I find std::array is a bit surprising choice when final() and result() return StringRef. So the question will be, what should be the return type for finalByValue() for these hashes. std::array like hash(), or a SmallString maybe. | |
101–108 | Sounds good. Done, with enablement automatically whenever the type is handled by support::endian::byte_swap(). | |
206–208 | Done, with comments. | |
285 | I like it. Done. | |
315–316 | This has become addRangeElementsImpl. Now, addRange simply does add(size); addRangeElements(...);. | |
320–324 | This has become addRangeElementsImpl. | |
llvm/unittests/Support/HashBuilderTest.cpp | ||
39–45 | Per my other comment, I am not convinced by the finalHash() method. | |
54–57 | Renamed to hashRangeWithBuilder. | |
82–83 | I did adjust the tests following your suggestion.
I agree. We would not want to start seeing the HashBuilder tests failing across different platforms because of reasons that are known (and maybe already correctly handled). So let's correctly separate the responsibilities. | |
91 | Done. I was not familiar with using lambdas with auto for that type of use-case. | |
96–98 | I agree here a single call would be nice. | |
109–110 | This ended up with explicit names ByteSwapAndHashWithHasher and hashWithBuilder. |
Address review comments.
The major changes are around:
- Renaming update* to add*
- Refactoring HashBuilder into HashBuilderBase, HashBuilderImpl, and HashBuilder.
Thanks for the working through this! LGTM if you remove the CRTP (see below, I don't think it's needed). See also a few more nits I found to pick.
llvm/include/llvm/Support/HashBuilder.h | ||
---|---|---|
31–36 | Using CRTP blocks the compile-time speedup / code size wins. Can we skip HashBuilder/Endianness/EffectiveEndianness/Derived? | |
46–49 | I suggest dropping the return statement, for void update(ArrayRef), a simple passthrough to HasherT::update.
| |
56 | (this would also return void) | |
61–67 | I don't think we need enable_if or return type deduction here. It seems like overkill to make "does HashBuilder::final work?" SFINAE-detectible. | |
69–77 | Same as final. | |
92 | I think this comment should explain that support::endianness::native is not supported here but that it's handled by HashBuilder. Basically, explain the static assertion in text. | |
101–103 | Using CRTP in HashBuilderImpl also increases compile-time / code size since it blocks sharing code between "native" and whatever the system endianness is. I suggest dropping CRTP and returning HashBuilderImpl& instead of HashBuilder& from add*. (If this needed CRTP, which I don't think it does, I'd suggest using an opaque DerivedT to avoid passing through details like EffectiveEndianness.) | |
411–425 | Given that add* will return HashBuilderImpl&, might be reasonable to reduce this to a typedef: template <class HasherT, support::endianness Endianness> using HashBuilder = HashBuilderImpl<HasherT, (Endianness == support::endianness::native ? support::endian::system_endianness() : Endianness)>; Either way seems fine though. | |
llvm/unittests/Support/HashBuilderTest.cpp | ||
47–51 | I'd skip auto type deduction unless the type is hard to spell. Also you can use StringRef::str here. | |
54–57 | (same as above) |
Main changes:
- Remove CRTP
- (Attempt to) fix Windows build
- Fix nits
llvm/include/llvm/Support/HashBuilder.h | ||
---|---|---|
61–67 | You may want to double-check my earlier comments.
| |
101–103 | Removed CRTP. |
- Explicitly disable hashable types for the addHash overload (for the Windows build).
- Introduce checks to confirm debian x64 test failures (to be removed).
llvm/include/llvm/Support/HashBuilder.h | ||
---|---|---|
61–67 | On not requiring HasherT::final, my suggested edit doesn't change that (unless I'm missing something?). The call to HasherT::final is template-dependent. Both before and after the edit, there will be no compile errors if there are no calls to HashBuilder<HasherT>::final, and a compile error if there is such a call. The only semantic difference is whether HashBuilder::final is SFINAE-detectable (compile error because there is no overload for HashBuilder::final vs. compile error because the overload for HashBuilder::final can't be instantiated), which doesn't seem important. On the return type, I don't see the benefit of allowing flexibility here. I also don't see a tie-in to the argument about trivial support. And there is a cost: generic code that depends on HashBuilder::final will have to reason about other arbitrary return types. It's simpler for clients if this just returns StringRef. If you still think there's strong value in either of those (which both seem unrelated to allowing use of trivial HasherT types), can you clarify what it is? |
llvm/include/llvm/Support/HashBuilder.h | ||
---|---|---|
61–67 | Both done. (will be updated in the next arc diff.) I misunderstood your earlier comment, thinking you suggested to still require HasherT::final(). And I didn't think of simply relying on the template for the purpose. |
Ran through the address sanitizer and fixed stack-use-after-scope issues with ArrayRef.
Finalize workaround for Windows.
As far as I can see in the C++ spec section 8.5.4 guarantees evaluation order of
the initializer-list in a braced-init-list. But this not seem to hold with MSVC
19.28.29914.0. So simply use a recursive template instead.
You should be able to remove this once you're including StringRef.h