This is an archive of the discontinued LLVM Phabricator instance.

Added hash_stream class for producing hash codes from data streams.
AbandonedPublic

Authored by teemperor on Jul 19 2016, 7:10 AM.

Details

Summary

So far LLVM only offers hashing for data sets that have to be fully accessible during hashing (such as containers). To conform to this API, it is sometimes necessary for the user to first collect and store all necessary data and afterwards pass it to the hash functions. An example for this is FoldingSetNodeID which is sometimes used to just generate hash codes and uses this workaround.

This patch adds a class which has an API that allows providing the data piece by piece, eliminating the need to store all the date before hashing it. hash_stream produces hash codes with the same quality as hash_combine and hash_value as it uses the same backend.

Diff Detail

Event Timeline

teemperor updated this revision to Diff 64485.Jul 19 2016, 7:10 AM
teemperor retitled this revision from to Added hash_stream class for producing hash codes from data streams..
teemperor updated this object.
teemperor added a subscriber: llvm-commits.
teemperor updated this revision to Diff 67689.Aug 11 2016, 8:06 AM
teemperor added reviewers: v.g.vassilev, NoQ.
mehdi_amini added a subscriber: mehdi_amini.
NoQ added inline comments.Aug 15 2016, 7:35 AM
include/llvm/ADT/Hashing.h
37

I think it's worth mentioning that hash_stream wasn't proposed as part of N3333, unlikle other classes in this file. Or maybe it could be proposed :)

731

Maybe compare size and remaining_size only once? The control flow would also be more clear that way.

if (size <= remaining_size) {
  append_to_buffer(c, size);
  return *this;
}

append_to_buffer(c, remaining_size);
unittests/ADT/HashingTest.cpp
160

Maybe you wanted to use '55l' here? (:

chandlerc edited edge metadata.Aug 16 2016, 5:13 PM

I'd like to understand the immediate motivation here...

While I understand the abstract use case, I'm very hesitant to extend this interface in such a significant way without a very strong and immediate use case at hand. The committee is still trying to get this API standardized and I think it is quite likely to change. The more we extend and build up dependencies on it the harder it will be to follow any subsequent changes.

This intention of this patch wasn't to extend the proposed API in any way; the only reason why the hash_stream class landed here is because LLVM's CityHash implementation is in this file's ::detail:: namespace and implementing hash_stream somewhere else would mean depending on the implementation details in different source files. Sorry about the confusion.

As the CityHash implementation isn't part of N3333: What about moving that implementation to it's own implementation header file and let Hashing.h and hash_stream use it? With that Hashing.h would only contain N3333 API and hash_stream could use the same backend in a clean way.

This intention of this patch wasn't to extend the proposed API in any way; the only reason why the hash_stream class landed here is because LLVM's CityHash implementation is in this file's ::detail:: namespace and implementing hash_stream somewhere else would mean depending on the implementation details in different source files. Sorry about the confusion.

As the CityHash implementation isn't part of N3333: What about moving that implementation to it's own implementation header file and let Hashing.h and hash_stream use it? With that Hashing.h would only contain N3333 API and hash_stream could use the same backend in a clean way.

None of this actually addresses my comment though: "I'd like to understand the immediate motivation here..."

Shifting the interface we expand from one file to another doesn't really change much IMO.

The immediate motivation is that in D22515 we need to generate a hash code for data that isn't in a container but implicitly stored in the properties of some AST nodes. This hash code needs to be generated for every single node in the AST so we try to avoid calling any heavy code during that. But to generate a hash code with the current API we first need to forward our data stream that we get from the visitor to a container like FoldingSetNodeID (which could even allocate memory) and then forward that container to the hashing function which handles it again like a stream of data.

hash_stream would allow us to generate hash_codes without having this unnecessary buffer between the two streams. As we allow the user of the API to fill this buffer with whatever data he is interested in, it could be that this buffer restricts the performance in some cases where the user adds a lot of data.

The other option would be to use hash_combine(OldHashCode, NewData) on every new data which is like the makeshift version of this patch because we repeatedly have to finalize on every new chunk of data.

The immediate motivation is that in D22515 we need to generate a hash code for data that isn't in a container but implicitly stored in the properties of some AST nodes.

Ok, thanks. This really helps.

The current code in Hashing.h is really strongly engineered toward container usage though. I'm not sure it is a reasonable approach for many other uses.

As one example, it is designed to be statistically resilient to collisions in the space in which containers are likely to exist, and unbiased if high bits are masked off. The use case you suggest doesn't seem necessarily to fit either of those.

I can imagine clone detection actually not wanting *any* collisions -- it essentially might want a *fingerprint* or *signature* rather than merely a hash code. If that is the case, I think an API for doing online-updates of MD5 (or better yet Blake2, but that isn't in-tree) would be a much better choice.

I can also imagine clone detection using this more like a hash-similarity search or bloom filter. In that case, cityhash is very likely to be a much more rigorous (and slow) hash than you would want.

Have you looked at these options at all? If so, what tradeoffs made them unappealing and made cityhash itself appealing?

teemperor abandoned this revision.Aug 18 2016, 4:52 PM

Thanks for the tips and the review!

I think the main reason why this backend is used because the code evolved from the similar functionality in Stmt::Profile which also uses this backend. We didn't look into alternatives (at least not that I'm aware of). But there is currently no performance data set that would justify any need for using CityHash over some other implementation, so I moved the clone detection to MD5 for the time being.

And we actually wanted just hashes in the clone detector as generating a good fingerprint for every AST node would be a tough requirement for the user. It's more intended as a fast value for first good guess about what nodes belong together.

I'll abandon this patch unless someone else has any need for it or we do some performance testing that actually suggests to use CityHash.