This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/ADT/
-
llvm/
-
ADT/
2
Hashing.h
-
unittests/ADT/
-
ADT/
1
HashingTest.cpp

Differential D22512

Added hash_stream class for producing hash codes from data streams.
AbandonedPublic

Authored by teemperor on Jul 19 2016, 7:10 AM.

Download Raw Diff

Details

Reviewers

chandlerc
v.g.vassilev
NoQ

Summary

So far LLVM only offers hashing for data sets that have to be fully accessible during hashing (such as containers). To conform to this API, it is sometimes necessary for the user to first collect and store all necessary data and afterwards pass it to the hash functions. An example for this is FoldingSetNodeID which is sometimes used to just generate hash codes and uses this workaround.

This patch adds a class which has an API that allows providing the data piece by piece, eliminating the need to store all the date before hashing it. hash_stream produces hash codes with the same quality as hash_combine and hash_value as it uses the same backend.

Diff Detail

Event Timeline

teemperor updated this revision to Diff 64485.Jul 19 2016, 7:10 AM

teemperor retitled this revision from to Added hash_stream class for producing hash codes from data streams..

teemperor updated this object.

teemperor added a subscriber: llvm-commits.

teemperor updated this revision to Diff 67689.Aug 11 2016, 8:06 AM

teemperor added reviewers: v.g.vassilev, NoQ.

mehdi_amini added a reviewer: chandlerc.Aug 11 2016, 2:30 PM

mehdi_amini added a subscriber: mehdi_amini.

teemperor added a child revision: D22515: [analyzer] Added custom hashing to the CloneDetector..Aug 12 2016, 7:40 AM

NoQ added inline comments.Aug 15 2016, 7:35 AM

include/llvm/ADT/Hashing.h
37	I think it's worth mentioning that `hash_stream` wasn't proposed as part of N3333, unlikle other classes in this file. Or maybe it could be proposed :)
731	Maybe compare `size` and `remaining_size` only once? The control flow would also be more clear that way. if (size <= remaining_size) { append_to_buffer(c, size); return *this; } append_to_buffer(c, remaining_size);
unittests/ADT/HashingTest.cpp
160	Maybe you wanted to use '`55l`' here? (:

I'd like to understand the immediate motivation here...

While I understand the abstract use case, I'm very hesitant to extend this interface in such a significant way without a very strong and immediate use case at hand. The committee is still trying to get this API standardized and I think it is quite likely to change. The more we extend and build up dependencies on it the harder it will be to follow any subsequent changes.

This intention of this patch wasn't to extend the proposed API in any way; the only reason why the hash_stream class landed here is because LLVM's CityHash implementation is in this file's ::detail:: namespace and implementing hash_stream somewhere else would mean depending on the implementation details in different source files. Sorry about the confusion.

As the CityHash implementation isn't part of N3333: What about moving that implementation to it's own implementation header file and let Hashing.h and hash_stream use it? With that Hashing.h would only contain N3333 API and hash_stream could use the same backend in a clean way.

In D22512#518736, @teemperor wrote:

This intention of this patch wasn't to extend the proposed API in any way; the only reason why the hash_stream class landed here is because LLVM's CityHash implementation is in this file's ::detail:: namespace and implementing hash_stream somewhere else would mean depending on the implementation details in different source files. Sorry about the confusion.

As the CityHash implementation isn't part of N3333: What about moving that implementation to it's own implementation header file and let Hashing.h and hash_stream use it? With that Hashing.h would only contain N3333 API and hash_stream could use the same backend in a clean way.

None of this actually addresses my comment though: "I'd like to understand the immediate motivation here..."

Shifting the interface we expand from one file to another doesn't really change much IMO.

The immediate motivation is that in D22515 we need to generate a hash code for data that isn't in a container but implicitly stored in the properties of some AST nodes. This hash code needs to be generated for every single node in the AST so we try to avoid calling any heavy code during that. But to generate a hash code with the current API we first need to forward our data stream that we get from the visitor to a container like FoldingSetNodeID (which could even allocate memory) and then forward that container to the hashing function which handles it again like a stream of data.

hash_stream would allow us to generate hash_codes without having this unnecessary buffer between the two streams. As we allow the user of the API to fill this buffer with whatever data he is interested in, it could be that this buffer restricts the performance in some cases where the user adds a lot of data.

The other option would be to use hash_combine(OldHashCode, NewData) on every new data which is like the makeshift version of this patch because we repeatedly have to finalize on every new chunk of data.

In D22512#518875, @teemperor wrote:

The immediate motivation is that in D22515 we need to generate a hash code for data that isn't in a container but implicitly stored in the properties of some AST nodes.

Ok, thanks. This really helps.

The current code in Hashing.h is really strongly engineered toward container usage though. I'm not sure it is a reasonable approach for many other uses.

As one example, it is designed to be statistically resilient to collisions in the space in which containers are likely to exist, and unbiased if high bits are masked off. The use case you suggest doesn't seem necessarily to fit either of those.

I can imagine clone detection actually not wanting *any* collisions -- it essentially might want a *fingerprint* or *signature* rather than merely a hash code. If that is the case, I think an API for doing online-updates of MD5 (or better yet Blake2, but that isn't in-tree) would be a much better choice.

I can also imagine clone detection using this more like a hash-similarity search or bloom filter. In that case, cityhash is very likely to be a much more rigorous (and slow) hash than you would want.

Have you looked at these options at all? If so, what tradeoffs made them unappealing and made cityhash itself appealing?

teemperor removed a child revision: D22515: [analyzer] Added custom hashing to the CloneDetector..Aug 18 2016, 4:15 PM

Thanks for the tips and the review!

I think the main reason why this backend is used because the code evolved from the similar functionality in Stmt::Profile which also uses this backend. We didn't look into alternatives (at least not that I'm aware of). But there is currently no performance data set that would justify any need for using CityHash over some other implementation, so I moved the clone detection to MD5 for the time being.

And we actually wanted just hashes in the clone detector as generating a good fingerprint for every AST node would be a tough requirement for the user. It's more intended as a fast value for first good guess about what nodes belong together.

I'll abandon this patch unless someone else has any need for it or we do some performance testing that actually suggests to use CityHash.

Revision Contents

Path

Size

include/

llvm/

ADT/

Hashing.h

163 lines

unittests/

ADT/

HashingTest.cpp

53 lines

Diff 64485

include/llvm/ADT/Hashing.h

Show All 28 Lines
// should be overloaded within the user-defined type's namespace and found		// should be overloaded within the user-defined type's namespace and found
// via ADL. Overloads for primitive types are provided by this library.		// via ADL. Overloads for primitive types are provided by this library.
//		//
// -- 'hash_combine' and 'hash_combine_range' are functions designed to aid		// -- 'hash_combine' and 'hash_combine_range' are functions designed to aid
// programmers in easily and intuitively combining a set of data into		// programmers in easily and intuitively combining a set of data into
// a single hash_code for their object. They should only logically be used		// a single hash_code for their object. They should only logically be used
// within the implementation of a 'hash_value' routine or similar context.		// within the implementation of a 'hash_value' routine or similar context.
//		//
		// -- 'hash_stream' is a class for combining a stream of data into a single
		NoQUnsubmitted Not Done Reply Inline Actions I think it's worth mentioning that `hash_stream` wasn't proposed as part of N3333, unlikle other classes in this file. Or maybe it could be proposed :) NoQ: I think it's worth mentioning that `hash_stream` wasn't proposed as part of N3333, unlikle…
		// hash_code.
		//
// Note that 'hash_combine_range' contains very special logic for hashing		// Note that 'hash_combine_range' contains very special logic for hashing
// a contiguous array of integers or pointers. This logic is extremely fast,		// a contiguous array of integers or pointers. This logic is extremely fast,
// on a modern Intel "Gainestown" Xeon (Nehalem uarch) @2.2 GHz, these were		// on a modern Intel "Gainestown" Xeon (Nehalem uarch) @2.2 GHz, these were
// benchmarked at over 6.5 GiB/s for large keys, and <20 cycles/hash for keys		// benchmarked at over 6.5 GiB/s for large keys, and <20 cycles/hash for keys
// under 32-bytes.		// under 32-bytes.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

▲ Show 20 Lines • Show All 576 Lines • ▼ Show 20 Lines	inline hash_code hash_integer_value(uint64_t value) {
const char s = reinterpret_cast<const char >(&value);		const char s = reinterpret_cast<const char >(&value);
const uint64_t a = fetch32(s);		const uint64_t a = fetch32(s);
return hash_16_bytes(seed + (a << 3), fetch32(s + 4));		return hash_16_bytes(seed + (a << 3), fetch32(s + 4));
}		}

} // namespace detail		} // namespace detail
} // namespace hashing		} // namespace hashing

		/// \brief Combines the data that was written to it into a single hash code.
		///
		/// First, this class needs to be provided with all data that should be hashed
		/// by calling the write() methods and related operators. Afterwards,
		/// compute_hash() should be called to retrieve the hash code for the data.
		///
		/// Note that this class can produce different hash codes than hash_value or
		/// hash_combine for the same input data. Also, the generated hash codes only
		/// depend on the data itself and not on the way the data was provided (e.g.
		/// one write call with a huge chunk of data and multiple write calls with small
		/// chunks of data will produce the same hash_code if the provided data in total
		/// was the same).
		class hash_stream {

		/// The block size that the hash_state requires.
		static const size_t block_size = 64;
		/// Buffer that stores the next block of data that should be hashed.
		char block_buffer[block_size];

		/// The total number of bytes written to this hash_stream instance so far.
		size_t length;
		/// The internal hash_state which does the actual hashing.
		hashing::detail::hash_state state;
		/// The seed that is used when hashing.
		size_t seed;

		/// \brief Returns the next free position in the block_buffer.
		size_t next_offset_in_buffer() const {
		return length % block_size;
		}

		/// \brief Forwards the filled block_buffer to the hash function.
		void hash_buffer() {
		// Check if we are currently forwarding the first block of data.
		// If no, just forward the buffer to the hash_state.
		if (length != block_size)
		return state.mix(block_buffer);
		// Otherwise we need to create the hash_state from this first block of data.
		state = hashing::detail::hash_state::create(block_buffer, seed);
		}

		/// \brief Appends the given data to the end of the block_buffer and hashes
		/// the block_buffer if it is full.
		///
		/// Note that the given data must fit into the block_buffer.
		void append_to_buffer(char const *c, size_t size) {
		size_t offset = next_offset_in_buffer();
		assert(offset + size <= block_size);
		std::memcpy(block_buffer + offset, c, size);

		length += size;
		if (next_offset_in_buffer() == 0)
		hash_buffer();
		}

		public:
		hash_stream() {
		reset();
		}

		/// \brief Clears the internal state of this hash_stream so that it can be
		/// reused to calculate a new hash code.
		void reset() {
		length = 0;
		seed = hashing::detail::get_execution_seed();
		}

		/// \brief Computes the hash code for the data that was provided so far.
		/// \return The calculated hash value.
		///
		/// This function will also reset this hash_stream instance.
		hash_code compute_hash() {
		// If hash_stream got not more than one block of data, we need to use
		// hash_short.
		if (length <= block_size) {
		hash_code result(hashing::detail::hash_short(block_buffer, length, seed));
		reset();
		return result;
		}

		// If the buffer is partially filled, the remaining data in the buffer first
		// needs to be hashed.
		if (next_offset_in_buffer() != 0)
		hash_buffer();

		hash_code result(state.finalize(length));
		reset();
		return result;
		}

		/// \brief Adds data that should be used to compute the final hash value.
		/// \param c The pointer to the start of the array.
		/// \param size The length of the array \p c in bytes.
		/// \return This hash_stream instance for convenience purposes.
		hash_stream &write(char const *c, size_t size) {
		// First, fill the rest of the current block_buffer if there is enough data.
		// If there isn't any data left afterwards, this write call is done. If the
		// block_buffer gets filled, it will be directly hashed and emptied.
		size_t remaining_size = block_size - next_offset_in_buffer();
		append_to_buffer(c, std::min(size, remaining_size));
		NoQUnsubmitted Not Done Reply Inline Actions Maybe compare `size` and `remaining_size` only once? The control flow would also be more clear that way. if (size <= remaining_size) { append_to_buffer(c, size); return this; } append_to_buffer(c, remaining_size); NoQ:* Maybe compare `size` and `remaining_size` only once? The control flow would also be more clear…
		if (size <= remaining_size)
		return *this;

		// Now that the buffer is empty, we consider the rest of the array as a list
		// of 64 byte blocks and some remaining bytes at the end.

		// All but the last 64 byte block can be directly hashed. They don't need to
		// be written into the block_buffer as we know that they would be
		// overwritten by the very next 64 byte block in this write call.
		size_t position = remaining_size;
		for (; (position + block_size * 2) <= size; position += block_size) {
		state.mix(c + position);
		length += block_size;
		}
		// The last 64 byte block needs to be in the buffer as there is a chance
		// that it won't be overwritten in the future and could influence the final
		// hash code.
		if (position + block_size <= size) {
		append_to_buffer(c + position, block_size);
		position += block_size;
		}

		// If there are any remaining bytes at the end, append them to the buffer
		// so that they will be hashed later.
		if (position < size) {
		append_to_buffer(c + position, size - position);
		}

		return *this;
		}

		/// \brief Writes each object in the given iterator range.
		template <typename InputIteratorT>
		hash_stream &write(InputIteratorT first, InputIteratorT last) {
		while (first != last) {
		this << first;
		++first;
		}
		return *this;
		}

		/// \brief Writes the contents of a string.
		template<typename T>
		hash_stream &operator<<(std::basic_string<T> const &string) {
		static_assert(std::is_trivial<T>::value,
		"Can only write strings of trivially copyable objects");
		return write(reinterpret_cast<char const *>(string.data()),
		string.size() * sizeof(T));
		}

		/// \brief Writes a trivially copyable object.
		template<typename T>
		hash_stream &operator<<(T const &value) {
		static_assert(std::is_trivial<T>::value,
		"Can only write strings of trivially copyable objects");
		return write(reinterpret_cast<char const *>(&value), sizeof(T));
		}

		};

// Declared and documented above, but defined here so that any of the hashing		// Declared and documented above, but defined here so that any of the hashing
// infrastructure is available.		// infrastructure is available.
template <typename T>		template <typename T>
typename std::enable_if<is_integral_or_enum<T>::value, hash_code>::type		typename std::enable_if<is_integral_or_enum<T>::value, hash_code>::type
hash_value(T value) {		hash_value(T value) {
return ::llvm::hashing::detail::hash_integer_value(		return ::llvm::hashing::detail::hash_integer_value(
static_cast<uint64_t>(value));		static_cast<uint64_t>(value));
}		}
Show All 25 Lines

unittests/ADT/HashingTest.cpp

Show First 20 Lines • Show All 113 Lines • ▼ Show 20 Lines	TEST(HashingTest, HashValueStdString) {
EXPECT_EQ(hash_combine_range(ws.c_str(), ws.c_str() + ws.size()),		EXPECT_EQ(hash_combine_range(ws.c_str(), ws.c_str() + ws.size()),
hash_value(ws));		hash_value(ws));
EXPECT_EQ(hash_combine_range(ws.c_str(), ws.c_str() + ws.size() - 1),		EXPECT_EQ(hash_combine_range(ws.c_str(), ws.c_str() + ws.size() - 1),
hash_value(ws.substr(0, ws.size() - 1)));		hash_value(ws.substr(0, ws.size() - 1)));
EXPECT_EQ(hash_combine_range(ws.c_str() + 1, ws.c_str() + ws.size() - 1),		EXPECT_EQ(hash_combine_range(ws.c_str() + 1, ws.c_str() + ws.size() - 1),
hash_value(ws.substr(1, ws.size() - 2)));		hash_value(ws.substr(1, ws.size() - 2)));
}		}

		TEST(HashingTest, HashStream) {
		hash_stream streamA, streamB;

		// Test hashing of short strings
		streamA << std::string("Hello World! ")
		<< std::string("Another Hello World!");
		streamB << std::string("Hello World! Another Hello World!");
		EXPECT_EQ(streamA.compute_hash(), streamB.compute_hash());

		streamA << std::string("Hello World! ")
		<< std::string("Another Hello World!");
		streamB << std::string("Hello World! Different Text!");
		EXPECT_NE(streamA.compute_hash(), streamB.compute_hash());

		// Test hashing of long strings that require multiple 64 byte blocks.
		streamA << std::string("Lorem ipsum dolor sit amet, consectetur adipiscing "
		"elit. Donec tempus vestibulum metus, a vehicula enim "
		"placerat at. Sed commodo cursus posuere.");
		streamB << std::string("Lorem ipsum dolor sit amet, consectetur adipiscing ")
		<< std::string("elit. Donec tempus vestibulum metus, a vehicula enim")
		<< std::string(" placerat at. Sed commodo cursus posuere.");
		EXPECT_EQ(streamA.compute_hash(), streamB.compute_hash());

		// Test that hashing a string and hashing each character once is the same.
		streamA << std::string("ab");
		streamB << 'a' << 'b';
		EXPECT_EQ(streamA.compute_hash(), streamB.compute_hash());

		// Test hashing of integers.
		streamA << 56;
		streamB << 55;
		EXPECT_NE(streamA.compute_hash(), streamB.compute_hash());

		streamA << 55;
		streamB << 55;
		EXPECT_EQ(streamA.compute_hash(), streamB.compute_hash());

		// Test that data type size matters for hash_stream.
		streamA << 56l;
		NoQUnsubmitted Not Done Reply Inline Actions Maybe you wanted to use '`55l`' here? (: NoQ: Maybe you wanted to use '`55l`' here? (:
		streamB << 55;
		EXPECT_NE(streamA.compute_hash(), streamB.compute_hash());

		// Test hashing with iterator pairs.
		std::vector<int> data = {1, 2, 3, 4};

		streamA.write(data.begin(), data.end());

		streamB.write(data.begin(), data.begin() + 2);
		streamB.write(data.begin() + 2, data.end());
		EXPECT_EQ(streamA.compute_hash(), streamB.compute_hash());

		}

template <typename T, size_t N> T *begin(T (&arr)[N]) { return arr; }		template <typename T, size_t N> T *begin(T (&arr)[N]) { return arr; }
template <typename T, size_t N> T *end(T (&arr)[N]) { return arr + N; }		template <typename T, size_t N> T *end(T (&arr)[N]) { return arr + N; }

// Provide a dummy, hashable type designed for easy verification: its hash is		// Provide a dummy, hashable type designed for easy verification: its hash is
// the same as its value.		// the same as its value.
struct HashableDummy { size_t value; };		struct HashableDummy { size_t value; };
hash_code hash_value(HashableDummy dummy) { return dummy.value; }		hash_code hash_value(HashableDummy dummy) { return dummy.value; }

▲ Show 20 Lines • Show All 319 Lines • Show Last 20 Lines