This is an archive of the discontinued LLVM Phabricator instance.

[llvm-profdata] Remove MD5 collision check in D147740
Needs ReviewPublic

Authored by huangjd on Jun 24 2023, 12:43 AM.

Download Raw Diff

Details

Reviewers

davidxl
snehasish
hoy
wenlei

Summary

After testing D147740 with multiple industrial projects with ~10 million FunctionSamples, no MD5 collision has been found.
In perfect hashing, the probability of collision for N symbols over K possible hash value is 1 - K!/((K-N)! * K^N). When N is 1 million and K is 2^64, the probability is 3*10^-8, when N is 10 million the probability is 3*10^-6, so we are probably not going to find an actual case in real world application. (However if K is 2^32, the probability of collision is almost 1, this is indeed a problem, if anyone still use a large profile on 32-bit machine, as hash_code is tied to size_t).
Furthermore, when a collision happens we can't do anything to recover it, unless using a multi-map, but that is significantly slower, which contradicts the purpose of optimizing the profile reader.
One more thing, since we have been using profiles with MD5 names, and they have to be coming from non-MD5 sources, so if hash collision is to happen, it already happened when we convert a non-MD5 profile to a MD5 one, so there's no point to check for that in the reader, and this feature can be removed.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	50 ms	x64 debian > Flang.Driver::pic-flags.f90
	60,030 ms	x64 debian > MLIR.Examples/standalone::test.toy

Event Timeline

huangjd created this revision.Jun 24 2023, 12:43 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 24 2023, 12:43 AM

Herald added subscribers: hoy, wlei, ormris and 2 others. · View Herald Transcript

huangjd requested review of this revision.Jun 24 2023, 12:43 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 24 2023, 12:43 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

huangjd added a parent revision: D147740: [llvm-profdata] Refactoring Sample Profile Reader to increase FDO build speed using MD5 as key to Sample Profile map.Jun 24 2023, 12:46 AM

huangjd added reviewers: davidxl, snehasish, hoy, wenlei.

In SampleProf.h line 1423 1428 1448, added fully qualified class name to the template template argument. Not sure if this is supposed to be the case, or is it MSVC's bug?

Harbormaster completed remote builds in B240933: Diff 534173.Jun 24 2023, 1:35 AM

Copying comments from the original patch for continuation..

In D147740#4443233, @huangjd wrote:
Actually do we really care about MD5 collision? ExtBinary format already ignored MD5 collision for regular string names (and therefore regular function profiles), as only one of two functions with colliding MD5 get written to the name table (and the other is therefore lost). If we are using CS profiles, since different CS profiles have different serializations, their > hashes are distributed as expected. The most important thing is that, even if we detect a hash collision, we can't do anything about it except logging it (using a multi-map makes the reader much slower), so I think the MD5 collision check should be marked as LLVM_DEBUG. This does reduce 0.5 second out of ~30 seconds (1.67%) over the 1 GB profile read .

I don't think we care. Is the new type HashKeyMap and SampleProfileMap all for detecting and reporting collision? I'd avoid all that complexity and prefer a simple DenseMap + a SampleContext->hash_code converter and not even bother with debug prints for collision...

In D153692#4447455, @wenlei wrote:

Copying comments from the original patch for continuation..

In D147740#4443233, @huangjd wrote:
Actually do we really care about MD5 collision? ExtBinary format already ignored MD5 collision for regular string names (and therefore regular function profiles), as only one of two functions with colliding MD5 get written to the name table (and the other is therefore lost). If we are using CS profiles, since different CS profiles have different serializations, their > hashes are distributed as expected. The most important thing is that, even if we detect a hash collision, we can't do anything about it except logging it (using a multi-map makes the reader much slower), so I think the MD5 collision check should be marked as LLVM_DEBUG. This does reduce 0.5 second out of ~30 seconds (1.67%) over the 1 GB profile read .

I don't think we care. Is the new type HashKeyMap and SampleProfileMap all for detecting and reporting collision? I'd avoid all that complexity and prefer a simple DenseMap + a SampleContext->hash_code converter and not even bother with debug prints for collision...

I will add a separate patch to remove the hash collision check. It would be noteworthy if users reports collision in actual use cases, and if that doesn't happen in a while the hash collision check can be removed. Even the probability of collision is very small for one single program but given that LLVM is used everywhere so we can't be so sure.
The new wrapper type also serves another purpose that existing code can use emplace, find, erase, etc without changing in most cases, and I am planning to have CallTargetMap and FunctionSamplesMap using that as well.

In D153692#4447689, @huangjd wrote:

In D153692#4447455, @wenlei wrote:

Copying comments from the original patch for continuation..

In D147740#4443233, @huangjd wrote:
Actually do we really care about MD5 collision? ExtBinary format already ignored MD5 collision for regular string names (and therefore regular function profiles), as only one of two functions with colliding MD5 get written to the name table (and the other is therefore lost). If we are using CS profiles, since different CS profiles have different serializations, their > hashes are distributed as expected. The most important thing is that, even if we detect a hash collision, we can't do anything about it except logging it (using a multi-map makes the reader much slower), so I think the MD5 collision check should be marked as LLVM_DEBUG. This does reduce 0.5 second out of ~30 seconds (1.67%) over the 1 GB profile read .

I don't think we care. Is the new type HashKeyMap and SampleProfileMap all for detecting and reporting collision? I'd avoid all that complexity and prefer a simple DenseMap + a SampleContext->hash_code converter and not even bother with debug prints for collision...

I will add a separate patch to remove the hash collision check. It would be noteworthy if users reports collision in actual use cases, and if that doesn't happen in a while the hash collision check can be removed. Even the probability of collision is very small for one single program but given that LLVM is used everywhere so we can't be so sure.

I don't think there's much we can do still, even if there's collision report. Maybe we can revert back to use full name instead of MD5, if that turns out to be a problem. But as you mentioned we already support MD5 in profile generation.

The new wrapper type also serves another purpose that existing code can use emplace, find, erase, etc without changing in most cases,

Can you elaborate? Is that critical or just for convenience?

and I am planning to have CallTargetMap and FunctionSamplesMap using that as well.

FWIW, for CallTargetMap, we experimented with changing from StringMap (which is bad since it owns copies of strings) to DenseMap<StringRef, uint64_t>, and it regressed memory usage noticeably. The problem is DenseMap grows to 64 entries on first insertion, but CallTargetMap is usually very sparse with zero or a few entries. FunctionSamplesMap is likely sparse too.

Sorry for being late on patch review. I should have given feedback on the original patch that it's preferable to keep things simple. Unless I'm missing something, the custom HashKeyMap/SampleProfileMap feels unjustified for added complexity.

SampleProfileMap is a simple wrapper that can probably introduce interfaces to ensure the right insertion policy, the HashKeyMap does look like a little heavy weight and may deserve some simplification (i.e. with assumption that collision is a non-issue).

I don't think there's much we can do still, even if there's collision report. Maybe we can revert back to use full name instead of MD5, if that turns out to be a problem. But as you mentioned we already support MD5 in profile generation.

If a collision is rare while there is a strong demand for correctness by users, we can add some ad-hoc logic to deal with that, which is still faster than using a multimap, and much faster than using the full function name. If collision happens frequently, then we have no choice but to revert back to use full name, but this is extremely unlikely. I couldn't find any research on partial MD5 (we only use 64-bit of it) collision using ASCII strings, so that's why I am not ruling out this out of caution.

Can you elaborate? Is that critical or just for convenience?

It's to enforce OOP practice. The optimization passes see SampleProfileMap, which is supposed to map Functions to SampleProfiles. It should not care how keys are actually represented, and it's much more cleaner to write profiles.find(FuncName) than profiles.find(MD5Hash(FuncName)). The latter has another potential risk of misuse, because there already exist multiple ways to get a string's hash value: MD5Hash, llvm::hash_value, std::hash, and they are all different, we have to make sure the correct hash function is used at all times.

FWIW, for CallTargetMap, we experimented with changing from StringMap (which is bad since it owns copies of strings) to DenseMap<StringRef, uint64_t>, and it regressed memory usage noticeably. The problem is DenseMap grows to 64 entries on first insertion, but CallTargetMap is usually very sparse with zero or a few entries. FunctionSamplesMap is likely sparse too.

Yes, I am aware of that. I am planning to change it to HashKeyMap<std::map, StringRef, uint64_t> as well (or unordered_map, depends on which one is more beneficial. After that I also have a plan to use specialized data structure for CallTargetMap and FunctionSamplesMap because it's true that they have zero or one entries for almost all the cases.

In D153692#4450504, @huangjd wrote:

I don't think there's much we can do still, even if there's collision report. Maybe we can revert back to use full name instead of MD5, if that turns out to be a problem. But as you mentioned we already support MD5 in profile generation.

If a collision is rare while there is a strong demand for correctness by users, we can add some ad-hoc logic to deal with that, which is still faster than using a multimap, and much faster than using the full function name. If collision happens frequently, then we have no choice but to revert back to use full name, but this is extremely unlikely. I couldn't find any research on partial MD5 (we only use 64-bit of it) collision using ASCII strings, so that's why I am not ruling out this out of caution.

I'm still not convinced that we need to detect or report collision, It's not a correctness issue when collision happens - one of the function will just lose its profile. It's going to be very rare for function name to hit collision (we're not talking about hashing the entire function which is mapping a much bigger universe into an int), and it's going to be even less likely for a collision to cause noticeable perf loss. That said, if there's concern around collision, maybe we should verify this assumption by building large code base and see whether/how often it happens as a one-off testing for this change, instead of building collision detection into the final implementation (I doubt we would actually get any report even if it happens, if the report is debug only).

Can you elaborate? Is that critical or just for convenience?

It's to enforce OOP practice. The optimization passes see SampleProfileMap, which is supposed to map Functions to SampleProfiles. It should not care how keys are actually represented, and it's much more cleaner to write profiles.find(FuncName) than profiles.find(MD5Hash(FuncName)). The latter has another potential risk of misuse, because there already exist multiple ways to get a string's hash value: MD5Hash, llvm::hash_value, std::hash, and they are all different, we have to make sure the correct hash function is used at all times.

Ok, I think this is fair - hiding hashing details behind the scene is reasonable. But if that's the intention, SampleProfileMap can be a very thin wrapper on top of existing container.

FWIW, for CallTargetMap, we experimented with changing from StringMap (which is bad since it owns copies of strings) to DenseMap<StringRef, uint64_t>, and it regressed memory usage noticeably. The problem is DenseMap grows to 64 entries on first insertion, but CallTargetMap is usually very sparse with zero or a few entries. FunctionSamplesMap is likely sparse too.

Yes, I am aware of that. I am planning to change it to HashKeyMap<std::map, StringRef, uint64_t> as well (or unordered_map, depends on which one is more beneficial. After that I also have a plan to use specialized data structure for CallTargetMap and FunctionSamplesMap because it's true that they have zero or one entries for almost all the cases.

Do you plan to change CallTargetMap to use MD5 as key as well? This will be different from SampleProfileMap in the sense that SampleProfileMap has function name as part of its value (function samples), but for CallTargetMap, its value is just a count, so if we change its key to MD5, we won't be able to recover "original" key from its value for debugging etc.

wenlei mentioned this in D147740: [llvm-profdata] Refactoring Sample Profile Reader to increase FDO build speed using MD5 as key to Sample Profile map.Jul 13 2023, 6:53 PM

Removed MD5 collision check

huangjd retitled this revision from Fixed D147740 - [llvm-profdata] Refactoring Sample Profile Reader to increase FDO build speed using MD5 as key to Sample Profile map to [llvm-profdata] Remove MD5 collision check in D147740.Aug 23 2023, 6:11 PM

huangjd edited the summary of this revision. (Show Details)

huangjd added inline comments.

llvm/include/llvm/ProfileData/SampleProf.h
1308	fixed mistake in comment

Harbormaster completed remote builds in B254501: Diff 552933.Aug 23 2023, 6:37 PM

GitHub <noreply@github.com> mentioned this in rGf4f85e0ab405: [llvm-profdata] Remove MD5 collision check in D147740 (#66544).Sep 15 2023, 3:31 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

ProfileData/

SampleProf.h

68 lines

unittests/

tools/

llvm-profdata/

CMakeLists.txt

1 line

MD5CollisionTest.cpp

Diff 552933

llvm/include/llvm/ProfileData/SampleProf.h

Show First 20 Lines • Show All 1,293 Lines • ▼ Show 20 Lines

raw_ostream &operator<<(raw_ostream &OS, const FunctionSamples &FS);		raw_ostream &operator<<(raw_ostream &OS, const FunctionSamples &FS);

/// This class is a wrapper to associative container MapT<KeyT, ValueT> using		/// This class is a wrapper to associative container MapT<KeyT, ValueT> using
/// the hash value of the original key as the new key. This greatly improves the		/// the hash value of the original key as the new key. This greatly improves the
/// performance of insert and query operations especially when hash values of		/// performance of insert and query operations especially when hash values of
/// keys are available a priori, and reduces memory usage if KeyT has a large		/// keys are available a priori, and reduces memory usage if KeyT has a large
/// size.		/// size.
/// When performing any action, if an existing entry with a given key is found,		/// All keys with the same hash value are considered equivalent (i.e. hash
/// and the interface "KeyT ValueT::getKey<KeyT>() const" to retrieve a value's		/// collision is silently ignored). Given such feature this class should only be
/// original key exists, this class checks if the given key actually matches		/// used where it does not affect compilation correctness, for example, when
/// the existing entry's original key. If they do not match, this class behaves		/// loading a sample profile.
/// as if the entry did not exist (for insertion, this means the new value will
/// replace the existing entry's value, as if it is newly inserted). If
/// ValueT::getKey<KeyT>() is not available, all keys with the same hash value
/// are considered equivalent (i.e. hash collision is silently ignored). Given
/// such feature this class should only be used where it does not affect
/// compilation correctness, for example, when loading a sample profile.
/// Assuming the hashing algorithm is uniform, the probability of hash collision		/// Assuming the hashing algorithm is uniform, the probability of hash collision
/// with 1,000,000 entries is		/// with 1,000,000 entries is
/// (2^64)!/((2^64-1000000)!(2^64)^1000000) ~= 310^-8.		/// 1 - (2^64)!/((2^64-1000000)!(2^64)^1000000) ~= 310^-8.
		huangjdAuthorUnsubmitted Done Reply Inline Actions fixed mistake in comment huangjd: fixed mistake in comment
template <template <typename, typename, typename...> typename MapT,		template <template <typename, typename, typename...> typename MapT,
typename KeyT, typename ValueT, typename... MapTArgs>		typename KeyT, typename ValueT, typename... MapTArgs>
class HashKeyMap : public MapT<hash_code, ValueT, MapTArgs...> {		class HashKeyMap : public MapT<hash_code, ValueT, MapTArgs...> {
public:		public:
using base_type = MapT<hash_code, ValueT, MapTArgs...>;		using base_type = MapT<hash_code, ValueT, MapTArgs...>;
using key_type = hash_code;		using key_type = hash_code;
using original_key_type = KeyT;		using original_key_type = KeyT;
using mapped_type = ValueT;		using mapped_type = ValueT;
using value_type = typename base_type::value_type;		using value_type = typename base_type::value_type;

using iterator = typename base_type::iterator;		using iterator = typename base_type::iterator;
using const_iterator = typename base_type::const_iterator;		using const_iterator = typename base_type::const_iterator;

private:
// If the value type has getKey(), retrieve its original key for comparison.
template <typename U = mapped_type,
typename = decltype(U().template getKey<original_key_type>())>
static bool
CheckKeyMatch(const original_key_type &Key, const mapped_type &ExistingValue,
original_key_type *ExistingKeyIfDifferent = nullptr) {
const original_key_type &ExistingKey =
ExistingValue.template getKey<original_key_type>();
bool Result = (Key == ExistingKey);
if (!Result && ExistingKeyIfDifferent)
*ExistingKeyIfDifferent = ExistingKey;
return Result;
}

// If getKey() does not exist, this overload is selected, which assumes all
// keys with the same hash are equivalent.
static bool CheckKeyMatch(...) { return true; }

public:
template <typename... Ts>		template <typename... Ts>
std::pair<iterator, bool> try_emplace(const key_type &Hash,		std::pair<iterator, bool> try_emplace(const key_type &Hash,
const original_key_type &Key,		const original_key_type &Key,
Ts &&...Args) {		Ts &&...Args) {
assert(Hash == hash_value(Key));		assert(Hash == hash_value(Key));
auto Ret = base_type::try_emplace(Hash, std::forward<Ts>(Args)...);		return base_type::try_emplace(Hash, std::forward<Ts>(Args)...);
if (!Ret.second) {
original_key_type ExistingKey;
if (LLVM_UNLIKELY(!CheckKeyMatch(Key, Ret.first->second, &ExistingKey))) {
dbgs() << "MD5 collision detected: " << Key << " and " << ExistingKey
<< " has same hash value " << Hash << "\n";
Ret.second = true;
Ret.first->second = mapped_type(std::forward<Ts>(Args)...);
}
}
return Ret;
}		}

template <typename... Ts>		template <typename... Ts>
std::pair<iterator, bool> try_emplace(const original_key_type &Key,		std::pair<iterator, bool> try_emplace(const original_key_type &Key,
Ts &&...Args) {		Ts &&...Args) {
key_type Hash = hash_value(Key);		key_type Hash = hash_value(Key);
return try_emplace(Hash, Key, std::forward<Ts>(Args)...);		return try_emplace(Hash, Key, std::forward<Ts>(Args)...);
}		}

template <typename... Ts> std::pair<iterator, bool> emplace(Ts &&...Args) {		template <typename... Ts> std::pair<iterator, bool> emplace(Ts &&...Args) {
return try_emplace(std::forward<Ts>(Args)...);		return try_emplace(std::forward<Ts>(Args)...);
}		}

mapped_type &operator[](const original_key_type &Key) {		mapped_type &operator[](const original_key_type &Key) {
return try_emplace(Key, mapped_type()).first->second;		return try_emplace(Key, mapped_type()).first->second;
}		}

iterator find(const original_key_type &Key) {		iterator find(const original_key_type &Key) {
key_type Hash = hash_value(Key);		key_type Hash = hash_value(Key);
auto It = base_type::find(Hash);		auto It = base_type::find(Hash);
if (It != base_type::end())		if (It != base_type::end())
if (LLVM_LIKELY(CheckKeyMatch(Key, It->second)))
return It;		return It;
return base_type::end();		return base_type::end();
}		}

const_iterator find(const original_key_type &Key) const {		const_iterator find(const original_key_type &Key) const {
key_type Hash = hash_value(Key);		key_type Hash = hash_value(Key);
auto It = base_type::find(Hash);		auto It = base_type::find(Hash);
if (It != base_type::end())		if (It != base_type::end())
if (LLVM_LIKELY(CheckKeyMatch(Key, It->second)))
return It;		return It;
return base_type::end();		return base_type::end();
}		}

size_t erase(const original_key_type &Ctx) {		size_t erase(const original_key_type &Ctx) {
auto It = find(Ctx);		auto It = find(Ctx);
if (It != base_type::end()) {		if (It != base_type::end()) {
base_type::erase(It);		base_type::erase(It);
return 1;		return 1;
Show All 24 Lines	return HashKeyMap<llvm::DenseMap, SampleContext, FunctionSamples>::find(
Ctx);		Ctx);
}		}

const_iterator find(const SampleContext &Ctx) const {		const_iterator find(const SampleContext &Ctx) const {
return HashKeyMap<llvm::DenseMap, SampleContext, FunctionSamples>::find(		return HashKeyMap<llvm::DenseMap, SampleContext, FunctionSamples>::find(
Ctx);		Ctx);
}		}

// Overloaded find() to lookup a function by name. This is called by IPO		// Overloaded find() to lookup a function by name.
// passes with an actual function name, and it is possible that the profile
// reader converted function names in the profile to MD5 strings, so we need
// to check if either representation matches.
iterator find(StringRef Fname) {		iterator find(StringRef Fname) {
uint64_t Hash = hashFuncName(Fname);		return base_type::find(hashFuncName(Fname));
auto It = base_type::find(hash_code(Hash));
if (It != end()) {
StringRef CtxName = It->second.getContext().getName();
if (LLVM_LIKELY(CtxName == Fname \|\| CtxName == std::to_string(Hash)))
return It;
}
return end();
}		}

size_t erase(const SampleContext &Ctx) {		size_t erase(const SampleContext &Ctx) {
return HashKeyMap<llvm::DenseMap, SampleContext, FunctionSamples>::erase(		return HashKeyMap<llvm::DenseMap, SampleContext, FunctionSamples>::erase(
Ctx);		Ctx);
}		}

size_t erase(const key_type &Key) { return base_type::erase(Key); }		size_t erase(const key_type &Key) { return base_type::erase(Key); }
▲ Show 20 Lines • Show All 247 Lines • Show Last 20 Lines

llvm/unittests/tools/llvm-profdata/CMakeLists.txt

	set(LLVM_LINK_COMPONENTS			set(LLVM_LINK_COMPONENTS
	Core			Core
	ProfileData			ProfileData
	Support			Support
	)			)

	add_llvm_unittest(LLVMProfdataTests			add_llvm_unittest(LLVMProfdataTests
	OutputSizeLimitTest.cpp			OutputSizeLimitTest.cpp
	MD5CollisionTest.cpp
	)			)

	target_link_libraries(LLVMProfdataTests PRIVATE LLVMTestingSupport)			target_link_libraries(LLVMProfdataTests PRIVATE LLVMTestingSupport)

	set_property(TARGET LLVMProfdataTests PROPERTY FOLDER "Tests/UnitTests/ToolTests")			set_property(TARGET LLVMProfdataTests PROPERTY FOLDER "Tests/UnitTests/ToolTests")

llvm/unittests/tools/llvm-profdata/MD5CollisionTest.cpp

This file was deleted.

	//===- llvm/unittests/tools/llvm-profdata/MD5CollisionTest.cpp ------------===//
	//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//
	//===----------------------------------------------------------------------===//

	/// Test whether the MD5-key SampleProfileMap can handle collision correctly.
	/// Probability of collision is rare but not negligible since we only use the
	/// lower 64 bits of the MD5 value. A unit test is required because the function
	/// names are not printable ASCII characters.

	#include "llvm/ProfileData/SampleProfReader.h"
	#include "llvm/Support/VirtualFileSystem.h"
	#include "llvm/Testing/Support/Error.h"
	#include "gtest/gtest.h"

	/// According to https://en.wikipedia.org/wiki/MD5#Preimage_vulnerability, the
	/// MD5 of the two strings are 79054025255fb1a26e4bc422aef54eb4.

	// First 8 bytes of the MD5.
	const uint64_t ExpectedHash = 0xa2b15f2525400579;

	// clang-format off
	const uint8_t ProfileData[] = {
	0x84, 0xe4, 0xd0, 0xb1, 0xf4, 0xc9, 0x94, 0xa8,
	0x53, 0x67, 0x03, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x7D, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x03, 0x01, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x04, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x80, 0x01, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x05, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x20, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x90, 0x01, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x20, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00,

	/// Name Table
	0x02,
	/// String1
	0xd1, 0x31, 0xdd, 0x02, 0xc5, 0xe6, 0xee, 0xc4,
	0x69, 0x3d, 0x9a, 0x06, 0x98, 0xaf, 0xf9, 0x5c,
	0x2f, 0xca, 0xb5, 0x87, 0x12, 0x46, 0x7e, 0xab,
	0x40, 0x04, 0x58, 0x3e, 0xb8, 0xfb, 0x7f, 0x89,
	0x55, 0xad, 0x34, 0x06, 0x09, 0xf4, 0xb3, 0x02,
	0x83, 0xe4, 0x88, 0x83, 0x25, 0x71, 0x41, 0x5a,
	0x08, 0x51, 0x25, 0xe8, 0xf7, 0xcd, 0xc9, 0x9f,
	0xd9, 0x1d, 0xbd, 0xf2, 0x80, 0x37, 0x3c, 0x5b,
	0xd8, 0x82, 0x3e, 0x31, 0x56, 0x34, 0x8f, 0x5b,
	0xae, 0x6d, 0xac, 0xd4, 0x36, 0xc9, 0x19, 0xc6,
	0xdd, 0x53, 0xe2, 0xb4, 0x87, 0xda, 0x03, 0xfd,
	0x02, 0x39, 0x63, 0x06, 0xd2, 0x48, 0xcd, 0xa0,
	0xe9, 0x9f, 0x33, 0x42, 0x0f, 0x57, 0x7e, 0xe8,
	0xce, 0x54, 0xb6, 0x70, 0x80, 0xa8, 0x0d, 0x1e,
	0xc6, 0x98, 0x21, 0xbc, 0xb6, 0xa8, 0x83, 0x93,
	0x96, 0xf9, 0x65, 0x2b, 0x6f, 0xf7, 0x2a, 0x70, 0x00,
	/// String2
	0xd1, 0x31, 0xdd, 0x02, 0xc5, 0xe6, 0xee, 0xc4,
	0x69, 0x3d, 0x9a, 0x06, 0x98, 0xaf, 0xf9, 0x5c,
	0x2f, 0xca, 0xb5, 0x07, 0x12, 0x46, 0x7e, 0xab,
	0x40, 0x04, 0x58, 0x3e, 0xb8, 0xfb, 0x7f, 0x89,
	0x55, 0xad, 0x34, 0x06, 0x09, 0xf4, 0xb3, 0x02,
	0x83, 0xe4, 0x88, 0x83, 0x25, 0xf1, 0x41, 0x5a,
	0x08, 0x51, 0x25, 0xe8, 0xf7, 0xcd, 0xc9, 0x9f,
	0xd9, 0x1d, 0xbd, 0x72, 0x80, 0x37, 0x3c, 0x5b,
	0xd8, 0x82, 0x3e, 0x31, 0x56, 0x34, 0x8f, 0x5b,
	0xae, 0x6d, 0xac, 0xd4, 0x36, 0xc9, 0x19, 0xc6,
	0xdd, 0x53, 0xe2, 0x34, 0x87, 0xda, 0x03, 0xfd,
	0x02, 0x39, 0x63, 0x06, 0xd2, 0x48, 0xcd, 0xa0,
	0xe9, 0x9f, 0x33, 0x42, 0x0f, 0x57, 0x7e, 0xe8,
	0xce, 0x54, 0xb6, 0x70, 0x80, 0x28, 0x0d, 0x1e,
	0xc6, 0x98, 0x21, 0xbc, 0xb6, 0xa8, 0x83, 0x93,
	0x96, 0xf9, 0x65, 0xab, 0x6f, 0xf7, 0x2a, 0x70, 0x00,

	/// FuncOffsetTable
	0x02, 0x00, 0x00, 0x01, 0x17, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,

	/// Samples
	/// String1:10:1
	/// 1: 5
	/// 2.3: 6
	/// 4: String2:100
	/// 1: 100
	/// String2:7:3
	/// 9: 0
	0x01, 0x00, 0x0a, 0x02, 0x01, 0x00, 0x05, 0x00,
	0x02, 0x03, 0x06, 0x00, 0x01, 0x04, 0x00, 0x01,
	0x64, 0x01, 0x01, 0x00, 0x64, 0x00, 0x00,

	0x03, 0x01, 0x07, 0x01, 0x09, 0x00, 0x00, 0x00,
	0x00};
	// clang-format on

	using namespace llvm;
	using namespace llvm::sampleprof;

	TEST(MD5CollisionTest, TestCollision) {
	auto InputBuffer = MemoryBuffer::getMemBuffer(
	StringRef(reinterpret_cast<const char *>(ProfileData),
	sizeof(ProfileData)),
	"", false);
	LLVMContext Context;
	auto FileSystem = vfs::getRealFileSystem();
	auto Result = SampleProfileReader::create(InputBuffer, Context, *FileSystem);
	ASSERT_TRUE(Result);
	SampleProfileReader *Reader = Result->get();
	ASSERT_FALSE(Reader->read());

	std::vector<StringRef> &NameTable = *Reader->getNameTable();
	ASSERT_EQ(NameTable.size(), 2U);
	StringRef S1 = NameTable[0];
	StringRef S2 = NameTable[1];
	ASSERT_NE(S1, S2);
	ASSERT_EQ(MD5Hash(S1), ExpectedHash);
	ASSERT_EQ(MD5Hash(S2), ExpectedHash);

	// S2's MD5 value collides with S1, S1 is expected to be dropped when S2 is
	// inserted, as if S1 never existed.

	FunctionSamples ExpectedFS;
	ExpectedFS.setName(S2);
	ExpectedFS.setHeadSamples(3);
	ExpectedFS.setTotalSamples(7);
	ExpectedFS.addBodySamples(9, 0, 0);

	SampleProfileMap &Profiles = Reader->getProfiles();
	EXPECT_EQ(Profiles.size(), 1U);
	if (Profiles.size()) {
	auto &[Hash, FS] = *Profiles.begin();
	EXPECT_EQ(Hash, hash_code(ExpectedHash));
	EXPECT_EQ(FS, ExpectedFS);
	}

	// Inserting S2 again should fail, returning the existing sample unchanged.
	auto [It1, Inserted1] = Profiles.try_emplace(S2, FunctionSamples());
	EXPECT_FALSE(Inserted1);
	EXPECT_EQ(Profiles.size(), 1U);
	if (Profiles.size()) {
	auto &[Hash, FS] = *It1;
	EXPECT_EQ(Hash, hash_code(ExpectedHash));
	EXPECT_EQ(FS, ExpectedFS);
	}

	// Inserting S1 should success as if S2 never existed, and S2 is erased.
	FunctionSamples FS1;
	FS1.setName(S1);
	FS1.setHeadSamples(5);
	FS1.setTotalSamples(10);
	FS1.addBodySamples(1, 2, 5);

	auto [It2, Inserted2] = Profiles.try_emplace(S1, FS1);
	EXPECT_TRUE(Inserted2);
	EXPECT_EQ(Profiles.size(), 1U);
	if (Profiles.size()) {
	auto &[Hash, FS] = *It2;
	EXPECT_EQ(Hash, hash_code(ExpectedHash));
	EXPECT_EQ(FS, FS1);
	}
	}