Download Raw Diff

Details

Reviewers

ldionne
EricWF
Mordante

Group Reviewers

Restricted Project

Commits

rG78addb2c3227: Use hash value checks optimizations consistently

Summary

There are couple of optimizations of __hash_table::find which are applicable
to other places like __hash_table::__node_insert_unique_prepare and
__hash_table::__emplace_unique_key_args.

for (__nd = __nd->__next_; __nd != nullptr &&
    (__nd->__hash() == __hash
    // ^^^^^^^^^^^^^^^^^^^^^^
    //         (1)
      || std::__constrain_hash(__nd->__hash(), __bc) == __chash);
                                               __nd = __nd->__next_)
{   
    if ((__nd->__hash() == __hash)
    // ^^^^^^^^^^^^^^^^^^^^^^^^^^
    //           (2)
        && key_eq()(__nd->__upcast()->__value_, __k))
        return iterator(__nd, this);
}

(1): We can avoid expensive modulo operations from std::__constrain_hash if
hashes matched. This one is from commit 6a411472e3c4.
(2): We can avoid key_eq calls if hashes didn't match. Commit:
318d35a7bca6c4e5.

Both of them are applicable for insert and emplace methods.

Results of unordered_set_operations benchmark:

Comparing /tmp/main to /tmp/hashtable-hash-value-optimization
Benchmark                                                                 Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------------------------
BM_Hash/uint32_random_std_hash/1024                                    -0.0127         -0.0127          1511          1492          1508          1489
BM_Hash/uint32_random_custom_hash/1024                                 +0.0012         +0.0013          1370          1371          1367          1369
BM_Hash/uint32_top_std_hash/1024                                       -0.0027         -0.0028          1502          1497          1498          1494
BM_Hash/uint32_top_custom_hash/1024                                    +0.0033         +0.0032          1368          1373          1365          1370
BM_InsertValue/unordered_set_uint32/1024                               +0.0267         +0.0266         36421         37392         36350         37318
BM_InsertValue/unordered_set_uint32_sorted/1024                        +0.0230         +0.0229         28247         28897         28193         28837
BM_InsertValue/unordered_set_top_bits_uint32/1024                      +0.0492         +0.0491         31012         32539         30952         32472
BM_InsertValueRehash/unordered_set_top_bits_uint32/1024                +0.0523         +0.0520         62905         66197         62780         66043
BM_InsertValue/unordered_set_string/1024                               -0.0252         -0.0253        300762        293189        299805        292221
BM_InsertValueRehash/unordered_set_string/1024                         -0.0932         -0.0920        332924        301882        331276        300810
BM_InsertValue/unordered_set_prefixed_string/1024                      -0.0578         -0.0577        314239        296072        313222        295137
BM_InsertValueRehash/unordered_set_prefixed_string/1024                -0.0986         -0.0985        336091        302950        334982        301995
BM_Find/unordered_set_random_uint64/1024                               -0.1416         -0.1417         16075         13798         16041         13769
BM_FindRehash/unordered_set_random_uint64/1024                         -0.0105         -0.0105          5900          5838          5889          5827
BM_Find/unordered_set_sorted_uint64/1024                               +0.0014         +0.0014          2813          2817          2807          2811
BM_FindRehash/unordered_set_sorted_uint64/1024                         -0.0247         -0.0249          5863          5718          5851          5706
BM_Find/unordered_set_sorted_uint128/1024                              +0.0113         +0.0112         15570         15746         15539         15713
BM_FindRehash/unordered_set_sorted_uint128/1024                        +0.0438         +0.0441          6917          7220          6902          7206
BM_Find/unordered_set_sorted_uint32/1024                               -0.0020         -0.0020          3098          3091          3092          3085
BM_FindRehash/unordered_set_sorted_uint32/1024                         +0.0570         +0.0569          5377          5684          5368          5673
BM_Find/unordered_set_sorted_large_uint64/1024                         +0.0081         +0.0081          3594          3623          3587          3616
BM_FindRehash/unordered_set_sorted_large_uint64/1024                   -0.0542         -0.0540          6154          5820          6140          5808
BM_Find/unordered_set_top_bits_uint64/1024                             -0.0061         -0.0061         10440         10377         10417         10353
BM_FindRehash/unordered_set_top_bits_uint64/1024                       +0.0131         +0.0128          5852          5928          5840          5914
BM_Find/unordered_set_string/1024                                      -0.0352         -0.0349        189037        182384        188389        181809
BM_FindRehash/unordered_set_string/1024                                -0.0309         -0.0311        180718        175142        180141        174532
BM_Find/unordered_set_prefixed_string/1024                             -0.0559         -0.0557        190853        180177        190251        179659
BM_FindRehash/unordered_set_prefixed_string/1024                       -0.0563         -0.0561        182396        172136        181797        171602
BM_Rehash/unordered_set_string_arg/1024                                -0.0244         -0.0241         27052         26393         26989         26339
BM_Rehash/unordered_set_int_arg/1024                                   -0.0410         -0.0410         19582         18779         19539         18738
BM_InsertDuplicate/unordered_set_int/1024                              +0.0023         +0.0025         12168         12196         12142         12173
BM_InsertDuplicate/unordered_set_string/1024                           -0.0505         -0.0504        189238        179683        188648        179133
BM_InsertDuplicate/unordered_set_prefixed_string/1024                  -0.0989         -0.0987        198893        179222        198263        178702
BM_EmplaceDuplicate/unordered_set_int/1024                             -0.0175         -0.0173         12674         12452         12646         12427
BM_EmplaceDuplicate/unordered_set_string/1024                          -0.0559         -0.0557        190104        179481        189492        178934
BM_EmplaceDuplicate/unordered_set_prefixed_string/1024                 -0.1111         -0.1110        201233        178870        200608        178341
BM_InsertDuplicate/unordered_set_int_insert_arg/1024                   -0.0747         -0.0745         12993         12022         12964         11997
BM_InsertDuplicate/unordered_set_string_insert_arg/1024                -0.0584         -0.0583        191489        180311        190864        179731
BM_EmplaceDuplicate/unordered_set_int_insert_arg/1024                  -0.0807         -0.0804         35946         33047         35866         32982
BM_EmplaceDuplicate/unordered_set_string_arg/1024                      -0.0312         -0.0310        321623        311601        320559        310637
OVERALL_GEOMEAN                                                        -0.0276         -0.0275             0             0             0             0

Time differences looks more like noise to me. But if we want to have this
optimizations in find, we probably want them in insert and emplace as
well.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ilvokhin created this revision.Dec 30 2022, 6:07 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 30 2022, 6:07 AM

ilvokhin requested review of this revision.Dec 30 2022, 6:07 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 30 2022, 6:07 AM

Herald added 1 blocking reviewer(s): Restricted Project. · View Herald Transcript

Herald added a subscriber: libcxx-commits. · View Herald Transcript

Harbormaster completed remote builds in B205212: Diff 485690.Dec 30 2022, 7:23 AM

Friendly ping: @ldionne, @EricWF I wonder if you have any thoughts on the change.

Mordante edited the summary of this revision. (Show Details)May 7 2023, 9:28 AM

Thanks for your contribution! Could you rebase the patch and update the commit message with the comparsision of before and after? That makes comparing the numbers easier.

Rebase and and comparison (before vs. after) with compare.py.

Harbormaster completed remote builds in B231799: Diff 521920.May 13 2023, 12:57 PM

Thanks, the comparison confirms what I thought before. On average there is no difference between before and after? Did you investigate this?

Thanks, for looking into it.

There are at least two different cases this change might affect: cheap to compare keys and expensive to compare keys.

For cheap to compare keys the intent is avoid std::__constrain_hash call, with modulo operation inside (which might dominate comparison for some cases as comparison is cheap) by comparing unconstrained hash first.

For expensive to compare keys we are optimizing a little bit different case, when unconstrained hashes do not match, but constrained hashed matched. For this case if hashes do not match, we can skip expensive key_eq call.

Unfortunately, I believe, both cases are not covered by the current benchmark set.

Let's look at the expensive to compare keys case deeper, as it's easier to analyze (at least for me) and I didn't come up with a good example for cheap to compare keys yet. One example of expensive to compare keys are strings, but not every string set is suitable. Currently, we generate random strings with length 1024 and characters are generated with uniform distribution (from numbers, lower and upper case letters). It's very likely to compare such strings we will perform small amount of operations as mismatch will be found on the relatively short prefix at the beginning of the string.

We can generate strings with long common prefix to make them more expensive to compare. Below is example of such benchmark for strings with common prefix of length 992.

Benchmark                                                                 Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------------------------
BM_InsertValue/unordered_set_prefixed_string/1024                      -0.0148         -0.0148        309669        305095        308578        304001
BM_InsertValueRehash/unordered_set_prefixed_string/1024                -0.1288         -0.1274        354849        309146        352840        307893
BM_InsertDuplicate/unordered_set_prefixed_string/1024                  -0.0281         -0.0288        191892        186503        191225        185719
BM_EmplaceDuplicate/unordered_set_prefixed_string/1024                 -0.0543         -0.0540        196555        185887        195738        185160

We would like to have more elements in a single bucket to make this case even worse, for this we can increase load factor. Below is benchmark for strings with common prefix of length 992 and hash set load factor 3.

Benchmark                                                                 Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------------------------
BM_InsertValue/unordered_set_prefixed_string/1024                      -0.1401         -0.1399        377803        324867        376322        323659
BM_InsertValueRehash/unordered_set_prefixed_string/1024                -0.0306         -0.0313        392866        380829        391327        379085
BM_InsertDuplicate/unordered_set_prefixed_string/1024                  -0.1396         -0.1395        238364        205093        237427        204315
BM_EmplaceDuplicate/unordered_set_prefixed_string/1024                 -0.1548         -0.1546        237479        200714        236605        200020

Overall, with this change we can get +- performance neutral results for average use cases and from 1% to 15% improvement for one of the corner cases. Moreover, implementation of insert and emplace methods will be more consistent with find method.

I agree, it's not a clearest cut in the world, but seems like a small improvement for me. Let me know what do you think.

Also, I discovered benchmarks from the description was taken on shared CPU, updated description with results on dedicated CPU to reduce noise.

D140779_benchmarks.diff4 KBDownload

ilvokhin edited the summary of this revision. (Show Details)May 21 2023, 5:39 AM

Thanks for the additional information. I think it makes sense to add these benchmarks to the benchmark suite. I think we can be quite certain 32 characters for 1024 elements will be unique. I prefer to keep the default load factor so we test our defaults. (We could add benchmarks with different load factors, but let's make sure we test the default too.)

Added benchmarks for strings with common prefix. Currently, benchmarks are with default load factor.
Updated benchmark results in the description.
Rebase.

Harbormaster completed remote builds in B235018: Diff 526267.May 27 2023, 7:34 PM

Rabase.
Updated benchmark results in description, seems previous one were included unintended load factor changes.

Harbormaster completed remote builds in B236455: Diff 528210.Jun 4 2023, 9:11 AM

Friendly ping @Mordante.

In D140779#4411985, @ilvokhin wrote:

Friendly ping @Mordante.

Sorry, I missed this mail between all other reviews. LGTM!

Do you have commit access? If not can you provide your name and email address for proper attribution?

This revision is now accepted and ready to land.Jun 24 2023, 6:49 AM

No worries at all. Thanks for review.

I don't have commit access.

Dmitry Ilvokhin
d@ilvokhin.com

I tried to land your patch, but there are merge conflicts. Can you rebase this patch on main and upload it again. That way we get a new CI run too. I'll land it when the CI is green.

Rebase + format (as file was reformatted).

Harbormaster completed remote builds in B241858: Diff 535481.Jun 28 2023, 2:27 PM

Looks like CI is green. @Mordante, could you please kindly take another try to land this patch?

Closed by commit rG78addb2c3227: Use hash value checks optimizations consistently (authored by ilvokhin, committed by Mordante). · Explain WhyJul 4 2023, 12:01 PM

This revision was automatically updated to reflect the committed changes.

Mordante added a commit: rG78addb2c3227: Use hash value checks optimizations consistently.

Herald added a subscriber: wangpc. · View Herald TranscriptJul 4 2023, 12:01 PM

In D140779#4466265, @ilvokhin wrote:

Looks like CI is green. @Mordante, could you please kindly take another try to land this patch?

I just landed it on your behalf.

Diff 537147

libcxx/benchmarks/GenerateInput.h

	Show First 20 Lines • Show All 105 Lines • ▼ Show 20 Lines
	inline std::vector<std::string> getRandomStringInputs(size_t N) {			inline std::vector<std::string> getRandomStringInputs(size_t N) {
	std::vector<std::string> inputs;			std::vector<std::string> inputs;
	for (size_t i = 0; i < N; ++i) {			for (size_t i = 0; i < N; ++i) {
	inputs.push_back(getRandomString(1024));			inputs.push_back(getRandomString(1024));
	}			}
	return inputs;			return inputs;
	}			}

				inline std::vector<std::string> getPrefixedRandomStringInputs(size_t N) {
				std::vector<std::string> inputs;
				constexpr int kSuffixLength = 32;
				const std::string prefix = getRandomString(1024 - kSuffixLength);
				for (size_t i = 0; i < N; ++i) {
				inputs.push_back(prefix + getRandomString(kSuffixLength));
				}
				return inputs;
				}

	inline std::vector<std::string> getSortedStringInputs(size_t N) {			inline std::vector<std::string> getSortedStringInputs(size_t N) {
	std::vector<std::string> inputs = getRandomStringInputs(N);			std::vector<std::string> inputs = getRandomStringInputs(N);
	std::sort(inputs.begin(), inputs.end());			std::sort(inputs.begin(), inputs.end());
	return inputs;			return inputs;
	}			}

	inline std::vector<std::string> getReverseSortedStringInputs(size_t N) {			inline std::vector<std::string> getReverseSortedStringInputs(size_t N) {
	std::vector<std::string> inputs = getSortedStringInputs(N);			std::vector<std::string> inputs = getSortedStringInputs(N);
	Show All 13 Lines

libcxx/benchmarks/unordered_set_operations.bench.cpp

	Show First 20 Lines • Show All 172 Lines • ▼ Show 20 Lines

	// String //			// String //
	BENCHMARK_CAPTURE(BM_InsertValue, unordered_set_string, std::unordered_set<std::string>{}, getRandomStringInputs)			BENCHMARK_CAPTURE(BM_InsertValue, unordered_set_string, std::unordered_set<std::string>{}, getRandomStringInputs)
	->Arg(TestNumInputs);			->Arg(TestNumInputs);

	BENCHMARK_CAPTURE(BM_InsertValueRehash, unordered_set_string, std::unordered_set<std::string>{}, getRandomStringInputs)			BENCHMARK_CAPTURE(BM_InsertValueRehash, unordered_set_string, std::unordered_set<std::string>{}, getRandomStringInputs)
	->Arg(TestNumInputs);			->Arg(TestNumInputs);

				// Prefixed String //
				BENCHMARK_CAPTURE(
				BM_InsertValue, unordered_set_prefixed_string, std::unordered_set<std::string>{}, getPrefixedRandomStringInputs)
				->Arg(TestNumInputs);

				BENCHMARK_CAPTURE(BM_InsertValueRehash,
				unordered_set_prefixed_string,
				std::unordered_set<std::string>{},
				getPrefixedRandomStringInputs)
				->Arg(TestNumInputs);

	//----------------------------------------------------------------------------//			//----------------------------------------------------------------------------//
	// BM_Find			// BM_Find
	// ---------------------------------------------------------------------------//			// ---------------------------------------------------------------------------//

	// Random //			// Random //
	BENCHMARK_CAPTURE(			BENCHMARK_CAPTURE(
	BM_Find, unordered_set_random_uint64, std::unordered_set<uint64_t>{}, getRandomIntegerInputs<uint64_t>)			BM_Find, unordered_set_random_uint64, std::unordered_set<uint64_t>{}, getRandomIntegerInputs<uint64_t>)
	->Arg(TestNumInputs);			->Arg(TestNumInputs);
	▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines

	// String //			// String //
	BENCHMARK_CAPTURE(BM_Find, unordered_set_string, std::unordered_set<std::string>{}, getRandomStringInputs)			BENCHMARK_CAPTURE(BM_Find, unordered_set_string, std::unordered_set<std::string>{}, getRandomStringInputs)
	->Arg(TestNumInputs);			->Arg(TestNumInputs);

	BENCHMARK_CAPTURE(BM_FindRehash, unordered_set_string, std::unordered_set<std::string>{}, getRandomStringInputs)			BENCHMARK_CAPTURE(BM_FindRehash, unordered_set_string, std::unordered_set<std::string>{}, getRandomStringInputs)
	->Arg(TestNumInputs);			->Arg(TestNumInputs);

				// Prefixed String //
				BENCHMARK_CAPTURE(
				BM_Find, unordered_set_prefixed_string, std::unordered_set<std::string>{}, getPrefixedRandomStringInputs)
				->Arg(TestNumInputs);

				BENCHMARK_CAPTURE(
				BM_FindRehash, unordered_set_prefixed_string, std::unordered_set<std::string>{}, getPrefixedRandomStringInputs)
				->Arg(TestNumInputs);

	//----------------------------------------------------------------------------//			//----------------------------------------------------------------------------//
	// BM_Rehash			// BM_Rehash
	// ---------------------------------------------------------------------------//			// ---------------------------------------------------------------------------//

	BENCHMARK_CAPTURE(BM_Rehash,			BENCHMARK_CAPTURE(BM_Rehash,
	unordered_set_string_arg,			unordered_set_string_arg,
	std::unordered_set<std::string, std::hash<std::string>, SlowStringEq>{},			std::unordered_set<std::string, std::hash<std::string>, SlowStringEq>{},
	getRandomStringInputs)			getRandomStringInputs)
	Show All 32 Lines

libcxx/include/__hash_table

Show First 20 Lines • Show All 1,617 Lines • ▼ Show 20 Lines	__hash_table<_Tp, _Hash, _Equal, _Alloc>::__node_insert_unique_prepare(

if (__bc != 0)		if (__bc != 0)
{		{
size_t __chash = std::__constrain_hash(__hash, __bc);		size_t __chash = std::__constrain_hash(__hash, __bc);
__next_pointer __ndptr = __bucket_list_[__chash];		__next_pointer __ndptr = __bucket_list_[__chash];
if (__ndptr != nullptr)		if (__ndptr != nullptr)
{		{
for (__ndptr = __ndptr->__next_; __ndptr != nullptr &&		for (__ndptr = __ndptr->__next_; __ndptr != nullptr &&
std::__constrain_hash(__ndptr->__hash(), __bc) == __chash;		(__ndptr->__hash() == __hash \|\|
		std::__constrain_hash(__ndptr->__hash(), __bc) == __chash);
__ndptr = __ndptr->__next_)		__ndptr = __ndptr->__next_)
{		{
if (key_eq()(__ndptr->__upcast()->__value_, __value))		if ((__ndptr->__hash() == __hash) &&
		key_eq()(__ndptr->__upcast()->__value_, __value))
return __ndptr;		return __ndptr;
}		}
}		}
}		}
if (size()+1 > __bc * max_load_factor() \|\| __bc == 0)		if (size()+1 > __bc * max_load_factor() \|\| __bc == 0)
{		{
__rehash_unique(_VSTD::max<size_type>(2 * __bc + !std::__is_hash_power2(__bc),		__rehash_unique(_VSTD::max<size_type>(2 * __bc + !std::__is_hash_power2(__bc),
size_type(std::ceil(float(size() + 1) / max_load_factor()))));		size_type(std::ceil(float(size() + 1) / max_load_factor()))));
▲ Show 20 Lines • Show All 192 Lines • ▼ Show 20 Lines	if (__bc != 0)
__chash = std::__constrain_hash(__hash, __bc);		__chash = std::__constrain_hash(__hash, __bc);
__nd = __bucket_list_[__chash];		__nd = __bucket_list_[__chash];
if (__nd != nullptr)		if (__nd != nullptr)
{		{
for (__nd = __nd->__next_; __nd != nullptr &&		for (__nd = __nd->__next_; __nd != nullptr &&
(__nd->__hash() == __hash \|\| std::__constrain_hash(__nd->__hash(), __bc) == __chash);		(__nd->__hash() == __hash \|\| std::__constrain_hash(__nd->__hash(), __bc) == __chash);
__nd = __nd->__next_)		__nd = __nd->__next_)
{		{
if (key_eq()(__nd->__upcast()->__value_, __k))		if ((__nd->__hash() == __hash) &&
		key_eq()(__nd->__upcast()->__value_, __k))
goto __done;		goto __done;
}		}
}		}
}		}
{		{
__node_holder __h = __construct_node_hash(__hash, _VSTD::forward<_Args>(__args)...);		__node_holder __h = __construct_node_hash(__hash, _VSTD::forward<_Args>(__args)...);
if (size()+1 > __bc * max_load_factor() \|\| __bc == 0)		if (size()+1 > __bc * max_load_factor() \|\| __bc == 0)
{		{
▲ Show 20 Lines • Show All 634 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Use hash value checks optimizations consistently
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 537147

libcxx/benchmarks/GenerateInput.h

libcxx/benchmarks/unordered_set_operations.bench.cpp

libcxx/include/__hash_table

This is an archive of the discontinued LLVM Phabricator instance.

Use hash value checks optimizations consistentlyClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 537147

libcxx/benchmarks/GenerateInput.h

libcxx/benchmarks/unordered_set_operations.bench.cpp

libcxx/include/__hash_table

Use hash value checks optimizations consistently
ClosedPublic