This is an archive of the discontinued LLVM Phabricator instance.

[ADT] add ConcurrentHashtable class.
ClosedPublic

Authored by avl on Aug 23 2022, 2:30 AM.

Details

Summary

ConcurrentHashTable - is a resizeable concurrent hashtable.
The range of resizings is limited up to x2^32. The hashtable allows only concurrent insertions.

Concurrent hashtable is necessary for the D96035 patch.

Diff Detail

Event Timeline

avl created this revision.Aug 23 2022, 2:30 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2022, 2:30 AM
avl requested review of this revision.Aug 23 2022, 2:30 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2022, 2:30 AM
avl added a comment.Aug 23 2022, 12:08 PM

The comment says that Swift's ConcurrentMap is a binary tree, which is usually slower than hashmap. Another thing is that it does not support rebalancing, while this patch supports rehashing. Though I did not compare this patch with Swift's ConcurrentMap.

lkail added a subscriber: lkail.Aug 24 2022, 6:32 AM
avl updated this revision to Diff 484377.Dec 20 2022, 2:11 PM

added possibility to specify allocator as template parameter,
added possibility to change range of resizings,
set default resizing range to x2^32

avl edited the summary of this revision. (Show Details)Dec 20 2022, 2:12 PM
avl updated this revision to Diff 484524.Dec 21 2022, 4:01 AM

removed dependence on <experimental/random>

@dexonsmith & co working on the CAS have also proposed a thread safe hash table of sorts ( https://reviews.llvm.org/D133715 )- it's a bit more esoteric/specialized, but I wonder if the use cases overlap enough to be able to unify them?

avl updated this revision to Diff 485614.Dec 29 2022, 7:29 AM

cure windows build.

@dexonsmith & co working on the CAS have also proposed a thread safe hash table of sorts ( https://reviews.llvm.org/D133715 )- it's a bit more esoteric/specialized, but I wonder if the use cases overlap enough to be able to unify them?

I won’t have time to take a look myself for a couple of weeks, but adding other interested parties.

Certainly sounds like there’s crossover! The data structure in the other patch supports concurrent insertion and look-up and uses atomics rather than locks. It does not support iteration, although that could be implemented. It does not directly support arbitrary keys, but could be used to implement a more general map; the client is expected to do hashing and decide what to do with collisions. Likewise, it does not support erase, but the client could use a tombstone. Not sure if your use case requires those operations, or if the overhead would be worth it.

avl added a comment.Dec 29 2022, 11:06 AM

@dexonsmith & co working on the CAS have also proposed a thread safe hash table of sorts ( https://reviews.llvm.org/D133715 )- it's a bit more esoteric/specialized, but I wonder if the use cases overlap enough to be able to unify them?

David, thank you for pointing this another patch. It would be good to have a unified solution.

avl added a comment.Dec 29 2022, 11:25 AM

@dexonsmith & co working on the CAS have also proposed a thread safe hash table of sorts ( https://reviews.llvm.org/D133715 )- it's a bit more esoteric/specialized, but I wonder if the use cases overlap enough to be able to unify them?

I won’t have time to take a look myself for a couple of weeks, but adding other interested parties.

Certainly sounds like there’s crossover! The data structure in the other patch supports concurrent insertion and look-up and uses atomics rather than locks. It does not support iteration, although that could be implemented. It does not directly support arbitrary keys, but could be used to implement a more general map; the client is expected to do hashing and decide what to do with collisions. Likewise, it does not support erase, but the client could use a tombstone. Not sure if your use case requires those operations, or if the overhead would be worth it.

This hashtable(D132455) is implemented for https://reviews.llvm.org/D96035 patch.
The main requirement is to have a possibility to store aggregate key/data pairs
(i.e. key is separated from the data) in parallel. Another requirement is to have information
whether data inserted immidiately or by previous call. It should use memory pool
(like BumpPtrAllocator).

So far, I created a table comparing patches:

------------------------------------------------------------------------
                    |   HashMappedTrie     |    ConcurrentHashtable    |
------------------------------------------------------------------------
    thread-safe     |         yes          |           yes             |
------------------------------------------------------------------------
 range of resizings |     not limited      |          x2^32            |
                    |                      |     can be increased      |
------------------------------------------------------------------------
    key/data pairs  |         no           |           yes             |
------------------------------------------------------------------------
     lock-free?     |         yes          |           no              |
                    |                      | uses mutexes for locking  |
------------------------------------------------------------------------
    insertions      |         yes          |           yes             |
------------------------------------------------------------------------
      lookups       |         yes          |           no              |
                    |                      |   can be easily added     |
------------------------------------------------------------------------
     deletions      |          no          |           no              |
                    |                      |   can be easily added     |
------------------------------------------------------------------------
     iterations     |          no          |           no              |
                    |  can be easily added |  an ineffective non-thread|
                    |                      |safe solution could be done|
------------------------------------------------------------------------
  hash collisions   |          yes         |      no collisions        |
                    |   should be handled  |                           |
                    |      by client       |                           |                    
------------------------------------------------------------------------

I did some first-look performance comparisons of the patches(using
this utility https://reviews.llvm.org/D132548).
The numbers might be inaccurate if I used HashMappedTrie incorrectly.
There is a difference - I did not set any initial sizes for HashMappedTrie
while initial sizes for ConcurrentHashtable were set. Also, only the key
was stored for HashMappedTrie while the key and data pair were stored for
ConcurrentHashtable. All runs insert 100000000 strings converted from
the corresponding integer.

------------------------------------------------------------------------------
                          |   HashMappedTrie     |    ConcurrentHashtable    |
------------------------------------------------------------------------------
 --num-threads 1          | time:       62sec    | time:           30sec     |                    
 --initial-size 100000000 | memory:     13.2G    | memory:         16.1G     |                    
------------------------------------------------------------------------------
 --num-threads 1          | time:       62sec    | time:           34sec     |                    
 --initial-size       100 | memory:     13.2G    | memory:         18.1G     |                    
------------------------------------------------------------------------------
 --num-threads 16         | time:       38sec    | time:          3.5sec     |                    
 --initial-size 100000000 | memory:     13.2G    | memory:         16.1G     |                    
------------------------------------------------------------------------------
 --num-threads 16         | time:       38sec    | time:          7.3sec     |                    
 --initial-size       100 | memory:     13.2G    | memory:         18.1G     |                    
------------------------------------------------------------------------------

Thanks for doing the comparison with https://reviews.llvm.org/D133715.

There are definitely some crossovers can help both data structure, like I also have a patch for a thread-safe-allocator: https://reviews.llvm.org/D133713. HashMappedTrie can definitely store key/data pair. For example, InMemoryCAS (https://reviews.llvm.org/D133716) is storing data as in HashMappedTrie to implement a CAS.

There are also definitely more differences, like one data structure is table-like and the other is a tree-like. Not sure if it is worth that we unify the interface so you can switch between but we can consider that. I think the biggest difference is how the collision is handled. For a CAS implementation that intended for caching, a hash collision cannot be accepted. I guess it is possible to extend HashMappedTrie to support collision and have a mode to error on collision, but it is not free (but might not be too costly).

I am super interested in how you do the comparison. Can you post a patch for the code you have for that? The HashMappedTrie hasn't been really tuned for performance/memory usage, it would be interesting to use your tool to do some investigation (for example, the NumRootBits and NumSubtrieBits can be tuned for memory/performance).

avl added a comment.Jan 3 2023, 10:23 AM

I am super interested in how you do the comparison. Can you post a patch for the code you have for that?

Sure. I will update https://reviews.llvm.org/D132548 to support HashMappedTrie in a couple of days.

avl added a comment.Jan 4 2023, 9:15 AM

I am super interested in how you do the comparison. Can you post a patch for the code you have for that? The HashMappedTrie hasn't been really tuned for performance/memory usage, it would be interesting to use your tool to do some investigation (for example, the NumRootBits and NumSubtrieBits can be tuned for memory/performance).

@steven_wu I`ve updated D132548 to have a possibility to measure HashMappedTrie. It depends on D125979, D132455, D133715.
In its current state patch does not set NumRootBits, NumSubtrieBits, though such options might be added.
If intel threading building block hashmap is not neccessary - the USE_ITBB should be unset.
if lib cuckoo hashmap is not neccessary - the USE_LIBCUCKOO should be unset.
Command line to run the tool:

/usr/bin/time -f " %E %M " ./check-hashtable --data-set random --num-threads 1 --table-kind hash-mapped-trie --aggregate-data

Any changes/suggestions are welcomed :-)

I am super interested in how you do the comparison. Can you post a patch for the code you have for that? The HashMappedTrie hasn't been really tuned for performance/memory usage, it would be interesting to use your tool to do some investigation (for example, the NumRootBits and NumSubtrieBits can be tuned for memory/performance).

@steven_wu I`ve updated D132548 to have a possibility to measure HashMappedTrie. It depends on D125979, D132455, D133715.
In its current state patch does not set NumRootBits, NumSubtrieBits, though such options might be added.
If intel threading building block hashmap is not neccessary - the USE_ITBB should be unset.
if lib cuckoo hashmap is not neccessary - the USE_LIBCUCKOO should be unset.
Command line to run the tool:

/usr/bin/time -f " %E %M " ./check-hashtable --data-set random --num-threads 1 --table-kind hash-mapped-trie --aggregate-data

Any changes/suggestions are welcomed :-)

Thanks! It works great. The only downside is that different memory allocator is used in different implementation so the number is not directly comparable but could reflect the performance for simple use cases.

It is already good to look at the scaling factor for implementations (size and threads). For example, I see the default configuration for HashMappedTrie scales to about 4 threads, then it just goes really bad (because the root bits is only 6 so high contention is expected in the beginning).

avl updated this revision to Diff 486795.Jan 6 2023, 3:13 AM

improved dumping, did several refactorings, improved resizing code.

avl updated this revision to Diff 489390.Jan 15 2023, 12:23 PM

refactored.
simplified implementation(removed version for integral data).
implemented a couple of space optimizations.

avl edited the summary of this revision. (Show Details)Jan 26 2023, 7:54 AM
avl added a comment.Feb 6 2023, 3:48 AM

ping.

@JDevlieghere @aprantl Do you think it is better to move this ConcurrentHashtable into the DWARFLinkerParallel folder?

avl updated this revision to Diff 500444.Feb 25 2023, 12:27 PM

rebased. refactored&simplified.

avl updated this revision to Diff 500530.Feb 26 2023, 2:40 AM

fixed build.

avl added a comment.Feb 26 2023, 8:19 AM

@aprantl @JDevlieghere @dblaikie @MaskRay Would you mind to take a look at this review, please?

The performance comparison for this hash table for reading strings from DWARF info from clang binary(done by utility from D132548):

--num-threads 16

                             time      memory
1. llvm-concurrent-hashmap: 0.82 sec     3.1G
2. lldb-const-string:       0.98 sec     3.1G


--num-threads 1

                             time      memory
1. llvm-concurrent-hashmap: 5.7 sec     3.1G
2. lldb-const-string:       7.1 sec     3.1G
3. llvm-string-map:         5.7 sec     3.1G

The advantages comparing to lldb-const-string implementation:

  1. ConcurrentHashTableByPtr is general. It might be used not only for strings.
  2. ConcurrentHashTableByPtr has a bit better performance numbers.
  3. ConcurrentHashTableByPtr is more scalable.
JDevlieghere accepted this revision.Mar 15 2023, 11:38 AM

ping.

@JDevlieghere @aprantl Do you think it is better to move this ConcurrentHashtable into the DWARFLinkerParallel folder?

I'm fine either way. If someone has concerns about the implementation, it's probably less contentious to land it in the DWARFLinkerParallel first, but so far I've not heard any objections. If we do want to use this from LLDB, then we'll need it to be in ADT eventually, though I'd like to do a comparison with Steven's HashMappedTrie for LLDB's real world usage.

I read through the code and most of it makes sense to me, but I wouldn't mind if someone that deal with data structures on a daily basis has another look. I'll mark this as accepted but would ask you to let it sit here for a few more days for others to take a look.

llvm/include/llvm/ADT/ConcurrentHashtable.h
113–114
222
297

What's the benefit of rehashing at 90% capacity? It seems like this is going to always leave a few empty slots on the table? I understand you always need to have one slot because you rehash after insertion, but it seems like you could rehash here rehash when you've exhausted the bucket?

308
This revision is now accepted and ready to land.Mar 15 2023, 11:38 AM
avl added a comment.Mar 16 2023, 5:32 AM

ping.

@JDevlieghere @aprantl Do you think it is better to move this ConcurrentHashtable into the DWARFLinkerParallel folder?

I'm fine either way. If someone has concerns about the implementation, it's probably less contentious to land it in the DWARFLinkerParallel first, but so far I've not heard any objections. If we do want to use this from LLDB, then we'll need it to be in ADT eventually, though I'd like to do a comparison with Steven's HashMappedTrie for LLDB's real world usage.

I have the utility to do the comparision - https://reviews.llvm.org/D132548 The problem is that HashMappedTrie does not resolve hash collisions. it is assumed that such collisions would be resolved by the client of HashMappedTrie. I am not sure what is the good way to resolve such collisions. Thus, I do not know how to make a "fair" comparision at the moment.

llvm/include/llvm/ADT/ConcurrentHashtable.h
297

When the hashtable is nearly 100% full then it needs to pass too many entries while searching for the free slot. the worst scenario is if the bucket is of 1000000 entries size and it already has 999999 entries then it might need to enumerate all 999999 entries, which is slow.
In case bucket is of 1000000 entries size and it is 90% full the number of entries which should be enumerated is smaller.

So wasting 10% of memory allows to have 20% performance improvement. The exact value "0.9" is received while experimenting. Probably it would be good to have a possibility to change this value (for the case when memory is more important).

avl updated this revision to Diff 507499.Mar 22 2023, 2:11 PM

addressed comments.

avl added a comment.Mar 23 2023, 6:34 AM

Thank you for the review!

This revision was automatically updated to reflect the committed changes.
SeanP added a subscriber: SeanP.Apr 5 2023, 10:59 AM

Alexey, not all platforms support thread local storage. Can you wrap the tests up in

#ifdef LLVM_ENABLE_THREADS

#else

#endif

You should also use LLVM_THREAD_LOCAL instead of using thread_local.

Thanks

avl added a comment.Apr 5 2023, 11:55 AM

@SeanP Thank you for catching! It looks like it is not necessary to insert #ifdef LLVM_ENABLE_THREADS. Just using LLVM_THREAD_LOCAL should be enough. Please, consider https://reviews.llvm.org/D147649

andrew.w.kaylor added a subscriber: andrew.w.kaylor.

I stumbled across a potential integer overflow in this code. A proposed fix is posted as D158117.