PerThreadBumpPtrAllocator allows separating allocations by thread id.
That makes allocations race free. It is possible because
ThreadPoolExecutor class creates threads, keeps them until
the destructor of ThreadPoolExecutor is called, and assigns ids
to the threads. Thus PerThreadBumpPtrAllocator should be used with only
threads created by ThreadPoolExecutor. This allocator is useful when
thread safe BumpPtrAllocator is needed.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
It's nice how simple this is!
Thus ThreadPoolAllocator should be used with only
threads created by ThreadPoolExecutor.
I hadn't thought deeply about this limitation previously...
This seems fine for many APIs. But, I don't think the CAS could use this, since clients might share a CAS between multiple threads NOT owned by ThreadPoolAllocator. For the CAS, probably we need to stick with https://reviews.llvm.org/D133713 for now (probably eventually use a lock-free ConcurrentBumpPtrAllocator instead).
I agree that this probably doesn't work for CAS since the use of CAS is not bounded to any context like a ThreadPoolExecutor, for example, it is currently a legal use case to have multiple thread pool to insert into CAS at the same time. It is not feasible to put such a restriction on CAS since CAS should be safe to read/write concurrently from different process.
For this PoolAllocator, it might be a good idea to require it to be initialized with an instance of ThreadPoolExecutor and maybe add a method for Thread to get the context of its Executor so you can double check the allocator is used in the correct Exectuor when assertion is on.
I see, My first intention was to implement some similar solution. But it requires an additional refactoring(f.e. ThreadPoolExecutor is not currently visible). This patch is simple working implementation allowing to check the idea.
I was thinking about something like that:
class ThreadPoolAllocator { void *Allocate(size_t size, size_t alignment); }; class ThreadPool { std::unique_ptr<ThreadPoolAllocator> getAllocator (); }; class ThreadPoolExecutor { std::unique_ptr<ThreadPoolAllocator> getAllocator (); };
llvm/include/llvm/Support/PerThreadBumpPtrAllocator.h | ||
---|---|---|
32 | initialize Allocators instead of initializing it to empty then calling resize | |
37 | Delete this comment and the blank line above. One // Pull in base class overloads. suffices. Actually the code self complains, so I personally prefer removing the comment completely. | |
95 | Since this cannot be resized, we can use unique_ptr<AllocatorTy[]> and a separate size variable. | |
llvm/lib/Support/Parallel.cpp | ||
161 | A thread ID is smaller than this number, so the name seems confusing. Use getNumThreads? |
addressed comments. added getThreadLocalAllocator().
grouped methods on whether it could be or not called asynchronously.
I have a suggested comment and some nits, otherwise looks good. It'd be good to have @andrewng's review as well.
llvm/include/llvm/Support/PerThreadBumpPtrAllocator.h | ||
---|---|---|
19 | Suggest: PerThreadAllocator is used in conjunction with ThreadPoolExecutor to allow per-thread allocations. It wraps a possibly thread-unsafe allocator, e.g. BumpPtrAllocator. PerThreadAllocator must be used inside ThreadPoolExecutor as it utilizes getThreadIndex, which is set by ThreadPoolExecutor threads. | |
llvm/unittests/Support/PerThreadBumpPtrAllocatorTest.cpp | ||
22 | delete blank line | |
42 | 5000 seems too much. 1000 suffices. |
I still have a general concern: this utility isn't safe to use in general LLVM library code, and while that's documented in the header, there's nothing enforcing that or checking for it. I think it'd be easy to get this wrong, and our existing test coverage would be unlikely to catch mistakes, but it could be a big problem for tools/libraries that have their own thread pools and depend on LLVM code.
Is there any way of adding assertions, or a structural change, to confirm this hasn't been used in the wrong place? I think it would be okay to add a bit of API surface to ThreadPoolExecutor, or add some new fields under LLVM_ENABLE_API_BREAKING_CHECKS.
the possible solution might be initializing threadIndex to some unrelated value by default.
f.e. setting threadIndex to -1; Threads created by ThreadPoolExecutor would have indexes in range 0 ... ThreadsNum.
It will trigger assertions "assert(getThreadIndex() < NumOfAllocators);" for wrong threads inside PerThreadAllocator methods. Does it sound OK?
When a llvm::parallel::parallel_* function returns, the ThreadPoolExecutor doesn't reset llvm::parallel::threadIndex. So the check won't be effective for a PerThreadBumpPtrAllocator misuse after a parallel_* function call.
Any expensive check mechanism should also support nested llvm::parallel::parallel_* calls (even if the inner calls are serial).
I don't have a good approach off my head...
Sounds reasonable to me. Maybe we'd want to keep zero-init for non-assertions builds to avoid unnecessary static initializers?
I don't think we need to catch every possible failure. It sounds useful (even if not exhaustive) to catch misuse in the cases where no llvm::parallel code has been called yet. Then, regular unit test coverage for such library code can trigger assertion failures....
Any expensive check mechanism
Note, LLVM_ABI_BREAKING_CHECKS (spelling?) is on-by-default in assertions builds. This is different from LLVM_EXPENSIVE_CHECKS (sp?). I think if we can't make the assertion cheap enough to be on-by-default in assertions builds it probably isn't worth it.
(Not sure @avl's suggestion is ABI-breaking anyway so it's probably moot.)
BTW, if others feel strongly that such an assertion wouldn't be useful (say, maybe there's reason to believe that even unit tests wouldn't trigger it in practice due to @MaskRay's points?), happy to back away and let this land without it.
If it would not catch all the cases, probably it still would be useful if catch some of them.
f.e. following case would be coaght:
PerThreadBumpPtrAllocator Allocator; parallelFor() { Allocator.Allocate(); } Allocator.Allocate(); << the call is done on main thread. << assertion should be raised.
PerThreadBumpPtrAllocator Allocator; ThreadPool Pool(parallel::strategy); Pool.async([&]() { Allocator.Allocate(); }); << the call is done on the thread created by not ThreadPoolExecutor. << assertion should be raised.
Can it be useful?
It also could probably be done in more general way:
#if LLVM_ENABLE_ASSERTIONS thread_local unsigned threadIndex = -1; #else thread_local unsigned threadIndex; #endif inline unsigned getThreadIndex() { assert(threadIndex != -1); return threadIndex; }
However, above suggestion forbids usage of the main thread which is currently used by parallelFor(). Thus the above check either should not be done either parallelFor() should be refactored to not use main thread.
It also could probably be done in more general way:
#if LLVM_ENABLE_ASSERTIONS thread_local unsigned threadIndex = -1; #else thread_local unsigned threadIndex; #endif inline unsigned getThreadIndex() { assert(threadIndex != -1); return threadIndex; }However, above suggestion forbids usage of the main thread which is currently used by parallelFor(). Thus the above check either should not be done either parallelFor() should be refactored to not use main thread.
Yes, any checking would need to somehow handle the case where parallelFor() falls back to using the main thread or this optimisation would need to be dropped. Also, IIRC, some of the other parallel methods may/will still use the main thread concurrently with threads from the thread pool.
I don't really have much time right now but I've scanned the change and it LGTM barring this concern over "safety of usage" which I think would be good to have if it can be done relatively easily.
so far I suggest to implement safety check as a separate patch. After having a good solution for this.
I think if we don't add the check now it's unlikely to happen later.
Seems like threads are assigned IDs from 1 in the ThreadPoolExecutor constructor via calls to work(). The main thread assigns threadIndex to 0 in the same place:
explicit ThreadPoolExecutor(ThreadPoolStrategy S = hardware_concurrency()) { unsigned ThreadCount = S.compute_thread_count(); // Spawn all but one of the threads in another thread as spawning threads // can take a while. Threads.reserve(ThreadCount); Threads.resize(1); std::lock_guard<std::mutex> Lock(Mutex); Threads[0] = std::thread([this, ThreadCount, S] { for (unsigned I = 1; I < ThreadCount; ++I) { Threads.emplace_back([=] { work(S, I); }); if (Stop) break; } ThreadsCreated.set_value(); work(S, 0); }); } void work(ThreadPoolStrategy S, unsigned ThreadID) { threadIndex = ThreadID;
Can you explain what will go wrong with the main thread?
(Although I do think an independent patch makes sense so it can be reverted independently if it causes trouble.)
Aha, looks like I misread the code. The work() calls are coming from within a lambda that's executed by the first created thread. So, right now, the main thread has the same threadIndex as the first spawned thread.
(But if that's the case, doesn't that cause a problem for the allocator? Doesn't the allocator require that the main thread has a different ID from the worker threads?)
Assuming it's okay for the main thread to alias the worker threads, what if we just check if the ThreadPoolExecutor has been constructed?
// Parallel.h bool hasDefaultExecutor(); unsigned getThreadIndex() { assert(hasDefaultExecutor()); return threadIndex; } // Parallel.cpp static std::atomic<bool> HasDefaultExecutor = false; bool hasDefaultExecutor() { return HasDefaultExecutor; } Executor *Executor::getDefaultExecutor() { HasDefaultExecutor = true;
Specifically, if the main thread is doing work concurrently with Thread0, and won't they be using the same slice of the PerThreadAllocator?
My understanding is that it should not be so that the main thread is concurrently doing work with Thread0. If that is the case then the problem exists even without PerThreadAllocator. There was recently fixed a bug when exactly main thread doing work concurrently with Thread0 - D142317
Thus current assumption is that the main thread never works concurrently with Thread0. Because they never work concurrently it is OK to have slice 0 for both : main thread and Thread0.
I am OK to do that separate patch right after the current patch. Just do not have a good idea for this at the moment.
WDYT of the idea above, to have a Boolean flag that checks whether getDefaultExecutor() has been called, and assert on that in getThreadIndex()?
I think this solves only part of the problem: it checks the fact that executor is already created when getThreadIndex() is requested. But it does not check that thread index is valid. If thread was created not by ThreadPoolExecutor then it would have zero index which clashes with thread index of main thread and Thread0. I thought we want to check that other threads were not used with getThreadIndex.
Checking ThreadPoolExecutor existence still useful check and it would be good to implement it. If we found a good way to check thread indexes it would also be useful.
Yeah, seems like a good start for now. This would catch the case where someone is NOT using llvm::parallel at all, but has a bunch of threads, and is wrongly assuming this allocator is safe for concurrent use in general.
Checking the thread indexes seems hard, since the "main" thread could be a different client thread on different entries to llvm::parallel.
This check will help for pure users of getThreadIndex() but will not help users of PerThreadBumpPtrAllocator as it calls "detail::Executor::getDefaultExecutor()->getThreadsNum();" in the constructor. Thus any call to getThreadIndex() after PerThreadBumpPtrAllocator is created will have HasDefaultExecutor == true.
Checking the thread indexes seems hard, since the "main" thread could be a different client thread on different entries to llvm::parallel.
I see; I've been missing that detail.
You could introduce a free function, getDefaultThreadsNum() to Parallel.h/cpp, which in assertions mode calls hardware_concurrency().compute_thread_count() (redundantly) without calling getDefaultExecutor(). WDYT?
Originally, I tried to avoid such duplication. As there is no guarantee that DefaultExecutor() uses exactly hardware_concurrency().compute_thread_count() threads. If the default executor strategy would be changed it is easy to forget to update getDefaultThreadsNum() function. Probably, it could be verified by adding more asserts though.
Anyway, I tend to lean towards initializing threadIndex with -1:
thread_local unsigned threadIndex = -1; inline unsigned getThreadIndex() { assert((threadIndex != -1) || (parallel::strategy.ThreadsRequested == 1)); return threadIndex; }
This solution guarantees that getThreadIndex() is used with only threads created by ThreadPoolExecutor.
It also helps to increase the level of parallelization by allowing parallel execution of llvm::parallelFor().
Currently, there is a limitation to not having two TaskGroups running in parallel:
// Latch::sync() called by the dtor may cause one thread to block. If is a dead // lock if all threads in the default executor are blocked. To prevent the dead // lock, only allow the first TaskGroup to run tasks parallelly. In the scenario // of nested parallel_for_each(), only the outermost one runs parallelly. TaskGroup::TaskGroup() : Parallel(TaskGroupInstances++ == 0) {}
This is done to avoid nested llvm::parallelFor() but it also avoids parallel llvm::parallelFor():
ThreadPool Pool(parallel::strategy); Pool.async([&]() { llvm::parallelFor(); }); Pool.async([&]() { llvm::parallelFor(); }); Pool.wait();
The above solution allows running llvm::parallelFor() parallelly without having a deadlock.
Using threadIndex will make only nested llvm::parallelFor() to be sequential and allow parallel llvm::parallelFor():
TaskGroup::TaskGroup() : Parallel(threadIndex == -1) {}
Another thing, is that using assertion to check threadIndex will forbid the calling of getThreadIndex() on the main thread.
It will break this code lld/ELF/Relocations.cpp:
// Deterministic parallellism needs sorting relocations which is unsuitable // for -z nocombreloc. MIPS and PPC64 use global states which are not suitable // for parallelism. bool serial = !config->zCombreloc || config->emachine == EM_MIPS || config->emachine == EM_PPC64; parallel::TaskGroup tg; for (ELFFileBase *f : ctx.objectFiles) { auto fn = [f]() { RelocationScanner scanner; for (InputSectionBase *s : f->getSections()) { if (s && s->kind() == SectionBase::Regular && s->isLive() && (s->flags & SHF_ALLOC) && !(s->type == SHT_ARM_EXIDX && config->emachine == EM_ARM)) scanner.template scanSection<ELFT>(*s); } }; if (serial) fn(); <<<<< called on the main thread, it calls getThreadIndex() inside. else tg.execute(fn); } // Both the main thread and thread pool index 0 use getThreadIndex()==0. Be // careful that they don't concurrently run scanSections. When serial is // true, fn() has finished at this point, so running execute is safe.
To solve this problem it is possible to add a sequential mode to TaskGroup, so that tasks marked with
"sequential" flags are always executed on Thread0. It makes them always executed in the same order.
It also will allow executing scanSection concurrently as no indexes overlap now.
tg.execute(fn, serial);
List of changes shortly:
- initialiase threadIndex with -1 and add assertion:
thread_local unsigned threadIndex = -1; inline unsigned getThreadIndex() { assert((threadIndex != -1) || (parallel::strategy.ThreadsRequested == 1)); return threadIndex; }
- remove optimization for NumItems to avoid executing on the main thread.
void llvm::parallelFor(size_t Begin, size_t End, llvm::function_ref<void(size_t)> Fn) { // If we have zero or one items, then do not incur the overhead of spinning up // a task group. They are surprisingly expensive, and because they do not // support nested parallelism, a single entry task group can block parallel // execution underneath them. #if LLVM_ENABLE_THREADS auto NumItems = End - Begin; if (NumItems > 1 && parallel::strategy.ThreadsRequested != 1) { <<< delete check for NumItems
- Use threadIndex to avoid nested parallelization:
TaskGroup::TaskGroup() : Parallel(threadIndex == -1) {}
- Add sequential mode to TaskGroup forcing marked tasks to be executed on the same thread in sequential order.
What do you think about that approach? It looks like it could give a good level of checking: no other threads are allowed to call getThreadIndex() and helps to get more parallelization.
This sounds okay to me, but I admit I don't know llvm::parallel well enough to understand the implications.
@MaskRay, WDYT?
Anyway, I tend to lean towards initializing threadIndex with -1:
This looks good to me to avoid collision between the main thread and the thread pool thread 0. This also fulfills "remove zero initialization." in a comment in D142317.
It will break this code lld/ELF/Relocations.cpp:
if (serial) fn(); <<<<< called on the main thread, it calls getThreadIndex() inside. else tg.execute(fn);
Yes. It seems that we need a serial tg.xxxxx that makes getThreadIndex() == 0, even if llvm::parallel::strategy requests more than one threads.
I haven't thought about other changes very clearly. The safeguard changes and discussions can go to another patch.
Sounds good; happy for this to land while you continue working on how to do the checking.
- remove optimization for NumItems to avoid executing on the main thread.
Sorry but I'm quite busy right now, so haven't been able to give this much attention. I did notice this whilst scanning the updates to this review and IIRC this was quite an important optimisation for some use case (can't remember the details). So might want to take some care before removing it.
yep. I am going to create an appropriate set of reviews, so that it would be possible to pay attention to each aspect separately.
Apart from the minor suggestion, all the parallel changes LGTM.
llvm/lib/Support/Parallel.cpp | ||
---|---|---|
106 | Perhaps change getThreadsNum -> getNumThreads or getThreadCount? |
Suggest:
PerThreadAllocator is used in conjunction with ThreadPoolExecutor to allow per-thread allocations. It wraps a possibly thread-unsafe allocator, e.g. BumpPtrAllocator.
PerThreadAllocator must be used inside ThreadPoolExecutor as it utilizes getThreadIndex, which is set by ThreadPoolExecutor threads.