This is an archive of the discontinued LLVM Phabricator instance.

[Support] Avoid using main thread for llvm::parallelFor().
ClosedPublic

Authored by avl on Jan 22 2023, 2:20 PM.

Details

Summary

The llvm::parallelFor() uses threads created by ThreadPoolExecutor as well as main thread.
The index for the main thread matches with the index for the first thread created by ThreadPoolExecutor.
It results in that getThreadIndex returns the same value for different threads.
To avoid thread index clashing - do not use main thread for llvm::parallelFor():

parallel::TaskGroup TG;
for (; Begin + TaskSize < End; Begin += TaskSize) {
  TG.spawn([=, &Fn] {
    for (size_t I = Begin, E = Begin + TaskSize; I != E; ++I)
      Fn(I);
  });
}
for (; Begin != End; ++Begin)    <<<< executed by main thread.
  Fn(Begin);                     <<<<
return;                          <<<<

Diff Detail

Event Timeline

avl created this revision.Jan 22 2023, 2:20 PM
avl requested review of this revision.Jan 22 2023, 2:20 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 22 2023, 2:20 PM
andrewng added inline comments.Jan 23 2023, 4:30 AM
llvm/lib/Support/Parallel.cpp
216–222

Perhaps a "simpler" fix would be not to use the main thread here, i.e. should use TG.spawn? However, that assumes that getThreadIndex() is only for use in the "parallelFor" style functions otherwise the other parallel functions would also need to avoid using the main thread too.

I'm surprised that this hasn't been causing more issues.

Perhaps I'm missing something here, but I thought that thread_local variables are zero initialized. Therefore, the threadIndex for the main thread should already be 0. However, what is definitely of concern is that parallelFor could create contention between the main thread and thread pool thread "0" because they share the index 0. I definitely missed this in the original review that introduced this change. I spotted the potential for the issue within LLD itself but completely forgot that the parallel support code also makes use of the main thread!

Perhaps I'm missing something here, but I thought that thread_local variables are zero initialized.

That is my understanding too. An unitialized TLS variable is assigned to .tbss which is the TLS equivalent of .bss.

avl added a comment.EditedJan 23 2023, 7:00 AM

Perhaps I'm missing something here, but I thought that thread_local variables are zero initialized. Therefore, the threadIndex for the main thread should already be 0. However, what is definitely of concern is that parallelFor could create contention between the main thread and thread pool thread "0" because they share the index 0. I definitely missed this in the original review that introduced this change. I spotted the potential for the issue within LLD itself but completely forgot that the parallel support code also makes use of the main thread!

I did not find direct specification of that behavior(thread_local variables are zero initialized) thus created a patch which initializes it.

Another thing, as mentioned in above quote, is that main thread with zero index clashes with first thread created by parallelFor.

I did not find direct specification of that behavior(thread_local variables are zero initialized) thus created a patch which initializes it.

Did you see actual issues arising from threadIndex being uninitialized? If so, what platform and toolchain were you using?

Another thing, as mentioned in above quote, is that main thread with zero index clashes with first thread created by parallelFor.

See my suggestion.

llvm/lib/Support/Parallel.cpp
216–222

My gut feeling is that this would be the "better" solution, i.e. not including the invoking thread (might not be the main thread) as part of the pool of worker threads. It just feels cleaner to me.

However, can we also assume that the use of getThreadIndex() is not supported with regards to parallel_sort which I believe can also make use of the invoking thread? I think this would be reasonable.

avl added a comment.Jan 23 2023, 9:34 AM

I did not find direct specification of that behavior(thread_local variables are zero initialized) thus created a patch which initializes it.

Did you see actual issues arising from threadIndex being uninitialized? If so, what platform and toolchain were you using?

No, I did not. The problem which I see is data race while accessing the same index from different threads.

I did not see that C++ standard says that thread local data is zero initialized - that is why i decided to initialize it. If we think this is not necessary - we can skip initialization part.

Another thing, as mentioned in above quote, is that main thread with zero index clashes with first thread created by parallelFor.

See my suggestion.

llvm/lib/Support/Parallel.cpp
216–222

yes, we can avoid using main thread in this case.

Though, it seems, having the same index for main thread and for one from ThreadPool threads is a potential bug. We would always need to remember to avoid using main thread and ThreadPool threads. Instead, it looks better, to always have different indexes. As this patch does : 0 for main thread, 1... for others.

How about using TG.spawn in this place, but still having different indexes for main thread and others?

andrewng added inline comments.Jan 23 2023, 11:33 AM
llvm/lib/Support/Parallel.cpp
216–222

My feeling is that the current getThreadIndex() should really only apply in the context of the parallel thread pool threads and the parallel algorithms. Remember there are other thread pools in LLVM too and there may be more threads than just the main one (although perhaps these would just share the same index of 0?). Other threads may also not be long-lived.

Mixing the concept of the parallel getThreadIndex() with other threads just doesn't feel right.

avl added a comment.EditedJan 23 2023, 2:54 PM

See comment in place.

I also have a connected question not related directly. What do you think about allocator from D142318 and following design for such allocator:

class ThreadPoolAllocator {
  void *Allocate(size_t size, size_t alignment);
};

class ThreadPool {
  std::unique_ptr<ThreadPoolAllocator> getAllocator ();
};

class ThreadPoolExecutor {
  std::unique_ptr<ThreadPoolAllocator> getAllocator ();
};

The idea is to have thread-safe allocator connected with its execution context.

llvm/lib/Support/Parallel.cpp
216–222

I see. To summaries the changes :

  1. remove zero initialization.
  2. remove advancing index for zero thread.
  3. use TG.spawn instead of main thread.

correct?

I also have a connected question not related directly. What do you think about allocator from D142318 and following design for such allocator:

class ThreadPoolAllocator {
  void *Allocate(size_t size, size_t alignment);
};

class ThreadPool {
  std::unique_ptr<ThreadPoolAllocator> getAllocator ();
};

class ThreadPoolExecutor {
  std::unique_ptr<ThreadPoolAllocator> getAllocator ();
};

The idea is to have thread-safe allocator connected with its execution context.

I think you need to be a bit careful with terminology because IIUC it's not actually "thread-safe" but a "per thread" allocator. I think this could be useful in some situations but perhaps a better overall approach would be to use a low-level thread aware memory allocator. I think some toolchains/runtimes already have such an allocator and there are other options such as rpmalloc and mimalloc. This way, the benefit is more widespread and doesn't require any extra effort (although some allocator libraries have some limitations on certain platforms).

llvm/lib/Support/Parallel.cpp
216–222

To keep things simple, I think the above change, i.e. avoiding the use of the main thread, is all that's required to fix the important concurrency issue. Any other changes related to D142318 can be moved to that patch.

avl retitled this revision from [Support][LLD] Avoid using uninitialized threadIndex. to [Support] Avoid using main thread for llvm::parallelFor()..Jan 24 2023, 5:04 AM
avl edited the summary of this revision. (Show Details)
avl updated this revision to Diff 491727.Jan 24 2023, 5:26 AM
avl edited the summary of this revision. (Show Details)

addressed comments.

avl added a comment.Jan 24 2023, 5:39 AM

I think you need to be a bit careful with terminology because IIUC it's not actually "thread-safe" but a "per thread" allocator. I think this could be useful in some situations but perhaps a better overall approach would be to use a low-level thread aware memory allocator. I think some toolchains/runtimes already have such an allocator and there are other options such as rpmalloc and mimalloc. This way, the benefit is more widespread and doesn't require any extra effort (although some allocator libraries have some limitations on certain platforms).

Other than limitations on certain platforms, AFAICT, these libraries do not have a notion of memory pools - allocating data in "one big chunk"(To avoid memory fragmentation and to have a possibility to free resources fast). Thus if someone need a thread aware memory pool using rpmalloc and mimalloc would be not enough.

andrewng accepted this revision.Jan 24 2023, 8:25 AM

I think you need to be a bit careful with terminology because IIUC it's not actually "thread-safe" but a "per thread" allocator. I think this could be useful in some situations but perhaps a better overall approach would be to use a low-level thread aware memory allocator. I think some toolchains/runtimes already have such an allocator and there are other options such as rpmalloc and mimalloc. This way, the benefit is more widespread and doesn't require any extra effort (although some allocator libraries have some limitations on certain platforms).

Other than limitations on certain platforms, AFAICT, these libraries do not have a notion of memory pools - allocating data in "one big chunk"(To avoid memory fragmentation and to have a possibility to free resources fast). Thus if someone need a thread aware memory pool using rpmalloc and mimalloc would be not enough.

Yes, as I mentioned in some situations a "memory pool" may be more useful, e.g. more efficient freeing of allocated resources by releasing the pool rather than individual deallocations, although having quickly scanned D142318, it looked like those allocators were long-lived and weren't intended for this use case. I believe that both rpmalloc and mimalloc already do try to reduce memory fragmentation.

LGTM and thanks for catching this issue!

This revision is now accepted and ready to land.Jan 24 2023, 8:25 AM
MaskRay accepted this revision.Jan 24 2023, 6:16 PM

I'm not sure of the latest procedure regarding release branches but this patch should be cherry picked to the latest 16.x branch.

avl added a comment.Jan 25 2023, 3:47 AM

@andrewng @MaskRay Thank you for the review!

This revision was landed with ongoing or failed builds.Jan 25 2023, 3:48 AM
This revision was automatically updated to reflect the committed changes.

@avl, could you please get this fix cherry picked to the 16.x release branch. The instructions for doing so are here: Backporting Fixes to the Release Branches and the GitHub milestone is here: LLVM 16.0.0 Release. Thanks.

avl added a comment.Jan 27 2023, 5:15 AM

@avl, could you please get this fix cherry picked to the 16.x release branch. The instructions for doing so are here: Backporting Fixes to the Release Branches and the GitHub milestone is here: LLVM 16.0.0 Release. Thanks.

Will do. Thanks.

avl added a comment.Jan 30 2023, 9:42 AM

@avl, could you please get this fix cherry picked to the 16.x release branch. The instructions for doing so are here: Backporting Fixes to the Release Branches and the GitHub milestone is here: LLVM 16.0.0 Release. Thanks.

Will do. Thanks.

https://github.com/llvm/llvm-project/issues/60338