This is an archive of the discontinued LLVM Phabricator instance.

Parallel: only allow the first TaskGroup to run tasks parallelly
ClosedPublic

Authored by MaskRay on Apr 24 2019, 11:14 PM.

Details

Summary

Concurrent (e.g. nested) llvm::parallel::for_each() may lead to dead
locks. See PR35788 (fixed by rLLD322041) and PR41508 (fixed by D60757).

When parallel_for_each() is about to return, in ~Latch() called by
~TaskGroup(), a thread (in the default executor) may block in
Latch::sync() waiting for Count to become zero. If all threads in the
default executor are blocked, it is a dead lock.

To fix this, force serial execution if the current TaskGroup is not the
first one. For a nested llvm::parallel::for_each(), only the outermost
one runs parallelly.

Diff Detail

Repository
rL LLVM

Event Timeline

MaskRay created this revision.Apr 24 2019, 11:14 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 24 2019, 11:14 PM

A triply nested loop can easily reproduce the dead lock:

#include <llvm/Support/Parallel.h>
#include <unistd.h>

int main() {
  int a[99] = {};
  llvm::parallel::for_each(llvm::parallel::par, a, a+99, [&](int x) {
    usleep(1000);
    llvm::parallel::for_each(llvm::parallel::par, a, a+99, [&](int x) {
      usleep(1000);
      llvm::parallel::for_each(llvm::parallel::par, a, a+99, [&](int x) {
        usleep(1000);
      });
    });
  });
}

strace -f ./a => You soon see no nanosleep syscalls are made as all threads get stuck in futex(). This patch fixes that.

ruiu added a comment.Apr 25 2019, 12:25 AM

Don't you think we can just get rid of the notion of TaskGroup? Looks like we don't need TaskGroup to implement a thread pool and a parallel-for-loop.

Generally, it doesn't make much sense to have more than one thread pool in a process (invoking more threads than necessary is against the aim of the thread pool itself), so we can make the thread pool a singleton class. Let's call a thread pool class ThreadPool. That class would define getInstance() to get an instance. The only other member function would be void execute(std::function<void()> Fn), which executes Fn in some thread. Internally, that function would add a given function object to a queue and wakes up threads using a condition variable.

I believe it shouldn't be too hard to implement such thread pool class.

Once we implement the thread pool, then we can implement parallel-for-each like this:

template <class IterTy, class FuncTy>
void parallel_for_each(IterTy Begin, IterTy End, FuncTy Fn) {
  // Split tasks into groups.
  ptrdiff_t TaskSize = std::distance(Begin, End) / 1024;
  if (TaskSize == 0)
    TaskSize = 1;
                              
  size_t NumTasks;
  std::mutex Mu;
  std::condition_varaible Cond;
  
  // Submit jobs to the thread pool.
  while (TaskSize < std::distance(Begin, End)) {
    ++NumTasks;
    ThreadPool::getInstance().execute([&] {                                             
      std::for_each(Begin, Begin + TaskSize, Fn);
      std::lock_guard<std::mutex> Lock(Mu);
      if (--NumTasks == 0)
        Cond.notify_all();
    });
    Begin += TaskSize;
  } 

  std::for_each(Begin, End, Fn);

  // Wait for everybody to complete.
  std::lock_guard<std::mutex> Lock(Mu);
  cv.wait(lk, [&] {return NumTasks == 0;});
}

This parallel_for_each should be completely reentrant.

Generally, it doesn't make much sense to have more than one thread pool in a process

The existing Executor::getDefaultExecutor() implements the thread pool. I agree that one thread pool suffices, so we parallelize the outermost loop and serialize inner loops.

Note a thread pool doesn't solve the problem. Latch::sync() makes one thread block (before the function returns there must be a synchronization point). (This can be fixed by introducing a thread scheduler but it is unnecessary for our use cases)

Once we implement the thread pool, then we can implement parallel-for-each like this:

TaskGroup encapsulates the common code that parallel_sort, parallel_for_each and parallel_for_each_n need: 1) spawns a task 2) synchronization point.
If we had only parallel_for_each but not the other parallel_* variants, I would agree that inlining TaskGroup would be good.

ruiu accepted this revision.Apr 25 2019, 3:58 AM

LGTM

I spent a few hours to simplify this code and make it fundamentally reentrant but failed. It is hard to debug multi-threading code. For now, I think this is a good measure to prevent accidentally writing nested use of this thread pool executor.

This revision is now accepted and ready to land.Apr 25 2019, 3:58 AM
MaskRay updated this revision to Diff 196610.Apr 25 2019, 4:20 AM
MaskRay edited the summary of this revision. (Show Details)

Update comments

MaskRay updated this revision to Diff 196611.Apr 25 2019, 4:21 AM
MaskRay edited the summary of this revision. (Show Details)

Fix typo

This revision was automatically updated to reflect the committed changes.