This is an archive of the discontinued LLVM Phabricator instance.

[Support/Parallel] - Do not use a task group for a very small task.
ClosedPublic

Authored by grimar on Aug 11 2017, 3:52 AM.

Details

Summary

parallel_for_each_n splits a given task into small pieces of tasks and then
passes them to background threads managed by a thread pool to process them
in parallel. TaskGroup then waits for all tasks to be done, which is done by
TaskGroup's destructor.

In the previous code, all tasks were passed to background threads, and the
main thread just waited for them to finish their jobs. This patch changes
the logic so that the main thread processes a task just like other
worker threads instead of just waiting for workers.

This patch improves the performance of parallel_for_each_n for a task which
is too small that we do not split it into multiple tasks. Previously, such task
was submitted to another thread and the main thread waited for its completion.
That involves multiple inter-thread synchronization which is not cheap for
small tasks. Now, such task is processed by the main thread, so no inter-thread
communication is necessary.

Diff Detail

Event Timeline

grimar created this revision.Aug 11 2017, 3:52 AM
ruiu edited edge metadata.Aug 11 2017, 7:01 AM

This doesn't seem correct. Imagine that you have infinite number of cores. Then, the new code would take 2x time than the old code. I guess you are "fixing" the problem at a wrong place.

In D36607#839043, @ruiu wrote:

This doesn't seem correct. Imagine that you have infinite number of cores. Then, the new code would take 2x time than the old code. I guess you are "fixing" the problem at a wrong place.

Why new code will take 2x time ?

ruiu added a comment.Aug 11 2017, 7:37 AM

Because the last one is not parallelized by the taskgroup, no?

In D36607#839086, @ruiu wrote:

Because the last one is not parallelized by the taskgroup, no?

And at the end of parallel_for_each_n() it will wait for all threads to finish anyways.
I think there is no difference will last one be executed on main thread or on another thread if we anyways will wait them all, no ?

If we have infinite number of threads and each task 0..N eats time T, then:

  1. Original code will take total time T. Because each task will be assigned to own thread which takes T to finish. Main thread will wait T until secondaty threads finish.
  2. My code will take total time T, because tasks 0..(N-1) will take time T and main thread will also take time T for task N at the same time. Total time will be T.
ruiu added a comment.Aug 11 2017, 7:59 AM

Then why does it take so much time in some test cases if both are the same in the big-O? It should never create 65k+ threads. If it does, something's broken. This patch seems to be trying to fix at a wrong place.

In D36607#839139, @ruiu wrote:

Then why does it take so much time in some test cases if both are the same in the big-O? It should never create 65k+ threads. If it does, something's broken. This patch seems to be trying to fix at a wrong place.

Because it calls:

void parallel_for_each_n(IndexTy Begin, IndexTy End, FuncTy Fn)

where Begin == 0, End == 1. And creates thread for a single task, when it should not.
Because there is no reason to take separate thread in this case, when main thread is free
for the same.

And creating 65k threads takes that time.
I really believe patch is fine, lets see what other think.

It should never create 65k+ threads. If it does, something's broken.

In that testcase we have 65k output sections. And we call parallelForEachN for each. That is how we parallelize writing in LLD.
And it does not seem broken, because usually we have little amount of output sections and much more input sections.

ruiu added a comment.Aug 11 2017, 8:12 AM

It is broken. The taskgroup should not spawn more threads than the number of physical cores however it is used. It is intended to be a thread-pool, and spawning threads per task is nonsense.

zturner edited edge metadata.

I think there is a balance between what the library should do for you and what the programmer should be responsible for. Writing:

for (int I=0; I < 65000; ++I)
  parallel_for_each_n(0, 1, callback);

is kind of a bad thing to do, but at the same time I guess it shouldn't give pathologically bad performance.
Microsoft apparently implements this optimization in ConcRT, so that gives me some confidence that it might be an ok optimization.

On the other hand, task-based threading is fraught with potential for subtle "anti-optimizations" that are hard to evaluate. I think a lot of people have probably thought about this before, so perhaps we could ask someone on the libc++ side who is responsible for / going to be responsible for parallelism algorithms in libc++? I don't know who that would be off the top of my head, but +mclow as if it's not him, he should at least know who it is.

In D36607#839160, @ruiu wrote:

It is broken. The taskgroup should not spawn more threads than the number of physical cores however it is used. It is intended to be a thread-pool, and spawning threads per task is nonsense.

Ah, in that sense you're right I think. It even has comment about that:

// TaskGroup has a relatively high overhead, so we want to reduce
// the number of spawn() calls. We'll create up to 1024 tasks here.
// (Note that 1024 is an arbitrary number. This code probably needs
// improving to take the number of available cores into account.)

But still - for our case it spawn single thread 65k times, not 65k threads at once.
And we can simply avoid that.

So it does not spawn more threads than the number of cores for case I am trying to fix.

ruiu added a comment.Aug 11 2017, 8:26 AM

Well, since it is supposed to be a thread-pool, it shouldn't spawn a thread for a new task. It should keep threads running until a process is terminated.

grimar abandoned this revision.Aug 14 2017, 5:53 AM

I was mistaken. Did not notice that TaskGroup::spawn() has different implementation for windows/non-windows case.
Linux implementation actually based on a ThreadPoolExecutor which implemented via thread pool,
so l was mistaken when said that it creates 65k threads, issue happens because of something else.

This patch still looks as correct optimization for me and solves the problem too, avoiding using of unnecessary thread.
Looks it helps because avoids using implementation based on thread pool, which looks have some narrow place.
I am going to find real issue now.

grimar reclaimed this revision.Aug 14 2017, 7:06 AM
grimar edited the summary of this revision. (Show Details)

After looking into ThreadPoolExecutor I think its implementation is fine.
It creates std::thread::hardware_concurrency() - 1 threads and reuses them correctly.

So looks slowdown is caused by overhead because of using it huge amount of times for minor single tasks.
Patch still fixes it.

ruiu added a comment.Aug 14 2017, 9:32 AM

I still wonder if a communication between threads can make something that slow. Where is the exact location where you observe the slowdown? Until we understand the exact reason, we cannot exclude the possibility that this change is hiding a real issue.

In D36607#840916, @ruiu wrote:

I still wonder if a communication between threads can make something that slow. Where is the exact location where you observe the slowdown? Until we understand the exact reason, we cannot exclude the possibility that this change is hiding a real issue.

It is ThreadPoolExecutor::work().
(https://github.com/llvm-mirror/llvm/blob/master/lib/Support/Parallel.cpp#L102)

Currently it has following implementation:

void work() {
  while (true) {
    std::unique_lock<std::mutex> Lock(Mutex);
    Cond.wait(Lock, [&] { return Stop || !WorkStack.empty(); });
    if (Stop)
      break;
    auto Task = WorkStack.top();
    WorkStack.pop();
    Lock.unlock();
    Task();
  }
  Done.dec();
}

If I modify it slightly to avoid waiting on condition variable (and use busy-waiting instead) then it works much faster for me:

void work() {
    while (true) {
      std::unique_lock<std::mutex> Lock(Mutex);
      if (!WorkStack.empty()) {
          auto Task = WorkStack.top();
          WorkStack.pop();
          Lock.unlock();
          Task();
      }
      if (Stop)
       break;
    }
    Done.dec();
  }

What make me think that sync overhead is significant in this case.

Another example is when I stop calling Task() from work() and do that instead in add().
There are 2 cases, in first I commented out all sync stuff. In second leaved sync stuff as is:

void add(std::function<void()> F) override {
   //std::unique_lock<std::mutex> Lock(Mutex); // COMMENTED OUT
   //WorkStack.push(F);                                       // COMMENTED OUT
   //Lock.unlock();                                                // COMMENTED OUT
   //Cond.notify_one();                                        // COMMENTED OUT
   F();                                                                   // NEW LINE
 }

void add(std::function<void()> F) override {
   std::unique_lock<std::mutex> Lock(Mutex);
   WorkStack.push(F);
   Lock.unlock();
   Cond.notify_one();
   F();                                                                 // NEW LINE
 }

In first case (when all sync is commented in add()) link time is instant, in second takes some time, though
all job is done in add() by F() anyways, what means all time is spend on sync stuff I believe.

davide added a subscriber: davide.Aug 16 2017, 4:24 AM
In D36607#840916, @ruiu wrote:

I still wonder if a communication between threads can make something that slow. Where is the exact location where you observe the slowdown? Until we understand the exact reason, we cannot exclude the possibility that this change is hiding a real issue.

It is ThreadPoolExecutor::work().
(https://github.com/llvm-mirror/llvm/blob/master/lib/Support/Parallel.cpp#L102)

Currently it has following implementation:

void work() {
  while (true) {
    std::unique_lock<std::mutex> Lock(Mutex);
    Cond.wait(Lock, [&] { return Stop || !WorkStack.empty(); });
    if (Stop)
      break;
    auto Task = WorkStack.top();
    WorkStack.pop();
    Lock.unlock();
    Task();
  }
  Done.dec();
}

If I modify it slightly to avoid waiting on condition variable (and use busy-waiting instead) then it works much faster for me:

void work() {
    while (true) {
      std::unique_lock<std::mutex> Lock(Mutex);
      if (!WorkStack.empty()) {
          auto Task = WorkStack.top();
          WorkStack.pop();
          Lock.unlock();
          Task();
      }
      if (Stop)
       break;
    }
    Done.dec();
  }

What make me think that sync overhead is significant in this case.

Adaptive mutexes (e.g. for pthread_*) work using this principle. The kernel maintains a shared page and threads contending on the lock spin until the lock owner isn't descheduled.
This works because in general the cost of a context switch overcomes that of wasting few cycles spinning (under the assumption the CS is small).
An approximation of this scheme would be that of spinning for a certain number of cycles decreasing a variable each time (and going to sleep when the variable reaches zero).
This is, FWIW, what FreeBSD pthread mutexes do.
A different (and possibly more effective) proposal for Parallel would be that of exploring work-stealing algorithms for queueing.

Adaptive mutexes (e.g. for pthread_*) work using this principle. The kernel maintains a shared page and threads contending on the lock spin until the lock owner isn't descheduled.
This works because in general the cost of a context switch overcomes that of wasting few cycles spinning (under the assumption the CS is small).
An approximation of this scheme would be that of spinning for a certain number of cycles decreasing a variable each time (and going to sleep when the variable reaches zero).
This is, FWIW, what FreeBSD pthread mutexes do.
A different (and possibly more effective) proposal for Parallel would be that of exploring work-stealing algorithms for queueing.

Interesting.
And what this patch do is orthogonal thing, since helps to avoid all threading/syncing overhead at all for free in some cases.

Adaptive mutexes (e.g. for pthread_*) work using this principle. The kernel maintains a shared page and threads contending on the lock spin until the lock owner isn't descheduled.
This works because in general the cost of a context switch overcomes that of wasting few cycles spinning (under the assumption the CS is small).
An approximation of this scheme would be that of spinning for a certain number of cycles decreasing a variable each time (and going to sleep when the variable reaches zero).
This is, FWIW, what FreeBSD pthread mutexes do.
A different (and possibly more effective) proposal for Parallel would be that of exploring work-stealing algorithms for queueing.

Interesting.
And what this patch do is orthogonal thing, since helps to avoid all threading/syncing overhead at all for free in some cases.

This is not true in general. Imagine that the Fn(J) callback takes a considerable amount of time, much larger than the synchronization overhead, then your sequential version will take much more time than the parallel version (unless I'm missing some detail).
It's hard to find the right balance without having a characterization of the length of a critical section (or, FWIW, the data access pattern).
In other words, I think your patch just "fixes" this problem but may cause a significant slowdown somewhere else. LLVM doesn't make a heavy use of parallel/concurrent data structures, so it's unlikely to show up in practice.
If you still want to go for this route (I'm not sure it's the best way forward, alas) you should at least guarantee this doesn't regress ThinLTO (which is the only other main consumer of parallel).

Adaptive mutexes (e.g. for pthread_*) work using this principle. The kernel maintains a shared page and threads contending on the lock spin until the lock owner isn't descheduled.
This works because in general the cost of a context switch overcomes that of wasting few cycles spinning (under the assumption the CS is small).
An approximation of this scheme would be that of spinning for a certain number of cycles decreasing a variable each time (and going to sleep when the variable reaches zero).
This is, FWIW, what FreeBSD pthread mutexes do.
A different (and possibly more effective) proposal for Parallel would be that of exploring work-stealing algorithms for queueing.

Interesting.
And what this patch do is orthogonal thing, since helps to avoid all threading/syncing overhead at all for free in some cases.

This is not true in general. Imagine that the Fn(J) callback takes a considerable amount of time, much larger than the synchronization overhead, then your sequential version will take much more time than the parallel version (unless I'm missing some detail).

My version just executes the last task on the main thread (or single one if there is only one task).
Original code would execute it in parallel single thread, and main thread would wait until it complete anyways (TaskGroup descructor calls sync()).
So I believe there is no difference in execution time.

ruiu added a comment.Aug 16 2017, 2:57 PM

George,

First of all, busy-waiting is a no-no, so we can't eliminate the condition variable from the thread pool. When a thread has to wait for some event to happen, it has to use a condition variable to put itself into the "blocked" state instead of busy-looping until the condition is met. That is in general true even if using busy loops makes your program faster, as your OS supports multiple processes and you don't want one process eat up all CPU resources.

This patch look fine though. But the description is a bit pointless. That is why you are getting so much questions about this patch. You generally don't want to copy-pasta code to describe what you are doing (which applies not only this patch but also your other patches). Instead, use prose to describe what you are trying to achieve with a patch. I'd update the patch description as follows:

Do not use a task group for a very small task

parallel_for_each_n splits a given task into small pieces of tasks and then
passes them to background threads managed by a thread pool to process them
in parallel. TaskGroup then waits for all tasks to be done, which is done by
TaskGroup's destructor.

In the previous code, all tasks were passed to background threads, and the
main thread just waited for them to finish their jobs. This patch changes
the logic so that the main thread processes a task just like other
worker threads instead of just waiting for workers.

This patch improves the performance of parallel_for_each_n for a task which
is too small that we do not split it into multiple tasks. Previously, such task
was submitted to another thread and the main thread waited for its completion.
That involves multiple inter-thread synchronization which is not cheap for
small tasks. Now, such task is processed by the main thread, so no inter-thread
communication is necessary.
In D36607#843789, @ruiu wrote:

George,

First of all, busy-waiting is a no-no, so we can't eliminate the condition variable from the thread pool. When a thread has to wait for some event to happen, it has to use a condition variable to put itself into the "blocked" state instead of busy-looping until the condition is met. That is in general true even if using busy loops makes your program faster, as your OS supports multiple processes and you don't want one process eat up all CPU resources.

Sure, I would not suggest busy-waiting for generic thread pool use case like this. It was experiment.
Idea mentioned by Davide about "spinning for a certain number of cycles decreasing a variable each time (and going to sleep when the variable reaches zero)" can probably fit nice here though.
Since when we start doing parallel_for we assume each thread will have several tasks, it looks reasonable to check if there are any more before going to sleep after each one.

This patch look fine though. But the description is a bit pointless. That is why you are getting so much questions about this patch. You generally don't want to copy-pasta code to describe what you are doing (which applies not only this patch but also your other patches). Instead, use prose to describe what you are trying to achieve with a patch.

I'll try. Thank you.

grimar retitled this revision from [Support/Parallel] - Do not spawn thread for single/last task. to [Support/Parallel] - Do not use a task group for a very small task..Aug 17 2017, 12:51 AM
grimar edited the summary of this revision. (Show Details)
ruiu added a comment.Aug 17 2017, 8:50 AM

I *believe* busy-waiting for a very short period of time before putting itself into the blocked state is what the condition variable or the mutex implemented in the standard library already do. If you want to tune parameters for multi-thread programming primitives, you can try, but I personally wouldn't do that because I know it is hard to improve because it's already improved by experts in that field.

Parallel.h
165 ↗(On Diff #110683)

Do the same thing here.

grimar updated this revision to Diff 111640.Aug 18 2017, 2:21 AM
  • Addressed review comment.
ruiu accepted this revision.Aug 18 2017, 2:34 PM

LGTM

This revision is now accepted and ready to land.Aug 18 2017, 2:34 PM
This revision was automatically updated to reflect the committed changes.