Page MenuHomePhabricator

Add task pool to LLDB
ClosedPublic

Authored by tberghammer on Oct 14 2015, 7:32 AM.

Details

Summary

Add a new task pool class to LLDB to make it easy to execute tasks in parallel

Basic design goals:

  • Have a very lightweight and easy to use interface where a list of lambdas can be executed in parallel
  • Use a global thread pool to limit the number of threads used (std::async don't do it on Linux) and to eliminate the thread creation overhead

Possible future improvements (please weight in about priorities about these and add what additional features you want to see):

  • Possibility to cancel already added, but not yet started tasks
  • Lazy creation of the worker threads
  • Removing unused worker threads after some time
  • Parallel for_each implementation

The first user of the thread pool will be the dwarf parsing code. An example of how it will be used is available at http://reviews.llvm.org/D13662 (diff 2)

Diff Detail

Repository
rL LLVM

Event Timeline

tberghammer retitled this revision from to Add task pool to LLDB.
tberghammer updated this object.
tberghammer added a subscriber: lldb-commits.
labath requested changes to this revision.Oct 14 2015, 8:38 AM
labath edited edge metadata.

Although it may not seem from the number of my comments, I actually quite like this implementation. However, I think more work needs to be done to make this future-proof.

  • we definitely need unit tests for this.
  • we need more documentation (especially on the public interfaces, but the trickier internal parts could use some explanation as well).
include/lldb/Utility/TaskPool.h
23 ↗(On Diff #37351)

belonging to

24 ↗(On Diff #37351)

s/have/has

33 ↗(On Diff #37351)

Perhahs the entire TaskPool could be an implementation detail and this functionality would be accessed (via a static method on) TaskRunner? What do you think? I have found it confusing it the other commit, why you sometimes use TaskPool and another time TaskRunner...

80 ↗(On Diff #37351)

WaitForAllTasks

103 ↗(On Diff #37351)

Suggestion: name these Head, Tail. You use T for other purposes elsewhere, and I have spent a lot of time trying to figure out why is this H and not F before I figured out what's going on here.

126 ↗(On Diff #37351)

task_sp

138 ↗(On Diff #37351)

You use T for both "task" and the "result value of a task". Could you disambiguate the two. Suggestion: Ret (or Return) and Task.

145 ↗(On Diff #37351)

This construction is quite convoluted. Do you actually need a list of the pending tasks anywhere? As far as I can tell, you only need to check if there are any tasks pending, which can be done with a simple (atomic) integer.

148 ↗(On Diff #37351)

Is std::forward needed here?

This revision now requires changes to proceed.Oct 14 2015, 8:38 AM
zturner edited edge metadata.Oct 14 2015, 9:55 AM

Can this be done in such a way that everything boils down to a single call to std::async on platforms that support thread limiting?

Alternatively, why not just put a semaphore inside of the lambda that you run with std::async to limit the number of threads? This seems overly complicated when you could solve it much simpler in a way that makes the most efficient use of every platform's asynchronous handling support.

zturner requested changes to this revision.Oct 14 2015, 9:55 AM
zturner edited edge metadata.

We can change this class to use std::async on Windows and an std::thread based implementation on Linux with the same interface (other OS-es should be decided) but I would prefer to use just 1 logic as it is easier to maintain.

Limiting the number of threads with adding a semaphore to the beginning of the lambda given to std::async wouldn't work because if the number of threads aren't limited inside the implementation of std::async then we still create a huge number of threads what will be blocked on the semaphore call but they still consume the resources. Guarding the std::async calls with a semaphore can limit the number of thread but it will make them a blocking call what isn't really something I want to see.

On Linux using this thread pooling code instead of std::async achieved a ~25% speedup on the dwarf parsing code. I haven't investigated the exact reasons but I think the difference is that we don't create a lot of thread (thread creation isn't too cheap) and the lower number of threads decrease the number of congestion.

Ok, it seems reasonable to just use std::async on platforms that this is ok on, and not use it on platforms which it's not ok on. I think it should be enabled on Windows, but I'll leave it up to you to decide whether it's a whitelist or a blacklist.

clayborg requested changes to this revision.Oct 14 2015, 11:21 AM
clayborg edited edge metadata.

I agree with labath's comments and see if we can move TaskPoolImpl into the .cpp file.

include/lldb/Utility/TaskPool.h
39–66 ↗(On Diff #37351)

Can TaskPoolImpl be moved to the .cpp file in the anonymous namespace?

Any std::async adoption will need to be able to deliver tasks as they complete via "TaskRunner<T>::WaitForNextCompletedTask()".

We had a previous example where 1000 items could be pushed onto a std::vector of tasks and then the code was doing:

for (i=0; i<tasks.size(); ++i)
    tasks[i].wait();

This means that if the first of 1000 jobs took 100 seconds, but the other 999 took 0.1 seconds each, then it would stall the pipeline waiting for the first one to complete. I want to make sure we can do:

while (auto future = task_runner. WaitForNextCompletedTask())
{
}

So we can process the items as needed as they complete. So as long as we can still do this with the implementation we use I will be happy.

I thought about this some more and I'm fine with going with a single implementation and not using std async. It would be nice to take advantage of any deep optimizations std::async provides over a hand-rolled solution, but at the same time there's a cost to adding complexity and asymmetric implementations, even if the asymmetry is hidden behind this interface. And the gain we get from goign single threaded -> this implementation is much greater than we would theoretically get from going from this implementation -> std async.

I'll look more closely at the implementation and see if I have other comments, but the general idea is fine.

The only real suggestion / question I have is a design one.

By using this implementation we can't take advantage of the system thread pool. That was the point of using std async in the first place, but we found that it doesn't always limit the number of threads. Maybe there's a way to get the best of both worlds.

What if, instead of storing a std::queue<std::function<void()>> you instead store a std::queue<std::packaged_task>. Now the only problem that remains is how to guarantee that no more than std::hardware_concurrency() of these packaged_task is waiting at any given time. You could do this by taking the std::function that someone gives you, and wrapping it in a packaged_task which first runs the function, and then signals a condition variable after the function completes. Then a single "dispatch" thread (for lack of a better word) could wake on this condition variable, pull a new packaged_task off the queue, and execute it asynchronously. You'd also need to signal that same condition variable when a new item is added to the queue so that the dispatch thread could decide whether to run it immediately (if it's under-scheduled) or wait if it's full.

This would probably also make the implementation quite a bit simpler as well as being able to take advantage of any deep optimizations a platform has in its own thread pool implementation (if any).

Zach: If these are implementation details, lets get this in first and then worry about optimizations later.

Well, it's not just an optimization. Threading code is hard to reason about, and the more complicated an implementation the more likely it is to have race conditions or other problems. So any opportunity to reduce the amount of manual thread management is a win in that sense.

Anyway, it was mostly a question / suggestion, not something I'm going to block over, so if everyone feels it's ok to go in this way then go for it. But if you read through the CL, it's not easy to be 100% confident that there are no race conditions. If all you've got is a single condition variable which wakes up and sends something to the standard library implementation, it's a lot easier to have that confidence. The optimization is just a side benefit.

tberghammer edited edge metadata.
tberghammer marked 7 inline comments as done.
  • Address review comments
  • Add some more documentation
  • Add some unit tests

Zach: I thought about your idea and I think it won't make the code any simpler because your suggestion would only implement the functionality what is currently implemented by TaskPool and TaskPoolImpl what is using only a single condition variable and a single queue (+ a mutex what is needed in all implementation).

To implement the WaitForNextCompletedTask function we need some more logic what wouldn't be handled better by your approach either because the TreadPool is a global object wile WaitForNextCompletedTask should wait only for a task from a given list, so you still have to add at least 1 additional storage with the synchronization for it.

From efficiency perspective, I don't think using std::async instead of a manually managed thread pool can make a big positive difference. The advantage of your solution would be that we don't have the worker threads running in the background when we don't use them, but on the other hand in some implementation of std::async we might have to start a new thread for each std::async (I expect something like this to happen on Linux) what isn't too cheap even when we know that we limited the number of threads to a low number. If having the worker thread running in the background is an issue then we can add it to my implementation with a smart heuristic about when we want to tear them down (e.g 100 ms after last task is done).

The main thought I had was just that it would make the code simpler and delegate more of the synchronization stuff to the library. But if we don't get any of that benefit then I guess there's no point in doing that.

That said, I really dont' want to see 48 threads running inside the LLDB process at all times (I have a 48-core machine). It makes it really annoying to debug. So some way to tear down threads would be a big help. But that's even more complexity.

If this is only going to be used occasionally (like each time some DWARF is being parsed), then what about making TaskPool non-static? Just create a new TaskPool each time you want to do work in parallel and tear it down when you're done doing the work? If someone creates 2 task pools they get 2 different sets of threads. This is kind of why I prefer delegating to the standard library whenever possible, because it shoudl already handle all of these cases.

Another option is to go back to the original one line implementation using std::async, but have each asynchronous thread work on num_compile_units / hardware_concurrency() compile units inside of a loop. That way you never get more than hardware_concurrency() threads.

clayborg accepted this revision.Oct 15 2015, 9:47 AM
clayborg edited edge metadata.

Very nice! I look forward to seeing this patch and the associated DWARF patches make it in!

tfiala added a subscriber: tfiala.Oct 15 2015, 9:49 AM

The main thought I had was just that it would make the code simpler and delegate more of the synchronization stuff to the library. But if we don't get any of that benefit then I guess there's no point in doing that.

That said, I really dont' want to see 48 threads running inside the LLDB process at all times (I have a 48-core machine). It makes it really annoying to debug. So some way to tear down threads would be a big help. But that's even more complexity.

If this is only going to be used occasionally (like each time some DWARF is being parsed), then what about making TaskPool non-static? Just create a new TaskPool each time you want to do work in parallel and tear it down when you're done doing the work? If someone creates 2 task pools they get 2 different sets of threads. This is kind of why I prefer delegating to the standard library whenever possible, because it shoudl already handle all of these cases.

Ooo - danger lies down that path. We don't want different parts of lldb that can be thread-ized (and there are several of them) all doing work as if they have all the threads available to them. That should be managed by a global thread pool for the app. I think we'll see much more use of this facility once we have it in.

Another option is to go back to the original one line implementation using std::async, but have each asynchronous thread work on num_compile_units / hardware_concurrency() compile units inside of a loop. That way you never get more than hardware_concurrency() threads.

The main thought I had was just that it would make the code simpler and delegate more of the synchronization stuff to the library. But if we don't get any of that benefit then I guess there's no point in doing that.

That said, I really dont' want to see 48 threads running inside the LLDB process at all times (I have a 48-core machine). It makes it really annoying to debug. So some way to tear down threads would be a big help. But that's even more complexity.

If this is only going to be used occasionally (like each time some DWARF is being parsed), then what about making TaskPool non-static? Just create a new TaskPool each time you want to do work in parallel and tear it down when you're done doing the work? If someone creates 2 task pools they get 2 different sets of threads. This is kind of why I prefer delegating to the standard library whenever possible, because it shoudl already handle all of these cases.

We really need a global thread pool so we don't make too many threads. Many GUI components have 3 or 4 threads, we have a private process state thread that sometimes does work and responds to the dynamic loader breakpoints and the dynamic loaders often do work, and when shared libraries get loaded we have LanguageRuntime and other plug-ins that look for a symbol inside one or more modules. So this definitely be a global queue. If you run task_runner. WaitForAllTasks() you could actually have the implementation run 1 task on the current thread instead of just parking and waiting so you will probably get as good of performance as you used to even if all the threads from the thread pool are busy. So this needs to stay global so we only spawn up to N threads in a process when using the thread pool. We can add a setting do LLDB that can be set via settings set to change the concurrency if needed. If you spawn too many threads you can bring a machine to it a halt quite quickly.

Another option is to go back to the original one line implementation using std::async, but have each asynchronous thread work on num_compile_units / hardware_concurrency() compile units inside of a loop. That way you never get more than hardware_concurrency() threads.

Again, this is you won't get more that "hardware_concurrency()" threads for each invocation of std::async right?

The main thought I had was just that it would make the code simpler and delegate more of the synchronization stuff to the library. But if we don't get any of that benefit then I guess there's no point in doing that.

That said, I really dont' want to see 48 threads running inside the LLDB process at all times (I have a 48-core machine). It makes it really annoying to debug. So some way to tear down threads would be a big help. But that's even more complexity.

If this is only going to be used occasionally (like each time some DWARF is being parsed), then what about making TaskPool non-static? Just create a new TaskPool each time you want to do work in parallel and tear it down when you're done doing the work? If someone creates 2 task pools they get 2 different sets of threads. This is kind of why I prefer delegating to the standard library whenever possible, because it shoudl already handle all of these cases.

We really need a global thread pool so we don't make too many threads. Many GUI components have 3 or 4 threads, we have a private process state thread that sometimes does work and responds to the dynamic loader breakpoints and the dynamic loaders often do work, and when shared libraries get loaded we have LanguageRuntime and other plug-ins that look for a symbol inside one or more modules. So this definitely be a global queue. If you run task_runner. WaitForAllTasks() you could actually have the implementation run 1 task on the current thread instead of just parking and waiting so you will probably get as good of performance as you used to even if all the threads from the thread pool are busy. So this needs to stay global so we only spawn up to N threads in a process when using the thread pool. We can add a setting do LLDB that can be set via settings set to change the concurrency if needed. If you spawn too many threads you can bring a machine to it a halt quite quickly.

Another option is to go back to the original one line implementation using std::async, but have each asynchronous thread work on num_compile_units / hardware_concurrency() compile units inside of a loop. That way you never get more than hardware_concurrency() threads.

Again, this is you won't get more that "hardware_concurrency()" threads for each invocation of std::async right?

Yes. But an implementation of std::async is either going to use a global thread pool or it's not. If it does there's no problem, and if it doesn't, it's going to create one thread for each call, and when the work in that thread is done, the thread will exit. You wouldn't have a situation where you call std::async 100 times, join on all 100 results, and the 100 threads are still sitting around. After the work is done the threads would go away. So by dividing the work into no more than hardware_concurrency chunks and sending those chunks to separate calls invocations of std::async, you're safe.

Having a global thread pool is fine, but if that's what we want to do I think it needs to be a much smaller value than hardware_concurrency. Because having that many threads sitting around doing nothing all the time makes debugging LLDB a real chore (not to mention being wasteful).

tberghammer edited edge metadata.

Addressing comments from the discussion with destroying the treads not in use while keeping a global thread pool with at most hardware_concurrency threads.

IMO this update also simplifies the implementation of the ThreadPool with removing the conditional_variable.

A possible future improvement is to destroy the threads only if they are idle for a given time (e.g. 50 ms) so we still don't have them hanging around while we can avoid the possibly unnecessary thread creation and destroyation in case the tasks coming in slowly.

Yes. But an implementation of std::async is either going to use a global thread pool or it's not. If it does there's no problem, and if it doesn't, it's going to create one thread for each call, and when the work in that thread is done, the thread will exit. You wouldn't have a situation where you call std::async 100 times, join on all 100 results, and the 100 threads are still sitting around. After the work is done the threads would go away. So by dividing the work into no more than hardware_concurrency chunks and sending those chunks to separate calls invocations of std::async, you're safe.

Performance will be worse if there is a thread created for each task as thread creation is very expensive and can often be slower than just doing things serially for small tasks.

Having a global thread pool is fine, but if that's what we want to do I think it needs to be a much smaller value than hardware_concurrency. Because having that many threads sitting around doing nothing all the time makes debugging LLDB a real chore (not to mention being wasteful).

I don't agree. They are parked doing nothing taking up no real resources, or they are doing something useful. When tasks come in, the parked threads are ready to go, no spin up time like when using std::async(). Debugging LLDB already has many threads and on MacOSX there are already some libdispatch threads around and they don't add much noise IMHO. We could actually back this new approach with libdispatch on MacOSX, but I would want to verify that the performance is as good or better than the current approach before doing so. The benefit of libdispatch is it will tear down worker threads if they haven't been used in a while. But we could do the same with our approach.

If you want to optimize windows with std::async() I would say go for it. Maybe unit tests can run using multiple ways (std::async, TaskPool, libdispatch) and then recommend a default for your platform based on real performance tests. Demangling a bunch of names is one example that could be used for performance testing this.

Addressing comments from the discussion with destroying the treads not in use while keeping a global thread pool with at most hardware_concurrency threads.

We should perf test this, but I would be OK with parked threads to tell the truth. I think we should prove they are taking up resources first before we try and destroy them.

IMO this update also simplifies the implementation of the ThreadPool with removing the conditional_variable.

A possible future improvement is to destroy the threads only if they are idle for a given time (e.g. 50 ms) so we still don't have them hanging around while we can avoid the possibly unnecessary thread creation and destroyation in case the tasks coming in slowly.

Not sure what the right timeout would be, I would rather just leave them parked and ready unless we prove they are taking up resources. They are going to be parked in the kernel awaiting synchronization, so I don't really see the point of tearing the thread down.

I agree that parking the threads in the kernel should be very cheap (I am not sure about Windows) but on high end multi core machines having 40+ threads around just to wait for work is a bit strange for me too and it definitely makes debugging of LLDB more difficult.

In the near future I expect that we will use the TaskPool only for very heavy computations while we filling up our internal caches (such as debug info parsing or symtab parsing) and then don't use it for anything later.

I agree that parking the threads in the kernel should be very cheap (I am not sure about Windows) but on high end multi core machines having 40+ threads around just to wait for work is a bit strange for me too and it definitely makes debugging of LLDB more difficult.

Ok, maybe we can pick a larger timeout then like say 2 seconds?

In the near future I expect that we will use the TaskPool only for very heavy computations while we filling up our internal caches (such as debug info parsing or symtab parsing) and then don't use it for anything later.

I want to used it in many ModuleList functions like:

ModuleList::FindFunctionSymbols()
ModuleList::FindFunctions()
ModuleList::FindCompileUnits()
ModuleList::FindGlobalVariables()
ModuleList::FindSymbolsWithNameAndType()
ModuleList::FindSymbolsMatchingRegExAndType()

and many more.

I agree that parking the threads in the kernel should be very cheap (I am not sure about Windows) but on high end multi core machines having 40+ threads around just to wait for work is a bit strange for me too and it definitely makes debugging of LLDB more difficult.

Ok, maybe we can pick a larger timeout then like say 2 seconds?

Yeah I definitely like the idea of retiring the threads when they're not getting used for some length of time (and configurable but in the seconds range sounds reasonable as a default). When we're on OS X machines, they tend to be capped out at 8 cores (although the Mac Pros can get up to 24 logical cores). It is ugly on systems with higher core counts to leave threads around.

(As an aside, it might be interesting to have the debugger keep track of last time a thread run and hide threads that have been inactive if we know about well-known "park" locations, but not for the general case --- IDEs could conceivably make use of that to hide dormant threads. Although I suppose the IDE could do that entirely without lldb tagging a thread state as dormant).

In the near future I expect that we will use the TaskPool only for very heavy computations while we filling up our internal caches (such as debug info parsing or symtab parsing) and then don't use it for anything later.

I want to used it in many ModuleList functions like:

ModuleList::FindFunctionSymbols()
ModuleList::FindFunctions()
ModuleList::FindCompileUnits()
ModuleList::FindGlobalVariables()
ModuleList::FindSymbolsWithNameAndType()
ModuleList::FindSymbolsMatchingRegExAndType()

and many more.

Yes. But an implementation of std::async is either going to use a global thread pool or it's not. If it does there's no problem, and if it doesn't, it's going to create one thread for each call, and when the work in that thread is done, the thread will exit. You wouldn't have a situation where you call std::async 100 times, join on all 100 results, and the 100 threads are still sitting around. After the work is done the threads would go away. So by dividing the work into no more than hardware_concurrency chunks and sending those chunks to separate calls invocations of std::async, you're safe.

Performance will be worse if there is a thread created for each task as thread creation is very expensive and can often be slower than just doing things serially for small tasks.

Having a global thread pool is fine, but if that's what we want to do I think it needs to be a much smaller value than hardware_concurrency. Because having that many threads sitting around doing nothing all the time makes debugging LLDB a real chore (not to mention being wasteful).

I don't agree. They are parked doing nothing taking up no real resources, or they are doing something useful. When tasks come in, the parked threads are ready to go, no spin up time like when using std::async(). Debugging LLDB already has many threads and on MacOSX there are already some libdispatch threads around and they don't add much noise IMHO. We could actually back this new approach with libdispatch on MacOSX, but I would want to verify that the performance is as good or better than the current approach before doing so. The benefit of libdispatch is it will tear down worker threads if they haven't been used in a while. But we could do the same with our approach.

If you want to optimize windows with std::async() I would say go for it. Maybe unit tests can run using multiple ways (std::async, TaskPool, libdispatch) and then recommend a default for your platform based on real performance tests. Demangling a bunch of names is one example that could be used for performance testing this.

I agree having a few threads around isn't a big deal, but 48-60 is a really high amount of noise when you debug LLDB and get a list of threads. I'm kind of leaning back towards just making this use std::async on Windows because having 60-70 threads in my thread list every time I start up LLDB is going to be a big problem. At the same time, even if I had only 32 cores, or 16 cores, I don't think I would want LLDB trying to consume 100% of my CPU *ever* because it makes the machine unusable even if only for a short period of time. If I'm trying to load 2 GB of debug information (not a hypothetical situation) this is going to be a big problem.

We could try heuristics but I feel like this becomes a lot easier and less error prone by just letting the system manage the thread pool wherever possible.

Tamas: I know you mentioned earlier that it's possible to have 2 different implementations, one that delegates to std::async and one that uses this approach. Looking over the code it looks like only a couple lines of code need to be different. If it's that straightforward can we just do that? This would also give Greg and others on FreeBSD or somewhere else to try flipping the switch on their own platform and seeing which gives better results.

Yes. But an implementation of std::async is either going to use a global thread pool or it's not. If it does there's no problem, and if it doesn't, it's going to create one thread for each call, and when the work in that thread is done, the thread will exit. You wouldn't have a situation where you call std::async 100 times, join on all 100 results, and the 100 threads are still sitting around. After the work is done the threads would go away. So by dividing the work into no more than hardware_concurrency chunks and sending those chunks to separate calls invocations of std::async, you're safe.

Performance will be worse if there is a thread created for each task as thread creation is very expensive and can often be slower than just doing things serially for small tasks.

Having a global thread pool is fine, but if that's what we want to do I think it needs to be a much smaller value than hardware_concurrency. Because having that many threads sitting around doing nothing all the time makes debugging LLDB a real chore (not to mention being wasteful).

I don't agree. They are parked doing nothing taking up no real resources, or they are doing something useful. When tasks come in, the parked threads are ready to go, no spin up time like when using std::async(). Debugging LLDB already has many threads and on MacOSX there are already some libdispatch threads around and they don't add much noise IMHO. We could actually back this new approach with libdispatch on MacOSX, but I would want to verify that the performance is as good or better than the current approach before doing so. The benefit of libdispatch is it will tear down worker threads if they haven't been used in a while. But we could do the same with our approach.

If you want to optimize windows with std::async() I would say go for it. Maybe unit tests can run using multiple ways (std::async, TaskPool, libdispatch) and then recommend a default for your platform based on real performance tests. Demangling a bunch of names is one example that could be used for performance testing this.

I agree having a few threads around isn't a big deal, but 48-60 is a really high amount of noise when you debug LLDB and get a list of threads. I'm kind of leaning back towards just making this use std::async on Windows because having 60-70 threads in my thread list every time I start up LLDB is going to be a big problem. At the same time, even if I had only 32 cores, or 16 cores, I don't think I would want LLDB trying to consume 100% of my CPU *ever* because it makes the machine unusable even if only for a short period of time. If I'm trying to load 2 GB of debug information (not a hypothetical situation) this is going to be a big problem.

We could try heuristics but I feel like this becomes a lot easier and less error prone by just letting the system manage the thread pool wherever possible.

Tamas: I know you mentioned earlier that it's possible to have 2 different implementations, one that delegates to std::async and one that uses this approach. Looking over the code it looks like only a couple lines of code need to be different. If it's that straightforward can we just do that? This would also give Greg and others on FreeBSD or somewhere else to try flipping the switch on their own platform and seeing which gives better results.

Yes I can make an implementation what is based on std::async (the only necessary change is to call std::async in TaskPool::AddTask instead of using the current implementation), but I think if we start a 5000 std::async at the same time then it will still consume 100% of your CPU.

Greg: Would you prefer to have the threads in the thread pool always around, or to stop them when there is no work to do (I don't really have a preference about it from Linux side especially if you plan to use the thread pool for a lot of task)?

Stopping the threads is fine as it seems to be desired from most others.

Zachary, I don't think using std::async is a good idea because it provides a very different threading model than the one we want here. Let me demonstrate that with an example:

#include <condition_variable>
#include <future>
#include <mutex>
#include <thread>
#include <vector>

using namespace std;


unsigned X = 0;
mutex m;
condition_variable cv;

int f()
{
    unique_lock<mutex> l(m);
    X++;
    printf("X = %d\n", X);
    cv.wait(l, [] { return X > 1000; });
    cv.notify_one();
    return X;
}

int main()
{
    vector<future<int>> v;
    for(unsigned i = 0; i < 1000; ++i) 
        v.push_back(async(launch::async, f));

    v.push_back(async(launch::async, f)); // break here

    for(auto &f: v)
        printf("future = %d\n", f.get());

    return 0;
}

This (convoluted) example starts 1000 (+1) asynchronous tasks and wires them up in a way that they can't complete until all of them are running. If the implementation was limiting the number of concurrently running tasks to n<1000, then this program would not complete. Nevertheless, it *does* complete.

When I run this program under linux, it completes instantly. That's because linux implementation of std::async just kicks off a new thread for each task let's them run freely. Windows (for better or for worse) does something different here. Initially, it only spawns a small number of tasks and then periodically checks if the tasks have finished, and if not, it starts a more of them. That's why it takes this program several minutes to complete, but it still completes, and if you check it with a debugger in the end, you will see that there were 1000 threads running (i.e. it was not doing anything clever, like multiplexing multiple tasks over the same OS thread, etc.). This is not the threading model we want here I think (in fact we should probably ban using any communication between the tasks in the pool). Our requirement seems to be "being able to limit the number of tasks running concurrently". Having the thread pool exhibit one behavior on windows and another elsewhere would be very confusing and error-prone.

Having said that, I do think there is one more thing that speaks against having hardware_concurrency() threads running constantly, that we haven't considered yet. It is our test suite. It already spawns a large number of LLDB processes in parallel, and having each process spawn a large number of threads might be dangerous. Tamas, if you're going to keep the threads alive, I think we should evaluate the impact of it on our test suite.

Considering all of this, I think it is a good idea to shut down threads when they are not used as an initial implementation. Later, we can evaluate potential improvements, like keeping some number of threads on stand-by.

Create optional std::async based implementation

Based on Pavel's example and some additional experimenting we done I am not sure if std::async will give us any benefit even on Windows as it don't limit the number of threads to the number of cores (because it can't do if it want to implement the standard). I have no problem with having 2 implementation (considering that one of them is very simple) but if it don't give us any benefit, then I would prefer to ignore it.

Based on Pavel's example and some additional experimenting we done I am not sure if std::async will give us any benefit even on Windows as it don't limit the number of threads to the number of cores (because it can't do if it want to implement the standard). I have no problem with having 2 implementation (considering that one of them is very simple) but if it don't give us any benefit, then I would prefer to ignore it.

To back these with some numbers, let me explain what I have done: I have modified the f() function in my example to do some actual work (count to a very large number). On my Windows VM with hardware_concurrency()==16, after one second, I had about 70 threads running (more than 4 times the number of cpus). Since our tasks are going to be cpu-intensive, there is no point in having more than 16 of them running, and using std::async will cause a lot of contention, which will slow everything down. I believe the async implementation is not suited for our purpose here and we should not use it.

I'm in agreement with Pavel's assessment on this.

Regarding the tests, I had brought that up with Greg earlier. We can handle that in a few different ways:

  • We do nothing, and perhaps some collections of tests run slower if they are all pounding on their thread pools at the same time with heavy CPU work.
  • We add a "max cpu" setting to lldb and just have an option that flips it to something relatively low (enough to test the concurrent nature but not giving it too much CPU). The test suite can then limit this.
  • Add a ninja-like cpu monitor to the test runner for the non-pool test runners (and also get those running right on Windows), and only keep enough test workers running to keep the machine hot. That would self adjust. (I haven't thought about this at all so I don't yet have an idea of how hard it will be to do it in an all-platform-encompasing way, but I'm sure it's a solvable problem).

Just some thoughts....

zturner accepted this revision.Oct 16 2015, 8:41 AM
zturner edited edge metadata.

Your example makes sense. So go ahead and delete the std async implementation for now. As Tamas said, its very easy to put in, so a better thing to do perhaps is once there's something doing actual parallel work, just test it in a real situation. I still wish there was a way to avoid reinventing the wheel here, but it looks like there's nothing that gives the semantics we need.

Sorry, should have mentioned that the lgtm is contingent on Todd's concerns. I like the idea of having a setting in LLDB that controls the maximum number of threads to use for parallel work. We should probably set this at a very h igh level in dotest so that all instances automatically get this value without having to remember to do it in each test.

I thought a bit about the points raised by Todd and in general I am happy with overloading the system during testing with a very high number of tests as it will most likely help us to detect race conditions and based on my testing (on a 40 core Linux machine) it isn't make any measurable difference in testing time.

I am fine with the idea of creating a setting to specify the number of threads in the thread pool (even though I don't see any immediate use case for it) but implementing it isn't easy because in the current system all setting is owned by an instance of the Debugger class (explicitly or implicitly). Adding a setting there to limit the number of threads will require the thread pool to get a reference to the Debugger instance what isn't a feasible solution. (Who would pass the Debugger to the thread pool? What happens when we have multiple debuggers?). The best solution currently I can think of (but I don't like) is to specify the maximum number of threads with an environment variable so it can be set before starting LLDB but it isn't a nice interface and using it when LLDB is loaded as a library is also problematic.

My preference would be to commit in the current implementation and add the setting to limit the number of threads afterward so we can start adding multi threaded code to LLDB. What do you think?

labath accepted this revision.Oct 19 2015, 7:30 AM
labath edited edge metadata.

I think this can go in now. Currently, the paralelization happens on compile unit level, and the test programs generally only have one compile unit. Since the threads are created on an as-needed basis, I don't expect any significant impact of these changes on the test suite.

include/lldb/Utility/TaskPool.h
30 ↗(On Diff #37566)

s/Non/None/

36 ↗(On Diff #37566)

thread pool/task pool ?
(several other places as well)

69 ↗(On Diff #37566)

doesn't

source/Utility/TaskPool.cpp
51 ↗(On Diff #37566)

num_threads is now unused.

This revision is now accepted and ready to land.Oct 19 2015, 7:30 AM

That seems totally reasonable. We're fine with "this isn't an issue until it's proven to be an issue."

(And I like stressing the concurrency - we initially found a number of issues when the concurrent test runner came online in the first place).

Yes, lets get this in an iterate on it afterwards!

This revision was automatically updated to reflect the committed changes.
tberghammer marked 4 inline comments as done.