The LLVM ThreadPool recently got the addition of the concept of
ThreadPoolTaskGroup: this is a way to "partition" the threadpool
into a group of tasks and enable nested parallelism through this
grouping at every level of nesting.
We make use of this feature in MLIR threading abstraction to fix a long
lasting TODO and enable nested parallelism.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
mlir/include/mlir/IR/Threading.h | ||
---|---|---|
71–72 | Shouldn't we be waiting on the task group here now instead? With that, can we all drop the need to capture all of the futures explicitly? |
mlir/include/mlir/IR/Threading.h | ||
---|---|---|
71–72 | Presumably, it's less code and also note that only ThreadPool's wait() can process other tasks while waiting, futures are blocking (see the last paragraph in ThreadPool doc description). |
Remove futures and use taskgroup.wait()
mlir/include/mlir/IR/Threading.h | ||
---|---|---|
71–72 | Oh right! Of course... |
mlir/include/mlir/IR/Threading.h | ||
---|---|---|
74–75 | This only applies to worker threads though, if this is the main thread won't we block? This would be a regression from the current behavior, which lets the main thread also participate in the work. Should we still process one of the work items on this thread as before? |
mlir/include/mlir/IR/Threading.h | ||
---|---|---|
74–75 | Yes, main thread will block here. But assuming the thread pool has an optimal concurrency set (I don't know if that's the case), then the main thread wouldn't have a CPU core to run on and would need to compete for it with thread pool threads. Calling the process function before waiting in the main thread would only save spawning one thread pool thread (assuming it has not already been spawned by something else). |
Adding Chris for visibility, I know he was thinking about our threading model and utilities recently!
mlir/include/mlir/IR/Threading.h | ||
---|---|---|
74–75 | If the main thread is competing with the worker threads, then it is already a problem when you get to this point isn't it? I'll keep the pre-existing behavior, the good point raised here is that the ThreadPool size should account for the "main" thread maybe? |
mlir/include/mlir/IR/Threading.h | ||
---|---|---|
74–75 |
What I mean is this: Let's say you get to this point and only the main thread exists. The threadpool is set up to use up to the number of CPU cores available. So if you queue up enough tasks, threadpool tasks will use all CPU cores and keeping the main thread running would make it compete for some CPU core that should be used by a worker thread. You seemingly get around this by using one less threadpool task, but that doesn't work with recursive usage. Let's say you have 8 CPU cores. On first call here you launch 7 worker threads and keep main thread working too. So far so good. But if one of the tasks calls here recursively, it queues another 7 tasks, and the threadpool will launch the last 8th thread. So now you have 9 active threads for 8 CPU cores.
Now that ThreadPool::wait() can keep processing when waiting for a group and called from a worker thread, presumably it shouldn't be difficult to make that work in all cases. Assuming you know that the main thread will always wait() and not e.g. get stuck waiting on a socket. But is one extra spawned thread worth spending time on that? |
mlir/include/mlir/IR/Threading.h | ||
---|---|---|
74–75 |
What I had in mind was to initialize the thread pool to 7 threads on a 8 core machine actually. I agree with you that if you let the thread pool run 8 thread it'll oversubscribe inconsistently. When I was referring to the pool already competing with the main thread, that was assuming that there were already potentially other tasks running in the pool. Anyway I'm fine with going either way here, the fact that the default strategy for the thread pool is to use as many threads as there are cores make me favor @llunak 's approach, @rriddle is this fine with you? |
mlir/include/mlir/IR/Threading.h | ||
---|---|---|
74–75 | I agree that is a better approach. I am just recalling some serious performance problems with doing that before. It would be nice if we can benchmark something reasonable to see what the results of this are. |
Shouldn't we be waiting on the task group here now instead? With that, can we all drop the need to capture all of the futures explicitly?