This is an archive of the discontinued LLVM Phabricator instance.

Enable the use of ThreadPoolTaskGroup in MLIR threading helper to enable nested parallelism
ClosedPublic

Authored by mehdi_amini on May 3 2022, 9:59 PM.

Details

Summary

The LLVM ThreadPool recently got the addition of the concept of
ThreadPoolTaskGroup: this is a way to "partition" the threadpool
into a group of tasks and enable nested parallelism through this
grouping at every level of nesting.
We make use of this feature in MLIR threading abstraction to fix a long
lasting TODO and enable nested parallelism.

Diff Detail

Event Timeline

mehdi_amini created this revision.May 3 2022, 9:59 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2022, 9:59 PM
mehdi_amini requested review of this revision.May 3 2022, 9:59 PM
rriddle added inline comments.May 3 2022, 10:14 PM
mlir/include/mlir/IR/Threading.h
74–75

Shouldn't we be waiting on the task group here now instead? With that, can we all drop the need to capture all of the futures explicitly?

llunak added inline comments.May 3 2022, 11:20 PM
mlir/include/mlir/IR/Threading.h
74–75

Presumably, it's less code and also note that only ThreadPool's wait() can process other tasks while waiting, futures are blocking (see the last paragraph in ThreadPool doc description).

Remove futures and use taskgroup.wait()

mlir/include/mlir/IR/Threading.h
74–75

Oh right! Of course...

Add a comment on the wait()

rriddle accepted this revision.May 4 2022, 3:56 AM
rriddle added inline comments.
mlir/include/mlir/IR/Threading.h
77–78

This only applies to worker threads though, if this is the main thread won't we block? This would be a regression from the current behavior, which lets the main thread also participate in the work. Should we still process one of the work items on this thread as before?

This revision is now accepted and ready to land.May 4 2022, 3:56 AM
llunak added inline comments.May 4 2022, 4:42 AM
mlir/include/mlir/IR/Threading.h
77–78

Yes, main thread will block here. But assuming the thread pool has an optimal concurrency set (I don't know if that's the case), then the main thread wouldn't have a CPU core to run on and would need to compete for it with thread pool threads. Calling the process function before waiting in the main thread would only save spawning one thread pool thread (assuming it has not already been spawned by something else).

Adding Chris for visibility, I know he was thinking about our threading model and utilities recently!

mehdi_amini added inline comments.May 4 2022, 7:49 AM
mlir/include/mlir/IR/Threading.h
77–78

If the main thread is competing with the worker threads, then it is already a problem when you get to this point isn't it?

I'll keep the pre-existing behavior, the good point raised here is that the ThreadPool size should account for the "main" thread maybe?

Process one action in the current thread before waiting for the tasksGroup.

llunak added inline comments.May 4 2022, 9:00 AM
mlir/include/mlir/IR/Threading.h
77–78

If the main thread is competing with the worker threads, then it is already a problem when you get to this point isn't it?

What I mean is this: Let's say you get to this point and only the main thread exists. The threadpool is set up to use up to the number of CPU cores available. So if you queue up enough tasks, threadpool tasks will use all CPU cores and keeping the main thread running would make it compete for some CPU core that should be used by a worker thread.

You seemingly get around this by using one less threadpool task, but that doesn't work with recursive usage. Let's say you have 8 CPU cores. On first call here you launch 7 worker threads and keep main thread working too. So far so good. But if one of the tasks calls here recursively, it queues another 7 tasks, and the threadpool will launch the last 8th thread. So now you have 9 active threads for 8 CPU cores.

I'll keep the pre-existing behavior, the good point raised here is that the ThreadPool size should account for the "main" thread maybe?

Now that ThreadPool::wait() can keep processing when waiting for a group and called from a worker thread, presumably it shouldn't be difficult to make that work in all cases. Assuming you know that the main thread will always wait() and not e.g. get stuck waiting on a socket. But is one extra spawned thread worth spending time on that?

mehdi_amini added inline comments.May 4 2022, 10:04 AM
mlir/include/mlir/IR/Threading.h
77–78

You seemingly get around this by using one less threadpool task, but that doesn't work with recursive usage. Let's say you have 8 CPU cores. On first call here you launch 7 worker threads and keep main thread working too. So far so good. But if one of the tasks calls here recursively, it queues another 7 tasks, and the threadpool will launch the last 8th thread. So now you have 9 active threads for 8 CPU cores.

What I had in mind was to initialize the thread pool to 7 threads on a 8 core machine actually. I agree with you that if you let the thread pool run 8 thread it'll oversubscribe inconsistently.

When I was referring to the pool already competing with the main thread, that was assuming that there were already potentially other tasks running in the pool.

Anyway I'm fine with going either way here, the fact that the default strategy for the thread pool is to use as many threads as there are cores make me favor @llunak 's approach, @rriddle is this fine with you?

rriddle added inline comments.May 6 2022, 10:13 AM
mlir/include/mlir/IR/Threading.h
77–78

I agree that is a better approach. I am just recalling some serious performance problems with doing that before. It would be nice if we can benchmark something reasonable to see what the results of this are.