Download Raw Diff

Details

Reviewers

Commits

rG072e0aabbc45: Enable the use of ThreadPoolTaskGroup in MLIR threading helper to enable nested…

Summary

The LLVM ThreadPool recently got the addition of the concept of
ThreadPoolTaskGroup: this is a way to "partition" the threadpool
into a group of tasks and enable nested parallelism through this
grouping at every level of nesting.
We make use of this feature in MLIR threading abstraction to fix a long
lasting TODO and enable nested parallelism.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

mehdi_amini created this revision.May 3 2022, 9:59 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2022, 9:59 PM

Herald added subscribers: sdasgup3, wenzhicui, wrengr and 17 others. · View Herald Transcript

mehdi_amini requested review of this revision.May 3 2022, 9:59 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2022, 9:59 PM

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

rriddle added inline comments.May 3 2022, 10:14 PM

mlir/include/mlir/IR/Threading.h
74–75	Shouldn't we be waiting on the task group here now instead? With that, can we all drop the need to capture all of the futures explicitly?

Harbormaster completed remote builds in B162616: Diff 426916.May 3 2022, 10:21 PM

llunak added inline comments.May 3 2022, 11:20 PM

mlir/include/mlir/IR/Threading.h
74–75	Presumably, it's less code and also note that only ThreadPool's wait() can process other tasks while waiting, futures are blocking (see the last paragraph in ThreadPool doc description).

Remove futures and use taskgroup.wait()

mlir/include/mlir/IR/Threading.h
74–75	Oh right! Of course...

Add a comment on the wait()

Harbormaster completed remote builds in B162629: Diff 426931.May 4 2022, 1:02 AM

rriddle accepted this revision.May 4 2022, 3:56 AM

rriddle added inline comments.

mlir/include/mlir/IR/Threading.h
77–78	This only applies to worker threads though, if this is the main thread won't we block? This would be a regression from the current behavior, which lets the main thread also participate in the work. Should we still process one of the work items on this thread as before?

This revision is now accepted and ready to land.May 4 2022, 3:56 AM

llunak added inline comments.May 4 2022, 4:42 AM

mlir/include/mlir/IR/Threading.h
77–78	Yes, main thread will block here. But assuming the thread pool has an optimal concurrency set (I don't know if that's the case), then the main thread wouldn't have a CPU core to run on and would need to compete for it with thread pool threads. Calling the process function before waiting in the main thread would only save spawning one thread pool thread (assuming it has not already been spawned by something else).

Adding Chris for visibility, I know he was thinking about our threading model and utilities recently!

mehdi_amini added inline comments.May 4 2022, 7:49 AM

mlir/include/mlir/IR/Threading.h
77–78	If the main thread is competing with the worker threads, then it is already a problem when you get to this point isn't it? I'll keep the pre-existing behavior, the good point raised here is that the ThreadPool size should account for the "main" thread maybe?

Process one action in the current thread before waiting for the tasksGroup.

Harbormaster completed remote builds in B162686: Diff 427006.May 4 2022, 8:15 AM

llunak added inline comments.May 4 2022, 9:00 AM

mlir/include/mlir/IR/Threading.h
77–78	If the main thread is competing with the worker threads, then it is already a problem when you get to this point isn't it? What I mean is this: Let's say you get to this point and only the main thread exists. The threadpool is set up to use up to the number of CPU cores available. So if you queue up enough tasks, threadpool tasks will use all CPU cores and keeping the main thread running would make it compete for some CPU core that should be used by a worker thread. You seemingly get around this by using one less threadpool task, but that doesn't work with recursive usage. Let's say you have 8 CPU cores. On first call here you launch 7 worker threads and keep main thread working too. So far so good. But if one of the tasks calls here recursively, it queues another 7 tasks, and the threadpool will launch the last 8th thread. So now you have 9 active threads for 8 CPU cores. I'll keep the pre-existing behavior, the good point raised here is that the ThreadPool size should account for the "main" thread maybe? Now that ThreadPool::wait() can keep processing when waiting for a group and called from a worker thread, presumably it shouldn't be difficult to make that work in all cases. Assuming you know that the main thread will always wait() and not e.g. get stuck waiting on a socket. But is one extra spawned thread worth spending time on that?

mehdi_amini added inline comments.May 4 2022, 10:04 AM

mlir/include/mlir/IR/Threading.h
77–78	You seemingly get around this by using one less threadpool task, but that doesn't work with recursive usage. Let's say you have 8 CPU cores. On first call here you launch 7 worker threads and keep main thread working too. So far so good. But if one of the tasks calls here recursively, it queues another 7 tasks, and the threadpool will launch the last 8th thread. So now you have 9 active threads for 8 CPU cores. What I had in mind was to initialize the thread pool to 7 threads on a 8 core machine actually. I agree with you that if you let the thread pool run 8 thread it'll oversubscribe inconsistently. When I was referring to the pool already competing with the main thread, that was assuming that there were already potentially other tasks running in the pool. Anyway I'm fine with going either way here, the fact that the default strategy for the thread pool is to use as many threads as there are cores make me favor @llunak 's approach, @rriddle is this fine with you?

Cool!

rriddle added inline comments.May 6 2022, 10:13 AM

mlir/include/mlir/IR/Threading.h
77–78	I agree that is a better approach. I am just recalling some serious performance problems with doing that before. It would be nice if we can benchmark something reasonable to see what the results of this are.

Closed by commit rG072e0aabbc45: Enable the use of ThreadPoolTaskGroup in MLIR threading helper to enable nested… (authored by mehdi_amini). · Explain WhyMay 6 2022, 12:40 PM

This revision was automatically updated to reflect the committed changes.

mehdi_amini added a commit: rG072e0aabbc45: Enable the use of ThreadPoolTaskGroup in MLIR threading helper to enable nested….

Diff 427006

mlir/include/mlir/IR/Threading.h

Show All 35 Lines
LogicalResult failableParallelForEach(MLIRContext *context, IteratorT begin,		LogicalResult failableParallelForEach(MLIRContext *context, IteratorT begin,
IteratorT end, FuncT &&func) {		IteratorT end, FuncT &&func) {
unsigned numElements = static_cast<unsigned>(std::distance(begin, end));		unsigned numElements = static_cast<unsigned>(std::distance(begin, end));
if (numElements == 0)		if (numElements == 0)
return success();		return success();

// If multithreading is disabled or there is a small number of elements,		// If multithreading is disabled or there is a small number of elements,
// process the elements directly on this thread.		// process the elements directly on this thread.
// FIXME: ThreadPool should allow work stealing to avoid deadlocks when		if (!context->isMultithreadingEnabled() \|\| numElements <= 1) {
// scheduling work within a worker thread.
if (!context->isMultithreadingEnabled() \|\| numElements <= 1 \|\|
context->getThreadPool().isWorkerThread()) {
for (; begin != end; ++begin)		for (; begin != end; ++begin)
if (failed(func(*begin)))		if (failed(func(*begin)))
return failure();		return failure();
return success();		return success();
}		}

// Build a wrapper processing function that properly initializes a parallel		// Build a wrapper processing function that properly initializes a parallel
// diagnostic handler.		// diagnostic handler.
Show All 9 Lines	while (!processingFailed) {
if (failed(func(*std::next(begin, index))))		if (failed(func(*std::next(begin, index))))
processingFailed = true;		processingFailed = true;
handler.eraseOrderIDForThread();		handler.eraseOrderIDForThread();
}		}
};		};

// Otherwise, process the elements in parallel.		// Otherwise, process the elements in parallel.
llvm::ThreadPool &threadPool = context->getThreadPool();		llvm::ThreadPool &threadPool = context->getThreadPool();
		llvm::ThreadPoolTaskGroup tasksGroup(threadPool);
size_t numActions = std::min(numElements, threadPool.getThreadCount());		size_t numActions = std::min(numElements, threadPool.getThreadCount());
SmallVector<std::shared_future<void>> threadFutures;
threadFutures.reserve(numActions - 1);
for (unsigned i = 1; i < numActions; ++i)		for (unsigned i = 1; i < numActions; ++i)
threadFutures.emplace_back(threadPool.async(processFn));		tasksGroup.async(processFn);
processFn();		processFn();
		// If the current thread is a worker thread from the pool, then waiting for
		rriddleUnsubmitted Not Done Reply Inline Actions Shouldn't we be waiting on the task group here now instead? With that, can we all drop the need to capture all of the futures explicitly? rriddle: Shouldn't we be waiting on the task group here now instead? With that, can we all drop the need…
		llunakUnsubmitted Not Done Reply Inline Actions Presumably, it's less code and also note that only ThreadPool's wait() can process other tasks while waiting, futures are blocking (see the last paragraph in ThreadPool doc description). llunak: Presumably, it's less code and also note that only ThreadPool's wait() can process other tasks…
		mehdi_aminiAuthorUnsubmitted Done Reply Inline Actions Oh right! Of course... mehdi_amini: Oh right! Of course...
// Wait for all of the threads to finish.		// the task group allows the current thread to also participate in processing
for (std::shared_future<void> &future : threadFutures)		// tasks from the group.
future.wait();		tasksGroup.wait();
		rriddleUnsubmitted Not Done Reply Inline Actions This only applies to worker threads though, if this is the main thread won't we block? This would be a regression from the current behavior, which lets the main thread also participate in the work. Should we still process one of the work items on this thread as before? rriddle: This only applies to worker threads though, if this is the main thread won't we block? This…
		llunakUnsubmitted Not Done Reply Inline Actions Yes, main thread will block here. But assuming the thread pool has an optimal concurrency set (I don't know if that's the case), then the main thread wouldn't have a CPU core to run on and would need to compete for it with thread pool threads. Calling the process function before waiting in the main thread would only save spawning one thread pool thread (assuming it has not already been spawned by something else). llunak: Yes, main thread will block here. But assuming the thread pool has an optimal concurrency set…
		mehdi_aminiAuthorUnsubmitted Done Reply Inline Actions If the main thread is competing with the worker threads, then it is already a problem when you get to this point isn't it? I'll keep the pre-existing behavior, the good point raised here is that the ThreadPool size should account for the "main" thread maybe? mehdi_amini: If the main thread is competing with the worker threads, then it is already a problem when you…
		llunakUnsubmitted Not Done Reply Inline Actions If the main thread is competing with the worker threads, then it is already a problem when you get to this point isn't it? What I mean is this: Let's say you get to this point and only the main thread exists. The threadpool is set up to use up to the number of CPU cores available. So if you queue up enough tasks, threadpool tasks will use all CPU cores and keeping the main thread running would make it compete for some CPU core that should be used by a worker thread. You seemingly get around this by using one less threadpool task, but that doesn't work with recursive usage. Let's say you have 8 CPU cores. On first call here you launch 7 worker threads and keep main thread working too. So far so good. But if one of the tasks calls here recursively, it queues another 7 tasks, and the threadpool will launch the last 8th thread. So now you have 9 active threads for 8 CPU cores. I'll keep the pre-existing behavior, the good point raised here is that the ThreadPool size should account for the "main" thread maybe? Now that ThreadPool::wait() can keep processing when waiting for a group and called from a worker thread, presumably it shouldn't be difficult to make that work in all cases. Assuming you know that the main thread will always wait() and not e.g. get stuck waiting on a socket. But is one extra spawned thread worth spending time on that? llunak: > If the main thread is competing with the worker threads, then it is already a problem when…
		mehdi_aminiAuthorUnsubmitted Done Reply Inline Actions You seemingly get around this by using one less threadpool task, but that doesn't work with recursive usage. Let's say you have 8 CPU cores. On first call here you launch 7 worker threads and keep main thread working too. So far so good. But if one of the tasks calls here recursively, it queues another 7 tasks, and the threadpool will launch the last 8th thread. So now you have 9 active threads for 8 CPU cores. What I had in mind was to initialize the thread pool to 7 threads on a 8 core machine actually. I agree with you that if you let the thread pool run 8 thread it'll oversubscribe inconsistently. When I was referring to the pool already competing with the main thread, that was assuming that there were already potentially other tasks running in the pool. Anyway I'm fine with going either way here, the fact that the default strategy for the thread pool is to use as many threads as there are cores make me favor @llunak 's approach, @rriddle is this fine with you? mehdi_amini: > You seemingly get around this by using one less threadpool task, but that doesn't work with…
		rriddleUnsubmitted Not Done Reply Inline Actions I agree that is a better approach. I am just recalling some serious performance problems with doing that before. It would be nice if we can benchmark something reasonable to see what the results of this are. rriddle: I agree that is a better approach. I am just recalling some serious performance problems with…
return failure(processingFailed);		return failure(processingFailed);
}		}

/// Invoke the given function on the elements in the provided range		/// Invoke the given function on the elements in the provided range
/// asynchronously. If the given function returns a failure when processing any		/// asynchronously. If the given function returns a failure when processing any
/// of the elements, execution is stopped and a failure is returned from this		/// of the elements, execution is stopped and a failure is returned from this
/// function. This means that in the case of failure, not all elements of the		/// function. This means that in the case of failure, not all elements of the
/// range will be processed. Diagnostics emitted during processing are ordered		/// range will be processed. Diagnostics emitted during processing are ordered
▲ Show 20 Lines • Show All 63 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Enable the use of ThreadPoolTaskGroup in MLIR threading helper to enable nested parallelism
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 427006

mlir/include/mlir/IR/Threading.h

This is an archive of the discontinued LLVM Phabricator instance.

Enable the use of ThreadPoolTaskGroup in MLIR threading helper to enable nested parallelismClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 427006

mlir/include/mlir/IR/Threading.h

Enable the use of ThreadPoolTaskGroup in MLIR threading helper to enable nested parallelism
ClosedPublic