This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/include/llvm/Support/
-
include/
-
llvm/
-
Support/
-
Parallel.h

Differential D101699

[Support/Parallel] Add a special case for 0/1 items to llvm::parallel_for_each.
ClosedPublic

Authored by lattner on May 1 2021, 2:07 PM.

Download Raw Diff

Details

Reviewers

ruiu
zturner
rnk
MaskRay
lattner

Commits

rG5fa9d4163421: [Support/Parallel] Add a special case for 0/1 items to llvm::parallel_for_each.

Summary

This avoids the non-trivial overhead of creating a TaskGroup in these degenerate
cases, but also exposes parallelism. It turns out that the default executor
underlying TaskGroup prevents recursive parallelism - so an instance of a task
group being alive will make nested ones become serial.

This is a big issue in MLIR in some dialects, if they have a single instance of
an outer op (e.g. a firrtl.circuit) that has many parallel ops within it (e.g.
a firrtl.module). This patch side-steps the problem by avoiding creating the
TaskGroup in the unneeded case. See this issue for more details:
https://github.com/llvm/circt/issues/993

Note that this isn't a really great solution for the general case of nested
parallelism. A redesign of the TaskGroup stuff would be better, but would be
a much more invasive change.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

lattner created this revision.May 1 2021, 2:07 PM

Herald added subscribers: dexonsmith, rriddle. · View Herald TranscriptMay 1 2021, 2:07 PM

lattner requested review of this revision.May 1 2021, 2:07 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 1 2021, 2:07 PM

Herald added subscribers: llvm-commits, stephenneuendorffer. · View Herald Transcript

This seems obvious to me, but I'd appreciate a quick look from someone else.

Harbormaster completed remote builds in B102124: Diff 342176.May 1 2021, 2:41 PM

This is obvious improvement, so I'm going to self approve. TaskGroup could be improved significantly but this fits within the current architecture.

This revision is now accepted and ready to land.May 3 2021, 10:07 AM

Closed by commit rG5fa9d4163421: [Support/Parallel] Add a special case for 0/1 items to llvm::parallel_for_each. (authored by lattner). · Explain WhyMay 3 2021, 10:08 AM

This revision was automatically updated to reflect the committed changes.

lattner added a commit: rG5fa9d4163421: [Support/Parallel] Add a special case for 0/1 items to llvm::parallel_for_each..

LGTM from me. Though I think it would be good to reduce some of the duplication in these methods.

Change looks good.

In the long run, we should try to build on an existing C++17 parallel algorithms library, such as the pstl in our own monorepo. I don't know how far off it is before we can raise the LLVM toolchain requirements to C++17, and after that, how much platform support will hold us back, but maybe we can hack the pstl into the LLVM build if it becomes a sticking point.

The important thing for LLVM is that we express our code using well-known, deterministic parallel algorithms, and we can pick up better implementations later.

How expensive is spinning up a TaskGroup? Mostly the ctor/dtor of TaskGroup?

I created D117510 to remove one Fn(...) call which may make instantiated code slightly smaller.

Herald added a subscriber: Chia-hungDuan. · View Herald TranscriptJan 17 2022, 12:31 PM

In D101699#2736758, @rnk wrote:

Change looks good.

In the long run, we should try to build on an existing C++17 parallel algorithms library, such as the pstl in our own monorepo. I don't know how far off it is before we can raise the LLVM toolchain requirements to C++17, and after that, how much platform support will hold us back, but maybe we can hack the pstl into the LLVM build if it becomes a sticking point.

The important thing for LLVM is that we express our code using well-known, deterministic parallel algorithms, and we can pick up better implementations later.

Keeping LLVM's own version provides some flexibility. E.g. when lld is writing output sections parallely, we may tune the function to have a better TaskSize.
I tested that for -DCMAKE_BUILD_TYPE=Release clang, using a [[ https://gist.github.com/MaskRay/540e7bb31408afcee2b827140bef33e3 | fixed TaskSize==128 ]] is 1.02x as fast as the current code, though it has no noticeable difference when linking chrome.
Such preference may not be expressable with the C++ STL.

Another thing we need to think about is code bloat. Many template heavy libraries expand to huge amount of code. For some programs like a linker, we favor calling parallel* in many places. The expanded code may be significant.
(I think mold has experienced this.)

Spinning up the threadpool is expensive, but the bigger issue is that Threading.h doesn't support reentrant concurrency. If you have a parallel for loop with one element, then that element does a parallel for loop over 10000 elements, it will run in serial. :-(

lattner mentioned this in D117510: [Support] Simplify parallelForEach{,N}.Jan 17 2022, 10:37 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

Support/

Parallel.h

26 lines

Diff 342176

llvm/include/llvm/Support/Parallel.h

	Show First 20 Lines • Show All 123 Lines • ▼ Show 20 Lines
	// TaskGroup has a relatively high overhead, so we want to reduce			// TaskGroup has a relatively high overhead, so we want to reduce
	// the number of spawn() calls. We'll create up to 1024 tasks here.			// the number of spawn() calls. We'll create up to 1024 tasks here.
	// (Note that 1024 is an arbitrary number. This code probably needs			// (Note that 1024 is an arbitrary number. This code probably needs
	// improving to take the number of available cores into account.)			// improving to take the number of available cores into account.)
	enum { MaxTasksPerGroup = 1024 };			enum { MaxTasksPerGroup = 1024 };

	template <class IterTy, class FuncTy>			template <class IterTy, class FuncTy>
	void parallel_for_each(IterTy Begin, IterTy End, FuncTy Fn) {			void parallel_for_each(IterTy Begin, IterTy End, FuncTy Fn) {
				// If we have zero or one items, then do not incur the overhead of spinning up
				// a task group. They are surprisingly expensive, and because they do not
				// support nested parallelism, a single entry task group can block parallel
				// execution underneath them.
				auto NumItems = std::distance(Begin, End);
				if (NumItems <= 1) {
				if (NumItems)
				Fn(*Begin);
				return;
				}

	// Limit the number of tasks to MaxTasksPerGroup to limit job scheduling			// Limit the number of tasks to MaxTasksPerGroup to limit job scheduling
	// overhead on large inputs.			// overhead on large inputs.
	ptrdiff_t TaskSize = std::distance(Begin, End) / MaxTasksPerGroup;			ptrdiff_t TaskSize = NumItems / MaxTasksPerGroup;
	if (TaskSize == 0)			if (TaskSize == 0)
	TaskSize = 1;			TaskSize = 1;

	TaskGroup TG;			TaskGroup TG;
	while (TaskSize < std::distance(Begin, End)) {			while (TaskSize < std::distance(Begin, End)) {
	TG.spawn([=, &Fn] { std::for_each(Begin, Begin + TaskSize, Fn); });			TG.spawn([=, &Fn] { std::for_each(Begin, Begin + TaskSize, Fn); });
	Begin += TaskSize;			Begin += TaskSize;
	}			}
	std::for_each(Begin, End, Fn);			std::for_each(Begin, End, Fn);
	}			}

	template <class IndexTy, class FuncTy>			template <class IndexTy, class FuncTy>
	void parallel_for_each_n(IndexTy Begin, IndexTy End, FuncTy Fn) {			void parallel_for_each_n(IndexTy Begin, IndexTy End, FuncTy Fn) {
				// If we have zero or one items, then do not incur the overhead of spinning up
				// a task group. They are surprisingly expensive, and because they do not
				// support nested parallelism, a single entry task group can block parallel
				// execution underneath them.
				auto NumItems = End - Begin;
				if (NumItems <= 1) {
				if (NumItems)
				Fn(Begin);
				return;
				}

	// Limit the number of tasks to MaxTasksPerGroup to limit job scheduling			// Limit the number of tasks to MaxTasksPerGroup to limit job scheduling
	// overhead on large inputs.			// overhead on large inputs.
	ptrdiff_t TaskSize = (End - Begin) / MaxTasksPerGroup;			ptrdiff_t TaskSize = NumItems / MaxTasksPerGroup;
	if (TaskSize == 0)			if (TaskSize == 0)
	TaskSize = 1;			TaskSize = 1;

	TaskGroup TG;			TaskGroup TG;
	IndexTy I = Begin;			IndexTy I = Begin;
	for (; I + TaskSize < End; I += TaskSize) {			for (; I + TaskSize < End; I += TaskSize) {
	TG.spawn([=, &Fn] {			TG.spawn([=, &Fn] {
	for (IndexTy J = I, E = I + TaskSize; J != E; ++J)			for (IndexTy J = I, E = I + TaskSize; J != E; ++J)
	▲ Show 20 Lines • Show All 149 Lines • Show Last 20 Lines