This is an archive of the discontinued LLVM Phabricator instance.

llvm/include/llvm/Support/Parallel.h
214	Removing for (; Begin != End; ++Begin) Fn(Begin); return; is still correct (my lld executable will be 9KiB smaller on top of the current decrease) but make the TaskGroup have less parallelism.

Harbormaster completed remote builds in B143849: Diff 400619.Jan 17 2022, 1:18 PM

Cool, thank you for working to improve this. As mentioned on https://reviews.llvm.org/D101699, dropping the special case for 1 element will have catastrophic performance impacts for some workloads (e.g. in the MLIR/CIRCT world) because of the problems that Threading.h has with nested parallelism.

Have you tried detemplating this entirely? If something is interesting to parallelize, then the granule of work should not be tiny. I'd consider moving this to take a unique_function<void()>, which would allow moving the implementation details of all of this out of line to a .cpp file.

simplify

Herald added a subscriber: hiraditya. · View Herald TranscriptJan 22 2022, 11:39 AM

Harbormaster completed remote builds in B145036: Diff 402244.Jan 22 2022, 12:21 PM

niiice, thank you!

This revision is now accepted and ready to land.Jan 22 2022, 4:59 PM

Closed by commit rG8e382ae91b97: [Support] Simplify parallelForEach{,N} (authored by MaskRay). · Explain WhyJan 23 2022, 10:35 AM

This revision was automatically updated to reflect the committed changes.

MaskRay added a commit: rG8e382ae91b97: [Support] Simplify parallelForEach{,N}.

dblaikie added a subscriber: dblaikie.Jan 23 2022, 10:49 AM

rnk added inline comments.Jan 24 2022, 8:43 AM

llvm/lib/Support/Parallel.cpp
197	I believe the old code was templated in an effort to avoid having an indirect call in the inner loop. However, I don't think that actually matters. These constructs have so much overhead that they are only useful for coarse-grained parallelism, not vectorizable inner loops.

lattner added inline comments.Jan 24 2022, 10:45 AM

llvm/lib/Support/Parallel.cpp
197	Right, I completely agree with you

MaskRay mentioned this in D119908: [ELF] Move duplicate symbol check after input file parsing.Feb 16 2022, 10:21 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Support/

Parallel.h

80 lines

lib/

Support/

Parallel.cpp

32 lines

Diff 402354

llvm/include/llvm/Support/Parallel.h

Show First 20 Lines • Show All 124 Lines • ▼ Show 20 Lines
}		}

// TaskGroup has a relatively high overhead, so we want to reduce		// TaskGroup has a relatively high overhead, so we want to reduce
// the number of spawn() calls. We'll create up to 1024 tasks here.		// the number of spawn() calls. We'll create up to 1024 tasks here.
// (Note that 1024 is an arbitrary number. This code probably needs		// (Note that 1024 is an arbitrary number. This code probably needs
// improving to take the number of available cores into account.)		// improving to take the number of available cores into account.)
enum { MaxTasksPerGroup = 1024 };		enum { MaxTasksPerGroup = 1024 };

template <class IterTy, class FuncTy>
void parallel_for_each(IterTy Begin, IterTy End, FuncTy Fn) {
// If we have zero or one items, then do not incur the overhead of spinning up
// a task group. They are surprisingly expensive, and because they do not
// support nested parallelism, a single entry task group can block parallel
// execution underneath them.
auto NumItems = std::distance(Begin, End);
if (NumItems <= 1) {
if (NumItems)
Fn(*Begin);
return;
}

// Limit the number of tasks to MaxTasksPerGroup to limit job scheduling
// overhead on large inputs.
ptrdiff_t TaskSize = NumItems / MaxTasksPerGroup;
if (TaskSize == 0)
TaskSize = 1;

TaskGroup TG;
while (TaskSize < std::distance(Begin, End)) {
TG.spawn([=, &Fn] { std::for_each(Begin, Begin + TaskSize, Fn); });
Begin += TaskSize;
}
std::for_each(Begin, End, Fn);
}

template <class IndexTy, class FuncTy>
void parallel_for_each_n(IndexTy Begin, IndexTy End, FuncTy Fn) {
// If we have zero or one items, then do not incur the overhead of spinning up
// a task group. They are surprisingly expensive, and because they do not
// support nested parallelism, a single entry task group can block parallel
// execution underneath them.
auto NumItems = End - Begin;
if (NumItems <= 1) {
if (NumItems)
Fn(Begin);
return;
}

// Limit the number of tasks to MaxTasksPerGroup to limit job scheduling
// overhead on large inputs.
ptrdiff_t TaskSize = NumItems / MaxTasksPerGroup;
if (TaskSize == 0)
TaskSize = 1;

TaskGroup TG;
IndexTy I = Begin;
for (; I + TaskSize < End; I += TaskSize) {
TG.spawn([=, &Fn] {
for (IndexTy J = I, E = I + TaskSize; J != E; ++J)
Fn(J);
});
}
for (IndexTy J = I; J < End; ++J)
Fn(J);
}

template <class IterTy, class ResultTy, class ReduceFuncTy,		template <class IterTy, class ResultTy, class ReduceFuncTy,
class TransformFuncTy>		class TransformFuncTy>
ResultTy parallel_transform_reduce(IterTy Begin, IterTy End, ResultTy Init,		ResultTy parallel_transform_reduce(IterTy Begin, IterTy End, ResultTy Init,
ReduceFuncTy Reduce,		ReduceFuncTy Reduce,
TransformFuncTy Transform) {		TransformFuncTy Transform) {
// Limit the number of tasks to MaxTasksPerGroup to limit job scheduling		// Limit the number of tasks to MaxTasksPerGroup to limit job scheduling
// overhead on large inputs.		// overhead on large inputs.
size_t NumInputs = std::distance(Begin, End);		size_t NumInputs = std::distance(Begin, End);
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	#if LLVM_ENABLE_THREADS
if (parallel::strategy.ThreadsRequested != 1) {		if (parallel::strategy.ThreadsRequested != 1) {
parallel::detail::parallel_sort(Start, End, Comp);		parallel::detail::parallel_sort(Start, End, Comp);
return;		return;
}		}
#endif		#endif
llvm::sort(Start, End, Comp);		llvm::sort(Start, End, Comp);
}		}

		void parallelForEachN(size_t Begin, size_t End, function_ref<void(size_t)> Fn);

template <class IterTy, class FuncTy>		template <class IterTy, class FuncTy>
void parallelForEach(IterTy Begin, IterTy End, FuncTy Fn) {		void parallelForEach(IterTy Begin, IterTy End, FuncTy Fn) {
#if LLVM_ENABLE_THREADS		parallelForEachN(0, End - Begin, [&](size_t I) { Fn(Begin[I]); });
if (parallel::strategy.ThreadsRequested != 1) {
parallel::detail::parallel_for_each(Begin, End, Fn);
return;
}
#endif
std::for_each(Begin, End, Fn);
}

template <class FuncTy>
void parallelForEachN(size_t Begin, size_t End, FuncTy Fn) {
#if LLVM_ENABLE_THREADS
if (parallel::strategy.ThreadsRequested != 1) {
parallel::detail::parallel_for_each_n(Begin, End, Fn);
return;
}
#endif
for (size_t I = Begin; I != End; ++I)
Fn(I);
}		}

template <class IterTy, class ResultTy, class ReduceFuncTy,		template <class IterTy, class ResultTy, class ReduceFuncTy,
class TransformFuncTy>		class TransformFuncTy>
ResultTy parallelTransformReduce(IterTy Begin, IterTy End, ResultTy Init,		ResultTy parallelTransformReduce(IterTy Begin, IterTy End, ResultTy Init,
ReduceFuncTy Reduce,		ReduceFuncTy Reduce,
TransformFuncTy Transform) {		TransformFuncTy Transform) {
#if LLVM_ENABLE_THREADS		#if LLVM_ENABLE_THREADS
if (parallel::strategy.ThreadsRequested != 1) {		if (parallel::strategy.ThreadsRequested != 1) {
return parallel::detail::parallel_transform_reduce(Begin, End, Init, Reduce,		return parallel::detail::parallel_transform_reduce(Begin, End, Init, Reduce,
Transform);		Transform);
}		}
#endif		#endif
for (IterTy I = Begin; I != End; ++I)		for (IterTy I = Begin; I != End; ++I)
		MaskRayAuthorUnsubmitted Done Reply Inline Actions Removing for (; Begin != End; ++Begin) Fn(Begin); return; is still correct (my lld executable will be 9KiB smaller on top of the current decrease) but make the TaskGroup have less parallelism. MaskRay: Removing ``` for (; Begin != End; ++Begin) Fn(Begin); return; ``` is still correct (my lld…
Init = Reduce(std::move(Init), Transform(*I));		Init = Reduce(std::move(Init), Transform(*I));
return std::move(Init);		return std::move(Init);
}		}

// Range wrappers.		// Range wrappers.
template <class RangeTy,		template <class RangeTy,
class Comparator = std::less<decltype(*std::begin(RangeTy()))>>		class Comparator = std::less<decltype(*std::begin(RangeTy()))>>
void parallelSort(RangeTy &&R, const Comparator &Comp = Comparator()) {		void parallelSort(RangeTy &&R, const Comparator &Comp = Comparator()) {
Show All 36 Lines

llvm/lib/Support/Parallel.cpp

Show First 20 Lines • Show All 168 Lines • ▼ Show 20 Lines	if (Parallel) {
F();		F();
}		}
}		}

} // namespace detail		} // namespace detail
} // namespace parallel		} // namespace parallel
} // namespace llvm		} // namespace llvm
#endif // LLVM_ENABLE_THREADS		#endif // LLVM_ENABLE_THREADS

		void llvm::parallelForEachN(size_t Begin, size_t End,
		llvm::function_ref<void(size_t)> Fn) {
		// If we have zero or one items, then do not incur the overhead of spinning up
		// a task group. They are surprisingly expensive, and because they do not
		// support nested parallelism, a single entry task group can block parallel
		// execution underneath them.
		#if LLVM_ENABLE_THREADS
		auto NumItems = End - Begin;
		if (NumItems > 1 && parallel::strategy.ThreadsRequested != 1) {
		// Limit the number of tasks to MaxTasksPerGroup to limit job scheduling
		// overhead on large inputs.
		auto TaskSize = NumItems / parallel::detail::MaxTasksPerGroup;
		if (TaskSize == 0)
		TaskSize = 1;

		parallel::detail::TaskGroup TG;
		for (; Begin + TaskSize < End; Begin += TaskSize) {
		TG.spawn([=, &Fn] {
		for (size_t I = Begin, E = Begin + TaskSize; I != E; ++I)
		Fn(I);
		rnkUnsubmitted Not Done Reply Inline Actions I believe the old code was templated in an effort to avoid having an indirect call in the inner loop. However, I don't think that actually matters. These constructs have so much overhead that they are only useful for coarse-grained parallelism, not vectorizable inner loops. rnk: I believe the old code was templated in an effort to avoid having an indirect call in the inner…
		lattnerUnsubmitted Not Done Reply Inline Actions Right, I completely agree with you lattner: Right, I completely agree with you
		});
		}
		for (; Begin != End; ++Begin)
		Fn(Begin);
		return;
		}
		#endif

		for (; Begin != End; ++Begin)
		Fn(Begin);
		}

This is an archive of the discontinued LLVM Phabricator instance.

[Support] Simplify parallelForEach{,N}ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 402354

llvm/include/llvm/Support/Parallel.h

llvm/lib/Support/Parallel.cpp

[Support] Simplify parallelForEach{,N}
ClosedPublic