This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/include/llvm/Support/
-
include/
-
llvm/
-
Support/
1/1
Parallel.h

Differential D117510

[Support] Simplify parallelForEach{,N}
ClosedPublic

Authored by MaskRay on Jan 17 2022, 12:27 PM.

Download Raw Diff

Details

Reviewers

lattner
rnk
rriddle

Commits

rG8e382ae91b97: [Support] Simplify parallelForEach{,N}

Summary

Merge parallel_for_each into parallelForEach (this removes 1 Fn(...) call)
Change parallelForEach to use parallelForEachN
Move parallelForEachN into Parallel.cpp

My x86-64 lld executable is 100KiB smaller.
No noticeable difference in performance.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	1,630 ms	x64 debian > SanitizerCommon-asan-x86_64-Linux.Linux::decorate_proc_maps.cpp
	540 ms	x64 debian > SanitizerCommon-lsan-x86_64-Linux.Linux::decorate_proc_maps.cpp
	940 ms	x64 debian > SanitizerCommon-msan-x86_64-Linux.Linux::decorate_proc_maps.cpp
	1,460 ms	x64 debian > SanitizerCommon-tsan-x86_64-Linux.Linux::decorate_proc_maps.cpp

Event Timeline

MaskRay created this revision.Jan 17 2022, 12:27 PM

Herald added subscribers: dexonsmith, pengfei. · View Herald TranscriptJan 17 2022, 12:27 PM

MaskRay requested review of this revision.Jan 17 2022, 12:27 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 17 2022, 12:27 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

MaskRay mentioned this in D101699: [Support/Parallel] Add a special case for 0/1 items to llvm::parallel_for_each..Jan 17 2022, 12:31 PM

MaskRay added inline comments.Jan 17 2022, 12:33 PM

llvm/include/llvm/Support/Parallel.h
246	Removing for (; Begin != End; ++Begin) Fn(Begin); return; is still correct (my lld executable will be 9KiB smaller on top of the current decrease) but make the TaskGroup have less parallelism.

Harbormaster completed remote builds in B143849: Diff 400619.Jan 17 2022, 1:18 PM

Cool, thank you for working to improve this. As mentioned on https://reviews.llvm.org/D101699, dropping the special case for 1 element will have catastrophic performance impacts for some workloads (e.g. in the MLIR/CIRCT world) because of the problems that Threading.h has with nested parallelism.

Have you tried detemplating this entirely? If something is interesting to parallelize, then the granule of work should not be tiny. I'd consider moving this to take a unique_function<void()>, which would allow moving the implementation details of all of this out of line to a .cpp file.

simplify

Herald added a subscriber: hiraditya. · View Herald TranscriptJan 22 2022, 11:39 AM

Harbormaster completed remote builds in B145036: Diff 402244.Jan 22 2022, 12:21 PM

niiice, thank you!

This revision is now accepted and ready to land.Jan 22 2022, 4:59 PM

Closed by commit rG8e382ae91b97: [Support] Simplify parallelForEach{,N} (authored by MaskRay). · Explain WhyJan 23 2022, 10:35 AM

This revision was automatically updated to reflect the committed changes.

MaskRay added a commit: rG8e382ae91b97: [Support] Simplify parallelForEach{,N}.

dblaikie added a subscriber: dblaikie.Jan 23 2022, 10:49 AM

rnk added inline comments.Jan 24 2022, 8:43 AM

llvm/lib/Support/Parallel.cpp
198 ↗	(On Diff #402244)	I believe the old code was templated in an effort to avoid having an indirect call in the inner loop. However, I don't think that actually matters. These constructs have so much overhead that they are only useful for coarse-grained parallelism, not vectorizable inner loops.

lattner added inline comments.Jan 24 2022, 10:45 AM

llvm/lib/Support/Parallel.cpp
198 ↗	(On Diff #402244)	Right, I completely agree with you

MaskRay mentioned this in D119908: [ELF] Move duplicate symbol check after input file parsing.Feb 16 2022, 10:21 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Support/

Parallel.h

107 lines

Diff 400619

llvm/include/llvm/Support/Parallel.h

Show First 20 Lines • Show All 124 Lines • ▼ Show 20 Lines
}		}

// TaskGroup has a relatively high overhead, so we want to reduce		// TaskGroup has a relatively high overhead, so we want to reduce
// the number of spawn() calls. We'll create up to 1024 tasks here.		// the number of spawn() calls. We'll create up to 1024 tasks here.
// (Note that 1024 is an arbitrary number. This code probably needs		// (Note that 1024 is an arbitrary number. This code probably needs
// improving to take the number of available cores into account.)		// improving to take the number of available cores into account.)
enum { MaxTasksPerGroup = 1024 };		enum { MaxTasksPerGroup = 1024 };

template <class IterTy, class FuncTy>
void parallel_for_each(IterTy Begin, IterTy End, FuncTy Fn) {
// If we have zero or one items, then do not incur the overhead of spinning up
// a task group. They are surprisingly expensive, and because they do not
// support nested parallelism, a single entry task group can block parallel
// execution underneath them.
auto NumItems = std::distance(Begin, End);
if (NumItems <= 1) {
if (NumItems)
Fn(*Begin);
return;
}

// Limit the number of tasks to MaxTasksPerGroup to limit job scheduling
// overhead on large inputs.
ptrdiff_t TaskSize = NumItems / MaxTasksPerGroup;
if (TaskSize == 0)
TaskSize = 1;

TaskGroup TG;
while (TaskSize < std::distance(Begin, End)) {
TG.spawn([=, &Fn] { std::for_each(Begin, Begin + TaskSize, Fn); });
Begin += TaskSize;
}
std::for_each(Begin, End, Fn);
}

template <class IndexTy, class FuncTy>
void parallel_for_each_n(IndexTy Begin, IndexTy End, FuncTy Fn) {
// If we have zero or one items, then do not incur the overhead of spinning up
// a task group. They are surprisingly expensive, and because they do not
// support nested parallelism, a single entry task group can block parallel
// execution underneath them.
auto NumItems = End - Begin;
if (NumItems <= 1) {
if (NumItems)
Fn(Begin);
return;
}

// Limit the number of tasks to MaxTasksPerGroup to limit job scheduling
// overhead on large inputs.
ptrdiff_t TaskSize = NumItems / MaxTasksPerGroup;
if (TaskSize == 0)
TaskSize = 1;

TaskGroup TG;
IndexTy I = Begin;
for (; I + TaskSize < End; I += TaskSize) {
TG.spawn([=, &Fn] {
for (IndexTy J = I, E = I + TaskSize; J != E; ++J)
Fn(J);
});
}
for (IndexTy J = I; J < End; ++J)
Fn(J);
}

template <class IterTy, class ResultTy, class ReduceFuncTy,		template <class IterTy, class ResultTy, class ReduceFuncTy,
class TransformFuncTy>		class TransformFuncTy>
ResultTy parallel_transform_reduce(IterTy Begin, IterTy End, ResultTy Init,		ResultTy parallel_transform_reduce(IterTy Begin, IterTy End, ResultTy Init,
ReduceFuncTy Reduce,		ReduceFuncTy Reduce,
TransformFuncTy Transform) {		TransformFuncTy Transform) {
// Limit the number of tasks to MaxTasksPerGroup to limit job scheduling		// Limit the number of tasks to MaxTasksPerGroup to limit job scheduling
// overhead on large inputs.		// overhead on large inputs.
size_t NumInputs = std::distance(Begin, End);		size_t NumInputs = std::distance(Begin, End);
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	if (parallel::strategy.ThreadsRequested != 1) {
return;		return;
}		}
#endif		#endif
llvm::sort(Start, End, Comp);		llvm::sort(Start, End, Comp);
}		}

template <class IterTy, class FuncTy>		template <class IterTy, class FuncTy>
void parallelForEach(IterTy Begin, IterTy End, FuncTy Fn) {		void parallelForEach(IterTy Begin, IterTy End, FuncTy Fn) {
		// If we have zero or one items, then do not incur the overhead of spinning up
		// a task group. They are surprisingly expensive, and because they do not
		// support nested parallelism, a single entry task group can block parallel
		// execution underneath them.
#if LLVM_ENABLE_THREADS		#if LLVM_ENABLE_THREADS
if (parallel::strategy.ThreadsRequested != 1) {		auto NumItems = std::distance(Begin, End);
parallel::detail::parallel_for_each(Begin, End, Fn);		if (NumItems > 1 && parallel::strategy.ThreadsRequested != 1) {
		// Limit the number of tasks to MaxTasksPerGroup to limit job scheduling
		// overhead on large inputs.
		auto TaskSize = NumItems / parallel::detail::MaxTasksPerGroup;
		if (TaskSize == 0)
		TaskSize = 1;

		parallel::detail::TaskGroup TG;
		while (TaskSize < std::distance(Begin, End)) {
		TG.spawn([=, &Fn] { std::for_each(Begin, Begin + TaskSize, Fn); });
		Begin += TaskSize;
		}
		std::for_each(Begin, End, Fn);
return;		return;
}		}
#endif		#endif

std::for_each(Begin, End, Fn);		std::for_each(Begin, End, Fn);
}		}

template <class FuncTy>		template <class FuncTy>
void parallelForEachN(size_t Begin, size_t End, FuncTy Fn) {		void parallelForEachN(size_t Begin, size_t End, FuncTy Fn) {
		// If we have zero or one items, then do not incur the overhead of spinning up
		// a task group. They are surprisingly expensive, and because they do not
		// support nested parallelism, a single entry task group can block parallel
		// execution underneath them.
#if LLVM_ENABLE_THREADS		#if LLVM_ENABLE_THREADS
if (parallel::strategy.ThreadsRequested != 1) {		auto NumItems = End - Begin;
parallel::detail::parallel_for_each_n(Begin, End, Fn);		if (NumItems > 1 && parallel::strategy.ThreadsRequested != 1) {
		// Limit the number of tasks to MaxTasksPerGroup to limit job scheduling
		// overhead on large inputs.
		auto TaskSize = NumItems / parallel::detail::MaxTasksPerGroup;
		if (TaskSize == 0)
		TaskSize = 1;

		parallel::detail::TaskGroup TG;
		for (; Begin + TaskSize < End; Begin += TaskSize) {
		TG.spawn([=, &Fn] {
		for (size_t I = Begin, E = Begin + TaskSize; I != E; ++I)
		Fn(I);
		});
		}
		for (; Begin != End; ++Begin)
		MaskRayAuthorUnsubmitted Done Reply Inline Actions Removing for (; Begin != End; ++Begin) Fn(Begin); return; is still correct (my lld executable will be 9KiB smaller on top of the current decrease) but make the TaskGroup have less parallelism. MaskRay: Removing ``` for (; Begin != End; ++Begin) Fn(Begin); return; ``` is still correct (my lld…
		Fn(Begin);
return;		return;
}		}
#endif		#endif
for (size_t I = Begin; I != End; ++I)
Fn(I);		for (; Begin != End; ++Begin)
		Fn(Begin);
}		}

template <class IterTy, class ResultTy, class ReduceFuncTy,		template <class IterTy, class ResultTy, class ReduceFuncTy,
class TransformFuncTy>		class TransformFuncTy>
ResultTy parallelTransformReduce(IterTy Begin, IterTy End, ResultTy Init,		ResultTy parallelTransformReduce(IterTy Begin, IterTy End, ResultTy Init,
ReduceFuncTy Reduce,		ReduceFuncTy Reduce,
TransformFuncTy Transform) {		TransformFuncTy Transform) {
#if LLVM_ENABLE_THREADS		#if LLVM_ENABLE_THREADS
▲ Show 20 Lines • Show All 50 Lines • Show Last 20 Lines