This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lld/ELF/
-
ELF/
-
Relocations.cpp
-
llvm/
-
include/llvm/Support/
-
llvm/
-
Support/
2
Parallel.h
-
lib/Support/
-
Support/
3/6
Parallel.cpp

Differential D148916

[Support][Parallel] Initialize threadIndex and add assertion checking its usage.
ClosedPublic

Authored by avl on Apr 21 2023, 4:58 AM.

Download Raw Diff

Details

Reviewers

MaskRay
andrewng
dexonsmith
lattner

Commits

rG85c2768ce902: [Support][Parallel] Initialize threadIndex and add assertion checking its usage.

Summary

That patch adds a check for threadIndex being used with only threads
created by ThreadPoolExecutor. This helps catch two types of errors:

If a thread is created not by ThreadPoolExecutor its index may clash with the index of another thread. Using threadIndex, in that case, may lead to a data race.
Index of the main thread(threadIndex == 0) currently clashes with the index of thread0 in ThreadPoolExecutor threads. That may lead to a data race if main thread and thread0 are executed concurrently.

This patch allows execution tasks on the main thread only in case
parallel::strategy.ThreadsRequested == 1. In all other cases,
assertions check that threadIndex != UINT_MAX(i.e. that task
is executed on a thread created by ThreadPoolExecutor).

This patch is a followup of discussion from D142318

Depends on D148728

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

avl created this revision.Apr 21 2023, 4:58 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2023, 4:58 AM

Herald added subscribers: arphaman, hiraditya, arichardson, emaste. · View Herald Transcript

avl requested review of this revision.Apr 21 2023, 4:58 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2023, 4:58 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

avl edited the summary of this revision. (Show Details)Apr 21 2023, 5:27 AM

andrewng added inline comments.Apr 21 2023, 8:29 AM

llvm/lib/Support/Parallel.cpp
226	The following is the commit for which the `NumItems <= 1` case was important: 5fa9d416 "[Support/Parallel] Add a special case for 0/1 items to llvm::parallel_for_each.". I don't know if it's still important but probably should be checked.

avl added a child revision: D148984: [Support][Parallel] Change check for nested TaskGroups..Apr 22 2023, 2:19 AM

avl added inline comments.Apr 22 2023, 2:50 AM

llvm/lib/Support/Parallel.cpp
226	The following is the commit for which the `NumItems <= 1` case was important: 5fa9d416 "[Support/Parallel] Add a special case for 0/1 items to llvm::parallel_for_each.". I don't know if it's still important but probably should be checked. Thank you for the pointer! That problem is still important. I would suggest to use another workaround though. The original problem is connected with the fact that parallelFor() functions do not support nested parallelism. If nested TaskGroups would be allowed to be Parallel then deadlock could happen: when all threads are busy and are waiting for nested tasks the task queue does not move. The test case from https://github.com/llvm/circt/issues/993 looks like this: parallelFor (Files) { parallelFor(Passes) { } } A nested loop for Passes is not Parallel since nested parallelism is forbidden. But if "Files" have only one file, the workaround is to run root parallelFor sequentially and run nested parallelFor parallelly. It gives a performance improvement. That is why this workaround for NumItems is necessary. The drawbacks of the workaround: It helps one concrete case. It makes using getThreadIndex() dangerous if parallelFor() is called from different threads. parallelFor() may be run on calling threads and then the same threadIndex would be used concurrently. The following workaround could probably solve the problem better: std::function<void()> Fn = [&]() { parallelFor(Passes) { } }; ThreadPool Pool; for (Files) { Pool.async(Fn); Pool.wait(); It does not have a problem with using threadIndex. It also could give more parallelism. Let`s say we have two files. The current solution will use only two threads(as internal parallelFor is sequential), while the above solution would use all available threads. Another solution would be to implement support of nested parallelism for TaskGroups. But it looks to be much bigger task. Will put that info into https://github.com/llvm/circt/issues/993

avl added a reviewer: lattner.Apr 24 2023, 11:17 AM

avl added inline comments.Apr 25 2023, 2:22 AM

llvm/lib/Support/Parallel.cpp
226	@lattner Chris, Would you mind to comment on whether it would be OK to remove "NumItems > 1" optimization and use ThreadPool for nested parallelFor instead, please?

I haven't looked in detail at this patch, but I can describe the problem that I was hitting. The situation was happening in MLIR and other clients that have an expose nested parallelism, e.g. in its parallel pass manager. The thread pool at the time refused to do nested cases in parallel, so if you did:

parallel for x in ...
  parallel for y in ...
     work

it would do the outer loop in parallel and then refuse to do the inner loop in parallel. This was sort of ok, but was really really bad when the outer loop actually had only a single entry. That would end up running the inner loop in series even though there was no parallelism in play.

The change to do 0/1 without "parallelism" led to a 50x speedup on a CIRCT workload. It also eliminates a bit of fixed overhead in those small cases, but that is a more minor issue.

-Chris

In D148916#4295882, @lattner wrote:
I haven't looked in detail at this patch, but I can describe the problem that I was hitting. The situation was happening in MLIR and other clients that have an expose nested parallelism, e.g. in its parallel pass manager. The thread pool at the time refused to do nested cases in parallel, so if you did:
parallel for x in ...
  parallel for y in ...
     work
it would do the outer loop in parallel and then refuse to do the inner loop in parallel. This was sort of ok, but was really really bad when the outer loop actually had only a single entry. That would end up running the inner loop in series even though there was no parallelism in play.

The change to do 0/1 without "parallelism" led to a 50x speedup on a CIRCT workload. It also eliminates a bit of fixed overhead in those small cases, but that is a more minor issue.

-Chris

I suggest to use another approach to have more parallelism:

std::function<void()> Fn = [&]() {
  parallel for y in ...
};

ThreadPool Pool;

for x in ...
  Pool.async(Fn);

Pool.wait();

What do you think of it? it could give more benefits than 0/1 approach.

Btw, looking at circt/llvm/mlir/include/mlir/IR/Threading.h it seems that it does not use llvm::parallel::parrallelFor and already uses similar solution with separate ThreadPool. Thus it looks like we can remove "NumItems > 1" workaround in llvm/lib/Support/Parallel.cpp

template <typename IteratorT, typename FuncT>
LogicalResult failableParallelForEach(MLIRContext *context, IteratorT begin,
                                      IteratorT end, FuncT &&func) {
  unsigned numElements = static_cast<unsigned>(std::distance(begin, end));
  if (numElements == 0)
    return success();

  // If multithreading is disabled or there is a small number of elements,
  // process the elements directly on this thread.
  if (!context->isMultithreadingEnabled() || numElements <= 1) {
    for (; begin != end; ++begin)
      if (failed(func(*begin)))
        return failure();
    return success();
  }

  // Build a wrapper processing function that properly initializes a parallel
  // diagnostic handler.
  ParallelDiagnosticHandler handler(context);
  std::atomic<unsigned> curIndex(0);
  std::atomic<bool> processingFailed(false);
  auto processFn = [&] {
    while (!processingFailed) {
      unsigned index = curIndex++;
      if (index >= numElements)
        break;
      handler.setOrderIDForThread(index);
      if (failed(func(*std::next(begin, index))))
        processingFailed = true;
      handler.eraseOrderIDForThread();
    }
  };

  // Otherwise, process the elements in parallel.
  llvm::ThreadPool &threadPool = context->getThreadPool();
  llvm::ThreadPoolTaskGroup tasksGroup(threadPool);
  size_t numActions = std::min(numElements, threadPool.getThreadCount());
  for (unsigned i = 0; i < numActions; ++i)
    tasksGroup.async(processFn);
  // If the current thread is a worker thread from the pool, then waiting for
  // the task group allows the current thread to also participate in processing
  // tasks from the group, which avoid any deadlock/starvation.
  tasksGroup.wait();
  return failure(processingFailed);
}

It isn't just the function pass manager:

~/Projects/circt> grep -i -r parallel lib (trimmed):
lib/Transforms/StripDebugInfoWithPred.cpp:  parallelForEach(
lib/Conversion/ExportVerilog/ExportVerilog.cpp:  parallelForEach(context, thingsToEmit, [&](StringOrOpToEmit &stringOrOp) {
lib/Conversion/ExportVerilog/ExportVerilog.cpp:  if (failed(failableParallelForEach(
lib/Conversion/ExportVerilog/ExportVerilog.cpp:  emitter.emitOps(list, output->os(), /*parallelize=*/false);
lib/Conversion/ExportVerilog/ExportVerilog.cpp:  // Emit each file in parallel if context enables it.
lib/Conversion/ExportVerilog/ExportVerilog.cpp:  parallelForEach(module->getContext(), emitter.files.begin(),
lib/Conversion/ExportVerilog/ExportVerilog.cpp:  if (failed(failableParallelForEach(
lib/Conversion/ExportVerilog/LegalizeNames.cpp:  mlir::parallelForEach(
lib/Conversion/FIRRTLToHW/LowerToHW.cpp:    memories = llvm::parallelTransformReduce(
lib/Conversion/FIRRTLToHW/LowerToHW.cpp:  auto result = mlir::failableParallelForEachN(
lib/Dialect/HW/InnerSymbolTable.cpp:  return mlir::failableParallelForEach(
lib/Dialect/HW/InnerSymbolTable.cpp:  return mlir::failableParallelForEach(
lib/Dialect/FIRRTL/Transforms/IMDeadCodeElim.cpp:  mlir::parallelForEach(circuit.getContext(),
lib/Dialect/FIRRTL/Transforms/IMConstProp.cpp:  mlir::parallelForEach(circuit.getContext(),
lib/Dialect/FIRRTL/Transforms/VBToBV.cpp:      failableParallelForEach(&getContext(), modules, [&](FModuleOp module) {
lib/Dialect/FIRRTL/Transforms/InnerSymbolDCE.cpp:  parallelForEach(&getContext(), modules,
lib/Dialect/FIRRTL/Transforms/LowerTypes.cpp:  auto result = failableParallelForEach(&getContext(), ops, lowerModules);
lib/Dialect/FIRRTL/Transforms/MemToRegOfVec.cpp:    mlir::parallelForEach(circtOp.getContext(), dutModuleSet,
lib/Dialect/FIRRTL/Import/FIRParser.cpp:  auto anyFailed = mlir::failableParallelForEachN(

simialrly the MLIR inliner and verifier and passmgr have things like:

./mlir/lib/Pass/Pass.cpp:  if (failed(failableParallelForEach(context, opInfos, processFn)))
./mlir/lib/IR/Verifier.cpp:  if (failed(failableParallelForEach(
./mlir/lib/Transforms/Inliner.cpp:  return failableParallelForEach(ctx, nodesToVisit, [&](CallGraphNode *node) {

all above places are calls to mlir::failableParallelForEach, right? The llvm::parallel::parrallelFor is not used . Thus, it looks like llvm::parallel::parrallelFor may be changed ?

all above places are calls to mlir::failableParallelForEach, right? The llvm::parallel::parrallelFor is not used . Thus, it looks like llvm::parallel::parrallelFor may be changed ?

I haven't checked thoroughly but it does look like mlir has moved away from llvm::parallel and its parallelFor implementation supports nested parallel for loops.

llvm/include/llvm/Support/Parallel.h
50–54	I tend to get a bad feeling when code is duplicated as this is in `Parallel.cpp`. I know macros aren't great either but perhaps define the "implementation" as a macro and use that here and in `Parallel.cpp`?
llvm/lib/Support/Parallel.cpp
249–250	IIUC, this will be executed if `parallel::strategy.ThreadsRequested == 1` and in this case, `getThreadIndex()` will return `UINT_MAX`. Is this the intended behaviour?

avl added inline comments.Apr 26 2023, 9:26 AM

llvm/lib/Support/Parallel.cpp
249–250	no, that is not intended. Good catch! Thank you.

addressed comments(used macros for getThreadIndex implementation,
set threadIndex to 0 in case parallel::strategy.ThreadsRequested == 1)

andrewng added inline comments.Apr 28 2023, 7:34 AM

llvm/include/llvm/Support/Parallel.h
32–37	I think this can be moved within the `#if LLVM_ENABLE_THREADS` just below. Perhaps `the` -> `a` in the assert string.
llvm/lib/Support/Parallel.cpp
103	This setting of `parallel::threadIndex = 0` (also in `llvm::parallelFor`) feels a little "odd". Just wondering if an alternative would be to check `parallel::strategy.ThreadsRequested == 1` in `getThreadIndex()` and return `0` there?

addressed comments(added check for parallel::strategy.ThreadsRequested == 1 into getThreadIndex)

Harbormaster completed remote builds in B229003: Diff 518158.Apr 29 2023, 4:38 AM

LGTM.

This revision is now accepted and ready to land.May 2 2023, 6:17 AM

Thank you for the review!

Closed by commit rG85c2768ce902: [Support][Parallel] Initialize threadIndex and add assertion checking its usage. (authored by avl). · Explain WhyMay 2 2023, 9:56 AM

This revision was automatically updated to reflect the committed changes.

avl added a commit: rG85c2768ce902: [Support][Parallel] Initialize threadIndex and add assertion checking its usage..

It looks like this caused a failure on AIX, could you take a look please?

https://lab.llvm.org/buildbot/#/builders/214/builds/7230/steps/6/logs/FAIL__LLVM-Unit__84

In D148916#4315282, @Jake-Egan wrote:

It looks like this caused a failure on AIX, could you take a look please?

https://lab.llvm.org/buildbot/#/builders/214/builds/7230/steps/6/logs/FAIL__LLVM-Unit__84

Hi, thank you for reporting the issue. At first glance it looks unrelated, but I will check more.

https://lab.llvm.org/buildbot/#/builders/214/builds/7230/steps/6/logs/FAIL__LLVM-Unit__84

Hi, thank you for reporting the issue. At first glance it looks unrelated, but I will check more.

I suspect it's more likely because of D149173.

My mistake, I confirmed locally that it's D149173. Thanks

Revision Contents

Path

Size

lld/

ELF/

Relocations.cpp

3 lines

llvm/

include/

llvm/

Support/

Parallel.h

10 lines

lib/

Support/

Parallel.cpp

23 lines

Diff 518779

lld/ELF/Relocations.cpp

Show First 20 Lines • Show All 1,531 Lines • ▼ Show 20 Lines	auto fn = [f]() {
(s->flags & SHF_ALLOC) &&		(s->flags & SHF_ALLOC) &&
!(s->type == SHT_ARM_EXIDX && config->emachine == EM_ARM))		!(s->type == SHT_ARM_EXIDX && config->emachine == EM_ARM))
scanner.template scanSection<ELFT>(*s);		scanner.template scanSection<ELFT>(*s);
}		}
};		};
tg.spawn(fn, serial);		tg.spawn(fn, serial);
}		}

// Both the main thread and thread pool index 0 use getThreadIndex()==0. Be
// careful that they don't concurrently run scanSections. When serial is
// true, fn() has finished at this point, so running execute is safe.
tg.spawn([] {		tg.spawn([] {
RelocationScanner scanner;		RelocationScanner scanner;
for (Partition &part : partitions) {		for (Partition &part : partitions) {
for (EhInputSection *sec : part.ehFrame->sections)		for (EhInputSection *sec : part.ehFrame->sections)
scanner.template scanSection<ELFT>(*sec);		scanner.template scanSection<ELFT>(*sec);
if (part.armExidx && part.armExidx->isLive())		if (part.armExidx && part.armExidx->isLive())
for (InputSection *sec : part.armExidx->exidxSections)		for (InputSection *sec : part.armExidx->exidxSections)
scanner.template scanSection<ELFT>(*sec);		scanner.template scanSection<ELFT>(*sec);
▲ Show 20 Lines • Show All 716 Lines • Show Last 20 Lines

llvm/include/llvm/Support/Parallel.h

	Show All 23 Lines

	namespace parallel {			namespace parallel {

	// Strategy for the default executor used by the parallel routines provided by			// Strategy for the default executor used by the parallel routines provided by
	// this file. It defaults to using all hardware threads and should be			// this file. It defaults to using all hardware threads and should be
	// initialized before the first use of parallel routines.			// initialized before the first use of parallel routines.
	extern ThreadPoolStrategy strategy;			extern ThreadPoolStrategy strategy;

	#if LLVM_ENABLE_THREADS			#if LLVM_ENABLE_THREADS
				#define GET_THREAD_INDEX_IMPL \
				if (parallel::strategy.ThreadsRequested == 1) \
				return 0; \
				assert((threadIndex != UINT_MAX) && \
				"getThreadIndex() must be called from a thread created by " \
				andrewngUnsubmitted Not Done Reply Inline Actions I think this can be moved within the `#if LLVM_ENABLE_THREADS` just below. Perhaps `the` -> `a` in the assert string. andrewng: I think this can be moved within the `#if LLVM_ENABLE_THREADS` just below. Perhaps `the` ->…
				"ThreadPoolExecutor"); \
				return threadIndex;

	#ifdef _WIN32			#ifdef _WIN32
	// Direct access to thread_local variables from a different DLL isn't			// Direct access to thread_local variables from a different DLL isn't
	// possible with Windows Native TLS.			// possible with Windows Native TLS.
	unsigned getThreadIndex();			unsigned getThreadIndex();
	#else			#else
	// Don't access this directly, use the getThreadIndex wrapper.			// Don't access this directly, use the getThreadIndex wrapper.
	extern thread_local unsigned threadIndex;			extern thread_local unsigned threadIndex;

	inline unsigned getThreadIndex() { return threadIndex; }			inline unsigned getThreadIndex() { GET_THREAD_INDEX_IMPL; }
	#endif			#endif
	#else			#else
	inline unsigned getThreadIndex() { return 0; }			inline unsigned getThreadIndex() { return 0; }
	#endif			#endif

				andrewngUnsubmitted Not Done Reply Inline Actions I tend to get a bad feeling when code is duplicated as this is in `Parallel.cpp`. I know macros aren't great either but perhaps define the "implementation" as a macro and use that here and in `Parallel.cpp`? andrewng: I tend to get a bad feeling when code is duplicated as this is in `Parallel.cpp`. I know macros…
	namespace detail {			namespace detail {
	class Latch {			class Latch {
	uint32_t Count;			uint32_t Count;
	mutable std::mutex Mutex;			mutable std::mutex Mutex;
	mutable std::condition_variable Cond;			mutable std::condition_variable Cond;

	public:			public:
	explicit Latch(uint32_t Count = 0) : Count(Count) {}			explicit Latch(uint32_t Count = 0) : Count(Count) {}
	▲ Show 20 Lines • Show All 225 Lines • Show Last 20 Lines

llvm/lib/Support/Parallel.cpp

Show All 18 Lines

llvm::ThreadPoolStrategy llvm::parallel::strategy;		llvm::ThreadPoolStrategy llvm::parallel::strategy;

namespace llvm {		namespace llvm {
namespace parallel {		namespace parallel {
#if LLVM_ENABLE_THREADS		#if LLVM_ENABLE_THREADS

#ifdef _WIN32		#ifdef _WIN32
static thread_local unsigned threadIndex;		static thread_local unsigned threadIndex = UINT_MAX;

unsigned getThreadIndex() { return threadIndex; }		unsigned getThreadIndex() { GET_THREAD_INDEX_IMPL; }
#else		#else
thread_local unsigned threadIndex;		thread_local unsigned threadIndex = UINT_MAX;
#endif		#endif

namespace detail {		namespace detail {

namespace {		namespace {

/// An abstract class that takes closures and runs them asynchronously.		/// An abstract class that takes closures and runs them asynchronously.
class Executor {		class Executor {
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	struct Creator {
static void *call() { return new ThreadPoolExecutor(strategy); }		static void *call() { return new ThreadPoolExecutor(strategy); }
};		};
struct Deleter {		struct Deleter {
static void call(void Ptr) { ((ThreadPoolExecutor )Ptr)->stop(); }		static void call(void Ptr) { ((ThreadPoolExecutor )Ptr)->stop(); }
};		};

void add(std::function<void()> F, bool Sequential = false) override {		void add(std::function<void()> F, bool Sequential = false) override {
{		{
bool UseSequentialQueue =		if (parallel::strategy.ThreadsRequested == 1) {
Sequential \|\| parallel::strategy.ThreadsRequested == 1;		F();
		andrewngUnsubmitted Not Done Reply Inline Actions This setting of `parallel::threadIndex = 0` (also in `llvm::parallelFor`) feels a little "odd". Just wondering if an alternative would be to check `parallel::strategy.ThreadsRequested == 1` in `getThreadIndex()` and return `0` there? andrewng: This setting of `parallel::threadIndex = 0` (also in `llvm::parallelFor`) feels a little "odd".
		return;
		}

std::lock_guard<std::mutex> Lock(Mutex);		std::lock_guard<std::mutex> Lock(Mutex);
if (UseSequentialQueue)		if (Sequential)
WorkQueueSequential.emplace_front(std::move(F));		WorkQueueSequential.emplace_front(std::move(F));
else		else
WorkQueue.emplace_back(std::move(F));		WorkQueue.emplace_back(std::move(F));
}		}
Cond.notify_one();		Cond.notify_one();
}		}

private:		private:
▲ Show 20 Lines • Show All 98 Lines • ▼ Show 20 Lines	#endif
F();		F();
}		}

} // namespace parallel		} // namespace parallel
} // namespace llvm		} // namespace llvm

void llvm::parallelFor(size_t Begin, size_t End,		void llvm::parallelFor(size_t Begin, size_t End,
llvm::function_ref<void(size_t)> Fn) {		llvm::function_ref<void(size_t)> Fn) {
// If we have zero or one items, then do not incur the overhead of spinning up
// a task group. They are surprisingly expensive, and because they do not
// support nested parallelism, a single entry task group can block parallel
// execution underneath them.
#if LLVM_ENABLE_THREADS		#if LLVM_ENABLE_THREADS
		if (parallel::strategy.ThreadsRequested != 1) {
auto NumItems = End - Begin;		auto NumItems = End - Begin;
if (NumItems > 1 && parallel::strategy.ThreadsRequested != 1) {
andrewngUnsubmitted Not Done Reply Inline Actions The following is the commit for which the `NumItems <= 1` case was important: 5fa9d416 "[Support/Parallel] Add a special case for 0/1 items to llvm::parallel_for_each.". I don't know if it's still important but probably should be checked. andrewng: The following is the commit for which the `NumItems <= 1` case was important: 5fa9d416 "…
avlAuthorUnsubmitted Done Reply Inline Actions The following is the commit for which the `NumItems <= 1` case was important: 5fa9d416 "[Support/Parallel] Add a special case for 0/1 items to llvm::parallel_for_each.". I don't know if it's still important but probably should be checked. Thank you for the pointer! That problem is still important. I would suggest to use another workaround though. The original problem is connected with the fact that parallelFor() functions do not support nested parallelism. If nested TaskGroups would be allowed to be Parallel then deadlock could happen: when all threads are busy and are waiting for nested tasks the task queue does not move. The test case from https://github.com/llvm/circt/issues/993 looks like this: parallelFor (Files) { parallelFor(Passes) { } } A nested loop for Passes is not Parallel since nested parallelism is forbidden. But if "Files" have only one file, the workaround is to run root parallelFor sequentially and run nested parallelFor parallelly. It gives a performance improvement. That is why this workaround for NumItems is necessary. The drawbacks of the workaround: It helps one concrete case. It makes using getThreadIndex() dangerous if parallelFor() is called from different threads. parallelFor() may be run on calling threads and then the same threadIndex would be used concurrently. The following workaround could probably solve the problem better: std::function<void()> Fn = [&]() { parallelFor(Passes) { } }; ThreadPool Pool; for (Files) { Pool.async(Fn); Pool.wait(); It does not have a problem with using threadIndex. It also could give more parallelism. Let`s say we have two files. The current solution will use only two threads(as internal parallelFor is sequential), while the above solution would use all available threads. Another solution would be to implement support of nested parallelism for TaskGroups. But it looks to be much bigger task. Will put that info into https://github.com/llvm/circt/issues/993 avl: > The following is the commit for which the `NumItems <= 1` case was important: 5fa9d416 "…
avlAuthorUnsubmitted Done Reply Inline Actions @lattner Chris, Would you mind to comment on whether it would be OK to remove "NumItems > 1" optimization and use ThreadPool for nested parallelFor instead, please? avl: @lattner Chris, Would you mind to comment on whether it would be OK to remove "NumItems > 1"…
// Limit the number of tasks to MaxTasksPerGroup to limit job scheduling		// Limit the number of tasks to MaxTasksPerGroup to limit job scheduling
// overhead on large inputs.		// overhead on large inputs.
auto TaskSize = NumItems / parallel::detail::MaxTasksPerGroup;		auto TaskSize = NumItems / parallel::detail::MaxTasksPerGroup;
if (TaskSize == 0)		if (TaskSize == 0)
TaskSize = 1;		TaskSize = 1;

parallel::TaskGroup TG;		parallel::TaskGroup TG;
for (; Begin + TaskSize < End; Begin += TaskSize) {		for (; Begin + TaskSize < End; Begin += TaskSize) {
TG.spawn([=, &Fn] {		TG.spawn([=, &Fn] {
for (size_t I = Begin, E = Begin + TaskSize; I != E; ++I)		for (size_t I = Begin, E = Begin + TaskSize; I != E; ++I)
Fn(I);		Fn(I);
});		});
}		}
if (Begin != End) {		if (Begin != End) {
TG.spawn([=, &Fn] {		TG.spawn([=, &Fn] {
for (size_t I = Begin; I != End; ++I)		for (size_t I = Begin; I != End; ++I)
Fn(I);		Fn(I);
});		});
}		}
return;		return;
}		}
#endif		#endif

for (; Begin != End; ++Begin)		for (; Begin != End; ++Begin)
Fn(Begin);		Fn(Begin);
		andrewngUnsubmitted Not Done Reply Inline Actions IIUC, this will be executed if `parallel::strategy.ThreadsRequested == 1` and in this case, `getThreadIndex()` will return `UINT_MAX`. Is this the intended behaviour? andrewng: IIUC, this will be executed if `parallel::strategy.ThreadsRequested == 1` and in this case…
		avlAuthorUnsubmitted Done Reply Inline Actions no, that is not intended. Good catch! Thank you. avl: no, that is not intended. Good catch! Thank you.
}		}