This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Support/
-
llvm/
-
Support/
1/3
Parallel.h
-
lib/Support/
-
Support/
2
Parallel.cpp

Differential D68820

win: Move Parallel.h off concrt to cross-platform code
ClosedPublic

Authored by thakis on Oct 10 2019, 11:27 AM.

Download Raw Diff

Details

Reviewers

rnk

Commits

rGd49600320598: win: Move Parallel.h off concrt to cross-platform code
rL374421: win: Move Parallel.h off concrt to cross-platform code

Summary

r179397 added Parallel.h and implemented it terms of concrt in 2013.

In 2015, a cross-platform implementation of the functions has appeared and is in use everywhere but on Windows (r232419).
r246219 hints that <thread> had issues in MSVC2013, but r296906 suggests they've been fixed now that we require 2015+.

So remove the concrt code. It's less code, and it sounds like concrt has conceptual and performance issues, see PR41198.

I built blink_core.dll in a debug component build with full symbols and in a release component build without any symbols.
I couldn't measure a performance difference for linking blink_core.dll before and after this patch.
(Raw data: https://gist.github.com/nico/d4b02c7dd835bb96ed67e919f3558e6f)

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

thakis created this revision.Oct 10 2019, 11:27 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 10 2019, 11:27 AM

Herald added a subscriber: hiraditya. · View Herald Transcript

thakis mentioned this in D59676: Make Parallel.h build with libc++ on Windows..Oct 10 2019, 11:30 AM

thakis mentioned this in D68709: (not yet for review) win: Use cross-platform code in Parallel.h/.cpp.

thakis added subscribers: BillyONeal, aganea.

Looks very good. :)

This revision is now accepted and ready to land.Oct 10 2019, 11:36 AM

Closed by commit rGd49600320598: win: Move Parallel.h off concrt to cross-platform code (authored by thakis). · Explain WhyOct 10 2019, 12:00 PM

This revision was automatically updated to reflect the committed changes.

Looks slightly better without ConcRT, thanks Nico!
Here's some quick results showing the difference for the global hash parallelization in LLD, with MSVC OBJs. The algorithm is iterating on the .debug$T records, for a few thousands OBJs on 72 hyper threads:
(we're saving about 2 secs on this test)

Before:

After:

There's still this memory map lock eating half of the CPU time, I'm not too sure yet how to avoid it, if ever. Maybe touch the pages for the OBJ files in advance? Anyway, unrelated.

BillyONeal added inline comments.Oct 10 2019, 9:24 PM

llvm/include/llvm/Support/Parallel.h
124	If you get a chance to benchmark I'm curious how this compares to our std::sort(std::execution::par, ...) version :)

aganea marked an inline comment as done.Oct 11 2019, 8:31 AM

aganea added inline comments.

llvm/include/llvm/Support/Parallel.h
124	I ran a few AB/BA tests on LLD with my dataset. The cumulated time on all cores with ConcRT is consistently over by about 300ms on my 36-core Skylake (~1.9 sec for ConcRT version, ~1.6 sec after this patch). There are only three places where we `parallelSort` in LLD, so maybe this not representative. But the dataset is quite big, ~22 GB of OBJs and LIBs. This is a Unity build of the Editor Release target of one of our games. I can try also with no Unity files, usually the dataset is about an order of magnitude greater. Before: After:

BillyONeal added inline comments.Oct 11 2019, 9:12 AM

llvm/include/llvm/Support/Parallel.h
124	Not concrt; the std::sort(par...) standard parallel algorithm is an unrelated implementation.

aganea mentioned this in D69582: Let clang driver support parallel jobs.Oct 31 2019, 2:48 PM

rnk added inline comments.May 14 2020, 12:06 PM

llvm/lib/Support/Parallel.cpp
73	I've belatedly realized that this means that LLVM is doing thread management on its own, i.e. every linker invocation spawns `hardware_concurrency()` threads. My understanding is that ConCRT is built on the system worker thread pool, which helps prevent oversubscription of CPU resources. While @aganea measured that this change improved benchmarks, this change could lead to bad throughput when multiple link jobs run concurrently. Today, LLD is not very parallel, but this may become more of an issue as we use more and more parallelism for debug info merging. At some point in the future, we should try measuring the impact of this change on the performance of three links running in parallel, and see if using the NT worker pool gives benefits in that case. For now, though, let's not get ahead of ourselves with unmeasured concerns and leave this as is.

aganea added inline comments.May 14 2020, 2:44 PM

llvm/lib/Support/Parallel.cpp
73	One cheap alternative is to always use `heavyweight_hardware_concurrency()` by default, and let the user do `--threads=%NUMBER_OF_PROCESSORS%` if they want `hardware_concurrency()`. In the absence of a global decision-maker, `heavyweight_hardware_concurrency()` is bit of a hack. Letting an external build system like Ninja doing that though static flags, ie `--threads` or `/opt:lldltojobs` doesn't work too well either. You can end up with large spans of time where nothing is happening, because that part of the application (LLD) isn't multi-threaded. Or because the `ThreadPool`'s jobs are cooling down, as below at time 100: I've tried increasing the number of threads, to see how it would react. It seems every extra ThinLTO thread above the my hardware threads, is adding roughly 150 ms to the execution. For example, running an input on 72 threads takes 108 sec, while the same input on 100 threads takes 113 sec. I don't know if the relation is linear, but it gives an idea. Probably context-switching between applications would be even more costly, I assume two lld-link running side-by-side using each 72 threads would cost even more. I think a platform-independent solution is needed here. If we have several LLDs running, we could dynamically throttle the number of threads for each `ThreadPool`, through some kind of IPC. We "just" need to ensure there aren't more than N threads at one time, while taking into account: affinity, hyper-threading/cache affinity, core-local memory, and multi-socket machines. How we would interface with Ninja? LLD wouldn't know how many free "lanes" Ninja has. Should we retain, increase, or remove `LLVM_PARALLEL_LINK_JOBS` ? We could build some kind of generic IPC API to be used in Ninja, but then what happens for build systems that don't implement it? make, Fastbuild, MSBuild, etc. Another way would be to embed the the compiler & the linker into the build system (not necessarly in the way I was showing last year). There's value for doing so, one example is the usage of clang-scan-deps I was showing, it lets the build system extract dependency information very quickly, instead of invoking thousands of processes, while doing memoization as much as possible. The same thing can be achieved for pre-processing, compilation or linking. Lots of things to be done, not enough time :-)

BTW ThreadPoolExecutor has undefined behavior since there's a detached thread touching the standard library when the program exits violating [basic.start.term]/6. Detached threads are almost never safe.

In D68820#2037557, @BillyONeal wrote:

BTW ThreadPoolExecutor has undefined behavior since there's a detached thread touching the standard library when the program exits violating [basic.start.term]/6. Detached threads are almost never safe.

This was fixed later by D70447.

Revision Contents

Path

Size

llvm/

include/

llvm/

Support/

Parallel.h

27 lines

lib/

Support/

Parallel.cpp

31 lines

Diff 224434

llvm/include/llvm/Support/Parallel.h

Show All 12 Lines
#include "llvm/Config/llvm-config.h"		#include "llvm/Config/llvm-config.h"
#include "llvm/Support/MathExtras.h"		#include "llvm/Support/MathExtras.h"

#include <algorithm>		#include <algorithm>
#include <condition_variable>		#include <condition_variable>
#include <functional>		#include <functional>
#include <mutex>		#include <mutex>

#if defined(_MSC_VER) && LLVM_ENABLE_THREADS
#pragma warning(push)
#pragma warning(disable : 4530)
#include <concrt.h>
#include <ppl.h>
#pragma warning(pop)
#endif

namespace llvm {		namespace llvm {

namespace parallel {		namespace parallel {
struct sequential_execution_policy {};		struct sequential_execution_policy {};
struct parallel_execution_policy {};		struct parallel_execution_policy {};

template <typename T>		template <typename T>
struct is_execution_policy		struct is_execution_policy
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	public:
TaskGroup();		TaskGroup();
~TaskGroup();		~TaskGroup();

void spawn(std::function<void()> f);		void spawn(std::function<void()> f);

void sync() const { L.sync(); }		void sync() const { L.sync(); }
};		};

#if defined(_MSC_VER)
template <class RandomAccessIterator, class Comparator>
void parallel_sort(RandomAccessIterator Start, RandomAccessIterator End,
const Comparator &Comp) {
concurrency::parallel_sort(Start, End, Comp);
}
template <class IterTy, class FuncTy>
void parallel_for_each(IterTy Begin, IterTy End, FuncTy Fn) {
concurrency::parallel_for_each(Begin, End, Fn);
}

template <class IndexTy, class FuncTy>
void parallel_for_each_n(IndexTy Begin, IndexTy End, FuncTy Fn) {
concurrency::parallel_for(Begin, End, Fn);
}

#else
const ptrdiff_t MinParallelSize = 1024;		const ptrdiff_t MinParallelSize = 1024;

/// Inclusive median.		/// Inclusive median.
template <class RandomAccessIterator, class Comparator>		template <class RandomAccessIterator, class Comparator>
RandomAccessIterator medianOf3(RandomAccessIterator Start,		RandomAccessIterator medianOf3(RandomAccessIterator Start,
RandomAccessIterator End,		RandomAccessIterator End,
const Comparator &Comp) {		const Comparator &Comp) {
RandomAccessIterator Mid = Start + (std::distance(Start, End) / 2);		RandomAccessIterator Mid = Start + (std::distance(Start, End) / 2);
Show All 29 Lines	void parallel_quick_sort(RandomAccessIterator Start, RandomAccessIterator End,
});		});
parallel_quick_sort(Pivot + 1, End, Comp, TG, Depth - 1);		parallel_quick_sort(Pivot + 1, End, Comp, TG, Depth - 1);
}		}

template <class RandomAccessIterator, class Comparator>		template <class RandomAccessIterator, class Comparator>
void parallel_sort(RandomAccessIterator Start, RandomAccessIterator End,		void parallel_sort(RandomAccessIterator Start, RandomAccessIterator End,
const Comparator &Comp) {		const Comparator &Comp) {
TaskGroup TG;		TaskGroup TG;
parallel_quick_sort(Start, End, Comp, TG,		parallel_quick_sort(Start, End, Comp, TG,
		BillyONealUnsubmitted Not Done Reply Inline Actions If you get a chance to benchmark I'm curious how this compares to our std::sort(std::execution::par, ...) version :) BillyONeal: If you get a chance to benchmark I'm curious how this compares to our std::sort(std::execution…
		aganeaUnsubmitted Done Reply Inline Actions I ran a few AB/BA tests on LLD with my dataset. The cumulated time on all cores with ConcRT is consistently over by about 300ms on my 36-core Skylake (~1.9 sec for ConcRT version, ~1.6 sec after this patch). There are only three places where we `parallelSort` in LLD, so maybe this not representative. But the dataset is quite big, ~22 GB of OBJs and LIBs. This is a Unity build of the Editor Release target of one of our games. I can try also with no Unity files, usually the dataset is about an order of magnitude greater. Before: After: aganea: I ran a few AB/BA tests on LLD with my dataset. The cumulated time on all cores with ConcRT is…
		BillyONealUnsubmitted Not Done Reply Inline Actions Not concrt; the std::sort(par...) standard parallel algorithm is an unrelated implementation. BillyONeal: Not concrt; the std::sort(par...) standard parallel algorithm is an unrelated implementation.
llvm::Log2_64(std::distance(Start, End)) + 1);		llvm::Log2_64(std::distance(Start, End)) + 1);
}		}

template <class IterTy, class FuncTy>		template <class IterTy, class FuncTy>
void parallel_for_each(IterTy Begin, IterTy End, FuncTy Fn) {		void parallel_for_each(IterTy Begin, IterTy End, FuncTy Fn) {
// TaskGroup has a relatively high overhead, so we want to reduce		// TaskGroup has a relatively high overhead, so we want to reduce
// the number of spawn() calls. We'll create up to 1024 tasks here.		// the number of spawn() calls. We'll create up to 1024 tasks here.
// (Note that 1024 is an arbitrary number. This code probably needs		// (Note that 1024 is an arbitrary number. This code probably needs
Show All 25 Lines	for (; I + TaskSize < End; I += TaskSize) {
});		});
}		}
for (IndexTy J = I; J < End; ++J)		for (IndexTy J = I; J < End; ++J)
Fn(J);		Fn(J);
}		}

#endif		#endif

#endif

template <typename Iter>		template <typename Iter>
using DefComparator =		using DefComparator =
std::less<typename std::iterator_traits<Iter>::value_type>;		std::less<typename std::iterator_traits<Iter>::value_type>;

} // namespace detail		} // namespace detail

// sequential algorithm implementations.		// sequential algorithm implementations.
template <class Policy, class RandomAccessIterator,		template <class Policy, class RandomAccessIterator,
▲ Show 20 Lines • Show All 50 Lines • Show Last 20 Lines

llvm/lib/Support/Parallel.cpp

Show All 26 Lines
class Executor {		class Executor {
public:		public:
virtual ~Executor() = default;		virtual ~Executor() = default;
virtual void add(std::function<void()> func) = 0;		virtual void add(std::function<void()> func) = 0;

static Executor *getDefaultExecutor();		static Executor *getDefaultExecutor();
};		};

#if defined(_MSC_VER)
/// An Executor that runs tasks via ConcRT.
class ConcRTExecutor : public Executor {
struct Taskish {
Taskish(std::function<void()> Task) : Task(Task) {}

std::function<void()> Task;

static void run(void *P) {
Taskish Self = static_cast<Taskish >(P);
Self->Task();
concurrency::Free(Self);
}
};

public:
virtual void add(std::function<void()> F) {
Concurrency::CurrentScheduler::ScheduleTask(
Taskish::run, new (concurrency::Alloc(sizeof(Taskish))) Taskish(F));
}
};

Executor *Executor::getDefaultExecutor() {
static ConcRTExecutor exec;
return &exec;
}

#else
/// An implementation of an Executor that runs closures on a thread pool		/// An implementation of an Executor that runs closures on a thread pool
/// in filo order.		/// in filo order.
class ThreadPoolExecutor : public Executor {		class ThreadPoolExecutor : public Executor {
public:		public:
explicit ThreadPoolExecutor(unsigned ThreadCount = hardware_concurrency())		explicit ThreadPoolExecutor(unsigned ThreadCount = hardware_concurrency())
: Done(ThreadCount) {		: Done(ThreadCount) {
// Spawn all but one of the threads in another thread as spawning threads		// Spawn all but one of the threads in another thread as spawning threads
// can take a while.		// can take a while.
std::thread([&, ThreadCount] {		std::thread([&, ThreadCount] {
for (size_t i = 1; i < ThreadCount; ++i) {		for (size_t i = 1; i < ThreadCount; ++i) {
std::thread([=] { work(); }).detach();		std::thread([=] { work(); }).detach();
rnkUnsubmitted Not Done Reply Inline Actions I've belatedly realized that this means that LLVM is doing thread management on its own, i.e. every linker invocation spawns `hardware_concurrency()` threads. My understanding is that ConCRT is built on the system worker thread pool, which helps prevent oversubscription of CPU resources. While @aganea measured that this change improved benchmarks, this change could lead to bad throughput when multiple link jobs run concurrently. Today, LLD is not very parallel, but this may become more of an issue as we use more and more parallelism for debug info merging. At some point in the future, we should try measuring the impact of this change on the performance of three links running in parallel, and see if using the NT worker pool gives benefits in that case. For now, though, let's not get ahead of ourselves with unmeasured concerns and leave this as is. rnk: I've belatedly realized that this means that LLVM is doing thread management on its own, i.e.
aganeaUnsubmitted Not Done Reply Inline Actions One cheap alternative is to always use `heavyweight_hardware_concurrency()` by default, and let the user do `--threads=%NUMBER_OF_PROCESSORS%` if they want `hardware_concurrency()`. In the absence of a global decision-maker, `heavyweight_hardware_concurrency()` is bit of a hack. Letting an external build system like Ninja doing that though static flags, ie `--threads` or `/opt:lldltojobs` doesn't work too well either. You can end up with large spans of time where nothing is happening, because that part of the application (LLD) isn't multi-threaded. Or because the `ThreadPool`'s jobs are cooling down, as below at time 100: I've tried increasing the number of threads, to see how it would react. It seems every extra ThinLTO thread above the my hardware threads, is adding roughly 150 ms to the execution. For example, running an input on 72 threads takes 108 sec, while the same input on 100 threads takes 113 sec. I don't know if the relation is linear, but it gives an idea. Probably context-switching between applications would be even more costly, I assume two lld-link running side-by-side using each 72 threads would cost even more. I think a platform-independent solution is needed here. If we have several LLDs running, we could dynamically throttle the number of threads for each `ThreadPool`, through some kind of IPC. We "just" need to ensure there aren't more than N threads at one time, while taking into account: affinity, hyper-threading/cache affinity, core-local memory, and multi-socket machines. How we would interface with Ninja? LLD wouldn't know how many free "lanes" Ninja has. Should we retain, increase, or remove `LLVM_PARALLEL_LINK_JOBS` ? We could build some kind of generic IPC API to be used in Ninja, but then what happens for build systems that don't implement it? make, Fastbuild, MSBuild, etc. Another way would be to embed the the compiler & the linker into the build system (not necessarly in the way I was showing last year). There's value for doing so, one example is the usage of clang-scan-deps I was showing, it lets the build system extract dependency information very quickly, instead of invoking thousands of processes, while doing memoization as much as possible. The same thing can be achieved for pre-processing, compilation or linking. Lots of things to be done, not enough time :-) aganea: One cheap alternative is to always use `heavyweight_hardware_concurrency()` by default, and let…
}		}
work();		work();
}).detach();		}).detach();
}		}

~ThreadPoolExecutor() override {		~ThreadPoolExecutor() override {
std::unique_lock<std::mutex> Lock(Mutex);		std::unique_lock<std::mutex> Lock(Mutex);
Stop = true;		Stop = true;
Show All 30 Lines	private:
std::condition_variable Cond;		std::condition_variable Cond;
parallel::detail::Latch Done;		parallel::detail::Latch Done;
};		};

Executor *Executor::getDefaultExecutor() {		Executor *Executor::getDefaultExecutor() {
static ThreadPoolExecutor exec;		static ThreadPoolExecutor exec;
return &exec;		return &exec;
}		}
#endif		} // namespace
}

static std::atomic<int> TaskGroupInstances;		static std::atomic<int> TaskGroupInstances;

// Latch::sync() called by the dtor may cause one thread to block. If is a dead		// Latch::sync() called by the dtor may cause one thread to block. If is a dead
// lock if all threads in the default executor are blocked. To prevent the dead		// lock if all threads in the default executor are blocked. To prevent the dead
// lock, only allow the first TaskGroup to run tasks parallelly. In the scenario		// lock, only allow the first TaskGroup to run tasks parallelly. In the scenario
// of nested parallel_for_each(), only the outermost one runs parallelly.		// of nested parallel_for_each(), only the outermost one runs parallelly.
TaskGroup::TaskGroup() : Parallel(TaskGroupInstances++ == 0) {}		TaskGroup::TaskGroup() : Parallel(TaskGroupInstances++ == 0) {}
Show All 18 Lines