This is an archive of the discontinued LLVM Phabricator instance.

[Core] Add parallel infrastructure to lld.
ClosedPublic

Authored by Bigcheese on Apr 9 2013, 3:24 PM.

Download Raw Diff

Details

Reviewers

kledzik
chandlerc
• rafael
Bigcheese
shankarke

Summary

Uses ConcRT and PPL on Windows.

Diff Detail

Event Timeline

shankarke added inline comments.Apr 9 2013, 4:06 PM

include/lld/Core/Parallel.h
48–51	you might want to add a else case to assert probably.
69–70	I think size_t is pretty big for a threadcount.
74–81	Doesnot handle the case when some of the threads cant be created.
74–82	Nice!
80	It might be good to add a name to the task, for tracing purposes.
98	This could return an ErrorOr to indicate success/failure of this thread.
104–105	You would need to unlock the mutex here.
109	it would be nice to get the task return status and check for success/failure and set the stop field.

Bigcheese added inline comments.Apr 9 2013, 4:35 PM

include/lld/Core/Parallel.h
48–51	There's nothing to assert here.
69–70	I'll change it to the return type of hardware_concurrency (unsigned).
74–81	We don't use exception handling. There's no way to detect this case.
80	Which task?
98	It can't fail. And what would look at the return value?
104–105	unique_lock unlocks the mutex in the destructor.
109	Task returns void.

shankarke added inline comments.Apr 9 2013, 4:45 PM

include/lld/Core/Parallel.h
48–51	what I meant was if the latch was decremented twice. you could catch that error.
80	any task that the thread pool would manage, for example : when you try to add tasks that would read input files parallelly in lld, you could add the name of the file to it, so that you know when the thread starts processing the work.
98	I thought the task that was scheduled to execute in this thread, could fail and the whole process has to be stopped. For example : you tried to parse all the files in lld parallelly and one of the files was a bad file, you might want to return that failure immediately from the thread and stop everything else.
109	Ok. I was thinking if Tasks are made to return an ErrorOr/bool, you could get the return value of task and stop other threads on failure.

Michael, can you add more background on where you want to go with this? On darwin the preferred way to parallelize is to use libdispatch (aka GCD). In that model, you divide up your work into small chunks as functions/lamdas/blocks, and use serial or parallel queues to have the OS (libdispatch) run them all in the correct order as efficiently as possible (see http://libdispatch.macosforge.org/trac/wiki/tutorial for a good tutorial). For instance, in this code might spin up hardware_concurrency() threads, but another process could be doing the same thing. Then you have two processes thrashing each other for cpu threads. With libdispatch you never say how many threads to use and the OS balances across processes.

So, I'd like a design that allows lld parallelism to be layered on libdispatch for darwin.

silvas added inline comments.Apr 9 2013, 5:16 PM

include/lld/Core/Parallel.h
69–70	The core count is probably not a good choice for the default, since then you risk getting blocked waiting on IO. Since LLD uses memory-mapped IO, there's no way to do it "async" since the IO will be incurred when you take a major fault (which is transparent to the program). So, for tasks that require IO, you really need as many threads as parallel IO's that you are going to be doing. Another alternative is careful use of madvise(), but I don't think LLD is currently introspective enough about its access patterns to be able to effectively do that at this point. For compute-intensive tasks though, core count is probably fine. So I would discourage having a this take a default. Only the client of the API really knows how much parallelism they need, and they should specify it explicitly.
195–197	Have you benchmarked this? I'm curious what the cutoff is before this starts beating std::sort. (especially libcxx's std::sort). I wouldn't be surprised if the cutoff is very large.
199	This algorithm can go quadratic. You should fall back to std::sort if the depth becomes larger than logarithmic in the length of the range to be sorted to ensure n*log(n). Also, it would make sense to fall back to std::sort if the depth gets greater than log2(coreCount) to avoid increasing the parallelism beyond the number of cores.
214–220	You can "tail call" into one of these.

Bigcheese added inline comments.Apr 9 2013, 5:17 PM

include/lld/Core/Parallel.h
48–51	You mean below 0? decrementing twice is perfectly valid.
80	That's what the ScopedTask stuff is for. That overhead shouldn't be added by default.
98	That should be handled at a higher level. ThreadPoolExecutor threads live until the process exits.
109	An Executor is a low level interface for tasks. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3562.pdf for what it's based on. Also, we don't want other threads to stop when a task reports an error. We may be able to recover from it. All threads will stop before the process exits anyway. If a task needs to return a value, it should use std::packaged_task. If you find the need to use this often, it's trivial to make a wrapper for it.

Michael, can you add more background on where you want to go with this? On darwin the preferred way to parallelize is to use libdispatch (aka GCD). In that model, you divide up your work into small chunks as functions/lamdas/blocks, and use serial or parallel queues to have the OS (libdispatch) run them all in the correct order as efficiently as possible (see http://libdispatch.macosforge.org/trac/wiki/tutorial for a good tutorial). For instance, in this code might spin up hardware_concurrency() threads, but another process could be doing the same thing. Then you have two processes thrashing each other for cpu threads. With libdispatch you never say how many threads to use and the OS balances across processes.

So, I'd like a design that allows lld parallelism to be layered on libdispatch for darwin.

The executor model is designed to allow this. I did this for Windows by using ConcRT in the ConcRTExecutor class.

Bigcheese added inline comments.Apr 9 2013, 5:38 PM

include/lld/Core/Parallel.h
69–70	This is for the default Executor, there's really not a better choice to make. If a specific client finds that it needs more, it can create its own ThreadPoolExecutor and give some other thread count. Removing the default here would just move it over to getDefaultExecutor().
195–197	I benchmarked this with ParallelTest.cpp with array taking up 1GiB (that's as high as MSVC will go, even in x64 mode :(). This was the best cutoff for that specific case. I expect that it may vary widely depending on how expensive your comparator is. PPL actually takes this as a parameter to parallel_sort.
199	Good idea on the max depth to use. As for the max task count, I found that limiting the max splits to the number of cores to be slower, as chunks take differing times to complete. Smaller chunks balance better.
214–220	yep.

On a side note: now that it seems that LLD is starting to get into a phase of its life where it is being optimized, we *really* should have bots that test for performance regressions. Otherwise, we have no real baseline and can't tell whether things are actually getting faster or not.

We probably want at least:

Release build of LLVM+Clang
Debug build of LLVM+Clang
Release build of Chromium (or some other gargantuan project)
Debug build of Chromium

on linux, mac, windows. Especially now that system dependent parallel stuff is being added, it is important to have a common baseline for meaningful progress to be made (commits benefitting one platform may adversely affect another). It shouldn't take much computing power since the link step does not take very long (compared to compiling), so it would be sufficient to just stash a "golden" set of object files and run just the actual link step.

We're in the process of putting together a large internal buildbot farm that will include tests almost exactly like the ones Sean has listed.

I'd love for lab to also be doing some LNTs here.

I'd also suggest some benchmarking of small resulting executables, as per Sean's point about parallel_sort.

Alex

silvas added inline comments.Apr 9 2013, 7:41 PM

include/lld/Core/Parallel.h
199	As for the max task count, I found that limiting the max splits to the number of cores to be slower, as chunks take differing times to complete. Smaller chunks balance better. Ah, ok then. If you incrementally build up a set of addresses and need to in the end emit them in sorted order, there are trie-like data structures that allow for extremely dense encoding of sorted maps/sets (because key prefixes are shared) and constant-time insertion; I believe that these data structures are also very amenable to lock-free variants (one common usage is for kernel IP routing tables). They also have the advantage that they can be reconstituted into a sorted array of addresses in O(n) time. Specialized data structures like these might be something worth looking into as a memory optimization.

Ok, LGTM.

Add documentation
Make ThreadPoolExecutor filo
Make threadCount unsigned.
parallel_sort
- Base max depth on input size
- Tail execute one half of the sort.

shankarke added inline comments.Apr 10 2013, 1:59 PM

include/lld/Core/Parallel.h
76	biref -> brief
126	do you want to call it as workStack instead ?

Bigcheese added inline comments.Apr 10 2013, 2:24 PM

include/lld/Core/Parallel.h
76	Fixed.
126	Sure.

Committed.

zturner mentioned this in D32826: Move Parallel.h from LLD to LLVM.May 4 2017, 6:54 AM

Revision Contents

Path

Size

include/

lld/

Core/

Parallel.h

250 lines

unittests/

CoreTests/

CMakeLists.txt

1 line

ParallelTest.cpp

33 lines

Diff 1597

include/lld/Core/Parallel.h

This file was added.

				//===- lld/Core/Parallel.h - Parallel utilities ---------------------------===//
				//
				// The LLVM Linker
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLD_CORE_PARALLEL_H
				#define LLD_CORE_PARALLEL_H

				#include "lld/Core/Instrumentation.h"
				#include "lld/Core/LLVM.h"
				#include "lld/Core/range.h"

				#include "llvm/Support/MathExtras.h"

				#ifdef _MSC_VER
				// Exceptions are disabled so this isn't defined, but concrt assumes it is.
				namespace {
				void *__uncaught_exception() { return nullptr; }
				}
				#endif

				#include <algorithm>
				#include <atomic>
				#include <condition_variable>
				#include <mutex>
				#include <thread>
				#include <stack>

				#ifdef _MSC_VER
				#include <concrt.h>
				#include <ppl.h>
				#endif

				namespace lld {
				/// \brief Allows one or more threads to wait on a potentially unknown number of
				/// events.
				///
				/// A latch starts at \p count. inc() increments this, and dec() decrements it.
				/// All calls to sync() will block while the count is not 0.
				///
				/// Calling dec() on a Latch with a count of 0 has undefined behaivor.
				class Latch {
				std::atomic<uint32_t> _count;
				mutable std::mutex _condMut;
				mutable std::condition_variable _cond;

				public:
				shankarkeUnsubmitted Not Done Reply Inline Actions you might want to add a else case to assert probably. shankarke: you might want to add a else case to assert probably.
				BigcheeseAuthorUnsubmitted Not Done Reply Inline Actions There's nothing to assert here. Bigcheese: There's nothing to assert here.
				shankarkeUnsubmitted Not Done Reply Inline Actions what I meant was if the latch was decremented twice. you could catch that error. shankarke: what I meant was if the latch was decremented twice. you could catch that error.
				BigcheeseAuthorUnsubmitted Not Done Reply Inline Actions You mean below 0? decrementing twice is perfectly valid. Bigcheese: You mean below 0? decrementing twice is perfectly valid.
				Latch(uint32_t count = 0) : _count(count) {}
				~Latch() { sync(); }
				void inc() { ++_count; }

				void dec() {
				if (--_count == 0)
				_cond.notify_all();
				}

				void sync() const {
				std::unique_lock<std::mutex> lock(_condMut);
				_cond.wait(lock, [&] {
				return _count == 0;
				});
				}
				};

				/// \brief An abstract class that takes closures and runs them asynchronously.
				class Executor {
				shankarkeUnsubmitted Not Done Reply Inline Actions I think size_t is pretty big for a threadcount. shankarke: I think size_t is pretty big for a threadcount.
				BigcheeseAuthorUnsubmitted Not Done Reply Inline Actions I'll change it to the return type of hardware_concurrency (unsigned). Bigcheese: I'll change it to the return type of hardware_concurrency (unsigned).
				silvasUnsubmitted Not Done Reply Inline Actions The core count is probably not a good choice for the default, since then you risk getting blocked waiting on IO. Since LLD uses memory-mapped IO, there's no way to do it "async" since the IO will be incurred when you take a major fault (which is transparent to the program). So, for tasks that require IO, you really need as many threads as parallel IO's that you are going to be doing. Another alternative is careful use of madvise(), but I don't think LLD is currently introspective enough about its access patterns to be able to effectively do that at this point. For compute-intensive tasks though, core count is probably fine. So I would discourage having a this take a default. Only the client of the API really knows how much parallelism they need, and they should specify it explicitly. silvas: The core count is probably not a good choice for the default, since then you risk getting…
				BigcheeseAuthorUnsubmitted Not Done Reply Inline Actions This is for the default Executor, there's really not a better choice to make. If a specific client finds that it needs more, it can create its own ThreadPoolExecutor and give some other thread count. Removing the default here would just move it over to getDefaultExecutor(). Bigcheese: This is for the default Executor, there's really not a better choice to make. If a specific…
				public:
				virtual ~Executor() {}
				virtual void add(std::function<void()> func) = 0;
				};

				/// \biref An implementation of an Executor that runs closures on a thread pool
				shankarkeUnsubmitted Not Done Reply Inline Actions biref -> brief shankarke: biref -> brief
				BigcheeseAuthorUnsubmitted Not Done Reply Inline Actions Fixed. Bigcheese: Fixed.
				/// in filo order.
				class ThreadPoolExecutor : public Executor {
				public:
				ThreadPoolExecutor(unsigned threadCount =
				shankarkeUnsubmitted Not Done Reply Inline Actions It might be good to add a name to the task, for tracing purposes. shankarke: It might be good to add a name to the task, for tracing purposes.
				BigcheeseAuthorUnsubmitted Not Done Reply Inline Actions Which task? Bigcheese: Which task?
				shankarkeUnsubmitted Not Done Reply Inline Actions any task that the thread pool would manage, for example : when you try to add tasks that would read input files parallelly in lld, you could add the name of the file to it, so that you know when the thread starts processing the work. shankarke: any task that the thread pool would manage, for example : when you try to add tasks that would…
				BigcheeseAuthorUnsubmitted Not Done Reply Inline Actions That's what the ScopedTask stuff is for. That overhead shouldn't be added by default. Bigcheese: That's what the ScopedTask stuff is for. That overhead shouldn't be added by default.
				std::thread::hardware_concurrency())
				shankarkeUnsubmitted Not Done Reply Inline Actions Doesnot handle the case when some of the threads cant be created. shankarke: Doesnot handle the case when some of the threads cant be created.
				BigcheeseAuthorUnsubmitted Not Done Reply Inline Actions We don't use exception handling. There's no way to detect this case. Bigcheese: We don't use exception handling. There's no way to detect this case.
				: _stop(false), _done(threadCount) {
				shankarkeUnsubmitted Not Done Reply Inline Actions Nice! shankarke: Nice!
				// Spawn all but one of the threads in another thread as spawning threads
				// can take a while.
				std::thread([&, threadCount] {
				for (std::size_t i = 1; i < threadCount; ++i) {
				std::thread([=] {
				work();
				}).detach();
				}
				work();
				}).detach();
				}

				~ThreadPoolExecutor() {
				_stop = true;
				_cond.notify_all();
				// Wait for ~Latch.
				shankarkeUnsubmitted Not Done Reply Inline Actions This could return an ErrorOr to indicate success/failure of this thread. shankarke: This could return an ErrorOr to indicate success/failure of this thread.
				BigcheeseAuthorUnsubmitted Not Done Reply Inline Actions It can't fail. And what would look at the return value? Bigcheese: It can't fail. And what would look at the return value?
				shankarkeUnsubmitted Not Done Reply Inline Actions I thought the task that was scheduled to execute in this thread, could fail and the whole process has to be stopped. For example : you tried to parse all the files in lld parallelly and one of the files was a bad file, you might want to return that failure immediately from the thread and stop everything else. shankarke: I thought the task that was scheduled to execute in this thread, could fail and the whole…
				BigcheeseAuthorUnsubmitted Not Done Reply Inline Actions That should be handled at a higher level. ThreadPoolExecutor threads live until the process exits. Bigcheese: That should be handled at a higher level. ThreadPoolExecutor threads live until the process…
				}

				virtual void add(std::function<void()> f) {
				std::unique_lock<std::mutex> lock(_mutex);
				_workQueue.push(f);
				lock.unlock();
				_cond.notify_one();
				shankarkeUnsubmitted Not Done Reply Inline Actions You would need to unlock the mutex here. shankarke: You would need to unlock the mutex here.
				BigcheeseAuthorUnsubmitted Not Done Reply Inline Actions unique_lock unlocks the mutex in the destructor. Bigcheese: unique_lock unlocks the mutex in the destructor.
				}

				private:
				void work() {
				shankarkeUnsubmitted Not Done Reply Inline Actions it would be nice to get the task return status and check for success/failure and set the stop field. shankarke: it would be nice to get the task return status and check for success/failure and set the stop…
				BigcheeseAuthorUnsubmitted Not Done Reply Inline Actions Task returns void. Bigcheese: Task returns void.
				shankarkeUnsubmitted Not Done Reply Inline Actions Ok. I was thinking if Tasks are made to return an ErrorOr/bool, you could get the return value of task and stop other threads on failure. shankarke: Ok. I was thinking if Tasks are made to return an ErrorOr/bool, you could get the return value…
				BigcheeseAuthorUnsubmitted Not Done Reply Inline Actions An Executor is a low level interface for tasks. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3562.pdf for what it's based on. Also, we don't want other threads to stop when a task reports an error. We may be able to recover from it. All threads will stop before the process exits anyway. If a task needs to return a value, it should use std::packaged_task. If you find the need to use this often, it's trivial to make a wrapper for it. Bigcheese: An Executor is a low level interface for tasks. See http://www.open-std.
				while (true) {
				std::unique_lock<std::mutex> lock(_mutex);
				_cond.wait(lock, [&] {
				return _stop \|\| !_workQueue.empty();
				});
				if (_stop)
				break;
				auto task = _workQueue.top();
				_workQueue.pop();
				lock.unlock();
				task();
				}
				_done.dec();
				}

				std::atomic<bool> _stop;
				std::stack<std::function<void()>> _workQueue;
				shankarkeUnsubmitted Not Done Reply Inline Actions do you want to call it as workStack instead ? shankarke: do you want to call it as workStack instead ?
				BigcheeseAuthorUnsubmitted Not Done Reply Inline Actions Sure. Bigcheese: Sure.
				std::mutex _mutex;
				std::condition_variable _cond;
				Latch _done;
				};

				#ifdef _MSC_VER
				/// \brief An Executor that runs tasks via ConcRT.
				class ConcRTExecutor : public Executor {
				struct Taskish {
				Taskish(std::function<void()> task) : _task(task) {}

				std::function<void()> _task;

				static void run(void *p) {
				Taskish self = static_cast<Taskish >(p);
				self->_task();
				delete self;
				}
				};

				public:
				virtual void add(std::function<void()> func) {
				Concurrency::CurrentScheduler::ScheduleTask(Taskish::run,
				new Taskish(func));
				}
				};

				inline Executor *getDefaultExecutor() {
				static ConcRTExecutor exec;
				return &exec;
				}
				#else
				inline Executor *getDefaultExecutor() {
				static ThreadPoolExecutor exec;
				return &exec;
				}
				#endif

				/// \brief Allows launching a number of tasks and waiting for them to finish
				/// either explicitly via sync() or implicitly on destruction.
				class TaskGroup {
				Latch _latch;

				public:
				void spawn(std::function<void()> f) {
				_latch.inc();
				getDefaultExecutor()->add([&, f] {
				f();
				_latch.dec();
				});
				}

				void sync() const { _latch.sync(); }
				};

				#ifdef _MSC_VER
				// Use ppl parallel_sort on Windows.
				template <class RandomAccessIterator, class Comp>
				void parallel_sort(
				RandomAccessIterator start, RandomAccessIterator end,
				const Comp &comp = std::less<
				typename std::iterator_traits<RandomAccessIterator>::value_type>()) {
				concurrency::parallel_sort(start, end, comp);
				}
				#else
				namespace detail {
				const ptrdiff_t minParallelSize = 1024;

				/// \brief Inclusive median.
				template <class RandomAccessIterator, class Comp>
				RandomAccessIterator medianOf3(RandomAccessIterator start,
				silvasUnsubmitted Not Done Reply Inline Actions Have you benchmarked this? I'm curious what the cutoff is before this starts beating std::sort. (especially libcxx's std::sort). I wouldn't be surprised if the cutoff is very large. silvas: Have you benchmarked this? I'm curious what the cutoff is before this starts beating std::sort.
				BigcheeseAuthorUnsubmitted Not Done Reply Inline Actions I benchmarked this with ParallelTest.cpp with array taking up 1GiB (that's as high as MSVC will go, even in x64 mode :(). This was the best cutoff for that specific case. I expect that it may vary widely depending on how expensive your comparator is. PPL actually takes this as a parameter to parallel_sort. Bigcheese: I benchmarked this with ParallelTest.cpp with array taking up 1GiB (that's as high as MSVC will…
				RandomAccessIterator end, const Comp &comp) {
				RandomAccessIterator mid = start + (std::distance(start, end) / 2);
				silvasUnsubmitted Not Done Reply Inline Actions This algorithm can go quadratic. You should fall back to std::sort if the depth becomes larger than logarithmic in the length of the range to be sorted to ensure nlog(n). Also, it would make sense to fall back to std::sort if the depth gets greater than log2(coreCount) to avoid increasing the parallelism beyond the number of cores. silvas:* This algorithm can go quadratic. You should fall back to std::sort if the depth becomes larger…
				BigcheeseAuthorUnsubmitted Not Done Reply Inline Actions Good idea on the max depth to use. As for the max task count, I found that limiting the max splits to the number of cores to be slower, as chunks take differing times to complete. Smaller chunks balance better. Bigcheese: Good idea on the max depth to use. As for the max task count, I found that limiting the max…
				silvasUnsubmitted Not Done Reply Inline Actions As for the max task count, I found that limiting the max splits to the number of cores to be slower, as chunks take differing times to complete. Smaller chunks balance better. Ah, ok then. If you incrementally build up a set of addresses and need to in the end emit them in sorted order, there are trie-like data structures that allow for extremely dense encoding of sorted maps/sets (because key prefixes are shared) and constant-time insertion; I believe that these data structures are also very amenable to lock-free variants (one common usage is for kernel IP routing tables). They also have the advantage that they can be reconstituted into a sorted array of addresses in O(n) time. Specialized data structures like these might be something worth looking into as a memory optimization. silvas: > As for the max task count, I found that limiting the max splits to the number of cores to be…
				return comp(start, (end - 1))
				? (comp(mid, (end - 1)) ? (comp(start, mid) ? mid : start)
				: end - 1)
				: (comp(mid, start) ? (comp((end - 1), mid) ? mid : end - 1)
				: start);
				}

				template <class RandomAccessIterator, class Comp>
				void parallel_quick_sort(RandomAccessIterator start, RandomAccessIterator end,
				const Comp &comp, TaskGroup &tg, size_t depth) {
				// Do a sequential sort for small inputs.
				if (std::distance(start, end) < detail::minParallelSize \|\| depth == 0) {
				std::sort(start, end, comp);
				return;
				}

				// Partition.
				auto pivot = medianOf3(start, end, comp);
				// Move pivot to end.
				std::swap((end - 1), pivot);
				pivot = std::partition(start, end - 1, [end](decltype(*start) v) {
				silvasUnsubmitted Not Done Reply Inline Actions You can "tail call" into one of these. silvas: You can "tail call" into one of these.
				BigcheeseAuthorUnsubmitted Not Done Reply Inline Actions yep. Bigcheese: yep.
				return v < *(end - 1);
				});
				// Move pivot to middle of partition.
				std::swap(pivot, (end - 1));

				// Recurse.
				tg.spawn([=, &tg] {
				parallel_quick_sort(start, pivot, comp, tg, depth - 1);
				});
				parallel_quick_sort(pivot + 1, end, comp, tg, depth - 1);
				}
				}

				template <class RandomAccessIterator, class Comp>
				void parallel_sort(
				RandomAccessIterator start, RandomAccessIterator end,
				const Comp &comp = std::less<
				typename std::iterator_traits<RandomAccessIterator>::value_type>()) {
				TaskGroup tg;
				detail::parallel_quick_sort(start, end, comp, tg,
				llvm::Log2_64(std::distance(start, end)) + 1);
				}
				#endif

				template <class T> void parallel_sort(T start, T end) {
				parallel_sort(start, end, std::less<T>());
				}
				} // end namespace lld

				#endif

unittests/CoreTests/CMakeLists.txt

	add_lld_unittest(CoreTests			add_lld_unittest(CoreTests
				ParallelTest.cpp
	RangeTest.cpp			RangeTest.cpp
	)			)

unittests/CoreTests/ParallelTest.cpp

This file was added.

				//===- lld/unittest/ParallelTest.cpp --------------------------------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				///
				/// \file
				/// \brief Parallel.h unit tests.
				///
				//===----------------------------------------------------------------------===//

				#include "gtest/gtest.h"

				#include "lld/Core/Parallel.h"

				#include <array>
				#include <random>

				uint32_t array[1024 * 1024];

				TEST(Parallel, sort) {
				std::mt19937 randEngine;
				std::uniform_int_distribution<uint32_t> dist;

				for (auto &i : array)
				i = dist(randEngine);

				lld::parallel_sort(std::begin(array), std::end(array));
				ASSERT_TRUE(std::is_sorted(std::begin(array), std::end(array)));
				}