Uses ConcRT and PPL on Windows.
Diff Detail
Event Timeline
include/lld/Core/Parallel.h | ||
---|---|---|
48–51 | you might want to add a else case to assert probably. | |
69–70 | I think size_t is pretty big for a threadcount. | |
74–81 | Doesnot handle the case when some of the threads cant be created. | |
74–82 | Nice! | |
80 | It might be good to add a name to the task, for tracing purposes. | |
98 | This could return an ErrorOr to indicate success/failure of this thread. | |
104–105 | You would need to unlock the mutex here. | |
109 | it would be nice to get the task return status and check for success/failure and set the stop field. |
include/lld/Core/Parallel.h | ||
---|---|---|
48–51 | There's nothing to assert here. | |
69–70 | I'll change it to the return type of hardware_concurrency (unsigned). | |
74–81 | We don't use exception handling. There's no way to detect this case. | |
80 | Which task? | |
98 | It can't fail. And what would look at the return value? | |
104–105 | unique_lock unlocks the mutex in the destructor. | |
109 | Task returns void. |
include/lld/Core/Parallel.h | ||
---|---|---|
48–51 | what I meant was if the latch was decremented twice. you could catch that error. | |
80 | any task that the thread pool would manage, for example : when you try to add tasks that would read input files parallelly in lld, you could add the name of the file to it, so that you know when the thread starts processing the work. | |
98 | I thought the task that was scheduled to execute in this thread, could fail and the whole process has to be stopped. For example : you tried to parse all the files in lld parallelly and one of the files was a bad file, you might want to return that failure immediately from the thread and stop everything else. | |
109 | Ok. I was thinking if Tasks are made to return an ErrorOr/bool, you could get the return value of task and stop other threads on failure. |
Michael, can you add more background on where you want to go with this? On darwin the preferred way to parallelize is to use libdispatch (aka GCD). In that model, you divide up your work into small chunks as functions/lamdas/blocks, and use serial or parallel queues to have the OS (libdispatch) run them all in the correct order as efficiently as possible (see http://libdispatch.macosforge.org/trac/wiki/tutorial for a good tutorial). For instance, in this code might spin up hardware_concurrency() threads, but another process could be doing the same thing. Then you have two processes thrashing each other for cpu threads. With libdispatch you never say how many threads to use and the OS balances across processes.
So, I'd like a design that allows lld parallelism to be layered on libdispatch for darwin.
include/lld/Core/Parallel.h | ||
---|---|---|
69–70 | The core count is probably not a good choice for the default, since then you risk getting blocked waiting on IO. Since LLD uses memory-mapped IO, there's no way to do it "async" since the IO will be incurred when you take a major fault (which is transparent to the program). So, for tasks that require IO, you really need as many threads as parallel IO's that you are going to be doing. Another alternative is careful use of madvise(), but I don't think LLD is currently introspective enough about its access patterns to be able to effectively do that at this point. For compute-intensive tasks though, core count is probably fine. So I would discourage having a this take a default. Only the client of the API really knows how much parallelism they need, and they should specify it explicitly. | |
195–197 | Have you benchmarked this? I'm curious what the cutoff is before this starts beating std::sort. (especially libcxx's std::sort). I wouldn't be surprised if the cutoff is *very* large. | |
199 | This algorithm can go quadratic. You should fall back to std::sort if the depth becomes larger than logarithmic in the length of the range to be sorted to ensure n*log(n). Also, it would make sense to fall back to std::sort if the depth gets greater than log2(coreCount) to avoid increasing the parallelism beyond the number of cores. | |
214–220 | You can "tail call" into one of these. |
include/lld/Core/Parallel.h | ||
---|---|---|
48–51 | You mean below 0? decrementing twice is perfectly valid. | |
80 | That's what the ScopedTask stuff is for. That overhead shouldn't be added by default. | |
98 | That should be handled at a higher level. ThreadPoolExecutor threads live until the process exits. | |
109 | An Executor is a low level interface for tasks. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3562.pdf for what it's based on. Also, we don't want other threads to stop when a task reports an error. We may be able to recover from it. All threads will stop before the process exits anyway. If a task needs to return a value, it should use std::packaged_task. If you find the need to use this often, it's trivial to make a wrapper for it. |
Michael, can you add more background on where you want to go with this? On darwin the preferred way to parallelize is to use libdispatch (aka GCD). In that model, you divide up your work into small chunks as functions/lamdas/blocks, and use serial or parallel queues to have the OS (libdispatch) run them all in the correct order as efficiently as possible (see http://libdispatch.macosforge.org/trac/wiki/tutorial for a good tutorial). For instance, in this code might spin up hardware_concurrency() threads, but another process could be doing the same thing. Then you have two processes thrashing each other for cpu threads. With libdispatch you never say how many threads to use and the OS balances across processes.
So, I'd like a design that allows lld parallelism to be layered on libdispatch for darwin.
The executor model is designed to allow this. I did this for Windows by using ConcRT in the ConcRTExecutor class.
include/lld/Core/Parallel.h | ||
---|---|---|
69–70 | This is for the default Executor, there's really not a better choice to make. If a specific client finds that it needs more, it can create its own ThreadPoolExecutor and give some other thread count. Removing the default here would just move it over to getDefaultExecutor(). | |
195–197 | I benchmarked this with ParallelTest.cpp with array taking up 1GiB (that's as high as MSVC will go, even in x64 mode :(). This was the best cutoff for that specific case. I expect that it may vary widely depending on how expensive your comparator is. PPL actually takes this as a parameter to parallel_sort. | |
199 | Good idea on the max depth to use. As for the max task count, I found that limiting the max splits to the number of cores to be slower, as chunks take differing times to complete. Smaller chunks balance better. | |
214–220 | yep. |
On a side note: now that it seems that LLD is starting to get into a phase of its life where it is being optimized, we *really* should have bots that test for performance regressions. Otherwise, we have no real baseline and can't tell whether things are actually getting faster or not.
We probably want at least:
- Release build of LLVM+Clang
- Debug build of LLVM+Clang
- Release build of Chromium (or some other gargantuan project)
- Debug build of Chromium
on linux, mac, windows. Especially now that system dependent parallel stuff is being added, it is important to have a common baseline for meaningful progress to be made (commits benefitting one platform may adversely affect another). It shouldn't take much computing power since the link step does not take very long (compared to compiling), so it would be sufficient to just stash a "golden" set of object files and run just the actual link step.
We're in the process of putting together a large internal buildbot farm that will include tests almost exactly like the ones Sean has listed.
I'd love for lab to also be doing some LNTs here.
I'd also suggest some benchmarking of small resulting executables, as per Sean's point about parallel_sort.
Alex
include/lld/Core/Parallel.h | ||
---|---|---|
199 |
Ah, ok then. If you incrementally build up a set of addresses and need to in the end emit them in sorted order, there are trie-like data structures that allow for *extremely* dense encoding of sorted maps/sets (because key prefixes are shared) and constant-time insertion; I believe that these data structures are also very amenable to lock-free variants (one common usage is for kernel IP routing tables). They also have the advantage that they can be reconstituted into a sorted array of addresses in O(n) time. Specialized data structures like these might be something worth looking into as a memory optimization. |
- Add documentation
- Make ThreadPoolExecutor filo
- Make threadCount unsigned.
- parallel_sort
- Base max depth on input size
- Tail execute one half of the sort.
you might want to add a else case to assert probably.