This is an archive of the discontinued LLVM Phabricator instance.

[Core] Add parallel infrastructure to lld.
ClosedPublic

Authored by Bigcheese on Apr 9 2013, 3:24 PM.

Details

Summary

Uses ConcRT and PPL on Windows.

Diff Detail

Event Timeline

shankarke added inline comments.Apr 9 2013, 4:06 PM
include/lld/Core/Parallel.h
48–51

you might want to add a else case to assert probably.

69–70

I think size_t is pretty big for a threadcount.

74–81

Doesnot handle the case when some of the threads cant be created.

74–82

Nice!

80

It might be good to add a name to the task, for tracing purposes.

98

This could return an ErrorOr to indicate success/failure of this thread.

104–105

You would need to unlock the mutex here.

109

it would be nice to get the task return status and check for success/failure and set the stop field.

Bigcheese added inline comments.Apr 9 2013, 4:35 PM
include/lld/Core/Parallel.h
48–51

There's nothing to assert here.

69–70

I'll change it to the return type of hardware_concurrency (unsigned).

74–81

We don't use exception handling. There's no way to detect this case.

80

Which task?

98

It can't fail. And what would look at the return value?

104–105

unique_lock unlocks the mutex in the destructor.

109

Task returns void.

shankarke added inline comments.Apr 9 2013, 4:45 PM
include/lld/Core/Parallel.h
48–51

what I meant was if the latch was decremented twice. you could catch that error.

80

any task that the thread pool would manage, for example : when you try to add tasks that would read input files parallelly in lld, you could add the name of the file to it, so that you know when the thread starts processing the work.

98

I thought the task that was scheduled to execute in this thread, could fail and the whole process has to be stopped.

For example : you tried to parse all the files in lld parallelly and one of the files was a bad file, you might want to return that failure immediately from the thread and stop everything else.

109

Ok. I was thinking if Tasks are made to return an ErrorOr/bool, you could get the return value of task and stop other threads on failure.

Michael, can you add more background on where you want to go with this? On darwin the preferred way to parallelize is to use libdispatch (aka GCD). In that model, you divide up your work into small chunks as functions/lamdas/blocks, and use serial or parallel queues to have the OS (libdispatch) run them all in the correct order as efficiently as possible (see http://libdispatch.macosforge.org/trac/wiki/tutorial for a good tutorial). For instance, in this code might spin up hardware_concurrency() threads, but another process could be doing the same thing. Then you have two processes thrashing each other for cpu threads. With libdispatch you never say how many threads to use and the OS balances across processes.

So, I'd like a design that allows lld parallelism to be layered on libdispatch for darwin.

silvas added inline comments.Apr 9 2013, 5:16 PM
include/lld/Core/Parallel.h
69–70

The core count is probably not a good choice for the default, since then you risk getting blocked waiting on IO.

Since LLD uses memory-mapped IO, there's no way to do it "async" since the IO will be incurred when you take a major fault (which is transparent to the program). So, for tasks that require IO, you really need as many threads as parallel IO's that you are going to be doing. Another alternative is careful use of madvise(), but I don't think LLD is currently introspective enough about its access patterns to be able to effectively do that at this point.

For compute-intensive tasks though, core count is probably fine. So I would discourage having a this take a default. Only the client of the API really knows how much parallelism they need, and they should specify it explicitly.

195–197

Have you benchmarked this? I'm curious what the cutoff is before this starts beating std::sort. (especially libcxx's std::sort). I wouldn't be surprised if the cutoff is *very* large.

199

This algorithm can go quadratic. You should fall back to std::sort if the depth becomes larger than logarithmic in the length of the range to be sorted to ensure n*log(n). Also, it would make sense to fall back to std::sort if the depth gets greater than log2(coreCount) to avoid increasing the parallelism beyond the number of cores.

214–220

You can "tail call" into one of these.

Bigcheese added inline comments.Apr 9 2013, 5:17 PM
include/lld/Core/Parallel.h
48–51

You mean below 0? decrementing twice is perfectly valid.

80

That's what the ScopedTask stuff is for. That overhead shouldn't be added by default.

98

That should be handled at a higher level. ThreadPoolExecutor threads live until the process exits.

109

An Executor is a low level interface for tasks. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3562.pdf for what it's based on.

Also, we don't want other threads to stop when a task reports an error. We may be able to recover from it. All threads will stop before the process exits anyway.

If a task needs to return a value, it should use std::packaged_task. If you find the need to use this often, it's trivial to make a wrapper for it.

Michael, can you add more background on where you want to go with this? On darwin the preferred way to parallelize is to use libdispatch (aka GCD). In that model, you divide up your work into small chunks as functions/lamdas/blocks, and use serial or parallel queues to have the OS (libdispatch) run them all in the correct order as efficiently as possible (see http://libdispatch.macosforge.org/trac/wiki/tutorial for a good tutorial). For instance, in this code might spin up hardware_concurrency() threads, but another process could be doing the same thing. Then you have two processes thrashing each other for cpu threads. With libdispatch you never say how many threads to use and the OS balances across processes.

So, I'd like a design that allows lld parallelism to be layered on libdispatch for darwin.

The executor model is designed to allow this. I did this for Windows by using ConcRT in the ConcRTExecutor class.

Bigcheese added inline comments.Apr 9 2013, 5:38 PM
include/lld/Core/Parallel.h
69–70

This is for the default Executor, there's really not a better choice to make. If a specific client finds that it needs more, it can create its own ThreadPoolExecutor and give some other thread count. Removing the default here would just move it over to getDefaultExecutor().

195–197

I benchmarked this with ParallelTest.cpp with array taking up 1GiB (that's as high as MSVC will go, even in x64 mode :(). This was the best cutoff for that specific case. I expect that it may vary widely depending on how expensive your comparator is. PPL actually takes this as a parameter to parallel_sort.

199

Good idea on the max depth to use. As for the max task count, I found that limiting the max splits to the number of cores to be slower, as chunks take differing times to complete. Smaller chunks balance better.

214–220

yep.

silvas added a comment.Apr 9 2013, 5:54 PM

On a side note: now that it seems that LLD is starting to get into a phase of its life where it is being optimized, we *really* should have bots that test for performance regressions. Otherwise, we have no real baseline and can't tell whether things are actually getting faster or not.

We probably want at least:

  • Release build of LLVM+Clang
  • Debug build of LLVM+Clang
  • Release build of Chromium (or some other gargantuan project)
  • Debug build of Chromium

on linux, mac, windows. Especially now that system dependent parallel stuff is being added, it is important to have a common baseline for meaningful progress to be made (commits benefitting one platform may adversely affect another). It shouldn't take much computing power since the link step does not take very long (compared to compiling), so it would be sufficient to just stash a "golden" set of object files and run just the actual link step.

alexr added a comment.Apr 9 2013, 7:03 PM

We're in the process of putting together a large internal buildbot farm that will include tests almost exactly like the ones Sean has listed.

I'd love for lab to also be doing some LNTs here.

I'd also suggest some benchmarking of small resulting executables, as per Sean's point about parallel_sort.

Alex

silvas added inline comments.Apr 9 2013, 7:41 PM
include/lld/Core/Parallel.h
199

As for the max task count, I found that limiting the max splits to the number of cores to be slower, as chunks take differing times to complete. Smaller chunks balance better.

Ah, ok then.

If you incrementally build up a set of addresses and need to in the end emit them in sorted order, there are trie-like data structures that allow for *extremely* dense encoding of sorted maps/sets (because key prefixes are shared) and constant-time insertion; I believe that these data structures are also very amenable to lock-free variants (one common usage is for kernel IP routing tables). They also have the advantage that they can be reconstituted into a sorted array of addresses in O(n) time. Specialized data structures like these might be something worth looking into as a memory optimization.

Bigcheese updated this revision to Unknown Object (????).Apr 10 2013, 1:34 PM
  • Add documentation
  • Make ThreadPoolExecutor filo
  • Make threadCount unsigned.
  • parallel_sort
    • Base max depth on input size
    • Tail execute one half of the sort.
shankarke added inline comments.Apr 10 2013, 1:59 PM
include/lld/Core/Parallel.h
76

biref -> brief

126

do you want to call it as workStack instead ?

Bigcheese added inline comments.Apr 10 2013, 2:24 PM
include/lld/Core/Parallel.h
76

Fixed.

126

Sure.

Bigcheese accepted this revision.May 24 2013, 4:46 PM
Bigcheese closed this revision.

Committed.