[lld-macho] Speed up markLive()
From the trace report, it's one of the most substantial pieces (only after load-input and write-output) and it seems like an easy win here.
Changes:
 - parallelize scanning symtab, and sections.
 - also add additional timescope
Stats for 10 runs (on Mac, 2.4 GHz 8-Core Intel Core i9 | Memory: 32 GB 2667 MHz DDR4)
    N           Min           Max        Median           Avg        Stddev
x  10         19.77         23.35         21.51        21.569     1.1866053
+  10          19.9         20.09            20        19.999   0.062795966
Difference at 95.0% confidence
        -1.57 +/- 0.789477
        -7.27897% +/- 3.66024%
        (Student's t, pooled s = 0.840231)
Differential Revision: https://reviews.llvm.org/D110018Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
| lld/MachO/InputSection.h | ||
|---|---|---|
| 101 ↗ | (On Diff #373398) | Why do we need to define our own copy constructor? Isn't the compiler doing that for us already? | 
| 118 ↗ | (On Diff #373398) | Shouldn't this method also have the mutex as well? | 
| lld/MachO/MarkLive.cpp | ||
| 72 | I think we might have a race condition here. If two different symbols pointing to the same isec and off happen to enter this function, we could end up writing the same value twice to worklist. It feels like the lock should be at the enqueue scope instead of the list here. Doing so, I believe we can also remove all the locks on the isLive and markLive methods. | |
| lld/MachO/InputSection.h | ||
|---|---|---|
| 101 ↗ | (On Diff #373398) | because the mutex makes it non-trivially copyable(ie., the default copy ctor won't work). | 
| 118 ↗ | (On Diff #373398) | yes - missed this one | 
| lld/MachO/MarkLive.cpp | ||
| 72 | No, if the lock were to be put on the whole enqueue() then the whole thing would essentially be single-threaded (because all different sections would have to wait for the same lock). We should minimise the locked region here. | |
I'm a bit ambivalent about this change. The typical way that parallel mark-sweep is implemented is via a work-stealing queue. That way we have one lock per thread, instead of all threads contending on one lock. Granted this diff is already an improvement, but perhaps it would be best to implement the optimal solution right off the bat...
| lld/MachO/InputFiles.cpp | ||
|---|---|---|
| 278 ↗ | (On Diff #373398) | why this change? | 
| lld/MachO/InputSection.h | ||
| 120 ↗ | (On Diff #373398) | I think this lock isn't necessary... multiple concurrent calls to this method will still end up setting live to true | 
| 259–260 ↗ | (On Diff #373398) | atomics might perform better here, assuming low contention | 
I was hoping for a easy fix here rather than an overhaul. But yeah, I guess it doesn't make a lot of sense to leave it half-done ...
Taking this off review queue to implement a proper concurrent mark-sweep
| lld/MachO/InputFiles.cpp | ||
|---|---|---|
| 278 ↗ | (On Diff #373398) | ConcatInputSection is no longer trivially copy-able. | 
Updated diff:
- add a simple WorkStealingQueue based on existing implementation + tests
- first pass at making markLive using the queue.
Haven't taken a proper look yet, but could you fix the lints first to make it easier to read? :)
Also, that was pretty fast... I guess you've implemented something similar before?
Done - sorry, hard to switch back and forth with different naming style
Also, that was pretty fast... I guess you've implemented something similar before?
The queue is mostly from existing code :)
The mark-and-sweep is based on code I wrote ~3 years ago ... (which is to say, there might be bugs ...)
| lld/MachO/MarkLive.cpp | ||
|---|---|---|
| 33–41 | hardware_concurrency() seems like a good initializer | |
| 43–63 | hm, this seems more like a Worker or Executor to me, each of which executes a number of tasks or jobs... | |
| 49 | ||
| 56–57 | if start and end denoted half-open (instead of closed) intervals, would we still need skip? also, how do you feel about creating an array of ArrayRefs here, instead of working with raw indices? That uses a few extra ints, but given that the number of threads in the pool is small, that shouldn't be an issue. | |
| 96 | would prefer a more descriptive name... run / execute? | |
| 116–139 | I'm kind of suspicious of this. It's marking sections as live based on the liveness of other sections, which are concurrently being marked as live. Depending on the order of writes, it may incorrectly conclude that a certain live section X is not live, and therefore not mark the sections that X points to as live. | |
| 165 | work is usually a noun, not a verb... maybe this could also be run() (or have run above take a default argument) | |
| 167 | can we have a more descriptive name, like groupSize? | |
| 169 | ||
| 190–191 | i'm kind of confused here. why are we comparing start to idx -- isn't start an index into the size array of jobs, whereas idx is the index of the Task vector of size poolSize? seems like we're comparing numbers on different scales... same thing for idx and size | |
| 320 | I think the "no" makes it clearer :) | |
| lld/MachO/WorkStealingQueue.h | ||
| 39 | the rest of the codebase doesn't use the _ suffix; I think we can just use capacity here and have the getter be getCapacity() | |
| lld/unittests/MachOUnitTests/CMakeLists.txt | ||
| 3 | ||
| lld/MachO/MarkLive.cpp | ||
|---|---|---|
| 116–139 | Good point! (kind of surprised not a lot of tests were failing ...) | |
| 190–191 | "idx" is the index of the Task vector, yes but it correlates to where its work items are: So the division of labour is like this: Task[0] : start=0, end = start + d -1 = d -1 Task[1] : start=d, end = d + d - 1 Task[2]: start = d + d, end = d + d + d -1 .... Task[n]: start = n * d, end = start + d -1. | |
| lld/MachO/MarkLive.cpp | ||
|---|---|---|
| 116–139 | Race conditions often don't show up on small inputs :) I would recommend running dead_strip on a large program before and after this change, and checking that the outputs are identical. I think the easiest way to fix this is to create a mapping of sections to their live support sections (basically reversing the pointer), and then have addSect visit the live support sections. | |
| 190–191 | right but... what does start < idx mean here, and when do we expect it to be true? shouldn't it never be true, since n >= n * d for all nonnegative n and d? | |
| lld/MachO/MarkLive.cpp | ||
|---|---|---|
| 190–191 | Ah- I misunderstood what you were confusing about. 
 d could be zero, which is non-negative :) ie., when we have fewer tasks than the number of workers, then you dont need to use all of the workers, when that happens, you just need the first [0, x) workers, and give each of them 1 task. Similarly, end >= size is true when start== d == 0 (because its type is size_t , hence the -1 will ensure it). (but I guess you're right that this is written in a much more convoluted way than it should be ... will fix) | |
addressed review comments:
- comestic issues
- potential race-condition w.r.t handling of live-support
How does the work-stealing queue compare with oneTBB tbb::parallel_for_each with a feeder?
| lld/MachO/MarkLive.cpp | ||
|---|---|---|
| 230 | open-ended intervals are usually easy to work with and less error-prone | |
| 251 | ||
| lld/MachO/WorkStealingQueue.h | ||
| 76 | I find that SmallVector<X, 0> is usually more efficient and compiles to less code. | |
| 82 | Why is the .inc split from the .h file? | |
@int3 Do you also have a plan to implement a work-stealing queue? It can be in lld/Common so that lld/ELF and lld/COFF can use it as well...
I'd asked @oontvoo to hold off on this until we'd stabilized the dead_strip code. But yeah, it would make sense for whatever work-stealing queue we implement to go under Common :)
hardware_concurrency() seems like a good initializer