This is an archive of the discontinued LLVM Phabricator instance.

[lld-macho] Speed up markLive()
Needs ReviewPublic

Authored by oontvoo on Sep 17 2021, 11:09 PM.

Details

Reviewers
gkm
smeenai
Group Reviewers
Restricted Project
Summary
[lld-macho] Speed up markLive()

From the trace report, it's one of the most substantial pieces (only after load-input and write-output) and it seems like an easy win here.

Changes:
 - parallelize scanning symtab, and sections.
 - also add additional timescope

Stats for 10 runs (on Mac, 2.4 GHz 8-Core Intel Core i9 | Memory: 32 GB 2667 MHz DDR4)

    N           Min           Max        Median           Avg        Stddev
x  10         19.77         23.35         21.51        21.569     1.1866053
+  10          19.9         20.09            20        19.999   0.062795966
Difference at 95.0% confidence
        -1.57 +/- 0.789477
        -7.27897% +/- 3.66024%
        (Student's t, pooled s = 0.840231)

Differential Revision: https://reviews.llvm.org/D110018

Diff Detail

Event Timeline

oontvoo created this revision.Sep 17 2021, 11:09 PM
Herald added a project: Restricted Project. · View Herald Transcript
Herald added a reviewer: Restricted Project. · View Herald Transcript
oontvoo requested review of this revision.Sep 17 2021, 11:09 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 17 2021, 11:09 PM
oontvoo edited the summary of this revision. (Show Details)Sep 17 2021, 11:10 PM
thevinster added inline comments.
lld/MachO/InputSection.h
101 ↗(On Diff #373398)

Why do we need to define our own copy constructor? Isn't the compiler doing that for us already?

118 ↗(On Diff #373398)

Shouldn't this method also have the mutex as well?

lld/MachO/MarkLive.cpp
56

I think we might have a race condition here. If two different symbols pointing to the same isec and off happen to enter this function, we could end up writing the same value twice to worklist. It feels like the lock should be at the enqueue scope instead of the list here.

Doing so, I believe we can also remove all the locks on the isLive and markLive methods.

oontvoo marked 2 inline comments as done.Sep 20 2021, 9:33 AM
oontvoo added inline comments.
lld/MachO/InputSection.h
101 ↗(On Diff #373398)

because the mutex makes it non-trivially copyable(ie., the default copy ctor won't work).
(This reminds me I should change this to a proper copy ctor :) )

118 ↗(On Diff #373398)

yes - missed this one

lld/MachO/MarkLive.cpp
56

No, if the lock were to be put on the whole enqueue() then the whole thing would essentially be single-threaded (because all different sections would have to wait for the same lock).

We should minimise the locked region here.
And yes - this should have checked isLive() again before pushing into the queue

int3 added a subscriber: int3.Sep 20 2021, 9:41 AM

I'm a bit ambivalent about this change. The typical way that parallel mark-sweep is implemented is via a work-stealing queue. That way we have one lock per thread, instead of all threads contending on one lock. Granted this diff is already an improvement, but perhaps it would be best to implement the optimal solution right off the bat...

lld/MachO/InputFiles.cpp
278 ↗(On Diff #373398)

why this change?

lld/MachO/InputSection.h
120 ↗(On Diff #373398)

I think this lock isn't necessary... multiple concurrent calls to this method will still end up setting live to true

259–260 ↗(On Diff #373398)

atomics might perform better here, assuming low contention

oontvoo planned changes to this revision.Sep 22 2021, 10:36 AM
oontvoo marked 3 inline comments as done.

I'm a bit ambivalent about this change. The typical way that parallel mark-sweep is implemented is via a work-stealing queue. That way we have one lock per thread, instead of all threads contending on one lock. Granted this diff is already an improvement, but perhaps it would be best to implement the optimal solution right off the bat...

I was hoping for a easy fix here rather than an overhaul. But yeah, I guess it doesn't make a lot of sense to leave it half-done ...

Taking this off review queue to implement a proper concurrent mark-sweep

lld/MachO/InputFiles.cpp
278 ↗(On Diff #373398)

ConcatInputSection is no longer trivially copy-able.

oontvoo updated this revision to Diff 374438.Sep 22 2021, 9:57 PM
oontvoo marked an inline comment as done.

Updated diff:

  • add a simple WorkStealingQueue based on existing implementation + tests
  • first pass at making markLive using the queue.
oontvoo edited the summary of this revision. (Show Details)Sep 22 2021, 9:57 PM
oontvoo updated this revision to Diff 374439.Sep 22 2021, 9:59 PM
oontvoo edited the summary of this revision. (Show Details)

typo in makefile

int3 added a comment.Sep 27 2021, 4:56 AM

Haven't taken a proper look yet, but could you fix the lints first to make it easier to read? :)

Also, that was pretty fast... I guess you've implemented something similar before?

oontvoo updated this revision to Diff 375768.Sep 28 2021, 7:59 PM

fixed lint issues

Haven't taken a proper look yet, but could you fix the lints first to make it easier to read? :)

Done - sorry, hard to switch back and forth with different naming style

Also, that was pretty fast... I guess you've implemented something similar before?

The queue is mostly from existing code :)
The mark-and-sweep is based on code I wrote ~3 years ago ... (which is to say, there might be bugs ...)

oontvoo updated this revision to Diff 379919.Oct 14 2021, 10:47 PM

fixed more lints

int3 added inline comments.Oct 18 2021, 8:22 PM
lld/MachO/MarkLive.cpp
27–28

hardware_concurrency() seems like a good initializer

30–51

hm, this seems more like a Worker or Executor to me, each of which executes a number of tasks or jobs...

36
43–44

if start and end denoted half-open (instead of closed) intervals, would we still need skip?

also, how do you feel about creating an array of ArrayRefs here, instead of working with raw indices? That uses a few extra ints, but given that the number of threads in the pool is small, that shouldn't be an issue.

75

would prefer a more descriptive name... run / execute?

95–118

I'm kind of suspicious of this. It's marking sections as live based on the liveness of other sections, which are concurrently being marked as live. Depending on the order of writes, it may incorrectly conclude that a certain live section X is not live, and therefore not mark the sections that X points to as live.

144

work is usually a noun, not a verb... maybe this could also be run() (or have run above take a default argument)

146

can we have a more descriptive name, like groupSize?

148
169–170

i'm kind of confused here. why are we comparing start to idx -- isn't start an index into the size array of jobs, whereas idx is the index of the Task vector of size poolSize? seems like we're comparing numbers on different scales...

same thing for idx and size

264

I think the "no" makes it clearer :)

lld/MachO/WorkStealingQueue.h
39

the rest of the codebase doesn't use the _ suffix; I think we can just use capacity here and have the getter be getCapacity()

lld/unittests/MachOUnitTests/CMakeLists.txt
3
oontvoo marked an inline comment as done.Oct 20 2021, 2:04 PM
oontvoo added inline comments.
lld/MachO/MarkLive.cpp
95–118

Good point! (kind of surprised not a lot of tests were failing ...)

169–170

"idx" is the index of the Task vector, yes but it correlates to where its work items are:

So the division of labour is like this:

Task[0] : start=0, end = start + d -1 = d -1 
Task[1] : start=d, end = d + d - 1
Task[2]: start = d + d, end = d + d + d -1
....
Task[n]: start = n * d, end = start + d -1.
oontvoo planned changes to this revision.Oct 20 2021, 2:05 PM
oontvoo marked an inline comment as done.

addressing comments

int3 added inline comments.Oct 20 2021, 4:42 PM
lld/MachO/MarkLive.cpp
95–118

Race conditions often don't show up on small inputs :) I would recommend running dead_strip on a large program before and after this change, and checking that the outputs are identical.

I think the easiest way to fix this is to create a mapping of sections to their live support sections (basically reversing the pointer), and then have addSect visit the live support sections.

169–170

right but... what does start < idx mean here, and when do we expect it to be true? shouldn't it never be true, since n >= n * d for all nonnegative n and d?

oontvoo added inline comments.Oct 20 2021, 5:17 PM
lld/MachO/MarkLive.cpp
169–170

Ah- I misunderstood what you were confusing about.

shouldn't it never be true, since n >= n * d for all nonnegative n and d

d could be zero, which is non-negative :)

ie., when we have fewer tasks than the number of workers, then you dont need to use all of the workers, when that happens, you just need the first [0, x) workers, and give each of them 1 task.

Similarly, end >= size is true when start== d == 0 (because its type is size_t , hence the -1 will ensure it).

(but I guess you're right that this is written in a much more convoluted way than it should be ... will fix)

oontvoo updated this revision to Diff 381816.Oct 24 2021, 6:30 PM
oontvoo marked 11 inline comments as done.

addressed review comments:

  • comestic issues
  • potential race-condition w.r.t handling of live-support
MaskRay added a subscriber: MaskRay.EditedMar 1 2022, 11:07 PM

How does the work-stealing queue compare with oneTBB tbb::parallel_for_each with a feeder?

lld/MachO/MarkLive.cpp
212

open-ended intervals are usually easy to work with and less error-prone

233
lld/MachO/WorkStealingQueue.h
77

I find that SmallVector<X, 0> is usually more efficient and compiles to less code.

83

Why is the .inc split from the .h file?

Herald added a project: Restricted Project. · View Herald TranscriptMar 1 2022, 11:07 PM

@int3 Do you also have a plan to implement a work-stealing queue? It can be in lld/Common so that lld/ELF and lld/COFF can use it as well...

int3 added a comment.Mar 2 2022, 12:17 PM

I'd asked @oontvoo to hold off on this until we'd stabilized the dead_strip code. But yeah, it would make sense for whatever work-stealing queue we implement to go under Common :)

smeenai resigned from this revision.Apr 28 2022, 1:55 PM