Page MenuHomePhabricator

[lld-macho] Make writing map file asynchronous
ClosedPublic

Authored by thevinster on Dec 8 2021, 6:26 PM.

Details

Reviewers
oontvoo
int3
Group Reviewers
Restricted Project
Commits
rGd17b092fe690: [lld-macho] Make writing map file asynchronous
Summary

For large applications that write to map files, writing map files can take quite
a bit of time. Sorting the biggest contributors to link times, writing map files
ranks in at 2nd place, with load input files being the biggest contributor of
link times. Avoiding writing map files on the critical path (and having its own
thread) saves ~2-3 seconds when linking chromium framework on a 16-Core
Intel Xeon W.

           base            diff            difference (95% CI)
sys_time   1.617 ± 0.034   1.657 ± 0.026   [  +1.5% ..   +3.5%]
user_time  28.536 ± 0.245  28.609 ± 0.180  [  -0.1% ..   +0.7%]
wall_time  23.833 ± 0.271  21.684 ± 0.194  [  -9.5% ..   -8.5%]
samples    31              24

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
thevinster published this revision for review.Dec 8 2021, 6:33 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 8 2021, 6:33 PM
oontvoo added a subscriber: oontvoo.Dec 8 2021, 6:39 PM
oontvoo added inline comments.
lld/MachO/Writer.cpp
1163–1164
This revision is now accepted and ready to land.Dec 8 2021, 6:39 PM
int3 requested changes to this revision.Dec 8 2021, 8:16 PM
int3 added a subscriber: int3.

did you forget to add the implementation of writeMapAsync?

lld/MachO/Writer.cpp
1163–1164

yeah, I don't think the map file contains anything from LinkEdit, but just to be safe I would move it after everything's finalized. I'm assuming writeOutputFile takes longer than writeMapAsync anyway

This revision now requires changes to proceed.Dec 8 2021, 8:16 PM

did you forget to add the implementation of writeMapAsync?

Isn't line 1151, writeMapAsync a variable? :) (not a func call)

lld/MachO/Writer.cpp
1163–1164

PS: actually, dont you want std::future<void> writeMapAsync (std::async(std::launch::async, []{ writeMapFile(); })); ?

int3 added a comment.Dec 8 2021, 8:37 PM

Isn't line 1151, writeMapAsync a variable? :) (not a func call)

d'oh, yes, brain fart there...

lld/MachO/Writer.cpp
1163–1164

yup, based on https://en.cppreference.com/w/cpp/thread/async I think @oontvoo 's right.

Also, that page says

If the std::future obtained from std::async is not moved from or bound to a reference, the destructor of the std::future will block at the end of the full expression until the asynchronous operation completes

so I don't think the get() call below is necessary?

I'm assuming writeOutputFile takes longer than writeMapAsync anyway

Profiling on my end shows that writeMapFile actually takes longer than writeOutputFile and on some cases longer than finalizeLinkEditSegment. I personally believe we get more wins from if we did writeMapFile before finalizeLinkEditSegment. But, if we plan on having writeMapFile depend on contents in finalizeLinkEditSegment in the future, I'm happy to move it after finalizeLinkEditSegment.

dont you want std::future<void> writeMapAsync (std::async(std::launch::async, []{ writeMapFile(); })); ?

By not providing the policy, it lets the system decide whether it should be async or deferred. The async policy is preferred, but the system will choose deferred if the system is running out of memory and can't spawn another thread in efforts to not hard crash. I think we can agree that it's better not to crash if it means having to not run it in parallel? Also, the lambda isn't necessary at all. It's working without the lambda on my end, reference docs don't specify that it must be wrapped in a lambda.

so I don't think the get() call below is necessary?

In the majority of cases, it's probably not necessary, but I think it's better to be safer than sorry. If the implementation changes behind the scenes for us and starts producing undefined behaviors, then it might not be obvious where crashes originate from right?

int3 added a comment.EditedDec 8 2021, 9:14 PM

Profiling on my end shows that writeMapFile actually takes longer than writeOutputFile and on some cases longer than finalizeLinkEditSegment.

Oh interesting... well we can stick with this order then

The async policy is preferred, but the system will choose deferred if the system is running out of memory and can't spawn another thread in efforts to not hard crash.

Do you have a source for this? Because that's not what cppreference says... it just says

If both the std::launch::async and std::launch::deferred flags are set in policy, it is up to the implementation whether to perform asynchronous execution or lazy evaluation.

I agree the lambda isn't necessary.

If the implementation changes behind the scenes for us

cppreference describes what the standard indicates implementations should do, so this shouldn't be a concern

oontvoo added a comment.EditedDec 9 2021, 7:42 AM

Profiling on my end shows that writeMapFile actually takes longer than writeOutputFile and on some cases longer than finalizeLinkEditSegment.

Oh interesting... well we can stick with this order then

The async policy is preferred, but the system will choose deferred if the system is running out of memory and can't spawn another thread in efforts to not hard crash.

Do you have a source for this? Because that's not what cppreference says... it just says

If both the std::launch::async and std::launch::deferred flags are set in policy, it is up to the implementation whether to perform asynchronous execution or lazy evaluation.

I agree the lambda isn't necessary.

If the implementation changes behind the scenes for us

cppreference describes what the standard indicates implementations should do, so this shouldn't be a concern

Ok so looks like there are two options:

  • if we don't specify the policy explicitly, then we need the .get() call to ensure the future actually happens.
  • if we do specify the policy to async, then no need to call .get();

(What's worse is that all of this is implementation defined, which is to say, no guaranteed it wouldn't change).

I guess there are no great options here but the current one is fine (ie., first option). Can we at least edit the comment to say it more clearly that since the policy is not specified, it may be lazy or async therefore we need to call .get() to get it eval'ed?

(Same question with int3, though - where does it say the impl will pick the best policy based on thread/memory?)

P.S: Another suggestion (since std::future in generally isn't great and I've heard WG21 wasn't using it in any future design), maybe just use LLVM's parallel API?
(ie., use the parallel ...() to run writeMapFile() in parallel with { finalizeLinkEditSegment(); writeOutputFile(); } )
This could be better than mixing multiple threading APIs

Do you have a source for this?

Over at https://www.cplusplus.com/reference/future/async/ under automatic selection, it says Automatic: The function chooses the policy automatically (at some point). This depends on the system and library implementation, which generally optimizes for the current availability of concurrency in the system. I think this means that the system will be smart enough to know which one it should use based on its available resources.

Also in the cppreference you have linked, there's a section on Exceptions that says Throws std::system_error with error condition std::errc::resource_unavailable_try_again if the launch policy equals std::launch::async and the implementation is unable to start a new thread (if the policy is async|deferred or has additional bits set, it will fall back to deferred or the implementation-defined policies in this case).

use the parallel ...() to run writeMapFile() in parallel with { finalizeLinkEditSegment(); writeOutputFile(); }

This was my initial thought. I abandoned this approach because 1/ I didn't quite know what the right syntax is to get this particular thing working and 2/ we aren't really trying to run writeMapFile in parallel with *only* finalizeLinkEditSegment or writeOutputFile. It's more about putting things off the critical path. Let's say finalizeLinkEditSegment is its own thread (for whatever reason) then things gets tricky since we would have a main thread, writeMapFile thread, and the finalizeLinkEditSegment thread. With async it's easy, but how does that look like with parallel?

This was my initial thought. I abandoned this approach because 1/ I didn't quite know what the right syntax is to get this particular thing working and 2/ we aren't really trying to run writeMapFile in parallel with *only* finalizeLinkEditSegment or writeOutputFile. It's more about putting things off the critical path. Let's say finalizeLinkEditSegment is its own thread (for whatever reason) then things gets tricky since we would have a main thread, writeMapFile thread, and the finalizeLinkEditSegment thread. With async it's easy, but how does that look like with parallel?

(see suggested edit inline)

I'm not sure understand how running writeMapFile() in its on thread (parallel with the other two functions) any different here ... Either way, the "main" thread still have to block until all three are finished before it can return from this.

So the premise here is that writeMap is slow, hence we want to do something else in parallel with it rather than waiting for it , yes?

lld/MachO/Writer.cpp
1162–1166

perhaps something like this ^ ?

I'll admit your suggestion is clever, but I think it sacrifices on readability. How would something like this be modeled in the parallelForEach statement?

std::async(func1);
func2();
std::async(func3);
func4();

With the parallelForEach, it sounds like we'll need to nest things because parallelForEach isn't meant to model concurrency but more about parallelism.

int3 added a comment.Dec 9 2021, 9:18 PM

Ah, thanks for the sources @thevinster! I was actually gonna agree that async | deferred seemed like the best choice, but then I found this: https://eli.thegreenplace.net/2016/the-promises-and-challenges-of-stdasync-task-based-parallelism-in-c11/

It makes the argument that the default policy should "never" be used because of all the potential pitfalls. I don't think any of those pitfalls apply to our relatively simple use case, but maybe it's nice to just use the async policy regardless so we don't have to think about that? But in any case...

I agree with @oontvoo that using raw std::async is not ideal since it's semi-deprecated. I'll also point out (belatedly) that it's not great to be running an async job in parallel with a call to parallelForEachN. The latter spawns jobs based on the number of cores available because it assumes that it is the only thing spawning jobs. Ideally, we would be spawning jobs from a single thread pool instead.

Parallel.h is designed for simple fork-join patterns where the forking and joining all happen at a single point, but that's not really what we're doing here. ThreadPool.h seems to be a better fit for what we are trying to do; it exposes a very similar API to std::async but I think without its pitfalls. If I'm reading the implementation correctly, Pool.async() always launches the jobs immediately if there is a thread available, and always launches them in a separate thread even if they get deferred. And it's being used throughout LLVM's codebase, so we should probably use it too :)

Here's an example we could draw on: https://github.com/llvm/llvm-project/blob/be0a7e9f27083ada6072fcc0711ffa5630daa5ec/mlir/include/mlir/IR/Threading.h#L76. I think we'd create one ThreadPool, use it to dispatch writeMapFile, then pass the same instance to finalizeLinkEditSegment, which dispatches more jobs with it (but only waits on those jobs it dispatched, and not the writeMapFile job). Also the ThreadPool dtor is blocking, so we don't actually need to wait on the writeMapFile job.

thevinster planned changes to this revision.Dec 10 2021, 11:17 AM

Use threadpool

thevinster edited the summary of this revision. (Show Details)Dec 10 2021, 6:31 PM
thakis added a subscriber: thakis.Dec 12 2021, 9:42 AM

Looks cool, but have you tried optimizing the map writing code a bit first? IIRC it's fairly unoptimized.

(But moving it on a thread and optimizing it are both good things to do of course.)

Looks cool, but have you tried optimizing the map writing code a bit first? IIRC it's fairly unoptimized.

Thanks for taking a glance! That is one area I'm thinking of improving as well, but right now this seemed like a lower hanging fruit that should give us relatively significant wins.

int3 added a comment.Dec 13 2021, 10:41 AM

@thakis I think they're orthogonal and equally useful improvements. One reason we haven't optimized the actual MapFile generation yet is that we're keeping it as ramp-up work for a new team member :)

lld/MachO/Writer.cpp
68

super nit, but "writer" here seems redundant since this is the only threadpool used in Writer

1164–1165

so as my earlier block comment was saying, we should pass writerThreadPool to finalizeLinkEditSegment too and use that in lieu of parallelForEachN, so that we maintain control of the total number of threads.

thevinster added inline comments.Dec 13 2021, 11:04 AM
lld/MachO/Writer.cpp
68

I think it might be more relevant now we that want to use it in finalizeLinkEditSegment. Any reason why we want to instantiate it in the run scope as opposed to having it as a member of the class? It's possible we may want to use it elsewhere within the Writer class? In fact, scanning usages of parallelForEach{N} also shows writeUuid benefiting from using async as well. In that case, it makes sense to put it at the class level scope.

1164–1165

Oh I see. It wasn't clear from your previous comment that you wanted to replace parallelForEachN with it. I'll change.

int3 added inline comments.Dec 13 2021, 11:13 AM
lld/MachO/Writer.cpp
68

I have no issues with the scope. I'm just saying that having "writer" in the variable name seems redundant

thevinster planned changes to this revision.Dec 13 2021, 11:15 AM
thevinster added inline comments.
lld/MachO/Writer.cpp
68

My bad. I misinterpreted the statement. 👍

Show two different ways of using threadPool

thevinster added a comment.EditedDec 13 2021, 6:26 PM

Not meant as a commit for submitting, but wanted to poll the audience on what pattern we'd like to use with threadPool. We essentially have three options here:

  • Option 1 - continue using parallelForEach methods
  • Option 2 - Re-use our declared class level threadPool, but areas where we only want to wait on a subset of threads, we have to re-implement the fork-join model by declaring some vector and waiting on those threads that are pushed back to the vector (See L1040-1050)
  • Option 3 - Declare class-level and method-level threadPools. Simpifies the for loop from Option 2, but might be harder to manage since we now have to control N number of threadPools with a potential of N different concurrency settings. (See L1103-1115)

I did run a benchmark across all options and there's no reasonable performance improvement with either options. So, it seems like this all comes down to syntax and what looks best.

thevinster planned changes to this revision.Dec 13 2021, 6:27 PM
oontvoo added inline comments.Dec 13 2021, 6:28 PM
lld/MachO/Writer.cpp
1044–1047
1045

nit: no brace

1100–1111

why is this change needed?

lld/MachO/Writer.cpp
1100–1111

This commit is meant as a way to show the different ways we could be using ThreadPool. It's more of an RFC (See https://reviews.llvm.org/D115416#3190992). As far as why we want to change it (per @int3's suggestion), we want to change all our uses of parallelForEach in favor of ThreadPool so we can control the total number of threads.

int3 added inline comments.Dec 13 2021, 9:18 PM
lld/MachO/Writer.cpp
68

so can we rename writerThreadPool to just threadPool? :)

1100–1111

we should be reusing the writerThreadPool here instead of creating a writeUuidThreadPool... each ThreadPool maintains its own counter of the number of threads spawned

thevinster added inline comments.Dec 13 2021, 9:23 PM
lld/MachO/Writer.cpp
68

I'll clean this up. This revision is more of a "which option do we want to go for first" before I clean up all the nit details

1100–1111

Okay, so it sounds like we want to go for Option 2 then?

int3 added a comment.Dec 13 2021, 9:32 PM

Oh yeah sorry I forgot to reply to your other comment. Yes, I'd prefer option 2. We should use parallelForEach whenever possible, and fall back to the single class-level threadPool whenever we want to fork/join at multiple different points.

oontvoo added inline comments.Dec 14 2021, 6:24 AM
lld/MachO/Writer.cpp
1100–1111

we should be reusing the writerThreadPool here instead of creating a writeUuidThreadPool... each ThreadPool maintains its own counter of the number of threads spawned

Yeah +1 to this.
Maybe my question was too terse - I was asking why you needed to switch paranllelForEachN for a more manual approach but NOT used the writerThreadPool.

@thakis I think they're orthogonal and equally useful improvements.

Absolutely agreed :)

thevinster marked 14 inline comments as done.Dec 14 2021, 3:15 PM

Address feedback + use option 2

oontvoo added inline comments.Dec 14 2021, 4:53 PM
lld/MachO/Writer.cpp
1047–1052

No need to enqueue a NULL osec

oontvoo added inline comments.Dec 14 2021, 4:55 PM
lld/MachO/Writer.cpp
1047–1052

(although I'm not sure it could ever be null here )

int3 accepted this revision.Dec 14 2021, 6:26 PM

lgtm, thanks!

lld/MachO/Writer.cpp
1047–1052

+1

This revision is now accepted and ready to land.Dec 14 2021, 6:26 PM
oontvoo accepted this revision.Dec 14 2021, 7:54 PM

LG - thanks for putting up with the nits!

Prevent writing map files on the critical path

Maybe a better name is something like "Make writing map file asynchronous"

thevinster retitled this revision from [lld-macho] Prevent writing map files on the critical path to [lld-macho] Make writing map file asynchronous.Dec 15 2021, 4:35 PM
thevinster marked 3 inline comments as done.

Don't enqueue if null osec

This revision was landed with ongoing or failed builds.Dec 15 2021, 4:41 PM
This revision was automatically updated to reflect the committed changes.
crownyanguan added inline comments.
lld/MachO/Writer.cpp
1111

This change will lead build failure in windows, like this:
W:\ll_1074497988\20211217_000000\llvm\lld\MachO\Writer.cpp(1111): error C2365: 'i': redefinition; previous definition was 'data variable'
W:\ll_1074497988\20211217_000000\llvm\lld\MachO\Writer.cpp(1109): note: see declaration of 'i'
c:/Program files (x86)/Microsoft Visual Studio/2017/Professional/VC/Tools/MSVC/14.16.27023/include\functional(1914): fatal error C1060: compiler is out of heap space

thevinster added a comment.EditedJan 10 2022, 12:26 AM
lld/MachO/Writer.cpp
1111

Thanks for flagging! I don't have a windows machine, but https://reviews.llvm.org/rG7a161eb43b280a6194fef5b65a81ef59921ebf02 should solve the shadowing error. I'm not sure why this would be causing heap issues but could you try re-running your build again and see if it still reproduces?

crownyanguan added inline comments.Jan 16 2022, 6:15 PM
lld/MachO/Writer.cpp
1111

I dont't whether if this fixes, but, currently, it built pass in windows.