Page MenuHomePhabricator

win: Move Parallel.h off concrt to cross-platform code
ClosedPublic

Authored by thakis on Oct 10 2019, 11:27 AM.

Details

Summary

r179397 added Parallel.h and implemented it terms of concrt in 2013.

In 2015, a cross-platform implementation of the functions has appeared and is in use everywhere but on Windows (r232419).
r246219 hints that <thread> had issues in MSVC2013, but r296906 suggests they've been fixed now that we require 2015+.

So remove the concrt code. It's less code, and it sounds like concrt has conceptual and performance issues, see PR41198.

I built blink_core.dll in a debug component build with full symbols and in a release component build without any symbols.
I couldn't measure a performance difference for linking blink_core.dll before and after this patch.
(Raw data: https://gist.github.com/nico/d4b02c7dd835bb96ed67e919f3558e6f)

Diff Detail

Event Timeline

thakis created this revision.Oct 10 2019, 11:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 10 2019, 11:27 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
rnk accepted this revision.Oct 10 2019, 11:36 AM

Looks very good. :)

This revision is now accepted and ready to land.Oct 10 2019, 11:36 AM
This revision was automatically updated to reflect the committed changes.

Looks slightly better without ConcRT, thanks Nico!
Here's some quick results showing the difference for the global hash parallelization in LLD, with MSVC OBJs. The algorithm is iterating on the .debug$T records, for a few thousands OBJs on 72 hyper threads:
(we're saving about 2 secs on this test)

Before:

After:

There's still this memory map lock eating half of the CPU time, I'm not too sure yet how to avoid it, if ever. Maybe touch the pages for the OBJ files in advance? Anyway, unrelated.

BillyONeal added inline comments.Oct 10 2019, 9:24 PM
llvm/include/llvm/Support/Parallel.h
124

If you get a chance to benchmark I'm curious how this compares to our std::sort(std::execution::par, ...) version :)

aganea marked an inline comment as done.Oct 11 2019, 8:31 AM
aganea added inline comments.
llvm/include/llvm/Support/Parallel.h
124

I ran a few AB/BA tests on LLD with my dataset. The cumulated time on all cores with ConcRT is consistently over by about 300ms on my 36-core Skylake (~1.9 sec for ConcRT version, ~1.6 sec after this patch). There are only three places where we parallelSort in LLD, so maybe this not representative. But the dataset is quite big, ~22 GB of OBJs and LIBs. This is a Unity build of the Editor Release target of one of our games. I can try also with no Unity files, usually the dataset is about an order of magnitude greater.

Before:

After:

BillyONeal added inline comments.Oct 11 2019, 9:12 AM
llvm/include/llvm/Support/Parallel.h
124

Not concrt; the std::sort(par...) standard parallel algorithm is an unrelated implementation.

rnk added inline comments.May 14 2020, 12:06 PM
llvm/lib/Support/Parallel.cpp
73

I've belatedly realized that this means that LLVM is doing thread management on its own, i.e. every linker invocation spawns hardware_concurrency() threads. My understanding is that ConCRT is built on the system worker thread pool, which helps prevent oversubscription of CPU resources.

While @aganea measured that this change improved benchmarks, this change could lead to bad throughput when multiple link jobs run concurrently. Today, LLD is not very parallel, but this may become more of an issue as we use more and more parallelism for debug info merging. At some point in the future, we should try measuring the impact of this change on the performance of three links running in parallel, and see if using the NT worker pool gives benefits in that case. For now, though, let's not get ahead of ourselves with unmeasured concerns and leave this as is.

aganea added inline comments.May 14 2020, 2:44 PM
llvm/lib/Support/Parallel.cpp
73

One cheap alternative is to always use heavyweight_hardware_concurrency() by default, and let the user do --threads=%NUMBER_OF_PROCESSORS% if they want hardware_concurrency().

In the absence of a global decision-maker, heavyweight_hardware_concurrency() is bit of a hack. Letting an external build system like Ninja doing that though static flags, ie --threads or /opt:lldltojobs doesn't work too well either. You can end up with large spans of time where nothing is happening, because that part of the application (LLD) isn't multi-threaded. Or because the ThreadPool's jobs are cooling down, as below at time 100:

I've tried increasing the number of threads, to see how it would react. It seems every extra ThinLTO thread above the my hardware threads, is adding roughly 150 ms to the execution. For example, running an input on 72 threads takes 108 sec, while the same input on 100 threads takes 113 sec. I don't know if the relation is linear, but it gives an idea. Probably context-switching between applications would be even more costly, I assume two lld-link running side-by-side using each 72 threads would cost even more.

I think a platform-independent solution is needed here. If we have several LLDs running, we could dynamically throttle the number of threads for each ThreadPool, through some kind of IPC. We "just" need to ensure there aren't more than N threads at one time, while taking into account: affinity, hyper-threading/cache affinity, core-local memory, and multi-socket machines.

How we would interface with Ninja? LLD wouldn't know how many free "lanes" Ninja has. Should we retain, increase, or remove LLVM_PARALLEL_LINK_JOBS ? We could build some kind of generic IPC API to be used in Ninja, but then what happens for build systems that don't implement it? make, Fastbuild, MSBuild, etc.

Another way would be to embed the the compiler & the linker into the build system (not necessarly in the way I was showing last year). There's value for doing so, one example is the usage of clang-scan-deps I was showing, it lets the build system extract dependency information very quickly, instead of invoking thousands of processes, while doing memoization as much as possible. The same thing can be achieved for pre-processing, compilation or linking. Lots of things to be done, not enough time :-)

BTW ThreadPoolExecutor has undefined behavior since there's a detached thread touching the standard library when the program exits violating [basic.start.term]/6. Detached threads are almost never safe.

BTW ThreadPoolExecutor has undefined behavior since there's a detached thread touching the standard library when the program exits violating [basic.start.term]/6. Detached threads are almost never safe.

This was fixed later by D70447.