Page MenuHomePhabricator

win: Move Parallel.h off concrt to cross-platform code

Authored by thakis on Oct 10 2019, 11:27 AM.



r179397 added Parallel.h and implemented it terms of concrt in 2013.

In 2015, a cross-platform implementation of the functions has appeared and is in use everywhere but on Windows (r232419).
r246219 hints that <thread> had issues in MSVC2013, but r296906 suggests they've been fixed now that we require 2015+.

So remove the concrt code. It's less code, and it sounds like concrt has conceptual and performance issues, see PR41198.

I built blink_core.dll in a debug component build with full symbols and in a release component build without any symbols.
I couldn't measure a performance difference for linking blink_core.dll before and after this patch.
(Raw data:

Diff Detail

Event Timeline

thakis created this revision.Oct 10 2019, 11:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 10 2019, 11:27 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
rnk accepted this revision.Oct 10 2019, 11:36 AM

Looks very good. :)

This revision is now accepted and ready to land.Oct 10 2019, 11:36 AM
This revision was automatically updated to reflect the committed changes.

Looks slightly better without ConcRT, thanks Nico!
Here's some quick results showing the difference for the global hash parallelization in LLD, with MSVC OBJs. The algorithm is iterating on the .debug$T records, for a few thousands OBJs on 72 hyper threads:
(we're saving about 2 secs on this test)



There's still this memory map lock eating half of the CPU time, I'm not too sure yet how to avoid it, if ever. Maybe touch the pages for the OBJ files in advance? Anyway, unrelated.

BillyONeal added inline comments.Oct 10 2019, 9:24 PM

If you get a chance to benchmark I'm curious how this compares to our std::sort(std::execution::par, ...) version :)

aganea marked an inline comment as done.Oct 11 2019, 8:31 AM
aganea added inline comments.

I ran a few AB/BA tests on LLD with my dataset. The cumulated time on all cores with ConcRT is consistently over by about 300ms on my 36-core Skylake (~1.9 sec for ConcRT version, ~1.6 sec after this patch). There are only three places where we parallelSort in LLD, so maybe this not representative. But the dataset is quite big, ~22 GB of OBJs and LIBs. This is a Unity build of the Editor Release target of one of our games. I can try also with no Unity files, usually the dataset is about an order of magnitude greater.



BillyONeal added inline comments.Oct 11 2019, 9:12 AM

Not concrt; the std::sort(par...) standard parallel algorithm is an unrelated implementation.