Page MenuHomePhabricator

Make Parallel.h build with libc++ on Windows.
AbandonedPublic

Authored by thakis on Mar 21 2019, 4:56 PM.

Details

Reviewers
rnk
EricWF
Summary

ppl.h requires yvals.h to be included, but it only includes a few
standard library headers and depends on those including yvals.h.

It seems unfortunate to make libc++/win include yvals.h in all header
files, so just add an explicit include of yvals.h here instead.

Diff Detail

Event Timeline

thakis created this revision.Mar 21 2019, 4:56 PM
Herald added a project: Restricted Project. · View Herald Transcript

Attempting to use ConcRT powered machinery (like the Concurrency:: primitives in this file) with a different STL like libc++ is likely to be non-functional, since that calls into concrt140.dll which assumes MSVC++'s STL. There's also an issue here that this code is using std::mutex with ConcRT powered machinery like Concurrency::parallel_for_each, which is not supported and is likely to result in deadlocks. ConcRT assumes that everything you give to it is non-blocking, so you can't use Concurrency::parallel_for_each with platform synchronization primitives. (This is the biggest reason we, MSVC++, are moving away from ConcRT as a technoligy and did not use it to power our parallel algorithms implementation)

There's an additional concern here that ppl.h (and yvals.h) will force linking with MSVC++'s STL DLL, msvcp140.dll, because support machinery for ppl.h live in there. yvals.h enforces this with pragma comment(lib, "msvcp140.dll") et al. yvals.h also emits corresponding pragma detect_mismatch directives which libc++ probably doesn't want to expose.

I'm not really sure what the path forward here is.

EricWF added inline comments.Mar 21 2019, 9:11 PM
llvm/include/llvm/Support/Parallel.h
29

This is an internal header we're not supposed to really be touching.
Can we guard if with #ifdef _LIBCPP_VERSION?

Thanks for chiming in!

Attempting to use ConcRT powered machinery (like the Concurrency:: primitives in this file) with a different STL like libc++ is likely to be non-functional, since that calls into concrt140.dll which assumes MSVC++'s STL.

'assumes' in what way? If concrt140.dll uses msvcp140.dll internally but no STL objects are passed from user code (built with libc++) to ppl functions to concrt140.dll, things should be fine for example.

(libc++ has all its symbols in an inline namespace, so you don't get symbol collisions and can have e.g. libc++ and libstdc++ in the same process. From what I understand, COFF is less symbol-name-based than ELF and there are fewer issues around the symbol name problem there anyways.)

There's also an issue here that this code is using std::mutex with ConcRT powered machinery like Concurrency::parallel_for_each, which is not supported and is likely to result in deadlocks. ConcRT assumes that everything you give to it is non-blocking, so you can't use Concurrency::parallel_for_each with platform synchronization primitives. (This is the biggest reason we, MSVC++, are moving away from ConcRT as a technoligy and did not use it to power our parallel algorithms implementation)

I filed https://bugs.llvm.org/show_bug.cgi?id=41198 for this part. Since that's a problem with the normal MSVC standard library, this seems like the biggest issue here. (But in practice, things seem to work fine so far.) But it's also independent of the libc++ part.

There's an additional concern here that ppl.h (and yvals.h) will force linking with MSVC++'s STL DLL, msvcp140.dll, because support machinery for ppl.h live in there. yvals.h enforces this with pragma comment(lib, "msvcp140.dll") et al. yvals.h also emits corresponding pragma detect_mismatch directives which libc++ probably doesn't want to expose.

IIRC we need bits of the lower-level part of MSVC++'s standard library for abi things like stdexcept (which in llvm land is in libc++abi and not libc++, but it's in the same library in Windows). I don't remember the details, but I can look them up if necessary.

I'm not really sure what the path forward here is.

Two things:

  1. For Chromium, we don't need to build clang and lld etc (for building chrome) with libc++. We do have a fork of LLVM 7.0 that's used by some GPU shader software emulation layer that we ship, and we need to build that with libc++ – but it's probably fine to build with LLVM_ENABLE_THREADS to sidestep this issue.
  1. For LLVM built with libc++ (which would just abstractly nice if it worked), I think it depends a bit on what exactly 'likely non-functional' above means :) Maybe it happens to work, and then we can land this patch for now. The concrt + mutex incompat issue might mean we have to move off concrt eventually (independent of libc++), which would then also solve this issue here.

'assumes' in what way? If concrt140.dll uses msvcp140.dll internally but no STL objects are passed from user code (built with libc++) to ppl functions to concrt140.dll, things should be fine for example.

STL objects are passed over the concrt140.dll boundary. For example, _TaskCollection et al. are passed around there and they contain exception_ptr objects. I don't have an explicit example of "this will cause doom" but these kinds of ODR problems tend to be the "impossible bug that you have to investigate for weeks" territory. Combine that with multithreading and that will likely be unpleasant to investigate.

From what I understand, COFF is less symbol-name-based than ELF

COFF is just as symbol name based, but the operating system linkage model is namespaced per-module on Windows vs. process-wide on POSIX. So each particular DLL can have their own concept of what vector::vector is. But if you pass pointer to vector over the DLL boundary, nothing will protect you and you get the usual ODR problems.

Since that's a problem with the normal MSVC standard library, this seems like the biggest issue here. (But in practice, things seem to work fine so far.) But it's also independent of the libc++ part.

That's true; this problem is existing. Note that this would have been okay in MSVC++2012, when std::mutex was itself powered by ConcRT. std::mutex is no longer powered by ConcRT because too many customers complained about being unable to use std::mutex in with other platform synchronization primitives like SRWLOCK or CRITICAL_SECTION.

The concrt + mutex incompat issue might mean we have to move off concrt eventually

Indeed, this is what we are recommending people do. Notably, our parallel algorithms implementation (should I get permission from our lawyers to contribute it to libc++, still pending :/) avoids any ConcRT interaction and calls into the Vista threadpool directly. I'm assuming you folks don't care about Windows XP support?

Billy3

thakis abandoned this revision.Mar 22 2019, 3:43 PM

'assumes' in what way? If concrt140.dll uses msvcp140.dll internally but no STL objects are passed from user code (built with libc++) to ppl functions to concrt140.dll, things should be fine for example.

STL objects are passed over the concrt140.dll boundary. For example, _TaskCollection et al. are passed around there and they contain exception_ptr objects. I don't have an explicit example of "this will cause doom" but these kinds of ODR problems tend to be the "impossible bug that you have to investigate for weeks" territory. Combine that with multithreading and that will likely be unpleasant to investigate.

That does sound like a deal breaker for this patch. Abandoning; and thanks again for taking the time.

From what I understand, COFF is less symbol-name-based than ELF

COFF is just as symbol name based, but the operating system linkage model is namespaced per-module on Windows vs. process-wide on POSIX. So each particular DLL can have their own concept of what vector::vector is. But if you pass pointer to vector over the DLL boundary, nothing will protect you and you get the usual ODR problems.

Right.

Since that's a problem with the normal MSVC standard library, this seems like the biggest issue here. (But in practice, things seem to work fine so far.) But it's also independent of the libc++ part.

That's true; this problem is existing. Note that this would have been okay in MSVC++2012, when std::mutex was itself powered by ConcRT. std::mutex is no longer powered by ConcRT because too many customers complained about being unable to use std::mutex in with other platform synchronization primitives like SRWLOCK or CRITICAL_SECTION.

The concrt + mutex incompat issue might mean we have to move off concrt eventually

Indeed, this is what we are recommending people do. Notably, our parallel algorithms implementation (should I get permission from our lawyers to contribute it to libc++, still pending :/) avoids any ConcRT interaction and calls into the Vista threadpool directly. I'm assuming you folks don't care about Windows XP support?

No, neither Chromium nor LLVM care about XP.

Sounds like the way forward is then:

  • In Chromium, build the llvm bits with LLVM_ENABLE_THREADS=off, which makes this issue go away.

f you happen to have pointers to the APIs to use here… :)

Our parallel algorithms implementation multiplies the hardware_concurrency by some multiplier (I think 32), breaks the input into that many chunks, calls CreateThreadpoolWork, calls SubmitThreadpoolWork once for each callback, has the calling thread also try to retire chunks, and calls WaitForThreadpoolWorkCallbacks and CloseThreadpoolWork when done. I talked about this at CppCon last September https://www.youtube.com/watch?v=nOpwhTbulmk

Billy3

aganea added a subscriber: aganea.EditedMay 10 2019, 9:28 AM

In addition to the std::mutex issue @BillyONeal mentions, it seems there's a fair amount of "friction"/spinning in concrt.
As a good example, while using WIP patch D55585, about ~15% of time is spent in concrt yielding the CPU on a 6-core PC. And the more cores you have, the worse it gets, up to ~33% of time yielding the CPU on a 36-cores PC (see below).

@BillyONeal do you think your new parallel algorithm(s) will give a better throughput?


The tests below are from a LLD compiled with Clang 8.0 and /MT /INCREMENTAL:NO to ensure no thunking, using concrt/ppl includes from MSVC 19.16.27030.1 x64
The dataset exercised by the parallel loop below is 8.8 GB, out of memory-mapped OBJs files. Both machines below have 128 GB and a dual NVMe SSD RAID0 drive, so I/O is not an issue (I've checked in WPA)
The vector being iterated has ~4700 elements (pointers), see https://reviews.llvm.org/D55585#change-l7lid0abOGLS PDB.cpp, L1660

First test, running on a 6-core Intel Xeon W-2135, Windows 10 1803:

View on a debug build in VS:

Second test, running on a 36-core Dual Intel Xeon Gold 6140, Windows 10 1703 - interestingly, the CPU graph at the top shows that it was never able to fully use all cores

NOTE: timings in the "Weight" column are in ms, cumulative for all cores.

@BillyONeal do you think your new parallel algorithm(s) will give a better throughput?

Yes; in fact this spinning behavior is one of the big reasons we didn't move forward with ConcRT as a technology (the other being inability to compose with OS primitives).

Moving us off concrt in D68820.