This is the first review of this code so there's a lot to look at, and I'm fully expecting a lot of changes will be requested.
This patch contains...
A) Changes to __threading_support that introduce:
- Low-level semaphores on POSIX and Apple GCD, and futex on Linux.
- Declarations for a sharded table of contention state in the dylib, used by atomic::wait.
- Declarations for a thread_local variable in the dylib, used by barriers.
B) The <atomic> changes from P1135:
- High QoI: multi-layered back-off, using either/both the state from 1b and futexes from 1a.
- Low QoI: exponential time back-off, using chrono only.
C) Barrier:
- High QoI: a tree barrier, using the acceleration state in 1c to amortize the extra round.
- Low QoI: a central barrier, with a specialization for the empty completion function.
D) Semaphore:
- All QoI: a general template semaphore for very large ptrdiff_t values.
- High QoI: a specialization for “reasonable” ptrdiff_t values, using semaphores in 1a and acceleration atomics.
- Low QoI: a specialization for unit count.
E) Latch:
- All QoI (low): a central latch. (If there's a desire to see it, I could borrow the same QoI knobs from barrier, but sizeof() would grow a lot.)
F) The first basic tests for each facility from P1135.
G) Miscellaneous tweaks I needed to get this to build as libcu++ (the CUDA variant). We can drop these or you can take them as improvements. One of them is a legit macro bug.
Make sure this is wrapped in an #if !defined(_LIBCPP_HAS_THREAD_API_EXTERNAL) (It might already be but, I can't see the context).