This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libcxx/docs/
-
docs/
-
DesignDocs/
8/26
InternalThreadSynchronization.rst
-
index.rst

Differential D117366

[libcxx][libcxxabi][docs] Document the various places we use threading in libcxx and libcxxabi
Needs RevisionPublic

Authored by DanielMcIntosh-IBM on Jan 14 2022, 4:10 PM.

Download Raw Diff

Details

Reviewers

ldionne
Mordante
• Quuxplusone

Group Reviewers

Restricted Project

Summary

libcxx and libcxxabi use threading in some unexpected places. This documents all
those places, including how we use it and why it's needed.

Hopefully this will also help to clarify why z/OS needs to make changes to each
of these locations in order to support POSIX(OFF). See the discussion on D110349
for more context. However, most of the information in this document is not
specific to z/OS or AIX, and should be useful to anyone working on any platform.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

DanielMcIntosh-IBM requested review of this revision.Jan 14 2022, 4:10 PM

DanielMcIntosh-IBM created this revision.

Herald added a project: Restricted Project. · View Herald TranscriptJan 14 2022, 4:10 PM

Herald added a reviewer: Restricted Project. · View Herald Transcript

Herald added a subscriber: libcxx-commits. · View Herald Transcript

DanielMcIntosh-IBM mentioned this in D117370: [libcxx][SystemZ][z/OS] Added ceeedb.h for ceeedb_multithread and ceeedb_posix.Jan 14 2022, 4:52 PM

DanielMcIntosh-IBM mentioned this in D117372: [libcxx] switch locale from using std::call_once to __libcpp_mutex_t.Jan 14 2022, 5:01 PM

DanielMcIntosh-IBM mentioned this in D117373: [libcxx] switch debug.cpp from using std::mutex to __libcpp_mutex_t.

DanielMcIntosh-IBM mentioned this in D117375: [libcxx][SystemZ][z/OS] added internal_threading_support.h.Jan 14 2022, 5:11 PM

DanielMcIntosh-IBM added a child revision: D117375: [libcxx][SystemZ][z/OS] added internal_threading_support.h.Jan 14 2022, 5:18 PM

DanielMcIntosh-IBM added a child revision: D117372: [libcxx] switch locale from using std::call_once to __libcpp_mutex_t.Jan 14 2022, 5:22 PM

DanielMcIntosh-IBM removed a child revision: D117375: [libcxx][SystemZ][z/OS] added internal_threading_support.h.Jan 14 2022, 5:26 PM

Fix "Duplicate explicit target name" error.

Add InternalThreadSynchronization to libcxx/docs/index.rst

Herald added a subscriber: arphaman. · View Herald TranscriptJan 14 2022, 7:09 PM

Harbormaster completed remote builds in B143556: Diff 400230.Jan 14 2022, 11:12 PM

• Quuxplusone added inline comments.Jan 16 2022, 11:08 AM

libcxx/docs/DesignDocs/InternalThreadSynchronization.rst
18–20	`Thread-safe static initialization` is the usual term for this.
21–24	Also `struct Big { int a[100]; }; std::atomic<Big>`, right? My ill-informed impression is that `atomic<shared_ptr>` requires DCAS (double-wide compare-and-swap) in hardware, but `atomic<Big>` requires arbitrarily wide atomics that don't exist anywhere. `std::atomic<std::shared_ptr<T>>` is standard C++20. `std::atomic<Big>` is standard C++11.
43
57–64	I think it would make sense to explain the basic functionality in terms of `thread_local` first. Then the no-threads version is just "that, but replace `thread_local` with `static`"; and the no-thread-local version is "that, but simulate `thread_local` by hand in the following way...". (Assuming I got that right, of course. My eyes glazed over at the first paragraph. I'm hypothesizing that it will become more comprehensible once the basic functionality is separated from the mechanics of `thread_local`-simulation.)
81–85	`intialized` "For reasons described [on a different website not run by us]" is a risky game. ;) The usual term for what they're describing is "Static Initialization Order Fiasco": we need `(what object?)->__id_` to get initialized before the first time it's used, but the first time it's used might be inside the constructor of `std::cout`, which might run before the constructor of `(what object?)` because static objects' initialization order is not generally controllable. We actually do control initialization order in a couple of places in the standard library, via `__attribute__((init_priority(SOME-HIGH-NUMBER)))` a.k.a. `_LIBCPP_INIT_PRIORITY_MAX`; do we avoid the attribute here only for historical reasons?
91
98	`excecution`, `responsability`, `gaurd`, `aquired` I'd strongly prefer just spelling out `static local [variable]` on each reference, instead of introducing an idiosyncratic acronym `SLV`. "Unlike C, C++ permits dynamic initialization of static local variables. Dynamic initializers for static locals run the first time execution passes through the declaration. So C++ needs to keep track of whether a static local has already been initialized, and avoid races if multiple threads attempt to perform the initialization at once. The Itanium C++ ABI defines three runtime functions to help with this: each thread should call `__cxa_guard_acquire` before attempting initialization, and either `__cxa_guard_abort` or `__cxa_guard_release` after initialization fails (via an exception) or succeeds, respectively." And then we can get into how libc++abi implements those three functions in detail.
107	"`sometimes`"? It would be useful to say when and why.
114–118	Contrary to what the Itanium ABI suggests, we do not hold on to the mutex (or any other mutex) after returning from `__cxa_guard_acquire()`. I think this is a narrower-than-useful definition of "mutex." When we return from `__cxa_guard_acquire` we're definitely "holding" something, which we must remember to release later, and which (as long as we hold it) will block any other thread from successfully completing a `__cxa_guard_acquire` on the same thing; instead those threads will go to sleep until the thing is available again. We could implement this thing as a std::mutex, but we don't (because std::mutex is large?) so instead we implement it as ____. You don't describe how the three primitives are implemented when `_LIBCXXABI_USE_FUTEX` is defined.
121	IIUC: There is only one std::condition_variable, global to the whole program. Any thread which is waiting to attempt initialization of any static local (waiting its turn to climb the glass hill) will be blocked on that std::condition_variable. When anyone succeeds at their initialization attempt, they notify_all on the condition variable (because they want to unblock all the waiters associated with their own static local). When anyone fails at their initialization attempt, they notify_all on the condition variable (because they want to unblock just one waiter associated with their own static local, but the condition variable's waiter list might be a mix of threads associated with many different static locals so they don't know who to wake). This is all gotten from your description; I didn't try to correlate it with the actual code. Did I get the right impression?
129	Throughout: If you mean `shared_ptr`, please say `shared_ptr`. "Shared pointer" means a pointer (`T*`) that is shared.
152–158	I don't particularly understand this section. My best guess is that you're speculating that `std::atomic<shared_ptr<T>>` might hold a data member of type `std::mutex`; but I expect/hope that we'd never do that, because it's (not standard-mandated but) really important for QoI that `sizeof(std::atomic<T>) == sizeof(T)`, and I don't see any reason for `shared_ptr` to be the unique exception to that rule. It would also just be super confusing if `atomic_foo(shared_ptr*)` and `atomic<shared_ptr>` used two different synchronization mechanisms.
176–180	I'd say: `std::random_shuffle` depends on a global pseudorandom number generator (PRNG), similar to `std::rand()`. Calls to the global PRNG must be synchronized to avoid data races. The dylib provides a function named `__rs_get()` which returns an object of type `__rs_default`. `__rs_default::operator()` calls the `operator()` of a static local `std::mt19937`, which is protected by the global mutex `__rs_mut`. `__rs_default`'s constructor locks `__rs_mut`; `__rs_default`'s destructor unlocks `__rs_mut`. Note that this means libc++'s `std::random_shuffle` is not reentrant.
251–252	This sounds like a "different people do different things" cop-out. Are you trying to describe what we actually do (if so, you need only describe the actual implementation, not the choices we didn't make)? Or is your target audience an implementor of a new platform, where you think they should do something different (if so, you need only describe the thing you think they should do)?
258–260	This sounds very much like "However, cxa_exception was not written with this behaviour in mind, so don't do that." — and then this whole paragraph can be removed!

Address some of @Quuxplusone's comments

@Quuxplusone I've addressed the comments you made that were easy to address. The rest will take me some time to get to as I have other items I need to work on at the moment.

However, even without addressing them all, I think this still provides valuable background for the POSIX(OFF) work in D110349 and/or D117375.

libcxx/docs/DesignDocs/InternalThreadSynchronization.rst
21–24	First, this is about the standalone `atomic_load(const shared_ptr<_Tp>* __p)`/`atomic_store(shared_ptr<_Tp>* __p, shared_ptr<_Tp> __r)`/etc., which is standard C++11 and already implemented in libcxx, unlike `atomic<shared_ptr>`. Second, yes, implementing `atomic<Big>` directly using builtin atomics would require arbitrarily wide atomics. They kind of exist through the __c11_atomic builtins and/or the gcc __atomic builtins. Without these, `atomic<Big>` can still be implemented indirectly using spin-locks or regular locks. Technically, the standalone shared pointer overloads could also use spin-locks, but at present they don't. I assumed this is because unlike `std::atomic<Big>` and `std::atomic<std::shared_ptr<T>>` they need to share the mutex(es) or spin-lock(s) across all shared-pointers, whereas `atomic<Big>` and `atomic<shared_ptr>` can put a bool or mutex inside the atomic object, making it unique to a specific `shared_ptr` or `Big`. However, looking at the description of rGd77851e8374c5a48de6e7694196b714abd673d84, it seems it was just because of a "pathological distrust of spin locks".
107	The thread-id is only used to detect recursive initialization (in a multi-threaded environment), and not super relevant to this discussion/topic, but I can include a short blurb on it if you think that would be appropriate.
114–118	That's a fair point, but I think that in the context of this discussion, where the focus is on when and where we use the base threading support library (e.g. pthreads), it makes sense to limit the definition of a "mutex" to objects which we acquire/release using said threading support library (i.e. `__libcpp_mutex_t` and/or `__libcpp_recursive_mutex_t`). If we use a more broad definition of mutex/lock, it becomes a lot less clear whether an operation would require using the thread support library. I've excluded primitive spin-locks (which can be implemented using atomic operations) from the definition of a lock/mutex as well for similar reasons. If you think it's necessary, I can include a brief discussion of this choice at the beginning of the document, but I think it's pretty clear as-is. I glossed over what happens when `_LIBCXXABI_USE_FUTEX` is defined because it doesn't appear to be in use at the moment, and doesn't rely on `__threading_support`, but I'll add a small blurb about it, and explain that it doesn't rely on the thread support library.
121	Yes, that is correct.
152–158	When I wrote this, I was looking at `__cxx_atomic_lock_impl`, which does not have `sizeof(std::atomic<T>) == sizeof(T)`, and uses a spin-lock as I described and suggested `std::atomic<std::shared_ptr<T>>` might also do. Looking at it again more carefully, I see that `__cxx_atomic_lock_impl` is only used when _LIBCPP_ATOMIC_ONLY_USE_BUILTINS is defined (right now this only happens for freestanding implementations). It seems I have also misunderstood the reason these overloads don't use a lock-free implementation (it has more to do with the `shared_ptr`'s non-trivial constructors than anything else). I'll update this whole section with a new description when I get the chance.
251–252	As I discuss below, I wanted to document both the current implementation, which doesn't support loading the thread library in the middle of exception handling, as well as how support for that could be added. You're right though that this wording isn't great, and I'll update it when I get the chance.
258–260	Unfortunately, this is also probably the best and least disruptive option at the current time. The only other option I can think of is described in the next paragraph, and to implement that well would require a lot more work (which I gloss over completely with the statement "This approach does require some method of determining whether the current thread is the main thread, however that can be accomplished in a very platform agnostic manner that is outside the scope of this document."). In the long term, I personally think this second approach will be better for us, because it allows the threading library to be loaded in the middle of exception handling, but it will be much more disruptive to libc++ and the broader llvm community. Given the push-back I've already received from Louis in D110349, I suspect that unless there is a demonstrable performance benefit to it on other platforms, I will have a hard time getting it approved. I don't have the resources to invest in implementing and performance testing the second option on the off chance it is better and I can convince the llvm community to switch. (Especially since there are some messy situations around the user spawning threads without using std::thread - in which case what we decide is the 'main' thread might not actually be the main thread. This turns out to be a non-issue, but it gets a little messy and would require extensive comments and documentation). I did however still want to document this as an option since the AIX team (for whom this is most relevant) have yet to look at this in detail.

Harbormaster completed remote builds in B144327: Diff 401275.Jan 19 2022, 11:51 AM

There's a lot of quite useful information in there. However, it's not clear to me that a "DesignDoc" is the right place to put it. I'm truly scared that this is going to get out of sync with the actual code in no time at all, especially the parts that explain the current implementation of <something>. Instead, would it make sense to improve the in-code documentation of the various parts you describe by adding those descriptions as high-level comments in the code itself, close to the implementation? Then, this design doc can stay higher level and not risk getting out of sync.

libcxx/docs/DesignDocs/InternalThreadSynchronization.rst
38
81
95	This is usually called a function-local `static` I think? Or is `Static Local Variable` the official naming and I'm using ad-hoc terminology? (that's quite possible)

This revision now requires changes to proceed.Jan 24 2022, 8:08 AM

In D117366#3266416, @ldionne wrote:

There's a lot of quite useful information in there. However, it's not clear to me that a "DesignDoc" is the right place to put it. I'm truly scared that this is going to get out of sync with the actual code in no time at all, especially the parts that explain the current implementation of <something>. Instead, would it make sense to improve the in-code documentation of the various parts you describe by adding those descriptions as high-level comments in the code itself, close to the implementation? Then, this design doc can stay higher level and not risk getting out of sync.

That's a good point. When I circle back around to this, I'll move the description of the implementation out of the design doc and into the code where possible, but I worry that some parts of this will be very difficult to explain in a clear and concise manner without getting a little bit into the actual implementation. We'll see how it goes when I actually try moving things.

libcxx/docs/DesignDocs/InternalThreadSynchronization.rst
95	There doesn't seem to be a whole lot of consistency in terminology as far as I can tell. cppreference refers to them as "Static Local Variables", the standard seems to have settled on "block variable with static storage duration" (e.g. basic.stc.static), but has also used "local static variables" on occasion in the past (e.g. basic.stc.static in the C++14 draft).

DanielMcIntosh-IBM mentioned this in D120348: [libcxx][SystemZ][ POSIX(OFF) support on z/OS.Mar 1 2022, 8:19 AM

Revision Contents

Path

Size

libcxx/

docs/

DesignDocs/

InternalThreadSynchronization.rst

282 lines

index.rst

1 line

Diff 401275

libcxx/docs/DesignDocs/InternalThreadSynchronization.rst

This file was added.

===============================

Internal Thread Synchronization

===============================

.. contents::

:local:

Overview

========

Several parts of C++ and the C++ standard library unrelated to the `C++11 Thread Support Library

<https://en.cppreference.com/w/cpp/thread>`__ are required to operate in a thread-aware manner.

In particular,

* exceptions are thrown and caught on a per-thread basis.

* ``std::locale`` must allow concurrent access and modification.

* `"If multiple threads attempt to initialize the same static local variable concurrently, the

initialization occurs exactly once"

<https://en.cppreference.com/w/cpp/language/storage_duration#Static_local_variables>`__

* the `shared_ptr overloads <https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic>`__ for

QuuxplusoneUnsubmitted

Not Done

Thread-safe static initialization is the usual term for this.

Quuxplusone: `Thread-safe static initialization` is the usual term for this.

``std::atomic_...`` cannot be easily implemented without using locks. (``std::atomic_...``

is part of the `Atomic Operations Library <https://en.cppreference.com/w/cpp/atomic>`__ rather

than the C++11 Thread Support Library)

QuuxplusoneUnsubmitted

Not Done

Also struct Big { int a[100]; }; std::atomic<Big>, right? My ill-informed impression is that atomic<shared_ptr> requires DCAS (double-wide compare-and-swap) in hardware, but atomic<Big> requires arbitrarily wide atomics that don't exist anywhere.

std::atomic<std::shared_ptr<T>> is standard C++20. std::atomic<Big> is standard C++11.

Quuxplusone: Also `struct Big { int a[100]; }; std::atomic<Big>`, right? My ill-informed impression is that…

DanielMcIntosh-IBMAuthorUnsubmitted

Done

First, this is about the standalone atomic_load(const shared_ptr<_Tp>* __p)/atomic_store(shared_ptr<_Tp>* __p, shared_ptr<_Tp> __r)/etc., which is standard C++11 and already implemented in libcxx, unlike atomic<shared_ptr>.

Second, yes, implementing atomic<Big> directly using builtin atomics would require arbitrarily wide atomics. They kind of exist through the __c11_atomic builtins and/or the gcc __atomic builtins. Without these, atomic<Big> can still be implemented indirectly using spin-locks or regular locks.

Technically, the standalone shared pointer overloads could also use spin-locks, but at present they don't. I assumed this is because unlike std::atomic<Big> and std::atomic<std::shared_ptr<T>> they need to share the mutex(es) or spin-lock(s) across all shared-pointers, whereas atomic<Big> and atomic<shared_ptr> can put a bool or mutex inside the atomic object, making it unique to a specific shared_ptr or Big. However, looking at the description of rGd77851e8374c5a48de6e7694196b714abd673d84, it seems it was just because of a "pathological distrust of spin locks".

DanielMcIntosh-IBM: First, this is about the standalone `atomic_load(const shared_ptr<_Tp>* __p)`/`atomic_store…

In addition, libc++ and libc++abi have some pieces of their own that need to be thread-aware:

* ``std::__libcpp_db``, since containers and iterators may be used in a concurrent fashion.

(See :ref:`Debug Mode <using-debug-mode>` for more information)

* ``std::__rs_default``, because of its use of a random number engine with static storage duration.

* ``__aligned_malloc_with_fallback``, ``__calloc_with_fallback`` and their ``free_with_fallback``

counterparts.

.. _exceptions:

Exceptions

----------

To manage the caught and uncaught exception stacks for a thread, the libc++abi uses a

ldionneUnsubmitted

Not Done

----------

- To manage the caught and uncaught exception stacks for a thread, the libc++abi uses a

+ To manage the caught and uncaught exception stacks for a thread, libc++abi uses a

``__cxa_eh_globals`` structure. A pointer to the ``__cxa_eh_globals`` instance associated with the

ldionne:

``__cxa_eh_globals`` structure. A pointer to the ``__cxa_eh_globals`` instance associated with the

current thread is obtained using ``__cxa_get_globals()`` and/or ``__cxa_get_globals_fast()``.

Assuming neither ``_LIBCXXABI_HAS_NO_THREADS`` nor ``HAS_THREAD_LOCAL`` is defined, the current

implementation for these functions is:

QuuxplusoneUnsubmitted

Done

current thread is obtained using ``__cxa_get_globals()`` and/or ``__cxa_get_globals_fast()``.

- Assuming neither ``_LIBCXXABI_HAS_NO_THREADS`` nor ``HAS_THREAD_LOCAL`` is defined, the current

+ When neither ``_LIBCXXABI_HAS_NO_THREADS`` nor ``HAS_THREAD_LOCAL`` is defined, the current

implementation for these functions is:

Quuxplusone:

* The first time ANY thread calls ``__cxa_get_globals_fast()``, we initialize a ``__libcpp_tls_key``

using ``__libcpp_execute_once``

* Then, ``__cxa_get_globals_fast()`` will use ``__libcpp_tls_get()`` to retrieve a pointer to any

potential previously allocated ``__cxa_eh_globals`` associated with the current thread. If we

haven't allocated any before, this will return ``NULL``.

* ``__cxa_get_globals()`` will invoke ``__cxa_get_globals_fast()``. If it returns non-null, that

means there is already an instance of ``__cxa_eh_globals`` for the current thread. Use that

instance. Otherwise, this is the first time the current thread called ``__cxa_get_globals()`` so

we allocate a new instance of ``__cxa_eh_globals``, then use ``__libcpp_tls_set()`` to store it

so later invocations can retrieve it. Finally we return the new ``__cxa_eh_globals``.

When ``_LIBCXXABI_HAS_NO_THREADS`` is defined, both ``__cxa_get_globals()`` and

``__cxa_get_globals_fast()`` call a third function that simply returns a pointer to a static local

instance of ``__cxa_eh_globals``. Without multiple threads, none of the above complexity is

required.

When ``_LIBCXXABI_HAS_NO_THREADS`` is not defined, but ``HAS_THREAD_LOCAL`` is defined, we take a

very similar tactic, except instead of a static local copy of ``__cxa_eh_globals``, we use a static

``thread_local`` copy so that each thread gets a different instance.

QuuxplusoneUnsubmitted

Not Done

I think it would make sense to explain the basic functionality in terms of thread_local first. Then the no-threads version is just "that, but replace thread_local with static"; and the no-thread-local version is "that, but simulate thread_local by hand in the following way...". (Assuming I got that right, of course. My eyes glazed over at the first paragraph. I'm hypothesizing that it will become more comprehensible once the basic functionality is separated from the mechanics of thread_local-simulation.)

Quuxplusone: I think it would make sense to explain the basic functionality in terms of `thread_local` first.

Locales

-------

Concurrent access and modification of ``std::locale``\s is mostly a non-issue, with many locale

modification and I/O functions being naturally thread safe through the use of local variables.

There are a few places we use static local variables to ensure an initialization function is only

called once, such as in ``locale::__global()``, as well as a few places we use ``__shared_count``

(which relies on atomics instead of thread synchronization), such as for the reference counting of

``locale::facet``\s. Other than that, there is very little that could require thread synchronization

or other thread-aware behaviour.

There is one notable place we *do* need to perform more sophisticated thread management:

Every subclass of ``locale::facet`` must have a public static ``locale::id`` member that is used by

``std::locale`` to index facets. Specifically, each ``locale::id`` instance has a unique value for

``locale::id::__id_``, which is 1 greater than the index at which ``std::locale`` will store the

facet in the ``locale::__imp::facets_`` vector. For reasons described `here

<https://en.cppreference.com/w/cpp/locale/locale/id/id>`__, ``locale::id::__id_`` is 0 initialized

ldionneUnsubmitted

Not Done

facet in the ``locale::__imp::facets_`` vector. For reasons described `here

- <https://en.cppreference.com/w/cpp/locale/locale/id/id>`__, ``locale::id::__id_`` is 0 initialized

+ <https://en.cppreference.com/w/cpp/locale/locale/id/id>`__, ``locale::id::__id_`` is zero-initialized

at first, and only truly intialized when the corresponding facet is first added to a locale (the

ldionne:

at first, and only truly intialized when the corresponding facet is first added to a locale (the

initialization occurs via the first call to ``locale::id::__get()``, which invokes

``locale::id::__init()``).

QuuxplusoneUnsubmitted

Not Done

intialized
"For reasons described [on a different website not run by us]" is a risky game. ;) The usual term for what they're describing is "Static Initialization Order Fiasco": we need (what object?)->__id_ to get initialized before the first time it's used, but the first time it's used might be inside the constructor of std::cout, which might run before the constructor of (what object?) because static objects' initialization order is not generally controllable.
We actually do control initialization order in a couple of places in the standard library, via __attribute__((init_priority(SOME-HIGH-NUMBER))) a.k.a. _LIBCPP_INIT_PRIORITY_MAX; do we avoid the attribute here only for historical reasons?

Quuxplusone: `intialized` "For reasons described [on a different website not run by us]" is a risky game. ;)…

This initialization must happen once for each ``locale::id``, but can occur on any thread. Since

there is more than one ``locale::id``, and each one needs to independently perform this

initialization, a static local variable is insufficient. Instead, to prevent a race condition

between multiple threads trying to initialize the same facet's id, libcxx uses ``std::call_once``

with each ``locale::id`` having it's own ``once_flag``, and to prevent a race condition on

the increment of ``locale::id::__next_id`` between threads trying to initialize different facets'

QuuxplusoneUnsubmitted

Done

between multiple threads trying to initialize the same facet's id, libcxx uses ``std::call_once``

- with each ``locale::id`` having it's own ``once_flag``, and to prevent a race condition on

+ with each ``locale::id`` having its own ``once_flag``, and to prevent a race condition on

the increment of ``locale::id::__next_id`` between threads trying to initialize different facets'

Quuxplusone:

ids, it uses an atomic add.

Static local variables

ldionneUnsubmitted

Not Done

This is usually called a function-local static I think? Or is Static Local Variable the official naming and I'm using ad-hoc terminology? (that's quite possible)

ldionne: This is usually called a function-local `static` I think? Or is `Static Local Variable` the…

DanielMcIntosh-IBMAuthorUnsubmitted

Done

There doesn't seem to be a whole lot of consistency in terminology as far as I can tell. cppreference refers to them as "Static Local Variables", the standard seems to have settled on "block variable with static storage duration" (e.g. basic.stc.static), but has also used "local static variables" on occasion in the past (e.g. basic.stc.static in the C++14 draft).

DanielMcIntosh-IBM: There doesn't seem to be a whole lot of consistency in terminology as far as I can tell.

----------------------

Unlike C, C++ permits dynamic initialization of static local variables. Dynamic initializers for

static locals run the first time execution passes through the declaration. So C++ needs to keep

QuuxplusoneUnsubmitted

Done

excecution, responsability, gaurd, aquired
I'd strongly prefer just spelling out static local [variable] on each reference, instead of introducing an idiosyncratic acronym SLV.
"Unlike C, C++ permits dynamic initialization of static local variables. Dynamic initializers for static locals run the first time execution passes through the declaration. So C++ needs to keep track of whether a static local has already been initialized, and avoid races if multiple threads attempt to perform the initialization at once. The Itanium C++ ABI defines three runtime functions to help with this: each thread should call __cxa_guard_acquire before attempting initialization, and either __cxa_guard_abort or __cxa_guard_release after initialization fails (via an exception) or succeeds, respectively."
And then we can get into how libc++abi implements those three functions in detail.

Quuxplusone: `excecution`, `responsability`, `gaurd`, `aquired` I'd strongly prefer just spelling out…

track of whether a static local has already been initialized, and avoid races if multiple threads

attempt to perform the initialization at once. The libc++/Itanium C++ ABI defines three runtime

functions to help with this: each thread should call ``__cxa_guard_acquire`` before attempting

initialization, and either ``__cxa_guard_abort`` or ``__cxa_guard_release`` after initialization

fails (via an exception) or succeeds, respectively.

In libcxxabi, we keep track of the initialization status of static locals using a 4 or 8 byte 'guard

object'. The guard object consists of a 'guard byte', an 'init byte', 2 unused bytes, and sometimes

a 4 byte thread id. If the guard byte is non-zero, then initialization has been completed. Beyond

QuuxplusoneUnsubmitted

Not Done

"sometimes"? It would be useful to say when and why.

Quuxplusone: "`sometimes`"? It would be useful to say when and why.

DanielMcIntosh-IBMAuthorUnsubmitted

Not Done

The thread-id is only used to detect recursive initialization (in a multi-threaded environment), and not super relevant to this discussion/topic, but I can include a short blurb on it if you think that would be appropriate.

DanielMcIntosh-IBM: The thread-id is only used to detect recursive initialization (in a multi-threaded environment)…

that, if the guard byte is zero, the rest of the initialization state is tracked using the init

byte. While the guard byte can be read and written to using simple atomic operations, the init byte

requires more careful thread management.

When neither ``_LIBCXXABI_HAS_NO_THREADS`` nor ``_LIBCXXABI_USE_FUTEX`` is defined, the current

implementation protects the init byte using a single condition-variable & mutex pair common to all

static locals. Contrary to what the Itanium ABI suggests, we do not hold on to the mutex, (or any

other mutex) after returning from ``__cxa_guard_acquire()``. Instead, we only hold onto the mutex

while we're reading from or writing to the init byte - once a thread has started the initialization

(``__cxa_guard_acquire()`` finished and returned 1), it no longer holds the mutex. Any other

threads whose execution arrives at the static local's declaration while the initialization is

QuuxplusoneUnsubmitted

Not Done

Contrary to what the Itanium ABI suggests, we do not hold on to the mutex (or any other mutex) after returning from __cxa_guard_acquire().

I think this is a narrower-than-useful definition of "mutex." When we return from __cxa_guard_acquire we're definitely "holding" something, which we must remember to release later, and which (as long as we hold it) will block any other thread from successfully completing a __cxa_guard_acquire on the same thing; instead those threads will go to sleep until the thing is available again. We could implement this thing as a std::mutex, but we don't (because std::mutex is large?) so instead we implement it as ____.

You don't describe how the three primitives are implemented when _LIBCXXABI_USE_FUTEX is defined.

Quuxplusone: > Contrary to what the Itanium ABI suggests, we do not hold on to the mutex (or any other…

DanielMcIntosh-IBMAuthorUnsubmitted

Not Done

That's a fair point, but I think that in the context of this discussion, where the focus is on when and where we use the base threading support library (e.g. pthreads), it makes sense to limit the definition of a "mutex" to objects which we acquire/release using said threading support library (i.e. __libcpp_mutex_t and/or __libcpp_recursive_mutex_t). If we use a more broad definition of mutex/lock, it becomes a lot less clear whether an operation would require using the thread support library. I've excluded primitive spin-locks (which can be implemented using atomic operations) from the definition of a lock/mutex as well for similar reasons. If you think it's necessary, I can include a brief discussion of this choice at the beginning of the document, but I think it's pretty clear as-is.

I glossed over what happens when _LIBCXXABI_USE_FUTEX is defined because it doesn't appear to be in use at the moment, and doesn't rely on __threading_support, but I'll add a small blurb about it, and explain that it doesn't rely on the thread support library.

DanielMcIntosh-IBM: That's a fair point, but I think that in the context of this discussion, where the focus is on…

ongoing will acquire the mutex, then determine from the init byte that initialization has already

been started by another thread, and wait on the condition variable (temporarily relinquishing the

mutex, but not yet returning from ``__cxa_guard_acquire``).

QuuxplusoneUnsubmitted

Not Done

IIUC: There is only one std::condition_variable, global to the whole program. Any thread which is waiting to attempt initialization of any static local (waiting its turn to climb the glass hill) will be blocked on that std::condition_variable. When anyone succeeds at their initialization attempt, they notify_all on the condition variable (because they want to unblock all the waiters associated with their own static local). When anyone fails at their initialization attempt, they notify_all on the condition variable (because they want to unblock just one waiter associated with their own static local, but the condition variable's waiter list might be a mix of threads associated with many different static locals so they don't know who to wake).
This is all gotten from your description; I didn't try to correlate it with the actual code. Did I get the right impression?

Quuxplusone: IIUC: There is only one std::condition_variable, global to the whole program. Any thread which…

DanielMcIntosh-IBMAuthorUnsubmitted

Done

Yes, that is correct.

DanielMcIntosh-IBM: Yes, that is correct.

When ``_LIBCXXABI_HAS_NO_THREADS`` is defined, we still read and write to the init byte to track the

initialization status, but a mutex is no longer needed to protect the init byte from concurrent

reading/writing. Additionally, since there's only one thread, we don't need a condition variable

either. If we detect that initialization has already started but not completed, that implies

recursive initialization of the static local variable, which is an error.

When ``_LIBCXXABI_HAS_NO_THREADS`` is not defined, but ``_LIBCXXABI_USE_FUTEX`` is defined, we use

QuuxplusoneUnsubmitted

Done

Throughout: If you mean shared_ptr, please say shared_ptr. "Shared pointer" means a pointer (T*) that is shared.

Quuxplusone: Throughout: If you mean `shared_ptr`, please say `shared_ptr`. "Shared //pointer//" means a…

atomics to safely perform concurrent reads and writes to the init byte. Careful use of compare-and-

exchange operations allow us to safely check whether someone else has started initialization, and if

so, mark that we're waiting. We then use a futex to wait for the other thread to finish its attempt

at initializing the static local variable.

Each of these implementations produces a similar effect to if each static local had its own mutex,

and we acquired the local's mutex in ``__cxa_guard_acquire()``, and released it in

``__cxa_guard_release()`` or ``__cxa_guard_abort()``, like the Itanium ABI suggests.

``shared_ptr`` overloads of ``std::atomic_...``

-----------------------------------------------

``shared_ptr`` is under most circumstances thread-safe, and can be used to simplify memory

management of resources used by multiple threads. Usually this is accomplished by each thread

having a ``shared_ptr`` that is used to reference count the common resource. When the last shared

pointer is destroyed, the pointed-to object is also freed/destroyed. Assuming concurrent access or

modification of the pointed-to object is thread-safe, this common usage of ``shared_ptr`` is also

thread safe and does not interally need to use any thread synchronization mechanisms (instead it

relies on atomic operations to modify the ref-count).

However, even if concurrent access/modification of the pointed-to object is safe, multiple threads

directly accessing the *same* ``shared_ptr`` is NOT thread safe (in contrast to the above scenario

where each thread had its own copy of the ``shared_ptr``). One solution the standard library

provides is a set of overloads for the standalone atomic operation functions such as

``std::atomic_store`` and ``std::atomic_load`` that work on (pointers to) ``shared_ptr`` (`See

cppreference <https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic>`__).

Since ``shared_ptr``\s are usually too large to use builtin atomic operations with, these overloads

need to use some kind of locking to enforce atomicity of the operations. For this they map each

QuuxplusoneUnsubmitted

Not Done

I don't particularly understand this section. My best guess is that you're speculating that std::atomic<shared_ptr<T>> might hold a data member of type std::mutex; but I expect/hope that we'd never do that, because it's (not standard-mandated but) really important for QoI that sizeof(std::atomic<T>) == sizeof(T), and I don't see any reason for shared_ptr to be the unique exception to that rule. It would also just be super confusing if atomic_foo(shared_ptr*) and atomic<shared_ptr> used two different synchronization mechanisms.

Quuxplusone: I don't particularly understand this section. My best guess is that you're speculating that…

DanielMcIntosh-IBMAuthorUnsubmitted

Not Done

When I wrote this, I was looking at __cxx_atomic_lock_impl, which does not have sizeof(std::atomic<T>) == sizeof(T), and uses a spin-lock as I described and suggested std::atomic<std::shared_ptr<T>> might also do.

Looking at it again more carefully, I see that __cxx_atomic_lock_impl is only used when _LIBCPP_ATOMIC_ONLY_USE_BUILTINS is defined (right now this only happens for freestanding implementations).

It seems I have also misunderstood the reason these overloads don't use a lock-free implementation (it has more to do with the shared_ptr's non-trivial constructors than anything else). I'll update this whole section with a new description when I get the chance.

DanielMcIntosh-IBM: When I wrote this, I was looking at `__cxx_atomic_lock_impl`, which does not have `sizeof(std…

pointer-to-shared_ptr to one of 16 ``std::__sp_mut``\s, which they hold for the duration of the

atomic operation. Each ``__sp_mut`` is effetively a light wrapper around ``__libcpp_mutex_lock()``/

``trylock()`` / ``unlock()``.

`P0718R2 <https://wg21.link/P0718R2>`__ introduces another option for performing atomic operations

on ``shared_ptr``\s to C++20 with ``std::atomic<std::shared_ptr<T>>`` (`See cppreference

<https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic2>`__). This will still need some

form of lock, but unlike with the standalone ``std::atomic_...`` functions, it will be easy to use

a lock specific to that particular pointer-to-shared_ptr instead of sharing the locks. This may

make the existing techniques used by ``std::atomic<T>`` that don't rely on threading primatives,

such as a rudimentary spin-lock, a more attractive implementation option.

LIBCPP DEBUG

------------

With ``_LIBCPP_DEBUG == 1`` libcxx adds "additional assertions about the validity of iterators"

(See :ref:`Debug Mode <using-debug-mode>`). It does this by keeping track of the sizes of

containers, the positions of iterators within those containers, etc. using ``std::__libcpp_db``.

Due to the thread safety guarantees associated with iterators (`See cppreference

<https://en.cppreference.com/w/cpp/container#Thread_safety>`__), there are many operations that

``__libcpp_db`` must pay attention to which can safely happen concurrently to iterators on the same

QuuxplusoneUnsubmitted

Not Done

I'd say:

std::random_shuffle depends on a global pseudorandom number generator (PRNG), similar to std::rand(). Calls to the global PRNG must be synchronized to avoid data races. The dylib provides a function named __rs_get() which returns an object of type __rs_default. __rs_default::operator() calls the operator() of a static local std::mt19937, which is protected by the global mutex __rs_mut. __rs_default's constructor locks __rs_mut; __rs_default's destructor unlocks __rs_mut. Note that this means libc++'s std::random_shuffle is not reentrant.

Quuxplusone: I'd say: > `std::random_shuffle` depends on a global pseudorandom number generator (PRNG)…

container. As such, everytime an iterator or container ``__libcpp_db`` is tracking gets updated, it

needs to acquire a lock before reading from or writing to its internal state.

Random shuffle

--------------

To avoid generating the same sequence every time the two-argument form of ``std::random_shuffle``

is called, ``std::__rs_default`` needs to preserve some state after destruction. It does this by

using a random number engine with static storage duration (specifically a static local instance of

``std::mt19937`` in ``__rs_default::operator()()``). However, the Mersenne Twister engine is not

thread safe, so ``__rs_default`` needs to protect it using a mutex.

Fallback malloc

---------------

In order to make sure exceptions still work when we can't allocate memory (important for exceptions

like ``std::bad_alloc``), we need a fallback option for constructing the exception and/or

``__cxa_eh_globals`` that doesn't require memory allocation. This is what

``__aligned_malloc_with_fallback`` and ``__calloc_with_fallback`` are for. They share a small 512

byte array with static storage duration which is divided up and managed using a freelist embedded

within the array itself. When regular heap allocation fails, they fall back to 'allocating' space

from the array instead. However, since the array may be used concurrently by multiple threads,

access to and modification of it is protected using a mutex.

Implications when threading may not be available

================================================

Not all platforms have a threading library like pthreads available, so libc++ cannot always rely on

the mutexes or functions like ``__libcpp_tls_get`` that the ``__threading_support`` header

provides. If the user uses the C++11 Thread Support Library without an available threading library,

it can safely be considered user error. While detecting it might be useful, the std Thread Support

Library doesn't need to work under those conditions, so it can ignore the possibility. That's not

true for any of the above seven situations - they need to work with or without thread support.

In cases where we know at the time libcxx is compiled that there is no threading library available,

we address this by providing an alternative implementation when the preprocessor macro

``_LIBCXXABI_HAS_NO_THREADS`` and/or ``_LIBCPP_HAS_NO_THREADS`` is defined.

z/OS, AIX and runtime-dependent threading availability

------------------------------------------------------

On some platforms such as z/OS or AIX, availability of a threading library might not be known until

runtime, or worse, the threading library may become available in the middle of application

execution because the application requested it. In such cases, libcxx and libcxxabi need to

dynamically change their behaviour depending on the availability of a threading library.

As discussed above, using classes like ``std::mutex`` or ``std::condition_variable`` without an

avaiable threading library is user error, so they don't need to dynamically change their behaviour.

While they could be turned into no-op's or throw an error when no threading library is available,

for performance reasons libcxx doesn't bother to check and assumes one is available.

Aside from exceptions_, our only concern in the remaining seven situations described above is

preventing concurrent use of a resource which is not thread-safe:

* For locales the thread-unsafe resource is the ``once_flag`` used in ``std::call_once`` (which is

ensuring we only increment ``locale::id::__next_id`` once per facet)

* For static local variables the thread-unsafe resource is the init byte

* For the ``shared_ptr`` overloads of ``std::atomic_...`` the thread-unsafe resource is the

``shared_ptr`` itself

* For LIBCPP DEBUG the thread-unsafe resource is ``__libcpp_db``'s internal state

* For random shuffle the thread-unsafe resource is the static local instance of ``std::mt19937``

* For fallback malloc the thread-unsafe resource is the byte array global variable

In all of these situations, the critical section(s) of code that use the resource which we're

trying to protect don't spawn threads, and don't invoke user code (while ``std::call_once`` can

invoke user code, that happens between two separate critical sections). Thus, if there is only one

thread in existence when we would normally need to acquire the mutex (as would happen if the

threading library is unavailable), there will only be one thread for the entirety of the critical

section, meaning that it is safe to skip acquisition of the mutex. We can use this to avoid relying

on a non-existent threading library.

Exceptions pt. 2

QuuxplusoneUnsubmitted

Not Done

This sounds like a "different people do different things" cop-out. Are you trying to describe what we actually do (if so, you need only describe the actual implementation, not the choices we didn't make)? Or is your target audience an implementor of a new platform, where you think they should do something different (if so, you need only describe the thing you think they should do)?

Quuxplusone: This sounds like a "different people do different things" cop-out. Are you trying to describe…

DanielMcIntosh-IBMAuthorUnsubmitted

Done

As I discuss below, I wanted to document both the current implementation, which doesn't support loading the thread library in the middle of exception handling, as well as how support for that could be added. You're right though that this wording isn't great, and I'll update it when I get the chance.

DanielMcIntosh-IBM: As I discuss below, I wanted to document both the current implementation, which doesn't support…

----------------

The situation is more complex for exceptions_. Exceptions aren't using the thread library to

protect a thread-unsafe resource, but to provide a unique ``__cxa_eh_globals`` for each thread.

Since the ``__cxa_eh_globals`` instance associated with a given thread must remain consistent

throughout exception handling, we cannot as easily avoid using a potentially non-existent threading

library.

When the threading library is unavailable, there's only 1 thread, so we associate it with a static

QuuxplusoneUnsubmitted

Not Done

This sounds very much like "However, cxa_exception was not written with this behaviour in mind, so don't do that." — and then this whole paragraph can be removed!

Quuxplusone: This sounds very much like "However, cxa_exception was not written with this behaviour in mind…

DanielMcIntosh-IBMAuthorUnsubmitted

Not Done

Unfortunately, this is also probably the best and least disruptive option at the current time.

The only other option I can think of is described in the next paragraph, and to implement that well would require a lot more work (which I gloss over completely with the statement "This approach does require some method of determining whether the current thread is the main thread, however that can be accomplished in a very platform agnostic manner that is outside the scope of this document."). In the long term, I personally think this second approach will be better for us, because it allows the threading library to be loaded in the middle of exception handling, but it will be much more disruptive to libc++ and the broader llvm community. Given the push-back I've already received from Louis in D110349, I suspect that unless there is a demonstrable performance benefit to it on other platforms, I will have a hard time getting it approved.

I don't have the resources to invest in implementing and performance testing the second option on the off chance it is better and I can convince the llvm community to switch. (Especially since there are some messy situations around the user spawning threads without using std::thread - in which case what we decide is the 'main' thread might not actually be the main thread. This turns out to be a non-issue, but it gets a little messy and would require extensive comments and documentation). I did however still want to document this as an option since the AIX team (for whom this is most relevant) have yet to look at this in detail.

DanielMcIntosh-IBM: Unfortunately, this is also probably the best and least disruptive option at the current time.

local instance of ``__cxa_eh_globals``, similar to what we do when ``_LIBCXXABI_HAS_NO_THREADS`` is

defined. When the threading library is available or becomes available there are a couple reasonable

choices of implementation.

The simplest is to use the current implemenatation with ``__libcpp_tls_get`` as is. If the plaform

allows applications to load the thread library during execution, the ``__cxa_eh_globals`` instance

associated with the main thread woud change after loading the thread library. With the

current implemenatation of cxa_exception, this is unlikely to cause issues unless the library is

loaded while exception handling is ongoing. However, cxa_exception was not written with this

behaviour in mind, so this could produce unexpected results and may become unsafe if

cxa_exception changes.

Another option is to continue to associate the main thread with the static local instance of

``__cxa_eh_globals`` like we do when the threading library is not available, and only use the

``__libcpp_tls_get`` approach for other threads. This approach does require some method of

determining whether the current thread is the main thread, however that can be accomplished in a

very platform agnostic manner that is outside the scope of this document. This approach may also

provide a very small performance benefit to single threaded applications as well as multi-threaded

applications where most exception handling occurs on the main thread. If implemented on all

platforms, adding support for platforms with runtime-dependent can be done by checking whether the

thread library is available when determining whether the current thread is the main thread (if it's

not available, we're the main thread).

libcxx/docs/index.rst

Show First 20 Lines • Show All 169 Lines • ▼ Show 20 Lines	.. toctree::
DesignDocs/ABIVersioning		DesignDocs/ABIVersioning
DesignDocs/AtomicDesign		DesignDocs/AtomicDesign
DesignDocs/CapturingConfigInfo		DesignDocs/CapturingConfigInfo
DesignDocs/DebugMode		DesignDocs/DebugMode
DesignDocs/ExperimentalFeatures		DesignDocs/ExperimentalFeatures
DesignDocs/ExtendedCXX03Support		DesignDocs/ExtendedCXX03Support
DesignDocs/FeatureTestMacros		DesignDocs/FeatureTestMacros
DesignDocs/FileTimeType		DesignDocs/FileTimeType
		DesignDocs/InternalThreadSynchronization
DesignDocs/NoexceptPolicy		DesignDocs/NoexceptPolicy
DesignDocs/ThreadingSupportAPI		DesignDocs/ThreadingSupportAPI
DesignDocs/UniquePtrTrivialAbi		DesignDocs/UniquePtrTrivialAbi
DesignDocs/UnspecifiedBehaviorRandomization		DesignDocs/UnspecifiedBehaviorRandomization
DesignDocs/VisibilityMacros		DesignDocs/VisibilityMacros


Build Bots and Test Coverage		Build Bots and Test Coverage
Show All 39 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[libcxx][libcxxabi][docs] Document the various places we use threading in libcxx and libcxxabiNeeds RevisionPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 401275

libcxx/docs/DesignDocs/InternalThreadSynchronization.rst

libcxx/docs/index.rst

[libcxx][libcxxabi][docs] Document the various places we use threading in libcxx and libcxxabi
Needs RevisionPublic