diff --git a/libcxx/docs/DesignDocs/InternalThreadSynchronization.rst b/libcxx/docs/DesignDocs/InternalThreadSynchronization.rst new file mode 100644 --- /dev/null +++ b/libcxx/docs/DesignDocs/InternalThreadSynchronization.rst @@ -0,0 +1,270 @@ +=============================== +Internal Thread Synchronization +=============================== + +.. contents:: + :local: + +Overview +======== + +Several parts of C++ and the C++ standard library unrelated to the `C++11 Thread Support Library +`__ are required to operate in a thread-aware manner. +In particular, + +* exceptions are thrown and caught on a per-thread basis. +* ``std::locale`` must allow concurrent access and modification. +* `"If multiple threads attempt to initialize the same static local variable concurrently, the + initialization occurs exactly once" + `__ +* the `shared pointer overloads `__ for + ``std::atomic_...`` cannot be easily implemented without using locks. (``std::atomic_...`` + is part of the `Atomic Operations Library `__ rather + than the C++11 Thread Support Library) + +In addition, libc++ and libc++abi have some pieces of their own that need to be thread-aware: + +* ``std::__libcpp_db``, since containers and iterators may be used in a concurrent fashion. + (See :ref:`Debug Mode ` for more information) +* ``std::__rs_default``, because of its use of a random number engine with static storage duration. +* ``__aligned_malloc_with_fallback``, ``__calloc_with_fallback`` and their ``free_with_fallback`` + counterparts. + + +.. _exceptions: + +Exceptions +---------- +To manage the caught and uncaught exception stacks for a thread, the libc++abi uses a +``__cxa_eh_globals`` structure. A pointer to the ``__cxa_eh_globals`` instance associated with the +current thread is obtained using ``__cxa_get_globals()`` and/or ``__cxa_get_globals_fast()``. + +Assuming neither ``_LIBCXXABI_HAS_NO_THREADS`` nor ``HAS_THREAD_LOCAL`` is defined, the current +implementation for these functions is: + +* The first time ANY thread calls ``__cxa_get_globals_fast()``, we initialize a ``__libcpp_tls_key`` + using ``__libcpp_execute_once`` +* Then, ``__cxa_get_globals_fast()`` will use ``__libcpp_tls_get()`` to retrieve a pointer to any + potential previously allocated ``__cxa_eh_globals`` associated with the current thread. If we + haven't allocated any before, this will return ``NULL``. +* ``__cxa_get_globals()`` will invoke ``__cxa_get_globals_fast()``. If it returns non-null, that + means there is already an instance of ``__cxa_eh_globals`` for the current thread. Use that + instance. Otherwise, this is the first time the current thread called ``__cxa_get_globals()`` so + we allocate a new instance of ``__cxa_eh_globals``, then use ``__libcpp_tls_set()`` to store it + so later invocations can retrieve it. Finally we return the new ``__cxa_eh_globals``. + +When ``_LIBCXXABI_HAS_NO_THREADS`` is defined, both ``__cxa_get_globals()`` and +``__cxa_get_globals_fast()`` call a third function that simply returns a pointer to a static local +instance of ``__cxa_eh_globals``. Without multiple threads, none of the above complexity is +required. + +When ``_LIBCXXABI_HAS_NO_THREADS`` is not defined, but ``HAS_THREAD_LOCAL`` is defined, we take a +very similar tactic, except instead of a static local copy of ``__cxa_eh_globals``, we use a static +``thread_local`` copy so that each thread gets a different instance. + + +Locales +------- +Concurrent access and modification of ``std::locale``\s is mostly a non-issue, with many locale +modification and I/O functions being naturally thread safe through the use of local variables. +There are a few places we use static local variables to ensure an initialization function is only +called once, such as in ``locale::__global()``, as well as a few places we use ``__shared_count`` +(which relies on atomics instead of thread synchronization), such as for the reference counting of +``locale::facet``\s. Other than that, there is very little that could require thread synchronization +or other thread-aware behaviour. + +There is one notable place we *do* need to perform more sophisticated thread management: +Every subclass of ``locale::facet`` must have a public static ``locale::id`` member that is used by +``std::locale`` to index facets. Specifically, each ``locale::id`` instance has a unique value for +``locale::id::__id_``, which is 1 greater than the index at which ``std::locale`` will store the +facet in the ``locale::__imp::facets_`` vector. For reasons described `here +`__, ``locale::id::__id_`` is 0 initialized +at first, and only truly intialized when the corresponding facet is first added to a locale (the +initialization occurs via the first call to ``locale::id::__get()``, which invokes +``locale::id::__init()``). + +This initialization must happen once for each ``locale::id``, but can occur on any thread. Since +there is more than one ``locale::id``, and each one needs to independently perform this +initialization, a static local variable is insufficient. Instead, to prevent a race condition +between multiple threads trying to initialize the same facet's id, libcxx uses ``std::call_once`` +with each ``locale::id`` having it's own ``once_flag``, and to prevent a race condition on +the increment of ``locale::id::__next_id`` between threads trying to initialize different facets' +ids, it uses an atomic add. + + +Static local variables +---------------------- +Unlike C, in C++, static local variables (SLVs) are initialized the first time excecution passes +through the declaration. As a result, we need to keep track of whether an SLV has been initialized +or not, and there is a potential race condition between threads attempting to perform the +initialization. + +To keep track of the initialization status of an SLV, we use a 4 or 8 byte 'guard object'. Reading +and writing to this guard object is the responsability of ``__cxa_guard_acquire()``, +``__cxa_guard_release()`` and ``__cxa_guard_abort()``. + +The guard object consists of a 'guard byte', an 'init byte', 2 unused bytes, and sometimes a 4 byte +thread id. If the guard byte is non-zero, then initialization has been completed. Beyond that, if +the gaurd byte is zero, the rest of the initialization state is tracked using the init byte. While +the guard byte can be read and written to using simple atomic operations, the init byte requires +more careful thread management. + +Assuming neither ``_LIBCXXABI_HAS_NO_THREADS`` nor ``_LIBCXXABI_USE_FUTEX`` is defined, the current +implementation protects the init byte using a single condition-variable & mutex pair common to all +SLVs. Contrary to what the Itanium ABI suggests, we do not hold on to the mutex, (or any other +mutex) after returning from ``__cxa_guard_acquire()``. Instead, we only hold onto the mutex while +we're reading from or writing to the init byte - once a thread has started the initialization +(``__cxa_guard_acquire()`` finished and returned 1), it no longer holds the mutex. Any other +threads whose execution arrives at the SLV's declaration while the initialization is ongoing will +acquire the mutex, then determine from the init byte that initialization has already been started +by another thread, and wait on the condition variable (temporarily relinquishing the mutex, but not +yet returning from ``__cxa_guard_acquire``). + +This produces a similar effect to if each SLV had its own mutex, and we aquired the SLV's mutex in +``__cxa_guard_acquire()``, and released it in ``__cxa_guard_release()`` or ``__cxa_guard_abort()``, +like the Itanium ABI suggests. + + +Shared pointer overloads of ``std::atomic_...`` +----------------------------------------------- +Shared pointers are under most circumstances thread-safe, and can be used to simplify memory +management of resources used by multiple threads. Usually this is accomplished by each thread +having a shared pointer that is used to reference count the common resource. When the last shared +pointer is destroyed, the pointed-to object is also freed/destroyed. Assuming concurrent access or +modification of the pointed-to object is thread-safe, this common usage of shared pointers is also +thread safe and does not interally need to use of any thread synchronization mechanisms (instead it +relies on atomic operations to modify the ref-count). + +However, even if concurrent access/modification of the pointed-to object is safe, multiple threads +directly accessing the *same* shared pointer is NOT thread safe (in contrast to the above scenario +where each thread had it's own copy of the shared pointer). One solution the standard library +provides is a set of overloads for the standalone atomic operation functions such as +``std::atomic_store`` and ``std::atomic_load`` that work on (pointers to) shared pointers (`See +cppreference `__). + +Since ``shared_ptr``\s are usually too large to use builtin atomic operations with, these overloads +need to use some kind of locking to enforce atomicity of the operations. For this they map each +pointer-to-shared_ptr to one of 16 ``std::__sp_mut``\s, which they hold for the duration of the +atomic operation. Each ``__sp_mut`` is effetively a light wrapper around ``__libcpp_mutex_lock()``/ +``trylock()`` / ``unlock()``. + +`P0718R2 `__ introduces another option for performing atomic operations +on ``shared_ptr``\s to C++20 with ``std::atomic>`` (`See cppreference +`__). This will still need some +form of lock, but unlike with the standalone ``std::atomic_...`` functions, it will be easy to use +a lock specific to that particular pointer-to-shared_ptr instead of sharing the locks. This may +make the existing techniques used by ``std::atomic`` that don't rely on threading primatives, +such as a rudimentary spin-lock, a more attractive implementation option. + + +LIBCPP DEBUG +------------ +With ``_LIBCPP_DEBUG == 1`` libcxx adds "additional assertions about the validity of iterators" +(See :ref:`Debug Mode `). It does this by keeping track of the sizes of +containers, the positions of iterators within those containers, etc. using ``std::__libcpp_db``. + +Due to the thread safety guarantees associated with iterators (`See cppreference +`__), there are many operations that +``__libcpp_db`` must pay attention to which can safely happen concurrently to iterators on the same +container. As such, everytime an iterator or container ``__libcpp_db`` is tracking gets updated, it +needs to aquire a lock before reading from or writing to its internal state. + + +Random shuffle +-------------- +To avoid generating the same sequence every time the two-argument form of ``std::random_shuffle`` +is called, ``std::__rs_default`` needs to preserve some state after destruction. It does this by +using a random number engine with static storage duration (specifically a static local instance of +``std::mt19937`` in ``__rs_default::operator()()``). However, the Mersenne Twister engine is not +thread safe, so ``__rs_default`` needs to protect it using a mutex. + + +Fallback malloc +--------------- +In order to make sure exceptions still work when we can't allocate memory (important for exceptions +like ``std::bad_alloc``), we need a fallback option for constructing the exception and/or +``__cxa_eh_globals`` that doesn't require memory allocation. This is what +``__aligned_malloc_with_fallback`` and ``__calloc_with_fallback`` are for. They share a small 512 +byte array with static storage duration which is divided up and managed using a freelist embedded +within the array itself. When regular heap allocation fails, they fall back to 'allocating' space +from the array instead. However, since the array may be used concurrently by multiple threads, +access to and modification of it is protected using a mutex. + + +Implications when threading may not be available +================================================ +Not all platforms have a threading library like pthreads available, so libc++ cannot always rely on +the mutexes or functions like ``__libcpp_tls_get`` that the ``__threading_support`` header +provides. If the user uses the C++11 Thread Support Library without an available threading library, +it can safely be considered user error. While detecting it might be useful, the std Thread Support +Library doesn't need to work under those conditions, so it can ignore the possibility. That's not +true for any of the above seven situations - they need to work with or without thread support. + +In cases where we know at the time libcxx is compiled that there is no threading library available, +we address this by providing an alternative implementation when the preprocessor macro +``_LIBCXXABI_HAS_NO_THREADS`` and/or ``_LIBCPP_HAS_NO_THREADS`` is defined. + + +z/OS, AIX and runtime-dependent threading availability +------------------------------------------------------ +On some platforms such as z/OS or AIX, availability of a threading library might not be known until +runtime, or worse, the threading library may become available in the middle of application +execution because the application requested it. In such cases, libcxx and libcxxabi need to +dynamically change their behaviour depending on the availability of a threading library. + +As discussed above, using classes like ``std::mutex`` or ``std::condition_variable`` without an +avaiable threading library is user error, so they don't need to dynamically change their behaviour. +While they could be turned into no-op's or throw an error when no threading library is available, +for performance reasons libcxx doesn't bother to check and assumes one is available. + +Aside from exceptions_, our only concern in the remaining seven situations described above is +preventing concurrent use of a resource which is not thread-safe: + +* For locales the thread-unsafe resource is the ``once_flag`` used in ``std::call_once`` (which is + ensuring we only increment ``locale::id::__next_id`` once per facet) +* For static local variables the thread-unsafe resource is the init byte +* For the shared pointer overloads of ``std::atomic_...`` the thread-unsafe resource is the + ``shared_ptr`` itself +* For LIBCPP DEBUG the thread-unsafe resource is ``__libcpp_db``'s internal state +* For random shuffle the thread-unsafe resource is the static local instance of ``std::mt19937`` +* For fallback malloc the thread-unsafe resource is the byte array global variable + +In all of these situations, the critical section(s) of code that use the resource which we're +trying to protect don't spawn threads, and don't invoke user code (while ``std::call_once`` can +invoke user code, that happens between two separate critical sections). Thus, if there is only one +thread in existence when we would normally need to acquire the mutex (as would happen if the +threading library is unavailable), there will only be one thread for the entirety of the critical +section, meaning that it is safe to skip acquisition of the mutex. We can use this to avoid relying +on a non-existent threading library. + +Exceptions pt. 2 +---------------- +The situation is more complex for exceptions_. Exceptions aren't using the thread library to +protect a thread-unsafe resource, but to provide a unique ``__cxa_eh_globals`` for each thread. +Since the ``__cxa_eh_globals`` instance associated with a given thread must remain consistent +throughout exception handling, we cannot as easily avoid using a potentially non-existent threading +library. + +When the threading library is unavailable, there's only 1 thread, so we associate it with a static +local instance of ``__cxa_eh_globals``, similar to what we do when ``_LIBCXXABI_HAS_NO_THREADS`` is +defined. When the threading library is available or becomes available there are a couple reasonable +choices of implementation. + +The simplest is to use the current implemenatation with ``__libcpp_tls_get`` as is. If the plaform +allows applications to load the thread library during execution, the ``__cxa_eh_globals`` instance +associated with the main thread woud change after loading the thread library. With the +current implemenatation of cxa_exception, this is unlikely to cause issues unless the library is +loaded while exception handling is ongoing. However, cxa_exception was not written with this +behaviour in mind, so this could produce unexpected results and may become unsafe if +cxa_exception changes. + +Another option is to continue to associate the main thread with the static local instance of +``__cxa_eh_globals`` like we do when the threading library is not available, and only use the +``__libcpp_tls_get`` approach for other threads. This approach does require some method of +determining whether the current thread is the main thread, however that can be accomplished in a +very platform agnostic manner that is outside the scope of this document. This approach may also +provide a very small performance benefit to single threaded applications as well as multi-threaded +applications where most exception handling occurs on the main thread. If implemented on all +platforms, adding support for platforms with runtime-dependent can be done by checking whether the +thread library is available when determining whether the current thread is the main thread (if it's +not available, we're the main thread).