diff --git a/libcxx/docs/DesignDocs/InternalThreadSynchronization.rst b/libcxx/docs/DesignDocs/InternalThreadSynchronization.rst
new file mode 100644
--- /dev/null
+++ b/libcxx/docs/DesignDocs/InternalThreadSynchronization.rst
@@ -0,0 +1,270 @@
+===============================
+Internal Thread Synchronization
+===============================
+
+.. contents::
+   :local:
+
+Overview
+========
+
+Several parts of C++ and the C++ standard library unrelated to the `C++11 Thread Support Library
+<https://en.cppreference.com/w/cpp/thread>`__ are required to operate in a thread-aware manner.
+In particular,
+
+* exceptions are thrown and caught on a per-thread basis.
+* ``std::locale`` must allow concurrent access and modification.
+* `"If multiple threads attempt to initialize the same static local variable concurrently, the
+  initialization occurs exactly once"
+  <https://en.cppreference.com/w/cpp/language/storage_duration#Static_local_variables>`__
+* the `shared pointer overloads <https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic>`__ for
+  ``std::atomic_...`` cannot be easily implemented without using locks. (``std::atomic_...``
+  is part of the `Atomic Operations Library <https://en.cppreference.com/w/cpp/atomic>`__ rather
+  than the C++11 Thread Support Library)
+
+In addition, libc++ and libc++abi have some pieces of their own that need to be thread-aware:
+
+* ``std::__libcpp_db``, since containers and iterators may be used in a concurrent fashion.
+  (See :ref:`Debug Mode <using-debug-mode>` for more information)
+* ``std::__rs_default``, because of its use of a random number engine with static storage duration.
+* ``__aligned_malloc_with_fallback``, ``__calloc_with_fallback`` and their ``free_with_fallback``
+  counterparts.
+
+
+.. _exceptions:
+
+Exceptions
+----------
+To manage the caught and uncaught exception stacks for a thread, the libc++abi uses a
+``__cxa_eh_globals`` structure. A pointer to the ``__cxa_eh_globals`` instance associated with the
+current thread is obtained using ``__cxa_get_globals()`` and/or ``__cxa_get_globals_fast()``.
+
+Assuming neither ``_LIBCXXABI_HAS_NO_THREADS`` nor ``HAS_THREAD_LOCAL`` is defined, the current
+implementation for these functions is:
+
+* The first time ANY thread calls ``__cxa_get_globals_fast()``, we initialize a ``__libcpp_tls_key``
+  using ``__libcpp_execute_once``
+* Then, ``__cxa_get_globals_fast()`` will use ``__libcpp_tls_get()`` to retrieve a pointer to any
+  potential previously allocated ``__cxa_eh_globals`` associated with the current thread. If we
+  haven't allocated any before, this will return ``NULL``.
+* ``__cxa_get_globals()`` will invoke ``__cxa_get_globals_fast()``. If it returns non-null, that
+  means there is already an instance of ``__cxa_eh_globals`` for the current thread. Use that
+  instance. Otherwise, this is the first time the current thread called ``__cxa_get_globals()`` so
+  we allocate a new instance of ``__cxa_eh_globals``, then use ``__libcpp_tls_set()`` to store it
+  so later invocations can retrieve it. Finally we return the new ``__cxa_eh_globals``.
+
+When ``_LIBCXXABI_HAS_NO_THREADS`` is defined, both ``__cxa_get_globals()`` and 
+``__cxa_get_globals_fast()`` call a third function that simply returns a pointer to a static local
+instance of ``__cxa_eh_globals``. Without multiple threads, none of the above complexity is
+required.
+
+When ``_LIBCXXABI_HAS_NO_THREADS`` is not defined, but ``HAS_THREAD_LOCAL`` is defined, we take a
+very similar tactic, except instead of a static local copy of ``__cxa_eh_globals``, we use a static
+``thread_local`` copy so that each thread gets a different instance.
+
+
+Locales
+-------
+Concurrent access and modification of ``std::locale``\s is mostly a non-issue, with many locale
+modification and I/O functions being naturally thread safe through the use of local variables.
+There are a few places we use static local variables to ensure an initialization function is only
+called once, such as in ``locale::__global()``, as well as a few places we use ``__shared_count``
+(which relies on atomics instead of thread synchronization), such as for the reference counting of
+``locale::facet``\s. Other than that, there is very little that could require thread synchronization
+or other thread-aware behaviour.
+
+There is one notable place we *do* need to perform more sophisticated thread management:
+Every subclass of ``locale::facet`` must have a public static ``locale::id`` member that is used by
+``std::locale`` to index facets. Specifically, each ``locale::id`` instance has a unique value for
+``locale::id::__id_``, which is 1 greater than the index at which ``std::locale`` will store the
+facet in the ``locale::__imp::facets_`` vector. For reasons described `here
+<https://en.cppreference.com/w/cpp/locale/locale/id/id>`__, ``locale::id::__id_`` is 0 initialized
+at first, and only truly intialized when the corresponding facet is first added to a locale (the
+initialization occurs via the first call to ``locale::id::__get()``, which invokes
+``locale::id::__init()``).
+
+This initialization must happen once for each ``locale::id``, but can occur on any thread. Since
+there is more than one ``locale::id``, and each one needs to independently perform this
+initialization, a static local variable is insufficient. Instead, to prevent a race condition
+between multiple threads trying to initialize the same facet's id, libcxx uses ``std::call_once``
+with each ``locale::id`` having it's own ``once_flag``, and to prevent a race condition on
+the increment of ``locale::id::__next_id`` between threads trying to initialize different facets'
+ids, it uses an atomic add.
+
+
+Static local variables
+----------------------
+Unlike C, in C++, static local variables (SLVs) are initialized the first time excecution passes
+through the declaration. As a result, we need to keep track of whether an SLV has been initialized
+or not, and there is a potential race condition between threads attempting to perform the
+initialization.
+
+To keep track of the initialization status of an SLV, we use a 4 or 8 byte 'guard object'. Reading
+and writing to this guard object is the responsability of ``__cxa_guard_acquire()``,
+``__cxa_guard_release()`` and ``__cxa_guard_abort()``.
+
+The guard object consists of a 'guard byte', an 'init byte', 2 unused bytes, and sometimes a 4 byte
+thread id. If the guard byte is non-zero, then initialization has been completed. Beyond that, if
+the gaurd byte is zero, the rest of the initialization state is tracked using the init byte. While
+the guard byte can be read and written to using simple atomic operations, the init byte requires
+more careful thread management.
+
+Assuming neither ``_LIBCXXABI_HAS_NO_THREADS`` nor ``_LIBCXXABI_USE_FUTEX`` is defined, the current
+implementation protects the init byte using a single condition-variable & mutex pair common to all
+SLVs. Contrary to what the Itanium ABI suggests, we do not hold on to the mutex, (or any other
+mutex) after returning from ``__cxa_guard_acquire()``. Instead, we only hold onto the mutex while
+we're reading from or writing to the init byte - once a thread has started the initialization
+(``__cxa_guard_acquire()`` finished and returned 1), it no longer holds the mutex. Any other
+threads whose execution arrives at the SLV's declaration while the initialization is ongoing will
+acquire the mutex, then determine from the init byte that initialization has already been started
+by another thread, and wait on the condition variable (temporarily relinquishing the mutex, but not
+yet returning from ``__cxa_guard_acquire``).
+
+This produces a similar effect to if each SLV had its own mutex, and we aquired the SLV's mutex in
+``__cxa_guard_acquire()``, and released it in ``__cxa_guard_release()`` or ``__cxa_guard_abort()``,
+like the Itanium ABI suggests.
+
+
+Shared pointer overloads of ``std::atomic_...``
+-----------------------------------------------
+Shared pointers are under most circumstances thread-safe, and can be used to simplify memory
+management of resources used by multiple threads. Usually this is accomplished by each thread
+having a shared pointer that is used to reference count the common resource. When the last shared
+pointer is destroyed, the pointed-to object is also freed/destroyed. Assuming concurrent access or
+modification of the pointed-to object is thread-safe, this common usage of shared pointers is also
+thread safe and does not interally need to use of any thread synchronization mechanisms (instead it
+relies on atomic operations to modify the ref-count).
+
+However, even if concurrent access/modification of the pointed-to object is safe, multiple threads
+directly accessing the *same* shared pointer is NOT thread safe (in contrast to the above scenario
+where each thread had it's own copy of the shared pointer). One solution the standard library
+provides is a set of overloads for the standalone atomic operation functions such as
+``std::atomic_store`` and ``std::atomic_load`` that work on (pointers to) shared pointers (`See
+cppreference <https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic>`__).
+
+Since ``shared_ptr``\s are usually too large to use builtin atomic operations with, these overloads
+need to use some kind of locking to enforce atomicity of the operations. For this they map each
+pointer-to-shared_ptr to one of 16 ``std::__sp_mut``\s, which they hold for the duration of the
+atomic operation. Each ``__sp_mut`` is effetively a light wrapper around ``__libcpp_mutex_lock()``/
+``trylock()`` / ``unlock()``.
+
+`P0718R2 <https://wg21.link/P0718R2>`__ introduces another option for performing atomic operations
+on ``shared_ptr``\s to C++20 with ``std::atomic<std::shared_ptr<T>>`` (`See cppreference
+<https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic2>`__). This will still need some
+form of lock, but unlike with the standalone ``std::atomic_...`` functions, it will be easy to use
+a lock specific to that particular pointer-to-shared_ptr instead of sharing the locks. This may
+make the existing techniques used by ``std::atomic<T>`` that don't rely on threading primatives,
+such as a rudimentary spin-lock, a more attractive implementation option.
+
+
+LIBCPP DEBUG
+------------
+With ``_LIBCPP_DEBUG == 1`` libcxx adds "additional assertions about the validity of iterators"
+(See :ref:`Debug Mode <using-debug-mode>`). It does this by keeping track of the sizes of
+containers, the positions of iterators within those containers, etc. using ``std::__libcpp_db``.
+
+Due to the thread safety guarantees associated with iterators (`See cppreference
+<https://en.cppreference.com/w/cpp/container#Thread_safety>`__), there are many operations that
+``__libcpp_db`` must pay attention to which can safely happen concurrently to iterators on the same
+container. As such, everytime an iterator or container ``__libcpp_db`` is tracking gets updated, it
+needs to aquire a lock before reading from or writing to its internal state.
+
+
+Random shuffle
+--------------
+To avoid generating the same sequence every time the two-argument form of ``std::random_shuffle``
+is called, ``std::__rs_default`` needs to preserve some state after destruction. It does this by
+using a random number engine with static storage duration (specifically a static local instance of
+``std::mt19937`` in ``__rs_default::operator()()``). However, the Mersenne Twister engine is not
+thread safe, so ``__rs_default`` needs to protect it using a mutex.
+
+
+Fallback malloc
+---------------
+In order to make sure exceptions still work when we can't allocate memory (important for exceptions
+like ``std::bad_alloc``), we need a fallback option for constructing the exception and/or
+``__cxa_eh_globals`` that doesn't require memory allocation. This is what
+``__aligned_malloc_with_fallback`` and ``__calloc_with_fallback`` are for. They share a small 512
+byte array with static storage duration which is divided up and managed using a freelist embedded
+within the array itself. When regular heap allocation fails, they fall back to 'allocating' space
+from the array instead. However, since the array may be used concurrently by multiple threads,
+access to and modification of it is protected using a mutex.
+
+
+Implications when threading may not be available
+================================================
+Not all platforms have a threading library like pthreads available, so libc++ cannot always rely on
+the mutexes or functions like ``__libcpp_tls_get`` that the ``__threading_support`` header
+provides. If the user uses the C++11 Thread Support Library without an available threading library,
+it can safely be considered user error. While detecting it might be useful, the std Thread Support
+Library doesn't need to work under those conditions, so it can ignore the possibility. That's not
+true for any of the above seven situations - they need to work with or without thread support.
+
+In cases where we know at the time libcxx is compiled that there is no threading library available,
+we address this by providing an alternative implementation when the preprocessor macro
+``_LIBCXXABI_HAS_NO_THREADS`` and/or ``_LIBCPP_HAS_NO_THREADS`` is defined.
+
+
+z/OS, AIX and runtime-dependent threading availability
+------------------------------------------------------
+On some platforms such as z/OS or AIX, availability of a threading library might not be known until
+runtime, or worse, the threading library may become available in the middle of application
+execution because the application requested it. In such cases, libcxx and libcxxabi need to
+dynamically change their behaviour depending on the availability of a threading library.
+
+As discussed above, using classes like ``std::mutex`` or ``std::condition_variable`` without an
+avaiable threading library is user error, so they don't need to dynamically change their behaviour.
+While they could be turned into no-op's or throw an error when no threading library is available,
+for performance reasons libcxx doesn't bother to check and assumes one is available.
+
+Aside from exceptions_, our only concern in the remaining seven situations described above is
+preventing concurrent use of a resource which is not thread-safe:
+
+* For locales the thread-unsafe resource is the ``once_flag`` used in ``std::call_once`` (which is
+  ensuring we only increment ``locale::id::__next_id`` once per facet)
+* For static local variables the thread-unsafe resource is the init byte
+* For the shared pointer overloads of ``std::atomic_...`` the thread-unsafe resource is the
+  ``shared_ptr`` itself
+* For LIBCPP DEBUG the thread-unsafe resource is ``__libcpp_db``'s internal state
+* For random shuffle the thread-unsafe resource is the static local instance of ``std::mt19937``
+* For fallback malloc the thread-unsafe resource is the byte array global variable
+
+In all of these situations, the critical section(s) of code that use the resource which we're
+trying to protect don't spawn threads, and don't invoke user code (while ``std::call_once`` can
+invoke user code, that happens between two separate critical sections). Thus, if there is only one
+thread in existence when we would normally need to acquire the mutex (as would happen if the
+threading library is unavailable), there will only be one thread for the entirety of the critical
+section, meaning that it is safe to skip acquisition of the mutex. We can use this to avoid relying
+on a non-existent threading library.
+
+Exceptions pt. 2
+----------------
+The situation is more complex for exceptions_. Exceptions aren't using the thread library to
+protect a thread-unsafe resource, but to provide a unique ``__cxa_eh_globals`` for each thread.
+Since the ``__cxa_eh_globals`` instance associated with a given thread must remain consistent
+throughout exception handling, we cannot as easily avoid using a potentially non-existent threading
+library.
+
+When the threading library is unavailable, there's only 1 thread, so we associate it with a static
+local instance of ``__cxa_eh_globals``, similar to what we do when ``_LIBCXXABI_HAS_NO_THREADS`` is
+defined. When the threading library is available or becomes available there are a couple reasonable
+choices of implementation.
+
+The simplest is to use the current implemenatation with ``__libcpp_tls_get`` as is. If the plaform
+allows applications to load the thread library during execution, the ``__cxa_eh_globals`` instance
+associated with the main thread woud change after loading the thread library. With the
+current implemenatation of cxa_exception, this is unlikely to cause issues unless the library is
+loaded while exception handling is ongoing. However, cxa_exception was not written with this
+behaviour in mind, so this could produce unexpected results and may become unsafe if
+cxa_exception changes.
+
+Another option is to continue to associate the main thread with the static local instance of
+``__cxa_eh_globals`` like we do when the threading library is not available, and only use the
+``__libcpp_tls_get`` approach for other threads. This approach does require some method of
+determining whether the current thread is the main thread, however that can be accomplished in a
+very platform agnostic manner that is outside the scope of this document. This approach may also
+provide a very small performance benefit to single threaded applications as well as multi-threaded
+applications where most exception handling occurs on the main thread. If implemented on all
+platforms, adding support for platforms with runtime-dependent can be done by checking whether the
+thread library is available when determining whether the current thread is the main thread (if it's
+not available, we're the main thread).