This is an archive of the discontinued LLVM Phabricator instance.

[libcxx][libcxxabi][docs] Document the various places we use threading in libcxx and libcxxabi
Needs RevisionPublic

Authored by DanielMcIntosh-IBM on Jan 14 2022, 4:10 PM.

Details

Reviewers
ldionne
Mordante
Quuxplusone
Group Reviewers
Restricted Project
Summary

libcxx and libcxxabi use threading in some unexpected places. This documents all
those places, including how we use it and why it's needed.

Hopefully this will also help to clarify why z/OS needs to make changes to each
of these locations in order to support POSIX(OFF). See the discussion on D110349
for more context. However, most of the information in this document is not
specific to z/OS or AIX, and should be useful to anyone working on any platform.

Diff Detail

Event Timeline

DanielMcIntosh-IBM requested review of this revision.Jan 14 2022, 4:10 PM
DanielMcIntosh-IBM created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptJan 14 2022, 4:10 PM
Herald added a reviewer: Restricted Project. · View Herald Transcript

Fix "Duplicate explicit target name" error.

Add InternalThreadSynchronization to libcxx/docs/index.rst

libcxx/docs/DesignDocs/InternalThreadSynchronization.rst
17–19

Thread-safe static initialization is the usual term for this.

20–23

Also struct Big { int a[100]; }; std::atomic<Big>, right? My ill-informed impression is that atomic<shared_ptr> requires DCAS (double-wide compare-and-swap) in hardware, but atomic<Big> requires arbitrarily wide atomics that don't exist anywhere.

std::atomic<std::shared_ptr<T>> is standard C++20. std::atomic<Big> is standard C++11.

42
56–63

I think it would make sense to explain the basic functionality in terms of thread_local first. Then the no-threads version is just "that, but replace thread_local with static"; and the no-thread-local version is "that, but simulate thread_local by hand in the following way...". (Assuming I got that right, of course. My eyes glazed over at the first paragraph. I'm hypothesizing that it will become more comprehensible once the basic functionality is separated from the mechanics of thread_local-simulation.)

80–84

intialized
"For reasons described [on a different website not run by us]" is a risky game. ;) The usual term for what they're describing is "Static Initialization Order Fiasco": we need (what object?)->__id_ to get initialized before the first time it's used, but the first time it's used might be inside the constructor of std::cout, which might run before the constructor of (what object?) because static objects' initialization order is not generally controllable.
We actually do control initialization order in a couple of places in the standard library, via __attribute__((init_priority(SOME-HIGH-NUMBER))) a.k.a. _LIBCPP_INIT_PRIORITY_MAX; do we avoid the attribute here only for historical reasons?

90
97

excecution, responsability, gaurd, aquired
I'd strongly prefer just spelling out static local [variable] on each reference, instead of introducing an idiosyncratic acronym SLV.
"Unlike C, C++ permits dynamic initialization of static local variables. Dynamic initializers for static locals run the first time execution passes through the declaration. So C++ needs to keep track of whether a static local has already been initialized, and avoid races if multiple threads attempt to perform the initialization at once. The Itanium C++ ABI defines three runtime functions to help with this: each thread should call __cxa_guard_acquire before attempting initialization, and either __cxa_guard_abort or __cxa_guard_release after initialization fails (via an exception) or succeeds, respectively."
And then we can get into how libc++abi implements those three functions in detail.

106

"sometimes"? It would be useful to say when and why.

113–117

Contrary to what the Itanium ABI suggests, we do not hold on to the mutex (or any other mutex) after returning from __cxa_guard_acquire().

I think this is a narrower-than-useful definition of "mutex." When we return from __cxa_guard_acquire we're definitely "holding" something, which we must remember to release later, and which (as long as we hold it) will block any other thread from successfully completing a __cxa_guard_acquire on the same thing; instead those threads will go to sleep until the thing is available again. We could implement this thing as a std::mutex, but we don't (because std::mutex is large?) so instead we implement it as ____.

You don't describe how the three primitives are implemented when _LIBCXXABI_USE_FUTEX is defined.

120

IIUC: There is only one std::condition_variable, global to the whole program. Any thread which is waiting to attempt initialization of any static local (waiting its turn to climb the glass hill) will be blocked on that std::condition_variable. When anyone succeeds at their initialization attempt, they notify_all on the condition variable (because they want to unblock all the waiters associated with their own static local). When anyone fails at their initialization attempt, they notify_all on the condition variable (because they want to unblock just one waiter associated with their own static local, but the condition variable's waiter list might be a mix of threads associated with many different static locals so they don't know who to wake).
This is all gotten from your description; I didn't try to correlate it with the actual code. Did I get the right impression?

128

Throughout: If you mean shared_ptr, please say shared_ptr. "Shared pointer" means a pointer (T*) that is shared.

151–157

I don't particularly understand this section. My best guess is that you're speculating that std::atomic<shared_ptr<T>> might hold a data member of type std::mutex; but I expect/hope that we'd never do that, because it's (not standard-mandated but) really important for QoI that sizeof(std::atomic<T>) == sizeof(T), and I don't see any reason for shared_ptr to be the unique exception to that rule. It would also just be super confusing if atomic_foo(shared_ptr*) and atomic<shared_ptr> used two different synchronization mechanisms.

175–179

I'd say:

std::random_shuffle depends on a global pseudorandom number generator (PRNG), similar to std::rand(). Calls to the global PRNG must be synchronized to avoid data races. The dylib provides a function named __rs_get() which returns an object of type __rs_default. __rs_default::operator() calls the operator() of a static local std::mt19937, which is protected by the global mutex __rs_mut. __rs_default's constructor locks __rs_mut; __rs_default's destructor unlocks __rs_mut. Note that this means libc++'s std::random_shuffle is not reentrant.

250–251

This sounds like a "different people do different things" cop-out. Are you trying to describe what we actually do (if so, you need only describe the actual implementation, not the choices we didn't make)? Or is your target audience an implementor of a new platform, where you think they should do something different (if so, you need only describe the thing you think they should do)?

257–259

This sounds very much like "However, cxa_exception was not written with this behaviour in mind, so don't do that." — and then this whole paragraph can be removed!

DanielMcIntosh-IBM marked 4 inline comments as done.

Address some of @Quuxplusone's comments

DanielMcIntosh-IBM added a comment.EditedJan 19 2022, 9:03 AM

@Quuxplusone I've addressed the comments you made that were easy to address. The rest will take me some time to get to as I have other items I need to work on at the moment.

However, even without addressing them all, I think this still provides valuable background for the POSIX(OFF) work in D110349 and/or D117375.

libcxx/docs/DesignDocs/InternalThreadSynchronization.rst
20–23

First, this is about the standalone atomic_load(const shared_ptr<_Tp>* __p)/atomic_store(shared_ptr<_Tp>* __p, shared_ptr<_Tp> __r)/etc., which is standard C++11 and already implemented in libcxx, unlike atomic<shared_ptr>.

Second, yes, implementing atomic<Big> directly using builtin atomics would require arbitrarily wide atomics. They kind of exist through the __c11_atomic builtins and/or the gcc __atomic builtins. Without these, atomic<Big> can still be implemented indirectly using spin-locks or regular locks.

Technically, the standalone shared pointer overloads could also use spin-locks, but at present they don't. I assumed this is because unlike std::atomic<Big> and std::atomic<std::shared_ptr<T>> they need to share the mutex(es) or spin-lock(s) across all shared-pointers, whereas atomic<Big> and atomic<shared_ptr> can put a bool or mutex inside the atomic object, making it unique to a specific shared_ptr or Big. However, looking at the description of rGd77851e8374c5a48de6e7694196b714abd673d84, it seems it was just because of a "pathological distrust of spin locks".

106

The thread-id is only used to detect recursive initialization (in a multi-threaded environment), and not super relevant to this discussion/topic, but I can include a short blurb on it if you think that would be appropriate.

113–117

That's a fair point, but I think that in the context of this discussion, where the focus is on when and where we use the base threading support library (e.g. pthreads), it makes sense to limit the definition of a "mutex" to objects which we acquire/release using said threading support library (i.e. __libcpp_mutex_t and/or __libcpp_recursive_mutex_t). If we use a more broad definition of mutex/lock, it becomes a lot less clear whether an operation would require using the thread support library. I've excluded primitive spin-locks (which can be implemented using atomic operations) from the definition of a lock/mutex as well for similar reasons. If you think it's necessary, I can include a brief discussion of this choice at the beginning of the document, but I think it's pretty clear as-is.

I glossed over what happens when _LIBCXXABI_USE_FUTEX is defined because it doesn't appear to be in use at the moment, and doesn't rely on __threading_support, but I'll add a small blurb about it, and explain that it doesn't rely on the thread support library.

120

Yes, that is correct.

151–157

When I wrote this, I was looking at __cxx_atomic_lock_impl, which does not have sizeof(std::atomic<T>) == sizeof(T), and uses a spin-lock as I described and suggested std::atomic<std::shared_ptr<T>> might also do.

Looking at it again more carefully, I see that __cxx_atomic_lock_impl is only used when _LIBCPP_ATOMIC_ONLY_USE_BUILTINS is defined (right now this only happens for freestanding implementations).

It seems I have also misunderstood the reason these overloads don't use a lock-free implementation (it has more to do with the shared_ptr's non-trivial constructors than anything else). I'll update this whole section with a new description when I get the chance.

250–251

As I discuss below, I wanted to document both the current implementation, which doesn't support loading the thread library in the middle of exception handling, as well as how support for that could be added. You're right though that this wording isn't great, and I'll update it when I get the chance.

257–259

Unfortunately, this is also probably the best and least disruptive option at the current time.

The only other option I can think of is described in the next paragraph, and to implement that well would require a lot more work (which I gloss over completely with the statement "This approach does require some method of determining whether the current thread is the main thread, however that can be accomplished in a very platform agnostic manner that is outside the scope of this document."). In the long term, I personally think this second approach will be better for us, because it allows the threading library to be loaded in the middle of exception handling, but it will be much more disruptive to libc++ and the broader llvm community. Given the push-back I've already received from Louis in D110349, I suspect that unless there is a demonstrable performance benefit to it on other platforms, I will have a hard time getting it approved.

I don't have the resources to invest in implementing and performance testing the second option on the off chance it is better and I can convince the llvm community to switch. (Especially since there are some messy situations around the user spawning threads without using std::thread - in which case what we decide is the 'main' thread might not actually be the main thread. This turns out to be a non-issue, but it gets a little messy and would require extensive comments and documentation). I did however still want to document this as an option since the AIX team (for whom this is most relevant) have yet to look at this in detail.

ldionne requested changes to this revision.Jan 24 2022, 8:08 AM

There's a lot of quite useful information in there. However, it's not clear to me that a "DesignDoc" is the right place to put it. I'm truly scared that this is going to get out of sync with the actual code in no time at all, especially the parts that explain the current implementation of <something>. Instead, would it make sense to improve the in-code documentation of the various parts you describe by adding those descriptions as high-level comments in the code itself, close to the implementation? Then, this design doc can stay higher level and not risk getting out of sync.

libcxx/docs/DesignDocs/InternalThreadSynchronization.rst
39
82
96

This is usually called a function-local static I think? Or is Static Local Variable the official naming and I'm using ad-hoc terminology? (that's quite possible)

This revision now requires changes to proceed.Jan 24 2022, 8:08 AM

There's a lot of quite useful information in there. However, it's not clear to me that a "DesignDoc" is the right place to put it. I'm truly scared that this is going to get out of sync with the actual code in no time at all, especially the parts that explain the current implementation of <something>. Instead, would it make sense to improve the in-code documentation of the various parts you describe by adding those descriptions as high-level comments in the code itself, close to the implementation? Then, this design doc can stay higher level and not risk getting out of sync.

That's a good point. When I circle back around to this, I'll move the description of the implementation out of the design doc and into the code where possible, but I worry that some parts of this will be very difficult to explain in a clear and concise manner without getting a little bit into the actual implementation. We'll see how it goes when I actually try moving things.

libcxx/docs/DesignDocs/InternalThreadSynchronization.rst
96

There doesn't seem to be a whole lot of consistency in terminology as far as I can tell. cppreference refers to them as "Static Local Variables", the standard seems to have settled on "block variable with static storage duration" (e.g. basic.stc.static), but has also used "local static variables" on occasion in the past (e.g. basic.stc.static in the C++14 draft).