This is an archive of the discontinued LLVM Phabricator instance.

[libcxx] switch locale from using std::call_once to __libcpp_mutex_t
Needs RevisionPublic

Authored by DanielMcIntosh-IBM on Jan 14 2022, 5:01 PM.

Details

Reviewers
ldionne
Mordante
Quuxplusone
EricWF
Group Reviewers
Restricted Project
Summary

This is the 2nd of 4 changes to add support for POSIX(OFF) on z/OS.
See D117366 for more background, and D110349 for discussion of an
alternative implementation.

Depends on D117366

Diff Detail

Event Timeline

DanielMcIntosh-IBM requested review of this revision.Jan 14 2022, 5:01 PM
DanielMcIntosh-IBM created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptJan 14 2022, 5:01 PM
Herald added a reviewer: Restricted Project. · View Herald Transcript
DanielMcIntosh-IBM edited the summary of this revision. (Show Details)Jan 14 2022, 5:17 PM
libcxx/include/__locale
206–207

Kind of ugly that we still need to #include "mutex" just for this. Maybe I should change this to

#if defined(_LIBCPP_ABI_MICROSOFT)
   uintptr_t __reserved;
#else
   unsigned long __reserved;
#endif

And add a static_assert(sizeof(locale::id.__reserved) == sizeof(once_flag)) somewhere. Not clear where I'd add it though.

215–216

I think technically this removes a symbol from the ABI, but it's one that really shouldn't ever get used by anybody.
Does this mean I need to update something in libcxx/lib?

libcxx/src/locale.cpp
716–732

I'm not 100% certain the atomic load/store is necessary or that I got the memory ordering right.

Looking at std::call_once as a reference, perhaps this should be

if(__libcpp_atomic_load(&__id_, _AO_Acquire) != 0)

and on line 726

__libcpp_atomic_store(&__id_, ++__next_id, _AO_Release);

However, I think the reason std::call_once uses acquire/release memory ordering is to make sure "all concurrent calls to call_once are guaranteed to observe any side-effects made by the active call, with no additional synchronization", which doesn't really apply here.

732

I could have also addressed this by pulling __call_once out of mutex.cpp into a separate file, and had that use internal_threading_support.h. That would have made std::call_once usable when the base threading support library (e.g. pthread) is unavailable at runtime. However, since std::call_once is part of the C++11 Thread Support Library, I figured it's better to change things here so we're consistent about what we do and don't support when the thread support library is unavailable (and we don't needlessly affect the performance of std::call_once).

Assuming this and D117373 are the only reason we support including <mutex> when _LIBCPP_HAS_NO_THREADS is defined (instead of giving an #error like we do with almost every other header that is part of the C++11 Thread support library), and both of these changes are accepted, it may make sense to stop supporting <mutex> with _LIBCPP_HAS_NO_THREADS.

Fix copy-paste error

Update abi lists

No change. Rebase onto D117366 and re-trigger CI

EricWF requested changes to this revision.Mar 10 2022, 3:34 PM
EricWF added a subscriber: EricWF.

Do we have tests that fire up a bunch of threads and construct various locale::ids concurrently? If not we should add some. That will allow TSAN to diagnose bugs.

Overall, I think the current approach is still the best approach. call_once is always available in libc++, so you can use it unconditionally.

libcxx/src/locale.cpp
717

Where does __id_ get initialized?

718

Before this change, there was no synchronization between different instances of locale::id objects because each object had its own __flag. There was an atomic increment, but that was non-blocking.

Now, every construction of a new locale::id potentially blocks every construction of locale::id. This could significantly effect the behavior of existing multithreaded programs that use a lot of locales and a lot of threads.

Also, couldn't this just be written

static int& __id_ref = (__id_ = __cxx_atomic_add(__next_id, 1));
return __id;

Which is just a fancier way of writing the current call_once version. (Which is available in all dialects and with/without threads).

This revision now requires changes to proceed.Mar 10 2022, 3:34 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 10 2022, 3:34 PM
EricWF added inline comments.Mar 10 2022, 3:36 PM
libcxx/lib/abi/arm64-apple-darwin.libcxxabi.v1.stable.exceptions.no_new_in_libcxx.abilist
1707 ↗(On Diff #401281)

The fact this function went away, means the ABI is potentially broken.

libcxx/src/locale.cpp
717

Nevermind, I see it now.

DanielMcIntosh-IBM added a comment.EditedMar 10 2022, 5:20 PM

Do we have tests that fire up a bunch of threads and construct various locale::ids concurrently? If not we should add some. That will allow TSAN to diagnose bugs.

None that I can find. Should that be part of a separate patch?

Overall, I think the current approach is still the best approach. call_once is always available in libc++, so you can use it unconditionally.

Update: it appears that in the C++ standard, [atomics] has been moved to [thread.atomics] and what was previously known as the "Thread Support Library" is now the "Concurrency support library" (see https://github.com/cplusplus/draft/commit/d74c2170a9f4c928519461d7742293af2d141852). This will change how we have to word things in documentation & comments, but that's about it. I still believe libc++ shouldn't use anything other than atomics from the Concurrency Support Library, especially when _LIBCPP_HAS_NO_THREADS is defined. Support for atomics was separated from support for threads by D114109, so this re-organization of the standard doesn't really change anything.

I would argue that call_once shouldn't always be available in libc++. Specifically, it shouldn't be available when _LIBCPP_HAS_NO_THREADS is defined. Nearly every header that is part of the C++11 Thread Support Library (i.e. <thread>, <mutex>, <shared_mutex>, <condition_variable>, <semaphore>, <latch> and <future>) has the following right near at the top:

#ifdef _LIBCPP_HAS_NO_THREADS
# error <latch> is not supported on this single threaded system
#endif

The only exceptions are <mutex> and <condition_variable>. However, all of <condition_variable> is wrapped in a #ifndef _LIBCPP_HAS_NO_THREADS, so including it does nothing when _LIBCPP_HAS_NO_THREADS is defined (aside from the #include of <__config>, <__mutex_base>, <memory> and <version> at the very start, but <__mutex_base> is similarly wrapped in a #ifndef _LIBCPP_HAS_NO_THREADS). <mutex> does the exact same thing except it's careful to put all the definitions and forward declarations needed for std::call_once outside the #if directive, just so that we can use std::call_once here in libcxx/src/locale.cpp.

This seems like a very poor design to me, and if I had to guess the only reason it's like this is probably because _LIBCPP_HAS_NO_THREADS didn't exist when locale::id::__get was written, and we didn't bother to change the design when we added it. I would actually argue that we should have forbidden the use of <mutex> and <condition_variable> with _LIBCPP_HAS_NO_THREADS, because there's nothing in there that should realistically be used by a single-threaded application. As a user, if you're on a platform without threads, using std::call_once is frankly ridiculous - replacing the std::once_flag with a boolean, and using a simple if statement to check the boolean makes WAYYYYY more sense. However, at this point, it's maybe a little late to be doing that, since we've already opened the door to such bizarre uses of libc++. We could maybe still add something like this to <mutex> and <condition_variable> (though #warning is non-standard):

#ifdef _LIBCPP_HAS_NO_THREADS
# warning <mutex> included on a single threaded system
#endif

Any changes to <mutex> and <condition_variable> aside, when we're on a single-threaded system, libc++ itself certainly (in my mind) shouldn't be relying on anything from <mutex> or <condition_variable> (including std::call_once).

libcxx/lib/abi/arm64-apple-darwin.libcxxabi.v1.stable.exceptions.no_new_in_libcxx.abilist
1707 ↗(On Diff #401281)

I was under the impression this wouldn't be the first time we've removed a function from the ABI?
E.g. In libcxx/lib/abi/CHANGELOG.TXT there's this at line 816:

This change also marks __start_std_streams as hidden -- this variable is only required to initialize the streams, and nobody should depend on it from outside the dylib.

This function is in a similar situation: It should only be getting called from libcxx/src/locale.cpp, which is a compiled libc++ source file and thus part of the dylib.

libcxx/src/locale.cpp
718

There was still synchronization, it was just buried in std::call_once (specifically inside std::__call_once at libcxx/src/mutex.cpp:231).

If anything, this version is less restrictive since the memory ordering here is _AO_Relaxed instead of _AO_Acquire like in std::call_once. This could improve performance and reduce the amount of time spent blocked/waiting since it allows more re-ordering of instructions than before. However, as I mentioned in a comment a little further down I'm not 100% sure I actually got this memory ordering right, since I'm not very familiar with them.

As for your proposed alternative, I'm not super familiar with how static local references work, so I might be missing something, but won't that result in only 1 call to __cxx_atomic_add and thus only one increment operation on __next_id? The current implementation gets around this by using a different __flag_ per instance of locale::id (recall that __flag_ is a non-static member of locale::id). I can't see any way of replicating this behaviour using just static locals.

DanielMcIntosh-IBM added a comment.EditedMar 16 2022, 6:45 PM

Follow up on @EricWF's comments based on a video chat with him:
Since id::__get is apparently hit quite often, the extra performance we might get here because of the relaxed memory ordering instead of acquire/release ordering could potentially be significant. Not to mention that in the happy path we replace 1 atomic load (from id::__flag_.__state_) and 1 non-atomic load (from id::__id_) with a single atomic load (from id::__id_). I do have to run some benchmarks to double check, but I would be shocked if this somehow hurt performance.
Comparing the old sequence to the new sequence side by side makes this pretty obvious I think.

The common/happy path:

old version using std::call_once                                    |   new version using __threading_support directly

call into std::call_once                                            <=  -
atomic_load(id::__flag_.__state_, _AO_Acquire)                      <=  atomic_load(id::__id_, _AO_Relaxed)
-                                                                   ..  assign load result to local_variable
compare load result NEQ ~0                                          ==  compare local_variable NEQ 0
//(False)
return from std::call_once                                          <=  -
return id::__id_ - 1                                                <   return local_variable - 1

And the uncommon path is WAY simpler after the change:

old version using std::call_once                                    |   new version using __threading_support directly

call into std::call_once                                            <=  -
atomic_load(id::__flag_.__state_, _AO_Acquire)                      <=  atomic_load(id::__id_, _AO_Relaxed)
-                                                                   ..  assign load result to local_variable
compare load result NEQ ~0                                          ==  compare local_variable NEQ 0
{
    junk wrapper code                                               <=  -
    call into __call_once                                           <=  -
//(NO_THREADS):
    compare (volatile) id::__flag_.__state_ EQ 0                    <=  compare local_variable EQ 0
    {
        start try block                                             <=  -
        assign 1 to (volatile) id::__flag_.__state_                 <   -
        junk wrapper code                                           <=  -
        atomic_add(&__next_id, 1, _AO_Seq);                         <=  increment __next_id
        -                                                           ..  assign __next_id to local_variable
        assign __next_id to id::__id_                               ??  atomic_store(id::__id_, local_variable, _AO_Relaxed)
        assign ~0 to (volatile) id::__flag_.__state_                <   -
        catch block/end of try-block                                <=  -
    }
//(THREADS):
    lock mutex common to all std::call_once calls                   <   lock mutex common to all id::__get calls (a subset of the previous calls to std::call_once)
    -                                                               ..  assign id::__id_ to local_variable
    compare (volatile) id::__flag_.__state_ EQ 1                    <   -
        condvar_wait                                                <   -
    compare (volatile) id::__flag_.__state_ EQ 0                    <=  compare local_variable EQ 0
    {
        start try block                                             <=  -
        atomic_store(id::__flag_.__state_, 1, _AO_Relaxed)          <   -
        unlock mutex                                                <   -
        junk wrapper code                                           <=  -
        atomic_add(&__next_id, 1, _AO_Seq);                         <=  increment __next_id
        -                                                           ..  assign __next_id to local_variable
        assign __next_id to id::__id_                               <   -
        re-lock mutex                                               <   -
        atomic_store(id::__flag_.__state_, ~0, _AO_Release)         <=  atomic_store(id::__id_, local_variable, _AO_Relaxed)
        unlock mutex                                                ==  unlock mutex
        condvar_broadcast                                           <   -
        catch block/end of try-block                                <=  -
    }
    else
        unlock mutex                                                ==  unlock mutex
}
return from std::call_once                                          <=  -
return id::__id_ - 1                                                <   return local_variable - 1

Once this becomes more of a priority for us, I'll run some actual benchmarks to make sure before I consider landing this.