Details
- Reviewers
ldionne Mordante • Quuxplusone EricWF - Group Reviewers
Restricted Project
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
libcxx/include/__locale | ||
---|---|---|
206–207 | Kind of ugly that we still need to #include "mutex" just for this. Maybe I should change this to #if defined(_LIBCPP_ABI_MICROSOFT) uintptr_t __reserved; #else unsigned long __reserved; #endif And add a static_assert(sizeof(locale::id.__reserved) == sizeof(once_flag)) somewhere. Not clear where I'd add it though. | |
215–216 | I think technically this removes a symbol from the ABI, but it's one that really shouldn't ever get used by anybody. | |
libcxx/src/locale.cpp | ||
716–732 | I'm not 100% certain the atomic load/store is necessary or that I got the memory ordering right. Looking at std::call_once as a reference, perhaps this should be if(__libcpp_atomic_load(&__id_, _AO_Acquire) != 0) and on line 726 __libcpp_atomic_store(&__id_, ++__next_id, _AO_Release); However, I think the reason std::call_once uses acquire/release memory ordering is to make sure "all concurrent calls to call_once are guaranteed to observe any side-effects made by the active call, with no additional synchronization", which doesn't really apply here. | |
732 | I could have also addressed this by pulling __call_once out of mutex.cpp into a separate file, and had that use internal_threading_support.h. That would have made std::call_once usable when the base threading support library (e.g. pthread) is unavailable at runtime. However, since std::call_once is part of the C++11 Thread Support Library, I figured it's better to change things here so we're consistent about what we do and don't support when the thread support library is unavailable (and we don't needlessly affect the performance of std::call_once). Assuming this and D117373 are the only reason we support including <mutex> when _LIBCPP_HAS_NO_THREADS is defined (instead of giving an #error like we do with almost every other header that is part of the C++11 Thread support library), and both of these changes are accepted, it may make sense to stop supporting <mutex> with _LIBCPP_HAS_NO_THREADS. |
Do we have tests that fire up a bunch of threads and construct various locale::ids concurrently? If not we should add some. That will allow TSAN to diagnose bugs.
Overall, I think the current approach is still the best approach. call_once is always available in libc++, so you can use it unconditionally.
libcxx/src/locale.cpp | ||
---|---|---|
717 | Where does __id_ get initialized? | |
718 | Before this change, there was no synchronization between different instances of locale::id objects because each object had its own __flag. There was an atomic increment, but that was non-blocking. Now, every construction of a new locale::id potentially blocks every construction of locale::id. This could significantly effect the behavior of existing multithreaded programs that use a lot of locales and a lot of threads. Also, couldn't this just be written static int& __id_ref = (__id_ = __cxx_atomic_add(__next_id, 1)); return __id; Which is just a fancier way of writing the current call_once version. (Which is available in all dialects and with/without threads). |
None that I can find. Should that be part of a separate patch?
Overall, I think the current approach is still the best approach. call_once is always available in libc++, so you can use it unconditionally.
Update: it appears that in the C++ standard, [atomics] has been moved to [thread.atomics] and what was previously known as the "Thread Support Library" is now the "Concurrency support library" (see https://github.com/cplusplus/draft/commit/d74c2170a9f4c928519461d7742293af2d141852). This will change how we have to word things in documentation & comments, but that's about it. I still believe libc++ shouldn't use anything other than atomics from the Concurrency Support Library, especially when _LIBCPP_HAS_NO_THREADS is defined. Support for atomics was separated from support for threads by D114109, so this re-organization of the standard doesn't really change anything.
I would argue that call_once shouldn't always be available in libc++. Specifically, it shouldn't be available when _LIBCPP_HAS_NO_THREADS is defined. Nearly every header that is part of the C++11 Thread Support Library (i.e. <thread>, <mutex>, <shared_mutex>, <condition_variable>, <semaphore>, <latch> and <future>) has the following right near at the top:
#ifdef _LIBCPP_HAS_NO_THREADS # error <latch> is not supported on this single threaded system #endif
The only exceptions are <mutex> and <condition_variable>. However, all of <condition_variable> is wrapped in a #ifndef _LIBCPP_HAS_NO_THREADS, so including it does nothing when _LIBCPP_HAS_NO_THREADS is defined (aside from the #include of <__config>, <__mutex_base>, <memory> and <version> at the very start, but <__mutex_base> is similarly wrapped in a #ifndef _LIBCPP_HAS_NO_THREADS). <mutex> does the exact same thing except it's careful to put all the definitions and forward declarations needed for std::call_once outside the #if directive, just so that we can use std::call_once here in libcxx/src/locale.cpp.
This seems like a very poor design to me, and if I had to guess the only reason it's like this is probably because _LIBCPP_HAS_NO_THREADS didn't exist when locale::id::__get was written, and we didn't bother to change the design when we added it. I would actually argue that we should have forbidden the use of <mutex> and <condition_variable> with _LIBCPP_HAS_NO_THREADS, because there's nothing in there that should realistically be used by a single-threaded application. As a user, if you're on a platform without threads, using std::call_once is frankly ridiculous - replacing the std::once_flag with a boolean, and using a simple if statement to check the boolean makes WAYYYYY more sense. However, at this point, it's maybe a little late to be doing that, since we've already opened the door to such bizarre uses of libc++. We could maybe still add something like this to <mutex> and <condition_variable> (though #warning is non-standard):
#ifdef _LIBCPP_HAS_NO_THREADS # warning <mutex> included on a single threaded system #endif
Any changes to <mutex> and <condition_variable> aside, when we're on a single-threaded system, libc++ itself certainly (in my mind) shouldn't be relying on anything from <mutex> or <condition_variable> (including std::call_once).
libcxx/lib/abi/arm64-apple-darwin.libcxxabi.v1.stable.exceptions.no_new_in_libcxx.abilist | ||
---|---|---|
1707 ↗ | (On Diff #401281) | I was under the impression this wouldn't be the first time we've removed a function from the ABI?
This function is in a similar situation: It should only be getting called from libcxx/src/locale.cpp, which is a compiled libc++ source file and thus part of the dylib. |
libcxx/src/locale.cpp | ||
718 | There was still synchronization, it was just buried in std::call_once (specifically inside std::__call_once at libcxx/src/mutex.cpp:231). If anything, this version is less restrictive since the memory ordering here is _AO_Relaxed instead of _AO_Acquire like in std::call_once. This could improve performance and reduce the amount of time spent blocked/waiting since it allows more re-ordering of instructions than before. However, as I mentioned in a comment a little further down I'm not 100% sure I actually got this memory ordering right, since I'm not very familiar with them. As for your proposed alternative, I'm not super familiar with how static local references work, so I might be missing something, but won't that result in only 1 call to __cxx_atomic_add and thus only one increment operation on __next_id? The current implementation gets around this by using a different __flag_ per instance of locale::id (recall that __flag_ is a non-static member of locale::id). I can't see any way of replicating this behaviour using just static locals. |
Follow up on @EricWF's comments based on a video chat with him:
Since id::__get is apparently hit quite often, the extra performance we might get here because of the relaxed memory ordering instead of acquire/release ordering could potentially be significant. Not to mention that in the happy path we replace 1 atomic load (from id::__flag_.__state_) and 1 non-atomic load (from id::__id_) with a single atomic load (from id::__id_). I do have to run some benchmarks to double check, but I would be shocked if this somehow hurt performance.
Comparing the old sequence to the new sequence side by side makes this pretty obvious I think.
The common/happy path:
old version using std::call_once | new version using __threading_support directly call into std::call_once <= - atomic_load(id::__flag_.__state_, _AO_Acquire) <= atomic_load(id::__id_, _AO_Relaxed) - .. assign load result to local_variable compare load result NEQ ~0 == compare local_variable NEQ 0 //(False) return from std::call_once <= - return id::__id_ - 1 < return local_variable - 1
And the uncommon path is WAY simpler after the change:
old version using std::call_once | new version using __threading_support directly call into std::call_once <= - atomic_load(id::__flag_.__state_, _AO_Acquire) <= atomic_load(id::__id_, _AO_Relaxed) - .. assign load result to local_variable compare load result NEQ ~0 == compare local_variable NEQ 0 { junk wrapper code <= - call into __call_once <= - //(NO_THREADS): compare (volatile) id::__flag_.__state_ EQ 0 <= compare local_variable EQ 0 { start try block <= - assign 1 to (volatile) id::__flag_.__state_ < - junk wrapper code <= - atomic_add(&__next_id, 1, _AO_Seq); <= increment __next_id - .. assign __next_id to local_variable assign __next_id to id::__id_ ?? atomic_store(id::__id_, local_variable, _AO_Relaxed) assign ~0 to (volatile) id::__flag_.__state_ < - catch block/end of try-block <= - } //(THREADS): lock mutex common to all std::call_once calls < lock mutex common to all id::__get calls (a subset of the previous calls to std::call_once) - .. assign id::__id_ to local_variable compare (volatile) id::__flag_.__state_ EQ 1 < - condvar_wait < - compare (volatile) id::__flag_.__state_ EQ 0 <= compare local_variable EQ 0 { start try block <= - atomic_store(id::__flag_.__state_, 1, _AO_Relaxed) < - unlock mutex < - junk wrapper code <= - atomic_add(&__next_id, 1, _AO_Seq); <= increment __next_id - .. assign __next_id to local_variable assign __next_id to id::__id_ < - re-lock mutex < - atomic_store(id::__flag_.__state_, ~0, _AO_Release) <= atomic_store(id::__id_, local_variable, _AO_Relaxed) unlock mutex == unlock mutex condvar_broadcast < - catch block/end of try-block <= - } else unlock mutex == unlock mutex } return from std::call_once <= - return id::__id_ - 1 < return local_variable - 1
Once this becomes more of a priority for us, I'll run some actual benchmarks to make sure before I consider landing this.
Kind of ugly that we still need to #include "mutex" just for this. Maybe I should change this to
And add a static_assert(sizeof(locale::id.__reserved) == sizeof(once_flag)) somewhere. Not clear where I'd add it though.