Fixes PR#47013. The implementation in libstdc++ seems to have had the same problem (see Bug #100806) and also solved it by always notifying all (instead of just one if __update is equal to 1).
Details
Diff Detail
Event Timeline
I believe the old code had a problem. I'm not fully convinced that the new code doesn't still have a similar problem. ;) But this PR seems like a good idea to me, and might even be a candidate for 13.x (@ldionne?).
Is it possible to write a very naïve test for this? E.g.
int main(int, char**) { std::counting_semaphore s(0); std::barrier b(3); std::thread t1 = std::thread([&]() { for (int i=0; i < 100000; ++i) { s.acquire(); b.arrive_and_wait(); } }); std::thread t2 = std::thread([&]() { for (int i=0; i < 100000; ++i) { s.acquire(); b.arrive_and_wait(); } }); std::thread t3 = std::thread([&]() { for (int i=0; i < 100000; ++i) { s.release(1); s.release(1); b.arrive_and_wait(); } }); t1.join(); t2.join(); t3.join(); return 0; }
(This fails to reproduce the problem on my personal laptop, but I don't think that's too surprising.)
LGTM % test nitpicks. However, I believe we should force @ldionne to look at this one. :) <semaphore> has already gotten two patches merged back to 13.x, and this might be another one.
libcxx/test/std/thread/thread.semaphore/lost_wakeup.pass.cpp | ||
---|---|---|
10 | I assume that REQUIRES: long_tests is a tradeoff — faster tests in most configurations, but testing of this specific test only in one (or a few) configurations; right? | |
34–37 | Throughout: please int [ij] = 0 instead of auto [ij] = 0. |
libcxx/test/std/thread/thread.semaphore/lost_wakeup.pass.cpp | ||
---|---|---|
34–37 | It's hard to say. From my understanding, the releases have to happen in a quick enough succession that none of the acquirer threads get around to decrementing the variable in between. This might be more likely with two releaser threads to happen, but maybe not because it increases the contention on the atomic variable. I really don't know either, so I've implemented your suggestion for now. |
I would actually like @__simt__ to take a look at this one.
If we do agree this is a proper fix, then yeah I would cherry-pick to release/13.x.
libcxx/test/std/thread/thread.semaphore/lost_wakeup.pass.cpp | ||
---|---|---|
10 | REQUIRES: long_tests was used for slow devices like simulators. It's not really maintained anymore. By default, it is enabled. I'm OK with the current state, i.e. we always run this test. | |
12 | You will need to add: // This test requires the dylib support introduced in D68480, which shipped in macOS 11.0. // XFAIL: use_system_cxx_lib && target={{.+}}-apple-macosx10.{{9|10|11|12|13|14|15}} You will also need to add: // TODO(ldionne): This test fails on Ubuntu Focal on our CI nodes (and only there), in 32 bit mode. // UNSUPPORTED: linux && 32bits-on-64bits Sorry, I still have not figured this one out. | |
52–54 | This is used for some freestanding configurations where there is no way to create a thread without providing a stack size. In those configurations, support::make_test_thread can be used to pass a stack size. That isn't covered by CI yet so you couldn't notice. |
I've just read up the backstory of this.
I agree with the explanation of the bug, and agree the fix is to just notify_all() regardless of the update amount.
Wouldn't it also be possible to just unconditionally notify the atomic? Something like this:
__a.fetch_add(__update, memory_order_release); if(__update > 1) __a.notify_all(); else __a.notify_one();
One downside is that there might be some unneeded calls to the platform wake function (futex, etc.), but it would have the advantage of avoiding thundering herd problems if there are many consumers and a single producer repeatedly calling notify_one().
IMO it makes more sense to try and diminish spurious wakeups, but I'm far from an expert in that domain.
The current fix LGTM -- I did my due diligence and I understand the problem and how this fixes it. I'd simply like to make sure the test is not flaky.
libcxx/include/semaphore | ||
---|---|---|
88–92 | We might want to reformulate this as if(__a.fetch_add(__update, memory_order_release) == 0) __a.notify_all(); This seems easier to understand. | |
libcxx/test/std/thread/thread.semaphore/lost_wakeup.pass.cpp | ||
1 | Do you think this test is the reason why the CI nodes timed out after running for 2 hours? It seems like it's not consistently slow (I just restarted the macOS CI and it finished quickly, but before that it froze for 2 hours, which I've never seen). https://buildkite.com/llvm-project/libcxx-ci/builds/6835#4278d152-fa4d-482f-a12e-e9697caa99e3 |
libcxx/test/std/thread/thread.semaphore/lost_wakeup.pass.cpp | ||
---|---|---|
1 | I don't see what could be causing this, but I agree that it looks suspicious. The fix itself should be uncontroversial, though, at least as far as correctness is concerned, because we are always waking up at least as many threads as before, so if anything, we're making deadlocks less likely to occur. What options do we have regarding this test? Could the problem also lie with, say, std::barrier? |
I've been studying the libcxx atomic "futex table" implementation in
src/atomic.cpp and include/atomic and I think I have some theories why
there still might be some deadlocks, at least under macOS.
The "futex table" in src/atomic.cpp contains 256 platform native futexes
(something like Linux/OpenBSD/NetBSD futex, macOS __ulock_*, FreeBSD
umtx, Windows WaitOnAddress). This table is needed because some platforms
don't natively support waiting on atomics of sizes 8/16/32/64 bits. On Linux,
for example, futexes are always 32 bit. On these platforms, calling
atomic_wait on an atomic of unsupported size then waits using one of
the table futexes "under the hood".
Each futex of the table has an associated "waiter count" (__contention_state)
which tracks the number of waiters on that futex. This count is used to avoid
calling the platform wake function needlessly.
If the platform _does_ support the size of the atomic natively then the
wait/notify syscalls (e.g. futex) are called directly on the atomic instead
of using one from the table. However, the "waiter count" of the table futex
is still used to minimize native wake up calls.
The part of __libcpp_atomic_wait_backoff_impl where the waiting actually
happens looks like this:
auto const __monitor = __libcpp_atomic_monitor(__a); if(__test_fn()) return true; __libcpp_atomic_wait(__a, __monitor);
...so the current value of the futex (either the table one if the size is not
natively supported, or the actual one otherwise) is saved. Then the test
function is called. If the test is unsuccessful (in case of semaphores the
atomic is "0") then we wait on the futex, telling it the __monitor value to
be free of races.
This will work absolutely fine if one of the table futexes is used. Waking one
of them up will always increase the futex value by one, so the futex wait will
instantly return if the current atomic value is different than __monitor
(which indicates that a wakeup must have happened _after_ the call to
__test_fn but before going to sleep). There is a corner case when there are
~4 billion wakeups in this short interval. This is mitigated by waiting for at
most 2s when the native platform futex size is only 32 bit (explained by
Olivier Giroux here: https://github.com/ogiroux/atomic_wait/issues/3).
Now, if the native atomic/futex is used (which is the case on macOS for
counting_semaphore!), deadlocks are _much_ more likely to occur, because now
the futex is no longer a simple counter that always increments, but instead
the "real" value (in case of counting_semaphore just the semaphore value
itself). There might be executions like the following:
- T1 calls acquire() - T2 calls try_acquire() - T3 calls release() two times - semaphore starts with "0" T1 T2 T3 sem ------------------------------------------------------------------------ 0 ------------------------------------------------------------------------ acquire(): sees that sema=0 goes into __libcpp_atomic_wait_backoff_impl 0 ------------------------------------------------------------------------ release() 1 ------------------------------------------------------------------------ __monitor=1 1 ------------------------------------------------------------------------ try_acquire(): acquires one successfully, returns true 0 ------------------------------------------------------------------------ __test_fn(): returns false since sem==0 0 ------------------------------------------------------------------------ release() 1 ------------------------------------------------------------------------ calls __libcpp_atomic_wait(__a, 1); --> deadlock! 1 ------------------------------------------------------------------------
Here, __libcpp_atomic_wait(__a, 0); should have been called, because
__test_fn() made a decision to sleep based on the value "0" of sem. But
instead, the value of __monitor is used which is "1" and leads to this ABA
style problem where T1 sleeps although the semaphore is unlocked.
I think to solve this problem you could require the "test function" to save its
current understanding of the atomic into an in/out argument so that the atomic
wait function knows which value to use for its second argument. In case of the
semaphores, it could look like this:
_LIBCPP_AVAILABILITY_SYNC _LIBCPP_INLINE_VISIBILITY void acquire() { - auto const __test_fn = [this]() -> bool { - auto __old = __a.load(memory_order_relaxed); - return (__old != 0) && __a.compare_exchange_strong(__old, __old - 1, memory_order_acquire, memory_order_relaxed); + auto const __test_fn = [this](ptrdiff_t& __old) -> bool { + while (true) { + if (__old == 0) + return false; + if (__a.compare_exchange_weak(__old, __old - 1, memory_order_acquire, memory_order_relaxed)) + return true; + } }; - __cxx_atomic_wait(&__a.__a_, __test_fn); + __cxx_atomic_wait_fn(&__a.__a_, __test_fn, memory_order_relaxed); }
For semaphores, the resulting __old is always "0" if "false" is returned. So
the second argument to futex is always "0" which should be correct.
I've been hacking on a patch which is very much WIP but I could post it if
there's interest (and if this analysis even makes sense!).
The second problem is related to the "waiter count" (__contention_state)
which is used to minimize platform wake calls. The only thing that must never
happen is that the wait calls happens but the wake call is not executed. But
I think this is currently not guaranteed, at least according to the C11 memory
model. I guess in practice this problem is highly unlikely to appear, but I
think it's still worth fixing to be correct without doubt.
Currently, the __contention_state is incremented/decremented around the
futex wait call like this (in src/atomic.cpp):
__cxx_atomic_fetch_add(__contention_state, __cxx_contention_t(1), memory_order_seq_cst); // We sleep as long as the monitored value hasn't changed. __libcpp_platform_wait_on_address(__platform_state, __old_value); __cxx_atomic_fetch_sub(__contention_state, __cxx_contention_t(1), memory_order_release);
...and the wake is done like this:
if(0 != __cxx_atomic_load(__contention_state, memory_order_seq_cst)) // We only call 'wake' if we consumed a contention bit here. __libcpp_platform_wake_by_address(__platform_state, __notify_one);
To visualize the problem I modeled a simple example with herd7
(http://diy.inria.fr/doc/herd.html). The example is a simple
atomic_notify/atomic_wait execution with two threads "P0" and "P1". "P0" sets
an atomic variable from "0" to "1" and calls atomic_notify, and "P1" just
calls atomic_wait(atomic, 0). So "P1" must never block for long (if it blocks
at all):
C fut {} P0 (atomic_int* atomic, atomic_int* futex, atomic_int* waiters, int* do_notify) { atomic_store_explicit(atomic, 1, memory_order_relaxed); // `atomic_notify(atomic)` { int tmp = atomic_fetch_add_explicit(futex, 1, memory_order_release); *do_notify = atomic_load_explicit(waiters, memory_order_seq_cst) != 0; } } P1 (atomic_int* atomic, atomic_int* futex, atomic_int* waiters, int* monitor, int *do_block) { int expected = 0; // `atomic_wait(atomic, expected)` { *monitor = atomic_load_explicit(futex, memory_order_acquire); int actual = atomic_load_explicit(atomic, memory_order_relaxed); if (actual == expected) { int tmp = atomic_fetch_add_explicit(waiters, 1, memory_order_seq_cst); int success = atomic_compare_exchange_strong_explicit(futex, monitor, *monitor, memory_order_relaxed, memory_order_relaxed); *do_block = success != 0; } } } exists (1:actual=0)
atomic is the atomic for atomic_notify/atomic_wait, futex is the
"backing" futex from the table, and waiters corresponds to
__contention_state.
I used `herd7 -c11 -cat rc11.cat -show prop -graph columns -unshow hb,eco,co
-showinitwrites false -oneinit false -xscale 4.0 -yscale 0.8 -evince
sem.litmus` to generate graphs.
Since herd7 does not support "read-compare-block" operations (like futex) I
modelled the wait with a relaxed, strong "read-compare-exchange" and look at
the return value. If and only if it's true, the futex call would have
blocked. I also elided the fetch_sub operation after the futex wait because
it isn't relevant. I think memory_order_release for the fetch_sub is still
important to avoid compiler reorderings.
herd7 finds an execution (https://ibb.co/zrNFJNZ) where the waker reads "0"
from waiters (so this read is ordered before the fetch_add in the
sequential consistency order "psc") but the "read-compare-block" operation of
the futex wait call reads "0" from futex even though futex was increased by
the waker right before the atomic_load_explicit! Then, the waiter would wait
(because it read "0" and this is the same as expected) and the waker
_wouldn't_ wake (because it read a waiter count of "0"). This really is
mind-bending to me and I think the explanation is that the sequential
consistency order "psc" does not necessarily imply "happens-before", so "P1" is
free to read an "older" value of futex.
To fix this, a "read-don't-modify-write"/"load release" operation (similar to
https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf) can be used to read
the __contention_state variable from the waker:
if(0 != __cxx_atomic_fetch_add(__contention_state, 0, memory_order_release)) // We only call 'wake' if we consumed a contention bit here. __libcpp_platform_wake_by_address(__platform_state, __notify_one);
The seq_cst fetch_add in the waiter can then be weakened to acquire:
__cxx_atomic_fetch_add(__contention_state, __cxx_contention_t(1), memory_order_acquire); // We sleep as long as the monitored value hasn't changed. __libcpp_platform_wait_on_address(__platform_state, __old_value); __cxx_atomic_fetch_sub(__contention_state, __cxx_contention_t(1), memory_order_release);
Here is the herd7 model of the "read-don't-modify-write" version:
C fut {} P0 (atomic_int* atomic, atomic_int* futex, atomic_int* waiters, int* do_notify) { atomic_store_explicit(atomic, 1, memory_order_relaxed); // `atomic_notify(atomic)` { int tmp = atomic_fetch_add_explicit(futex, 1, memory_order_release); *do_notify = atomic_fetch_add_explicit(waiters, 0, memory_order_release) != 0; } } P1 (atomic_int* atomic, atomic_int* futex, atomic_int* waiters, int* monitor, int *do_block) { int expected = 0; // `atomic_wait(atomic, expected)` { *monitor = atomic_load_explicit(futex, memory_order_acquire); int actual = atomic_load_explicit(atomic, memory_order_relaxed); if (actual == expected) { int tmp = atomic_fetch_add_explicit(waiters, 1, memory_order_acquire); int success = atomic_compare_exchange_strong_explicit(futex, monitor, *monitor, memory_order_relaxed, memory_order_relaxed); *do_block = success != 0; } } } exists (1:actual=0)
This model disallows the execution where do_block is "1" and do_notify is
"0".
Thanks a lot for your analysis @jiixyj! Your analysis of the ABA problem makes sense to me.
To summarize, the current acquire() implementation does this:
auto const __test_fn = [this]() -> bool { auto __old = __a.load(memory_order_relaxed); return (__old != 0) && __a.compare_exchange_strong(__old, __old - 1, memory_order_acquire, memory_order_relaxed); }; __cxx_atomic_wait(&__a.__a_, __test_fn);
where __cxx_atomic_wait does this:
auto const __monitor = __libcpp_atomic_monitor(__a); if(__test_fn()) return true; __libcpp_atomic_wait(__a, __monitor);
i.e. __cxx_atomic_wait loads the old value, and then the test function also loads the old value, which might have changed in between, so the test function tests a different value than __cxx_atomic_wait then waits on. In particular, __cxx_atomic_wait might read 1, __test_fn might read zero and therefore return false, and then the value of __a.__a_ might concurrently be updated to 1 before __cxx_atomic_wait enters the wait with an old value of 1, hence causing a deadlock/lost wakeup.
So I think the main problem here is that __cxx_atomic_wait loads the value, and then __test_fn has to load it again. The fix proposed by @jiixyj, namely to have __cxx_atomic_wait pass the value of __monitor as a reference into __test_fn, appears sensible to me; what do you think, @ldionne?
Woah, I somehow missed the comment on December 14. What an analysis -- my mind is blown TBH, thank you so much @jiixyj for looking into it like that.
If I followed correctly, then yes it makes sense to me, and yes the proposed fix also makes sense. Also, if you re-upload a patch, you should be able to see if it hangs forever (we now enforce a timeout in our CI).
libcxx/test/std/thread/thread.semaphore/lost_wakeup.pass.cpp | ||
---|---|---|
16–17 | This can be removed now. | |
19 |
We might want to reformulate this as
This seems easier to understand.