This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libcxx/
-
include/
-
atomic
-
latch
1/1
semaphore
-
test/std/thread/thread.semaphore/
-
std/
-
thread/
-
thread.semaphore/
12/12
lost_wakeup.pass.cpp

Differential D114119

[libcxx] Fix potential lost wake-up in counting semaphore
AcceptedPublic

Authored by ldionne on Nov 17 2021, 1:31 PM.

Download Raw Diff

Details

Reviewers

• Quuxplusone
fwolff

Group Reviewers

Restricted Project

Summary

Fixes PR#47013. The implementation in libstdc++ seems to have had the same problem (see Bug #100806) and also solved it by always notifying all (instead of just one if __update is equal to 1).

Diff Detail

Unit TestsFailed

	Time	Test
	3,840 ms	libcxx CI GCC 11 / C++latest > llvm-libc++-shared-gcc-cfg-in.std/atomics/atomics_types_operations/atomics_types_operations_wait::atomic_notify_all.pass.cpp
	3,780 ms	libcxx CI GCC 11 / C++latest > llvm-libc++-shared-gcc-cfg-in.std/atomics/atomics_types_operations/atomics_types_operations_wait::atomic_notify_one.pass.cpp
	3,780 ms	libcxx CI GCC 11 / C++latest > llvm-libc++-shared-gcc-cfg-in.std/atomics/atomics_types_operations/atomics_types_operations_wait::atomic_wait.pass.cpp
	3,770 ms	libcxx CI GCC 11 / C++latest > llvm-libc++-shared-gcc-cfg-in.std/atomics/atomics_types_operations/atomics_types_operations_wait::atomic_wait_explicit.pass.cpp
	430 ms	libcxx CI Modular build > llvm-libc++-shared-cfg-in.std/iterators/predef_iterators/move_iterators/move_sentinel::base.pass.cpp

Event Timeline

fwolff requested review of this revision.Nov 17 2021, 1:31 PM

fwolff created this revision.

Herald added 1 blocking reviewer(s): Restricted Project. · View Herald TranscriptNov 17 2021, 1:31 PM

Herald added a subscriber: libcxx-commits. · View Herald Transcript

I believe the old code had a problem. I'm not fully convinced that the new code doesn't still have a similar problem. ;) But this PR seems like a good idea to me, and might even be a candidate for 13.x (@ldionne?).
Is it possible to write a very naïve test for this? E.g.

int main(int, char**)
{
    std::counting_semaphore s(0);
    std::barrier b(3);
    std::thread t1 = std::thread([&]() {
        for (int i=0; i < 100000; ++i) {
            s.acquire();
            b.arrive_and_wait();
        }
    });
    std::thread t2 = std::thread([&]() {
        for (int i=0; i < 100000; ++i) {
            s.acquire();
            b.arrive_and_wait();
        }
    });
    std::thread t3 = std::thread([&]() {
        for (int i=0; i < 100000; ++i) {
            s.release(1);
            s.release(1);
            b.arrive_and_wait();
        }
    });
    t1.join();
    t2.join();
    t3.join();

    return 0;
}

(This fails to reproduce the problem on my personal laptop, but I don't think that's too surprising.)

I've added a simple test, although I couldn't reproduce the issue locally either.

Harbormaster completed remote builds in B134805: Diff 388044.Nov 17 2021, 2:57 PM

fwolff updated this revision to Diff 388302.Nov 18 2021, 1:11 PM

LGTM % test nitpicks. However, I believe we should force @ldionne to look at this one. :) <semaphore> has already gotten two patches merged back to 13.x, and this might be another one.

libcxx/test/std/thread/thread.semaphore/lost_wakeup.pass.cpp
10	I assume that `REQUIRES: long_tests` is a tradeoff — faster tests in most configurations, but testing of this specific test only in one (or a few) configurations; right? This test takes only 1.3 seconds on my laptop. Maybe we should just knock the iteration count down from `100'000` to `10'000` (so it takes 0.13 seconds) and then we can definitely remove this `REQUIRES`. Defer to @ldionne here.
34–37	Throughout: please `int [ij] = 0` instead of `auto [ij] = 0`. Also, I find this nested-loop version much harder to grok at first sight (I mean, maybe just because I wrote the other version myself ;)). IMO it would help to unroll the `j` loop. I guess the real question is, does the change from "2 acquiring threads, 2 releases in 1 releasing thread" to "8 acquiring threads, 8 releases in 2 releasing threads" make the buggy situation more or less likely to occur? From the description in https://bugs.llvm.org/show_bug.cgi?id=47013#c1 , I think that increasing the number of waiting acquirers is good (because each adjacent pair of acquirers has a chance to race with each other, so we have 7x the chances to reproduce in each loop iteration), but I don't see how increasing the number of releasers helps. If you think it'd be harmless to go back to just one releaser (keeping all 8 acquirers, and ideally still unrolling the loop), that would simplify this code IMHO.

fwolff updated this revision to Diff 388335.Nov 18 2021, 3:21 PM

fwolff marked 2 inline comments as done.Nov 18 2021, 3:29 PM

fwolff added inline comments.

libcxx/test/std/thread/thread.semaphore/lost_wakeup.pass.cpp
34–37	It's hard to say. From my understanding, the releases have to happen in a quick enough succession that none of the acquirer threads get around to decrementing the variable in between. This might be more likely with two releaser threads to happen, but maybe not because it increases the contention on the atomic variable. I really don't know either, so I've implemented your suggestion for now.

Harbormaster completed remote builds in B134998: Diff 388335.Nov 18 2021, 9:41 PM

I would actually like @__simt__ to take a look at this one.

If we do agree this is a proper fix, then yeah I would cherry-pick to release/13.x.

libcxx/test/std/thread/thread.semaphore/lost_wakeup.pass.cpp
10	`REQUIRES: long_tests` was used for slow devices like simulators. It's not really maintained anymore. By default, it is enabled. I'm OK with the current state, i.e. we always run this test.
12	You will need to add: // This test requires the dylib support introduced in D68480, which shipped in macOS 11.0. // XFAIL: use_system_cxx_lib && target={{.+}}-apple-macosx10.{{9\|10\|11\|12\|13\|14\|15}} You will also need to add: // TODO(ldionne): This test fails on Ubuntu Focal on our CI nodes (and only there), in 32 bit mode. // UNSUPPORTED: linux && 32bits-on-64bits Sorry, I still have not figured this one out.
52–54	This is used for some freestanding configurations where there is no way to create a thread without providing a stack size. In those configurations, `support::make_test_thread` can be used to pass a stack size. That isn't covered by CI yet so you couldn't notice.

This revision now requires changes to proceed.Nov 22 2021, 10:37 AM

I've just read up the backstory of this.

I agree with the explanation of the bug, and agree the fix is to just notify_all() regardless of the update amount.

Wouldn't it also be possible to just unconditionally notify the atomic? Something like this:

__a.fetch_add(__update, memory_order_release);
if(__update > 1)
    __a.notify_all();
else
    __a.notify_one();

One downside is that there might be some unneeded calls to the platform wake function (futex, etc.), but it would have the advantage of avoiding thundering herd problems if there are many consumers and a single producer repeatedly calling notify_one().

fwolff updated this revision to Diff 389308.Nov 23 2021, 1:37 PM

fwolff marked an inline comment as done.

fwolff marked 2 inline comments as done.

Harbormaster completed remote builds in B135716: Diff 389308.Nov 23 2021, 9:49 PM

In D114119#3147218, @jiixyj wrote:
Wouldn't it also be possible to just unconditionally notify the atomic? Something like this:
__a.fetch_add(__update, memory_order_release);
if(__update > 1)
    __a.notify_all();
else
    __a.notify_one();
One downside is that there might be some unneeded calls to the platform wake function (futex, etc.), but it would have the advantage of avoiding thundering herd problems if there are many consumers and a single producer repeatedly calling notify_one().

IMO it makes more sense to try and diminish spurious wakeups, but I'm far from an expert in that domain.

The current fix LGTM -- I did my due diligence and I understand the problem and how this fixes it. I'd simply like to make sure the test is not flaky.

libcxx/include/semaphore
91–96	We might want to reformulate this as if(__a.fetch_add(__update, memory_order_release) == 0) __a.notify_all(); This seems easier to understand.
libcxx/test/std/thread/thread.semaphore/lost_wakeup.pass.cpp
2	Do you think this test is the reason why the CI nodes timed out after running for 2 hours? It seems like it's not consistently slow (I just restarted the macOS CI and it finished quickly, but before that it froze for 2 hours, which I've never seen). https://buildkite.com/llvm-project/libcxx-ci/builds/6835#4278d152-fa4d-482f-a12e-e9697caa99e3

This revision is now accepted and ready to land.Nov 24 2021, 3:25 PM

fwolff updated this revision to Diff 389889.Nov 25 2021, 4:52 PM

fwolff marked 2 inline comments as done.Nov 25 2021, 5:05 PM

fwolff added inline comments.

libcxx/test/std/thread/thread.semaphore/lost_wakeup.pass.cpp
2	I don't see what could be causing this, but I agree that it looks suspicious. The fix itself should be uncontroversial, though, at least as far as correctness is concerned, because we are always waking up at least as many threads as before, so if anything, we're making deadlocks less likely to occur. What options do we have regarding this test? Could the problem also lie with, say, `std::barrier`?

Harbormaster completed remote builds in B136130: Diff 389889.Nov 25 2021, 11:13 PM

ldionne added inline comments.Dec 1 2021, 8:56 AM

libcxx/test/std/thread/thread.semaphore/lost_wakeup.pass.cpp
2	I pulled the patch down to run it locally. First time went well, second time -- I'm still running the test after a couple minutes. It looks like it's hanging for some reason.
2	Oh and I'm running on a Mac obviously.

I've been studying the libcxx atomic "futex table" implementation in
src/atomic.cpp and include/atomic and I think I have some theories why
there still might be some deadlocks, at least under macOS.

The "futex table" in src/atomic.cpp contains 256 platform native futexes
(something like Linux/OpenBSD/NetBSD futex, macOS __ulock_*, FreeBSD
umtx, Windows WaitOnAddress). This table is needed because some platforms
don't natively support waiting on atomics of sizes 8/16/32/64 bits. On Linux,
for example, futexes are always 32 bit. On these platforms, calling
atomic_wait on an atomic of unsupported size then waits using one of
the table futexes "under the hood".

Each futex of the table has an associated "waiter count" (__contention_state)
which tracks the number of waiters on that futex. This count is used to avoid
calling the platform wake function needlessly.

If the platform _does_ support the size of the atomic natively then the
wait/notify syscalls (e.g. futex) are called directly on the atomic instead
of using one from the table. However, the "waiter count" of the table futex
is still used to minimize native wake up calls.

The part of __libcpp_atomic_wait_backoff_impl where the waiting actually
happens looks like this:

auto const __monitor = __libcpp_atomic_monitor(__a);
if(__test_fn())
    return true;
__libcpp_atomic_wait(__a, __monitor);

...so the current value of the futex (either the table one if the size is not
natively supported, or the actual one otherwise) is saved. Then the test
function is called. If the test is unsuccessful (in case of semaphores the
atomic is "0") then we wait on the futex, telling it the __monitor value to
be free of races.

This will work absolutely fine if one of the table futexes is used. Waking one
of them up will always increase the futex value by one, so the futex wait will
instantly return if the current atomic value is different than __monitor
(which indicates that a wakeup must have happened _after_ the call to
__test_fn but before going to sleep). There is a corner case when there are
~4 billion wakeups in this short interval. This is mitigated by waiting for at
most 2s when the native platform futex size is only 32 bit (explained by
Olivier Giroux here: https://github.com/ogiroux/atomic_wait/issues/3).

Now, if the native atomic/futex is used (which is the case on macOS for
counting_semaphore!), deadlocks are _much_ more likely to occur, because now
the futex is no longer a simple counter that always increments, but instead
the "real" value (in case of counting_semaphore just the semaphore value
itself). There might be executions like the following:

- T1 calls acquire()
- T2 calls try_acquire()
- T3 calls release() two times
- semaphore starts with "0"

T1                        T2                        T3              sem
------------------------------------------------------------------------
                                                                     0
------------------------------------------------------------------------
acquire():
sees that sema=0
goes into
__libcpp_atomic_wait_backoff_impl                                    0
------------------------------------------------------------------------
                                                    release()        1
------------------------------------------------------------------------
__monitor=1                                                          1
------------------------------------------------------------------------
                          try_acquire():
                          acquires one
                          successfully,
                          returns true                               0
------------------------------------------------------------------------
__test_fn():
returns false
since sem==0                                                         0
------------------------------------------------------------------------
                                                    release()        1
------------------------------------------------------------------------
calls
__libcpp_atomic_wait(__a, 1);
--> deadlock!                                                        1
------------------------------------------------------------------------

Here, __libcpp_atomic_wait(__a, 0); should have been called, because
__test_fn() made a decision to sleep based on the value "0" of sem. But
instead, the value of __monitor is used which is "1" and leads to this ABA
style problem where T1 sleeps although the semaphore is unlocked.

I think to solve this problem you could require the "test function" to save its
current understanding of the atomic into an in/out argument so that the atomic
wait function knows which value to use for its second argument. In case of the
semaphores, it could look like this:

     _LIBCPP_AVAILABILITY_SYNC _LIBCPP_INLINE_VISIBILITY
     void acquire()
     {
-        auto const __test_fn = [this]() -> bool {
-            auto __old = __a.load(memory_order_relaxed);
-            return (__old != 0) && __a.compare_exchange_strong(__old, __old - 1, memory_order_acquire, memory_order_relaxed);
+        auto const __test_fn = [this](ptrdiff_t& __old) -> bool {
+            while (true) {
+                if (__old == 0)
+                    return false;
+                if (__a.compare_exchange_weak(__old, __old - 1, memory_order_acquire, memory_order_relaxed))
+                    return true;
+            }
         };
-        __cxx_atomic_wait(&__a.__a_, __test_fn);
+        __cxx_atomic_wait_fn(&__a.__a_, __test_fn, memory_order_relaxed);
     }

For semaphores, the resulting __old is always "0" if "false" is returned. So
the second argument to futex is always "0" which should be correct.

I've been hacking on a patch which is very much WIP but I could post it if
there's interest (and if this analysis even makes sense!).

The second problem is related to the "waiter count" (__contention_state)
which is used to minimize platform wake calls. The only thing that must never
happen is that the wait calls happens but the wake call is not executed. But
I think this is currently not guaranteed, at least according to the C11 memory
model. I guess in practice this problem is highly unlikely to appear, but I
think it's still worth fixing to be correct without doubt.

Currently, the __contention_state is incremented/decremented around the
futex wait call like this (in src/atomic.cpp):

__cxx_atomic_fetch_add(__contention_state, __cxx_contention_t(1), memory_order_seq_cst);
// We sleep as long as the monitored value hasn't changed.
__libcpp_platform_wait_on_address(__platform_state, __old_value);
__cxx_atomic_fetch_sub(__contention_state, __cxx_contention_t(1), memory_order_release);

...and the wake is done like this:

if(0 != __cxx_atomic_load(__contention_state, memory_order_seq_cst))
    // We only call 'wake' if we consumed a contention bit here.
    __libcpp_platform_wake_by_address(__platform_state, __notify_one);

To visualize the problem I modeled a simple example with herd7
(http://diy.inria.fr/doc/herd.html). The example is a simple
atomic_notify/atomic_wait execution with two threads "P0" and "P1". "P0" sets
an atomic variable from "0" to "1" and calls atomic_notify, and "P1" just
calls atomic_wait(atomic, 0). So "P1" must never block for long (if it blocks
at all):

C fut

{}

P0 (atomic_int* atomic, atomic_int* futex, atomic_int* waiters, int* do_notify) {
  atomic_store_explicit(atomic, 1, memory_order_relaxed);

  // `atomic_notify(atomic)`
  {
    int tmp = atomic_fetch_add_explicit(futex, 1, memory_order_release);

    *do_notify = atomic_load_explicit(waiters, memory_order_seq_cst) != 0;
  }
}

P1 (atomic_int* atomic, atomic_int* futex, atomic_int* waiters, int* monitor, int *do_block) {
  int expected = 0;

  // `atomic_wait(atomic, expected)`
  {
    *monitor = atomic_load_explicit(futex, memory_order_acquire);

    int actual = atomic_load_explicit(atomic, memory_order_relaxed);
    if (actual == expected) {
      int tmp = atomic_fetch_add_explicit(waiters, 1, memory_order_seq_cst);


      int success = atomic_compare_exchange_strong_explicit(futex,
        monitor, *monitor, memory_order_relaxed, memory_order_relaxed);
      *do_block = success != 0;
    }
  }
}

exists (1:actual=0)

atomic is the atomic for atomic_notify/atomic_wait, futex is the
"backing" futex from the table, and waiters corresponds to
__contention_state.

I used `herd7 -c11 -cat rc11.cat -show prop -graph columns -unshow hb,eco,co
-showinitwrites false -oneinit false -xscale 4.0 -yscale 0.8 -evince
sem.litmus` to generate graphs.

Since herd7 does not support "read-compare-block" operations (like futex) I
modelled the wait with a relaxed, strong "read-compare-exchange" and look at
the return value. If and only if it's true, the futex call would have
blocked. I also elided the fetch_sub operation after the futex wait because
it isn't relevant. I think memory_order_release for the fetch_sub is still
important to avoid compiler reorderings.

herd7 finds an execution (https://ibb.co/zrNFJNZ) where the waker reads "0"
from waiters (so this read is ordered before the fetch_add in the
sequential consistency order "psc") but the "read-compare-block" operation of
the futex wait call reads "0" from futex even though futex was increased by
the waker right before the atomic_load_explicit! Then, the waiter would wait
(because it read "0" and this is the same as expected) and the waker
_wouldn't_ wake (because it read a waiter count of "0"). This really is
mind-bending to me and I think the explanation is that the sequential
consistency order "psc" does not necessarily imply "happens-before", so "P1" is
free to read an "older" value of futex.

To fix this, a "read-don't-modify-write"/"load release" operation (similar to
https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf) can be used to read
the __contention_state variable from the waker:

if(0 != __cxx_atomic_fetch_add(__contention_state, 0, memory_order_release))
    // We only call 'wake' if we consumed a contention bit here.
    __libcpp_platform_wake_by_address(__platform_state, __notify_one);

The seq_cst fetch_add in the waiter can then be weakened to acquire:

__cxx_atomic_fetch_add(__contention_state, __cxx_contention_t(1), memory_order_acquire);
// We sleep as long as the monitored value hasn't changed.
__libcpp_platform_wait_on_address(__platform_state, __old_value);
__cxx_atomic_fetch_sub(__contention_state, __cxx_contention_t(1), memory_order_release);

Here is the herd7 model of the "read-don't-modify-write" version:

C fut

{}

P0 (atomic_int* atomic, atomic_int* futex, atomic_int* waiters, int* do_notify) {
  atomic_store_explicit(atomic, 1, memory_order_relaxed);

  // `atomic_notify(atomic)`
  {
    int tmp = atomic_fetch_add_explicit(futex, 1, memory_order_release);

    *do_notify = atomic_fetch_add_explicit(waiters, 0, memory_order_release) != 0;
  }
}

P1 (atomic_int* atomic, atomic_int* futex, atomic_int* waiters, int* monitor, int *do_block) {
  int expected = 0;

  // `atomic_wait(atomic, expected)`
  {
    *monitor = atomic_load_explicit(futex, memory_order_acquire);

    int actual = atomic_load_explicit(atomic, memory_order_relaxed);
    if (actual == expected) {
      int tmp = atomic_fetch_add_explicit(waiters, 1, memory_order_acquire);


      int success = atomic_compare_exchange_strong_explicit(futex,
        monitor, *monitor, memory_order_relaxed, memory_order_relaxed);
      *do_block = success != 0;
    }
  }
}

exists (1:actual=0)

This model disallows the execution where do_block is "1" and do_notify is
"0".

Thanks a lot for your analysis @jiixyj! Your analysis of the ABA problem makes sense to me.

To summarize, the current acquire() implementation does this:

auto const __test_fn = [this]() -> bool {
    auto __old = __a.load(memory_order_relaxed);
    return (__old != 0) && __a.compare_exchange_strong(__old, __old - 1, memory_order_acquire, memory_order_relaxed);
};
__cxx_atomic_wait(&__a.__a_, __test_fn);

where __cxx_atomic_wait does this:

auto const __monitor = __libcpp_atomic_monitor(__a);
if(__test_fn())
    return true;
__libcpp_atomic_wait(__a, __monitor);

i.e. __cxx_atomic_wait loads the old value, and then the test function also loads the old value, which might have changed in between, so the test function tests a different value than __cxx_atomic_wait then waits on. In particular, __cxx_atomic_wait might read 1, __test_fn might read zero and therefore return false, and then the value of __a.__a_ might concurrently be updated to 1 before __cxx_atomic_wait enters the wait with an old value of 1, hence causing a deadlock/lost wakeup.

So I think the main problem here is that __cxx_atomic_wait loads the value, and then __test_fn has to load it again. The fix proposed by @jiixyj, namely to have __cxx_atomic_wait pass the value of __monitor as a reference into __test_fn, appears sensible to me; what do you think, @ldionne?

Herald added a project: Restricted Project. · View Herald TranscriptApr 13 2022, 9:11 AM

@fwolff @jiixyj

Woah, I somehow missed the comment on December 14. What an analysis -- my mind is blown TBH, thank you so much @jiixyj for looking into it like that.

If I followed correctly, then yes it makes sense to me, and yes the proposed fix also makes sense. Also, if you re-upload a patch, you should be able to see if it hangs forever (we now enforce a timeout in our CI).

libcxx/test/std/thread/thread.semaphore/lost_wakeup.pass.cpp
16–17	This can be removed now.
19

In D114119#3449098, @ldionne wrote:

Also, if you re-upload a patch, you should be able to see if it hangs forever (we now enforce a timeout in our CI).

OK, I've implemented the change now, fingers crossed...

fwolff marked 5 inline comments as done.Apr 14 2022, 7:04 AM

Harbormaster completed remote builds in B159684: Diff 422851.Apr 14 2022, 7:29 AM

[Github PR transition cleanup]

Commandeering to rebase.

Revision Contents

Path

Size

libcxx/

include/

atomic

27 lines

latch

2 lines

semaphore

23 lines

test/

std/

thread/

thread.semaphore/

lost_wakeup.pass.cpp

64 lines

Diff 422851

libcxx/include/atomic

	Show First 20 Lines • Show All 1,478 Lines • ▼ Show 20 Lines
	_LIBCPP_AVAILABILITY_SYNC _LIBCPP_EXPORTED_FROM_ABI void __cxx_atomic_notify_one(__cxx_atomic_contention_t const volatile*);			_LIBCPP_AVAILABILITY_SYNC _LIBCPP_EXPORTED_FROM_ABI void __cxx_atomic_notify_one(__cxx_atomic_contention_t const volatile*);
	_LIBCPP_AVAILABILITY_SYNC _LIBCPP_EXPORTED_FROM_ABI void __cxx_atomic_notify_all(__cxx_atomic_contention_t const volatile*);			_LIBCPP_AVAILABILITY_SYNC _LIBCPP_EXPORTED_FROM_ABI void __cxx_atomic_notify_all(__cxx_atomic_contention_t const volatile*);
	_LIBCPP_AVAILABILITY_SYNC _LIBCPP_EXPORTED_FROM_ABI __cxx_contention_t __libcpp_atomic_monitor(__cxx_atomic_contention_t const volatile*);			_LIBCPP_AVAILABILITY_SYNC _LIBCPP_EXPORTED_FROM_ABI __cxx_contention_t __libcpp_atomic_monitor(__cxx_atomic_contention_t const volatile*);
	_LIBCPP_AVAILABILITY_SYNC _LIBCPP_EXPORTED_FROM_ABI void __libcpp_atomic_wait(__cxx_atomic_contention_t const volatile*, __cxx_contention_t);			_LIBCPP_AVAILABILITY_SYNC _LIBCPP_EXPORTED_FROM_ABI void __libcpp_atomic_wait(__cxx_atomic_contention_t const volatile*, __cxx_contention_t);

	template <class _Atp, class _Fn>			template <class _Atp, class _Fn>
	struct __libcpp_atomic_wait_backoff_impl {			struct __libcpp_atomic_wait_backoff_impl {
	_Atp* __a;			_Atp* __a;
	_Fn __test_fn;			_Fn __check_fn;
	_LIBCPP_AVAILABILITY_SYNC			_LIBCPP_AVAILABILITY_SYNC
	_LIBCPP_INLINE_VISIBILITY bool operator()(chrono::nanoseconds __elapsed) const			_LIBCPP_INLINE_VISIBILITY bool operator()(chrono::nanoseconds __elapsed) const
	{			{
	if(__elapsed > chrono::microseconds(64))			if(__elapsed > chrono::microseconds(64))
	{			{
	auto const __monitor = __libcpp_atomic_monitor(__a);			__cxx_contention_t __monitor = __libcpp_atomic_monitor(__a);
	if(__test_fn())			if(__check_fn(__monitor))
	return true;			return true;
	__libcpp_atomic_wait(__a, __monitor);			__libcpp_atomic_wait(__a, __monitor);
	}			}
	else if(__elapsed > chrono::microseconds(4))			else if(__elapsed > chrono::microseconds(4))
	__libcpp_thread_yield();			__libcpp_thread_yield();
	else			else
	{} // poll			{} // poll
	return false;			return false;
	}			}
	};			};

	template <class _Atp, class _Fn>			template <class _Atp, class _TFn, class _CFn>
	_LIBCPP_AVAILABILITY_SYNC			_LIBCPP_AVAILABILITY_SYNC
	_LIBCPP_INLINE_VISIBILITY bool __cxx_atomic_wait(_Atp* __a, _Fn && __test_fn)			_LIBCPP_INLINE_VISIBILITY bool __cxx_atomic_wait(_Atp* __a, _TFn && __test_fn, _CFn && __check_fn)
	{			{
	__libcpp_atomic_wait_backoff_impl<_Atp, typename decay<_Fn>::type> __backoff_fn = {__a, __test_fn};			__libcpp_atomic_wait_backoff_impl<_Atp, typename decay<_CFn>::type> __backoff_fn = {__a, __check_fn};
	return __libcpp_thread_poll_with_backoff(__test_fn, __backoff_fn);			return __libcpp_thread_poll_with_backoff(__test_fn, __backoff_fn);
	}			}

	#else // _LIBCPP_HAS_NO_PLATFORM_WAIT			#else // _LIBCPP_HAS_NO_PLATFORM_WAIT

	template <class _Tp>			template <class _Tp>
	_LIBCPP_INLINE_VISIBILITY void __cxx_atomic_notify_all(__cxx_atomic_impl<_Tp> const volatile*) { }			_LIBCPP_INLINE_VISIBILITY void __cxx_atomic_notify_all(__cxx_atomic_impl<_Tp> const volatile*) { }
	template <class _Tp>			template <class _Tp>
	_LIBCPP_INLINE_VISIBILITY void __cxx_atomic_notify_one(__cxx_atomic_impl<_Tp> const volatile*) { }			_LIBCPP_INLINE_VISIBILITY void __cxx_atomic_notify_one(__cxx_atomic_impl<_Tp> const volatile*) { }
	template <class _Atp, class _Fn>			template <class _Atp, class _TFn, class _CFn>
	_LIBCPP_INLINE_VISIBILITY bool __cxx_atomic_wait(_Atp*, _Fn && __test_fn)			_LIBCPP_INLINE_VISIBILITY bool __cxx_atomic_wait(_Atp*, _TFn && __test_fn, _CFn &&)
	{			{
	#if defined(_LIBCPP_HAS_NO_THREADS)			#if defined(_LIBCPP_HAS_NO_THREADS)
	using _Policy = __spinning_backoff_policy;			using _Policy = __spinning_backoff_policy;
	#else			#else
	using _Policy = __libcpp_timed_backoff_policy;			using _Policy = __libcpp_timed_backoff_policy;
	#endif			#endif
	return __libcpp_thread_poll_with_backoff(__test_fn, _Policy());			return __libcpp_thread_poll_with_backoff(__test_fn, _Policy());
	}			}

	#endif // _LIBCPP_HAS_NO_PLATFORM_WAIT			#endif // _LIBCPP_HAS_NO_PLATFORM_WAIT

	template <class _Atp, class _Tp>			template <class _Atp, class _Tp>
	struct __cxx_atomic_wait_test_fn_impl {			struct __cxx_atomic_wait_test_fn_impl {
	_Atp* __a;			_Atp* __a;
	_Tp __val;			_Tp __val;
	memory_order __order;			memory_order __order;
	_LIBCPP_INLINE_VISIBILITY bool operator()() const			_LIBCPP_INLINE_VISIBILITY bool operator()() const
	{			{
	return !__cxx_nonatomic_compare_equal(__cxx_atomic_load(__a, __order), __val);			return !__cxx_nonatomic_compare_equal(__cxx_atomic_load(__a, __order), __val);
	}			}
	};			};

				struct __cxx_atomic_wait_check_fn_impl {
				__cxx_contention_t __val;
				_LIBCPP_INLINE_VISIBILITY bool operator()(__cxx_contention_t __cur) const
				{
				return !__cxx_nonatomic_compare_equal(__cur, __val);
				}
				};

	template <class _Atp, class _Tp>			template <class _Atp, class _Tp>
	_LIBCPP_AVAILABILITY_SYNC			_LIBCPP_AVAILABILITY_SYNC
	_LIBCPP_INLINE_VISIBILITY bool __cxx_atomic_wait(_Atp* __a, _Tp const __val, memory_order __order)			_LIBCPP_INLINE_VISIBILITY bool __cxx_atomic_wait(_Atp* __a, _Tp const __val, memory_order __order)
	{			{
	__cxx_atomic_wait_test_fn_impl<_Atp, _Tp> __test_fn = {__a, __val, __order};			__cxx_atomic_wait_test_fn_impl<_Atp, _Tp> __test_fn = {__a, __val, __order};
	return __cxx_atomic_wait(__a, __test_fn);			__cxx_atomic_wait_check_fn_impl __check_fn = {static_cast<__cxx_contention_t>(__val)};
				return __cxx_atomic_wait(__a, __test_fn, __check_fn);
	}			}

	// general atomic<T>			// general atomic<T>

	template <class _Tp, bool = is_integral<_Tp>::value && !is_same<_Tp, bool>::value>			template <class _Tp, bool = is_integral<_Tp>::value && !is_same<_Tp, bool>::value>
	struct __atomic_base // false			struct __atomic_base // false
	{			{
	mutable __cxx_atomic_impl<_Tp> __a_;			mutable __cxx_atomic_impl<_Tp> __a_;
	▲ Show 20 Lines • Show All 1,145 Lines • Show Last 20 Lines

libcxx/include/latch

Show First 20 Lines • Show All 88 Lines • ▼ Show 20 Lines	public:
{		{
return 0 == __a.load(memory_order_acquire);		return 0 == __a.load(memory_order_acquire);
}		}
inline _LIBCPP_AVAILABILITY_SYNC _LIBCPP_INLINE_VISIBILITY		inline _LIBCPP_AVAILABILITY_SYNC _LIBCPP_INLINE_VISIBILITY
void wait() const		void wait() const
{		{
__cxx_atomic_wait(&__a.__a_, [&]() -> bool {		__cxx_atomic_wait(&__a.__a_, [&]() -> bool {
return try_wait();		return try_wait();
		}, [](__cxx_contention_t __cur) -> bool {
		return __cur == 0;
});		});
}		}
inline _LIBCPP_AVAILABILITY_SYNC _LIBCPP_INLINE_VISIBILITY		inline _LIBCPP_AVAILABILITY_SYNC _LIBCPP_INLINE_VISIBILITY
void arrive_and_wait(ptrdiff_t __update = 1)		void arrive_and_wait(ptrdiff_t __update = 1)
{		{
count_down(__update);		count_down(__update);
wait();		wait();
}		}
Show All 9 Lines

libcxx/include/semaphore

	Show First 20 Lines • Show All 82 Lines • ▼ Show 20 Lines

	public:			public:
	_LIBCPP_INLINE_VISIBILITY			_LIBCPP_INLINE_VISIBILITY
	constexpr explicit __atomic_semaphore_base(ptrdiff_t __count) : __a(__count)			constexpr explicit __atomic_semaphore_base(ptrdiff_t __count) : __a(__count)
	{			{
	}			}
	_LIBCPP_AVAILABILITY_SYNC _LIBCPP_INLINE_VISIBILITY			_LIBCPP_AVAILABILITY_SYNC _LIBCPP_INLINE_VISIBILITY
	void release(ptrdiff_t __update = 1)			void release(ptrdiff_t __update = 1)
	{			{
	if(0 < __a.fetch_add(__update, memory_order_release))			if(__a.fetch_add(__update, memory_order_release) == 0)
	;			// Always notify all, regardless of the value of __update
	else if(__update > 1)			// (see https://llvm.org/PR47013)
	__a.notify_all();			__a.notify_all();
	else
	__a.notify_one();
	}			}
				ldionneAuthorUnsubmitted Done Reply Inline Actions We might want to reformulate this as if(__a.fetch_add(__update, memory_order_release) == 0) __a.notify_all(); This seems easier to understand. ldionne: We might want to reformulate this as ``` if(__a.fetch_add(__update, memory_order_release) ==…
	_LIBCPP_AVAILABILITY_SYNC _LIBCPP_INLINE_VISIBILITY			_LIBCPP_AVAILABILITY_SYNC _LIBCPP_INLINE_VISIBILITY
	void acquire()			void acquire()
	{			{
	auto const __test_fn = [this]() -> bool {			auto const __test_fn = [this]() -> bool {
	auto __old = __a.load(memory_order_relaxed);			auto __old = __a.load(memory_order_relaxed);
	return (__old != 0) && __a.compare_exchange_strong(__old, __old - 1, memory_order_acquire, memory_order_relaxed);			return (__old != 0) && __a.compare_exchange_strong(__old, __old - 1, memory_order_acquire, memory_order_relaxed);
	};			};
	__cxx_atomic_wait(&__a.__a_, __test_fn);			auto const __check_fn = [this](__cxx_contention_t & __monitor) -> bool {
				ptrdiff_t __old = __monitor;
				bool __r = __try_acquire_impl(__old);
				__monitor = static_cast<__cxx_contention_t>(__old);
				return __r;
				};
				__cxx_atomic_wait(&__a.__a_, __test_fn, __check_fn);
	}			}
	template <class Rep, class Period>			template <class Rep, class Period>
	_LIBCPP_AVAILABILITY_SYNC _LIBCPP_INLINE_VISIBILITY			_LIBCPP_AVAILABILITY_SYNC _LIBCPP_INLINE_VISIBILITY
	bool try_acquire_for(chrono::duration<Rep, Period> const& __rel_time)			bool try_acquire_for(chrono::duration<Rep, Period> const& __rel_time)
	{			{
	if (__rel_time == chrono::duration<Rep, Period>::zero())			if (__rel_time == chrono::duration<Rep, Period>::zero())
	return try_acquire();			return try_acquire();
	auto const __test_fn = [this]() { return try_acquire(); };			auto const __test_fn = [this]() { return try_acquire(); };
	return __libcpp_thread_poll_with_backoff(__test_fn, __libcpp_timed_backoff_policy(), __rel_time);			return __libcpp_thread_poll_with_backoff(__test_fn, __libcpp_timed_backoff_policy(), __rel_time);
	}			}
	_LIBCPP_AVAILABILITY_SYNC _LIBCPP_INLINE_VISIBILITY			_LIBCPP_AVAILABILITY_SYNC _LIBCPP_INLINE_VISIBILITY
	bool try_acquire()			bool try_acquire()
	{			{
	auto __old = __a.load(memory_order_acquire);			auto __old = __a.load(memory_order_acquire);
				return __try_acquire_impl(__old);
				}

				private:
				_LIBCPP_AVAILABILITY_SYNC _LIBCPP_INLINE_VISIBILITY
				bool __try_acquire_impl(ptrdiff_t & __old)
				{
	while (true) {			while (true) {
	if (__old == 0)			if (__old == 0)
	return false;			return false;
	if (__a.compare_exchange_strong(__old, __old - 1, memory_order_acquire, memory_order_relaxed))			if (__a.compare_exchange_strong(__old, __old - 1, memory_order_acquire, memory_order_relaxed))
	return true;			return true;
	}			}
	}			}
	};			};
	▲ Show 20 Lines • Show All 62 Lines • Show Last 20 Lines

libcxx/test/std/thread/thread.semaphore/lost_wakeup.pass.cpp

This file was added.

//===----------------------------------------------------------------------===//

ldionneAuthorUnsubmitted

Done

Do you think this test is the reason why the CI nodes timed out after running for 2 hours? It seems like it's not consistently slow (I just restarted the macOS CI and it finished quickly, but before that it froze for 2 hours, which I've never seen). https://buildkite.com/llvm-project/libcxx-ci/builds/6835#4278d152-fa4d-482f-a12e-e9697caa99e3

ldionne: Do you think this test is the reason why the CI nodes timed out after running for 2 hours? It…

fwolffUnsubmitted

Done

I don't see what could be causing this, but I agree that it looks suspicious.

The fix itself should be uncontroversial, though, at least as far as correctness is concerned, because we are always waking up at least as many threads as before, so if anything, we're making deadlocks less likely to occur.

What options do we have regarding this test? Could the problem also lie with, say, std::barrier?

fwolff: I don't see what could be causing this, but I agree that it looks suspicious. The fix itself…

ldionneAuthorUnsubmitted

Done

I pulled the patch down to run it locally. First time went well, second time -- I'm still running the test after a couple minutes. It looks like it's hanging for some reason.

ldionne: I pulled the patch down to run it locally. First time went well, second time -- I'm still…

ldionneAuthorUnsubmitted

Done

Oh and I'm running on a Mac obviously.

ldionne: Oh and I'm running on a Mac obviously.

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

// UNSUPPORTED: libcpp-has-no-threads

// UNSUPPORTED: c++03, c++11, c++14, c++17

QuuxplusoneUnsubmitted

Done

I assume that REQUIRES: long_tests is a tradeoff — faster tests in most configurations, but testing of this specific test only in one (or a few) configurations; right?
This test takes only 1.3 seconds on my laptop. Maybe we should just knock the iteration count down from 100'000 to 10'000 (so it takes 0.13 seconds) and then we can definitely remove this REQUIRES. Defer to @ldionne here.

Quuxplusone: I assume that `REQUIRES: long_tests` is a tradeoff — faster tests in most configurations, but…

ldionneAuthorUnsubmitted

Done

REQUIRES: long_tests was used for slow devices like simulators. It's not really maintained anymore.

By default, it is enabled. I'm OK with the current state, i.e. we always run this test.

ldionne: `REQUIRES: long_tests` was used for slow devices like simulators. It's not really maintained…

// This test requires the dylib support introduced in D68480, which shipped in macOS 11.0.

ldionneAuthorUnsubmitted

Done

You will need to add:

// This test requires the dylib support introduced in D68480, which shipped in macOS 11.0.
// XFAIL: use_system_cxx_lib && target={{.+}}-apple-macosx10.{{9|10|11|12|13|14|15}}

You will also need to add:

// TODO(ldionne): This test fails on Ubuntu Focal on our CI nodes (and only there), in 32 bit mode.
// UNSUPPORTED: linux && 32bits-on-64bits

Sorry, I still have not figured this one out.

ldionne: You will need to add: ``` // This test requires the dylib support introduced in D68480, which…

// XFAIL: use_system_cxx_lib && target={{.+}}-apple-macosx10.{{9|10|11|12|13|14|15}}

// This is a regression test for https://llvm.org/PR47013.

// <semaphore>

ldionneAuthorUnsubmitted

Done

This can be removed now.

ldionne: This can be removed now.

#include <barrier>

ldionneAuthorUnsubmitted

Done

// UNSUPPORTED: linux && 32bits-on-64bits

- // This is a regression test for PR#47013.

+ // This is a regression test for https://llvm.org/PR47013

// <semaphore>

ldionne:

#include <semaphore>

#include <thread>

#include <vector>

#include "make_test_thread.h"

static std::counting_semaphore s(0);

static std::barrier b(8 + 1);

void acquire() {

for (int i = 0; i < 10'000; ++i) {

s.acquire();

b.arrive_and_wait();

}

void release() {

for (int i = 0; i < 10'000; ++i) {

QuuxplusoneUnsubmitted

Done

Throughout: please int [ij] = 0 instead of auto [ij] = 0.
Also, I find this nested-loop version much harder to grok at first sight (I mean, maybe just because I wrote the other version myself ;)). IMO it would help to unroll the j loop.
I guess the real question is, does the change from "2 acquiring threads, 2 releases in 1 releasing thread" to "8 acquiring threads, 8 releases in 2 releasing threads" make the buggy situation more or less likely to occur? From the description in https://bugs.llvm.org/show_bug.cgi?id=47013#c1 , I think that increasing the number of waiting acquirers is good (because each adjacent pair of acquirers has a chance to race with each other, so we have 7x the chances to reproduce in each loop iteration), but I don't see how increasing the number of releasers helps. If you think it'd be harmless to go back to just one releaser (keeping all 8 acquirers, and ideally still unrolling the loop), that would simplify this code IMHO.

Quuxplusone: Throughout: please `int [ij] = 0` instead of `auto [ij] = 0`. Also, I find this nested-loop…

fwolffUnsubmitted

Done

It's hard to say. From my understanding, the releases have to happen in a quick enough succession that none of the acquirer threads get around to decrementing the variable in between. This might be more likely with two releaser threads to happen, but maybe not because it increases the contention on the atomic variable. I really don't know either, so I've implemented your suggestion for now.

fwolff: It's hard to say. From my understanding, the releases have to happen in a quick enough…

s.release(1);

b.arrive_and_wait();

}

int main(int, char**) {

std::vector<std::thread> threads;

ldionneAuthorUnsubmitted

Done

for (int i = 0; i < 8; ++i)

- threads.emplace_back(acquire);

+ threads.push_back(support::make_test_thread(acquire));

- threads.emplace_back(release);

+ threads.emplace_back(support::make_test_thread(release));

for (auto& thread : threads)

This is used for some freestanding configurations where there is no way to create a thread without providing a stack size. In those configurations, support::make_test_thread can be used to pass a stack size.

That isn't covered by CI yet so you couldn't notice.

ldionne: This is used for some freestanding configurations where there is no way to create a thread…

for (int i = 0; i < 8; ++i)

threads.push_back(support::make_test_thread(acquire));

threads.push_back(support::make_test_thread(release));

for (auto& thread : threads)

thread.join();

return 0;

}

This is an archive of the discontinued LLVM Phabricator instance.

[libcxx] Fix potential lost wake-up in counting semaphoreAcceptedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 422851

libcxx/include/atomic

libcxx/include/latch

libcxx/include/semaphore

libcxx/test/std/thread/thread.semaphore/lost_wakeup.pass.cpp

[libcxx] Fix potential lost wake-up in counting semaphore
AcceptedPublic