This is an archive of the discontinued LLVM Phabricator instance.

[libcxx] [test] Make the condvar wait_for tests less brittle
ClosedPublic

Authored by mstorsjo on Mar 23 2021, 5:03 AM.

Details

Summary

These seem to fail occasionally (they are marked as possibly requiring
a retry).

When doing a condvar wait_for(), it can wake up before the timeout
as a spurious wakeup. In these cases, the wait_for() method returns that
the timeout wasn't hit, and the test reruns another wait_for().

On Windows, it seems like the wait_for() operation often can end up
returning slightly before the intended deadline - when intending to
wait for 250 milliseconds, it can return after e.g. 235 milliseconds.
In these cases, the wait_for() doesn't indicate a timeout.

Previously, the test then reran a new wait_for() for a full 250
milliseconds each time. So for N consecutive wakeups slightly too early,
we'd wait for (N+1)*250 milliseconds. Now it only reruns wait_for() for
the remaining intended wait duration.

Diff Detail

Event Timeline

mstorsjo requested review of this revision.Mar 23 2021, 5:03 AM
mstorsjo created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptMar 23 2021, 5:03 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript
Quuxplusone requested changes to this revision.Mar 23 2021, 6:51 AM
Quuxplusone added a subscriber: Quuxplusone.
Quuxplusone added inline comments.
libcxx/test/std/thread/thread.condition/thread.condition.condvar/wait_for.pass.cpp
47–52

What happens with the subtraction when wait_end < Clock::now()?
(This could happen because of a libc++ bug; or more likely it could happen because wait_for woke spuriously after 249.9ms and then it took us 0.2ms to get around to calling Clock::now() again.)

But also, there's no way this code should ever be intentionally waiting 250ms even once. The main thread is just waiting for line 44's test1=1; to happen, and then it immediately sets test2=1 and notifies (line 76).

I think the only way that our line 48/49 can ever time out if if it entirely misses the main thread's notify due to not being asleep yet. Which actually probably happens a lot. But if the point of this test is to time out, then the main thread really shouldn't be notifying at all ever (and the timeout should probably be shorter, say 50ms). And if that's not the point of this test, then the main thread should be notifying only after f has gone to sleep, so the notify doesn't get missed. I think the straightforward way to do that is to swap lines 75+76 so that the main thread notifies under the lock. (That way, there's no semantic race condition where f spuriously wakes, then main notifies (without holding the lock) which races with f's going back to sleep (under the lock).

This revision now requires changes to proceed.Mar 23 2021, 6:51 AM
mstorsjo added inline comments.Mar 23 2021, 7:09 AM
libcxx/test/std/thread/thread.condition/thread.condition.condvar/wait_for.pass.cpp
47–52

Ok, regarding the subtraction, you're possibly right, but for the rest of the test, you're missing what's happening. Please back off and re-read the test. I would say it's mostly correct as is, except for the repeated timeouts.

Trying to respond to some of your points:

  • You claim "Then the main thread immediately sets test2=1 and notifies" No, it doesn't always. The main thread does two tests. First the main thread waits for the thread to start up (line 72), sets test2 (74) and notifies (76). Then it repeats the whole experiment, without setting test2 and without notifying (lines 81-90). This is the part of the test that actually exercises the timeouts.
  • You claim "I think the only way that our line 48/49 can ever time out if if it entirely misses the main thread's notify due to not being asleep yet." No, it can't miss that, because the main thread first locks the mutex (68), then enters a wait loop (71-72). At this point the mutex is released, and the thread can proceed. Once the thread has proceeded past lines 44-45 (setting the test1 flag and notifying) and releasing the mutex while waiting in line 49, the main thread is unblocked and can proceed to set the flag. So there's no way that the thread can miss the notify in the way you describe.
  • You claim "then the main thread should be notifying only after f has gone to sleep", well it does! The main thread and f-thread are mutually exclusive, and main waits at line 72 until the thread has reached its waiting point.

Now back to the topic of the test: How would you suggest to change the wait duration?

You claim "Then the main thread immediately sets test2=1 and notifies" No, it doesn't always.

Ah, I did miss that main did two tests (one without timeout, then one with timeout). You're definitely right that in the one expecting a timeout, spurious wakeups can prolong the test indefinitely. Setting an "end time" as you've done is probably the best solution for that.

If I rewrote this test, I guess it'd look something like this. I admit the do-while is a little fiddly, even if you replace it with a while. Getting rid of the runs variable is an unmitigated win.

void f(bool expect_timeout)
{
    typedef std::chrono::system_clock Clock;
    typedef std::chrono::milliseconds milliseconds;
    std::unique_lock<std::mutex> lk(mut);
    assert(test2 == 0);
    test1 = 1;
    cv.notify_one();
    Clock::time_point t0 = Clock::now();
    Clock::time_point t_end = t0 + milliseconds(100);
    std::cv_status status;
    do {
        Clock::time_point t_now = Clock::now();
        if (t_now >= t_end) break;
        status = cv.wait_for(lk, t_end - t_now);
    } while (test2 == 0 && status == std::cv_status::no_timeout);
    if (expect_timeout)
    {
        assert(status == std::cv_status::timeout);
        assert(test2 == 0);
        assert(Clock::now() >= t_end);
    }
    else
    {
        assert(status == std::cv_status::no_timeout);
        assert(test2 == 1);
        assert(Clock::now() < t_end); // in fact practically instantaneous
    }
}

int main(int, char**)
{
    {
        std::unique_lock<std::mutex> lk(mut);
        std::thread t = support::make_test_thread(f, false);
[...]
mstorsjo updated this revision to Diff 332693.Mar 23 2021, 9:13 AM

Expanded the loops to check for now >= end and breaking, before calling wait_for again.

Ping, this one crops up every now and then when testrunning the Windows CI

LGTM!

libcxx/test/std/thread/thread.condition/thread.condition.condvar/wait_for.pass.cpp
54

FWIW, I still think it'd be nice to rename runs == 0 to !expect_timeout.

mstorsjo added inline comments.Mar 31 2021, 7:45 AM
libcxx/test/std/thread/thread.condition/thread.condition.condvar/wait_for.pass.cpp
54

Sure, I can make a separate change for that after landing this (after a second review).

@curdeius - Can you give this a second approval? It'd be good for avoiding false alarms in the windows CI :-)

curdeius added inline comments.Apr 1 2021, 2:28 AM
libcxx/test/std/thread/thread.condition/thread.condition.condvar/wait_for.pass.cpp
49–52

How about this to fix the possible negative duration?

mstorsjo updated this revision to Diff 334627.Apr 1 2021, 2:40 AM

Changed to use a local variable for the duration. The variable used within the while clause needs to be declared outside of the loop, and thus can't use auto. But that allows removing the variable now from outside the loop.

curdeius accepted this revision.Apr 1 2021, 2:44 AM

LGTM.

This revision is now accepted and ready to land.Apr 1 2021, 2:44 AM
Quuxplusone added inline comments.Apr 1 2021, 9:13 AM
libcxx/test/std/thread/thread.condition/thread.condition.condvarany/wait_for.pass.cpp
54

I instinctively object to code that produces negative durations (e.g. milliseconds(-1)).
If you can show me where the standard says milliseconds(-1) is definitely supported, then I'll reluctantly relent — but expect me to forget that within a year and be back to my mental model that "you can't have a negative duration." :)
With my current mental model, I'd like to see a fix where we never subtract x - y until we've verified that x >= y.

mstorsjo added inline comments.Apr 1 2021, 11:33 AM
libcxx/test/std/thread/thread.condition/thread.condition.condvarany/wait_for.pass.cpp
54

https://eel.is/c++draft/time.clock.system#members-1 seems to say that the rep type for system_clock is a signed type.

Quuxplusone accepted this revision.Apr 1 2021, 11:35 AM

@Quuxplusone, all standard duration types are required to use a signed integer type as representation.
Cf. https://en.cppreference.com/w/cpp/chrono/duration, http://eel.is/c++draft/time.syn (convenience typedefs).

This revision was landed with ongoing or failed builds.Apr 1 2021, 11:45 AM
This revision was automatically updated to reflect the committed changes.