Page MenuHomePhabricator

Hold mutex lock while notify_all is called at notify_all_at_thread_exit
AcceptedPublic

Authored by Quuxplusone on Jul 10 2021, 5:19 AM.

Details

Reviewers
vinipsmaker
ldionne
post.kadirselcuk
lewissbaker
Group Reviewers
Restricted Project
Summary

Releasing the mutex before the call to notify_all is an optimization.
This optimization cannot be used here. The thread waiting on the
condition might destroy the associated resources — mutex + condition
variable — and the notifier thread will access an destroyed variable
— the condition variable. In fact, notify_all_at_thread_exit is meant
exactly to join on detached threads, and the waiting thread doesn't
expect for the notifier thread to access any further shared resources,
making this scenario very likely to happen. The waiting thread might
awake spuriously on the release of the mutex lock. The reorder is
necessary to prevent this race.

Further details can be found at
https://cplusplus.github.io/LWG/issue3343.

Diff Detail

Event Timeline

vinipsmaker requested review of this revision.Jul 10 2021, 5:19 AM
vinipsmaker created this revision.
Herald added a reviewer: Restricted Project. · View Herald TranscriptJul 10 2021, 5:19 AM
vinipsmaker edited the summary of this revision. (Show Details)Jul 10 2021, 5:21 AM

LGTM based on a quick read of that LWG issue. @ldionne, WDYT?

post.kadirselcuk planned changes to this revision.Jul 10 2021, 10:53 PM
post.kadirselcuk accepted this revision.
xgupta added a subscriber: xgupta.Jul 10 2021, 11:09 PM

Kadir, ig it is enough for today 😂

I disabled post.kadirselcuk who is apparently spamming the website.

ldionne requested changes to this revision.EditedJul 12 2021, 8:01 AM

The change does look legit, but the submitter not so much. We should:

  1. Add a test
  2. Update the LWG status page

Edit Oh, sorry, I thought the submitter was the spammer, not @vinipsmaker. In all cases, my comments above hold but let's move forward with this.

Update the LWG status page

Given you'll spend time with the committee anyway, could one of you also write a related unlock_at_thread_exit? The use case is described at https://lists.isocpp.org/std-proposals/2021/07/2809.php.

Quuxplusone commandeered this revision.Jul 20 2021, 8:36 AM
Quuxplusone edited reviewers, added: post.kadirselcuk; removed: Quuxplusone.

Commandeering to add a test and update the LWG status page, per ldionne's request.

tambre added a subscriber: tambre.Jul 20 2021, 8:42 AM

Commandeering to add a test and update the LWG status page, per ldionne's request.

What exactly "commandeering" refers to here? I don't understand this comment.

@vinipsmaker wrote:

What exactly "commandeering" refers to here?

It means I'm taking over the responsibility of updating this patch, i.e., it's "mine" now instead of "yours" (as far as Phabricator is concerned). Louis's impression was that you weren't planning to follow up on adding tests and updating the status page, so he asked me to commandeer the patch. However, if you want it back so that you can add tests and status-page update, then I have no problem with you commandeering it back. I'm not planning to actively work on it for a few hours at least. :)

Btw, "could one of you also write a related unlock_at_thread_exit?" — I agree with your rationale and think such a function ought to exist, but I don't think @ldionne would consent to add it to libc++ unless it was already in the Standard (or at least tentatively ready for adoption by the Standard), and I'm not personally willing to write the paper and champion it and so on. But I think it's a good idea for someone (not me) to write that paper. :)

It means I'm taking over the responsibility of updating this patch, i.e., it's "mine" now instead of "yours" (as far as Phabricator is concerned).

Okay, I see. Thanks.

Louis's impression was that you weren't planning to follow up on adding tests and updating the status page, so he asked me to commandeer the patch.

His impression is correct. I just want to solve my problem. I want to stay as far away as possible from C++ committees. Thanks for taking care of this.

[...] I agree with your rationale and think such a function [unlock_at_thread_exit] ought to exist, but I don't think @ldionne would consent to add it to libc++ unless it was already in the Standard [...] But I think it's a good idea for someone (not me) to write that paper. :)

I didn't express myself clearly. Actually that's what I was suggesting... a paper. Several of you guys already are in constant contact with the C++ committee. I just want to solve my problem. I don't want my name in the standard nor any contact with the C++ committee. Just like yourself... I don't want to write such paper.

Looks like this function won't see the light of the day, but that's okay.

Quuxplusone added inline comments.
libcxx/src/thread.cpp
162

Btw, update: I've emailed @lewissbaker to ask why the proof-of-concept wandbox from https://cplusplus.github.io/LWG/issue3343 (namely, https://wandbox.org/permlink/eUu3eiQbLl7JQKMm ) doesn't reproduce anymore. It's trivial to turn that code snippet into a libc++ unit test; but since it's already green on trunk, it wouldn't be very useful.
He hasn't gotten back to me yet. But this PR is still on my radar!

xgupta removed a subscriber: xgupta.Jul 21 2021, 8:07 PM
Quuxplusone added a reviewer: lewissbaker.

Still haven't heard from @lewissbaker , so this did fall off my radar. At this point I'm inclined to say we should just commit the fix, presumably without the attached test case (because the test case doesn't actually reproduce the bug with libc++ anyway). However, I'm showing the test case in this PR right now, just so that people see what I tried.
Again, this test case does not reproduce the bug: it's green both before and after the proposed change.

At this point I'm inclined to say we should just commit the fix, presumably without the attached test case

Well... I can't even use notify_all_at_thread_exit in its current state, and my workaround isn't general enough to cover all cases. Tests should be available to ensure the library is correct, not the other way around. If a test case is just going to guarantee libc++ will be buggy... I don't see the point with a test case at all.

@vinipsmaker wrote:

If a test case is just going to guarantee libc++ will be buggy...

Well, this test case is green both before and after the change. So it wouldn't "guarantee libc++ will be buggy"; it simply would "not guarantee libc++ won't be buggy."

And actually it looks like this test case does fail with GCC 11! https://reviews.llvm.org/harbormaster/unit/view/943944/
Is that bot using libstdc++ instead of libc++? or is something weird going on here?

@ldionne, what are your thoughts?
Situation at the moment: This apparently fixes a fairly important bug in <thread>. The attached test case is supposed to detect the bug, but AFAICT it does not. The wording for the fix is proposed in LWG3343 but has not yet been adopted into any draft.
I think the obvious alternatives are "Land this to fix the bug and get it over with," or "Continue sitting on this indefinitely until LWG3343 gets adopted." Either way, we'll have to grapple with the fact that the test case doesn't seem to reproduce the bug.

(Also, re-ping: @lewissbaker , but I don't think he's listening)

ldionne accepted this revision.Aug 24 2021, 1:58 PM

LGTM, but..

Is that bot using libstdc++ instead of libc++? or is something weird going on here?

No, it should be using libc++ (which you can validate by looking at the command-line used to compile the test). Are you able to reproduce in Docker with this?

./libcxx/utils/ci/run-buildbot-container
$ ./libcxx/utils/ci/run-buildbot generic-gcc-cxx11
This revision is now accepted and ready to land.Aug 24 2021, 1:58 PM
lewissbaker added inline comments.Aug 27 2021, 8:36 PM
libcxx/src/thread.cpp
162

One approach you could take to try to verify that this fix works is to try putting a sleep between the unlock() and notify_all() calls in both the before and after code to try to increase the chances of the other thread acquires the lock, sees the state that says the thread is done and then destroys the CV before the exiting thread calls notify_all().

It's obviously not something you'd leave in the code, however, but could serve to give a higher confidence in the fix.

libcxx/test/std/thread/thread.condition/notify_all_at_thread_exit_lwg3343.pass.cpp
62 ↗(On Diff #365644)

We are trying to make the cv destructor race with the call to notify_all().
This is generally only going to happen if the main thread happens to see the threads_active as 0 between the lock being released and the cv being notified.

You could possibly increase the likelihood of a race by taking out a lock on the mutex before launching the threads.
The idea being that the other threads that launch quickly will wait to acquire the lock and then once the main thread releases the lock inside the call to cv.wait() then the other threads start at once.

I'd also consider removing the sleep_for() in the threads to encourage some more raciness of all of the threads exiting at the same time.
Really, it's only ever going to be the last thread to execute that will potentially run into the lifetime issue.

Another approach could be to have the main thread simply spin, acquiring the lock, looking at the threads_active and then releasing the lock and having it continue on and destroy the CV once it sees that no more threads are active.
i.e. just do a spin-loop version of cv.wait() without actually calling cv.wait()

If you do all of this in conjunction with inserting a sleep_for() call in the implementation of notify_all_at_thread_exit() between the unlock and notify_all calls (just for testing) then you should be able to reproduce the issue.