This is an archive of the discontinued LLVM Phabricator instance.

[libcxx] Fix a data race in call_once
ClosedPublic

Authored by kubamracek on Aug 30 2016, 7:15 AM.

Details

Summary

call_once is using relaxed atomic load to perform double-checked locking, which contains a data race. The fast-path load has to be an acquire atomic load.

Diff Detail

Event Timeline

kubamracek updated this revision to Diff 69676.Aug 30 2016, 7:15 AM
kubamracek retitled this revision from to [libcxx] Fix a data race in call_once.
kubamracek updated this object.
kubamracek added subscribers: zaks.anna, dcoughlin.
dvyukov added inline comments.Aug 30 2016, 7:31 AM
include/memory
674

Do we care about these platforms? This code is not correct.
I understand that this particular patch does not make things worse, but people can use this primitive in other places assuming that it works.

src/mutex.cpp
201–202

The first part of this comment does not make sense:
"Changes to flag are done via relaxed atomic stores, because they're protected by a mutex;"
If they are protected by a mutex, why do we use atomics?

compnerd added inline comments.
include/memory
674

Well, in a unthreaded environment, I don't see how this is incorrect. Perhaps you mean do we care about threaded environments without the atomics. In the latter, I suspect not, since during the configuration phase, we explicitly check how to get access to the atomics. But, I would defer to @mclow.lists or @EricWF on that point.

src/mutex.cpp
201–202

This comment makes sense; however, it is poorly phrased. What its getting at is that we can avoid the fences around this access due to the mutex guarding. This comment is no longer valid though since the relaxed store is being modified below though. Id say this comment can just go.

dvyukov added inline comments.Aug 30 2016, 8:00 AM
include/memory
674

I meant the "defined(ATOMIC_ACQUIRE) && (has_builtin(__atomic_load_n) || _GNUC_VER >= 407)" part.

src/mutex.cpp
201–202

Agree.
As demonstrated writing of such comments does not help to get the code right :)

kubamracek updated this revision to Diff 69686.Aug 30 2016, 8:04 AM

Okay, removing the comment.

EricWF edited edge metadata.Sep 1 2016, 11:10 AM

Is it possible to write a test case for this that TSAN would diagnose? If so I would like to see that added.

call_once is using relaxed atomic load to perform double-checked locking, which contains a data race. The fast-path load has to be an acquire atomic load.

Am I correct in saying this isn't a correctness issue? If a data race occurs in call_once it will be handled by the if (flag == 0) check in __call_once.

Instead I'm assuming this patch is intended as a performance improvement?

dvyukov edited edge metadata.Sep 1 2016, 12:25 PM

This is a correctness issue -- user state is not properly synchronized.

This is a correctness issue -- user state is not properly synchronized.

Can you provide an example execution that demonstrates the correctness issues?

I cannot seem to see such a case. I understand there is a data race between the relaxed loads/stores but I fail to see how that causes a synchronization issues. If a thread 'A' fails to observe the relaxed store in call_once it will still observe it inside __call_once after taking the mutex.

If a thread 'A' fails to observe the relaxed store in call_once it will still observe it inside __call_once after taking the mutex.

The problem is when a thread _does_ observe the store, but fail to observe stores to the associated user state. There is nothing that guarantees that it will.

EricWF added a comment.Sep 1 2016, 1:42 PM

The problem is when a thread _does_ observe the store, but fail to observe stores to the associated user state. There is nothing that guarantees that it will.

I don't understand whan "user state" refers to.

compnerd edited edge metadata.Sep 1 2016, 2:12 PM

Im not sure I follow this entirely either. Perhaps Im just overlooking something, but similar to @EricWF, I think that the relaxed acquisitions should be fine because the mutex will do the stronger barrier to guarantee the correctness. That is to say, the data race is benign.

The following code reproduces the correctness issue on AArch64 (but not on x86). It may take several runs, but eventually it will crash (it crashes on almost every run on a two-core device that I’m using).

long global;

static const long N = 1000000;
std::once_flag once_token[N];
pthread_barrier_t barrier;

void thread_func1() {
  for (int i = 0; i < N; i++) {
    pthread_barrier_wait(&barrier);

    std::call_once(once_token[i], [i] {
      global = 17 + i;
    });

    if (global != 17 + i) {
      abort();
    }

    if (i % (N / 100) == 0) {
      fprintf(stderr, ".");
    }
  }
}

int main() {
  pthread_barrier_init(&barrier, NULL, 4);
  std::thread t1(thread_func1);
  std::thread t2(thread_func1);
  std::thread t3(thread_func1);
  std::thread t4(thread_func1);
  t1.join();
  t2.join();
  t3.join();
  t4.join();
}

The scenario is this: Threads A and B enter call_once simultaneously. Thread A finds the once_flag zero, runs the user code and after that it relaxed-stores ~0 to the flag. Since this is a relaxed store, there can still be other pending stores (from user code) that haven’t been finished. Thread B finds ~0 in the flag and returns from call_once immediately, but the following user code doesn’t see all the changes made by user code run in thread A.

The fact that thread A’s operations happens under a mutex doesn’t change anything, because the other thread doesn’t take the mutex.

EricWF accepted this revision.Sep 2 2016, 1:12 AM
EricWF edited edge metadata.

That explanation makes sense. This LGTM.

This revision is now accepted and ready to land.Sep 2 2016, 1:12 AM
This revision was automatically updated to reflect the committed changes.
jlebar added a subscriber: jlebar.Apr 21 2017, 11:37 AM

This patch changes two call_once's to use acquire loads, but there's a third call_once that does not have an acquire load. Presumably this is a (serious!) bug.

I've sent D32402 to add the missing acquire_load.