call_once is using relaxed atomic load to perform double-checked locking, which contains a data race. The fast-path load has to be an acquire atomic load.
Details
Diff Detail
Event Timeline
include/memory | ||
---|---|---|
674 | Do we care about these platforms? This code is not correct. | |
src/mutex.cpp | ||
201–202 | The first part of this comment does not make sense: |
include/memory | ||
---|---|---|
674 | Well, in a unthreaded environment, I don't see how this is incorrect. Perhaps you mean do we care about threaded environments without the atomics. In the latter, I suspect not, since during the configuration phase, we explicitly check how to get access to the atomics. But, I would defer to @mclow.lists or @EricWF on that point. | |
src/mutex.cpp | ||
201–202 | This comment makes sense; however, it is poorly phrased. What its getting at is that we can avoid the fences around this access due to the mutex guarding. This comment is no longer valid though since the relaxed store is being modified below though. Id say this comment can just go. |
Is it possible to write a test case for this that TSAN would diagnose? If so I would like to see that added.
call_once is using relaxed atomic load to perform double-checked locking, which contains a data race. The fast-path load has to be an acquire atomic load.
Am I correct in saying this isn't a correctness issue? If a data race occurs in call_once it will be handled by the if (flag == 0) check in __call_once.
Instead I'm assuming this patch is intended as a performance improvement?
This is a correctness issue -- user state is not properly synchronized.
Can you provide an example execution that demonstrates the correctness issues?
I cannot seem to see such a case. I understand there is a data race between the relaxed loads/stores but I fail to see how that causes a synchronization issues. If a thread 'A' fails to observe the relaxed store in call_once it will still observe it inside __call_once after taking the mutex.
If a thread 'A' fails to observe the relaxed store in call_once it will still observe it inside __call_once after taking the mutex.
The problem is when a thread _does_ observe the store, but fail to observe stores to the associated user state. There is nothing that guarantees that it will.
The problem is when a thread _does_ observe the store, but fail to observe stores to the associated user state. There is nothing that guarantees that it will.
I don't understand whan "user state" refers to.
Im not sure I follow this entirely either. Perhaps Im just overlooking something, but similar to @EricWF, I think that the relaxed acquisitions should be fine because the mutex will do the stronger barrier to guarantee the correctness. That is to say, the data race is benign.
The following code reproduces the correctness issue on AArch64 (but not on x86). It may take several runs, but eventually it will crash (it crashes on almost every run on a two-core device that I’m using).
long global; static const long N = 1000000; std::once_flag once_token[N]; pthread_barrier_t barrier; void thread_func1() { for (int i = 0; i < N; i++) { pthread_barrier_wait(&barrier); std::call_once(once_token[i], [i] { global = 17 + i; }); if (global != 17 + i) { abort(); } if (i % (N / 100) == 0) { fprintf(stderr, "."); } } } int main() { pthread_barrier_init(&barrier, NULL, 4); std::thread t1(thread_func1); std::thread t2(thread_func1); std::thread t3(thread_func1); std::thread t4(thread_func1); t1.join(); t2.join(); t3.join(); t4.join(); }
The scenario is this: Threads A and B enter call_once simultaneously. Thread A finds the once_flag zero, runs the user code and after that it relaxed-stores ~0 to the flag. Since this is a relaxed store, there can still be other pending stores (from user code) that haven’t been finished. Thread B finds ~0 in the flag and returns from call_once immediately, but the following user code doesn’t see all the changes made by user code run in thread A.
The fact that thread A’s operations happens under a mutex doesn’t change anything, because the other thread doesn’t take the mutex.
This patch changes two call_once's to use acquire loads, but there's a third call_once that does not have an acquire load. Presumably this is a (serious!) bug.
Do we care about these platforms? This code is not correct.
I understand that this particular patch does not make things worse, but people can use this primitive in other places assuming that it works.