This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libcxx/trunk/
-
trunk/
-
include/
-
memory
-
mutex
-
src/
-
mutex.cpp

Differential D24028

[libcxx] Fix a data race in call_once
ClosedPublic

Authored by kubamracek on Aug 30 2016, 7:15 AM.

Download Raw Diff

Details

Reviewers

compnerd
dvyukov
EricWF
dexonsmith

Commits

rG224264ade067: [libcxx] Fix a data race in call_once
rCXX280621: [libcxx] Fix a data race in call_once
rL280621: [libcxx] Fix a data race in call_once

Summary

call_once is using relaxed atomic load to perform double-checked locking, which contains a data race. The fast-path load has to be an acquire atomic load.

Diff Detail

Repository: rL LLVM

Event Timeline

kubamracek updated this revision to Diff 69676.Aug 30 2016, 7:15 AM

kubamracek retitled this revision from to [libcxx] Fix a data race in call_once.

kubamracek updated this object.

kubamracek added reviewers: EricWF, dexonsmith, dvyukov, compnerd.

kubamracek added subscribers: zaks.anna, dcoughlin.

dvyukov added inline comments.Aug 30 2016, 7:31 AM

include/memory
674 ↗	(On Diff #69676)	Do we care about these platforms? This code is not correct. I understand that this particular patch does not make things worse, but people can use this primitive in other places assuming that it works.
src/mutex.cpp
202 ↗	(On Diff #69676)	The first part of this comment does not make sense: "Changes to flag are done via relaxed atomic stores, because they're protected by a mutex;" If they are protected by a mutex, why do we use atomics?

compnerd added a subscriber: mclow.lists.Aug 30 2016, 7:51 AM

compnerd added inline comments.

include/memory
674 ↗	(On Diff #69676)	Well, in a unthreaded environment, I don't see how this is incorrect. Perhaps you mean do we care about threaded environments without the atomics. In the latter, I suspect not, since during the configuration phase, we explicitly check how to get access to the atomics. But, I would defer to @mclow.lists or @EricWF on that point.
src/mutex.cpp
202 ↗	(On Diff #69676)	This comment makes sense; however, it is poorly phrased. What its getting at is that we can avoid the fences around this access due to the mutex guarding. This comment is no longer valid though since the relaxed store is being modified below though. Id say this comment can just go.

dvyukov added inline comments.Aug 30 2016, 8:00 AM

include/memory
674 ↗	(On Diff #69676)	I meant the "defined(ATOMIC_ACQUIRE) && (has_builtin(__atomic_load_n) \|\| _GNUC_VER >= 407)" part.
src/mutex.cpp
202 ↗	(On Diff #69676)	Agree. As demonstrated writing of such comments does not help to get the code right :)

Okay, removing the comment.

Is it possible to write a test case for this that TSAN would diagnose? If so I would like to see that added.

call_once is using relaxed atomic load to perform double-checked locking, which contains a data race. The fast-path load has to be an acquire atomic load.

Am I correct in saying this isn't a correctness issue? If a data race occurs in call_once it will be handled by the if (flag == 0) check in __call_once.

Instead I'm assuming this patch is intended as a performance improvement?

This is a correctness issue -- user state is not properly synchronized.

This is a correctness issue -- user state is not properly synchronized.

Can you provide an example execution that demonstrates the correctness issues?

I cannot seem to see such a case. I understand there is a data race between the relaxed loads/stores but I fail to see how that causes a synchronization issues. If a thread 'A' fails to observe the relaxed store in call_once it will still observe it inside __call_once after taking the mutex.

If a thread 'A' fails to observe the relaxed store in call_once it will still observe it inside __call_once after taking the mutex.

The problem is when a thread _does_ observe the store, but fail to observe stores to the associated user state. There is nothing that guarantees that it will.

The problem is when a thread _does_ observe the store, but fail to observe stores to the associated user state. There is nothing that guarantees that it will.

I don't understand whan "user state" refers to.

Im not sure I follow this entirely either. Perhaps Im just overlooking something, but similar to @EricWF, I think that the relaxed acquisitions should be fine because the mutex will do the stronger barrier to guarantee the correctness. That is to say, the data race is benign.

The following code reproduces the correctness issue on AArch64 (but not on x86). It may take several runs, but eventually it will crash (it crashes on almost every run on a two-core device that I’m using).

long global;

static const long N = 1000000;
std::once_flag once_token[N];
pthread_barrier_t barrier;

void thread_func1() {
  for (int i = 0; i < N; i++) {
    pthread_barrier_wait(&barrier);

    std::call_once(once_token[i], [i] {
      global = 17 + i;
    });

    if (global != 17 + i) {
      abort();
    }

    if (i % (N / 100) == 0) {
      fprintf(stderr, ".");
    }
  }
}

int main() {
  pthread_barrier_init(&barrier, NULL, 4);
  std::thread t1(thread_func1);
  std::thread t2(thread_func1);
  std::thread t3(thread_func1);
  std::thread t4(thread_func1);
  t1.join();
  t2.join();
  t3.join();
  t4.join();
}

The scenario is this: Threads A and B enter call_once simultaneously. Thread A finds the once_flag zero, runs the user code and after that it relaxed-stores ~0 to the flag. Since this is a relaxed store, there can still be other pending stores (from user code) that haven’t been finished. Thread B finds ~0 in the flag and returns from call_once immediately, but the following user code doesn’t see all the changes made by user code run in thread A.

The fact that thread A’s operations happens under a mutex doesn’t change anything, because the other thread doesn’t take the mutex.

That explanation makes sense. This LGTM.

This revision is now accepted and ready to land.Sep 2 2016, 1:12 AM

Closed by commit rL280621: [libcxx] Fix a data race in call_once (authored by kuba.brecka). · Explain WhySep 4 2016, 3:03 AM

This revision was automatically updated to reflect the committed changes.

This patch changes two call_once's to use acquire loads, but there's a third call_once that does not have an acquire load. Presumably this is a (serious!) bug.

jlebar mentioned this in D32402: Add missing acquire_load to call_once overload..Apr 23 2017, 10:02 AM

I've sent D32402 to add the missing acquire_load.

jlebar mentioned this in rL301132: Add missing acquire_load to call_once overload..Apr 23 2017, 10:11 AM

Revision Contents

Path

Size

libcxx/

trunk/

include/

memory

12 lines

mutex

4 lines

src/

mutex.cpp

5 lines

Diff 70286

libcxx/trunk/include/memory

Show First 20 Lines • Show All 657 Lines • ▼ Show 20 Lines	#if !defined(_LIBCPP_HAS_NO_THREADS) && \
defined(__ATOMIC_RELAXED) && \		defined(__ATOMIC_RELAXED) && \
(__has_builtin(__atomic_load_n) \|\| _GNUC_VER >= 407)		(__has_builtin(__atomic_load_n) \|\| _GNUC_VER >= 407)
return __atomic_load_n(__value, __ATOMIC_RELAXED);		return __atomic_load_n(__value, __ATOMIC_RELAXED);
#else		#else
return *__value;		return *__value;
#endif		#endif
}		}

		template <class _ValueType>
		inline _LIBCPP_ALWAYS_INLINE
		_ValueType __libcpp_acquire_load(_ValueType const* __value) {
		#if !defined(_LIBCPP_HAS_NO_THREADS) && \
		defined(__ATOMIC_ACQUIRE) && \
		(__has_builtin(__atomic_load_n) \|\| _GNUC_VER >= 407)
		return __atomic_load_n(__value, __ATOMIC_ACQUIRE);
		#else
		return *__value;
		#endif
		}

// addressof moved to <__functional_base>		// addressof moved to <__functional_base>

template <class _Tp> class allocator;		template <class _Tp> class allocator;

template <>		template <>
class _LIBCPP_TYPE_VIS_ONLY allocator<void>		class _LIBCPP_TYPE_VIS_ONLY allocator<void>
{		{
public:		public:
▲ Show 20 Lines • Show All 5,137 Lines • Show Last 20 Lines

libcxx/trunk/include/mutex

	Show First 20 Lines • Show All 568 Lines • ▼ Show 20 Lines

	#ifndef _LIBCPP_HAS_NO_VARIADICS			#ifndef _LIBCPP_HAS_NO_VARIADICS

	template<class _Callable, class... _Args>			template<class _Callable, class... _Args>
	inline _LIBCPP_INLINE_VISIBILITY			inline _LIBCPP_INLINE_VISIBILITY
	void			void
	call_once(once_flag& __flag, _Callable&& __func, _Args&&... __args)			call_once(once_flag& __flag, _Callable&& __func, _Args&&... __args)
	{			{
	if (__libcpp_relaxed_load(&__flag.__state_) != ~0ul)			if (__libcpp_acquire_load(&__flag.__state_) != ~0ul)
	{			{
	typedef tuple<_Callable&&, _Args&&...> _Gp;			typedef tuple<_Callable&&, _Args&&...> _Gp;
	_Gp __f(_VSTD::forward<_Callable>(__func), _VSTD::forward<_Args>(__args)...);			_Gp __f(_VSTD::forward<_Callable>(__func), _VSTD::forward<_Args>(__args)...);
	__call_once_param<_Gp> __p(__f);			__call_once_param<_Gp> __p(__f);
	__call_once(__flag.__state_, &__p, &__call_once_proxy<_Gp>);			__call_once(__flag.__state_, &__p, &__call_once_proxy<_Gp>);
	}			}
	}			}

	#else // _LIBCPP_HAS_NO_VARIADICS			#else // _LIBCPP_HAS_NO_VARIADICS

	template<class _Callable>			template<class _Callable>
	inline _LIBCPP_INLINE_VISIBILITY			inline _LIBCPP_INLINE_VISIBILITY
	void			void
	call_once(once_flag& __flag, _Callable& __func)			call_once(once_flag& __flag, _Callable& __func)
	{			{
	if (__libcpp_relaxed_load(&__flag.__state_) != ~0ul)			if (__libcpp_acquire_load(&__flag.__state_) != ~0ul)
	{			{
	__call_once_param<_Callable> __p(__func);			__call_once_param<_Callable> __p(__func);
	__call_once(__flag.__state_, &__p, &__call_once_proxy<_Callable>);			__call_once(__flag.__state_, &__p, &__call_once_proxy<_Callable>);
	}			}
	}			}

	template<class _Callable>			template<class _Callable>
	inline _LIBCPP_INLINE_VISIBILITY			inline _LIBCPP_INLINE_VISIBILITY
	▲ Show 20 Lines • Show All 72 Lines • Show Last 20 Lines

libcxx/trunk/src/mutex.cpp

	Show First 20 Lines • Show All 193 Lines • ▼ Show 20 Lines
	// call into dispatch_once_f instead of here. Relevant radar this code needs to			// call into dispatch_once_f instead of here. Relevant radar this code needs to
	// keep in sync with: 7741191.			// keep in sync with: 7741191.

	#ifndef _LIBCPP_HAS_NO_THREADS			#ifndef _LIBCPP_HAS_NO_THREADS
	static __libcpp_mutex_t mut = _LIBCPP_MUTEX_INITIALIZER;			static __libcpp_mutex_t mut = _LIBCPP_MUTEX_INITIALIZER;
	static __libcpp_condvar_t cv = _LIBCPP_CONDVAR_INITIALIZER;			static __libcpp_condvar_t cv = _LIBCPP_CONDVAR_INITIALIZER;
	#endif			#endif

	/// NOTE: Changes to flag are done via relaxed atomic stores
	/// even though the accesses are protected by a mutex because threads
	/// just entering 'call_once` concurrently read from flag.
	void			void
	__call_once(volatile unsigned long& flag, void* arg, void(func)(void))			__call_once(volatile unsigned long& flag, void* arg, void(func)(void))
	{			{
	#if defined(_LIBCPP_HAS_NO_THREADS)			#if defined(_LIBCPP_HAS_NO_THREADS)
	if (flag == 0)			if (flag == 0)
	{			{
	#ifndef _LIBCPP_NO_EXCEPTIONS			#ifndef _LIBCPP_NO_EXCEPTIONS
	try			try
	Show All 20 Lines
	#ifndef _LIBCPP_NO_EXCEPTIONS			#ifndef _LIBCPP_NO_EXCEPTIONS
	try			try
	{			{
	#endif // _LIBCPP_NO_EXCEPTIONS			#endif // _LIBCPP_NO_EXCEPTIONS
	__libcpp_relaxed_store(&flag, 1ul);			__libcpp_relaxed_store(&flag, 1ul);
	__libcpp_mutex_unlock(&mut);			__libcpp_mutex_unlock(&mut);
	func(arg);			func(arg);
	__libcpp_mutex_lock(&mut);			__libcpp_mutex_lock(&mut);
	__libcpp_relaxed_store(&flag, ~0ul);			__libcpp_atomic_store(&flag, ~0ul, _AO_Release);
	__libcpp_mutex_unlock(&mut);			__libcpp_mutex_unlock(&mut);
	__libcpp_condvar_broadcast(&cv);			__libcpp_condvar_broadcast(&cv);
	#ifndef _LIBCPP_NO_EXCEPTIONS			#ifndef _LIBCPP_NO_EXCEPTIONS
	}			}
	catch (...)			catch (...)
	{			{
	__libcpp_mutex_lock(&mut);			__libcpp_mutex_lock(&mut);
	__libcpp_relaxed_store(&flag, 0ul);			__libcpp_relaxed_store(&flag, 0ul);
	Show All 13 Lines